Diagnostics and Stale Lease Repair for Telegram Listeners
The Telegram listener recovery plan makes stale leases visible, repairable, and aligned with the same files the runtime uses.
If a listener can stop and restart, the next requirement is honesty. Operators need to know whether the listener is healthy, whether a lease is stale, and whether the runtime state matches what the CLI says.
The diagnostics slice is the observability half of the recovery slice. It turns the Telegram listener into something you can inspect from the command line without guessing at process state or reading the source code to interpret the files.
Why stale leases matter
Stale leases are what happen when a listener loses the chance to finish its own shutdown. A crash, a hard stop, or an interrupted run can leave behind state that still looks active even though no real listener owns it anymore.
That is a dangerous condition in a channel system. It blocks new starts, confuses operators, and creates the impression that work is still in flight when it is not. Worse, it makes recovery ambiguous. If the lease is stale but still marked active, the system has to decide whether to trust the file or the actual runtime.
The recovery plan treats stale lease detection as a first-class recovery problem. The point is not to hide the stale state. The point is to recognize it and repair it safely.
What the status surface should tell the truth about
The current Telegram docs already describe the listener state, lease, history, inbox, and offset model. The recovery work extends that model into a dedicated status surface so operators can inspect the listener without navigating multiple files by hand.
The status report should line up with the same state the listener uses internally. At minimum, it should expose:
- whether the listener is active,
- whether a lease is currently held,
- whether the lease looks stale or recoverable,
- the last saved offset,
- the inbox count or equivalent progress signal,
- and the last meaningful lifecycle event.
That makes the command useful for more than just curiosity. A status command should answer the operator question: can I start this listener, or do I need to repair state first?
Why diagnostics belong in the CLI
The recovery work does not turn the listener into a hidden daemon with private health endpoints. The design stays aligned with the rest of the sprint: the filesystem is the source of truth, and the CLI is the visible control surface.
That choice keeps diagnostics concrete. If an operator runs a status command, they should see the same answer the listener uses to decide whether it is safe to resume. If they run close, they should see the stop contract reflected in the files. If they inspect the listener history, it should line up with both.
This is the core observability idea in the slice:
- the runtime shell owns the lifecycle,
- the files record the lifecycle,
- and the CLI summarizes the lifecycle in human terms.
If those three views disagree, the system is not observable enough yet.
What stale lease repair should do
The repair path needs to be specific. A stale lease should not be treated as a clean start, and it should not be left behind because the status surface can describe it.
The intended repair behavior is:
- detect that the lease is stale rather than actively owned,
- mark or clear the stale lease in the filesystem,
- preserve the inbox and offset history,
- and allow a new listener start to proceed safely.
That is the practical difference between inspection and repair. Inspection tells the operator what is wrong. Repair changes the state so the next listener start is unambiguous.
The repair path also has to avoid duplicating inbox work. Clearing a stale lease should not reset the offset, and it should not pretend prior updates were never acknowledged. The recovery work keeps recovery focused on ownership and operability, not on rewriting history.
Why the status contract matters for restart safety
It is easy to think of diagnostics as a separate convenience layer. In this system, they are part of the restart contract.
If the listener reports active but the lease is stale, the operator needs to know before starting a second process. If the listener reports stopped but the state files still suggest a live lease, that mismatch needs to be visible. If the offset is behind the inbox contents, that needs to be explainable before the next intake cycle begins.
The status command becomes the bridge between the durable files and the operator's next move. It helps answer the questions that matter after a crash:
- Is the channel safe to restart?
- Is there a stale lease to clear?
- Is the saved offset still trustworthy?
- Has the inbox already committed the latest update?
Those are operational questions, not abstract ones. The recovery work makes them visible from the CLI so they can be acted on quickly.
What this slice does not do
Observability is useful only if it stays scoped. The recovery work is not trying to replace the listener shell with a richer admin console. It is not adding new Telegram intake semantics. It is not changing the meaning of inbound updates or outbound sends.
The slice is intentionally narrow:
- repair stale ownership,
- surface listener health,
- expose the fields that matter for restart decisions,
- and keep the runtime story aligned with the on-disk files.
That restraint matters because recovery tools can become their own source of confusion if they start inventing a second truth. The Telegram listener should have one truth, represented in files, and one readable summary of that truth, exposed through the CLI.
What this prepares
Once stale leases are detectable and repairable, the recovery contract becomes much easier to operate. The listener can stop cleanly, the next start can recover safely, and the status surface can explain what happened in between.
That prepares the sprint for the final integration and packaging work. Before the system can be proven end to end, operators need a reliable answer to a simpler question: what state is the Telegram listener in right now, and is it safe to continue?
The design rule
Do not let diagnostics drift away from the filesystem contract. If the files say one thing and the CLI says another, the system is hiding its own state. The recovery work uses status and stale lease repair to make the Telegram listener legible again.