Production Listener Recovery and Observability
Recovery and observability belong to the same Telegram listener contract: clean stop, safe restart, stale lease repair, and readable files.
If a listener can stop and restart, the next requirement is honesty. Operators need to know whether the listener is healthy, whether a lease is stale, and whether the runtime state matches what the CLI says.
Recovery and observability are the same contract viewed from different angles.
Stop And Restart Must Stay Linked
The clean-stop work made close a durable transition instead of a best-effort
cleanup step. The restart work keeps the saved offset and inbox trail in play so
the next listen run can resume from the last acknowledged position.
That linkage matters because a listener that can stop but cannot restart safely is still fragile. If the state file says one thing and the inbox says another, the operator is left guessing. The files need to tell the same story.
What The Files Have To Agree On
The runtime files are only useful if they line up:
state.yamlshould describe the listener lifecycle.lease.yamlshould show who owns the listener and whether it is active.history.mdshould preserve the lifecycle timeline.inbox/should show what was actually committed.
When those files agree, the operator can trust the system. When they disagree, the mismatch is the signal that recovery or repair is needed.
Why Stale Leases Matter
A stale lease is what happens when the process does not get to finish its own shutdown. A crash, an interrupted stop, or a hard kill can leave behind a file that still looks active even though nothing is actually listening anymore.
That is not a small annoyance. It blocks new starts and makes the system look busier than it is. The recovery surface has to make stale ownership visible and repairable without rewriting the inbox or pretending prior work never happened.
Observability Should Stay Operator-Readable
The Telegram listener should be inspectable from the CLI without reading source code or guessing at process state. The status surface should summarize the same truth that the runtime files encode.
At minimum, the operator needs to know:
- whether the listener is active,
- whether the lease is stale,
- what the last saved offset was,
- whether inbox work exists,
- and what the last lifecycle event was.
That is enough to answer the real question: can I start this listener again, or do I need to repair state first?
Repair Should Not Rewrite History
Repair is about ownership, not erasure.
If a stale lease needs to be cleared, the fix should not reset the offset or pretend the inbox never happened. The listener already committed work to disk. Recovery should restore the ability to start safely without discarding the evidence of what already ran.
That restraint is the difference between a repair tool and a second source of truth.
The Design Rule
Do not split recovery from observability. A listener that cannot explain its own state cannot be operated with confidence. The Telegram listener uses the same files for stop, restart, and inspection because those files are what make the runtime legible.