telegram.recovery.clean-stop

Clean Stop and Restart Recovery for Telegram Listeners

The Telegram listener recovery plan ties clean shutdown, restart recovery, and offset continuity into one filesystem-first contract.

2026-05-18telegram / channel-server / recovery / filesystem-first / matic

A production listener is not finished when it can start. It is finished when it can stop cleanly, leave a readable filesystem trail, and resume without duplicating work after the next start.

The recovery work focuses on the Telegram listener's shutdown and restart contract. The goal is not to add new Telegram intake behavior. The goal is to make the existing foreground listener shell survivable in the real world: stop it, restart it, and trust the files to describe what happened.

Why stop and recovery belong together

In a filesystem-first system, stop behavior is never just cleanup. It is part of the operational contract. If a listener exits without marking its own state, releasing its lease, and preserving its offset, the next operator has to guess whether the channel is safe to start again.

That is the problem the recovery work is designed to remove.

The planned work focuses on three linked guarantees:

close should leave a consistent listener record.
restart should continue from the saved offset.
a crash should not force manual repair before the next listen.

Those guarantees are related. If the listener finalizes state but loses the offset, restart safety breaks. If it preserves offset but leaves a stale lease behind, restart safety breaks in a different way. If it hides the stop event behind process disappearance, the filesystem contract becomes unreliable. The recovery work ties those pieces together so the operator can trust one story.

What `close` must mean

The stop command is not a best-effort cleanup hook. The recovery plan treats close as a durable transition that should be visible in the listener files.

The expected shape is straightforward:

the listener state should move to stopped,
the lease should be released,
the stop timestamp should be recorded,
the last event should reflect close,
and the history file should show the lifecycle transition.

That is the difference between a stop command and a polite suggestion. A stop command should leave evidence that the listener actually finalized. An operator should be able to inspect .matic/channels/telegram/listener/ and see the result without chasing terminal output.

The plan also keeps the stop path strict. If there is no existing state or lease, close should not pretend it succeeded. Silent success would hide errors in the caller's assumptions and make recovery harder to reason about later. In this slice, explicit failure is better than ambiguous cleanup.

Why offset continuity matters

The inbound-intake work established the intake boundary and the inbox records. The recovery work has to protect that work during shutdown and restart.

The listener already treats Telegram update ids as the high-water mark for durable intake. The next offset is only meaningful if it survives a stop and restart. If the listener loses that number, the next run can re-poll work it already acknowledged or, worse, skip work that was not actually written.

The recovery slice keeps that boundary intact:

the inbox record is written before the offset advances,
the saved offset is the source of truth on restart,
and a clean stop should preserve the last acknowledged position.

This is why the recovery work does not add new intake semantics. The job is not to expand the Telegram API surface. The job is to make the existing intake contract durable under interruption.

Recovery without manual intervention

A restart-safe listener should not require the operator to clean up its state by hand after a crash. That would turn the filesystem contract into a suggestion rather than a guarantee.

The intended recovery behavior is more concrete:

The foreground shell checks the listener files on startup,
stale or inactive lifecycle state is recognized,
The saved offset is reused if the inbox already contains acknowledged work,
and the listener resumes from the point it last committed.

That makes recovery an ordinary start path, not a special recovery playbook. The operator should not need to delete state files or invent a repair ritual just to get the channel listening again.

Clean stop is also an observability problem

Stop and recovery are often treated as reliability features only, but they are also observability features. If the listener can stop cleanly, then the files show a coherent timeline. If the listener can restart safely, then the files show where work resumed.

That matters because the Telegram channel server is not a hidden service. It is an org-local runtime surface. Operators inspect the org directory to understand what the listener did. A clean stop plus a safe restart turns that directory into evidence instead of guesswork.

The recovery work therefore keeps the runtime shell readable:

lifecycle state remains explicit,
leases remain visible,
history remains append-only,
and offsets remain tied to durable inbox writes.

Those are the ingredients that let later diagnostics say something meaningful.

What this sets up

This work prepares the listener for the recovery and observability surface. Once stop and restart behavior are durable, the next question becomes whether the operator can tell when the listener is healthy, when a lease is stale, and when a manual repair is needed.

That is the right order. First make the listener safe to stop and restart. Then make it easier to inspect and repair.

The design rule

Do not treat crash recovery as a separate system from clean shutdown. They are the same contract viewed from different directions. A listener that cannot stop cleanly cannot restart safely, and a listener that cannot restart safely has not really been operationalized yet.