runtime-harness

Principles, Practices, and Patterns in Harness Engineering

Harnesses define the thin control plane around models and runtimes, with durable observation, verification, and cleanup.

2026-05-15m0 / python / harness-engineering / filesystem-first / runtime-boundaries

Harnesses are the control plane around models and runtimes. They are not the product, and they are not the system of record. Their job is to create the boundary that makes work observable, reproducible, and safe to evolve.

What a harness is

A good harness prepares inputs, drives execution, records outputs, checks assertions, and cleans up afterward. That sounds simple, but the design choices matter because the harness can easily become too much of the system.

The useful harness is thin. It stays accountable to durable state instead of becoming a second hidden application layer.

The teardown model

Harness engineering becomes easier to reason about when you split it into layers:

workspace harness: prepares files, contexts, and inputs
execution harness: drives a model or runtime without becoming the system of record
observation harness: records logs, artifacts, session records, and state snapshots
verification harness: turns outputs into assertions, approvals, and repeatable tests
portability harness: keeps adapters, environment isolation, and runtime swaps manageable

That breakdown is practical because each layer has a different failure mode. You debug them differently, and you keep them separate for a reason.

Why filesystem-first state wins

Hidden in-memory orchestration is hard to replay and harder to audit. Filesystem-first state is better by default because it stays inspectable, diffable, and recoverable with ordinary tools.

That is especially valuable in harnesses that need to support replay and review. If the artifacts live on disk, the run can be inspected after the fact. If the state is only in memory, the harness becomes hard to trust as soon as the process ends.

Common failure modes

The usual harness problems are predictable:

leaky context makes runs non-reproducible
hidden side effects make failures hard to diagnose
overgrown fixtures turn the harness into the product
brittle adapters make runtime swaps expensive

Those are all signs that the boundary has gotten too thick. A harness should help the system run, not absorb the system into itself.

When the harness is too complex

A harness is too complex when it requires more coordination than the thing it is supposed to test, or when it starts owning durable state that should live in the workspace or runtime layer.

That rule keeps the boundary honest. The harness can prepare, observe, and verify, but the durable work still has to live somewhere legible.

Matic tie-in

The runtime adapter and operational-loop work show why runtime adapters and execution loops need explicit boundaries. Matic treats harnesses as thin wrappers around durable work, not as hidden orchestrators.

That keeps the filesystem as the source of truth for tasks, sessions, approvals, and artifacts. It also makes the system easier to inspect when something fails, which is the real test of harness quality.