Agent-readable wiki

duet-agent Mental Model Wiki

duet-agent is a TypeScript agent harness for jobs that outlive a single chat session: it combines a five-state relay state machine, PGlite-backed observational memory, and a protocol-first TurnRunner so any process—serverless function, cron job, or interactive TUI—can resume exactly where the last one stopped.

Pages

  1. The Mental Model — Three Answers to Process DeathThe simplest accurate model of what duet-agent is and why its three subsystems (relay state machine, observational memory, TurnState snapshot) are one coherent answer to the same problem: work that must survive a dead process.
  2. TurnState & the Command/Event ProtocolTurnState is the only thing that must survive between process boundaries. This page traces the lifecycle of a TurnState from creation through prompt/answer/wake commands, terminal events (complete, ask, sleep, interrupted), and what each field owns—agent messages, stateMachine session, usage accounting, and todos.
  3. The Five State Kinds — Vocabulary of a RelayEvery relay is built from exactly five state kinds: agent (sub-agent with a prompt), script (shell command), poll (recurring external check), timer (pure wall-clock delay), and terminal (named business outcome). This page explains the invariants, input schema templating, and what each kind can and cannot do—including why integrations like GitHub or email are always script/poll states, never engine primitives.
  4. State Machine Execution Flow — How the Runner Agent Drives TransitionsThe runner agent—not a config file—selects the next state every turn. This page traces how state-machine-controller.ts dispatches a state, records audit events in StateMachineSession.history, emits sleep for poll/timer states, handles interruptions, and runs a terminal acknowledgment turn. Includes the carry-forward invariant and the mid-session start rule.
  5. Observational Memory — How Transcripts Become Durable RowsMemory and compaction are the same primitive. After each turn an observer model reads the transcript and appends Observation rows to PGlite; when rows grow beyond a threshold a reflector condenses them. Embeddings run in a background worker (embedding-worker.ts) so foreground turns never block. This page covers the observe → reflect → embed pipeline, trigger conditions, failure isolation, and the image-to-text path that keeps screenshots recallable.
  6. Recall & the Frozen Context Pack — What Survives Into Every PromptEvery turn is prefixed with a frozen two-layer memory pack: global cross-session observations ranked by recency half-life, and local session compaction. The pack rebuilds only on three specific events (initial load, reflector replacement, wire-shaping eviction) so the provider's prompt cache survives turn-over-turn. recall_memory tool uses hybrid RRF retrieval (pgvector cosine + tsvector keyword) to surface anything that missed the pack. This page explains pack structure, rebuild triggers, cache stability invariant, and RRF fusion.
  7. Wire Shaping & Model Resolution — Context Budget Enforcementwire-shaping.ts enforces a byte budget (15 MB trigger, 80% target) and a token budget (200k default effectiveContext) on the dispatched message list. Eviction advances the WireGuardHorizon, trimming oldest messages in one block to minimize prompt-cache invalidations. Images get a fixed 1,600-token estimate to prevent base64 byte inflation from triggering early eviction. Model resolution (resolver.ts, catalog.ts, duet-gateway.ts) abstracts Anthropic/OpenAI/OpenRouter/Duet Gateway behind a single resolveModelName call. This page explains the two-gate eviction system, cache-miss cost model, and BYOK/BYOC model routing.
  8. Invariants, Failure Modes & Safe-Change RulesA synthesis of the core invariants that hold across all subsystems, the failure modes that break them, and how to change the codebase safely. Covers: TurnState as the only cross-process contract; memory pack rebuild triggers (the three-event rule); prompt-cache stability conditions; state machine history append-only guarantee; PGlite cross-process lock; transient-error retry scope; and which files are safe to change in isolation versus which touch multiple invariants.

Complete Markdown

# duet-agent Mental Model Wiki

> duet-agent is a TypeScript agent harness for jobs that outlive a single chat session: it combines a five-state relay state machine, PGlite-backed observational memory, and a protocol-first TurnRunner so any process—serverless function, cron job, or interactive TUI—can resume exactly where the last one stopped.

## Context Links

- [Agent index](https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/llms.txt)
- [Human interactive wiki](https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a)
- [GitHub repository](https://github.com/dzhng/duet-agent)

## Repository Metadata

- Repository: dzhng/duet-agent

- Generated: 2026-05-22T00:33:51.703Z
- Updated: 2026-05-22T00:34:49.281Z
- Runtime: Claude Code
- Format: Mental Model
- Pages: 8

## Page Index

- 01. [The Mental Model — Three Answers to Process Death](https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/01-the-mental-model-three-answers-to-process-death.md) - The simplest accurate model of what duet-agent is and why its three subsystems (relay state machine, observational memory, TurnState snapshot) are one coherent answer to the same problem: work that must survive a dead process.
- 02. [TurnState & the Command/Event Protocol](https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/02-turnstate-the-command-event-protocol.md) - TurnState is the only thing that must survive between process boundaries. This page traces the lifecycle of a TurnState from creation through prompt/answer/wake commands, terminal events (complete, ask, sleep, interrupted), and what each field owns—agent messages, stateMachine session, usage accounting, and todos.
- 03. [The Five State Kinds — Vocabulary of a Relay](https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/03-the-five-state-kinds-vocabulary-of-a-relay.md) - Every relay is built from exactly five state kinds: agent (sub-agent with a prompt), script (shell command), poll (recurring external check), timer (pure wall-clock delay), and terminal (named business outcome). This page explains the invariants, input schema templating, and what each kind can and cannot do—including why integrations like GitHub or email are always script/poll states, never engine primitives.
- 04. [State Machine Execution Flow — How the Runner Agent Drives Transitions](https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/04-state-machine-execution-flow-how-the-runner-agent-drives-transitions.md) - The runner agent—not a config file—selects the next state every turn. This page traces how state-machine-controller.ts dispatches a state, records audit events in StateMachineSession.history, emits sleep for poll/timer states, handles interruptions, and runs a terminal acknowledgment turn. Includes the carry-forward invariant and the mid-session start rule.
- 05. [Observational Memory — How Transcripts Become Durable Rows](https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/05-observational-memory-how-transcripts-become-durable-rows.md) - Memory and compaction are the same primitive. After each turn an observer model reads the transcript and appends Observation rows to PGlite; when rows grow beyond a threshold a reflector condenses them. Embeddings run in a background worker (embedding-worker.ts) so foreground turns never block. This page covers the observe → reflect → embed pipeline, trigger conditions, failure isolation, and the image-to-text path that keeps screenshots recallable.
- 06. [Recall & the Frozen Context Pack — What Survives Into Every Prompt](https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/06-recall-the-frozen-context-pack-what-survives-into-every-prompt.md) - Every turn is prefixed with a frozen two-layer memory pack: global cross-session observations ranked by recency half-life, and local session compaction. The pack rebuilds only on three specific events (initial load, reflector replacement, wire-shaping eviction) so the provider's prompt cache survives turn-over-turn. recall_memory tool uses hybrid RRF retrieval (pgvector cosine + tsvector keyword) to surface anything that missed the pack. This page explains pack structure, rebuild triggers, cache stability invariant, and RRF fusion.
- 07. [Wire Shaping & Model Resolution — Context Budget Enforcement](https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/07-wire-shaping-model-resolution-context-budget-enforcement.md) - wire-shaping.ts enforces a byte budget (15 MB trigger, 80% target) and a token budget (200k default effectiveContext) on the dispatched message list. Eviction advances the WireGuardHorizon, trimming oldest messages in one block to minimize prompt-cache invalidations. Images get a fixed 1,600-token estimate to prevent base64 byte inflation from triggering early eviction. Model resolution (resolver.ts, catalog.ts, duet-gateway.ts) abstracts Anthropic/OpenAI/OpenRouter/Duet Gateway behind a single resolveModelName call. This page explains the two-gate eviction system, cache-miss cost model, and BYOK/BYOC model routing.
- 08. [Invariants, Failure Modes & Safe-Change Rules](https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/08-invariants-failure-modes-safe-change-rules.md) - A synthesis of the core invariants that hold across all subsystems, the failure modes that break them, and how to change the codebase safely. Covers: TurnState as the only cross-process contract; memory pack rebuild triggers (the three-event rule); prompt-cache stability conditions; state machine history append-only guarantee; PGlite cross-process lock; transient-error retry scope; and which files are safe to change in isolation versus which touch multiple invariants.

## Source File Index

- `evals/context-overflow-recovery.eval.ts`
- `evals/memory-reflect.eval.ts`
- `evals/outreach-lifecycle.eval.ts`
- `evals/prompt-cache.eval.ts`
- `evals/recall-memory-cross-session.eval.ts`
- `evals/recall-memory-implicit-triggers.eval.ts`
- `evals/source-of-truth-first.eval.ts`
- `evals/state-machine-interrupt-resume.eval.ts`
- `evals/state-machine-real-session-carry-forward.eval.ts`
- `evals/state-machine-routing.eval.ts`
- `evals/state-machine-tool-call-shape.eval.ts`
- `evals/thread-context-loss.eval.ts`
- `examples/state-machine.ts`
- `README.md`
- `src/guardrails/firewall.ts`
- `src/guardrails/semantic.ts`
- `src/index.ts`
- `src/memory/context-pack.ts`
- `src/memory/embedding-worker.ts`
- `src/memory/embedding.ts`
- `src/memory/loader.ts`
- `src/memory/migrations.ts`
- `src/memory/observation-groups.ts`
- `src/memory/observational-prompts.ts`
- `src/memory/observational.ts`
- `src/memory/pglite.ts`
- `src/memory/recall.ts`
- `src/memory/session.ts`
- `src/memory/storage.ts`
- `src/memory/store.ts`
- `src/model-resolution/catalog.ts`
- `src/model-resolution/duet-gateway.ts`
- `src/model-resolution/resolver.ts`
- `src/session/session-manager.ts`
- `src/session/session.ts`
- `src/turn-runner/prompts.ts`
- `src/turn-runner/shell-state-handle.ts`
- `src/turn-runner/state-compaction.ts`
- `src/turn-runner/state-machine-controller.ts`
- `src/turn-runner/state-machine-session.ts`
- `src/turn-runner/tools.ts`
- `src/turn-runner/transient-error.ts`
- `src/turn-runner/turn-runner.ts`
- `src/turn-runner/turn-state.ts`
- `src/turn-runner/usage-accounting.ts`
- `src/turn-runner/wire-shaping.ts`
- `src/types/protocol.ts`
- `src/types/state-machine.ts`
- `test/memory-recall.test.ts`
- `test/memory-reflect-planner.test.ts`
- `test/transient-error.test.ts`
- `test/turn-runner-state-machine-agent-events.test.ts`

---

## 01. The Mental Model — Three Answers to Process Death

> The simplest accurate model of what duet-agent is and why its three subsystems (relay state machine, observational memory, TurnState snapshot) are one coherent answer to the same problem: work that must survive a dead process.

- Page Markdown: https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/01-the-mental-model-three-answers-to-process-death.md
- Generated: 2026-05-22T00:31:17.815Z

### Source Files

- `README.md`
- `src/index.ts`
- `src/turn-runner/turn-runner.ts`
- `src/types/protocol.ts`
- `src/session/session.ts`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [README.md](README.md)
- [src/index.ts](src/index.ts)
- [src/turn-runner/turn-runner.ts](src/turn-runner/turn-runner.ts)
- [src/types/protocol.ts](src/types/protocol.ts)
- [src/session/session.ts](src/session/session.ts)
- [src/types/state-machine.ts](src/types/state-machine.ts)
- [src/memory/observational.ts](src/memory/observational.ts)
</details>

# The Mental Model — Three Answers to Process Death

duet-agent is built around one central problem: how do you keep agent work alive when processes don't? Most harnesses offer no answer — when the chat session ends, the process dies, and the context and progress go with it. duet-agent answers that problem three times over, with three distinct subsystems that are engineered to solve the same threat from different angles. Understanding why those three subsystems exist — and why each one alone is not enough — is the fastest path to a reliable mental model of the whole framework.

This page explains what each subsystem is, what process-death scenario it defends against, and how the three fit together into one coherent architecture. Every claim is traced to a specific file and line in the repository.

---

## The One Problem, Stated Plainly

A long-running job — prospect an outbound contact, wait two weeks for a reply, book a meeting — requires at least three things to survive across process restarts:

1. **The business-process position**: which step of the workflow was active, what was decided, and where to pick up.
2. **The conversation memory**: what the agent has learned, tried, and been told across many turns and sessions.
3. **The in-process runtime state**: what model, tools, prompt shape, todos, follow-up queue, and session options the agent currently holds.

Lose (1) and the agent restarts the workflow from scratch. Lose (2) and the agent repeats work and forgets instructions. Lose (3) and a resumed process cannot reconstruct its live runtime without a snapshot.

duet-agent maps each of these three threats to one subsystem.

---

## Subsystem One — The Relay State Machine

### What It Is

A relay (internally `StateMachineSession` + `StateMachineDefinition`) is an agent-routed state machine over five possible state kinds: `agent`, `script`, `poll`, `timer`, and `terminal`. The available states are defined in a `StateMachineDefinition`; which one runs next is always a live agent decision, not a hard-coded graph.

```typescript
// src/types/state-machine.ts:153-163
export interface StateMachineDefinition {
  name: string;
  prompt: string;
  states: StateMachineState[];
}

export type StateMachineState =
  | StateMachineAgentState
  | StateMachineScriptState
  | StateMachinePollState
  | StateMachineTimerState
  | StateMachineTerminalState;
```

### What Death Scenario It Solves

The relay owns **business-process position**. Because `StateMachineSession` is serialized into `TurnState` and flushed to disk after every state transition, a process can die at any point — between states, during a two-week `poll` wait, mid-script — and resume exactly where it left off. The entire audit log (`history`), current state name, transition input, and progress counters all survive.

```typescript
// src/types/state-machine.ts:170-217
export interface StateMachineSession {
  definition: StateMachineDefinition;
  prompt: string;
  currentState?: string;
  currentInput?: Record<string, unknown>;
  progress?: StateMachineProgress;
  history: StateMachineSessionEvent[];
  terminal?: StateMachineTerminalResult;
  terminalAcknowledged?: boolean;
  createdAt: number;
  updatedAt: number;
}
```

The `sleep` terminal event is the relay's answer to long waits. When a `poll` or `timer` state has nothing to do yet, the runner emits `sleep` with a `wakeAt` timestamp. The outer layer (`Session`) persists the state, schedules a polling interval, and dispatches a `wake` command when the deadline arrives. No process needs to stay alive in between.

```typescript
// src/session/session.ts:549-567
private scheduleWake(terminal: Extract<TurnTerminalEvent, { type: "sleep" }>): void {
  this.cancelWake();
  const fire = (): void => {
    if (Date.now() < terminal.wakeAt) return;
    this.cancelWake();
    const state = this.runner.getState();
    if (!state || state.status !== "sleeping") return;
    this.dispatchTurn({ type: "wake" });
  };
  this.wakeTimer = setInterval(fire, WAKE_POLL_INTERVAL_MS);
  ...
}
```

### The Five State Kinds

| Kind | What It Does | Process-death behavior |
|---|---|---|
| `agent` | Runs a sub-agent with a prompt and optional skills | Output saved to history; agent re-runs if interrupted |
| `script` | Shells out to bash, curl, or any CLI | stdout/stderr captured; retries on resume |
| `poll` | Runs a command on an interval, sleeps between attempts | Emits `sleep`; outer layer owns the timer |
| `timer` | Waits for an absolute wall-clock time | Emits `sleep` with `wakeAt`; process can die |
| `terminal` | Finalizes the session with a named outcome | Written to `StateMachineSession.terminal`; persisted |

The key design choice: the runner agent, not a config graph, picks the next state on every step. This means a relay can start in the middle ("I already sent the email, just wait for a reply") without any workaround — the agent reads context and history and selects the appropriate state directly.

Sources: [src/types/state-machine.ts:219-347](src/types/state-machine.ts)

---

## Subsystem Two — Observational Memory

### What It Is

Observational memory is a PGlite database of derived text observations, extracted by a background observer/reflector pipeline from each turn's raw transcript. It is **not** a raw chat-log dump. After each agent run, the runner observes the latest unobserved transcript suffix and writes compact, text-only observations to persistent storage. When the observation log grows large, a reflector compresses groups into summaries.

The in-process budget math is explicit:

```typescript
// src/memory/observational.ts:73-80
export const MEMORY_BUDGET_RATIOS = {
  messageTokens: 0.6,       // raw-message tail
  observationTokens: 0.325, // local-memory pack in prefix
  globalContextTokenBudget: 0.075, // cross-session pack
} as const;
```

### What Death Scenario It Solves

Observational memory owns **cross-session knowledge**. The raw `TurnState.agent.messages` array is the in-process conversation transcript. But raw messages are not the same as memory: they are too large to fit in the full context window across many sessions, and they describe tool invocations and intermediate reasoning rather than durable facts.

The observational memory pipeline converts raw transcript activity into a frozen, two-layer prefix: a global pack of cross-session observations (ranked within a 7.5% budget of `effectiveContext`) and a local pack of this session's compaction summary (within 32.5%). Both layers are rebuilt only at compaction events, so the provider's prompt cache survives turn-over-turn.

This means a session resumed six months later on a different machine automatically gets both long-term cross-session knowledge and the current session's summary injected into the agent's prefix — without replaying raw messages.

The observer and reflector are intentionally background workers: they never block a foreground turn. The runner waits for durable writes only at observation boundaries, after a pi-agent run completes.

A critical consequence: without a `memoryDbPath`, the observational pipeline is entirely disabled. The README is explicit that this means no compaction — the raw transcript grows until a provider context-length error terminates the session.

Sources: [src/memory/observational.ts:55-80](src/memory/observational.ts), [src/turn-runner/turn-runner.ts:218-260](src/turn-runner/turn-runner.ts)

---

## Subsystem Three — TurnState Snapshot

### What It Is

`TurnState` is a serializable snapshot of everything the runner needs to reconstruct a live turn. It holds the agent's full message history, the state machine session if one is active, the current todo list, the follow-up queue, any queued but not yet executed commands, the active mode, runtime options (model, memory model, thinking level), and the overall lifecycle status.

```typescript
// src/types/protocol.ts:169-208
export interface TurnState {
  status: TurnStateStatus;
  mode: TurnMode;
  options?: TurnOptions;
  agent: AgentSession;
  stateMachine?: StateMachineSession;
  todos?: TurnTodo[];
  followUpQueue?: TurnFollowUpQueueEntry[];
  queuedCommands?: TurnCommand[];
}
```

### What Death Scenario It Solves

`TurnState` owns **in-process runtime state**. The relay knows which business state is active. Observational memory knows what the agent has learned across sessions. But the running process also holds transient state — which todos were in progress, which follow-up messages were queued, which model the user had switched to, whether a `wake` command was queued but not yet dispatched — that would be silently lost without an explicit snapshot.

`TurnState` captures all of it. The `Session` layer writes the snapshot to `state.json` on every terminal event (and on every `usage` tick for the context bar):

```typescript
// src/session/session.ts:652-663
private async writeStoredEnvelope(state: TurnState): Promise<void> {
  const payload: StoredSessionFile = {
    sessionId: this.id,
    updatedAt: Date.now(),
    state,
    sessionCostUsd: this.sessionCostUsd,
  };
  ...
  await writeFile(this.sessionFilePath(), `${JSON.stringify(payload, null, 2)}\n`, "utf-8");
}
```

And any fresh process passes that snapshot back via `runner.start({ state })`:

```typescript
// src/types/protocol.ts:316-340
export interface TurnStartCommand {
  type: "start";
  mode?: TurnMode;
  state?: TurnState;
  options?: TurnOptions;
  mcpServers?: Record<string, McpHttpServerConfig>;
}
```

There is one additional safety net: `TurnRunner` automatically compacts `TurnState` before it leaves the runner on every terminal event. Eviction drops the oldest agent messages (while preserving tool-call/result pairs) so `state.json` cannot grow unbounded even if observational memory compaction has already run many times. This `autoStateCompaction` behavior is on by default with a 100 MB ceiling.

Sources: [src/types/protocol.ts:169-208](src/types/protocol.ts), [src/session/session.ts:626-667](src/session/session.ts)

---

## How the Three Subsystems Compose

```text
┌─────────────────────────────────────────────────────────┐
│                    Incoming Turn                         │
│          runner.start({ state }) / runner.turn()         │
└───────────────────────┬─────────────────────────────────┘
                        │
          ┌─────────────▼──────────────┐
          │         TurnState          │  ← "where am I right now?"
          │  mode / options / todos /  │
          │  followUpQueue / messages  │
          └──────┬─────────────┬───────┘
                 │             │
   ┌─────────────▼──┐    ┌─────▼──────────────────────────┐
   │ StateMachine   │    │  Observational Memory           │
   │ Session        │    │  (PGlite — observer/reflector)  │
   │                │    │                                 │
   │ "which step?"  │    │ "what have I learned?"          │
   │ audit log,     │    │ frozen 2-layer prefix,          │
   │ currentState,  │    │ global + local packs,           │
   │ terminal       │    │ hybrid recall tool              │
   └────────┬───────┘    └──────────────┬──────────────────┘
            │                           │
            └──────────┬────────────────┘
                       │
          ┌────────────▼───────────────┐
          │   Disk / state.json        │
          │   (Session.writeStored     │
          │    Envelope on every       │
          │    terminal event)         │
          └────────────────────────────┘
```

The three subsystems are layered, not redundant:

- **TurnState** is the envelope that any process can hold and hand back to a fresh runner. It is the cheapest unit of persistence — JSON on disk.
- **StateMachineSession** (embedded in TurnState) is the business-process ledger. It answers "what step are we on?" but says nothing about what the agent has learned.
- **Observational memory** (in PGlite, separate from TurnState) is the knowledge layer. It answers "what has the agent learned across all sessions?" but says nothing about workflow position or runtime options.

Each layer fails differently. Without observational memory, sessions resume with no long-term recall and no compaction — context overflows. Without the relay state machine, multi-step workflows restart from scratch on every process death. Without TurnState snapshots, the runner cannot reconstruct the exact model config, queued commands, or todo list a previous process was holding.

The protocol makes the composition explicit: every terminal event (`complete`, `ask`, `sleep`, `interrupted`) carries the latest `TurnState`. Callers that need process-level durability persist it; callers that do not can ignore it. Either way, the runner has already done the work of collecting the three answers into one serializable envelope.

Sources: [src/types/protocol.ts:643-692](src/types/protocol.ts), [src/turn-runner/turn-runner.ts:364-390](src/turn-runner/turn-runner.ts)

---

## Terminal Events as the Handoff Point

The protocol design expresses the mental model precisely. There are exactly four terminal event types, and every one of them carries a `TurnState`:

| Terminal event | When emitted | Process-death implication |
|---|---|---|
| `complete` | Agent or state machine finished | Safe to exit; resume with the same state if user follows up |
| `ask` | Agent needs structured human input | Resume with an `answer` command; state is preserved |
| `sleep` | Poll or timer state waiting on external signal | Outer layer schedules a wake; process can exit |
| `interrupted` | User or system aborted the turn | State at interruption point is persisted; next prompt re-starts |

The `sleep` event is where the three answers converge most visibly. When the relay enters a `poll` or `timer` state, it emits `sleep` with a `wakeAt` timestamp. The `Session` layer writes `state.json` (TurnState snapshot), arms a polling timer to fire a `wake` command when the wall clock reaches `wakeAt`, and any subscribing UI renders a sleeping banner. If the process dies before the timer fires, the persisted `TurnState` (with `status: "sleeping"`) is enough for the next process to synthesize a fresh `sleep` event and re-arm the timer.

```typescript
// src/session/session.ts:209-229
private async replaySleepFromResumedState(state: TurnState): Promise<void> {
  const scheduled = this.currentScheduledState(state);
  ...
  await this.handleTurnEvent({ type: "sleep", wakeAt, state });
}
```

This pattern — persist the snapshot, let the outer layer own the clock, make the runner stateless between turns — is why a serverless invocation, a container that didn't exist when the work started, or a different machine next month can all resume the same session by calling `runner.start({ state })` with the saved JSON.

Sources: [src/session/session.ts:192-229](src/session/session.ts), [src/types/protocol.ts:664-691](src/types/protocol.ts)

---

## Summary

duet-agent's three subsystems are not independent features bolted together. They are three answers to the same architectural threat — process death — each operating at a different layer of the stack. The relay state machine preserves business-process position as a serializable session object embedded in `TurnState`. Observational memory preserves cross-session knowledge as derived observations in a local PGlite database, injected as a frozen prefix so the provider's prompt cache survives turn boundaries. The `TurnState` snapshot preserves all live runtime state — options, todos, queued commands, conversation history — in a single JSON envelope that any fresh process can hand back to a runner. Remove any one of the three and a class of long-running jobs becomes unreliable; together they make the deployment model tractable: cron wakes a container, hands it the snapshot, runs one turn, persists, exits.

Sources: [src/turn-runner/turn-runner.ts:281-340](src/turn-runner/turn-runner.ts)

---

## 02. TurnState & the Command/Event Protocol

> TurnState is the only thing that must survive between process boundaries. This page traces the lifecycle of a TurnState from creation through prompt/answer/wake commands, terminal events (complete, ask, sleep, interrupted), and what each field owns—agent messages, stateMachine session, usage accounting, and todos.

- Page Markdown: https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/02-turnstate-the-command-event-protocol.md
- Generated: 2026-05-22T00:31:42.066Z

### Source Files

- `src/types/protocol.ts`
- `src/turn-runner/turn-state.ts`
- `src/turn-runner/turn-runner.ts`
- `src/turn-runner/usage-accounting.ts`
- `src/session/session-manager.ts`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [src/types/protocol.ts](src/types/protocol.ts)
- [src/turn-runner/turn-state.ts](src/turn-runner/turn-state.ts)
- [src/turn-runner/turn-runner.ts](src/turn-runner/turn-runner.ts)
- [src/turn-runner/usage-accounting.ts](src/turn-runner/usage-accounting.ts)
- [src/session/session-manager.ts](src/session/session-manager.ts)
- [src/session/session.ts](src/session/session.ts)
</details>

# TurnState & the Command/Event Protocol

`TurnState` is the only data structure that must survive when the process exits and restarts. It is the serialized checkpoint that lets a session resume an agent conversation, a state-machine workflow, an in-progress todo list, and queued user prompts — all from a single JSON file. The command/event protocol built around `TurnState` is the control surface that every transport layer (CLI, HTTP server, daemon) speaks: commands drive the runner forward, and events report work or end the turn.

Understanding this protocol is the key to predicting what the system will do after a crash, a sleep, an interrupt, or a user follow-up that arrives while the agent is mid-task.

---

## The TurnState Snapshot

`TurnState` is defined as an interface in `src/types/protocol.ts`. Every field is intentionally carry-forward: when the runner writes a terminal event, the receiving layer persists this snapshot and hands it back to the runner on the next `start` command.

```ts
// src/types/protocol.ts
export interface TurnState {
  status: TurnStateStatus;       // lifecycle of this snapshot
  mode: TurnMode;                // "agent" | "auto" | StateMachineDefinition
  options?: TurnOptions;         // model, memoryModel, thinkingLevel
  agent: AgentSession;           // full conversation transcript
  stateMachine?: StateMachineSession;  // non-null in state-machine mode
  todos?: TurnTodo[];            // current work plan
  followUpQueue?: TurnFollowUpQueueEntry[];  // buffered pending user prompts
  queuedCommands?: TurnCommand[];            // commands not yet executed
}
```

Sources: [src/types/protocol.ts:169-208](src/types/protocol.ts)

### Field ownership

| Field | Owner | Why it survives |
|---|---|---|
| `status` | Runner | Tells the receiver whether to arm a wake timer, show a question, or accept prompts. |
| `mode` | Runner (set at start) | Prevents `auto` sessions from switching to a constrained definition after resume. |
| `options` | Runner (set at start) | Keeps model and thinking level stable across all pi-agent turns inside one session. |
| `agent` | pi-agent transcript | Is the entire conversation history; state-machine sessions share it as the parent transcript. |
| `stateMachine` | `StateMachineController` | Progress, definition, and current state for long-running workflows. |
| `todos` | todo tool | Preserved so a resumed runner does not lose the work plan. |
| `followUpQueue` | Runner | Multimodal payloads that were queued but not yet delivered. |
| `queuedCommands` | Runner | Commands that arrived while non-agent work was driving the turn; replayed on resume. |

### TurnStateStatus lifecycle

```text
            ┌────────────────────────────────────────────────────┐
            │                   running                          │
            │  (new turn starts, wake received, prompt absorbed) │
            └──────┬──────────┬──────────────┬──────────────────┘
                   │          │              │
            complete      sleeping     waiting_for_human
          (completed,    (wakeAt set)   (ask event)
           failed,
           cancelled)
                   │
             interrupted
```

Sources: [src/types/protocol.ts:161-167](src/types/protocol.ts)

---

## Command Types

Commands are the inputs that drive one turn. All three live-turn commands share the `TurnCommand` union; `start` is separate because it is a setup step, not a turn.

### start — session setup

`TurnStartCommand` bootstraps the runner: loads memory and skills, hydrates `state` (fresh or persisted), and emits `turn_started`. No LLM work happens here.

```ts
export interface TurnStartCommand {
  type: "start";
  mode?: TurnMode;
  state?: TurnState;    // provide to resume a previous session
  options?: TurnOptions;
  mcpServers?: Record<string, McpHttpServerConfig>;
}
```

Sources: [src/types/protocol.ts:317-340](src/types/protocol.ts)

When `command.state` is present, `TurnRunner.start()` calls `createInitialTurnState` only on fresh sessions; resumed sessions get their persisted agent messages and state-machine history loaded directly:

```ts
// src/turn-runner/turn-runner.ts:326-333
const state = command.state
  ? {
      ...command.state,
      options: this.resolveTurnOptions(startOptions, command.state.options),
    }
  : createInitialTurnState(mode, this.resolveTurnOptions(startOptions));
this.stateMachineController.hydrate(state.stateMachine);
```

Sources: [src/turn-runner/turn-runner.ts:326-334](src/turn-runner/turn-runner.ts)

### prompt — user message

`TurnPromptCommand` delivers a user message (with optional images) against the current state.

```ts
export interface TurnPromptCommand {
  type: "prompt";
  message: string;
  behavior: TurnPromptBehavior;   // "steer" | "follow_up"
  images?: TurnPromptImage[];
}
```

`behavior` controls delivery when a pi-agent session is already active:
- `"steer"` — calls `agent.steer()`, injecting the message into the running turn as immediate context.
- `"follow_up"` — calls `agent.followUp()`, queued until the current pi-agent turn finishes.

Sources: [src/types/protocol.ts:377-392](src/types/protocol.ts)

### answer — structured question response

`TurnAnswerCommand` serializes a picker answer into XML and delivers it as a prompt. It uses the same `behavior` field and eventually routes through `TurnRunner.prompt()`:

```ts
// src/turn-runner/turn-runner.ts:675-683
protected async answer(command: TurnAnswerCommand): Promise<TurnTerminalEvent> {
  const message = this.commandToUserMessage(command);
  return this.prompt({ type: "prompt", message, behavior: command.behavior, images: command.images });
}
```

Sources: [src/turn-runner/turn-runner.ts:675-683](src/turn-runner/turn-runner.ts)

### wake — resume a sleeping session

`TurnWakeCommand` is the simplest command. If the runner's state is `"sleeping"`, it calls `stateMachineController.wake()` and drives the resulting state-machine step. If the runner is not sleeping, it returns `complete` with `"Nothing to wake."` immediately — making it safe to replay:

```ts
// src/turn-runner/turn-runner.ts:685-703
protected async wake(): Promise<TurnTerminalEvent> {
  const originalState = this.requireRunnerState();
  const state: TurnState = { ...originalState, status: "running" };
  if (originalState.status === "sleeping") {
    const result = await this.stateMachineController.wake();
    if (result) return this.driveStateMachineResult(result, state);
  }
  return { type: "complete", status: "completed", state: originalState, result: "Nothing to wake." };
}
```

Sources: [src/turn-runner/turn-runner.ts:685-703](src/turn-runner/turn-runner.ts)

---

## Terminal Events

Every turn ends with exactly one terminal event. All four carry the updated `TurnState` so the receiver can persist and resume.

```ts
// src/types/protocol.ts:687-691
export type TurnTerminalEvent =
  | TurnAskEvent
  | TurnCompletedEvent
  | TurnInterruptedEvent
  | TurnSleepEvent;
```

Sources: [src/types/protocol.ts:687-691](src/types/protocol.ts)

### complete

`TurnCompletedEvent` is emitted when the parent agent finishes its turn. `status` is one of `"completed" | "failed" | "cancelled"`.

### ask

`TurnAskEvent` pauses the turn and surfaces structured questions to the caller. It sets `state.status = "waiting_for_human"`. The caller sends a `TurnAnswerCommand` to continue.

### sleep

`TurnSleepEvent` is emitted when a state-machine poll or timer state needs to wait. It carries a `wakeAt` Unix timestamp in milliseconds. The session layer (`Session`) persists the state and schedules a wall-clock wakeup:

```ts
// src/session/session.ts:549-567
private scheduleWake(terminal: Extract<TurnTerminalEvent, { type: "sleep" }>): void {
  this.cancelWake();
  const fire = (): void => {
    if (Date.now() < terminal.wakeAt) return;
    this.cancelWake();
    const state = this.runner.getState();
    if (!state || state.status !== "sleeping") return;
    this.dispatchTurn({ type: "wake" });
  };
  this.wakeTimer = setInterval(fire, WAKE_POLL_INTERVAL_MS);
  const remaining = terminal.wakeAt - Date.now();
  if (remaining < WAKE_POLL_INTERVAL_MS) {
    this.wakeFastPath = setTimeout(fire, Math.max(0, remaining));
  }
}
```

Sources: [src/session/session.ts:549-567](src/session/session.ts)

The poll interval is 30 seconds (`WAKE_POLL_INTERVAL_MS = 30_000`) to survive OS sleep where monotonic timers pause.

### interrupted

`TurnInterruptedEvent` is emitted when `runner.interrupt()` is called mid-turn. The runner marks `state.status = "interrupted"`, aborts the parent pi-agent, and clears all queues. If state-machine work was active, the controller records an interrupt marker on the session.

---

## During-Turn Events

These events stream while the runner is still working. They update the UI but do not end the turn.

| Event type | Payload | Purpose |
|---|---|---|
| `step` | `TurnStep` (text, reasoning, tool call, system) | Streaming agent progress |
| `todos` | `TurnTodo[]` | Updated work plan |
| `follow_up_queue` | `TurnFollowUpQueueEntry[]` | Updated buffer of pending user prompts |
| `state_machine` | `StateMachineSession` | Full session snapshot for Kanban rendering |
| `memory` | `ObservationalMemoryActivityEvent` | Memory observation/reflection activity |
| `usage` | `TurnUsageFields` | Running token cost after each LLM boundary |
| `system` | `level`, `message` | Diagnostic info, warnings, error notices |

Sources: [src/types/protocol.ts:677-684](src/types/protocol.ts)

---

## Turn Lifecycle Sequence

```mermaid
sequenceDiagram
    participant Caller as CLI / TUI / Session
    participant Runner as TurnRunner
    participant Agent as pi-agent (parent)
    participant SM as StateMachineController

    Caller->>Runner: start({ type:"start", state? })
    Runner->>Runner: ensureMemoryLoaded(), ensureSkillsLoaded()
    Runner-->>Caller: emit turn_started (TurnState)

    Caller->>Runner: turn({ type:"prompt", message, behavior })
    Runner->>Agent: agent.prompt(text, images)
    Agent-->>Runner: step events (text_delta, tool_call, ...)
    Runner-->>Caller: emit step, todos, usage (during events)
    Agent-->>Runner: message_end (usage)
    Runner->>SM: runDecision / runState (if state-machine mode)
    SM-->>Runner: StateMachineExecutionResult
    Runner-->>Caller: emit state_machine, usage
    Runner-->>Caller: emit complete | ask | sleep | interrupted (terminal)
    Runner->>Session: persist state.json
```

Sources: [src/turn-runner/turn-runner.ts:342-390](src/turn-runner/turn-runner.ts), [src/session/session.ts:436-455](src/session/session.ts)

---

## State Persistence and Resume

The `Session` class is the persistence owner. After every terminal event it writes `state.json` inside the session's directory (`~/.duet/sessions/<id>/state.json`):

```ts
// src/session/session.ts:652-663
private async writeStoredEnvelope(state: TurnState): Promise<void> {
  const payload: StoredSessionFile = {
    sessionId: this.id,
    updatedAt: Date.now(),
    state,
    sessionCostUsd: this.sessionCostUsd,
  };
  if (this.lastUsage !== undefined) payload.lastUsage = this.lastUsage;
  await writeFile(this.sessionFilePath(), `${JSON.stringify(payload, null, 2)}\n`, "utf-8");
}
```

Sources: [src/session/session.ts:652-663](src/session/session.ts)

On resume, `Session.start()` reads `state.json`, passes the stored `TurnState` through `TurnStartCommand.state`, and re-arms the wake timer when `state.status === "sleeping"`.

The `SessionManager` creates or resumes sessions by session id. New sessions get a fresh `nanoid`-based id; resumed sessions load their stored state:

```ts
// src/session/session-manager.ts:86-92
resume(sessionId: string): Session {
  const existing = this.sessions.get(sessionId);
  if (existing) return existing;
  const session = this.createSession(sessionId, true);
  this.sessions.set(sessionId, session);
  return session;
}
```

Sources: [src/session/session-manager.ts:86-92](src/session/session-manager.ts)

---

## Usage Accounting

Token usage is tracked across every LLM boundary (parent worker plus each state-machine agent). The `addUsage` function is a pure accumulator — earlier in-place mutation semantics were deliberately replaced to eliminate silent discard bugs:

```ts
// src/turn-runner/usage-accounting.ts:12-33
export function addUsage(
  a: TurnTokenUsage | undefined,
  b: TurnTokenUsage | undefined,
): TurnTokenUsage | undefined {
  if (!a && !b) return undefined;
  if (!a) return cloneUsage(b!);
  if (!b) return cloneUsage(a);
  return {
    input: a.input + b.input,
    output: a.output + b.output,
    cacheRead: a.cacheRead + b.cacheRead,
    cacheWrite: a.cacheWrite + b.cacheWrite,
    totalTokens: a.totalTokens + b.totalTokens,
    cost: { input: ..., output: ..., cacheRead: ..., cacheWrite: ..., total: ... },
  };
}
```

Sources: [src/turn-runner/usage-accounting.ts:12-33](src/turn-runner/usage-accounting.ts)

The protocol exposes two distinct token totals in `TurnUsageFields`:

- **`turnUsage`**: cumulative sum across every LLM call in the turn (parent + all state agents). Use this for cost accounting.
- **`lastMessageUsage`**: exact provider-reported usage of the most recent parent call. Use this for context-window pressure display.

`contextWindowUsage` provides a heuristic segment breakdown (`systemPrompt`, `messages`, `localMemory`, `globalMemory`) that `scaleContextWindowUsageToTotalTokens` rescales to sum exactly to `lastMessageUsage.totalTokens` before emission.

Sources: [src/types/protocol.ts:581-620](src/types/protocol.ts)

---

## Initial State Creation and snapshotState

`createInitialTurnState` produces the zeroed snapshot for a fresh session:

```ts
// src/turn-runner/turn-state.ts:9-18
export function createInitialTurnState(mode: TurnMode, options?: TurnOptions): TurnState {
  return {
    status: "running",
    mode,
    options,
    agent: { status: "running", messages: [] },
  };
}
```

Sources: [src/turn-runner/turn-state.ts:9-18](src/turn-runner/turn-state.ts)

The runner's `snapshotState` method is the single choke point that reconciles in-flight pi-agent messages, the `StateMachineController`'s live session, and the current todo/followUpQueue/queuedCommands arrays into one consistent snapshot. Every state leaving the runner — via emit, return, or `getState()` — passes through this method:

```ts
// src/turn-runner/turn-runner.ts:1004-1022
private snapshotState(state: TurnState): TurnState {
  const parentAgent = this.parentAgent
    ? { ...state.agent, status: state.agent.status, messages: this.parentAgent.state.messages }
    : state.agent;
  const snapshot: TurnState = {
    ...state,
    agent: parentAgent,
    stateMachine: this.stateMachineController.getSession(),
    todos: copyOptionalArray(state.todos ?? this.state?.todos),
    followUpQueue: copyOptionalArray(state.followUpQueue ?? this.state?.followUpQueue),
    queuedCommands: copyOptionalArray(state.queuedCommands ?? this.state?.queuedCommands),
  };
  return this.applyAutoStateCompaction(snapshot);
}
```

Sources: [src/turn-runner/turn-runner.ts:1004-1022](src/turn-runner/turn-runner.ts)

Auto-compaction is enabled by default and evicts the oldest messages when the state snapshot exceeds the 100 MB ceiling, preventing unbounded `state.json` growth from wedging persistence.

---

## Invariants and Failure Modes

| Invariant | Where enforced |
|---|---|
| Exactly one terminal event per turn chain | `runTurnChain` emits after `drainQueuedTurnCommands` completes |
| `start` must precede `turn` | `requireStarted()` throws otherwise |
| Only one parent agent worker active at a time | `parentAgentRunning` guard in `runAgentWorker` |
| `wake` on a non-sleeping session is a no-op | Early return with `"Nothing to wake."` |
| Interrupted turn cost is not persisted | `sessionCostUsd` only increments on the terminal event |
| Sleeping `state.status` restores after a user prompt when the state machine is still waiting | `restoreSleepAfterTurn` flag in `Session` |

The protocol's constraint that every turn must end with a terminal event means callers can safely `await runner.turn(command)` and then persist the returned `state` — the state snapshot is always consistent and complete at that boundary, regardless of whether the turn produced agent work, a state-machine transition, or an immediate `ask`.

Sources: [src/turn-runner/turn-runner.ts:364-390](src/turn-runner/turn-runner.ts), [src/session/session.ts:481-500](src/session/session.ts)

---

## 03. The Five State Kinds — Vocabulary of a Relay

> Every relay is built from exactly five state kinds: agent (sub-agent with a prompt), script (shell command), poll (recurring external check), timer (pure wall-clock delay), and terminal (named business outcome). This page explains the invariants, input schema templating, and what each kind can and cannot do—including why integrations like GitHub or email are always script/poll states, never engine primitives.

- Page Markdown: https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/03-the-five-state-kinds-vocabulary-of-a-relay.md
- Generated: 2026-05-22T00:28:40.358Z

### Source Files

- `src/types/state-machine.ts`
- `src/turn-runner/shell-state-handle.ts`
- `examples/state-machine.ts`
- `evals/state-machine-tool-call-shape.eval.ts`
- `evals/outreach-lifecycle.eval.ts`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [src/types/state-machine.ts](src/types/state-machine.ts)
- [src/turn-runner/shell-state-handle.ts](src/turn-runner/shell-state-handle.ts)
- [src/turn-runner/state-machine-controller.ts](src/turn-runner/state-machine-controller.ts)
- [src/turn-runner/state-machine-session.ts](src/turn-runner/state-machine-session.ts)
- [examples/state-machine.ts](examples/state-machine.ts)
- [evals/outreach-lifecycle.eval.ts](evals/outreach-lifecycle.eval.ts)
- [evals/state-machine-tool-call-shape.eval.ts](evals/state-machine-tool-call-shape.eval.ts)
</details>

# The Five State Kinds — Vocabulary of a Relay

A relay (called a `StateMachineDefinition` in the codebase) is composed from exactly five state kinds: `agent`, `script`, `poll`, `timer`, and `terminal`. Every state in every relay definition must be one of these five. No sixth kind exists, and no external integration—email, GitHub, Slack, Calendly, webhooks—is a built-in primitive. The engine's deliberate design philosophy, stated directly in the type comments, is that "any external system with an API or CLI is a bash script away." That constraint is what makes the vocabulary minimal and the relay engine portable.

This page explains the invariants, input schema templating, and execution contract for each kind, then explores what each kind can and cannot do and why that boundary exists.

---

## The Common Base: `StateMachineBaseState`

Before examining each kind, every state shares a base structure:

```typescript
// src/types/state-machine.ts:249-260
export interface StateMachineBaseState {
  name: string;
  when?: string;
  inputSchema?: Record<string, unknown>;
}
```

- **`name`** — the string key the runner agent uses to select this state; recorded in every session history entry.
- **`when`** — optional guidance prose telling the runner agent when this state is appropriate. It is not evaluated by the engine; it is only injected into the runner's prompt so the LLM can make a better routing decision.
- **`inputSchema`** — an optional JSON Schema object. When present, the runner agent must supply a `Record<string, unknown>` that satisfies the schema when it selects this state. The controller validates the provided input and then stores it as `currentInput` on the session, making it available for template rendering.

### Template Rendering

Both `agent` and `script`/`poll` states accept `{{ input.fieldName }}` placeholders in their `prompt` or `command` fields. The engine renders these placeholders before execution using `renderTemplate` in `shell-state-handle.ts`:

```typescript
// src/turn-runner/shell-state-handle.ts:185-193
export function renderTemplate(template: string, input: Record<string, unknown>): string {
  return template.replace(TEMPLATE_PLACEHOLDER_PATTERN, (placeholder) => {
    const path = TEMPLATE_PLACEHOLDER_CAPTURE_PATTERN.exec(placeholder)?.[1];
    if (!path) return "";
    const value = readPath(input, path);
    if (value === undefined || value === null) return "";
    return typeof value === "string" ? value : JSON.stringify(value);
  });
}
```

Dot-path access is supported (e.g., `{{ input.reply.text }}`). Non-string values are serialized as JSON. Missing values render as empty string. This rendering happens immediately before execution, using `session.currentInput` as the source:

```typescript
// src/turn-runner/state-machine-controller.ts:267, 304
const prompt = renderTemplate(state.prompt, this.session?.currentInput ?? {});
const command = renderTemplate(state.command, this.session?.currentInput ?? {});
```

---

## The Five Kinds

### 1. `agent` — Sub-Agent with a Prompt

```typescript
// src/types/state-machine.ts:262-290
export interface StateMachineAgentState extends StateMachineBaseState {
  kind: "agent";
  prompt: string;
  systemPrompt?: string;
  allowedSkills?: string[];
  cwd?: string;
}
```

An `agent` state delegates execution to a fresh sub-agent turn. The controller calls `createStateAgent`, which builds a new turn-runner prompt from the rendered `prompt` field plus automatically injected history context, then runs the sub-agent to completion.

**What it can do:**
- Invoke tools (bash, read, write, edit) within `cwd`.
- Ask the user follow-up questions (surfaces as `state_asked_user` in history).
- Produce free-form textual output that becomes the state's completion payload.
- Scope its available skill set via `allowedSkills`.
- Operate in a different working directory than the parent runner.

**What it cannot do:**
- Directly select the next state — that belongs to the parent runner agent after the sub-agent completes.
- Run indefinitely without user interaction if the session needs to sleep; sleeping is only available in `poll` and `timer` states.

**Invariant:** The runner injects the original session prompt and relevant history into the sub-agent automatically. The state definition only describes the sub-agent's task-specific prompt. The comment in the type file is explicit: "The runner injects original prompt/history outside this state config."

**Example from the outreach eval:**

```typescript
// evals/outreach-lifecycle.eval.ts:66-80
{
  kind: "agent",
  name: "research_prospect",
  inputSchema: {
    type: "object",
    properties: {
      prospectName: { type: "string" },
      company: { type: "string" },
    },
    required: ["prospectName", "company"],
  },
  prompt:
    "Do not call tools. Write one concise research note for {{ input.prospectName }} from {{ input.company }}.",
},
```

The `{{ input.prospectName }}` and `{{ input.company }}` placeholders are filled from the runner's selected `input` before the sub-agent starts.

---

### 2. `script` — Shell Command (the Generic Integration Primitive)

```typescript
// src/types/state-machine.ts:292-308
export interface StateMachineScriptState extends StateMachineBaseState {
  kind: "script";
  command: string;
  cwd?: string;
  timeoutMs?: number;
  successCodes?: number[];
}
```

A `script` state runs an arbitrary shell command through `sh -lc`. This is intentionally the generic integration primitive for the relay engine. GitHub PRs, email sending, Slack notifications, Calendly links — all of these are `script` states, not engine primitives. The type file's comment is unambiguous: "Do not hardcode integrations such as email, GitHub, Slack, Calendly, or webhooks into the engine."

**What it can do:**
- Call any CLI tool (`gh`, `git`, `curl`, `sendmail`, custom scripts).
- Accept a `timeoutMs` to kill long-running commands.
- Treat non-zero exit codes as failure unless overridden in `successCodes`.
- Produce stdout captured as both raw string and parsed JSON (via `parseStructuredOutput`) in the completion payload.

**What it cannot do:**
- Sleep and retry automatically — that is `poll`'s job. A script either succeeds, fails, or times out in one shot.
- Interact with the user.

**Execution contract:** The controller spawns the command with `detached: true` so the process tree can be cleanly killed on interruption. On success, stdout is trimmed and parsed as JSON if possible, making downstream states able to read structured data from `input.parsed`:

```typescript
// src/turn-runner/state-machine-controller.ts:463-472
function normalizeStructuredShellOutput(shellOutput: ShellCommandOutput): ShellCommandOutput & {
  parsed: Record<string, unknown>;
} {
  return {
    ...shellOutput,
    stdout: shellOutput.stdout.trim(),
    stderr: shellOutput.stderr.trim(),
    parsed: parseStructuredOutput(shellOutput.stdout),
  };
}
```

**Example from the outreach eval:**

```typescript
// evals/outreach-lifecycle.eval.ts:82-97
{
  kind: "script",
  name: "send_outreach",
  inputSchema: { ... },
  command:
    'printf \'{"sent":true,"email":"{{ input.email }}","messageId":"eval-message-1",...}\'',
},
```

In a real deployment, this would be `scripts/send-email.sh '{{ input.email }}'` or similar.

---

### 3. `poll` — Recurring External Check

```typescript
// src/types/state-machine.ts:310-331
export interface StateMachinePollState extends StateMachineBaseState {
  kind: "poll";
  intervalMs: number;
  timeoutMs?: number;
  command: string;
  cwd?: string;
  successCodes?: number[];
}
```

A `poll` state is a `script` state that knows how to retry. The engine runs one poll attempt per wake, checks the exit code, and either completes (exit code in `successCodes`) or emits a `sleep` event scheduling the next attempt.

**What it can do:**
- Periodically poll any external system — PR status, email inbox, webhook delivery, CI pipeline — through any CLI or API wrapper.
- Capture structured JSON from stdout on success, identical to `script`.
- Time out the entire polling period with `timeoutMs`, not just a single attempt.

**What it cannot do:**
- Run multiple attempts in one wake. One wake = one attempt.
- Use a non-zero exit code to mean "found a result." Only exit codes in `successCodes` (default `[0]`) signal completion.

**Key invariant — exit code is king:**

The controller comment is explicit: stdout parsing does not affect poll completion. Only the exit code matters:

```typescript
// src/turn-runner/state-machine-controller.ts:356-374
// Poll success is determined purely by the script's exit code being
// in `successCodes` (default [0]). `shell.run()` resolves when the
// exit code is in the success set and rejects otherwise, so reaching
// this branch means "this poll attempt found a result." Stdout is
// parsed as JSON when possible for convenience, but the result of
// that parse does NOT affect whether the poll completes — only the
// exit code does.
const shellOutput = await shell.run();
const rawOutput = normalizePollShellOutput(shellOutput);
// ...
} catch (error) {
  // Exit code not in `successCodes` (or shell error) → keep polling.
  const wakeAt = Date.now() + state.intervalMs;
  this.session = recordStateSleep(this.requireSession(), state, wakeAt);
  return { type: "sleep", wakeAt };
}
```

**Timeout behavior:** The total elapsed time is checked at the start of each wake attempt using `elapsedSinceStateStarted`. If it exceeds `timeoutMs`, the state fails the entire session rather than attempting another poll:

```typescript
// src/turn-runner/state-machine-controller.ts:339-343
const elapsedMs = elapsedSinceStateStarted(this.session, state.name);
if (state.timeoutMs !== undefined && elapsedMs >= state.timeoutMs) {
  // ... fails the session
}
```

**Why integrations are always script/poll, never engine primitives:** The outreach lifecycle example from the type-file comments illustrates this clearly. Waiting for an email reply uses a `poll` state running a CLI wrapper — not a native email connector. This keeps the relay definition serializable (no functions, just data), keeps the engine BYOC-friendly (bring your own connector), and keeps polling latency acceptable (minutes, not milliseconds).

---

### 4. `timer` — Pure Wall-Clock Delay

```typescript
// src/types/state-machine.ts:333-338
export interface StateMachineTimerState extends StateMachineBaseState {
  kind: "timer";
  wakeAt: number;
}
```

A `timer` state carries exactly one field beyond the base: an absolute Unix epoch millisecond timestamp. When the engine reaches a timer state, if the target time is in the future, it emits a `sleep` event with `wakeAt` matching the specified timestamp. When the outer layer wakes the session at or after that time, the controller marks the state as completed and lets the runner agent choose what comes next.

**What it can do:**
- Enforce a fixed delay before a subsequent state (e.g., "wait 24 hours before sending a follow-up").
- Encode a specific calendar time as an epoch timestamp.

**What it cannot do:**
- Run any code or shell command.
- Carry any input schema; its `wakeAt` is static in the definition.
- Know what the next state will be — the parent runner decides after the timer completes.

**Execution contract:** The controller checks whether `wakeAt` is still in the future:

```typescript
// src/turn-runner/state-machine-controller.ts:381-393
private runTimerState(state: StateMachineTimerState, woke = false): StateMachineExecutionResult {
  if (!woke && state.wakeAt > Date.now()) {
    this.session = recordStateSleep(this.requireSession(), state, state.wakeAt);
    return { type: "sleep", wakeAt: state.wakeAt };
  }

  const output = {
    elapsedMs: elapsedSinceStateStarted(this.session, state.name),
    timestamp: Date.now(),
  };
  this.session = recordStateCompleted(this.requireSession(), state.name, output);
  return { type: "state_completed", stateName: state.name, output };
}
```

When it completes, the output carries `elapsedMs` (actual elapsed time) and `timestamp` (wall-clock completion time), which the next state can read via `input`.

**Example from the outreach eval:**

```typescript
// evals/outreach-lifecycle.eval.ts:95-98
{
  kind: "timer",
  name: "wait_for_reply",
  wakeAt: Date.now() + 60_000,
},
```

The eval sets a 60-second timer. The eval test confirms that after the first turn the session enters a `sleep` state at `wait_for_reply`, then resumes after the timer expires.

**`timer` vs `poll`:** Both emit `sleep` and both require the outer layer to wake the session. The distinction is that a `timer` knows exactly when to wake (fixed `wakeAt`) and runs no code at all. A `poll` recalculates its next wake on every failed attempt (`Date.now() + intervalMs`) and runs a shell command on each wake.

---

### 5. `terminal` — Named Business Outcome

```typescript
// src/types/state-machine.ts:340-347
export interface StateMachineTerminalState extends StateMachineBaseState {
  kind: "terminal";
  status: "completed" | "failed" | "cancelled";
  reason?: string;
}
```

A `terminal` state finalizes the session. When the runner agent selects a terminal state, the controller immediately records `terminal` on the session and returns `{ type: "terminal", status }` to the turn runner, which then closes the prompt loop and runs a final acknowledgment turn.

**What it can do:**
- Map a named business outcome (`meeting_scheduled`, `prospect_not_interested`, `merged`, `closed`) to one of three lifecycle statuses: `completed`, `failed`, or `cancelled`.
- Carry an optional `reason` string shown to users and recorded in history.
- Accept a caller-supplied override reason from the runner agent's decision when selecting it (for dynamic failure messages).

**What it cannot do:**
- Run shell commands.
- Wait for external events.
- Leave the session open. Reaching a terminal state is irreversible.

**Invariant — caller reason wins:** The controller resolves the final reason by preferring the runner agent's `decisionReason` over the state's static `reason`:

```typescript
// src/turn-runner/state-machine-controller.ts:396-407
private async runTerminalState(
  state: StateMachineTerminalState,
  decisionReason?: string,
): Promise<StateMachineExecutionResult> {
  const reason = decisionReason ?? state.reason;
  const terminal = { state: state.name, status: state.status, reason };
  this.session = recordStateMachineCompleted(this.requireSession(), terminal);
  return { type: "terminal", status: state.status, result: reason };
}
```

**Terminal states are not "error" or "done":** The type intentionally allows multiple named terminal states per relay. The outreach lifecycle has `meeting_scheduled` (completed), `prospect_not_interested` (completed), `negative_response` (failed), and `no_response_after_followups` (failed). Each carries semantic meaning for the business process, not just success/failure flags.

**Auto-injected escape hatches:** The eval comments note that "the runner auto-injects 'failed' and 'cancelled' terminal escape hatches, so the definition does not need to spell those out." This means every relay gets generic failure and cancellation terminals for free, without the author having to define them.

---

## Lifecycle and State Transitions

```text
StateMachineDefinition
  ├── agent states    → sub-agent executes → state_completed / ask / interrupted / failed
  ├── script states   → shell runs once   → state_completed / interrupted / failed (terminal)
  ├── poll states     → shell runs, then:
  │     exit in successCodes → state_completed
  │     exit not in successCodes → sleep(wakeAt = now + intervalMs) → retry on wake
  │     elapsed > timeoutMs → failed (terminal)
  ├── timer states    → wakeAt in future → sleep(wakeAt) → on wake → state_completed
  └── terminal states → immediately finalizes session → terminal(status, reason)
```

The session event log (`StateMachineSessionEvent`) captures every transition: `state_machine_started`, `runner_decided`, `state_started`, `state_completed`, `state_failed`, `state_interrupted`, `state_asked_user`, and `state_machine_completed`. This is an append-only audit log capped at 100 entries for long-running relays with many poll sleep cycles.

Sources: [src/turn-runner/state-machine-session.ts:23-24]()

---

## Comparison Table

| Kind | Runs code? | Can sleep/retry? | Can ask user? | Ends session? | Integration use |
|---|---|---|---|---|---|
| `agent` | Yes (via sub-agent tools) | No | Yes | No | Research, drafting, classification |
| `script` | Yes (shell, one-shot) | No | No | On failure | Email send, PR create, setup, cleanup |
| `poll` | Yes (shell, per-attempt) | Yes (intervalMs) | No | On timeout | PR status, inbox check, webhook wait |
| `timer` | No | Yes (fixed wakeAt) | No | No | Cadence delays, calendar scheduling |
| `terminal` | No | No | No | Always | Named business outcomes |

---

## Why Integrations Are Always `script` or `poll`

The type-file comments explain the philosophy directly:

> "Do not hardcode integrations such as email, GitHub, Slack, Calendly, or webhooks into the engine. Any external system with an API or CLI is a bash script away, and this engine can accept a few minutes of polling latency instead of requiring realtime responsiveness."

Sources: [src/types/state-machine.ts:17-20]()

The concrete benefit is that relay definitions remain plain serializable JSON/TypeScript data with no embedded logic. A relay can be stored in a database, transmitted over a wire, and resumed on a different process without any special deserialization. When an integration changes—e.g., switching email providers—only the script that a state references changes, not the engine or the state machine type system.

The outreach lifecycle eval demonstrates all five kinds working together in one relay: an `agent` for research, a `script` for email sending, a `timer` for delay, another `script` for reply fetching, another `agent` for classification, and a `terminal` for the named outcome `meeting_scheduled`. Sources: [evals/outreach-lifecycle.eval.ts:61-132]()

---

## 04. State Machine Execution Flow — How the Runner Agent Drives Transitions

> The runner agent—not a config file—selects the next state every turn. This page traces how state-machine-controller.ts dispatches a state, records audit events in StateMachineSession.history, emits sleep for poll/timer states, handles interruptions, and runs a terminal acknowledgment turn. Includes the carry-forward invariant and the mid-session start rule.

- Page Markdown: https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/04-state-machine-execution-flow-how-the-runner-agent-drives-transitions.md
- Generated: 2026-05-22T00:29:10.116Z

### Source Files

- `src/turn-runner/state-machine-controller.ts`
- `src/turn-runner/state-machine-session.ts`
- `src/turn-runner/tools.ts`
- `src/turn-runner/prompts.ts`
- `evals/state-machine-routing.eval.ts`
- `evals/state-machine-interrupt-resume.eval.ts`
- `test/turn-runner-state-machine-agent-events.test.ts`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [src/turn-runner/state-machine-controller.ts](src/turn-runner/state-machine-controller.ts)
- [src/turn-runner/state-machine-session.ts](src/turn-runner/state-machine-session.ts)
- [src/turn-runner/tools.ts](src/turn-runner/tools.ts)
- [src/turn-runner/prompts.ts](src/turn-runner/prompts.ts)
- [src/turn-runner/turn-runner.ts](src/turn-runner/turn-runner.ts)
- [evals/state-machine-routing.eval.ts](evals/state-machine-routing.eval.ts)
- [evals/state-machine-interrupt-resume.eval.ts](evals/state-machine-interrupt-resume.eval.ts)
- [test/turn-runner-state-machine-agent-events.test.ts](test/turn-runner-state-machine-agent-events.test.ts)
</details>

# State Machine Execution Flow — How the Runner Agent Drives Transitions

The duet-agent state machine is orchestrated entirely at runtime by a parent LLM agent — not by a static config file or a hard-coded transition table. Every state transition is a deliberate tool call made by the runner agent. This page traces that loop in full: how `StateMachineController` dispatches each state kind, how audit events accumulate in `StateMachineSession.history`, when `sleep` events are emitted for poll and timer states, how interruptions are absorbed without data loss, how the terminal acknowledgment turn works, and the two invariants every caller must honor — carry-forward and mid-session start.

Understanding this flow matters because the runner agent is the only entity that reads state output and selects the next state. Sub-agents and shell scripts only handle one state at a time; they have no view of prior states unless the runner explicitly carries that context forward.

---

## The Dispatch Loop

### Parent Agent Selects a State

After a state finishes, `TurnRunner.selectNextStateAfterCompletion` re-prompts the parent agent with the completed state's name and output. The parent must reply with a `select_state_machine_state` tool call (up to 3 retries before the machine is failed automatically). The tool is thin: it validates that the named state exists in the active definition, applies any inline override, and terminates the parent turn immediately with `terminate: true`.

```ts
// src/turn-runner/turn-runner.ts:862-918
private async selectNextStateAfterCompletion(
  stateName: string,
  output?: unknown,
): Promise<StateMachineExecutionResult> {
  for (let attempt = 1; attempt <= 3; attempt++) {
    const workerResult = await this.runAgentWorkerWithUsage({
      prompt: dedent`
        The state "${stateName}" finished.
        ...
        you must end this turn by calling the select_state_machine_state tool
      `,
      ...
    });
    const result = await this.controllerResultFromWorkerResult(...);
    if (!result) continue;  // retry
    return result;
  }
  return { type: "terminal", status: "failed", error: "..." };
}
```

Sources: [src/turn-runner/turn-runner.ts:862-918]()

The parent's `select_state_machine_state` tool call produces a `StateMachineRunnerDecision` — a plain object carrying `{ state, reason?, override?, input? }`. The decision is then handed to `StateMachineController.runDecision`.

### Controller Records the Decision and Dispatches

`runDecision` is the single dispatch point in `StateMachineController`. Every call starts by recording the runner's decision into `session.history`, then looking up the target state in the definition. For non-terminal states, `applyStateOverride` merges any inline override before `recordStateStarted` sets `currentState` and `currentInput` on the session snapshot and fires `onSessionChanged` — the callback the `TurnRunner` uses to emit a `state_machine` protocol event to connected UIs.

```ts
// src/turn-runner/state-machine-controller.ts:204-258
async runDecision(decision): Promise<StateMachineExecutionResult> {
  // 1. Interrupt any previously active work
  if (previous) {
    this.interrupt("Replaced by a newly selected state.");
    await previous.finished;  // await teardown before starting replacement
  }
  // 2. Record runner's choice in history
  this.session = recordRunnerDecision(stateMachine, decision);
  // 3. Look up state, fail if unknown
  const selectedState = findState(this.session, decision.state);
  // 4. Apply override, record state_started, notify UI
  this.session = recordStateStarted(this.session, effectiveState, decision.input);
  this.config.onSessionChanged?.(this.session);
  // 5. Dispatch by kind
  switch (effectiveState.kind) {
    case "agent":  return this.runAgentState(effectiveState);
    case "script": return this.runScriptState(effectiveState);
    case "poll":   return this.runPollState(effectiveState);
    case "timer":  return this.runTimerState(effectiveState);
    case "terminal": return this.runTerminalState(effectiveState, decision.reason);
  }
}
```

Sources: [src/turn-runner/state-machine-controller.ts:204-258]()

The five `kind` values map to five distinct execution paths described below.

---

## State Kinds and Their Execution Paths

```text
StateMachineRunnerDecision
        │
        ▼
StateMachineController.runDecision()
        │
        ├─ kind: "agent"    → runAgentState()   → StateAgentHandle.prompt()
        ├─ kind: "script"   → runScriptState()  → ShellStateHandle.run()
        ├─ kind: "poll"     → runPollState()    → ShellStateHandle.run() → sleep loop
        ├─ kind: "timer"    → runTimerState()   → immediate or deferred completion
        └─ kind: "terminal" → runTerminalState() → recordStateMachineCompleted()
```

### `agent` States

A fresh `StateAgentHandle` is created for each agent state execution — sub-agents get no view of the parent transcript or prior sub-agent work. The handle wraps a `pi-agent-core` `Agent` instance configured with mode `"agent"` tools (coding tools, `todo_write`, `ask_user_question`, `read_skill`; no state-machine control tools). When `agent.prompt()` resolves:

- `interrupted` → `recordInterruptedState`, return `{ type: "interrupted" }`.
- `ask` → `recordStateAskedUser`, return `{ type: "ask", questions }`.
- `failed` → `recordStateFailed`, return `{ type: "terminal", status: "failed" }`.
- `complete` → `recordStateCompleted`, return `{ type: "state_completed" }`.

State prompts may use `{{ input.field }}` templates; `renderTemplate` expands them from `session.currentInput` before the handle is built.

Sources: [src/turn-runner/state-machine-controller.ts:266-299](), [src/turn-runner/turn-runner.ts:920-990]()

### `script` States

`runScriptState` renders the command template, creates a `ShellStateHandle`, and awaits `shell.run()`. Success produces `{ type: "state_completed" }` with the shell output normalized: `stdout`/`stderr` are trimmed and `parsed` is populated by `parseStructuredOutput`. Failure checks `shell.interruptedReason()` to distinguish an interrupt (→ `{ type: "interrupted" }`) from a genuine error (→ `{ type: "terminal", status: "failed" }`).

Sources: [src/turn-runner/state-machine-controller.ts:301-336]()

### `poll` States and the Sleep Loop

`runPollState` first checks whether `timeoutMs` has elapsed using `elapsedSinceStateStarted`, which reads `session.progress.states[name].startedAt`. If the timeout is exceeded, the machine fails immediately. Otherwise, the shell command runs once. If the exit code is in `successCodes`, the poll completes normally. If not — the shell error path — the state sleeps:

```ts
// src/turn-runner/state-machine-controller.ts:371-375
// Exit code not in successCodes → keep polling.
const wakeAt = Date.now() + state.intervalMs;
this.session = recordStateSleep(this.requireSession(), state, wakeAt);
return { type: "sleep", wakeAt };
```

`recordStateSleep` increments `progress.states[name].sleeps` and writes `nextWakeAt`. The `TurnRunner` converts the `sleep` result into a `TurnTerminalEvent` of type `"sleep"`, which signals the caller to put the session to rest and call `wake()` at `wakeAt`.

Sources: [src/turn-runner/state-machine-controller.ts:338-379]()

### `timer` States

Timer states are the simplest: if `wakeAt > Date.now()` and the call did not come from `wake()`, emit sleep. When woken (or when `wakeAt` is already in the past), complete with `{ elapsedMs, timestamp }`.

Sources: [src/turn-runner/state-machine-controller.ts:381-393]()

### `terminal` States

`runTerminalState` calls `recordStateMachineCompleted`, which appends a `state_machine_completed` event to history and sets `session.terminal`. The caller-supplied `reason` wins over the state's static `reason` field — this lets the runner agent attach a specific failure message when selecting the auto-injected `failed` terminal.

Auto-injection: `assertValidDefinition` (called when the runner creates a definition) calls `injectMissingTerminalEscapeHatches`, ensuring `"failed"` and `"cancelled"` terminals exist in every definition so the runner can always abort without writing boilerplate.

Sources: [src/turn-runner/state-machine-controller.ts:395-407](), [src/turn-runner/tools.ts:1071-1079]()

---

## History Audit Trail

Every meaningful transition appends to `session.history` through one of the named recorder functions in `state-machine-session.ts`. The append helper enforces a hard cap of 100 entries (`STATE_MACHINE_HISTORY_LIMIT`), dropping the oldest entries when exceeded. The starting `state_machine_started` marker can fall off under this cap; consumers must not rely on its presence.

| Recorder | History event type | When called |
|---|---|---|
| `createStateMachineSession` | `state_machine_started` | Session creation |
| `recordRunnerDecision` | `runner_decided` | Every `runDecision` call |
| `recordStateStarted` | `state_started` | Before dispatching any state kind |
| `recordStateCompleted` | `state_completed` | Agent/script/poll/timer success |
| `recordStateSleep` | *(no history event; updates progress only)* | Poll/timer going to sleep |
| `recordStateFailed` | `state_failed` + `state_machine_completed` | State error or timeout |
| `recordStateInterrupted` | `state_interrupted` | Interrupt received |
| `recordStateAskedUser` | `state_asked_user` | Agent sub-agent asked user a question |
| `recordStateMachineCompleted` | `state_machine_completed` | Terminal state reached |

Sources: [src/turn-runner/state-machine-session.ts:76-257]()

`recordStateStarted` also clears `nextWakeAt` on all states (via `clearProgressWakeTimes`) when any new state begins, which prevents stale wake times from accumulating across transitions.

---

## Sleep / Wake Protocol

```mermaid
stateDiagram-v2
    [*] --> Running : runDecision()
    Running --> StateCompleted : state exits normally
    StateCompleted --> Running : selectNextStateAfterCompletion (parent picks next state)
    Running --> Sleeping : poll/timer emits {type:"sleep", wakeAt}
    Sleeping --> Running : wake() called at wakeAt → controller.wake()
    Running --> WaitingForHuman : agent state calls ask_user_question
    WaitingForHuman --> Running : TurnRunner.answer() → selectNextStateAfterCompletion
    Running --> Terminal : terminal state selected
    Terminal --> [*] : acknowledgment turn
    Running --> Interrupted : interrupt() called
    Interrupted --> Running : resume → selectNextStateAfterCompletion
```

When `driveStateMachineResult` receives `{ type: "sleep" }`, `TurnRunner.controllerResultToTerminal` converts it to a public `TurnTerminalEvent` of type `"sleep"`. The session status is set to `"sleeping"`. `TurnRunner.wake()` resumes by calling `stateMachineController.wake()`, which calls `currentScheduledState` to identify the current poll or timer state and re-dispatches it.

Sources: [src/turn-runner/turn-runner.ts:685-703](), [src/turn-runner/state-machine-controller.ts:260-264]()

**Mid-session start rule:** when a user prompt arrives while the session is sleeping, `restoreSleepAfterPromptIfNeeded` returns a new `sleep` terminal after the parent-agent turn completes — preserving the `nextWakeAt` timestamp so the poll loop is not disrupted. Only follow-up commands are queued; steer commands fire an immediate parent prompt that can redirect the state machine.

Sources: [src/turn-runner/turn-runner.ts:705-735]()

---

## Interruption Handling

Interruptions can originate from two places:

1. **External interrupt** (`TurnRunner.interrupt()`): aborts the parent pi-agent and calls `stateMachineController.interrupt("Interrupted")`.
2. **State replacement** (`runDecision` with active work): the controller self-interrupts with `"Replaced by a newly selected state."` and **awaits `previous.finished`** before starting the replacement state. This teardown wait is critical — without it, the orphaned sub-agent or shell could keep emitting events into the new state's turn.

In both cases, `recordInterruptedState` sets `session.currentState` to the reserved sentinel `INTERRUPTED_STATE_MACHINE_STATE` (`"interrupted"`). The `state_interrupted` history event captures the reason and any partial output (assistant text for agent states, `{ stdout, stderr }` for shell states).

The dedup guard in `recordInterruptedState` prevents double-recording when the sub-agent's `finished` promise settles after the interrupt has already been written:

```ts
// src/turn-runner/state-machine-controller.ts:417-437
// If the last history event is already a state_interrupted for this state,
// update it in place rather than appending a second entry.
if (
  session.currentState === INTERRUPTED_STATE_MACHINE_STATE &&
  last?.type === "state_interrupted" &&
  last.state === stateName
) {
  this.session = { ...session, history: [...history.slice(0, -1), { ...last, reason, output }] };
  return;
}
```

Sources: [src/turn-runner/state-machine-controller.ts:409-437](), [src/turn-runner/state-machine-controller.ts:183-202]()

---

## Terminal Acknowledgment Turn

Every terminal — whether selected explicitly by the runner or produced by a runtime failure — triggers one additional parent-agent turn before the public `TurnTerminalEvent` is emitted. This is the **terminal acknowledgment turn**.

`TurnRunner.runStateMachineTerminalAcknowledgment` gates on `session.terminal` being set and `session.terminalAcknowledged` being false. It calls `markTerminalAcknowledged()` immediately (setting `session.terminalAcknowledged = true`) to ensure the same terminal cannot be acknowledged twice.

The acknowledgment prompt, built by `formatStateMachineTerminalAcknowledgmentPrompt`, is deliberately neutral about whether the terminal was chosen or a runtime failure:

```ts
// src/turn-runner/tools.ts:859-884
return dedent`
  The state machine "${session.definition.name}" has reached a terminal state and is no longer running.

  ${toXML({ state_machine_terminal: { state, status, reason } })}

  Respond now:
  - If you want to start follow-up work, call create_state_machine_definition.
  - Otherwise reply to the user in plain text...

  Do not call select_state_machine_state — there is no active state machine to advance.
`;
```

The parent's reply on this turn may:

- **Plain text only** → `runStateMachineTerminalAcknowledgment` returns `undefined`, and the caller emits the original terminal event. The natural-language summary lands as the final assistant message.
- **`create_state_machine_definition`** → a new session is created (`createStateMachineSession` returns a fresh object), so it has its own `terminalAcknowledged = false`, and the new machine will receive its own acknowledgment when it terminates.

Sources: [src/turn-runner/turn-runner.ts:832-860](), [src/turn-runner/tools.ts:859-884]()

---

## The Carry-Forward Invariant

Each agent state runs in a **fresh sub-agent context** with no view of prior agent transcripts, tool output, or state output. The only inputs are the rendered prompt (after template expansion) and the `input` object passed via `runDecision`. The system prompt layer (`createStateMachineSystemPromptLayer`) codifies this as a rule:

> "Treat every transition as a chance to update the next state's prompt or input with whatever the orchestrator now knows that the sub-agent will need." — [src/turn-runner/prompts.ts:94]()

Mechanically, the runner agent must either:

- Pass concrete facts as `input` (when the next state has a matching `inputSchema`), or
- Use `override.prompt` to inline findings from the prior state into the next state's prompt before selecting it.

A static prompt referring to "the findings from the previous step" without `input` or an `override` is a bug — the sub-agent has no channel to receive those findings. `applyStateOverride` in the controller merges the override fields shallowly (`{ ...state, ...override.state }`) before `runAgentState` renders templates.

Sources: [src/turn-runner/tools.ts:534-543](), [src/turn-runner/state-machine-controller.ts:237-248]()

---

## Definition Validation and Auto-Injection

When the runner agent calls `create_state_machine_definition`, `assertValidDefinition` runs before the tool returns:

1. **Reserved name check** — no state may be named `"interrupted"`.
2. **Schema validation** — each `inputSchema` must be a valid JSON Schema.
3. **Schedule validation** — poll states must have a positive `intervalMs`; timer states must have a finite `wakeAt`.
4. **Minimum cadence** — newly created poll states must have `intervalMs ≥ 15 minutes`; timer `wakeAt` must be ≥ 15 minutes in the future. This floor is only enforced at creation, not on externally supplied definitions passed via `mode:` config.
5. **Auto-injection** — `"failed"` and `"cancelled"` terminal escape hatches are added if missing.
6. **Completed terminal required** — at least one terminal with `status: "completed"` must exist.

Sources: [src/turn-runner/tools.ts:1055-1090](), [src/turn-runner/tools.ts:1049-1053]()

---

## Full Sequence: One State Completion Cycle

```mermaid
sequenceDiagram
    participant TR as TurnRunner
    participant SMC as StateMachineController
    participant SM as StateMachineSession (history)
    participant Parent as Parent Agent
    participant Sub as Sub-Agent / Shell

    TR->>SMC: runDecision({ state: "fetch-data", input: {url} })
    SMC->>SM: recordRunnerDecision()      [runner_decided]
    SMC->>SM: recordStateStarted()        [state_started]
    SMC->>TR: onSessionChanged(session)   [UI event: state_machine]
    SMC->>Sub: agent.prompt() / shell.run()
    Sub-->>SMC: { type: "complete", result }
    SMC->>SM: recordStateCompleted()      [state_completed]
    SMC-->>TR: { type: "state_completed", stateName, output }
    TR->>Parent: re-prompt with completed output
    Parent->>TR: select_state_machine_state({ state: "analyze-data" })
    TR->>SMC: runDecision({ state: "analyze-data", input: {...} })
    Note over SMC,SM: next cycle begins
```

Sources: [src/turn-runner/state-machine-controller.ts:204-258](), [src/turn-runner/turn-runner.ts:760-797]()

---

## Summary

The runner agent is the only entity that selects states. `StateMachineController.runDecision` is the single dispatch entry point: it records the decision, applies any override, records state start, fires a UI notification, and then dispatches to one of five `run*State` methods by state kind. Every transition appends audit events to `StateMachineSession.history` (capped at 100). Poll and timer states emit a `sleep` result that suspends the session until `wake()` restores it. Interruptions are safely absorbed through a teardown-wait (`previous.finished`) and a dedup guard on the history. Every terminal — chosen or runtime-failed — triggers one acknowledgment turn that gives the parent agent a chance to summarize or chain follow-up work. The carry-forward invariant is the key operational rule: because sub-agents start fresh, the runner must explicitly pass any facts the next state needs via `input` or `override.prompt` on every transition. These invariants are documented in the system prompt layer at [src/turn-runner/prompts.ts:88-94]() and enforced mechanically by the controller and tool validators.

---

## 05. Observational Memory — How Transcripts Become Durable Rows

> Memory and compaction are the same primitive. After each turn an observer model reads the transcript and appends Observation rows to PGlite; when rows grow beyond a threshold a reflector condenses them. Embeddings run in a background worker (embedding-worker.ts) so foreground turns never block. This page covers the observe → reflect → embed pipeline, trigger conditions, failure isolation, and the image-to-text path that keeps screenshots recallable.

- Page Markdown: https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/05-observational-memory-how-transcripts-become-durable-rows.md
- Generated: 2026-05-22T00:28:54.291Z

### Source Files

- `src/memory/observational.ts`
- `src/memory/observational-prompts.ts`
- `src/memory/observation-groups.ts`
- `src/memory/embedding-worker.ts`
- `src/memory/embedding.ts`
- `src/memory/storage.ts`
- `evals/memory-reflect.eval.ts`
- `test/memory-reflect-planner.test.ts`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [src/memory/observational.ts](src/memory/observational.ts)
- [src/memory/observational-prompts.ts](src/memory/observational-prompts.ts)
- [src/memory/observation-groups.ts](src/memory/observation-groups.ts)
- [src/memory/embedding-worker.ts](src/memory/embedding-worker.ts)
- [src/memory/embedding.ts](src/memory/embedding.ts)
- [src/memory/storage.ts](src/memory/storage.ts)
- [src/memory/context-pack.ts](src/memory/context-pack.ts)
- [src/memory/loader.ts](src/memory/loader.ts)
- [src/memory/migrations.ts](src/memory/migrations.ts)
- [evals/memory-reflect.eval.ts](evals/memory-reflect.eval.ts)
- [test/memory-reflect-planner.test.ts](test/memory-reflect-planner.test.ts)
</details>

# Observational Memory — How Transcripts Become Durable Rows

Duet's observational memory system converts raw conversation transcripts into structured `Observation` rows stored in a per-user PGlite (embedded Postgres) database. After each turn, an *observer* model reads the unobserved message tail and appends new rows; when the local session's accumulated rows exceed a token threshold, a *reflector* condenses them. Embeddings are generated in a background worker so that the foreground turn is never blocked waiting for a vector computation. A separate global `duet memory reflect` command cross-session prunes the entire pool into atomic reflection rows.

This page covers the full pipeline: how transcripts are serialized for the observer, how observation groups anchor progress markers, how reflection compaction works at session and global scope, and how the embedding backfill worker runs without interfering with foreground turns. Understanding this pipeline is essential for predicting memory latency, debugging blank or stale recall, and reasoning about what gets preserved across long-running or resumed sessions.

---

## The Observe → Reflect → Embed Pipeline

```text
Turn N completes
     │
     ▼
updateObservationalMemory()
     │
     ├─ agentMessagesToRaw()           serialise AgentMessage → RawMemoryMessage
     │                                 (text, images pass through; tool-calls flattened)
     │
     ├─ getUnobservedMessageTail()     skip messages already covered by an
     │                                 <observation-group> range marker
     │
     ├─ [unobserved tail > 0 ?]
     │     └─ observe()               LLM call → ObserverResult
     │           └─ appendObservation() → PGlite  (wrapped in <observation-group>)
     │
     ├─ [session obs tokens ≥ trigger ?]
     │     └─ reflectObservations()   LLM call → ReflectorResult
     │           └─ replaceSessionObservations() → PGlite
     │
     └─ return { observations, reflections }

Background (EmbeddingBackfillWorker, every 10 s):
     SELECT rows WHERE embedding IS NULL
     → embed()  (POST /api/v1/embed)
     → INSERT INTO observation_embeddings
```

Sources: [src/memory/observational.ts:609-696](), [src/memory/storage.ts:145-303]()

---

## Budget Arithmetic

All numeric token budgets flow from one caller-supplied value: `effectiveContext` — the actor model's effective context window size, already clamped to the provider's hard limit. A single `deriveMemoryBudgets()` call derives every threshold at a fixed ratio.

| Budget name | Ratio of `effectiveContext` | Default at 200k |
|---|---|---|
| `messageTokens` (raw tail trigger) | 0.60 | 120,000 |
| `observationTokens` (reflection trigger) | 0.325 | 65,000 |
| `globalContextTokenBudget` | 0.075 | 15,000 |
| `bufferActivation` (target after compaction) | 0.5 × trigger | 60,000 / 32,500 |

`bufferActivation` is always half the trigger. After a compaction event the retained content shrinks to half the trigger size, giving the session room to grow again before retriggering. Two fixed, context-independent caps guard the observer call itself:

| Cap | Value | Purpose |
|---|---|---|
| `maxTranscriptTokens` | 35,000 | Observer call never sees more than this from the unobserved tail |
| `maxObservationLogTokens` | 8,000 | Hard upper bound on the observation text the observer may produce |
| `previousObserverTokens` | 4,000 | Prior observations shown to the observer for dedupe context |

Sources: [src/memory/observational.ts:73-134](), [src/memory/observational.ts:170-195]()

---

## Message Serialization for the Observer

`agentMessageToRaw()` projects each `AgentMessage` into a `RawMemoryMessage` — a compact, observer-readable form. The key decisions:

- **Text and image blocks** pass through verbatim. Image blocks appear inline in the observer's multimodal prompt so screenshots are directly inspectable.
- **Tool-call blocks** become compact `[toolCall name(args)]` snippets, with arguments truncated to 1,500 characters.
- **Tool-result blocks** are prefixed with `[toolResult toolName (error?)]` and truncated to 1,500 characters. This keeps the observer aware of what was called without drowning the batch budget in raw JSON payloads.
- **`thinking` blocks are dropped entirely.** The observer records decisions, not the assistant's reasoning steps.
- **Synthetic memory-reminder messages** injected by `createObservationalContextTransform` are stripped before the observer ever sees them via `stripObservationalContextMessages()`.

```ts
// src/memory/observational.ts:1652-1697
function serializeMessageForObserver(message: AgentMessage): ObserverMessagePreview {
  // ... text/image pass through; toolCall flattened; thinking dropped
}
```

Each message gets a stable, deterministic `id` (`msg_assistant_<responseId>`, `msg_tool_<toolCallId>`, or a hash-based fallback). These ids are the backbone of the progress-marker system.

Sources: [src/memory/observational.ts:1563-1812]()

---

## Observation Groups — Progress Markers

Every observation row's content is wrapped in an `<observation-group>` XML element carrying four key attributes:

```xml
<observation-group id="a3f7c2b1d8e4" range="msg_user_1716000000_abc:msg_assistant_resp_xyz" kind="observation" cwd="/home/user/projects/duet">
  🟡 (14:33) Fixed null-check in auth.ts:45 …
</observation-group>
```

The `range` attribute is a `firstMessageId:lastMessageId` span. `getUnobservedMessageTail()` parses all observation groups from all local session observations, finds the highest message index covered by any range, and returns only messages beyond that index as the "unobserved tail." This is what prevents the observer from re-observing turns it already summarized.

```ts
// src/memory/observational.ts:1528-1561
export function getUnobservedMessageTail(
  messages: RawMemoryMessage[],
  observations: Observation[],
): RawMemoryMessage[] {
  const lastObservedIndex = getLastObservedMessageIndex(messages, observations);
  // ...
}
```

The `cwd` attribute persists the working directory onto each row so the reflector and any downstream reader can identify which project the row belongs to — essential when memory is read back weeks later or across multiple repositories.

Sources: [src/memory/observation-groups.ts:74-118](), [src/memory/observational.ts:1528-1561]()

---

## The Observer Model Call

`observe()` builds a structured-output prompt containing:

1. A system prompt instructing the observer to extract decision traces (trigger → investigation → decision → rationale), priority-tagged observations (`🔴`/`🟡`/`🟢`/`✅`), and temporal anchors.
2. Prior local observations (for dedupe) and cross-session global pack rows (with explicit `[memory id: mem_xxx]` markers so the observer can attribute `usedObservationIds`).
3. The serialized message history, trimmed to `maxTranscriptTokens` (35k) from the newest end.

The observer returns a structured `ObserverResult` with:

- `hasMemory: boolean` — gating flag. When false, no row is written and the message range is left unobserved.
- `observations: string` — the extracted observation log text.
- `usedObservationIds: string[]` — prior memory ids whose content actually informed this turn's response. These trigger a `bumpLastUsed()` update to refresh the `lastUsedAt` freshness signal for those rows.
- `currentTask`, `suggestedContinuation`, `threadTitle` — continuity metadata.

If the observation text exceeds `maxObservationLogTokens`, a retry is attempted with an explicit token count. A final hard trim via `trimObservationTextToTokenBudget()` enforces the hard cap.

Sources: [src/memory/observational.ts:1325-1399](), [src/memory/observational-prompts.ts:248-343]()

### Image-to-Text Path

Images pass through `serializeMessageForObserver` as inline `ImageContent` blocks in the observer's multimodal `content` array. The observer prompt's final instruction explicitly asks the model to "inspect [images] directly and summarize relevant visual details, user-visible text, UI state, diagrams, errors, or other facts needed for future continuity." This is the only path by which screenshots become recallable: the observer converts visual state into prose that lands in a durable `Observation` row.

Sources: [src/memory/observational-prompts.ts:329-343](), [src/memory/observational.ts:1666-1731]()

---

## In-Session Reflection

After each observer call, `updateObservationalMemory()` re-reads the session's observation token count. If it has reached `settings.reflection.observationTokens` (default ~65k at 200k context), `reflectObservations()` fires.

The reflector receives all current session observations, rendered via `renderObservationGroupsForReflection()` into a `## Group <id>` / `_range: <range>_` format. It returns an array of atomic `ReflectorReflection` rows — each a self-contained narrative of 150–600 tokens covering trigger → investigation → decision → rationale.

After reflection:

1. The array is joined into a single blob via `joinReflectorRows()`.
2. Token budget is enforced (retry + hard trim).
3. `reconcileObservationGroupsFromReflection()` re-wraps the output in `<observation-group>` elements using provenance derived from which source groups' content lines appear in the reflected sections.
4. `replaceSessionObservations()` atomically deletes this session's old rows and inserts the single new reflection row (kind=`"reflection"`).

```ts
// src/memory/observational.ts:875-931
async function reflectObservations(args): Promise<Observation[] | undefined> {
  // ...
  await replaceSessionObservations(session, sessionId, [reflected]);
  return [reflected];
}
```

This is a session-scoped operation. Other sessions' rows in the database are never touched.

Sources: [src/memory/observational.ts:875-932](), [src/memory/observation-groups.ts:219-254]()

---

## Global Cross-Session Reflection

`reflectAllObservations()` (exposed as `duet memory reflect`) condenses the entire global pool. It differs from in-session reflection in important ways:

### Eligibility Rules (enforced by `planReflectionBatches`)

| Row type | Rule |
|---|---|
| Global reflection row (`kind="reflection"`, `sessionId="__global_reflection__"`) | **Always preserved.** Re-reflecting condensed text degrades specificity. |
| Local reflection row (`kind="reflection"` with real `sessionId`) | **Eligible.** `duet memory reflect` breaks these into atomic global rows. |
| Raw observation row younger than 3 days | **Preserved.** Resume-info-loss risk too high. |
| Raw observation row older than 3 days | **Eligible.** |

Eligible rows are sorted chronologically and packed into batches up to `batchTokens` (default = reflection trigger). The 3-day minimum age is documented with explicit tradeoff reasoning: the specifics of a session still matter for resume continuity within 3 days; beyond that, the higher-level shape the reflector captures is what survives in human memory too.

Sources: [src/memory/observational.ts:940-1001](), [src/memory/observational.ts:1113-1160](), [test/memory-reflect-planner.test.ts:48-80]()

### Batch Processing

Each eligible batch runs through `reflectBatch()`:

1. A `generateStructuredOutput` call produces an array of atomic rows.
2. Each row is sanitized (lines capped at 10,000 chars) and trimmed to a per-row share of `targetTokens`.
3. A combined budget cap drops trailing rows if cumulative tokens would exceed the caller's `targetTokens`.
4. Each surviving row becomes its own `Observation` with `sessionId = "__global_reflection__"` and tags including `"global-prune"`.

After all batches complete, `replaceAllObservations()` atomically swaps the entire store: eligible rows disappear and new global reflection rows appear in a single transaction. This prevents peer CLIs from seeing a half-pruned pool.

Sources: [src/memory/observational.ts:1261-1315](), [src/memory/storage.ts:241-258]()

---

## The Embedding Backfill Worker

`EmbeddingBackfillWorker` runs as a persistent background loop started by `loadStoredMemory()`. Its contract: never block a foreground turn.

### Tick Shape

```
while not aborted:
  withDb(session):                    # acquires cross-process lock
    loop:
      SELECT observations WHERE no embedding AND id NOT IN cooldown
        ORDER BY priority DESC, created_at DESC
        LIMIT 50
      if empty: break
      embed(batch.map(r => r.content))  # POST /api/v1/embed
      INSERT INTO observation_embeddings (ON CONFLICT DO UPDATE)
  sleep(10s)
```

The worker exits `withDb` after draining the current batch so the idle-close timer releases the cross-process lock. A peer duet CLI can acquire it between drain ticks.

### Failure Isolation

- Per-embedding errors log and back off for 60 seconds before the next tick.
- `EmbeddingUnavailableError` (missing `DUET_API_KEY`, 4xx from the endpoint) propagates as a typed error; callers degrade to keyword-only retrieval rather than crashing.
- 5xx responses retry with exponential backoff up to 3 attempts.
- A per-id cooldown (default 5 minutes) prevents unbounded hot loops when the reflector's delete-and-reinsert cycle keeps wiping a row's embedding between drain ticks.
- The log file rotates at 1 MB (`<path>.1` keeps one prior rotation) to prevent unbounded growth.

Sources: [src/memory/embedding-worker.ts:83-273](), [src/memory/embedding.ts:55-176]()

### No FK, No Orphan Risk

Migration 7 dropped the foreign key from `observation_embeddings` to `observations`. An embedding row can survive after its parent observation is deleted (e.g., by a reflection replace). This is intentional: the orphan is harmless — recall queries JOIN back to `observations` and filter it out — and avoids the cascade-delete race that would otherwise force a re-embed on every reflection cycle.

Sources: [src/memory/embedding-worker.ts:216-224]()

---

## Context Pack Rendering and Cache Stability

The frozen context pack — the memory prefix rendered above the message tail — is rebuilt by `rebuildMemoryContextPack()` at exactly three moments:

1. `loadStoredMemory()` finishes — initial seed before turn 1.
2. The reflector replaces observations — condensed view changed.
3. The wire-shaping eviction horizon advances mid-turn — prompt cache is already invalidating, so the refresh piggybacks on a cache miss the model is already paying.

All other paths (observer appending a row, `recall_memory` tool returning rows) deliberately do NOT refresh. This keeps the prefix content-deterministic between compaction events so the provider's prompt cache survives turn over turn.

The pack has two layers, rendered in order:

- `<global_observations>` — highest-ranked cross-session rows, excluded from the current session, greedy-fitted to `globalContextTokenBudget` (default ~15k at 200k context). Ranked by `ln(priority) + ln(kindBias) + lastUsedAt / halfLife` in SQL; reflections rank higher via `reflectionBias` (default 1.3×).
- `<local_observations>` — current session's rows in chronological order, unranked. Represents the session's own compaction summary.

Sources: [src/memory/context-pack.ts:1-54](), [src/memory/loader.ts:1-78](), [src/memory/observational.ts:557-607]()

---

## Storage Schema Invariants

The `observations` table is the single source of truth. Key columns:

| Column | Purpose |
|---|---|
| `id` | `mem_<nanoid(12)>` — stable identifier used by range markers and `usedObservationIds` |
| `session_id` | Session owner. `NULL` for pre-session rows; `__global_reflection__` for global prune reflections |
| `kind` | `"observation"` or `"reflection"` |
| `priority` | `"high"` / `"medium"` / `"low"` — inferred from emoji in content (`🔴`→high, `🟡`→medium) |
| `last_used_at` | Bumped by `bumpLastUsed()` when `usedObservationIds` cites the row |
| `content` | The observation text, wrapped in `<observation-group>` |
| `tags_json` | `["observational-memory"]` base; reflections add `"reflection"` and optionally `"global-prune"` |

`observation_embeddings` stores the 3072-dimension vectors with the model tag echoed from the server. An `ON CONFLICT DO UPDATE` upsert means a re-embed after a model swap safely updates the stale row.

Sources: [src/memory/storage.ts:305-355](), [src/memory/migrations.ts:44-60]()

---

## Failure Isolation Summary

| Failure | Behavior |
|---|---|
| Observer LLM call fails | `activateObservations` propagates the error; `updateObservationalMemory` throws; turn logs but user reply is not blocked by default (runner wraps the update in a background task) |
| Reflector produces over-budget output | Retry with explicit token count, then hard `trimObservationTextToTokenBudget` |
| `withDb` lock contention (peer CLI holds the lock) | Returns `undefined`; `appendObservation`, `bumpLastUsed`, `replaceSessionObservations` all no-op silently. A single warning is surfaced via `onWarn` the first time this happens. |
| Corrupted PGlite directory | `quarantineDataDirectory` renames the directory aside and starts fresh; `onRecover` is called with the backup path |
| `DUET_API_KEY` missing | `EmbeddingUnavailableError` thrown; backfill worker backs off; recall degrades to keyword-only |
| Embedding endpoint 5xx | Exponential backoff, up to 3 retries; then worker sleeps 60s |
| Global reflect produces empty batch | Eligible rows are left unprocessed; store is not written; next `duet memory reflect` run retries them |

The design principle throughout is that memory bookkeeping failures must never crash or visibly stall a foreground turn. The observation pipeline runs after the actor turn completes; the embedding worker is fully decoupled; and all storage calls treat lock contention as a silent no-op rather than an error.

Sources: [src/memory/storage.ts:144-160](), [src/memory/embedding-worker.ts:124-160](), [src/memory/storage.ts:54-129]()

---

## 06. Recall & the Frozen Context Pack — What Survives Into Every Prompt

> Every turn is prefixed with a frozen two-layer memory pack: global cross-session observations ranked by recency half-life, and local session compaction. The pack rebuilds only on three specific events (initial load, reflector replacement, wire-shaping eviction) so the provider's prompt cache survives turn-over-turn. recall_memory tool uses hybrid RRF retrieval (pgvector cosine + tsvector keyword) to surface anything that missed the pack. This page explains pack structure, rebuild triggers, cache stability invariant, and RRF fusion.

- Page Markdown: https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/06-recall-the-frozen-context-pack-what-survives-into-every-prompt.md
- Generated: 2026-05-22T00:33:20.926Z

### Source Files

- `src/memory/context-pack.ts`
- `src/memory/recall.ts`
- `src/memory/loader.ts`
- `src/memory/store.ts`
- `src/memory/pglite.ts`
- `evals/recall-memory-cross-session.eval.ts`
- `evals/recall-memory-implicit-triggers.eval.ts`
- `test/memory-recall.test.ts`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [src/memory/context-pack.ts](src/memory/context-pack.ts)
- [src/memory/store.ts](src/memory/store.ts)
- [src/memory/loader.ts](src/memory/loader.ts)
- [src/memory/recall.ts](src/memory/recall.ts)
- [src/memory/session.ts](src/memory/session.ts)
- [src/memory/pglite.ts](src/memory/pglite.ts)
- [evals/recall-memory-cross-session.eval.ts](evals/recall-memory-cross-session.eval.ts)
- [evals/recall-memory-implicit-triggers.eval.ts](evals/recall-memory-implicit-triggers.eval.ts)
- [test/memory-recall.test.ts](test/memory-recall.test.ts)
</details>

# Recall & the Frozen Context Pack — What Survives Into Every Prompt

Every agent turn is prefixed with a two-layer, frozen memory pack that is inserted between the system prompt and the live message tail. The pack is built once from the durable PGlite observation store and held byte-identical across turns so the provider's prompt cache survives without a single extra dollar of re-encoding. When the model needs something that didn't make it into the pack — older sessions, long-tail observations, exact code symbols — it calls the `recall_memory` tool, which runs a hybrid Reciprocal Rank Fusion (RRF) search combining pgvector cosine similarity and PostgreSQL full-text (`tsvector`) ranking.

This page explains how the pack is structured, when it is rebuilt, why the rebuild triggers are deliberately narrow, and how RRF fusion fills the gap between the frozen pack and the full durable store.

---

## The Two-Layer Pack

The frozen pack is the single piece of memory state kept in-process by the runner. Everything else — raw observations, rankings, vectors — lives in PGlite and is read on demand.

```
┌─────────────────────────────────────┐
│          System prompt              │
├─────────────────────────────────────┤
│  GLOBAL layer  (cross-session,      │
│                ranked by score,     │
│                budget-fitted)       │
├─────────────────────────────────────┤
│  LOCAL layer   (current session,    │
│                chronological,       │
│                full fidelity)       │
├─────────────────────────────────────┤
│  Live message tail  (grows/turn)    │
└─────────────────────────────────────┘
```

The two layers are intentionally separated so the model can tell what comes from this conversation versus what comes from accumulated cross-session knowledge.

### Global Layer

The global layer contains the highest-signal observations from every session except the caller's current one. Rows are ranked by a composite score computed entirely in SQL:

```
rank = ln(priority_weight) + ln(kind_bias) + last_used_at / recency_half_life_ms
```

`priority_weight` maps `high→3`, `medium→2`, `low→1`. `kind_bias` applies a configurable `reflectionBias` multiplier to `kind = 'reflection'` rows, so compacted reflections outrank raw observations. The `last_used_at / h` term is the log of the exponential decay `0.5^((now - last_used_at) / h)` — the constant `now` cancels across candidates, leaving a pure function of stored columns that the `idx_obs_kind_priority_lastused` index can cover directly.

After Postgres returns the ranked candidates, JavaScript performs a greedy token-budget fit. It walks the ranked list, skipping rows that would overflow the budget but continuing past them — a smaller row later in the ranking may still fit, and abandoning the budget early just because the top-ranked row is a long reflection would waste prompt space.

Sources: [src/memory/loader.ts:79-138](src/memory/loader.ts)

### Local Layer

The local layer contains every observation the current session wrote, ordered chronologically and loaded in full regardless of size. It represents the session's own compaction summary — the observations and reflections that replaced earlier transcript. Bounding the local layer is the observer/reflector pipeline's job; their thresholds keep the set from unbounded growth, and a reflection condenses it back down when it gets too large.

Legacy rows with a `NULL` `session_id` (written before session-id tracking was introduced) are kept in the global pool and excluded from the current session's local pack. The loader uses `IS DISTINCT FROM` rather than `<>` so `NULL = sessionId` evaluates to `NULL` instead of silently dropping those rows from every query.

Sources: [src/memory/loader.ts:153-166](src/memory/loader.ts)

---

## The MemoryContextCache: In-Process State

`MemoryContextCache` is the runner's sole in-process memory state. It wraps a single `ContextPack` struct:

```typescript
// src/memory/store.ts
export interface ContextPack {
  /** Cross-session ranked memory; rendered above the local section. */
  global: Observation[];
  /** Current session's chronological compaction summary; rendered below global. */
  local: Observation[];
}
```

The cache exposes only `setContextPack()` and `getContextPack()`. The rendered prefix stays byte-identical between rebuild events because the transform reads `getContextPack()` on every dispatch without recomputing it. Only `rebuildMemoryContextPack()` may call `setContextPack()`.

Sources: [src/memory/store.ts:1-51](src/memory/store.ts)

---

## Three Rebuild Triggers — And Nothing Else

The pack is rebuilt on exactly three events. The comment in `context-pack.ts` names them explicitly:

> 1. `loadStoredMemory()` finishes — initial seed.
> 2. The reflector replaces observations — condensed view changed.
> 3. The wire-shaping eviction horizon advances — prompt cache is already invalidating, so piggyback the refresh for free.

Any other path — the observer appending a row mid-turn, a `recall_memory` tool call returning rows — deliberately does **not** refresh the pack. The prefix stays stable so the provider's prompt cache survives.

Failure during rebuild is non-fatal: a missing database, a planner glitch, or a corrupted index leaves the previous pack in place. The runner logs and continues; the user's turn is never blocked behind memory bookkeeping.

Sources: [src/memory/context-pack.ts:6-24](src/memory/context-pack.ts)

### Why This Invariant Matters

Provider prompt caches (Anthropic, OpenAI) hash the prefix bytes. If the prefix changes on every turn, every turn pays the full re-encoding cost. By rebuilding only on compaction events — which happen orders of magnitude less often than turns — the system pays exactly one cache invalidation per compaction, not one per turn.

The consequence is that new observations written by the observer mid-session are **not visible in the pack until the next rebuild trigger**. This is intentional. The model can still retrieve them immediately via `recall_memory`.

---

## Pack Rebuild Implementation

`rebuildMemoryContextPack` holds the open PGlite handle across both layer queries using one `withDb` call so the cross-process lock is acquired just once:

```typescript
// src/memory/context-pack.ts
await options.session.withDb(async (db) => {
  const [globalPack, localPack] = await Promise.all([
    loadGlobalPack(db, { ... }),
    options.sessionId !== undefined
      ? loadLocalPack(db, { sessionId: options.sessionId })
      : Promise.resolve([]),
  ]);
  options.cache.setContextPack({ global: globalPack, local: localPack });
});
```

The local pack is skipped when the runner has no session id (one-shot tools, tests). The global pack always runs; its `excludeSessionId` parameter is optional and meaningful as `undefined` (includes all rows, used by unrestricted recall).

Sources: [src/memory/context-pack.ts:36-53](src/memory/context-pack.ts)

---

## The `recall_memory` Tool: Hybrid RRF Retrieval

When the model calls `recall_memory`, it reaches into the full durable store for anything that missed the frozen pack. The retrieval runs two search paths in parallel inside a single `withDb` call, then fuses the ranked lists via Reciprocal Rank Fusion.

### Search Paths

| Path | Mechanism | Strength | Fallback behavior |
|---|---|---|---|
| Keyword (tsvector) | `websearch_to_tsquery` + `ts_rank` + GIN index | Exact tokens, proper nouns, code symbols | Always runs if query is non-empty |
| Vector (pgvector) | `embedding <=> $1::vector` cosine distance + HNSW index | Fuzzy paraphrases, semantic similarity | Skipped if no `embed` function provided, or if the embed call throws |

Each path fetches up to `PER_PATH_TOP_K = 30` candidates. The paths degrade independently: if the embedding endpoint is unavailable, the vector path is dropped and the function returns keyword-only results. If both paths fail to return anything, the function returns an empty list rather than throwing.

Sources: [src/memory/recall.ts:28-111](src/memory/recall.ts)

### Reciprocal Rank Fusion

RRF is a score-free rank aggregation method. Each candidate receives a contribution from each ranked list it appears in:

```
score(id) += 1 / (RRF_K + rank)
```

`RRF_K = 60` matches the value recommended in the original Cormack et al. paper and used by systems like gbrain and Zep. Smaller `k` weights the top-of-list more aggressively; 60 trades some top-1 sharpness for stability across lists with different score scales.

Candidates that rank highly in both lists accumulate contributions from both, rising above candidates that only appear in one. Ties resolve by first-seen insertion order, which mirrors gbrain's tiebreak convention.

```typescript
// src/memory/recall.ts
export function reciprocalRankFusion(rankedLists: ScoredHit[][]): string[] {
  const scores = new Map<string, { score: number; firstSeen: number }>();
  let order = 0;
  for (const list of rankedLists) {
    for (const hit of list) {
      const contribution = 1 / (RRF_K + hit.rank);
      const existing = scores.get(hit.id);
      if (existing) {
        existing.score += contribution;
      } else {
        scores.set(hit.id, { score: contribution, firstSeen: order++ });
      }
    }
  }
  // sort descending by score, then by first-seen for ties
  ...
}
```

After fusion, IDs are passed to `hydrate()`, which fetches full `Observation` rows from the database in a single `WHERE id = ANY($1::text[])` query and reorders them to match the fused ranking.

Sources: [src/memory/recall.ts:229-269](src/memory/recall.ts)

### Scope Filtering

The `recallMemory` function accepts a `scope` parameter controlling which sessions are searched:

| Scope | SQL behavior |
|---|---|
| `"all"` | No session filter applied |
| `"session"` | `session_id = $N` |
| `"global"` | `session_id IS DISTINCT FROM $N` (includes legacy `NULL` rows) |

The `IS DISTINCT FROM` form is required for the global scope so that pre-session-id legacy rows with `NULL` session ids remain in the pool.

Sources: [src/memory/recall.ts:193-216](src/memory/recall.ts)

---

## PGlite Handle Management: MemorySession

`MemorySession` manages the single PGlite handle per data directory with refcounted opens and an idle-close timer. The idle-close default is 2 seconds, keeping short write bursts (observer + reflector + embedding upserts in one turn) on a single open handle without permanently holding the cross-process lock against a second CLI process.

```
withDb call 1 ─┐
withDb call 2 ─┤─→ one PGlite.create ─→ fn1, fn2 run concurrently
               └─┐
                 └─ idle timer fires after 2s with no in-flight ops
                    → db.close() + lock released
```

The cross-process open-lock (`~/.duet/memory.db/.duet-open.lock`) ensures two `duet` CLI processes cannot both call `PGlite.create` on the same fresh data directory and corrupt each other's migrations. The lock stores the holder's PID; stale locks from crashed processes are detected via `process.kill(pid, 0)` and taken over atomically.

Sources: [src/memory/session.ts:61-115](src/memory/session.ts), [src/memory/pglite.ts:638-684](src/memory/pglite.ts)

---

## Eval Coverage

Two eval files validate the `recall_memory` trigger behavior end-to-end against the real Anthropic API inside Docker:

**`recall-memory-cross-session.eval.ts`** — explicit trigger scenarios: past-tense markers ("yesterday", "previous session", "already done X"). Asserts that at least one `recall_memory` tool call fires on cross-session questions and exactly zero fire on a self-contained arithmetic prompt.

**`recall-memory-implicit-triggers.eval.ts`** — implicit trigger scenarios: un-anchored named referents (a pet name, a colleague, a release ID, a codenamed project artifact) without past-tense markers. These are harder: the model must infer that "Doughy" refers to a sourdough starter seeded in the durable store rather than hedging or answering generically. The eval documents that `opus-4.7` (the production default) is expected to fail all five implicit positives while `sonnet-4.6` handles them; the current prompt layer is the Pareto-best prose found after iterating against both.

Sources: [evals/recall-memory-cross-session.eval.ts:19-30](evals/recall-memory-cross-session.eval.ts), [evals/recall-memory-implicit-triggers.eval.ts:62-115](evals/recall-memory-implicit-triggers.eval.ts)

---

## Summary

The frozen context pack is the key mechanism that keeps prompt costs predictable as durable memory accumulates across sessions. By rebuilding only on three precisely-defined compaction events and holding the pack byte-identical between them, the system pays exactly one prompt-cache invalidation per compaction. The global layer's SQL-ranked, budget-greedy fitting with recency half-life decay ensures the highest-signal cross-session signal fills the available token budget; the local layer preserves the current session's compaction summary in full. Everything that misses the pack — long-tail observations, exact code symbols, older sessions — is recoverable through `recall_memory`'s hybrid RRF retrieval, which degrades gracefully to keyword-only when the embedding endpoint is unavailable. The invariant that only three events may call `setContextPack()` is what makes this system coherent; violating it by refreshing on observer writes or recall tool calls would silently invalidate the prompt cache on every turn.

Sources: [src/memory/store.ts:6-16](src/memory/store.ts)

---

## 07. Wire Shaping & Model Resolution — Context Budget Enforcement

> wire-shaping.ts enforces a byte budget (15 MB trigger, 80% target) and a token budget (200k default effectiveContext) on the dispatched message list. Eviction advances the WireGuardHorizon, trimming oldest messages in one block to minimize prompt-cache invalidations. Images get a fixed 1,600-token estimate to prevent base64 byte inflation from triggering early eviction. Model resolution (resolver.ts, catalog.ts, duet-gateway.ts) abstracts Anthropic/OpenAI/OpenRouter/Duet Gateway behind a single resolveModelName call. This page explains the two-gate eviction system, cache-miss cost model, and BYOK/BYOC model routing.

- Page Markdown: https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/07-wire-shaping-model-resolution-context-budget-enforcement.md
- Generated: 2026-05-22T00:33:51.680Z

### Source Files

- `src/turn-runner/wire-shaping.ts`
- `src/turn-runner/state-compaction.ts`
- `src/model-resolution/resolver.ts`
- `src/model-resolution/catalog.ts`
- `src/model-resolution/duet-gateway.ts`
- `evals/thread-context-loss.eval.ts`
- `evals/context-overflow-recovery.eval.ts`
- `evals/prompt-cache.eval.ts`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [src/turn-runner/wire-shaping.ts](src/turn-runner/wire-shaping.ts)
- [src/turn-runner/state-compaction.ts](src/turn-runner/state-compaction.ts)
- [src/memory/observational.ts](src/memory/observational.ts)
- [src/model-resolution/resolver.ts](src/model-resolution/resolver.ts)
- [src/model-resolution/catalog.ts](src/model-resolution/catalog.ts)
- [src/model-resolution/duet-gateway.ts](src/model-resolution/duet-gateway.ts)
- [src/turn-runner/turn-runner.ts](src/turn-runner/turn-runner.ts)
- [evals/thread-context-loss.eval.ts](evals/thread-context-loss.eval.ts)
- [evals/prompt-cache.eval.ts](evals/prompt-cache.eval.ts)
</details>

# Wire Shaping & Model Resolution — Context Budget Enforcement

The duet-agent runner maintains a context window that is bounded by two independent constraints: a **token budget** that governs normal text-and-tool sessions, and a **byte budget** that prevents image-heavy threads from inflating serialized request sizes past any sane limit. These two gates are implemented in `wire-shaping.ts` and called from the observational memory transform. Separately, a three-level model resolution chain (`resolver.ts`, `catalog.ts`, `duet-gateway.ts`) normalizes user-supplied model names into concrete provider-specific model objects — abstracting Anthropic, OpenAI, OpenRouter, Vercel AI Gateway, and the Duet hosted gateway behind a single `resolveModelName` call.

Both systems interact at a critical seam: model resolution determines the hard context-window limit that `effectiveContext` is clamped against, and wire-shaping determines how many messages actually reach the provider. Understanding these two mechanisms together is essential for predicting when the system will evict messages, what it will preserve, and how cost-sensitive prompt-cache behavior is protected during eviction.

---

## The Two-Gate Eviction System

### Gate 1: Token Budget (`messageTokens`)

The primary eviction gate tracks estimated token consumption across the dispatched message list. The `DEFAULT_EFFECTIVE_CONTEXT` is **200,000 tokens**, a target well below frontier-model windows (Sources: [src/memory/observational.ts:62](src/memory/observational.ts)). This value is clamped to the model's hard `contextWindow` at runtime:

```ts
// src/turn-runner/turn-runner.ts (resolveEffectiveContext)
protected resolveEffectiveContext(modelWindow?: number): number {
  const userValue = this.config.effectiveContext ?? DEFAULT_EFFECTIVE_CONTEXT;
  return modelWindow !== undefined ? Math.min(userValue, modelWindow) : userValue;
}
```

Within `effectiveContext`, three ratios govern how the budget is allocated:

| Segment | Ratio | Purpose |
|---|---|---|
| `messageTokens` | 60% | Raw-message tail compaction trigger |
| `observationTokens` | 32.5% | Local memory-pack ceiling |
| `globalContextTokenBudget` | 7.5% | Cross-session global pack ceiling |

Sources: [src/memory/observational.ts:73-80](src/memory/observational.ts)

When `messageTokens` is exceeded, the observational context transform calls `findEvictionHorizon` to advance the `WireGuardHorizon` until the token estimate for the post-horizon message slice falls back below the trigger.

### Gate 2: Byte Budget (`WIRE_BYTE_TRIGGER`)

The byte gate exists specifically because **image attachments break the token gate**. A single 2 MB inline base64 image would produce a naive `ceil(bytes/4) ≈ 500k` token estimate, which would evict earlier user messages from the wire prematurely. The actual per-image provider charge (Claude's vision tops out near 1,568 tokens; OpenAI high-detail tiles are bounded) is far lower and unrelated to the base64 payload length.

Two constants define this gate:

```ts
// src/turn-runner/wire-shaping.ts
export const WIRE_BYTE_TRIGGER = 15 * 1024 * 1024;  // 15 MB
export const WIRE_BYTE_TARGET  = Math.floor(WIRE_BYTE_TRIGGER * 0.8); // 12 MB
```

Sources: [src/turn-runner/wire-shaping.ts:18-28](src/turn-runner/wire-shaping.ts)

The target is 80% of the trigger so that a single block-eviction leaves room for several more turns of growth before tripping the gate again.

### Image Token Estimate: Preventing Base64 Inflation

To avoid false-positive token evictions on image messages, image blocks are counted at a fixed **1,600 tokens** regardless of their actual base64 payload size:

```ts
// src/turn-runner/wire-shaping.ts
export const IMAGE_WIRE_TOKEN_ESTIMATE = 1_600;

function calculateMessageTokens(msg: AgentMessage): number {
  // ...
  for (const block of content) {
    if (isImageBlock(block)) total += IMAGE_WIRE_TOKEN_ESTIMATE;
    else if (isTextBlock(block)) total += Math.ceil(block.text.length / 4);
    // ...
  }
}
```

Sources: [src/turn-runner/wire-shaping.ts:43-169](src/turn-runner/wire-shaping.ts)

Byte size is counted separately (raw `block.data.length`) and compared only against `WIRE_BYTE_TRIGGER` — not against the token budget. This dual-counting model correctly assigns each signal to its appropriate gate.

---

## The WireGuardHorizon: Sticky Eviction to Protect Prompt-Cache

### Why a Sticky Horizon

The runner calls `transformContext` on every turn against the **full** untransformed message history. A naïve design that re-computed the eviction cut each turn would advance the cut by one message every turn, invalidating the provider's cached prefix each time. Prompt-cache entries are keyed to an exact token prefix; any change forces a cold miss.

The `WireGuardHorizon` pins the cut to a monotonically-advancing timestamp:

```ts
// src/turn-runner/wire-shaping.ts
export interface WireGuardHorizon {
  /** Messages with timestamp <= evictionHorizon are dropped from the wire. */
  evictionHorizon: number;
}
```

Sources: [src/turn-runner/wire-shaping.ts:66-69](src/turn-runner/wire-shaping.ts)

Once the horizon is set, every subsequent turn produces the same shape — the same evicted prefix, the same surviving tail — until the budget is exceeded again and the horizon must advance. Advancing the horizon invalidates the cache once; not advancing it preserves the cache across every turn in between.

The horizon is held in memory on the runner instance and resets on session resume. This is intentional: provider-side prompt caches typically do not survive resume gaps anyway, so persisting the horizon across a restart would buy at most one avoided cache miss.

Sources: [src/turn-runner/turn-runner.ts:1656](src/turn-runner/turn-runner.ts)

### Block Eviction vs. Incremental Trim

`findEvictionHorizon` walks messages oldest-first and advances the horizon past whole messages until the caller-supplied predicate (`satisfiesBudget`) returns true. This produces a **block eviction** — a single large jump — rather than one-message-at-a-time trimming. The comment explains the tradeoff:

> "One large block-evict per crossing is far cheaper for prompt caching than incrementally trimming on every turn (each advance invalidates the cached prefix once, so fewer advances = fewer invalidations)."

Sources: [src/turn-runner/wire-shaping.ts:22-28](src/turn-runner/wire-shaping.ts)

### `applyEvictionHorizon`: Orphan Cleanup

After finding the horizon, `applyEvictionHorizon` removes messages at or before it and also drops any orphaned `toolResult` or `assistant` messages that end up at the new head — both Anthropic and OpenAI require the first message to have `role: user`:

```ts
// src/turn-runner/wire-shaping.ts:182-193
export function applyEvictionHorizon(messages: AgentMessage[], horizon: number): AgentMessage[] {
  if (horizon <= 0) return messages;
  let firstKept = 0;
  while (firstKept < messages.length && messageTimestamp(messages[firstKept]!) <= horizon)
    firstKept += 1;
  while (firstKept < messages.length && messages[firstKept]!.role !== "user")
    firstKept += 1;
  // ...
}
```

Sources: [src/turn-runner/wire-shaping.ts:182-193](src/turn-runner/wire-shaping.ts)

The `MIN_HISTORY_TAIL = 1` constant ensures at least the most recent user message always survives eviction — this message is the actor's current prompt and cannot be dropped.

---

## Gate Interaction Diagram

```text
Per-turn dispatch pipeline
═══════════════════════════════════════════════════════════════════

Full transcript (all messages, all time)
         │
         ▼
  ┌─────────────────────────┐
  │ applyEvictionHorizon    │  Drops messages ≤ WireGuardHorizon.evictionHorizon
  │ (wire-shaping.ts)       │  + strips orphaned tool-result / assistant heads
  └───────────┬─────────────┘
              │ Dispatched slice
              ▼
  ┌───────────────────────────────────────┐
  │  Gate 1: Token check                  │
  │  calculateWireTokens(slice)           │  Images → 1,600 fixed tokens
  │  vs. 0.60 × effectiveContext          │  Text/tools → ceil(chars/4)
  └───────────┬─────────┬─────────────────┘
        OK    │         │ Over budget
              │         └──► findEvictionHorizon → advance horizon → retry
              │
  ┌───────────▼───────────────────────────┐
  │  Gate 2: Byte check                   │
  │  calculateWireBytes(slice)            │  Images → raw base64 length
  │  vs. WIRE_BYTE_TRIGGER (15 MB)        │  Text → UTF-16 length
  └───────────┬─────────┬─────────────────┘
        OK    │         │ Over budget
              │         └──► findEvictionHorizon → advance to 80% target → retry
              │
              ▼
         Provider API call
```

---

## State-Level Compaction: A Separate Concern

`state-compaction.ts` handles a distinct problem: the **in-memory and on-disk TurnState** can grow without bound as tool calls accumulate across a long session. Its ceiling is a **100 MB JSON serialized size** limit (`DEFAULT_STATE_MAX_BYTES`), controlled per-runner via `autoStateCompaction.maxBytes`. Unlike wire-shaping, state compaction operates on the stored transcript, not the dispatched slice. It also performs the same `toolResult`/`assistant` head-fix to ensure the post-eviction message list starts with a `user` role.

Sources: [src/turn-runner/state-compaction.ts:14-98](src/turn-runner/state-compaction.ts)

These two layers are orthogonal:

| Layer | File | Trigger | Target | Scope |
|---|---|---|---|---|
| Wire shaping | `wire-shaping.ts` | 15 MB bytes or 60% × 200k tokens | 80% of trigger | Dispatched message slice only |
| State compaction | `state-compaction.ts` | 100 MB JSON size | ≥1 message retained | Full stored transcript |

---

## Model Resolution: `resolveModelName`

### Entry Point

All model references in the runner ultimately pass through a single function:

```ts
// src/model-resolution/resolver.ts:40-63
export function resolveModelName(model: string): Model<any> {
  model = resolveModelReference(model);
  const separator = model.indexOf(":");
  if (separator === -1) throw new Error("Models must use provider:modelId syntax");
  const rawProvider = model.slice(0, separator);
  const rawModelId = model.slice(separator + 1);
  const provider = resolveProviderShorthand(rawProvider) ?? rawProvider;
  const modelId = isKnownProvider(provider)
    ? canonicalizeProviderModelId(provider, rawModelId)
    : rawModelId;
  if (provider === "duet-gateway") {
    const resolved = resolveDuetGatewayModel(modelId);
    if (!resolved) throw new Error(`Unknown duet-gateway model: ${modelId}`);
    return resolved;
  }
  return getModel(provider, modelId);
}
```

Sources: [src/model-resolution/resolver.ts:40-63](src/model-resolution/resolver.ts)

The flow is: normalize shorthand → parse `provider:modelId` → canonicalize aliases → dispatch to Duet gateway branch or standard `getModel`.

### Provider Order and Inference

`PROVIDER_ORDER` defines both the resolution priority and the env-var inference order. `duet-gateway` must precede `vercel-ai-gateway` because the CLI shims `DUET_API_KEY` into `AI_GATEWAY_API_KEY` at startup, and `duet-gateway` must win before the Vercel entry reads that variable:

```ts
// src/model-resolution/catalog.ts:38-47
export const PROVIDER_ORDER: readonly ProviderPreference[] = [
  { provider: "duet-gateway", customEnvVar: () => process.env[DUET_GATEWAY_API_KEY_ENV] ? DUET_GATEWAY_API_KEY_ENV : null },
  { provider: "vercel-ai-gateway" },
  { provider: "openrouter" },
  { provider: "anthropic" },
  { provider: "openai" },
];
```

Sources: [src/model-resolution/catalog.ts:38-47](src/model-resolution/catalog.ts)

When no `--model` flag is passed, `resolveCliModel` walks `PROVIDER_ORDER` and infers the provider from whichever env var is present first. This makes the system BYOK-friendly: set `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `OPENROUTER_API_KEY`, or `DUET_API_KEY` and the right provider is selected automatically.

### The Catalog: Shorthands and Aliases

`catalog.ts` defines `MODEL_DEFINITIONS` — a list of shorthand names, aliases, and per-provider model IDs. For example, the shorthand `opus-4.7` resolves differently per provider:

| Provider | Resolved model ID |
|---|---|
| `duet-gateway` | `anthropic/claude-opus-4.7` |
| `vercel-ai-gateway` | `anthropic/claude-opus-4.7` |
| `openrouter` | `anthropic/claude-opus-4.7` |
| `anthropic` | `claude-opus-4-7` |

Sources: [src/model-resolution/catalog.ts:65-80](src/model-resolution/catalog.ts)

This alias table lets users pass `claude-opus-4-7`, `claude-opus-4.7`, or bare `opus-4.7` interchangeably. `canonicalizeModelName` normalizes everything to the canonical shorthand; `canonicalizeProviderModelId` maps it to the correct provider-specific ID.

Default models are also separated by use case:

```ts
export const DEFAULT_CLI_MODEL = "opus-4.7";
export const DEFAULT_CLI_MEMORY_MODEL = "gpt-5.4-mini";
```

Sources: [src/model-resolution/catalog.ts:30-31](src/model-resolution/catalog.ts)

The memory model is intentionally cheaper — it runs observational summarization, not actor turns.

### Duet Gateway: BYOC Routing

The `duet-gateway` provider is a thin proxy layer over Vercel AI Gateway. It reuses Vercel's model registry and request/response contract verbatim, overriding only `baseUrl` to point at the Duet-hosted proxy:

```ts
// src/model-resolution/duet-gateway.ts:40-48
export function resolveDuetGatewayModel(modelId: string): Model<any> | undefined {
  forceDuetGatewayAuth();
  const upstream = getModel("vercel-ai-gateway", modelId);
  if (!upstream) return undefined;
  return { ...upstream, baseUrl: getDuetGatewayBaseUrl() };
}
```

Sources: [src/model-resolution/duet-gateway.ts:40-48](src/model-resolution/duet-gateway.ts)

Auth is handled by `forceDuetGatewayAuth`, which overwrites `AI_GATEWAY_API_KEY` with the `DUET_API_KEY` before every Duet request. This is intentionally stricter than the startup shim (`shimDuetApiKeyToAiGateway`), which avoids clobbering an existing Vercel key. The Duet proxy rejects Vercel-issued `vck_...` tokens with an opaque 500, so the force-overwrite is necessary when the user has both keys set.

The base URL is `${DUET_APP_BASE_URL}/api/v1/ai-gateway`, resolved via `resolveDuetAppBaseUrl()`. Users override the app origin with `DUET_APP_BASE_URL`; the gateway path is fixed.

Sources: [src/model-resolution/duet-gateway.ts:1-78](src/model-resolution/duet-gateway.ts)

---

## Cache Miss Cost Model

The prompt-cache.eval.ts eval verifies that `cacheRead > 0` on the second turn of a resumed session, confirming that the serialized `TurnState` reconstructs an exact byte-for-byte prefix. The `WireGuardHorizon` mechanism is the key invariant that makes this possible across turns: as long as the horizon does not advance, the dispatched slice is content-deterministic and the provider finds its cached entry.

Sources: [evals/prompt-cache.eval.ts:33-53](evals/prompt-cache.eval.ts)

When the horizon does advance (either from a token/byte budget crossing or from a provider context-overflow recovery), the runner treats that cache miss as an opportunity to also refresh the frozen memory pack — piggybacking on the already-invalidated prefix rather than paying a second miss.

Sources: [src/turn-runner/turn-runner.ts:1641-1644](src/turn-runner/turn-runner.ts)

The thread-context-loss eval (`evals/thread-context-loss.eval.ts`) documents the original failure mode this system corrects: image-heavy threads where a base64 byte inflation triggered premature eviction of early user messages, making the model appear to "forget" prior conversation. The fix — routing images to `IMAGE_WIRE_TOKEN_ESTIMATE` for token accounting while isolating raw bytes for the byte gate — resolved the mismatch. Sources: [evals/thread-context-loss.eval.ts:20-45](evals/thread-context-loss.eval.ts)

---

## Summary

Wire shaping enforces a two-gate eviction system: a token gate (60% of `effectiveContext`, default 200k) for normal sessions, and a byte gate (15 MB trigger, 80% target) that catches image-payload inflation the token gate would miscount. Both gates advance the same `WireGuardHorizon`, which pins the eviction cut to a monotonically-increasing timestamp so subsequent turns reconstruct an identical dispatched prefix and re-hit the provider's prompt cache. Model resolution normalizes any user-supplied shorthand, alias, or `provider:modelId` string into a concrete `Model` object through a five-provider catalog, with `duet-gateway` taking priority when `DUET_API_KEY` is set and reusing Vercel AI Gateway's transport and model registry with only the base URL and auth token swapped.

Sources: [src/turn-runner/wire-shaping.ts:1-214](src/turn-runner/wire-shaping.ts), [src/model-resolution/resolver.ts:40-63](src/model-resolution/resolver.ts)

---

## 08. Invariants, Failure Modes & Safe-Change Rules

> A synthesis of the core invariants that hold across all subsystems, the failure modes that break them, and how to change the codebase safely. Covers: TurnState as the only cross-process contract; memory pack rebuild triggers (the three-event rule); prompt-cache stability conditions; state machine history append-only guarantee; PGlite cross-process lock; transient-error retry scope; and which files are safe to change in isolation versus which touch multiple invariants.

- Page Markdown: https://grok-wiki.com/public/wiki/dzhng-duet-agent-82dbe2572d3a/pages/08-invariants-failure-modes-safe-change-rules.md
- Generated: 2026-05-22T00:30:22.501Z

### Source Files

- `src/turn-runner/transient-error.ts`
- `src/guardrails/firewall.ts`
- `src/guardrails/semantic.ts`
- `src/memory/session.ts`
- `src/memory/migrations.ts`
- `evals/state-machine-real-session-carry-forward.eval.ts`
- `evals/source-of-truth-first.eval.ts`
- `test/transient-error.test.ts`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [src/turn-runner/transient-error.ts](src/turn-runner/transient-error.ts)
- [src/guardrails/firewall.ts](src/guardrails/firewall.ts)
- [src/guardrails/semantic.ts](src/guardrails/semantic.ts)
- [src/memory/session.ts](src/memory/session.ts)
- [src/memory/migrations.ts](src/memory/migrations.ts)
- [src/memory/pglite.ts](src/memory/pglite.ts)
- [src/memory/context-pack.ts](src/memory/context-pack.ts)
- [src/types/protocol.ts](src/types/protocol.ts)
- [src/types/state-machine.ts](src/types/state-machine.ts)
- [src/turn-runner/turn-runner.ts](src/turn-runner/turn-runner.ts)
- [evals/state-machine-real-session-carry-forward.eval.ts](evals/state-machine-real-session-carry-forward.eval.ts)
- [evals/source-of-truth-first.eval.ts](evals/source-of-truth-first.eval.ts)
- [test/transient-error.test.ts](test/transient-error.test.ts)
</details>

# Invariants, Failure Modes & Safe-Change Rules

This page synthesizes the core contracts that must hold across the duet-agent subsystems, explains which failure modes break each contract, and gives concrete rules for making changes safely. Understanding these invariants is the difference between a clean feature addition and a subtle regression that degrades only under production load or across concurrent processes.

The system has six interacting areas where invariants are load-bearing: the `TurnState` cross-process contract, the three-event rule for memory pack rebuilds, prompt-cache stability conditions, the state-machine history append-only guarantee, the PGlite cross-process open lock, and the transient-error retry scope. Each area constrains what you can change in adjacent code without breaking the whole.

---

## 1. TurnState — The Only Cross-Process Contract

### What it is

`TurnState` is the serializable snapshot that survives process boundaries. Every terminal event (`complete`, `ask`, `interrupted`, `sleep`) carries a `TurnState` back to the caller. The caller — CLI, daemon, HTTP server, or parent process — persists it and passes it back into `TurnRunner.start({ state })` on resume.

```typescript
// src/types/protocol.ts:169-208
export interface TurnState {
  status: TurnStateStatus;
  mode: TurnMode;
  options?: TurnOptions;
  agent: AgentSession;
  stateMachine?: StateMachineSession;
  todos?: TurnTodo[];
  followUpQueue?: TurnFollowUpQueueEntry[];
  queuedCommands?: TurnCommand[];
}
```

### Invariants

| Field | Invariant |
|---|---|
| `status` | Always reflects the terminal lifecycle: `running`, `waiting_for_human`, `sleeping`, `interrupted`, `completed`, `failed`, `cancelled`. |
| `mode` | Frozen at session start. An `"auto"` session keeps this to continue creating definitions across turns; explicit-definition sessions stay constrained to their state set. |
| `agent` | Always present. Owns the conversation transcript regardless of whether a state machine is active. |
| `stateMachine` | Present iff the session is in state-machine mode. Contains the full append-only history. |

### Failure modes

- **Dropping fields on serialize/deserialize**: Any field omitted from persistence breaks resume. `followUpQueue` and `queuedCommands` are easy to miss — they carry pending prompts that must replay on resume. Missing them silently drops user input.
- **Mutating `mode` post-start**: Changing `mode` between turns of the same session causes the routing guard to make wrong classifications. `"auto"` sessions that get converted to `"agent"` stop creating definitions; explicit-definition sessions that flip to `"auto"` ignore the caller's intent.
- **Losing `stateMachine.history`**: The history array is the only durable audit log for state transitions. Truncating it breaks replay evals and causes carry-forward failures (see Section 4).

### Safe-change rules

- Add optional fields freely — the runner ignores unknown fields on resume.
- Never remove or rename a field without a migration path that rewrites persisted `TurnState` blobs.
- Never mutate `mode` after `start()` has been called for a session.

Sources: [src/types/protocol.ts:169-208](src/types/protocol.ts)

---

## 2. Memory Pack Rebuild — The Three-Event Rule

### What it is

The memory pack is a frozen prefix rendered above the agent's message tail. It contains global cross-session observations/reflections and local session-scoped ones. Rebuilding it is an expensive DB query; doing it too often breaks prompt-cache stability (Section 3). `context-pack.ts` enforces exactly three rebuild triggers.

```typescript
// src/memory/context-pack.ts:8-23
// Three events trigger a refresh and exactly three:
//   1. `loadStoredMemory()` finishes — initial seed.
//   2. The reflector replaces observations — condensed view changed.
//   3. The wire-shaping eviction horizon advances — prompt cache is
//      already invalidating, so piggyback the refresh for free.
//
// Any other path (observer appending a row mid-turn, recall_memory
// tool returning rows) deliberately does NOT refresh...
```

### The three events in code

| Event | Where | Why |
|---|---|---|
| `loadStoredMemory()` completes | `turn-runner.ts:1714-1718` | Initial seed so the first dispatched turn sees a frozen prefix. |
| Reflector writes reflections | `turn-runner.ts:1559-1566` | Reflector condensed the observation set; pack must pick up the new view. |
| Wire-shaping eviction horizon advances | `turn-runner.ts:1641-1646` | Cache is already invalidating; piggyback the refresh instead of paying two misses. |

### Invariants

- The observer writing a row mid-turn does **not** trigger a rebuild. The prefix stays stable for that turn's LLM calls.
- The `recall_memory` tool returning rows does **not** trigger a rebuild. Tool results do not alter the frozen prefix.
- Only reflector output (condensed summaries, not raw observations) earns a rebuild.

### Failure modes

- **Adding a fourth rebuild trigger**: Any observer write path that calls `refreshMemoryContextPack()` will bust the prompt cache on every tool result, turning a cache hit into a cache miss per observation — catastrophic for long agentic turns.
- **Suppressing the reflector trigger**: If `result.reflections.length > 0` check is removed, the pack falls stale after a reflection run. The agent then works from an outdated condensed view for the rest of the session.
- **Rebuilding outside an idle-close window**: `rebuildMemoryContextPack` calls `session.withDb(...)`, which pins the cross-process open lock. Rebuilding too often inside a tight loop keeps the lock held longer, starving a concurrent CLI.

### Safe-change rules

- If you add a new memory writer (e.g., a new tagging step), verify it does **not** call `refreshMemoryContextPack`. It should only write rows; the reflector cycle will pick them up in the next condensation.
- When changing the reflector output shape, update the check at `turn-runner.ts:1564` to match the new reflections field, or the trigger will silently stop firing.

Sources: [src/memory/context-pack.ts:1-54](src/memory/context-pack.ts), [src/turn-runner/turn-runner.ts:1559-1649](src/turn-runner/turn-runner.ts)

---

## 3. Prompt-Cache Stability Conditions

### What it is

The provider's prompt cache survives between turns when the rendered system prompt prefix (including the frozen memory pack) is byte-identical between calls. A cache miss wastes tokens and increases latency. The codebase has three explicit stability guarantees that protect the cache.

### Stability guarantees

```text
Stable between turns (cache hits expected):
  [system prompt + AGENTS.md]
  [frozen global memory pack]
  [frozen local memory pack]
  ──────────────────────────── (frozen prefix)
  [message history tail]
  [latest user turn]

Unstable (invalidates cache):
  • Memory pack rebuild (three-event rule)
  • Wire-shaping advances eviction horizon
  • Model change mid-session (TurnOptions.model)
```

1. **Memory pack is frozen per turn**: Observer appends and `recall_memory` tool calls do not alter the frozen pack (see Section 2). The prefix bytes stay identical across all LLM calls in a single turn.
2. **Wire-shaping uses a sticky horizon**: `applyEvictionHorizon` applies the same eviction decision across all calls in a turn so the provider sees the same message tail. The runner creates `createInitialHorizon` once at session start and advances it only when wire bytes would exceed the cap (`turn-runner.ts:1631-1635`).
3. **Memory transform re-injects synthetic wrappers transiently**: The `createObservationalContextTransform` re-injects memory wrappers on each request using the frozen pack, not new DB queries. They do not count toward `TurnState.messages` bytes (`protocol.ts:545-548`).

### Failure modes

- **Calling `refreshMemoryContextPack` from an observer**: Covered in Section 2, but from the cache perspective: any extra rebuild mid-turn breaks the byte-stability of the prefix and guarantees a cache miss for every subsequent call in that turn.
- **Changing the memory rendering template without bumping version**: If the template that produces the frozen prefix changes (whitespace, tag names, ordering), every existing cached prefix becomes stale. New renders produce different bytes and the provider cold-starts every user's session.
- **Not applying the sticky horizon**: If the eviction horizon is recalculated per-call instead of per-turn, the provider sees a different message tail on each call within the same turn — guaranteed cache misses.

### Safe-change rules

- When editing `observational.ts` rendering logic, make a benchmark comparison of cache hit rates before and after.
- Never reset `wireGuardHorizon` between calls within the same turn. It is `private` on `TurnRunner` and should remain reset only at `start()`.
- The `contextWindowUsage` breakdown in `TurnState` is heuristic and read-only for display; changing its calculation does not affect cache.

Sources: [src/memory/context-pack.ts:8-23](src/memory/context-pack.ts), [src/types/protocol.ts:544-548](src/types/protocol.ts), [src/turn-runner/turn-runner.ts:1631-1648](src/turn-runner/turn-runner.ts)

---

## 4. State-Machine History — Append-Only Guarantee

### What it is

`StateMachineSession.history` is the audit log of every state transition. It is append-only by design: events are pushed as `state_started`, `state_completed`, `state_failed`, `state_interrupted`, and `state_machine_completed`. Nothing ever removes or edits a row.

```typescript
// src/types/state-machine.ts:194
/** Append-only audit log used for debugging, replay, and persistence. */
history: StateMachineSessionEvent[];
```

The eval in `evals/state-machine-real-session-carry-forward.eval.ts` verifies this guarantee directly:

```typescript
// evals/state-machine-real-session-carry-forward.eval.ts:188-192
const history = terminal.state.stateMachine?.history ?? [];
const fixOutput = history.find(
  (event) => event.type === "state_completed" && event.state === "fix_and_recover",
);
expect(fixOutput).toBeTruthy();
```

### The carry-forward invariant

When the parent agent transitions to a new state, it **must** inline findings from previous states into `override.prompt` or `input`. A fresh sub-agent for the new state cannot see prior states' output — it only sees its own rendered prompt. The eval reproduces a real production failure (`session c_cGfNEIotLU`) where the third sub-agent asked three clarifying questions because the parent passed `"Fix the root cause ... from the corrupted DBs found earlier"` with no concrete antecedent.

The transition must carry:
- Specific repo file paths (e.g. `src/memory/pglite.ts`)
- Root cause facts (e.g. the process-level open race, concurrent migration paths)

### Failure modes

- **Truncating history on serialize**: Dropping old entries to save payload bytes breaks replay evals and means the `terminalAcknowledged` flag (which gates the one-time terminal acknowledgment turn) can fire again for a stale terminal.
- **Not carrying findings forward**: The sub-agent for the new state asks clarifying questions the parent already answered, degrading quality and burning tokens. The eval gates this at `evals/state-machine-real-session-carry-forward.eval.ts:174-181`.
- **Mutating a completed entry**: Any code that rewrites a `state_completed` event destroys the audit trail and makes replay evals non-deterministic.
- **Re-acknowledging a terminal**: `terminalAcknowledged` is a one-time flag per session. If it is reset or not persisted, the runner runs the terminal acknowledgment turn again, which creates a duplicate conversation turn.

### Safe-change rules

- Add event types to `StateMachineSessionEvent` with new discriminants — never change existing discriminants.
- If history needs trimming for payload size, trim from the **oldest** end, never from the middle. Better: expose a compact `progress` summary alongside history rather than trimming history itself (the `progress` field at `state-machine.ts:192` is already designed for this).
- When adding a new state kind, ensure the runner appends a `state_started` before execution and a `state_completed` / `state_failed` after — not doing so leaves gaps that confuse carry-forward logic.

Sources: [src/types/state-machine.ts:168-218](src/types/state-machine.ts), [evals/state-machine-real-session-carry-forward.eval.ts:153-198](evals/state-machine-real-session-carry-forward.eval.ts)

---

## 5. PGlite Cross-Process Open Lock

### What it is

PGlite is a single-writer embedded database. If two processes call `PGlite.create` on the same data directory simultaneously, the directory's WAL can corrupt. The cross-process open lock (`.duet-open.lock` inside the data directory) ensures only one process holds an open handle at a time.

```typescript
// src/memory/pglite.ts:28-34
// Filename used for the cross-process open-lock written into each
// managed data directory. The first line is the holder's pid; other
// processes only proceed past O_EXCL if that pid is no longer alive.
const OPEN_LOCK_FILE = ".duet-open.lock";
```

### The lock lifecycle

```text
tryAcquireOpenLock(dataDir)
  ├─ O_EXCL create succeeds → hold lock (write pid to file)
  ├─ File exists, holder PID alive → return { holderPid } (retry later)
  └─ File exists, PID dead (stale) → unlink + re-create atomically

MemorySession.withDb():
  refs++
  → ensureOpen() → pollAcquireOpenLock() → openPGliteHoldingLock()
  → fn(db)
  refs--
  → if refs == 0: scheduleIdleClose() [2s default]

idle timer fires:
  → db.close() → releaseOpenLock(lockPath)
```

### Invariants

- The lock file must exist for the entire duration a `PGlite` handle is open. Releasing it before `db.close()` allows a second process to open, creating two concurrent writers.
- The idle-close window (default 2 seconds) keeps the lock held for short write bursts (observer + reflector + embedding upserts in the same turn) without permanently blocking a peer CLI.
- The lock must be released on process exit. An exit handler registered by `installExitCleanup()` does this via `process.on("exit", ...)`.

### Failure modes

- **Releasing the lock before closing `db`**: `closeNow()` in `MemorySession` clears `this.lockPath` before calling `db.close()`. If `db.close()` throws (PGlite occasionally does on aborted ops), the lock is still released because `releaseOpenLock` runs in the `finally` block of the wrapping open path.
- **Skipping the idle-close refcount check**: If `closeNow` is called while `refs > 0`, a concurrent `withDb` call will run against a closed handle. The refcount drain in `runDispose` guards against this.
- **Quarantine without lock release**: During backup restoration, the lock must be released before the data directory is renamed aside (the lock file lives inside it). `openPGliteHoldingLock` calls `releaseOpenLock(lockPath)` before `quarantineDataDirectory` (`pglite.ts:300-302`).
- **ENOENT on `pglite.data` during upgrade**: A `duet upgrade` running concurrently with an active CLI rewrites `node_modules` mid-flight. The `isExternalAssetError` guard prevents quarantining a healthy `memory.db` when the failure is an ENOENT on a PGlite runtime asset outside the data directory.

### Safe-change rules

- Never call `PGlite.create` directly. Always go through `openPGlite`, `openPGliteWaitingForLock`, or `MemorySession.withDb`.
- The backup rotation (`MAX_BACKUPS = 5`, `BACKUP_DEDUPE_WINDOW_MS = 5 min`) is fine to tune, but the snapshot must only happen **after** a successful open + init, never before (a corrupted directory snapshotted would rotate out the last good copy).
- Adding a new memory operation? Use `session.withDb(...)` — it handles refcounting and the idle-close timer correctly.

Sources: [src/memory/pglite.ts:28-34, 154-184, 251-338](src/memory/pglite.ts), [src/memory/session.ts:61-201](src/memory/session.ts)

---

## 6. Transient-Error Retry Scope

### What it is

`TurnRunner` drives the `Agent` directly (not through `AgentSession`) to own turn semantics. This means `AgentSession`'s built-in retry path does not apply. The transient-error module in `src/turn-runner/transient-error.ts` mirrors `AgentSession`'s detection regex exactly to ensure consistent behavior.

```typescript
// src/turn-runner/transient-error.ts:61-62
const TRANSIENT_PATTERN =
  /overloaded|provider.?returned.?error|rate.?limit|...|stream ended|.../i;
```

### Retry policy

```typescript
// src/turn-runner/transient-error.ts:112-116
export const DEFAULT_TRANSIENT_RETRY_POLICY: TransientRetryPolicy = {
  maxAttempts: 3,    // total prompt+continue attempts (initial + 2 retries)
  baseDelayMs: 2_000,
  maxDelayMs: 30_000,
};
```

### The progress-reset invariant

The retry counter resets when the agent makes forward progress between failures. "Forward progress" means at least one non-error assistant message was appended between two failures. This is tested explicitly:

```typescript
// test/transient-error.test.ts:169-215
// Both retry log lines should read "attempt 2/3": the first
// failure burned the implicit attempt 1, and the second failure
// resets to attempt 1 because the agent emitted an intermediate
// success between them.
expect(retryAttemptLabels(events)).toEqual(["2/3", "2/3"]);
```

Without a reset, a session that fails, recovers, fails again would exhaust `maxAttempts` across the two separate transient events — far too aggressive.

### What is NOT retried

| Pattern | Reason |
|---|---|
| `400 Bad Request` | Retrying with the same payload will not change the outcome |
| `401 Unauthorized` | Auth failure is permanent |
| `403 Forbidden` | Same |
| `404 Not Found` | Wrong endpoint; retrying doesn't help |
| Context overflow (`prompt is too long`) | Handled separately by `tryRecoverFromContextOverflow`, not this module |

### Failure modes

- **Adding a 4xx to the retry list**: `NON_RETRYABLE_PATTERN` explicitly blocks 4xx codes other than 429. Adding a code here (e.g., treating 422 as transient) wastes retries on payloads that will always fail.
- **Diverging the regex from `AgentSession._isRetryableError`**: The comment at `transient-error.ts:29` explains that the regex deliberately mirrors `pi-coding-agent`'s detection. Drift means some pi-ecosystem callers retry a class of error while duet-agent does not, producing inconsistent behavior.
- **Suppressing the progress reset**: If the counter is not reset on intermediate success, long agentic tasks that suffer one transient blip per state will burn through `maxAttempts` globally — causing the runner to give up during a valid multi-state session.
- **Retrying context overflow**: Context overflow must be handled by compaction, not by `agent.continue()` — the same oversized payload will fail identically.

### Safe-change rules

- When adding new provider error shapes to retry, add them to `TRANSIENT_PATTERN` and add a test row to `test/transient-error.test.ts` under the `retries:` table.
- When excluding new error shapes, add them to `NON_RETRYABLE_PATTERN` and add a `does not retry:` row.
- Keep `DEFAULT_TRANSIENT_RETRY_POLICY.maxAttempts` and `baseDelayMs` in sync with `AgentSession`'s values. The test at `test/transient-error.test.ts:106` explicitly asserts this synchronization.

Sources: [src/turn-runner/transient-error.ts:1-127](src/turn-runner/transient-error.ts), [test/transient-error.test.ts:1-256](test/transient-error.test.ts)

---

## 7. Schema Migrations — Forward-Only Guarantee

### What it is

`src/memory/migrations.ts` applies schema changes to PGlite in monotonically increasing version order. Each migration runs inside a single transaction; failure rolls back to the prior version. The `schema_version` table records every applied migration. The invariant is: **we never go backwards**.

```typescript
// src/memory/migrations.ts:22-23
// Forward-only by design. A bad migration is fixed by writing the
// next migration, not by reverting.
```

### Why drop-and-recreate for embedding tables

Migrations 6 and 7 both drop `observation_embeddings` entirely rather than attempting data-preserving rebuilds. The comment in migration 7 explains the root cause:

> PGlite's TOAST storage of `vector(3072)` columns hit `missing chunk number 0 for toast value ...` on production databases that had been heavily UPSERTed by the worker.

Backfill is cheap (embedding worker repopulates within minutes); TOAST corruption is not recoverable. This is a codebase-specific invariant: observation rows are the only data that genuinely needs in-place transforms; embedding tables can always be rebuilt.

### Failure modes

- **Reverting a migration**: Setting `LATEST_SCHEMA_VERSION` to an older value does not undo applied migrations — it just stops newer ones from running. Callers that need to downgrade must write a new forward migration that undoes the schema change.
- **Non-idempotent migrations**: Migration 1 uses `CREATE TABLE IF NOT EXISTS` specifically to allow both fresh installs and pre-migration databases to converge on the same baseline. Removing `IF NOT EXISTS` breaks upgrades.
- **Running migrations outside a transaction**: The `db.transaction(...)` wrapper means a DDL failure on step 3 of a multi-step migration rolls back steps 1 and 2. Removing this wrapper leaves the schema in a half-applied state.

### Safe-change rules

- New migrations always append at the end with `version: LATEST + 1`.
- Embedding table changes: prefer drop-and-recreate (the backfill worker handles repopulation) over data-preserving rebuilds that read `vector(3072)` columns.
- Observation table changes: use in-place transforms with `ALTER TABLE` + backfill DML, as in migrations 2 and 4.

Sources: [src/memory/migrations.ts:1-439](src/memory/migrations.ts)

---

## 8. Guardrail Firewall — Fail-Fast Composition

### What it is

`createFirewall` in `src/guardrails/firewall.ts` composes multiple guardrails into one. It evaluates them in order and short-circuits on the first block: a blocked result is returned immediately, not accumulated.

```typescript
// src/guardrails/firewall.ts:14-21
for (const g of guardrails) {
  const result = await g.evaluate(context);
  if (!result.allowed) {
    return {
      allowed: false,
      reason: `[${g.name}] ${result.reason}`,
      suggestion: result.suggestion,
    };
  }
```

The `SemanticGuardrail` uses an LLM call per evaluation — it is expensive. Ordering pattern-based guardrails before semantic ones is the intended usage pattern.

### Invariants

- Guardrail order is semantically significant: a cheaper pattern guardrail first catches obvious violations without paying the LLM cost.
- Warnings (non-blocking reasons from passing guardrails) are accumulated and returned in the final `allowed: true` result. They are not silently dropped.

### Failure modes

- **Placing `SemanticGuardrail` first**: Every action pays an LLM call even when a trivial pattern rule would have blocked it.
- **Swallowing the `reason` field on block**: The reason and suggestion are the user-visible explanation. Stripping them makes blocks opaque.

### Safe-change rules

- Compose firewall in order: regex/pattern guardrails first, semantic guardrail last.
- The `SemanticGuardrail` model is injected at construction — keep it BYOK/BYOC compatible, do not hardcode a provider model string in the guardrail definition.

Sources: [src/guardrails/firewall.ts:1-34](src/guardrails/firewall.ts), [src/guardrails/semantic.ts:1-54](src/guardrails/semantic.ts)

---

## 9. Which Files Are Safe to Change in Isolation

### Low blast radius (safe to change alone)

| File | Why |
|---|---|
| `src/guardrails/firewall.ts` | Composition only; no state owned |
| `src/guardrails/semantic.ts` | LLM call wrapper; only the policy string and model are observable |
| `src/turn-runner/transient-error.ts` | Self-contained; only affects retry behavior, not state transitions |
| `src/memory/migrations.ts` | Add-only at the end; each migration is transactional |
| `src/types/state-machine.ts` (additive changes only) | Adding new optional fields or new event discriminants is backward-compatible |

### High blast radius (touch multiple invariants)

| File | Why |
|---|---|
| `src/memory/context-pack.ts` | Directly governs the three-event rule. Any new call site breaks prompt-cache stability. |
| `src/memory/pglite.ts` | Lock lifecycle, quarantine, backup rotation. Changes here affect every concurrent-CLI scenario. |
| `src/memory/session.ts` | Refcount logic. Getting refcounting wrong causes use-after-close or lock starvation. |
| `src/turn-runner/turn-runner.ts` | Orchestrates all six invariants. Every section of this file is entangled with at least one other invariant. |
| `src/types/protocol.ts` | Renaming or removing `TurnState` fields breaks cross-process serialization for all callers (CLI, HTTP, daemon). |
| `src/memory/observational.ts` | Wire-shaping transform. Changes to the rendering template invalidate existing prompt caches. |

---

## Summary

The six load-bearing invariants — `TurnState` as the only cross-process contract, the three-event rule for memory pack rebuilds, prompt-cache stability through frozen prefixes, append-only state-machine history, PGlite's per-directory cross-process lock, and transient-error retry scope mirroring `AgentSession` — form an interconnected fabric. The most dangerous change pattern is adding a fourth memory rebuild trigger, because it simultaneously breaks prompt-cache stability, increases DB lock contention, and can cascade into PGlite lock exhaustion when multiple CLI sessions are running concurrently. The eval at `evals/state-machine-real-session-carry-forward.eval.ts:153-181` and the unit test at `test/transient-error.test.ts:169-215` are the two regression gates most directly protecting these invariants at CI time.

Sources: [src/memory/context-pack.ts:8-23](src/memory/context-pack.ts), [src/types/protocol.ts:169-208](src/types/protocol.ts)

---