# Durability Invariants & Failure Modes

> What Postgres owns, what lives only in-process, how reconnects and pod replacements are safe, and the exact failure scenarios where state can be lost or replayed incorrectly. The distinction between the event stream (client contract) and in-memory runtime state is the critical boundary.

- Repository: paradigmxyz/centaur
- GitHub: https://github.com/paradigmxyz/centaur
- Human wiki: https://grok-wiki.com/public/wiki/paradigmxyz-centaur-57fc6b2755e2
- Complete Markdown: https://grok-wiki.com/public/wiki/paradigmxyz-centaur-57fc6b2755e2/llms-full.txt

## Source Files

- `services/api/api/agent.py`
- `services/api/api/runtime_control.py`
- `services/api/api/models.py`
- `services/api/tests/test_agent_resilience.py`
- `services/api/tests/test_inflight_replay.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [services/api/api/agent.py](services/api/api/agent.py)
- [services/api/api/sandbox/base.py](services/api/api/sandbox/base.py)
- [services/api/api/models.py](services/api/api/models.py)
- [services/api/api/db.py](services/api/api/db.py)
- [services/api/api/runtime_control.py](services/api/api/runtime_control.py)
- [services/api/tests/test_agent_resilience.py](services/api/tests/test_agent_resilience.py)
- [services/api/tests/test_inflight_replay.py](services/api/tests/test_inflight_replay.py)
- [services/api/db/migrations/001_initial.sql](services/api/db/migrations/001_initial.sql)
- [services/api/db/migrations/006_add_inflight_turn_state.sql](services/api/db/migrations/006_add_inflight_turn_state.sql)
- [services/api/db/migrations/008_agent_runtime_control_plane.sql](services/api/db/migrations/008_agent_runtime_control_plane.sql)
</details>

# Durability Invariants & Failure Modes

Centaur runs AI agents in ephemeral sandbox containers (Kubernetes pods) while serving clients over long-lived Server-Sent Event streams. The core durability challenge is that containers can crash, be replaced, or be evicted at any moment, yet clients and downstream callers must receive a consistent, deduplicated event stream. This page maps exactly what Postgres owns, what lives only in process memory, and the specific failure scenarios where state can be lost, replayed, or delivered more than once.

Understanding the boundary between the **event stream** (the SSE wire a client reads) and **in-memory runtime state** (stream handles, turn counters) is the central design question. Postgres is the source of truth for session identity and turn lifecycle; the process-local `_runtime` dict is a disposable cache of IO handles that gets rebuilt on reconnect.

---

## State Ownership Map

### What Postgres Owns (Durable)

All session and turn state that must survive a pod replacement lives in Postgres. The key tables and columns are:

**`sandbox_sessions`** — one row per active `thread_key`:

| Column | Purpose |
|---|---|
| `thread_key` (PK) | Stable identifier for a user's conversation thread |
| `sandbox_id` | The Kubernetes pod name of the current container |
| `state` | Lifecycle state: `creating → running → idle → suspended / gone / stopped` |
| `agent_thread_id` | The upstream agent's conversation ID (for resume after replacement) |
| `last_delivered_id` | Cursor into `chat_messages`; which messages have been flushed into stdin |
| `inflight_turn_id` | UUID of the active turn; non-null if a turn is in progress |
| `inflight_turn_input` | Full JSONB payload sent to the sandbox — the exact bytes used for replay |
| `inflight_attempts` | How many times this turn has been attempted (incremented on each replay) |
| `last_result` | Final turn result text; written atomically with turn completion |
| `wire_lease_id` | Non-null while a client SSE wire is connected; heartbeated every 30 s |

Sources: [services/api/db/migrations/001_initial.sql:2-13](), [services/api/db/migrations/006_add_inflight_turn_state.sql:4-11]()

**`chat_messages`** — append-only message log:

Every user message, system message, and assistant response is appended here. The `last_delivered_id` column in `sandbox_sessions` is a cursor into this table. On reconnect or replay, unflushed messages are re-read and re-delivered to the sandbox stdin.

Sources: [services/api/db/migrations/001_initial.sql:15-23]()

**Control-plane tables** (added in migration 008):

- `agent_runtime_assignments` — records which runtime container is assigned to a thread, along with prompt identity hash for change detection.
- `agent_execution_requests` — durable queue for the execution worker pool; each execution row tracks `status` (`queued → running → completed / failed_permanent / cancelled`).
- `agent_spawn_requests` / `agent_release_requests` — idempotency records keyed by caller-supplied `spawn_id` / `release_id`.
- `agent_message_requests` — message delivery records that bridge spawn and execution events.

Sources: [services/api/db/migrations/008_agent_runtime_control_plane.sql:3-85]()

### What Lives Only In-Process (Ephemeral)

```text
_runtime: dict[sandbox_id → RuntimeState]
```

`RuntimeState` (defined in `sandbox/base.py`) holds:

| Field | Type | Meaning |
|---|---|---|
| `turn_counter` | `int` | Monotonically increasing per-process turn number |
| `stdout_stream` | backend handle | Open async iterator for sandbox stdout |
| `stdin_stream` | backend handle | Open write handle for sandbox stdin |
| `attach_context` | backend-specific | Context manager for the current attach session |
| `prefetched_stdout` | `list[str] | None` | Lines buffered before live attach |
| `stdout_read_lock` | `asyncio.Lock` | Prevents concurrent stdout readers |
| `last_result` | `str | None` | Turn result; a best-effort bridge before DB write commits |

These values are populated by `_get_runtime()` and dropped by `_drop_runtime()`. They are not persisted. After a pod restart or API server restart, this dict is empty and every `sandbox_id` gets a fresh `RuntimeState()` on first access.

Sources: [services/api/api/sandbox/base.py:13-28](), [services/api/api/agent.py:97-112]()

---

## The Critical Boundary: Event Stream vs. Runtime State

```text
┌──────────────────────────────────────────────────────────────┐
│  CLIENT (SSE consumer)                                       │
│    ← receives: wire.ready, turn.started, content_block_*,   │
│                turn.done, keepalive                          │
└───────────────────┬──────────────────────────────────────────┘
                    │  SSE wire (EventSourceResponse)
┌───────────────────▼──────────────────────────────────────────┐
│  API SERVER PROCESS                                          │
│  stream_connect() / _stream_stdout()                         │
│  ┌────────────────┐  ┌───────────────────────────────────┐   │
│  │ _runtime       │  │ DB writes (before yield)          │   │
│  │ turn_counter   │  │ _db_set_inflight_turn()           │   │
│  │ stdout_stream  │  │ _db_complete_inflight_turn()      │   │
│  │ last_result    │  │ _persist_turn_messages()          │   │
│  └────────────────┘  └───────────────────────────────────┘   │
└───────────────────┬──────────────────────────────────────────┘
                    │  asyncpg pool
┌───────────────────▼──────────────────────────────────────────┐
│  POSTGRES                                                    │
│  sandbox_sessions  chat_messages  agent_execution_requests   │
└──────────────────────────────────────────────────────────────┘
```

The ordering guarantee: **Postgres is written before the `turn.done` event is yielded to the client**. The code in `_stream_stdout` calls `asyncio.gather(_persist_turn_messages(...), _db_complete_inflight_turn(...))` and only then emits the `turn.done` SSE payload. A client that disconnects just before receiving `turn.done` can reconnect and get the event from the newly-idle session.

Sources: [services/api/api/agent.py:933-953]()

---

## Lifecycle States and Valid Transitions

```mermaid
stateDiagram-v2
    [*] --> creating : get_or_spawn()
    creating --> running : stream_connect() called
    running --> idle : turn.done committed to DB
    running --> suspended : reconcile_tick() / idle TTL eviction
    idle --> running : inject_stdin() or stream_connect()
    idle --> suspended : idle TTL expires (default 24 h)
    running --> gone : container crash, no wire, no inflight
    suspended --> running : get_or_spawn() (resume path)
    gone --> [*] : cleaned up by reconcile Step E
    stopped --> [*] : cleaned up by reconcile Step E
```

The state machine is enforced by a `CHECK` constraint on `sandbox_sessions.state`. The `db.py` startup check verifies all eight required states are present before accepting traffic.

Sources: [services/api/api/db.py:19-30](), [services/api/api/agent.py:1516-1736]()

---

## Reconnect and Pod Replacement Safety

### Normal Client Reconnect (Same Container)

1. `stream_connect()` is called again on the same `sandbox_id`.
2. `_get_runtime()` returns the existing `RuntimeState` (turn counter preserved in process memory).
3. `backend.attach()` re-opens the stdout iterator from the container.
4. A `wire.ready` event is emitted with the current `turn_counter` so the client knows where it is.
5. If a turn is still running, the container continues emitting events; the client resumes reading from the live stream.

If the first stream connection dropped mid-turn without receiving `turn.done`, the container is still running and the client will receive the remaining events starting from the reattach point.

Sources: [services/api/api/agent.py:1024-1109]()

### Container Replacement (Pod Restart / Node Eviction)

`get_or_spawn()` handles this transparently:

1. The DB row for the `thread_key` exists with a stale `sandbox_id`.
2. `backend.status(session)` returns `"gone"` or similar.
3. The old row is deleted; `agent_thread_id`, `last_delivered_id`, `inflight_turn_id`, `inflight_turn_input`, and `inflight_attempts` are saved in local variables before the delete.
4. A new container is spawned, and `_db_insert_session()` inserts a new row carrying the preserved fields.
5. `inject_stdin()` or `replay_inflight_turn()` re-injects the in-flight turn payload at `attempts + 1`.
6. `_flush_pending()` re-reads any `chat_messages` newer than `last_delivered_id` to catch up messages that weren't delivered before the crash.

The `resume_thread_id=old_agent_thread_id` argument to `backend.create()` tells the new container to resume the upstream agent's conversation thread, preserving conversation context inside the harness.

Sources: [services/api/api/agent.py:613-750](), [services/api/api/agent.py:1287-1369]()

### Stdin Broken Pipe Recovery

Both `inject_stdin()` and `replay_inflight_turn()` catch `BrokenPipeError / OSError` on the first `write_stdin` attempt. If the container reports status `"running"`, a `reattach_stdin()` is performed and the write is retried once. If the container is gone, a `RuntimeError` is raised and the caller falls into the container-replacement path.

Sources: [services/api/api/agent.py:1184-1210](), [services/api/api/agent.py:1325-1353]()

### EOF Reattach (Transient Stream Drop)

`_stream_stdout` implements an EOF reattach loop: if the stdout iterator returns EOF but `backend.status()` still returns `"running"`, `close_streams()` is called and `backend.attach()` is retried up to `STREAM_EOF_REATTACH_MAX` times (default: 6) with `STREAM_EOF_REATTACH_BACKOFF_S` (default: 1 s) delay. This handles transient network interruptions to the container without triggering a full spawn cycle.

Sources: [services/api/api/agent.py:966-1007](), [services/api/tests/test_agent_resilience.py:72-107]()

---

## The In-Flight Turn Invariant

The in-flight turn is the most critical durability invariant. The lifecycle is:

```
inject_stdin() called
  → _db_set_inflight_turn() [inflight_turn_id, inflight_turn_input, attempts=1]
  → backend.write_stdin()
  → sandbox runs the turn

sandbox emits is_turn_done event
  → _db_complete_inflight_turn() [clears inflight_turn_id, writes last_result]
  → turn.done yielded to client
```

**Key guarantee**: `inflight_turn_input` is written to Postgres *before* the payload reaches the sandbox stdin. If the process dies between those two steps, the turn input is safe and can be replayed. If the process dies after the sandbox receives the input but before `_db_complete_inflight_turn()`, the row still has `inflight_turn_id` set and the replay path (`replay_inflight_turn()`) will re-send the same input to a new container.

Sources: [services/api/api/agent.py:1159-1165](), [services/api/api/agent.py:1287-1310]()

---

## Failure Modes and Exact Loss Scenarios

### Scenario 1: API Process Crash During a Turn (In-Flight Turn Preserved)

**What happens**: Process dies after `_db_set_inflight_turn()` but before `_db_complete_inflight_turn()`.

**State**: `inflight_turn_id` is non-null in Postgres; `_runtime` is empty in the new process.

**Recovery**: On next `get_or_spawn()`, the old row is found, `inflight_turn_id` and `inflight_turn_input` are preserved through the delete-and-reinsert cycle. `replay_inflight_turn()` re-sends the payload. **No user input is lost**; the turn may be retried with `inflight_attempts` incremented.

**Risk**: The sandbox may have already completed the turn but the result was not written. In that case, `_db_complete_inflight_turn()` was never called, and the replay will produce a second agent response to the same input. This is the primary **double-delivery risk**.

### Scenario 2: API Process Crash After Turn Completes (Last Result Preserved)

**What happens**: `_db_complete_inflight_turn()` wrote `last_result` and cleared `inflight_turn_id`, but `_persist_turn_messages()` for the assistant's `chat_messages` row failed (it is called concurrently and its failure is silently logged, not re-raised).

**State**: `last_result` is in Postgres but `chat_messages` may lack the assistant message.

**Impact**: The assistant turn result is visible via `get_status()`, but `_flush_pending()` on the next reconnect will not replay the missing assistant message (assistant messages are filtered out of flush, except `history_backfill` rows). The assistant response is **not replayed** to the sandbox on resume; only user messages are re-flushed.

Sources: [services/api/api/agent.py:452-483](), [services/api/api/agent.py:1767-1803]()

### Scenario 3: Container Crash Without Wire Connected (Turn Lost)

**What happens**: The sandbox pod is evicted while idle, and `reconcile_tick()` Step A detects the container is gone.

**State**: `reconcile_tick()` calls `_mark_inactive()`, which sets `state = 'suspended'` and **clears `inflight_turn_id` and `inflight_turn_input`**.

```python
# reconcile_tick Step D: stale inflight rows reaped after IDLE_TTL_S
"inflight_turn_id = NULL, inflight_turn_input = NULL, inflight_started_at = NULL"
```

**Impact**: Any in-flight turn that was active but had no wire (e.g., the client disconnected) and that times out past `IDLE_TTL_S` (default 24 h) will have its in-flight payload wiped. The user must resend their request.

Sources: [services/api/api/agent.py:1525-1549](), [services/api/api/agent.py:1666-1710]()

### Scenario 4: Wire Lease Staleness (Client Disconnect Not Detected Immediately)

**What happens**: The SSE client disconnects but the heartbeat task's 30-second poll is not yet overdue.

**State**: `wire_lease_id` remains set in Postgres; `supervise_wires()` or `reconcile_tick()` will detect it after 120 seconds of no heartbeat and clear the lease.

**Impact**: The session appears "delivering" for up to 120 seconds after the client disconnects. This does not cause data loss but may delay capacity eviction.

Sources: [services/api/api/agent.py:1375-1436]()

### Scenario 5: Spawn Race (Concurrent Spawn for Same Thread Key)

**What happens**: Two requests race to create a session for the same `thread_key`.

**Protection**: `_db_insert_session()` uses `INSERT ... ON CONFLICT (thread_key) DO NOTHING RETURNING thread_key`. The loser detects `row is None`, stops its container with `backend.stop_by_id()`, drops its runtime, and fetches the winning session from Postgres.

**Impact**: One container is wasted (stopped immediately). The client gets the winner's session. No state loss.

Sources: [services/api/api/agent.py:727-748]()

### Scenario 6: Double Completion (turn.done Emitted Twice)

**What happens**: A reconnect arrives during the window between `_db_complete_inflight_turn()` and the `turn.done` yield, and `stream_reconnect()` is called with `logs=True`.

**Protection**: `stream_reconnect()` accepts a `skip_done_count` parameter to skip already-seen `turn.done` events from the container's log buffer. The client is responsible for tracking which `turn.done` events it has already processed.

**Risk**: If the container has already emitted `turn.done` to its log buffer and the reconnecting client does not supply the correct `skip_done_count`, it may receive a duplicate `turn.done`.

Sources: [services/api/api/agent.py:1739-1764]()

---

## The Flush Pipeline: Message Cursor Semantics

`_flush_pending()` reads `chat_messages` rows that the sandbox has not yet received. The cursor (`last_delivered_id`) points to the `id` of the last message delivered to stdin.

```python
# Flush logic (simplified):
# If last_delivered_id is set: return rows created AFTER that message's created_at
# If NULL: return all replayable messages
role_filter = "(role <> 'assistant' OR metadata->>'history_backfill' = 'true')"
```

**What is re-flushed on reconnect**: User messages and system messages not yet seen by the sandbox. History-backfill assistant messages (imported Slack context) are also included.

**What is NOT re-flushed**: Regular assistant messages generated by the sandbox. The sandbox is assumed to hold its own context for these via the harness's internal memory (`agent_thread_id` resume).

**Cursor advancement**: `_advance_cursor()` writes `last_delivered_id` only after `backend.write_stdin()` succeeds. If stdin write fails fatally, the cursor is not advanced and the same messages will be re-flushed on the next attempt.

Sources: [services/api/api/agent.py:452-521]()

---

## Periodic Reconciliation

`reconcile_tick()` runs every 60 seconds and enforces five invariants:

| Step | Query target | Action |
|---|---|---|
| A | `state IN (running, idle, delivering, error)` | Check backend; mark `suspended` if container gone |
| B | `state = idle AND updated_at < NOW() - IDLE_TTL_S` | Stop container; mark `suspended` |
| C | `state IN (running, delivering, error)` with no wire, no inflight, no execution | Stop container; mark `suspended` |
| D | `state IN (running, delivering, error)` with stale `inflight_turn_id` and no execution | Reap the in-flight record; mark `suspended` |
| E | `state IN (gone, stopped)` older than 1 hour, or `suspended` older than `SUSPENDED_RETENTION_S` (7 days) | Delete the row |

The reconciler is conservative: transient backend errors cause `continue` (skip the row, don't destroy it). This prevents accidental reaping of healthy sessions during brief API hiccups.

Sources: [services/api/api/agent.py:1516-1736]()

---

## Schema Startup Validation

At every startup, `check_schema_compatibility()` verifies that the Postgres schema has:

- All eight required `sandbox_sessions` state values in the check constraint
- All eight required columns (`agent_thread_id`, `inflight_turn_id`, `inflight_turn_input`, `inflight_started_at`, `inflight_attempts`, `last_result`, `last_result_at`, `trace_id`)
- All required migration versions applied (`005`, `006`, `007`, `008`, `009`, `010`, `011`, `035`)

If any check fails, the health endpoint reflects incompatibility and the API should not accept traffic. This prevents a rolled-back migration from causing silent data loss where the code expects durable columns that are missing.

Sources: [services/api/api/db.py:177-240]()

---

## Summary

Centaur's durability model is built on a single rule: **Postgres is written before the event is yielded to the client**. The `_runtime` dict is a disposable cache; losing it on restart costs nothing except the stream handles. The two columns that make pod replacement transparent are `inflight_turn_input` (the exact bytes to replay) and `last_delivered_id` (which messages the sandbox already received). The primary failure mode where state can be silently lost is stale in-flight records reaped by `reconcile_tick()` Step D after `IDLE_TTL_S` elapses with no active wire and no execution — a scenario that occurs when both the client and the container die before the turn completes. All other failure paths either replay safely or surface an explicit error to the caller.

Sources: [services/api/api/agent.py:933-939](), [services/api/api/agent.py:1667-1710]()
