# API Lifecycle: spawn → message → execute → events → release

> The five-step durable API lifecycle, how each call maps to a Postgres write, what the SSE event stream guarantees, and what happens when a client reconnects mid-run with after_event_id. Also covers the Slackbot ingress path including HMAC signature verification.

- Repository: paradigmxyz/centaur
- GitHub: https://github.com/paradigmxyz/centaur
- Human wiki: https://grok-wiki.com/public/wiki/paradigmxyz-centaur-57fc6b2755e2
- Complete Markdown: https://grok-wiki.com/public/wiki/paradigmxyz-centaur-57fc6b2755e2/llms-full.txt

## Source Files

- `services/api/api/app.py`
- `services/api/api/agent.py`
- `services/api/api/runtime_control.py`
- `services/api/api/slackbot_client.py`
- `docs/pages/architecture.mdx`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [services/api/api/app.py](services/api/api/app.py)
- [services/api/api/agent.py](services/api/api/agent.py)
- [services/api/api/runtime_control.py](services/api/api/runtime_control.py)
- [services/api/api/routers/agent.py](services/api/api/routers/agent.py)
- [services/api/api/slackbot_client.py](services/api/api/slackbot_client.py)
- [services/slackbot/src/slack/signature.ts](services/slackbot/src/slack/signature.ts)
- [services/slackbot/src/index.ts](services/slackbot/src/index.ts)
- [services/api/tests/test_direct_api_e2e.py](services/api/tests/test_direct_api_e2e.py)
</details>

# API Lifecycle: spawn → message → execute → events → release

Centaur's public-facing API is built around a five-step durable lifecycle that turns a natural-language request into a running sandbox, streams its output to connected clients, and then releases the runtime when the thread is done. Every step is a distinct HTTP call and every call writes durably to Postgres before the response is returned, so the system can survive process restarts, network blips, and mid-turn client disconnections without losing work.

This page covers each step in detail, explains the Postgres writes that back each transition, describes the SSE event stream and its reconnection semantics, and explains how the Slackbot ingress path feeds into the same lifecycle after verifying a Slack HMAC signature.

---

## Overview: the five calls

```text
POST /agent/spawn     → agent_runtime_assignments row (state=active)
POST /agent/message   → agent_message_requests + chat_messages rows
POST /agent/execute   → agent_execution_requests row (status=queued)
GET  /agent/threads/{key}/events (SSE) → streams agent_execution_events rows
POST /agent/threads/{key}/release → agent_runtime_assignments state=released
```

All five endpoints live under the `APIRouter` at prefix `/agent`, require an API-key via `verify_api_key`, and accept an optional scope guard `agent:execute`.

Sources: [services/api/api/routers/agent.py:58-62]()

---

## Step 1 — spawn

**Endpoint:** `POST /agent/spawn`

`spawn` creates or re-attaches a runtime assignment for a `thread_key`. A runtime assignment is the durable record binding a thread to a specific sandbox container and prompt identity for the lifetime of that conversation.

### What happens

1. The caller supplies `thread_key` and an optional `spawn_id` (idempotency key), `harness`, `engine`, `persona_id`, and `agents_md_override`.
2. `spawn_assignment` checks `agent_spawn_requests` for an existing row matching `(thread_key, spawn_id)` and returns the cached response if found.
3. A Postgres advisory lock `pg_advisory_xact_lock(hashtext(thread_key))` serializes concurrent spawns for the same thread.
4. `get_or_spawn` in `agent.py` resolves the container: it checks `sandbox_sessions` for a live session, falls back to the warm pool (`claim_container`), and finally does a cold spawn via the backend.
5. Inside the same transaction, `agent_runtime_assignments` is either re-used (if an active assignment with the same prompt identity already exists) or a new row is inserted with an incremented `assignment_generation`.
6. The idempotency response is recorded in `agent_spawn_requests`.

### Return value

```json
{
  "ok": true,
  "runtime_id": "<sandbox-id>",
  "thread_key": "...",
  "assignment_generation": 1,
  "persona_id": null,
  "prompt_ref": "harness:codex",
  "effective_agents_md_sha256": "<sha256>"
}
```

`assignment_generation` is the handle clients pass to all subsequent calls.

Sources: [services/api/api/runtime_control.py:417-636](), [services/api/api/agent.py:613-750]()

---

## Step 2 — message

**Endpoint:** `POST /agent/message`

`message` appends a user turn to durable storage without executing anything. The sandbox does not receive the input yet.

### Postgres writes

- `chat_messages` — one row per logical message. `id` is `msg:{thread_key}:{message_id}`. Inline `base64` image/document blobs are extracted into `attachments` and replaced with `attachment_ref` parts.
- `agent_message_requests` — idempotency log with `(thread_key, message_id, request_hash, event_json)`.

The advisory lock on `thread_key` runs inside the transaction so concurrent message inserts cannot interleave with a concurrent spawn.

Validation: the active assignment's `assignment_generation` must match the caller's value, otherwise the call fails with `ASSIGNMENT_GENERATION_STALE`.

Sources: [services/api/api/runtime_control.py:735-902]()

---

## Step 3 — execute

**Endpoint:** `POST /agent/execute`

`execute` enqueues a turn for the worker pool. The queue entry carries the `delivery` target (Slack channel, platform, etc.) and wakes an in-process worker.

### Postgres writes

- `agent_execution_requests` — one row inserted with `status='queued'`, `silence_deadline_at`, and `hard_deadline_at`. `execute_id` is the idempotency key.
- `agent_execution_events` — an initial `execution_state` event with `status='queued'` is appended immediately.
- `agent_final_delivery_outbox` — if the delivery platform is not `'dev'` and live-delivery is not enabled, an outbox row is inserted so the final answer can be sent to Slack even after SSE clients disconnect.

After the DB writes, `_worker_wake.set()` pings the in-process execution worker pool to pick up the new row without polling.

Sources: [services/api/api/runtime_control.py:1151-1317](), [services/api/api/routers/agent.py:251-295]()

### Auto-orchestration convenience path

When `assignment_generation` is omitted but `message` is set, `POST /agent/execute` auto-orchestrates all three steps internally. A single call runs `spawn_assignment → append_message → enqueue_execution` with generated idempotency IDs.

Sources: [services/api/api/routers/agent.py:298-358]()

---

## Step 4 — events (SSE stream)

**Endpoint:** `GET /agent/threads/{thread_key}/events`

This is the real-time output channel. It is a persistent Server-Sent Events stream backed by a polling loop over `agent_execution_events`.

### Query parameters

| Parameter | Default | Meaning |
|---|---|---|
| `after_event_id` | `0` | Resume cursor: only return rows with `event_id > after_event_id` |
| `execution_id` | none | Filter to events for one execution |
| `poll_ms` | `500` | Polling interval (clamped 50 ms – 5 s) |

### How the loop works

```python
cursor = max(0, after_event_id)
while True:
    rows = await pool.fetch(
        "SELECT event_id, event_kind, event_json FROM agent_execution_events "
        "WHERE thread_key = $1 AND event_id > $2 [AND execution_id = $3] "
        "ORDER BY event_id ASC LIMIT 200",
        thread_key, cursor, ...
    )
    if not rows:
        if execution_id:
            snapshot = await get_execution_terminal_snapshot(pool, execution_id)
            if snapshot:  # execution is already terminal — emit final state and return
                yield ServerSentEvent(...)
                return
        await asyncio.sleep(poll_s)
        continue
    for row in rows:
        cursor = int(row["event_id"])
        yield ServerSentEvent(id=str(cursor), event=row["event_kind"], data=...)
```

Sources: [services/api/api/routers/agent.py:884-957]()

### SSE event schema

Each event has:
- `id`: the `event_id` integer (monotonically increasing per Postgres insert order)
- `event`: the `event_kind` string (e.g. `execution_state`, `harness_event`)
- `data`: JSON payload

```text
id: 42
event: execution_state
data: {"type":"execution.state","execution_id":"exe_abc","thread_key":"...","status":"completed","result_text":"..."}
```

Keepalive comments are emitted every 15 s via `sse-starlette`'s `ping_message_factory`.

### Reconnecting with `after_event_id`

When a client disconnects and reconnects, it passes the highest `event_id` it received as `after_event_id`. The server resumes from `event_id > cursor` so no events are duplicated and none are missed.

If the execution is already terminal when the client reconnects (e.g., the job completed while the client was disconnected), and no new rows exist above the cursor, `get_execution_terminal_snapshot` synthesizes a final `execution_state` event from `agent_execution_requests` and returns it immediately rather than hanging the stream indefinitely.

The E2E test verifies this contract: the reconnect stream emits exactly the terminal event at the same `event_id` as the original stream's last event.

Sources: [services/api/api/routers/agent.py:921-934](), [services/api/api/runtime_control.py:1114-1148](), [services/api/tests/test_direct_api_e2e.py:169-188]()

### Sandbox stdout → SSE event path

The execution worker attaches to a sandbox's stdout via `_stream_stdout` in `agent.py`. Raw NDJSON lines from the container are:

1. Parsed as JSON.
2. Normalized by `normalize_harness_event(engine, evt)` into canonical event dicts.
3. Persisted into `agent_execution_events` via `append_execution_event`.
4. Yielded to `EventSourceResponse` as `{"data": "<json>"}` dicts.

The harness-level turn boundary is detected by `is_turn_done`. Before emitting `turn.done`, the worker writes `_db_complete_inflight_turn` (state=idle, clear inflight) and `_persist_turn_messages` (write assistant message to `chat_messages`) atomically via `asyncio.gather`. This ordering guarantees a reconnecting client cannot read a terminal SSE event before the durable state is committed.

Sources: [services/api/api/agent.py:839-953](), [services/api/api/agent.py:932-939]()

---

## Step 5 — release

**Endpoint:** `POST /agent/threads/{thread_key}/release`

`release` tears down the runtime assignment for a thread. It is idempotent via `agent_release_requests`.

### What happens

1. The advisory lock on `thread_key` is taken.
2. The active `agent_runtime_assignments` row is transitioned to `state='released'`.
3. If `cancel_inflight=true`, any `queued/running/cancel_requested/retry_wait` executions for the thread are cancelled in the same transaction.
4. If `stop_runtime=true` (default), the sandbox container is stopped asynchronously.
5. The response is written to `agent_release_requests` for idempotency.

Sources: [services/api/api/runtime_control.py:1643-1720]()

---

## Database state machine

```text
                   spawn
                     │
                     ▼
         ┌─── agent_runtime_assignments ───┐
         │   state: active                 │
         │   assignment_generation: N      │
         └─────────────────────────────────┘
                     │  message
                     ▼
         ┌─── agent_message_requests ──────┐
         │   chat_messages (user turn)     │
         └─────────────────────────────────┘
                     │  execute
                     ▼
         ┌─── agent_execution_requests ────┐
         │   status: queued → running      │
         │          → completed            │
         │          → failed_permanent     │
         │          → cancelled            │
         └─────────────────────────────────┘
                     │  worker streams
                     ▼
         ┌─── agent_execution_events ──────┐
         │   event_id (monotone)           │
         │   event_kind / event_json       │
         └─────────────────────────────────┘
                     │  release
                     ▼
         agent_runtime_assignments
           state: released
```

Every transition is a Postgres row write. There is no in-process queue between steps.

---

## Sandbox session durability

Independent of the control-plane tables above, `sandbox_sessions` tracks live container state. Key columns:

| Column | Role |
|---|---|
| `state` | `idle`, `running`, `delivering`, `error`, `suspended`, `gone`, `stopped` |
| `inflight_turn_id` | UUID of the active turn; set before stdin write, cleared on completion |
| `inflight_turn_input` | Full turn payload — enables replay if the container is replaced mid-turn |
| `last_delivered_id` | Cursor into `chat_messages`; prevents double-delivery on reconnect |
| `wire_lease_id` | UUID granted when `stream_connect` attaches to stdout |

If a sandbox dies mid-turn, `replay_inflight_turn` reads `inflight_turn_input` back from Postgres and re-sends it to a freshly spawned container with an incremented attempt counter.

Sources: [services/api/api/agent.py:221-266](), [services/api/api/agent.py:1287-1369]()

---

## Slackbot ingress path

The Slackbot is a separate Node.js/Hono service (`services/slackbot`). All Slack-originated events pass through HMAC signature verification before any application logic runs.

### HMAC verification

The `verifySlackSignature` function in `services/slackbot/src/slack/signature.ts` enforces Slack's v0 signing scheme:

```
base = "v0:{X-Slack-Request-Timestamp}:{raw_body}"
expected = "v0=" + HMAC-SHA256(signing_secret, base).hex()
```

Steps:
1. Read the raw request body (before JSON parsing).
2. Reject if `X-Slack-Signature` or `X-Slack-Request-Timestamp` headers are missing.
3. Reject if the timestamp is more than 5 minutes from `now` (replay protection).
4. Compute HMAC-SHA256 and compare with `timingSafeEqual` to prevent timing attacks.
5. Return `{ ok: false, status: 401 }` on any failure; `{ ok: true }` on success.

The middleware `slackSignatureMiddleware` applies this check to all inbound Slack routes (`/api/slack/events`, `/api/slack/actions`, `/api/slack/options`, `/api/slack/commands`).

Sources: [services/slackbot/src/slack/signature.ts:1-46](), [services/slackbot/src/index.ts:90-143]()

`SLACK_SIGNING_SECRET` is a required environment variable validated at container startup by `entrypoint.sh`.

Sources: [services/api/entrypoint.sh:9]()

### From Slack event to execution

Once signature verification passes, the Slackbot calls the Centaur API:
1. `POST /api/slack/agent-sessions` (via `slackbot_client.open_agent_session`) to open a UI session with a progress channel in Slack.
2. Calls into the `spawn → message → execute` lifecycle.
3. The execution worker streams harness events back through `slackbot_client.harness_event` and `slackbot_client.session_text` / `session_step`, progressively rendering the agent's output as Slack message blocks.
4. On completion, `slackbot_client.session_done` closes the Slack session.

The `slackbot_client.py` module is a thin async HTTP client for the Slackbot's internal REST API. It uses bearer-token auth (`SLACKBOT_API_KEY`) and retries on `{408, 429, 500-504}` with exponential back-off (base 0.25 s, 3 attempts).

Sources: [services/api/api/slackbot_client.py:32-85](), [services/api/api/slackbot_client.py:122-188]()

---

## Sequence diagram: full lifecycle

```mermaid
sequenceDiagram
    participant C as Client
    participant API as FastAPI /agent
    participant W as Execution Worker
    participant SB as Sandbox (container)
    participant DB as Postgres

    C->>API: POST /agent/spawn {thread_key, spawn_id}
    API->>DB: pg_advisory_lock; upsert sandbox_sessions; INSERT agent_runtime_assignments
    API-->>C: {assignment_generation: 1, runtime_id}

    C->>API: POST /agent/message {thread_key, assignment_generation: 1, message_id}
    API->>DB: INSERT chat_messages + agent_message_requests
    API-->>C: {ok: true, message_id}

    C->>API: POST /agent/execute {thread_key, assignment_generation: 1, execute_id}
    API->>DB: INSERT agent_execution_requests (status=queued) + execution_state event
    API-->>C: 202 {execution_id}

    C->>API: GET /agent/threads/{key}/events?after_event_id=0
    Note over API: poll loop begins

    W->>DB: claim execution row (status→running)
    W->>SB: flush pending messages → write stdin
    SB-->>W: stdout NDJSON stream
    W->>DB: INSERT agent_execution_events (harness events)
    W-->>API: SSE event stream

    API-->>C: id:1 event:harness_event data:{...}
    API-->>C: id:2 event:execution_state data:{status:running}
    SB-->>W: turn.done
    W->>DB: UPDATE sandbox_sessions (inflight cleared, state=idle)
    W->>DB: INSERT chat_messages (assistant reply)
    W->>DB: UPDATE agent_execution_requests (status=completed)
    W->>DB: INSERT agent_execution_events (execution_state completed)
    API-->>C: id:3 event:execution_state data:{status:completed,result_text:...}

    C->>API: POST /agent/threads/{key}/release
    API->>DB: UPDATE agent_runtime_assignments (state=released)
    API-->>C: {ok: true, released: true}
```

---

## Failure modes and invariants

| Failure | What breaks | Recovery |
|---|---|---|
| Client disconnects mid-stream | SSE loop exits; no data lost | Reconnect with `after_event_id` = last seen `event_id` |
| Execution is already terminal on reconnect | No new DB rows above cursor | `get_execution_terminal_snapshot` synthesizes a final event from the row |
| Sandbox dies mid-turn | `inflight_turn_input` is nil'd by reconciler | `replay_inflight_turn` re-sends payload to new container (attempt counter increments) |
| Slack replay attack | Timestamp too old | `verifySlackSignature` rejects with `stale_signature_timestamp` after 5-minute window |
| Concurrent spawns for same thread | Two API pods race | `pg_advisory_xact_lock` + `ON CONFLICT DO NOTHING` on `sandbox_sessions`; loser calls `backend.stop_by_id` on its own container |
| Repeated turn failures on same thread | Failure loop burns resources | `enqueue_execution` counts `failed_permanent` rows in a 5-minute window; rejects with `THREAD_FAILURE_LOOP` after threshold |
| Idle sandbox exceeds TTL | Container keeps consuming node resources | `reconcile_tick` evicts idle sessions older than `IDLE_TTL_S` (default 24 h) |

Sources: [services/api/api/agent.py:726-748](), [services/api/api/runtime_control.py:1250-1281](), [services/api/api/agent.py:1516-1736]()

---

## Summary

The five-step lifecycle (`spawn → message → execute → events → release`) is designed so that every mutation is a Postgres write that can be replayed or observed independently. The SSE stream is a polling view over `agent_execution_events` rows identified by a monotone `event_id`; clients reconnect by passing `after_event_id` and receive events from exactly the next undelivered row. The Slackbot ingress adds one pre-flight gate — HMAC-SHA256 signature verification with a 5-minute replay window — before feeding events into the same control-plane tables. The overall design has no in-process queues between steps: durability is Postgres, delivery is SSE, and recovery is row replay.

Sources: [services/api/api/runtime_control.py:1071-1088](), [services/slackbot/src/slack/signature.ts:33-45]()
