# Sandbox Pods & the Warm Pool

> How sandbox Kubernetes pods are created, claimed, and recycled; why the warm pool exists (15-second startup cost); the POOL_EVICT_ON_STARTUP invariant that guarantees new pods run fresh code after a deploy; and the sandbox session state machine (idle → running → delivering → released).

- Repository: paradigmxyz/centaur
- GitHub: https://github.com/paradigmxyz/centaur
- Human wiki: https://grok-wiki.com/public/wiki/paradigmxyz-centaur-57fc6b2755e2
- Complete Markdown: https://grok-wiki.com/public/wiki/paradigmxyz-centaur-57fc6b2755e2/llms-full.txt

## Source Files

- `services/api/api/warm_pool.py`
- `services/api/api/sandbox/kubernetes.py`
- `services/api/api/sandbox/base.py`
- `services/api/api/sandbox/registry.py`
- `services/api/tests/test_warm_pool.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [services/api/api/warm_pool.py](services/api/api/warm_pool.py)
- [services/api/api/sandbox/kubernetes.py](services/api/api/sandbox/kubernetes.py)
- [services/api/api/sandbox/base.py](services/api/api/sandbox/base.py)
- [services/api/api/sandbox/registry.py](services/api/api/sandbox/registry.py)
- [services/api/tests/test_warm_pool.py](services/api/tests/test_warm_pool.py)
- [services/api/api/agent.py](services/api/api/agent.py)
- [services/api/db/migrations/001_initial.sql](services/api/db/migrations/001_initial.sql)
- [services/api/db/migrations/003_add_delivering_state.sql](services/api/db/migrations/003_add_delivering_state.sql)
</details>

# Sandbox Pods & the Warm Pool

Centaur runs each agent conversation inside an isolated Kubernetes pod. Creating a pod from scratch takes roughly 15 seconds — enough latency to make every new conversation feel sluggish. The warm pool eliminates that wait by keeping a small set of pre-booted pods idle in the cluster, ready to be instantly claimed when a new thread arrives.

This page explains how pods are created and structured, how the warm pool maintains its target size, how the `POOL_EVICT_ON_STARTUP` invariant prevents stale code from surviving a deploy, and the full lifecycle of a sandbox session from creation through terminal release.

---

## Pod Structure

Each sandbox is a Kubernetes Pod with:

- A **sandbox container** running the agent harness (codex, amp, claude-code, etc.) as UID 1001 (`_AGENT_UID`), with all Linux capabilities dropped and a seccomp `RuntimeDefault` profile applied. `stdin: true` is set so the API can pipe NDJSON turns in.
- An **iron-proxy sidecar** (per-pod), which forwards egress through a managed firewall. Its health is polled via `/healthz` before the sandbox container starts.
- Optional **overlay init container** that copies org-specific files into a shared `emptyDir` volume before the main containers start.

The pod name is derived deterministically from the thread key via a SHA-1 digest: `centaur-centaur-sandbox-<normalized>-<sha1[:10]>`. Warm pods use a `warm-<timestamp>-<id>` placeholder key and carry the label `centaur.ai/warm=true` so they are distinguishable from assigned pods.

Sources: [services/api/api/sandbox/kubernetes.py:1028-1073](services/api/api/sandbox/kubernetes.py), [services/api/api/sandbox/kubernetes.py:1182-1210](services/api/api/sandbox/kubernetes.py)

---

## The Warm Pool

### Why It Exists

Starting a pod cold involves: creating a prompt Secret, a proxy ConfigMap, a Service, NetworkPolicies, an iron-proxy Pod (and waiting for it to become Ready), then the sandbox Pod itself (and waiting for `/home/agent/.ready`). This easily takes 15 seconds. The warm pool pre-runs all of that so the first user turn connects immediately.

### Configuration

| Env var | Default | Purpose |
|---|---|---|
| `WARM_POOL_SIZE` | `5` | Target number of idle pods to maintain |
| `WARM_POOL_HARNESS` | `codex` | Harness type for warm pods |
| `WARM_POOL_REPLENISH_INTERVAL` | `5.0` s | How often the background loop top-fills the pool |
| `WARM_POOL_BACKEND_TIMEOUT` | `30.0` s | Per-operation timeout for backend calls |
| `WARM_POOL_EVICT_ON_STARTUP` | `1` (enabled) | Whether to kill old warm pods on API start |

Sources: [services/api/api/warm_pool.py:29-42](services/api/api/warm_pool.py)

### Data Model

```python
@dataclass
class WarmContainer:
    sandbox_id: str   # Pod name
    harness:    str
    engine:     str
    created_at: float  # wall-clock epoch for age-based health checks
```

The pool is a plain Python list (`_pool: list[WarmContainer]`). All mutation is serialized through an `asyncio.Lock` (`_pool_lock`) so concurrent claims or replenishments do not race.

Sources: [services/api/api/warm_pool.py:45-53](services/api/api/warm_pool.py), [services/api/api/warm_pool.py:55-59](services/api/api/warm_pool.py)

### Backend Capability Gate

Only backends that declare `supports_warm_pool = True` participate. The `SandboxBackend` ABC defaults this property to `False`; `KubernetesExecutorBackend` overrides it to `True`. The registry always returns `KubernetesExecutorBackend` in production via `auto_configure()`.

Sources: [services/api/api/sandbox/base.py:60-62](services/api/api/sandbox/base.py), [services/api/api/sandbox/kubernetes.py:406-407](services/api/api/sandbox/kubernetes.py), [services/api/api/sandbox/registry.py:24-28](services/api/api/sandbox/registry.py)

---

## Warm Pool Lifecycle

```text
API process start
     │
     ├── POOL_EVICT_ON_STARTUP=true  → _evict_existing_warm()   (kill old pods from prior deploy)
     └── POOL_EVICT_ON_STARTUP=false → _recover_warm()          (adopt surviving pods)
           │
           ▼
     replenish()  ←─────────────────────────────────────────────────────────┐
     (spawn pods until len(_pool) == POOL_SIZE)                             │
           │                                                                 │
           └──► background loop: sleep(POOL_REPLENISH_INTERVAL) ───────────►┘
```

### Replenishment

`replenish()` first health-checks every existing pool entry by calling `backend.status_by_id()`. Entries not in `"running"` state or older than 3600 seconds are evicted and stopped. It then spawns new pods until the pool reaches `POOL_SIZE`. Each `_spawn_warm_container()` call goes through the full `backend.create(..., warm=True)` path with a 30-second timeout; failures are swallowed and logged — the loop tries again on the next tick.

Sources: [services/api/api/warm_pool.py:118-161](services/api/api/warm_pool.py), [services/api/api/warm_pool.py:91-115](services/api/api/warm_pool.py)

### The POOL_EVICT_ON_STARTUP Invariant

When the API restarts after a deploy (new container image, new overlay), any warm pods still running in the cluster were built with the **old** image and overlay refs. If those pods were adopted, the first claims after the deploy would run stale code.

`POOL_EVICT_ON_STARTUP` (enabled by default) guards against this. On startup, `_evict_existing_warm()` calls `backend.recover_warm(POOL_HARNESS)` to list all pods with label `centaur.ai/warm=true`, then stops every one that is not already assigned to a live thread. Only **assigned** sandbox IDs — pulled from `sandbox_sessions WHERE state IN ('running', 'idle', 'error')` — are spared.

```python
# warm_pool.py: start_replenish_loop
assigned = await _get_assigned_sandbox_ids()
if POOL_EVICT_ON_STARTUP:
    evicted = await _evict_existing_warm(assigned)
else:
    recovered = await _recover_warm(assigned)
count = await replenish()  # fill with fresh pods
```

After eviction, `replenish()` immediately fills the pool with freshly-spawned pods that use the current image. The cost is one full cold-start cycle on deploy, not on every user request.

Sources: [services/api/api/warm_pool.py:435-470](services/api/api/warm_pool.py), [services/api/api/warm_pool.py:484-513](services/api/api/warm_pool.py), [services/api/api/warm_pool.py:473-481](services/api/api/warm_pool.py)

---

## Claiming a Warm Pod

`claim_container(thread_key, harness, ...)` is called before any cold spawn is attempted:

1. **Harness match** — if `harness != POOL_HARNESS`, returns `None` immediately.
2. **Backend gate** — if the backend does not support warm pools, returns `None`.
3. **Kubernetes + persona/repo** — warm pods are generic; persona or repo injection requires cold-spawn for the Kubernetes backend, so `claim_container` returns `None` in that case. (Non-Kubernetes backends handle injection via exec after claim.)
4. **Pop** — the first entry in `_pool` is atomically popped under `_pool_lock`.
5. **Liveness check** — `backend.status_by_id()` is called; if the pod is not `"running"`, it is stopped and `None` is returned.
6. **Token refresh** — a fresh sandbox API token is minted for the thread and written into the pod via `exec_run`.
7. **Trace injection** — if a `trace_id` is provided, it is written to `/home/agent/.trace_id`.
8. **Persona/repo injection** — if applicable, `_inject_persona()` clones the repo, copies skills, and assembles the prompt inside the running pod via `exec_run`.
9. **Return** — a `SandboxSession` is constructed with the warm pod's `sandbox_id` bound to the new `thread_key`.

```python
# warm_pool.py:301-385 (condensed)
session = SandboxSession(
    sandbox_id=warm.sandbox_id,
    thread_key=thread_key,
    harness=harness,
    engine=warm.engine,
    started_at=time.time(),
    trace_id=trace_id or "",
)
```

The pool replenish loop will notice the deficit on its next tick and spawn a replacement pod.

Sources: [services/api/api/warm_pool.py:301-385](services/api/api/warm_pool.py)

---

## Sandbox Session State Machine

Once a pod is claimed (warm or cold), its lifecycle is tracked in the `sandbox_sessions` PostgreSQL table. The `state` column drives scheduling, reconciliation, and idle eviction decisions.

```mermaid
stateDiagram-v2
    [*] --> idle : fresh spawn, no in-flight turn
    [*] --> running : resumed with in-flight turn
    idle --> running : turn dispatched (_db_set_inflight_turn)
    running --> delivering : result delivery claimed (atomic UPDATE)
    running --> idle : SSE disconnect, no in-flight turn
    delivering --> idle : delivery complete
    idle --> suspended : idle TTL expired (reconcile_tick)
    running --> suspended : stale running TTL expired
    delivering --> suspended : stale running TTL expired
    idle --> gone : hard stop / pod deleted
    running --> gone : hard stop / pod deleted
    suspended --> [*]
    gone --> [*]
    idle --> released : runtime assignment GC
```

### State Descriptions

| State | Meaning |
|---|---|
| `idle` | Pod is alive and attached; no turn is active. The thread accepts new turns. |
| `running` | A turn is in-flight (`inflight_turn_id` set). Pod is processing. |
| `delivering` | The turn result is being atomically claimed for delivery to a client. |
| `suspended` | Pod has been stopped; session row retained for context continuity. |
| `released` | Runtime assignment (control-plane record) was GC'd; pod no longer active. |
| `gone` | Pod deleted; row may be retained briefly for diagnostics. |
| `error` | Turn ended abnormally; may be retried. |

### Transitions in Code

- **→ running**: `_db_set_inflight_turn()` sets `state = 'running'` atomically with the turn payload. Also set when SSE connects via `_db_update_state(thread_key, "running")`.
- **→ idle**: On SSE disconnect, if no in-flight turn remains, `_db_update_state(thread_key, "idle")` is called.
- **→ delivering**: Added by migration 003; claimed by an atomic `UPDATE ... SET state = 'delivering'` to prevent duplicate delivery.
- **→ released**: The `_release_stale_runtime_assignments()` GC marks `agent_runtime_assignments.state = 'released'` for assignments whose backend pod is no longer `running` or `created`.
- **→ suspended/gone**: `reconcile_tick()` (runs every 60 s) enforces `IDLE_TTL_S` (default 24 h) on idle rows and `inactive_running_ttl` on stuck running/delivering rows.

The constant `_REUSABLE_DB_STATES = {"running", "idle", "delivering", "error", "suspended"}` controls which sessions are considered eligible for reuse when a thread reconnects.

Sources: [services/api/api/agent.py:89](services/api/api/agent.py), [services/api/api/agent.py:238](services/api/api/agent.py), [services/api/api/agent.py:279](services/api/api/agent.py), [services/api/api/agent.py:1042](services/api/api/agent.py), [services/api/api/agent.py:1100-1101](services/api/api/agent.py), [services/api/api/agent.py:1492-1497](services/api/api/agent.py), [services/api/db/migrations/003_add_delivering_state.sql:3-6](services/api/db/migrations/003_add_delivering_state.sql)

---

## Warm vs Cold Spawn: Decision Flow

```text
New thread request for (thread_key, harness)
          │
          ▼
claim_container(thread_key, harness)
          │
    ┌─────┴──────┐
    │  harness   │ ──mismatch──► None
    │  matches?  │
    └─────┬──────┘
          │ match
          ▼
    pool non-empty?
          │ yes                   no
          ▼                       ▼
    status_by_id()          ─────────────────────
    running? ─no─► discard   Cold spawn path:
          │ yes              _evict_idle_for_capacity()
          ▼                  backend.create(thread_key, ...)
    token refresh             (15 s typical)
    trace inject
    persona inject (if any)
          │
          ▼
    SandboxSession returned
```

If `claim_container` returns `None`, the caller falls through to `backend.create()`, which goes through the full pod creation sequence in `KubernetesExecutorBackend.create()`.

Sources: [services/api/api/agent.py:681-750](services/api/api/agent.py), [services/api/api/warm_pool.py:301-385](services/api/api/warm_pool.py)

---

## Failure Modes

| Scenario | Behavior |
|---|---|
| Warm pod dies between replenish and claim | `status_by_id()` returns non-`running`; pod is stopped, `None` returned; caller falls back to cold spawn |
| Token refresh fails on claim | Warning logged; claim continues — stale token will expire on next use |
| `replenish()` spawn fails | Exception swallowed, `warm_container_spawn_failed` logged; next replenish tick retries |
| Pod older than 1 hour | Evicted by `replenish()` health check; replaced by a fresh pod |
| Deploy with `POOL_EVICT_ON_STARTUP=false` | Old pods are adopted — may run stale image if image was bumped. Default is enabled to prevent this. |
| Kubernetes backend + persona or repo requested | `claim_container` returns `None` by design; always cold-spawns to get the correct persona bundle baked in at pod creation |

Sources: [services/api/api/warm_pool.py:118-161](services/api/api/warm_pool.py), [services/api/api/warm_pool.py:311-345](services/api/api/warm_pool.py)

---

## Summary

The warm pool is a process-local in-memory list of pre-booted Kubernetes pods, maintained at `POOL_SIZE` (default 5) by a background asyncio task that runs every 5 seconds. Claiming a warm pod skips the ~15-second cold-start path and instead injects a fresh token and optionally a persona/repo in-place. The `POOL_EVICT_ON_STARTUP` invariant ensures that after any deploy, all pre-existing warm pods are replaced before the first claim, guaranteeing the first user-facing turn always runs the just-deployed code. Once claimed, a sandbox session progresses through the `idle → running → delivering → idle` cycle in PostgreSQL, with `reconcile_tick()` enforcing TTLs and cleaning up orphaned or stale sessions every 60 seconds.

Sources: [services/api/api/warm_pool.py:484-513](services/api/api/warm_pool.py)
