# Sandbox lifecycle

> SQLite-backed status machine (creating, running, stopped, error), container naming (s-{ulid}), reconcile-on-boot, and destroy vs purge semantics.

- Repository: tastyeffectco/sandboxes
- GitHub: https://github.com/tastyeffectco/sandboxes
- Human docs: https://grok-wiki.com/public/docs/tastyeffectco-sandboxes-f551c1a2e9a0
- Complete Markdown: https://grok-wiki.com/public/docs/tastyeffectco-sandboxes-f551c1a2e9a0/llms-full.txt

## Source Files

- `control-plane/migrations/0001_init.sql`
- `control-plane/internal/store/writer.go`
- `control-plane/internal/reconcile/reconcile.go`
- `control-plane/internal/api/handlers.go`
- `control-plane/internal/docker/docker.go`
- `ARCHITECTURE.md`

---

---
title: "Sandbox lifecycle"
description: "SQLite-backed status machine (creating, running, stopped, error), container naming (s-{ulid}), reconcile-on-boot, and destroy vs purge semantics."
---

Every sandbox is a row in SQLite (`sandbox` table) keyed by a ULID. `sandboxd` shells out to Docker for a sibling container named `s-{id}`; Traefik preview hostnames use the same id (`s-{id}-{port}.preview.{domain}`). The database is the source of truth: the boot reconciler converges Docker to SQLite, not the reverse.

## Status machine

Four string statuses are stored on each row. There is no separate enum type in code — callers and SQL use these literals exactly.

| Status | Meaning | Typical `container_id` | Workspace on disk |
|--------|---------|------------------------|-------------------|
| `creating` | Row inserted; loopback provision and `docker run` in progress | `NULL` | Directory created or being seeded |
| `running` | Container up; row records short Docker id and cgroup path | Set (12-char prefix) | Bind-mounted into container |
| `stopped` | Container removed or `docker stop` succeeded; row kept for wake / id-reuse | Often preserved for audit | **Retained** under `workspaces/<id>/` |
| `error` | Create or reconcile failure; `error_message` set | May be partial | Left for operator review; reconciler does not auto-heal |

```mermaid
stateDiagram-v2
    [*] --> creating: POST /sandbox\nStore.Create
    creating --> running: MarkRunning\n(container up)
    creating --> error: abort()\nMarkError
    running --> stopped: idle/pressure reaper\nor POST /v1/.../stop
    stopped --> running: wake\nMarkRunningWoke
    running --> error: egress reconcile failure
    creating --> error: reconcile\nstale > 5 min
    stopped --> stopped: reconcile\ncontainer missing
    error --> [*]: operator fixes\nor purge
    stopped --> [*]: DELETE /sandbox\n(row gone, disk kept)
    [*] --> [*]: POST .../purge\n(rows + disk gone)
```

### Transitions by component

| From | To | Writer | Trigger |
|------|-----|--------|---------|
| — | `creating` | `Store.Create` | `POST /sandbox` after validation |
| `creating` | `running` | `MarkRunning` | Successful `docker run` + optional egress install |
| `creating` | `error` | `MarkError` via `abort()` | Any failure after row insert (provision, run, inspect, egress) |
| `running` | `stopped` | `MarkStoppedAt` | Idle reaper, pressure reaper, or `POST /v1/sandboxes/{id}/stop` |
| `stopped` | `running` | `MarkRunningWoke` | Wake handler after `docker start` |
| `running` / `creating` / `stopped` | `stopped` | `MarkStopped` | Reconciler: container missing or not running |
| `creating` | `error` | `MarkError` | Reconciler: row in `creating` older than 5 minutes |
| `running` | `error` | `MarkError` | Reconciler: egress repopulation failed (when egress enabled) |

`MarkRunning` and `MarkRunningWoke` clear `error_message`. `MarkStoppedAt` also sets `stopped_at` (Unix seconds) for audit; the idle path uses this, while reconciler `MarkStopped` does not set `stopped_at`.

<Note>
Wake refuses `creating` and `error` rows. For `running`, it returns success so the client can refresh onto the live Traefik route. Only `stopped` rows proceed through admission and `docker start`.
</Note>

## Identity and container naming

<ParamField body="id" type="ULID" required>
Primary key on `sandbox.id`. Omit on create to auto-generate; if supplied, must parse as a ULID (`400` with `id must be a ULID` otherwise).
</ParamField>

Docker containers use a fixed prefix:

```text
Container --name / --hostname:  s-{ulid}
Example:                        s-01ARZ3NDEKTSV4RRFFQ69G5FAV
Preview Host (port 3000):       s-01ARZ3NDEKTSV4RRFFQ69G5FAV-3000.preview.localhost
```

`internal/docker.RunSpec.Name` and `Hostname` are both set to `s-` + id on create. All lifecycle paths (`inspect`, `exec`, `stop`, `rm`, wake `start`) use that name.

The row also stores `workspace_img` and `workspace_mnt` paths (historical loopback naming). In the OSS directory layout, both resolve to `SANDBOXED_DATA_DIR/workspaces/<id>/`.

## Create path

<Steps>
<Step title="Insert row">
`POST /sandbox` inserts `status='creating'` with `container_id` and `cgroup_path` NULL, plus `sandbox_port` rows for each exposed port.
</Step>
<Step title="Provision workspace">
`Loopback.Provision` (or `ProvisionFromTemplate`) creates `workspaces/<id>/` and seeds from the base image skeleton when the directory is empty. Idempotent for id-reuse.
</Step>
<Step title="Run container">
`docker run -d` with hardened flags, Traefik labels, env injection, and bind mount `workspaces/<id>:/home/sandbox`.
</Step>
<Step title="Mark running">
`MarkRunning` records container id and cgroup path; optional egress adds bridge IP to nftables and `container_ip` on the row. `BumpLastActive` seeds activity for the idle reaper.
</Step>
</Steps>

On any failure after the row exists, `abort()` runs: best-effort `docker rm s-{id}`, `Loopback.Release` (no-op for directory storage), and `MarkError` with a diagnostic message.

## Reconcile on boot

`reconcile.Once` runs **once** at process startup, before the HTTP listener accepts traffic. It implements: **SQLite is truth; Docker is converged to match.**

```text
sandboxd main
    │
    ├─ store.Open + migrations
    ├─ BackfillRunningActivity (legacy last_active_at)
    ├─ reconcile.Once  ◄── blocks listener until complete
    └─ HTTP server
```

For each row in `running`, `creating`, or `stopped`:

1. **Loopback** — If workspace data exists and status ≠ `error`, `Provision` re-establishes storage (directory storage: ensure mount path exists; idempotent).
2. **Inspect `s-{id}`** — Missing container → `MarkStopped`, clear `container_ip`. Present but not running → `MarkStopped`. Running → `MarkRunning` + re-apply `memory.high` when enabled + refresh egress IP when configured.
3. **Stale `creating`** — Older than 5 minutes → `MarkError` with `interrupted while creating; reconciler timeout`; mount left for manual review.

Orphan detection **logs only** — it does not delete:

| Orphan type | Detection | Action |
|-------------|-----------|--------|
| Container `s-*` | No matching sandbox row | Warn + metric |
| Mount under workspaces | No matching row (legacy loopback builds) | Warn; OSS reports zero mounts |
| `workspace_owner` without workspace dir | `.img`/directory missing | Warn; manual disposition |

<Warning>
The reconciler never auto-recreates a missing container for a `stopped` row and never adopts orphan containers into SQLite. Wake or an explicit create path must bring compute back.
</Warning>

## Stop vs destroy vs purge

Stopping frees RAM but keeps metadata and disk. Destroy removes the container and DB row but keeps the workspace for id-reuse. Purge is irreversible end-to-end deletion.

| Operation | HTTP | Container | SQLite row | `workspace_owner` | Workspace dir | Snapshots dir |
|-----------|------|-----------|------------|-------------------|---------------|---------------|
| **Stop** | `POST /v1/sandboxes/{id}/stop` | `docker stop` (10s) | `running` → `stopped`, sets `stopped_at` | Unchanged | Kept | Kept |
| **Destroy (soft)** | `DELETE /sandbox/{id}` | `docker rm` | Row deleted (`Store.Delete`) | **Survives** | **Kept** | Kept |
| **Purge (hard)** | `POST /sandbox/{id}/purge` | stop + `docker rm` if present | `PurgeSandbox` (row + owner) | Deleted | `RemoveAll` workspace | `RemoveAll` `_snapshots/<id>/` if configured |

### Destroy — `DELETE /sandbox/{id}`

Audit action: `sandbox.destroy`. Response: `204 No Content`.

Order matters when egress is enabled: remove nftables source **before** `docker rm`. Then `Loopback.Release` (no-op for directory workspaces) and `Store.Delete`. Ports cascade via foreign key.

<Info>
**Id-reuse:** A workspace directory can exist without a row (for example after destroy). `POST /sandbox` with the same `id` is allowed if no row exists; Phase 8 checks `workspace_owner` so `external.user_id` matches before reattaching.
</Info>

### Purge — `POST /sandbox/{id}/purge`

Shared implementation: `purgeOne` (also used by `POST /external-users/{id}/purge` and `POST /external-projects/{id}/purge`).

<ResponseExample>
```json
{
  "purged": true,
  "freed_bytes": 12345678
}
```
</ResponseExample>

Holds per-id lock for the whole operation. Returns `500` on first failure in scope purges (partial purges stay purged; caller retries).

### v1 `DELETE /v1/sandboxes/{id}`

Public destroy delegates to **purge**, not soft delete:

```text
DELETE /v1/sandboxes/{id}  →  POST /sandbox/{id}/purge  →  204
```

v1 “delete project sandbox” means remove disk and ownership binding, not preserve workspace for reuse.

## Persistence model

```text
SANDBOXED_DATA_DIR/
├── state/sandboxd.db     ← SQLite (WAL), single-writer goroutine
└── workspaces/<ulid>/    ← bind mount; survives stop & soft destroy
```

| Column | Role while `creating` | While `running` | While `stopped` / `error` |
|--------|----------------------|-----------------|---------------------------|
| `container_id` | NULL | Short Docker id | Often retained |
| `cgroup_path` | NULL | Relative cgroup path | Often retained |
| `container_ip` | — | Bridge IP (egress builds) | Cleared on stop/reconcile |
| `error_message` | NULL until failure | NULL | Set on `error` |
| `last_active_at` | Bumped at create | Bumped by traffic/exec/wake | Used by idle logic |
| `stopped_at` | NULL | NULL | Set by explicit stop / reapers |
| `keepalive_until` | — | Idle reaper skips while > now | — |

`workspace_owner` (Phase 8) maps `sandbox_id` → upstream `external_user_id` and survives soft destroy so reattach and snapshot authorization stay consistent; only `PurgeSandbox` deletes it.

## API surface (lifecycle)

| Method | Path | Effect on lifecycle |
|--------|------|---------------------|
| `POST` | `/sandbox` | Create row → `creating` → `running` |
| `GET` | `/sandbox/{id}`, `/sandboxes` | Read row + optional live `docker inspect` |
| `DELETE` | `/sandbox/{id}` | Soft destroy (workspace kept) |
| `POST` | `/sandbox/{id}/purge` | Hard purge |
| `POST` | `/v1/sandboxes/{id}/stop` | `running` → `stopped` (409 if task active) |
| `DELETE` | `/v1/sandboxes/{id}` | Hard purge via delegation |
| `POST` | `/sandbox/{id}/keepalive` | Extends `keepalive_until` (idle exemption) |

Preview wake does not change status by itself until `docker start` succeeds; see wake and idle docs for admission and reaper interaction.

## Operator signals

| Symptom | Likely row state | What to check |
|---------|------------------|---------------|
| Preview shows “warming” forever | `stopped` waking, or nothing listening on port | `GET /sandbox/{id}`, container logs |
| `id must be a ULID` | — | Pass valid ULID or omit `id` |
| Create stuck | `creating` > 5 min → reconciler sets `error` | `error_message`, workspace dir, `docker ps -a --filter name=s-` |
| Row exists, create 409 | Prior row not destroyed | `DELETE /sandbox/{id}` or purge |
| Orphan `s-*` container | No SQLite row | Manual `docker rm`; reconciler will not remove it |

Startup logs include reconcile summary: `rows`, `reapplied`, `stopped`, `errored`, `orphan_containers`, `orphan_mounts`.

## Related pages

<CardGroup>
<Card title="Workspaces and isolation" href="/workspaces-persistence">
Per-sandbox directories, seeding, bind mounts, and what survives stop vs purge.
</Card>
<Card title="Wake, idle, and pressure" href="/wake-idle-reapers">
How sandboxes move between `running` and `stopped` without deleting rows or disk.
</Card>
<Card title="Manage sandboxes" href="/sandbox-operations">
Operational workflows: create, exec, keepalive, stop, destroy, purge, and claim.
</Card>
<Card title="Control plane API (legacy)" href="/legacy-api-reference">
`/sandbox*` routes including destroy, purge, and health endpoints.
</Card>
<Card title="v1 API reference" href="/v1-api-reference">
`POST /v1/sandboxes/{id}/stop` and `DELETE /v1/sandboxes/{id}` (purge semantics).
</Card>
<Card title="Uninstall and maintenance" href="/uninstall-maintenance">
What uninstall scripts remove vs retained workspaces and SQLite state.
</Card>
</CardGroup>
