# Wake, idle, and pressure

> Stop-on-idle (SANDBOXD_IDLE_THRESHOLD_SECONDS), wake-on-preview (catch-all → sandboxd), memory admission/refusal, pressure reaper, keepalive, and warming-page behavior.

- Repository: tastyeffectco/sandboxes
- GitHub: https://github.com/tastyeffectco/sandboxes
- Human docs: https://grok-wiki.com/public/docs/tastyeffectco-sandboxes-f551c1a2e9a0
- Complete Markdown: https://grok-wiki.com/public/docs/tastyeffectco-sandboxes-f551c1a2e9a0/llms-full.txt

## Source Files

- `control-plane/internal/wake/handler.go`
- `control-plane/internal/reaper/idle.go`
- `control-plane/internal/reaper/pressure.go`
- `control-plane/cmd/sandboxd/main.go`
- `traefik/dynamic/wake.yml`
- `ARCHITECTURE.md`

---

---
title: "Wake, idle, and pressure"
description: "Stop-on-idle (SANDBOXD_IDLE_THRESHOLD_SECONDS), wake-on-preview (catch-all → sandboxd), memory admission/refusal, pressure reaper, keepalive, and warming-page behavior."
---

`sandboxd` frees host RAM by stopping idle sandboxes and reclaiming memory under pressure, then restarts containers on the next preview hit or programmatic wake. Traefik’s priority-1 catch-all (`traefik/dynamic/wake.yml`) forwards preview traffic for stopped sandboxes to `sandboxd:9000`, where `internal/wake` runs admission, `docker start`, optional TCP readiness, and HTML warming pages; `internal/reaper` runs the idle and pressure loops configured from `SANDBOXD_*` env vars in `control-plane/cmd/sandboxd/main.go`.

## Activity signals

Idle and pressure reapers only stop sandboxes whose SQLite row has `status='running'` and `last_active_at` older than the cutoff. Activity that postpones a stop comes from four signals:

| Signal | Mechanism | Effect on reapers |
|--------|-----------|-------------------|
| HTTP preview traffic | `activity.Tailer` bumps `last_active_at` from Traefik access logs | Rows stay above idle cutoff |
| WebSocket / SSE | `activity.Poller` (when `SANDBOXD_POLLER_METRIC_RE` is set) bumps `last_active_at` | Same; without poller, widen `SANDBOXD_WAKE_GRACE_SECONDS` |
| In-flight `exec` | `activity.InflightExec` per-id counter | Idle and pressure **skip** that id |
| Explicit keepalive | `keepalive_until` column via `POST /sandbox/{id}/keepalive` | Idle and pressure **skip** while `keepalive_until > now` |
| Running coding task | `SandboxHasRunningTask` | Idle reaper **skips** |

<Note>
The pressure reaper’s **&lt;5% emergency** branch stops the heaviest-RSS cgroup **even if** the sandbox has an in-flight exec or active task. Advisory bands (10–15% and 5–10%) only stop the oldest *idle-running* candidate and never kill active work to defend a soft threshold.
</Note>

## Stop-on-idle

The idle reaper (`internal/reaper/idle.go`) ticks every `SANDBOXD_IDLE_REAP_INTERVAL_SECONDS` (default **30s**). Each tick calls `Store.ListIdleCandidates` with cutoff `now - SANDBOXD_IDLE_THRESHOLD_SECONDS` (default **2100s / 35 min**), ordered by `last_active_at ASC`.

For each candidate it applies skip rules (in-flight exec, `keepalive_until`, running task), then `docker stop` with a 10s grace, optional egress IP cleanup, and `MarkStoppedAt`. The workspace bind mount is untouched; the next preview or API wake runs `docker start` again.

Set `SANDBOXD_IDLE_REAP_INTERVAL_SECONDS` to **0** to disable the loop entirely.

```bash
# Postpone automatic idle stop for 2 hours (unix seconds)
curl -s -XPOST http://127.0.0.1:9090/sandbox/$ID/keepalive \
  -H 'content-type: application/json' \
  -d '{"until":'$(date -u -v+2H +%s)'}'
```

<ParamField body="until" type="integer" required>
Unix timestamp (seconds). Capped to `now + SANDBOXD_KEEPALIVE_MAX_SECONDS` (default **86400** / 24h). Must be in the future.
</ParamField>

## Wake-on-preview routing

Running sandboxes register Traefik Docker labels with **priority 100**. The file-provider catch-all uses **priority 1** so it only matches when no live per-sandbox router exists:

```text
Browser → s-{ULID}-{port}.preview.{PREVIEW_DOMAIN}
    │
    ├─ container running → Traefik priority-100 router → dev server :port
    │
    └─ container stopped → catch-all (wake.yml) → sandboxd:9000
                              hostDispatch → ServeCatchAll
```

`hostDispatch` in `main.go` inspects the `Host` header with the same regex as `wake.Handler` (`^s-([0-9A-Za-z]+)-([0-9]+)\.preview\.{domain}`). Browser hosts are uppercased to canonical ULID before DB lookup.

```mermaid
sequenceDiagram
  participant Browser
  participant Traefik
  participant sandboxd
  participant Docker
  participant DevServer

  Browser->>Traefik: GET preview Host (stopped sandbox)
  Traefik->>sandboxd: catch-all priority 1
  sandboxd->>sandboxd: Admit (memory)
  sandboxd->>Docker: docker start s-{id}
  sandboxd->>DevServer: TCP probe :port (optional)
  sandboxd-->>Browser: 200 HTML meta-refresh (2s)
  Docker-->>Traefik: labels → priority-100 router
  Browser->>Traefik: refresh
  Traefik->>DevServer: proxy to container
```

## Wake handler behavior

`internal/wake/handler.go` serves two entry points:

| Entry | Path / trigger | Response |
|-------|----------------|----------|
| Catch-all (HTML) | Any method; preview `Host` via Traefik | Warming / busy / error HTML |
| Programmatic (JSON) | `POST /wake/{id}` on API mux | JSON `{"id","status","wake_duration_ms"}` or error |

**Status handling:**

| Row status | Behavior |
|------------|----------|
| `running` | Success immediately (refresh or JSON) |
| `stopped` | Admission → `docker start` → optional TCP probe → `MarkRunningWoke` |
| `creating` | Error `creating` — do not start under half-built row |
| `error` | Error with `error_message` when set |
| not found | `not_found` |

**Concurrency:** Per-id inflight map deduplicates concurrent wakes; `idlock.Registry` excludes snapshot/restore/destroy races.

**Private sandboxes (HTML only):** Stopped `visibility=private` sandboxes run the same preview-token check as `/forward-auth` before start. Service/operator actors skip the cookie gate (enables wake-on-task-submit). JSON `POST /wake/{id}` is not cookie-gated.

**TCP readiness:** Catch-all path probes `bridgeIP:port` for up to `SANDBOXD_WAKE_TCP_READY_TIMEOUT_SECONDS` (default **8s**). Timeout increments metric `tcp_ready_timeout` but still returns the refresh page — the next browser retry may hit the live route.

## Memory admission and refusal

`wake.Admit` (`internal/wake/admit.go`) is shared by preview wake, `POST /wake/{id}`, and **`POST /sandbox` create**:

1. Read `/proc/meminfo` → `MemAvailable` percent.
2. `cost_pct = SANDBOXD_WAKE_COST_MB (800) / MemTotal × 100`.
3. If `(avail_pct - cost_pct) >= SANDBOXD_MEM_REFUSE_WAKES_PCT (10)` → admit.
4. Else if pressure reaper’s `Refused` flag is set → deny `wakes_refused` (no sync tick).
5. Else run one synchronous `pressure.Tick` (may stop oldest idle sandbox).
6. Re-read meminfo; if still below floor → deny `low_memory`.

HTML denial serves the busy page with `Retry-After: 30` and `X-Retry-After-Reason`. JSON returns **503** with `mem_available_percent` and the same `Retry-After`.

## Host memory pressure reaper

The pressure reaper ticks every `SANDBOXD_PRESSURE_INTERVAL_SECONDS` (default **10s**), reading `MemAvailable` from `/proc/meminfo` (not `MemFree`).

| MemAvailable % | Action |
|----------------|--------|
| ≥ `SANDBOXD_MEM_HEADROOM_PCT` (15) | No stop |
| 10–15 | Stop **one** oldest idle-running sandbox (`reason=memory_pressure`, band `10-15`) |
| 5–10 | Same + set `Refused` → new wakes denied; log warning |
| &lt; 5 | **Emergency:** stop single heaviest `cgroup.memory.current` sandbox, active or idle |

Wake refusal uses hysteresis: `Refused` clears when availability rises to **`RefuseWakesPct + 2`** (default **12%**) to avoid flapping at 10%.

On startup, if `MemAvailable` is already below headroom, `main.go` runs one synchronous pressure tick before the HTTP listener accepts traffic.

```mermaid
stateDiagram-v2
  [*] --> Healthy: avail >= 15%
  Healthy --> Band1015: avail 10-15%
  Band1015 --> Healthy: avail >= 15%
  Band1015 --> Band510: avail < 10%
  Band510 --> RefusingWakes: Refused=true
  RefusingWakes --> Band510: avail >= 12% clears Refused
  Band510 --> Emergency: avail < 5%
  Emergency --> [*]: stop heaviest RSS once per tick
```

<Warning>
Emergency stops can interrupt active agent work. Monitor `sandboxd_pressure_reaper_stops_total` and `sandboxd_wakes_refused_active` via Prometheus (`GET /metrics`).
</Warning>

## Warming and error pages

`internal/wake/html.go` returns white-label HTML (no sandbox id in body). Machine-readable reasons use response headers:

| Page | HTTP | User-visible title | Headers |
|------|------|-------------------|---------|
| Success / warming | 200 | “Spinning up your app!” | `meta refresh` every **2s** (`RefreshSeconds` in handler config) |
| Admission denied | 503 | “Almost ready…” | `Retry-After`, `X-Retry-After-Reason` (`wakes_refused`, `low_memory`) |
| Other failure | 503 | “We couldn't load your app” | `X-Wake-Error` (`not_found`, `start_failed`, `creating`, …) |

After the refresh, Traefik’s Docker provider usually observes the started container and the priority-100 router serves the dev server directly.

<Info>
If the warming page loops indefinitely, the dev server may not be listening on the exposed port yet, or wake failed — check `X-Wake-Error`, `sandboxd` logs (`component=wake`), and the troubleshooting page.
</Info>

## Programmatic wake and task submit

`POST /wake/{id}` returns JSON on success:

```json
{"id":"<ULID>","status":"running","wake_duration_ms":1234}
```

`POST /v1/sandboxes/{id}/tasks` delegates to the same wake path when `status=stopped` (service-token auth satisfies private-sandbox policy for API callers).

Manual stop without waiting for idle:

:::endpoint POST /v1/sandboxes/{id}/stop
Stop the container now (idempotent if already stopped). Rejects when a runtimed task is active. Next preview hit runs the wake path.
:::

## Configuration

| Variable | Default | Role |
|----------|---------|------|
| `SANDBOXD_IDLE_THRESHOLD_SECONDS` | `2100` | Idle window before `docker stop` |
| `SANDBOXD_IDLE_REAP_INTERVAL_SECONDS` | `30` | Idle reaper period; `0` disables |
| `SANDBOXD_PRESSURE_INTERVAL_SECONDS` | `10` | Pressure reaper period; `0` disables |
| `SANDBOXD_MEM_HEADROOM_PCT` | `15` | Healthy band floor |
| `SANDBOXD_MEM_REFUSE_WAKES_PCT` | `10` | Wake admission floor + refusal band |
| `SANDBOXD_MEM_EMERGENCY_PCT` | `5` | Emergency RSS kill band |
| `SANDBOXD_WAKE_COST_MB` | `800` | Estimated RAM per wake in admission |
| `SANDBOXD_WAKE_TCP_READY_TIMEOUT_SECONDS` | `8` | TCP probe timeout (catch-all only) |
| `SANDBOXD_WAKE_GRACE_SECONDS` | `60` | Activity grace when poller is in fallback |
| `SANDBOXD_KEEPALIVE_MAX_SECONDS` | `86400` | Max `keepalive` extension |
| `SANDBOXED_SET_MEMORY_HIGH` | `false` | Re-apply cgroup `memory.high` after wake |

Optional activity tuning: `SANDBOXD_ACCESS_LOG`, `SANDBOXD_TAILER_OFFSET`, `SANDBOXD_POLLER_METRIC_RE`, `SANDBOXD_POLLER_URL`, `SANDBOXD_POLLER_INTERVAL_SECONDS`.

<Steps>
<Step title="Verify idle stop">
Create a sandbox, wait longer than `SANDBOXD_IDLE_THRESHOLD_SECONDS` with no preview traffic, then `GET /sandbox/{id}` — expect `status: "stopped"`.
</Step>
<Step title="Verify wake-on-preview">
`curl -H "Host: s-$ID-3000.preview.localhost" http://127.0.0.1:${HTTP_PORT:-80}/` — first response is HTML warming; repeat until the app body appears.
</Step>
<Step title="Verify keepalive">
`POST /sandbox/{id}/keepalive` with a future `until`; confirm the sandbox stays `running` through the idle threshold.
</Step>
</Steps>

## Related pages

<CardGroup>
<Card title="Preview routing" href="/preview-routing">
Traefik priority 100 vs catch-all priority 1, `PREVIEW_DOMAIN`, and label constraints.
</Card>
<Card title="Sandbox lifecycle" href="/sandbox-lifecycle">
SQLite status machine (`running` / `stopped` / `creating` / `error`) and reconcile-on-boot.
</Card>
<Card title="Manage sandboxes" href="/sandbox-operations">
`keepalive`, `POST /v1/sandboxes/{id}/stop`, exec, and purge workflows.
</Card>
<Card title="Configuration reference" href="/configuration-reference">
Full compose-backed env catalog including reaper and wake knobs.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Warming-page stalls, admission denial, and compose log probes.
</Card>
</CardGroup>
