# Agent Sandbox Developer Reference Wiki

> A developer reference for kubernetes-sigs/agent-sandbox: a Kubernetes controller and SDK ecosystem that delivers a Sandbox CRD plus extensions (SandboxClaim, SandboxTemplate, SandboxWarmPool) for managing isolated, stateful, singleton workloads such as AI agent runtimes.

## Context Links

- [Agent index](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/llms.txt)
- [Human interactive wiki](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a)
- [GitHub repository](https://github.com/kubernetes-sigs/agent-sandbox)

## Repository Metadata

- Repository: kubernetes-sigs/agent-sandbox

- Generated: 2026-05-25T23:21:59.478Z
- Updated: 2026-05-25T23:22:25.082Z
- Runtime: Claude Code
- Format: Technical
- Pages: 30

## Page Index

- 01. [Technical Orientation](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/01-technical-orientation.md) - Purpose of agent-sandbox, its core/extensions split, controller-manager entry point, and a map for navigating the rest of this developer reference.
- 02. [Installation & Deployment Modes](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/02-installation-deployment-modes.md) - Installing the controller via released YAML, Helm chart, or kind clusters; choosing between core-only and core+extensions deployments.
- 03. [Quickstart Paths (gVisor, Kata, Vanilla)](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/03-quickstart-paths-gvisor-kata-vanilla.md) - Walkthrough of the quickstart manifests for vanilla, gVisor, and Kata runtimes, plus how the examples directory organizes runnable scenarios.
- 04. [Controller Configuration & Tuning Flags](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/04-controller-configuration-tuning-flags.md) - All command-line flags exposed by the controller binary: QPS/burst, worker concurrency, warm-pool batch size, leader election, pprof, tracing, and cluster domain.
- 05. [Sandbox CRD (agents.x-k8s.io/v1beta1)](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/05-sandbox-crd-agents.x-k8s.io-v1beta1.md) - Field-by-field reference for the core Sandbox resource: PodTemplate, VolumeClaimTemplates, Lifecycle, Replicas (0/1), and Service toggle.
- 06. [SandboxTemplate CRD](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/06-sandboxtemplate-crd.md) - Reusable template type used by SandboxClaim and SandboxWarmPool, including the embedded Sandbox spec it encapsulates.
- 07. [SandboxClaim CRD](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/07-sandboxclaim-crd.md) - Claim resource that resolves to a Sandbox: template references, warm-pool policy, env injection, additional pod metadata, and shutdown policies (Delete, DeleteForeground, Retain).
- 08. [SandboxWarmPool CRD](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/08-sandboxwarmpool-crd.md) - Specification of pre-warmed sandbox pools: template binding, replica counts, and adoption semantics consumed by SandboxClaim.
- 09. [Conditions, Reasons & Status Surfaces](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/09-conditions-reasons-status-surfaces.md) - Catalogue of condition types (Ready, Suspended, Finished), reason strings, and the annotation/label keys (pod-name, template-ref, propagated-labels) that controllers use to coordinate state.
- 10. [Controller Manager Entry Point](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/10-controller-manager-entry-point.md) - How cmd/agent-sandbox-controller/main.go wires schemes, the controller-runtime Manager, metrics/pprof servers, leader election, and the optional extensions reconciler set.
- 11. [Sandbox Reconciler](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/11-sandbox-reconciler.md) - Reconciliation loop for the core Sandbox: pod/PVC/service materialization, identity propagation, status conditions, scale subresource, and the cluster-domain FQDN logic.
- 12. [SandboxClaim Reconciler](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/12-sandboxclaim-reconciler.md) - The largest controller in the project: template resolution, env/metadata injection, warm-pool adoption, pod-exclusivity invariants, foreground deletion, and TTL after finish.
- 13. [SandboxTemplate Reconciler](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/13-sandboxtemplate-reconciler.md) - Validation and bookkeeping done by the template controller, including how template changes ripple to claims and warm pools.
- 14. [SandboxWarmPool Reconciler](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/14-sandboxwarmpool-reconciler.md) - Pool maintenance loop: parallel batch creation/deletion bounded by max-batch-size, rollout on template changes, and watcher coordination with SandboxClaim.
- 15. [Warm Sandbox Queue](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/15-warm-sandbox-queue.md) - The in-memory queue shared between the warm-pool and claim reconcilers that hands off warm sandboxes to incoming claims.
- 16. [Lifecycle & Expiry Logic](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/16-lifecycle-expiry-logic.md) - Shared expiry helpers used by Sandbox and SandboxClaim controllers to compute shutdown times, requeue durations, and policy-driven cleanup.
- 17. [Metrics & Sandbox Collector](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/17-metrics-sandbox-collector.md) - Reconciler latency/result metrics plus the custom Prometheus collector that surfaces per-sandbox phase, age, and warm-pool stats.
- 18. [OpenTelemetry Tracing Setup](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/18-opentelemetry-tracing-setup.md) - Provider-neutral OTLP tracing wiring used by both the controller binary and the SDKs; instrumenter interface and no-op fallback.
- 19. [Go High-Level SDK (clients/go/sandbox)](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/19-go-high-level-sdk-clients-go-sandbox.md) - The high-level Go client: Sandbox lifecycle, command execution, file transfer, port tunnels, gateway, connector strategies, and tracing helpers.
- 20. [Generated Go Clientsets, Informers & Listers](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/20-generated-go-clientsets-informers-listers.md) - The k8s.io/client-go-style generated machinery for Sandbox and extensions: typed clientsets, informers, listers, and the codegen wiring.
- 21. [Python Sync SDK Core](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/21-python-sync-sdk-core.md) - Synchronous Python client surface: SandboxClient, Sandbox, connector, command executor, filesystem helpers, and the k8s helper layer.
- 22. [Python Async SDK](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/22-python-async-sdk.md) - The asyncio mirror of the sync SDK: AsyncSandboxClient, AsyncSandbox, async connector, async filesystem, and async command executor.
- 23. [Python Extensions, Gateway & Sandbox Router](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/23-python-extensions-gateway-sandbox-router.md) - Optional Python add-ons: computer-use extension, GKE pod-snapshot extensions, the sandbox-router service, and the kind-based gateway harness.
- 24. [Helm Chart Layout](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/24-helm-chart-layout.md) - Structure of the Helm chart: CRD shipping, deployment template, controller-args helper, RBAC bindings, and values knobs that map to controller flags.
- 25. [Static Manifests & Generated RBAC](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/25-static-manifests-generated-rbac.md) - The kubectl-apply-ready manifests in k8s/ plus the generated ClusterRole/Binding files used by both core and extensions controllers.
- 26. [Examples Library Map](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/26-examples-library-map.md) - Tour of the examples/ tree: AIO sandbox, Chrome/VSCode/JupyterLab, agent frameworks (Hermes, LangChain, ADK, Ray, Kueue), policy and scaling scenarios.
- 27. [Build, Codegen & Repository Tools](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/27-build-codegen-repository-tools.md) - Make targets, the codegen.go shim, deepcopy/CRD generation, lint configuration, and the dev/tools scripts that power local development.
- 28. [E2E Test Framework](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/28-e2e-test-framework.md) - Layout of the Go e2e suite, the framework client/predicates/watchset helpers, and the parallel/replica/shutdown scenario coverage.
- 29. [Load Testing & CI Pipelines](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/29-load-testing-ci-pipelines.md) - cluster-loader-based load-test recipes plus the prowjob presubmit/periodic configuration that runs them in CI.
- 30. [KEPs & Roadmap](https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/30-keps-roadmap.md) - In-flight design proposals tracked under docs/keps (suspended state, metadata propagation, Python SDK refactor) and the published roadmap.

## Source File Index

- `AGENTS.md`
- `api/v1beta1/groupversion_info.go`
- `api/v1beta1/sandbox_types.go`
- `api/v1beta1/zz_generated.deepcopy.go`
- `clients/go/sandbox/client.go`
- `clients/go/sandbox/commands.go`
- `clients/go/sandbox/connector.go`
- `clients/go/sandbox/files.go`
- `clients/go/sandbox/gateway.go`
- `clients/go/sandbox/sandbox.go`
- `clients/go/sandbox/tracing.go`
- `clients/go/sandbox/tunnel.go`
- `clients/k8s/clientset/versioned`
- `clients/k8s/extensions/clientset`
- `clients/k8s/extensions/informers`
- `clients/k8s/extensions/listers`
- `clients/python/agentic-sandbox-client/gateway-kind/README.md`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_connector.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_k8s_helper.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/commands/async_command_executor.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/commands/command_executor.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/connector.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/extensions/computer_use.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/files/async_filesystem.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/files/filesystem.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/k8s_helper.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py`
- `clients/python/agentic-sandbox-client/otel-collector-config.yaml.example`
- `clients/python/agentic-sandbox-client/sandbox-router/README.md`
- `clients/python/agentic-sandbox-client/sandbox-router/sandbox_router.py`
- `cmd/agent-sandbox-controller/main.go`
- `codegen.go`
- `controllers/sandbox_controller_test.go`
- `controllers/sandbox_controller.go`
- `controllers/testmain_test.go`
- `dev/ci/periodics`
- `dev/ci/presubmits`
- `dev/load-test/cluster-loader-sandbox.yaml`
- `dev/load-test/README.md`
- `dev/tools/client-gen-go.sh`
- `dev/tools/create-kind-cluster`
- `dev/tools/deploy-to-kube`
- `dev/tools/lint-api`
- `dev/tools/test-unit`
- `Dockerfile`
- `docs/api.md`
- `docs/configuration.md`
- `docs/development.md`
- `docs/keps/119-sandbox-suspended-state/README.md`
- `docs/keps/174-metadata-propagation/README.md`
- `docs/keps/359-refactor-python-sdk/README.md`
- `docs/keps/README.md`
- `docs/prowjob_manual_run.md`
- `docs/testing.md`
- `examples/chrome-sandbox`
- `examples/hello-world-sandbox`
- `examples/hpa-swp-scaling`
- `examples/jupyterlab`
- `examples/policy`
- `examples/quickstart/gvisor.md`
- `examples/quickstart/kata-containers.md`
- `examples/quickstart/README.md`
- `examples/README.md`
- `examples/vscode-sandbox`
- `extensions/api/v1beta1/groupversion_info.go`
- `extensions/api/v1beta1/sandboxclaim_types.go`
- `extensions/api/v1beta1/sandboxtemplate_types.go`
- `extensions/api/v1beta1/sandboxwarmpool_types.go`
- `extensions/controllers/queue/simple_sandbox_queue_test.go`
- `extensions/controllers/queue/simple_sandbox_queue.go`
- `extensions/controllers/sandboxclaim_controller_test.go`
- `extensions/controllers/sandboxclaim_controller.go`
- `extensions/controllers/sandboxclaim_pod_exclusivity_test.go`
- `extensions/controllers/sandboxtemplate_controller_test.go`
- `extensions/controllers/sandboxtemplate_controller.go`
- `extensions/controllers/sandboxwarmpool_controller_test.go`
- `extensions/controllers/sandboxwarmpool_controller.go`
- `extensions/controllers/utils.go`
- `extensions/examples/README.md`
- `extensions/examples/sandbox-claim.yaml`
- `extensions/examples/sandboxclaim.yaml`
- `extensions/examples/sandboxtemplate.yaml`
- `extensions/examples/sandboxwarmpool.yaml`
- `extensions/examples/secure-sandboxtemplate.yaml`
- `go.mod`
- `helm/Chart.yaml`
- `helm/crds/agents.x-k8s.io_sandboxes.yaml`
- `helm/README.md`
- `helm/templates/_controller-args.tpl`
- `helm/templates/deployment.yaml`
- `helm/values.yaml`
- `internal/lifecycle/expiry_test.go`
- `internal/lifecycle/expiry.go`
- `internal/metrics/metrics_test.go`
- `internal/metrics/metrics.go`
- `internal/metrics/sandbox_collector_test.go`
- `internal/metrics/sandbox_collector.go`
- `internal/metrics/tracing.go`
- `internal/version/`
- `k8s/controller.yaml`
- `k8s/crds/agents.x-k8s.io_sandboxes.yaml`
- `k8s/crds/extensions.agents.x-k8s.io_sandboxclaims.yaml`
- `k8s/crds/extensions.agents.x-k8s.io_sandboxtemplates.yaml`
- `k8s/crds/extensions.agents.x-k8s.io_sandboxwarmpools.yaml`
- `k8s/extensions-rbac.generated.yaml`
- `k8s/extensions.controller.yaml`
- `k8s/extensions.yaml`
- `k8s/rbac.generated.yaml`
- `Makefile`
- `README.md`
- `roadmap.md`
- `test/e2e/basic_test.go`
- `test/e2e/extensions/warmpool_rollout_test.go`
- `test/e2e/framework/client.go`
- `test/e2e/framework/watchset.go`
- `test/e2e/parallelism_test.go`
- `test/e2e/README.md`
- `tools.mod`

---

## 01. Technical Orientation

> Purpose of agent-sandbox, its core/extensions split, controller-manager entry point, and a map for navigating the rest of this developer reference.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/01-technical-orientation.md
- Generated: 2026-05-25T22:29:32.308Z

### Source Files

- `README.md`
- `AGENTS.md`
- `cmd/agent-sandbox-controller/main.go`
- `go.mod`
- `roadmap.md`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [README.md](README.md)
- [AGENTS.md](AGENTS.md)
- [cmd/agent-sandbox-controller/main.go](cmd/agent-sandbox-controller/main.go)
- [go.mod](go.mod)
- [roadmap.md](roadmap.md)
- [controllers/sandbox_controller.go](controllers/sandbox_controller.go)
- [api/v1beta1/groupversion_info.go](api/v1beta1/groupversion_info.go)
- [extensions/api/v1beta1/groupversion_info.go](extensions/api/v1beta1/groupversion_info.go)
- [Dockerfile](Dockerfile)
- [codegen.go](codegen.go)
</details>

# Technical Orientation

`agent-sandbox` is a Kubernetes SIG Apps subproject that introduces a `Sandbox` Custom Resource Definition (CRD) and an associated controller for managing **stateful, singleton, pod-backed workloads with a stable identity**. The project is targeted at AI agent runtimes, development environments, notebooks, and similar workloads that do not fit the stateless replicated model of `Deployment` or the numbered-set model of `StatefulSet`. The Go module is `sigs.k8s.io/agent-sandbox` (Go 1.26.2), built on `controller-runtime` v0.23.x and Kubernetes API libraries v0.35.x.

This page is the entry point for the developer reference. It explains the **core / extensions split**, walks through the **controller-manager entry point** in `cmd/agent-sandbox-controller`, and maps the rest of the repository so other pages can be navigated with confidence.

Sources: [README.md:16-18](), [AGENTS.md:5-12](), [go.mod:1-24]()

## Why agent-sandbox exists

Kubernetes excels at stateless replicated workloads (`Deployment`) and stable numbered sets (`StatefulSet`), but use cases such as per-user dev environments, isolated runtimes for LLM-generated code, Jupyter-style sessions, and single-pod stateful services are awkward to express through those primitives. The README enumerates the gap and the desired characteristics (strong runtime isolation such as gVisor or Kata, deep hibernation, automatic resume, efficient persistence, programmatic API), and the roadmap shows the project is iterating toward Beta/GA while expanding runtime support and SDK surface.

Sources: [README.md:134-159](), [roadmap.md:1-29]()

## Core / Extensions split

The project is intentionally partitioned into a small, stable **core** and an opt-in **extensions** module. The core ships the primitive `Sandbox`; extensions build higher-level lifecycle and pool semantics on top of it without forcing them on every operator.

| Layer | Go package | API group / version | Kinds | Controller package |
| --- | --- | --- | --- | --- |
| Core | `sigs.k8s.io/agent-sandbox/api/v1beta1` | `agents.x-k8s.io / v1beta1` | `Sandbox` | `controllers/` |
| Extensions | `sigs.k8s.io/agent-sandbox/extensions/api/v1beta1` | `extensions.agents.x-k8s.io / v1beta1` | `SandboxClaim`, `SandboxTemplate`, `SandboxWarmPool` | `extensions/controllers/` |

Both groups are versioned `v1alpha1` in older docs but are registered today as `v1beta1` in code — see the `+groupName` markers in `api/v1beta1/groupversion_info.go:17` and `extensions/api/v1beta1/groupversion_info.go:17`, and the import `extensionsv1beta1 "sigs.k8s.io/agent-sandbox/extensions/api/v1beta1"` in `cmd/agent-sandbox-controller/main.go:34`.

The extensions form a small pattern stack on top of `Sandbox`:

- **`SandboxTemplate`** — reusable spec a `Sandbox` can be stamped from.
- **`SandboxClaim`** — user-facing request that resolves to a `Sandbox`, optionally by adopting one from a warm pool.
- **`SandboxWarmPool`** — pre-creates Sandboxes so claims can hand back a ready instance immediately.

Sources: [README.md:22-83](), [AGENTS.md:9-23](), [api/v1beta1/groupversion_info.go:14-36](), [extensions/api/v1beta1/groupversion_info.go:14-36]()

### Ownership map

```text
agent-sandbox repo
├── api/v1beta1/                ← Sandbox types (CRD source of truth)
├── controllers/                ← SandboxReconciler + envtest
├── extensions/
│   ├── api/v1beta1/            ← SandboxClaim / Template / WarmPool types
│   └── controllers/            ← three reconcilers + shared queue/
├── cmd/agent-sandbox-controller/
│   └── main.go                 ← single manager binary
├── internal/                   ← lifecycle, metrics, version (not importable externally)
├── k8s/                        ← generated CRDs, RBAC, controller manifests
├── helm/                       ← parallel Helm packaging
├── clients/
│   ├── k8s/                    ← generated typed clientset/listers/informers
│   ├── go/                     ← hand-written high-level Go SDK
│   └── python/agentic-sandbox-client/  ← Python SDK (PyPI: k8s-agent-sandbox)
├── docs/                       ← development.md, testing.md, configuration.md, keps/
├── examples/, extensions/examples/
└── test/e2e/, test/benchmarks/
```

Sources: [AGENTS.md:13-34](), [codegen.go:14-28]()

## Controller-manager entry point

A single binary (`bin/manager`, container entrypoint `/agent-sandbox-controller`) hosts every reconciler. The `extensions` flag decides whether the extension CRDs and their three reconcilers are also registered on the manager — the core path always runs.

### Startup architecture

```mermaid
flowchart LR
    subgraph CLI[cmd/agent-sandbox-controller/main.go]
      Flags[flag.Parse]
      Ver[version.Print]
      ZapLog[zap.New logger]
      Sched[ctrl.SetupSignalHandler]
      OTel[asmetrics.SetupOTel\nif --enable-tracing]
      MetricsOpts[metricsserver.Options\n+pprof handlers]
      RC[ctrl.GetConfigOrDie\n+QPS/Burst]
      MGR[ctrl.NewManager\nLeaderElectionID=a3317529...]
    end

    subgraph CoreScheme[controllers.Scheme]
      coreInit[init: clientgo + sandboxv1beta1\nAddToScheme]
    end

    subgraph Reconcilers
      SR[SandboxReconciler]
      SCR[SandboxClaimReconciler]
      STR[SandboxTemplateReconciler]
      SWP[SandboxWarmPoolReconciler]
      Q[queue.NewSimpleSandboxQueue]
    end

    subgraph Probes
      Health[/healthz/]
      Ready[/readyz/]
      Metrics[/metrics + /debug/pprof]
    end

    Flags --> RC --> MGR
    ZapLog --> MGR
    Sched --> MGR
    OTel -. instrumenter .-> SR
    OTel -. instrumenter .-> SCR
    OTel -. instrumenter .-> STR
    CoreScheme --> MGR
    MGR -->|register| SR
    MGR -->|register if --extensions| SCR
    MGR -->|register if --extensions| STR
    MGR -->|register if --extensions| SWP
    SCR <-->|warm queue| Q
    MetricsOpts --> MGR
    MGR --> Health
    MGR --> Ready
    MGR --> Metrics
    MGR -->|mgr.Start ctx| Run((reconcile loops))
```

Sources: [cmd/agent-sandbox-controller/main.go:50-295](), [controllers/sandbox_controller.go:112-128]()

### Flags and defaults

The flags exposed by `main` are the contract operators tune against. Defaults below are taken directly from the `flag.*Var` calls in `cmd/agent-sandbox-controller/main.go`.

| Flag | Default | Purpose |
| --- | --- | --- |
| `--version` | `false` | Print build info from `internal/version` and exit. |
| `--cluster-domain` | `cluster.local` | Used when the Sandbox controller composes service FQDNs. |
| `--metrics-bind-address` | `:8080` | controller-runtime metrics + optional pprof. |
| `--health-probe-bind-address` | `:8081` | `/healthz` and `/readyz` (both wired to `healthz.Ping`). |
| `--leader-elect` | `true` | Single-active manager via Lease `a3317529.agent-sandbox.x-k8s.io`. |
| `--leader-election-namespace` | `""` | Empty → controller-runtime auto-detection. |
| `--extensions` | `false` | Register `SandboxClaim`, `SandboxTemplate`, `SandboxWarmPool` reconcilers. |
| `--enable-tracing` | `false` | Initialize OTLP tracing via `asmetrics.SetupOTel` (10s init timeout). |
| `--enable-pprof` / `--enable-pprof-debug` | `false` | Mount pprof handlers; debug variant exposes heap/goroutine/block/mutex/fgprof. |
| `--pprof-block-profile-rate` | `1000000` | ns sampling rate for `/debug/pprof/block` when debug is on. |
| `--pprof-mutex-profile-fraction` | `10` | 1/N sampling for `/debug/pprof/mutex` when debug is on. |
| `--kube-api-qps` | `-1.0` | Client-side QPS limit; `-1` means unlimited. |
| `--kube-api-burst` | `10` | Client-side burst. |
| `--sandbox-concurrent-workers` | `1` | Reconciles in flight for `SandboxReconciler`. |
| `--sandbox-claim-concurrent-workers` | `1` | Reconciles in flight for `SandboxClaimReconciler`. |
| `--sandbox-warm-pool-concurrent-workers` | `1` | Reconciles in flight for `SandboxWarmPoolReconciler`. |
| `--sandbox-template-concurrent-workers` | `1` | Reconciles in flight for `SandboxTemplateReconciler`. |
| `--sandbox-warm-pool-max-batch-size` | `300` | Cap on parallel sandbox create/delete per warm-pool reconcile. |

Validation in `main` enforces that worker counts and `--kube-api-burst` are positive, warns when total workers exceed 1000 or exceed `--kube-api-burst` with a positive QPS, and clamps negative pprof sampling values to 0.

Sources: [cmd/agent-sandbox-controller/main.go:70-145](), [cmd/agent-sandbox-controller/main.go:185-227]()

### Startup sequence

The order of operations in `main` matters because some side effects (pprof, scheme registration, leader election ID) need to be in place before `mgr.Start` blocks the goroutine.

1. Parse flags, optionally short-circuit on `--version`.
2. Install the `zap` logger via `ctrl.SetLogger`.
3. Validate worker / QPS settings and log a "Concurrency settings" summary.
4. Install a signal-handled context (`ctrl.SetupSignalHandler`).
5. If `--enable-tracing` is set, build the OTLP instrumenter; otherwise use `asmetrics.NewNoOp()` so reconcilers can always call `r.Tracer.StartSpan` unconditionally.
6. Reset `http.DefaultServeMux` so importing `net/http/pprof` does **not** accidentally expose pprof on the default mux.
7. Build the scheme: always register the core (`controllers.Scheme` already includes `clientgo` + `sandboxv1beta1`); add `extensionsv1beta1` only when `--extensions` is true.
8. Configure `metricsserver.Options` with optional pprof handlers and set `runtime.SetBlockProfileRate` / `SetMutexProfileFraction` for the debug variant.
9. Take the REST config, apply QPS/Burst, build the manager with leader election ID `a3317529.agent-sandbox.x-k8s.io`.
10. Register `asmetrics.RegisterSandboxCollector` against the manager client.
11. Register `SandboxReconciler` (always) and, when `--extensions` is set, `SandboxClaimReconciler` (sharing a `queue.NewSimpleSandboxQueue` with the warm-pool reconciler), `SandboxTemplateReconciler`, and `SandboxWarmPoolReconciler`.
12. Wire `/healthz` and `/readyz` to `healthz.Ping`, then `mgr.Start(ctx)`.

Sources: [cmd/agent-sandbox-controller/main.go:151-294]()

### Scheme registration in core

`controllers.Scheme` is a package-level `runtime.Scheme` populated in an `init()` so any importer of `controllers` gets the core types registered. `cmd/main.go` then layers extensions on top of the same scheme rather than constructing a parallel one.

```go
// controllers/sandbox_controller.go
var (
    Scheme = runtime.NewScheme()
)

func init() {
    utilruntime.Must(clientgoscheme.AddToScheme(Scheme))
    utilruntime.Must(sandboxv1beta1.AddToScheme(Scheme))
}
```

Sources: [controllers/sandbox_controller.go:112-120](), [cmd/agent-sandbox-controller/main.go:174-177]()

## Map of the rest of the developer reference

This section is the navigational hub for the other pages in the wiki. Each row points to the directory or file that owns the code, so deeper pages can be opened from the actual source rather than from prose.

| Topic | Where to look | Notes |
| --- | --- | --- |
| Core `Sandbox` types and kubebuilder markers | [api/v1beta1/sandbox_types.go](api/v1beta1/sandbox_types.go) | Spec, Status, validation/defaulting markers. |
| Core reconciler | [controllers/sandbox_controller.go](controllers/sandbox_controller.go) | Ownership, finalizers, pod/service/PVC management. |
| Extension types | [extensions/api/v1beta1/](extensions/api/v1beta1/) | `sandboxclaim_types.go`, `sandboxtemplate_types.go`, `sandboxwarmpool_types.go`. |
| Extension reconcilers | [extensions/controllers/](extensions/controllers/) | Claim/template/warm-pool reconcilers, shared `queue/` package, exclusivity tests. |
| Shared internals | [internal/lifecycle/](internal/lifecycle/), [internal/metrics/](internal/metrics/), [internal/version/](internal/version/) | Expiry helpers, custom Sandbox metric collector, OTel setup, build-info injected via `-ldflags`. |
| Generated manifests | [k8s/](k8s/), [helm/](helm/) | CRDs in `k8s/crds/` and `helm/crds/`; RBAC in `*.generated.yaml`. Regenerate via `make all` per `codegen.go`. |
| Generated typed clientset | [clients/k8s/](clients/k8s/) | Output of `dev/tools/client-gen-go.sh`. Do not hand-edit. |
| Hand-written Go SDK | [clients/go/](clients/go/) | High-level `SandboxClaim` lifecycle, Gateway / port-forward / direct connectivity. |
| Hand-written Python SDK | [clients/python/agentic-sandbox-client/](clients/python/agentic-sandbox-client/) | Directory `agentic-sandbox-client`, package `k8s_agent_sandbox`, PyPI `k8s-agent-sandbox`; sync/async parity required. |
| Examples | [examples/](examples/), [extensions/examples/](extensions/examples/) | Many `README.md` files double as docs-site pages via mounts in `site/hugo.yaml`. |
| Tests | [test/e2e/](test/e2e/), [test/benchmarks/](test/benchmarks/), `*_test.go` next to code | E2E expects `bin/KUBECONFIG` from `make deploy-kind`. |
| Developer docs and KEPs | [docs/development.md](docs/development.md), [docs/testing.md](docs/testing.md), [docs/configuration.md](docs/configuration.md), [docs/keps/](docs/keps/) | Source of truth for contribution workflow and design proposals. |
| Tooling and CI | [dev/tools/](dev/tools/), [dev/ci/](dev/ci/) | Lint, generate, deploy-kind, release scripts, Prow presubmits. |
| Public site | [site/](site/) | Hugo + Docsy; many pages are includes of repo files — edits to mounted files change the public docs. |

Sources: [AGENTS.md:13-62](), [codegen.go:14-28]()

## Build, packaging, and operational shape

The container image is built from a multi-stage `Dockerfile` that compiles a static binary with version metadata injected via `-ldflags` into `internal/version`, then copies it into `gcr.io/distroless/static-debian13:nonroot`. The image entrypoint is the controller binary, so all configuration flows through the flags documented above.

```dockerfile
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build \
    -ldflags="-s -w -X sigs.k8s.io/agent-sandbox/internal/version.gitVersion=${GIT_VERSION} ..." \
    -o /agent-sandbox-controller ./cmd/agent-sandbox-controller
FROM gcr.io/distroless/static-debian13:nonroot
COPY --from=builder /agent-sandbox-controller /agent-sandbox-controller
ENTRYPOINT ["/agent-sandbox-controller"]
```

For day-to-day development, every common task is fronted by a `make` target (`make all`, `make build`, `make test-unit`, `make test-e2e`, `make lint-go`, `make lint-api`, `make deploy-kind`); the `deploy-kind` target additionally accepts `EXTENSIONS=true` and `CONTROLLER_ARGS="..."` to mirror the runtime split described above. Code generation (CRDs, RBAC, deepcopy, typed clients) is driven by the `//go:generate` directives in `codegen.go` and runs through `make fix-go-generate`.

Sources: [Dockerfile:1-37](), [AGENTS.md:36-49](), [codegen.go:14-28]()

## How to navigate from here

A reader who has read this page should be able to:

- Start the binary locally via `make deploy-kind` (optionally `EXTENSIONS=true`) and know which CRDs land in the cluster.
- Open `cmd/agent-sandbox-controller/main.go` and identify which flag controls which reconciler and which side-effect (tracing, pprof, leader election, QPS shaping).
- Choose between `controllers/` and `extensions/controllers/` based on whether a change touches the core `Sandbox` primitive or a higher-level claim/template/warm-pool semantic.
- Find the right SDK surface for client work — `clients/go/` for Go, `clients/python/agentic-sandbox-client/` for Python (published as `k8s-agent-sandbox`), and `clients/k8s/` only when typed Kubernetes-style clients are needed.

Subsequent pages drill into each of these areas: the `Sandbox` reconciler's ownership model, the `SandboxClaim`/`SandboxTemplate`/`SandboxWarmPool` workflow, the shared queue and exclusivity model, the metrics/tracing internals, and the CRD/RBAC generation pipeline.

Sources: [AGENTS.md:36-62](), [cmd/agent-sandbox-controller/main.go:236-277]()

---

## 02. Installation & Deployment Modes

> Installing the controller via released YAML, Helm chart, or kind clusters; choosing between core-only and core+extensions deployments.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/02-installation-deployment-modes.md
- Generated: 2026-05-25T22:29:39.258Z

### Source Files

- `README.md`
- `helm/README.md`
- `k8s/controller.yaml`
- `k8s/extensions.controller.yaml`
- `dev/tools/deploy-to-kube`
- `dev/tools/create-kind-cluster`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [README.md](README.md)
- [helm/README.md](helm/README.md)
- [helm/Chart.yaml](helm/Chart.yaml)
- [helm/values.yaml](helm/values.yaml)
- [helm/templates/deployment.yaml](helm/templates/deployment.yaml)
- [helm/templates/_controller-args.tpl](helm/templates/_controller-args.tpl)
- [helm/templates/clusterrolebinding-extensions.yaml](helm/templates/clusterrolebinding-extensions.yaml)
- [helm/templates/namespace.yaml](helm/templates/namespace.yaml)
- [k8s/controller.yaml](k8s/controller.yaml)
- [k8s/extensions.controller.yaml](k8s/extensions.controller.yaml)
- [k8s/extensions.yaml](k8s/extensions.yaml)
- [k8s/rbac.generated.yaml](k8s/rbac.generated.yaml)
- [k8s/extensions-rbac.generated.yaml](k8s/extensions-rbac.generated.yaml)
- [dev/tools/deploy-to-kube](dev/tools/deploy-to-kube)
- [dev/tools/create-kind-cluster](dev/tools/create-kind-cluster)
- [dev/tools/release](dev/tools/release)
- [Makefile](Makefile)
</details>

# Installation & Deployment Modes

`agent-sandbox` ships as a single Go controller (`agent-sandbox-controller`) plus a small set of CRDs. The same controller binary can run in two reconciliation modes — **core only** (just the `Sandbox` CRD) or **core + extensions** (`Sandbox`, `SandboxClaim`, `SandboxTemplate`, `SandboxWarmPool`) — selected at deploy time by a single `--extensions` flag. This page describes the three supported installation paths, what each path actually applies to the cluster, and how the two deployment modes differ at the manifest level.

The three installation paths correspond to three audiences: cluster operators consuming released artifacts (raw YAML from a GitHub release), platform teams that want templated, upgradeable installs (Helm chart in `helm/`), and contributors hacking on the controller locally (kind cluster scripts under `dev/tools/`). All three converge on the same set of cluster objects: a `Namespace`, a `ServiceAccount`, one or two `ClusterRoleBinding`s, a `Service` for metrics, and a `Deployment` running the controller image.

## What gets installed

Regardless of installation method, a working install contains the objects below. The "Extensions" column indicates which objects only appear when extensions are enabled.

| Object | Kind | Source | Extensions only |
| --- | --- | --- | --- |
| `agent-sandbox-system` | `Namespace` | [k8s/controller.yaml:1-5](k8s/controller.yaml) / [helm/templates/namespace.yaml](helm/templates/namespace.yaml) | No |
| `agent-sandbox-controller` | `ServiceAccount` | [k8s/controller.yaml:9-15](k8s/controller.yaml) | No |
| `agent-sandbox-controller` | `ClusterRole` / `ClusterRoleBinding` | [k8s/rbac.generated.yaml:1-60](k8s/rbac.generated.yaml), [k8s/controller.yaml:19-30](k8s/controller.yaml) | No |
| `agent-sandbox-controller` | `Service` (metrics on `:8080`) | [k8s/controller.yaml:34-48](k8s/controller.yaml) | No |
| `agent-sandbox-controller` | `Deployment` | [k8s/controller.yaml:52-81](k8s/controller.yaml) or [k8s/extensions.controller.yaml:1-32](k8s/extensions.controller.yaml) | No (mode swap) |
| `sandboxes.agents.x-k8s.io` | `CustomResourceDefinition` | `k8s/crds/agents.x-k8s.io_sandboxes.yaml` | No |
| `agent-sandbox-controller-extensions` | `ClusterRole` / `ClusterRoleBinding` | [k8s/extensions-rbac.generated.yaml:1-87](k8s/extensions-rbac.generated.yaml), [k8s/extensions.yaml:1-14](k8s/extensions.yaml) | Yes |
| `sandboxclaims`, `sandboxtemplates`, `sandboxwarmpools` (`extensions.agents.x-k8s.io`) | `CustomResourceDefinition` | `k8s/crds/extensions.agents.x-k8s.io_*.yaml` | Yes |

The controller exposes two container ports: `metrics` on 8080 (also fronted by the `Service`) and `healthz` on 8081, used by the Helm chart's liveness/readiness probes ([helm/templates/deployment.yaml:30-48](helm/templates/deployment.yaml)).

## Deployment topology

```text
                  +------------------------------------------+
                  |  Namespace: agent-sandbox-system         |
                  |                                          |
                  |  ServiceAccount: agent-sandbox-controller|
                  |              |                           |
                  |              v                           |
                  |   Deployment: agent-sandbox-controller   |
                  |   args: --leader-elect=true              |
                  |         [--extensions  if enabled]       |
                  |   ports: metrics:8080  healthz:8081      |
                  |              |                           |
                  |              v                           |
                  |   Service: agent-sandbox-controller      |
                  |             (TCP/8080 metrics)           |
                  +-------------------|----------------------+
                                      |
   Cluster-scoped:                    |
   ClusterRole/ClusterRoleBinding "agent-sandbox-controller"
        (Sandbox CRD + pods/pvcs/services/leases/events)
   ClusterRole/ClusterRoleBinding "agent-sandbox-controller-extensions"
        (only when --extensions is set; adds SandboxClaim/Template/WarmPool
         plus networkpolicies)
   CRDs: agents.x-k8s.io_sandboxes
         extensions.agents.x-k8s.io_{sandboxclaims,sandboxtemplates,sandboxwarmpools}
```

This shape is identical across the three install paths; only the way the manifests are rendered and applied differs.

## Installation path 1 — Released YAML

The released-artifact path is what the README documents for production consumers ([README.md:85-101](README.md)). Two manifests are published on each GitHub release:

| Asset | Contents | Apply when |
| --- | --- | --- |
| `manifest.yaml` | Namespace, `ServiceAccount`, core `ClusterRole`/binding, `Service`, core `Deployment` (without `--extensions`), and the `sandboxes.agents.x-k8s.io` CRD | Always (core install) |
| `extensions.yaml` | Extensions `ClusterRoleBinding`, extensions CRDs (`sandboxclaims`, `sandboxtemplates`, `sandboxwarmpools`) | After `manifest.yaml`, only if you want extensions enabled |

```sh
export VERSION="vX.Y.Z"
# Core only:
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/manifest.yaml
# Add extensions:
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/extensions.yaml
```

Both files are produced by `dev/tools/release`. It globs everything under `k8s/`, splits files by basename — anything whose basename starts with `extensions` goes into `extensions.yaml`, everything else into `manifest.yaml` — then rewrites the `ko://` image placeholder with the real release image `registry.k8s.io/agent-sandbox/agent-sandbox-controller:<tag>` ([dev/tools/release:43-85](dev/tools/release)).

> Important: `extensions.yaml` only carries the extra RBAC binding and extension CRDs. It does **not** flip the controller into extensions mode by itself — the released `manifest.yaml` ships the **non-extensions** controller `Deployment` from [k8s/controller.yaml:52-81](k8s/controller.yaml), whose args are just `--leader-elect=true`. To actually run the extensions controllers you must either (a) install via Helm with `controller.extensions=true`, or (b) edit the deployed controller `Deployment` to add `--extensions` to its args (matching [k8s/extensions.controller.yaml:23-25](k8s/extensions.controller.yaml)).

## Installation path 2 — Helm chart

The Helm chart in `helm/` is the recommended path when you want a single, repeatable, value-driven install with first-class support for the mode toggle. Chart metadata: `apiVersion: v2`, `name: agent-sandbox`, `version: 0.1.0` ([helm/Chart.yaml:1-9](helm/Chart.yaml)).

```bash
# Core only
helm install agent-sandbox ./helm/ \
  --namespace agent-sandbox-system \
  --create-namespace \
  --set image.tag=<version>

# Core + extensions (WarmPool, Template, Claim)
helm install agent-sandbox ./helm/ \
  --namespace agent-sandbox-system \
  --create-namespace \
  --set image.tag=<version> \
  --set controller.extensions=true
```

Sourced from [helm/README.md:7-37](helm/README.md). `image.tag` is **required** — `values.yaml` ships an empty string so you must supply it explicitly ([helm/values.yaml:6-9](helm/values.yaml)).

### How the chart wires the mode flag

A single value, `controller.extensions`, drives both the controller flag and the extra RBAC binding:

1. The args template renders `--extensions=true` into the controller container args when the value is true ([helm/templates/_controller-args.tpl:11-13](helm/templates/_controller-args.tpl)).
2. The extra `ClusterRoleBinding` template is wrapped in `{{- if .Values.controller.extensions }}` so it is only produced in extensions mode ([helm/templates/clusterrolebinding-extensions.yaml:1-16](helm/templates/clusterrolebinding-extensions.yaml)).
3. The base `ClusterRole` for extensions, the `Service`, `ServiceAccount`, and CRDs are always present. Helm installs anything in `helm/crds/` automatically before other resources, including the extension CRDs ([helm/README.md:3-5](helm/README.md), [helm/crds/](helm/crds/)).

That last point creates two important operational caveats called out in `helm/README.md`:

- **CRDs are not upgraded** by `helm upgrade`. After bumping the chart version, apply CRDs manually: `kubectl apply -f helm/crds/` ([helm/README.md:48-53](helm/README.md)).
- **CRDs are not deleted** by `helm uninstall`. Removing them with `kubectl delete -f helm/crds/` will cascade-delete every `Sandbox`, `SandboxWarmPool`, `SandboxTemplate`, and `SandboxClaim` in the cluster ([helm/README.md:54-67](helm/README.md)).

### Controller flags exposed through values

The `_controller-args.tpl` partial is the single source of truth for what flags the chart produces. Each value in `.Values.controller.*` maps to one controller flag, and unset values are omitted entirely:

| Helm value | Flag emitted | Purpose |
| --- | --- | --- |
| `controller.leaderElect` | `--leader-elect=` | Enable leader election (default `true`) |
| `controller.leaderElectionNamespace` | `--leader-election-namespace=` | Override lease namespace |
| `controller.clusterDomain` | `--cluster-domain=` | FQDN suffix used by the controller |
| `controller.extensions` | `--extensions=` | Switch to core + extensions mode |
| `controller.enableTracing` | `--enable-tracing=` | OTLP tracing |
| `controller.enablePprof` / `enablePprofDebug` | `--enable-pprof=`, `--enable-pprof-debug=` | Profiling endpoints |
| `controller.pprofBlockProfileRate` / `pprofMutexProfileFraction` | `--pprof-block-profile-rate=`, `--pprof-mutex-profile-fraction=` | Profiling tunables |
| `controller.kubeApiQps` / `kubeApiBurst` | `--kube-api-qps=`, `--kube-api-burst=` | API client throttle |
| `controller.sandboxConcurrentWorkers` | `--sandbox-concurrent-workers=` | Worker pool for `Sandbox` controller |
| `controller.sandboxClaimConcurrentWorkers` | `--sandbox-claim-concurrent-workers=` | Extensions worker pool |
| `controller.sandboxWarmPoolConcurrentWorkers` | `--sandbox-warm-pool-concurrent-workers=` | Extensions worker pool |
| `controller.sandboxTemplateConcurrentWorkers` | `--sandbox-template-concurrent-workers=` | Extensions worker pool |
| `controller.extraArgs` | passthrough | Anything not modeled above (e.g. `--zap-log-level=debug`) |

Defaults and full descriptions are tabulated in [helm/README.md:70-101](helm/README.md). Worker-count flags for extensions resources are only meaningful when `controller.extensions=true`; the controller manager simply will not start those reconcilers otherwise.

### Reusing an existing namespace

`namespace.create` (default `true`) gates the `Namespace` object ([helm/templates/namespace.yaml:1-8](helm/templates/namespace.yaml)). To install into a namespace you manage yourself, set both `namespace.create=false` and `namespace.name=<your-ns>`; the deployment, service account, service, and extensions RoleBinding will all resolve to that namespace through the `agent-sandbox.namespace` helper ([helm/README.md:29-37](helm/README.md), [helm/templates/deployment.yaml:5](helm/templates/deployment.yaml)).

## Installation path 3 — kind for local development

The kind path is wired through the `deploy-kind` Make target and a pair of Python scripts. It is the only path that builds the controller image from your working tree before deploying it.

```bash
# Core only
make deploy-kind

# Core + extensions
EXTENSIONS=true make deploy-kind

# Custom controller flags (do NOT use this to enable extensions)
CONTROLLER_ARGS="--enable-pprof-debug --zap-log-level=debug" make deploy-kind
```

These invocations come straight from [Makefile:31-40](Makefile). The target runs three scripts in sequence: `create-kind-cluster`, `push-images`, and `deploy-to-kube`.

### `create-kind-cluster`

Bootstraps (or recreates) a kind cluster named `agent-sandbox` using an inline `kind.x-k8s.io/v1alpha4` `Cluster` config. The config enables `NodeRestriction` and `OwnerReferencesPermissionEnforcement` admission plugins and turns kubelet verbosity up to `8`; `--containerd-loglevel debug` is the default, and `--kubeconfig bin/KUBECONFIG` exports a kubeconfig for the new cluster ([dev/tools/create-kind-cluster:24-101](dev/tools/create-kind-cluster)). `--recreate` deletes any existing cluster of the same name first ([dev/tools/create-kind-cluster:55-72](dev/tools/create-kind-cluster)).

### `deploy-to-kube`

This is the script that ties the deployment-mode selection to the manifest set. The flow is gather → process → apply ([dev/tools/deploy-to-kube:236-251](dev/tools/deploy-to-kube)):

```text
k8s/**/*.yaml
   |
   v
gather_manifests   --> [ {doc, filename, kind}, ... ]
   |
   v
process_manifests( --extensions ? )
   |   - Skip files whose basename starts with "extensions" unless --extensions
   |   - If --extensions: drop the *core* controller Deployment (replaced by
   |     k8s/extensions.controller.yaml which sets --extensions in args)
   |   - Rewrite ko://... image refs to <image-prefix><cmd>:<tag>
   |   - Append --controller-args flags to the controller container
   |
   v
apply_manifests in three ordered batches via `kubectl apply -f -`:
   1. prereq: Namespace + CRDs
   2. other:  ServiceAccount, ClusterRole(s), ClusterRoleBinding, Service, Deployment
   3. extensions: only if --extensions (extensions ClusterRoleBinding)
```

The mode swap is explicit at [dev/tools/deploy-to-kube:185-194](dev/tools/deploy-to-kube): when `--extensions` is set the core `Deployment` from `k8s/controller.yaml` is skipped because the deployment in `k8s/extensions.controller.yaml` — which adds `--extensions` to the container args ([k8s/extensions.controller.yaml:23-25](k8s/extensions.controller.yaml)) — replaces it.

Image rewriting is path-aware: a value like `ko://sigs.k8s.io/agent-sandbox/cmd/agent-sandbox-controller` is rewritten to `<image_prefix>agent-sandbox-controller:<tag>` ([dev/tools/deploy-to-kube:37-51](dev/tools/deploy-to-kube)). For kind builds the prefix is `kind.local/` and the tag is generated from date + `git describe --always --dirty` when not pinned via `IMAGE_TAG` ([Makefile:39-40](Makefile), [dev/tools/shared/utils.py:33-40](dev/tools/shared/utils.py)).

`--controller-args` (or `CONTROLLER_ARGS` in the Makefile) appends arbitrary flags to the controller container — useful for one-off debugging — but the Makefile help and the script's own help string warn that it must **not** be used to set `--extensions`; use the `EXTENSIONS=true` / `--extensions` toggle so the matching RBAC and Deployment swap also happen ([Makefile:35](Makefile), [dev/tools/deploy-to-kube:268-273](dev/tools/deploy-to-kube), [dev/tools/deploy-to-kube:56-69](dev/tools/deploy-to-kube)).

`make delete-kind` tears the cluster down via `kind delete cluster --name agent-sandbox` ([Makefile:46-48](Makefile)).

## Core vs. core + extensions: choosing a mode

Both modes run the same binary; the difference is what reconcilers it activates and what RBAC and CRDs are required to back them.

| Aspect | Core only | Core + extensions |
| --- | --- | --- |
| Controller arg | `--leader-elect=true` | `--leader-elect=true` plus `--extensions` (or `--extensions=true`) |
| CRDs required | `sandboxes.agents.x-k8s.io` | The above plus `sandboxclaims`, `sandboxtemplates`, `sandboxwarmpools` (all under `extensions.agents.x-k8s.io`) |
| Extra ClusterRoleBinding | none | `agent-sandbox-controller-extensions` (grants `extensions.agents.x-k8s.io/*` plus `networkpolicies`) |
| What the controller reconciles | `Sandbox` | `Sandbox`, `SandboxClaim`, `SandboxTemplate`, `SandboxWarmPool` |
| Released YAML | `manifest.yaml` | `manifest.yaml` + `extensions.yaml` (and you must add `--extensions` to the Deployment args) |
| Helm | default (`controller.extensions=false`) | `--set controller.extensions=true` |
| kind / Make | `make deploy-kind` | `EXTENSIONS=true make deploy-kind` |
| Worker-count flags | `--sandbox-concurrent-workers` | Above plus `--sandbox-claim-…`, `--sandbox-warm-pool-…`, `--sandbox-template-concurrent-workers` |

The extensions-only RBAC additions are visible in [k8s/extensions-rbac.generated.yaml:51-87](k8s/extensions-rbac.generated.yaml): rights over `sandboxclaims`, `sandboxtemplates`, `sandboxwarmpools` (and their `/status`, `/finalizers` subresources) and over `networking.k8s.io/networkpolicies`, which the extension controllers create around warm-pool / claim sandboxes.

A practical rule of thumb: choose **core only** when you just want users (or another controller) to author `Sandbox` objects directly; choose **core + extensions** when you need pooling (`SandboxWarmPool`), templated provisioning (`SandboxTemplate`), or self-service claims (`SandboxClaim`).

## Summary

Three installation paths front the same controller deployment: raw YAML from GitHub releases for production consumers, the `helm/` chart for templated installs with a clean `controller.extensions` toggle, and the `dev/tools/deploy-to-kube` + `create-kind-cluster` pair for local kind development driven from the `Makefile`. Across all three, the mode choice — core only vs. core + extensions — flips a single controller flag (`--extensions`), adds one cluster role binding, and brings the three extension CRDs into play; everything else (namespace, service account, metrics service, core CRD, core RBAC, Deployment shape) is identical.

---

## 03. Quickstart Paths (gVisor, Kata, Vanilla)

> Walkthrough of the quickstart manifests for vanilla, gVisor, and Kata runtimes, plus how the examples directory organizes runnable scenarios.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/03-quickstart-paths-gvisor-kata-vanilla.md
- Generated: 2026-05-25T22:28:08.521Z

### Source Files

- `examples/quickstart/README.md`
- `examples/quickstart/gvisor.md`
- `examples/quickstart/kata-containers.md`
- `examples/hello-world-sandbox`
- `examples/README.md`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [examples/quickstart/README.md](examples/quickstart/README.md)
- [examples/quickstart/gvisor.md](examples/quickstart/gvisor.md)
- [examples/quickstart/kata-containers.md](examples/quickstart/kata-containers.md)
- [examples/hello-world-sandbox/README.md](examples/hello-world-sandbox/README.md)
- [examples/hello-world-sandbox/hello-world.yaml](examples/hello-world-sandbox/hello-world.yaml)
- [examples/hello-world-sandbox/Dockerfile](examples/hello-world-sandbox/Dockerfile)
- [examples/README.md](examples/README.md)
- [examples/kata-gke-sandbox/README.md](examples/kata-gke-sandbox/README.md)
- [clients/python/agentic-sandbox-client/python-sandbox-template.yaml](clients/python/agentic-sandbox-client/python-sandbox-template.yaml)
- [extensions/examples/sandboxwarmpool.yaml](extensions/examples/sandboxwarmpool.yaml)
</details>

# Quickstart Paths (gVisor, Kata, Vanilla)

The Agent Sandbox repository ships a single "Secure Agent Sandbox Quickstart" that walks the operator through provisioning a working cluster, installing the controller and extensions, applying a `SandboxTemplate`/`SandboxWarmPool`, deploying the router, and exercising the Python SDK. The walkthrough is structured as a linear ten-step path that branches at a single decision point: whether to use a plain KIND cluster (the **vanilla** path), a KIND cluster augmented with the `runsc`/gVisor shim (the **gVisor** path), or a minikube cluster with Kata Containers installed via Helm (the **Kata** path). Steps 3–10 are shared across all three; only cluster creation, image loading, the optional `runtimeClassName`, and cleanup differ.

This page explains how those three quickstart variants relate, what each branch changes in the shared steps, and how the larger `examples/` directory organizes other runnable scenarios that build on the same `Sandbox`, `SandboxTemplate`, `SandboxClaim`, and `SandboxWarmPool` resources.

Sources: [examples/quickstart/README.md:1-54](examples/quickstart/README.md), [examples/README.md:1-20](examples/README.md)

## Topology of the Quickstart

The quickstart README is authoritative; the gVisor and Kata documents are "branch points" that handle only cluster-runtime setup, then redirect the reader back to Step 3 of the main README. All three branches converge on the same controller install, the same `SandboxTemplate` file (with one `runtimeClassName` line conditionally uncommented), and the same Python SDK test client.

```text
                Step 1: clone repo + export env vars
                                │
                                ▼
       ┌─────────── Step 2: create cluster ───────────┐
       │                       │                      │
       ▼                       ▼                      ▼
  vanilla KIND          KIND + gVisor          minikube + Kata
  (basic cluster)    (gvisor.md branch)    (kata-containers.md branch)
       │                       │                      │
       └──────► Step 3: install controller + extensions ◄──────┘
                                │
                                ▼
       Step 4: apply python-sandbox-template.yaml
         (optionally uncomment runtimeClassName: gvisor|kata-qemu)
                                │
                                ▼
       Step 5: SandboxWarmPool  →  5.1 SandboxClaim probe
                                │
                                ▼
       Step 6–7: build, load, deploy sandbox-router
         (kind load docker-image  OR  minikube image load)
                                │
                                ▼
       Step 8–9: install Python SDK + run test_client.py
                                │
                                ▼
       Step 10: kubectl delete ns + (kind|minikube) delete
```

Sources: [examples/quickstart/README.md:25-300](examples/quickstart/README.md), [examples/quickstart/gvisor.md:1-57](examples/quickstart/gvisor.md), [examples/quickstart/kata-containers.md:1-77](examples/quickstart/kata-containers.md)

## The Vanilla Path (KIND, no isolation)

The default path uses an unmodified KIND cluster — no custom `containerd` patches, no `RuntimeClass`, no virtualization prerequisites. The user runs `kind create cluster --name agent-sandbox-demo`, installs the controller from the project root `README.md`, and proceeds straight to applying the template. The `SandboxTemplate` is applied as-is — both `runtimeClassName` lines stay commented out, so pods run under whatever default runtime KIND provides on the host.

The cluster is later torn down with `kind delete cluster --name agent-sandbox-demo` in Step 10.

Sources: [examples/quickstart/README.md:43-49](examples/quickstart/README.md), [examples/quickstart/README.md:280-300](examples/quickstart/README.md), [clients/python/agentic-sandbox-client/python-sandbox-template.yaml:1-37](clients/python/agentic-sandbox-client/python-sandbox-template.yaml)

## The gVisor Path (KIND + runsc)

The gVisor branch is documented in `examples/quickstart/gvisor.md`. It still uses KIND, but adds:

1. A Linux host prerequisite (gVisor's `runsc` only runs on Linux).
2. A `kind-config.yaml` that patches `containerd` to register the `runsc` runtime and bind-mounts the `runsc` and `containerd-shim-runsc-v1` binaries from `/usr/local/bin/` on the host into the KIND node.
3. A `RuntimeClass` named `gvisor` with handler `runsc`, applied after the cluster is up.

```yaml
# examples/quickstart/gvisor.md (cluster config excerpt)
containerdConfigPatches:
- |-
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
    runtime_type = "io.containerd.runsc.v1"
nodes:
- role: control-plane
  extraMounts:
  - hostPath: /usr/local/bin/runsc
    containerPath: /usr/local/bin/runsc
  - hostPath: /usr/local/bin/containerd-shim-runsc-v1
    containerPath: /usr/local/bin/containerd-shim-runsc-v1
```

The guide notes that `io.containerd.runsc.v1` is gVisor's shim implementation version, not the containerd shim-protocol version. After cluster creation and `RuntimeClass` apply, the user returns to README Step 3 and uncomments `runtimeClassName: gvisor` in `python-sandbox-template.yaml` before Step 4.

Because the cluster is still KIND, Step 7's image-load command remains `kind load docker-image`, and the Step 10 cleanup is the same `kind delete cluster` as the vanilla path.

The appendix supplies gVisor-specific validation: `kubectl get pod -o jsonpath='{.spec.runtimeClassName}'` should print `gvisor`, `dmesg | head` should show gVisor's boot messages, and `ls /dev | wc -l` should report roughly 16 devices, far fewer than a normal container.

Sources: [examples/quickstart/gvisor.md:1-99](examples/quickstart/gvisor.md), [examples/quickstart/README.md:51-55](examples/quickstart/README.md), [clients/python/agentic-sandbox-client/python-sandbox-template.yaml:10-12](clients/python/agentic-sandbox-client/python-sandbox-template.yaml)

## The Kata Path (minikube + kata-deploy)

The Kata Containers branch is documented in `examples/quickstart/kata-containers.md`. Unlike gVisor, it does not use KIND — Kata needs VM-level virtualization, so the guide spins up a minikube profile named `agent-sandbox-kata` with the `kvm2` driver and `containerd` runtime:

```bash
minikube start \
  --driver=kvm2 \
  --container-runtime=containerd \
  --cpus=4 \
  --memory=8192 \
  --profile=agent-sandbox-kata
```

Kata itself is installed via its upstream Helm chart in the `kube-system` namespace, then the minikube node is labeled `kata-containers=enabled`. A quick smoke test creates a `busybox` pod with `runtimeClassName: kata-qemu` and verifies it boots a different kernel from the host.

Switching to minikube affects three places in the main README:

- Step 4: uncomment `runtimeClassName: kata-qemu` in `python-sandbox-template.yaml`.
- Step 7: use `minikube image load ${ROUTER_IMAGE} -p agent-sandbox-kata` instead of `kind load`.
- Step 10: use `minikube delete -p agent-sandbox-kata` instead of `kind delete cluster`.

The appendix validates Kata isolation by checking `runtimeClassName: kata-qemu`, by running `uname -r` inside the pod (should differ from the host kernel), and by grepping `/proc/cpuinfo` for the `hypervisor` flag.

Sources: [examples/quickstart/kata-containers.md:1-114](examples/quickstart/kata-containers.md), [examples/quickstart/README.md:165-181](examples/quickstart/README.md), [examples/quickstart/README.md:295-300](examples/quickstart/README.md)

## What the Three Paths Change

The quickstart deliberately keeps the diff between the three paths small. The table below summarizes the only steps where the path actually matters; everything else (controller install, namespace, WarmPool, SDK install, test client) is identical.

| Concern                       | Vanilla (KIND)                 | gVisor (KIND + runsc)                              | Kata (minikube + kata-deploy)                              |
|-------------------------------|--------------------------------|----------------------------------------------------|------------------------------------------------------------|
| Host requirement              | Docker only                    | Linux host with `runsc` + `containerd-shim-runsc-v1` installed | KVM/QEMU, ≥4 CPU, ≥8 GB RAM, ≥20 GB disk                   |
| Cluster command (Step 2)      | `kind create cluster …`        | `kind create cluster … --config kind-config.yaml`  | `minikube start --driver=kvm2 --container-runtime=containerd --profile=agent-sandbox-kata` |
| Extra cluster-level resource  | None                           | `RuntimeClass/gvisor` (handler `runsc`)            | Kata Helm chart in `kube-system`, node label `kata-containers=enabled` |
| `python-sandbox-template.yaml`| Both `runtimeClassName` lines stay commented | Uncomment `runtimeClassName: gvisor`     | Uncomment `runtimeClassName: kata-qemu`                    |
| Router image load (Step 7.1)  | `kind load docker-image …`     | `kind load docker-image …`                         | `minikube image load … -p agent-sandbox-kata`              |
| Cleanup (Step 10)             | `kind delete cluster --name agent-sandbox-demo` | same as vanilla                       | `minikube delete -p agent-sandbox-kata`                    |
| Isolation validation          | None                           | `dmesg` shows gVisor boot, `/dev` ≈16 entries      | `uname -r` differs from host, `/proc/cpuinfo` has `hypervisor` |

Sources: [examples/quickstart/README.md:43-181](examples/quickstart/README.md), [examples/quickstart/gvisor.md:17-93](examples/quickstart/gvisor.md), [examples/quickstart/kata-containers.md:19-113](examples/quickstart/kata-containers.md)

## The Shared Backbone: Template, WarmPool, Router, SDK

All three quickstart paths converge on the same `SandboxTemplate` file at `clients/python/agentic-sandbox-client/python-sandbox-template.yaml`. The template is `extensions.agents.x-k8s.io/v1beta1`, parameterized by `${SANDBOX_NAMESPACE}` and `${SANDBOX_TEMPLATE_NAME}` (substituted in via `envsubst`), and embeds a `podTemplate.spec` whose two `runtimeClassName` candidates are intentionally left commented for the operator to choose:

```yaml
# clients/python/agentic-sandbox-client/python-sandbox-template.yaml
spec:
  podTemplate:
    spec:
      # Optional: uncomment one of the lines below to enable sandbox isolation.
      # runtimeClassName: gvisor      # gVisor (see examples/quickstart/gvisor.md)
      # runtimeClassName: kata-qemu   # Kata Containers (see examples/quickstart/kata-containers.md)
      containers:
      - name: python-runtime
        image: us-central1-docker.pkg.dev/k8s-staging-images/agent-sandbox/python-runtime-sandbox:latest-main
        ports:
        - containerPort: 8888
```

Step 5 then layers a `SandboxWarmPool` (`extensions.agents.x-k8s.io/v1alpha1`) on top of that template with two replicas, and Step 5.1 proves the WarmPool works by creating a one-off `SandboxClaim` and inspecting the `agents.x-k8s.io/pod-name` annotation. The README explicitly documents that WarmPool produces *pods* (not Sandbox resources), labels them `agents.x-k8s.io/pool=<hash>`, and that on claim the controller swaps that label for `sandbox-name-hash=<hash>` and refills the pool to maintain replica count.

The router (`clients/python/agentic-sandbox-client/sandbox-router/`) is built locally as `${ROUTER_IMAGE}` (default `sandbox-router:local`), loaded into whichever cluster type was created, then applied with `imagePullPolicy: Never` uncommented so the cluster does not try to pull `sandbox-router:local` from a registry. Step 7.4 health-checks it via `kubectl port-forward svc/sandbox-router-svc 8080:8080` and a `curl /healthz` that expects `{"status":"ok"}`. Step 8 installs the Python SDK with `pip install -e clients/python/agentic-sandbox-client`, and Step 9 runs `test_client.py --template-name python-runtime-template --namespace agent-sandbox-demo` to exercise the full lifecycle.

Step 9.2 contains a small shell snippet that compares `SandboxClaim.metadata.creationTimestamp` with the underlying pod's `metadata.creationTimestamp` — if the pod predates the claim, the test reports "Pod was PRE-WARMED from WarmPool!" — which is the quickstart's only quantitative claim about WarmPool performance.

Sources: [clients/python/agentic-sandbox-client/python-sandbox-template.yaml:1-37](clients/python/agentic-sandbox-client/python-sandbox-template.yaml), [examples/quickstart/README.md:76-274](examples/quickstart/README.md), [extensions/examples/sandboxwarmpool.yaml:1-11](extensions/examples/sandboxwarmpool.yaml)

## Where `hello-world-sandbox` Fits

`examples/hello-world-sandbox/` is a deliberately minimal example that uses the core `Sandbox` CRD directly, without the extensions-layer template/claim/warmpool machinery and without isolation. Its `hello-world.yaml` is a single `agents.x-k8s.io/v1alpha1 Sandbox` with an inline `podTemplate.spec`:

```yaml
# examples/hello-world-sandbox/hello-world.yaml
apiVersion: agents.x-k8s.io/v1alpha1
kind: Sandbox
metadata:
  name: hello-world
spec:
  podTemplate:
    spec:
      containers:
      - name: my-container
        image: ${IMAGE}
      restartPolicy: Never
```

The accompanying README walks through building a tiny `alpine:latest` image whose entrypoint is `echo 'Hello, World from Kubernetes!' && sleep 3600`, pushing it to a Google Artifact Registry repository, and applying the manifest via `cat hello-world.yaml | envsubst | kubectl apply -f -`. Verification uses `kubectl get sandbox hello-world`, `kubectl describe pod hello-world`, and `kubectl logs hello-world -c my-container`. This example is what the quickstart README's `Sandbox` bullet refers to as "Core isolated environment for running untrusted code" — it shows that the controller can reconcile a bare `Sandbox` into a `Pod` without any of the extensions API surface.

Sources: [examples/hello-world-sandbox/README.md:1-87](examples/hello-world-sandbox/README.md), [examples/hello-world-sandbox/hello-world.yaml:1-11](examples/hello-world-sandbox/hello-world.yaml), [examples/hello-world-sandbox/Dockerfile:1-5](examples/hello-world-sandbox/Dockerfile)

## How `examples/` Is Organized

`examples/README.md` enumerates the runnable scenarios under `examples/`. They fall into a few rough groups, all of which depend on the same controller installed by the quickstart:

| Category                                   | Subdirectories                                                                                                                                |
|--------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|
| Onboarding / quickstart                    | `quickstart/`, `hello-world-sandbox/`, `python-sdk-quickstart/`                                                                               |
| Runtime images for sandboxed workloads     | `python-runtime-sandbox/`, `jupyterlab/`, `vscode-sandbox/`, `chrome-sandbox/`, `aio-sandbox/`, `openclaw-sandbox/`                            |
| Agent framework integrations               | `code-interpreter-agent-on-adk/` (Google ADK), `langchain/` (LangGraph), `hermes-agent/`, `gemini-cu-sandbox/`, `kueue-agent-sandbox/`, `ray-integration/`, `analytics-tool/` |
| Operational / scaling / policy patterns    | `hpa-swp-scaling/`, `manual-pdb/`, `policy/`, `composing-sandbox-nw-policies/`, `sandbox-ksa/`                                                 |
| Alternative cluster runtimes               | `kata-gke-sandbox/`                                                                                                                            |

`kata-gke-sandbox/` is worth calling out alongside the quickstart's Kata path: it covers Kata Containers on managed **GKE** rather than local **minikube**. It documents the hardware constraints (Intel N2 machine type, Ubuntu node image, nested-virt-supporting zone), runs a `setup.sh` to provision the cluster, and then applies `sandbox-kata-gke.yaml` — a core `Sandbox` (not a `SandboxClaim`) that pins `runtimeClassName: kata-qemu` and uses a `nodeSelector` of `cloud.google.com/gke-os-distribution: ubuntu` to land on the Kata-capable node pool. Its isolation check is the same `uname -r` comparison the quickstart Kata appendix uses. It is the production-flavored counterpart to the local `quickstart/kata-containers.md` walkthrough.

Sources: [examples/README.md:1-20](examples/README.md), [examples/kata-gke-sandbox/README.md:1-128](examples/kata-gke-sandbox/README.md)

## Reading the Quickstart as a Decision Tree

When picking which path to follow, the prerequisites listed at the tops of the three documents are the real differentiator: the vanilla path needs only Docker and kubectl, the gVisor path additionally requires a Linux host with `runsc` installed, and the Kata path requires KVM-capable hardware plus minikube and Helm. The body of the quickstart is otherwise the same — the controller, extensions, template, warm pool, router, SDK, and test script do not change behavior between paths. The chosen path determines the *isolation boundary of the sandbox pod*, not the *shape of the Agent Sandbox API* the operator interacts with.

Once any path is finished, the broader `examples/` directory provides drop-in workloads (Python runtime, JupyterLab, VSCode, Chrome, OpenClaw) and integration patterns (LangGraph, ADK, Ray, HPA scaling, network policy) that all consume the same `Sandbox` / `SandboxTemplate` / `SandboxClaim` / `SandboxWarmPool` resources the quickstart left running.

Sources: [examples/quickstart/README.md:16-23](examples/quickstart/README.md), [examples/quickstart/gvisor.md:6-10](examples/quickstart/gvisor.md), [examples/quickstart/kata-containers.md:6-16](examples/quickstart/kata-containers.md), [examples/README.md:1-20](examples/README.md)

---

## 04. Controller Configuration & Tuning Flags

> All command-line flags exposed by the controller binary: QPS/burst, worker concurrency, warm-pool batch size, leader election, pprof, tracing, and cluster domain.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/04-controller-configuration-tuning-flags.md
- Generated: 2026-05-25T22:28:24.224Z

### Source Files

- `cmd/agent-sandbox-controller/main.go`
- `docs/configuration.md`
- `helm/templates/_controller-args.tpl`
- `helm/values.yaml`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [cmd/agent-sandbox-controller/main.go](cmd/agent-sandbox-controller/main.go)
- [docs/configuration.md](docs/configuration.md)
- [helm/templates/_controller-args.tpl](helm/templates/_controller-args.tpl)
- [helm/values.yaml](helm/values.yaml)
- [controllers/sandbox_controller.go](controllers/sandbox_controller.go)
- [extensions/controllers/sandboxwarmpool_controller.go](extensions/controllers/sandboxwarmpool_controller.go)
- [internal/metrics/tracing.go](internal/metrics/tracing.go)
</details>

# Controller Configuration & Tuning Flags

The `agent-sandbox-controller` binary is a controller-runtime Manager that hosts the core `Sandbox` reconciler and, optionally, the extension reconcilers for `SandboxClaim`, `SandboxTemplate`, and `SandboxWarmPool`. Its runtime behavior — how aggressively it talks to the kube-apiserver, how many objects it reconciles in parallel, how it elects a leader, and which diagnostic endpoints it exposes — is governed entirely by command-line flags parsed in `main()`. This page enumerates each flag, the value it defaults to, the code path that consumes it, and how the Helm chart surfaces it.

The flags fall into six logical groups: cluster identity, kube-client tuning (QPS/burst), per-controller worker concurrency, warm-pool batch sizing, leader election, and observability (tracing + pprof). All of them are registered with the standard `flag` package on the global `flag.CommandLine`, alongside controller-runtime's zap logger flags which are bound via `opts.BindFlags(flag.CommandLine)`.

## Flag Registration Surface

All flags are declared as local variables in `main()` and bound with `flag.StringVar`, `flag.BoolVar`, `flag.IntVar`, or `flag.Float64Var` before `flag.Parse()` is called. Validation runs immediately after parsing: concurrency values must be positive, `kube-api-burst` must be positive, and exceeding 1000 total workers or exceeding the burst limit produces an informational warning rather than a hard error.

Sources: [cmd/agent-sandbox-controller/main.go:50-145]()

### Complete Flag Reference

| Flag | Type | Default | Purpose |
| --- | --- | --- | --- |
| `--version` | bool | `false` | Print version and exit. |
| `--cluster-domain` | string | `cluster.local` | Cluster DNS suffix used when composing the Sandbox `ServiceFQDN`. |
| `--metrics-bind-address` | string | `:8080` | Address for the Prometheus metrics endpoint (and pprof, when enabled). |
| `--health-probe-bind-address` | string | `:8081` | Address for `/healthz` and `/readyz`. |
| `--leader-elect` | bool | `true` | Enable controller-runtime leader election. |
| `--leader-election-namespace` | string | `""` | Namespace for the leader election Lease; auto-detected when empty. |
| `--extensions` | bool | `false` | Register the `SandboxClaim`, `SandboxTemplate`, and `SandboxWarmPool` reconcilers in addition to the core `Sandbox` reconciler. |
| `--enable-tracing` | bool | `false` | Initialize the OpenTelemetry SDK and export spans via OTLP/gRPC. |
| `--enable-pprof` | bool | `false` | Expose only `/debug/pprof/profile` on the metrics server. |
| `--enable-pprof-debug` | bool | `false` | Expose the full pprof/fgprof handler set. Implies `--enable-pprof`. |
| `--pprof-block-profile-rate` | int | `1000000` | `runtime.SetBlockProfileRate` value when pprof-debug is enabled. `<=0` disables; nanoseconds otherwise. |
| `--pprof-mutex-profile-fraction` | int | `10` | `runtime.SetMutexProfileFraction` value when pprof-debug is enabled. `<=0` disables; samples ~1/N. |
| `--kube-api-qps` | float64 | `-1.0` | QPS limit applied to the REST config (`-1` disables client-side throttling). |
| `--kube-api-burst` | int | `10` | Burst limit applied to the REST config. Must be `> 0`. |
| `--sandbox-concurrent-workers` | int | `1` | `MaxConcurrentReconciles` for the `Sandbox` controller. |
| `--sandbox-claim-concurrent-workers` | int | `1` | `MaxConcurrentReconciles` for the `SandboxClaim` controller (extensions only). |
| `--sandbox-warm-pool-concurrent-workers` | int | `1` | `MaxConcurrentReconciles` for the `SandboxWarmPool` controller (extensions only). |
| `--sandbox-template-concurrent-workers` | int | `1` | `MaxConcurrentReconciles` for the `SandboxTemplate` controller (extensions only). |
| `--sandbox-warm-pool-max-batch-size` | int | `300` | Maximum sandboxes the warm-pool reconciler creates or deletes in a single reconcile. |

Sources: [cmd/agent-sandbox-controller/main.go:70-97]()

## Flag-to-Manager Wiring

The diagram below maps every CLI flag to the controller-runtime construct or `runtime` setting it ultimately drives. It is grounded in `main.go` and the per-controller `SetupWithManager` calls.

```mermaid
flowchart LR
    subgraph CLI["CLI flags (flag.CommandLine)"]
        F_QPS["--kube-api-qps"]
        F_BURST["--kube-api-burst"]
        F_LE["--leader-elect"]
        F_LEN["--leader-election-namespace"]
        F_METRICS["--metrics-bind-address"]
        F_PROBE["--health-probe-bind-address"]
        F_PPROF["--enable-pprof / --enable-pprof-debug"]
        F_PPRATE["--pprof-block-profile-rate"]
        F_PPMUT["--pprof-mutex-profile-fraction"]
        F_TRACE["--enable-tracing"]
        F_EXT["--extensions"]
        F_CD["--cluster-domain"]
        F_W1["--sandbox-concurrent-workers"]
        F_W2["--sandbox-claim-concurrent-workers"]
        F_W3["--sandbox-warm-pool-concurrent-workers"]
        F_W4["--sandbox-template-concurrent-workers"]
        F_BATCH["--sandbox-warm-pool-max-batch-size"]
    end

    subgraph REST["rest.Config (ctrl.GetConfigOrDie)"]
        REST_QPS["restConfig.QPS"]
        REST_BURST["restConfig.Burst"]
    end

    subgraph MGR["ctrl.Manager (ctrl.NewManager)"]
        MGR_LE["LeaderElection / Namespace / LeaderElectionID"]
        MGR_METRICS["metricsserver.Options{BindAddress, ExtraHandlers}"]
        MGR_PROBE["HealthProbeBindAddress"]
    end

    subgraph RT["go runtime"]
        RT_BLOCK["runtime.SetBlockProfileRate"]
        RT_MUTEX["runtime.SetMutexProfileFraction"]
    end

    subgraph OBS["Observability"]
        OTEL["asmetrics.SetupOTel → otlptracegrpc exporter"]
    end

    subgraph CTRLS["Reconcilers"]
        SBX["SandboxReconciler{ClusterDomain, Tracer}"]
        CLM["SandboxClaimReconciler"]
        WP["SandboxWarmPoolReconciler{MaxBatchSize}"]
        TPL["SandboxTemplateReconciler"]
    end

    F_QPS --> REST_QPS
    F_BURST --> REST_BURST
    F_LE --> MGR_LE
    F_LEN --> MGR_LE
    F_METRICS --> MGR_METRICS
    F_PROBE --> MGR_PROBE
    F_PPROF --> MGR_METRICS
    F_PPRATE --> RT_BLOCK
    F_PPMUT --> RT_MUTEX
    F_TRACE --> OTEL --> SBX
    F_CD --> SBX
    F_W1 --> SBX
    F_EXT --> CLM
    F_EXT --> WP
    F_EXT --> TPL
    F_W2 --> CLM
    F_W3 --> WP
    F_W4 --> TPL
    F_BATCH --> WP
```

Sources: [cmd/agent-sandbox-controller/main.go:179-277]()

## Kube API Client Tuning

`--kube-api-qps` and `--kube-api-burst` are stamped onto the REST config returned by `ctrl.GetConfigOrDie()` before the Manager is constructed. The QPS value is cast from `float64` to `float32`; the documented default of `-1.0` disables the client-side rate limiter entirely.

```go
restConfig := ctrl.GetConfigOrDie()
restConfig.QPS = float32(kubeAPIQPS)
restConfig.Burst = kubeAPIBurst
```

After parsing, `main()` computes `totalWorkers = sandbox + sandboxClaim + sandboxWarmPool + sandboxTemplate`. If QPS is positive and the worker total exceeds `kubeAPIBurst`, the setup logger emits a warning about likely client-side throttling. A separate warning fires when `totalWorkers > 1000` regardless of QPS, on the theory that this would create excessive apiserver load.

Sources: [cmd/agent-sandbox-controller/main.go:130-145](), [cmd/agent-sandbox-controller/main.go:216-218]()

## Worker Concurrency

Each reconciler's `SetupWithManager` accepts a `concurrentWorkers int` and passes it to controller-runtime as `controller.Options{MaxConcurrentReconciles: concurrentWorkers}`. The flag-to-controller mapping is one-to-one:

| Flag | Reconciler | Setup call |
| --- | --- | --- |
| `--sandbox-concurrent-workers` | `controllers.SandboxReconciler` | `controllers/sandbox_controller.go:1130` |
| `--sandbox-claim-concurrent-workers` | `extensions/controllers.SandboxClaimReconciler` | `extensions/controllers/sandboxclaim_controller.go:1270` |
| `--sandbox-warm-pool-concurrent-workers` | `extensions/controllers.SandboxWarmPoolReconciler` | `extensions/controllers/sandboxwarmpool_controller.go:534` |
| `--sandbox-template-concurrent-workers` | `extensions/controllers.SandboxTemplateReconciler` | `extensions/controllers/sandboxtemplate_controller.go:216` |

The three extension reconcilers are only constructed when `--extensions=true`. Setting their worker flags without enabling extensions has no effect because the setup branch is skipped.

Sources: [cmd/agent-sandbox-controller/main.go:241-276](), [controllers/sandbox_controller.go:1130-1149]()

## Warm-Pool Batch Size

`--sandbox-warm-pool-max-batch-size` is passed into `SandboxWarmPoolReconciler.MaxBatchSize` and bounds how many Sandbox creations or deletions the warm-pool controller fans out in a single reconcile (the field is read as `int32(r.MaxBatchSize)` to drive parallel batches). Validation in `main` rejects values `<= 0`. As a safety net, `SetupWithManager` also clamps `MaxBatchSize <= 0` back to the package constant `sandboxCreateDeleteMaxBatchSize = 300`.

Sources: [cmd/agent-sandbox-controller/main.go:97](), [cmd/agent-sandbox-controller/main.go:269-273](), [extensions/controllers/sandboxwarmpool_controller.go:48-58](), [extensions/controllers/sandboxwarmpool_controller.go:534-537]()

## Leader Election

`--leader-elect` defaults to `true`, which matches the chart default in `helm/values.yaml`. The Manager is constructed with a fixed `LeaderElectionID` of `a3317529.agent-sandbox.x-k8s.io`, so multiple replicas of the same controller image will contend for the same Lease. `--leader-election-namespace` controls where that Lease lives; when empty, controller-runtime falls back to its auto-detection (in-cluster service account namespace), and the setup logger emits a V(1) note to that effect.

```go
mgr, err := ctrl.NewManager(restConfig, ctrl.Options{
    Scheme:                  scheme,
    Metrics:                 metricsOpts,
    HealthProbeBindAddress:  probeAddr,
    LeaderElection:          enableLeaderElection,
    LeaderElectionNamespace: leaderElectionNamespace,
    LeaderElectionID:        "a3317529.agent-sandbox.x-k8s.io",
})
```

Sources: [cmd/agent-sandbox-controller/main.go:147-149](), [cmd/agent-sandbox-controller/main.go:220-227]()

## Observability: Tracing and pprof

### Tracing

When `--enable-tracing` is set, `main()` calls `asmetrics.SetupOTel(initCtx, "agent-sandbox-controller")` with a 10-second initialization timeout. `SetupOTel` creates an OTLP/gRPC exporter (`otlptracegrpc.New(ctx)`), wires it into a batching `sdktrace.TracerProvider`, sets the global propagator to W3C `TraceContext` only, and returns a cleanup closure that calls `tp.Shutdown` on exit. The exporter respects the standard `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_EXPORTER_OTLP_INSECURE` environment variables. The resulting `Instrumenter` is plumbed into the `SandboxReconciler`, `SandboxClaimReconciler`, and `SandboxTemplateReconciler` as their `Tracer` field; if tracing is disabled, a no-op instrumenter (`asmetrics.NewNoOp()`) is used instead.

Sources: [cmd/agent-sandbox-controller/main.go:153-168](), [cmd/agent-sandbox-controller/main.go:236-265](), [internal/metrics/tracing.go:124-147]()

### pprof

The pprof handlers are mounted on the metrics server, **not** on Go's default `http.DefaultServeMux`. To prevent leakage from the side effect of importing `net/http/pprof`, `main()` deliberately resets the default mux:

```go
http.DefaultServeMux = http.NewServeMux()
```

Two flags govern exposure:

- `--enable-pprof` mounts only `/debug/pprof/profile` (CPU profile).
- `--enable-pprof-debug` implies `--enable-pprof` and additionally mounts `/debug/pprof/`, `cmdline`, `symbol`, `heap`, `goroutine`, `allocs`, `block`, `mutex`, `trace`, and `/debug/fgprof`. It also activates block/mutex profiling at the Go runtime level by calling `runtime.SetBlockProfileRate(pprofBlockProfileRate)` and `runtime.SetMutexProfileFraction(pprofMutexProfileFraction)`. Negative inputs are clamped to `0` with a warning.

The comment on `--enable-pprof-debug` notes it "may expose sensitive information and comes with performance overhead" — leaving it off is the safe default for production.

Sources: [cmd/agent-sandbox-controller/main.go:80-90](), [cmd/agent-sandbox-controller/main.go:170-214]()

## Cluster Domain

`--cluster-domain` is passed straight into `SandboxReconciler.ClusterDomain` and used to assemble `sandbox.Status.ServiceFQDN`:

```go
sandbox.Status.ServiceFQDN = service.Name + "." + service.Namespace + ".svc." + r.ClusterDomain
```

Override it only when the cluster is configured with a non-default DNS suffix.

Sources: [cmd/agent-sandbox-controller/main.go:71](), [cmd/agent-sandbox-controller/main.go:236-241](), [controllers/sandbox_controller.go:127](), [controllers/sandbox_controller.go:614]()

## Helm Chart Mapping

The chart template `helm/templates/_controller-args.tpl` emits a `--flag=value` line for each populated key under `.Values.controller`. Empty/false-y values are omitted, so the binary falls back to its own defaults; this is why most fields in `helm/values.yaml` are commented out. The mapping is purely camelCase-to-kebab-case:

| `controller.*` value | CLI flag |
| --- | --- |
| `leaderElect` | `--leader-elect` |
| `clusterDomain` | `--cluster-domain` |
| `leaderElectionNamespace` | `--leader-election-namespace` |
| `extensions` | `--extensions` |
| `enableTracing` | `--enable-tracing` |
| `enablePprof` | `--enable-pprof` |
| `enablePprofDebug` | `--enable-pprof-debug` |
| `pprofBlockProfileRate` | `--pprof-block-profile-rate` |
| `pprofMutexProfileFraction` | `--pprof-mutex-profile-fraction` |
| `kubeApiQps` | `--kube-api-qps` |
| `kubeApiBurst` | `--kube-api-burst` |
| `sandboxConcurrentWorkers` | `--sandbox-concurrent-workers` |
| `sandboxClaimConcurrentWorkers` | `--sandbox-claim-concurrent-workers` |
| `sandboxWarmPoolConcurrentWorkers` | `--sandbox-warm-pool-concurrent-workers` |
| `sandboxTemplateConcurrentWorkers` | `--sandbox-template-concurrent-workers` |
| `extraArgs[]` | any flag not listed above (e.g. zap logger flags, `--sandbox-warm-pool-max-batch-size`) |

Note that `--sandbox-warm-pool-max-batch-size` and `--metrics-bind-address`/`--health-probe-bind-address` are not first-class keys in the chart; pass them through `controller.extraArgs`.

Sources: [helm/templates/_controller-args.tpl:1-50](), [helm/values.yaml:29-48]()

## Worked Example: High-Throughput Extensions Deployment

The pattern documented in `docs/configuration.md` for a high-throughput cluster combines extension enablement, raised per-controller worker counts, a larger warm-pool batch, and explicit kube-client QPS/burst sized to the worker total:

```yaml
args:
  - --leader-elect=true
  - --extensions
  - --sandbox-concurrent-workers=10
  - --sandbox-claim-concurrent-workers=10
  - --sandbox-warm-pool-concurrent-workers=10
  - --sandbox-warm-pool-max-batch-size=500
  - --kube-api-qps=50
  - --kube-api-burst=100
```

With these values, the validation in `main()` is satisfied: all worker counts are positive, the four-controller total of `30` (plus the default `sandboxTemplateConcurrentWorkers=1`, i.e. `31`) is well under `1000`, and it fits within `kubeAPIBurst=100`, so no throttling warning fires.

Sources: [docs/configuration.md:37-52](), [cmd/agent-sandbox-controller/main.go:119-145]()

## Summary

The controller's tunable surface is intentionally small and entirely flag-driven: a Manager is constructed from `--leader-elect*`, `--metrics-bind-address`, and `--health-probe-bind-address`; its REST client is shaped by `--kube-api-qps`/`--kube-api-burst`; each reconciler's `MaxConcurrentReconciles` comes from one `--*-concurrent-workers` flag; the warm-pool fan-out is bounded by `--sandbox-warm-pool-max-batch-size`; and the diagnostic surface (`--enable-tracing`, `--enable-pprof`, `--enable-pprof-debug`, and the two pprof sampling-rate knobs) is wired into OpenTelemetry and the Go runtime respectively. The Helm chart's `controller.*` keys are a thin, omit-if-empty pass-through to these same flags, with `controller.extraArgs` as the escape hatch for anything the template does not enumerate.

---

## 05. Sandbox CRD (agents.x-k8s.io/v1beta1)

> Field-by-field reference for the core Sandbox resource: PodTemplate, VolumeClaimTemplates, Lifecycle, Replicas (0/1), and Service toggle.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/05-sandbox-crd-agents.x-k8s.io-v1beta1.md
- Generated: 2026-05-25T22:31:53.460Z

### Source Files

- `api/v1beta1/sandbox_types.go`
- `api/v1beta1/groupversion_info.go`
- `api/v1beta1/zz_generated.deepcopy.go`
- `k8s/crds/agents.x-k8s.io_sandboxes.yaml`
- `docs/api.md`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [api/v1beta1/sandbox_types.go](api/v1beta1/sandbox_types.go)
- [api/v1beta1/groupversion_info.go](api/v1beta1/groupversion_info.go)
- [api/v1beta1/zz_generated.deepcopy.go](api/v1beta1/zz_generated.deepcopy.go)
- [k8s/crds/agents.x-k8s.io_sandboxes.yaml](k8s/crds/agents.x-k8s.io_sandboxes.yaml)
- [docs/api.md](docs/api.md)
- [controllers/sandbox_controller.go](controllers/sandbox_controller.go)
</details>

# Sandbox CRD (agents.x-k8s.io/v1beta1)

The `Sandbox` resource is the core CRD of agent-sandbox. It defines a single, controller-managed Kubernetes Pod (with optional headless `Service` and `PersistentVolumeClaim`s) intended to host an agent workload. The schema lives in package `v1beta1`, GroupVersion `agents.x-k8s.io/v1beta1`, and is registered through a standard kubebuilder `SchemeBuilder`.

This page is a field-by-field reference for the `Sandbox` resource as defined in the Go API types and rendered into the published CRD. It covers the `PodTemplate` and `VolumeClaimTemplates` inputs, the inline `Lifecycle` block, the `Replicas` 0/1 toggle, the tri-state `Service` field, well-known annotations, and the `status` shape. The semantics tied to each field by the `Sandbox` controller are documented where they affect what the field means in practice.

Sources: [api/v1beta1/sandbox_types.go:130-244](), [api/v1beta1/groupversion_info.go:25-36]()

## Group, Version, and Resource Identity

The CRD is registered as a namespaced resource with the short name `sandbox`. It exposes both the `/status` subresource and the `/scale` subresource, which maps to `.spec.replicas`, `.status.replicas`, and `.status.selector` so that `kubectl scale` works against a single Sandbox.

| Property | Value | Source |
| --- | --- | --- |
| API group | `agents.x-k8s.io` | [api/v1beta1/groupversion_info.go:27]() |
| Version | `v1beta1` | [api/v1beta1/groupversion_info.go:27]() |
| Kind / List Kind | `Sandbox` / `SandboxList` | [k8s/crds/agents.x-k8s.io_sandboxes.yaml:10-13]() |
| Plural / Singular | `sandboxes` / `sandbox` | [k8s/crds/agents.x-k8s.io_sandboxes.yaml:13-16]() |
| Short name | `sandbox` | [api/v1beta1/sandbox_types.go:228]() |
| Scope | `Namespaced` | [api/v1beta1/sandbox_types.go:228]() |
| Status subresource | enabled | [k8s/crds/agents.x-k8s.io_sandboxes.yaml:4020-4025]() |
| Scale subresource | `specReplicasPath=.spec.replicas`, `statusReplicasPath=.status.replicas`, `labelSelectorPath=.status.selector` | [k8s/crds/agents.x-k8s.io_sandboxes.yaml:4020-4025]() |

The kubebuilder markers on the top-level Go type drive these CRD attributes directly:

```go
// api/v1beta1/sandbox_types.go
// +genclient
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.replicas,selectorpath=.status.selector
// +kubebuilder:resource:scope=Namespaced,shortName=sandbox
type Sandbox struct { ... }
```

Sources: [api/v1beta1/sandbox_types.go:224-244](), [k8s/crds/agents.x-k8s.io_sandboxes.yaml:1-25]()

## Top-Level Shape

A `Sandbox` follows the standard Kubernetes object layout: `apiVersion`, `kind`, `metadata`, `spec`, and `status`. Only `spec` is `+required`; `status` is `omitempty,omitzero`.

```text
Sandbox
├── metadata (ObjectMeta)
└── spec  (SandboxSpec, required)
    ├── podTemplate            (PodTemplate, required)
    ├── volumeClaimTemplates   ([]PersistentVolumeClaimTemplate, atomic)
    ├── shutdownTime           (metav1.Time, inline Lifecycle)
    ├── shutdownPolicy         (ShutdownPolicy, inline Lifecycle, default=Retain)
    ├── replicas               (*int32, 0..1, default=1)
    └── service                (*bool, tri-state)
└── status (SandboxStatus)
    ├── serviceFQDN, service, podIPs
    ├── replicas, selector
    └── conditions ([]metav1.Condition)
```

The `Lifecycle` struct is embedded inline into `SandboxSpec` via `Lifecycle `json:",inline"``, which is why `shutdownTime` and `shutdownPolicy` appear at the top level of `spec` rather than under a nested `lifecycle` key.

Sources: [api/v1beta1/sandbox_types.go:129-166](), [api/v1beta1/sandbox_types.go:181-222](), [k8s/crds/agents.x-k8s.io_sandboxes.yaml:3833-4017]()

## `spec.podTemplate`

`podTemplate` is the only required field on `SandboxSpec`. It carries the full Pod specification that the controller materializes for the Sandbox, plus a restricted `metadata` block.

```go
// api/v1beta1/sandbox_types.go:109-117
type PodTemplate struct {
    Spec       corev1.PodSpec `json:"spec"`           // required
    ObjectMeta PodMetadata    `json:"metadata"`        // optional, labels/annotations only
}
```

- `podTemplate.spec` is a full upstream `corev1.PodSpec`, surfaced verbatim through the generated CRD OpenAPI schema. Anything that can appear on a Pod (containers, volumes, affinity, security context, runtime class, etc.) can appear here.
- `podTemplate.metadata` is a narrowed `PodMetadata` shape, only `labels` and `annotations` are honored ([api/v1beta1/sandbox_types.go:68-82]()). A `name` is intentionally not surfaced; the underlying Pod is named after the Sandbox.

Two well-known annotations participate in controller bookkeeping for label/annotation propagation from Sandbox to Pod:

| Constant | Annotation key | Purpose |
| --- | --- | --- |
| `SandboxPropagatedLabelsAnnotation` | `agents.x-k8s.io/propagated-labels` | Tracks labels explicitly propagated from `Sandbox` spec to Pod. |
| `SandboxPropagatedAnnotationsAnnotation` | `agents.x-k8s.io/propagated-annotations` | Tracks annotations explicitly propagated from `Sandbox` spec to Pod. |
| `SandboxPodTemplateHashLabel` | `agents.x-k8s.io/sandbox-pod-template-hash` | Hash label set on the Pod for template comparison. |
| `SandboxPodNameAnnotation` | `agents.x-k8s.io/pod-name` | Records the Pod name when the Sandbox adopts one from a warm pool. |
| `SandboxTemplateRefAnnotation` | `agents.x-k8s.io/sandbox-template-ref` | Records the `SandboxTemplate` reference, when used. |

Sources: [api/v1beta1/sandbox_types.go:56-66](), [api/v1beta1/sandbox_types.go:68-117](), [docs/api.md:75-108]()

## `spec.volumeClaimTemplates`

`volumeClaimTemplates` is an optional, atomic list of PVC templates the Sandbox is allowed to reference. Each entry combines a narrowed metadata block with a standard `corev1.PersistentVolumeClaimSpec`.

```go
// api/v1beta1/sandbox_types.go:119-127
type PersistentVolumeClaimTemplate struct {
    EmbeddedObjectMetadata `json:"metadata"`
    Spec corev1.PersistentVolumeClaimSpec `json:"spec"` // required
}
```

Key field characteristics:

- Listed as `+listType=atomic` in `SandboxSpec`, so the entire list is replaced on update rather than merged ([api/v1beta1/sandbox_types.go:138-142](), [k8s/crds/agents.x-k8s.io_sandboxes.yaml:3850-3958]()).
- `EmbeddedObjectMetadata` exposes `name`, `labels`, and `annotations` on the template ([api/v1beta1/sandbox_types.go:84-107]()). The `name` is required for the controller to derive the actual PVC name.
- The API-level docstring requires that every claim has at least one matching access mode with a provisioner volume ([api/v1beta1/sandbox_types.go:138-141]()).

The `Sandbox` controller materializes each template into a real `PersistentVolumeClaim`. The PVC name is `"<template.Name>-<sandbox.Name>"`, the sandbox name hash label is added, and an existing same-named PVC that is unowned is adopted; PVCs owned by other controllers cause reconciliation to fail.

```go
// controllers/sandbox_controller.go:952-1008 (excerpt)
for _, pvcTemplate := range sandbox.Spec.VolumeClaimTemplates {
    pvcName := pvcTemplate.Name + "-" + sandbox.Name
    ...
    // Adopt unowned PVC, refuse to use one owned by a different controller,
    // otherwise create from pvcTemplate.Spec with cloned labels/annotations.
}
```

Sources: [api/v1beta1/sandbox_types.go:84-127](), [api/v1beta1/sandbox_types.go:138-142](), [controllers/sandbox_controller.go:945-1010]()

## `spec.shutdownTime` and `spec.shutdownPolicy` (inline `Lifecycle`)

`Lifecycle` is embedded inline into `SandboxSpec`, exposing two top-level keys on `spec` that govern expiry.

```go
// api/v1beta1/sandbox_types.go:181-192
type Lifecycle struct {
    ShutdownTime   *metav1.Time     `json:"shutdownTime,omitempty"`
    ShutdownPolicy *ShutdownPolicy  `json:"shutdownPolicy,omitempty"`
}
```

| Field | Type | Default | Validation | Effect |
| --- | --- | --- | --- | --- |
| `shutdownTime` | RFC 3339 timestamp | (none) | `format: date-time` | Absolute wall-clock expiry. When `now >= shutdownTime`, the Sandbox is treated as expired. |
| `shutdownPolicy` | `Delete` \| `Retain` | `Retain` | `+kubebuilder:validation:Enum=Delete;Retain` | After child cleanup at expiry, `Delete` removes the `Sandbox` object too; `Retain` keeps it with an `Expired` condition. |

The `ShutdownPolicy` enum is defined as:

```go
// api/v1beta1/sandbox_types.go:168-178
const (
    ShutdownPolicyDelete ShutdownPolicy = "Delete"
    ShutdownPolicyRetain ShutdownPolicy = "Retain"
)
```

Controller semantics:

- `checkSandboxExpiry` returns `expired=true` when `now` is no longer before `spec.ShutdownTime`. If `ShutdownTime` is nil, the Sandbox never expires ([controllers/sandbox_controller.go:1092-1111]()).
- On expiry, `handleSandboxExpiry` deletes the child Pod, owned `Service`, and owned PVCs regardless of `ShutdownPolicy`. The doc on `ShutdownPolicy` confirms this: "Underlying resources (Pods, Services) are always deleted on expiry." ([api/v1beta1/sandbox_types.go:187-188]())
- If `ShutdownPolicy == Delete`, the controller then deletes the `Sandbox` resource itself ([controllers/sandbox_controller.go:1065-1071]()).
- If `ShutdownPolicy == Retain` (default), the controller resets `status` (keeping conditions) and sets a `Ready=False` condition with reason `SandboxExpired` ([controllers/sandbox_controller.go:1074-1087](), [api/v1beta1/sandbox_types.go:53-54]()).

Sources: [api/v1beta1/sandbox_types.go:168-192](), [controllers/sandbox_controller.go:197-216](), [controllers/sandbox_controller.go:1065-1127]()

## `spec.replicas` (0 or 1)

`replicas` is a pointer-to-`int32` constrained to the closed range `[0, 1]`, defaulting to `1`. A Sandbox is intentionally a single-Pod resource; the field exists as an on/off toggle that is compatible with the standard `/scale` subresource.

```go
// api/v1beta1/sandbox_types.go:148-155
// +kubebuilder:validation:Minimum=0
// +kubebuilder:validation:Maximum=1
// +kubebuilder:default=1
// +optional
Replicas *int32 `json:"replicas,omitempty"`
```

The controller treats `replicas=0` as a suspension signal rather than a separate "paused" state:

```mermaid
stateDiagram-v2
    [*] --> Running: replicas=1 (default)
    Running --> Suspending: spec.replicas set to 0
    Suspending --> Suspended: Pod terminated
    Suspended --> Running: spec.replicas set to 1
    Running --> Expired: now >= shutdownTime
    Suspended --> Expired: now >= shutdownTime
    Expired --> [*]: shutdownPolicy=Delete
    Expired --> Expired: shutdownPolicy=Retain
```

Concrete behavior driven from `Spec.Replicas`:

- `computeSuspendedCondition` only emits a `Suspended` condition when `*Spec.Replicas == 0`. If the Pod still exists, the condition is `False` with reason `PodNotTerminated`; once the Pod is gone, it becomes `True` with reason `PodTerminated` ([controllers/sandbox_controller.go:289-311](), [api/v1beta1/sandbox_types.go:28-34]()).
- When `replicas=0`, `computeReadyCondition` short-circuits with `Ready=False` and reason `SandboxSuspended`, with message `"Sandbox is suspending"` if the Pod is still terminating, otherwise `"Sandbox is suspended"` ([controllers/sandbox_controller.go:328-337]()).
- During reconciliation, `replicas=0` causes the controller to delete the backing Pod ([controllers/sandbox_controller.go:671-678]()).
- Because `replicas` is the spec path for the `scale` subresource, `kubectl scale sandbox/<name> --replicas=0` (or `1`) is the supported way to toggle this state.

Sources: [api/v1beta1/sandbox_types.go:148-155](), [api/v1beta1/sandbox_types.go:28-44](), [controllers/sandbox_controller.go:188-216](), [controllers/sandbox_controller.go:289-337](), [controllers/sandbox_controller.go:660-690]()

## `spec.service` (tri-state Service toggle)

`service` is a `*bool` rather than a plain `bool`, because the controller distinguishes three states. The `nolint` comments on the field document why this was intentional rather than promoted to an enum:

```go
// api/v1beta1/sandbox_types.go:157-165
// service controls whether the controller should automatically create a
// headless Service for this Sandbox.
// When unset, the controller preserves existing Services for backward
// compatibility but does not create new ones. Set to true to enable or false
// to explicitly disable and remove the Service.
//nolint:kubeapilinter
//nolint:nobools // Enum not used to avoid duplicating the Service API; field is not expected to extend (issue #746).
// +optional
Service *bool `json:"service,omitempty"`
```

| `spec.service` | Behavior in `reconcileService` | Source |
| --- | --- | --- |
| `nil` (unset) | Do not create a new Service. If a Service already exists, leave it untouched (do not adopt unowned ones, do not delete). Used for backward compatibility with older Sandboxes that predated this field. | [controllers/sandbox_controller.go:462-503](), [controllers/sandbox_controller.go:524-538]() |
| `true` | Create a headless Service (`ClusterIP: None`) named after the Sandbox, with a selector keyed by the sandbox name hash. Adopt an unowned same-named Service if its `ClusterIP` is `None` (or empty); refuse to adopt one already owned by another controller. | [controllers/sandbox_controller.go:471-499](), [controllers/sandbox_controller.go:539-563]() |
| `false` | Delete the Service if and only if it is owned by this Sandbox. Services owned by other controllers or unowned Services are left alone. | [controllers/sandbox_controller.go:511-522]() |

The Sandbox-managed Service is always headless: the controller hard-codes `Spec.ClusterIP = "None"` and a `Spec.Selector` of `{ <sandboxLabel>: <nameHash> }` so the Service resolves to the Pod via DNS.

The readiness gate also reflects the tri-state. `svcRequired` is `true` if `*Spec.Service` is `true`, or if `Spec.Service` is `nil` but a Service already exists (legacy preservation):

```go
// controllers/sandbox_controller.go:364-372
svcRequired := false
if sandbox.Spec.Service != nil {
    svcRequired = *sandbox.Spec.Service
} else if svc != nil {
    // Backward compatibility: require service readiness
    svcRequired = true
}
```

When the Service exists, the controller sets `status.service` to its name and `status.serviceFQDN` to a fully qualified DNS name for the Pod inside the headless Service.

Sources: [api/v1beta1/sandbox_types.go:157-165](), [controllers/sandbox_controller.go:460-594](), [controllers/sandbox_controller.go:364-389]()

## `status` Shape

`SandboxStatus` is published on the `/status` subresource. It reports observed state only and is owned by the controller.

```go
// api/v1beta1/sandbox_types.go:194-222
type SandboxStatus struct {
    ServiceFQDN   string             `json:"serviceFQDN,omitempty"`
    Service       string             `json:"service,omitempty"`
    Conditions    []metav1.Condition `json:"conditions,omitempty"`
    Replicas      int32              `json:"replicas,omitempty"`
    LabelSelector string             `json:"selector,omitempty"`
    PodIPs        []string           `json:"podIPs,omitempty"`
}
```

| Status field | JSON key | Meaning |
| --- | --- | --- |
| `ServiceFQDN` | `serviceFQDN` | Fully qualified DNS name valid for default cluster settings. The cluster domain defaults to `cluster.local` but is configurable via the controller flag `--cluster-domain`. |
| `Service` | `service` | Name of the Service the Sandbox is using (when one exists). |
| `Conditions` | `conditions` | Standard `metav1.Condition` array (see below). |
| `Replicas` | `replicas` | Actual replica count (0 or 1), wired into the `scale` subresource. |
| `LabelSelector` | `selector` | String form of the label selector matching the Pod, used by `kubectl scale` and HPAs. |
| `PodIPs` | `podIPs` | IPs of the backing Pod; can be multiple on dual-stack clusters. |

### Status conditions

The controller emits three condition types defined as `ConditionType` constants:

| Condition `type` | Possible `reason` values | When set |
| --- | --- | --- |
| `Ready` | `DependenciesReady`, `DependenciesNotReady`, `SandboxSuspended`, `SandboxExpired`, `ReconcilerError` | Always computed. `True` only when Pod is `Ready` with `PodIPs`, and Service readiness gate (`svcRequired`) is satisfied. |
| `Suspended` | `PodTerminated`, `PodNotTerminated` | Only when `spec.replicas == 0`. `True` once the Pod has been terminated. |
| `Finished` | `PodSucceeded`, `PodFailed` | Set when the backing Pod reaches a terminal phase. |

The reason constants are defined together with the condition types:

```go
// api/v1beta1/sandbox_types.go:27-54
const (
    SandboxConditionSuspended ConditionType = "Suspended"
    SandboxReasonSuspendedPodTerminated    = "PodTerminated"
    SandboxReasonSuspendedPodNotTerminated = "PodNotTerminated"

    SandboxConditionReady ConditionType = "Ready"
    SandboxReasonDependenciesReady       = "DependenciesReady"
    SandboxReasonDependenciesNotReady    = "DependenciesNotReady"
    SandboxReasonSuspended               = "SandboxSuspended"

    SandboxConditionFinished ConditionType = "Finished"
    SandboxReasonPodSucceeded             = "PodSucceeded"
    SandboxReasonPodFailed                = "PodFailed"

    SandboxReasonExpired = "SandboxExpired"
)
```

Sources: [api/v1beta1/sandbox_types.go:22-66](), [api/v1beta1/sandbox_types.go:194-222](), [controllers/sandbox_controller.go:280-417]()

## Minimal Example

The smallest valid `v1beta1` Sandbox just sets `podTemplate.spec`. The remaining fields default sensibly: `replicas=1`, `shutdownPolicy=Retain`, no `shutdownTime`, no Service, no PVCs.

```yaml
apiVersion: agents.x-k8s.io/v1beta1
kind: Sandbox
metadata:
  name: hello-world
spec:
  podTemplate:
    spec:
      containers:
        - name: agent
          image: ghcr.io/example/agent:latest
      restartPolicy: Never
```

A more complete example exercising every top-level `spec` field documented above:

```yaml
apiVersion: agents.x-k8s.io/v1beta1
kind: Sandbox
metadata:
  name: agent-with-storage
spec:
  podTemplate:
    metadata:
      labels:
        app: my-agent
    spec:
      containers:
        - name: agent
          image: ghcr.io/example/agent:latest
          volumeMounts:
            - name: work
              mountPath: /work
  volumeClaimTemplates:
    - metadata:
        name: work
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi
  shutdownTime: "2026-06-01T00:00:00Z"
  shutdownPolicy: Delete       # default is Retain
  replicas: 1                  # 0 to suspend
  service: true                # create a headless Service
```

Sources: [api/v1beta1/sandbox_types.go:129-166](), [k8s/crds/agents.x-k8s.io_sandboxes.yaml:3833-3960]()

## Summary

The `Sandbox` CRD is a deliberately small surface area on top of a full `corev1.PodSpec`. `podTemplate` is the only required field; `volumeClaimTemplates` provides PVC materialization tied to the Sandbox lifetime; the inline `Lifecycle` fields (`shutdownTime`, `shutdownPolicy`) implement timed expiry with a default-`Retain` cleanup policy; `replicas` is a 0/1 scale toggle that doubles as a suspend switch; and `service` is a tri-state pointer-bool that distinguishes "create a headless Service", "delete the owned Service", and "leave any existing Service alone for backward compatibility". All field defaults and validation rules are expressed as kubebuilder markers on the Go types and surface verbatim in the generated CRD schema and the published `docs/api.md` reference.

Sources: [api/v1beta1/sandbox_types.go:129-222](), [docs/api.md:111-172]()

---

## 06. SandboxTemplate CRD

> Reusable template type used by SandboxClaim and SandboxWarmPool, including the embedded Sandbox spec it encapsulates.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/06-sandboxtemplate-crd.md
- Generated: 2026-05-25T22:32:10.225Z

### Source Files

- `extensions/api/v1beta1/sandboxtemplate_types.go`
- `extensions/api/v1beta1/groupversion_info.go`
- `k8s/crds/extensions.agents.x-k8s.io_sandboxtemplates.yaml`
- `extensions/examples/sandboxtemplate.yaml`
- `extensions/examples/secure-sandboxtemplate.yaml`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [extensions/api/v1beta1/sandboxtemplate_types.go](extensions/api/v1beta1/sandboxtemplate_types.go)
- [extensions/api/v1beta1/groupversion_info.go](extensions/api/v1beta1/groupversion_info.go)
- [k8s/crds/extensions.agents.x-k8s.io_sandboxtemplates.yaml](k8s/crds/extensions.agents.x-k8s.io_sandboxtemplates.yaml)
- [extensions/examples/sandboxtemplate.yaml](extensions/examples/sandboxtemplate.yaml)
- [extensions/examples/secure-sandboxtemplate.yaml](extensions/examples/secure-sandboxtemplate.yaml)
- [extensions/api/v1beta1/sandboxclaim_types.go](extensions/api/v1beta1/sandboxclaim_types.go)
- [extensions/api/v1beta1/sandboxwarmpool_types.go](extensions/api/v1beta1/sandboxwarmpool_types.go)
- [extensions/controllers/sandboxtemplate_controller.go](extensions/controllers/sandboxtemplate_controller.go)
- [extensions/controllers/sandboxclaim_controller.go](extensions/controllers/sandboxclaim_controller.go)
- [extensions/controllers/sandboxwarmpool_controller.go](extensions/controllers/sandboxwarmpool_controller.go)
- [extensions/controllers/utils.go](extensions/controllers/utils.go)
- [api/v1beta1/sandbox_types.go](api/v1beta1/sandbox_types.go)
</details>

# SandboxTemplate CRD

`SandboxTemplate` is the reusable, namespaced blueprint that describes how an agent sandbox should be built. It is part of the `extensions.agents.x-k8s.io` API group and is consumed by `SandboxClaim` (one-shot rentals) and `SandboxWarmPool` (pre-provisioned pools). The template encapsulates the Pod shape, persistent storage, environment-variable injection policy, headless `Service` opt-in, and a shared `NetworkPolicy` that the template controller materializes into the cluster.

This page covers the API shape of `SandboxTemplate`, the constants and enums that govern its semantics, how the embedded `Sandbox` spec (`PodTemplate`, `VolumeClaimTemplates`, `Service`) is propagated to derived `Sandbox` objects, and how the dedicated template controller manages a single shared `NetworkPolicy` per template.

## API identity

The type is registered under the `extensions.agents.x-k8s.io` group at version `v1beta1`. It is namespaced, exposes the short name `sandboxtemplate`, and is registered into the scheme by the package `init()`.

```go
// extensions/api/v1beta1/groupversion_info.go
GroupVersion = schema.GroupVersion{Group: "extensions.agents.x-k8s.io", Version: "v1beta1"}
```

```go
// extensions/api/v1beta1/sandboxtemplate_types.go
// +kubebuilder:resource:scope=Namespaced,shortName=sandboxtemplate
type SandboxTemplate struct {
    metav1.TypeMeta `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty,omitzero"`
    Spec SandboxTemplateSpec `json:"spec"`
}
```

The generated CRD confirms the group/kind/plural names, lists `v1beta1` as the served and stored version, and marks `spec.podTemplate` as the only required spec field.

Sources: [extensions/api/v1beta1/groupversion_info.go:25-36](), [extensions/api/v1beta1/sandboxtemplate_types.go:141-167](), [k8s/crds/extensions.agents.x-k8s.io_sandboxtemplates.yaml:1-19](), [k8s/crds/extensions.agents.x-k8s.io_sandboxtemplates.yaml:4142-4149]()

## Shape of `SandboxTemplateSpec`

`SandboxTemplateSpec` reuses the embedded `PodTemplate` and `PersistentVolumeClaimTemplate` types defined in the core sandbox API (`sigs.k8s.io/agent-sandbox/api/v1beta1`), then adds the policy fields that distinguish a template from a raw `Sandbox`.

| Field | Type | Required | Default | Purpose |
|---|---|---|---|---|
| `podTemplate` | `sandboxv1beta1.PodTemplate` | yes | - | Pod metadata + `corev1.PodSpec` used to materialize each sandbox Pod. |
| `volumeClaimTemplates` | `[]sandboxv1beta1.PersistentVolumeClaimTemplate` | no | empty | PVCs created per derived sandbox; list is atomic. |
| `networkPolicy` | `*NetworkPolicySpec` | no | nil → secure default | Ingress/egress rules applied to the shared `NetworkPolicy`. |
| `networkPolicyManagement` | `NetworkPolicyManagement` enum | no | `Managed` | Whether the controller creates/owns the `NetworkPolicy`. |
| `envVarsInjectionPolicy` | `EnvVarsInjectionPolicy` enum | no | `Disallowed` | Whether a `SandboxClaim` may inject env vars. |
| `service` | `*bool` | no | nil | Opt-in/out for a headless `Service` per sandbox. |

The embedded Pod and PVC types come from the core sandbox API:

```go
// api/v1beta1/sandbox_types.go
type PodTemplate struct {
    Spec corev1.PodSpec `json:"spec"`
    ObjectMeta PodMetadata `json:"metadata"`
}
type PersistentVolumeClaimTemplate struct {
    EmbeddedObjectMetadata `json:"metadata"`
    Spec corev1.PersistentVolumeClaimSpec `json:"spec"`
}
```

Sources: [extensions/api/v1beta1/sandboxtemplate_types.go:73-139](), [api/v1beta1/sandbox_types.go:109-127]()

### Enums and constants

The template package declares two string-typed enums and one well-known label key. All three are central to how the template binds to derived sandboxes and Pods.

```go
// extensions/api/v1beta1/sandboxtemplate_types.go
const (
    SandboxIDLabel = "agents.x-k8s.io/claim-uid"

    NetworkPolicyManagementManaged   NetworkPolicyManagement = "Managed"
    NetworkPolicyManagementUnmanaged NetworkPolicyManagement = "Unmanaged"

    EnvVarsInjectionPolicyAllowed    EnvVarsInjectionPolicy = "Allowed"
    EnvVarsInjectionPolicyOverrides  EnvVarsInjectionPolicy = "Overrides"
    EnvVarsInjectionPolicyDisallowed EnvVarsInjectionPolicy = "Disallowed"
)
```

| Enum | Allowed values | CRD default | Effect |
|---|---|---|---|
| `NetworkPolicyManagement` | `Managed`, `Unmanaged` | `Managed` | `Unmanaged` short-circuits the template controller and lets external systems (e.g. Cilium) own networking. |
| `EnvVarsInjectionPolicy` | `Allowed`, `Overrides`, `Disallowed` | `Disallowed` | Gate that `SandboxClaim.spec.env` is evaluated against by the claim controller. |

Sources: [extensions/api/v1beta1/sandboxtemplate_types.go:33-56](), [k8s/crds/extensions.agents.x-k8s.io_sandboxtemplates.yaml:31-37](), [k8s/crds/extensions.agents.x-k8s.io_sandboxtemplates.yaml:223-228]()

### Restricted `NetworkPolicySpec`

`NetworkPolicySpec` is deliberately a **subset** of `networkingv1.NetworkPolicySpec`. Only `ingress` and `egress` are exposed; `PodSelector` and `PolicyTypes` are intentionally excluded because the template controller fills them in to guarantee a default-deny posture targeted at the template's hashed pod selector.

```go
// extensions/api/v1beta1/sandboxtemplate_types.go
type NetworkPolicySpec struct {
    Ingress []networkingv1.NetworkPolicyIngressRule `json:"ingress,omitempty"`
    Egress  []networkingv1.NetworkPolicyEgressRule  `json:"egress,omitempty"`
}
```

Sources: [extensions/api/v1beta1/sandboxtemplate_types.go:58-71](), [extensions/api/v1beta1/sandboxtemplate_types.go:91-114]()

## Structural relationship

The diagram below shows the embedded reuse of the core sandbox `PodTemplate`/`PersistentVolumeClaimTemplate` and the two consumers that reference the template by name.

```mermaid
classDiagram
    class SandboxTemplate {
        +ObjectMeta metadata
        +SandboxTemplateSpec spec
    }
    class SandboxTemplateSpec {
        +PodTemplate podTemplate
        +[]PersistentVolumeClaimTemplate volumeClaimTemplates
        +*NetworkPolicySpec networkPolicy
        +NetworkPolicyManagement networkPolicyManagement
        +EnvVarsInjectionPolicy envVarsInjectionPolicy
        +*bool service
    }
    class PodTemplate {
        +PodMetadata metadata
        +corev1.PodSpec spec
    }
    class PersistentVolumeClaimTemplate {
        +EmbeddedObjectMetadata metadata
        +corev1.PersistentVolumeClaimSpec spec
    }
    class NetworkPolicySpec {
        +[]NetworkPolicyIngressRule ingress
        +[]NetworkPolicyEgressRule egress
    }
    class SandboxTemplateRef {
        +string name
    }
    class SandboxClaim {
        +SandboxTemplateRef sandboxTemplateRef
    }
    class SandboxWarmPool {
        +SandboxTemplateRef sandboxTemplateRef
    }
    SandboxTemplate --> SandboxTemplateSpec
    SandboxTemplateSpec --> PodTemplate
    SandboxTemplateSpec --> PersistentVolumeClaimTemplate
    SandboxTemplateSpec --> NetworkPolicySpec
    SandboxClaim --> SandboxTemplateRef
    SandboxWarmPool --> SandboxTemplateRef
    SandboxTemplateRef ..> SandboxTemplate : by name
```

Sources: [extensions/api/v1beta1/sandboxtemplate_types.go:73-139](), [extensions/api/v1beta1/sandboxclaim_types.go:101-128](), [extensions/api/v1beta1/sandboxwarmpool_types.go:24-47]()

## How consumers dereference the template

Both `SandboxClaim` and `SandboxWarmPool` carry a `SandboxTemplateRef` that is a bare name (no namespace, no UID). The reference is resolved at reconcile time, and the resolved template's `Spec.PodTemplate`, `Spec.VolumeClaimTemplates`, and `Spec.Service` are deep-copied into the generated `Sandbox`. The template ref name is also annotated and hashed onto labels so that the shared `NetworkPolicy` and warm-pool bookkeeping can target the right Pods.

```go
// extensions/api/v1beta1/sandboxclaim_types.go
type SandboxTemplateRef struct {
    Name string `json:"name,omitempty"`
}

type SandboxClaimSpec struct {
    TemplateRef SandboxTemplateRef `json:"sandboxTemplateRef,omitempty"`
    ...
}
```

```go
// extensions/api/v1beta1/sandboxwarmpool_types.go
const TemplateRefField = ".spec.sandboxTemplateRef.name"

type SandboxWarmPoolSpec struct {
    Replicas    int32                          `json:"replicas"`
    TemplateRef SandboxTemplateRef             `json:"sandboxTemplateRef,omitempty"`
    UpdateStrategy *SandboxWarmPoolUpdateStrategy `json:"updateStrategy,omitempty"`
}
```

The `SandboxClaim` controller copies the template into the `Sandbox` and stamps identity labels on the Pod template:

```go
// extensions/controllers/sandboxclaim_controller.go
sandbox.Annotations[v1beta1.SandboxTemplateRefAnnotation] = template.Name
template.Spec.PodTemplate.DeepCopyInto(&sandbox.Spec.PodTemplate)
sandbox.Spec.Service = template.Spec.Service
for i, vct := range template.Spec.VolumeClaimTemplates { vct.DeepCopyInto(&sandbox.Spec.VolumeClaimTemplates[i]) }
sandbox.Spec.PodTemplate.ObjectMeta.Labels[sandboxTemplateRefHash] = SandboxTemplateRefHash(template.Name)
```

The `SandboxWarmPool` controller does the same and additionally computes a JSON-marshalled hash of `template.Spec.PodTemplate` (`SandboxPodTemplateHashLabel`) so that pool members can be detected as "stale" when the template drifts:

```go
// extensions/controllers/sandboxwarmpool_controller.go
specJSON, err := json.Marshal(template.Spec.PodTemplate)
// ... NameHash(string(specJSON)) -> currentPodTemplateHash
PodTemplate: sandboxv1beta1.PodTemplate{
    Spec:       *template.Spec.PodTemplate.Spec.DeepCopy(),
    ObjectMeta: sandboxv1beta1.PodMetadata{Labels: podLabels, Annotations: podAnnotations},
}
```

The hashed label key bound to the template name is defined once in the warm-pool controller and reused by all three controllers:

```go
// extensions/controllers/sandboxwarmpool_controller.go
sandboxTemplateRefHash = "agents.x-k8s.io/sandbox-template-ref-hash"
```

Sources: [extensions/api/v1beta1/sandboxclaim_types.go:101-128](), [extensions/api/v1beta1/sandboxwarmpool_types.go:24-47](), [extensions/controllers/sandboxclaim_controller.go:948-966](), [extensions/controllers/sandboxwarmpool_controller.go:303-379](), [extensions/controllers/sandboxwarmpool_controller.go:49-49]()

### Env var injection policy gate

`EnvVarsInjectionPolicy` is enforced in `SandboxClaimReconciler.createSandbox` after the template has been copied: if a claim supplies any `spec.env` while the template's policy is `Disallowed`, the claim is rejected. `Allowed` permits new variables but not overrides; `Overrides` permits both. The default in the CRD schema is `Disallowed`.

Sources: [extensions/api/v1beta1/sandboxtemplate_types.go:48-55](), [extensions/api/v1beta1/sandboxtemplate_types.go:123-128](), [extensions/controllers/sandboxclaim_controller.go:972-978]()

## Template controller: shared `NetworkPolicy`

`SandboxTemplateReconciler` watches `SandboxTemplate` and owns a single `NetworkPolicy` per template, named `<template>-network-policy` in the template's namespace. It does not create Pods, Services, or PVCs directly — those are the domain of the claim and warm-pool controllers — its only job is to materialize the shared NetworkPolicy.

```mermaid
flowchart TD
    subgraph User["User-authored objects"]
        T[SandboxTemplate]
        C[SandboxClaim]
        W[SandboxWarmPool]
    end
    subgraph Reconcilers["extensions/controllers"]
        TR[SandboxTemplateReconciler]
        CR[SandboxClaimReconciler]
        WR[SandboxWarmPoolReconciler]
    end
    subgraph Cluster["Materialized cluster state"]
        NP["NetworkPolicy &lt;tmpl&gt;-network-policy<br/>podSelector: sandbox-template-ref-hash"]
        SB[Sandbox]
        Pods[Pod with hashed template label]
        PVCs[PVCs from volumeClaimTemplates]
        Svc[headless Service]
    end
    T --> TR --> NP
    C -->|sandboxTemplateRef| CR
    W -->|sandboxTemplateRef| WR
    CR -->|DeepCopy PodTemplate, VCTs, Service| SB
    WR -->|DeepCopy PodTemplate, VCTs, Service| SB
    SB --> Pods
    SB --> PVCs
    SB --> Svc
    Pods -. matched by .-> NP
```

Sources: [extensions/controllers/sandboxtemplate_controller.go:38-154](), [extensions/controllers/sandboxclaim_controller.go:923-966](), [extensions/controllers/sandboxwarmpool_controller.go:303-382]()

### Reconcile branches

The controller branches on `Spec.NetworkPolicyManagement` and the presence of `Spec.NetworkPolicy`:

| Management | `spec.networkPolicy` | Behavior |
|---|---|---|
| `Unmanaged` | any | The template's NetworkPolicy (if any) is deleted; controller exits early. |
| `Managed` (or empty) | nil | `buildDefaultNetworkPolicySpec` is used: ingress from `app=sandbox-router`; egress to `0.0.0.0/0` and `::/0` minus RFC1918 and link-local. |
| `Managed` (or empty) | non-nil | User-provided `ingress`/`egress` are wrapped with controller-injected `PodSelector` (the template-name hash) and `PolicyTypes`. |

The controller compares the existing policy with `equality.Semantic.DeepEqual` and patches only on drift; if no policy exists it creates one and sets a controller reference back to the `SandboxTemplate`. The reconciler ignores objects whose `DeletionTimestamp` is set, relying on owner-reference garbage collection to clean up the policy.

```go
// extensions/controllers/sandboxtemplate_controller.go
npName := template.Name + "-network-policy"
if management == extensionsv1beta1.NetworkPolicyManagementUnmanaged {
    r.Delete(ctx, existingNP) // tolerate NotFound
    return ctrl.Result{}, nil
}
if template.Spec.NetworkPolicy == nil {
    desiredSpec = buildDefaultNetworkPolicySpec(template.Name)
} else {
    desiredSpec = networkingv1.NetworkPolicySpec{
        PodSelector: metav1.LabelSelector{MatchLabels: map[string]string{
            sandboxTemplateRefHash: SandboxTemplateRefHash(template.Name),
        }},
        PolicyTypes: []networkingv1.PolicyType{Ingress, Egress},
        Ingress: template.Spec.NetworkPolicy.Ingress,
        Egress:  template.Spec.NetworkPolicy.Egress,
    }
}
```

Sources: [extensions/controllers/sandboxtemplate_controller.go:67-153](), [extensions/controllers/sandboxtemplate_controller.go:156-213]()

### Secure-by-default policy

When `networkPolicy` is omitted under `Managed`, the controller installs a deny-everything-internal policy: ingress is restricted to Pods labeled `app=sandbox-router`, and egress is `0.0.0.0/0` minus the three RFC1918 ranges and `169.254.0.0/16` (link-local / cloud metadata), plus an IPv6 catch-all that excludes `fc00::/7`. The accompanying `ApplySandboxSecureDefaults` helper also forces `AutomountServiceAccountToken=false` if unset and, only in secure-by-default mode, rewires `DNSPolicy` to `None` with explicit public resolvers to block internal DNS enumeration.

```go
// extensions/controllers/utils.go
if spec.AutomountServiceAccountToken == nil {
    automount := false
    spec.AutomountServiceAccountToken = &automount
}
isSecureByDefault := isManaged && template.Spec.NetworkPolicy == nil
if isSecureByDefault && spec.DNSPolicy == "" {
    spec.DNSPolicy = corev1.DNSNone
    spec.DNSConfig = &corev1.PodDNSConfig{Nameservers: []string{"8.8.8.8", "1.1.1.1"}}
}
```

The secure-by-default policy enforces a strict "Default Deny" ingress posture. As the field documentation warns, sidecars (Istio proxy, monitoring agents) that need their own ingress ports must be added explicitly to the `Ingress` list or they will fail health checks.

Sources: [extensions/controllers/sandboxtemplate_controller.go:156-213](), [extensions/controllers/utils.go:23-48](), [extensions/api/v1beta1/sandboxtemplate_types.go:91-114]()

## Volume claim templates and Service opt-in

`volumeClaimTemplates` is an atomic list of PVC templates; updates replace the entire list rather than merging. Both consumer controllers deep-copy each entry into the derived `Sandbox.Spec.VolumeClaimTemplates`, leaving downstream PVC creation to the core sandbox controller.

`spec.service` is a pointer to `bool` (it intentionally uses `*bool` rather than an enum to mirror the headless-`Service` field on the underlying `Sandbox`; see issue #746 referenced in the source). When set, the value is propagated verbatim into the generated `Sandbox.Spec.Service`. When unset, the controller "preserves existing Services for backward compatibility but does not create new ones."

Sources: [extensions/api/v1beta1/sandboxtemplate_types.go:82-89](), [extensions/api/v1beta1/sandboxtemplate_types.go:130-139](), [extensions/controllers/sandboxclaim_controller.go:950-958](), [extensions/controllers/sandboxwarmpool_controller.go:360-379]()

## Examples

The repository ships two illustrative templates under `extensions/examples/`. Note these example manifests are tagged with `apiVersion: extensions.agents.x-k8s.io/v1alpha1`, while the served/stored CRD version in `k8s/crds/` is `v1beta1` — examples have not been refreshed to match the current API version.

```yaml
# extensions/examples/sandboxtemplate.yaml
apiVersion: extensions.agents.x-k8s.io/v1alpha1
kind: SandboxTemplate
metadata:
  name: secure-datascience-template
spec:
  podTemplate:
    spec:
      securityContext:
        runAsUser: 1000
        runAsNonRoot: true
      containers:
      - name: my-container
        image: busybox
        command: ["/bin/sh", "-c", "sleep 36000"]
        volumeMounts: [{ name: workspace, mountPath: /workspace }]
  volumeClaimTemplates:
  - metadata: { name: workspace }
    spec:
      accessModes: ["ReadWriteOnce"]
      resources: { requests: { storage: 1Gi } }
```

The `secure-sandboxtemplate.yaml` example shows an explicit `networkPolicy` block that overrides the controller's secure default: the only allowed ingress is the Istio ingress gateway, and the only allowed egress is DNS on port 53 (UDP/TCP) — which implicitly denies traffic to the Kubernetes API server and to peer sandboxes.

```yaml
# extensions/examples/secure-sandboxtemplate.yaml (excerpt)
spec:
  podTemplate:
    spec:
      runtimeClassName: gvisor
      ...
  networkPolicy:
    ingress:
      - from:
        - namespaceSelector: { matchLabels: { istio-injection: enabled } }
          podSelector:       { matchLabels: { app: istio-ingressgateway } }
    egress:
      - ports:
        - { protocol: UDP, port: 53 }
        - { protocol: TCP, port: 53 }
```

Sources: [extensions/examples/sandboxtemplate.yaml:1-39](), [extensions/examples/secure-sandboxtemplate.yaml:1-64]()

## Operational notes

- **Single shared `NetworkPolicy` per template.** Updating `spec.networkPolicy` updates the one policy object owned by the template; the CNI re-enforces rules across all existing and future sandboxes that match the hashed pod selector. There is no per-sandbox `NetworkPolicy`.
- **Template ref by name only.** Both `SandboxClaim.Spec.TemplateRef` and `SandboxWarmPool.Spec.TemplateRef` carry just a `name`. `SandboxWarmPool` exposes `TemplateRefField = ".spec.sandboxTemplateRef.name"` for indexer lookups; any rename of the JSON tag must be mirrored in this constant.
- **Pod template hashing for pools.** `SandboxWarmPool` JSON-marshals `template.Spec.PodTemplate` to detect drift; PodTemplate content changes drive `Recreate` or `OnReplenish` update strategies, but pure label/annotation changes do not (per the `Recreate` doc comment).
- **Required fields.** The CRD declares `spec` and `spec.podTemplate` as required; everything else is optional and defaulted as documented above.

The template is therefore best thought of as the immutable-style policy surface (security defaults, network policy, env-injection rules) plus the embedded `Sandbox` spec fragment (`podTemplate`, `volumeClaimTemplates`, `service`) that the claim and warm-pool controllers stamp into concrete `Sandbox` objects.

Sources: [extensions/api/v1beta1/sandboxtemplate_types.go:91-138](), [extensions/api/v1beta1/sandboxwarmpool_types.go:24-69](), [k8s/crds/extensions.agents.x-k8s.io_sandboxtemplates.yaml:4142-4149]()

---

## 07. SandboxClaim CRD

> Claim resource that resolves to a Sandbox: template references, warm-pool policy, env injection, additional pod metadata, and shutdown policies (Delete, DeleteForeground, Retain).

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/07-sandboxclaim-crd.md
- Generated: 2026-05-25T22:32:20.517Z

### Source Files

- `extensions/api/v1beta1/sandboxclaim_types.go`
- `k8s/crds/extensions.agents.x-k8s.io_sandboxclaims.yaml`
- `extensions/examples/sandboxclaim.yaml`
- `extensions/examples/sandbox-claim.yaml`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [extensions/api/v1beta1/sandboxclaim_types.go](extensions/api/v1beta1/sandboxclaim_types.go)
- [k8s/crds/extensions.agents.x-k8s.io_sandboxclaims.yaml](k8s/crds/extensions.agents.x-k8s.io_sandboxclaims.yaml)
- [extensions/examples/sandboxclaim.yaml](extensions/examples/sandboxclaim.yaml)
- [extensions/examples/sandbox-claim.yaml](extensions/examples/sandbox-claim.yaml)
- [extensions/controllers/sandboxclaim_controller.go](extensions/controllers/sandboxclaim_controller.go)
- [extensions/api/v1beta1/sandboxtemplate_types.go](extensions/api/v1beta1/sandboxtemplate_types.go)
- [api/v1beta1/sandbox_types.go](api/v1beta1/sandbox_types.go)
</details>

# SandboxClaim CRD

`SandboxClaim` is the namespaced "user-intent" resource in the `extensions.agents.x-k8s.io` API group. A claim names a `SandboxTemplate`, optionally narrows warm-pool selection, layers extra pod labels/annotations on top of the template, injects environment variables, and declares when and how the resulting `Sandbox` should be torn down. The `SandboxClaimReconciler` resolves the claim into exactly one owned `Sandbox` in the same namespace, either by adopting a pre-provisioned pod from a warm pool or by creating a fresh one from the template.

This page documents the schema, the spec/status surfaces, the warm-pool resolution flow, metadata and environment-variable injection rules, the lifecycle and shutdown policies (`Delete`, `DeleteForeground`, `Retain`), and the status conditions emitted by the controller.

## API identification and scoping

The custom resource is registered as `SandboxClaim` (singular/short name `sandboxclaim`, plural `sandboxclaims`) in group `extensions.agents.x-k8s.io`, namespaced, with `v1beta1` served and stored. The `status` subresource is enabled so spec/status updates are decoupled.

```yaml
group: extensions.agents.x-k8s.io
names: { kind: SandboxClaim, plural: sandboxclaims, singular: sandboxclaim, shortNames: [sandboxclaim] }
scope: Namespaced
versions:
- name: v1beta1
  served: true
  storage: true
  subresources: { status: {} }
```

Note that the example manifests under `extensions/examples/` declare `apiVersion: extensions.agents.x-k8s.io/v1alpha1`, while the CRD published in `k8s/crds/` only serves `v1beta1`. Use `v1beta1` to match the installed CRD.

Sources: [k8s/crds/extensions.agents.x-k8s.io_sandboxclaims.yaml:1-141](), [extensions/api/v1beta1/sandboxclaim_types.go:175-202](), [extensions/examples/sandboxclaim.yaml:1-20]()

## Spec surface

`SandboxClaimSpec` is intentionally small — the heavy lifting belongs in the referenced `SandboxTemplate`. The claim only carries fields that vary per consumer.

| Field | Type | Required | Default | Purpose |
| --- | --- | --- | --- | --- |
| `sandboxTemplateRef.name` | string | yes | — | Name of the `SandboxTemplate` in the same namespace. |
| `warmpool` | string | no | `default` | One of `none`, `default`, or a specific warm-pool name. Controls adoption. |
| `additionalPodMetadata` | object | no | `{}` | Extra `labels`/`annotations` merged onto the pod template. |
| `env` | `[]EnvVar` | no | `[]` | Environment variables injected into one or more containers. |
| `lifecycle.shutdownTime` | RFC3339 timestamp | no | — | Absolute expiration time for the claim. |
| `lifecycle.ttlSecondsAfterFinished` | int32 ≥ 0 | no | — | Retention window after the mirrored `Finished` condition transitions. |
| `lifecycle.shutdownPolicy` | enum | no | `Retain` | One of `Delete`, `DeleteForeground`, `Retain`. |

`sandboxTemplateRef` only carries `name`; cross-namespace references are not modeled — the template must live in the same namespace as the claim. The OpenAPI schema enforces both the `sandboxTemplateRef` requirement and the `shutdownPolicy` enum.

Sources: [extensions/api/v1beta1/sandboxclaim_types.go:101-151](), [k8s/crds/extensions.agents.x-k8s.io_sandboxclaims.yaml:29-86]()

### Template reference

```go
type SandboxTemplateRef struct {
    Name string `json:"name,omitempty"`
}
```

The reconciler resolves the template only when needed: it first tries to find or adopt an existing `Sandbox` and only fetches the template if it has to create one from scratch, or if metadata needs to be merged after adoption. A missing template is requeued (`ErrTemplateNotFound`) instead of returned as an error to avoid log spam.

Sources: [extensions/api/v1beta1/sandboxclaim_types.go:101-106](), [extensions/controllers/sandboxclaim_controller.go:62-62](), [extensions/controllers/sandboxclaim_controller.go:263-272](), [extensions/controllers/sandboxclaim_controller.go:1182-1197]()

### Warm-pool policy

`WarmPoolPolicy` is a free-form string with two sentinel values:

| Value | Meaning |
| --- | --- |
| `none` | Never adopt from a warm pool; always cold-start from the template. |
| `default` | Adopt from any matching warm pool (default). |
| any other string | Adopt only from the named pool. `IsSpecificPool()` returns true. |

Specific-pool matching is enforced in the adoption loop by comparing `Labels[warmPoolSandboxLabel]` against `NameHash(policy)`; non-matching candidates are pushed back onto the queue. Two important interactions:

- If `warmpool != none` **and** the claim sets `spec.env`, the controller refuses adoption and returns an error — env injection mutates the pod spec at create time and cannot be applied to a pre-running warm sandbox.
- If `warmpool == none`, the controller skips the warm-pool queue entirely and falls through to `createSandbox`.

Sources: [extensions/api/v1beta1/sandboxclaim_types.go:33-55](), [extensions/controllers/sandboxclaim_controller.go:74-80](), [extensions/controllers/sandboxclaim_controller.go:591-646](), [extensions/controllers/sandboxclaim_controller.go:1155-1180]()

### Additional pod metadata

`additionalPodMetadata` reuses `sandboxv1beta1.PodMetadata` (labels + annotations) and is merged onto the pod template that ends up in `Sandbox.spec.podTemplate.metadata`. Two rules are enforced server-side by the controller:

1. **No restricted-domain keys.** Keys whose domain prefix is `kubernetes.io`, `k8s.io`, or `agents.x-k8s.io` (or any subdomain) are rejected via `ErrInvalidMetadata`. Label values must additionally pass `validation.IsValidLabelValue` (max 63 chars, standard pattern).
2. **No silent overrides.** If the template already defines the same label or annotation key with a different value, `mergePodMetadata` returns a metadata-override conflict error. Identical values are allowed; new keys are appended.

The merged metadata also receives controller-injected identity labels (`agents.x-k8s.io/claim-uid` from the claim UID and `agents.x-k8s.io/sandbox-template-ref-hash`). On every reconcile, the controller diffs the recomputed `mergedMeta` against the live `Sandbox.spec.podTemplate.ObjectMeta` and pushes an update when they drift, so changes to `additionalPodMetadata` propagate even after creation.

Sources: [extensions/api/v1beta1/sandboxclaim_types.go:142-145](), [api/v1beta1/sandbox_types.go:68-82](), [extensions/controllers/sandboxclaim_controller.go:64-72](), [extensions/controllers/sandboxclaim_controller.go:324-374](), [extensions/controllers/sandboxclaim_controller.go:806-894]()

### Env injection

Each entry in `spec.env` is `{name, value, containerName?}`. Both `name` and `value` are required by the schema; `containerName` is optional.

```go
type EnvVar struct {
    Name          string `json:"name"`
    Value         string `json:"value"`
    ContainerName string `json:"containerName,omitempty"`
}
```

Resolution rules:

- The template's `envVarsInjectionPolicy` gates whether injection happens at all: `Allowed` permits new variables, `Overrides` additionally permits replacing existing values, and any other value (including the explicit `Disallowed`) causes the create to fail with "environment variable injection is not allowed by the template policy."
- Env entries are grouped by `containerName`. Entries without a `containerName` (the "default" bucket) are appended only to the **first** main container in the template (`Spec.Containers[0]`).
- Entries with a `containerName` target that exact container, scanning both `InitContainers` and `Containers`. If the named container is not present in the resolved pod template, the reconcile fails with `target container %q not found in template for environment variable %q`.
- For each var, if a same-name entry already exists on the container, the injection is treated as an override and is rejected unless the template's policy is `Overrides`; otherwise it is appended.

Combined with the warm-pool rule above, env injection is only allowed when the claim does not adopt — that is, when a fresh sandbox is created from the template by `createSandbox`.

Sources: [extensions/api/v1beta1/sandboxclaim_types.go:108-122](), [extensions/api/v1beta1/sandboxtemplate_types.go:30-55](), [extensions/controllers/sandboxclaim_controller.go:896-921](), [extensions/controllers/sandboxclaim_controller.go:972-1038](), [extensions/controllers/sandboxclaim_controller.go:1157-1162]()

## Lifecycle and shutdown policies

`Lifecycle` separates *when* the claim expires from *what to do* when it does. Three policies are defined in `ShutdownPolicy`:

```go
ShutdownPolicyDelete           = "Delete"
ShutdownPolicyDeleteForeground = "DeleteForeground"
ShutdownPolicyRetain           = "Retain"  // default
```

| Policy | API behavior on expiration | Resource cleanup | When to use |
| --- | --- | --- | --- |
| `Delete` | The `SandboxClaim` object is deleted (default propagation), which cascades to the owned `Sandbox`. | Everything is garbage-collected. | You don't need to observe shutdown completion. |
| `DeleteForeground` | The claim is deleted with `metav1.DeletePropagationForeground`. The claim stays in the API with a `deletionTimestamp` until its `Sandbox`/Pod terminate. | Same as Delete, but observable. | External systems poll for the claim's disappearance as a "fully torn down" signal. |
| `Retain` | The claim object is preserved; only the owned `Sandbox` (and its Pod, Service, etc.) are deleted. Status reflects `ClaimExpired`. | Sandbox resources are reclaimed; the claim record stays. | Historical/audit retention, or driving downstream cleanup off the persisted claim. |

The reconciler computes expiration via `lifecycle.TimeLeft(now, shutdownTime, ttlSecondsAfterFinished, finishedCondition)`. If `claim.Spec.Lifecycle` is nil, the claim never expires. When expired:

- `Delete` / `DeleteForeground` paths short-circuit the reconcile after emitting a `ClaimExpired` event and issuing `r.Delete` with the appropriate propagation option. The reconcile returns immediately because subsequent status updates against a deleted object would fail.
- `Retain` falls through to `reconcileExpired`, which fetches the owned sandbox, verifies controller-ownership, and issues a delete on the sandbox while leaving the claim in place. If the sandbox is not controlled by this claim, the call fails with `ErrSandboxNotOwned` (suppressed in the requeue path to avoid a crash loop).

The `ttlSecondsAfterFinished` countdown is anchored on the `Finished` condition mirrored from the underlying `Sandbox`. The claim does **not** propagate `shutdownTime` down to the `Sandbox`; expiration is enforced entirely by the claim controller.

```text
                      lifecycle.TimeLeft(now, shutdownTime, ttlSecondsAfterFinished, finishedCond)
                                          │
                  ┌───────────────────────┼───────────────────────┐
       claim not expired             expired + Delete*       expired + Retain
                  │                       │                       │
                  ▼                       ▼                       ▼
          reconcileActive          r.Delete(claim, ...)     reconcileExpired
       (adopt or create)          [Foreground? add opt]    (delete owned Sandbox,
                                                            keep the claim)
```

Sources: [extensions/api/v1beta1/sandboxclaim_types.go:57-99](), [extensions/controllers/sandboxclaim_controller.go:165-282](), [extensions/controllers/sandboxclaim_controller.go:309-317](), [extensions/controllers/sandboxclaim_controller.go:388-420]()

## Resolution flow

A single `Reconcile` pass walks the claim through expiration check, fast-path adoption, optional cold creation, and status update.

```mermaid
flowchart TD
    subgraph API["extensions.agents.x-k8s.io / v1beta1"]
      Claim["SandboxClaim<br/>spec.sandboxTemplateRef<br/>spec.warmpool<br/>spec.env<br/>spec.additionalPodMetadata<br/>spec.lifecycle"]
    end

    subgraph Reconciler["SandboxClaimReconciler"]
      Reconcile["Reconcile()"]
      Expire["checkExpiration()"]
      Active["reconcileActive()"]
      Expired["reconcileExpired()"]
      GetOrCreate["getOrCreateSandbox()"]
      Adopt["adoptSandboxFromCandidates()"]
      Create["createSandbox()"]
      Merge["mergePodMetadata()"]
      InjectEnv["injectEnvs()"]
      Status["computeAndSetStatus() / updateStatus()"]
    end

    subgraph WarmPool["Warm pool side"]
      Queue["WarmSandboxQueue<br/>(templateHash → SandboxKey)"]
      Template["SandboxTemplate"]
    end

    subgraph Core["agents.x-k8s.io / Sandbox"]
      Sandbox["Sandbox (owned)"]
    end

    Claim --> Reconcile
    Reconcile --> Expire
    Expire -->|not expired| Active
    Expire -->|expired + Delete/DeleteForeground| Reconcile -.->|r.Delete claim| Claim
    Expire -->|expired + Retain| Expired --> Sandbox
    Active --> GetOrCreate
    GetOrCreate -->|status/label hit or name lookup| Sandbox
    GetOrCreate -->|warmpool != none| Adopt --> Queue
    Adopt -->|adopt success| Sandbox
    GetOrCreate -->|no candidate or warmpool=none| Create
    Create --> Template
    Create --> Merge --> Sandbox
    Create --> InjectEnv --> Sandbox
    Active --> Status
    Expired --> Status
    Status --> Claim
```

Key invariants in `reconcileActive` / `getOrCreateSandbox`:

1. The status pointer (`claim.Status.SandboxStatus.Name`) is the primary discovery hint; the `agents.x-k8s.io/sandbox-name` label on the claim is the secondary one. Both are validated with `metav1.IsControlledBy` before being trusted.
2. Name-based lookup uses `claim.Name` as the sandbox name when creating from scratch (`createSandbox` writes `Name: claim.Name`), so re-running a reconcile is idempotent.
3. Warm-pool adoption is a two-phase update: first the claim is patched with the `agents.x-k8s.io/sandbox-name` label under optimistic locking, then `completeAdoption` strips warm-pool labels (`warmPoolSandboxLabel`, `sandboxTemplateRefHash`, `agents.x-k8s.io/sandbox-pod-template-hash`), re-parents the sandbox via `SetControllerReference(claim, ...)`, and forces `Spec.PodTemplate.ObjectMeta` to the merged metadata.
4. Cross-namespace adoption is rejected (`ErrCrossNamespaceAdoption`); ineligible candidates are pushed back onto the queue rather than dropped.

Sources: [extensions/controllers/sandboxclaim_controller.go:140-282](), [extensions/controllers/sandboxclaim_controller.go:319-386](), [extensions/controllers/sandboxclaim_controller.go:591-794](), [extensions/controllers/sandboxclaim_controller.go:923-1068](), [extensions/controllers/sandboxclaim_controller.go:1070-1180]()

## Status

`SandboxClaimStatus` exposes two fields:

```go
type SandboxClaimStatus struct {
    Conditions    []metav1.Condition `json:"conditions,omitempty"`
    SandboxStatus SandboxStatus      `json:"sandbox,omitempty"`
}

type SandboxStatus struct {
    Name   string   `json:"name,omitempty"`   // resolved Sandbox name
    PodIPs []string `json:"podIPs,omitempty"` // mirrored from the Sandbox
}
```

The reconciler maintains two condition types on the claim:

| Condition | Source | Representative `Reason` values |
| --- | --- | --- |
| `Ready` | Computed by `computeReadyCondition`. Mirrors `Ready` from the owned `Sandbox` while the claim is healthy. | `TemplateNotFound`, `InvalidMetadata`, `SandboxMissing`, `SandboxNotReady`, `Expired`, `ClaimExpired`, `ReconcilerError`. |
| `Finished` | Mirrored from `Sandbox.status.conditions[Finished]` via `syncFinishedCondition`. Removed when the sandbox is gone and the claim is not yet expired. | Carried through from the core controller. |

When the underlying `Sandbox` reports `SandboxReasonExpired`, the claim's `Ready` condition surfaces the same reason so callers can distinguish "sandbox expired on its own" from "claim expired and reaped the sandbox" (the latter uses `Reason=ClaimExpired`). The controller emits Kubernetes events with reasons `ClaimExpired`, `SandboxAdopted`, and `SandboxProvisioned` corresponding to the major lifecycle transitions.

Sources: [extensions/api/v1beta1/sandboxclaim_types.go:153-173](), [extensions/controllers/sandboxclaim_controller.go:459-576](), [extensions/controllers/sandboxclaim_controller.go:185-188](), [extensions/controllers/sandboxclaim_controller.go:700-712]()

## Example manifests

Minimal claim (template ref only, defaults to `warmpool: default`, no lifecycle, `Retain` policy implicit):

```yaml
# extensions/examples/sandboxclaim.yaml
apiVersion: extensions.agents.x-k8s.io/v1beta1   # use v1beta1 (CRD-served version)
kind: SandboxClaim
metadata:
  name: my-secure-sandbox
  namespace: default
spec:
  sandboxTemplateRef:
    name: secure-datascience-template
  # warmpool: "default"        # implicit default
```

Claim with an explicit expiration and clean-up policy:

```yaml
# extensions/examples/sandbox-claim.yaml
apiVersion: extensions.agents.x-k8s.io/v1beta1
kind: SandboxClaim
metadata:
  name: my-secure-sandbox
  namespace: default
spec:
  sandboxTemplateRef:
    name: secure-datascience-template
  lifecycle:
    shutdownPolicy: Delete                  # Delete | DeleteForeground | Retain
    shutdownTime: "2025-12-31T23:59:59Z"
```

A richer claim that exercises every field:

```yaml
apiVersion: extensions.agents.x-k8s.io/v1beta1
kind: SandboxClaim
metadata:
  name: data-prep
  namespace: ml-team
spec:
  sandboxTemplateRef:
    name: jupyter-template
  warmpool: fast-pool                       # adopt only from this named pool
  additionalPodMetadata:
    labels:
      team: ml
    annotations:
      cost-center: "1234"
  env:                                       # requires template policy Allowed/Overrides
    - name: NOTEBOOK_DIR
      value: /workspace
    - name: GPU_FLAGS
      value: "--mig"
      containerName: trainer                 # target a specific container
  lifecycle:
    shutdownTime: "2026-01-01T00:00:00Z"
    ttlSecondsAfterFinished: 600
    shutdownPolicy: DeleteForeground
```

Sources: [extensions/examples/sandboxclaim.yaml:1-20](), [extensions/examples/sandbox-claim.yaml:1-20]()

## Summary

The `SandboxClaim` CRD is a thin, intent-bearing wrapper around a `SandboxTemplate`: it declares which template to use, whether warm-pool adoption is allowed, what extra metadata and env vars to apply, and when/how the resulting sandbox should be torn down. The controller enforces a strict separation between adopt (no env injection, no template overrides) and create-from-template (full merge with no key conflicts on `additionalPodMetadata`), and pins the expiration semantics to the claim with three distinct shutdown behaviors — `Delete`, `DeleteForeground`, and `Retain` — so callers can pick between fire-and-forget cleanup, observable teardown, and audit-style retention without changing the underlying `Sandbox` API.

---

## 08. SandboxWarmPool CRD

> Specification of pre-warmed sandbox pools: template binding, replica counts, and adoption semantics consumed by SandboxClaim.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/08-sandboxwarmpool-crd.md
- Generated: 2026-05-25T22:32:24.116Z

### Source Files

- `extensions/api/v1beta1/sandboxwarmpool_types.go`
- `k8s/crds/extensions.agents.x-k8s.io_sandboxwarmpools.yaml`
- `extensions/examples/sandboxwarmpool.yaml`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [extensions/api/v1beta1/sandboxwarmpool_types.go](extensions/api/v1beta1/sandboxwarmpool_types.go)
- [k8s/crds/extensions.agents.x-k8s.io_sandboxwarmpools.yaml](k8s/crds/extensions.agents.x-k8s.io_sandboxwarmpools.yaml)
- [extensions/examples/sandboxwarmpool.yaml](extensions/examples/sandboxwarmpool.yaml)
- [extensions/controllers/sandboxwarmpool_controller.go](extensions/controllers/sandboxwarmpool_controller.go)
- [extensions/api/v1beta1/sandboxclaim_types.go](extensions/api/v1beta1/sandboxclaim_types.go)
- [extensions/controllers/sandboxclaim_controller.go](extensions/controllers/sandboxclaim_controller.go)
</details>

# SandboxWarmPool CRD

`SandboxWarmPool` is a namespaced Custom Resource in API group `extensions.agents.x-k8s.io/v1beta1` that maintains a population of pre-allocated, ready-to-use `Sandbox` objects derived from a `SandboxTemplate`. It is the supply side of the warm-pool pattern: the `SandboxWarmPool` controller continually drives the live replica count toward `spec.replicas`, while the `SandboxClaim` controller consumes those pre-warmed sandboxes through an adoption protocol that flips ownership from the pool to the claim. The CRD is short, but it ties together a template binding, an `HorizontalPodAutoscaler`-friendly `scale` subresource, an `updateStrategy` for handling template drift, and a label/ownership convention that is load-bearing for adoption.

This page documents the schema, controller behavior, the labels and owner references involved in pool membership, the two update strategies, and the handoff contract that lets `SandboxClaim` adopt pool members.

## Resource Identity and Scope

The CRD is registered as a namespaced kind with short name `swp` and exposes a `scale` subresource whose `specReplicasPath`, `statusReplicasPath`, and `labelSelectorPath` point at the warm pool's own fields. This makes the resource directly compatible with `HorizontalPodAutoscaler`, and the Go type's comment explicitly anticipates that case.

| Aspect | Value |
| --- | --- |
| API group / version | `extensions.agents.x-k8s.io/v1beta1` |
| Kind / list kind | `SandboxWarmPool` / `SandboxWarmPoolList` |
| Plural / singular | `sandboxwarmpools` / `sandboxwarmpool` |
| Short name | `swp` |
| Scope | Namespaced |
| Subresources | `status`, `scale` |
| Printer columns | `Ready` (`.status.readyReplicas`), `Age` |

Sources: [k8s/crds/extensions.agents.x-k8s.io_sandboxwarmpools.yaml:1-84](k8s/crds/extensions.agents.x-k8s.io_sandboxwarmpools.yaml), [extensions/api/v1beta1/sandboxwarmpool_types.go:86-120](extensions/api/v1beta1/sandboxwarmpool_types.go)

## Spec Schema

```go
// extensions/api/v1beta1/sandboxwarmpool_types.go
type SandboxWarmPoolSpec struct {
    Replicas       int32                          `json:"replicas"`
    TemplateRef    SandboxTemplateRef             `json:"sandboxTemplateRef,omitempty"`
    UpdateStrategy *SandboxWarmPoolUpdateStrategy `json:"updateStrategy,omitempty"`
}
```

| Field | Type | Required | Notes |
| --- | --- | --- | --- |
| `spec.replicas` | `int32`, `minimum: 0` | yes | Desired number of pre-allocated sandboxes. Exposed through the `scale` subresource for HPAs. |
| `spec.sandboxTemplateRef.name` | `string` | yes | Name of the `SandboxTemplate` (same namespace) whose `podTemplate`, `service`, and `volumeClaimTemplates` are used to build pool sandboxes. Indexed by `.spec.sandboxTemplateRef.name` via `TemplateRefField`. |
| `spec.updateStrategy.type` | `Recreate` \| `OnReplenish` | no | Defaults to `OnReplenish`. Governs how stale sandboxes are reconciled when the underlying template drifts. |

The `TemplateRefField` constant is wired into a manager field indexer in `SetupWithManager` and used by `findWarmPoolsForTemplate` so that template events fan out to all pools referencing that template name. The code comment in the spec is explicit that the JSON tag `sandboxTemplateRef` and the indexer constant must stay in sync.

Sources: [extensions/api/v1beta1/sandboxwarmpool_types.go:24-69](extensions/api/v1beta1/sandboxwarmpool_types.go), [extensions/controllers/sandboxwarmpool_controller.go:534-583](extensions/controllers/sandboxwarmpool_controller.go)

### Status Schema

```go
type SandboxWarmPoolStatus struct {
    Replicas      int32  `json:"replicas,omitempty"`
    ReadyReplicas int32  `json:"readyReplicas,omitempty"`
    Selector      string `json:"selector,omitempty"`
}
```

`status.replicas` is the count of active (non-deleting, owned-or-adopted, non-stale) sandboxes the controller currently observes. `status.readyReplicas` counts those whose `Sandbox` `Ready` condition is `True`. `status.selector` is the stringified label selector (`agents.x-k8s.io/warm-pool-sandbox=<NameHash(poolName)>`) used internally and exported so the `scale` subresource can attach an HPA via `labelSelectorPath: .status.selector`.

Status writes go through Server-Side Apply with field owner `warmpool-controller` and `ForceOwnership`, and are skipped when semantically unchanged from the prior status snapshot.

Sources: [extensions/api/v1beta1/sandboxwarmpool_types.go:71-84](extensions/api/v1beta1/sandboxwarmpool_types.go), [extensions/controllers/sandboxwarmpool_controller.go:159-170](extensions/controllers/sandboxwarmpool_controller.go), [extensions/controllers/sandboxwarmpool_controller.go:416-443](extensions/controllers/sandboxwarmpool_controller.go)

## Example

```yaml
# extensions/examples/sandboxwarmpool.yaml
apiVersion: extensions.agents.x-k8s.io/v1alpha1
kind: SandboxWarmPool
metadata:
  name: sandboxwarmpool-example
spec:
  updateStrategy:
    type: Recreate
  replicas: 1
  sandboxTemplateRef:
    name: secure-datascience-template
```

Note that the bundled example file uses `v1alpha1` in its `apiVersion`, while the generated CRD only serves and stores `v1beta1` (`extensions.agents.x-k8s.io_sandboxwarmpools.yaml`). Apply against a cluster using `extensions.agents.x-k8s.io/v1beta1`.

Sources: [extensions/examples/sandboxwarmpool.yaml:1-11](extensions/examples/sandboxwarmpool.yaml), [k8s/crds/extensions.agents.x-k8s.io_sandboxwarmpools.yaml:18-27](k8s/crds/extensions.agents.x-k8s.io_sandboxwarmpools.yaml)

## Controller Architecture

```mermaid
flowchart LR
    subgraph User["User / HPA"]
        SWP["SandboxWarmPool (spec.replicas, sandboxTemplateRef, updateStrategy)"]
    end

    subgraph TemplateNS["Same namespace"]
        ST["SandboxTemplate (referenced by name)"]
    end

    subgraph Ctrl["SandboxWarmPoolReconciler"]
        REC["Reconcile()"]
        POOL["reconcilePool()"]
        FILTER["filterActiveSandboxes() — adopt orphans, drop stale"]
        STALE["isSandboxStale() — pod template hash + DeepEqual"]
        BUILD["buildSandboxCR() — apply secure defaults, set ownerRef"]
        SLOW["slowStartBatch() — create/delete in parallel"]
    end

    subgraph Cluster["Pool members"]
        SB1["Sandbox (label warm-pool-sandbox=H(poolName))"]
        SB2["Sandbox ..."]
    end

    subgraph Claim["Consumer side"]
        SC["SandboxClaim"]
        SCR["SandboxClaimReconciler.adoptSandboxFromCandidates()"]
        Q["WarmSandboxQueue (templateRefHash → sandbox keys)"]
    end

    SWP --> REC --> POOL --> FILTER --> STALE
    POOL -->|need more| BUILD --> SLOW --> SB1
    POOL -->|too many| SLOW -.->|delete| SB2
    ST -. watched .-> REC
    SB1 -. ownedBy .-> SWP
    SB1 -. enqueued by hash .-> Q --> SCR --> SC
    SCR -->|completeAdoption: strip labels, reset ownerRef| SB1
```

`SandboxWarmPoolReconciler` owns the supply side. It watches `SandboxWarmPool` objects, owns `Sandbox` objects via `Owns(&sandboxv1beta1.Sandbox{})`, and also watches `SandboxTemplate` so that a template change triggers reconciliation of every warm pool that references it. Concurrency is controlled by `MaxConcurrentReconciles` passed at setup, and `MaxBatchSize` (defaulting to `sandboxCreateDeleteMaxBatchSize = 300`) caps the number of creates/deletes per reconcile pass.

Sources: [extensions/controllers/sandboxwarmpool_controller.go:48-66](extensions/controllers/sandboxwarmpool_controller.go), [extensions/controllers/sandboxwarmpool_controller.go:534-583](extensions/controllers/sandboxwarmpool_controller.go)

## Reconciliation Loop

`Reconcile` fetches the pool, returns early on deletion, snapshots the status, calls `reconcilePool`, and then patches the status via SSA. The interesting work happens in `reconcilePool` and its helpers.

1. **Hash the pool name.** `poolNameHash := sandboxcontrollers.NameHash(warmPool.Name)` produces an 8-character hash used as the value of the membership label `agents.x-k8s.io/warm-pool-sandbox`.
2. **List candidate sandboxes.** Sandboxes carrying that label are listed within the pool's namespace.
3. **Resolve the template and compute its pod-template hash.** `fetchTemplateAndHash` retrieves the `SandboxTemplate` and computes `computePodTemplateHash(template)` from `Spec.PodTemplate` (JSON-marshaled, then hashed). Failures other than `NotFound` are joined into the returned error so reconciliation can still proceed for create/delete decisions when the template is missing.
4. **Filter, adopt, drop stale.** `filterActiveSandboxes` walks each candidate and decides between *ignore*, *adopt*, *delete stale*, or *keep active* (see [Adoption and Ownership](#adoption-and-ownership) and [Update Strategies](#update-strategies)).
5. **Sweep stuck sandboxes.** Any active sandbox that is not `Ready` and is older than the constant `warmPoolReadinessGracePeriod = 5 * time.Minute` is deleted. This bounds how long a wedged pre-warm can occupy a slot.
6. **Compute deltas.** `currentReplicas` is the count of healthy active sandboxes; the controller creates or deletes to converge toward `spec.replicas`. Both operations use `slowStartBatch`, which starts with one parallel call and doubles the batch on every success up to the remaining count, capped per reconcile by `MaxBatchSize`.
7. **Delete ordering.** When over-provisioned, sandboxes are sorted so that *unready first, then newest first* are deleted, preserving older Ready members that are the most valuable adoption candidates.
8. **Status update.** `status.replicas`, `status.readyReplicas`, and `status.selector` are written and applied with SSA only if changed.

Sources: [extensions/controllers/sandboxwarmpool_controller.go:67-229](extensions/controllers/sandboxwarmpool_controller.go), [extensions/controllers/sandboxwarmpool_controller.go:303-325](extensions/controllers/sandboxwarmpool_controller.go), [extensions/controllers/sandboxwarmpool_controller.go:585-621](extensions/controllers/sandboxwarmpool_controller.go)

## Labels, Annotations, and Ownership

`buildSandboxCR` constructs each pool member. It sets a controller `OwnerReference` from the new `Sandbox` to its `SandboxWarmPool` (so deletions cascade), copies `template.Spec.PodTemplate.ObjectMeta.Labels` and `Annotations` into the pod template, and overlays the following membership/identity labels and annotations:

| Key | Where | Value | Purpose |
| --- | --- | --- | --- |
| `agents.x-k8s.io/warm-pool-sandbox` | Sandbox + pod template labels | `NameHash(poolName)` | Pool membership; used as the list selector and in `status.selector`. Also used by `SandboxClaim` to pin adoption to a specific pool. |
| `agents.x-k8s.io/sandbox-template-ref-hash` | Sandbox + pod template labels | `SandboxTemplateRefHash(templateRefName)` | Allows `SandboxClaim` to find pool members for a given template by hash. |
| `SandboxPodTemplateHashLabel` (from `sandboxv1beta1`) | Sandbox + pod template labels | `computePodTemplateHash(template)` | Identifies the exact pod-template revision a member was built from; consumed by `isSandboxStale`. |
| `SandboxTemplateRefAnnotation` | Sandbox annotation | `warmPool.Spec.TemplateRef.Name` | Plain-text record of which template produced the sandbox. |

The pod spec is normalized with `ApplySandboxSecureDefaults(template, &sandbox.Spec.PodTemplate.Spec)` before creation, and the same normalization is reapplied when computing the expected spec in `comparePodSpecs`. This is what lets the staleness check use `equality.Semantic.DeepEqual` without false positives from controller-defaulted fields.

Sources: [extensions/controllers/sandboxwarmpool_controller.go:48-52](extensions/controllers/sandboxwarmpool_controller.go), [extensions/controllers/sandboxwarmpool_controller.go:327-390](extensions/controllers/sandboxwarmpool_controller.go), [extensions/controllers/sandboxwarmpool_controller.go:521-531](extensions/controllers/sandboxwarmpool_controller.go)

## Adoption and Ownership

`filterActiveSandboxes` distinguishes three states for each labeled sandbox:

```text
┌──────────────────────────────┐
│   Sandbox carries pool label │
└────────────┬─────────────────┘
             ▼
   controllerRef? ──── nil (orphan)
        │                 │
        │                 ├──> If stale → Delete
        │                 └──> Else      → adoptSandbox()  ───┐
        │                                                     │
        ├── ref.UID == warmPool.UID  (owned)                  │
        │   └──> If Recreate strategy && stale → Delete       │
        │   └──> Else → keep                                   │
        │                                                     │
        └── ref.UID != warmPool.UID  (foreign)                │
            └──> Ignore (log only)                            │
                                                              │
adoptSandbox: SetControllerReference(warmPool, sb)  <─────────┘
              + r.Update(ctx, sb)
```

Adoption is the mechanism that lets an orphaned, label-matching sandbox rejoin a pool — for example, after the previous owner pool was deleted or after the `SandboxClaim` controller's owner-reference flip raced with the pool's lister. Conversely, sandboxes whose `controllerRef.UID` does not match the pool are explicitly ignored: once a `SandboxClaim` takes ownership during adoption, that sandbox is no longer counted toward `status.replicas`.

Sources: [extensions/controllers/sandboxwarmpool_controller.go:231-301](extensions/controllers/sandboxwarmpool_controller.go)

## Update Strategies

```go
const (
    RecreateSandboxWarmPoolUpdateStrategyType    SandboxWarmPoolUpdateStrategyType = "Recreate"
    OnReplenishSandboxWarmPoolUpdateStrategyType SandboxWarmPoolUpdateStrategyType = "OnReplenish"
)
```

| Strategy | Behavior when template drifts | When it triggers replacement |
| --- | --- | --- |
| `OnReplenish` (default) | Stale members keep running until they are manually deleted or adopted out by a `SandboxClaim`. Fresh members built from the new template fill the gap on the next reconcile. | Replacement happens lazily as members are removed for other reasons. |
| `Recreate` | Stale members are eagerly deleted in the reconcile pass. The pool then refills under `slowStartBatch`. Per the type comment, only `PodTemplate` spec changes trigger recreate; pure label/annotation edits on the template do not. | Replacement happens as soon as the controller observes drift. |

`isSandboxStale` evaluates drift in three layers, short-circuiting where possible to avoid per-sandbox `DeepEqual` work:

1. If the sandbox's `agents.x-k8s.io/sandbox-template-ref-hash` label does not match `SandboxTemplateRefHash(template.Name)`, it is stale (template binding changed at the name level).
2. For owned members, if the recorded `SandboxPodTemplateHashLabel` matches `currentPodTemplateHash`, it is fresh — no further work.
3. Otherwise (mismatched hashes, or orphan), `comparePodSpecs` runs `ApplySandboxSecureDefaults` against the template and uses `equality.Semantic.DeepEqual` to compare. Results are memoized in `vettedHashes` per reconcile so identical legacy revisions are only compared once.

If `currentPodTemplateHash` could not be computed (marshal error), the staleness check is skipped to avoid a mass deletion on a transient hashing failure.

Note that the `updateStrategy.type` field has `+kubebuilder:default=OnReplenish`. The Go path additionally falls back to `OnReplenish` if `spec.updateStrategy` is unset or contains an unknown value.

Sources: [extensions/api/v1beta1/sandboxwarmpool_types.go:49-69](extensions/api/v1beta1/sandboxwarmpool_types.go), [extensions/controllers/sandboxwarmpool_controller.go:240-301](extensions/controllers/sandboxwarmpool_controller.go), [extensions/controllers/sandboxwarmpool_controller.go:462-531](extensions/controllers/sandboxwarmpool_controller.go)

## Consumption by SandboxClaim

A `SandboxClaim` chooses how to consume warm pools through its `spec.warmpool` field (`WarmPoolPolicy`):

| Policy value | Meaning |
| --- | --- |
| `"default"` (default) | Adopt any pool member whose `sandbox-template-ref-hash` matches the claim's template. |
| `"none"` | Skip warm pools entirely and always create a cold sandbox. |
| `<pool name>` | Only adopt members carrying `agents.x-k8s.io/warm-pool-sandbox = NameHash(<pool name>)`. |

The `SandboxClaim` controller keeps a `WarmSandboxQueue` keyed by `SandboxTemplateRefHash(templateRefName)`. When picking a candidate, `getCandidate` enforces the `IsSpecificPool` check by comparing the candidate's pool-membership label against `NameHash(string(policy))`, returning unmatched candidates to the queue.

Adoption completes in `completeAdoption`, which is what makes the handoff irreversible from the pool's perspective:

```go
// extensions/controllers/sandboxclaim_controller.go
delete(adopted.Labels, warmPoolSandboxLabel)
delete(adopted.Labels, sandboxTemplateRefHash)
delete(adopted.Labels, v1beta1.SandboxPodTemplateHashLabel)

// Transfer ownership from SandboxWarmPool to SandboxClaim
adopted.OwnerReferences = nil
if err := controllerutil.SetControllerReference(claim, adopted, r.Scheme); err != nil {
    return fmt.Errorf("failed to set controller reference on adopted sandbox: %w", err)
}
```

After this point, the pool controller's list (filtered by the membership label) no longer sees the sandbox, so its `status.replicas` drops by one and the next reconcile schedules a fresh replacement. The handoff also explains the asymmetry of update strategies: `OnReplenish` deliberately leans on this drain-and-refill path to roll the pool forward.

One constraint propagates from the warm-pool design to claims: when `WarmPool` policy is anything other than `none`, `SandboxClaim` rejects `spec.env`, because injecting environment variables into an already-running pool sandbox is not supported.

Sources: [extensions/api/v1beta1/sandboxclaim_types.go:33-55](extensions/api/v1beta1/sandboxclaim_types.go), [extensions/api/v1beta1/sandboxclaim_types.go:124-151](extensions/api/v1beta1/sandboxclaim_types.go), [extensions/controllers/sandboxclaim_controller.go:591-645](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:728-741](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:1155-1178](extensions/controllers/sandboxclaim_controller.go)

## Operational Notes

- **HPA integration.** Because the `scale` subresource maps `spec.replicas` and `status.replicas` plus a label selector (`status.selector`), a standard `HorizontalPodAutoscaler` can target a `SandboxWarmPool` directly. The Go comment on `Spec.Replicas` calls this out explicitly.
- **Throughput caps.** `MaxBatchSize` (default 300) bounds creates and deletes per reconcile; the `slowStartBatch` helper doubles parallelism from 1 to up to that cap per pass, which keeps the cluster from being slammed with simultaneous Sandbox creates on startup.
- **Readiness watchdog.** Sandboxes that fail to reach `Ready` within `warmPoolReadinessGracePeriod` (5 minutes) are deleted on the next reconcile, ensuring the pool does not accumulate wedged slots.
- **Cross-resource indexing.** The `TemplateRefField = ".spec.sandboxTemplateRef.name"` constant is shared with `SandboxClaim` for the same purpose; both controllers register a field indexer so that watching `SandboxTemplate` produces correct fan-out via `findWarmPoolsForTemplate`.
- **Deletion.** When a `SandboxWarmPool` is being deleted (`DeletionTimestamp` non-zero), `Reconcile` returns immediately; cascading delete of the owned `Sandbox` objects is left to Kubernetes garbage collection driven by the controller-owner reference set in `buildSandboxCR`.

Sources: [extensions/api/v1beta1/sandboxwarmpool_types.go:24-47](extensions/api/v1beta1/sandboxwarmpool_types.go), [extensions/controllers/sandboxwarmpool_controller.go:48-52](extensions/controllers/sandboxwarmpool_controller.go), [extensions/controllers/sandboxwarmpool_controller.go:80-148](extensions/controllers/sandboxwarmpool_controller.go), [extensions/controllers/sandboxwarmpool_controller.go:534-583](extensions/controllers/sandboxwarmpool_controller.go)

## Summary

`SandboxWarmPool` is a small CRD with three knobs — `replicas`, `sandboxTemplateRef`, and `updateStrategy` — that drive a controller responsible for keeping a population of pre-warmed `Sandbox` objects in lockstep with a referenced `SandboxTemplate`. Membership is encoded in a hashed label on each pool member, ownership is held by the pool until a `SandboxClaim` adopts a member, and template drift is handled either lazily (`OnReplenish`) or eagerly (`Recreate`). The `scale` subresource and `status.selector` make the pool first-class for HPAs, while the slow-start batching, the `5 m` readiness watchdog, and the staleness hash cache shape the controller's runtime behavior under churn.

---

## 09. Conditions, Reasons & Status Surfaces

> Catalogue of condition types (Ready, Suspended, Finished), reason strings, and the annotation/label keys (pod-name, template-ref, propagated-labels) that controllers use to coordinate state.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/09-conditions-reasons-status-surfaces.md
- Generated: 2026-05-25T22:36:11.523Z

### Source Files

- `api/v1beta1/sandbox_types.go`
- `extensions/api/v1beta1/sandboxclaim_types.go`
- `extensions/api/v1beta1/sandboxwarmpool_types.go`
- `docs/api.md`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [api/v1beta1/sandbox_types.go](api/v1beta1/sandbox_types.go)
- [extensions/api/v1beta1/sandboxclaim_types.go](extensions/api/v1beta1/sandboxclaim_types.go)
- [extensions/api/v1beta1/sandboxwarmpool_types.go](extensions/api/v1beta1/sandboxwarmpool_types.go)
- [extensions/api/v1beta1/sandboxtemplate_types.go](extensions/api/v1beta1/sandboxtemplate_types.go)
- [controllers/sandbox_controller.go](controllers/sandbox_controller.go)
- [extensions/controllers/sandboxclaim_controller.go](extensions/controllers/sandboxclaim_controller.go)
- [extensions/controllers/sandboxwarmpool_controller.go](extensions/controllers/sandboxwarmpool_controller.go)
- [internal/lifecycle/expiry.go](internal/lifecycle/expiry.go)
</details>

# Conditions, Reasons & Status Surfaces

This page catalogues the **status conditions** (`Ready`, `Suspended`, `Finished`), their machine-readable **reason strings**, and the **annotation/label keys** the `agent-sandbox` controllers use to coordinate state between `Sandbox`, `SandboxClaim`, and `SandboxWarmPool`. These surfaces are the wire format by which the core `Sandbox` controller publishes pod state, the `SandboxClaim` controller mirrors and translates that state for callers, and the `SandboxWarmPool` controller tracks which pods are still considered fresh for adoption.

Every condition follows the standard `metav1.Condition` shape (`Type`, `Status`, `Reason`, `Message`, `ObservedGeneration`, `LastTransitionTime`), so consumers can rely on `meta.FindStatusCondition` semantics. The annotation/label vocabulary is intentionally small and namespaced under `agents.x-k8s.io/` so callers can write selectors and tooling against a stable set of keys.

## Condition Types

The three condition types reported on `Sandbox.status.conditions` are declared as `ConditionType` constants in `api/v1beta1/sandbox_types.go`. The `SandboxClaim` re-uses the same condition type strings for `Ready` and `Finished` so a caller can read either resource with the same parsing logic.

| Type | Declared at | Reported on | Meaning |
|------|-------------|-------------|---------|
| `Ready` | `SandboxConditionReady` | `Sandbox.status`, `SandboxClaim.status` | The sandbox's pod (and optional Service) are observed Ready, or — when False — the reason captures why not. |
| `Suspended` | `SandboxConditionSuspended` | `Sandbox.status` only | `spec.replicas == 0`. Tracks whether the underlying pod has actually been torn down yet. |
| `Finished` | `SandboxConditionFinished` | `Sandbox.status`, mirrored onto `SandboxClaim.status` | The backing pod has reached a terminal phase (`PodSucceeded`/`PodFailed`). Only present while the pod still exists. |

Sources: [api/v1beta1/sandbox_types.go:27-55](api/v1beta1/sandbox_types.go), [controllers/sandbox_controller.go:273-417](controllers/sandbox_controller.go)

### Ready

`computeReadyCondition` is the single funnel for `Sandbox`'s `Ready` value. It starts pessimistic (`False`, `Reason=DependenciesNotReady`) and only transitions to `True` once the pod has `PodReady=True`, has at least one `PodIP`, and the optional headless `Service` is present when required.

```go
// controllers/sandbox_controller.go (excerpt)
readyCondition := metav1.Condition{
    Type:               string(sandboxv1beta1.SandboxConditionReady),
    ObservedGeneration: sandbox.Generation,
    Status:             metav1.ConditionFalse,
    Reason:             sandboxv1beta1.SandboxReasonDependenciesNotReady,
}
// ... if pod Ready AND service satisfied ...
readyCondition.Status = metav1.ConditionTrue
readyCondition.Reason = sandboxv1beta1.SandboxReasonDependenciesReady
```

Sources: [controllers/sandbox_controller.go:313-392](controllers/sandbox_controller.go)

### Suspended

`computeSuspendedCondition` is **only emitted when `spec.replicas == 0`**. The `Status` flips depending on whether the pod has actually been deleted: `True/PodTerminated` once the pod is gone, `False/PodNotTerminated` while it's still draining. When `replicas != 0` the controller does not emit a `Suspended` condition at all (existing ones remain whatever `meta.SetStatusCondition` last wrote).

Sources: [controllers/sandbox_controller.go:289-311](controllers/sandbox_controller.go), [api/v1beta1/sandbox_types.go:28-33](api/v1beta1/sandbox_types.go)

### Finished

`computeFinishedCondition` returns `nil` unless the pod exists **and** its `Status.Phase` is `PodSucceeded` or `PodFailed`. The reconcile loop strips the `Finished` condition whenever the pod is missing or non-terminal, so its presence is the authoritative "this run is over" signal:

```go
// controllers/sandbox_controller.go (excerpt)
if !hasFinished {
    meta.RemoveStatusCondition(&sandbox.Status.Conditions,
        string(sandboxv1beta1.SandboxConditionFinished))
}
```

`SandboxClaim` reflects the same condition into its own status array via `syncFinishedCondition`, and the `Lifecycle.ttlSecondsAfterFinished` countdown is anchored on `FinishedCondition.LastTransitionTime`. The `lifecycle.FinishedCondition` helper requires `Status == True`; transient or False entries are ignored for TTL purposes.

Sources: [controllers/sandbox_controller.go:256-268](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:394-417](controllers/sandbox_controller.go), [extensions/controllers/sandboxclaim_controller.go:562-576](extensions/controllers/sandboxclaim_controller.go), [internal/lifecycle/expiry.go:24-82](internal/lifecycle/expiry.go)

## Reason Strings

Reasons are machine-readable codes intended for selectors, alerts, and metric labels. They are declared as untyped string constants alongside the condition types.

### Sandbox-emitted reasons

| Condition | Reason | Status | Source |
|-----------|--------|--------|--------|
| `Ready` | `DependenciesReady` | True | `SandboxReasonDependenciesReady` |
| `Ready` | `DependenciesNotReady` | False | `SandboxReasonDependenciesNotReady` |
| `Ready` | `SandboxSuspended` | False | `SandboxReasonSuspended` (set when `replicas==0`) |
| `Ready` | `SandboxExpired` | False | `SandboxReasonExpired` (set when shutdownTime passes) |
| `Ready` | `ReconcilerError` | False | Free-form, set when reconcile returns a non-nil error |
| `Suspended` | `PodTerminated` | True | `SandboxReasonSuspendedPodTerminated` |
| `Suspended` | `PodNotTerminated` | False | `SandboxReasonSuspendedPodNotTerminated` |
| `Finished` | `PodSucceeded` | True | `SandboxReasonPodSucceeded` |
| `Finished` | `PodFailed` | True | `SandboxReasonPodFailed` |

Sources: [api/v1beta1/sandbox_types.go:27-55](api/v1beta1/sandbox_types.go), [controllers/sandbox_controller.go:289-417](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:1080-1127](controllers/sandbox_controller.go)

### SandboxClaim-emitted reasons

The claim controller adds its own reasons for `Ready` while still mirroring the underlying sandbox state when neither expiry nor an error applies.

| Reason | When emitted |
|--------|--------------|
| `ClaimExpired` (`extensionsv1beta1.ClaimExpiredReason`) | `Lifecycle.shutdownTime` or `ttlSecondsAfterFinished` has elapsed; also used by `Eventf` when the controller logs the deletion. |
| `TemplateNotFound` | `getTemplate` returned `ErrTemplateNotFound`. |
| `InvalidMetadata` | `validateAdditionalPodMetadata` rejected the claim's `additionalPodMetadata`. |
| `SandboxMissing` | Reconcile succeeded with no error but the owned `Sandbox` does not exist (and the claim is not expired). |
| `SandboxNotReady` | Underlying `Sandbox` exists but no `Ready` condition was found to forward. |
| `SandboxExpired` | Forwarded from the underlying Sandbox via `hasSandboxExpiredCondition`. |
| `ReconcilerError` | Generic fallback for unrecognized reconcile errors. |

`hasClaimExpiredCondition` and `hasSandboxExpiredCondition` both query `Ready` by reason — the reason is the load-bearing signal that distinguishes "claim TTL fired" from "sandbox TTL fired" even though both surface as `Ready=False`.

Sources: [extensions/api/v1beta1/sandboxclaim_types.go:25-31](extensions/api/v1beta1/sandboxclaim_types.go), [extensions/controllers/sandboxclaim_controller.go:459-546](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:1436-1444](extensions/controllers/sandboxclaim_controller.go)

## State Coordination Across Controllers

The conditions and reasons above are wired together so that a single read of the claim's status is sufficient to decide whether to keep waiting, drop the connection, or retry. The diagram captures who writes what, and how the SandboxClaim reconciler folds the underlying Sandbox status back into its own:

```mermaid
flowchart LR
  subgraph CoreCtrl["controllers/sandbox_controller.go"]
    direction TB
    SR[SandboxReconciler]
    SR --> SCond[Sandbox.status.conditions<br/>Ready / Suspended / Finished]
  end

  subgraph ExtCtrl["extensions/controllers/sandboxclaim_controller.go"]
    direction TB
    SCR[SandboxClaimReconciler]
    SCR --> CCond[SandboxClaim.status.conditions<br/>Ready + mirrored Finished]
  end

  subgraph PoolCtrl["extensions/controllers/sandboxwarmpool_controller.go"]
    direction TB
    WPR[SandboxWarmPoolReconciler]
    WPR -.writes labels.-> SandboxObj
  end

  Pod[(corev1.Pod)] -- phase + Ready --> SR
  SandboxObj[(Sandbox CR)] -- mirrored Finished --> SCR
  SCond --> SandboxObj
  SandboxObj -- read in computeReadyCondition --> SCR
  CCond --> ClaimObj[(SandboxClaim CR)]
  SCR -- Eventf<br/>ClaimExpired --> Events[(events.k8s.io)]
```

The mirroring path lives in `computeReadyCondition` and `syncFinishedCondition`: if the underlying `Sandbox` has a `Ready` condition the claim simply forwards it (preserving `Reason`, `Message`, and `LastTransitionTime`); if the sandbox carries a terminal `Finished` it is copied onto the claim, and removed when no longer applicable. Expiry is a special case — the claim emits `Ready=False/ClaimExpired` regardless of the sandbox's own condition so callers can distinguish a deliberate caller-initiated shutdown from an internal failure.

Sources: [extensions/controllers/sandboxclaim_controller.go:459-576](extensions/controllers/sandboxclaim_controller.go), [controllers/sandbox_controller.go:240-271](controllers/sandbox_controller.go)

## Lifecycle State Machine

`Sandbox.status.conditions` evolves through a small state machine driven by pod phase and `spec.replicas`. `Finished` is orthogonal — it can appear in parallel with `Ready=False` while the pod is still in a terminal phase.

```mermaid
stateDiagram-v2
  [*] --> NotReady: pod missing / pending
  NotReady --> Ready: pod Ready=True\nReason=DependenciesReady
  Ready --> NotReady: pod loses readiness\nReason=DependenciesNotReady
  Ready --> Suspending: replicas=0\nSuspended=False/PodNotTerminated
  Suspending --> Suspended: pod deleted\nSuspended=True/PodTerminated
  Suspended --> NotReady: replicas back to 1
  Ready --> Finished: pod Succeeded/Failed\nFinished=True
  NotReady --> Finished: pod Succeeded/Failed
  Ready --> Expired: shutdownTime elapsed\nReady=False/SandboxExpired
  Finished --> Expired: ttlSecondsAfterFinished elapsed\n(claim-side)
  Expired --> [*]: ShutdownPolicy=Delete\nresource removed
```

Sources: [controllers/sandbox_controller.go:273-417](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:1065-1127](controllers/sandbox_controller.go), [internal/lifecycle/expiry.go:47-82](internal/lifecycle/expiry.go)

## Annotation and Label Keys

All cross-controller coordination keys are namespaced under `agents.x-k8s.io/`. Every key is declared once as a constant so test fixtures, the warm-pool controller, and the claim controller all share the same vocabulary.

### Annotations

| Key | Declared at | Written by | Read by | Purpose |
|-----|-------------|------------|---------|---------|
| `agents.x-k8s.io/pod-name` | `SandboxPodNameAnnotation` | `SandboxClaim` completeAdoption; pod-create path | core Sandbox controller (`resolvePodName`), `getLaunchType` | Records the pod name an adopted warm-pool sandbox is bound to. Differs from `sandbox.Name` only during/after adoption. Also used as the "this was a warm start" signal for metrics. |
| `agents.x-k8s.io/sandbox-template-ref` | `SandboxTemplateRefAnnotation` | warm-pool controller; claim's `createSandbox` | metrics collector | Stores the originating `SandboxTemplate.name` for cardinality-bounded metric labels and audit. |
| `agents.x-k8s.io/propagated-labels` | `SandboxPropagatedLabelsAnnotation` | core Sandbox controller (`reconcilePod`, `updatePodMetadata`) | core Sandbox controller (next reconcile) | Sorted, comma-joined list of label keys the controller copied from `Sandbox.spec.podTemplate.metadata.labels` onto the Pod. Lets the controller detect which labels it owns so it can delete ones that are removed from spec. |
| `agents.x-k8s.io/propagated-annotations` | `SandboxPropagatedAnnotationsAnnotation` | core Sandbox controller | core Sandbox controller | Same idea as above, but for annotations. |
| `opentelemetry.io/trace-context` | `internal/metrics.TraceContextAnnotation` | webhook / claim / sandbox controllers | tracing helpers | Propagates a W3C trace context across CRD boundaries (not in the `agents.x-k8s.io/` namespace, but participates in the same propagation chain). |

Sources: [api/v1beta1/sandbox_types.go:56-66](api/v1beta1/sandbox_types.go), [controllers/sandbox_controller.go:82-90](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:785-943](controllers/sandbox_controller.go), [extensions/controllers/sandboxclaim_controller.go:748-755](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxwarmpool_controller.go:335-338](extensions/controllers/sandboxwarmpool_controller.go)

### Labels

| Key | Declared at | Owner | Purpose |
|-----|-------------|-------|---------|
| `agents.x-k8s.io/sandbox-name-hash` | `sandboxLabel` (package-private) | core Sandbox controller | Selector that ties a Pod and headless Service to its owning `Sandbox`. The hash form keeps the value inside Kubernetes' 63-character label limit. |
| `agents.x-k8s.io/sandbox-pod-template-hash` | `SandboxPodTemplateHashLabel` | warm-pool controller | Tags warm-pool sandboxes with the template-content hash; the pool reconciler uses it (via `isSandboxStale`) to decide which pods are still fresh. Stripped on adoption so adopted sandboxes don't accidentally re-enter the pool. |
| `agents.x-k8s.io/sandbox-name` | `AssignedSandboxNameLabel` | `SandboxClaim` controller | Written on the `SandboxClaim` itself once a sandbox is bound; lets callers selector-match `claim → sandbox` without reading status. |
| `agents.x-k8s.io/claim-uid` | `SandboxIDLabel` | `SandboxClaim` controller | Written on `Sandbox.metadata.labels` and `Sandbox.spec.podTemplate.metadata.labels` so NetworkPolicies and external informers can resolve a Pod or Sandbox back to its owning claim by UID. |
| `agents.x-k8s.io/warm-pool-sandbox` | `warmPoolSandboxLabel` (package-private) | warm-pool controller | Hashed pool-name marker that lets the warm-pool informer enumerate its own sandboxes; stripped during claim adoption. |
| `agents.x-k8s.io/sandbox-template-ref-hash` | `sandboxTemplateRefHash` (package-private) | warm-pool + claim controllers | Hashed template-name marker on the pod template; used to bucket warm pods by template for the adoption queue. |

Sources: [controllers/sandbox_controller.go:49-53](controllers/sandbox_controller.go), [api/v1beta1/sandbox_types.go:60-61](api/v1beta1/sandbox_types.go), [extensions/api/v1beta1/sandboxclaim_types.go:25-31](extensions/api/v1beta1/sandboxclaim_types.go), [extensions/api/v1beta1/sandboxtemplate_types.go:33-37](extensions/api/v1beta1/sandboxtemplate_types.go), [extensions/controllers/sandboxwarmpool_controller.go:47-50](extensions/controllers/sandboxwarmpool_controller.go), [extensions/controllers/sandboxclaim_controller.go:578-589](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:728-794](extensions/controllers/sandboxclaim_controller.go)

### Propagation Algorithm

The `propagated-labels` / `propagated-annotations` annotations exist because Kubernetes has no "owner record" for fields a controller stamps onto someone else's object. Without a tracking annotation, removing a key from `Sandbox.spec.podTemplate.metadata.labels` would leave the previously-written value orphaned on the Pod. The Sandbox controller solves this by writing the *sorted, comma-joined set of keys it currently manages* back onto the Pod and consulting it on the next reconcile:

```text
   spec.podTemplate.metadata.labels       Pod.metadata.labels
   ┌──────────────────────────┐           ┌──────────────────────────┐
   │ app=demo                 │ ────────► │ app=demo                 │
   │ tier=web                 │           │ tier=web                 │
   └──────────────────────────┘           │ agents.x-k8s.io/         │
                ▲                         │   sandbox-name-hash=...  │
                │                         └──────────────────────────┘
                │                                       │
        next reconcile: diff                            │ annotation
        managedKeys vs.                                 ▼
        Pod.annotations["propagated-labels"]    propagated-labels=app,tier
```

If `tier` is later removed from spec, the next reconcile sees `tier` in the tracked set but absent from spec, and deletes it from `Pod.labels`. Annotations follow the identical procedure.

Sources: [controllers/sandbox_controller.go:785-943](controllers/sandbox_controller.go)

## Worked Example: A Claim's `status` Across Phases

The table maps user-observable lifecycle phases onto the conditions and metadata the `SandboxClaim` controller actually writes. Reading top-to-bottom corresponds to the normal flow of a successful claim that later expires.

| Phase | `Ready` | `Finished` | Notable labels/annotations on `Sandbox` |
|-------|---------|------------|-----------------------------------------|
| Claim accepted, template missing | `False / TemplateNotFound` | — | (no sandbox yet) |
| Cold start, pod pending | `False / DependenciesNotReady` (forwarded) | — | `claim-uid`, `sandbox-template-ref-hash` |
| Warm-pool adoption in progress | `False / SandboxNotReady` | — | `pod-name`, `claim-uid` added; `warm-pool-sandbox`, `sandbox-pod-template-hash` removed |
| Running | `True / DependenciesReady` (forwarded) | — | `claim-uid`, `pod-name` (if warm) |
| `replicas=0` requested | `False / SandboxSuspended` | — | `Suspended` condition set on Sandbox |
| Pod exited 0 | `True / DependenciesReady` then transitions | `True / PodSucceeded` | `Finished` mirrored from Sandbox |
| `ttlSecondsAfterFinished` elapses | `False / ClaimExpired` | preserved | `ClaimExpired` event emitted |
| `shutdownTime` elapses with `ShutdownPolicy=Retain` | `False / ClaimExpired` | (cleared if sandbox deleted) | Sandbox deleted; claim object retained |

Sources: [extensions/controllers/sandboxclaim_controller.go:160-249](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:459-576](extensions/controllers/sandboxclaim_controller.go), [controllers/sandbox_controller.go:289-417](controllers/sandbox_controller.go)

## Summary

The status surface of an agent sandbox is intentionally minimal: three condition types (`Ready`, `Suspended`, `Finished`), a closed set of reason strings declared as constants in `api/v1beta1/sandbox_types.go` and `extensions/api/v1beta1/sandboxclaim_types.go`, and a small handful of `agents.x-k8s.io/`-namespaced annotations and labels that let controllers reconstruct ownership and propagation history across reconciles. Together they form the contract that lets the `SandboxClaim` mirror the core `Sandbox`'s status verbatim while still expressing claim-only concerns like `ClaimExpired` and `ttlSecondsAfterFinished`, and let the `SandboxWarmPool` decide which pre-warmed sandboxes are still safe to hand out.

---

## 10. Controller Manager Entry Point

> How cmd/agent-sandbox-controller/main.go wires schemes, the controller-runtime Manager, metrics/pprof servers, leader election, and the optional extensions reconciler set.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/10-controller-manager-entry-point.md
- Generated: 2026-05-25T22:35:42.375Z

### Source Files

- `cmd/agent-sandbox-controller/main.go`
- `internal/version/`
- `Dockerfile`
- `helm/templates/deployment.yaml`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [cmd/agent-sandbox-controller/main.go](cmd/agent-sandbox-controller/main.go)
- [internal/version/version.go](internal/version/version.go)
- [Dockerfile](Dockerfile)
- [helm/templates/deployment.yaml](helm/templates/deployment.yaml)
- [helm/templates/_controller-args.tpl](helm/templates/_controller-args.tpl)
- [controllers/sandbox_controller.go](controllers/sandbox_controller.go)
- [extensions/api/v1beta1/groupversion_info.go](extensions/api/v1beta1/groupversion_info.go)
- [extensions/controllers/sandboxclaim_controller.go](extensions/controllers/sandboxclaim_controller.go)
- [extensions/controllers/sandboxtemplate_controller.go](extensions/controllers/sandboxtemplate_controller.go)
- [extensions/controllers/sandboxwarmpool_controller.go](extensions/controllers/sandboxwarmpool_controller.go)
- [internal/metrics/tracing.go](internal/metrics/tracing.go)
- [internal/metrics/sandbox_collector.go](internal/metrics/sandbox_collector.go)
</details>

# Controller Manager Entry Point

The single binary that runs the agent-sandbox operator is built from `cmd/agent-sandbox-controller/main.go`. It is a thin program by Kubernetes operator standards: it parses CLI flags, builds the runtime `Scheme`, configures the metrics/healthz/pprof servers, constructs a controller-runtime `Manager`, registers the `Sandbox` reconciler unconditionally, optionally wires three extra reconcilers behind an `--extensions` switch, and finally blocks on `mgr.Start(ctx)` until a termination signal arrives.

This page maps each section of `main.go` to the concrete pieces of the project it wires together (scheme registration, the controllers package, the `internal/metrics` instrumenter, the `extensions/controllers` set, and the `internal/version` build-stamp data), and shows how the Helm deployment and the multi-stage Dockerfile match those flags and probes.

## Binary layout and build provenance

The `main` package lives at `cmd/agent-sandbox-controller/main.go`. There is exactly one entry point and exactly one binary; both the Helm deployment and the Dockerfile reference it as `/agent-sandbox-controller`.

The two-stage Dockerfile compiles the binary statically with `CGO_ENABLED=0`, strips symbols (`-s -w`), and injects three build-time identifiers into the `internal/version` package via `-ldflags -X`:

```dockerfile
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} go build \
    -ldflags="-s -w -X sigs.k8s.io/agent-sandbox/internal/version.gitVersion=${GIT_VERSION} \
              -X sigs.k8s.io/agent-sandbox/internal/version.gitSHA=${GIT_SHA} \
              -X sigs.k8s.io/agent-sandbox/internal/version.buildDate=${BUILD_DATE}" \
    -o /agent-sandbox-controller ./cmd/agent-sandbox-controller
```

The runtime image is `gcr.io/distroless/static-debian13:nonroot`, with the binary as the `ENTRYPOINT`. Those `-X` symbols target the package-level variables declared in `internal/version/version.go`:

```go
var (
    gitVersion = "unknown"
    gitSHA     = "unknown"
    buildDate  = "unknown"
    goVersion  = runtime.Version()
    goCompiler = runtime.Compiler
    goOS       = runtime.GOOS
    goArch     = runtime.GOARCH
)
```

`version.Print("agent-sandbox-controller")` renders a small text template that includes program name, git version, git SHA, build date, Go version, compiler, and `GOOS/GOARCH`. `main` calls it under the `--version` flag and exits before any other initialization runs, so the binary can be queried without contacting the API server.

Sources: [cmd/agent-sandbox-controller/main.go:50-107](), [internal/version/version.go:26-91](), [Dockerfile:1-37]()

## Flag surface

All runtime configuration is exposed as `flag` values registered on the default command line. Each flag has a single owner; nothing reads environment variables directly in this file. The Helm chart renders the same names into the container `args` block, defined by `helm/templates/_controller-args.tpl` and consumed by `helm/templates/deployment.yaml` via the `agent-sandbox.controllerArgs` template.

| Flag | Default | Owner / Effect |
|------|---------|----------------|
| `--version` | `false` | Prints `version.Print(...)` and exits |
| `--cluster-domain` | `cluster.local` | Passed to `SandboxReconciler.ClusterDomain` for FQDN generation |
| `--metrics-bind-address` | `:8080` | Bind for controller-runtime metrics server (and pprof when enabled) |
| `--health-probe-bind-address` | `:8081` | Bind for `/healthz` and `/readyz` |
| `--leader-elect` | `true` | Toggles controller-runtime leader election |
| `--leader-election-namespace` | `""` | When empty with leader-elect on, falls back to controller-runtime auto-detection |
| `--extensions` | `false` | Registers the extensions scheme and three extra reconcilers |
| `--enable-tracing` | `false` | Initializes OTel via `asmetrics.SetupOTel` |
| `--enable-pprof` | `false` | Exposes only `/debug/pprof/profile` on the metrics server |
| `--enable-pprof-debug` | `false` | Exposes the full pprof index plus `fgprof`; implies `--enable-pprof` |
| `--pprof-block-profile-rate` | `1000000` | `runtime.SetBlockProfileRate` value when pprof debug is on |
| `--pprof-mutex-profile-fraction` | `10` | `runtime.SetMutexProfileFraction` value when pprof debug is on |
| `--kube-api-qps` | `-1.0` | Set on `restConfig.QPS`; `-1` keeps the client unlimited |
| `--kube-api-burst` | `10` | Set on `restConfig.Burst`; validated `> 0` |
| `--sandbox-concurrent-workers` | `1` | `MaxConcurrentReconciles` for the core Sandbox controller |
| `--sandbox-claim-concurrent-workers` | `1` | `MaxConcurrentReconciles` for `SandboxClaim` (extensions only) |
| `--sandbox-warm-pool-concurrent-workers` | `1` | `MaxConcurrentReconciles` for `SandboxWarmPool` (extensions only) |
| `--sandbox-template-concurrent-workers` | `1` | `MaxConcurrentReconciles` for `SandboxTemplate` (extensions only) |
| `--sandbox-warm-pool-max-batch-size` | `300` | `SandboxWarmPoolReconciler.MaxBatchSize` for parallel create/delete |

After `flag.Parse()`, `main` performs early validation: concurrency values must be positive, `--kube-api-burst` must be positive, the warm-pool batch size must be positive, and the sum of all worker counts is logged as a warning if it exceeds `1000` (or exceeds `kube-api-burst` when QPS is set).

Sources: [cmd/agent-sandbox-controller/main.go:51-149](), [helm/templates/_controller-args.tpl:1-50]()

## Scheme assembly

The `Scheme` consumed by the manager is `controllers.Scheme`, which is built in a package-level `init()` and already contains the core Kubernetes client-go types and the sandbox v1beta1 group:

```go
// controllers/sandbox_controller.go
var Scheme = runtime.NewScheme()

func init() {
    utilruntime.Must(clientgoscheme.AddToScheme(Scheme))
    utilruntime.Must(sandboxv1beta1.AddToScheme(Scheme))
}
```

`main` then conditionally extends it with the extensions group only when `--extensions` is set:

```go
scheme := controllers.Scheme
if extensions {
    utilruntime.Must(extensionsv1beta1.AddToScheme(scheme))
}
```

The extensions group is `extensions.agents.x-k8s.io/v1beta1`, declared in `extensions/api/v1beta1/groupversion_info.go`. Skipping that registration when extensions are off keeps the manager's cache and RBAC surface limited to the core `Sandbox` CRD.

Sources: [cmd/agent-sandbox-controller/main.go:174-177](), [controllers/sandbox_controller.go:112-120](), [extensions/api/v1beta1/groupversion_info.go:25-36]()

## Wiring diagram

```mermaid
flowchart TB
    subgraph CLI["cmd/agent-sandbox-controller/main.go"]
        Flags["flag.Parse()<br/>--leader-elect, --extensions,<br/>--enable-tracing, --enable-pprof,<br/>--kube-api-*, --sandbox-*-workers"]
        Logger["zap.New + ctrl.SetLogger"]
        SigCtx["ctrl.SetupSignalHandler()"]
        OTel["asmetrics.SetupOTel<br/>(10s init timeout)"]
        Mux["http.DefaultServeMux =<br/>http.NewServeMux()"]
        Mgr["ctrl.NewManager(restConfig, Options{...})<br/>LeaderElectionID a3317529.agent-sandbox.x-k8s.io"]
    end

    subgraph Scheme["Scheme (runtime.NewScheme)"]
        Core["clientgoscheme +<br/>sandboxv1beta1<br/>(controllers.init())"]
        Ext["extensionsv1beta1<br/>(only if --extensions)"]
    end

    subgraph Servers["Manager-owned servers"]
        Metrics["metricsserver.Options<br/>BindAddress :8080<br/>+ ExtraHandlers (pprof)"]
        Probe["HealthProbeBindAddress :8081<br/>healthz.Ping /healthz, /readyz"]
    end

    subgraph Recs["Reconcilers"]
        Core1["controllers.SandboxReconciler<br/>--sandbox-concurrent-workers"]
        Claim["extensionscontrollers.SandboxClaimReconciler<br/>+ queue.SimpleSandboxQueue"]
        Tmpl["extensionscontrollers.SandboxTemplateReconciler"]
        Warm["extensionscontrollers.SandboxWarmPoolReconciler<br/>MaxBatchSize"]
    end

    SandColl["asmetrics.RegisterSandboxCollector(mgr.GetClient())"]

    Flags --> Logger --> SigCtx --> OTel --> Mux --> Mgr
    Core --> Mgr
    Ext --> Mgr
    Mgr --> Metrics
    Mgr --> Probe
    Mgr --> SandColl
    Mgr --> Core1
    Mgr -.->|--extensions| Claim
    Mgr -.->|--extensions| Tmpl
    Mgr -.->|--extensions| Warm
    SigCtx -->|ctx| Mgr
```

Sources: [cmd/agent-sandbox-controller/main.go:150-294]()

## Logger, signal context, and tracing initialization

The zap options struct is bound to the command line so `--zap-*` flags work, then `ctrl.SetLogger` installs the resulting logger globally:

```go
opts := zap.Options{Development: false}
opts.BindFlags(flag.CommandLine)
flag.Parse()
...
ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))
```

`ctrl.SetupSignalHandler()` returns the parent `context.Context` used for everything downstream (`SetupOTel`, leader election, reconcilers, `mgr.Start`). When SIGTERM or SIGINT arrives, this context is cancelled and the manager exits cleanly.

Tracing is opt-in. If `--enable-tracing` is set, `main` creates a 10-second timeout child context purely for the OTel bootstrap, calls `asmetrics.SetupOTel(initCtx, "agent-sandbox-controller")`, and defers the returned `cleanup`. Otherwise the program uses `asmetrics.NewNoOp()`. The resulting `instrumenter` is the same `Tracer` field handed to both the core `SandboxReconciler` and the extensions' `SandboxClaim` and `SandboxTemplate` reconcilers.

Sources: [cmd/agent-sandbox-controller/main.go:98-168](), [internal/metrics/tracing.go:49](), [internal/metrics/tracing.go:124]()

## Metrics server, pprof, and the default ServeMux defense

Before configuring the metrics server, `main` explicitly resets the process-wide HTTP mux:

```go
// Importing net/http/pprof registers handlers on the global DefaultServeMux.
// Reset it to avoid accidentally exposing pprof via any server that uses the default mux.
http.DefaultServeMux = http.NewServeMux()
```

This protects against the transitive import of `net/http/pprof` (only used via its individual handler functions here) silently exposing pprof on any future server that uses `http.DefaultServeMux`. All pprof exposure is then explicit and routed through the metrics server's `ExtraHandlers`:

| Flag combination | Endpoints mounted on `metrics-bind-address` |
|------------------|---------------------------------------------|
| Neither flag | `/metrics` only |
| `--enable-pprof` | `/metrics`, `/debug/pprof/profile` |
| `--enable-pprof-debug` | All of the above plus `/debug/pprof/` index, `cmdline`, `symbol`, `heap`, `goroutine`, `allocs`, `block`, `mutex`, `trace`, and `/debug/fgprof` |

When `--enable-pprof-debug` is active, `main` also clamps any negative sampling values to zero and applies `runtime.SetBlockProfileRate` and `runtime.SetMutexProfileFraction`. The setup log explicitly warns that the debug surface may expose sensitive information.

Sources: [cmd/agent-sandbox-controller/main.go:170-214]()

## Manager construction and leader election

The REST config is taken from in-cluster or kubeconfig discovery and overridden with the QPS/burst flags before constructing the manager:

```go
restConfig := ctrl.GetConfigOrDie()
restConfig.QPS   = float32(kubeAPIQPS)
restConfig.Burst = kubeAPIBurst

mgr, err := ctrl.NewManager(restConfig, ctrl.Options{
    Scheme:                  scheme,
    Metrics:                 metricsOpts,
    HealthProbeBindAddress:  probeAddr,
    LeaderElection:          enableLeaderElection,
    LeaderElectionNamespace: leaderElectionNamespace,
    LeaderElectionID:        "a3317529.agent-sandbox.x-k8s.io",
})
```

The lease name `a3317529.agent-sandbox.x-k8s.io` is the stable identifier under which only one replica becomes the active reconciler. When `--leader-elect=true` and `--leader-election-namespace=""`, the setup log records that auto-detection of the namespace is being attempted by controller-runtime.

Immediately after the manager exists, `asmetrics.RegisterSandboxCollector(mgr.GetClient(), …)` attaches a custom Prometheus collector backed by the manager's cached client; it is registered globally and surfaced on the same `/metrics` endpoint as the controller-runtime metrics.

Sources: [cmd/agent-sandbox-controller/main.go:216-234](), [internal/metrics/sandbox_collector.go:62]()

## Reconciler registration

The core reconciler is always registered:

```go
if err = (&controllers.SandboxReconciler{
    Client:        mgr.GetClient(),
    Scheme:        mgr.GetScheme(),
    Tracer:        instrumenter,
    ClusterDomain: clusterDomain,
}).SetupWithManager(mgr, sandboxConcurrentWorkers); err != nil {
    setupLog.Error(err, "unable to create controller", "controller", "Sandbox")
    os.Exit(1)
}
```

When `--extensions` is set, three more are registered, sharing a single in-memory `SimpleSandboxQueue` that the claim controller uses to track warm sandboxes:

| Reconciler (extensions only) | Concurrency flag | Notable fields |
|------------------------------|------------------|----------------|
| `extensionscontrollers.SandboxClaimReconciler` | `--sandbox-claim-concurrent-workers` | `WarmSandboxQueue: queue.NewSimpleSandboxQueue()`, `Recorder: mgr.GetEventRecorder("sandboxclaim-controller")`, `Tracer: instrumenter` |
| `extensionscontrollers.SandboxTemplateReconciler` | `--sandbox-template-concurrent-workers` | `Recorder: mgr.GetEventRecorder("sandboxtemplate-controller")`, `Tracer: instrumenter` |
| `extensionscontrollers.SandboxWarmPoolReconciler` | `--sandbox-warm-pool-concurrent-workers` | `MaxBatchSize: sandboxWarmPoolMaxBatchSize` |

Each reconciler's own `SetupWithManager(mgr, concurrentWorkers)` is what actually registers watches and applies the worker count; `main.go` is only responsible for instantiation and ordering. Errors at any step cause `os.Exit(1)`.

Sources: [cmd/agent-sandbox-controller/main.go:236-277](), [extensions/controllers/sandboxclaim_controller.go:1269-1273](), [extensions/controllers/sandboxtemplate_controller.go:215-218](), [extensions/controllers/sandboxwarmpool_controller.go:533-537]()

## Health probes and main loop

After reconciler wiring, `main` attaches two trivial liveness/readiness probes and starts the manager:

```go
if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil { ... }
if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil { ... }

setupLog.Info("starting manager")
if err := mgr.Start(ctx); err != nil { ... }
```

`mgr.Start(ctx)` blocks until the signal context is cancelled, at which point the manager runs internal shutdown and the deferred OTel `cleanup` fires. The probe server listens on `HealthProbeBindAddress` (`:8081` by default), distinct from the metrics server on `:8080`.

The Helm deployment matches these defaults exactly:

```yaml
ports:
- name: metrics
  containerPort: 8080
- name: healthz
  containerPort: 8081
livenessProbe:
  httpGet: { path: /healthz, port: healthz }
readinessProbe:
  httpGet: { path: /readyz,  port: healthz }
```

Renaming or moving either port in `main.go` would require a matching change in `helm/templates/deployment.yaml`, since the probes are wired to the `healthz` named port and Prometheus scrape config typically targets the `metrics` port.

Sources: [cmd/agent-sandbox-controller/main.go:281-294](), [helm/templates/deployment.yaml:30-48]()

## Operational summary

The entry point is intentionally small and procedural: parse flags, build a logger and signal context, optionally initialize tracing, decide which CRD groups go into the scheme, configure the metrics/pprof and probe endpoints, build the manager with a fixed leader-election ID, instantiate the reconcilers (always Sandbox, optionally the three extensions reconcilers behind `--extensions`), register the Prometheus collector and health probes, and block on `mgr.Start`. Build-time provenance comes from `internal/version` symbols injected by the Dockerfile's `-ldflags`, and the Helm chart's controller args template is the canonical mapping from the flags documented above to what a deployed pod actually runs.

---

## 11. Sandbox Reconciler

> Reconciliation loop for the core Sandbox: pod/PVC/service materialization, identity propagation, status conditions, scale subresource, and the cluster-domain FQDN logic.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/11-sandbox-reconciler.md
- Generated: 2026-05-25T22:35:38.064Z

### Source Files

- `controllers/sandbox_controller.go`
- `controllers/sandbox_controller_test.go`
- `controllers/testmain_test.go`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [controllers/sandbox_controller.go](controllers/sandbox_controller.go)
- [controllers/sandbox_controller_test.go](controllers/sandbox_controller_test.go)
- [controllers/testmain_test.go](controllers/testmain_test.go)
- [api/v1beta1/sandbox_types.go](api/v1beta1/sandbox_types.go)
- [cmd/agent-sandbox-controller/main.go](cmd/agent-sandbox-controller/main.go)
</details>

# Sandbox Reconciler

The Sandbox reconciler is the controller that turns a `Sandbox` custom resource into a running set of Kubernetes primitives: a single `Pod`, an optional headless `Service`, and one `PersistentVolumeClaim` per entry in `spec.volumeClaimTemplates`. It enforces single-controller ownership of those primitives, propagates a hash-based identity label and pod template metadata down to the Pod, surfaces overall state through three status conditions, drives the CRD's `scale` subresource, and computes the service FQDN from the controller's configured cluster domain. It also implements an expiry path (`shutdownTime` + `shutdownPolicy`) that tears the live resources down while keeping terminal status conditions intact.

This page walks the reconcile loop end-to-end against `controllers/sandbox_controller.go`, including the warm-pool pod adoption path that lets the Sandbox attach to an existing Pod rather than always creating one. The `Reconcile` entry point and its `reconcileChildResources` body assume `spec.replicas` is either 0 or 1 — the CRD scale subresource is intentionally constrained to that range.

## Controller wiring and configuration

`SandboxReconciler` is a small `client.Client`-backed struct with three injected dependencies: the runtime `Scheme`, a metrics/tracing `Instrumenter`, and the cluster's DNS suffix used to build service FQDNs. The controller is wired in `cmd/agent-sandbox-controller/main.go`, which exposes the suffix as a `--cluster-domain` flag defaulting to `cluster.local`.

```go
// controllers/sandbox_controller.go
type SandboxReconciler struct {
    client.Client
    Scheme        *runtime.Scheme
    Tracer        asmetrics.Instrumenter
    ClusterDomain string
}
```

`SetupWithManager` registers the controller for `Sandbox` resources and uses `Owns(...)` for `Pod` and `Service`. Both owned watches are filtered with a `LabelSelectorPredicate` that only fires for objects carrying the `agents.x-k8s.io/sandbox-name-hash` label, so warm-pool Pods that don't yet belong to a Sandbox don't enqueue spurious reconciles. The maximum concurrency is plumbed through from the binary's `concurrentWorkers` flag.

Sources: [controllers/sandbox_controller.go:122-128](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:1129-1150](controllers/sandbox_controller.go), [cmd/agent-sandbox-controller/main.go:71](cmd/agent-sandbox-controller/main.go), [cmd/agent-sandbox-controller/main.go:236-244](cmd/agent-sandbox-controller/main.go)

```text
                         ┌──────────────────────────────┐
                         │       SandboxReconciler      │
                         │  Client, Scheme, Tracer,     │
                         │  ClusterDomain               │
                         └──────────────┬───────────────┘
                                        │
                                        ▼
                                   Reconcile(req)
                                        │
              ┌─────────────┬───────────┼────────────┬────────────────┐
              ▼             ▼           ▼            ▼                ▼
        reconcilePVCs   reconcilePod  reconcile   computeConditions  updateStatus
       (one per VCT)    (≤1 pod)      Service     (Suspended/        (status subresource;
                                      (headless,  Ready/Finished)    skipped when unchanged)
                                      ClusterIP=None)
```

## Reconcile entry point

The top-level `Reconcile` function does the following in order, gating each step on the previous one:

1. Load the `Sandbox`. A NotFound is treated as a successful no-op so deletions are quiet.
2. Open a tracing span (`ReconcileSandbox`) and, the first time the sandbox is seen, write the active trace ID into `agents.x-k8s.io/trace-context` via a `MergeFrom` patch — this is inline, no re-reconcile.
3. Short-circuit if `DeletionTimestamp` is non-zero. Garbage collection of owned children is delegated to Kubernetes via controller references; the reconciler does not finalize anything.
4. Default `spec.replicas` to 1 when nil.
5. Branch on expiry: if `checkSandboxExpiry` returns true, run `handleSandboxExpiry`; otherwise call `reconcileChildResources` and then re-check expiry to set `RequeueAfter`.
6. Persist status via `updateStatus`, which skips the API call when `oldStatus` `DeepEqual`s the new status.

Errors from child reconciliation and from `updateStatus` are joined with `errors.Join` so a status failure does not mask the original error.

Sources: [controllers/sandbox_controller.go:148-228](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:431-445](controllers/sandbox_controller.go)

```mermaid
sequenceDiagram
    participant K as controller-runtime
    participant R as SandboxReconciler
    participant API as kube-apiserver
    K->>R: Reconcile(req)
    R->>API: Get Sandbox
    R->>R: StartSpan + maybe patch trace-context
    alt DeletionTimestamp set
        R-->>K: ctrl.Result{}, nil
    else expired
        R->>R: setSandboxExpiredCondition (if not yet marked)
        R->>R: handleSandboxExpiry (delete Pod/Service, maybe delete Sandbox)
    else normal path
        R->>API: reconcilePVCs (per volumeClaimTemplates)
        R->>API: reconcilePod (Get/Create/Adopt)
        R->>API: reconcileService (Get/Create/Adopt/Delete)
        R->>R: computeConditions (Suspended, Ready, Finished)
        R->>R: checkSandboxExpiry → RequeueAfter
    end
    R->>API: Status().Update if changed
    R-->>K: ctrl.Result{RequeueAfter}, err
```

## Identity: the sandbox-name-hash label

Every owned object is stamped with the label `agents.x-k8s.io/sandbox-name-hash`. The value is an 8-character lowercase hex FNV-1a hash of the sandbox name, computed by `NameHash`:

```go
// controllers/sandbox_controller.go
const sandboxLabel = "agents.x-k8s.io/sandbox-name-hash"

func NameHash(objectName string) string {
    return fmt.Sprintf("%08x", GetNumericHash(objectName))
}
```

This hash powers three things that all need to agree:

- The label on the Pod, the Service, and (on create) the PVC.
- The Service's `spec.selector`, which is rewritten to `{sandboxLabel: nameHash}` on adoption and on drift.
- The `Pods` listing inside `reconcilePod`, which uses a `labels.Selector` to enumerate matching Pods in the namespace (with a warning log if more than one is found, since a Sandbox is expected to own at most one).
- The watch predicate in `SetupWithManager`, which only enqueues Pods and Services that carry the label key.

`sandbox.Status.LabelSelector` is published in the `<label>=<hash>` form so the CRD's scale subresource (declared `selectorpath=.status.selector`) can be used by `kubectl scale` and HPA-style clients.

Sources: [controllers/sandbox_controller.go:49-53](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:230-271](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:447-458](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:1130-1150](controllers/sandbox_controller.go), [api/v1beta1/sandbox_types.go:225-244](api/v1beta1/sandbox_types.go)

## Resource ownership model

Before mutating any object, the reconciler classifies its relationship to the current Sandbox using `checkOwnership`, which inspects `metav1.GetControllerOf` and returns one of three states:

| State | Trigger | Reaction |
|---|---|---|
| `resourceOwnedBySandbox` | `controllerRef.UID == sandbox.UID` | Drive drift back to desired (labels, selectors, metadata). Delete on shrink/expiry. |
| `resourceUnowned` | No controllerRef on the object | Adopt by calling `ctrl.SetControllerReference`, with extra preconditions for Services. For Services, adoption is also gated on `spec.service == true` (an unset `service` leaves an unowned Service alone). |
| `resourceOwnedByOther` | A different controllerRef | Refuse to touch the object. For Pods, return a hard error so `Ready` flips to `ReconcilerError`. For Services, the reconciler returns an error from `reconcileService`. |

This three-way classification is used identically in `reconcilePod`, `reconcileService`, `reconcilePVCs`, and `handleSandboxExpiry`, so adoption-vs-refusal is consistent across the entire lifecycle.

Sources: [controllers/sandbox_controller.go:55-80](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:509-590](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:671-781](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:945-1011](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:1014-1063](controllers/sandbox_controller.go)

## Pod reconciliation

`reconcilePod` is the most involved subroutine. It must support three scenarios with one code path:

1. **Fresh creation** — no Pod exists; create one named after the Sandbox.
2. **Warm-pool adoption** — a pre-existing Pod whose name is tracked in the sandbox annotation `agents.x-k8s.io/pod-name`. The reconciler reads that annotation through `resolvePodName`, so the Pod's name may differ from `sandbox.Name`.
3. **Suspend** — `spec.replicas == 0` deletes the Pod (if owned) and clears the tracking annotation.

Key behaviors:

- The reconciler first does a label-based `List` for diagnostic purposes (logs a warning if more than one Pod matches), then a direct `Get` on the resolved pod name.
- If the annotated pod has gone missing, `clearPodNameAnnotation` removes the annotation via a `MergeFrom` patch so the next reconcile can fall back to creating a Pod named after the Sandbox.
- On adoption of an unowned Pod, `ctrl.SetControllerReference` is called and `updatePodMetadata` propagates labels and annotations from `spec.podTemplate.metadata` to the live Pod.
- On create, the desired pod is constructed from a deep-copied `spec.podTemplate.Spec`, with `MergeVolumeClaimVolumes` overlaying any `volumeClaimTemplates`-derived volumes by name. The newly built `Pod` gets the sandbox-name-hash label plus every label/annotation from the pod template. The set of keys it stamped is recorded in `agents.x-k8s.io/propagated-labels` and `agents.x-k8s.io/propagated-annotations` (comma-separated, sorted) so a later reconcile can detect and *remove* keys that have since been dropped from the template.
- Pod create races (`AlreadyExists`) fall back to a `Get` plus `reconcileExistingPod`, so a controller crash mid-create is recoverable.
- An `ensurePodNameAnnotation` closure writes the annotation after a successful create/adopt — but only if the sandbox doesn't already track a *different* pod name, to avoid hijacking a warm-pool record.

```go
// controllers/sandbox_controller.go
func resolvePodName(sandbox *sandboxv1beta1.Sandbox) string {
    if name, ok := sandbox.Annotations[sandboxv1beta1.SandboxPodNameAnnotation]; ok && name != "" {
        return name
    }
    return sandbox.Name
}
```

Sources: [controllers/sandbox_controller.go:82-90](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:92-110](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:597-609](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:623-864](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:866-943](controllers/sandbox_controller.go)

### Label and annotation propagation

`updatePodMetadata` is the source of truth for keeping the live Pod's metadata in sync with the template. It implements three jobs in one pass:

1. Force the sandbox-name-hash label to the current hash.
2. Apply every `(k, v)` from `spec.podTemplate.metadata.labels` and `.annotations`, updating only on differences.
3. Use the `propagated-labels` / `propagated-annotations` tracking annotations to delete keys that the template no longer mentions. The new key list (sorted) is then written back. Without this bookkeeping, a key removed from the template would otherwise linger on the Pod forever, because three-way diffing is not free with imperative `Update` calls.

Sources: [controllers/sandbox_controller.go:866-943](controllers/sandbox_controller.go)

## Service reconciliation and cluster-domain FQDN

`reconcileService` produces a single headless service per Sandbox. Its desired state is keyed off the optional `spec.service *bool`:

| `spec.service` | Service exists & ownership | Action |
|---|---|---|
| `nil` | not found | No-op; clear `status.service`/`status.serviceFQDN`. |
| `nil` | found, owned-by-sandbox | Reconcile drift on labels and selector. |
| `nil` | found, unowned | Leave as-is (backward compatibility), but `computeReadyCondition` still requires it for Ready. |
| `nil` | found, owned-by-other | Error (`refusing to use service`). |
| `true` | not found | Create a headless service (`ClusterIP: None`) named after the sandbox. |
| `true` | found, unowned | Adopt — but only if `service.spec.clusterIP` is `None` or empty, because `clusterIP` is immutable. |
| `true` | found, owned-by-sandbox | Patch back label and selector to `{sandboxLabel: nameHash}`. |
| `false` | found, owned-by-sandbox | Delete the service. |
| `false` | any other state | Do not delete; clear status. |

`setServiceStatus` writes both `status.service` and `status.serviceFQDN`, where the FQDN is constructed by string concatenation:

```go
// controllers/sandbox_controller.go
sandbox.Status.ServiceFQDN = service.Name + "." + service.Namespace + ".svc." + r.ClusterDomain
```

The cluster domain comes from the `--cluster-domain` flag (default `cluster.local`); `TestSetServiceStatusCustomDomain` exercises both the default and a custom value such as `custom.domain`.

Sources: [controllers/sandbox_controller.go:460-594](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:611-621](controllers/sandbox_controller.go), [controllers/sandbox_controller_test.go:2429-2463](controllers/sandbox_controller_test.go), [api/v1beta1/sandbox_types.go:194-222](api/v1beta1/sandbox_types.go)

## PVC reconciliation

`reconcilePVCs` iterates `spec.volumeClaimTemplates` and, for each entry named `vct`, manages a PVC named `vct + "-" + sandbox.Name` (the same naming convention that `reconcilePod` uses when wiring `corev1.Volume` entries through `MergeVolumeClaimVolumes`).

- If the PVC exists and is owned by the Sandbox, nothing happens.
- If it exists but is unowned, `ctrl.SetControllerReference` is called and the object is `Update`d to take ownership.
- If it exists but is owned by another controller, `reconcilePVCs` returns an error which propagates up into the Ready condition as `ReconcilerError`.
- If it does not exist, the controller creates it using the template's `Spec`, copying labels/annotations from the template and adding the sandbox-name-hash label.

PVC deletion follows the standard owner-reference garbage collection: nothing in this reconciler deletes a PVC explicitly, even on `replicas=0` or expiry.

Sources: [controllers/sandbox_controller.go:92-110](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:809-824](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:945-1011](controllers/sandbox_controller.go)

## Status conditions

`computeConditions` produces up to three `metav1.Condition` values per pass. `Suspended` and `Finished` are conditional; `Ready` is always present. Each condition carries `ObservedGeneration: sandbox.Generation`. The reconciler also explicitly *removes* the `Finished` condition when neither `PodSucceeded` nor `PodFailed` is observed in the current pass, so a Pod that's been recreated does not carry over a stale `Finished=True`.

| Type | Status / Reason | When set |
|---|---|---|
| `Ready` | `True` / `DependenciesReady` | Pod is `Running`, Pod Ready is `True`, `len(PodIPs) > 0`, and the service requirement is satisfied. |
| `Ready` | `False` / `DependenciesNotReady` | Default for any not-yet-ready dependency state; message describes Pod phase and whether the Service exists. |
| `Ready` | `False` / `ReconcilerError` | Any error returned from `reconcileChildResources`. Message is `"Error seen: " + err.Error()`. |
| `Ready` | `False` / `SandboxSuspended` | `spec.replicas == 0`. Message distinguishes "suspending" (pod still around) from "suspended" (no pod). |
| `Ready` | `False` / `SandboxExpired` | Set by `setSandboxExpiredCondition` and persisted by `handleSandboxExpiry`. |
| `Suspended` | `True` / `PodTerminated` | `spec.replicas == 0` and no Pod exists. |
| `Suspended` | `False` / `PodNotTerminated` | `spec.replicas == 0` but the Pod is still present. |
| `Finished` | `True` / `PodSucceeded` | Pod phase is `Succeeded`. |
| `Finished` | `True` / `PodFailed` | Pod phase is `Failed`. |

A service is "required" for Ready when either `spec.service == true` or a Service currently exists (the backward-compatibility branch around `controllers/sandbox_controller.go:367-372`). `TestComputeConditions` enumerates a dozen permutations that pin down this matrix.

Sources: [controllers/sandbox_controller.go:256-417](controllers/sandbox_controller.go), [controllers/sandbox_controller_test.go:63-239](controllers/sandbox_controller_test.go), [api/v1beta1/sandbox_types.go:24-54](api/v1beta1/sandbox_types.go)

## Scale subresource and pod IP surfacing

The `Sandbox` CRD declares both `+kubebuilder:subresource:status` and `+kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.replicas,selectorpath=.status.selector`. The reconciler keeps the three fields the scale subresource points at coherent on each pass:

- `status.replicas` is set to `1` when a Pod is present, `0` otherwise.
- `status.selector` is filled with `<sandboxLabel>=<NameHash(sandbox.Name)>` when the Pod exists.
- `status.podIPs` is mirrored from `pod.Status.PodIPs` (dual-stack aware) via `podIPsFromStatus`.

When `replicas` is 0, the Pod-IP and selector status fields are cleared in the same block. The `replicas` field itself only ever takes the values 0 or 1 because of the CRD's `minimum=0, maximum=1` markers.

Sources: [controllers/sandbox_controller.go:241-250](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:419-429](controllers/sandbox_controller.go), [api/v1beta1/sandbox_types.go:148-156](api/v1beta1/sandbox_types.go), [api/v1beta1/sandbox_types.go:209-222](api/v1beta1/sandbox_types.go), [api/v1beta1/sandbox_types.go:225-244](api/v1beta1/sandbox_types.go)

## Lifecycle: shutdownTime and shutdownPolicy

The inlined `Lifecycle` substruct adds two fields to `SandboxSpec`: `shutdownTime` (absolute) and `shutdownPolicy` (`Delete` or `Retain`, defaulting to `Retain`). `checkSandboxExpiry` returns whether the sandbox is past its `shutdownTime` and, if not, how long to wait before reconsidering. The wait is clamped to a 2-second minimum so reconcile thrash is bounded:

```go
// controllers/sandbox_controller.go
requeueAfter := max(remainingTime, 2*time.Second)
```

On expiry, the controller follows a two-pass protocol designed to preserve any terminal `Finished` condition for observability:

```mermaid
stateDiagram-v2
    [*] --> Live: shutdownTime in future
    Live --> ExpiringMarked: shutdownTime ≤ now\nsetSandboxExpiredCondition
    ExpiringMarked --> ExpiringMarked: requeue (immediateRequeueDelay)
    ExpiringMarked --> Cleaning: sandboxMarkedExpired = true
    Cleaning --> Retained: ShutdownPolicy=Retain\nPod & Service deleted\nstatus.Conditions preserved
    Cleaning --> Deleted: ShutdownPolicy=Delete\nSandbox object deleted
    Retained --> [*]
    Deleted --> [*]
```

Pass 1 (`Reconcile` line 198) sets the Ready=`SandboxExpired` condition, updates status, and returns `RequeueAfter: immediateRequeueDelay` so the next pass observes the marker via `sandboxMarkedExpired`.

Pass 2 calls `handleSandboxExpiry`, which:

1. Deletes the Pod if owned by this Sandbox; logs and skips deletion if it's unowned or owned by another controller.
2. Deletes the Service under the same ownership rule.
3. If `shutdownPolicy == Delete`, deletes the Sandbox itself and returns `sandboxDeleted = true`, suppressing the trailing status update.
4. Otherwise, resets `sandbox.Status` to an empty struct *while preserving `Conditions`* (so `Finished=PodSucceeded` or `Finished=PodFailed` survives the cleanup), then re-asserts `Ready=False / SandboxExpired`.

`TestSandboxShutdownExpiryUsesTwoPassAndPreservesFinishedCondition` exercises the full sequence for both `PodSucceeded` and `PodFailed` and asserts that the `Finished` condition is still present after the Pod and Service are gone.

Sources: [controllers/sandbox_controller.go:188-228](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:1014-1090](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:1092-1127](controllers/sandbox_controller.go), [controllers/sandbox_controller_test.go:2244-2306](controllers/sandbox_controller_test.go), [controllers/sandbox_controller_test.go:2308-2427](controllers/sandbox_controller_test.go), [api/v1beta1/sandbox_types.go:168-192](api/v1beta1/sandbox_types.go)

## Status persistence and field ownership

`updateStatus` does a `reflect.DeepEqual` between the snapshot taken at the top of `Reconcile` (`oldStatus`) and the post-reconcile `sandbox.Status`, calling `r.Status().Update` only when they differ. This keeps the controller from generating spurious status revisions that would themselves enqueue reconciles via the watch on `Sandbox`.

For the spec/metadata side, the controller mixes two approaches:

- `Create` calls use `client.FieldOwner("sandbox-controller")` so server-side-apply conflict detection points at this controller.
- `Update` calls (for adoption and for patching label/selector drift on owned services) do not specify a field owner; `Patch` with `client.MergeFrom` is used for narrow annotation changes (trace context, pod-name tracking).

Sources: [controllers/sandbox_controller.go:49-53](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:175-186](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:431-445](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:493-499](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:838-850](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:1005-1007](controllers/sandbox_controller.go)

## Summary

The Sandbox reconciler is a single-replica, single-pod controller whose interesting behavior lives at the boundaries: a hash-derived identity label that ties together a Pod, a headless Service, and one PVC per template; a three-state ownership classifier (`resourceOwnedBySandbox` / `resourceUnowned` / `resourceOwnedByOther`) that decides between drive-to-state, adoption, and refusal; a label/annotation propagator that uses tracking annotations to detect deletions; a status surface comprising `Ready`, `Suspended`, and `Finished` plus scale-subresource fields; and a two-pass expiry path that drops live resources but preserves terminal conditions for observability. The cluster-domain FQDN is the simplest of these — a literal `<svc>.<ns>.svc.<--cluster-domain>` concatenation seeded by the controller flag — but it ties the published `status.serviceFQDN` to operator-controlled DNS configuration rather than cluster auto-discovery.

---

## 12. SandboxClaim Reconciler

> The largest controller in the project: template resolution, env/metadata injection, warm-pool adoption, pod-exclusivity invariants, foreground deletion, and TTL after finish.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/12-sandboxclaim-reconciler.md
- Generated: 2026-05-25T22:36:50.404Z

### Source Files

- `extensions/controllers/sandboxclaim_controller.go`
- `extensions/controllers/sandboxclaim_controller_test.go`
- `extensions/controllers/sandboxclaim_pod_exclusivity_test.go`
- `extensions/controllers/utils.go`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [extensions/controllers/sandboxclaim_controller.go](extensions/controllers/sandboxclaim_controller.go)
- [extensions/controllers/sandboxclaim_controller_test.go](extensions/controllers/sandboxclaim_controller_test.go)
- [extensions/controllers/sandboxclaim_pod_exclusivity_test.go](extensions/controllers/sandboxclaim_pod_exclusivity_test.go)
- [extensions/controllers/utils.go](extensions/controllers/utils.go)
- [extensions/api/v1beta1/sandboxclaim_types.go](extensions/api/v1beta1/sandboxclaim_types.go)
- [extensions/api/v1beta1/sandboxtemplate_types.go](extensions/api/v1beta1/sandboxtemplate_types.go)
- [internal/lifecycle/expiry.go](internal/lifecycle/expiry.go)
</details>

# SandboxClaim Reconciler

The `SandboxClaimReconciler` is the largest controller in the `extensions/controllers` package. It turns a user-facing `SandboxClaim` into a working `Sandbox` resource, choosing between three sources — adopting a warm pool sandbox, taking over a previously created one by status/label/name, or cold-starting from a `SandboxTemplate`. It also owns the lifetime of that sandbox, including expiration, TTL-after-finished, foreground deletion, and the 1:1 pod-exclusivity invariant that prevents two claims from binding the same warm pool pod.

The page maps the reconcile loop to the responsibilities that show up in the implementation: template resolution and metadata merging, environment variable injection under template policy, the warm-pool adoption protocol with optimistic ownership transfer, status/condition forwarding from the core `Sandbox`, expiration with the three `ShutdownPolicy` modes, and the watches/predicates that feed the controller.

## Reconcile entry point and high-level flow

`Reconcile` runs in a single pass per request. It loads the claim, opportunistically cleans up a legacy per-claim `NetworkPolicy`, starts a trace span, initializes observability annotations, then decides between the *active* and *expired* branch based on `checkExpiration`. After the chosen branch returns a `Sandbox` (or `nil`), the reconciler computes the `Ready` and `Finished` conditions, writes status with `r.updateStatus`, records latency metrics, and returns a requeue duration matching the next expiry boundary.

Two error sentinels suppress error returns to avoid crash loops: `ErrTemplateNotFound` causes a 1-minute requeue, and `ErrInvalidMetadata` / `ErrSandboxNotOwned` are logged at V(1) and swallowed.

Sources: [extensions/controllers/sandboxclaim_controller.go:140-282](extensions/controllers/sandboxclaim_controller.go)

```mermaid
flowchart TD
    Start[Reconcile request] --> Get[Get SandboxClaim]
    Get --> Cleanup[cleanupLegacyNetworkPolicy]
    Cleanup --> Init[initializeAnnotations<br/>trace + first-observed]
    Init --> Exp{checkExpiration<br/>shutdownTime / TTL}
    Exp -->|expired + Delete*| DeleteClaim[Delete claim<br/>Foreground prop. if set]
    Exp -->|expired + Retain| ReconcileExpired[reconcileExpired:<br/>delete owned Sandbox]
    Exp -->|active| ReconcileActive[reconcileActive]

    ReconcileActive --> Validate[validateAdditionalPodMetadata]
    Validate --> GetOrCreate[getOrCreateSandbox]
    GetOrCreate -->|hit by status/label/name| MetaSync[mergePodMetadata<br/>+ Update if drifted]
    GetOrCreate -->|warm queue pop| Adopt[adoptSandboxFromCandidates]
    GetOrCreate -->|miss| ColdCreate[createSandbox<br/>from SandboxTemplate]

    ReconcileExpired --> Status
    MetaSync --> Status
    Adopt --> Status
    ColdCreate --> Status
    DeleteClaim --> Done

    Status[computeAndSetStatus<br/>Ready + Finished mirror] --> Persist[updateStatus<br/>Status().Patch]
    Persist --> Metrics[recordCreationLatencyMetric]
    Metrics --> Requeue{post-expiration<br/>or timeLeft > 0?}
    Requeue --> Done[Result + err]
```

## Public API and constants

The reconciler is a `client.Client` plus injected collaborators:

| Field | Purpose |
|---|---|
| `Scheme` | Used by `controllerutil.SetControllerReference` when creating/adopting Sandboxes. |
| `WarmSandboxQueue` | In-memory per-template hash queue (`queue.SandboxQueue`) of warm pool candidates. |
| `Recorder` | `events.EventRecorder` used for `SandboxProvisioned`, `SandboxAdopted`, `ClaimExpired` events. |
| `Tracer` | `asmetrics.Instrumenter` for tracing spans and propagating a trace-context annotation. |
| `MaxConcurrentReconciles` | Concurrency for the controller manager. |
| `observedTimes` | Type-safe `sync.Map` keyed by `NamespacedName` and tagged by UID for latency tracking. |

Sentinel errors drive flow control: `ErrTemplateNotFound`, `ErrInvalidMetadata`, `ErrSandboxNotOwned`, `ErrCrossNamespaceAdoption`. Two annotation keys gate behavior: `agents.x-k8s.io/controller-first-observed-at` (observability) and `asmetrics.TraceContextAnnotation` (trace propagation). The `restrictedDomains` list (`kubernetes.io`, `k8s.io`, `agents.x-k8s.io`) is enforced when validating user-supplied label/annotation keys.

Sources: [extensions/controllers/sandboxclaim_controller.go:58-125](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:60-72](extensions/controllers/sandboxclaim_controller.go)

## Template resolution and sandbox provisioning

### Cold path: `createSandbox`

`createSandbox` is the cold path; it deep-copies `template.Spec.PodTemplate` into a new `Sandbox` named after the claim, propagates the trace context, copies `VolumeClaimTemplates`, applies identity labels (`SandboxIDLabel = claim.UID` and the template-ref hash `agents.x-k8s.io/sandbox-template-ref-hash`) onto both the top-level `Sandbox` metadata and the pod template — because KEP-0174 only propagates pod-template labels, but the platform informer reads top-level `Sandbox.metadata.labels`.

It then merges `claim.Spec.AdditionalPodMetadata`, applies `ApplySandboxSecureDefaults`, sets `Replicas = 1`, attaches the controller owner reference, and creates the `Sandbox`. Cold start is recorded as `LaunchTypeCold` in `RecordSandboxClaimCreation`.

Sources: [extensions/controllers/sandboxclaim_controller.go:923-1068](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/utils.go:23-48](extensions/controllers/utils.go)

### Environment variable injection

The controller injects `claim.Spec.Env` into the pod template, but only when the template's `EnvVarsInjectionPolicy` permits it:

| Policy | Behavior |
|---|---|
| `Disallowed` (default when policy is not `Allowed` / `Overrides`) | Any `claim.Spec.Env` causes rejection with "environment variable injection is not allowed by the template policy". |
| `Allowed` | New env vars may be appended; collision with an existing name is rejected. |
| `Overrides` | Existing variables with the same name are replaced in place. |

Env vars without `ContainerName` are appended only to the first regular container; vars with `ContainerName` are routed to that container across init- and regular-container lists, and any unknown container name fails the reconcile with a precise error referencing the offending variable name. The implementation lives in `injectEnvs`, with grouping/validation in `createSandbox`.

A second hard rule: `getOrCreateSandbox` rejects `claim.Spec.Env` when `WarmPool != "none"`, because warm pool pods are pre-baked and per-claim env injection would silently miss them.

Sources: [extensions/controllers/sandboxclaim_controller.go:896-1038](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:1155-1163](extensions/controllers/sandboxclaim_controller.go)

### `additionalPodMetadata` validation and merge

`validateAdditionalPodMetadata` rejects any key whose `/`-prefix domain is one of `kubernetes.io`, `k8s.io`, or `agents.x-k8s.io` (or a sub-domain), and runs `validation.IsValidLabelValue` for label values. `mergePodMetadata` then performs a strict "no overrides" merge: a claim label or annotation whose key already exists in the template with a *different* value fails the reconcile, while matching values and new keys are merged in. This is enforced both when creating and when adopting.

Sources: [extensions/controllers/sandboxclaim_controller.go:806-894](extensions/controllers/sandboxclaim_controller.go)

## Warm pool adoption

### Lookup order in `getOrCreateSandbox`

`getOrCreateSandbox` exhausts existing-binding paths before touching the warm queue:

1. `claim.Status.SandboxStatus.Name` — if set and the named `Sandbox` is owned by this claim, return it.
2. `claim.Labels[AssignedSandboxNameLabel]` — the optimistic lock label written during a previous adoption attempt. If the sandbox is still `Kind: SandboxWarmPool`-owned, the controller retries `completeAdoption` and returns an "in progress" error so the next reconcile sees it controlled by this claim. If it does not exist, the stale label is removed by patch.
3. Name-based lookup at `claim.Name` — picks up cold-path sandboxes the controller previously created; verifies controller ownership and refuses to silently overwrite a foreign-owned sandbox with the same name.
4. Otherwise, if `WarmPool == "none"` return `nil, nil` (caller cold-starts). For `default` or a specific pool name, pop a candidate from `WarmSandboxQueue`.

Sources: [extensions/controllers/sandboxclaim_controller.go:1070-1180](extensions/controllers/sandboxclaim_controller.go)

### `getCandidate` and `adoptSandboxFromCandidates`

`getCandidate` pops keys from the per-template-hash queue, fetches each one, and discards *ghost pods* (queue keys whose `Sandbox` is gone from the informer cache). Pods that fail `verifySandboxCandidate` are dropped; those that are simply in the wrong namespace (`ErrCrossNamespaceAdoption`) are skipped and re-queued via a deferred re-add. When `WarmPoolPolicy.IsSpecificPool()` is true, candidates whose `warmPoolSandboxLabel` does not equal `NameHash(<pool>)` are also skipped back to the queue.

`adoptSandboxFromCandidates` then performs an optimistic two-step claim adoption, retried up to three times:

1. Set `claim.Labels[AssignedSandboxNameLabel] = <sandbox-name>` and `Update` the claim. A `Conflict` here means another reconciler raced us; the candidate is re-queued and the loop retries.
2. `completeAdoption` patches the sandbox: strips warm pool labels (`warmPoolSandboxLabel`, `sandboxTemplateRefHash`, `SandboxPodTemplateHashLabel`), drops the old `SandboxWarmPool` owner ref, sets the claim as controller, ensures `SandboxPodNameAnnotation == sandbox.Name`, propagates the trace-context annotation, re-applies identity labels, then rebuilds the pod-template `ObjectMeta` exactly as the active path does (template metadata + identity labels + merged claim metadata). A missing-template fallback merges directly into the existing sandbox's pod template.

Successful adoption records `LaunchTypeWarm` with the source warm-pool name and the candidate's readiness, and emits a `SandboxAdopted` event.

Sources: [extensions/controllers/sandboxclaim_controller.go:591-794](extensions/controllers/sandboxclaim_controller.go)

### Pod-exclusivity invariant

The 1:1 invariant — every warm pool pod is adopted by at most one claim, and every claim ends up owning exactly one `Sandbox` — is enforced by two independent mechanisms that the test `TestWarmPoolPodExclusivity` exercises end-to-end with three claims and two warm pods:

- The queue itself: `r.WarmSandboxQueue.Get(templateHash)` is a pop, not a list. A consumed key is gone unless the reconciler explicitly re-adds it.
- The `AssignedSandboxNameLabel` + claim `Update` acts as an optimistic lock; concurrent claims that pop the same key will lose the race when persisting the label, push the key back, and re-try.

The result is verified by collecting `sandbox → owning-claim` from `controllerRef` and asserting both directions are 1:1, and that both warm pods are adopted while the third claim cold-starts.

Sources: [extensions/controllers/sandboxclaim_pod_exclusivity_test.go:41-187](extensions/controllers/sandboxclaim_pod_exclusivity_test.go), [extensions/controllers/sandboxclaim_controller.go:648-726](extensions/controllers/sandboxclaim_controller.go)

```text
WarmSandboxQueue[templateHash]
        |
        | pop()  (single-consumer per key)
        v
 +--------------+   1. claim.Update(label=sb)      +------------------+
 | reconcile A  | -------------------------------> | apiserver        |
 +--------------+   2. patch sandbox controller    | (resourceVersion)|
                                                   +------------------+
        ^                                                   |
        | re-add key on Conflict                            v
 +--------------+   1. claim.Update(label=sb) -> 409 Conflict
 | reconcile B  |   re-queue and pick next candidate
 +--------------+
```

## Status, conditions, and events

`computeAndSetStatus` produces a `Ready` condition (`computeReadyCondition`) and then mirrors the `Finished` condition from the `Sandbox` onto the claim via `syncFinishedCondition`. `SandboxStatus.{Name, PodIPs}` is copied from the sandbox when present and cleared otherwise.

`computeReadyCondition` short-circuits in a strict order:

| Input | `Ready` Reason | Notes |
|---|---|---|
| `err = ErrTemplateNotFound` | `TemplateNotFound` | False; reconcile requeues in 1 min. |
| `err = ErrInvalidMetadata` | `InvalidMetadata` | False; error suppressed (no requeue spam). |
| `err = ErrSandboxNotOwned` | `ClaimExpired` | False; treated as expired-state cleanup blocked. |
| any other `err` | `ReconcilerError` | False; error returned for backoff. |
| `isClaimExpired` true | `ClaimExpired` | "Sandbox cleanup initiated." |
| `sandbox == nil` | `SandboxMissing` | False. |
| underlying sandbox has `Ready=False, Reason=Expired` | `SandboxExpired` | Forwards core-controller expiry. |
| else | forwards the sandbox's `Ready` condition verbatim | falls back to `SandboxNotReady` if absent. |

`syncFinishedCondition` only mirrors `Sandbox.Status.SandboxConditionFinished` when a sandbox is present; if no sandbox exists and the claim is *not* expired, it removes any stale `Finished` condition to avoid keeping a terminal marker on a re-provisioned claim.

`updateStatus` sorts both old and new conditions deterministically by `Type` and then `Status().Patch` only when the semantic deep-equal differs, keeping resourceVersion churn minimal.

Sources: [extensions/controllers/sandboxclaim_controller.go:422-576](extensions/controllers/sandboxclaim_controller.go)

## Lifecycle: expiration, TTL after finished, shutdown policies

Expiration is computed by `checkExpiration`, which delegates to `lifecycle.TimeLeft(now, ShutdownTime, TTLSecondsAfterFinished, finishedCondition)`. The library returns `(true, 0)` once expired, `(false, dur)` with the remaining duration otherwise, choosing the earliest of `ShutdownTime` and `finishedAt + TTL`.

```mermaid
stateDiagram-v2
    [*] --> Active
    Active --> Active: requeue at min(ShutdownTime, finishedAt+TTL)
    Active --> Expired: now >= expireAt

    state Expired {
        [*] --> RetainBranch: Lifecycle.ShutdownPolicy = Retain
        [*] --> DeleteBranch: Lifecycle.ShutdownPolicy = Delete
        [*] --> ForegroundBranch: Lifecycle.ShutdownPolicy = DeleteForeground

        RetainBranch --> SandboxDeleted: reconcileExpired Deletes Sandbox\nKeeps Claim with Ready=ClaimExpired
        DeleteBranch --> ClaimDeleted: Delete(claim)\nNo propagation policy
        ForegroundBranch --> ClaimDeleted: Delete(claim,\nPropagationPolicy=Foreground)
    }

    SandboxDeleted --> [*]
    ClaimDeleted --> [*]
```

`Reconcile` only takes the delete-claim branch when `claimExpired` *and* either policy `Delete` or `DeleteForeground` is configured. `DeleteForeground` adds `client.PropagationPolicy(metav1.DeletePropagationForeground)`, ensuring the API server blocks finalization of the claim until its owned `Sandbox` (and its dependents) are removed. After issuing the delete, the reconciler returns immediately — continuing would attempt to patch the status of an object already in deletion.

For `Retain`, `reconcileExpired` looks up the sandbox by `Status.SandboxStatus.Name` (falling back to `claim.Name`), verifies controller ownership (otherwise returns `ErrSandboxNotOwned`), and issues a non-foreground delete. The claim itself is preserved with the `ClaimExpired` Ready reason.

After computing status on the active path, `Reconcile` re-runs `checkExpiration` (`postExpiration`) to handle the case where mirroring the `Finished` condition from the sandbox *during this same reconcile* made the claim newly TTL-expired. When this happens, the controller writes status and requeues at `immediateRequeueDelay = 1ms` so the next pass enters the expired branch. `TestSandboxClaimMirrorsFinishedConditionAndSchedulesTTL` and `TestSandboxClaimTTLAfterFinishedCleanupPolicy` cover this two-pass behavior: pass 1 mirrors `Finished` and computes a positive `RequeueAfter`; pass 2 triggers cleanup according to the policy.

Sources: [extensions/controllers/sandboxclaim_controller.go:165-261](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:309-420](extensions/controllers/sandboxclaim_controller.go), [internal/lifecycle/expiry.go:24-82](internal/lifecycle/expiry.go), [extensions/controllers/sandboxclaim_controller_test.go:1110-1303](extensions/controllers/sandboxclaim_controller_test.go)

## Watches, predicates, and queue feeders

`SetupWithManager` builds a controller with:

- `For(&SandboxClaim{}, WithPredicates(getTimingPredicate()))` — the timing predicate stamps the first-observed time per UID in `observedTimes` and removes the entry on delete events so the map cannot leak.
- `Owns(&Sandbox{})` — standard owner-driven requeues.
- `Watches(&Sandbox{}, &sandboxEventHandler{...})` — pushes adoptable sandboxes into the per-template-hash queue and removes ghost-pod keys on delete.
- `Watches(&SandboxTemplate{}, &templateEventHandler{...})` — drops the entire warm queue for a deleted template via `RemoveQueue`.
- `Watches(&SandboxTemplate{}, EnqueueRequestsFromMapFunc(mapTemplateToClaims), WithPredicates(ResourceVersionChangedPredicate{}))` — when a template changes, re-enqueue every claim that references it through the indexed `TemplateRefField`.
- A field index on `TemplateRefField` is registered via `mgr.GetFieldIndexer().IndexField` for the template→claims mapping.

`sandboxEventHandler.Update` enqueues a key when a sandbox transitions from not-adoptable to adoptable, or when an already-adoptable sandbox changes its template hash. `isAdoptable` requires: not deleting, has `warmPoolSandboxLabel`, has `sandboxTemplateRefHash`, and (if controlled) the controller kind is `SandboxWarmPool`. `verifySandboxCandidate` adds a same-namespace check via the `ErrCrossNamespaceAdoption` sentinel.

Sources: [extensions/controllers/sandboxclaim_controller.go:1228-1297](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:1446-1565](extensions/controllers/sandboxclaim_controller.go)

## Secure defaults and legacy cleanup

`ApplySandboxSecureDefaults` (in `utils.go`) is applied once by `createSandbox`:

- Sets `AutomountServiceAccountToken = false` when unset.
- When the template uses managed network policy (`NetworkPolicyManagement == ""` or `Managed`) *and* defines no `NetworkPolicy`, the controller is in "Secure by Default" mode: it overrides `DNSPolicy = None` and injects external resolvers `8.8.8.8`, `1.1.1.1`, on the theory that internal DNS would let a sandbox enumerate cluster services. Custom rules or `Unmanaged` mode leave DNS alone so air-gapped or proxied environments still work.

`cleanupLegacyNetworkPolicy` runs every reconcile and idempotently deletes a deprecated per-claim `NetworkPolicy` named `<claim>-network-policy`, but only if the policy is actually controlled by the claim — a user-created policy with the same reserved name is logged and left alone. Errors are non-fatal so a transient API issue cannot block sandbox provisioning.

Sources: [extensions/controllers/utils.go:23-59](extensions/controllers/utils.go), [extensions/controllers/sandboxclaim_controller.go:1300-1332](extensions/controllers/sandboxclaim_controller.go)

## Latency metrics and observability

The reconciler records four metrics tied to launch type (`cold` / `warm`), template, and namespace:

- `RecordSandboxClaimCreation` at create/adopt time (with pool name and ready/not-ready state).
- `RecordClaimStartupLatency` from the webhook-stamped `WebhookAnnotation` time to the moment `Ready=True` is first observed.
- `RecordClaimControllerStartupLatency` from the controller-stamped `ObservabilityAnnotation` time.
- `RecordSandboxCreationLatency` from `sandbox.CreationTimestamp` to the underlying `Sandbox`'s `Ready=True` `LastTransitionTime`.

`recordCreationLatencyMetric` only fires on the *first* transition to `Ready=True` (`oldReady != True && newReady == True`). On re-reconciles after Ready, it also drains any `observedTimes` entry that a post-Ready `UpdateFunc` may have re-added, preventing duplicate latency emissions.

`getLaunchType` distinguishes warm vs cold by the presence of `SandboxPodNameAnnotation` on the sandbox — warm-adopted sandboxes are stamped with their pre-existing pod name in `completeAdoption`, while cold-started ones leave the annotation empty.

Sources: [extensions/controllers/sandboxclaim_controller.go:1334-1434](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:748-755](extensions/controllers/sandboxclaim_controller.go)

## Summary

The `SandboxClaimReconciler` is small in surface but dense in invariants: a single `Reconcile` pass picks an active-vs-expired branch, threads a strict template/metadata/env policy through both cold creation and warm adoption, enforces 1:1 sandbox ownership through a pop-based queue plus an optimistic claim-label lock, mirrors the `Finished` condition so TTL-after-finished can drive a second-pass cleanup, and routes expiration through three `ShutdownPolicy` modes with foreground propagation when full subtree teardown is required. The watches, predicates, and ghost-pod handling around the warm pool queue make this the controller most worth reading carefully when changing claim, sandbox, or warm-pool semantics.

---

## 13. SandboxTemplate Reconciler

> Validation and bookkeeping done by the template controller, including how template changes ripple to claims and warm pools.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/13-sandboxtemplate-reconciler.md
- Generated: 2026-05-25T22:38:47.157Z

### Source Files

- `extensions/controllers/sandboxtemplate_controller.go`
- `extensions/controllers/sandboxtemplate_controller_test.go`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [extensions/controllers/sandboxtemplate_controller.go](extensions/controllers/sandboxtemplate_controller.go)
- [extensions/controllers/sandboxtemplate_controller_test.go](extensions/controllers/sandboxtemplate_controller_test.go)
- [extensions/api/v1beta1/sandboxtemplate_types.go](extensions/api/v1beta1/sandboxtemplate_types.go)
- [extensions/controllers/utils.go](extensions/controllers/utils.go)
- [extensions/controllers/sandboxwarmpool_controller.go](extensions/controllers/sandboxwarmpool_controller.go)
- [extensions/controllers/sandboxclaim_controller.go](extensions/controllers/sandboxclaim_controller.go)
- [controllers/sandbox_controller.go](controllers/sandbox_controller.go)
</details>

# SandboxTemplate Reconciler

The `SandboxTemplateReconciler` is a focused controller-runtime reconciler whose sole job is to materialize one shared `NetworkPolicy` per `SandboxTemplate`. The `SandboxTemplate` CR itself is mostly inert spec — it carries the `PodTemplate`, `VolumeClaimTemplates`, an `EnvVarsInjectionPolicy`, an optional `NetworkPolicy` block and a `NetworkPolicyManagement` mode. The template controller owns only the lifecycle of the derived shared `NetworkPolicy`; the *rippling* of template spec changes to existing `Sandbox`, `SandboxClaim`, and `SandboxWarmPool` objects is performed by the claim and warm-pool reconcilers, which watch templates and react to events.

This page describes the reconciler's responsibilities, the secure-by-default policy it synthesizes, the management modes, the label hash that ties everything together, and how that hash is the mechanism by which template changes propagate to downstream consumers.

## Scope and responsibilities

The reconciler is intentionally narrow. Looking at the imports and `SetupWithManager`, it watches `SandboxTemplate` as its primary resource and `Owns` only `NetworkPolicy`; it does not own `Sandbox`, `SandboxClaim`, or `SandboxWarmPool` objects, and it does not write template status.

```go
// extensions/controllers/sandboxtemplate_controller.go
func (r *SandboxTemplateReconciler) SetupWithManager(mgr ctrl.Manager, concurrentWorkers int) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&extensionsv1beta1.SandboxTemplate{}).
        Owns(&networkingv1.NetworkPolicy{}).
        WithOptions(controller.Options{MaxConcurrentReconciles: concurrentWorkers}).
        Complete(r)
}
```

The reconciler struct holds the controller-runtime `Client`, the runtime `Scheme`, an `events.EventRecorder`, and an `asmetrics.Instrumenter` for span/tracing. RBAC kubebuilder markers grant verbs only on `sandboxtemplates`, `sandboxtemplates/finalizers`, `networkpolicies`, and `events` — there are no permissions for sandboxes, claims, or pools.

Sources: [extensions/controllers/sandboxtemplate_controller.go:38-50](extensions/controllers/sandboxtemplate_controller.go), [extensions/controllers/sandboxtemplate_controller.go:216-222](extensions/controllers/sandboxtemplate_controller.go)

## Reconcile loop

Each `Reconcile` call is a six-step state machine that converges the shared `NetworkPolicy` named `<template-name>-network-policy` in the template's own namespace.

| Step | Action | Source |
|------|--------|--------|
| 1 | `Get` the `SandboxTemplate`; on `NotFound` return cleanly. | [sandboxtemplate_controller.go:56-62](extensions/controllers/sandboxtemplate_controller.go) |
| 2 | Open a trace span via `Tracer.StartSpan`; bail early if `DeletionTimestamp` is set. | [sandboxtemplate_controller.go:64-69](extensions/controllers/sandboxtemplate_controller.go) |
| 3 | Resolve `npName = template.Name + "-network-policy"`, default `NetworkPolicyManagement` to `Managed` when empty. | [sandboxtemplate_controller.go:72-78](extensions/controllers/sandboxtemplate_controller.go) |
| 4 | If management is `Unmanaged`, delete the named policy (ignoring `NotFound`) and exit. | [sandboxtemplate_controller.go:81-92](extensions/controllers/sandboxtemplate_controller.go) |
| 5 | Construct the desired `NetworkPolicySpec` — either the secure default or a controller-shaped wrapper around the user spec. | [sandboxtemplate_controller.go:95-112](extensions/controllers/sandboxtemplate_controller.go) |
| 6 | Reconcile by diff: `Get` the existing policy; if `equality.Semantic.DeepEqual` matches, no-op; otherwise `Update`. If missing, `Create` with the template set as controller reference. | [sandboxtemplate_controller.go:115-153](extensions/controllers/sandboxtemplate_controller.go) |

Two implementation choices are worth noting. First, an unmanaged template will actively *delete* any pre-existing policy with the canonical name, which is how a user transitions a template from `Managed` to `Unmanaged` and trusts an external CNI such as Cilium to take over. Second, the semantic-deep-equal short-circuit (`return ctrl.Result{}, nil // Perfect match, O(1) efficiency.`) is what makes high-frequency requeues from downstream watchers cheap.

Only the create path calls `controllerutil.SetControllerReference`. Once the policy exists, subsequent updates only overwrite its `Spec`, so the owner reference is established exactly once and is what triggers `Owns(&networkingv1.NetworkPolicy{})` to requeue the template when its policy is mutated externally.

```mermaid
stateDiagram-v2
    [*] --> Fetched: Get SandboxTemplate
    Fetched --> Done: DeletionTimestamp != 0
    Fetched --> ResolveScope: alive
    ResolveScope --> DeleteNP: management == Unmanaged
    ResolveScope --> BuildDesired: management == Managed
    BuildDesired --> CheckExisting: spec built
    CheckExisting --> Done: DeepEqual match (no-op)
    CheckExisting --> UpdateNP: drift detected
    CheckExisting --> CreateNP: NotFound
    UpdateNP --> Done
    CreateNP --> Done
    DeleteNP --> Done
```

Sources: [extensions/controllers/sandboxtemplate_controller.go:52-154](extensions/controllers/sandboxtemplate_controller.go)

## Management modes

`NetworkPolicyManagement` is a typed string enum on the template spec with kubebuilder-validated values `Managed` and `Unmanaged`. The CRD defaults to `Managed` and so does the reconciler when the field is empty, giving identical behavior whether the field is omitted or explicitly set.

| Mode | `template.Spec.NetworkPolicy` | Reconciler behavior |
|------|-------------------------------|---------------------|
| `Managed` (default) | `nil` | Builds the **Secure by Default** policy via `buildDefaultNetworkPolicySpec`. |
| `Managed` | set | Builds a controller-shaped policy whose `Ingress`/`Egress` come from the user spec, but whose `PodSelector` and `PolicyTypes` are still controller-owned. |
| `Unmanaged` | ignored | Deletes the canonical policy if present and returns; the `NetworkPolicy` field is *completely ignored* in this mode. |

The "Unmanaged ignores `NetworkPolicy`" semantic is exercised explicitly by the `templateOptOut` fixture in the test file, where the template carries a non-nil `NetworkPolicy.Egress` rule but no policy is expected to be created.

Sources: [extensions/api/v1beta1/sandboxtemplate_types.go:26-128](extensions/api/v1beta1/sandboxtemplate_types.go), [extensions/controllers/sandboxtemplate_controller.go:74-92](extensions/controllers/sandboxtemplate_controller.go), [extensions/controllers/sandboxtemplate_controller_test.go:71-79](extensions/controllers/sandboxtemplate_controller_test.go)

## Desired spec construction

Two fields of the resulting `NetworkPolicySpec` are always controller-owned, regardless of whether the template provided a custom policy: `PodSelector` and `PolicyTypes`. The CRD type comment calls this out explicitly — those fields are intentionally absent from `extensionsv1beta1.NetworkPolicySpec` so they cannot be overridden.

`PodSelector` always selects on the label key `agents.x-k8s.io/sandbox-template-ref-hash` with a value computed from the template name (see [Template hash propagation](#template-hash-propagation) below). `PolicyTypes` is always `[Ingress, Egress]`, which ensures a default-deny posture for both directions even when the user supplies only one rule list.

### Secure-by-default policy

When the template omits `NetworkPolicy` under `Managed`, `buildDefaultNetworkPolicySpec` produces a strict isolation profile:

- **Ingress**: a single rule allowing traffic only from pods labelled `app: sandbox-router`.
- **Egress**: a single `IPBlock` rule allowing `0.0.0.0/0` minus `10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`, and `169.254.0.0/16`, plus `::/0` minus `fc00::/7` for IPv6.

The inline comments emphasize the security stance: the egress block list deliberately excludes RFC1918 ranges (cluster/VPC traffic), the link-local metadata server, and IPv6 ULA. The comment on `169.254.0.0/16` is "Block Link-Local (Metadata Server)" and a comment notes "This intentionally blocks internal cluster DNS (CoreDNS) by default to prevent agents from probing for service discovery and leaking internal service names."

That last point is why `ApplySandboxSecureDefaults` in `utils.go` injects `DNSPolicy: None` plus public resolvers (`8.8.8.8`, `1.1.1.1`) into the Pod spec *only* when the template is in secure-by-default mode (managed and no custom `NetworkPolicy`). Custom policies and unmanaged templates leave DNS alone for "air-gapped/proxy compatibility." This is the one cross-cutting touchpoint between template configuration and the per-pod spec; it is invoked by the claim and warm-pool reconcilers, not by the template reconciler itself.

```go
// extensions/controllers/utils.go
isSecureByDefault := isManaged && template.Spec.NetworkPolicy == nil
if isSecureByDefault && spec.DNSPolicy == "" {
    spec.DNSPolicy = corev1.DNSNone
    spec.DNSConfig = &corev1.PodDNSConfig{
        Nameservers: []string{"8.8.8.8", "1.1.1.1"},
    }
}
```

### Custom policy

When `NetworkPolicy` is set under `Managed`, the reconciler still owns the selector and policy types but copies the user's `Ingress` and `Egress` lists verbatim:

```go
desiredSpec = networkingv1.NetworkPolicySpec{
    PodSelector: metav1.LabelSelector{
        MatchLabels: map[string]string{
            sandboxTemplateRefHash: SandboxTemplateRefHash(template.Name),
        },
    },
    PolicyTypes: []networkingv1.PolicyType{networkingv1.PolicyTypeIngress, networkingv1.PolicyTypeEgress},
    Ingress: template.Spec.NetworkPolicy.Ingress,
    Egress:  template.Spec.NetworkPolicy.Egress,
}
```

Because `PolicyTypes` always includes both directions, an empty `Ingress` or empty `Egress` list is a default-deny for that direction. The CRD field comment warns that this can block sidecar health checks (Istio proxy, monitoring agents) if their ports are not explicitly allowed.

Sources: [extensions/controllers/sandboxtemplate_controller.go:94-112](extensions/controllers/sandboxtemplate_controller.go), [extensions/controllers/sandboxtemplate_controller.go:156-213](extensions/controllers/sandboxtemplate_controller.go), [extensions/controllers/utils.go:23-48](extensions/controllers/utils.go), [extensions/api/v1beta1/sandboxtemplate_types.go:58-114](extensions/api/v1beta1/sandboxtemplate_types.go)

## Template hash propagation

The reconciler does not directly update sandboxes when a template changes; it does not have permissions to. Instead, propagation rides on a content-addressable label, `agents.x-k8s.io/sandbox-template-ref-hash`, declared as the package-level constant `sandboxTemplateRefHash` in `sandboxwarmpool_controller.go`. The label value is `NameHash(templateName)`, an FNV-1a 32-bit hash formatted as 8 hex characters.

```go
// controllers/sandbox_controller.go
func NameHash(objectName string) string {
    return fmt.Sprintf("%08x", GetNumericHash(objectName))
}
```

Three controllers cooperate via this single label:

```mermaid
flowchart LR
    subgraph Template["SandboxTemplate domain"]
        TPL[SandboxTemplate spec]
        TC[SandboxTemplateReconciler]
        NP["Shared NetworkPolicy\n&lt;name&gt;-network-policy\nPodSelector: ref-hash=H"]
    end
    subgraph Pool["Warm pool domain"]
        WP[SandboxWarmPool]
        WPC[SandboxWarmPoolReconciler]
        WS["Warm Sandboxes\nlabel ref-hash=H"]
    end
    subgraph Claim["Claim domain"]
        SC[SandboxClaim]
        SCC[SandboxClaimReconciler]
        SB["Claimed Sandbox\nlabel ref-hash=H"]
    end
    TPL --> TC --> NP
    TPL -.watches.-> WPC
    TPL -.watches.-> SCC
    WPC --> WS
    SCC --> SB
    NP -. CNI selects pods .-> WS
    NP -. CNI selects pods .-> SB
```

- The template reconciler stamps the hash into the `NetworkPolicy.PodSelector.MatchLabels`.
- The claim reconciler stamps the same hash onto the `Sandbox.Spec.PodTemplate.ObjectMeta.Labels` (and the merged pod metadata) when materializing or adopting a sandbox.
- The warm-pool reconciler stamps the same hash onto its warm-up sandboxes and uses it as the queue key for prewarmed candidates.

Because the policy's `PodSelector` and every downstream sandbox carry the same value, the CNI binds them at runtime — no extra plumbing required when the policy spec changes.

Sources: [extensions/controllers/sandboxwarmpool_controller.go:48-52](extensions/controllers/sandboxwarmpool_controller.go), [extensions/controllers/utils.go:50-59](extensions/controllers/utils.go), [controllers/sandbox_controller.go:454-458](controllers/sandbox_controller.go), [extensions/controllers/sandboxclaim_controller.go:358-372](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:960-970](extensions/controllers/sandboxclaim_controller.go)

## How template changes ripple to claims and warm pools

The template reconciler only updates the `NetworkPolicy`. Spec changes that affect `PodTemplate`, `VolumeClaimTemplates`, or `EnvVarsInjectionPolicy` reach existing sandboxes through *watch-driven requeue* in the other two reconcilers.

### Claim reconciler watch wiring

The claim controller installs two watches against `SandboxTemplate` in its `SetupWithManager`:

```go
// extensions/controllers/sandboxclaim_controller.go
Watches(&extensionsv1beta1.SandboxTemplate{}, &templateEventHandler{sandboxQueue: r.WarmSandboxQueue}).
Watches(
    &extensionsv1beta1.SandboxTemplate{},
    handler.EnqueueRequestsFromMapFunc(r.mapTemplateToClaims),
    builder.WithPredicates(predicate.ResourceVersionChangedPredicate{}),
).
```

- `mapTemplateToClaims` uses a `TemplateRefField` field index over `SandboxClaim` objects to enqueue every claim that references the changed template, scoped to its namespace.
- A `templateEventHandler` listens for template **deletion** only, and on delete calls `sandboxQueue.RemoveQueue(templateHash)` to flush warm-pool candidates from the in-memory queue, so claims do not pick up sandboxes destined for a vanished template.

When a claim is requeued, it re-runs its hot/cold logic; the hot path may rewrite the sandbox's merged pod metadata to match the latest template labels and annotations, including refreshing the `sandboxTemplateRefHash` label.

### Warm-pool reconciler watch wiring

The warm-pool reconciler installs an equivalent watch:

```go
// extensions/controllers/sandboxwarmpool_controller.go
Watches(
    &extensionsv1beta1.SandboxTemplate{},
    handler.EnqueueRequestsFromMapFunc(r.findWarmPoolsForTemplate),
).
```

`findWarmPoolsForTemplate` lists `SandboxWarmPool` objects via a `TemplateRefField` field index and requeues each one. Inside the reconcile loop, `isSandboxStale` then evaluates each existing warm-pool sandbox against the current template: it checks the `sandboxTemplateRefHash` label, then uses `comparePodSpecs`, which applies `ApplySandboxSecureDefaults` to a deep copy of the template spec and `equality.Semantic.DeepEqual`s it against the live sandbox spec. Stale sandboxes are recycled.

```go
// extensions/controllers/sandboxwarmpool_controller.go
if sandbox.Labels[sandboxTemplateRefHash] != SandboxTemplateRefHash(template.Name) {
    return true
}
// ... hash-cache + semantic DeepEqual on normalized pod specs.
```

### Network-policy ripple

For the `NetworkPolicy` itself the ripple is implicit: the template reconciler updates a single shared object, and the cluster CNI re-evaluates rules against every pod that currently carries the matching `sandboxTemplateRefHash` label — both warm-pool sandboxes and claimed sandboxes — without any per-sandbox write from the template controller. The CRD documentation field comment makes this explicit: "any updates to these rules will be applied to the single shared policy object. The underlying Kubernetes CNI will then dynamically enforce the updated rules across all existing and future sandboxes referencing this template."

Sources: [extensions/controllers/sandboxclaim_controller.go:1249-1298](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:1543-1566](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxwarmpool_controller.go:533-583](extensions/controllers/sandboxwarmpool_controller.go), [extensions/controllers/sandboxwarmpool_controller.go:462-531](extensions/controllers/sandboxwarmpool_controller.go), [extensions/api/v1beta1/sandboxtemplate_types.go:99-104](extensions/api/v1beta1/sandboxtemplate_types.go)

## Test coverage and invariants

`TestSandboxTemplateReconcileNetworkPolicy` is a table-driven test covering the five user-visible behaviors of the reconciler:

| Case | Pre-state | Expected post-state |
|------|-----------|--------------------|
| `Creates Default Secure Policy (Strict Isolation) when template has none` | template with empty `NetworkPolicy`, managed | shared NP with `PolicyTypes` length 2, ingress from `app: sandbox-router`, egress `0.0.0.0/0`, selector key `agents.x-k8s.io/sandbox-template-ref-hash` |
| `Creates custom network policy when defined in template` | template with explicit ingress/egress | shared NP whose selector value equals `NameHash("test-template-custom")`, ingress preserved as `app: ingress` |
| `NetworkPolicy is not created when template is Unmanaged` | unmanaged template, no existing NP | no NP exists |
| `Existing NetworkPolicy is deleted when template updates to Unmanaged` | unmanaged template + pre-existing NP | NP deleted |
| `Existing NetworkPolicy is updated when template spec changes` | template with new spec + outdated NP | NP overwritten; stale `old-label` removed, new ingress applied |

The "update" case is the strongest behavioral invariant: outdated `PodSelector.MatchLabels["old-label"]` must disappear after one reconcile, confirming that the reconciler replaces `existingNP.Spec` wholesale rather than merging. That matches the implementation line `existingNP.Spec = desiredSpec` followed by `r.Update(ctx, existingNP)`.

Sources: [extensions/controllers/sandboxtemplate_controller_test.go:39-201](extensions/controllers/sandboxtemplate_controller_test.go), [extensions/controllers/sandboxtemplate_controller.go:118-130](extensions/controllers/sandboxtemplate_controller.go)

## Summary

The `SandboxTemplateReconciler` is deliberately small: it converges a single shared `NetworkPolicy` per `SandboxTemplate`, honors a `Managed`/`Unmanaged` mode switch, defaults to a strict secure-by-default profile that blocks RFC1918 and metadata-server egress, and idempotently exits via `equality.Semantic.DeepEqual`. It never writes downstream resources directly. Spec changes ripple to existing claims and warm pools through field-indexed watch handlers in the `SandboxClaim` and `SandboxWarmPool` controllers, while the shared `NetworkPolicy` itself ripples via CNI enforcement against the controller-owned `sandboxTemplateRefHash` pod-selector label that every downstream sandbox carries.

---

## 14. SandboxWarmPool Reconciler

> Pool maintenance loop: parallel batch creation/deletion bounded by max-batch-size, rollout on template changes, and watcher coordination with SandboxClaim.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/14-sandboxwarmpool-reconciler.md
- Generated: 2026-05-25T22:38:21.559Z

### Source Files

- `extensions/controllers/sandboxwarmpool_controller.go`
- `extensions/controllers/sandboxwarmpool_controller_test.go`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [extensions/controllers/sandboxwarmpool_controller.go](extensions/controllers/sandboxwarmpool_controller.go)
- [extensions/controllers/sandboxwarmpool_controller_test.go](extensions/controllers/sandboxwarmpool_controller_test.go)
- [extensions/api/v1beta1/sandboxwarmpool_types.go](extensions/api/v1beta1/sandboxwarmpool_types.go)
- [extensions/controllers/sandboxclaim_controller.go](extensions/controllers/sandboxclaim_controller.go)
- [extensions/controllers/utils.go](extensions/controllers/utils.go)
</details>

# SandboxWarmPool Reconciler

The `SandboxWarmPoolReconciler` maintains a pool of pre-allocated `Sandbox` CRs so that a `SandboxClaim` can adopt a warm sandbox instead of paying a full cold-start. Each pool is bound to a `SandboxTemplate`; the reconciler keeps the actual replica count converging on `Spec.Replicas`, garbage-collects sandboxes that fail to become Ready, and rolls forward stale sandboxes when the underlying template drifts. Coordination with `SandboxClaimReconciler` happens through shared labels and a controller-side queue rather than a direct API.

This page covers the reconcile loop, the parallel slow-start batching used for create/delete, the staleness model and rollout strategies, and how the controller hands off sandboxes to claims through ownership transfer.

## Controller Shape and Watches

`SandboxWarmPoolReconciler` owns `SandboxWarmPool` and the `Sandbox` CRs it creates. It also watches `SandboxTemplate` so that any template mutation re-queues every pool that references it. A field indexer on `.spec.sandboxTemplateRef.name` powers the reverse lookup.

```go
// extensions/controllers/sandboxwarmpool_controller.go:548-557
return ctrl.NewControllerManagedBy(mgr).
    For(&extensionsv1beta1.SandboxWarmPool{}).
    Owns(&sandboxv1beta1.Sandbox{}).
    WithOptions(controller.Options{MaxConcurrentReconciles: concurrentWorkers}).
    Watches(
        &extensionsv1beta1.SandboxTemplate{},
        handler.EnqueueRequestsFromMapFunc(r.findWarmPoolsForTemplate),
    ).
    Complete(r)
```

`findWarmPoolsForTemplate` lists pools by the `TemplateRefField` index and returns one `reconcile.Request` per match, so a template change fans out to every dependent pool exactly once.

```mermaid
flowchart LR
    subgraph API["Kubernetes API"]
        WP["SandboxWarmPool"]
        T["SandboxTemplate"]
        S["Sandbox (owned)"]
        SC["SandboxClaim"]
    end
    subgraph WPC["SandboxWarmPoolReconciler"]
        REC["reconcilePool()"]
        FW["findWarmPoolsForTemplate()"]
    end
    subgraph SCC["SandboxClaimReconciler"]
        EH["sandboxEventHandler"]
        Q["WarmSandboxQueue"]
        ADOPT["adoptSandboxFromCandidates"]
    end
    WP -- Reconcile --> REC
    T -- Watch --> FW --> REC
    REC -- Create/Delete/Adopt --> S
    S -- Owns --> WP
    S -- Watch --> EH
    EH -- Add/Remove key --> Q
    SC -- Reconcile --> ADOPT
    ADOPT -- Pop key, Patch ownership --> S
```

Sources: [extensions/controllers/sandboxwarmpool_controller.go:533-583](), [extensions/controllers/sandboxclaim_controller.go:1446-1540]()

## API Surface

`SandboxWarmPoolSpec` is intentionally small: a desired replica count, a template reference, and an optional update strategy. The CRD is `Scale`-subresource enabled, which lets an HPA drive `replicas`.

| Field | Type | Notes |
|---|---|---|
| `spec.replicas` | int32, required, min 0 | Desired pool size; targeted by the `scale` subresource. |
| `spec.sandboxTemplateRef.name` | string | Indexed as `.spec.sandboxTemplateRef.name`; powers the template watch. |
| `spec.updateStrategy.type` | enum `Recreate`/`OnReplenish`, default `OnReplenish` | Controls how stale sandboxes are reconciled. |
| `status.replicas` | int32 | Count of active (non-deleting, owned/adoptable) sandboxes. |
| `status.readyReplicas` | int32 | Count whose `Ready` condition is `True`. |
| `status.selector` | string | Label-selector string used by the `scale` subresource. |

Sources: [extensions/api/v1beta1/sandboxwarmpool_types.go:24-119]()

## Reconcile Pipeline

`Reconcile` itself is thin: fetch the CR, exit on deletion, snapshot status, run `reconcilePool`, then patch status via Server-Side Apply with field owner `warmpool-controller`. The work happens inside `reconcilePool`.

```
reconcilePool steps                                File reference
─────────────────────────────────────────────────  ───────────────────────────
1. NameHash(pool.Name)                  → label    controller.go:108-115
2. List Sandboxes by warmPoolSandboxLabel          controller.go:111-123
3. fetchTemplateAndHash                            controller.go:125-126,312-325
4. filterActiveSandboxes (delete stale + adopt)    controller.go:128-129,239-301
5. Garbage-collect sandboxes stuck non-Ready > 5m  controller.go:131-148
6. Write Replicas / ReadyReplicas / Selector       controller.go:159-169
7. Create (replicas < desired)                     controller.go:174-192
8. Delete (replicas > desired)                     controller.go:195-222
9. Join template error if not NotFound             controller.go:224-228
```

Two non-obvious properties: the template is fetched only once per reconcile (its hash is reused for staleness checks and label propagation), and a missing template (`IsNotFound`) is *swallowed* — the pool stops creating but does not report the missing template as a hard error, while every other template error is surfaced.

Sources: [extensions/controllers/sandboxwarmpool_controller.go:67-229](), [extensions/controllers/sandboxwarmpool_controller.go:312-325]()

### Status Patching

Status writes are skipped when `equality.Semantic.DeepEqual(oldStatus, &warmPool.Status)` holds, then applied as a typed SSA patch with `client.ForceOwnership`. The `nolint:staticcheck` annotation reflects that `client.Apply` is used without generated apply configurations.

Sources: [extensions/controllers/sandboxwarmpool_controller.go:415-443]()

## Parallel Batch Creation and Deletion

Both create and delete loops are bounded by `MaxBatchSize`. The constructor defaults `MaxBatchSize` to `sandboxCreateDeleteMaxBatchSize = 300` if the caller passes zero or negative.

```go
// extensions/controllers/sandboxwarmpool_controller.go:534-537
if r.MaxBatchSize <= 0 {
    r.MaxBatchSize = sandboxCreateDeleteMaxBatchSize
}
```

`min(desiredDelta, maxBatchSize)` caps the per-reconcile delta; the remainder is left for subsequent reconciles. Inside the cap, work runs through `slowStartBatch`, which doubles the parallelism each successful round, starting at one:

```
batch sizes for count=14, initialBatchSize=1:
1 → 2 → 4 → 7  (the last batch is trimmed to `remaining`)
```

This is verified by `TestSlowStartBatch` which checks "all succeed with batch trimming (count=14)" produces 14 successful calls in those four rounds, and "early exit on failure" stops after `1+2+4 = 7` calls when an injected error fires.

Failure semantics:

- An `errgroup.WithContext` collects the first error per batch; on failure the loop returns immediately without launching the next round.
- The returned `successes` includes the partial successes of the failed batch (via `batchSuccesses atomic.Int64`).
- Context cancellation is checked at the top of every round; mid-batch cancellations short-circuit subsequent rounds, as `TestSlowStartBatch` "context canceled in middle of batch" demonstrates.

Sources: [extensions/controllers/sandboxwarmpool_controller.go:48-52](), [extensions/controllers/sandboxwarmpool_controller.go:171-222](), [extensions/controllers/sandboxwarmpool_controller.go:585-621](), [extensions/controllers/sandboxwarmpool_controller_test.go:1461-1537]()

### Deletion Priority

When the pool is over-provisioned, `slices.SortFunc` sorts candidates so unready sandboxes are deleted before ready ones; within each group, newest first. This biases the pool toward keeping ready, settled capacity:

```go
// extensions/controllers/sandboxwarmpool_controller.go:201-211
slices.SortFunc(activeSandboxes, func(a, b sandboxv1beta1.Sandbox) int {
    aReady := isSandboxReady(&a)
    bReady := isSandboxReady(&b)
    if aReady != bReady {
        if aReady { return 1 }
        return -1
    }
    return b.CreationTimestamp.Compare(a.CreationTimestamp.Time)
})
```

`deletePoolSandbox` ignores `NotFound` so racing deletions inside a batch do not abort the rest.

Sources: [extensions/controllers/sandboxwarmpool_controller.go:195-222](), [extensions/controllers/sandboxwarmpool_controller.go:405-413]()

## Sandbox Construction

`buildSandboxCR` builds one blueprint per reconcile and the create loop deep-copies it for each replica. The blueprint:

- Carries three pool-identity labels: `warmPoolSandboxLabel` (hash of pool name), `sandboxTemplateRefHash` (hash of template name), `SandboxPodTemplateHashLabel` (hash of pod template JSON).
- Annotates the sandbox with `SandboxTemplateRefAnnotation` for metrics linkage.
- Propagates the same three labels into the pod template labels so platform informers can target the pods.
- Copies `VolumeClaimTemplates` from the template (verified by `TestCreatePoolSandboxPropagatesVolumeClaimTemplates`).
- Calls `ApplySandboxSecureDefaults` to set `AutomountServiceAccountToken: false` when unset and, under "Secure By Default" (`NetworkPolicyManagement` empty or `Managed` and `NetworkPolicy == nil`), pins `DNSPolicy: None` with public resolvers `8.8.8.8` and `1.1.1.1`.
- Sets the `SandboxWarmPool` as `OwnerReference` (controller=true) so deletions cascade.

Names are server-generated via `GenerateName: "<poolName>-"`, so the same blueprint can be created N times concurrently.

Sources: [extensions/controllers/sandboxwarmpool_controller.go:327-403](), [extensions/controllers/utils.go:23-48](), [extensions/controllers/sandboxwarmpool_controller_test.go:459-648]()

## Staleness, Adoption, and Rollout

`filterActiveSandboxes` walks the listed sandboxes and partitions them into "kept" vs. "must delete or skip":

```mermaid
flowchart TD
    A[Sandbox candidate] --> B{DeletionTimestamp set?}
    B -- yes --> SKIP[skip]
    B -- no --> C{controllerRef?}
    C -- foreign --> SKIP2[ignore – belongs to other controller]
    C -- nil (orphan) --> D
    C -- this pool --> E
    D[Vet staleness via comparePodSpecs] --> F{stale?}
    E{strategy == Recreate?} -- yes --> G[Vet staleness]
    E -- no, OnReplenish --> KEEP
    G --> F
    F -- yes --> DEL[Delete sandbox]
    F -- no, orphan --> ADOPT[SetControllerReference + Update]
    F -- no, owned --> KEEP[append to activeSandboxes]
    ADOPT --> KEEP
```

Sources: [extensions/controllers/sandboxwarmpool_controller.go:239-301](), [extensions/controllers/sandboxwarmpool_controller.go:231-237]()

### isSandboxStale

`isSandboxStale` is layered for cost and security:

1. If `sandboxTemplateRefHash` label does not match `SandboxTemplateRefHash(template.Name)`, the sandbox is stale immediately — covers the "template ref rename" case proven by `TestReconcilePool_TemplateRefUpdate_SameSpec` (sandboxes are recreated even if the pod spec is identical).
2. Orphans always run the full semantic comparison (`comparePodSpecs`) regardless of the hash label, because an unowned sandbox could be carrying a spoofed hash with a mutated PodSpec. `TestIsSandboxStale_OrphanedSandboxVetting` exercises both the spoofed and genuine paths.
3. Otherwise, if `SandboxPodTemplateHashLabel` equals the freshly computed template hash, the sandbox is fresh.
4. If hash computation failed (`currentPodTemplateHash == ""`), the function returns *not stale* and logs, to avoid mass-deleting pods because of a transient marshal failure.
5. As a final fallback, `comparePodSpecs` runs and the verdict is memoized in `vettedHashes` so two sandboxes with the same drift hash do not pay the cost twice.

`comparePodSpecs` defends against false positives by re-applying `ApplySandboxSecureDefaults` to a copy of the template's pod spec before the `equality.Semantic.DeepEqual` — without that normalization, every fresh sandbox would appear stale because the controller injected `AutomountServiceAccountToken` or DNS fields the template did not set. `TestComparePodSpecsNormalization` pins these cases.

Sources: [extensions/controllers/sandboxwarmpool_controller.go:462-531](), [extensions/controllers/sandboxwarmpool_controller_test.go:1194-1296](), [extensions/controllers/sandboxwarmpool_controller_test.go:1384-1459](), [extensions/controllers/sandboxwarmpool_controller_test.go:1027-1137]()

### Update Strategies

| Strategy | Behavior on template drift | Behavior on orphans |
|---|---|---|
| `Recreate` | Stale owned sandboxes are deleted every reconcile; pool replenishes at the next cycle. | Always vetted; stale orphans deleted. |
| `OnReplenish` (default, also fallback for unknown values) | Owned sandboxes keep their old spec until they are deleted (manually or by claim adoption); only then is the replacement created from the current template. | Orphans are still vetted and may be deleted. |

`TestReconcilePool_TemplateUpdateRollout` verifies both directions: after `image-v1 → image-v2`, `Recreate` flips every replica to `image-v2`, while `OnReplenish` keeps `image-v1` and only the replenishment sandbox observes `image-v2`. The same test asserts an empty `Spec.UpdateStrategy.Type` is treated as `OnReplenish`; the `switch` at controller.go:253-262 also defaults *unknown* values to `OnReplenish` with a log line.

Sources: [extensions/controllers/sandboxwarmpool_controller.go:247-262](), [extensions/api/v1beta1/sandboxwarmpool_types.go:49-69](), [extensions/controllers/sandboxwarmpool_controller_test.go:858-1025]()

### Stuck-Sandbox Garbage Collection

Independent of staleness, any non-Ready sandbox older than the constant `warmPoolReadinessGracePeriod = 5 * time.Minute` is deleted in-line. This prevents image-pull or scheduling failures from monopolizing pool slots; `TestReconcilePoolGCStuckSandboxes` covers both the "delete after grace period" and "keep within grace period" cases.

Sources: [extensions/controllers/sandboxwarmpool_controller.go:131-148](), [extensions/controllers/sandboxwarmpool_controller_test.go:755-856]()

## Coordination with SandboxClaim

The warm pool reconciler does not actively hand sandboxes to claims. Instead, `SandboxClaimReconciler` runs a `sandboxEventHandler` against the same `Sandbox` watch and maintains an in-memory `WarmSandboxQueue` keyed by `sandboxTemplateRefHash`. The contract is enforced through `isAdoptable` and three pool-managed labels:

```go
// extensions/controllers/sandboxclaim_controller.go:1503-1519
if !candidate.DeletionTimestamp.IsZero() { ... }
if _, ok := candidate.Labels[warmPoolSandboxLabel]; !ok { ... }
if _, ok := candidate.Labels[sandboxTemplateRefHash]; !ok { ... }
if controllerRef != nil && controllerRef.Kind != "SandboxWarmPool" { ... }
```

When a sandbox transitions from non-adoptable to adoptable (or its template-ref hash changes), the handler adds the key to the queue; on Delete the key is purged. The claim reconciler pops keys from the queue, verifies the sandbox is still adoptable, and then `completeAdoption` strips the three pool labels and re-parents the sandbox to the claim:

```go
// extensions/controllers/sandboxclaim_controller.go:732-741
delete(adopted.Labels, warmPoolSandboxLabel)
delete(adopted.Labels, sandboxTemplateRefHash)
delete(adopted.Labels, v1beta1.SandboxPodTemplateHashLabel)
adopted.OwnerReferences = nil
if err := controllerutil.SetControllerReference(claim, adopted, r.Scheme); err != nil { ... }
```

Because the pool's `List` is gated on `warmPoolSandboxLabel`, removing that label is exactly what causes the next `reconcilePool` to treat the adoption as a missing replica and create a replacement — that is the entire "replenish" signal. For `WarmPoolPolicyNone` claims, the claim reconciler skips this path and creates a cold sandbox directly.

```mermaid
sequenceDiagram
    participant WPR as SandboxWarmPoolReconciler
    participant API as Kube API
    participant CEH as sandboxEventHandler
    participant Q as WarmSandboxQueue
    participant SCR as SandboxClaimReconciler

    WPR->>API: Create Sandbox (labels: warmPool, refHash, podHash; owner=WarmPool)
    API-->>CEH: CreateEvent
    CEH->>Q: Add(refHash, key) if adoptable
    SCR->>Q: Get(refHash) → key
    SCR->>API: Patch Sandbox (drop labels, owner=Claim)
    API-->>WPR: Watch event (Sandbox no longer matches selector)
    WPR->>API: List → currentReplicas < desired → Create replacement
```

Sources: [extensions/controllers/sandboxclaim_controller.go:591-746](), [extensions/controllers/sandboxclaim_controller.go:1446-1540](), [extensions/controllers/sandboxclaim_controller.go:1155-1180](), [extensions/controllers/sandboxwarmpool_controller.go:111-123]()

## Failure Modes and Error Aggregation

Errors are accumulated with `errors.Join` rather than short-circuiting. Stale-delete failures, adoption failures, stuck-GC failures, batch-create failures, and batch-delete failures all surface together at the end of `reconcilePool`, while `Reconcile` returns the joined error so controller-runtime requeues with backoff.

The one swallowed error is `IsNotFound` on the template fetch: the controller logs, refuses to *create* (`tmplErr == nil` gates `buildSandboxCR`), but still updates status and may continue to *delete* excess sandboxes. Other template errors are joined into the returned error.

Sources: [extensions/controllers/sandboxwarmpool_controller.go:128-229](), [extensions/controllers/sandboxwarmpool_controller.go:445-460]()

## Summary

The `SandboxWarmPool` reconciler is a small set-reconciliation loop with three notable properties: bounded-parallel batching via `slowStartBatch` that doubles concurrency on success and stops on first error, a layered staleness check that combines cheap label-hash comparisons with a memoized semantic-equal fallback (normalized through `ApplySandboxSecureDefaults` to suppress drift caused by the controller's own injected fields), and a coordination protocol with `SandboxClaim` that relies entirely on label/ownership transitions — the warm pool produces labelled sandboxes, the claim reconciler watches the same objects and strips the labels on adoption, and the resulting "missing replica" reconcile is what drives replenishment.

---

## 15. Warm Sandbox Queue

> The in-memory queue shared between the warm-pool and claim reconcilers that hands off warm sandboxes to incoming claims.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/15-warm-sandbox-queue.md
- Generated: 2026-05-25T22:38:36.026Z

### Source Files

- `extensions/controllers/queue/simple_sandbox_queue.go`
- `extensions/controllers/queue/simple_sandbox_queue_test.go`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [extensions/controllers/queue/simple_sandbox_queue.go](extensions/controllers/queue/simple_sandbox_queue.go)
- [extensions/controllers/queue/simple_sandbox_queue_test.go](extensions/controllers/queue/simple_sandbox_queue_test.go)
- [extensions/controllers/sandboxclaim_controller.go](extensions/controllers/sandboxclaim_controller.go)
- [cmd/agent-sandbox-controller/main.go](cmd/agent-sandbox-controller/main.go)
- [extensions/controllers/utils.go](extensions/controllers/utils.go)
</details>

# Warm Sandbox Queue

The Warm Sandbox Queue is the in-process handoff structure that lets the `SandboxClaim` controller skip cold-starting a `Sandbox` whenever a `SandboxWarmPool` has already pre-provisioned an adoptable one for the same template. It is a single in-memory data structure, owned by the controller manager process, partitioned by template hash, and shared between the controller's `Sandbox` watch (producer) and the `SandboxClaim` reconciler (consumer).

This page covers the public interface, the underlying `sync.Map`-of-FIFOs implementation, how items enter and leave the queue, the concurrency and "Ghost Pod" semantics that drive the design, and how the queue is wired into the controller manager. The queue lives entirely in memory; nothing is persisted, so it is rebuilt from `Watches` on each controller restart.

## Role in the warm-pool handoff

There is no Kubernetes work-queue or message bus between the warm pool and claim sides — the queue is just a Go object held by `SandboxClaimReconciler.WarmSandboxQueue` and populated by event handlers registered on the same controller. The producer side is a `Watches(&v1beta1.Sandbox{}, &sandboxEventHandler{...})` registration that observes every `Sandbox` create/update/delete in the cluster; the consumer side is `getCandidate`, which is called from the claim reconcile path before any cold-start logic runs.

```text
SandboxWarmPool reconciler          SandboxClaim reconciler
        creates                            reconciles
           |                                    |
           v                                    v
   +-----------------+               +----------------------+
   | Sandbox object  | -- watch -->  | sandboxEventHandler  |
   | (adoptable)     |               |   .Update -> Add()   |
   +-----------------+               +----------+-----------+
                                                |
                                                v
                                  +-------------------------------+
                                  |  WarmSandboxQueue (in-mem)    |
                                  |   templateHash -> FIFO+dedup  |
                                  +-------------------------------+
                                                ^
                                                |
                                         Get() / RemoveItem()
                                                |
                                   +------------+--------------+
                                   |  getCandidate /            |
                                   |  templateEventHandler      |
                                   +----------------------------+
```

Sources: [extensions/controllers/sandboxclaim_controller.go:1286-1298](), [extensions/controllers/sandboxclaim_controller.go:591-646]()

## The `SandboxQueue` interface

The queue is defined as a four-method interface so the reconciler depends on behavior, not the concrete map type. `SandboxKey` is a thin alias around `types.NamespacedName`, so a key uniquely identifies a `Sandbox` resource.

| Method | Signature | Purpose |
|--------|-----------|---------|
| `Add` | `Add(templateHash string, item SandboxKey)` | Push an adoptable sandbox onto the per-template FIFO; idempotent on duplicate keys. |
| `Get` | `Get(templateHash string) (SandboxKey, bool)` | Pop the front item; `ok=false` when the queue is empty or absent. |
| `RemoveItem` | `RemoveItem(templateHash string, item SandboxKey)` | Eagerly drop a specific entry; used when a sandbox is deleted from the cluster. |
| `RemoveQueue` | `RemoveQueue(templateHash string)` | Drop an entire per-template queue; used when the owning `SandboxTemplate` is deleted. |

```go
// extensions/controllers/queue/simple_sandbox_queue.go
type SandboxKey types.NamespacedName

type SandboxQueue interface {
    Add(templateHash string, item SandboxKey)
    Get(templateHash string) (SandboxKey, bool)
    RemoveQueue(templateHash string)
    RemoveItem(templateHash string, item SandboxKey)
}
```

Sources: [extensions/controllers/queue/simple_sandbox_queue.go:23-33]()

## `SimpleSandboxQueue` implementation

`SimpleSandboxQueue` is the only implementation. It composes two layers of concurrency primitives:

- An outer `sync.Map` named `queues`, keyed by `templateHash` (string), valued by `*synchronizedQueue`. This map handles the high-churn read path of "find or create the per-template queue" without a global lock.
- An inner `synchronizedQueue` per template, which holds an ordered `[]SandboxKey` slice and a `map[SandboxKey]struct{}` set guarded by a single `sync.Mutex`. The slice gives FIFO ordering for fair adoption; the set gives O(1) deduplication so the same sandbox cannot be queued twice.

```go
// extensions/controllers/queue/simple_sandbox_queue.go
type SimpleSandboxQueue struct {
    queues sync.Map  // templateHash -> *synchronizedQueue
}

type synchronizedQueue struct {
    mu    sync.Mutex
    items []SandboxKey
    set   map[SandboxKey]struct{} // Used for O(1) deduplication
}
```

The hash key itself is computed by `SandboxTemplateRefHash`, which delegates to `sandboxcontrollers.NameHash(templateRefName)`. So all consumers and producers agree on partitioning by template name hash, not by template UID.

Sources: [extensions/controllers/queue/simple_sandbox_queue.go:35-107](), [extensions/controllers/utils.go:50-58]()

### Push and Pop semantics

`Push` is gated on the dedup set: if the key already exists, the call is a no-op. Otherwise it is appended to the tail and inserted into the set. `Pop` removes the head element, clears the slot to release the underlying string allocations to the garbage collector, and reslices.

```go
// extensions/controllers/queue/simple_sandbox_queue.go
func (q *synchronizedQueue) Push(key SandboxKey) {
    q.mu.Lock()
    defer q.mu.Unlock()
    if _, exists := q.set[key]; !exists {
        q.set[key] = struct{}{}
        q.items = append(q.items, key)
    }
}

func (q *synchronizedQueue) Pop() (SandboxKey, bool) {
    q.mu.Lock()
    defer q.mu.Unlock()
    if len(q.items) == 0 {
        return SandboxKey{}, false
    }
    item := q.items[0]
    q.items[0] = SandboxKey{} // release references to the GC
    q.items = q.items[1:]
    delete(q.set, item)
    return item, true
}
```

`Add` and `Get` on `SimpleSandboxQueue` wrap these calls with a `LoadOrStore` / `Load` against the outer `sync.Map`. `Add` lazily creates a per-template queue; `Get` returns `false` cleanly if no queue has ever been created for the hash. The basic FIFO contract is verified by `TestSimpleSandboxQueue_BasicOperations` and dedup by `TestSynchronizedQueue_Deduplication`.

Sources: [extensions/controllers/queue/simple_sandbox_queue.go:47-58](), [extensions/controllers/queue/simple_sandbox_queue.go:109-140](), [extensions/controllers/queue/simple_sandbox_queue_test.go:20-47](), [extensions/controllers/queue/simple_sandbox_queue_test.go:98-116]()

### Ghost Pod removal

`RemoveItem` exists specifically to keep the in-memory queue consistent with cluster state when a `Sandbox` is deleted out from under it (the controller calls these "Ghost Pods"). It scans the slice for the key, shifts subsequent entries left, and explicitly zeroes the tail slot so the removed `SandboxKey` cannot linger in the backing array. The dedup set is updated in lockstep, so a future `Add` for the same key after removal will succeed.

```go
// extensions/controllers/queue/simple_sandbox_queue.go
func (q *synchronizedQueue) Remove(key SandboxKey) {
    q.mu.Lock()
    defer q.mu.Unlock()
    if _, exists := q.set[key]; !exists {
        return
    }
    delete(q.set, key)
    for i, k := range q.items {
        if k == key {
            last := len(q.items) - 1
            copy(q.items[i:], q.items[i+1:])
            q.items[last] = SandboxKey{}
            q.items = q.items[:last]
            break
        }
    }
}
```

`TestSimpleSandboxQueue_RemoveItem_GhostPodFix` asserts both the post-removal ordering and that every unused slot in the backing array down to `cap(items)` is the zero-value, preventing the slice from retaining references to deleted sandboxes.

Sources: [extensions/controllers/queue/simple_sandbox_queue.go:62-91](), [extensions/controllers/queue/simple_sandbox_queue_test.go:49-96]()

### Queue lifecycle and memory leaks

`RemoveQueue` deletes the entire per-template entry from the outer `sync.Map`. It is invoked when a `SandboxTemplate` is deleted, after which no claim will ever again ask for that hash. The comment on `synchronizedQueue` flags an open TODO to also remove queues when their `SandboxWarmPool` is deleted; today, the deletion is template-driven only.

```go
// extensions/controllers/queue/simple_sandbox_queue.go
// TODO(vicentefb): Implement queue cleanup mechanism.
// We should remove the queue from the sync.Map when the corresponding
// SandboxWarmPool for a given template is deleted to prevent memory leaks.
func (s *SimpleSandboxQueue) RemoveQueue(templateHash string) {
    s.queues.Delete(templateHash)
}
```

The `TestSimpleSandboxQueue_RemoveQueue_MemoryLeakFix` test pins the contract: after `RemoveQueue`, `Get` for the same hash returns `false`.

Sources: [extensions/controllers/queue/simple_sandbox_queue.go:93-100](), [extensions/controllers/queue/simple_sandbox_queue.go:142-146](), [extensions/controllers/queue/simple_sandbox_queue_test.go:118-133]()

## Producers: who calls `Add` and `RemoveItem`

The queue does not poll the API server. It is filled and pruned by two event handlers attached to `SandboxClaimReconciler` via `Watches`:

| Source event | Handler | Action |
|--------------|---------|--------|
| `Sandbox` Create / Update (transitions to adoptable, or template-hash label change while adoptable) | `sandboxEventHandler.Update` | `Add(templateHash, key)` |
| `Sandbox` Delete | `sandboxEventHandler.Delete` | `RemoveItem(templateHash, key)` |
| `SandboxTemplate` Delete | `templateEventHandler.Delete` | `RemoveQueue(templateHash)` |

`isAdoptable` defines what "adoptable" means: not being deleted, owned by a `SandboxWarmPool` (or unowned), and labeled with both `warmPoolSandboxLabel` and the `sandboxTemplateRefHash` label. The hash used as the queue key is `newSandbox.Labels[sandboxTemplateRefHash]`, i.e. it comes from the live label set rather than being recomputed.

```go
// extensions/controllers/sandboxclaim_controller.go
if (!oldAdoptable && newAdoptable) || (newAdoptable && hashChanged) {
    key := queue.SandboxKey{Namespace: newSandbox.Namespace, Name: newSandbox.Name}
    h.sandboxQueue.Add(newSandbox.Labels[sandboxTemplateRefHash], key)
}
```

Sources: [extensions/controllers/sandboxclaim_controller.go:1446-1481](), [extensions/controllers/sandboxclaim_controller.go:1503-1541](), [extensions/controllers/sandboxclaim_controller.go:1543-1566]()

## Consumer: `getCandidate` and the skip-list pattern

The claim reconciler pulls from the queue inside `getCandidate`. The shape of the loop is the most important behavior to understand because it explains why the queue tolerates stale entries:

1. Pop a key. If empty, return `(nil, nil)` — the warm pool is exhausted and the caller falls back to cold-start.
2. `Get` the corresponding `Sandbox` from the informer cache. If it is `NotFound`, the entry was a Ghost Pod and is simply skipped (no re-queue). On any other error, the key is `Add`ed back and the error is returned.
3. Run `verifySandboxCandidate`: rejects wrong-namespace, deleted, mislabeled, or wrong-template-hash sandboxes. Cross-namespace matches are added to a local `skipped` list (so they go back into the queue via the `defer`); other rejections are dropped.
4. If the claim asks for a specific warm pool, sandboxes from a different pool are also pushed onto the local `skipped` list and the loop continues.
5. On success, returns the chosen `Sandbox` and its key. A `defer` re-queues anything on `skipped` so other claims can still adopt those entries.

```go
// extensions/controllers/sandboxclaim_controller.go
var skipped []queue.SandboxKey
defer func() {
    for _, key := range skipped {
        r.WarmSandboxQueue.Add(templateHash, key)
    }
}()
for {
    adoptedKey, ok := r.WarmSandboxQueue.Get(templateHash)
    if !ok {
        return nil, queue.SandboxKey{}, nil
    }
    ...
    if k8errors.IsNotFound(err) {
        continue // Ghost Pod: silently drop
    }
    ...
}
```

The adoption finalization path in `adoptSandboxFromCandidates` also re-queues the candidate on `Update` or `completeAdoption` failures (`r.WarmSandboxQueue.Add(templateHash, adoptedKey)`), so a transient API conflict does not permanently remove a usable warm sandbox.

Sources: [extensions/controllers/sandboxclaim_controller.go:591-646](), [extensions/controllers/sandboxclaim_controller.go:648-699](), [extensions/controllers/sandboxclaim_controller.go:1487-1519]()

## Wiring in the controller manager

A single `SimpleSandboxQueue` is constructed at program start and shared by reference. Because both the producer event handlers and the consumer reconciler hold the same `queue.SandboxQueue` value, there is no need for serialization or fan-out — the in-memory pointer is the integration point.

```go
// cmd/agent-sandbox-controller/main.go
if extensions {
    warmSandboxQueue := queue.NewSimpleSandboxQueue()
    if err = (&extensionscontrollers.SandboxClaimReconciler{
        Client:           mgr.GetClient(),
        Scheme:           mgr.GetScheme(),
        WarmSandboxQueue: warmSandboxQueue,
        ...
    }).SetupWithManager(mgr, sandboxClaimConcurrentWorkers); err != nil {
        ...
    }
}
```

Inside `SetupWithManager`, the same `r.WarmSandboxQueue` is passed into both `&sandboxEventHandler{sandboxQueue: r.WarmSandboxQueue}` and `&templateEventHandler{sandboxQueue: r.WarmSandboxQueue}`. The queue is therefore one Go value shared across N reconciler workers (`MaxConcurrentReconciles`), which is exactly why the inner mutex and outer `sync.Map` matter.

Because the queue is in-memory only, a controller restart starts with an empty queue. The watch-driven producer reconciles by replaying `Create` events for every existing `Sandbox`, which re-populates the queue for any already-adoptable sandboxes the next time their state is observed.

Sources: [cmd/agent-sandbox-controller/main.go:246-257](), [extensions/controllers/sandboxclaim_controller.go:1286-1298]()

## Failure modes the queue is designed for

| Failure mode | How the queue copes |
|--------------|--------------------|
| Same `Sandbox` observed multiple times by the informer | `Push` is a no-op on dedup set hits, so the queue holds at most one entry per key. |
| `Sandbox` deleted while still in the queue ("Ghost Pod") | `RemoveItem` from the delete handler eagerly prunes; if missed, `getCandidate` silently drops `NotFound` entries on pop. |
| Adoption fails mid-flight (API conflict, transient error) | The reconciler `Add`s the key back, preserving warm-pool capacity. |
| Cross-namespace or wrong-pool candidate found | Held in a local `skipped` slice and re-queued via `defer` so other claims can still adopt it. |
| `SandboxTemplate` deleted | `templateEventHandler.Delete` calls `RemoveQueue`, wiping the entire per-template queue from the `sync.Map`. |
| `SandboxWarmPool` deleted (no template deletion) | Not handled today — see the `TODO(vicentefb)` on `synchronizedQueue`. The per-template queue would persist with stale or never-popped entries until the template itself is deleted. |
| Controller restart | Queue is empty; producer event handlers refill it as the informer replays `Sandbox` state. |

Sources: [extensions/controllers/queue/simple_sandbox_queue.go:93-99](), [extensions/controllers/sandboxclaim_controller.go:603-645](), [extensions/controllers/sandboxclaim_controller.go:1521-1566]()

## Summary

The Warm Sandbox Queue is a deliberately small piece of infrastructure: a `sync.Map` of per-template FIFOs with O(1) deduplication, four methods on its `SandboxQueue` interface, and two short event handlers that keep it in step with the cluster's `Sandbox` and `SandboxTemplate` state. The cleverness lives in the consumer protocol — Ghost Pod tolerance, the `skipped` defer-requeue pattern, and explicit re-`Add` on adoption errors — which together let the queue stay non-authoritative: it is an opportunistic cache of hints about who is adoptable right now, and any inconsistency is reconciled the next time `getCandidate` runs.

---

## 16. Lifecycle & Expiry Logic

> Shared expiry helpers used by Sandbox and SandboxClaim controllers to compute shutdown times, requeue durations, and policy-driven cleanup.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/16-lifecycle-expiry-logic.md
- Generated: 2026-05-25T22:39:58.855Z

### Source Files

- `internal/lifecycle/expiry.go`
- `internal/lifecycle/expiry_test.go`
- `api/v1beta1/sandbox_types.go`
- `extensions/api/v1beta1/sandboxclaim_types.go`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [internal/lifecycle/expiry.go](internal/lifecycle/expiry.go)
- [internal/lifecycle/expiry_test.go](internal/lifecycle/expiry_test.go)
- [api/v1beta1/sandbox_types.go](api/v1beta1/sandbox_types.go)
- [extensions/api/v1beta1/sandboxclaim_types.go](extensions/api/v1beta1/sandboxclaim_types.go)
- [extensions/controllers/sandboxclaim_controller.go](extensions/controllers/sandboxclaim_controller.go)
- [controllers/sandbox_controller.go](controllers/sandbox_controller.go)
</details>

# Lifecycle & Expiry Logic

This page documents how `Sandbox` and `SandboxClaim` resources reach the **expired** state, who computes that decision, and what shape the resulting reconciler behaviour takes (status updates, child-resource cleanup, requeue durations, and policy-driven deletion). The single source of truth for the math is `internal/lifecycle/expiry.go`, a small pure-function package that combines an absolute `shutdownTime` with the relative `ttlSecondsAfterFinished` timer and returns either an expiry verdict or a duration to wait.

The package exists to give controllers one consistent answer to *"is this resource expired, and if not, when should I look again?"* without duplicating the policy ordering between an explicit deadline and a finish-relative timer. In the current codebase the `SandboxClaim` controller is the only direct consumer of the helpers; the core `Sandbox` controller still keeps an inline `checkSandboxExpiry` that handles its narrower spec (`shutdownTime` only). Both controllers share the same vocabulary (the `Finished` condition, the `Expired` reason, the `ShutdownPolicy` enum) so the behaviour reads consistently from the outside, even though the implementations are split.

## Domain Model

`Lifecycle` is embedded into the spec of both CRDs but with different fields. The asymmetry is what motivates the shared helper: `SandboxClaim` carries an extra TTL timer that needs combining with the absolute deadline.

| Field | `Sandbox.spec.lifecycle` | `SandboxClaim.spec.lifecycle` |
| --- | --- | --- |
| `shutdownTime` (`*metav1.Time`) | yes | yes |
| `ttlSecondsAfterFinished` (`*int32`, min 0) | — | yes |
| `shutdownPolicy` (enum) | `Delete`, `Retain` (default `Retain`) | `Delete`, `DeleteForeground`, `Retain` (default `Retain`) |

The `Sandbox` `Lifecycle` struct is inlined into `SandboxSpec` at [api/v1beta1/sandbox_types.go:180-192](api/v1beta1/sandbox_types.go); the `SandboxClaim` variant adds `TTLSecondsAfterFinished` at [extensions/api/v1beta1/sandboxclaim_types.go:77-99](extensions/api/v1beta1/sandboxclaim_types.go) and exposes a third `DeleteForeground` policy that blocks claim removal until the Sandbox and Pod have actually terminated.

The terminal signal is the `Finished` condition (type `SandboxConditionFinished`). The core Sandbox controller raises it when the backing Pod reaches `PodSucceeded` or `PodFailed` (`computeFinishedCondition` at [controllers/sandbox_controller.go:394-417](controllers/sandbox_controller.go)). The Claim controller mirrors that condition onto the claim via `syncFinishedCondition` at [extensions/controllers/sandboxclaim_controller.go:562-576](extensions/controllers/sandboxclaim_controller.go), so the TTL timer measured by the shared helper reads the mirrored `LastTransitionTime` rather than reaching across into the Sandbox object.

Sources: [api/v1beta1/sandbox_types.go:46-54,168-192](api/v1beta1/sandbox_types.go), [extensions/api/v1beta1/sandboxclaim_types.go:57-99](extensions/api/v1beta1/sandboxclaim_types.go), [controllers/sandbox_controller.go:394-417](controllers/sandbox_controller.go), [extensions/controllers/sandboxclaim_controller.go:562-576](extensions/controllers/sandboxclaim_controller.go)

## The `internal/lifecycle` Package

Five small helpers compose into a single `TimeLeft` answer. Each is a pure function; nothing reads from the cluster, so they are trivially testable and free of side effects.

```text
                shutdownTime (absolute) ──┐
                                          ▼
ttlSecondsAfterFinished ─► NeedsCleanup ─► ExpireAt ─► TimeLeft ─► (expired, requeueAfter)
        ▲                          ▲
        └──────── FinishedCondition (terminal Finished=True with LastTransitionTime)
```

### `FinishedCondition`

Returns the terminal condition only when it is both present and `Status == True`; otherwise `nil`. This keeps callers from accidentally treating a `Finished=False` placeholder as a finish event.

```go
// internal/lifecycle/expiry.go
func FinishedCondition(conditions []metav1.Condition, conditionType string) *metav1.Condition {
    condition := meta.FindStatusCondition(conditions, conditionType)
    if condition == nil || condition.Status != metav1.ConditionTrue {
        return nil
    }
    return condition
}
```

### `NeedsCleanup` and `FinishedTime`

`NeedsCleanup` is a guard: TTL applies only when both the spec field and the terminal condition exist. `FinishedTime` extracts the `LastTransitionTime` as the TTL start, defending against a zero timestamp.

### `ExpireAt` — the policy ordering

`ExpireAt` returns the **earliest** of the two configured deadlines. If only `shutdownTime` is set, that's the answer. If only the TTL is configured and the resource has finished, the TTL deadline (`finishedAt + ttlSeconds`) is the answer. When both are set, the function picks whichever fires first:

```go
// internal/lifecycle/expiry.go
ttlExpireAt := finishedAt.Add(time.Duration(*ttlSecondsAfterFinished) * time.Second)
if expireAt == nil || ttlExpireAt.Before(*expireAt) {
    expireAt = &ttlExpireAt
}
```

This ordering is exercised by the table tests at [internal/lifecycle/expiry_test.go:37-93](internal/lifecycle/expiry_test.go): a `120s` TTL beats a five-minute shutdown (TTL fires at +90s from `now`), while a five-second shutdown beats the same `120s` TTL.

### `TimeLeft` — the controller-facing API

`TimeLeft(now, shutdownTime, ttl, finishedCondition)` returns `(expired bool, requeueAfter time.Duration)`. A `nil` expiry collapses to `(false, 0)`, so a Sandbox with no lifecycle configured never reports as expired and never asks for a timed requeue. When `now` has reached or passed the deadline it returns `(true, 0)`; otherwise the remaining duration so the controller can schedule a precise requeue.

| Input combination | `expired` | `requeueAfter` |
| --- | --- | --- |
| no `shutdownTime`, no TTL | `false` | `0` |
| TTL only, finished 30s ago, TTL=120s | `false` | `90s` |
| TTL only, finished 30s ago, TTL=0 | `true` | `0` |
| `shutdownTime` in 5s, TTL=120s after finish | `false` | `5s` (shutdown wins) |
| `shutdownTime` in 5m, TTL=120s after finish at -30s | `false` | `90s` (TTL wins) |

Sources: [internal/lifecycle/expiry.go:24-82](internal/lifecycle/expiry.go), [internal/lifecycle/expiry_test.go:25-94](internal/lifecycle/expiry_test.go)

## SandboxClaim Reconcile Flow

The Claim controller wraps `TimeLeft` in a small private helper that short-circuits when no lifecycle stanza is set:

```go
// extensions/controllers/sandboxclaim_controller.go
func (r *SandboxClaimReconciler) checkExpiration(claim *extensionsv1beta1.SandboxClaim) (bool, time.Duration) {
    if claim.Spec.Lifecycle == nil {
        return false, 0
    }
    finishedCondition := lifecycle.FinishedCondition(
        claim.Status.Conditions, string(v1beta1.SandboxConditionFinished))
    return lifecycle.TimeLeft(time.Now(),
        claim.Spec.Lifecycle.ShutdownTime,
        claim.Spec.Lifecycle.TTLSecondsAfterFinished,
        finishedCondition)
}
```

The reconcile loop calls `checkExpiration` **twice**: once at the top of the loop to decide which arm to take, and again after reconciling active state in case that reconcile pass mirrored a brand-new `Finished` condition that pushes the claim across the TTL boundary in the same tick. The second call drives `postTimeLeft` which becomes the final `RequeueAfter`.

```mermaid
stateDiagram-v2
    [*] --> CheckExpiration
    CheckExpiration --> MarkExpired: expired && !already marked
    MarkExpired --> [*]: requeue after immediateRequeueDelay
    CheckExpiration --> DeletePath: expired && policy in {Delete, DeleteForeground}
    DeletePath --> [*]: Delete(claim, [PropagationForeground?])
    CheckExpiration --> RetainCleanup: expired && policy = Retain
    RetainCleanup --> reconcileExpired: delete owned Sandbox, keep Claim
    CheckExpiration --> ReconcileActive: not expired
    ReconcileActive --> PostExpirationCheck: re-run checkExpiration
    PostExpirationCheck --> MarkExpired: now expired
    PostExpirationCheck --> [*]: requeue after postTimeLeft
```

Policy handling lives in [extensions/controllers/sandboxclaim_controller.go:178-261](extensions/controllers/sandboxclaim_controller.go). Notable details:

- `immediateRequeueDelay = time.Millisecond` (line 59) drives a near-instant requeue after the status update writes `Reason=ClaimExpired`, so the next pass enters the cleanup branch without waiting for a full controller resync.
- `ShutdownPolicyDelete` issues a normal `Delete`; `ShutdownPolicyDeleteForeground` adds `client.PropagationPolicy(metav1.DeletePropagationForeground)` so the claim object hangs around with a `deletionTimestamp` until the owned Sandbox and Pod are gone. The reconcile then returns to avoid touching status on an object that may already be gone.
- `ShutdownPolicyRetain` (the default) flows into `reconcileExpired` ([line 388-420](extensions/controllers/sandboxclaim_controller.go)), which deletes the controlled Sandbox but leaves the claim object in place with a `Ready=False / Reason=ClaimExpired` condition.
- `reconcileExpired` refuses to delete a Sandbox it does not control (`metav1.IsControlledBy`), returning `ErrSandboxNotOwned` which is suppressed downstream to avoid crash-looping.

Sources: [extensions/controllers/sandboxclaim_controller.go:59,176-261,309-317,388-420](extensions/controllers/sandboxclaim_controller.go)

## Sandbox Controller's Inline Expiry

The core `Sandbox` controller does **not** import `internal/lifecycle`. Because `Sandbox.spec.lifecycle` carries only `shutdownTime`, the controller uses its own minimal helper:

```go
// controllers/sandbox_controller.go
func checkSandboxExpiry(sandbox *sandboxv1beta1.Sandbox, now time.Time) (bool, time.Duration) {
    if sandbox.Spec.ShutdownTime == nil {
        return false, 0
    }
    shutdownTime := sandbox.Spec.ShutdownTime.Time
    if !now.Before(shutdownTime) {
        return true, 0
    }
    remainingTime := shutdownTime.Sub(now)
    requeueAfter := max(remainingTime, 2*time.Second)
    return false, requeueAfter
}
```

It differs from `lifecycle.TimeLeft` in two ways worth noting:

1. It applies a `2 * time.Second` floor to `requeueAfter`, presumably to avoid hot-looping on near-expiry; the shared helper returns the exact remaining duration and leaves rate-limiting to the manager.
2. It has no notion of TTL-after-finished, because the `Sandbox` CRD does not define that field.

The reconcile loop at [controllers/sandbox_controller.go:197-217](controllers/sandbox_controller.go) calls `checkSandboxExpiry` twice — once before reconciling child resources, once after — for the same "Finished may have just been set" reason as the Claim controller. On expiry it sets `Ready=False / Reason=SandboxExpired` via `setSandboxExpiredCondition`, then `handleSandboxExpiry` deletes owned child resources and, if `ShutdownPolicy=Delete`, the Sandbox object itself ([line 1065-1071](controllers/sandbox_controller.go)). After successful cleanup it strips live-resource status while preserving terminal conditions ([line 1075-1087](controllers/sandbox_controller.go)).

Sources: [controllers/sandbox_controller.go:49-53,192-227,1043-1127](controllers/sandbox_controller.go)

## Where the Two Implementations Meet

The Claim controller surfaces independent core-controller expiry into the claim's own `Ready` condition. After active reconcile it checks whether the underlying Sandbox carries `Reason=SandboxExpired` and, if so, reports the claim as not-ready with a distinguishing message:

```go
// extensions/controllers/sandboxclaim_controller.go:521-530
if hasSandboxExpiredCondition(sandbox.Status.Conditions) {
    return metav1.Condition{
        Type:    string(v1beta1.SandboxConditionReady),
        Status:  metav1.ConditionFalse,
        Reason:  v1beta1.SandboxReasonExpired,
        Message: "Underlying Sandbox resource has expired independently of the Claim.",
        ...
    }
}
```

This is the seam where the two reconcilers agree on a shared vocabulary even though they compute expiry separately: the Sandbox controller decides its own deadline and writes `SandboxReasonExpired`; the Claim controller decides its (potentially earlier) deadline using `lifecycle.TimeLeft` and writes `ClaimExpiredReason`; an observer looking at the claim's `Ready` condition can tell which side fired.

Sources: [extensions/controllers/sandboxclaim_controller.go:459-545](extensions/controllers/sandboxclaim_controller.go), [api/v1beta1/sandbox_types.go:46-54](api/v1beta1/sandbox_types.go), [extensions/api/v1beta1/sandboxclaim_types.go:26-31](extensions/api/v1beta1/sandboxclaim_types.go)

## Test Coverage and Boundary Cases

`internal/lifecycle/expiry_test.go` is a single table-driven test that anchors the policy ordering. The fixture uses a `now` of `2026-04-13 12:00:00 UTC` and a `Finished` condition transitioned thirty seconds before `now`, then varies `shutdownTime` and `ttlSecondsAfterFinished` independently. The cases worth calling out:

- A `ttlSecondsAfterFinished == 0` expires immediately upon finish — useful for "delete as soon as the pod terminates" claims.
- The "earlier shutdown time wins" case (5s vs 120s TTL) and the "later shutdown time loses to ttl" case (5m vs 120s TTL) together prove the min-of-deadlines behaviour in both directions.
- The "no expiry configured" case returns `(false, 0)`, which the controllers translate into "do not set a `RequeueAfter`" — without this guard, every reconcile of an unbounded Sandbox would re-arm a timer.

Sources: [internal/lifecycle/expiry_test.go:25-94](internal/lifecycle/expiry_test.go)

## Summary

`internal/lifecycle/expiry.go` is intentionally a thin policy library: it answers "which deadline wins, and how long until it fires" without touching the API server. The `SandboxClaim` controller leans on it directly because it has to reconcile two deadline sources (`shutdownTime` plus `ttlSecondsAfterFinished` against a mirrored `Finished` condition) and three shutdown policies; the `Sandbox` controller keeps its narrower deadline logic inline. Both controllers share the same condition types and `Expired` reasons, so consumers can reason about expiry uniformly across the two layers even though the helpers and the inline `checkSandboxExpiry` are not yet unified.

---

## 17. Metrics & Sandbox Collector

> Reconciler latency/result metrics plus the custom Prometheus collector that surfaces per-sandbox phase, age, and warm-pool stats.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/17-metrics-sandbox-collector.md
- Generated: 2026-05-25T22:41:14.467Z

### Source Files

- `internal/metrics/metrics.go`
- `internal/metrics/sandbox_collector.go`
- `internal/metrics/metrics_test.go`
- `internal/metrics/sandbox_collector_test.go`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [internal/metrics/metrics.go](internal/metrics/metrics.go)
- [internal/metrics/sandbox_collector.go](internal/metrics/sandbox_collector.go)
- [internal/metrics/metrics_test.go](internal/metrics/metrics_test.go)
- [internal/metrics/sandbox_collector_test.go](internal/metrics/sandbox_collector_test.go)
- [cmd/agent-sandbox-controller/main.go](cmd/agent-sandbox-controller/main.go)
- [extensions/controllers/sandboxclaim_controller.go](extensions/controllers/sandboxclaim_controller.go)
- [api/v1beta1/sandbox_types.go](api/v1beta1/sandbox_types.go)
</details>

# Metrics & Sandbox Collector

The `internal/metrics` package owns every Prometheus surface that the `agent-sandbox-controller` exposes beyond the metrics that controller-runtime emits automatically. It bundles two independent concerns: a small set of histogram and counter recorders that the `SandboxClaim` reconciler calls when a claim crosses lifecycle boundaries, and a custom Prometheus collector (`SandboxCollector`) that derives point-in-time `agent_sandboxes` gauge series on every scrape by listing `Sandbox` objects through the controller-runtime cache.

This page documents both halves, the labels they emit, where reconciler code drives them, and the trade-offs the collector makes when it runs against a large cache.

## Package Layout and Registration

All metrics, descriptors, and the collector live in a single package and register themselves into the global `controller-runtime` Prometheus registry that backs the metrics HTTP server configured by the controller manager.

| File | Responsibility |
|---|---|
| `internal/metrics/metrics.go` | Declares histogram, counter, and gauge vectors; defines launch-type / annotation constants; registers everything in `init()`; provides `Record*` helpers. |
| `internal/metrics/sandbox_collector.go` | Defines `SandboxCollector`, the `AgentSandboxesDesc` descriptor, its `AgentSandboxesMetricKey` aggregation key, and `RegisterSandboxCollector`. |
| `internal/metrics/tracing.go` | Unrelated OpenTelemetry `Instrumenter` (out of scope for this page, but shares the package). |
| `internal/metrics/metrics_test.go`, `sandbox_collector_test.go`, `testmain_test.go` | Recorder unit tests, fake-client collector tests, and a `goleak` `TestMain`. |

Vector metrics and `BuildInfo` are registered up-front in `init()` so they are visible on `/metrics` even before any reconcile has occurred. The custom collector is registered later, from `main.go`, once the controller manager has a usable client:

```go
// cmd/agent-sandbox-controller/main.go:234
// Register the custom Sandbox metric collector globally.
asmetrics.RegisterSandboxCollector(mgr.GetClient(), mgr.GetLogger().WithName("sandbox-collector"))
```

`RegisterSandboxCollector` is idempotent: it swallows `prometheus.AlreadyRegisteredError` and logs at info level instead, so re-invocations during test setup or manager restarts do not crash the process.

Sources: [internal/metrics/metrics.go:38-139](internal/metrics/metrics.go), [internal/metrics/sandbox_collector.go:61-71](internal/metrics/sandbox_collector.go), [cmd/agent-sandbox-controller/main.go:179-234](cmd/agent-sandbox-controller/main.go)

## Component Map

```mermaid
flowchart LR
    subgraph Reconciler["extensions/controllers/sandboxclaim_controller.go"]
        Init["initializeAnnotations<br/>stamps ObservabilityAnnotation"]
        Adopt["tryAdoptSandbox<br/>warm path"]
        ColdCreate["create cold Sandbox"]
        ReadyTx["recordCreationLatencyMetric<br/>on Ready transition"]
    end

    subgraph MetricsPkg["internal/metrics"]
        Helpers["Record* helpers<br/>metrics.go"]
        Vectors["HistogramVec / CounterVec / GaugeFunc<br/>metrics.go"]
        Collector["SandboxCollector<br/>sandbox_collector.go"]
        Desc["AgentSandboxesDesc"]
    end

    subgraph CR["controller-runtime"]
        Reg["metrics.Registry"]
        Server["metricsserver on :metricsAddr"]
    end

    APIServer[("Kubernetes API<br/>v1beta1.SandboxList")]

    Init --> Helpers
    Adopt --> Helpers
    ColdCreate --> Helpers
    ReadyTx --> Helpers
    Helpers --> Vectors
    Vectors --> Reg
    Collector --> Desc
    Collector -->|"List with<br/>UnsafeDisableDeepCopy"| APIServer
    Collector --> Reg
    Reg --> Server
```

The reconciler never touches the collector; the collector never touches the reconciler. They meet only through Prometheus' scrape protocol via the shared `metrics.Registry`.

Sources: [internal/metrics/metrics.go:132-139](internal/metrics/metrics.go), [internal/metrics/sandbox_collector.go:62-107](internal/metrics/sandbox_collector.go), [extensions/controllers/sandboxclaim_controller.go:285-300](extensions/controllers/sandboxclaim_controller.go)

## Reconciler-Driven Metrics

Four vector metrics and one build-info gauge are pre-registered in `init()` and updated through small package-level helpers. The helpers all live in `metrics.go` and exist so callers do not have to know the exact label order.

### Metric Catalog

| Name | Type | Labels | Helper | What it measures |
|---|---|---|---|---|
| `agent_sandbox_claim_startup_latency_ms` | HistogramVec | `launch_type`, `sandbox_template` | `RecordClaimStartupLatency` | End-to-end ms from the *webhook* first observing the claim to `SandboxClaim` Ready. |
| `agent_sandbox_claim_controller_startup_latency_ms` | HistogramVec | `launch_type`, `sandbox_template` | `RecordClaimControllerStartupLatency` | ms from the *controller* first observing the claim to `SandboxClaim` Ready (excludes admission lag). |
| `agent_sandbox_creation_latency_ms` | HistogramVec | `namespace`, `launch_type`, `sandbox_template` | `RecordSandboxCreationLatency` | ms from `Sandbox` creation to underlying Pod Ready. For warm launches this collapses to controller synchronization overhead. |
| `agent_sandbox_claim_creation_total` | CounterVec | `namespace`, `sandbox_template`, `launch_type`, `warmpool_name`, `pod_condition` | `RecordSandboxClaimCreation` | One increment per claim that produces a Sandbox; emitted at warm adoption and at cold creation. |
| `agent_sandbox_build_info` | GaugeFunc, constant `1` | `git_version`, `git_commit`, `build_date`, `go_version`, `compiler`, `platform` | n/a (constant) | Build provenance, populated from `internal/version` at process start. |

Latency histograms share two carefully tuned bucket sets. Claim histograms span `100ms`..`240000ms`; sandbox-creation histograms extend further to `600000ms` (10 minutes) because they cover cold Pod scheduling and image pull:

```go
// internal/metrics/metrics.go:48
Buckets: []float64{100, 250, 500, 750, 1000, 1250, 1500, 2000, 2500, 5000, 10000, 30000, 60000, 120000, 240000},
```

Sources: [internal/metrics/metrics.go:38-130](internal/metrics/metrics.go), [internal/metrics/metrics.go:141-161](internal/metrics/metrics.go)

### Launch-Type Vocabulary

Three constants normalize the `launch_type` label across every metric:

```go
// internal/metrics/metrics.go:26
LaunchTypeWarm    = "warm"    // Pod from a SandboxWarmPool
LaunchTypeCold    = "cold"    // Pod not from a SandboxWarmPool
LaunchTypeUnknown = "unknown" // Used when Sandbox is nil during failure
```

The claim controller derives the value from the resulting `Sandbox`: a non-empty `agents.x-k8s.io/pod-name` annotation marks an adopted warm pod; absence of that annotation marks a cold creation; a nil `*Sandbox` (typically during failure paths) falls back to `unknown`. See `getLaunchType` in `extensions/controllers/sandboxclaim_controller.go:1335-1343`.

### Where Recorders Fire

Two timing annotations bracket the claim lifecycle. The webhook stamps `agents.x-k8s.io/webhook-first-observed-at` (`WebhookAnnotation`) when it first admits the claim, and `SandboxClaimReconciler.initializeAnnotations` stamps `agents.x-k8s.io/controller-first-observed-at` (`ObservabilityAnnotation`) on the first reconcile pass:

```go
// extensions/controllers/sandboxclaim_controller.go:287
needObservabilityPatch := claim.Annotations[asmetrics.ObservabilityAnnotation] == ""
...
claim.Annotations[asmetrics.ObservabilityAnnotation] = timestamp.Format(time.RFC3339Nano)
```

When the claim later transitions to Ready, the reconciler reads those annotations, parses them as `time.RFC3339Nano`, guards against parse failures and negative durations, and calls the matching helper. The `RecordSandboxCreationLatency` helper also keys off `Sandbox.CreationTimestamp` and the Ready condition's `LastTransitionTime`, so it can be computed even for warm launches where the pod predates the claim. See `recordClaimStartupLatency`, `recordControllerStartupLatency`, and `recordSandboxCreationLatency` in `extensions/controllers/sandboxclaim_controller.go:1345-1395`.

`RecordSandboxClaimCreation` is invoked from two different places in the reconciler — once when warm adoption succeeds and once when a cold Sandbox is created — and uses the `pod_condition` label (`ready` vs. `not_ready`) to distinguish whether the adopted warm pod was already Ready at handoff:

```go
// extensions/controllers/sandboxclaim_controller.go:710
asmetrics.RecordSandboxClaimCreation(claim.Namespace, claim.Spec.TemplateRef.Name,
    asmetrics.LaunchTypeWarm, poolName, podCondition)
// extensions/controllers/sandboxclaim_controller.go:1065
asmetrics.RecordSandboxClaimCreation(claim.Namespace, claim.Spec.TemplateRef.Name,
    asmetrics.LaunchTypeCold, "none", "not_ready")
```

Sources: [internal/metrics/metrics.go:26-36](internal/metrics/metrics.go), [extensions/controllers/sandboxclaim_controller.go:285-300](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:706-710](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:1334-1395](extensions/controllers/sandboxclaim_controller.go)

### Latency Recording Sequence

```mermaid
sequenceDiagram
    autonumber
    participant W as Admission Webhook
    participant API as kube-apiserver
    participant R as SandboxClaim Reconciler
    participant M as metrics.Record*
    participant P as Prometheus /metrics

    W->>API: stamp WebhookAnnotation on create
    API-->>R: watch event (claim)
    R->>API: initializeAnnotations<br/>stamp ObservabilityAnnotation
    R->>R: adopt warm Sandbox<br/>or create cold Sandbox
    R->>M: RecordSandboxClaimCreation<br/>(namespace, template, launch_type,<br/>warmpool_name, pod_condition)
    Note over R,API: ...Sandbox transitions to Ready...
    R->>M: RecordClaimStartupLatency<br/>(now - WebhookAnnotation)
    R->>M: RecordClaimControllerStartupLatency<br/>(now - ObservabilityAnnotation)
    R->>M: RecordSandboxCreationLatency<br/>(SandboxReady - Sandbox.CreationTimestamp)
    P-->>M: scrape
```

Sources: [extensions/controllers/sandboxclaim_controller.go:285-300](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:1345-1395](extensions/controllers/sandboxclaim_controller.go), [internal/metrics/metrics.go:141-161](internal/metrics/metrics.go)

### Build Info

`BuildInfo` is a `prometheus.NewGaugeFunc` that always returns `1` and carries `git_version`, `git_commit`, `build_date`, `go_version`, `compiler`, and `platform` as constant labels resolved from `sigs.k8s.io/agent-sandbox/internal/version`. `TestBuildInfo` pins the exact `# HELP`/`# TYPE`/sample line that `/metrics` must produce, so any change to the label set or help string fails the build.

Sources: [internal/metrics/metrics.go:112-129](internal/metrics/metrics.go), [internal/metrics/metrics_test.go:101-111](internal/metrics/metrics_test.go)

## The Custom Sandbox Collector

`SandboxCollector` implements the `prometheus.Collector` interface (`Describe` + `Collect`) and is the only path that produces the `agent_sandboxes` gauge. Unlike the reconciler-driven vectors, this metric is *not* maintained incrementally; each Prometheus scrape recomputes the full set of label combinations by listing live `Sandbox` objects.

### Descriptor and Aggregation Key

The descriptor and its companion key struct establish the label contract:

```go
// internal/metrics/metrics.go:105
AgentSandboxesDesc = prometheus.NewDesc(
    "agent_sandboxes",
    "Monitor the point-in-time number of sandboxes in the cluster.",
    []string{"namespace", "ready_condition", "expired", "launch_type", "sandbox_template", "owned_by"},
    nil,
)
```

```go
// internal/metrics/sandbox_collector.go:37
type AgentSandboxesMetricKey struct {
    Namespace      string
    ReadyCondition string
    Expired        string
    LaunchType     string
    Template       string
    OwnedBy        string
}
```

`NewAgentSandboxesConstMetric` projects a `(count, key)` pair into a `prometheus.MustNewConstMetric` of type `GaugeValue`. Using `ConstMetric` is what lets the collector emit a series only when it observes one — there is no risk of stale label combinations lingering after the underlying Sandboxes disappear, which is the failure mode a long-lived `GaugeVec` would have here.

Sources: [internal/metrics/metrics.go:97-110](internal/metrics/metrics.go), [internal/metrics/sandbox_collector.go:37-59](internal/metrics/sandbox_collector.go)

### Collect Pipeline

`Collect` is a four-stage pipeline executed under a 5-second per-scrape deadline (`metricsCollectTimeout`). The collector applies `client.UnsafeDisableDeepCopy` to the `List` call to skip the per-object deep copy controller-runtime normally performs when serving cached reads; the source comments document why this is safe and why a `GaugeVec` was deliberately rejected:

```go
// internal/metrics/sandbox_collector.go:94
// Collect fetches sandboxes, calculates labels, and sends metrics to the channel.
// UnsafeDisableDeepCopy avoids O(N) deep-copy overhead on every scrape; safe here because
// Collect only reads fields for label aggregation and never mutates or retains the objects.
// A GaugeVec updated in the Reconcile loop would be more performant (O(1) per scrape),
// but this is a known trade-off to keep the Reconcile loop simpler.
```

```text
            ┌───────────────────────────────────────────────────────────┐
            │ Collect(ch)                                              │
            ├───────────────────────────────────────────────────────────┤
            │ 1. ctx with 5s timeout                                    │
            │ 2. client.List(&SandboxList, UnsafeDisableDeepCopy)       │
            │      │                                                     │
            │      └─▶ on error: logger.Error + return (no series)       │
            │ 3. for each sandbox:                                      │
            │      derive (namespace, ready_condition, expired,         │
            │              launch_type, sandbox_template, owned_by)     │
            │      counts[key]++                                        │
            │ 4. for key,count := range counts:                         │
            │      ch <- NewAgentSandboxesConstMetric(count, key)       │
            └───────────────────────────────────────────────────────────┘
```

If `List` fails, the collector logs and returns without emitting any series for that scrape — Prometheus simply observes the absence and continues. There is no fall-back to a cached snapshot.

Sources: [internal/metrics/sandbox_collector.go:32-34](internal/metrics/sandbox_collector.go), [internal/metrics/sandbox_collector.go:89-162](internal/metrics/sandbox_collector.go)

### Label Derivation Rules

Each Sandbox is mapped to exactly one `AgentSandboxesMetricKey`, then the keys are aggregated by count:

| Label | Default | Becomes non-default when… | Source |
|---|---|---|---|
| `namespace` | `sandbox.Namespace` | — | `sandbox_collector.go:149` |
| `ready_condition` | `"false"` | `Ready` condition has `Status == ConditionTrue` | `sandbox_collector.go:111-117` |
| `expired` | `"false"` | `Ready` condition has `Reason == SandboxReasonExpired` (`"SandboxExpired"`) | `sandbox_collector.go:111-121`, `api/v1beta1/sandbox_types.go:53-54` |
| `launch_type` | `"cold"` | `SandboxPodNameAnnotation` (`agents.x-k8s.io/pod-name`) is set and non-empty (warm-pool adoption) | `sandbox_collector.go:123-126`, `api/v1beta1/sandbox_types.go:56-57` |
| `sandbox_template` | `"unknown"` | `SandboxTemplateRefAnnotation` (`agents.x-k8s.io/sandbox-template-ref`) is set and non-empty | `sandbox_collector.go:128-133`, `api/v1beta1/sandbox_types.go:58-59` |
| `owned_by` | `"None"` | Controller `OwnerReference.APIVersion` matches `extensions/v1beta1.GroupVersion` and `Kind` is `"SandboxClaim"` or `"SandboxWarmPool"` | `sandbox_collector.go:135-146` |

Two subtleties are explicit in code: an `expired` reading depends on the `Ready` condition being present at all (a missing condition leaves both `ready_condition` and `expired` at `"false"`), and `owned_by` only recognizes controllers from the `agent-sandbox` extensions group — third-party controllers wrapping a Sandbox would surface as `"None"`. A user-created Sandbox without the template annotation defaults to `sandbox_template="unknown"`, called out in a comment at `sandbox_collector.go:129-130`.

Sources: [internal/metrics/sandbox_collector.go:109-156](internal/metrics/sandbox_collector.go), [api/v1beta1/sandbox_types.go:35-59](api/v1beta1/sandbox_types.go)

### Verified Cases from the Tests

`TestSandboxCollector` builds a fake client with hand-crafted Sandboxes and asserts the exact set of `agent_sandboxes` series produced. The matrix below summarizes what each case proves about the derivation rules above:

| Test case | Inputs | Asserted series |
|---|---|---|
| `single ready cold unknown sandbox` | `Ready=True`, no annotations, no owner | `namespace=default ready=true expired=false launch=cold template=unknown owned_by=None → 1` |
| `missing ready condition` | `Conditions: nil` | Both `ready_condition` and `expired` default to `"false"` |
| `mixed sandboxes` | 4 Sandboxes, one warm + expired in `test-ns`, two duplicate cold not-ready in `default` | 3 distinct series; the two duplicates collapse to the same key and count `2` |
| `claimed sandbox` | controller OwnerReference `Kind=SandboxClaim` | `owned_by=SandboxClaim` |
| `warmpool sandbox` | controller OwnerReference `Kind=SandboxWarmPool` | `owned_by=SandboxWarmPool` |

The `mixed sandboxes` case in particular pins the aggregation contract — Sandboxes with identical label tuples must collapse into a single gauge sample whose value is the count.

Sources: [internal/metrics/sandbox_collector_test.go:39-256](internal/metrics/sandbox_collector_test.go)

## Operational Notes

- **Metrics endpoint.** The controller manager runs the Prometheus HTTP server on the address passed to `--metrics-bind-address`. `metricsserver.Options{BindAddress: metricsAddr}` is the only wiring, so the standard controller-runtime flags govern auth, TLS, and `ExtraHandlers` (`pprof` is mounted on the same listener when `--enable-pprof` is on). See `cmd/agent-sandbox-controller/main.go:179-214`.
- **Scrape cost.** Every scrape lists *all* Sandboxes through the cache. The `UnsafeDisableDeepCopy` opt-in keeps that cost to a pointer walk over already-cached objects, but very large clusters still pay an O(N) iteration per scrape; the source explicitly chose this over the O(1) `GaugeVec` alternative to keep the Reconcile loop simpler.
- **Per-scrape timeout.** A 5-second deadline (`metricsCollectTimeout`) bounds the List call. Exceeding it produces a logged error and an empty scrape, not a partial series set.
- **Idempotent registration.** `RegisterSandboxCollector` tolerates `prometheus.AlreadyRegisteredError`, which matters when tests or restart paths re-invoke registration against the global registry.
- **Leak discipline.** `internal/metrics/testmain_test.go` wraps the package's tests with `goleak.VerifyTestMain`, so any goroutine leaked by the collector or the OTel instrumenter would fail the suite.

Sources: [internal/metrics/sandbox_collector.go:32-107](internal/metrics/sandbox_collector.go), [cmd/agent-sandbox-controller/main.go:179-234](cmd/agent-sandbox-controller/main.go), [internal/metrics/testmain_test.go:1-25](internal/metrics/testmain_test.go)

## Summary

The metrics package is intentionally split along a push/pull seam. The reconciler pushes timing and counting events into pre-registered histograms and counters via thin `Record*` helpers, with claim lifecycle bracketed by two well-defined RFC3339Nano annotations stamped by the webhook and the controller's first reconcile. In parallel, `SandboxCollector` pulls a fresh snapshot of all Sandboxes on every Prometheus scrape, aggregates them by `(namespace, ready_condition, expired, launch_type, sandbox_template, owned_by)`, and emits `agent_sandboxes` as a set of `ConstMetric` gauge samples — accepting an O(N) per-scrape List, with `UnsafeDisableDeepCopy` softening the cost, in exchange for a Reconcile loop that never needs to bookkeep cardinality.

---

## 18. OpenTelemetry Tracing Setup

> Provider-neutral OTLP tracing wiring used by both the controller binary and the SDKs; instrumenter interface and no-op fallback.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/18-opentelemetry-tracing-setup.md
- Generated: 2026-05-25T22:42:02.056Z

### Source Files

- `internal/metrics/tracing.go`
- `clients/go/sandbox/tracing.go`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py`
- `clients/python/agentic-sandbox-client/otel-collector-config.yaml.example`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [internal/metrics/tracing.go](internal/metrics/tracing.go)
- [clients/go/sandbox/tracing.go](clients/go/sandbox/tracing.go)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py)
- [clients/python/agentic-sandbox-client/otel-collector-config.yaml.example](clients/python/agentic-sandbox-client/otel-collector-config.yaml.example)
- [cmd/agent-sandbox-controller/main.go](cmd/agent-sandbox-controller/main.go)
- [clients/go/sandbox/options.go](clients/go/sandbox/options.go)
- [controllers/sandbox_controller.go](controllers/sandbox_controller.go)
- [clients/go/sandbox/sandbox.go](clients/go/sandbox/sandbox.go)
- [clients/go/sandbox/tracing_test.go](clients/go/sandbox/tracing_test.go)
- [clients/python/agentic-sandbox-client/GCP.md](clients/python/agentic-sandbox-client/GCP.md)

</details>

# OpenTelemetry Tracing Setup

The repository ships provider-neutral OpenTelemetry (OTel) tracing for three runtime surfaces: the `agent-sandbox-controller` binary, the Go SDK (`clients/go/sandbox`), and the Python SDK (`clients/python/agentic-sandbox-client`). Each surface configures its own `TracerProvider` with the OTLP/gRPC exporter, then emits spans whose lineage is stitched back across the controller/SDK boundary using a JSON-encoded W3C Trace Context written to the Kubernetes object annotation `opentelemetry.io/trace-context`.

The wiring is deliberately minimal: no vendor SDKs are pulled in, the exporter endpoint is taken from the standard `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable, and every surface degrades safely to a no-op when tracing is disabled or the optional dependency is missing. A user-run OTel Collector (an `otel-collector-config.yaml.example` is supplied) is the integration point for any backend (Google Cloud, Jaeger, Tempo, etc.).

## Architecture Overview

The controller and SDKs participate in one logical trace per sandbox. The controller starts the root span during reconcile and persists the carrier on the `Sandbox` CR; the SDKs start a per-instance "lifecycle" span that becomes the parent for all sandbox operations (`run`, `read`, `write`, `list`, `exists`, `create_claim`, `wait_for_sandbox_ready`).

```mermaid
flowchart LR
  subgraph Controller["agent-sandbox-controller (Go)"]
    Main["cmd/.../main.go<br/>--enable-tracing"]
    Setup["internal/metrics<br/>SetupOTel / NewNoOp"]
    Inst["Instrumenter interface<br/>StartSpan / GetTraceContext<br/>AddEvent / IsRecording"]
    Recon["controllers/sandbox_controller.go<br/>Reconcile()"]
  end

  subgraph GoSDK["Go SDK (clients/go/sandbox)"]
    GoOpts["Options.TracerProvider<br/>Options.TraceServiceName"]
    GoTrace["tracing.go<br/>NewTracerProvider / newTracer<br/>startSpan / recordError"]
    GoLife["sandbox.go<br/>{svc}.lifecycle span"]
  end

  subgraph PySDK["Python SDK (k8s_agent_sandbox)"]
    PyCfg["SandboxTracerConfig<br/>enable_tracing / trace_service_name"]
    PyTM["trace_manager.py<br/>initialize_tracer<br/>TracerManager<br/>@trace_span / @async_trace_span"]
  end

  subgraph K8s["Kubernetes API"]
    CR["Sandbox / SandboxClaim<br/>annotations[opentelemetry.io/trace-context]<br/>= JSON {traceparent, tracestate}"]
  end

  Collector["OTel Collector<br/>OTLP :4317 gRPC / :4318 HTTP<br/>processors: batch<br/>exporters: googlecloud / ..."]

  Main --> Setup --> Inst --> Recon
  Recon -- Inject carrier --> CR
  CR -- Extract carrier --> Inst
  GoOpts --> GoTrace --> GoLife --> CR
  PyCfg --> PyTM --> CR
  Setup -- "OTLP/gRPC<br/>OTEL_EXPORTER_OTLP_ENDPOINT" --> Collector
  GoTrace -- "OTLP/gRPC" --> Collector
  PyTM -- "OTLP/gRPC" --> Collector
```

Sources: [internal/metrics/tracing.go:35-147](internal/metrics/tracing.go), [clients/go/sandbox/tracing.go:47-72](clients/go/sandbox/tracing.go), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py:114-161](clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py), [controllers/sandbox_controller.go:160-186](controllers/sandbox_controller.go)

## Controller Tracing (`internal/metrics`)

### Instrumenter Interface and No-Op Fallback

`internal/metrics/tracing.go` defines a small abstraction the controller uses everywhere instead of calling OTel APIs directly. This keeps the controller compilable and runnable without an OTel backend and lets reconcilers be unit-tested without a real `TracerProvider`.

```go
// internal/metrics/tracing.go
type Instrumenter interface {
    StartSpan(ctx context.Context, obj metav1.Object, spanName string, attrs map[string]string) (context.Context, func())
    GetTraceContext(ctx context.Context) string
    AddEvent(ctx context.Context, name string, attrs map[string]string)
    IsRecording(ctx context.Context) bool
}
```

Two implementations satisfy it:

| Type | Constructor | Behavior |
| --- | --- | --- |
| `noopInstrumenter` | `metrics.NewNoOp()` | Returns the input context, a no-op closer, empty trace context, and `IsRecording=false`. Used when `--enable-tracing` is unset. |
| `otelInstrumenter` | Returned by `metrics.SetupOTel(ctx, serviceName)` | Wraps a real `trace.Tracer`, a `propagation.TextMapPropagator`, and a `logr.Logger`. |

Sources: [internal/metrics/tracing.go:41-57](internal/metrics/tracing.go)

### `SetupOTel`: Bootstrapping the SDK

`SetupOTel` is the controller's single entry point for initializing the OTel SDK. It constructs an OTLP/gRPC exporter (which honors `OTEL_EXPORTER_OTLP_ENDPOINT` and `OTEL_EXPORTER_OTLP_INSECURE`), wires it into a batching `TracerProvider`, installs that provider and a W3C `TraceContext` propagator as the OTel globals, and returns the `Instrumenter` together with a shutdown closer.

```go
// internal/metrics/tracing.go
exporter, err := otlptracegrpc.New(ctx)
...
tp := sdktrace.NewTracerProvider(
    sdktrace.WithBatcher(exporter),
    sdktrace.WithResource(resource.NewWithAttributes(
        semconv.SchemaURL,
        semconv.ServiceNameKey.String(serviceName),
    )),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.TraceContext{})
```

Notes from the implementation:
- Only the W3C `TraceContext` propagator is installed; baggage is intentionally excluded.
- The resource is built with `semconv/v1.38.0` and a single `service.name` attribute.
- The instrumentation scope name is `agent-sandbox-controller`.

Sources: [internal/metrics/tracing.go:123-147](internal/metrics/tracing.go)

### Controller Binary Wiring

`cmd/agent-sandbox-controller/main.go` registers a `--enable-tracing` flag. When unset, the controller runs with `NewNoOp()`; when set, `SetupOTel` is invoked under a 10-second initialization timeout, and the shutdown closer is deferred for graceful flush.

```go
// cmd/agent-sandbox-controller/main.go
flag.BoolVar(&enableTracing, "enable-tracing", false, "Enable OpenTelemetry tracing via OTLP.")
...
var instrumenter = asmetrics.NewNoOp()
if enableTracing {
    initCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
    defer cancel()
    instrumenter, cleanup, err = asmetrics.SetupOTel(initCtx, "agent-sandbox-controller")
    if err != nil {
        setupLog.Error(err, "unable to initialize tracing")
        os.Exit(1)
    }
    defer cleanup()
}
```

The `Instrumenter` is then handed to reconcilers (for example, `controllers.SandboxReconciler.Tracer`).

Sources: [cmd/agent-sandbox-controller/main.go:57-168](cmd/agent-sandbox-controller/main.go), [controllers/sandbox_controller.go:126-186](controllers/sandbox_controller.go)

### Trace-Context Propagation Across Reconcile Boundaries

`otelInstrumenter.StartSpan` does two things that make cross-binary tracing work:

1. If `obj` has the `opentelemetry.io/trace-context` annotation, it JSON-decodes the carrier and extracts a parent context using the global W3C propagator before starting the span.
2. It collapses callers' `map[string]string` attrs into `attribute.KeyValue` slices via `trace.WithAttributes(...)`.

`GetTraceContext` does the inverse: it injects the active span into an empty `propagation.MapCarrier` and JSON-marshals it. `SandboxReconciler.Reconcile` calls this immediately after starting `ReconcileSandbox` and patches the resulting JSON onto the `Sandbox`'s annotations when missing — that single write becomes the carrier that SDK clients later extract.

```go
// internal/metrics/tracing.go
const TraceContextAnnotation = "opentelemetry.io/trace-context"
// Example: {"traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"}
```

Sources: [internal/metrics/tracing.go:36-105](internal/metrics/tracing.go), [controllers/sandbox_controller.go:160-186](controllers/sandbox_controller.go)

## Go SDK Tracing (`clients/go/sandbox`)

The Go SDK does not call `SetupOTel`. It is library code, so it never installs OTel globals; instead, it asks the caller for a `TracerProvider` and exposes a convenience constructor (`NewTracerProvider`) for callers that just want OTLP/gRPC defaults.

### Options Surface

```go
// clients/go/sandbox/options.go
// TraceServiceName is the OpenTelemetry service name used for the tracer's
// instrumentation scope and the resource's service.name attribute.
// Default: "sandbox-client".
TraceServiceName string

// TracerProvider sets the OpenTelemetry TracerProvider for span creation.
// If nil, falls back to otel.GetTracerProvider (noop by default).
TracerProvider trace.TracerProvider
```

`setDefaults` fills `TraceServiceName` with `"sandbox-client"` when unset. `newTracer(opts)` (in `tracing.go`) selects `opts.TracerProvider` if non-nil, falls back to `otel.GetTracerProvider()` (a no-op unless the application installed one), and converts the service name into the instrumentation scope by replacing dashes with underscores.

| Option | Default | Effect |
| --- | --- | --- |
| `TraceServiceName` | `"sandbox-client"` | Span name prefix (e.g. `sandbox-client.run`), resource `service.name`, and underscored scope. |
| `TracerProvider` | `nil` → `otel.GetTracerProvider()` | If both are noop, all span machinery becomes free. |

Sources: [clients/go/sandbox/options.go:131-180](clients/go/sandbox/options.go), [clients/go/sandbox/tracing.go:107-120](clients/go/sandbox/tracing.go)

### `NewTracerProvider`: Optional OTLP Helper

For applications that don't already have a global `TracerProvider`, the SDK ships `NewTracerProvider(ctx, serviceName)` that constructs an OTLP/gRPC exporter, merges a default resource with a `service.name` attribute, and returns a batching `*sdktrace.TracerProvider`. Ownership of `Shutdown` is left to the caller.

```go
// clients/go/sandbox/tracing.go
// NewTracerProvider creates a TracerProvider with an OTLP/gRPC exporter.
// The endpoint is read from OTEL_EXPORTER_OTLP_ENDPOINT (default: localhost:4317).
// serviceName becomes the service.name resource attribute.
// The caller owns the returned provider and must call Shutdown when done.
```

Sources: [clients/go/sandbox/tracing.go:47-72](clients/go/sandbox/tracing.go)

### Lifecycle Spans, Error Recording, and Carrier Injection

`Sandbox.Open` starts a `{svcName}.lifecycle` span as the per-instance root. Each operation (`Run`, `Read`, `Write`, `List`, `Exists`, `reconnect`, …) is parented to it via `startSpan(ctx, tracer, svcName, "<op>", attrs...)`, which formats span names as `{svcName}.{operation}`.

When an operation fails, `recordError(span, err)` calls `span.RecordError(err)` and `span.SetStatus(codes.Error, err.Error())`. When the SDK needs to hand the lineage to the controller (e.g., during claim creation), `traceContextJSON(ctx)` injects the active span into a `propagation.MapCarrier` and JSON-marshals it for the `opentelemetry.io/trace-context` annotation. The function returns `""` when no active span is present, so callers won't write a meaningless annotation when tracing is off.

The semantic attribute keys are centralized at the top of `tracing.go`:

| Constant | String key | Used by |
| --- | --- | --- |
| `AttrClaimName` | `sandbox.claim.name` | `create_claim` |
| `AttrCommand` | `sandbox.command` | `run` |
| `AttrExitCode` | `sandbox.exit_code` | `run` |
| `AttrFilePath` | `sandbox.file.path` | `read`, `write`, `list`, `exists` |
| `AttrFileSize` | `sandbox.file.size` | `read`, `write` |
| `AttrFileCount` | `sandbox.file.count` | `list` |
| `AttrFileExists` | `sandbox.file.exists` | `exists` |
| `AttrGatewayName` / `AttrGatewayNamespace` | `sandbox.gateway.{name,namespace}` | Gateway connection strategy |
| `AttrRequestID` | `sandbox.request_id` | HTTP requests |

These keys (and the lifecycle-as-parent invariant) are asserted in `clients/go/sandbox/tracing_test.go`, which uses `tracetest.NewInMemoryExporter` to verify that all `test-svc.*` spans share the lifecycle span's `SpanID` as parent and that the captured `SandboxClaim` carries a `traceparent` annotation.

Sources: [clients/go/sandbox/tracing.go:33-105](clients/go/sandbox/tracing.go), [clients/go/sandbox/sandbox.go:80-216](clients/go/sandbox/sandbox.go), [clients/go/sandbox/tracing_test.go:155-302](clients/go/sandbox/tracing_test.go)

## Python SDK Tracing (`k8s_agent_sandbox.trace_manager`)

The Python SDK mirrors the Go SDK's shape but adds two Python-specific concerns: OpenTelemetry is an *optional* dependency (graceful import failure must keep the SDK usable), and a process-wide `TracerProvider` is needed so that multiple `SandboxClient` instances share one exporter.

### Optional-Dependency Mock Layer

The module attempts the OTel imports inside a `try`; on `ImportError` it sets `OPENTELEMETRY_AVAILABLE = False` and defines `MockSpan`, `MockTracer`, `TraceStub`, `ContextStub`, and a mock `TraceContextTextMapPropagator`. The mock `Tracer.start_as_current_span` returns `contextlib.nullcontext()`, so existing `with tracer.start_as_current_span(...)` callsites in the SDK keep working without modification.

```python
# clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py
try:
    from opentelemetry import trace, context
    from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
    ...
    OPENTELEMETRY_AVAILABLE = True
except ImportError:
    OPENTELEMETRY_AVAILABLE = False
    logging.debug("OpenTelemetry not installed; using MockTracer.")
    class MockSpan: ...
    class MockTracer:
        def start_as_current_span(self, *a, **k):
            return nullcontext()
    ...
```

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py:32-107](clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py)

### `initialize_tracer`: Process-Wide Singleton

`initialize_tracer(service_name)` uses double-checked locking around a module-level `_TRACER_PROVIDER`. On first call (when OTel is installed and no provider is set), it builds a `Resource(attributes={"service.name": service_name})`, a `TracerProvider`, and a `BatchSpanProcessor(OTLPSpanExporter())`, installs it as the global provider, and registers `atexit` shutdown. On subsequent calls, if the requested `service_name` differs from the one already installed, a warning is logged and the existing provider is kept — the **first** client to initialize wins for the whole process.

```python
# clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py
with _TRACER_PROVIDER_LOCK:
    if _TRACER_PROVIDER is None:
        resource = Resource(attributes={"service.name": service_name})
        _TRACER_PROVIDER = TracerProvider(resource=resource)
        _TRACER_PROVIDER.add_span_processor(
            BatchSpanProcessor(OTLPSpanExporter())
        )
        trace.set_tracer_provider(_TRACER_PROVIDER)
        atexit.register(_TRACER_PROVIDER.shutdown)
```

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py:109-161](clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py)

### `TracerManager`: Per-Client Lifecycle Context

Each `SandboxClient` instance owns a `TracerManager(service_name)` that:

- Builds a tracer with an instrumentation scope derived from `service_name` (dashes → underscores), matching the Go SDK convention.
- Holds a single `parent_span` named `{service_name}.lifecycle` and a `context_token` returned from `context.attach(...)`.
- Exposes `get_trace_context_json()` which uses `TraceContextTextMapPropagator` to inject the current context into a dict carrier and `json.dumps` it for the `opentelemetry.io/trace-context` annotation (returns `""` if the carrier is empty).

`create_tracer_manager(config)` is the factory used by `SandboxClient`. It short-circuits to `(None, None)` either when `config.enable_tracing` is false or when OTel is not installed; the calling code then treats `self.tracer is None` as "tracing disabled" and skips the decorators.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py:219-271](clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py)

### `@trace_span` / `@async_trace_span` Decorators

These decorators wrap sync/async methods on objects that expose `self.tracer` and `self.trace_service_name`. The span name is composed at call time as `f"{self.trace_service_name}.{span_suffix}"`. When `self.tracer` is `None`, the wrapped function is invoked without any span — the safe path when tracing is disabled.

```python
# clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py
def trace_span(span_suffix):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(self, *args, **kwargs):
            tracer = getattr(self, 'tracer', None)
            if not tracer:
                return func(self, *args, **kwargs)
            service_name = getattr(self, 'trace_service_name', 'sandbox-client')
            with tracer.start_as_current_span(f"{service_name}.{span_suffix}"):
                return func(self, *args, **kwargs)
        return wrapper
    return decorator
```

The async variant has the same shape with `async def wrapper` and `await func(...)`.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py:164-216](clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py)

### Client Wiring (`SandboxTracerConfig`)

The Python SDK enables tracing through a Pydantic config:

```python
# clients/python/agentic-sandbox-client/k8s_agent_sandbox/models.py
class SandboxTracerConfig(BaseModel):
    enable_tracing: bool = False
    trace_service_name: str = "sandbox-client"
```

`SandboxClient.__init__` calls `initialize_tracer(self.tracer_config.trace_service_name)` if enabled, then `create_tracer_manager(self.tracer_config)` to obtain `(tracing_manager, tracer)`. The pair is passed through to executors, file helpers, and other subordinates, which read `self.tracer` and `self.trace_service_name` for the decorators.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py:29-82](clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py)

## Comparison Across Surfaces

| Concern | Controller (`internal/metrics`) | Go SDK (`clients/go/sandbox`) | Python SDK (`k8s_agent_sandbox`) |
| --- | --- | --- | --- |
| Owns global `TracerProvider`? | Yes — installs via `otel.SetTracerProvider`. | No — uses caller-provided or `otel.GetTracerProvider()`. | Yes — module-level singleton with `atexit` shutdown. |
| Enable switch | `--enable-tracing` CLI flag | `Options.TracerProvider != nil` | `SandboxTracerConfig.enable_tracing` |
| Fallback when disabled | `metrics.NewNoOp()` Instrumenter | Global noop `TracerProvider` | `None` tracer; decorators short-circuit |
| Fallback when OTel missing | N/A (build-time dependency) | N/A (build-time dependency) | `MockSpan` / `MockTracer` / stubs |
| Exporter | `otlptracegrpc.New(ctx)` | `otlptracegrpc.New(ctx)` (via helper) | `OTLPSpanExporter()` with `BatchSpanProcessor` |
| Endpoint source | `OTEL_EXPORTER_OTLP_ENDPOINT` env | `OTEL_EXPORTER_OTLP_ENDPOINT` env (default `localhost:4317`) | Environment, as resolved by the OTel Python SDK |
| Propagator | W3C `TraceContext` only | W3C `TraceContext` only | `TraceContextTextMapPropagator` |
| Annotation key | `opentelemetry.io/trace-context` | `opentelemetry.io/trace-context` | `opentelemetry.io/trace-context` |
| Span name shape | `ReconcileSandbox`, `reconcilePod`, `reconcilePVCs` | `{TraceServiceName}.{op}`, e.g. `sandbox-client.run` | `{trace_service_name}.{op}`, e.g. `sandbox-client.run` |
| Instrumentation scope | `"agent-sandbox-controller"` | `strings.ReplaceAll(svcName, "-", "_")` | `service_name.replace('-', '_')` |
| Lifecycle root span | Per `Reconcile` invocation | `{svc}.lifecycle` per `Sandbox` instance | `{svc}.lifecycle` per `SandboxClient` |
| Shutdown | `defer cleanup()` in `main` | Caller-owned `tp.Shutdown(...)` | `atexit.register(_TRACER_PROVIDER.shutdown)` |

Sources: [internal/metrics/tracing.go:123-147](internal/metrics/tracing.go), [clients/go/sandbox/tracing.go:47-120](clients/go/sandbox/tracing.go), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py:114-271](clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py)

## End-to-End Span Flow

The sequence below shows how a single trace travels from a Python or Go client through the Kubernetes API to the controller and out to a collector, using the annotation as the only out-of-band carrier.

```mermaid
sequenceDiagram
  participant App as Client App
  participant SDK as SDK (Go / Python)
  participant K8s as Kubernetes API
  participant Ctrl as Controller (SandboxReconciler)
  participant Col as OTel Collector (OTLP/gRPC)

  App->>SDK: New / open SandboxClient
  SDK->>SDK: start "{svc}.lifecycle" span
  SDK->>SDK: traceContextJSON / get_trace_context_json
  SDK->>K8s: create SandboxClaim<br/>annotations[opentelemetry.io/trace-context] = JSON
  K8s-->>Ctrl: Reconcile(Sandbox)
  Ctrl->>Ctrl: StartSpan(obj, "ReconcileSandbox", attrs)<br/>extract carrier from annotation
  alt annotation missing
    Ctrl->>K8s: Patch Sandbox.annotations[trace-context] = GetTraceContext(ctx)
  end
  Ctrl->>Ctrl: AddEvent("NewPodStatusObserved", ...)
  Ctrl-->>Col: batched OTLP spans (controller scope)
  SDK->>SDK: child spans: run / read / write / list / exists
  SDK-->>Col: batched OTLP spans (SDK scope)
  App->>SDK: Close / end_lifecycle_span
  SDK->>SDK: span.End() + recordError (if any)
```

Sources: [controllers/sandbox_controller.go:160-186](controllers/sandbox_controller.go), [clients/go/sandbox/sandbox.go:210-233](clients/go/sandbox/sandbox.go), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py:237-257](clients/python/agentic-sandbox-client/k8s_agent_sandbox/trace_manager.py), [internal/metrics/tracing.go:66-105](internal/metrics/tracing.go)

## Collector Configuration

The Python client ships a reference collector config that documents the expected OTLP receivers on `0.0.0.0:4317` (gRPC) and `0.0.0.0:4318` (HTTP), a batch processor, and a single `googlecloud` exporter. It is a starting point — any OTLP-compatible exporter (Jaeger, Tempo, OTLP HTTP, debug) can replace it without code changes in this repository.

```yaml
# clients/python/agentic-sandbox-client/otel-collector-config.yaml.example
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:

exporters:
  googlecloud:
    project: "YOUR-GCP-PROJECT-ID"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [googlecloud]
```

Pointing clients at a non-local collector is purely an environment-variable concern; for example, `OTEL_EXPORTER_OTLP_ENDPOINT="http://otel-collector.default.svc.cluster.local:4317"` is the in-cluster pattern called out by `clients/python/agentic-sandbox-client/GCP.md`. The provider-neutrality of this design comes from a single fact: none of the three Go/Python surfaces import a vendor SDK — they only speak OTLP, and the collector handles fan-out.

Sources: [clients/python/agentic-sandbox-client/otel-collector-config.yaml.example:1-22](clients/python/agentic-sandbox-client/otel-collector-config.yaml.example), [clients/python/agentic-sandbox-client/GCP.md:33-112](clients/python/agentic-sandbox-client/GCP.md)

## Summary

The repository implements OpenTelemetry tracing as three small, structurally-similar modules that all speak OTLP/gRPC to an external collector and propagate context across the controller/SDK boundary through one JSON annotation (`opentelemetry.io/trace-context`). The controller hides span emission behind a four-method `Instrumenter` interface so it can run with a no-op when tracing is disabled; the Go SDK exposes `TracerProvider`/`TraceServiceName` options and a convenience `NewTracerProvider` helper but never touches OTel globals; the Python SDK guards optional imports with a complete mock layer and protects the process-wide `TracerProvider` with a double-checked-locked singleton. The shape — provider-neutral exporter, W3C-only propagation, lifecycle-rooted spans, and annotation-based context handoff — keeps the integration portable across any OTLP-compatible backend.

---

## 19. Go High-Level SDK (clients/go/sandbox)

> The high-level Go client: Sandbox lifecycle, command execution, file transfer, port tunnels, gateway, connector strategies, and tracing helpers.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/19-go-high-level-sdk-clients-go-sandbox.md
- Generated: 2026-05-25T22:42:19.203Z

### Source Files

- `clients/go/sandbox/sandbox.go`
- `clients/go/sandbox/client.go`
- `clients/go/sandbox/connector.go`
- `clients/go/sandbox/commands.go`
- `clients/go/sandbox/files.go`
- `clients/go/sandbox/tunnel.go`
- `clients/go/sandbox/gateway.go`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [clients/go/sandbox/sandbox.go](clients/go/sandbox/sandbox.go)
- [clients/go/sandbox/client.go](clients/go/sandbox/client.go)
- [clients/go/sandbox/connector.go](clients/go/sandbox/connector.go)
- [clients/go/sandbox/commands.go](clients/go/sandbox/commands.go)
- [clients/go/sandbox/files.go](clients/go/sandbox/files.go)
- [clients/go/sandbox/tunnel.go](clients/go/sandbox/tunnel.go)
- [clients/go/sandbox/gateway.go](clients/go/sandbox/gateway.go)
- [clients/go/sandbox/strategy.go](clients/go/sandbox/strategy.go)
- [clients/go/sandbox/options.go](clients/go/sandbox/options.go)
- [clients/go/sandbox/types.go](clients/go/sandbox/types.go)
- [clients/go/sandbox/tracing.go](clients/go/sandbox/tracing.go)
- [clients/go/sandbox/k8s.go](clients/go/sandbox/k8s.go)
</details>

# Go High-Level SDK (clients/go/sandbox)

The `sigs.k8s.io/agent-sandbox/clients/go/sandbox` package is the high-level Go client for the agent-sandbox project. It hides the Kubernetes API plumbing required to provision a sandbox (creating a `SandboxClaim`, waiting for a `Sandbox` to become ready), and exposes a small surface for running commands and moving files inside the sandbox over HTTP. It also encapsulates how the SDK reaches the in-cluster `sandbox-router`: directly, through a `Gateway` resource, or via an SPDY port-forward tunnel.

This page maps the package by responsibility: the `Sandbox` lifecycle object and the `Client` registry, the `connector` retry/transport layer, the three `ConnectionStrategy` implementations, the `Commands` and `Files` sub-objects, and the OpenTelemetry tracing helpers that thread spans through every operation.

## Package architecture

The package is layered: user code holds a `Sandbox` (or a `Client` that owns many). The `Sandbox` exposes `Commands` and `Files` sub-objects; both delegate to a shared `connector` that owns the HTTP client, retry policy, and request-ID instrumentation. The `connector` does not know how to find the sandbox; it asks a `ConnectionStrategy` for the base URL on `Connect`. Three strategies are wired by `New` based on `Options`, and a `K8sHelper` provides the Kubernetes clientsets used for claim creation, name resolution, readiness watches, and pod discovery.

```mermaid
flowchart LR
    subgraph user["User code"]
        App[Application]
    end
    subgraph sdk["clients/go/sandbox"]
        Client[Client<br/>registry + signals]
        Sandbox[Sandbox<br/>lifecycle + lock]
        Commands[Commands]
        Files[Files]
        Connector[connector<br/>HTTP + retry]
        subgraph strategies["ConnectionStrategy"]
            Direct[DirectStrategy]
            Gateway[gatewayStrategy]
            Tunnel[tunnelStrategy]
        end
        K8s[K8sHelper<br/>clientsets]
        Tracing[tracing.go<br/>spans + OTLP]
    end
    subgraph cluster["Kubernetes / sandbox-router"]
        Claim[SandboxClaim]
        SBox[Sandbox CR]
        GW[Gateway]
        Router[sandbox-router Pod]
    end

    App --> Client --> Sandbox
    App --> Sandbox
    Sandbox --> Commands --> Connector
    Sandbox --> Files --> Connector
    Sandbox --> K8s
    Connector --> Direct
    Connector --> Gateway
    Connector --> Tunnel
    K8s --> Claim
    K8s --> SBox
    Gateway --> GW
    Tunnel --> Router
    Direct --> Router
    Commands -. spans .-> Tracing
    Files -. spans .-> Tracing
    Sandbox -. lifecycle span .-> Tracing
```

Sources: [clients/go/sandbox/sandbox.go:32-169](clients/go/sandbox/sandbox.go), [clients/go/sandbox/client.go:32-76](clients/go/sandbox/client.go), [clients/go/sandbox/connector.go:54-119](clients/go/sandbox/connector.go), [clients/go/sandbox/strategy.go:20-32](clients/go/sandbox/strategy.go)

## Sandbox lifecycle

`Sandbox` is the central object. Construction (`New`) only configures dependencies; the cluster is not touched until `Open(ctx)` is called. `Open` serializes through a single-slot `lifecycleSem` channel so concurrent `Open`/`Close`/`Disconnect` cannot interleave, registers an `openCancel` so a parallel `Close` or `Disconnect` can preempt a stuck `Open`, and starts a long-lived lifecycle span that becomes the parent of all subsequent operation spans.

The first `Open` follows the create path: `createClaim` → `resolveSandboxName` (warm-pool aware: the resolved sandbox name may differ from the generated claim name) → `waitForSandboxReady` → `connector.Connect`. The remaining ready-timeout budget is reduced by the time spent resolving the name so the wait cannot overrun. If any step after claim creation fails, `rollbackOpen` deletes the claim on a detached `CleanupTimeout` context and either clears or preserves `claimName` depending on whether deletion succeeded. A preserved `claimName` is what causes the next `Open` to take the reconnect path via `ErrOrphanedClaim` retry semantics.

`Close` first cancels any in-progress `Open`, then tears down the transport before flipping `draining = true` and swapping the in-flight `WaitGroup`. The ordering is deliberate: closing the transport first makes new operations fail fast with `ErrNotReady`, so no operation can slip past `trackOp` and still reach a live connection. The drain budget is half of `CleanupTimeout`; the other half is reserved for claim deletion. `Disconnect` is the suspend variant — it releases the transport but does not delete the claim, and notably does not call `connector.Close()` on a timed-out semaphore acquisition because doing so would race with a just-succeeding `Open`.

```mermaid
stateDiagram-v2
    [*] --> New: sandbox.New(opts)
    New --> Opening: Open(ctx)
    Opening --> Ready: createClaim<br/>resolveSandboxName<br/>waitForSandboxReady<br/>connector.Connect
    Opening --> Orphaned: failure after claim create<br/>(rollbackOpen could not delete)
    Opening --> Failed: failure before claim<br/>or rollback succeeded
    Orphaned --> Opening: Open() retries via reconnect path
    Orphaned --> Closed: Close() succeeds
    Ready --> Disconnected: Disconnect(ctx)
    Disconnected --> Opening: Open(ctx) → reconnect()
    Ready --> Draining: Close(ctx)
    Draining --> Closed: drain + deleteClaim
    Failed --> [*]
    Closed --> [*]
```

Sources: [clients/go/sandbox/sandbox.go:175-275](clients/go/sandbox/sandbox.go), [clients/go/sandbox/sandbox.go:277-336](clients/go/sandbox/sandbox.go), [clients/go/sandbox/sandbox.go:338-364](clients/go/sandbox/sandbox.go), [clients/go/sandbox/sandbox.go:367-503](clients/go/sandbox/sandbox.go), [clients/go/sandbox/k8s.go:123-269](clients/go/sandbox/k8s.go)

### In-flight operation tracking

Per-call methods on `Commands` and `Files` call `defer trackOp()()` on entry. `trackOp` reads `draining` and the current `inflightOps` under `s.mu`: if draining, it returns a no-op; otherwise it adds to the WaitGroup. `Close` swaps `inflightOps` for a fresh `WaitGroup` while marking `draining = true` under the same lock, so the two states are observed atomically. This is the only mechanism the SDK uses to guarantee that operations cannot slip into a sandbox that is about to delete its claim.

Sources: [clients/go/sandbox/sandbox.go:404-414](clients/go/sandbox/sandbox.go), [clients/go/sandbox/sandbox.go:574-589](clients/go/sandbox/sandbox.go)

## Client: multi-sandbox registry

`Client` is a thin manager over many `Sandbox` instances. It keeps a `map[Key]*Sandbox` keyed by `{Namespace, ClaimName}` and shares a single `K8sHelper` (so the underlying clientsets and REST config are not rebuilt per sandbox). `CreateSandbox` provisions and opens a new one, registering it on success. `GetSandbox` returns a cached, ready handle, evicts a stale one, or constructs a new `Sandbox` and forces the reconnect path by pre-populating `claimName` and `sandboxName` before `Open`.

`EnableAutoCleanup` installs a SIGINT/SIGTERM handler that calls `DeleteAll` and then re-raises the signal so the default handler can terminate the process. The returned `stop` function detaches the handler.

| Method | Effect |
|---|---|
| `CreateSandbox(ctx, template, ns)` | New claim + open + register |
| `GetSandbox(ctx, claim, ns)` | Cached if ready; else verify + reconnect |
| `ListActiveSandboxes()` | Tracked handles, prunes unready entries |
| `ListAllSandboxes(ctx, ns)` | Lists all `SandboxClaim` names in namespace |
| `DeleteSandbox(ctx, claim, ns)` | Close if tracked, else delete claim directly |
| `DeleteAll(ctx)` | Best-effort close of every tracked handle |
| `EnableAutoCleanup()` | SIGINT/SIGTERM → `DeleteAll` |

Sources: [clients/go/sandbox/client.go:38-231](clients/go/sandbox/client.go), [clients/go/sandbox/client.go:235-265](clients/go/sandbox/client.go)

## Connection strategies

`ConnectionStrategy` is the seam between the SDK and the cluster topology. Each strategy returns a `baseURL`; the connector then composes requests by appending an endpoint and setting the routing headers.

```text
                    +---------------------+
Options.APIURL  ->  | DirectStrategy      |  returns Options.APIURL verbatim
                    +---------------------+
Options.GatewayName ->  | gatewayStrategy |  watches Gateway.status.addresses[0].value
                        +-----------------+  builds http(s)://<addr> after validation
default          ->  | tunnelStrategy      |  SPDY port-forward to a router endpoint pod
                     +---------------------+  base URL = http://127.0.0.1:<local-port>
```

`New` picks one in `clients/go/sandbox/sandbox.go:83-109`: `APIURL` wins (advanced/test mode), then `GatewayName` (production routing), otherwise port-forward (developer mode). After `Connect`, the connector logs the discovered URL with a mode label so logs unambiguously identify which path was taken.

Sources: [clients/go/sandbox/sandbox.go:83-126](clients/go/sandbox/sandbox.go), [clients/go/sandbox/strategy.go:20-32](clients/go/sandbox/strategy.go), [clients/go/sandbox/connector.go:128-147](clients/go/sandbox/connector.go)

### DirectStrategy

The simplest case: a pre-configured URL passed in `Options.APIURL`. `Options.validate` enforces an `http`/`https` scheme and a non-empty host. Useful for tests against `httptest` servers and for environments that already expose a reachable router URL.

Sources: [clients/go/sandbox/strategy.go:26-31](clients/go/sandbox/strategy.go), [clients/go/sandbox/options.go:242-253](clients/go/sandbox/options.go)

### gatewayStrategy

`gatewayStrategy` lists then watches a Gateway resource (`gateway.networking.k8s.io/v1`, plural `gateways`) by name within `GatewayNamespace`. It loops list→watch with exponential backoff capped at 5 s and re-lists with a cleared `ResourceVersion` after a watch closes. `extractGatewayAddress` reads `status.addresses[0].value` and validates it as either an IP or a hostname matching `[a-zA-Z0-9.-]` (no empty labels, no leading/trailing dot or dash). Addresses containing `/`, `?`, `#`, or `@` are rejected to prevent SSRF via a compromised Gateway. IPv6 addresses are wrapped in brackets in `formatURL`. Watch events of type `Deleted` return `ErrGatewayDeleted`.

Sources: [clients/go/sandbox/gateway.go:33-128](clients/go/sandbox/gateway.go), [clients/go/sandbox/gateway.go:168-232](clients/go/sandbox/gateway.go)

### tunnelStrategy

The developer-mode path. `tunnelStrategy.Connect` resolves a ready `sandbox-router-svc` endpoint pod through `EndpointSlices` (label selector `kubernetes.io/service-name=sandbox-router-svc`), opens an SPDY round-tripper using `spdy.RoundTripperFor`, wraps it in a `trackingDialer` (so a stop-during-dial can force-close the connection), and asks `client-go`'s `portforward.New` to forward `0:8080`. The local port is read back via `fw.GetPorts()` and turned into `http://127.0.0.1:<local>`.

A background `monitorPortForward` goroutine waits on either the port-forward error channel or the stop channel. When the port-forward dies unexpectedly, it calls `connector.SetLastError` with `ErrPortForwardDied` plus up to 256 bytes of stderr; the connector then surfaces that error on the next `SendRequest` instead of a bare `ErrNotReady`. The stderr buffer is a `syncBuffer` capped at 64 KB to bound memory under chatty failures.

```mermaid
sequenceDiagram
    participant U as Sandbox.Open
    participant T as tunnelStrategy
    participant K as Kubernetes API
    participant PF as portforward.ForwardPorts
    participant C as connector

    U->>T: Connect(ctx)
    T->>K: List EndpointSlices for sandbox-router-svc
    K-->>T: pod = first ready endpoint
    T->>K: POST .../pods/<pod>/portforward (SPDY upgrade)
    T->>PF: New(trackingDialer, [0:8080], stop, ready)
    PF->>K: Stream port-forward
    PF-->>T: readyChan closed
    T->>T: fw.GetPorts() → local=<random>
    T-->>U: "http://127.0.0.1:<local>"
    T->>T: go monitorPortForward(...)
    Note over PF,C: tunnel later dies
    PF-->>T: errChan ← err
    T->>C: SetLastError(ErrPortForwardDied: err)
```

Sources: [clients/go/sandbox/tunnel.go:109-256](clients/go/sandbox/tunnel.go), [clients/go/sandbox/tunnel.go:261-335](clients/go/sandbox/tunnel.go), [clients/go/sandbox/tunnel.go:38-70](clients/go/sandbox/tunnel.go)

## connector: HTTP transport and retries

`connector` owns the `*http.Client`, the discovered base URL, retry state, and the routing headers attached to every request. It is constructed once per `Sandbox` and shared by both `Commands` and `Files`.

The default transport sets `DialContext` and `ResponseHeaderTimeout` to `PerAttemptTimeout`, 100 idle conns total, 10 per host, a 90 s idle timeout, and a 10 s TLS handshake timeout. A caller-supplied `HTTPTransport` bypasses this; in that case `ownsTransport` is false and `Close` will not call `CloseIdleConnections` on it.

`SendRequest` is the single retry engine:

1. If the caller's ctx has no deadline, it wraps it with `RequestTimeout`.
2. A request ID is generated and stamped onto the current span as `sandbox.request_id`.
3. Up to `maxAttempts` (default 6), it: reads `baseURL`/`sandboxID`/`namespace`/`serverPort` under `c.mu`, builds the URL by trimming slashes, attaches headers (`X-Sandbox-ID`, `X-Sandbox-Namespace`, `X-Sandbox-Port`, `X-Request-ID`, `Content-Type`, W3C trace context), and runs `c.httpClient.Do` under a per-attempt `time.AfterFunc(PerAttemptTimeout, attemptCancel)`.
4. Retries on transport error and on `500/502/503/504`. Non-retryable status codes are returned as `HTTPError`. Retries with non-`io.Seeker` bodies after the first attempt return early with a clear error.
5. Backoff is `baseBackoff * 2^(attempt-1)` (500 ms → 8 s cap) plus uniform jitter in ±25 %. `backoffScale` is a test hook that compresses sleeps to milliseconds in unit tests.
6. The successful response body is wrapped in `cancelOnClose` so closing the body cancels both the per-attempt context and the request-wide timeout, preventing leaked goroutines on long-lived bodies.

Two reliability invariants are encoded inline:

- The "per-attempt timer fires between `Do` returning and `Stop`" race is detected by re-checking `attemptCtx.Err()`. If it fires after a successful `Do`, the response body is unusable, so the body is drained, closed, and the attempt is retried.
- Tunnel death is surfaced via `c.lastError` (set by `tunnelStrategy.monitorPortForward`). When `baseURL == ""` and `lastError != nil`, requests fail with `ErrNotReady: <port-forward error>` instead of a generic message.

Both the success and failure paths drain up to `maxDrainBytes` (4 KiB) before closing — this is what allows `net/http`'s transport to reuse the underlying TCP connection.

| Constant | Value | Where |
|---|---|---|
| `maxAttempts` | 6 | retry cap |
| `baseBackoff` | 500 ms | first non-zero sleep |
| `maxBackoff` | 8 s | per-attempt sleep cap |
| `maxDrainBytes` | 4 KiB | body drain for keepalive |
| Retryable codes | 500, 502, 503, 504 | `retryableStatusCodes` |

Sources: [clients/go/sandbox/connector.go:37-119](clients/go/sandbox/connector.go), [clients/go/sandbox/connector.go:129-184](clients/go/sandbox/connector.go), [clients/go/sandbox/connector.go:201-377](clients/go/sandbox/connector.go), [clients/go/sandbox/types.go:33-37](clients/go/sandbox/types.go)

## Commands sub-object

`Commands.Run` POSTs `{"command": "<cmd>"}` to `execute`, decodes the JSON response into `ExecutionResult{Stdout, Stderr, ExitCode}`, and bounds the decode at `maxExecutionResponseSize = 16 MB`. If decoding hits the limit (`lr.N <= 0`), the error is wrapped as `ErrResponseTooLarge` rather than a generic JSON error.

Because execution is non-idempotent, the default is a single attempt. Callers that know their command is safe to retry opt in via `WithMaxAttempts(n)`:

```go
// clients/go/sandbox/commands.go:51
result, err := s.Run(ctx, "cat /etc/hostname", sandbox.WithMaxAttempts(6))
```

Per-call options are applied through `applyCallOpts`, which derives an optional `context.WithTimeout` from `WithTimeout(d)` and forwards `maxAttempts`. A `WithTimeout` of zero leaves the caller's deadline (or the connector's `RequestTimeout`) in charge.

Sources: [clients/go/sandbox/commands.go:29-97](clients/go/sandbox/commands.go), [clients/go/sandbox/types.go:68-95](clients/go/sandbox/types.go), [clients/go/sandbox/files.go:59-69](clients/go/sandbox/files.go)

## Files sub-object

`Files` exposes `Write`, `Read`, `List`, and `Exists`. The defaults differ from `Commands` because file operations on the server are idempotent: `maxAttempts` falls through to the connector default of 6.

| Method | HTTP | Endpoint | Notable behavior |
|---|---|---|---|
| `Write` | POST | `upload` (multipart) | Rejects content over `MaxUploadSize` *before* I/O; rejects non-plain filenames (no separators, no `.`/`..`) |
| `Read` | GET | `download/<percent-encoded-path>` | Limits body to `MaxDownloadSize+1`; returns oversize error if exceeded |
| `List` | GET | `list/<percent-encoded-path>` | 8 MB JSON cap; filters out entries that are neither `file` nor `directory` |
| `Exists` | GET | `exists/<percent-encoded-path>` | 8 MB JSON cap; returns `{exists: bool}` |

Path encoding goes through `percentEncode`, which encodes everything outside RFC 3986 unreserved (`A-Za-z0-9-_.~`) — including `/`. The server therefore receives the full path as a single opaque path segment, eliminating any router-level path traversal.

`Write` buffers the entire upload in memory as a `multipart` body. This is by design: a `*bytes.Reader` is an `io.Seeker`, so the connector's retry loop can `Seek(0, io.SeekStart)` before re-sending. A streaming body would have made the first failure terminal.

Sources: [clients/go/sandbox/files.go:32-69](clients/go/sandbox/files.go), [clients/go/sandbox/files.go:90-277](clients/go/sandbox/files.go), [clients/go/sandbox/connector.go:241-253](clients/go/sandbox/connector.go), [clients/go/sandbox/types.go:122-143](clients/go/sandbox/types.go)

## Options and validation

`Options` is the single configuration struct. `setDefaults` fills in safe defaults — including a `funcr.New` stderr logger unless `Quiet` is set — and `validate` enforces them. `TemplateName` is required and validated as a DNS subdomain. `Namespace` and `GatewayNamespace` are validated as DNS labels (≤63, no dots). All durations and size limits must be strictly positive.

| Field | Default | Purpose |
|---|---|---|
| `TemplateName` | required | `SandboxTemplate` to clone |
| `Namespace` | `default` | where to create the `SandboxClaim` |
| `GatewayName` | — | enables `gatewayStrategy` |
| `GatewayScheme` | `http` | scheme used to build URL from address |
| `APIURL` | — | enables `DirectStrategy`; wins over Gateway |
| `ServerPort` | 8888 | value sent as `X-Sandbox-Port` |
| `SandboxReadyTimeout` | 180 s | budget for resolve + ready |
| `GatewayReadyTimeout` | 180 s | per-`Gateway` discovery |
| `PortForwardReadyTimeout` | 30 s | `readyChan` wait + SPDY HTTP client timeout |
| `CleanupTimeout` | 30 s | detached drain + claim delete (Close uses 2× for sem) |
| `RequestTimeout` | 180 s | wraps caller ctx when deadline-less |
| `PerAttemptTimeout` | 60 s | per HTTP attempt header timeout |
| `MaxDownloadSize` | 256 MB | cap for `Read` |
| `MaxUploadSize` | 256 MB | cap for `Write` |
| `HTTPTransport` | — | override transport (skips idle-conn close on `Close`) |
| `TraceServiceName` | `sandbox-client` | OTel instrumentation scope + resource attribute |
| `TracerProvider` | `otel.GetTracerProvider()` | source of spans |

Sources: [clients/go/sandbox/options.go:30-194](clients/go/sandbox/options.go), [clients/go/sandbox/options.go:196-300](clients/go/sandbox/options.go)

## Tracing

Tracing is wired through every layer. `newTracer` picks `Options.TracerProvider` (or the global), turns `TraceServiceName` into an OTel scope name (with `-` → `_`), and the same `tracer` is shared by the sandbox, commands, files, and strategies. A long-lived lifecycle span (`<svc>.lifecycle`) is started by `Open` and ended by `Close`/`Disconnect`/failed `Open`. Each operation calls `withLifecycleSpan` so that operation spans (`<svc>.run`, `<svc>.read`, `<svc>.upload` and so on) are children of the lifecycle span even if the caller's context has changed.

W3C trace context is propagated outward in two places: `connector.SendRequest` injects it into HTTP headers per attempt, and `K8sHelper.createClaim` injects it into the `opentelemetry.io/trace-context` annotation on the new `SandboxClaim` so the controller can continue the trace server-side. `recordError` is the consistent shape for failures: it both records the error and sets span status to `codes.Error`.

The package also exposes attribute keys in the `sandbox.*` namespace (`sandbox.claim.name`, `sandbox.command`, `sandbox.exit_code`, `sandbox.file.*`, `sandbox.gateway.*`, `sandbox.request_id`) and a convenience `NewTracerProvider(ctx, serviceName)` that returns an OTLP/gRPC batched provider reading `OTEL_EXPORTER_OTLP_ENDPOINT`.

Sources: [clients/go/sandbox/tracing.go:34-120](clients/go/sandbox/tracing.go), [clients/go/sandbox/sandbox.go:210-233](clients/go/sandbox/sandbox.go), [clients/go/sandbox/k8s.go:128-157](clients/go/sandbox/k8s.go), [clients/go/sandbox/connector.go:217-272](clients/go/sandbox/connector.go)

## Error model

Errors are sentinel values exported from `types.go`, optionally wrapped in `HTTPError` for non-OK server responses. `HTTPError.Error` truncates server bodies at 256 bytes before printing, and `maxErrorBodySize` (512) bounds how much body is read into the error chain. Sandbox-scoped errors prefix `sandbox[<namespace>/<claim>]:` so a multi-sandbox log is greppable by claim.

| Sentinel | Raised by |
|---|---|
| `ErrNotReady` | `connector.SendRequest` when no base URL |
| `ErrTimeout` | `gatewayStrategy`, `K8sHelper` watches |
| `ErrClaimFailed` | `K8sHelper.createClaim` |
| `ErrPortForwardDied` | `tunnelStrategy.monitorPortForward` |
| `ErrAlreadyOpen` | `Sandbox.Open` when already connected |
| `ErrOrphanedClaim` | `Sandbox.Open`/`reconnect` when claim exists but verification fails |
| `ErrRetriesExhausted` | `connector.SendRequest` after `maxAttempts` |
| `ErrSandboxDeleted` | claim watch sees `Deleted` |
| `ErrGatewayDeleted` | gateway watch sees `Deleted` |
| `ErrResponseTooLarge` | `Commands.Run` decode bounded by 16 MB |

Sources: [clients/go/sandbox/types.go:39-66](clients/go/sandbox/types.go), [clients/go/sandbox/connector.go:222-355](clients/go/sandbox/connector.go), [clients/go/sandbox/commands.go:83-92](clients/go/sandbox/commands.go)

## Summary

`clients/go/sandbox` is intentionally narrow: `Sandbox` owns lifecycle and locking, `Client` owns multi-sandbox bookkeeping, the `connector` owns transport and retries, and three pluggable `ConnectionStrategy` implementations isolate how the router URL is discovered. The same `tracer` flows from `Options` through every layer, so a single OTel span tree covers claim creation, gateway/tunnel discovery, the lifecycle window, and every individual command or file operation.

---

## 20. Generated Go Clientsets, Informers & Listers

> The k8s.io/client-go-style generated machinery for Sandbox and extensions: typed clientsets, informers, listers, and the codegen wiring.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/20-generated-go-clientsets-informers-listers.md
- Generated: 2026-05-25T23:13:02.111Z

### Source Files

- `clients/k8s/clientset/versioned`
- `clients/k8s/extensions/clientset`
- `clients/k8s/extensions/informers`
- `clients/k8s/extensions/listers`
- `codegen.go`
- `dev/tools/client-gen-go.sh`

> ⚠️ The agent returned an invalid wiki page. This page needs recovery.
>
> First failure: the page did not include the required "# Generated Go Clientsets, Informers & Listers" heading near the top
> Retry failure: the page did not include the required "# Generated Go Clientsets, Informers & Listers" heading near the top

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [codegen.go](codegen.go)
- [dev/tools/client-gen-go.sh](dev/tools/client-gen-go.sh)
- [clients/k8s/clientset/versioned/clientset.go](clients/k8s/clientset/versioned/clientset.go)
- [clients/k8s/clientset/versioned/typed/api/v1beta1/api_client.go](clients/k8s/clientset/versioned/typed/api/v1beta1/api_client.go)
- [clients/k8s/clientset/versioned/typed/api/v1beta1/sandbox.go](clients/k8s/clientset/versioned/typed/api/v1beta1/sandbox.go)
- [clients/k8s/clientset/versioned/typed/api/v1beta1/generated_expansion.go](clients/k8s/clientset/versioned/typed/api/v1beta1/generated_expansion.go)
- [clients/k8s/clientset/versioned/scheme/register.go](clients/k8s/clientset/versioned/scheme/register.go)
- [clients/k8s/clientset/versioned/fake/clientset_generated.go](clients/k8s/clientset/versioned/fake/clientset_generated.go)
- [clients/k8s/extensions/clientset/versioned/clientset.go](clients/k8s/extensions/clientset/versioned/clientset.go)
- [clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/api_client.go](clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/api_client.go)
- [clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/sandboxclaim.go](clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/sandboxclaim.go)
- [clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/generated_expansion.go](clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/generated_expansion.go)
- [clients/k8s/extensions/clientset/versioned/fake/clientset_generated.go](clients/k8s/extensions/clientset/versioned/fake/clientset_generated.go)
- [clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/fake/fake_api_client.go](clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/fake/fake_api_client.go)
- [clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/fake/fake_sandboxclaim.go](clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/fake/fake_sandboxclaim.go)
- [clients/k8s/informers/externalversions/factory.go](clients/k8s/informers/externalversions/factory.go)
- [clients/k8s/informers/externalversions/generic.go](clients/k8s/informers/externalversions/generic.go)
- [clients/k8s/informers/externalversions/api/interface.go](clients/k8s/informers/externalversions/api/interface.go)
- [clients/k8s/informers/externalversions/api/v1beta1/interface.go](clients/k8s/informers/externalversions/api/v1beta1/interface.go)
- [clients/k8s/informers/externalversions/api/v1beta1/sandbox.go](clients/k8s/informers/externalversions/api/v1beta1/sandbox.go)
- [clients/k8s/informers/externalversions/internalinterfaces/factory_interfaces.go](clients/k8s/informers/externalversions/internalinterfaces/factory_interfaces.go)
- [clients/k8s/extensions/informers/externalversions/generic.go](clients/k8s/extensions/informers/externalversions/generic.go)
- [clients/k8s/extensions/informers/externalversions/api/v1beta1/interface.go](clients/k8s/extensions/informers/externalversions/api/v1beta1/interface.go)
- [clients/k8s/listers/api/v1beta1/sandbox.go](clients/k8s/listers/api/v1beta1/sandbox.go)
- [clients/k8s/listers/api/v1beta1/expansion_generated.go](clients/k8s/listers/api/v1beta1/expansion_generated.go)
- [clients/k8s/extensions/listers/api/v1beta1/sandboxclaim.go](clients/k8s/extensions/listers/api/v1beta1/sandboxclaim.go)
- [clients/k8s/extensions/listers/api/v1beta1/expansion_generated.go](clients/k8s/extensions/listers/api/v1beta1/expansion_generated.go)
- [api/v1beta1/sandbox_types.go](api/v1beta1/sandbox_types.go)
- [api/v1beta1/groupversion_info.go](api/v1beta1/groupversion_info.go)
- [extensions/api/v1beta1/sandboxclaim_types.go](extensions/api/v1beta1/sandboxclaim_types.go)
- [extensions/api/v1beta1/groupversion_info.go](extensions/api/v1beta1/groupversion_info.go)
- [clients/go/sandbox/k8s.go](clients/go/sandbox/k8s.go)
- [clients/go/sandbox/sandbox_test.go](clients/go/sandbox/sandbox_test.go)
</details>

# Generated Go Clientsets, Informers & Listers

The `clients/k8s` tree contains the `k8s.io/client-go`-style generated machinery used to interact with this project's Custom Resources from Go. It is produced by the `k8s.io/code-generator` tools (`client-gen`, `lister-gen`, `informer-gen`) and follows conventions that any consumer of `client-go` will already recognise: typed clientsets per group/version, shared informer factories with per-resource informers, and listers backed by an informer's indexer. Two parallel hierarchies are emitted — one for the core `agents.x-k8s.io` group (`Sandbox`) and a separate one for the `extensions.agents.x-k8s.io` group (`SandboxClaim`, `SandboxTemplate`, `SandboxWarmPool`) — because each lives in its own Go API package and gets its own clientset, scheme, listers, informers, and fake.

This page maps the generated layout onto the source: where each layer lives, how the typed REST shims are assembled, how the shared informer factory wires informers to listers, and how everything is regenerated. The wrapper SDK in `clients/go/sandbox` consumes these types directly and is used as the canonical example of how the layers fit together.

Sources: [codegen.go:19-28](), [dev/tools/client-gen-go.sh:24-78]()

## Where the generated tree comes from

A single shell script, `dev/tools/client-gen-go.sh`, drives all three code generators against the two API packages and writes everything into `clients/k8s/{clientset,listers,informers}` and `clients/k8s/extensions/{clientset,listers,informers}`. It is invoked through the top-level `//go:generate ./dev/tools/client-gen-go.sh` directive in `codegen.go`, so `go generate ./...` is the only entry point a developer needs.

```bash
# dev/tools/client-gen-go.sh:24-48
CMD="go run -modfile=tools.mod k8s.io/code-generator"
API_PKG="sigs.k8s.io/agent-sandbox/api/v1beta1"
CLIENT_PKG="sigs.k8s.io/agent-sandbox/clients/k8s"

${CMD}/cmd/client-gen \
  --output-dir "clients/k8s/clientset" \
  --output-pkg "${CLIENT_PKG}/clientset" \
  --clientset-name "versioned" \
  --input-base "" \
  --input "${API_PKG}"

${CMD}/cmd/lister-gen   --output-dir "clients/k8s/listers"   …  "${API_PKG}"
${CMD}/cmd/informer-gen --output-dir "clients/k8s/informers" \
  --versioned-clientset-package "${CLIENT_PKG}/clientset/versioned" \
  --listers-package "${CLIENT_PKG}/listers" \
  …  "${API_PKG}"
```

The same three invocations are then repeated with `EXT_API_PKG=sigs.k8s.io/agent-sandbox/extensions/api/v1beta1` and `EXT_CLIENT_PKG=…/clients/k8s/extensions`, which is why the extensions tree is a structurally identical sibling rather than additional group/version folders under the core clientset. Finally, `dev/tools/fix-boilerplate` re-applies the project's Apache-2.0 license header to every generated file.

Resources are discovered by the generators from `+genclient` markers on the API types — `Sandbox` (`api/v1beta1/sandbox_types.go:224`) for the core group; `SandboxClaim`, `SandboxTemplate`, and `SandboxWarmPool` (`extensions/api/v1beta1/sandboxclaim_types.go:175`, etc.) for extensions — combined with each package's `SchemeGroupVersion` (`api/v1beta1/groupversion_info.go:27-29`, `extensions/api/v1beta1/groupversion_info.go:27-29`).

Sources: [codegen.go:28](), [dev/tools/client-gen-go.sh:24-78](), [api/v1beta1/sandbox_types.go:224-244](), [api/v1beta1/groupversion_info.go:27-40](), [extensions/api/v1beta1/groupversion_info.go:27-40]()

## Layout at a glance

The two trees mirror the upstream `kubernetes/sample-controller` layout. The `clientset` builds typed REST clients; `informers` builds a `SharedInformerFactory` over those clients; `listers` reads from each informer's `cache.Indexer`.

```text
clients/k8s/                                    clients/k8s/extensions/
├── clientset/versioned/                        ├── clientset/versioned/
│   ├── clientset.go                            │   ├── clientset.go
│   ├── scheme/register.go                      │   ├── scheme/register.go
│   ├── fake/clientset_generated.go             │   ├── fake/clientset_generated.go
│   └── typed/api/v1beta1/                      │   └── typed/api/v1beta1/
│       ├── api_client.go  (AgentsV1beta1)      │       ├── api_client.go  (ExtensionsV1beta1)
│       ├── sandbox.go                          │       ├── sandboxclaim.go
│       ├── generated_expansion.go              │       ├── sandboxtemplate.go
│       └── fake/                               │       ├── sandboxwarmpool.go
│           ├── fake_api_client.go              │       ├── generated_expansion.go
│           └── fake_sandbox.go                 │       └── fake/{fake_api_client,fake_*}.go
├── informers/externalversions/                 ├── informers/externalversions/
│   ├── factory.go                              │   ├── factory.go
│   ├── generic.go                              │   ├── generic.go
│   ├── internalinterfaces/factory_interfaces.go│   ├── internalinterfaces/factory_interfaces.go
│   └── api/{interface.go, v1beta1/...}         │   └── api/{interface.go, v1beta1/...}
└── listers/api/v1beta1/                        └── listers/api/v1beta1/
    ├── sandbox.go                                  ├── sandboxclaim.go
    └── expansion_generated.go                      ├── sandboxtemplate.go
                                                    ├── sandboxwarmpool.go
                                                    └── expansion_generated.go
```

The following table summarises which generator produces which file pattern:

| Generator      | Output root                              | Representative files                                                          | Purpose                                                       |
| -------------- | ---------------------------------------- | ----------------------------------------------------------------------------- | ------------------------------------------------------------- |
| `client-gen`   | `clientset/versioned/`                   | `clientset.go`, `typed/api/v1beta1/api_client.go`, `typed/api/v1beta1/*.go`   | Typed CRUD + `Watch` over REST per group/version              |
| `client-gen`   | `clientset/versioned/scheme/`            | `register.go`                                                                 | Runtime `Scheme` + `Codecs` + `ParameterCodec` for the group  |
| `client-gen`   | `clientset/versioned/fake/`              | `clientset_generated.go`, `typed/api/v1beta1/fake/*`                          | In-memory clientset backed by `client-go/testing.ObjectTracker` |
| `lister-gen`   | `listers/api/v1beta1/`                   | `sandbox.go`, `sandboxclaim.go`, `expansion_generated.go`                     | Indexer-backed cluster + namespaced read helpers              |
| `informer-gen` | `informers/externalversions/`            | `factory.go`, `generic.go`, `api/v1beta1/sandbox.go`                          | `SharedInformerFactory` + per-resource informers              |

Sources: [dev/tools/client-gen-go.sh:24-78](), [clients/k8s/clientset/versioned/clientset.go:15](), [clients/k8s/informers/externalversions/factory.go:15](), [clients/k8s/listers/api/v1beta1/sandbox.go:15]()

## Clientset and typed clients

`Clientset` is the per-tree aggregate that consumers usually hold. It embeds a `DiscoveryClient` and exposes one method per group/version. For the core tree there is exactly one such method, `AgentsV1beta1()`; for the extensions tree it is `ExtensionsV1beta1()`. Both are also defined as an `Interface` so the matching fake clientset can satisfy the same contract.

```go
// clients/k8s/clientset/versioned/clientset.go:29-43
type Interface interface {
    Discovery() discovery.DiscoveryInterface
    AgentsV1beta1() agentsv1beta1.AgentsV1beta1Interface
}

type Clientset struct {
    *discovery.DiscoveryClient
    agentsV1beta1 *agentsv1beta1.AgentsV1beta1Client
}

func (c *Clientset) AgentsV1beta1() agentsv1beta1.AgentsV1beta1Interface { return c.agentsV1beta1 }
```

The three constructors are also identical to the upstream conventions: `NewForConfig`, `NewForConfigAndClient`, and `NewForConfigOrDie`. `NewForConfig` calls `rest.HTTPClientFor` then delegates to `NewForConfigAndClient` (`clients/k8s/clientset/versioned/clientset.go:58-99`); when both group clientsets are needed at once the `httpClient` is shared so a single transport is reused. The extensions clientset has the same shape but exposes `SandboxClaims`, `SandboxTemplates`, and `SandboxWarmPools` getters on its `ExtensionsV1beta1Client` (`clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/api_client.go:27-49`).

Per group/version the typed sub-client is configured against `SchemeGroupVersion` and a per-group `scheme.ParameterCodec`:

```go
// clients/k8s/clientset/versioned/typed/api/v1beta1/api_client.go:81-90
func setConfigDefaults(config *rest.Config) {
    gv := apiv1beta1.SchemeGroupVersion
    config.GroupVersion = &gv
    config.APIPath = "/apis"
    config.NegotiatedSerializer = rest.CodecFactoryForGeneratedClient(scheme.Scheme, scheme.Codecs).WithoutConversion()
    if config.UserAgent == "" {
        config.UserAgent = rest.DefaultKubernetesUserAgent()
    }
}
```

Each resource gets a `gentype.ClientWithList`-backed implementation that supplies CRUD, `UpdateStatus`, `DeleteCollection`, `Watch`, and `Patch`:

```go
// clients/k8s/clientset/versioned/typed/api/v1beta1/sandbox.go:37-68
type SandboxInterface interface {
    Create(ctx, sandbox, opts) (*Sandbox, error)
    Update(ctx, sandbox, opts) (*Sandbox, error)
    UpdateStatus(ctx, sandbox, opts) (*Sandbox, error)
    Delete(ctx, name, opts) error
    DeleteCollection(ctx, opts, listOpts) error
    Get(ctx, name, opts) (*Sandbox, error)
    List(ctx, opts) (*SandboxList, error)
    Watch(ctx, opts) (watch.Interface, error)
    Patch(ctx, name, pt, data, opts, subresources...) (*Sandbox, error)
    SandboxExpansion
}

type sandboxes struct { *gentype.ClientWithList[*apiv1beta1.Sandbox, *apiv1beta1.SandboxList] }

func newSandboxes(c *AgentsV1beta1Client, namespace string) *sandboxes {
    return &sandboxes{ gentype.NewClientWithList[*Sandbox, *SandboxList](
        "sandboxes", c.RESTClient(), scheme.ParameterCodec, namespace,
        func() *Sandbox { return &Sandbox{} },
        func() *SandboxList { return &SandboxList{} },
    )}
}
```

The presence of `UpdateStatus` is driven by the `+genclient` marker without `noStatus`; the file's own doc comment notes this (`clients/k8s/clientset/versioned/typed/api/v1beta1/sandbox.go:40-41`). The `SandboxExpansion` interface in `generated_expansion.go` is empty — it exists so a hand-written file may add methods to `SandboxInterface` without conflicting with regenerated code.

Sources: [clients/k8s/clientset/versioned/clientset.go:29-99](), [clients/k8s/clientset/versioned/typed/api/v1beta1/api_client.go:27-99](), [clients/k8s/clientset/versioned/typed/api/v1beta1/sandbox.go:30-68](), [clients/k8s/clientset/versioned/typed/api/v1beta1/generated_expansion.go:19](), [clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/api_client.go:27-49]()

## Layered architecture

```mermaid
flowchart TB
    subgraph apis["API packages (+genclient sources)"]
        coreAPI["api/v1beta1<br/>Sandbox<br/>(group: agents.x-k8s.io)"]
        extAPI["extensions/api/v1beta1<br/>SandboxClaim / SandboxTemplate / SandboxWarmPool<br/>(group: extensions.agents.x-k8s.io)"]
    end

    subgraph clientset["clientset/versioned (client-gen)"]
        coreCS["Clientset<br/>AgentsV1beta1()"]
        coreTyped["AgentsV1beta1Client<br/>Sandboxes(ns)"]
        coreFake["fake.Clientset<br/>NewSimpleClientset"]
        extCS["Clientset<br/>ExtensionsV1beta1()"]
        extTyped["ExtensionsV1beta1Client<br/>SandboxClaims/Templates/WarmPools"]
        extFake["fake.Clientset<br/>NewSimpleClientset"]
        scheme["scheme.Scheme<br/>scheme.Codecs<br/>ParameterCodec"]
    end

    subgraph informers["informers/externalversions (informer-gen)"]
        factory["sharedInformerFactory<br/>InformerFor / ForResource / Start"]
        groupIface["api.Interface<br/>V1beta1()"]
        sandboxInf["SandboxInformer<br/>Informer() / Lister()"]
    end

    subgraph listers["listers/api/v1beta1 (lister-gen)"]
        sandboxLister["SandboxLister<br/>(read-only over cache.Indexer)"]
        extListers["SandboxClaim/Template/WarmPool Listers"]
    end

    consumer["clients/go/sandbox.K8sHelper"]

    coreAPI -->|+genclient| coreTyped
    extAPI  -->|+genclient| extTyped
    coreTyped --> coreCS
    extTyped  --> extCS
    coreCS -. interface .- coreFake
    extCS  -. interface .- extFake
    scheme --> coreTyped
    scheme --> extTyped

    coreCS --> factory
    factory --> groupIface --> sandboxInf
    sandboxInf -- "cache.Indexer" --> sandboxLister
    factory -- "ForResource(GVR)" --> extListers

    consumer -->|AgentsV1beta1Interface| coreTyped
    consumer -->|ExtensionsV1beta1Interface| extTyped
```

The diagram captures the dependency direction enforced by the generators: API packages are the only hand-written input; the clientset depends on them and on the local `scheme`; informers depend on the versioned clientset; listers depend on the informer's `cache.Indexer`; consumers depend on the typed group interfaces. The extensions side is a structural copy of the core side, swapped onto the extensions API package.

Sources: [clients/k8s/clientset/versioned/clientset.go:29-43](), [clients/k8s/informers/externalversions/factory.go:36-54](), [clients/k8s/informers/externalversions/api/v1beta1/sandbox.go:33-100](), [clients/k8s/listers/api/v1beta1/sandbox.go:26-69](), [clients/go/sandbox/k8s.go:50-121]()

## Scheme registration

Each clientset ships its own `scheme` subpackage. It builds a private `runtime.Scheme`, registers the API package's `AddToScheme`, and exposes `Codecs`, `ParameterCodec`, and `AddToScheme` for composition with other schemes (for example `clientsetscheme.Scheme`).

```go
// clients/k8s/clientset/versioned/scheme/register.go:28-54
var Scheme = runtime.NewScheme()
var Codecs = serializer.NewCodecFactory(Scheme)
var ParameterCodec = runtime.NewParameterCodec(Scheme)
var localSchemeBuilder = runtime.SchemeBuilder{ agentsv1beta1.AddToScheme }
var AddToScheme = localSchemeBuilder.AddToScheme

func init() {
    v1.AddToGroupVersion(Scheme, schema.GroupVersion{Version: "v1"})
    utilruntime.Must(AddToScheme(Scheme))
}
```

Because the core and extensions trees are separate clientsets, they each have a separate scheme package. Consumers that need to decode both groups through the same codecs must call both `AddToScheme` functions against a shared scheme.

Sources: [clients/k8s/clientset/versioned/scheme/register.go:28-54]()

## Fake clientsets

`client-gen` emits a parallel `fake/` package per tree. `NewSimpleClientset(objects...)` returns a `*Clientset` that satisfies the same `versioned.Interface` as the real clientset but is backed by `k8s.io/client-go/testing.NewObjectTracker` and a default reactor pipeline. A watch reactor is also pre-registered against the tracker so any `Watch()` call against the fake returns events generated by tracker mutations.

```go
// clients/k8s/extensions/clientset/versioned/fake/clientset_generated.go:39-65
func NewSimpleClientset(objects ...runtime.Object) *Clientset {
    o := testing.NewObjectTracker(scheme, codecs.UniversalDecoder())
    for _, obj := range objects { _ = o.Add(obj) }
    cs := &Clientset{tracker: o}
    cs.discovery = &fakediscovery.FakeDiscovery{Fake: &cs.Fake}
    cs.AddReactor("*", "*", testing.ObjectReaction(o))
    cs.AddWatchReactor("*", func(action testing.Action) (bool, watch.Interface, error) { … })
    return cs
}
```

The fake group client mirrors the real one method-for-method, returning per-resource `gentype.FakeClientWithList`-backed fakes:

```go
// clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/fake/fake_api_client.go:25-39
type FakeExtensionsV1beta1 struct { *testing.Fake }
func (c *FakeExtensionsV1beta1) SandboxClaims(ns string) v1beta1.SandboxClaimInterface     { return newFakeSandboxClaims(c, ns) }
func (c *FakeExtensionsV1beta1) SandboxTemplates(ns string) v1beta1.SandboxTemplateInterface { return newFakeSandboxTemplates(c, ns) }
func (c *FakeExtensionsV1beta1) SandboxWarmPools(ns string) v1beta1.SandboxWarmPoolInterface { return newFakeSandboxWarmPools(c, ns) }

// clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/fake/fake_sandboxclaim.go:31-50
func newFakeSandboxClaims(fake *FakeExtensionsV1beta1, namespace string) apiv1beta1.SandboxClaimInterface {
    return &fakeSandboxClaims{ gentype.NewFakeClientWithList[*SandboxClaim, *SandboxClaimList](
        fake.Fake, namespace,
        v1beta1.SchemeGroupVersion.WithResource("sandboxclaims"),
        v1beta1.SchemeGroupVersion.WithKind("SandboxClaim"),
        func() *SandboxClaim     { return &SandboxClaim{} },
        func() *SandboxClaimList { return &SandboxClaimList{} },
        func(dst, src *SandboxClaimList) { dst.ListMeta = src.ListMeta },
        func(list *SandboxClaimList) []*SandboxClaim     { return gentype.ToPointerSlice(list.Items) },
        func(list *SandboxClaimList, items []*SandboxClaim) { list.Items = gentype.FromPointerSlice(items) },
    ), fake }
}
```

`NewSimpleClientset` is marked `Deprecated` upstream in favour of the apply-aware `NewClientset` constructor that becomes available once `--with-applyconfig` is used; the call sites in `clients/go/sandbox` suppress the deprecation with a `//nolint:staticcheck` and an inline TODO to regenerate with that flag (`clients/go/sandbox/sandbox_test.go:65-67`). The fake clientset additionally returns `true` from `IsWatchListSemanticsUnSupported()` to signal to reflectors that `WatchList` is not implemented (`clients/k8s/clientset/versioned/fake/clientset_generated.go:84-93`).

Sources: [clients/k8s/clientset/versioned/fake/clientset_generated.go:39-103](), [clients/k8s/extensions/clientset/versioned/fake/clientset_generated.go:39-103](), [clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/fake/fake_api_client.go:25-46](), [clients/k8s/extensions/clientset/versioned/typed/api/v1beta1/fake/fake_sandboxclaim.go:25-50](), [clients/go/sandbox/sandbox_test.go:42-45]()

## Shared informer factory

`informer-gen` emits a `SharedInformerFactory` per tree. The factory is a tiny `sync.Mutex`-guarded registry over `reflect.Type → cache.SharedIndexInformer`, keyed by the object type each informer is configured for. Options are supplied with the standard `WithCustomResyncConfig`, `WithTweakListOptions`, `WithNamespace`, and `WithTransform` functional options.

```go
// clients/k8s/informers/externalversions/factory.go:36-54
type sharedInformerFactory struct {
    client           versioned.Interface
    namespace        string
    tweakListOptions internalinterfaces.TweakListOptionsFunc
    lock             sync.Mutex
    defaultResync    time.Duration
    customResync     map[reflect.Type]time.Duration
    transform        cache.TransformFunc
    informers        map[reflect.Type]cache.SharedIndexInformer
    startedInformers map[reflect.Type]bool
    wg               sync.WaitGroup
    shuttingDown     bool
}
```

`InformerFor` is the deduplicating heart of the factory: it looks up an existing informer by `reflect.TypeOf(obj)` and, if absent, invokes the `NewInformerFunc` supplied by the per-resource accessor and applies the configured transform.

```go
// clients/k8s/informers/externalversions/factory.go:179-199
func (f *sharedInformerFactory) InformerFor(obj runtime.Object, newFunc internalinterfaces.NewInformerFunc) cache.SharedIndexInformer {
    f.lock.Lock(); defer f.lock.Unlock()
    informerType := reflect.TypeOf(obj)
    if informer, exists := f.informers[informerType]; exists { return informer }
    resyncPeriod, exists := f.customResync[informerType]
    if !exists { resyncPeriod = f.defaultResync }
    informer = newFunc(f.client, resyncPeriod)
    informer.SetTransform(f.transform)
    f.informers[informerType] = informer
    return informer
}
```

`Start` walks `f.informers`, launches `informer.Run(stopCh)` in a goroutine for each one that has not been started yet, and records it in `startedInformers` so a second `Start` call is a no-op for already-running informers; `WaitForCacheSync` filters by the same set when blocking on `HasSynced` (`clients/k8s/informers/externalversions/factory.go:123-175`). `Shutdown` flips `shuttingDown` and blocks on `f.wg` until all the goroutines started by previous `Start` calls return.

Group navigation is structured as `Factory → Group → Version → Resource`. Each layer is generated; the leaf returns a `SandboxInformer` (or per-extensions-resource equivalent):

```go
// clients/k8s/informers/externalversions/factory.go:259-261
func (f *sharedInformerFactory) Agents() api.Interface { return api.New(f, f.namespace, f.tweakListOptions) }

// clients/k8s/informers/externalversions/api/interface.go:24-44
type Interface interface { V1beta1() v1beta1.Interface }
func (g *group) V1beta1() v1beta1.Interface { return v1beta1.New(g.factory, g.namespace, g.tweakListOptions) }

// clients/k8s/informers/externalversions/api/v1beta1/interface.go:24-43
type Interface interface { Sandboxes() SandboxInformer }
func (v *version) Sandboxes() SandboxInformer {
    return &sandboxInformer{factory: v.factory, namespace: v.namespace, tweakListOptions: v.tweakListOptions}
}
```

The factory also implements `ForResource(GVR)` for callers that only know a `schema.GroupVersionResource`. It is a switch over the resources discovered for the tree:

```go
// clients/k8s/informers/externalversions/generic.go:51-60
func (f *sharedInformerFactory) ForResource(resource schema.GroupVersionResource) (GenericInformer, error) {
    switch resource {
    case v1beta1.SchemeGroupVersion.WithResource("sandboxes"):
        return &genericInformer{ resource: resource.GroupResource(),
            informer: f.Agents().V1beta1().Sandboxes().Informer() }, nil
    }
    return nil, fmt.Errorf("no informer found for %v", resource)
}
```

The extensions tree emits the same switch with three cases — `sandboxclaims`, `sandboxtemplates`, `sandboxwarmpools` — each routed through `f.Extensions().V1beta1()` (`clients/k8s/extensions/informers/externalversions/generic.go:51-64`, `clients/k8s/extensions/informers/externalversions/api/v1beta1/interface.go:24-57`).

Sources: [clients/k8s/informers/externalversions/factory.go:36-261](), [clients/k8s/informers/externalversions/generic.go:29-60](), [clients/k8s/informers/externalversions/api/interface.go:24-44](), [clients/k8s/informers/externalversions/api/v1beta1/interface.go:24-43](), [clients/k8s/extensions/informers/externalversions/api/v1beta1/interface.go:24-57]()

## Per-resource informer wiring

The leaf informer file (e.g. `sandbox.go`) is where the typed clientset is finally joined to a `cache.SharedIndexInformer`. `NewFilteredSandboxInformer` builds a `cache.ListWatch` whose `ListFunc`/`WatchFunc` (and their `WithContext` siblings) delegate to the typed `AgentsV1beta1().Sandboxes(namespace)` sub-client, optionally rewriting `metav1.ListOptions` via `tweakListOptions`.

```go
// clients/k8s/informers/externalversions/api/v1beta1/sandbox.go:56-100
func NewFilteredSandboxInformer(client versioned.Interface, namespace string,
    resyncPeriod time.Duration, indexers cache.Indexers,
    tweakListOptions internalinterfaces.TweakListOptionsFunc) cache.SharedIndexInformer {
    return cache.NewSharedIndexInformer(
        cache.ToListWatcherWithWatchListSemantics(&cache.ListWatch{
            ListFunc:  func(o v1.ListOptions) (runtime.Object, error)   { /* tweak; client.AgentsV1beta1().Sandboxes(namespace).List(...) */ },
            WatchFunc: func(o v1.ListOptions) (watch.Interface, error)  { /* tweak; ...Watch(...) */ },
            ListWithContextFunc:  /* ctx-aware List  */,
            WatchFuncWithContext: /* ctx-aware Watch */,
        }, client),
        &agentsandboxapiv1beta1.Sandbox{}, resyncPeriod, indexers,
    )
}

func (f *sandboxInformer) defaultInformer(client versioned.Interface, resyncPeriod time.Duration) cache.SharedIndexInformer {
    return NewFilteredSandboxInformer(client, f.namespace, resyncPeriod,
        cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc}, f.tweakListOptions)
}

func (f *sandboxInformer) Informer() cache.SharedIndexInformer { return f.factory.InformerFor(&Sandbox{}, f.defaultInformer) }
func (f *sandboxInformer) Lister()   apiv1beta1.SandboxLister  { return apiv1beta1.NewSandboxLister(f.Informer().GetIndexer()) }
```

A few details worth highlighting:

- The default informer always installs `cache.NamespaceIndex → cache.MetaNamespaceIndexFunc`, which is what makes `Lister().Sandboxes(namespace)` efficient.
- `cache.ToListWatcherWithWatchListSemantics(..., client)` opts the informer into the streaming `WatchList` semantics where supported by the API server — the fake clientset signals it is *not* supported through `IsWatchListSemanticsUnSupported() bool` on `Clientset` (`clients/k8s/clientset/versioned/fake/clientset_generated.go:91-93`).
- The `NewInformerFunc` signature `func(versioned.Interface, time.Duration) cache.SharedIndexInformer` is defined in `internalinterfaces` to keep the factory free of import cycles (`clients/k8s/informers/externalversions/internalinterfaces/factory_interfaces.go:28-39`).

Sources: [clients/k8s/informers/externalversions/api/v1beta1/sandbox.go:33-100](), [clients/k8s/informers/externalversions/internalinterfaces/factory_interfaces.go:28-39](), [clients/k8s/clientset/versioned/fake/clientset_generated.go:91-93]()

## Listers

Listers are tiny wrappers around `k8s.io/client-go/listers.ResourceIndexer[T]`. They expose namespaced and cluster-scoped reads over the informer's `cache.Indexer`:

```go
// clients/k8s/listers/api/v1beta1/sandbox.go:38-69
type sandboxLister struct {
    listers.ResourceIndexer[*apiv1beta1.Sandbox]
}
func NewSandboxLister(indexer cache.Indexer) SandboxLister {
    return &sandboxLister{listers.New[*apiv1beta1.Sandbox](indexer, apiv1beta1.Resource("sandbox"))}
}
func (s *sandboxLister) Sandboxes(namespace string) SandboxNamespaceLister {
    return sandboxNamespaceLister{listers.NewNamespaced[*apiv1beta1.Sandbox](s.ResourceIndexer, namespace)}
}
```

All returned objects are documented as read-only — callers must `DeepCopy` before mutation. `expansion_generated.go` emits empty marker interfaces (`SandboxListerExpansion`, `SandboxNamespaceListerExpansion`, and one pair per extensions resource) so hand-written methods can be added to a lister without conflicting with regenerated code. The extensions tree mirrors this with one file per resource (`sandboxclaim.go`, `sandboxtemplate.go`, `sandboxwarmpool.go`).

Sources: [clients/k8s/listers/api/v1beta1/sandbox.go:26-69](), [clients/k8s/listers/api/v1beta1/expansion_generated.go:19-26](), [clients/k8s/extensions/listers/api/v1beta1/sandboxclaim.go:26-69](), [clients/k8s/extensions/listers/api/v1beta1/expansion_generated.go:19-41]()

## Wiring it together

The wrapper SDK shows the canonical end-to-end consumption pattern: build a shared `*http.Client`, hand it to both `versioned.NewForConfigAndClient` calls, and keep only the per-group interfaces around for downstream code.

```go
// clients/go/sandbox/k8s.go:82-121
httpClient, err := rest.HTTPClientFor(config)
…
agentsCS, err := agentsclientset.NewForConfigAndClient(config, httpClient)
…
extensionsCS, err := extensionsclientset.NewForConfigAndClient(config, httpClient)
…
return &K8sHelper{
    AgentsClient:     agentsCS.AgentsV1beta1(),
    ExtensionsClient: extensionsCS.ExtensionsV1beta1(),
    …
}, nil
```

Calls then go directly through the typed interface, for example creating and watching a `SandboxClaim`:

```go
// clients/go/sandbox/k8s.go:148-208
created, err := h.ExtensionsClient.SandboxClaims(namespace).Create(ctx, claim, metav1.CreateOptions{})
…
watcher, err := h.ExtensionsClient.SandboxClaims(namespace).Watch(ctx, listOpts)
```

Tests follow the same pattern but swap in the fake clientsets without touching the rest of the code — that interface-symmetry is the whole point of generating both real and fake implementations together:

```go
// clients/go/sandbox/sandbox_test.go:65-72
agentsCS := fakeagents.NewSimpleClientset()
extensionsCS := fakeextensions.NewSimpleClientset()
opts.K8sHelper = &K8sHelper{
    AgentsClient:     agentsCS.AgentsV1beta1(),
    ExtensionsClient: extensionsCS.ExtensionsV1beta1(),
    Log:              opts.Logger,
}
```

Sources: [clients/go/sandbox/k8s.go:36-41](), [clients/go/sandbox/k8s.go:50-121](), [clients/go/sandbox/k8s.go:148-208](), [clients/go/sandbox/sandbox_test.go:42-72]()

## Regenerating

To regenerate everything, run `go generate ./...` from the repository root; the `//go:generate` directive at `codegen.go:28` invokes `dev/tools/client-gen-go.sh`, which produces the entire `clients/k8s/{clientset,informers,listers}` and `clients/k8s/extensions/{clientset,informers,listers}` trees in place and then re-applies the project's Apache-2.0 license headers. The generators emit `// Code generated by client-gen. DO NOT EDIT.` (and equivalents for `informer-gen` / `lister-gen`) on every file, so the only sanctioned way to add custom behaviour without losing it on regeneration is via the empty `*Expansion` interfaces emitted alongside each resource. Adding a new CR is correspondingly a matter of adding `+genclient` to the type and re-running generation — no edits to the clientset, factory, lister, or fake plumbing are required.

Sources: [codegen.go:19-28](), [dev/tools/client-gen-go.sh:55-78](), [clients/k8s/clientset/versioned/typed/api/v1beta1/generated_expansion.go:15-19]()

---

## 21. Python Sync SDK Core

> Synchronous Python client surface: SandboxClient, Sandbox, connector, command executor, filesystem helpers, and the k8s helper layer.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/21-python-sync-sdk-core.md
- Generated: 2026-05-25T22:44:31.102Z

### Source Files

- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/connector.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/k8s_helper.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/commands/command_executor.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/files/filesystem.py`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/connector.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/connector.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/k8s_helper.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/k8s_helper.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/commands/command_executor.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/commands/command_executor.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/files/filesystem.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/files/filesystem.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/models.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/models.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/constants.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/constants.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/utils.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/utils.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/exceptions.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/exceptions.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/__init__.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/__init__.py)
</details>

# Python Sync SDK Core

This page documents the synchronous Python client surface of `k8s_agent_sandbox`, the package that ships under `clients/python/agentic-sandbox-client`. The synchronous core is composed of five collaborating layers: `SandboxClient` (a registry that creates, attaches to, and tears down `SandboxClaim` custom resources), `Sandbox` (a per-instance handle), `SandboxConnector` plus its strategy classes (HTTP transport with router / port-forward / gateway / in-cluster variants), `CommandExecutor` and `Filesystem` (the two user-facing engines), and `K8sHelper` (the Kubernetes API wrapper that watches CRDs and resolves status).

The five layers are deliberately flat: a `SandboxClient` instance owns one `K8sHelper`, one connection configuration, and a dictionary of active `Sandbox` handles keyed by `(namespace, claim_name)`. Each `Sandbox` owns its own `SandboxConnector` and exposes the engines via the `.commands` and `.files` properties. The async variants (`AsyncSandboxClient`, `async_connector`, `async_filesystem`, `async_command_executor`, `async_k8s_helper`) mirror this structure and are imported lazily — `__init__.py` falls back to a stub that instructs the user to install the `async` extras when their dependencies are missing.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/__init__.py:15-36](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py:48-92](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox.py:32-80]()

## Architecture Overview

The diagram below names every module-backed class in the synchronous core and shows the ownership and call boundaries that the constructors and properties establish. Solid arrows are owns/creates; dashed arrows are HTTP/Kubernetes API egress.

```mermaid
flowchart TB
    subgraph User["User code"]
        App["import k8s_agent_sandbox"]
    end

    subgraph Client["sandbox_client.py"]
        SC["SandboxClient[T]"]
        ACT["_active_connection_sandboxes:<br/>Dict[(ns, claim_name), Sandbox]"]
    end

    subgraph Handle["sandbox.py"]
        SB["Sandbox"]
        CMD["CommandExecutor<br/>(commands/command_executor.py)"]
        FS["Filesystem<br/>(files/filesystem.py)"]
    end

    subgraph Net["connector.py"]
        CN["SandboxConnector"]
        DS["DirectConnectionStrategy"]
        GW["GatewayConnectionStrategy"]
        LT["LocalTunnelConnectionStrategy"]
        IC["InClusterConnectionStrategy"]
        SESS["requests.Session + Retry"]
    end

    subgraph K8s["k8s_helper.py"]
        KH["K8sHelper"]
        CO["CustomObjectsApi"]
        CV["CoreV1Api"]
    end

    subgraph External["External"]
        API["Kubernetes API server"]
        Router["sandbox-router-svc :8080"]
        Pod["Sandbox pod :8888"]
    end

    App --> SC
    SC --> ACT
    SC --> KH
    SC -->|create_sandbox / get_sandbox| SB
    SB --> CMD
    SB --> FS
    SB --> CN
    CN --> SESS
    CN -->|polymorphic| DS
    CN -->|polymorphic| GW
    CN -->|polymorphic| LT
    CN -->|polymorphic| IC
    KH --> CO
    KH --> CV
    CO -.->|watch/list/create/delete| API
    CV -.-> API
    LT -.->|kubectl port-forward subprocess| Router
    DS -.-> Router
    GW -.-> Router
    Router -.-> Pod
    IC -.->|svc DNS or pod IP| Pod
```

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py:56-92](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox.py:42-80](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/connector.py:38-304]()

## `SandboxClient` — registry and lifecycle

`SandboxClient` is generic in the handle type (`T = TypeVar('T', bound=Sandbox)`) so subclasses can swap in a richer `sandbox_class` (used by `SandboxWithSnapshotSupport` in `gke_extensions`). Construction wires three things: the connection config (defaulting to `SandboxLocalTunnelConnectionConfig`), an optional tracer manager from `trace_manager.create_tracer_manager`, and a fresh `K8sHelper`. When `cleanup=True`, `atexit.register(self.delete_all)` arranges best-effort teardown of every tracked sandbox at process exit.

The active-handle registry is `_active_connection_sandboxes: Dict[Tuple[str, str], T]`. Keys are `(namespace, claim_name)` tuples, and every public method that mutates state — `create_sandbox`, `get_sandbox`, `delete_sandbox`, `delete_all`, `list_active_sandboxes` — manipulates this dictionary directly. `list_active_sandboxes` also lazily evicts handles whose `is_active` property has flipped to `False`.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py:46-92](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py:215-287]()

### `create_sandbox` lifecycle

`create_sandbox` is the canonical happy path. Its steps are deterministic and the error path is precisely scoped: if anything between claim creation and the engine handshake fails, the partially created claim is deleted before the exception propagates.

| Step | Method called | Source |
| --- | --- | --- |
| Validate input | `_validate_labels`, `construct_sandbox_claim_lifecycle_spec` | `sandbox_client.py:117-122`, `utils.py:18-45` |
| Mint claim name | `f"sandbox-claim-{uuid.uuid4().hex[:8]}"` | `sandbox_client.py:124` |
| Create CRD | `_create_claim` → `k8s_helper.create_sandbox_claim` | `sandbox_client.py:336-352`, `k8s_helper.py:43-75` |
| Resolve sandbox name from claim status | `k8s_helper.resolve_sandbox_name` (watch) | `k8s_helper.py:77-129` |
| Wait until `Ready` condition is `True` | `_wait_for_sandbox_ready` → `k8s_helper.wait_for_sandbox_ready` | `sandbox_client.py:354-357`, `k8s_helper.py:131-168` |
| Instantiate handle | `self.sandbox_class(...)` | `sandbox_client.py:140-147` |
| Roll back on failure | `_delete_claim` in `except` | `sandbox_client.py:148-151` |

The `shutdown_after_seconds` argument is converted by `construct_sandbox_claim_lifecycle_spec` into a `{"shutdownTime": "<UTC ISO8601>", "shutdownPolicy": "Delete"}` block that the controller honors as a TTL. The helper validates that the argument is a positive `int` and rejects `bool` (because `type(x) is not int`) and `OverflowError` cases explicitly.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py:94-154](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/utils.py:18-45]()

### Label validation

`SandboxClient` enforces Kubernetes label rules client-side before submitting the claim. `_LABEL_NAME_RE`, `_LABEL_PREFIX_RE`, `_LABEL_NAME_MAX_LENGTH` (63) and `_LABEL_PREFIX_MAX_LENGTH` (253) mirror the upstream constraints. Keys with a `prefix/name` form are split and each segment is validated independently; values may be empty but, when non-empty, must satisfy the same name regex. Errors raise `ValueError` before any API call is issued.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py:289-334]()

### `get_sandbox`, `delete_sandbox`, `delete_all`

`get_sandbox` re-attaches to an existing claim. It always issues `resolve_sandbox_name` + `get_sandbox` against the K8s API even when a registry entry exists, so a stale handle for a deleted claim is detected and evicted before the call returns `SandboxNotFoundError`. If the registry entry is active and the underlying object is still present, the cached handle is returned; otherwise a fresh `sandbox_class(...)` is constructed and inserted.

`delete_sandbox` prefers terminating the in-memory handle (which calls `connector.close()` and then `k8s_helper.delete_sandbox_claim`); if no handle is tracked, it falls through to `_delete_claim`. `delete_all` iterates a snapshot of `_active_connection_sandboxes.items()` and logs but does not re-raise individual failures, so cleanup is best-effort.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py:156-213](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py:250-287]()

## `Sandbox` — per-instance handle

`Sandbox` is the resource handle the user actually calls into. Each instance owns a `SandboxConnector`, a `CommandExecutor`, and a `Filesystem`, and caches two derived values that are expensive to look up:

- `_pod_name`: pulled from the `agents.x-k8s.io/pod-name` annotation on the Sandbox object, falling back to `sandbox_id` when the annotation is absent.
- `_sandbox_name_hash`: parsed from `status.selector` when it matches the `agents.x-k8s.io/sandbox-name-hash=<value>` label form.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox.py:42-112](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/constants.py:28-30]()

### Pod IP resolution

`Sandbox.get_pod_ip` is passed by reference into both `SandboxConnector` and `InClusterConnectionStrategy` as the `get_pod_ip` callable. It deliberately **does not** cache: the docstring notes that the IP can change after a pod restart (e.g. when `spec.replicas` is scaled to 0 and back), so the function always queries the K8s API and reads `status.podIPs[0]`. Callers (the connector) layer their own caching on top with explicit invalidation hooks.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox.py:114-123]()

### `status`, `close_connection`, `terminate`

`status()` reads the Sandbox object's `conditions` and returns one of three string tags: `"SandboxReady"`, `"SandboxNotReady"`, or `"SandboxNotFound"`, paired with the condition message.

```text
                                     ┌──────────────────────┐
   create_sandbox / get_sandbox ────▶│  Sandbox (active)    │
                                     │  _is_closed = False  │
                                     └──────────┬───────────┘
                                                │
                  close_connection() ───────────┤
                  (local cleanup only)          │
                                                ▼
                                     ┌──────────────────────┐
                                     │  Sandbox (closed)    │◀── idempotent
                                     │  _commands = None    │
                                     │  _files = None       │
                                     │  remote claim alive  │
                                     └──────────┬───────────┘
                                                │
                  terminate() ──────────────────┤
                  (close + delete claim)        │
                                                ▼
                                     ┌──────────────────────┐
                                     │  Sandbox (terminated)│
                                     │  claim_name = None   │
                                     │  remote claim gone   │
                                     └──────────────────────┘
```

`close_connection` is the local-only teardown: it calls `connector.close()`, nulls `_commands` and `_files` (so any further `sandbox.commands.run(...)` raises `AttributeError`), and ends the OpenTelemetry lifecycle span. `terminate` calls `close_connection` first and then `k8s_helper.delete_sandbox_claim`; a `SandboxNotFoundError` from the delete is swallowed so the method is safely idempotent. After a successful delete, `claim_name` is cleared so retries cannot 404 against an empty name.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox.py:125-214]()

## `SandboxConnector` and connection strategies

`SandboxConnector` centralizes HTTP transport. It always builds a `requests.Session` with a 5-attempt `Retry` backing off on `500/502/503/504` for `GET/POST/PUT/DELETE`, and selects a `ConnectionStrategy` based on the runtime type of the `connection_config`.

```mermaid
classDiagram
    class ConnectionStrategy {
        <<abstract>>
        +connect() str
        +close()
        +verify_connection()
        +should_inject_router_headers() bool
    }
    class DirectConnectionStrategy {
        +config: SandboxDirectConnectionConfig
        +connect() str
    }
    class GatewayConnectionStrategy {
        +config: SandboxGatewayConnectionConfig
        +k8s_helper: K8sHelper
        +base_url: str
    }
    class LocalTunnelConnectionStrategy {
        +sandbox_id: str
        +port_forward_process: Popen
        +base_url: str
        -_get_free_port()
        -_is_port_open(port)
    }
    class InClusterConnectionStrategy {
        +_dns_url: str
        +_cached_pod_ip_url: str
        +_get_pod_ip: Callable
    }
    class SandboxConnector {
        +id: str
        +namespace: str
        +session: requests.Session
        +strategy: ConnectionStrategy
        +send_request(method, endpoint, **kwargs)
    }

    ConnectionStrategy <|.. DirectConnectionStrategy
    ConnectionStrategy <|.. GatewayConnectionStrategy
    ConnectionStrategy <|.. LocalTunnelConnectionStrategy
    ConnectionStrategy <|.. InClusterConnectionStrategy
    SandboxConnector --> ConnectionStrategy : delegates
```

The factory is a plain `isinstance` chain in `_connection_strategy`. An unrecognized config raises `ValueError("Unknown connection configuration type")`.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/connector.py:40-304]()

### Strategy comparison

| Strategy | `connect()` produces | Router headers? | Notes |
| --- | --- | --- | --- |
| `DirectConnectionStrategy` | `config.api_url` verbatim | Yes | Caller supplies the router URL. No verification or teardown needed. |
| `GatewayConnectionStrategy` | `http://{ip}` from `k8s_helper.wait_for_gateway_ip` | Yes | Watches a `gateway.networking.k8s.io/v1 Gateway` object; reports `sandbox_client_discovery_latency_ms{mode="gateway"}`. |
| `LocalTunnelConnectionStrategy` | `http://127.0.0.1:{local_port}` | Yes | Spawns `kubectl port-forward svc/sandbox-router-svc <port>:8080`. Polls a TCP probe every 0.5 s until `port_forward_ready_timeout`. `verify_connection` raises `SandboxPortForwardError` if `Popen.poll()` shows the child died. |
| `InClusterConnectionStrategy` | `http://{pod_ip}:{port}` or fallback `http://{sandbox_id}.{namespace}.svc.cluster.local:{port}` | **No** | Bypasses the router; resolves to pod IP only when the `get_pod_ip` callable returns a value, then caches it. |

The local tunnel strategy is the default because `SandboxClient.__init__` defaults `connection_config` to `SandboxLocalTunnelConnectionConfig()`.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/connector.py:63-255](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py:76]()

### Request path and header injection

`SandboxConnector.send_request` is the single chokepoint for every engine call.

```mermaid
sequenceDiagram
    participant Engine as CommandExecutor / Filesystem
    participant Conn as SandboxConnector
    participant Strat as ConnectionStrategy
    participant K8s as K8sHelper / get_pod_ip
    participant Srv as Router / Pod

    Engine->>Conn: send_request(method, endpoint, **kwargs)
    Conn->>Strat: connect()
    Strat-->>Conn: base_url
    Conn->>Strat: verify_connection()
    alt should_inject_router_headers
        Conn->>Conn: headers["X-Sandbox-ID"] / Namespace / Port
        opt _get_pod_ip and not auth_failed
            Conn->>K8s: get_pod_ip()
            K8s-->>Conn: pod_ip or 401/403
            Conn->>Conn: cache pod_ip or set _pod_ip_auth_failed
        end
        Conn->>Conn: headers["X-Sandbox-Pod-IP"] = pod_ip
    end
    Conn->>Srv: session.request(method, url, headers, ...)
    Srv-->>Conn: Response
    Conn->>Conn: raise_for_status
    Conn-->>Engine: Response
```

Three behaviors deserve highlighting:

1. **Auth backoff for pod-IP routing.** When `_get_pod_ip` raises with HTTP `401`/`403`, the connector flips `_pod_ip_auth_failed = True` and stops asking; transient errors are logged at debug and re-tried on the next request. Once the pod IP has been resolved, it is cached in `_pod_ip` and reused for the header.
2. **Crash-induced reconnect.** `SandboxPortForwardError` triggers `self.close()` and is re-raised, so the next `send_request` will rebuild the `kubectl port-forward` child.
3. **Failure wrapping.** Any `requests.exceptions.RequestException` is caught, the connector is closed, and a `SandboxRequestError(message, status_code, response)` is raised — preserving the HTTP status and raw response on the exception for callers.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/connector.py:319-372](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/exceptions.py:42-59]()

## `CommandExecutor` — `commands.run`

`CommandExecutor` is intentionally thin: one method, `run(command, timeout=60)`, that POSTs `{"command": command}` to the router endpoint `execute`. The response is parsed first into a Python dict and then validated through the `ExecutionResult` pydantic model, which guarantees the three fields `stdout`, `stderr`, and `exit_code`. If the JSON cannot be decoded or the shape is wrong, the method raises `RuntimeError` with the original payload preserved as `__cause__`.

The `@trace_span("run")` decorator and the `set_attribute("sandbox.command", ...)` / `set_attribute("sandbox.exit_code", ...)` calls integrate the executor into the OpenTelemetry lifecycle started by the tracer manager.

```python
# clients/python/agentic-sandbox-client/k8s_agent_sandbox/commands/command_executor.py
@trace_span("run")
def run(self, command: str, timeout: int = 60) -> ExecutionResult:
    payload = {"command": command}
    response = self.connector.send_request(
        "POST", "execute", json=payload, timeout=timeout)
    response_data = response.json()
    result = ExecutionResult(**response_data)
    return result
```

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/commands/command_executor.py:20-50](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/models.py:18-22]()

## `Filesystem` — `files.write` / `read` / `list` / `exists`

`Filesystem` wraps four router endpoints — `upload`, `download/<path>`, `list/<path>`, and `exists/<path>`. All path arguments other than `write/read` are passed through `urllib.parse.quote(path, safe='')`, so slashes and other reserved characters are percent-encoded into a single URL segment.

| Method | HTTP | Endpoint | Body / params | Returns |
| --- | --- | --- | --- | --- |
| `write(path, content, timeout=60, allow_unsafe_paths=False)` | `POST` | `upload` | `multipart` field `file` with name `path` | None |
| `read(path, timeout=60, allow_unsafe_paths=False)` | `GET` | `download/<quoted-path>` | — | `bytes` |
| `list(path, timeout=60)` | `GET` | `list/<quoted-path>` | — | `List[FileEntry]` |
| `exists(path, timeout=60)` | `GET` | `exists/<quoted-path>` | — | `bool` |

### Path-traversal hardening

`write` and `read` route their `path` arguments through `Filesystem._safe_upload_path` unless the caller opts out with `allow_unsafe_paths=True`. The check is deliberately stricter than `os.path.normpath`:

1. Rejects any character with `ord(c) < 0x20` or `0x7F` — including embedded `\x00`, which would otherwise survive `normpath` and truncate at the server's C/syscall layer.
2. Strips whitespace, rejects empty input, then `posixpath.normpath` and `lstrip('/')`.
3. Rejects the bare `"."` result and any path segment equal to `".."`.

The docstring calls out the exact attack the NUL check defends against: `foo\x00../etc/passwd` would otherwise normalize to `foo` on Linux while being inspected as something else by Python. `list` and `exists` do **not** call this guard — only `write` and `read`, which actually traverse the filesystem on the server.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/files/filesystem.py:34-161](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/models.py:24-29]()

## `K8sHelper` — Kubernetes API layer

`K8sHelper` is the only module that imports `kubernetes`. Its constructor tries `config.load_incluster_config()` first and falls back to `config.load_kube_config()`, which makes the same client work in both in-cluster pods and developer laptops.

It exposes a small, purposeful API:

| Method | Resource | Behavior |
| --- | --- | --- |
| `create_sandbox_claim(name, template, namespace, annotations, labels, lifecycle, warmpool)` | `SandboxClaim` (`extensions.agents.x-k8s.io/v1beta1`) | Builds the manifest and calls `create_namespaced_custom_object`. `lifecycle` and `warmpool` are inserted into `spec` only when present. |
| `resolve_sandbox_name(claim_name, namespace, timeout)` | `SandboxClaim` | Opens a `watch.Watch().stream(...)` against a field-selected list. Reads `status.sandbox.name` (or legacy `status.sandbox.Name`). Raises `SandboxTemplateNotFoundError` on `Ready=False/reason=TemplateNotFound` and `SandboxMetadataError` on `DELETED`. |
| `wait_for_sandbox_ready(name, namespace, timeout)` | `Sandbox` (`agents.x-k8s.io/v1beta1`) | Watches for `condition Ready=True` and returns `status.podIPs[0]` when present. Raises `TimeoutError` on deadline, `SandboxNotFoundError` on `DELETED`. |
| `delete_sandbox_claim(name, namespace)` | `SandboxClaim` | Treats `ApiException.status == 404` as success — re-raises wrapped as `SandboxNotFoundError` otherwise. |
| `get_sandbox(name, namespace)` / `get_sandbox_claim(name, namespace)` | `Sandbox` / `SandboxClaim` | Returns `None` on 404 instead of raising. |
| `list_sandbox_claims(namespace, label_selector=None)` | `SandboxClaim` | Returns the list of claim names; passes `label_selector` through when provided. |
| `wait_for_gateway_ip(gateway_name, namespace, timeout)` | `Gateway` (`gateway.networking.k8s.io/v1`) | Watches `status.addresses[0].value`. |

All CRD coordinates are centralized in `constants.py` so changes to API group/version flow through one file: `CLAIM_API_GROUP`, `CLAIM_API_VERSION`, `CLAIM_PLURAL_NAME`, `SANDBOX_API_GROUP`, `SANDBOX_API_VERSION`, `SANDBOX_PLURAL_NAME`, `GATEWAY_API_GROUP`, `GATEWAY_API_VERSION`, `GATEWAY_PLURAL`, and the annotation/label keys `POD_NAME_ANNOTATION` and `SANDBOX_NAME_HASH_LABEL`.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/k8s_helper.py:32-273](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/constants.py:15-37]()

### Watch loop pattern

Both `resolve_sandbox_name` and `wait_for_sandbox_ready` follow the same structure: compute a `deadline = time.monotonic() + timeout`, enter a `while True` loop, open a fresh `watch.Watch()` per outer iteration with `timeout_seconds=remaining`, and call `w.stop()` immediately before returning or raising. This pattern means a transient API-server disconnect causes the watch to end gracefully (the inner `for` exits), the outer loop recomputes the remaining budget, and a new watch is opened — no busy-loop, no swallowed deadline.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/k8s_helper.py:77-168]()

## Exception hierarchy

All errors derive from `SandboxError(RuntimeError)`, which makes broad `except RuntimeError` clauses still catch sandbox-specific failures while letting callers narrow when they want to.

| Exception | Raised by | Meaning |
| --- | --- | --- |
| `SandboxNotReadyError` | (reserved for callers / extensions) | Sandbox not yet ready for communication. |
| `SandboxNotFoundError` | `k8s_helper.delete_sandbox_claim`, `SandboxClient.get_sandbox`, `Sandbox.terminate` | Claim or sandbox missing or deleted. |
| `SandboxTemplateNotFoundError` | `k8s_helper.resolve_sandbox_name` | `Ready=False/TemplateNotFound`. |
| `SandboxPortForwardError` | `LocalTunnelConnectionStrategy.connect` / `verify_connection` | `kubectl port-forward` crashed. |
| `SandboxMetadataError` | `k8s_helper.resolve_sandbox_name` | Claim deleted mid-resolution. |
| `SandboxRequestError(message, status_code, response)` | `SandboxConnector.send_request` | HTTP request to the sandbox failed; preserves HTTP status and raw response. |

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/exceptions.py:18-59]()

## End-to-end flow

The diagram below walks a typical synchronous session end-to-end, naming each module that participates.

```mermaid
sequenceDiagram
    participant User
    participant SC as SandboxClient
    participant KH as K8sHelper
    participant SB as Sandbox
    participant CN as SandboxConnector
    participant Router as sandbox-router-svc

    User->>SC: create_sandbox(template="...")
    SC->>KH: create_sandbox_claim(name, template, ns)
    KH-->>SC: claim created
    SC->>KH: resolve_sandbox_name(claim, ns, t)
    KH-->>SC: sandbox_id
    SC->>KH: wait_for_sandbox_ready(sandbox_id, ns, t)
    KH-->>SC: Ready=True
    SC->>SB: Sandbox(claim_name, sandbox_id, ns, ...)
    SB->>CN: SandboxConnector(...)
    SC-->>User: Sandbox handle
    User->>SB: sandbox.commands.run("ls")
    SB->>CN: send_request("POST", "execute", json=...)
    CN->>Router: HTTP POST with X-Sandbox-* headers
    Router-->>CN: 200 + ExecutionResult JSON
    CN-->>SB: response
    SB-->>User: ExecutionResult
    User->>SC: delete_sandbox(claim_name)
    SC->>SB: terminate()
    SB->>CN: close() (stops port-forward, closes session)
    SB->>KH: delete_sandbox_claim(claim_name, ns)
    KH-->>SB: deleted (or 404 swallowed)
```

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox_client.py:94-154](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/sandbox.py:190-214](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/connector.py:319-372]()

## Summary

The synchronous SDK core is intentionally small and explicit: `SandboxClient` owns lifecycle and a registry, `Sandbox` owns engines and connection state, `SandboxConnector` owns transport and chooses one of four strategies, and `K8sHelper` is the sole touchpoint with the Kubernetes API. Layered on top are deterministic teardown semantics (`close_connection` vs. `terminate`), idempotent delete paths that absorb 404s, defensive path validation in `Filesystem._safe_upload_path`, and an exception hierarchy rooted at `SandboxError`. The async surface mirrors this structure module-for-module behind `AsyncSandboxClient`, which `__init__.py` imports lazily and replaces with an informative stub when the async extras are not installed.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/__init__.py:26-36]()

---

## 22. Python Async SDK

> The asyncio mirror of the sync SDK: AsyncSandboxClient, AsyncSandbox, async connector, async filesystem, and async command executor.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/22-python-async-sdk.md
- Generated: 2026-05-25T22:45:57.244Z

### Source Files

- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_connector.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_k8s_helper.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/commands/async_command_executor.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/files/async_filesystem.py`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_connector.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_connector.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_k8s_helper.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_k8s_helper.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/commands/async_command_executor.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/commands/async_command_executor.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/files/async_filesystem.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/files/async_filesystem.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/models.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/models.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/__init__.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/__init__.py)
</details>

# Python Async SDK

The Python Async SDK is the `asyncio`-native mirror of the synchronous `SandboxClient` stack. It is intended for asynchronous applications (FastAPI services, agent frameworks built on `asyncio`, concurrent batch tools) where awaiting `kubectl`-like watches, HTTP calls to the sandbox router, and per-sandbox lifecycle work must not block the event loop. The async layer is shipped as an *optional* extra: when the underlying dependencies (`kubernetes_asyncio`, `httpx`) are absent, importing `AsyncSandboxClient` raises a guided `ImportError` pointing at `pip install k8s-agent-sandbox[async]`.

Functionally, the async API is intentionally close to the sync API — same connection-config model, same `commands` / `files` shape on a sandbox handle, same SandboxClaim → Sandbox resolution semantics — but with two deliberate differences: (1) it does **not** support `SandboxLocalTunnelConnectionConfig` (no `kubectl port-forward` subprocess), and (2) it has no `atexit` cleanup fallback, because async cleanup cannot run from an `atexit` handler. Callers are expected to use `async with` or to call `delete_all()` + `close()` explicitly.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/__init__.py:26-35](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py:14-58]()

## Module Layout

The async surface lives alongside the sync surface in the `k8s_agent_sandbox` package. Each sync module has an `async_*` twin that re-uses the shared models and helpers where possible (e.g. the path sanitizer from `files.filesystem.Filesystem._safe_upload_path`).

| Concern | Async module | Public symbol |
| --- | --- | --- |
| Lifecycle / registry | `async_sandbox_client.py` | `AsyncSandboxClient` |
| Sandbox handle | `async_sandbox.py` | `AsyncSandbox` |
| HTTP transport | `async_connector.py` | `AsyncSandboxConnector` |
| Kubernetes API | `async_k8s_helper.py` | `AsyncK8sHelper` |
| Command execution | `commands/async_command_executor.py` | `AsyncCommandExecutor` |
| Filesystem | `files/async_filesystem.py` | `AsyncFilesystem` |
| Shared models | `models.py` | `SandboxConnectionConfig`, `ExecutionResult`, `FileEntry`, … |

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox.py:17-23](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/files/async_filesystem.py:17-21]()

## Architecture

The async stack composes four collaborators behind the `AsyncSandboxClient` facade. The client owns the `AsyncK8sHelper` (long-lived, shared across sandboxes) and the registry of live handles; each `AsyncSandbox` owns its own `AsyncSandboxConnector` plus `AsyncCommandExecutor` and `AsyncFilesystem` views over that connector.

```mermaid
flowchart LR
    subgraph User["Caller (asyncio app)"]
        APP[await client / sandbox]
    end

    subgraph Client["AsyncSandboxClient"]
        REG["_active_connection_sandboxes\n(namespace, claim_name) → AsyncSandbox"]
        LOCK["asyncio.Lock"]
        CLAIM["_create_claim / _wait_for_sandbox_ready / _delete_claim"]
    end

    subgraph Handle["AsyncSandbox"]
        CMDS["commands : AsyncCommandExecutor"]
        FILES["files : AsyncFilesystem"]
        CONN["connector : AsyncSandboxConnector"]
        STATE["_is_closed / get_pod_name / get_pod_ip"]
    end

    subgraph K8s["AsyncK8sHelper (kubernetes_asyncio)"]
        WATCH["watch.Watch streams"]
        CRD["CustomObjectsApi: SandboxClaim / Sandbox / Gateway"]
    end

    subgraph Sandbox["Sandbox pod / router / gateway"]
        ROUTER["/execute, /upload, /download/{path}, /list/{path}, /exists/{path}"]
    end

    APP --> Client
    Client --> Handle
    CLAIM --> K8s
    STATE --> K8s
    CMDS --> CONN
    FILES --> CONN
    CONN -->|httpx.AsyncClient| ROUTER
    K8s --> CRD
    K8s --> WATCH
```

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py:41-105](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox.py:26-95](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_connector.py:38-91]()

## `AsyncSandboxClient`

`AsyncSandboxClient` is a generic, registry-based lifecycle manager. The generic parameter `T = TypeVar("T", bound=AsyncSandbox)` plus the class attribute `sandbox_class: type[T] = AsyncSandbox` are the hook that subclasses (e.g. snapshot-aware variants under `gke_extensions/`) use to substitute a richer handle type without rewriting lifecycle logic.

### Construction and required configuration

The constructor refuses to build a client without a `connection_config`. This is stricter than the sync client and is deliberate: `SandboxLocalTunnelConnectionConfig` is rejected upstream by `AsyncSandboxConnector`, so there is no implicit local-dev fallback. The error message points callers explicitly at the supported configs and at the sync client for local development.

```python
# clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py:62-85
def __init__(self, connection_config: SandboxConnectionConfig | None = None,
             tracer_config: SandboxTracerConfig | None = None):
    if connection_config is None:
        raise ValueError(
            "connection_config is required for AsyncSandboxClient. "
            "Use SandboxDirectConnectionConfig, SandboxGatewayConnectionConfig, or "
            "SandboxInClusterConnectionConfig. ..."
        )
    ...
    self.k8s_helper = AsyncK8sHelper()
    self._active_connection_sandboxes: dict[tuple[str, str], T] = {}
    self._lock = asyncio.Lock()
```

The registry key is `(namespace, claim_name)` and all mutations are guarded by `asyncio.Lock` to keep concurrent `create_sandbox` / `get_sandbox` / `delete_sandbox` calls coherent.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py:38-85]()

### Context-manager cleanup contract

Async cleanup cannot run from `atexit`, so the class only supports two safe patterns:

1. `async with AsyncSandboxClient(...) as client: ...` — `__aexit__` calls `delete_all()` followed by `close()`.
2. Manual: `await client.delete_all(); await client.close()`.

If neither pattern is followed, in-flight SandboxClaims will outlive the process and accumulate in the cluster. `close()` walks every tracked sandbox, awaits `_close_connection()` on each, then shuts down the shared `AsyncK8sHelper` API client.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py:50-105](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py:304-313]()

### Lifecycle API

| Method | Purpose | Notes |
| --- | --- | --- |
| `create_sandbox(template, namespace, sandbox_ready_timeout, labels, warmpool, *, shutdown_after_seconds)` | Provision a new `SandboxClaim` and wait until the underlying Sandbox is `Ready`. | `shutdown_after_seconds` writes `spec.lifecycle.shutdownTime` + `shutdownPolicy=Delete`; on any failure or `CancelledError`, the in-flight claim is deleted under `asyncio.shield`. |
| `get_sandbox(claim_name, namespace, resolve_timeout, template_name)` | Reattach to an existing claim. | When `template_name` is supplied and `spec.sandboxTemplateRef.name` differs, raises `ValueError` (refuse-to-reattach guard); other failures raise `SandboxNotFoundError`. |
| `list_active_sandboxes()` | Tuples currently held in the registry. | Lazily prunes entries whose `is_active` is `False`. |
| `list_all_sandboxes(namespace, label_selector)` | Cluster-side listing of `SandboxClaim` names. | Forwards `label_selector` to the K8s list call. |
| `delete_sandbox(claim_name, namespace)` | Terminate one sandbox. | If tracked, calls `sandbox.terminate()`; otherwise deletes the claim directly. Failures are logged, not raised. |
| `delete_all()` | Bulk delete every tracked sandbox. | Per-sandbox errors are logged and do not abort the sweep. |

Cancellation safety in `create_sandbox` is non-trivial: the `except (Exception, asyncio.CancelledError)` arm wraps cleanup in `asyncio.shield(self._delete_claim(...))` so that a task cancellation racing with claim creation still releases the cluster-side resource before the exception propagates.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py:107-313]()

### Label validation

Label validation is identical to the sync client and lives on `AsyncSandboxClient` as static methods, anchored by two regexes and the Kubernetes-standard length limits (`63` for names, `253` for prefixes). Keys may be `prefix/name` (prefix must be a DNS subdomain) or just `name`; values are validated with the same name regex.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py:315-361]()

### Trace context propagation

`_create_claim`, `_wait_for_sandbox_ready`, and `_delete_claim` are wrapped with `@async_trace_span(...)`. When tracing is enabled and a tracing manager is attached, `_create_claim` injects the current trace context as JSON into the claim's `metadata.annotations["opentelemetry.io/trace-context"]`, allowing the controller side to continue the same trace. Span attributes record `sandbox.claim.name` plus optional lifecycle attributes.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py:363-399]()

## `AsyncSandbox`

`AsyncSandbox` is a thin coordinator: a value object plus three composed services (`connector`, `_commands`, `_files`). Like the client, it rejects `connection_config=None` because the absence of a local-tunnel fallback makes the parameter mandatory.

```python
# clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox.py:62-83
self.connector = AsyncSandboxConnector(
    sandbox_id=self.sandbox_id,
    namespace=self.namespace,
    connection_config=self.connection_config,
    k8s_helper=self.k8s_helper,
    get_pod_ip=self.get_pod_ip,
)
...
self._commands = AsyncCommandExecutor(self.connector, self.tracer, self.trace_service_name)
self._files = AsyncFilesystem(self.connector, self.tracer, self.trace_service_name)
```

Two K8s-backed accessors expose live pod state:

- `get_pod_name()` — memoized after the first lookup. Reads `metadata.annotations[POD_NAME_ANNOTATION]` from the Sandbox object; falls back to `sandbox_id` when the annotation is absent.
- `get_pod_ip()` — **not** memoized. The pod IP can change after a restart (e.g. `spec.replicas` scaled to `0` and back), so every call hits the K8s API and returns `status.podIPs[0]` or `None`.

`_close_connection()` is idempotent (guards on `_is_closed`), closes the connector's `httpx` client, drops the `_commands` / `_files` references (so `is_active` returns `False`), and ends the lifecycle tracing span. `terminate()` is the destructive variant: `_close_connection()` then `delete_sandbox_claim()`.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox.py:40-143]()

## `AsyncSandboxConnector`

The connector wraps a single `httpx.AsyncClient` (60 s default timeout, `AsyncHTTPTransport(retries=3)` at the transport layer) and resolves the base URL lazily per connection mode.

### Supported connection modes

```text
DirectConnection      api_url                                  ── injected router headers
GatewayConnection     wait_for_gateway_ip → http://<ip>        ── injected router headers
InClusterConnection   http://<id>.<ns>.svc.cluster.local:<p>   ── no router headers
                      or http://<podIP>:<p> if use_pod_ip      ── no router headers
LocalTunnel           rejected at __init__ (ValueError)
```

The `_inject_router_headers` flag is `False` only for `SandboxInClusterConnectionConfig`. In all other modes, every request carries `X-Sandbox-ID`, `X-Sandbox-Namespace`, `X-Sandbox-Port`, and (when available) `X-Sandbox-Pod-IP`, which lets the router route to the correct backend.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_connector.py:38-91](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_connector.py:125-149]()

### Pod-IP resolution and the auth latch

The connector caches a pod IP for routing speed, but it must defend against permission errors:

- First time around it calls the `get_pod_ip` callback (provided by `AsyncSandbox.get_pod_ip`).
- If the call raises and the underlying response is `401`/`403`, `_pod_ip_auth_failed` is **permanently** set on this client instance — pod-IP routing is disabled for its lifetime to avoid hammering the K8s API with a token that cannot read sandboxes.
- Transient errors are logged at debug and retried on later requests.

For `SandboxInClusterConnectionConfig`, `_resolve_base_url` upgrades the URL once pod-IP resolution succeeds, otherwise it falls back to cluster DNS.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_connector.py:69-149]()

### Retry and cache invalidation

`send_request` retries up to `MAX_RETRIES = 5` times on `{500, 502, 503, 504}` with exponential backoff (`BACKOFF_FACTOR * 2**attempt`, i.e. 0.5 s, 1 s, 2 s, …). On any HTTP error it raises `SandboxRequestError` and clears caches that may have gone stale:

- For `SandboxGatewayConnectionConfig`, the `_base_url` is dropped so the next request re-queries the gateway IP.
- Pod-IP caching state (`_pod_ip_resolved`, `_cached_pod_ip_url`, `_pod_ip`) is reset.

`close()` calls `httpx.AsyncClient.aclose()` and clears the same caches.

```mermaid
sequenceDiagram
    participant Caller as AsyncCommandExecutor/AsyncFilesystem
    participant Conn as AsyncSandboxConnector
    participant K8s as AsyncK8sHelper
    participant Router as Sandbox router / pod

    Caller->>Conn: send_request(METHOD, endpoint, ...)
    Conn->>Conn: _resolve_base_url()
    alt Gateway mode, no cached IP
        Conn->>K8s: wait_for_gateway_ip(...)
        K8s-->>Conn: ip_address
    else InCluster + use_pod_ip
        Conn->>K8s: get_pod_ip via callback
        K8s-->>Conn: podIPs[0] or None
    end
    Conn->>Router: httpx.AsyncClient.request(...)
    alt 5xx and attempt < 5
        Router-->>Conn: retryable status
        Conn->>Conn: asyncio.sleep(0.5 * 2^attempt)
        Conn->>Router: retry
    end
    Router-->>Caller: httpx.Response (or SandboxRequestError)
```

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_connector.py:33-208]()

## `AsyncK8sHelper`

`AsyncK8sHelper` wraps `kubernetes_asyncio` and is shared by the client and all of its sandboxes through `AsyncSandbox.k8s_helper`. Initialization is lazy and once-only: an `asyncio.Lock` guards `_ensure_initialized`, which tries in-cluster config first and falls back to `load_kube_config()`. `close()` shuts down the shared `ApiClient` and resets the latch so a later call would re-initialize cleanly.

### CRD operations

| Method | Resource | Behavior |
| --- | --- | --- |
| `create_sandbox_claim(name, template, namespace, annotations, labels, lifecycle, warmpool)` | `SandboxClaim` | Builds a manifest with `spec.sandboxTemplateRef`, optional `spec.lifecycle` and `spec.warmpool`; calls `create_namespaced_custom_object`. |
| `resolve_sandbox_name(claim_name, namespace, timeout)` | `SandboxClaim` (watch) | Watches the claim until `status.sandbox.name` is populated (warm-pool adoption may produce a different name from the claim). Surfaces `Ready=False` + `TemplateNotFound` as `SandboxTemplateNotFoundError`. Supports both `name` (post-rename) and legacy `Name` keys. |
| `wait_for_sandbox_ready(name, namespace, timeout)` | `Sandbox` (watch) | Watches until a condition with `type=Ready, status=True`. Returns the first `status.podIPs` entry (or `None` on older controllers). |
| `delete_sandbox_claim(name, namespace)` | `SandboxClaim` | `404` swallowed; other API errors raise `SandboxNotFoundError`. |
| `get_sandbox`, `get_sandbox_claim` | both | Return the raw object dict or `None` on `404`. |
| `list_sandbox_claims(namespace, label_selector)` | `SandboxClaim` | Returns `metadata.name` for each item, optionally filtered by selector. |
| `wait_for_gateway_ip(gateway_name, namespace, timeout)` | `Gateway` (watch) | Watches the Gateway custom resource for `status.addresses[0].value`. |

All watch-based methods (`resolve_sandbox_name`, `wait_for_sandbox_ready`, `wait_for_gateway_ip`) follow the same pattern: compute a remaining-time budget, open a `watch.Watch()`, iterate `w.stream(...)` with `timeout_seconds=remaining`, and always `await w.close()` in a `finally` block to release the streaming HTTP connection — important because `kubernetes_asyncio` watches hold a live connection until closed.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_k8s_helper.py:37-331]()

## `AsyncCommandExecutor`

The executor is a thin wrapper over `connector.send_request("POST", "execute", json={"command": ...}, timeout=...)`. The response is parsed into the shared `ExecutionResult` Pydantic model (`stdout`, `stderr`, `exit_code`); decode or schema failures are raised as `RuntimeError` so callers can distinguish them from `SandboxRequestError`. The whole `run` method is wrapped in `@async_trace_span("run")`, with `sandbox.command` and `sandbox.exit_code` recorded as span attributes when tracing is recording.

```python
# clients/python/agentic-sandbox-client/k8s_agent_sandbox/commands/async_command_executor.py:30-56
@async_trace_span("run")
async def run(self, command: str, timeout: int = 60) -> ExecutionResult:
    ...
    response = await self.connector.send_request("POST", "execute", json={"command": command}, timeout=timeout)
    ...
    return ExecutionResult(**response.json())
```

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/commands/async_command_executor.py:20-57](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/models.py:18-22]()

## `AsyncFilesystem`

`AsyncFilesystem` mirrors the four core sync filesystem methods, all decorated with `@async_trace_span(...)`:

| Method | HTTP | Notes |
| --- | --- | --- |
| `write(path, content, timeout, allow_unsafe_paths)` | `POST /upload` | Encodes `str` content as UTF-8 bytes; routes through `Filesystem._safe_upload_path` unless `allow_unsafe_paths=True`. |
| `read(path, timeout, allow_unsafe_paths)` | `GET /download/{path}` | Same path sanitization gate. Returns `response.content` bytes. |
| `list(path, timeout)` | `GET /list/{path}` | Parses into `list[FileEntry]`. |
| `exists(path, timeout)` | `GET /exists/{path}` | Returns the `exists` boolean from the JSON body. |

The path-safety contract is shared with the sync class via direct call into `Filesystem._safe_upload_path` (re-implementing it asynchronously would risk drift). The accompanying comment explains why `os.path.basename` is not sufficient — `basename("foo\x00../etc/passwd")` returns the string unchanged because of the embedded NUL, which would then truncate at the C layer when handed to the OS. The hardened sanitizer rejects empty / bare-`.`, embedded NUL and ASCII control characters, and any `..` segment after normalisation. URL-path interpolation goes through `urllib.parse.quote(path, safe="")` on every read/list/exists call.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/files/async_filesystem.py:24-139]()

## Optional dependency boundary

The async stack is opt-in. The top-level `__init__.py` imports `AsyncSandboxClient` from `async_sandbox_client`; on `ImportError` it replaces the symbol with a stub class whose `__init__` raises a guided `ImportError` telling the caller to install the `async` extra. This keeps the sync surface usable on minimal installations while still letting users `from k8s_agent_sandbox import AsyncSandboxClient` and discover the missing-extras error at instantiation time rather than at import time.

```python
# clients/python/agentic-sandbox-client/k8s_agent_sandbox/__init__.py:26-35
try:
    from .async_sandbox_client import AsyncSandboxClient
except ImportError:
    class AsyncSandboxClient:  # type: ignore[no-redef]
        def __init__(self, *args, **kwargs):
            raise ImportError(
                "AsyncSandboxClient requires the 'async' extras. "
                "Install with: pip install k8s-agent-sandbox[async]"
            )
```

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/__init__.py:26-35]()

## Sync vs. Async at a glance

The two stacks share models, label-validation rules, path sanitization, and tracing decorators. The differences worth knowing before choosing:

| Aspect | Sync | Async |
| --- | --- | --- |
| Local development via `kubectl port-forward` | Supported (`SandboxLocalTunnelConnectionConfig`) | **Not supported** — connector rejects it; client refuses `connection_config=None` |
| Process-exit safety net | `atexit` cleanup | None; must use `async with` or explicit `delete_all` + `close` |
| HTTP transport | `requests` | `httpx.AsyncClient` (transport-level retries=3) |
| K8s API | `kubernetes` | `kubernetes_asyncio` (watches must be closed in `finally`) |
| Mutual exclusion | `threading.Lock` | `asyncio.Lock` |
| Cancellation handling on create | Exceptions | `Exception | CancelledError`, claim cleanup under `asyncio.shield` |

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py:50-173](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_connector.py:55-91](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_k8s_helper.py:107-207]()

## Putting it together

A minimal end-to-end async session — provision a sandbox, run a command, upload and download a file, and let the context manager clean up — exercises every layer described above:

```python
from k8s_agent_sandbox import AsyncSandboxClient
from k8s_agent_sandbox.models import SandboxDirectConnectionConfig

async def main():
    config = SandboxDirectConnectionConfig(api_url="http://router.example")
    async with AsyncSandboxClient(connection_config=config) as client:
        sandbox = await client.create_sandbox(
            template="python-sandbox-template",
            shutdown_after_seconds=600,
        )
        result = await sandbox.commands.run("echo hello")
        await sandbox.files.write("/tmp/note.txt", "hi")
        data = await sandbox.files.read("/tmp/note.txt")
        # __aexit__ -> delete_all() -> client.close() releases all claims
```

The `async with` block is what makes this safe: it guarantees `delete_all()` runs even if the body raises, which is the only reliable cleanup path given the deliberate absence of an `atexit` fallback.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py:41-105](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/async_sandbox_client.py:107-173]()

---

## 23. Python Extensions, Gateway & Sandbox Router

> Optional Python add-ons: computer-use extension, GKE pod-snapshot extensions, the sandbox-router service, and the kind-based gateway harness.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/23-python-extensions-gateway-sandbox-router.md
- Generated: 2026-05-25T22:49:37.367Z

### Source Files

- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/extensions/computer_use.py`
- `clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions`
- `clients/python/agentic-sandbox-client/sandbox-router/sandbox_router.py`
- `clients/python/agentic-sandbox-client/sandbox-router/README.md`
- `clients/python/agentic-sandbox-client/gateway-kind/README.md`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/extensions/computer_use.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/extensions/computer_use.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/extensions/README.md](clients/python/agentic-sandbox-client/k8s_agent_sandbox/extensions/README.md)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/podsnapshot_client.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/podsnapshot_client.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/sandbox_with_snapshot_support.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/sandbox_with_snapshot_support.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/snapshot_engine.py](clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/snapshot_engine.py)
- [clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/README.md](clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/README.md)
- [clients/python/agentic-sandbox-client/sandbox-router/sandbox_router.py](clients/python/agentic-sandbox-client/sandbox-router/sandbox_router.py)
- [clients/python/agentic-sandbox-client/sandbox-router/sandbox_router.yaml](clients/python/agentic-sandbox-client/sandbox-router/sandbox_router.yaml)
- [clients/python/agentic-sandbox-client/sandbox-router/gateway.yaml](clients/python/agentic-sandbox-client/sandbox-router/gateway.yaml)
- [clients/python/agentic-sandbox-client/sandbox-router/README.md](clients/python/agentic-sandbox-client/sandbox-router/README.md)
- [clients/python/agentic-sandbox-client/sandbox-router/Dockerfile](clients/python/agentic-sandbox-client/sandbox-router/Dockerfile)
- [clients/python/agentic-sandbox-client/gateway-kind/README.md](clients/python/agentic-sandbox-client/gateway-kind/README.md)
- [clients/python/agentic-sandbox-client/gateway-kind/gateway-kind.yaml](clients/python/agentic-sandbox-client/gateway-kind/gateway-kind.yaml)
- [clients/python/agentic-sandbox-client/gateway-kind/run-test-kind.sh](clients/python/agentic-sandbox-client/gateway-kind/run-test-kind.sh)
</details>

# Python Extensions, Gateway & Sandbox Router

The Python client ships a small but layered set of optional add‑ons that live outside the core `SandboxClient`/`Sandbox` pair. Two of them are pure client subclasses — the `computer_use` extension and the GKE Pod Snapshot extension — that bolt extra HTTP verbs or Kubernetes operations onto a sandbox handle without changing core code. The other two are deployment surfaces: a FastAPI reverse proxy (`sandbox-router`) that fronts the cluster on a single Gateway IP, and a `gateway-kind` harness that wires the same routing into a local KinD cluster via `cloud-provider-kind`. Together they form the optional perimeter that turns the in‑cluster CRDs from the controller into something a Python program can reach from outside the cluster, and that lets sandboxes do more than execute shell commands.

This page covers what each piece is, how they compose, and the contracts they expose. The router is the linchpin: it is required by the `computer_use` extension, by Gateway Mode in the SDK, and by the KinD harness.

## High-level Topology

The four pieces fit into one request path and one in‑cluster control path, both of which the SDK can exercise from a developer laptop.

```mermaid
flowchart LR
    subgraph Client["Python SDK process"]
      CU["ComputerUseSandboxClient<br/>(extensions.computer_use)"]
      PS["PodSnapshotSandboxClient<br/>(gke_extensions.snapshots)"]
      Base["SandboxClient / Sandbox"]
    end

    subgraph Gateway["Edge"]
      GKE["GKE Gateway<br/>(gateway.yaml)"]
      KIND["KinD Gateway<br/>(gateway-kind.yaml +<br/>cloud-provider-kind)"]
    end

    subgraph Router["sandbox-router (FastAPI)"]
      Svc["sandbox-router-svc<br/>ClusterIP :8080"]
      Pods["router pods<br/>uvicorn sandbox_router:app"]
    end

    subgraph K8s["agent-sandbox in-cluster"]
      SBox["Sandbox Pod<br/>id.ns.svc.cluster.local:port"]
      Snap["PodSnapshot /<br/>PodSnapshotManualTrigger CRs"]
    end

    CU -- "HTTP POST /agent + X-Sandbox-* headers" --> GKE
    CU -. "alternate path" .-> KIND
    GKE --> Svc --> Pods --> SBox
    KIND --> Svc

    Base -- "kube-apiserver" --> SBox
    PS -- "Custom Objects API" --> Snap
    Snap -- "manages" --> SBox
```

Sources: [clients/python/agentic-sandbox-client/sandbox-router/sandbox_router.py:74-128](), [clients/python/agentic-sandbox-client/sandbox-router/gateway.yaml:1-52](), [clients/python/agentic-sandbox-client/gateway-kind/gateway-kind.yaml:1-32](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/extensions/computer_use.py:20-39]()

## Computer-Use Extension (`extensions/computer_use.py`)

The computer-use extension is the smallest of the four components: a pair of subclasses that wire a single new HTTP verb (`POST /agent`) onto the sandbox handle. It is meant for sandboxes that run a virtual-desktop / browser agent runtime inside the pod, where the SDK does not want to exec shells but instead post a natural-language query.

`SandboxWithComputerUseSupport` extends `Sandbox` and adds `agent(query: str, timeout: int = 60) -> ExecutionResult`. The method checks `self.is_active`, builds a `{"query": query}` payload, sends it through `self.connector` as an HTTP `POST` to the `agent` path, and parses the JSON body back into an `ExecutionResult` (so callers get the same `stdout`/`stderr`/`exit_code` shape as `execute()`). The whole method is decorated with `@trace_span("agent_query")` so that traces emitted by the SDK include a span for each agent call.

`ComputerUseSandboxClient` is a typed `SandboxClient[SandboxWithComputerUseSupport]` that simply overrides `sandbox_class`. Because the base client constructs and re‑hydrates handles through this class attribute, any sandbox returned by `create_sandbox(...)` or `get_sandbox(...)` is already a `SandboxWithComputerUseSupport` instance with `.agent()` available.

```python
# clients/python/agentic-sandbox-client/k8s_agent_sandbox/extensions/computer_use.py
class SandboxWithComputerUseSupport(Sandbox):
    @trace_span("agent_query")
    def agent(self, query: str, timeout: int = 60) -> ExecutionResult:
        if not self.is_active:
            raise ConnectionError("Sandbox is not active. Cannot execute agent queries.")
        payload = {"query": query}
        response = self.connector.send_request("POST", "agent", json=payload, timeout=timeout)
        response_data = response.json()
        return ExecutionResult(**(response_data or {}))
```

The extension does not provision the runtime itself — the README points users at `examples/gemini-cu-sandbox/sandbox-gemini-computer-use.yaml` for the `SandboxTemplate` and at a `gemini-api-key` `Secret` for credentials.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/extensions/computer_use.py:15-39](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/extensions/README.md:6-43]()

## GKE Pod Snapshot Extension (`gke_extensions/snapshots/`)

The GKE Pod Snapshot extension wraps Google's experimental `podsnapshot.gke.io` CRDs into a sandbox‑centric API. It assumes a GKE Standard cluster running gVisor with the `PodSnapshot` controller installed; if the CRDs are missing the client refuses to construct.

### Class layout

```mermaid
classDiagram
    class SandboxClient~T~
    class Sandbox

    class PodSnapshotSandboxClient {
      +sandbox_class = SandboxWithSnapshotSupport
      -_check_snapshot_crd_installed() bool
    }
    class SandboxWithSnapshotSupport {
      -_snapshots: SnapshotEngine
      +snapshots: SnapshotEngine
      +suspend(snapshot_before_suspend, wait_timeout) SuspendResponse
      +resume(wait_timeout) ResumeResponse
      +is_suspended() bool
      +is_restored_from_snapshot(uid) RestoreCheckResult
      +terminate()
    }
    class SnapshotEngine {
      +create(trigger_name, timeout) SnapshotResponse
      +list(filter_by) ListSnapshotResult
      +delete(uid, timeout) DeleteSnapshotResult
      +delete_all(delete_by, label_value, timeout) DeleteSnapshotResult
      +delete_manual_triggers(max_retries)
    }

    SandboxClient <|-- PodSnapshotSandboxClient
    Sandbox <|-- SandboxWithSnapshotSupport
    SandboxWithSnapshotSupport o--> SnapshotEngine
    PodSnapshotSandboxClient ..> SandboxWithSnapshotSupport : sandbox_class
```

`PodSnapshotSandboxClient.__init__` calls `_check_snapshot_crd_installed()` which lists API resources under `PODSNAPSHOT_API_GROUP`/`PODSNAPSHOT_API_VERSION` and looks for `PODSNAPSHOT_API_KIND`. It treats `403` and `404` from the API server as "not installed" and raises `RuntimeError("Pod Snapshot Controller is not ready. ...")` otherwise.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/podsnapshot_client.py:24-63]()

### `SnapshotEngine` — CRD wrapper

`SnapshotEngine` is the only piece that touches `podsnapshot.gke.io` CRDs directly. `create()` builds a `PodSnapshotManualTrigger` manifest, sanitizes the trigger name into K8s-safe characters (`lower`, `_`→`-`, truncated to 38 characters before a `-{timestamp}-{suffix}` tail to stay under the 63‑char limit, defaulted to `snap` if empty), creates it via the Custom Objects API, and then blocks on `wait_for_snapshot_to_be_completed` starting from the freshly returned `resourceVersion` to avoid losing a watch event. Created trigger names are appended to `self.created_manual_triggers` so that `delete_manual_triggers(max_retries=3)` can later sweep them up on `terminate()`.

`list()` always scopes by `SANDBOX_NAME_HASH_LABEL={hash}`, optionally ANDs grouping labels, and sorts by `creationTimestamp` descending. `_execute_deletion()` is shared by `delete(uid)` and `delete_all(delete_by="all"|"labels")`; it calls `delete_namespaced_custom_object` followed by `wait_for_snapshot_deletion` to confirm removal.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/snapshot_engine.py:87-261](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/snapshot_engine.py:397-547]()

### Suspend / Resume lifecycle

`SandboxWithSnapshotSupport` maps suspend/resume onto `spec.replicas` of the `Sandbox` CR.

```mermaid
stateDiagram-v2
    [*] --> Running
    Running --> Snapshotting : suspend(snapshot_before_suspend=True)
    Snapshotting --> Suspending : SnapshotResponse.success
    Snapshotting --> Failed : snapshot failed
    Running --> Suspending : suspend(snapshot_before_suspend=False)
    Suspending --> Suspended : pod terminated\n(wait_for_pod_termination)
    Suspending --> Failed : timeout
    Suspended --> Resuming : resume()
    Resuming --> Restored : pod ready &&\nis_restored_from_snapshot(uid).success
    Resuming --> NotRestored : pod ready, no prior snapshot
    Resuming --> Failed : timeout / verification fail
    Restored --> Running
    NotRestored --> Running
```

`suspend()` first cheaply checks `is_suspended()` (defined as `spec.replicas == 0` on the `Sandbox` CR, with informational logs when `podIPs` lag the spec). It then forces a resolution of `get_sandbox_name_hash()` *before* scaling — once `replicas` hits zero the pod is gone and that label can no longer be read — and only then optionally calls `snapshots.create("suspend-{sandbox_id}")`, patches `spec.replicas=0`, and waits for the pod to terminate by UID. `resume()` is symmetric: it captures `_get_latest_snapshot_uid()` *before* scaling back to 1, then after `wait_for_pod_ready` it calls `is_restored_from_snapshot(uid)` to verify the restore actually happened. If no prior snapshot existed it returns success with `restored_from_snapshot=False` rather than treating it as an error. `terminate()` always calls `_snapshots.delete_manual_triggers()` in a `try/finally` so cleanup runs even if termination throws.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/sandbox_with_snapshot_support.py:48-133](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/sandbox_with_snapshot_support.py:144-329]()

### Result types

| Type | Defined in | Fields |
|---|---|---|
| `SuspendResponse` | `sandbox_with_snapshot_support.py` | `success`, `snapshot_response`, `error_reason`, `error_code` |
| `ResumeResponse` | `sandbox_with_snapshot_support.py` | `success`, `restored_from_snapshot`, `snapshot_uid`, `error_reason`, `error_code` |
| `SnapshotResponse` | `snapshot_engine.py` | `success`, `trigger_name`, `snapshot_uid`, `snapshot_timestamp`, `error_reason`, `error_code` |
| `ListSnapshotResult` | `snapshot_engine.py` | `success`, `snapshots: list[SnapshotDetail]`, `error_reason`, `error_code` |
| `DeleteSnapshotResult` | `snapshot_engine.py` | `success`, `deleted_snapshots: list[str]`, `error_reason`, `error_code` |
| `SnapshotFilter` | `snapshot_engine.py` | `ready_only=True`, `grouping_labels`, `extra="forbid"` |

All responses use a `SUCCESS_CODE = 0` / `ERROR_CODE = 1` convention shared across the module so callers can branch on `error_code` or on the boolean `success`.

Sources: [clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/snapshot_engine.py:30-86](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/sandbox_with_snapshot_support.py:28-47]()

## `sandbox-router` Service

`sandbox-router` is a FastAPI + httpx + uvicorn process whose only job is "look at headers, build an in‑cluster URL, stream the request through". It exists because creating an `HTTPRoute` per ephemeral sandbox would not scale; a single static route on the Gateway hands every request to this service instead.

### Request handling

```mermaid
sequenceDiagram
    participant Client
    participant Gateway as Gateway (gateway.yaml / gateway-kind.yaml)
    participant Router as sandbox-router pod
    participant Sandbox as Sandbox pod

    Client->>Gateway: HTTP request<br/>X-Sandbox-ID, X-Sandbox-Namespace,<br/>X-Sandbox-Port, X-Sandbox-Pod-IP?
    Gateway->>Router: HTTPRoute -> sandbox-router-svc:8080
    Router->>Router: validate headers<br/>sanitize namespace<br/>parse port
    alt X-Sandbox-Pod-IP present
        Router->>Sandbox: http://{pod_ip}:{port}/{path}
    else fallback
        Router->>Sandbox: http://{id}.{ns}.svc.{cluster_domain}:{port}/{path}
    end
    Sandbox-->>Router: response (streamed)
    Router-->>Gateway: StreamingResponse
    Gateway-->>Client: response
```

The single catch‑all route at `@app.api_route("/{full_path:path}", methods=['GET','POST','PUT','DELETE','PATCH'])` enforces the contract:

- `X-Sandbox-ID` is required; missing returns `HTTP 400 "X-Sandbox-ID header is required."`.
- `X-Sandbox-Namespace` defaults to `default`; the router rejects anything where `namespace.replace("-", "").isalnum()` is false, explicitly to "prevent DNS injection".
- `X-Sandbox-Port` defaults to `8888` and must parse as `int`.
- `X-Sandbox-Pod-IP`, when present, short-circuits DNS and is used as the target host directly; otherwise the router constructs `f"{sandbox_id}.{namespace}.svc.{cluster_domain}"`.
- The original `Host` header is filtered out before forwarding; all other headers are preserved.
- Upstream connection failures map to `HTTP 502` ("Could not connect to the backend sandbox"), other exceptions to `HTTP 500`.

The response is returned as a `StreamingResponse` over `resp.aiter_bytes()` so large or long-lived bodies do not have to be buffered in router memory. A separate `GET /healthz` always returns `{"status":"ok"}` and is what the GKE `HealthCheckPolicy` and the pod's readiness/liveness probes hit.

Sources: [clients/python/agentic-sandbox-client/sandbox-router/sandbox_router.py:68-137]()

### Configuration

The router reads two environment variables at import time and logs the effective values.

| Variable | Default | Behavior |
|---|---|---|
| `PROXY_TIMEOUT_SECONDS` | `180.0` | Passed to the shared `httpx.AsyncClient(timeout=...)`. Non‑numeric or non‑positive values log a warning and fall back to the default. |
| `CLUSTER_DOMAIN` | `cluster.local` | Used when building the in‑cluster DNS target. Empty strings log a warning and fall back to the default. |

```python
# clients/python/agentic-sandbox-client/sandbox-router/sandbox_router.py
cluster_domain = _get_cluster_domain()
proxy_timeout = _get_proxy_timeout()
client = httpx.AsyncClient(timeout=proxy_timeout)
```

Sources: [clients/python/agentic-sandbox-client/sandbox-router/sandbox_router.py:26-65](), [clients/python/agentic-sandbox-client/sandbox-router/README.md:46-66]()

### Deployment manifest

`sandbox_router.yaml` defines the `sandbox-router-svc` `ClusterIP` (port `8080` → targetPort `8080`) and a `sandbox-router-deployment` with `replicas: 2` for HA, a `topologySpreadConstraints` block on `topology.kubernetes.io/zone`, readiness/liveness probes against `/healthz`, and `securityContext: { runAsUser: 1000, runAsGroup: 1000 }`. The image reference is templated as `${ROUTER_IMAGE}` and resolved by the deployment scripts via `sed`. The container itself comes from the `Dockerfile`, which installs hashed dependencies (`pip install --no-cache-dir --require-hashes -r requirements.txt`), drops to UID `1000`, exposes `8080`, and runs `uvicorn sandbox_router:app --host 0.0.0.0 --port 8080`.

Sources: [clients/python/agentic-sandbox-client/sandbox-router/sandbox_router.yaml:1-70](), [clients/python/agentic-sandbox-client/sandbox-router/Dockerfile:1-22]()

### Front-door Gateway (GKE)

`gateway.yaml` provisions:

- A `gateway.networking.k8s.io/v1` `Gateway` named `external-http-gateway` using `gatewayClassName: gke-l7-global-external-managed`.
- A single `HTTPRoute` (`sandbox-router-route`) with `path.type=PathPrefix, value=/` that points every request at `sandbox-router-svc:8080`.
- A GKE‑specific `networking.gke.io/v1 HealthCheckPolicy` that overrides the GKE Gateway's default `/` health check to use `/healthz` against the same service.

The README is explicit that the manifest is GKE-specific and that other implementations require swapping the `gatewayClassName` and dropping `HealthCheckPolicy`.

Sources: [clients/python/agentic-sandbox-client/sandbox-router/gateway.yaml:1-52](), [clients/python/agentic-sandbox-client/sandbox-router/README.md:68-78]()

## `gateway-kind` Harness

`gateway-kind/` is the local-developer equivalent of `gateway.yaml`. KinD has no native load balancer, so this directory pairs `cloud-provider-kind` (installed via `make deploy-cloud-provider-kind`) with a `Gateway` whose `gatewayClassName: cloud-provider-kind` lets that controller assign a real IP.

`gateway-kind.yaml` is intentionally minimal: a `v1 Gateway` named `kind-gateway` listening on port `80/HTTP` with `allowedRoutes.namespaces.from: All`, plus a `v1beta1 HTTPRoute` (`sandbox-router-route`) routing `PathPrefix: /` to `sandbox-router-svc:8080`. There is no GKE‑specific health-check policy here; KinD's load balancer does not need one.

The README walks operators through the manual flow (`make build`, `make deploy-kind EXTENSIONS=true`, `make deploy-cloud-provider-kind`, applying the router and template manifests with the image tag substituted in, then `kubectl apply -f gateway-kind.yaml`) and confirms success by polling `kubectl get gateway` until `ADDRESS` is populated and finally running `python ../test_client.py --gateway-name="kind-gateway"`.

`run-test-kind.sh` automates the same sequence: it resolves an `IMAGE_TAG` via the `dev/tools/shared/utils.get_image_tag` helper, exports `SANDBOX_ROUTER_IMG=kind.local/sandbox-router:${IMAGE_TAG}` and `SANDBOX_PYTHON_RUNTIME_IMG=kind.local/python-runtime-sandbox:${IMAGE_TAG}`, applies the templated manifests, `kubectl rollout status` waits for the router deployment, `kubectl wait --for=condition=Programmed=True gateway/kind-gateway` waits for the gateway, then spins up a `.venv`, installs the SDK in editable mode, and runs `python3 ./test_client.py --gateway-name kind-gateway`. A `trap cleanup EXIT` tears down the venv, `cloud-provider-kind`, and the KinD cluster.

Sources: [clients/python/agentic-sandbox-client/gateway-kind/gateway-kind.yaml:1-32](), [clients/python/agentic-sandbox-client/gateway-kind/README.md:1-86](), [clients/python/agentic-sandbox-client/gateway-kind/run-test-kind.sh:17-99]()

## How the Pieces Compose

```text
+-------------------------+        +------------------------------+
| ComputerUseSandboxClient|        | PodSnapshotSandboxClient     |
|  (subclass of           |        |  (subclass of SandboxClient, |
|   SandboxClient)        |        |   asserts PodSnapshot CRDs)  |
+-----------+-------------+        +--------------+---------------+
            | sandbox_class                        | sandbox_class
            v                                      v
+-------------------------+        +------------------------------+
| SandboxWithComputerUse  |        | SandboxWithSnapshotSupport   |
|  .agent(query) -> POST  |        |  .snapshots: SnapshotEngine  |
|     /agent              |        |  .suspend()/.resume()        |
+-----------+-------------+        +--------------+---------------+
            |   HTTP via SDK connector            |  kube-apiserver
            v                                      v
+------------------------------+   +------------------------------+
| sandbox-router (FastAPI)     |   | podsnapshot.gke.io CRDs      |
|  Service: sandbox-router-svc |   |  (PodSnapshot,               |
|  Route: /{full_path:path}    |   |   PodSnapshotManualTrigger)  |
+--------------+---------------+   +------------------------------+
               ^
               | HTTPRoute parentRef
               |
+----------------------------------+
| Gateway (gke-l7-global-external- |
| managed) | (cloud-provider-kind) |
+----------------------------------+
```

The two SDK extensions are independent — they subclass different things and never talk to each other — but they share infrastructure assumptions. The computer-use path needs the router and a Gateway because the SDK reaches the in‑pod agent over HTTP; the snapshot path does not need the router for snapshotting itself (it talks to the API server), but a real workflow normally pairs it with the same Gateway+router so the resumed sandbox is still reachable. `gateway-kind` exists exclusively to let a developer reproduce the GKE Gateway path locally on KinD, so a `python test_client.py --gateway-name=kind-gateway` exercises the same router code that production does.

In short: the SDK extensions teach the `Sandbox` handle new tricks (a new HTTP verb, a CRD-driven lifecycle), the router gives the cluster a single front door that scales to thousands of ephemeral sandboxes by routing on a header, and `gateway-kind` makes that same front door work without a cloud load balancer.

Sources: [clients/python/agentic-sandbox-client/sandbox-router/README.md:1-25](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/extensions/computer_use.py:35-39](), [clients/python/agentic-sandbox-client/k8s_agent_sandbox/gke_extensions/snapshots/podsnapshot_client.py:24-44]()

---

## 24. Helm Chart Layout

> Structure of the Helm chart: CRD shipping, deployment template, controller-args helper, RBAC bindings, and values knobs that map to controller flags.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/24-helm-chart-layout.md
- Generated: 2026-05-25T22:46:58.183Z

### Source Files

- `helm/Chart.yaml`
- `helm/values.yaml`
- `helm/templates/deployment.yaml`
- `helm/templates/_controller-args.tpl`
- `helm/README.md`
- `helm/crds/agents.x-k8s.io_sandboxes.yaml`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [helm/Chart.yaml](helm/Chart.yaml)
- [helm/values.yaml](helm/values.yaml)
- [helm/README.md](helm/README.md)
- [helm/templates/deployment.yaml](helm/templates/deployment.yaml)
- [helm/templates/_controller-args.tpl](helm/templates/_controller-args.tpl)
- [helm/templates/_helpers.tpl](helm/templates/_helpers.tpl)
- [helm/templates/serviceaccount.yaml](helm/templates/serviceaccount.yaml)
- [helm/templates/service.yaml](helm/templates/service.yaml)
- [helm/templates/namespace.yaml](helm/templates/namespace.yaml)
- [helm/templates/rbac.generated.yaml](helm/templates/rbac.generated.yaml)
- [helm/templates/extensions-rbac.generated.yaml](helm/templates/extensions-rbac.generated.yaml)
- [helm/templates/clusterrolebinding.yaml](helm/templates/clusterrolebinding.yaml)
- [helm/templates/clusterrolebinding-extensions.yaml](helm/templates/clusterrolebinding-extensions.yaml)
- [helm/crds/agents.x-k8s.io_sandboxes.yaml](helm/crds/agents.x-k8s.io_sandboxes.yaml)
- [helm/crds/extensions.agents.x-k8s.io_sandboxclaims.yaml](helm/crds/extensions.agents.x-k8s.io_sandboxclaims.yaml)
- [helm/crds/extensions.agents.x-k8s.io_sandboxtemplates.yaml](helm/crds/extensions.agents.x-k8s.io_sandboxtemplates.yaml)
- [helm/crds/extensions.agents.x-k8s.io_sandboxwarmpools.yaml](helm/crds/extensions.agents.x-k8s.io_sandboxwarmpools.yaml)
</details>

# Helm Chart Layout

The `helm/` directory ships the agent-sandbox controller as a self-contained Helm chart named `agent-sandbox` (`type: application`, `version: 0.1.0`). The chart bundles the four custom resource definitions used by the controller, a single `Deployment` for the controller pod, a `Service` exposing the metrics endpoint, RBAC primitives (a `ServiceAccount`, two generated `ClusterRole`s, and the matching `ClusterRoleBinding`s), an optional `Namespace`, and a controller-args helper template that translates `values.yaml` knobs into command-line flags. This page maps every file in the chart to the runtime object it produces and shows how `controller.*` values become `args:` on the controller container.

The chart is intentionally small: there are no subcharts, no hooks, and no shared configmaps. Templating is concentrated in two named templates (`agent-sandbox.controllerArgs` and the helpers in `_helpers.tpl`); every other file is a thin wrapper around a single Kubernetes object.

Sources: [helm/Chart.yaml:1-15](), [helm/README.md:1-67]()

## Directory Layout

```text
helm/
├── Chart.yaml                    # chart metadata: name, version, sources
├── values.yaml                   # default knobs (image, controller flags, pod overrides)
├── README.md                     # install/upgrade docs + parameter reference
├── crds/                         # auto-installed by Helm before templates
│   ├── agents.x-k8s.io_sandboxes.yaml
│   ├── extensions.agents.x-k8s.io_sandboxclaims.yaml
│   ├── extensions.agents.x-k8s.io_sandboxtemplates.yaml
│   └── extensions.agents.x-k8s.io_sandboxwarmpools.yaml
└── templates/
    ├── _helpers.tpl              # name/namespace/labels/image helpers
    ├── _controller-args.tpl      # controller flag generator
    ├── namespace.yaml            # gated on namespace.create
    ├── serviceaccount.yaml
    ├── deployment.yaml           # the controller pod
    ├── service.yaml              # metrics service (8080)
    ├── rbac.generated.yaml       # core ClusterRole (Sandbox)
    ├── extensions-rbac.generated.yaml  # extensions ClusterRole
    ├── clusterrolebinding.yaml
    └── clusterrolebinding-extensions.yaml  # gated on controller.extensions
```

The `.generated.yaml` suffix on the two RBAC role files signals that the manifests are produced by upstream tooling (`controller-gen`) and copied into the chart; do not edit them by hand. The `clusterrolebinding*.yaml` files are hand-written because they reference Helm release context (`{{ include "agent-sandbox.namespace" . }}`).

Sources: [helm/templates/rbac.generated.yaml:1-60](), [helm/templates/extensions-rbac.generated.yaml:1-88](), [helm/templates/clusterrolebinding.yaml:1-15]()

## CRD Shipping

CRDs live under `helm/crds/`, which is a Helm-special directory: every file there is applied **before** any template renders, exactly once, on first `helm install`. Helm intentionally does **not** upgrade or delete CRDs across releases, so the chart's `README.md` documents the manual `kubectl apply -f helm/crds/` step for upgrades and the manual `kubectl delete -f helm/crds/` step for full removal.

Four CRDs ship with the chart:

| File | API group | Kind | Scope |
|------|-----------|------|-------|
| `agents.x-k8s.io_sandboxes.yaml` | `agents.x-k8s.io` | `Sandbox` | Namespaced |
| `extensions.agents.x-k8s.io_sandboxclaims.yaml` | `extensions.agents.x-k8s.io` | `SandboxClaim` | Namespaced |
| `extensions.agents.x-k8s.io_sandboxtemplates.yaml` | `extensions.agents.x-k8s.io` | `SandboxTemplate` | Namespaced |
| `extensions.agents.x-k8s.io_sandboxwarmpools.yaml` | `extensions.agents.x-k8s.io` | `SandboxWarmPool` | Namespaced |

The CRDs always install — there is no value gate around `crds/` (Helm does not template that directory). The three extension CRDs are installed even when `controller.extensions=false`; what the flag actually gates is the *controller* logic that reconciles them and the extension RBAC binding (see [Conditional Templates](#conditional-templates)).

Each CRD carries the `controller-gen.kubebuilder.io/version: v0.19.0` annotation, confirming they are generated artifacts mirrored into the chart rather than maintained inline.

Sources: [helm/README.md:48-66](), [helm/crds/agents.x-k8s.io_sandboxes.yaml:1-18](), [helm/crds/extensions.agents.x-k8s.io_sandboxclaims.yaml:1-17]()

## Rendered Object Graph

```mermaid
flowchart LR
    subgraph Values["values.yaml"]
        VImg["image.{repository,tag,pullPolicy}"]
        VNs["namespace.{create,name}"]
        VCtrl["controller.*"]
        VPod["replicaCount / podAnnotations<br/>podLabels / resources<br/>nodeSelector / tolerations / affinity"]
    end

    subgraph Helpers["templates/_*.tpl"]
        HArgs["agent-sandbox.controllerArgs"]
        HHelp["agent-sandbox.{name,namespace,<br/>labels,selectorLabels,image}"]
    end

    subgraph Workload["Controller workload"]
        Ns["Namespace<br/>(if namespace.create)"]
        SA["ServiceAccount<br/>agent-sandbox-controller"]
        Dep["Deployment<br/>agent-sandbox-controller"]
        Svc["Service :8080 metrics"]
    end

    subgraph RBAC["Cluster-scoped RBAC"]
        CRcore["ClusterRole<br/>agent-sandbox-controller"]
        CRext["ClusterRole<br/>agent-sandbox-controller-extensions"]
        CRBcore["ClusterRoleBinding<br/>core"]
        CRBext["ClusterRoleBinding<br/>extensions (gated)"]
    end

    subgraph CRDs["helm/crds/ (pre-install, once)"]
        CRDsb["Sandbox CRD"]
        CRDsc["SandboxClaim CRD"]
        CRDst["SandboxTemplate CRD"]
        CRDsw["SandboxWarmPool CRD"]
    end

    VCtrl --> HArgs --> Dep
    VImg --> HHelp --> Dep
    VPod --> Dep
    VNs --> Ns
    VNs --> HHelp

    SA --> Dep
    SA --> CRBcore
    SA --> CRBext
    CRcore --> CRBcore
    CRext --> CRBext
    Svc -. selects .-> Dep
```

The diagram traces three independent value→object pathways: `controller.*` knobs flow through the `agent-sandbox.controllerArgs` helper into the `Deployment`'s `args:`; `image.*` and naming knobs flow through `_helpers.tpl` into image and label fields; and `namespace.*` controls both the optional `Namespace` object and the namespace placement for every other object via `agent-sandbox.namespace`.

Sources: [helm/templates/deployment.yaml:1-65](), [helm/templates/_helpers.tpl:1-43](), [helm/values.yaml:1-63]()

## The Controller Deployment

`helm/templates/deployment.yaml` renders a single `Deployment` named `agent-sandbox-controller`. Notable wiring:

- `replicas: {{ .Values.replicaCount }}` — defaults to 1; leader election (on by default) makes >1 safe.
- `serviceAccountName: agent-sandbox-controller` — hard-coded, matches `serviceaccount.yaml` and both `ClusterRoleBinding`s.
- `image: {{ include "agent-sandbox.image" . }}` — uses the `_helpers.tpl` builder, which `require`s a non-empty `image.tag` and produces `repository:tag`.
- `args:` is rendered entirely by `{{- include "agent-sandbox.controllerArgs" . | nindent 8 }}`.
- Two named container ports: `metrics` on 8080 and `healthz` on 8081. The `Service` only exposes the metrics port.
- `livenessProbe` hits `GET /healthz` on the `healthz` port (15s initial delay, 20s period); `readinessProbe` hits `GET /readyz` on the same port (5s/10s).
- Pod-level overrides — `resources`, `nodeSelector`, `affinity`, `tolerations` — are all rendered behind `{{- with }}` guards, so an empty value renders nothing rather than an empty field.

Pod template labels merge the chart's `selectorLabels` with `Values.podLabels` so user labels coexist with the immutable selector:

```yaml
{{- $selectorLabels := include "agent-sandbox.selectorLabels" . | fromYaml }}
labels:
  {{- toYaml (merge (dict) $selectorLabels .Values.podLabels) | nindent 8 }}
```

The order of arguments to `merge` matters: in Helm's `merge`, earlier maps win on conflict, so selector labels take precedence over user-supplied `podLabels` — preventing a stray override from breaking selector matching.

Sources: [helm/templates/deployment.yaml:1-65](), [helm/templates/_helpers.tpl:31-42]()

## The controller-args Helper

`templates/_controller-args.tpl` defines the named template `agent-sandbox.controllerArgs`, which is the single source of truth for converting `controller.*` values into CLI flags. Each block follows the same pattern: `{{- if .Values.controller.X }}` guards a line that emits `- --flag-name={{ value }}`.

```text
controller.<key>                       --> --<flag-name>=<value>
─────────────────────────────────────────────────────────────────
leaderElect                            --> --leader-elect
clusterDomain                          --> --cluster-domain
leaderElectionNamespace                --> --leader-election-namespace
extensions                             --> --extensions
enableTracing                          --> --enable-tracing
enablePprof                            --> --enable-pprof
enablePprofDebug                       --> --enable-pprof-debug
pprofBlockProfileRate                  --> --pprof-block-profile-rate
pprofMutexProfileFraction              --> --pprof-mutex-profile-fraction
kubeApiQps                             --> --kube-api-qps
kubeApiBurst                           --> --kube-api-burst
sandboxConcurrentWorkers               --> --sandbox-concurrent-workers
sandboxClaimConcurrentWorkers          --> --sandbox-claim-concurrent-workers
sandboxWarmPoolConcurrentWorkers       --> --sandbox-warm-pool-concurrent-workers
sandboxTemplateConcurrentWorkers       --> --sandbox-template-concurrent-workers
extraArgs[]                            --> verbatim, one per element (quoted)
```

Two consequences of the `{{- if }}` guards:

1. **Unset values produce no flag.** The controller's own defaults apply for any key the user does not set. `values.yaml` deliberately leaves most knobs commented out (`# clusterDomain: "cluster.local"`, `# kubeApiQps: -1.0`, …) so the rendered `args:` list stays minimal.
2. **Falsy values suppress the flag.** Because the guard is `if`, not `not (kindIs "invalid" ...)`, a setting like `controller.leaderElect: false` will *omit* `--leader-elect=false` rather than emit it. This is intentional for the boolean toggles but worth noting when overriding defaults.

`extraArgs` is the escape hatch for any flag the helper does not cover (the chart's example use case is `zap` logging flags). Each element is emitted verbatim and quoted:

```yaml
{{- range .Values.controller.extraArgs }}
- {{ . | quote }}
{{- end }}
```

Sources: [helm/templates/_controller-args.tpl:1-50](), [helm/values.yaml:29-49]()

## Values Knobs at a Glance

The full reference lives in `helm/README.md`; the table below highlights the knobs that the helper translates into flags, plus the ones that shape the pod itself.

| Value | Default | Effect |
|-------|---------|--------|
| `image.tag` | `""` | **Required.** Enforced by `required "image.tag is required"` in `agent-sandbox.image`. |
| `image.repository` | `registry.k8s.io/agent-sandbox/agent-sandbox-controller` | Image repository half of the ref. |
| `replicaCount` | `1` | `Deployment.spec.replicas`. |
| `namespace.create` | `true` | Gates rendering of `namespace.yaml`. |
| `namespace.name` | `agent-sandbox-system` | Overrides `Release.Namespace` for every object. |
| `controller.leaderElect` | `true` | Emits `--leader-elect=true`. |
| `controller.extensions` | `false` | Emits `--extensions=true` **and** renders the extensions `ClusterRoleBinding`. |
| `controller.enableTracing` | unset | Emits `--enable-tracing=true` when truthy. |
| `controller.enablePprof` / `enablePprofDebug` | unset | Toggle pprof endpoint on the metrics port. |
| `controller.kubeApiQps` / `kubeApiBurst` | unset | Tune client-go rate limits. |
| `controller.sandbox*ConcurrentWorkers` | unset | Reconciler concurrency per controller. |
| `controller.extraArgs` | `[]` | Free-form passthrough flags. |
| `podAnnotations` / `podLabels` | `{}` | Added to the pod template; selector labels win on label conflict. |
| `resources` / `nodeSelector` / `tolerations` / `affinity` | `{}` / `[]` | Standard pod overrides, all `{{- with }}`-guarded. |

Sources: [helm/values.yaml:1-63](), [helm/README.md:72-101](), [helm/templates/_helpers.tpl:39-42]()

## Naming and Label Helpers

`_helpers.tpl` defines five named templates that the rest of the chart depends on:

| Helper | Purpose |
|--------|---------|
| `agent-sandbox.name` | Truncates chart name (or `nameOverride`) to 63 chars. Used inside the label helpers. |
| `agent-sandbox.namespace` | `default .Release.Namespace .Values.namespace.name` — every object uses this to resolve its namespace. |
| `agent-sandbox.labels` | Common labels: `helm.sh/chart`, `app.kubernetes.io/{name,instance,managed-by}`, plus `app.kubernetes.io/version` when `image.tag` is set. |
| `agent-sandbox.selectorLabels` | Stable subset (`name` + `instance`) used by both `Deployment.selector` and `Service.selector`. |
| `agent-sandbox.image` | `repository:tag`; calls `required` on `image.tag` so installs fail fast when the user forgets to set it. |

The selector helper is deliberately narrower than the labels helper: `Deployment.spec.selector` is immutable after creation, so it must not include chart-version-derived labels.

Sources: [helm/templates/_helpers.tpl:1-43]()

## Service, ServiceAccount, and Namespace

These three templates are minimal:

- `serviceaccount.yaml` creates `agent-sandbox-controller` in the chart's namespace. No annotations are templated, so workload-identity wiring is not built in.
- `service.yaml` is a ClusterIP `Service` named `agent-sandbox-controller` exposing only the `metrics` port (8080 → named target port `metrics`). There is no service for the `healthz` port — probes go straight to the pod.
- `namespace.yaml` is wrapped entirely in `{{- if .Values.namespace.create }}`. When `false`, no namespace is rendered and the user is expected to point the release at a pre-existing namespace (the README's "Install into an existing namespace" path).

Sources: [helm/templates/service.yaml:1-16](), [helm/templates/serviceaccount.yaml:1-8](), [helm/templates/namespace.yaml:1-9]()

## RBAC Bindings

Cluster-scoped RBAC ships as two independent role/binding pairs:

```text
ServiceAccount: agent-sandbox-controller (in namespace)
        │
        ├── ClusterRoleBinding: agent-sandbox-controller
        │       └── ClusterRole: agent-sandbox-controller            (always)
        │              ├── core:           pods, services, persistentvolumeclaims (CRUD+watch)
        │              ├── agents.x-k8s.io: sandboxes (CRUD+watch),
        │              │                    sandboxes/{finalizers,status} (get/patch/update)
        │              ├── coordination.k8s.io: leases (CRU+watch)
        │              └── events.k8s.io: events (create/patch)
        │
        └── ClusterRoleBinding: agent-sandbox-controller-extensions   (if controller.extensions)
                └── ClusterRole: agent-sandbox-controller-extensions  (always present in chart)
                       ├── core: pods (read+patch+update+watch), events (create/patch/update)
                       ├── coordination.k8s.io: leases (CRUD+watch)
                       ├── extensions.agents.x-k8s.io: sandboxclaims, sandboxtemplates,
                       │                                sandboxwarmpools (CRUD+watch) and
                       │                                their finalizers/status subresources
                       ├── agents.x-k8s.io: sandboxes (CRUD+watch)
                       └── networking.k8s.io: networkpolicies (CRUD+watch)
```

A subtlety: `extensions-rbac.generated.yaml` (the `ClusterRole`) is **not** gated on `controller.extensions`, but `clusterrolebinding-extensions.yaml` **is**. The role is therefore present even when extensions are disabled — it simply has no subjects bound to it until the user opts in. Both binding files reference `agent-sandbox-controller` as the only subject and the matching `ClusterRole` name as `roleRef`.

Sources: [helm/templates/rbac.generated.yaml:1-60](), [helm/templates/extensions-rbac.generated.yaml:1-88](), [helm/templates/clusterrolebinding.yaml:1-15](), [helm/templates/clusterrolebinding-extensions.yaml:1-16]()

## Conditional Templates

Only a handful of templates have install-time gates; the rest always render.

| Template | Gate | Behavior when gate is false |
|----------|------|-----------------------------|
| `namespace.yaml` | `Values.namespace.create` | Nothing rendered; existing namespace is reused via `agent-sandbox.namespace`. |
| `clusterrolebinding-extensions.yaml` | `Values.controller.extensions` | The extensions `ClusterRole` exists but is unbound; the controller flag is also omitted. |
| `_controller-args.tpl` entries | per-key `{{- if .Values.controller.X }}` | Flag is omitted; controller falls back to its own default. |
| `Deployment` pod fields | `{{- with }}` on `resources` / `nodeSelector` / `affinity` / `tolerations` / `podAnnotations` | Field is omitted entirely from the rendered manifest. |

Both `namespace.yaml` and `clusterrolebinding-extensions.yaml` use a top-level `{{- if … }} … {{- end }}` wrapper, so disabling them yields zero output (no empty document separator) and Helm produces a clean manifest.

Sources: [helm/templates/namespace.yaml:1-9](), [helm/templates/clusterrolebinding-extensions.yaml:1-16](), [helm/templates/deployment.yaml:49-64]()

## End-to-End: From values to args

The flow below shows what the user sets, how it threads through the chart, and what lands inside the running pod's argv. This is the path a maintainer follows when adding a new controller flag.

```mermaid
sequenceDiagram
    autonumber
    actor User
    participant Vals as values.yaml
    participant Tpl as templates/_controller-args.tpl
    participant Dep as templates/deployment.yaml
    participant K8s as kube-apiserver
    participant Pod as agent-sandbox-controller pod

    User->>Vals: --set controller.extensions=true<br/>--set controller.kubeApiQps=50
    Vals->>Tpl: render agent-sandbox.controllerArgs
    Note over Tpl: emits --extensions=true and<br/>--kube-api-qps=50,<br/>skips guarded-out flags
    Tpl-->>Dep: include via {{ include "agent-sandbox.controllerArgs" . }}
    Dep->>K8s: apply Deployment with args[]
    K8s->>Pod: start container with rendered argv
    Pod->>Pod: parse flags and start reconcilers
```

To add a new flag, append a guarded block to `_controller-args.tpl`, add the default (or commented-out placeholder) to `values.yaml`, and document it in `helm/README.md`'s configuration table — no other template needs to change.

Sources: [helm/templates/_controller-args.tpl:1-50](), [helm/templates/deployment.yaml:28-29](), [helm/values.yaml:29-49]()

## Summary

The agent-sandbox Helm chart is a deliberately thin packaging of one controller `Deployment`, its `Service`, a `ServiceAccount`, two generated `ClusterRole`s with their bindings, an optional `Namespace`, and four CRDs shipped under the auto-installed `crds/` directory. All controller-tuning lives in `controller.*` values, and the `agent-sandbox.controllerArgs` named template is the single junction that turns those values into command-line flags — keeping the deployment template itself almost free of templating logic. The `controller.extensions` toggle is the one knob with multi-object impact: it both injects `--extensions=true` and renders the extensions `ClusterRoleBinding`, while the extension CRDs and `ClusterRole` always ship.

---

## 25. Static Manifests & Generated RBAC

> The kubectl-apply-ready manifests in k8s/ plus the generated ClusterRole/Binding files used by both core and extensions controllers.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/25-static-manifests-generated-rbac.md
- Generated: 2026-05-25T22:48:51.322Z

### Source Files

- `k8s/controller.yaml`
- `k8s/extensions.controller.yaml`
- `k8s/extensions.yaml`
- `k8s/rbac.generated.yaml`
- `k8s/extensions-rbac.generated.yaml`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [k8s/controller.yaml](k8s/controller.yaml)
- [k8s/extensions.controller.yaml](k8s/extensions.controller.yaml)
- [k8s/extensions.yaml](k8s/extensions.yaml)
- [k8s/rbac.generated.yaml](k8s/rbac.generated.yaml)
- [k8s/extensions-rbac.generated.yaml](k8s/extensions-rbac.generated.yaml)
- [codegen.go](codegen.go)
- [dev/tools/deploy-to-kube](dev/tools/deploy-to-kube)
- [controllers/sandbox_controller.go](controllers/sandbox_controller.go)
- [extensions/controllers/sandboxclaim_controller.go](extensions/controllers/sandboxclaim_controller.go)
- [extensions/controllers/sandboxtemplate_controller.go](extensions/controllers/sandboxtemplate_controller.go)
- [extensions/controllers/sandboxwarmpool_controller.go](extensions/controllers/sandboxwarmpool_controller.go)
- [cmd/agent-sandbox-controller/main.go](cmd/agent-sandbox-controller/main.go)
</details>

# Static Manifests & Generated RBAC

The `k8s/` directory is the canonical, `kubectl apply`-ready home for the agent-sandbox controller's cluster install: it bundles the `Namespace`, `ServiceAccount`, `Service`, `Deployment`, and `ClusterRoleBinding` for the controller, together with machine-generated `CustomResourceDefinition` and `ClusterRole` files. The split into hand-written and `controller-gen`–produced manifests matters because RBAC and CRD changes must follow code changes to the kubebuilder markers — never the other way around.

This page documents what each file in `k8s/` provides, how the generated RBAC files are produced from `//+kubebuilder:rbac` markers, how the core and extensions controllers compose their permissions, and how the deploy tooling assembles these manifests for both the core-only and `--extensions` deployment modes.

## Layout of `k8s/`

| File | Type | Generated? | Role |
| --- | --- | --- | --- |
| `controller.yaml` | Multi-doc YAML | hand-written | Core install: `Namespace`, `ServiceAccount`, `ClusterRoleBinding`, `Service`, `Deployment` |
| `extensions.controller.yaml` | YAML | hand-written | `Deployment` override that adds `--extensions` to the controller args |
| `extensions.yaml` | YAML | hand-written | `ClusterRoleBinding` for the extensions `ClusterRole` |
| `rbac.generated.yaml` | YAML | `controller-gen` | `ClusterRole agent-sandbox-controller` (core permissions) |
| `extensions-rbac.generated.yaml` | YAML | `controller-gen` | `ClusterRole agent-sandbox-controller-extensions` (extension permissions) |
| `crds/` | YAML files | `controller-gen` | One CRD per Go API type (`sandboxes`, `sandboxclaims`, `sandboxtemplates`, `sandboxwarmpools`) |

A parallel set of generated files lives under `helm/templates/` (`rbac.generated.yaml`, `extensions-rbac.generated.yaml`, `clusterrolebinding*.yaml`, `deployment.yaml`, etc.) for the Helm install path, produced by the same `controller-gen` invocations.

Sources: [k8s/controller.yaml:1-82](), [k8s/extensions.controller.yaml:1-33](), [k8s/extensions.yaml:1-15](), [k8s/rbac.generated.yaml:1-60](), [k8s/extensions-rbac.generated.yaml:1-88]()

## Core controller install (`controller.yaml`)

The hand-written `controller.yaml` is a single multi-document YAML that creates the namespace, identity, in-cluster service, and workload for the core controller. The five documents are emitted in install order:

1. `Namespace agent-sandbox-system` — every other object lives here.
2. `ServiceAccount agent-sandbox-controller` — the workload identity.
3. `ClusterRoleBinding agent-sandbox-controller` — binds the generated core `ClusterRole` (same name) to the service account.
4. `Service agent-sandbox-controller` — exposes the `metrics` port (`8080/TCP`) so a scraper can reach the controller pod.
5. `Deployment agent-sandbox-controller` — one replica, runs the controller with `--leader-elect=true`, exposes container ports `metrics` (`8080`) and `healthz` (`8081`).

The image is encoded as a `ko://` reference (`ko://sigs.k8s.io/agent-sandbox/cmd/agent-sandbox-controller`) and is rewritten by the deploy tooling before apply; the manifest is not directly applicable from a clean checkout without that rewrite.

```yaml
# k8s/controller.yaml (Deployment excerpt)
containers:
- name: agent-sandbox-controller
  image: ko://sigs.k8s.io/agent-sandbox/cmd/agent-sandbox-controller # placeholder value, replaced by deployment scripts
  args:
  - --leader-elect=true
  ports:
  - name: metrics
    containerPort: 8080
  - name: healthz
    containerPort: 8081
```

Sources: [k8s/controller.yaml:1-82]()

## Extensions overlay (`extensions.controller.yaml`, `extensions.yaml`)

The extensions install is intentionally minimal: rather than duplicating the whole core stack, two small documents add the extension surface on top.

- `k8s/extensions.controller.yaml` is a `Deployment` with the **same name and namespace** as the core controller. Its only material difference is an additional `--extensions` arg on the container:

  ```yaml
  args:
  - "--leader-elect=true"
  - "--extensions"
  ```

  The deploy tool treats this as a replacement, not an addition (see "Deploy ordering" below).

- `k8s/extensions.yaml` is a single `ClusterRoleBinding` named `agent-sandbox-controller-extensions` that binds the generated extensions `ClusterRole` to the same `ServiceAccount agent-sandbox-controller`. There is no separate service account for extensions — the controller process is one binary that opts into reconcilers based on the `--extensions` flag.

The `--extensions` flag is read by the controller entrypoint and gates registration of the extensions scheme and the three reconcilers (`SandboxClaim`, `SandboxTemplate`, `SandboxWarmPool`):

```go
// cmd/agent-sandbox-controller/main.go (excerpt)
flag.BoolVar(&extensions, "extensions", false, "Enable extensions controllers.")
...
if extensions {
    utilruntime.Must(extensionsv1beta1.AddToScheme(scheme))
}
...
if extensions {
    if err = (&extensionscontrollers.SandboxClaimReconciler{ ... }).SetupWithManager(mgr); err != nil { ... }
    if err = (&extensionscontrollers.SandboxTemplateReconciler{ ... }).SetupWithManager(mgr); err != nil { ... }
    if err = (&extensionscontrollers.SandboxWarmPoolReconciler{ ... }).SetupWithManager(mgr); err != nil { ... }
}
```

Sources: [k8s/extensions.controller.yaml:1-33](), [k8s/extensions.yaml:1-15](), [cmd/agent-sandbox-controller/main.go:55-78](), [cmd/agent-sandbox-controller/main.go:175-271]()

## Generated RBAC pipeline

The two `*.generated.yaml` files in `k8s/` are produced by `sigs.k8s.io/controller-tools/cmd/controller-gen` and **must not be hand-edited**. The pipeline is declared as `go:generate` directives in [codegen.go](codegen.go):

```go
// codegen.go (excerpt)
//go:generate go tool -modfile=tools.mod sigs.k8s.io/controller-tools/cmd/controller-gen \
//   paths=./controllers/... output:rbac:dir=k8s \
//   rbac:roleName=agent-sandbox-controller,fileName=rbac.generated.yaml
//go:generate go tool -modfile=tools.mod sigs.k8s.io/controller-tools/cmd/controller-gen \
//   paths=./extensions/controllers/... output:rbac:dir=k8s \
//   rbac:roleName=agent-sandbox-controller-extensions,fileName=extensions-rbac.generated.yaml
```

Each invocation:
1. Walks the Go packages in `paths=…`.
2. Collects every `//+kubebuilder:rbac:…` marker found on reconciler types.
3. Aggregates the markers into a single `ClusterRole` whose name comes from `rbac:roleName=…` and writes it to `k8s/<fileName>`.

Identical directives target `helm/templates/` to keep the Helm chart in sync. Regeneration is wired through `make all` (which runs `dev/tools/fix-go-generate`), so editing markers without rerunning the generator will fail CI.

```mermaid
flowchart LR
    subgraph Source["Go sources (//+kubebuilder:rbac markers)"]
        Core["controllers/sandbox_controller.go"]
        Claim["extensions/controllers/sandboxclaim_controller.go"]
        Tmpl["extensions/controllers/sandboxtemplate_controller.go"]
        Warm["extensions/controllers/sandboxwarmpool_controller.go"]
    end
    subgraph Gen["codegen.go go:generate"]
        CG1["controller-gen<br/>roleName=agent-sandbox-controller"]
        CG2["controller-gen<br/>roleName=agent-sandbox-controller-extensions"]
    end
    subgraph Out["k8s/ generated outputs"]
        R1["rbac.generated.yaml<br/>ClusterRole agent-sandbox-controller"]
        R2["extensions-rbac.generated.yaml<br/>ClusterRole agent-sandbox-controller-extensions"]
    end
    subgraph Bind["Hand-written bindings"]
        B1["controller.yaml<br/>ClusterRoleBinding agent-sandbox-controller"]
        B2["extensions.yaml<br/>ClusterRoleBinding ...-extensions"]
        SA["ServiceAccount<br/>agent-sandbox-controller"]
    end
    Core --> CG1 --> R1
    Claim --> CG2
    Tmpl --> CG2
    Warm --> CG2
    CG2 --> R2
    R1 -. roleRef .-> B1
    R2 -. roleRef .-> B2
    B1 --> SA
    B2 --> SA
```

Sources: [codegen.go:20-27](), [controllers/sandbox_controller.go:130-137](), [extensions/controllers/sandboxclaim_controller.go:127-136](), [extensions/controllers/sandboxtemplate_controller.go:46-50](), [extensions/controllers/sandboxwarmpool_controller.go:61-64]()

## Core `ClusterRole` (`rbac.generated.yaml`)

The core role grants the controller exactly what the `Sandbox` reconciler needs: full lifecycle on pods, services, and PVCs in the user namespaces it manages; full lifecycle plus subresources on `agents.x-k8s.io/sandboxes`; leases for leader election; and event emission.

| API group | Resources | Verbs |
| --- | --- | --- |
| `""` (core) | `pods`, `services`, `persistentvolumeclaims` | `create, delete, get, list, patch, update, watch` |
| `agents.x-k8s.io` | `sandboxes` | `create, delete, get, list, patch, update, watch` |
| `agents.x-k8s.io` | `sandboxes/finalizers`, `sandboxes/status` | `get, patch, update` |
| `coordination.k8s.io` | `leases` | `create, get, list, patch, update, watch` (no `delete`) |
| `events.k8s.io` | `events` | `create, patch` |

These rules are the exact union of the markers above `SandboxReconciler`:

```go
// controllers/sandbox_controller.go
//+kubebuilder:rbac:groups=agents.x-k8s.io,resources=sandboxes,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups=agents.x-k8s.io,resources=sandboxes/finalizers,verbs=get;update;patch
//+kubebuilder:rbac:groups=agents.x-k8s.io,resources=sandboxes/status,verbs=get;update;patch
//+kubebuilder:rbac:groups=core,resources=pods,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups=core,resources=services,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups=core,resources=persistentvolumeclaims,verbs=get;list;watch;create;update;patch;delete
//+kubebuilder:rbac:groups=events.k8s.io,resources=events,verbs=create;patch
//+kubebuilder:rbac:groups=coordination.k8s.io,resources=leases,verbs=get;list;watch;create;update;patch
```

Sources: [k8s/rbac.generated.yaml:1-60](), [controllers/sandbox_controller.go:130-137]()

## Extensions `ClusterRole` (`extensions-rbac.generated.yaml`)

The extensions role is the union of markers from three reconcilers and is broader than the core role in two notable ways: it can manage `extensions.agents.x-k8s.io/*` resources (claims, templates, warm pools), and it can `create/delete/get/list/patch/update/watch` `networking.k8s.io/networkpolicies` — needed because `SandboxTemplate` and `SandboxClaim` materialize network policy isolation around the sandboxes they own.

| API group | Resources | Verbs | Contributed by |
| --- | --- | --- | --- |
| `""` | `pods` | `get, list, patch, update, watch` | `SandboxClaim` |
| `""`, `events.k8s.io` | `events` | `create, patch, update` | all three |
| `agents.x-k8s.io` | `sandboxes` | `create, delete, get, list, patch, update, watch` | `SandboxClaim`, `SandboxWarmPool` |
| `coordination.k8s.io` | `leases` | `create, delete, get, list, patch, update, watch` | `SandboxClaim` |
| `extensions.agents.x-k8s.io` | `sandboxclaims`, `sandboxtemplates`, `sandboxwarmpools` | `create, delete, get, list, patch, update, watch` | all three |
| `extensions.agents.x-k8s.io` | `sandboxclaims/finalizers`, `sandboxclaims/status`, `sandboxtemplates/finalizers`, `sandboxwarmpools/finalizers`, `sandboxwarmpools/status` | `get, patch, update` | per-reconciler |
| `networking.k8s.io` | `networkpolicies` | `create, delete, get, list, patch, update, watch` | `SandboxTemplate` (full), `SandboxClaim` (`get, list, watch, delete`) |

Two quirks worth noting:

- The role contains rules for both API groups `""` and `events.k8s.io` on `events`. That's because some markers declare `groups=core` and others declare `groups=events.k8s.io` — `controller-gen` collapses them into a single rule that lists both groups.
- The extensions role permits `delete` on `leases`, while the core role does not. This comes from `SandboxClaim`'s marker (`verbs=get;list;watch;create;update;patch;delete`) and exists because the extensions controllers manage their own per-claim leases rather than only leader-election ones.

Sources: [k8s/extensions-rbac.generated.yaml:1-88](), [extensions/controllers/sandboxclaim_controller.go:127-136](), [extensions/controllers/sandboxtemplate_controller.go:46-50](), [extensions/controllers/sandboxwarmpool_controller.go:61-64]()

## How bindings tie roles to the service account

Both `ClusterRoleBinding`s point at the same `ServiceAccount agent-sandbox-controller` in `agent-sandbox-system` but reference different `ClusterRole`s by name:

```text
ServiceAccount: agent-sandbox-controller (ns: agent-sandbox-system)
        ^                 ^
        |                 |
ClusterRoleBinding        ClusterRoleBinding
  agent-sandbox-controller     agent-sandbox-controller-extensions
        |                 |
ClusterRole               ClusterRole
  agent-sandbox-controller     agent-sandbox-controller-extensions
  (rbac.generated.yaml)        (extensions-rbac.generated.yaml)
```

The names of the two `ClusterRole`s are not coincidental — they are passed directly to `controller-gen` via `rbac:roleName=…` in [codegen.go](codegen.go), so the bindings in `controller.yaml` and `extensions.yaml` are coupled by string to the generator invocation.

Sources: [k8s/controller.yaml:19-30](), [k8s/extensions.yaml:1-15](), [codegen.go:22-23]()

## Deploy ordering and the core/extensions composition

The `k8s/` directory is consumed by `dev/tools/deploy-to-kube` (and indirectly by `make deploy-kind`). The script walks `k8s/` recursively, then filters and sorts documents before `kubectl apply`:

1. **Extensions filter.** Files whose names start with `extensions` are skipped unless `--extensions` was passed.
2. **Core controller replacement.** When `--extensions` is set, the core `Deployment` from `controller.yaml` is filtered out so the `extensions.controller.yaml` `Deployment` (with `--extensions` in `args`) takes its place — they share the same name and namespace.
3. **Image rewrite.** Any `ko://…` image or `:latest` tag is rewritten using the configured `image_prefix` and `image_tag`.
4. **Flag append.** A `CONTROLLER_ARGS` string is appended to the controller container's `args`.
5. **Three-phase apply, in order:**
   - prerequisites — `Namespace` and `CustomResourceDefinition` objects (from `controller.yaml` and `k8s/crds/`)
   - other documents — `ServiceAccount`, `Service`, both `ClusterRole`s, the core `ClusterRoleBinding`, and the chosen `Deployment`
   - extension documents (only if `--extensions`) — the extensions `ClusterRoleBinding` from `extensions.yaml`

```mermaid
flowchart TD
    Start["dev/tools/deploy-to-kube"] --> Walk["gather_manifests(k8s/)"]
    Walk --> Filter{extensions flag?}
    Filter -- no --> SkipExt["drop files starting with 'extensions'"]
    Filter -- yes --> KeepExt["keep extensions files;<br/>drop core Deployment"]
    SkipExt --> Process["find_and_replace_images +<br/>append_controller_flags"]
    KeepExt --> Process
    Process --> Sort{kind?}
    Sort -- CRD/Namespace --> Pre["prereq_docs"]
    Sort -- extensions file --> Ext["extensions_docs"]
    Sort -- otherwise --> Other["other_docs"]
    Pre --> Apply1["kubectl apply (prereqs)"]
    Apply1 --> Apply2["kubectl apply (others)"]
    Apply2 --> ApplyExt{extensions?}
    ApplyExt -- yes --> Apply3["kubectl apply (extensions)"]
    ApplyExt -- no --> Done["done"]
    Apply3 --> Done
```

Because both `ClusterRole`s and both bindings are accepted regardless of whether `--extensions` is set (only the binding from `extensions.yaml` is gated by the `extensions` filename prefix), an extensions-enabled deploy ends up with both roles, both bindings, the extensions-enabled `Deployment`, and all CRDs applied to the cluster.

Sources: [dev/tools/deploy-to-kube:155-230](), [codegen.go:20-27]()

## Modifying RBAC: the source-of-truth rule

The static manifests in `k8s/` are deliberately the **install artifact**, not the **specification**. To change what the controller can do, edit the `//+kubebuilder:rbac` markers on the reconciler type in `controllers/` or `extensions/controllers/`, then regenerate:

```sh
make all          # runs fix-go-generate, which invokes controller-gen
# or, narrowly:
go generate ./...
```

This regenerates the matching files under both `k8s/` and `helm/templates/`. Hand-editing `rbac.generated.yaml` or `extensions-rbac.generated.yaml` is unsafe — the next generator run will overwrite the change and CI's `fix-go-generate` check will reject the PR.

Conversely, the hand-written files (`controller.yaml`, `extensions.controller.yaml`, `extensions.yaml`) are where bindings, namespaces, services, deployment shape, and controller flags live. New permissions belong in markers; new wiring belongs in the hand-written manifests.

Sources: [codegen.go:20-27](), [Makefile:1-10]()

## Summary

`k8s/` is a small, deliberately layered install surface: hand-written cluster identity and workload definitions (`controller.yaml`, `extensions.controller.yaml`, `extensions.yaml`) bind to `ClusterRole`s that are mechanically reproduced from `//+kubebuilder:rbac` markers on the reconcilers (`rbac.generated.yaml`, `extensions-rbac.generated.yaml`). The extensions install is an overlay rather than a parallel stack — same `ServiceAccount`, same `Deployment` name, with `--extensions` and an extra binding turning on additional reconcilers. The `dev/tools/deploy-to-kube` script orchestrates the apply, ensuring CRDs and namespaces land before any objects that depend on them and that the right controller `Deployment` wins when extensions are enabled.

---

## 26. Examples Library Map

> Tour of the examples/ tree: AIO sandbox, Chrome/VSCode/JupyterLab, agent frameworks (Hermes, LangChain, ADK, Ray, Kueue), policy and scaling scenarios.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/26-examples-library-map.md
- Generated: 2026-05-25T22:50:45.765Z

### Source Files

- `examples/README.md`
- `examples/chrome-sandbox`
- `examples/vscode-sandbox`
- `examples/jupyterlab`
- `examples/policy`
- `examples/hpa-swp-scaling`
- `extensions/examples/README.md`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [examples/README.md](examples/README.md)
- [examples/aio-sandbox/README.md](examples/aio-sandbox/README.md)
- [examples/aio-sandbox/aio-sandbox.yaml](examples/aio-sandbox/aio-sandbox.yaml)
- [examples/chrome-sandbox/README.md](examples/chrome-sandbox/README.md)
- [examples/vscode-sandbox/README.md](examples/vscode-sandbox/README.md)
- [examples/jupyterlab/README.md](examples/jupyterlab/README.md)
- [examples/hermes-agent/README.md](examples/hermes-agent/README.md)
- [examples/langchain/README.md](examples/langchain/README.md)
- [examples/code-interpreter-agent-on-adk/README.md](examples/code-interpreter-agent-on-adk/README.md)
- [examples/ray-integration/README.md](examples/ray-integration/README.md)
- [examples/kueue-agent-sandbox/README.md](examples/kueue-agent-sandbox/README.md)
- [examples/hpa-swp-scaling/README.md](examples/hpa-swp-scaling/README.md)
- [examples/hpa-swp-scaling/hpa.yaml](examples/hpa-swp-scaling/hpa.yaml)
- [examples/policy/sandbox-automount-token-policy.yaml](examples/policy/sandbox-automount-token-policy.yaml)
- [examples/policy/sandbox-automount-token-binding.yaml](examples/policy/sandbox-automount-token-binding.yaml)
- [examples/policy/kyverno/README.md](examples/policy/kyverno/README.md)
- [examples/policy/network-policy-management/README.md](examples/policy/network-policy-management/README.md)
- [examples/policy/vap/README.md](examples/policy/vap/README.md)
- [examples/policy/opa-gatekeeper/README.md](examples/policy/opa-gatekeeper/README.md)
- [examples/policy/policy-controller/README.md](examples/policy/policy-controller/README.md)
- [examples/composing-sandbox-nw-policies/README.md](examples/composing-sandbox-nw-policies/README.md)
- [examples/quickstart/README.md](examples/quickstart/README.md)
- [examples/python-sdk-quickstart/README.md](examples/python-sdk-quickstart/README.md)
- [extensions/examples/README.md](extensions/examples/README.md)
- [extensions/examples/sandboxtemplate.yaml](extensions/examples/sandboxtemplate.yaml)
</details>

# Examples Library Map

The `examples/` tree in `kubernetes-sigs/agent-sandbox` is the canonical place where the project demonstrates how its core `Sandbox` CRD (and the `extensions.agents.x-k8s.io` types `SandboxTemplate`, `SandboxClaim`, and `SandboxWarmPool`) compose with real workloads, agent frameworks, isolation runtimes, admission policies, and autoscalers. The README at the root of the directory enumerates roughly twenty self-contained subdirectories, each tuned to a different reader: a platform operator standing up a quickstart cluster, an agent author wiring an SDK into LangGraph or ADK, or a security administrator validating Kyverno/VAP/Gatekeeper guardrails.

This page maps that tree by audience and purpose, links each subdirectory to the representative manifest or script that anchors it, and clarifies the boundary between the core `Sandbox` API (in this repo) and the higher-level extension CRDs that the more sophisticated examples consume.

Sources: [examples/README.md:1-20](), [extensions/examples/README.md:1-3](), [extensions/examples/sandboxtemplate.yaml:1-44]()

## Top-Level Layout

The two example roots in the repository serve different layers of the API:

| Path | Purpose | API surface |
| --- | --- | --- |
| `examples/` | End-to-end scenarios that consume the controller. Each subdirectory pairs a `README.md` with `.yaml` manifests and (often) Python/Go test drivers. | Mixed: core `Sandbox` plus, where useful, the extension CRDs and Python SDK. |
| `extensions/examples/` | Minimal reference manifests for the extension CRDs — one `SandboxTemplate`, one `SandboxClaim`, one `SandboxWarmPool`, plus a `SecurityContext`-locked-down variant. | `extensions.agents.x-k8s.io/v1alpha1` only. |

The split is intentional: `extensions/examples/` is a starter pack of canonical CRD shapes, while `examples/` shows how a sandbox is wired into a real agent or platform.

```text
examples/
├── README.md                        # index of all examples
├── quickstart/                      # end-to-end walkthrough (KIND default)
├── python-sdk-quickstart/           # SDK-first variant
├── hello-world-sandbox/             # smallest possible Sandbox
├── aio-sandbox/                     # All-in-One runtime (VNC/VSCode/Jupyter)
├── chrome-sandbox/                  # headless Chrome, used by e2e tests
├── vscode-sandbox/                  # code-server + kustomize overlays
├── jupyterlab/                      # JupyterLab + PVC + HF models
├── analytics-tool/                  # ADK analytics tool on GKE
├── hermes-agent/                    # Hermes agent with custom skills
├── langchain/                       # LangGraph coding agent + local model
├── code-interpreter-agent-on-adk/   # ADK + Sandbox SDK as a tool
├── ray-integration/                 # Ray + warm pool + gVisor proxy exec
├── kueue-agent-sandbox/             # Admission control via Kueue
├── hpa-swp-scaling/                 # HPA scaling a SandboxWarmPool
├── manual-pdb/                      # PodDisruptionBudget recipe
├── composing-sandbox-nw-policies/   # KRO composition: Sandbox+Service+NetPol
├── policy/                          # VAP, Kyverno, OPA, Policy Controller
├── python-runtime-sandbox/          # FastAPI runtime image used by SDK
├── gemini-cu-sandbox/               # Gemini Computer Use sandboxed runtime
├── openclaw-sandbox/                # OpenClaw inside Sandbox
├── kata-gke-sandbox/                # Kata Containers on GKE node pools
└── sandbox-ksa/                     # Sandbox bound to a custom ServiceAccount

extensions/examples/                 # canonical CRD shapes
├── sandboxtemplate.yaml             # SandboxTemplate
├── sandbox-claim.yaml / sandboxclaim.yaml
├── sandboxwarmpool.yaml
├── secure-sandboxtemplate.yaml
└── llm.yaml
```

Sources: [examples/README.md:1-20](), [extensions/examples/README.md:1-3]()

## Onboarding Path: Quickstarts and Hello World

These examples are entry points: they require no agent framework and validate the controller, runtime, and SDK are working before moving on.

- **`quickstart/`** is the full walkthrough. It enumerates the resource set a new user will encounter — `Sandbox`, `SandboxTemplate`, `SandboxClaim`, `SandboxWarmPool`, the Python SDK, and the Router Service — and branches off to `gvisor.md` or `kata-containers.md` if the reader wants hardened isolation rather than a vanilla KIND cluster. It is the only example that explicitly enumerates all the public types up front.
- **`python-sdk-quickstart/`** flips the same flow around the SDK. It documents the three connection modes the client supports (`SandboxLocalTunnelConnectionConfig` for tunneled local development, `SandboxGatewayConnectionConfig` for a public Kubernetes Gateway, `SandboxDirectConnectionConfig` for in-cluster agents) and requires that a `python-sandbox-template` already exists from the Python Runtime example.
- **`hello-world-sandbox/`** is the smallest example: a Dockerfile, a Sandbox manifest, and instructions to push to Artifact Registry. Useful for sanity checks against a fresh cluster.

Sources: [examples/quickstart/README.md:1-60](), [examples/python-sdk-quickstart/README.md:1-30](), [examples/hello-world-sandbox/README.md:1-30]()

## Interactive Workspaces

This cluster of examples shows the `Sandbox` CRD wrapping general-purpose interactive runtimes — what an end user might think of as "a remote dev environment for an agent."

### AIO Sandbox

`aio-sandbox/` deploys the All-in-One image from `agent-infra/sandbox`, which bundles VNC, VSCode, Jupyter, and a terminal behind a single port. The manifest is minimal and demonstrates the bare-minimum hardened `securityContext`:

```yaml
# examples/aio-sandbox/aio-sandbox.yaml
spec:
  podTemplate:
    spec:
      containers:
        - name: aio-sandbox
          image: ghcr.io/agent-infra/sandbox:1.0.0.152
          securityContext:
            allowPrivilegeEscalation: false
          ports:
          - containerPort: 8080
          resources:
            limits:
              memory: "4Gi"
              cpu: "2000m"
```

The README distinguishes the in-pod `agent-sandbox` Python SDK (which drives tools inside the running sandbox — browser, shell, files) from the `agentic-sandbox-client` SDK that provisions sandboxes via the CRDs. They are designed to compose.

### Chrome Sandbox

`chrome-sandbox/` runs a headless Chrome inside a `Sandbox`, exposing the DevTools port `9222`. It doubles as the foundation for the project's e2e tests; the README points at `test/e2e/chromesandbox_test.go` as the authoritative consumer.

### VSCode Sandbox

`vscode-sandbox/` is the most layered of the workspace examples. It ships a base kustomization plus three overlays — `overlays/gvisor`, `overlays/kata`, `overlays/kata-mshv` — that inject the corresponding `runtimeClassName`. The README explains the access pattern carefully: when a hardened runtime is used, direct `kubectl port-forward` to the Pod is incompatible, so traffic must be routed through the Sandbox Router using the `X-Sandbox-ID` / `X-Sandbox-Port` headers (or a Gateway in production).

### JupyterLab

`jupyterlab/` offers two deployment shapes — a single-file `jupyterlab-full.yaml` or the modular `jupyterlab.yaml` plus a `kubectl create configmap` step from `files/` — and demonstrates init-container patterns: the `setup-environment` init container downloads Python deps and HuggingFace models into a `volumeClaimTemplates` PVC before the JupyterLab container starts. The README also covers force-re-init by removing `/home/jovyan/.initialized`.

Sources: [examples/aio-sandbox/README.md:1-92](), [examples/aio-sandbox/aio-sandbox.yaml:1-22](), [examples/chrome-sandbox/README.md:1-95](), [examples/vscode-sandbox/README.md:1-258](), [examples/jupyterlab/README.md:1-90]()

## Agent Framework Integrations

These examples show how to plug the sandbox primitives into common agent frameworks. They are the cleanest reference for "how do I make my agent run untrusted code in a sandbox?"

```mermaid
flowchart LR
    subgraph Local["Local / Trusted process"]
        Agent[Agent code<br/>LangGraph / ADK / Ray Actor]
        SDK[k8s-agent-sandbox SDK<br/>SandboxClient]
    end
    subgraph Cluster["Kubernetes cluster"]
        Router[sandbox-router<br/>Service]
        subgraph WarmPool["SandboxWarmPool"]
            Pod1[Pod<br/>runtime image]
            Pod2[Pod<br/>runtime image]
        end
        Template[SandboxTemplate]
    end

    Agent --> SDK
    SDK -->|Tunnel / Gateway / Direct| Router
    Router -->|X-Sandbox-ID / X-Sandbox-Port| Pod1
    WarmPool -.references.-> Template
```

- **`langchain/`** wires a LangGraph coding agent to a sandboxed Python runtime. The init container caches a HuggingFace model (`Salesforce/codegen-350M-mono` by default) on a PVC; the agent's execution loop generates code, sends it into the sandbox, captures errors, and retries up to three times. The README documents memory tuning and how to swap models.
- **`code-interpreter-agent-on-adk/`** is the lightest framework integration: a six-line ADK `Agent` whose only tool wraps `SandboxClient` to create a sandbox, write `run.py`, execute `python3 run.py`, and tear the sandbox down. It is the recommended pattern for "ADK + execute_python."
- **`hermes-agent/`** runs the Hermes agent inside a sandbox with a ConfigMap-mounted custom "Kubernetes Developer" skill (`hermes-skills` ConfigMap from `k8s-developer.md`). It includes `test_hermes.py` for automated verification and `chat_hermes.py` for interactive sessions on port 8642.
- **`ray-integration/`** is the most architecturally explicit example. It documents the **Proxy Execution Model**: a Ray rollout actor stays inside the trusted cluster, and every untrusted code execution is proxied via `SandboxClient` into a gVisor-isolated pod claimed from a `SandboxWarmPool`. The README provides `rl_poc_local.py` (using `SandboxLocalTunnelConnectionConfig`) and `rl_poc_prod.py` (using `SandboxGatewayConnectionConfig` with `gateway_name: "external-http-gateway"`).
- **`kueue-agent-sandbox/`** is admission control rather than a framework. It shows the `kueue.x-k8s.io/queue-name` label being attached to the `podTemplate.metadata.labels` of a `Sandbox`, so Kueue queues pod admission until cluster-queue quota is available. The README is explicit about scope: Kueue controls *when* a sandbox starts, not how long it runs.

Sources: [examples/langchain/README.md:1-90](), [examples/code-interpreter-agent-on-adk/README.md:1-70](), [examples/hermes-agent/README.md:1-105](), [examples/ray-integration/README.md:1-220](), [examples/kueue-agent-sandbox/README.md:1-109]()

## Scaling and Disruption

Two examples target the operational lifecycle of sandbox pools rather than the workloads inside them.

`hpa-swp-scaling/` is GKE-specific: it scales a `SandboxWarmPool` with a stock Kubernetes `HorizontalPodAutoscaler` that consumes a Prometheus counter (`agent_sandbox_claim_creation_total`) exposed through the GKE Managed Service for Prometheus and the Custom Metrics Adapter. The HPA spec scales the warm pool itself rather than a Deployment:

```yaml
# examples/hpa-swp-scaling/hpa.yaml
scaleTargetRef:
  apiVersion: extensions.agents.x-k8s.io/v1alpha1
  kind: SandboxWarmPool
  name: python-sdk-warmpool
minReplicas: 10
maxReplicas: 100
metrics:
- type: External
  external:
    metric:
      name: "prometheus.googleapis.com|agent_sandbox_claim_creation_total|counter"
    target:
      type: Value
      value: "0.5"
```

`create-claim.py` is the load generator and `pod-monitoring.yaml` exposes the metric to Managed Prometheus. The README contains a real `kubectl get hpa -w` trace showing the warm pool scaling from 10 to 100 and back down as claim rate rises and falls.

`manual-pdb/` shows the opt-in `PodDisruptionBudget` pattern: a shared PDB selects pods carrying `sandbox-disruption-policy: "manual-protection"`, set on the `SandboxTemplate`. Sandboxes without that label are unprotected from voluntary disruptions. The README warns specifically against `minAvailable: 1` on a shared PDB.

Sources: [examples/hpa-swp-scaling/README.md:1-97](), [examples/hpa-swp-scaling/hpa.yaml:1-23](), [examples/manual-pdb/README.md:1-40]()

## Policy and Admission

`examples/policy/` is the largest single sub-tree and demonstrates how the project layers cluster admission controllers on top of the `Sandbox` CRD. Each subdirectory targets a different policy engine but solves the same family of problems: prevent privilege escalation, enforce isolation defaults, and lock down the ServiceAccounts a `Sandbox` references.

| Subdirectory | Engine | Headline rule |
| --- | --- | --- |
| `vap/` | Kubernetes `ValidatingAdmissionPolicy` (built-in, v1.30+) | Enforces a "Secure by Default" posture across containers/initContainers/ephemeralContainers via CEL — `runtimeClassName: gvisor`, `hostNetwork/hostPID/hostIPC: false`, `automountServiceAccountToken: false`, `privileged: false`, `capabilities.drop: ["ALL"]`, resource limits, and GKE-specific `nodeSelector`/`tolerations`. |
| `kyverno/` | Kyverno `ValidatingPolicy` | Denies `RoleBinding`/`ClusterRoleBinding` creates or updates that grant permissions to a ServiceAccount used by a `Sandbox`-owned Pod, including indirect references via `system:serviceaccounts[:ns]` groups. Includes Chainsaw tests in `.chainsaw-tests/`. |
| `opa-gatekeeper/` | OPA Gatekeeper | Same goal, Rego implementation. |
| `policy-controller/` | Google Cloud Policy Controller (Anthos) | Same goal, fleet-managed on GKE. |
| (top-level) `sandbox-automount-token-{policy,binding}.yaml` | Plain VAP | A single rule: every `Sandbox` must explicitly set `spec.podTemplate.spec.automountServiceAccountToken == false`. |
| `network-policy-management/` | Controller-managed `NetworkPolicy` | Documents the Template-Level Shared Network Policy model and the Secure-by-Default ingress/egress posture (allow only the Sandbox Router; block RFC1918, link-local metadata, cluster DNS). |

The standalone `sandbox-automount-token-policy.yaml` is a useful read because it is small and concrete:

```yaml
# examples/policy/sandbox-automount-token-policy.yaml
spec:
  failurePolicy: Fail
  matchConstraints:
    resourceRules:
      - apiGroups:   ["agents.x-k8s.io"]
        apiVersions: ["v1alpha1"]
        operations:  ["CREATE", "UPDATE"]
        resources:   ["sandboxes"]
  validations:
    - expression: "object.spec.podTemplate.spec.automountServiceAccountToken == false"
```

The binding scopes the policy to every namespace except `kube-system`. Together these two files are the minimal hardening pattern the repo recommends.

`composing-sandbox-nw-policies/` complements the policy directory by demonstrating composition with **KRO** (`ResourceGraphDefinition`). It bundles a `Sandbox`, a `Service`, and a `NetworkPolicy` into a single higher-level `AgenticSandbox` CRD, illustrating one of three composition strategies the README discusses (custom controller, KRO, Helm).

Sources: [examples/policy/vap/README.md:1-46](), [examples/policy/kyverno/README.md:1-90](), [examples/policy/opa-gatekeeper/README.md:1-60](), [examples/policy/policy-controller/README.md:1-60](), [examples/policy/sandbox-automount-token-policy.yaml:1-15](), [examples/policy/sandbox-automount-token-binding.yaml:1-15](), [examples/policy/network-policy-management/README.md:1-90](), [examples/composing-sandbox-nw-policies/README.md:1-40]()

## Runtime Hardening Scenarios

A subset of examples is dedicated to providing a stronger isolation boundary under the `Sandbox` CRD by changing the container runtime. They all converge on `spec.podTemplate.spec.runtimeClassName`.

| Example | Runtime | Cluster target |
| --- | --- | --- |
| `vscode-sandbox/overlays/gvisor` | gVisor | Any cluster with the `gvisor` RuntimeClass |
| `vscode-sandbox/overlays/kata` | Kata Containers (QEMU) | Local minikube with nested virtualization |
| `vscode-sandbox/overlays/kata-mshv` | Kata + Microsoft Hypervisor | Azure AKS Pod Sandboxing |
| `kata-gke-sandbox/` | Kata Containers on GKE | Requires N2 Intel + Ubuntu node pool, via `setup.sh` |
| `quickstart/gvisor.md`, `quickstart/kata-containers.md` | Either | KIND/minikube introductory paths |

Each of these examples emphasizes that direct pod port-forward stops working under non-default runtimes; access must go through the Sandbox Router service or a Kubernetes Gateway.

Sources: [examples/vscode-sandbox/README.md:20-150](), [examples/kata-gke-sandbox/README.md:1-50](), [examples/quickstart/README.md:1-60]()

## Runtime Images and Specialized Workloads

The remaining examples package container images and surrounding glue for specific workloads.

- **`python-runtime-sandbox/`** is the FastAPI image with `/execute` that the Python SDK and the Ray integration consume. It defines two Pydantic models, `ExecuteRequest` and `ExecuteResponse`, and ships `run-test-docker.sh` and `run-test-kind.sh` for local validation.
- **`gemini-cu-sandbox/`** wraps the `computer-use-preview` agent (Gemini Computer Use) in a sandboxed FastAPI server, requiring `GEMINI_API_KEY`.
- **`openclaw-sandbox/`** runs OpenClaw (formerly Moltbot) inside a `Sandbox`; the README documents the gVisor + port-forward incompatibility and points at the Router as the workaround.
- **`analytics-tool/`** packages an ADK analytics tool image for GKE and shows the Service/JupyterLab integration on top.
- **`sandbox-ksa/`** is the canonical pattern for binding a `Sandbox` to a custom Kubernetes ServiceAccount — useful for per-sandbox identity and RBAC scoping.

Sources: [examples/python-runtime-sandbox/README.md:1-40](), [examples/gemini-cu-sandbox/README.md:1-40](), [examples/openclaw-sandbox/README.md:1-45](), [examples/analytics-tool/README.md:1-40](), [examples/sandbox-ksa/README.md:1-40]()

## Choosing An Example

The decision tree most readers follow is roughly:

1. **Just trying it out?** Start with `quickstart/` (or `python-sdk-quickstart/` for the SDK-centric path). Use `hello-world-sandbox/` as a smoke test.
2. **Wiring an agent framework?** Pick the framework-aligned example: `code-interpreter-agent-on-adk/` for ADK, `langchain/` for LangGraph, `hermes-agent/` for Hermes, `ray-integration/` for Ray RL.
3. **Need an interactive environment for a human or model to inhabit?** `aio-sandbox/`, `vscode-sandbox/`, `jupyterlab/`, or `chrome-sandbox/` depending on the tool surface.
4. **Operating at scale?** `hpa-swp-scaling/` for autoscaling warm pools, `manual-pdb/` for disruption control, `kueue-agent-sandbox/` for admission quotas.
5. **Securing the platform?** Start with `examples/policy/vap/` for the headline VAP, then layer `examples/policy/network-policy-management/` and one of `kyverno/`, `opa-gatekeeper/`, or `policy-controller/` depending on your engine.
6. **Hardening the runtime?** Pick a `vscode-sandbox/overlays/*` overlay or `kata-gke-sandbox/` matching your cluster.

The `extensions/examples/` directory is the place to look when you only need the raw CRD shapes for `SandboxTemplate`, `SandboxClaim`, and `SandboxWarmPool` without surrounding application code — it is referenced by `manifest.yaml`/`extensions.yaml` installation flows discussed in the higher-level examples.

Sources: [examples/README.md:1-20](), [extensions/examples/README.md:1-3](), [extensions/examples/sandboxtemplate.yaml:1-44]()

---

## 27. Build, Codegen & Repository Tools

> Make targets, the codegen.go shim, deepcopy/CRD generation, lint configuration, and the dev/tools scripts that power local development.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/27-build-codegen-repository-tools.md
- Generated: 2026-05-25T22:52:34.631Z

### Source Files

- `Makefile`
- `codegen.go`
- `tools.mod`
- `dev/tools/client-gen-go.sh`
- `dev/tools/lint-api`
- `dev/tools/test-unit`
- `docs/development.md`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [Makefile](Makefile)
- [codegen.go](codegen.go)
- [tools.mod](tools.mod)
- [dev/tools/client-gen-go.sh](dev/tools/client-gen-go.sh)
- [dev/tools/lint-api](dev/tools/lint-api)
- [dev/tools/test-unit](dev/tools/test-unit)
- [dev/tools/lint-go](dev/tools/lint-go)
- [dev/tools/fix-go-generate](dev/tools/fix-go-generate)
- [dev/tools/build-kal](dev/tools/build-kal)
- [dev/tools/.golangci.yaml](dev/tools/.golangci.yaml)
- [dev/tools/.golangci-kal.yml](dev/tools/.golangci-kal.yml)
- [dev/tools/client-gen-go.sh](dev/tools/client-gen-go.sh)
- [dev/tools/go.mod](dev/tools/go.mod)
- [dev/tools/shared/utils.py](dev/tools/shared/utils.py)
- [dev/tools/fix-boilerplate](dev/tools/fix-boilerplate)
- [dev/tools/update-toc](dev/tools/update-toc)
- [dev/tools/verify-toc](dev/tools/verify-toc)
- [docs/development.md](docs/development.md)
</details>

# Build, Codegen & Repository Tools

This page documents how `kubernetes-sigs/agent-sandbox` is built, how its generated code (CRDs, RBAC, deepcopy methods, typed clientsets/listers/informers) is produced, how Go and Kubernetes-API lint checks are configured, and how the `dev/tools/` Python helpers tie those steps together. The repository keeps developer tooling deliberately co-located: a single top-level `Makefile` exposes named targets that shell out to per-script entry points under `dev/tools/`, while a separate `tools.mod` / `dev/tools/go.mod` keeps developer-tool dependencies out of the production `go.mod`.

The flow matters because the controller's CRDs, RBAC manifests (under `k8s/` and `helm/`), `zz_generated.deepcopy.go`, and the typed clients under `clients/k8s/` are all derived from the Go API packages — they must be regenerated whenever `api/` or `extensions/` change, and the same generators are wired into both `make all` and CI.

## Make targets and the `make all` pipeline

The `Makefile` declares a single canonical orchestration target `all` that runs the full local-validation pipeline:

```make
all: fix-go-generate build lint-go lint-api test-unit toc-verify
```

Sources: [Makefile:1-2](Makefile)

Each phase is a thin shell-out to a script under `dev/tools/`, so the Makefile acts mostly as a dispatcher rather than encoding logic itself.

| Target | Action | Underlying tool |
|---|---|---|
| `fix-go-generate` | Runs all `//go:generate` directives | `dev/tools/fix-go-generate` ([Makefile:4-6](Makefile)) |
| `build` | Builds the controller binary to `bin/manager` with version ldflags | `go build` ([Makefile:27-29](Makefile)) |
| `lint-go` / `fix-go` | Runs `golangci-lint` (with or without `--fix`) | `dev/tools/lint-go` ([Makefile:70-76](Makefile)) |
| `lint-api` / `fix-api` | Runs Kube-API Linter (KAL) on `api/` and `extensions/api/` | `dev/tools/lint-api` ([Makefile:78-84](Makefile)) |
| `test-unit` | Runs Go and Python unit suites | `dev/tools/test-unit` ([Makefile:54-56](Makefile)) |
| `test-e2e` / `test-e2e-race` / `test-e2e-benchmarks` | Drives `dev/ci/presubmits/test-e2e` | `RACE` env toggles `-race` ([Makefile:58-68](Makefile)) |
| `deploy-kind` | Creates a kind cluster, pushes images, deploys controller | `dev/tools/create-kind-cluster`, `push-images`, `deploy-to-kube` ([Makefile:33-40](Makefile)) |
| `toc-update` / `toc-verify` | Updates/verifies KEP table-of-contents using `mdtoc` | `dev/tools/update-toc`, `dev/tools/verify-toc` ([Makefile:128-134](Makefile)) |
| `generate-api-docs` | Generates `docs/api.md` from CRD markers | `crd-ref-docs` ([Makefile:10-15](Makefile)) |
| `release-promote` / `release-publish` / `release-manifests` / `release-python-sdk` | Tag/publish flows; require `TAG=` | `dev/tools/tag-promote-images`, `dev/tools/release`, `dev/tools/release-python` ([Makefile:98-126](Makefile)) |
| `clean` | Removes `dev/tools/tmp` and `bin/manager` | (none) ([Makefile:136-139](Makefile)) |

### Version ldflags

The `build` target injects build metadata into `internal/version` via linker flags. `GIT_VERSION`, `GIT_SHA`, and `BUILD_DATE` are computed from `git describe`/`git rev-parse`/`date -u`, and stamped through `-X` flags into `sigs.k8s.io/agent-sandbox/internal/version`:

```make
VERSION_PKG := sigs.k8s.io/agent-sandbox/internal/version
LD_FLAGS := -s -w -X $(VERSION_PKG).gitVersion=$(GIT_VERSION) \
    -X $(VERSION_PKG).gitSHA=$(GIT_SHA) \
    -X $(VERSION_PKG).buildDate=$(BUILD_DATE)
```

Sources: [Makefile:17-29](Makefile)

## The `codegen.go` shim and `tools.mod`

`codegen.go` is a tiny, otherwise-empty file at the repo root whose sole purpose is to host project-wide `//go:generate` directives. The package compiles as part of the module but contributes no runtime code:

```go
// This file just exists as a place to put //go:generate directives that should apply to the entire project
package agentsandbox
```

Sources: [codegen.go:15-17](codegen.go)

It declares eight `controller-gen` invocations plus the clientset script:

```go
//go:generate go tool -modfile=tools.mod sigs.k8s.io/controller-tools/cmd/controller-gen object crd:maxDescLen=0 paths=./api/... output:crd:dir=k8s/crds
//go:generate go tool -modfile=tools.mod sigs.k8s.io/controller-tools/cmd/controller-gen object crd:maxDescLen=0 paths=./extensions/... output:crd:dir=k8s/crds
//go:generate go tool -modfile=tools.mod sigs.k8s.io/controller-tools/cmd/controller-gen paths=./controllers/... output:rbac:dir=k8s rbac:roleName=agent-sandbox-controller,fileName=rbac.generated.yaml
//go:generate go tool -modfile=tools.mod sigs.k8s.io/controller-tools/cmd/controller-gen paths=./extensions/controllers/... output:rbac:dir=k8s rbac:roleName=agent-sandbox-controller-extensions,fileName=extensions-rbac.generated.yaml
//go:generate go tool -modfile=tools.mod sigs.k8s.io/controller-tools/cmd/controller-gen object crd:maxDescLen=0 paths=./api/... output:crd:dir=helm/crds
//go:generate go tool -modfile=tools.mod sigs.k8s.io/controller-tools/cmd/controller-gen object crd:maxDescLen=0 paths=./extensions/... output:crd:dir=helm/crds
//go:generate go tool -modfile=tools.mod sigs.k8s.io/controller-tools/cmd/controller-gen paths=./controllers/... output:rbac:dir=helm/templates rbac:roleName=agent-sandbox-controller,fileName=rbac.generated.yaml
//go:generate go tool -modfile=tools.mod sigs.k8s.io/controller-tools/cmd/controller-gen paths=./extensions/controllers/... output:rbac:dir=helm/templates rbac:roleName=agent-sandbox-controller-extensions,fileName=extensions-rbac.generated.yaml
//go:generate ./dev/tools/client-gen-go.sh
```

Sources: [codegen.go:20-28](codegen.go)

Two design choices are worth calling out:

1. **`go tool -modfile=tools.mod`** invokes `controller-gen` from a sibling module file, not from `go.mod`. `tools.mod` declares it as a Go tool: `tool sigs.k8s.io/controller-tools/cmd/controller-gen` ([tools.mod:97](tools.mod)). That keeps `controller-tools`, `code-generator`, and `gengo` out of the production module while still pinning their versions reproducibly.
2. **Each `paths=...` invocation emits to two output trees** (`k8s/...` and `helm/...`). The repository deliberately duplicates the generated CRDs and RBAC into both a raw manifest tree (`k8s/crds`, `k8s/rbac.generated.yaml`) and the Helm chart tree (`helm/crds`, `helm/templates/...`). The matching files under those trees (e.g., `k8s/crds/agents.x-k8s.io_sandboxes.yaml`) are written by the `object` + `crd` generators; deepcopy methods land alongside the API types as `zz_generated.deepcopy.go` ([api/v1beta1/zz_generated.deepcopy.go:1-5](api/v1beta1/zz_generated.deepcopy.go)).

### `controller-gen` generators in use

| Generator | Inputs | Outputs |
|---|---|---|
| `object` | `./api/...`, `./extensions/...` | `zz_generated.deepcopy.go` next to API types |
| `crd:maxDescLen=0` | same | `k8s/crds/*.yaml`, `helm/crds/*.yaml` |
| `rbac:roleName=agent-sandbox-controller` | `./controllers/...` | `k8s/rbac.generated.yaml`, `helm/templates/rbac.generated.yaml` |
| `rbac:roleName=agent-sandbox-controller-extensions` | `./extensions/controllers/...` | `k8s/extensions-rbac.generated.yaml`, `helm/templates/extensions-rbac.generated.yaml` |

`maxDescLen=0` disables the default truncation of CRD description text, so the on-cluster schemas keep full kubebuilder-doc descriptions.

Sources: [codegen.go:20-28](codegen.go), [k8s/crds/](k8s/crds), [helm/crds/](helm/crds)

### `fix-go-generate`

`make fix-go-generate` delegates to `dev/tools/fix-go-generate`, which simply runs `go generate -v ./...` from the repo root so every package's `//go:generate` lines (in practice just `codegen.go`) fire:

```python
subprocess.check_call(["go", "generate", "-v", "./..."], cwd=repo_root)
```

Sources: [dev/tools/fix-go-generate:30-31](dev/tools/fix-go-generate)

## Typed-client generation: `client-gen-go.sh`

The last `//go:generate` directive in `codegen.go` is a shell script that produces Kubernetes typed clients, listers, and informers for both the core and extensions API groups. It bootstraps the `k8s.io/code-generator` binaries through `go run -modfile=tools.mod`, so they too come from the dev-only tool module rather than `go.mod`.

```bash
CMD="go run -modfile=tools.mod k8s.io/code-generator"
API_PKG="sigs.k8s.io/agent-sandbox/api/v1beta1"
CLIENT_PKG="sigs.k8s.io/agent-sandbox/clients/k8s"
```

Sources: [dev/tools/client-gen-go.sh:24-26](dev/tools/client-gen-go.sh)

It then invokes three generators in sequence per API group:

```text
client-gen   --output-dir clients/k8s/clientset            --clientset-name versioned   --input <API_PKG>
lister-gen   --output-dir clients/k8s/listers              <API_PKG>
informer-gen --output-dir clients/k8s/informers
             --versioned-clientset-package <CLIENT_PKG>/clientset/versioned
             --listers-package <CLIENT_PKG>/listers        <API_PKG>
```

The same three calls run again for `extensions/api/v1beta1`, writing into `clients/k8s/extensions/{clientset,listers,informers}`. Finally, the script applies repo-standard Apache license headers to the freshly generated files:

```bash
echo "Fixing license headers..."
"${SCRIPT_ROOT}"/dev/tools/fix-boilerplate
```

Sources: [dev/tools/client-gen-go.sh:24-79](dev/tools/client-gen-go.sh)

`fix-boilerplate` walks the tree and prepends the Apache 2.0 header (skipping `**/zz_generated.*` and `site/**`) using the `dev/tools/shared/headers.py` helper:

```python
excludes = [
    "**/zz_generated.*",
    "**/site/**",
]
headers.apply_headers_to_tree(repo_root, excludes=excludes)
```

Sources: [dev/tools/fix-boilerplate:31-38](dev/tools/fix-boilerplate), [dev/tools/shared/headers.py:40-72](dev/tools/shared/headers.py)

### End-to-end codegen view

```mermaid
flowchart LR
    subgraph Sources["Source packages"]
        APIv["api/v1beta1/*.go"]
        EXTAPI["extensions/api/v1beta1/*.go"]
        CTL["controllers/**.go"]
        EXTCTL["extensions/controllers/**.go"]
    end

    subgraph Driver["go generate driver"]
        CODEGEN["codegen.go<br/>//go:generate"]
        FIXGEN["dev/tools/fix-go-generate"]
    end

    subgraph Tools["tools.mod (dev-only deps)"]
        CG["controller-gen<br/>(object / crd / rbac)"]
        CODEGEN_K["k8s.io/code-generator<br/>(client/lister/informer)"]
    end

    subgraph Outputs["Generated artifacts"]
        DC["api/**/zz_generated.deepcopy.go"]
        CRDK["k8s/crds/*.yaml"]
        CRDH["helm/crds/*.yaml"]
        RBACK["k8s/*rbac.generated.yaml"]
        RBACH["helm/templates/*rbac.generated.yaml"]
        CLS["clients/k8s/{clientset,listers,informers}"]
        EXTCLS["clients/k8s/extensions/{clientset,listers,informers}"]
    end

    FIXGEN -->|go generate ./...| CODEGEN
    CODEGEN --> CG
    CODEGEN --> CODEGEN_K
    APIv --> CG
    EXTAPI --> CG
    CTL --> CG
    EXTCTL --> CG
    CG --> DC
    CG --> CRDK
    CG --> CRDH
    CG --> RBACK
    CG --> RBACH
    APIv --> CODEGEN_K
    EXTAPI --> CODEGEN_K
    CODEGEN_K --> CLS
    CODEGEN_K --> EXTCLS
    CLS -->|fix-boilerplate| CLS
    EXTCLS -->|fix-boilerplate| EXTCLS
```

Sources: [codegen.go:20-28](codegen.go), [dev/tools/client-gen-go.sh:24-79](dev/tools/client-gen-go.sh), [dev/tools/fix-go-generate:30-31](dev/tools/fix-go-generate), [tools.mod:88-97](tools.mod)

## Lint configuration

The repository runs two distinct Go linters: a general-purpose `golangci-lint` and a Kubernetes-API-specific KAL (Kube-API Linter). Both pull their binaries from the dev tool module via `go tool -modfile=dev/tools/go.mod`.

### General Go lint (`lint-go`)

`dev/tools/lint-go` shells out to `golangci-lint` with the repo's config:

```python
args = ["golangci-lint", "run", f"--config={repo_root}/dev/tools/.golangci.yaml"]
if "--fix" in sys.argv:
    args.append("--fix")
result = subprocess.run(utils.go_tool_args(*args), cwd=repo_root)
```

Sources: [dev/tools/lint-go:22-28](dev/tools/lint-go)

The helper `utils.go_tool_args` prepends `go tool -modfile=<repo>/dev/tools/go.mod`:

```python
def go_tool_args(*args):
    repo_root = get_repo_root()
    return ["go", "tool", f"-modfile={repo_root}/dev/tools/go.mod", *args]
```

Sources: [dev/tools/shared/utils.py:64-67](dev/tools/shared/utils.py)

`.golangci.yaml` enables a curated set of linters (`depguard`, `revive`, `staticcheck`, `govet`, `errcheck`, `ineffassign`, `unparam`, `testifylint`, `sloglint`, `misspell`, modernize, …) and bans `import "sort"` via `depguard`, steering callers to the `slices` package. Formatters (`gofmt`, `goimports`) are configured alongside, with `third_party`, `builtin`, and `examples` excluded:

```yaml
depguard:
  rules:
    forbid-pkg-errors:
      deny:
        - pkg: sort
          desc: Should be replaced with slices package
```

Sources: [dev/tools/.golangci.yaml:1-55](dev/tools/.golangci.yaml)

### Kube-API Linter (`lint-api`)

KAL is shipped as a `golangci-lint` plugin and must be compiled into a custom golangci-lint binary. `dev/tools/lint-api` builds it on first use, then runs it against the API trees only:

```python
kal_binary = os.path.join(tools_dir, "tmp", "bin", "golangci-kube-api-linter")
if not os.path.exists(kal_binary):
    build_script = os.path.join(tools_dir, "build-kal")
    build_result = subprocess.run([build_script], cwd=repo_root)
...
args = [kal_binary, "run", "--config", kal_config]
if "--fix" in sys.argv:
    args.append("--fix")
args.extend(["./api/...", "./extensions/api/..."])
```

Sources: [dev/tools/lint-api:31-53](dev/tools/lint-api)

`build-kal` compiles the custom binary using `golangci-lint custom`:

```bash
(cd "${SCRIPT_DIR}" && go tool -modfile "${SCRIPT_DIR}/go.mod" golangci-lint custom)
```

Sources: [dev/tools/build-kal:19-26](dev/tools/build-kal)

The output lives at `dev/tools/tmp/bin/golangci-kube-api-linter` — note that `make clean` removes `dev/tools/tmp` to force a rebuild ([Makefile:136-139](Makefile)).

`.golangci-kal.yml` disables all default linters (`default: none`) and enables only `kubeapilinter` with an explicit whitelist of KAL checks for CRD authoring discipline. The list includes:

| KAL linter | Enforces |
|---|---|
| `nobools` / `nofloats` | No `bool`/`float` fields in API types |
| `commentstart` / `jsontags` | godoc starts with field name, all fields have JSON tags |
| `optionalorrequired` | Every field is marked `+optional` or `+required` |
| `statussubresource` / `statusoptional` | `status` is a subresource with optional fields |
| `nophase` / `nomaps` / `nonullable` | No `phase` field, no maps, no `nullable` markers |
| `conflictingmarkers` | `default` cannot coexist with `required` |
| `duplicatemarkers` / `uniquemarkers` | No duplicate or non-unique markers on a field |
| `forbiddenmarkers` | Bans `PreserveUnknownFields`, `XPreserveUnknownFields`, `EmbeddedResource`, etc. |

Sources: [dev/tools/.golangci-kal.yml:11-57](dev/tools/.golangci-kal.yml)

Path rules restrict KAL to `api/.*` paths, and explicitly exclude `_test.go` and `zz_generated.*\.go$` so generated deepcopy code is not flagged:

```yaml
exclusions:
  generated: strict
  paths:
    - _test\.go
    - zz_generated.*\.go$
  rules:
  - path-except: "api/.*"
    linters:
      - kubeapilinter
```

Sources: [dev/tools/.golangci-kal.yml:58-68](dev/tools/.golangci-kal.yml)

## Unit tests: `test-unit`

`dev/tools/test-unit` is the single entry point used by both `make test-unit` and CI. It runs Go and Python suites and emits JUnit XML files under `bin/`.

### Go tests

It enumerates all packages with `go list ./...`, filters out `test/e2e`, then runs them under `gotestsum` (a Go test wrapper that produces JUnit), with `-race` enabled:

```python
go_list_output = subprocess.check_output(["go", "list", "./..."], cwd=repo_root, text=True)
filtered_packages = [pkg for pkg in packages if "test/e2e" not in pkg]

result = subprocess.run(utils.go_tool_args(
    "gotestsum",
    f"--junitfile={repo_root}/bin/unit-junit.xml",
    "--",
    "-race",
    *filtered_packages
), cwd=repo_root)
```

Sources: [dev/tools/test-unit:37-53](dev/tools/test-unit)

`gotestsum` is provided by `tool gotest.tools/gotestsum` in `dev/tools/go.mod` ([dev/tools/go.mod:5-10](dev/tools/go.mod)), so it runs via the same `go tool -modfile=…` mechanism as the other dev tools.

### Python tests

Two Python suites under `clients/python/agentic-sandbox-client/` are exercised inside a clean per-suite venv:

```python
PYTHON_TEST_SUITES = [
    {"name": "sandbox-router",     "dir": "clients/python/.../sandbox-router",     "requirements": "requirements.txt"},
    {"name": "k8s-agent-sandbox",  "dir": "clients/python/.../k8s_agent_sandbox", "requirements": "requirements.txt"},
]
```

For each suite the runner:

1. Wipes any existing venv at `bin/python-venv-<name>` (verifying `pyvenv.cfg` exists before removal, to avoid trashing an unrelated directory).
2. Creates a fresh venv with `python -m venv`.
3. Installs `requirements.txt`; if absent and the suite is `k8s-agent-sandbox`, installs the package itself with `pip install -e .[test]`.
4. Installs `pytest` separately.
5. Runs `pytest` with `--junitxml=bin/python-<name>-junit.xml -v`.

Sources: [dev/tools/test-unit:55-108](dev/tools/test-unit)

The runner returns the Go exit code first if non-zero, otherwise the Python exit code — both must pass for `make test-unit` to succeed.

## Shared helpers and tool module layout

The Python scripts under `dev/tools/` share a small library:

| Module | Purpose |
|---|---|
| `shared/utils.py` | `get_repo_root()`, `go_tool_args()`, image-tag/git helpers used by deploy/release scripts |
| `shared/headers.py` | Apache-header insertion engine used by `fix-boilerplate` |
| `shared/golang.py` | Walks the repo's Go modules (used by `fix-gomod`, `fix-go-format`) |
| `shared/git_ops.py` | Git helpers used by release/tagging flows |

Sources: [dev/tools/shared/utils.py:21-67](dev/tools/shared/utils.py), [dev/tools/shared/headers.py:23-72](dev/tools/shared/headers.py), [dev/tools/fix-boilerplate:31-38](dev/tools/fix-boilerplate), [dev/tools/fix-gomod:24-39](dev/tools/fix-gomod)

A compact map of the repository's module/tool surfaces:

```text
repo/
├── go.mod              -> production module (controller, controllers, api)
├── tools.mod           -> codegen tools: controller-gen, code-generator
│                          (used by //go:generate via `go tool -modfile=tools.mod`)
├── codegen.go          -> hosts all //go:generate directives
├── api/                -> kubebuilder-marked source for CRDs / deepcopy
├── extensions/api/     -> same, for extensions group
├── clients/k8s/        -> code-generator output (clientset/listers/informers)
├── k8s/crds/, k8s/*rbac.generated.yaml         -> raw manifest output
├── helm/crds/, helm/templates/*rbac.generated.yaml -> Helm chart output
└── dev/tools/
    ├── go.mod          -> lint/test tools: golangci-lint, gotestsum,
    │                       kube-api-linter, benchstat
    ├── .golangci.yaml  -> general Go lint config
    ├── .golangci-kal.yml -> KAL config (CRD conventions)
    ├── client-gen-go.sh -> drives client/lister/informer-gen
    ├── fix-go-generate, fix-boilerplate, fix-go-format, fix-gomod
    ├── lint-go, lint-api, build-kal
    ├── test-unit, test-e2e
    ├── update-toc, verify-toc
    └── shared/         -> Python helpers
```

Sources: [tools.mod:1-97](tools.mod), [dev/tools/go.mod:1-12](dev/tools/go.mod), [codegen.go:20-28](codegen.go)

The three-module split is the key invariant: production code depends only on `go.mod`; CRD/client codegen tools come from `tools.mod`; lint/test tools come from `dev/tools/go.mod`. Each has its own `tool` block so `go tool -modfile=<path>` is the universal invocation pattern.

## Documentation utilities

Two small Bash scripts maintain table-of-contents stability on KEP docs:

- `update-toc` (`make toc-update`) installs `sigs.k8s.io/mdtoc@v1.1.0` into a `mktemp -d` directory (to avoid mutating the project go.mod/go.sum), then runs it across `docs/keps/**.md` with `--max-depth=5`, filtering out the names listed in `dev/tools/.notableofcontents`. Sources: [dev/tools/update-toc:22-52](dev/tools/update-toc)
- `verify-toc` (`make toc-verify`) does the same install, then runs `mdtoc --dryrun` to fail when the TOCs are out of date. The two scripts share a `TOOL_VERSION=v1.1.0` constant with a "keep in sync" comment. Sources: [dev/tools/verify-toc:21-53](dev/tools/verify-toc)

`generate-api-docs` follows the same temp-install pattern for `github.com/elastic/crd-ref-docs`, rendering `docs/api.md` from the CRD markers ([Makefile:10-15](Makefile)).

## Putting it together

The combined effect is that a maintainer editing an API type only needs to run `make all` ([Makefile:1-2](Makefile)) to:

1. Re-run all `//go:generate` directives → updated deepcopy, CRDs (in both `k8s/` and `helm/`), RBAC, and typed clients with license headers fixed up.
2. Rebuild the controller binary with embedded version metadata.
3. Run general Go lint (`golangci-lint`) and the CRD-convention lint (KAL) against `api/`/`extensions/api/`.
4. Run Go + Python unit tests with race detection and JUnit output.
5. Verify KEP TOCs are current.

The same scripts run under Prow CI (presubmits and postsubmits) so local and CI behavior remain identical, anchored on `dev/tools/` and the two side-car module files (`tools.mod`, `dev/tools/go.mod`) that isolate developer tooling from production dependencies.

Sources: [Makefile:1-2](Makefile), [codegen.go:20-28](codegen.go), [dev/tools/client-gen-go.sh:24-79](dev/tools/client-gen-go.sh), [dev/tools/test-unit:37-108](dev/tools/test-unit), [docs/development.md:34-49](docs/development.md)

---

## 28. E2E Test Framework

> Layout of the Go e2e suite, the framework client/predicates/watchset helpers, and the parallel/replica/shutdown scenario coverage.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/28-e2e-test-framework.md
- Generated: 2026-05-25T22:53:04.074Z

### Source Files

- `test/e2e/README.md`
- `test/e2e/framework/client.go`
- `test/e2e/framework/watchset.go`
- `test/e2e/basic_test.go`
- `test/e2e/parallelism_test.go`
- `test/e2e/extensions/warmpool_rollout_test.go`
- `docs/testing.md`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [test/e2e/README.md](test/e2e/README.md)
- [docs/testing.md](docs/testing.md)
- [test/e2e/framework/client.go](test/e2e/framework/client.go)
- [test/e2e/framework/watchset.go](test/e2e/framework/watchset.go)
- [test/e2e/framework/context.go](test/e2e/framework/context.go)
- [test/e2e/framework/testlogs.go](test/e2e/framework/testlogs.go)
- [test/e2e/framework/componentlogs.go](test/e2e/framework/componentlogs.go)
- [test/e2e/framework/predicates/predicates.go](test/e2e/framework/predicates/predicates.go)
- [test/e2e/framework/predicates/conditions.go](test/e2e/framework/predicates/conditions.go)
- [test/e2e/framework/predicates/metadata.go](test/e2e/framework/predicates/metadata.go)
- [test/e2e/framework/predicates/sandbox.go](test/e2e/framework/predicates/sandbox.go)
- [test/e2e/basic_test.go](test/e2e/basic_test.go)
- [test/e2e/replicas_test.go](test/e2e/replicas_test.go)
- [test/e2e/shutdown_test.go](test/e2e/shutdown_test.go)
- [test/e2e/parallelism_test.go](test/e2e/parallelism_test.go)
- [test/e2e/extensions/warmpool_rollout_test.go](test/e2e/extensions/warmpool_rollout_test.go)
- [test/e2e/utils.go](test/e2e/utils.go)
- [Makefile](Makefile)
</details>

# E2E Test Framework

The agent-sandbox end-to-end suite is a Go `testing` package that runs against a real Kubernetes cluster (typically a kind cluster created by `make deploy-kind`), driving the `Sandbox`, `SandboxClaim`, `SandboxTemplate`, and `SandboxWarmPool` CRDs through the agent-sandbox controller. It is structured around a thin `framework` package that owns cluster credentials, object lifecycle, predicate evaluation, and a shared watch fan-out, and a set of `*_test.go` scenarios that exercise basic reconciliation, replica scaling, shutdown lifecycle, parallel reconciliation under load, and warm-pool rollout strategies.

This page maps the framework's public surface, the wiring between `TestContext`, `ClusterClient`, predicates, and `WatchSet`, and then walks through the parallel, replica, shutdown, and warm-pool scenarios to show how the helpers are composed.

## How tests are launched

Tests assume that a cluster is reachable through `bin/KUBECONFIG` (or `$KUBECONFIG`) and that the agent-sandbox controller is already installed. The README documents the canonical invocation:

```shell
make deploy-kind
go test ./test/e2e/... --parallel=1
```

`--parallel=1` serializes the top-level tests because they share a single cluster and patch the controller `Deployment` in some scenarios.

Sources: [test/e2e/README.md:1-30](), [Makefile:33-68](), [test/e2e/framework/context.go:40-67]()

The `Makefile` exposes higher-level entry points used by CI:

| Target | Behavior |
| --- | --- |
| `make test-e2e` | Runs `./dev/ci/presubmits/test-e2e` (race detector toggled via `RACE`). |
| `make test-e2e-race` | Same script with `RACE=1`. |
| `make test-e2e-benchmarks` | Same script with `--suite benchmarks`. |
| `make deploy-kind` / `make delete-kind` | Provision/teardown the kind cluster used by the suite. |

Sources: [Makefile:33-68](), [docs/testing.md:12-35]()

## Repository layout

```text
test/e2e/
├── README.md                 # run instructions
├── basic_test.go             # TestSimpleSandbox
├── replicas_test.go          # TestSandboxReplicas
├── shutdown_test.go          # TestSandboxShutdownTime, retain-expiry
├── parallelism_test.go       # parallel Sandboxes / SandboxClaims
├── volumeclaimtemplate_test.go
├── chromesandbox_test.go, chromesandbox_claim_test.go
├── utils.go                  # AtomicTimeDuration helper
├── framework/                # shared client + watch + predicates
│   ├── client.go             # ClusterClient (Get/Create/WaitForObject/Watch)
│   ├── context.go            # TestContext, kubeconfig, before/afterEach
│   ├── watchset.go           # WatchSet/ResourceWatch/Subscription fan-out
│   ├── testlogs.go           # per-test artifacts log file
│   ├── componentlogs.go      # kubelet/containerd log capture from kind nodes
│   └── predicates/           # ObjectPredicate library
└── extensions/               # CRD-specific scenarios
    ├── warmpool_rollout_test.go
    ├── warmpool_sandbox_watcher_test.go
    ├── pythonruntime_test.go
    ├── sandboxclaim_metric_test.go
    └── shutdown_policy_test.go
```

Sources: [test/e2e/framework/client.go:15-45](), [test/e2e/framework/context.go:14-37](), [test/e2e/extensions/warmpool_rollout_test.go:15-34]()

## Framework architecture

The framework wraps `*testing.T` (or `*testing.B`) into a `TestContext` that exposes a `ClusterClient` and a per-test `WatchSet`. The `ClusterClient` delegates CRUD to a controller-runtime `client.Client`, while waits and watches go through a `dynamic.Interface` so the framework can drive arbitrary GVRs without registering Go types ahead of time.

```mermaid
flowchart LR
    subgraph TestCase["*_test.go scenario"]
        TC["framework.NewTestContext(t)"]
    end

    subgraph Framework["test/e2e/framework"]
        Ctx["TestContext\n(context.go)"]
        CC["ClusterClient\n(client.go)"]
        WS["WatchSet\n(watchset.go)"]
        Logs["logCapturingT\n(testlogs.go)"]
        Comp["dumpControllerLogs\nMustGetKubeletLogs"]
    end

    subgraph Predicates["framework/predicates"]
        IF["ObjectPredicate interface"]
        Status["StatusPredicate\nConditionReasonPredicate\nReadyReplicasPredicate"]
        Meta["Annotation/Label/Owner\nNotDeleted/HasDeletionTimestamp"]
        Sandbox["SandboxHasStatus\n(go-cmp diff)"]
    end

    subgraph K8s["Kubernetes cluster (kind)"]
        API["kube-apiserver"]
        Ctrl["agent-sandbox-controller\n(agent-sandbox-system)"]
        CRDs["Sandbox / SandboxClaim\nSandboxTemplate / SandboxWarmPool"]
    end

    TC --> Ctx
    Ctx --> CC
    Ctx --> Logs
    Ctx --> Comp
    CC --> WS
    CC -->|"controller-runtime client"| API
    WS -->|"dynamic.Interface\nWatch loop"| API
    CC --> Predicates
    Predicates --> Status
    Predicates --> Meta
    Predicates --> Sandbox
    Ctrl --> CRDs
    API --> Ctrl
```

Sources: [test/e2e/framework/context.go:91-161](), [test/e2e/framework/client.go:51-78](), [test/e2e/framework/watchset.go:59-148](), [test/e2e/framework/predicates/predicates.go:26-33]()

### `TestContext` lifecycle

`NewTestContext` is the single entry point used by every scenario. It loads the kubeconfig, builds both a typed `controller-runtime` client and a `dynamic.Interface`, creates a `WatchSet`, attaches log capture, and registers cleanup hooks. `beforeEach` validates that the CRDs, the `agent-sandbox-system` namespace, and the `agent-sandbox-controller` Deployment exist before the test body runs; `afterEach` dumps controller logs into the test's artifacts directory if the test failed.

Key responsibilities, in order:

1. Resolve `KUBECONFIG` (env var, otherwise `<repo>/bin/KUBECONFIG`).
2. Build `*rest.Config`, shared HTTP client, controller-runtime client, and dynamic client against the cluster.
3. Create a `WatchSet(dynamicClient)` and register `watchSet.Close()` as test cleanup.
4. Wrap `T` with `logCapturingT` so every `Logf` / `Errorf` is mirrored to `<artifacts>/<TestName>/test.log` with elapsed-time prefixes.
5. Run `validateAgentSandboxInstallation()` — fails fast if the controller or CRDs are missing.
6. Register `afterEach()` to call `dumpControllerLogs` when `t.Failed()`.

Sources: [test/e2e/framework/context.go:91-178](), [test/e2e/framework/context.go:180-240](), [test/e2e/framework/client.go:521-549](), [test/e2e/framework/testlogs.go:34-77]()

### `ClusterClient` API

`ClusterClient` is the workhorse for all scenarios. Every method is `Helper()`-marked so failures point at the calling test line, and create operations register cleanup that deletes the object and then blocks on `WaitForObjectNotFound` so the next test starts with a clean namespace.

| Method | Purpose |
| --- | --- |
| `Get` / `List` / `Update` / `Delete` | Thin error-wrapping facades over the controller-runtime client. |
| `CreateWithCleanup` / `MustCreateWithCleanup` | Create the object and schedule deletion + wait-not-found in `t.Cleanup`. Uses `context.Background()` during cleanup since the test context is already done. |
| `MustUpdateObject[T]` | Re-fetch the latest version, apply `updateFunc`, then `Update`. Avoids stale-version conflicts. |
| `MatchesPredicates` / `MustMatchPredicates` / `MustExist` | One-shot predicate evaluation against the current API state. |
| `ValidateObjectNotFound` / `WaitForObjectNotFound` | Confirm deletion; namespaces get a 3-minute timeout because of cascading cleanup, everything else is 60s. |
| `PollUntilObjectMatches` | 1-second polling fallback used when watches aren't a good fit (for example, the shutdown-time scenario). |
| `WaitForObject` / `MustWaitForObject` | Watch-driven wait with predicate evaluation; the preferred timing primitive. |
| `Watch` / `Watch[T]` / `MustWatch[T]` | Generic event callback against a `GVR + WatchFilter`. |
| `WaitForSandboxReady` / `WaitForWarmPoolReady` / `GetSandbox` | Convenience wrappers for the specific CRDs. |
| `PortForward` | Spawns `kubectl port-forward` for tests that need direct pod access (e.g., Chrome debug port). |
| `ExecuteOnNode` / `IsKindCluster` | `docker exec` into the kind node container; tests can opt into kind-only behavior. |

`DefaultTimeout` is 60 seconds; a shorter deadline can be supplied via the caller's context, and `WaitForObject` clamps to whichever is smaller.

Sources: [test/e2e/framework/client.go:46-265](), [test/e2e/framework/client.go:485-549](), [test/e2e/framework/client.go:551-718]()

### `WatchForObject` and the typed `Watch[T]` helper

`WaitForObject` is the single timing primitive that the suite leans on. It first checks if the object already satisfies the predicates (fast path), looks up the GVR from a hard-coded GVK table (`gvrForGVK` covers Pod, Deployment, Namespace, Sandbox, SandboxWarmPool, SandboxClaim), subscribes through the shared `WatchSet`, and runs the predicate callback on each event. On timeout or non-match it logs the last observed object as YAML, which is invaluable when debugging asymmetric status diffs.

```go
// test/e2e/framework/client.go (excerpt)
done, err := Watch(ctx, cl, gvr, watchFilter, func(event watch.Event, obj *unstructured.Unstructured) (bool, error) {
    if event.Type == watch.Deleted {
        return false, fmt.Errorf("object was deleted while waiting for predicates to be satisfied")
    }
    var notMatching []predicates.ObjectPredicate
    for _, predicate := range p {
        if match, err := predicate.Matches(obj); err != nil {
            return false, err
        } else if !match {
            notMatching = append(notMatching, predicate)
        }
    }
    lastNotMatching = notMatching
    lastObject = obj.DeepCopy()
    return len(notMatching) == 0, nil
})
```

The generic `Watch[T]` decodes events to either `*unstructured.Unstructured` (passthrough) or any concrete `client.Object` by running `DefaultUnstructuredConverter.FromUnstructured`. Bookmark events are dropped, and `Error` events become callback errors. `MustWatch` swallows `context.Canceled` so tests that intentionally cancel the watch context don't fail.

Sources: [test/e2e/framework/client.go:267-349](), [test/e2e/framework/client.go:351-428](), [test/e2e/framework/client.go:430-473]()

### `WatchSet`: shared dynamic watches

A naïve "one Watch per `WaitForObject` call" would create one HTTP watch per assertion and incur listing + setup latency on every wait. The framework instead keeps **one** persistent watch per `(GVR, namespace)` and fans out events to per-call subscriptions that apply a `WatchFilter{Namespace, Name}`.

```mermaid
classDiagram
    class WatchSet {
        +watches: map[watchKey]*ResourceWatch
        +dynamicClient: dynamic.Interface
        +Subscribe(gvr, filter) *Subscription
        +Close()
        -getOrCreateWatch(gvr, namespace) *ResourceWatch
        -removeWatchIfIdle(rw)
    }
    class ResourceWatch {
        +gvr: GroupVersionResource
        +namespace: string
        +subscriptions: map[uint64]*Subscription
        +cancelWatchLoop: CancelFunc
        -watchLoop(ctx)
        -broadcast(event)
        -subscribe(filter) *Subscription
        -unsubscribe(sub)
    }
    class Subscription {
        +Events: chan watch.Event
        +filter: WatchFilter
        +Close()
    }
    class WatchFilter {
        +Namespace: string
        +Name: string
    }
    WatchSet "1" --> "*" ResourceWatch : owns
    ResourceWatch "1" --> "*" Subscription : broadcasts to
    Subscription --> WatchFilter : per-call filter
```

Notable design details:

- The watch loop runs under `context.Background()` so it survives the caller's per-call context and only stops when `WatchSet.Close()` runs (registered in `NewTestContext`) or when the last subscription unsubscribes (`removeWatchIfIdle`).
- On a `watch.Error` event the resource version is reset to `""` and the loop restarts from scratch; transient errors back off for 100 ms before retry.
- `broadcast` always forwards `watch.Error` and `watch.Bookmark` events to all subscribers, but applies name/namespace filtering for `Added/Modified/Deleted`.
- Each subscription's `Events` channel is buffered at 100 to absorb bursts without blocking the watch loop.
- Cluster-scoped resources use `namespace == ""` and a single cluster-wide watch.

Sources: [test/e2e/framework/watchset.go:29-148](), [test/e2e/framework/watchset.go:150-211](), [test/e2e/framework/watchset.go:230-328]()

### Predicates

`predicates.ObjectPredicate` is a two-method interface: `Matches(obj) (bool, error)` and `fmt.Stringer`. The stringer requirement means `WaitForObject` can log the unmatched predicates as `lastNotMatching` for fast debugging.

| Predicate | File | Behavior |
| --- | --- | --- |
| `ReadyConditionIsTrue` (`StatusPredicate`) | `conditions.go` | Looks for `status.conditions[Type=Ready, Status=True]` via unstructured conversion. |
| `ConditionReasonEquals(type, reason)` | `conditions.go` | Same shape but matches on `Reason`; used by the retain-expiry test. |
| `ReadyReplicasConditionIsTrue` | `conditions.go` | `status.readyReplicas == spec.replicas`; used to wait for the controller Deployment after patching. |
| `ObservedGenerationMatchesGeneration` | `conditions.go` | `status.observedGeneration == metadata.generation`; pairs with the replicas check on controller restarts. |
| `SandboxHasStatus(want)` | `sandbox.go` | Full `go-cmp` diff against `SandboxStatus`, ignoring `LastTransitionTime` and `PodIPs`. |
| `HasAnnotation` / `HasLabel` / `HasOwnerReferences` | `metadata.go` | Map and slice comparisons; owner refs are sorted by UID before diffing. |
| `NotDeleted` / `HasDeletionTimestamp` | `metadata.go` | DeletionTimestamp checks for cascading-delete and rollout assertions. |

All status predicates funnel through `asUnstructured` and `runtime.DefaultUnstructuredConverter.FromUnstructured`, which lets the same predicate evaluate any CRD without registering its Go type.

Sources: [test/e2e/framework/predicates/predicates.go:26-33](), [test/e2e/framework/predicates/conditions.go:38-189](), [test/e2e/framework/predicates/sandbox.go:38-66](), [test/e2e/framework/predicates/metadata.go:26-148]()

### Logs and artifacts

Two log sinks are wired up per test:

- `logCapturingT` mirrors `Log/Logf/Error/Errorf/Fatal/Fatalf` to `<ARTIFACTS>/<TestName>/test.log` with `[   12.345s]` elapsed-time prefixes. `ARTIFACTS` defaults to `./artifacts`.
- On failure, `dumpControllerLogs` lists pods in `agent-sandbox-system` labelled `app=agent-sandbox-controller`, writes the full log of each to `<artifacts>/controller-<pod>.log`, and inlines the last 42 lines into the test output ("following k8s e2e convention").

For kind-specific diagnostics, `componentlogs.go` exposes `MustGetKubeletLogs` and `MustGetContainerdLogs`, both of which shell out via `ClusterClient.ExecuteOnNode` (`docker exec <node> journalctl -u kubelet ...`) and persist results under the same artifacts directory.

Sources: [test/e2e/framework/testlogs.go:26-102](), [test/e2e/framework/context.go:180-240](), [test/e2e/framework/componentlogs.go:25-52]()

## Scenario coverage

The scenario tests are deliberately small and composable. Each one creates a uniquely-named namespace with `time.Now().UnixNano()` to avoid collisions, then registers it with `CreateWithCleanup` so the deferred delete tears down every child resource.

### Basic reconciliation — `TestSimpleSandbox`

`basic_test.go` exercises the happy path: create a `Sandbox` from a tiny `simpleSandbox` fixture (a single `registry.k8s.io/pause:3.10` container with a test annotation and label), wait for the controller to populate the full `SandboxStatus` (Service name, FQDN, replicas, label selector built from `NameHash` of the sandbox name, and a `Ready=True` condition with `SandboxReasonDependenciesReady`), and then assert that the generated Pod and Service exist with the expected owner references and propagated metadata. `NameHash` is an FNV-1a hex digest used to compute the `agents.x-k8s.io/sandbox-name-hash=` selector that the controller stamps onto the Service.

Sources: [test/e2e/basic_test.go:31-132]()

### Replica scaling — `TestSandboxReplicas`

`replicas_test.go` first creates a Sandbox with `Spec.Replicas=1` and waits for the same `Ready=True` status. It then uses `MustUpdateObject` to flip replicas to 0; `MustUpdateObject` re-reads the latest version before mutating so the test does not race the controller. The expected status flips to `Ready=False` with reason `SandboxReasonSuspended`, plus a second `Suspended=True` condition with reason `SandboxReasonSuspendedPodTerminated`. The test finally asserts the Pod is gone (`WaitForObjectNotFound`) while the Service still exists (`NotDeleted` predicate) — verifying that scaling to zero is non-destructive at the Service level.

Sources: [test/e2e/replicas_test.go:30-107](), [test/e2e/framework/client.go:113-129]()

### Shutdown lifecycle — `shutdown_test.go`

Two scenarios cover the `Spec.ShutdownTime` and `Spec.Lifecycle.ShutdownPolicy` surfaces:

- `TestSandboxShutdownTime` creates a Sandbox, then patches `ShutdownTime` to ~10 seconds in the future (truncated to RFC3339 second precision to match Kubernetes storage). It uses `PollUntilObjectMatches` (rather than the watch-based `WaitForObject`) to observe the transitioned status — `Service`/`ServiceFQDN` cleared, `Replicas: 0`, and `Ready=False` with reason `SandboxReasonExpired`. It then asserts the wall-clock time has passed the shutdown moment and that both the Pod and Service are deleted via `WaitForObjectNotFound`.
- `TestSandboxRetainedExpiryPreservesFinishedCondition` builds a Sandbox whose container exits cleanly (`busybox sh -c exit 0`) with `ShutdownPolicyRetain`. It first uses `ConditionReasonEquals(SandboxConditionFinished, SandboxReasonPodSucceeded)` to confirm the Finished condition is set, then `require.Eventually` waits until both `Ready=Expired` and `Finished=PodSucceeded` are simultaneously present alongside cleared service fields — proving that the retain policy preserves the Finished condition through expiry.

Sources: [test/e2e/shutdown_test.go:31-103](), [test/e2e/shutdown_test.go:105-160]()

### Parallel reconciliation — `parallelism_test.go`

`parallelism_test.go` stresses the controller's worker pool. The helper `patchControllerConcurrency` is the most interesting piece: it snapshots the controller Deployment, rewrites the container args to bump `--sandbox-concurrent-workers`, `--sandbox-claim-concurrent-workers`, `--sandbox-warm-pool-concurrent-workers` to 10, raises `--kube-api-qps=50` / `--kube-api-burst=100`, ensures `--extensions` is present, and waits for the new pod to roll out (using `ReadyReplicasConditionIsTrue` + `ObservedGenerationMatchesGeneration`). Cleanup restores the original spec under `retry.RetryOnConflict` and sleeps 5 s for leader election to settle.

Three scenarios then drive load through the patched controller:

| Test | Setup | What it stresses |
| --- | --- | --- |
| `TestParallelSandboxes` | 20 Sandboxes created concurrently via goroutines, each waiting on `ReadyConditionIsTrue`. | Pure Sandbox reconciliation under concurrency. |
| `TestParallelSandboxClaimsWithSufficientWarmPool` | `SandboxWarmPool` of 25 replicas, 20 parallel claims. | Claim binding when pre-warmed pods are available. |
| `TestParallelSandboxClaimsWithInsufficientWarmPool` | Pool of 5, 20 parallel claims. | Forces on-demand pod creation when the pool is exhausted. |

Errors flow through a buffered `errCh`; the test fails after `wg.Wait()` so each goroutine completes before reporting.

Sources: [test/e2e/parallelism_test.go:36-108](), [test/e2e/parallelism_test.go:110-145](), [test/e2e/parallelism_test.go:147-218]()

### Warm-pool rollout — `extensions/warmpool_rollout_test.go`

The extensions suite owns the `SandboxWarmPool` rollout semantics. Four scenarios share the helpers `createSandboxTemplate`, `createSandboxWarmPool`, `updateSandboxTemplateSpec` (appends `TEST_ENV=updated` to the pod), `verifySandboxStaysSame`, `verifySandboxRecreated`, and `verifySandboxHasUpdatedSpec` (which lists Sandboxes in the namespace, filters by `metav1.IsControlledBy(warmPool)`, and uses `require.Eventually` to wait for a fresh one).

```mermaid
stateDiagram-v2
    [*] --> Created : create SandboxTemplate + SandboxWarmPool
    Created --> Ready : WaitForWarmPoolReady\n(ReadyReplicasConditionIsTrue)
    Ready --> TemplateUpdated : Update SandboxTemplate

    state Strategy <<choice>>
    TemplateUpdated --> Strategy

    Strategy --> StaysSame : default / OnReplenish\n(verifySandboxStaysSame)
    StaysSame --> Replenished : Delete existing Sandbox\n(verifyOnReplenishLifecycle)
    Replenished --> Ready : new Sandbox has\nTEST_ENV=updated

    Strategy --> Recreated : Recreate\n(verifySandboxRecreated)
    Recreated --> Ready : new Sandbox observed,\nold marked for deletion

    Strategy --> NoRollout : metadata-only update\n(TestWarmPoolRolloutMetadataUpdate)
    NoRollout --> Ready : sandbox unchanged
```

Concrete scenarios:

- `TestWarmPoolRollout` is a table test over `default`, `onreplenish`, and `recreate` strategies, asserting the expected lifecycle (`verifyOnReplenishLifecycle` for the first two, `verifySandboxRecreated` for the third).
- `TestWarmPoolRolloutMultiTemplateIsolation` verifies that updating template A under the `Recreate` strategy rolls only warm-pool A's sandbox while warm-pool B's sandbox is untouched (no deletion timestamp).
- `TestWarmPoolRolloutSwitchTemplate` changes `Spec.TemplateRef.Name` from A to B (specs identical) and asserts that the new sandbox is annotated with the new template name via `sandboxv1beta1.SandboxTemplateRefAnnotation`.
- `TestWarmPoolRolloutMetadataUpdate` updates only labels in the template's pod metadata and asserts no rollout occurs (the existing sandbox keeps an empty deletion timestamp after a 5 s settling sleep).

Sources: [test/e2e/extensions/warmpool_rollout_test.go:33-136](), [test/e2e/extensions/warmpool_rollout_test.go:138-240](), [test/e2e/extensions/warmpool_rollout_test.go:242-442]()

## How the pieces compose in a typical test

A representative wait sits on top of three independent abstractions cooperating:

```text
test code            framework.ClusterClient            framework.WatchSet                 dynamic.Interface
─────────            ────────────────────────           ──────────────────                ─────────────────
WaitForObject(...)
   │                 fast-path MatchesPredicates
   │                       │   ↓ miss
   │                 gvkForObject / gvrForGVK
   │                       │
   │                 Subscribe(gvr, {ns,name}) ──────►  getOrCreateWatch(gvr,ns)
   │                                                         │ first time?
   │                                                         └─►  go watchLoop ──── Watch(ctx, ListOpts{Watch:true})
   │                 ◄─────── sub.Events (buffered 100)        broadcast(event) ◄─── ResultChan()
   │                       │ run predicates per event
   │                       └─► last YAML logged on miss
   │                       └─► return on first all-match or timeout
   │                 sub.Close() ─► removeWatchIfIdle (keeps watch if other subs alive)
```

The result is that ten parallel `WaitForObject` calls in a `TestParallelSandboxes` goroutine pool reuse a single Sandbox watch per namespace, which keeps the API server load low while still giving each test a precise, predicate-driven completion signal.

Sources: [test/e2e/framework/client.go:267-349](), [test/e2e/framework/watchset.go:80-211](), [test/e2e/parallelism_test.go:118-145]()

## Summary

The e2e framework is intentionally small: `TestContext` boots a kubeconfig-driven client pair per test, `ClusterClient` wraps controller-runtime CRUD with cleanup-aware helpers, `predicates.ObjectPredicate` provides composable assertions over arbitrary CRDs via unstructured conversion, and `WatchSet` keeps a single dynamic watch per `(GVR, namespace)` so dozens of concurrent waits remain cheap. The scenario files build on these primitives to cover the Sandbox happy path, replica scaling, time- and policy-driven shutdown, controller behavior under parallel Sandbox and SandboxClaim load, and the full matrix of `SandboxWarmPool` rollout strategies.

---

## 29. Load Testing & CI Pipelines

> cluster-loader-based load-test recipes plus the prowjob presubmit/periodic configuration that runs them in CI.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/29-load-testing-ci-pipelines.md
- Generated: 2026-05-25T23:21:59.475Z

### Source Files

- `dev/load-test/README.md`
- `dev/load-test/cluster-loader-sandbox.yaml`
- `dev/ci/presubmits`
- `dev/ci/periodics`
- `docs/prowjob_manual_run.md`

> ⚠️ The agent returned an invalid wiki page. This page needs recovery.
>
> First failure: the page did not include the required "# Load Testing & CI Pipelines" heading near the top
> Retry failure: the page did not include the required "# Load Testing & CI Pipelines" heading near the top

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [dev/load-test/README.md](dev/load-test/README.md)
- [dev/load-test/agent-sandbox-load-test.yaml](dev/load-test/agent-sandbox-load-test.yaml)
- [dev/load-test/cluster-loader-sandbox.yaml](dev/load-test/cluster-loader-sandbox.yaml)
- [dev/load-test/test-recipes/README-rapid-burst.md](dev/load-test/test-recipes/README-rapid-burst.md)
- [dev/load-test/test-recipes/rapid-burst-test.yaml](dev/load-test/test-recipes/rapid-burst-test.yaml)
- [dev/load-test/test-recipes/run_rapid_burst.sh](dev/load-test/test-recipes/run_rapid_burst.sh)
- [dev/load-test/test-recipes/high-volume-test.yaml](dev/load-test/test-recipes/high-volume-test.yaml)
- [dev/load-test/test-recipes/medium-scale-concurrent-load-test.yaml](dev/load-test/test-recipes/medium-scale-concurrent-load-test.yaml)
- [dev/load-test/test-recipes/throughput-test.yaml](dev/load-test/test-recipes/throughput-test.yaml)
- [dev/load-test/test-recipes/warmpool-burst-test.yaml](dev/load-test/test-recipes/warmpool-burst-test.yaml)
- [dev/load-test/test-recipes/monitor/agent-sandbox-controller-monitor.yaml](dev/load-test/test-recipes/monitor/agent-sandbox-controller-monitor.yaml)
- [dev/load-test/test-recipes/templates/cluster-loader-sandbox-template.yaml](dev/load-test/test-recipes/templates/cluster-loader-sandbox-template.yaml)
- [dev/load-test/test-recipes/templates/cluster-loader-warmpool.yaml](dev/load-test/test-recipes/templates/cluster-loader-warmpool.yaml)
- [dev/load-test/test-recipes/templates/cluster-loader-sandbox-claim.yaml](dev/load-test/test-recipes/templates/cluster-loader-sandbox-claim.yaml)
- [dev/load-test/test-recipes/templates/cluster-loader-hpa.yaml](dev/load-test/test-recipes/templates/cluster-loader-hpa.yaml)
- [dev/load-test/test-recipes/templates/cluster-loader-capacity-buffer.yaml](dev/load-test/test-recipes/templates/cluster-loader-capacity-buffer.yaml)
- [dev/ci/periodics/test-load-test](dev/ci/periodics/test-load-test)
- [dev/ci/presubmits/test-e2e](dev/ci/presubmits/test-e2e)
- [dev/ci/presubmits/test-unit](dev/ci/presubmits/test-unit)
- [dev/ci/presubmits/lint-go](dev/ci/presubmits/lint-go)
- [dev/ci/presubmits/lint-api](dev/ci/presubmits/lint-api)
- [dev/ci/presubmits/test-autogen-up-to-date](dev/ci/presubmits/test-autogen-up-to-date)
- [dev/ci/presubmits/shared/utils.py](dev/ci/presubmits/shared/utils.py)
- [dev/ci/shared/runner.py](dev/ci/shared/runner.py)
- [docs/prowjob_manual_run.md](docs/prowjob_manual_run.md)
</details>

# Load Testing & CI Pipelines

This page documents how `kubernetes-sigs/agent-sandbox` exercises the controller under load and how those tests are wired into Prow as presubmit and periodic jobs. Two surfaces collaborate: the `dev/load-test/` recipes built on top of [ClusterLoader2](https://github.com/kubernetes/perf-tests/tree/master/clusterloader2) (CL2), and the `dev/ci/` Python entrypoints that Prow invokes. The recipes describe *what* to measure — `Sandbox`/`SandboxClaim` startup latency, throughput, churn, warm-pool burst behaviour — while the CI entrypoints describe *how* to bring up an isolated KinD cluster, push the controller image, run a recipe, and emit a JUnit artifact that Prow can surface in Spyglass.

The two layers are deliberately separable. Any recipe can be run by hand against any cluster (the rapid-burst recipe is targeted at GKE; the entrypoint recipe at KinD), and CI just composes a thin wrapper around the entrypoint recipe with smaller defaults so it fits inside a Prow pod.

## High-Level Architecture

```mermaid
flowchart LR
  subgraph Prow["kubernetes/test-infra (Prow)"]
    PJ["pj-on-kind.sh /<br/>periodic job spec"]
  end

  subgraph CIEntry["dev/ci/ (Python wrappers)"]
    Periodic["periodics/test-load-test<br/>(LoadTestRunner)"]
    Pre_e2e["presubmits/test-e2e"]
    Pre_unit["presubmits/test-unit"]
    Pre_lintgo["presubmits/lint-go"]
    Pre_lintapi["presubmits/lint-api"]
    Pre_autogen["presubmits/test-autogen-up-to-date"]
    Runner["shared/runner.py<br/>TestRunner"]
  end

  subgraph DevTools["dev/tools/"]
    Kind["create-kind-cluster"]
    Push["push-images"]
    Deploy["deploy-to-kube<br/>+ deploy-cloud-provider"]
    Fix["fix-* scripts"]
    Lint["lint-go / lint-api / test-unit / test-e2e"]
  end

  subgraph LoadTest["dev/load-test/"]
    EntryRecipe["agent-sandbox-load-test.yaml<br/>(entrypoint recipe)"]
    SandboxTpl["cluster-loader-sandbox.yaml<br/>(Sandbox object template)"]
    Recipes["test-recipes/*.yaml<br/>(rapid-burst, throughput,<br/>high-volume, medium-scale, warmpool-burst)"]
    Templates["test-recipes/templates/*.yaml<br/>(SandboxTemplate, WarmPool, Claim,<br/>HPA, CapacityBuffer)"]
    Monitor["test-recipes/monitor/<br/>ServiceMonitor"]
  end

  PJ --> Periodic
  PJ --> Pre_e2e
  PJ --> Pre_unit
  PJ --> Pre_lintgo
  PJ --> Pre_lintapi
  PJ --> Pre_autogen

  Periodic --> Runner
  Pre_e2e --> Runner
  Periodic --> EntryRecipe
  EntryRecipe --> SandboxTpl
  Recipes --> Templates
  Recipes --> Monitor

  Runner --> Kind
  Runner --> Push
  Runner --> Deploy
  Pre_unit --> Lint
  Pre_lintgo --> Lint
  Pre_lintapi --> Lint
  Pre_autogen --> Fix
```

Sources: [dev/ci/shared/runner.py:23-80](), [dev/ci/periodics/test-load-test:48-110](), [dev/load-test/agent-sandbox-load-test.yaml:1-56](), [dev/load-test/test-recipes/rapid-burst-test.yaml:1-46]()

## The CI Entrypoint Layer (`dev/ci/`)

Every Prow job in this repo is a small Python script that Prow shells out to. There are two flavours:

1. **Lint/unit wrappers** that just delegate to a script in `dev/tools/`.
2. **Cluster-backed runners** that subclass `TestRunner` and provision a KinD cluster before running a binary.

### `TestRunner` base class

`dev/ci/shared/runner.py` is the spine shared by every cluster-backed CI job. It exposes a four-method lifecycle — `setup_cluster`, `run_tests`, `copy_artifacts`, `main` — and threads the repository root and an `--image-prefix` argument through every invocation:

```python
# dev/ci/shared/runner.py
class TestRunner:
    def __init__(self, name, description):
        ...
        self.parser.add_argument(
            "--image-prefix",
            ...
            default="kind.local/",
        )

    def setup_cluster(self, args, extra_push_images_args=None):
        image_tag = tools_utils.get_image_tag()
        subprocess.run([f"{self.repo_root}/dev/tools/create-kind-cluster",
                        self.name, "--recreate",
                        "--kubeconfig", f"{self.repo_root}/bin/KUBECONFIG"])
        ...
        subprocess.run([f"{self.repo_root}/dev/tools/push-images", ...])
        subprocess.run([f"{self.repo_root}/dev/tools/deploy-to-kube",
                        "--image-prefix", args.image_prefix,
                        "--image-tag", image_tag, "--extensions"])
        subprocess.run([f"{self.repo_root}/dev/tools/deploy-cloud-provider"])
```

`setup_cluster` is the same for every job: recreate a KinD cluster named after the job, build and load images into the cluster, deploy the controller plus extension CRDs, and stand up the cloud-provider mock. `run_tests` is `NotImplementedError` — subclasses must override it. `copy_artifacts` is a no-op by default; subclasses override it to publish JUnit files into `$ARTIFACTS`, the Prow-set output directory.

Sources: [dev/ci/shared/runner.py:23-80]()

### Presubmits

| Entrypoint | Subclass / pattern | Underlying script | JUnit output |
|---|---|---|---|
| `dev/ci/presubmits/test-e2e` | `E2ETestRunner(TestRunner)` | `dev/tools/test-e2e --suite {all,tests,benchmarks}` | `bin/e2e-go-junit.xml` → `$ARTIFACTS/junit_go.xml`, `bin/e2e-python-sdk-junit.xml` → `$ARTIFACTS/junit_python_sdk.xml` |
| `dev/ci/presubmits/test-unit` | thin wrapper via `shared/utils.py` | `dev/tools/test-unit` | `bin/unit-junit.xml` → `$ARTIFACTS/junit.xml` |
| `dev/ci/presubmits/lint-go` | thin wrapper | `dev/tools/lint-go` | none |
| `dev/ci/presubmits/lint-api` | thin wrapper | `dev/tools/lint-api` | none |
| `dev/ci/presubmits/test-autogen-up-to-date` | inline | runs every `dev/tools/fix-*` and fails if `git diff --name-only` is non-empty | none |

The lint/unit wrappers reuse `dev/ci/presubmits/shared/utils.py`, which walks up four parents from `__file__` to find the repo root and runs the named tool from `dev/tools/`:

```python
# dev/ci/presubmits/shared/utils.py
def get_repo_root():
    presubmit_dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
    return os.path.dirname(os.path.dirname(os.path.dirname(presubmit_dir)))

def run_dev_tool(tool_name):
    repo_root = get_repo_root()
    result = subprocess.run([f"{repo_root}/dev/tools/{tool_name}"])
    return result.returncode
```

`test-autogen-up-to-date` is the odd one out — it does not invoke a single tool but iterates over `sorted(glob.glob("dev/tools/fix-*"))`, runs each one, and exits non-zero if any of them produces a diff. It prints the full `git diff` and a remediation hint to re-run the script locally.

Sources: [dev/ci/presubmits/test-e2e:28-47](), [dev/ci/presubmits/test-unit:23-35](), [dev/ci/presubmits/lint-go:21-23](), [dev/ci/presubmits/lint-api:21-23](), [dev/ci/presubmits/test-autogen-up-to-date:22-48](), [dev/ci/presubmits/shared/utils.py:20-29]()

### The periodic load-test entrypoint

`dev/ci/periodics/test-load-test` is the only entry under `dev/ci/periodics/`. It subclasses `TestRunner` as `LoadTestRunner`, takes four CL2 parameters as CLI arguments, and reduces the defaults so the test fits inside a Prow pod backed by KinD:

| Flag | Default | Forwarded as |
|---|---|---|
| `--replicas` | 5 | `CL2_REPLICAS` |
| `--namespaces` | 1 | `CL2_NAMESPACES` |
| `--qps` | 10 | `CL2_QPS` |
| `--namespace-prefix` | `agent-sandbox` | `CL2_NAMESPACE_PREFIX` |

`setup_cluster` is overridden to pass `extra_push_images_args=["--controller-only"]`, so the load-test job only loads the controller image into KinD — none of the auxiliary images needed for full e2e are pushed:

```python
# dev/ci/periodics/test-load-test
def setup_cluster(self, args):
    return super().setup_cluster(args, extra_push_images_args=["--controller-only"])
```

`run_tests` (a) installs ClusterLoader2 on the fly by `git clone --depth 1` of `kubernetes/perf-tests` and `go build -o bin/clusterloader2 ./cmd/clusterloader.go`, (b) writes the four CLI flags into a `--testoverrides` file, then (c) invokes `clusterloader2` from `dev/load-test/` (because the entrypoint recipe references `cluster-loader-sandbox.yaml` via a relative path):

```python
cmd = [cl2_path,
       f"--testconfig={test_config}",
       f"--kubeconfig={kubeconfig}",
       f"--testoverrides={overrides_path}",
       "--provider=kind",
       "--v=2",
       f"--report-dir={report_dir}"]
subprocess.run(cmd, cwd=os.path.join(self.repo_root, "dev/load-test"))
```

`copy_artifacts` then copies `bin/junit.xml` to `$ARTIFACTS/junit_load_test.xml` so the Prow UI can render it. The clusterloader2 binary and the temporary kubeconfig are removed at the end of each run so subsequent invocations rebuild fresh.

Sources: [dev/ci/periodics/test-load-test:29-116]()

## The Load-Test Recipe Layer (`dev/load-test/`)

There are two cohorts of CL2 recipes here. The top-level `agent-sandbox-load-test.yaml` is the one driven by CI; the recipes under `test-recipes/` are tuned for larger GKE clusters and are run by humans.

### The CI-driven entrypoint recipe

`dev/load-test/agent-sandbox-load-test.yaml` is intentionally minimal. It accepts four parameters (`CL2_REPLICAS`, `CL2_NAMESPACES`, `CL2_QPS`, `CL2_NAMESPACE_PREFIX`) and executes five CL2 steps:

1. `Start Startup Latency Measurement` — registers a `PodStartupLatency` measurement.
2. `Create Sandboxes` — instantiates `cluster-loader-sandbox.yaml` (a `Sandbox` CR with an alpine container that sleeps 3600s) under the `BurstCreate` tuning set.
3. `Wait for Sandboxes to be Ready` — uses `WaitForRunningPods` against the label selector `group=agent-sandbox-load-test`.
4. `Gather Results` — collects the `SandboxStartupLatency` measurement.
5. `Delete Sandboxes` — drops the replica count to 0.

A TODO at the top of the file notes that the recipe currently measures only the downstream Pod latency, not the end-to-end `Sandbox` lifecycle, and that it should switch to `GenericPrometheusQuery` once `agent_sandbox_creation_latency_ms` metrics are exposed by the controller:

```yaml
# dev/load-test/agent-sandbox-load-test.yaml
# TODO: Replace PodStartupLatency with GenericPrometheusQuery
# once agent_sandbox_creation_latency_ms metrics are available to measure
# end-to-end controller overhead.
- Identifier: SandboxStartupLatency
  Method: PodStartupLatency # Warning: this only measures the downstream Pod latency, not the end-to-end Sandbox lifecycle
```

Sources: [dev/load-test/agent-sandbox-load-test.yaml:1-56](), [dev/load-test/cluster-loader-sandbox.yaml:1-17](), [dev/load-test/README.md:33-86]()

### Test recipes under `dev/load-test/test-recipes/`

The five `*.yaml` recipes under `test-recipes/` cover different scenarios. They are designed to be driven by hand (or by a shell wrapper such as `run_rapid_burst.sh`) against a real GKE cluster running the agent-sandbox controller.

| Recipe | Tuning set(s) | Defaults | What it stresses |
|---|---|---|---|
| `rapid-burst-test.yaml` | `BurstCreate` (QPS 100) | `BURST_SIZE=50`, `TOTAL_BURSTS=100`, `WARMPOOL_SIZE=200` | Repeated bursts of `SandboxClaim` creation against a populated `SandboxWarmPool`, with optional HPA + GKE `CapacityBuffer` scaling |
| `warmpool-burst-test.yaml` | `RampUp` / `RampDown` (100 / 50 QPS) | `REPLICAS=100`, `NAMESPACES=1` | Warm-pool churn under ramp-up/down |
| `high-volume-test.yaml` | `RampUp` / `RampDown` (10 / 50 QPS) | `REPLICAS=100` | Capacity ceiling — "keep increasing this to find the max" |
| `throughput-test.yaml` | `ConstantRate` / `QuickFinalDeletion` (200 / 100 QPS) | `REPLICAS=12000` | Sustained creation throughput |
| `medium-scale-concurrent-load-test.yaml` | `ConstantChurn` / `QuickFinalDeletion` (2 / 100 QPS) | `REPLICAS=1200` | Long-running steady-state churn |

The recipes share a common set of object templates under `test-recipes/templates/`:

| Template | Kind | API |
|---|---|---|
| `cluster-loader-sandbox-template.yaml` | `SandboxTemplate` | `extensions.agents.x-k8s.io/v1beta1` |
| `cluster-loader-warmpool.yaml` | `SandboxWarmPool` | `extensions.agents.x-k8s.io/v1beta1` |
| `cluster-loader-sandbox-claim.yaml` | `SandboxClaim` | `extensions.agents.x-k8s.io/v1beta1` |
| `cluster-loader-hpa.yaml` | `HorizontalPodAutoscaler` (External metric, targets `SandboxWarmPool`) | `autoscaling/v2` |
| `cluster-loader-capacity-buffer.yaml` | `CapacityBuffer` | `autoscaling.x-k8s.io/v1beta1` |

A single `ServiceMonitor` under `test-recipes/monitor/` scrapes the controller in `agent-sandbox-system` every 10s on the `metrics` port; it is loaded into CL2 via `--prometheus-additional-monitors-path`.

Sources: [dev/load-test/test-recipes/rapid-burst-test.yaml:1-46](), [dev/load-test/test-recipes/high-volume-test.yaml:1-15](), [dev/load-test/test-recipes/medium-scale-concurrent-load-test.yaml:1-15](), [dev/load-test/test-recipes/throughput-test.yaml:1-17](), [dev/load-test/test-recipes/warmpool-burst-test.yaml:1-16](), [dev/load-test/test-recipes/templates/cluster-loader-sandbox-template.yaml:1-20](), [dev/load-test/test-recipes/templates/cluster-loader-warmpool.yaml:1-10](), [dev/load-test/test-recipes/templates/cluster-loader-sandbox-claim.yaml:1-8](), [dev/load-test/test-recipes/templates/cluster-loader-hpa.yaml:1-24](), [dev/load-test/test-recipes/templates/cluster-loader-capacity-buffer.yaml:1-12](), [dev/load-test/test-recipes/monitor/agent-sandbox-controller-monitor.yaml:1-19]()

### Anatomy of the rapid-burst recipe

`rapid-burst-test.yaml` is the most complete recipe and the one most worth understanding, because it exercises every template and uses both `PodStartupLatency` and `GenericPrometheusQuery` measurements.

```mermaid
stateDiagram-v2
    [*] --> StartMeasurements
    StartMeasurements --> SetupTemplate: Setup Sandbox Template
    SetupTemplate --> SetupWarmPool: Setup Sandbox Warm Pool
    SetupWarmPool --> WaitWarmReady: Wait for Warm Pool Sandboxes\n(WaitForGenericK8sObjects,\nlabel: agents.x-k8s.io/warm-pool-sandbox)
    WaitWarmReady --> SetupHPA: if CL2_ENABLE_HPA=true
    SetupHPA --> SetupCapacityBuffer
    WaitWarmReady --> SetupCapacityBuffer: if CL2_ENABLE_HPA=false
    SetupCapacityBuffer --> PauseForNodes: if CL2_ENABLE_CAPACITY_BUFFER=true\nSleep CL2_CAPACITY_BUFFER_PAUSE
    SetupCapacityBuffer --> BurstLoop: if CL2_ENABLE_CAPACITY_BUFFER=false
    PauseForNodes --> BurstLoop
    BurstLoop --> BurstLoop: range 1..CL2_TOTAL_BURSTS\nCreate, WaitFor (Ready=True), Sleep 20s
    BurstLoop --> PauseScrape: Pause 1m for Prometheus scrape
    PauseScrape --> Gather: Gather Results\n(PodStartupLatency +\nGenericPrometheusQuery x2)
    Gather --> TeardownClaims
    TeardownClaims --> TeardownHPA: if enabled
    TeardownClaims --> TeardownBuffer: if enabled
    TeardownHPA --> TeardownBuffer
    TeardownBuffer --> TeardownWarmPool
    TeardownClaims --> TeardownWarmPool: otherwise
    TeardownWarmPool --> TeardownTemplate
    TeardownTemplate --> [*]
```

Two things are notable.

First, the recipe couples directly to the controller's CRDs through `WaitForGenericK8sObjects`. It waits on `agents.x-k8s.io/v1beta1` `sandboxes` (label `agents.x-k8s.io/warm-pool-sandbox`) for warm-pool readiness, and on `extensions.agents.x-k8s.io/v1beta1` `sandboxclaims` for burst readiness — both with `successfulConditions: ["Ready=True"]` and `maxFailedObjectCount: 0`:

```yaml
# dev/load-test/test-recipes/rapid-burst-test.yaml
- Identifier: WaitForBurst{{$burstIteration}}SandboxClaims
  Method: WaitForGenericK8sObjects
  Params:
    objectGroup: extensions.agents.x-k8s.io
    objectVersion: v1beta1
    objectResource: sandboxclaims
    successfulConditions: ["Ready=True"]
    failedConditions: []
    minDesiredObjectCount: {{MultiplyInt $replicaCount $namespaces}}
    maxFailedObjectCount: 0
    timeout: 60m
    refreshInterval: 20ms
```

Second, the recipe uses three Prometheus histograms exposed by the controller — `agent_sandbox_claim_startup_latency_ms_bucket` and `agent_sandbox_claim_controller_startup_latency_ms_bucket` — for `p50`, `p90`, `p99` violations, with thresholds (1s, 1s, 5s) declared inline. This is the pattern the entrypoint recipe's TODO is moving toward.

The recipe is driven by `run_rapid_burst.sh`. The shell wrapper hard-codes the GKE flow (`--provider=gke`, `$HOME/perf-tests`, `$HOME/agent-sandbox`), enables the CL2 Prometheus server, plugs in the `ServiceMonitor`, and writes a JSON test-overrides file under `dev/load-test/test-recipes/tmp/<RUN_ID>/`:

```bash
# dev/load-test/test-recipes/run_rapid_burst.sh
cd "$CL2_DIR"
go run cmd/clusterloader.go \
  --enable-prometheus-server=true \
  --kubeconfig=$HOME/.kube/config \
  --prometheus-additional-monitors-path="${TEST_DIR}/monitor" \
  --provider=gke \
  --report-dir="${LOGS_DIR}" \
  --testconfig="${TEST_CONFIG}" \
  --testoverrides="${LOGS_DIR}/testoverrides.json" \
  --v=2
```

The wrapper also performs pre-flight validation — it pings the cluster, refuses to enable `CapacityBuffer` if `autoscaling.x-k8s.io` is missing from `kubectl api-resources`, and rejects `ENABLE_HPA=true` if `WARMPOOL_SIZE` is outside `[HPA_MIN_REPLICAS, HPA_MAX_REPLICAS]`. The README documents that `ENABLE_CAPACITY_BUFFER=true` triggers a 5-minute sleep after the buffer is created so GKE node auto-provisioning has time to spin up standby nodes before the burst loop starts.

Sources: [dev/load-test/test-recipes/rapid-burst-test.yaml:44-238](), [dev/load-test/test-recipes/run_rapid_burst.sh:20-113](), [dev/load-test/test-recipes/README-rapid-burst.md:66-126]()

## Test-time Object Graph

The five recipe templates and the controller's CRDs form a tight graph at test time:

```mermaid
classDiagram
    class SandboxTemplate {
        +metadata.name
        +spec.podTemplate
    }
    class SandboxWarmPool {
        +spec.replicas
        +spec.sandboxTemplateRef
    }
    class SandboxClaim {
        +spec.sandboxTemplateRef
        +status.conditions[Ready]
    }
    class Sandbox {
        +metadata.labels[agents.x-k8s.io/warm-pool-sandbox]
        +status.conditions[Ready]
    }
    class HorizontalPodAutoscaler {
        +spec.scaleTargetRef = SandboxWarmPool
        +spec.metrics[External]
    }
    class CapacityBuffer {
        +spec.percentage
        +spec.scalableRef = SandboxWarmPool
        +spec.provisioningStrategy
    }

    SandboxWarmPool --> SandboxTemplate : sandboxTemplateRef
    SandboxClaim --> SandboxTemplate : sandboxTemplateRef
    SandboxWarmPool ..> Sandbox : pre-creates pool members
    SandboxClaim ..> Sandbox : binds a warm Sandbox
    HorizontalPodAutoscaler --> SandboxWarmPool : scaleTargetRef
    CapacityBuffer --> SandboxWarmPool : scalableRef
```

The HPA template uses an `External` metric with selectors on `metric.labels.warmpool_name` and `metric.labels.exported_namespace`, allowing the same HPA shape to be reused across namespaces created by CL2.

Sources: [dev/load-test/test-recipes/templates/cluster-loader-sandbox-template.yaml:1-20](), [dev/load-test/test-recipes/templates/cluster-loader-warmpool.yaml:1-10](), [dev/load-test/test-recipes/templates/cluster-loader-sandbox-claim.yaml:1-8](), [dev/load-test/test-recipes/templates/cluster-loader-hpa.yaml:1-24](), [dev/load-test/test-recipes/templates/cluster-loader-capacity-buffer.yaml:1-12](), [dev/load-test/test-recipes/rapid-burst-test.yaml:117-132]()

## End-to-End CI Flow for the Periodic

Putting the two layers together, the periodic load-test job follows this sequence inside a Prow pod:

```mermaid
sequenceDiagram
    participant Prow
    participant Periodic as periodics/test-load-test
    participant Runner as TestRunner (shared/runner.py)
    participant DevTools as dev/tools/*
    participant KinD
    participant CL2 as clusterloader2 (cloned & built)
    participant Recipe as agent-sandbox-load-test.yaml
    participant Out as $ARTIFACTS

    Prow->>Periodic: exec entrypoint
    Periodic->>Runner: LoadTestRunner.main()
    Runner->>DevTools: create-kind-cluster --recreate
    DevTools->>KinD: bootstrap cluster
    Runner->>DevTools: push-images --controller-only
    Runner->>DevTools: deploy-to-kube --extensions
    Runner->>DevTools: deploy-cloud-provider
    Periodic->>CL2: git clone perf-tests + go build
    Periodic->>CL2: clusterloader2 --testconfig=agent-sandbox-load-test.yaml --testoverrides=...
    CL2->>Recipe: execute steps 1..5
    Recipe->>KinD: create Sandboxes, WaitForRunningPods
    CL2-->>Periodic: bin/junit.xml
    Periodic->>Out: copy junit.xml -> junit_load_test.xml
```

`WaitForGenericK8sObjects` is what couples the rapid-burst recipe to the controller's CRDs, while the entrypoint recipe waits on raw `Pod` readiness via `WaitForRunningPods`. The shape of the JUnit output (one testcase per CL2 step) is illustrated in the load-test README:

```xml
<testsuite name="ClusterLoaderV2" tests="0" failures="0" errors="0" time="57.957">
  <testcase name="agent-sandbox-load-test overall (...)" .../>
  <testcase name="agent-sandbox-load-test: [step: 01] Start Startup Latency Measurement ..."/>
  <testcase name="agent-sandbox-load-test: [step: 02] Create Sandboxes" .../>
  <testcase name="agent-sandbox-load-test: [step: 03] Wait for Sandboxes to be Ready ..."/>
  <testcase name="agent-sandbox-load-test: [step: 04] Gather Results ..."/>
</testsuite>
```

Sources: [dev/load-test/README.md:73-86](), [dev/load-test/test-recipes/rapid-burst-test.yaml:189-204](), [dev/load-test/agent-sandbox-load-test.yaml:34-48](), [dev/ci/periodics/test-load-test:80-116]()

## Running a Prow Job Locally

`docs/prowjob_manual_run.md` documents how an engineer reproduces the periodic outside of CI. The job runs as KinD-in-Docker (nested virtualisation), so the doc raises host inotify limits before invocation:

```bash
sudo sysctl -w fs.inotify.max_user_watches=524288
sudo sysctl -w fs.inotify.max_user_instances=512
```

From a checkout of `kubernetes/test-infra`:

```bash
./config/pj-on-kind.sh periodic-agent-sandbox-perf-load-test
kubectl get pods
kubectl logs -f <POD_NAME> -c test
# After completion, artifacts live under the <OUT_DIR> the script prints at start
cd <OUT_DIR>/periodic-agent-sandbox-perf-load-test/<POD_NAME>
cat finished.json
```

The job referenced is the same `LoadTestRunner` entrypoint; its Prow spec is maintained in `config/jobs/kubernetes-sigs/agent-sandbox/agent-sandbox-periodics-main.yaml` in the `kubernetes/test-infra` repository, not in this one. The doc also notes minimum host resources of 8 CPU and 16 GB RAM for the nested cluster to come up cleanly.

Sources: [docs/prowjob_manual_run.md:1-84]()

## Extending the Pipelines

Three patterns recur when adding new coverage:

1. **A new presubmit shell wrapper** — copy `dev/ci/presubmits/lint-go` and change `run_dev_tool("lint-go")` to a target script in `dev/tools/`. The `shared/utils.py` helper handles repo-root resolution.
2. **A new cluster-backed test** — subclass `TestRunner` from `dev/ci/shared/runner.py`, override `run_tests` to invoke a binary, and override `copy_artifacts` to deposit a `junit_*.xml` into `$ARTIFACTS`. `setup_cluster` is inherited and handles KinD bring-up, image push, and CRD install; pass `extra_push_images_args=["--controller-only"]` when you only need the controller image.
3. **A new load-test recipe** — add a YAML under `dev/load-test/test-recipes/` (parameterised with `DefaultParam .CL2_*`), reuse the existing template manifests for `SandboxTemplate` / `SandboxWarmPool` / `SandboxClaim` / `HorizontalPodAutoscaler` / `CapacityBuffer`, and drive it from a local shell wrapper or a new CL2 invocation block in `LoadTestRunner.run_tests`. Prefer `GenericPrometheusQuery` over `PodStartupLatency` whenever the controller already exports a relevant histogram (e.g., `agent_sandbox_claim_startup_latency_ms_bucket`, used by the rapid-burst recipe).

Together these two surfaces give the project a stable signal (the small periodic in KinD) and a manual proving ground (the GKE-targeted recipes) without duplicating the wiring that brings the controller online.

Sources: [dev/ci/shared/runner.py:23-80](), [dev/ci/periodics/test-load-test:48-110](), [dev/load-test/test-recipes/rapid-burst-test.yaml:1-92](), [dev/load-test/test-recipes/rapid-burst-test.yaml:44-92]()

---

## 30. KEPs & Roadmap

> In-flight design proposals tracked under docs/keps (suspended state, metadata propagation, Python SDK refactor) and the published roadmap.

- Page Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/pages/30-keps-roadmap.md
- Generated: 2026-05-25T22:53:57.404Z

### Source Files

- `roadmap.md`
- `docs/keps/README.md`
- `docs/keps/119-sandbox-suspended-state/README.md`
- `docs/keps/174-metadata-propagation/README.md`
- `docs/keps/359-refactor-python-sdk/README.md`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [roadmap.md](roadmap.md)
- [docs/keps/README.md](docs/keps/README.md)
- [docs/keps/NNNN-template/README.md](docs/keps/NNNN-template/README.md)
- [docs/keps/NNNN-template/kep.yaml](docs/keps/NNNN-template/kep.yaml)
- [docs/keps/119-sandbox-suspended-state/README.md](docs/keps/119-sandbox-suspended-state/README.md)
- [docs/keps/119-sandbox-suspended-state/kep.yaml](docs/keps/119-sandbox-suspended-state/kep.yaml)
- [docs/keps/174-metadata-propagation/README.md](docs/keps/174-metadata-propagation/README.md)
- [docs/keps/174-metadata-propagation/kep.yaml](docs/keps/174-metadata-propagation/kep.yaml)
- [docs/keps/359-refactor-python-sdk/README.md](docs/keps/359-refactor-python-sdk/README.md)
- [docs/keps/359-refactor-python-sdk/kep.yaml](docs/keps/359-refactor-python-sdk/kep.yaml)
</details>

# KEPs & Roadmap

This page documents the forward-looking design surface of `kubernetes-sigs/agent-sandbox`: the Agent Sandbox Enhancement Proposal (KEP) process, the three in-flight KEPs currently checked into `docs/keps/`, and the published `roadmap.md`. KEPs are how non-trivial changes are proposed, debated, and coordinated; the roadmap states the high-level strategic priorities for the calendar year. Together they describe what shape the project is likely to take next, independent of what is already implemented.

The body below pulls each proposal apart into its motivation, the API or behavior it changes, and what state of maturity it is in (per the `status` field in each `kep.yaml`). The roadmap section maps the roadmap bullets back to the KEPs and tracking issues they correspond to, so a reader can connect a strategic priority to the concrete design document that backs it.

## The KEP Process

The KEP process is modeled directly on the upstream [Kubernetes Enhancement Proposal](https://github.com/kubernetes/enhancements/tree/master/keps) workflow. Writing one is optional and reserved for "non-trivial changes" — controversial proposals, new features, major changes to existing features, and anything with wide project impact. Lightweight changes go straight to PRs. Discussion happens on the `#sig-apps` Kubernetes Slack channel and the SIG-Apps mailing list before a KEP is drafted.

Each KEP lives in a folder under `docs/keps/<NNN>-<short-name>/` and consists of two files:

| File | Purpose |
| :--- | :--- |
| `README.md` | The proposal itself: summary, motivation, proposal, API changes, alternatives. Sections are generated from `docs/keps/NNNN-template/README.md`. |
| `kep.yaml` | Structured metadata — title, `kep-number`, authors, reviewers, approvers, `status`, `creation-date`, and optionally `stage` and `last-updated`. |

The `NNN` prefix is the tracking issue number on the `agent-sandbox` GitHub repo, which gives every proposal a stable identifier and a single place to discuss live state. The template embeds an auto-generated `<!-- toc -->` block maintained by `make toc-update`.

Sources: [docs/keps/README.md:1-30](), [docs/keps/NNNN-template/README.md:1-94](), [docs/keps/NNNN-template/kep.yaml:1-18]()

### KEP Lifecycle and Status Vocabulary

`status` in `kep.yaml` is the canonical truth for where a proposal sits. The template offers `implementable|implemented` as the example vocabulary; existing KEPs use `provisional` and `implementable`. The optional `stage` field (e.g., `alpha`) tracks rollout maturity once an implementation lands.

```text
   draft  ─►  provisional  ─►  implementable  ─►  implemented
                                    │
                                    └─►  stage: alpha → beta → ga
```

Sources: [docs/keps/NNNN-template/kep.yaml:5](), [docs/keps/119-sandbox-suspended-state/kep.yaml:1-10](), [docs/keps/174-metadata-propagation/kep.yaml:1-9](), [docs/keps/359-refactor-python-sdk/kep.yaml:1-10]()

## In-Flight KEPs

Three KEPs are currently checked in. The table below is a quick index; full detail follows in dedicated sections.

| KEP | Title | Status | Stage | Authors | Created |
| :--- | :--- | :--- | :--- | :--- | :--- |
| 119 | Agent Sandbox Suspended Condition Status | `implementable` | — | @SHRUTI6991 | 2026-03-16 |
| 174 | Label and Annotation Propagation to Sandbox Pods | `provisional` | `alpha` | @chenyiwang | 2026-03-18 |
| 359 | Python SDK Refactor From Context Manager to Resource Handle | `implementable` | — | @SHRUTI6991 | 2026-03-03 |

Note that the `kep-number` field inside `docs/keps/359-refactor-python-sdk/kep.yaml` is currently `174` — the folder name (`359-...`) reflects the tracking issue, while the YAML field has not been updated to match.

Sources: [docs/keps/119-sandbox-suspended-state/kep.yaml:1-10](), [docs/keps/174-metadata-propagation/kep.yaml:1-9](), [docs/keps/359-refactor-python-sdk/kep.yaml:1-10]()

### KEP-119: Sandbox `Suspended` Condition

KEP-119 adds an explicit `Suspended` status condition to the `Sandbox` CRD so that observers can distinguish "scaling down" from "scaled down" without inspecting child Pods, PVCs, or Services. Today there is only the aggregate `Ready` condition; once a suspension is requested, that single signal cannot say whether the termination is in progress or already complete, and it cannot model future "soft pause" modes such as freezing container processes or hibernating to disk.

The proposal defines two conditions and a Reason-driven hierarchy that the controller evaluates top-down:

| Scenario | `Suspended` | Suspended Reason | Pod Phase | `Ready` (Root) | Ready Reason |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Provisioning | None | None | None | `False` | `DependenciesNotReady` |
| Pod Starting | None | None | Pending | `False` | `DependenciesNotReady` |
| Operational | None | None | Running & Ready | `True` | `DependenciesReady` |
| Suspending | `False` | `PodNotTerminated` | Running / Terminating | `False` | `SandboxSuspended` |
| Suspended | `True` | `PodTerminated` | None | `False` | `SandboxSuspended` |

The moment a suspension is requested (e.g., `replicas: 0`), `Ready` immediately flips to `False` with reason `SandboxSuspended`, and the `Suspended` condition tracks the actual progress. This lets standard tooling drive automation directly:

```bash
kubectl wait --for=condition=Ready sandbox/my-env
kubectl wait --for=condition=Suspended=True sandbox/my-env
```

The KEP rejects two alternatives: retaining the legacy `status.phase` string (deprecated by Kubernetes API conventions and prone to combinatorial explosion as new suspension modes are added) and overloading `Ready` with reason codes (no headroom to distinguish "freeze" vs "hibernate" vs "scale-to-zero" in the future).

```mermaid
stateDiagram-v2
    [*] --> Provisioning
    Provisioning --> PodStarting: dependencies created
    PodStarting --> Operational: Pod Ready
    Operational --> Suspending: replicas=0 / suspend request
    Suspending --> Suspended: Pod terminated
    Suspended --> Provisioning: scale back up

    Provisioning: Ready=False<br/>DependenciesNotReady
    PodStarting: Ready=False<br/>DependenciesNotReady
    Operational: Ready=True<br/>DependenciesReady
    Suspending: Ready=False / SandboxSuspended<br/>Suspended=False / PodNotTerminated
    Suspended: Ready=False / SandboxSuspended<br/>Suspended=True / PodTerminated
```

Sources: [docs/keps/119-sandbox-suspended-state/README.md:1-82]()

### KEP-174: Metadata Propagation to Sandbox Pods

KEP-174 standardizes how labels and annotations flow from a top-level `SandboxClaim` through the `Sandbox` and finally onto the backing `Pod`. The motivation is twofold: enabling per-claim "personalization" (cost attribution, observability, stateful session identifiers) on otherwise homogeneous warm-pool pods, and avoiding "template explosion" caused by creating a new `SandboxTemplate` for every unique session.

The design adds a single new field, `additionalPodMetadata`, on `SandboxClaim`. `SandboxTemplate` and `SandboxWarmPool` are left untouched so that pool resources remain interchangeable.

```go
// sandbox_types.go (existing)
type PodMetadata struct {
    Labels      map[string]string `json:"labels,omitempty" protobuf:"bytes,1,rep,name=labels"`
    Annotations map[string]string `json:"annotations,omitempty" protobuf:"bytes,2,rep,name=annotations"`
}

// sandboxclaim_types.go (new)
type SandboxClaimSpec struct {
    // ...
    AdditionalPodMetadata sandboxv1alpha1.PodMetadata `json:"additionalPodMetadata,omitempty"`
}
```

The data flow and safety contract:

```text
SandboxClaim.spec.additionalPodMetadata
        │   (SandboxClaim controller)
        ▼
Sandbox.spec.podTemplate.metadata           ◄── merged with SandboxTemplate
        │   (Sandbox controller)
        ▼
Pod.metadata.{labels,annotations}
        │
        ├── agents.x-k8s.io/propagated-labels       = "k1,k2,..."
        └── agents.x-k8s.io/propagated-annotations  = "a1,a2,..."
```

Two annotations on the resulting Pod — `agents.x-k8s.io/propagated-labels` and `agents.x-k8s.io/propagated-annotations` — record which keys the controller put there, so it can later prune removed keys without disturbing labels added by mutating webhooks or other actors. The controller refuses requests where a key collides between the template and the claim with conflicting values ("Safety Principle: No Overrides").

Two propagation scenarios are spelled out:

| Scenario | Trigger | Controller behavior |
| :--- | :--- | :--- |
| Cold Start (new Pod) | First creation, no warmpool | Sandbox controller merges template + claim metadata before Pod creation. |
| Cold Start (Pod exists) | `SandboxClaim` metadata updated after creation | In-place patch of Pod `metadata`, no restart. |
| Warmpool adoption | Claim binds to a pre-warmed Sandbox | `SandboxClaim` controller patches the Sandbox's `spec.podTemplate.metadata` for sub-millisecond dispatch; no resource re-creation. |
| Warmpool post-adoption update | Claim metadata changes later | `SandboxClaim` controller re-syncs the metadata into the Sandbox. |

The scope is intentionally narrow: only label/annotation propagation that does not require a Pod restart. Anything that imposes functional control needing a restart is deferred to [Issue 208](https://github.com/kubernetes-sigs/agent-sandbox/issues/208). For metadata that must apply uniformly across an entire pool, users still modify `SandboxTemplate` directly per [Issue 347](https://github.com/kubernetes-sigs/agent-sandbox/issues/347).

Sources: [docs/keps/174-metadata-propagation/README.md:1-108]()

### KEP-359: Python SDK Refactor — Context Manager → Resource Handle

KEP-359 reshapes the Python client from a transient `with Sandbox() as sbx:` context manager into a persistent **Resource Handle** keyed by `sandbox_id`. The context-manager pattern fits short scripts but is a poor match for long-lived agent workflows: an agent may suspend a sandbox for hours, hand ownership across processes, juggle multiple sandboxes concurrently, or reattach to one created by a different component. None of those are expressible inside a single `with` block.

The new SDK is layered into three tiers:

```mermaid
classDiagram
    class SandboxClient {
        +router_dns: str
        +create_sandbox(template, namespace) Sandbox
        +get_sandbox(sandbox_id) Sandbox
    }
    class Sandbox {
        +id: str
        +commands: CoreExecution
        +files: Filesystem
        +status()
        +suspend()
        +resume()
        +terminate()
    }
    class CoreExecution {
        +run_code()
        +run_cmd()
    }
    class Filesystem {
        +read()
        +write()
        +list()
    }
    class ProcessSystem {
        +create_process()
        +kill_process()
    }
    SandboxClient ..> Sandbox : factory / reattach
    Sandbox *-- CoreExecution
    Sandbox *-- Filesystem
    Sandbox ..> ProcessSystem : sbx.process
```

`SandboxClient` is the entry point and factory; it owns global configuration such as `router_dns`. `Sandbox` is a stable handle around a `sandbox_id` and exposes lifecycle verbs (`status`, `suspend`, `resume`, `terminate`) plus dot-namespaced engines (`sbx.files`, `sbx.commands`, `sbx.process`). Engines talk to the Sandbox Router over a stable DNS name and pass an `X-Sandbox-Id` header for session affinity.

The motivating properties the design calls out:

1. **Identity stability** — the same `Sandbox` object remains valid whether the underlying Pod is running, suspended, or resuming.
2. **Orchestration vs. execution split** — management calls go through the control plane; execution calls go through the router.
3. **Capability discovery** — new functionality slots into a new engine namespace rather than the root object.
4. **Distributed ownership** — a `sandbox_id` can be handed to another process which calls `get_sandbox(id)` and continues from there.
5. **Non-linear logic** — multiple long-lived sandboxes coexist as plain objects, no nested `with` blocks.

Worked example from the KEP:

```python
client = SandboxClient(router_dns="router.sandbox.svc")

sbx = client.create_sandbox(template="python-ml")
sbx.files.write("data.py", "x = 42")
sbx.core.run_code("import data; print(data.x)")
sbx.suspend()

# Re-attach later, possibly in a different process
old_sbx = client.get_sandbox("sbx_123")
old_sbx.resume()
```

A proof-of-concept implementation lives at [PR #365](https://github.com/kubernetes-sigs/agent-sandbox/pull/365). The PoC initiates the router session at `Sandbox` construction time and updates the test client to use the handle-based flow. The "Scalability" section notes a future hook for a pluggable Sandbox Manager.

Sources: [docs/keps/359-refactor-python-sdk/README.md:1-129]()

## Roadmap

`roadmap.md` lists the project's "main strategic priorities for 2026." The bullets span documentation, distribution, SDK functionality, API shape, observability, isolation technology, and ecosystem integrations. Many bullets carry a GitHub issue number that links the strategic intent to the live tracking issue.

### Roadmap → KEP / Issue Map

The roadmap items split into ones that already have a KEP in flight versus ones that are tracked only by an issue and a sentence of intent.

| Roadmap theme | Tracking issue | KEP |
| :--- | :--- | :--- |
| Status Updates | [#119](https://github.com/kubernetes-sigs/agent-sandbox/issues/119) | KEP-119 (Suspended condition) |
| Metadata Propagation | [#174](https://github.com/kubernetes-sigs/agent-sandbox/issues/174) | KEP-174 |
| Expand SDK functionality (`read`, `write`, `run_code`, …) | — | KEP-359 (resource handle + engines) |
| Website Refresh | [#166](https://github.com/kubernetes-sigs/agent-sandbox/issues/166) | — |
| PyPI Distribution of `agent-sandbox-client` | [#146](https://github.com/kubernetes-sigs/agent-sandbox/issues/146) | — |
| Strict Sandbox-to-Pod Mapping | [#127](https://github.com/kubernetes-sigs/agent-sandbox/issues/127) | — |
| Go Client | [#227](https://github.com/kubernetes-sigs/agent-sandbox/issues/227) | — |
| Startup Actions | [#58](https://github.com/kubernetes-sigs/agent-sandbox/issues/58) | — |
| Creation Latency Metrics | [#123](https://github.com/kubernetes-sigs/agent-sandbox/issues/123) | — |
| Headless Service Port Handling | [#154](https://github.com/kubernetes-sigs/agent-sandbox/issues/154) | — |
| OpenEnv Support | [#132](https://github.com/kubernetes-sigs/agent-sandbox/issues/132) | — |

### Themes Without Linked Issues

Several priorities currently exist only as roadmap prose:

- **Documentation overhaul** — restructure and expand current documentation to lower the barrier to entry.
- **Benchmarking Guide** — published guidance for measuring Agent Sandbox performance.
- **Expand Sandbox use cases** — Computer Use, browser use, additional base images.
- **Decouple API from Runtime** — let users fully customize the runtime without breaking the API.
- **Scale-down / Resume PVC based** — preserve PVC when replicas drop to 0; restore on scale-up.
- **Complete CR, SDK, and template support** — feature-parity across surfaces.
- **API Support for Multi-Sandbox per Pod** — multiple sandboxes inside a single Pod.
- **Auto-deletion of bursty sandboxes** — typical RL training pattern.
- **Runtime API OTEL/Tracing instrumentation** — instrument and document.
- **Detailed Falco config extension** — surface gVisor logging configuration through the Agent Sandbox API.
- **Other isolation technologies** — QEMU, Firecracker, process isolation (pydantic).
- **Agent & RL framework integration** — CrewAI, Ray RLlib.
- **Integration with kAgent** and other sandbox offerings.
- **Deliver Beta/GA versions.**

Sources: [roadmap.md:1-29]()

## Reading and Contributing to a KEP

A practical workflow for a contributor proposing a non-trivial change:

1. Socialize the idea on `#sig-apps` Slack or the SIG-Apps mailing list.
2. Open a tracking issue on `kubernetes-sigs/agent-sandbox`; the issue number becomes the KEP's `NNN` prefix and folder name.
3. Copy `docs/keps/NNNN-template/` to `docs/keps/<NNN>-<short-name>/` and fill in the `README.md` sections (Summary, Motivation, Proposal, API Changes, Implementation Guidance, Scalability, Alternatives).
4. Populate `kep.yaml` — `title`, `kep-number`, `authors`, `reviewers`, `approvers`, `status`, `creation-date`, optionally `stage`, `last-updated`, `see-also`, and `replaces`.
5. Regenerate the table of contents with `make toc-update`.
6. Land the KEP through normal review; update `status` (and `stage` once implementation begins) as the proposal moves through `provisional` → `implementable` → `implemented`.

Sources: [docs/keps/README.md:1-30](), [docs/keps/NNNN-template/README.md:1-94](), [docs/keps/NNNN-template/kep.yaml:1-18]()

## Summary

The `docs/keps/` tree is the project's design archive: three live proposals (119 surfacing a `Suspended` status condition on the `Sandbox` CRD, 174 wiring `SandboxClaim` labels and annotations down to the backing Pod without breaking warm-pool homogeneity, and 359 rebuilding the Python SDK around a persistent `Sandbox` handle with namespaced engines) plus a template and process document. `roadmap.md` sits alongside as the 2026 strategic agenda; many of its bullets reference the same GitHub issues used to number the KEPs, making the roadmap → KEP → tracking-issue chain the natural path for anyone trying to understand where Agent Sandbox is heading next.

---