# SandboxClaim Reconciler

> The largest controller in the project: template resolution, env/metadata injection, warm-pool adoption, pod-exclusivity invariants, foreground deletion, and TTL after finish.

- Repository: kubernetes-sigs/agent-sandbox
- GitHub: https://github.com/kubernetes-sigs/agent-sandbox
- Human wiki: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a
- Complete Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/llms-full.txt

## Source Files

- `extensions/controllers/sandboxclaim_controller.go`
- `extensions/controllers/sandboxclaim_controller_test.go`
- `extensions/controllers/sandboxclaim_pod_exclusivity_test.go`
- `extensions/controllers/utils.go`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [extensions/controllers/sandboxclaim_controller.go](extensions/controllers/sandboxclaim_controller.go)
- [extensions/controllers/sandboxclaim_controller_test.go](extensions/controllers/sandboxclaim_controller_test.go)
- [extensions/controllers/sandboxclaim_pod_exclusivity_test.go](extensions/controllers/sandboxclaim_pod_exclusivity_test.go)
- [extensions/controllers/utils.go](extensions/controllers/utils.go)
- [extensions/api/v1beta1/sandboxclaim_types.go](extensions/api/v1beta1/sandboxclaim_types.go)
- [extensions/api/v1beta1/sandboxtemplate_types.go](extensions/api/v1beta1/sandboxtemplate_types.go)
- [internal/lifecycle/expiry.go](internal/lifecycle/expiry.go)
</details>

# SandboxClaim Reconciler

The `SandboxClaimReconciler` is the largest controller in the `extensions/controllers` package. It turns a user-facing `SandboxClaim` into a working `Sandbox` resource, choosing between three sources — adopting a warm pool sandbox, taking over a previously created one by status/label/name, or cold-starting from a `SandboxTemplate`. It also owns the lifetime of that sandbox, including expiration, TTL-after-finished, foreground deletion, and the 1:1 pod-exclusivity invariant that prevents two claims from binding the same warm pool pod.

The page maps the reconcile loop to the responsibilities that show up in the implementation: template resolution and metadata merging, environment variable injection under template policy, the warm-pool adoption protocol with optimistic ownership transfer, status/condition forwarding from the core `Sandbox`, expiration with the three `ShutdownPolicy` modes, and the watches/predicates that feed the controller.

## Reconcile entry point and high-level flow

`Reconcile` runs in a single pass per request. It loads the claim, opportunistically cleans up a legacy per-claim `NetworkPolicy`, starts a trace span, initializes observability annotations, then decides between the *active* and *expired* branch based on `checkExpiration`. After the chosen branch returns a `Sandbox` (or `nil`), the reconciler computes the `Ready` and `Finished` conditions, writes status with `r.updateStatus`, records latency metrics, and returns a requeue duration matching the next expiry boundary.

Two error sentinels suppress error returns to avoid crash loops: `ErrTemplateNotFound` causes a 1-minute requeue, and `ErrInvalidMetadata` / `ErrSandboxNotOwned` are logged at V(1) and swallowed.

Sources: [extensions/controllers/sandboxclaim_controller.go:140-282](extensions/controllers/sandboxclaim_controller.go)

```mermaid
flowchart TD
    Start[Reconcile request] --> Get[Get SandboxClaim]
    Get --> Cleanup[cleanupLegacyNetworkPolicy]
    Cleanup --> Init[initializeAnnotations<br/>trace + first-observed]
    Init --> Exp{checkExpiration<br/>shutdownTime / TTL}
    Exp -->|expired + Delete*| DeleteClaim[Delete claim<br/>Foreground prop. if set]
    Exp -->|expired + Retain| ReconcileExpired[reconcileExpired:<br/>delete owned Sandbox]
    Exp -->|active| ReconcileActive[reconcileActive]

    ReconcileActive --> Validate[validateAdditionalPodMetadata]
    Validate --> GetOrCreate[getOrCreateSandbox]
    GetOrCreate -->|hit by status/label/name| MetaSync[mergePodMetadata<br/>+ Update if drifted]
    GetOrCreate -->|warm queue pop| Adopt[adoptSandboxFromCandidates]
    GetOrCreate -->|miss| ColdCreate[createSandbox<br/>from SandboxTemplate]

    ReconcileExpired --> Status
    MetaSync --> Status
    Adopt --> Status
    ColdCreate --> Status
    DeleteClaim --> Done

    Status[computeAndSetStatus<br/>Ready + Finished mirror] --> Persist[updateStatus<br/>Status().Patch]
    Persist --> Metrics[recordCreationLatencyMetric]
    Metrics --> Requeue{post-expiration<br/>or timeLeft > 0?}
    Requeue --> Done[Result + err]
```

## Public API and constants

The reconciler is a `client.Client` plus injected collaborators:

| Field | Purpose |
|---|---|
| `Scheme` | Used by `controllerutil.SetControllerReference` when creating/adopting Sandboxes. |
| `WarmSandboxQueue` | In-memory per-template hash queue (`queue.SandboxQueue`) of warm pool candidates. |
| `Recorder` | `events.EventRecorder` used for `SandboxProvisioned`, `SandboxAdopted`, `ClaimExpired` events. |
| `Tracer` | `asmetrics.Instrumenter` for tracing spans and propagating a trace-context annotation. |
| `MaxConcurrentReconciles` | Concurrency for the controller manager. |
| `observedTimes` | Type-safe `sync.Map` keyed by `NamespacedName` and tagged by UID for latency tracking. |

Sentinel errors drive flow control: `ErrTemplateNotFound`, `ErrInvalidMetadata`, `ErrSandboxNotOwned`, `ErrCrossNamespaceAdoption`. Two annotation keys gate behavior: `agents.x-k8s.io/controller-first-observed-at` (observability) and `asmetrics.TraceContextAnnotation` (trace propagation). The `restrictedDomains` list (`kubernetes.io`, `k8s.io`, `agents.x-k8s.io`) is enforced when validating user-supplied label/annotation keys.

Sources: [extensions/controllers/sandboxclaim_controller.go:58-125](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:60-72](extensions/controllers/sandboxclaim_controller.go)

## Template resolution and sandbox provisioning

### Cold path: `createSandbox`

`createSandbox` is the cold path; it deep-copies `template.Spec.PodTemplate` into a new `Sandbox` named after the claim, propagates the trace context, copies `VolumeClaimTemplates`, applies identity labels (`SandboxIDLabel = claim.UID` and the template-ref hash `agents.x-k8s.io/sandbox-template-ref-hash`) onto both the top-level `Sandbox` metadata and the pod template — because KEP-0174 only propagates pod-template labels, but the platform informer reads top-level `Sandbox.metadata.labels`.

It then merges `claim.Spec.AdditionalPodMetadata`, applies `ApplySandboxSecureDefaults`, sets `Replicas = 1`, attaches the controller owner reference, and creates the `Sandbox`. Cold start is recorded as `LaunchTypeCold` in `RecordSandboxClaimCreation`.

Sources: [extensions/controllers/sandboxclaim_controller.go:923-1068](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/utils.go:23-48](extensions/controllers/utils.go)

### Environment variable injection

The controller injects `claim.Spec.Env` into the pod template, but only when the template's `EnvVarsInjectionPolicy` permits it:

| Policy | Behavior |
|---|---|
| `Disallowed` (default when policy is not `Allowed` / `Overrides`) | Any `claim.Spec.Env` causes rejection with "environment variable injection is not allowed by the template policy". |
| `Allowed` | New env vars may be appended; collision with an existing name is rejected. |
| `Overrides` | Existing variables with the same name are replaced in place. |

Env vars without `ContainerName` are appended only to the first regular container; vars with `ContainerName` are routed to that container across init- and regular-container lists, and any unknown container name fails the reconcile with a precise error referencing the offending variable name. The implementation lives in `injectEnvs`, with grouping/validation in `createSandbox`.

A second hard rule: `getOrCreateSandbox` rejects `claim.Spec.Env` when `WarmPool != "none"`, because warm pool pods are pre-baked and per-claim env injection would silently miss them.

Sources: [extensions/controllers/sandboxclaim_controller.go:896-1038](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:1155-1163](extensions/controllers/sandboxclaim_controller.go)

### `additionalPodMetadata` validation and merge

`validateAdditionalPodMetadata` rejects any key whose `/`-prefix domain is one of `kubernetes.io`, `k8s.io`, or `agents.x-k8s.io` (or a sub-domain), and runs `validation.IsValidLabelValue` for label values. `mergePodMetadata` then performs a strict "no overrides" merge: a claim label or annotation whose key already exists in the template with a *different* value fails the reconcile, while matching values and new keys are merged in. This is enforced both when creating and when adopting.

Sources: [extensions/controllers/sandboxclaim_controller.go:806-894](extensions/controllers/sandboxclaim_controller.go)

## Warm pool adoption

### Lookup order in `getOrCreateSandbox`

`getOrCreateSandbox` exhausts existing-binding paths before touching the warm queue:

1. `claim.Status.SandboxStatus.Name` — if set and the named `Sandbox` is owned by this claim, return it.
2. `claim.Labels[AssignedSandboxNameLabel]` — the optimistic lock label written during a previous adoption attempt. If the sandbox is still `Kind: SandboxWarmPool`-owned, the controller retries `completeAdoption` and returns an "in progress" error so the next reconcile sees it controlled by this claim. If it does not exist, the stale label is removed by patch.
3. Name-based lookup at `claim.Name` — picks up cold-path sandboxes the controller previously created; verifies controller ownership and refuses to silently overwrite a foreign-owned sandbox with the same name.
4. Otherwise, if `WarmPool == "none"` return `nil, nil` (caller cold-starts). For `default` or a specific pool name, pop a candidate from `WarmSandboxQueue`.

Sources: [extensions/controllers/sandboxclaim_controller.go:1070-1180](extensions/controllers/sandboxclaim_controller.go)

### `getCandidate` and `adoptSandboxFromCandidates`

`getCandidate` pops keys from the per-template-hash queue, fetches each one, and discards *ghost pods* (queue keys whose `Sandbox` is gone from the informer cache). Pods that fail `verifySandboxCandidate` are dropped; those that are simply in the wrong namespace (`ErrCrossNamespaceAdoption`) are skipped and re-queued via a deferred re-add. When `WarmPoolPolicy.IsSpecificPool()` is true, candidates whose `warmPoolSandboxLabel` does not equal `NameHash(<pool>)` are also skipped back to the queue.

`adoptSandboxFromCandidates` then performs an optimistic two-step claim adoption, retried up to three times:

1. Set `claim.Labels[AssignedSandboxNameLabel] = <sandbox-name>` and `Update` the claim. A `Conflict` here means another reconciler raced us; the candidate is re-queued and the loop retries.
2. `completeAdoption` patches the sandbox: strips warm pool labels (`warmPoolSandboxLabel`, `sandboxTemplateRefHash`, `SandboxPodTemplateHashLabel`), drops the old `SandboxWarmPool` owner ref, sets the claim as controller, ensures `SandboxPodNameAnnotation == sandbox.Name`, propagates the trace-context annotation, re-applies identity labels, then rebuilds the pod-template `ObjectMeta` exactly as the active path does (template metadata + identity labels + merged claim metadata). A missing-template fallback merges directly into the existing sandbox's pod template.

Successful adoption records `LaunchTypeWarm` with the source warm-pool name and the candidate's readiness, and emits a `SandboxAdopted` event.

Sources: [extensions/controllers/sandboxclaim_controller.go:591-794](extensions/controllers/sandboxclaim_controller.go)

### Pod-exclusivity invariant

The 1:1 invariant — every warm pool pod is adopted by at most one claim, and every claim ends up owning exactly one `Sandbox` — is enforced by two independent mechanisms that the test `TestWarmPoolPodExclusivity` exercises end-to-end with three claims and two warm pods:

- The queue itself: `r.WarmSandboxQueue.Get(templateHash)` is a pop, not a list. A consumed key is gone unless the reconciler explicitly re-adds it.
- The `AssignedSandboxNameLabel` + claim `Update` acts as an optimistic lock; concurrent claims that pop the same key will lose the race when persisting the label, push the key back, and re-try.

The result is verified by collecting `sandbox → owning-claim` from `controllerRef` and asserting both directions are 1:1, and that both warm pods are adopted while the third claim cold-starts.

Sources: [extensions/controllers/sandboxclaim_pod_exclusivity_test.go:41-187](extensions/controllers/sandboxclaim_pod_exclusivity_test.go), [extensions/controllers/sandboxclaim_controller.go:648-726](extensions/controllers/sandboxclaim_controller.go)

```text
WarmSandboxQueue[templateHash]
        |
        | pop()  (single-consumer per key)
        v
 +--------------+   1. claim.Update(label=sb)      +------------------+
 | reconcile A  | -------------------------------> | apiserver        |
 +--------------+   2. patch sandbox controller    | (resourceVersion)|
                                                   +------------------+
        ^                                                   |
        | re-add key on Conflict                            v
 +--------------+   1. claim.Update(label=sb) -> 409 Conflict
 | reconcile B  |   re-queue and pick next candidate
 +--------------+
```

## Status, conditions, and events

`computeAndSetStatus` produces a `Ready` condition (`computeReadyCondition`) and then mirrors the `Finished` condition from the `Sandbox` onto the claim via `syncFinishedCondition`. `SandboxStatus.{Name, PodIPs}` is copied from the sandbox when present and cleared otherwise.

`computeReadyCondition` short-circuits in a strict order:

| Input | `Ready` Reason | Notes |
|---|---|---|
| `err = ErrTemplateNotFound` | `TemplateNotFound` | False; reconcile requeues in 1 min. |
| `err = ErrInvalidMetadata` | `InvalidMetadata` | False; error suppressed (no requeue spam). |
| `err = ErrSandboxNotOwned` | `ClaimExpired` | False; treated as expired-state cleanup blocked. |
| any other `err` | `ReconcilerError` | False; error returned for backoff. |
| `isClaimExpired` true | `ClaimExpired` | "Sandbox cleanup initiated." |
| `sandbox == nil` | `SandboxMissing` | False. |
| underlying sandbox has `Ready=False, Reason=Expired` | `SandboxExpired` | Forwards core-controller expiry. |
| else | forwards the sandbox's `Ready` condition verbatim | falls back to `SandboxNotReady` if absent. |

`syncFinishedCondition` only mirrors `Sandbox.Status.SandboxConditionFinished` when a sandbox is present; if no sandbox exists and the claim is *not* expired, it removes any stale `Finished` condition to avoid keeping a terminal marker on a re-provisioned claim.

`updateStatus` sorts both old and new conditions deterministically by `Type` and then `Status().Patch` only when the semantic deep-equal differs, keeping resourceVersion churn minimal.

Sources: [extensions/controllers/sandboxclaim_controller.go:422-576](extensions/controllers/sandboxclaim_controller.go)

## Lifecycle: expiration, TTL after finished, shutdown policies

Expiration is computed by `checkExpiration`, which delegates to `lifecycle.TimeLeft(now, ShutdownTime, TTLSecondsAfterFinished, finishedCondition)`. The library returns `(true, 0)` once expired, `(false, dur)` with the remaining duration otherwise, choosing the earliest of `ShutdownTime` and `finishedAt + TTL`.

```mermaid
stateDiagram-v2
    [*] --> Active
    Active --> Active: requeue at min(ShutdownTime, finishedAt+TTL)
    Active --> Expired: now >= expireAt

    state Expired {
        [*] --> RetainBranch: Lifecycle.ShutdownPolicy = Retain
        [*] --> DeleteBranch: Lifecycle.ShutdownPolicy = Delete
        [*] --> ForegroundBranch: Lifecycle.ShutdownPolicy = DeleteForeground

        RetainBranch --> SandboxDeleted: reconcileExpired Deletes Sandbox\nKeeps Claim with Ready=ClaimExpired
        DeleteBranch --> ClaimDeleted: Delete(claim)\nNo propagation policy
        ForegroundBranch --> ClaimDeleted: Delete(claim,\nPropagationPolicy=Foreground)
    }

    SandboxDeleted --> [*]
    ClaimDeleted --> [*]
```

`Reconcile` only takes the delete-claim branch when `claimExpired` *and* either policy `Delete` or `DeleteForeground` is configured. `DeleteForeground` adds `client.PropagationPolicy(metav1.DeletePropagationForeground)`, ensuring the API server blocks finalization of the claim until its owned `Sandbox` (and its dependents) are removed. After issuing the delete, the reconciler returns immediately — continuing would attempt to patch the status of an object already in deletion.

For `Retain`, `reconcileExpired` looks up the sandbox by `Status.SandboxStatus.Name` (falling back to `claim.Name`), verifies controller ownership (otherwise returns `ErrSandboxNotOwned`), and issues a non-foreground delete. The claim itself is preserved with the `ClaimExpired` Ready reason.

After computing status on the active path, `Reconcile` re-runs `checkExpiration` (`postExpiration`) to handle the case where mirroring the `Finished` condition from the sandbox *during this same reconcile* made the claim newly TTL-expired. When this happens, the controller writes status and requeues at `immediateRequeueDelay = 1ms` so the next pass enters the expired branch. `TestSandboxClaimMirrorsFinishedConditionAndSchedulesTTL` and `TestSandboxClaimTTLAfterFinishedCleanupPolicy` cover this two-pass behavior: pass 1 mirrors `Finished` and computes a positive `RequeueAfter`; pass 2 triggers cleanup according to the policy.

Sources: [extensions/controllers/sandboxclaim_controller.go:165-261](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:309-420](extensions/controllers/sandboxclaim_controller.go), [internal/lifecycle/expiry.go:24-82](internal/lifecycle/expiry.go), [extensions/controllers/sandboxclaim_controller_test.go:1110-1303](extensions/controllers/sandboxclaim_controller_test.go)

## Watches, predicates, and queue feeders

`SetupWithManager` builds a controller with:

- `For(&SandboxClaim{}, WithPredicates(getTimingPredicate()))` — the timing predicate stamps the first-observed time per UID in `observedTimes` and removes the entry on delete events so the map cannot leak.
- `Owns(&Sandbox{})` — standard owner-driven requeues.
- `Watches(&Sandbox{}, &sandboxEventHandler{...})` — pushes adoptable sandboxes into the per-template-hash queue and removes ghost-pod keys on delete.
- `Watches(&SandboxTemplate{}, &templateEventHandler{...})` — drops the entire warm queue for a deleted template via `RemoveQueue`.
- `Watches(&SandboxTemplate{}, EnqueueRequestsFromMapFunc(mapTemplateToClaims), WithPredicates(ResourceVersionChangedPredicate{}))` — when a template changes, re-enqueue every claim that references it through the indexed `TemplateRefField`.
- A field index on `TemplateRefField` is registered via `mgr.GetFieldIndexer().IndexField` for the template→claims mapping.

`sandboxEventHandler.Update` enqueues a key when a sandbox transitions from not-adoptable to adoptable, or when an already-adoptable sandbox changes its template hash. `isAdoptable` requires: not deleting, has `warmPoolSandboxLabel`, has `sandboxTemplateRefHash`, and (if controlled) the controller kind is `SandboxWarmPool`. `verifySandboxCandidate` adds a same-namespace check via the `ErrCrossNamespaceAdoption` sentinel.

Sources: [extensions/controllers/sandboxclaim_controller.go:1228-1297](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:1446-1565](extensions/controllers/sandboxclaim_controller.go)

## Secure defaults and legacy cleanup

`ApplySandboxSecureDefaults` (in `utils.go`) is applied once by `createSandbox`:

- Sets `AutomountServiceAccountToken = false` when unset.
- When the template uses managed network policy (`NetworkPolicyManagement == ""` or `Managed`) *and* defines no `NetworkPolicy`, the controller is in "Secure by Default" mode: it overrides `DNSPolicy = None` and injects external resolvers `8.8.8.8`, `1.1.1.1`, on the theory that internal DNS would let a sandbox enumerate cluster services. Custom rules or `Unmanaged` mode leave DNS alone so air-gapped or proxied environments still work.

`cleanupLegacyNetworkPolicy` runs every reconcile and idempotently deletes a deprecated per-claim `NetworkPolicy` named `<claim>-network-policy`, but only if the policy is actually controlled by the claim — a user-created policy with the same reserved name is logged and left alone. Errors are non-fatal so a transient API issue cannot block sandbox provisioning.

Sources: [extensions/controllers/utils.go:23-59](extensions/controllers/utils.go), [extensions/controllers/sandboxclaim_controller.go:1300-1332](extensions/controllers/sandboxclaim_controller.go)

## Latency metrics and observability

The reconciler records four metrics tied to launch type (`cold` / `warm`), template, and namespace:

- `RecordSandboxClaimCreation` at create/adopt time (with pool name and ready/not-ready state).
- `RecordClaimStartupLatency` from the webhook-stamped `WebhookAnnotation` time to the moment `Ready=True` is first observed.
- `RecordClaimControllerStartupLatency` from the controller-stamped `ObservabilityAnnotation` time.
- `RecordSandboxCreationLatency` from `sandbox.CreationTimestamp` to the underlying `Sandbox`'s `Ready=True` `LastTransitionTime`.

`recordCreationLatencyMetric` only fires on the *first* transition to `Ready=True` (`oldReady != True && newReady == True`). On re-reconciles after Ready, it also drains any `observedTimes` entry that a post-Ready `UpdateFunc` may have re-added, preventing duplicate latency emissions.

`getLaunchType` distinguishes warm vs cold by the presence of `SandboxPodNameAnnotation` on the sandbox — warm-adopted sandboxes are stamped with their pre-existing pod name in `completeAdoption`, while cold-started ones leave the annotation empty.

Sources: [extensions/controllers/sandboxclaim_controller.go:1334-1434](extensions/controllers/sandboxclaim_controller.go), [extensions/controllers/sandboxclaim_controller.go:748-755](extensions/controllers/sandboxclaim_controller.go)

## Summary

The `SandboxClaimReconciler` is small in surface but dense in invariants: a single `Reconcile` pass picks an active-vs-expired branch, threads a strict template/metadata/env policy through both cold creation and warm adoption, enforces 1:1 sandbox ownership through a pop-based queue plus an optimistic claim-label lock, mirrors the `Finished` condition so TTL-after-finished can drive a second-pass cleanup, and routes expiration through three `ShutdownPolicy` modes with foreground propagation when full subtree teardown is required. The watches, predicates, and ghost-pod handling around the warm pool queue make this the controller most worth reading carefully when changing claim, sandbox, or warm-pool semantics.
