# Sandbox Reconciler

> Reconciliation loop for the core Sandbox: pod/PVC/service materialization, identity propagation, status conditions, scale subresource, and the cluster-domain FQDN logic.

- Repository: kubernetes-sigs/agent-sandbox
- GitHub: https://github.com/kubernetes-sigs/agent-sandbox
- Human wiki: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a
- Complete Markdown: https://grok-wiki.com/public/wiki/kubernetes-sigs-agent-sandbox-c3f2597a654a/llms-full.txt

## Source Files

- `controllers/sandbox_controller.go`
- `controllers/sandbox_controller_test.go`
- `controllers/testmain_test.go`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [controllers/sandbox_controller.go](controllers/sandbox_controller.go)
- [controllers/sandbox_controller_test.go](controllers/sandbox_controller_test.go)
- [controllers/testmain_test.go](controllers/testmain_test.go)
- [api/v1beta1/sandbox_types.go](api/v1beta1/sandbox_types.go)
- [cmd/agent-sandbox-controller/main.go](cmd/agent-sandbox-controller/main.go)
</details>

# Sandbox Reconciler

The Sandbox reconciler is the controller that turns a `Sandbox` custom resource into a running set of Kubernetes primitives: a single `Pod`, an optional headless `Service`, and one `PersistentVolumeClaim` per entry in `spec.volumeClaimTemplates`. It enforces single-controller ownership of those primitives, propagates a hash-based identity label and pod template metadata down to the Pod, surfaces overall state through three status conditions, drives the CRD's `scale` subresource, and computes the service FQDN from the controller's configured cluster domain. It also implements an expiry path (`shutdownTime` + `shutdownPolicy`) that tears the live resources down while keeping terminal status conditions intact.

This page walks the reconcile loop end-to-end against `controllers/sandbox_controller.go`, including the warm-pool pod adoption path that lets the Sandbox attach to an existing Pod rather than always creating one. The `Reconcile` entry point and its `reconcileChildResources` body assume `spec.replicas` is either 0 or 1 — the CRD scale subresource is intentionally constrained to that range.

## Controller wiring and configuration

`SandboxReconciler` is a small `client.Client`-backed struct with three injected dependencies: the runtime `Scheme`, a metrics/tracing `Instrumenter`, and the cluster's DNS suffix used to build service FQDNs. The controller is wired in `cmd/agent-sandbox-controller/main.go`, which exposes the suffix as a `--cluster-domain` flag defaulting to `cluster.local`.

```go
// controllers/sandbox_controller.go
type SandboxReconciler struct {
    client.Client
    Scheme        *runtime.Scheme
    Tracer        asmetrics.Instrumenter
    ClusterDomain string
}
```

`SetupWithManager` registers the controller for `Sandbox` resources and uses `Owns(...)` for `Pod` and `Service`. Both owned watches are filtered with a `LabelSelectorPredicate` that only fires for objects carrying the `agents.x-k8s.io/sandbox-name-hash` label, so warm-pool Pods that don't yet belong to a Sandbox don't enqueue spurious reconciles. The maximum concurrency is plumbed through from the binary's `concurrentWorkers` flag.

Sources: [controllers/sandbox_controller.go:122-128](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:1129-1150](controllers/sandbox_controller.go), [cmd/agent-sandbox-controller/main.go:71](cmd/agent-sandbox-controller/main.go), [cmd/agent-sandbox-controller/main.go:236-244](cmd/agent-sandbox-controller/main.go)

```text
                         ┌──────────────────────────────┐
                         │       SandboxReconciler      │
                         │  Client, Scheme, Tracer,     │
                         │  ClusterDomain               │
                         └──────────────┬───────────────┘
                                        │
                                        ▼
                                   Reconcile(req)
                                        │
              ┌─────────────┬───────────┼────────────┬────────────────┐
              ▼             ▼           ▼            ▼                ▼
        reconcilePVCs   reconcilePod  reconcile   computeConditions  updateStatus
       (one per VCT)    (≤1 pod)      Service     (Suspended/        (status subresource;
                                      (headless,  Ready/Finished)    skipped when unchanged)
                                      ClusterIP=None)
```

## Reconcile entry point

The top-level `Reconcile` function does the following in order, gating each step on the previous one:

1. Load the `Sandbox`. A NotFound is treated as a successful no-op so deletions are quiet.
2. Open a tracing span (`ReconcileSandbox`) and, the first time the sandbox is seen, write the active trace ID into `agents.x-k8s.io/trace-context` via a `MergeFrom` patch — this is inline, no re-reconcile.
3. Short-circuit if `DeletionTimestamp` is non-zero. Garbage collection of owned children is delegated to Kubernetes via controller references; the reconciler does not finalize anything.
4. Default `spec.replicas` to 1 when nil.
5. Branch on expiry: if `checkSandboxExpiry` returns true, run `handleSandboxExpiry`; otherwise call `reconcileChildResources` and then re-check expiry to set `RequeueAfter`.
6. Persist status via `updateStatus`, which skips the API call when `oldStatus` `DeepEqual`s the new status.

Errors from child reconciliation and from `updateStatus` are joined with `errors.Join` so a status failure does not mask the original error.

Sources: [controllers/sandbox_controller.go:148-228](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:431-445](controllers/sandbox_controller.go)

```mermaid
sequenceDiagram
    participant K as controller-runtime
    participant R as SandboxReconciler
    participant API as kube-apiserver
    K->>R: Reconcile(req)
    R->>API: Get Sandbox
    R->>R: StartSpan + maybe patch trace-context
    alt DeletionTimestamp set
        R-->>K: ctrl.Result{}, nil
    else expired
        R->>R: setSandboxExpiredCondition (if not yet marked)
        R->>R: handleSandboxExpiry (delete Pod/Service, maybe delete Sandbox)
    else normal path
        R->>API: reconcilePVCs (per volumeClaimTemplates)
        R->>API: reconcilePod (Get/Create/Adopt)
        R->>API: reconcileService (Get/Create/Adopt/Delete)
        R->>R: computeConditions (Suspended, Ready, Finished)
        R->>R: checkSandboxExpiry → RequeueAfter
    end
    R->>API: Status().Update if changed
    R-->>K: ctrl.Result{RequeueAfter}, err
```

## Identity: the sandbox-name-hash label

Every owned object is stamped with the label `agents.x-k8s.io/sandbox-name-hash`. The value is an 8-character lowercase hex FNV-1a hash of the sandbox name, computed by `NameHash`:

```go
// controllers/sandbox_controller.go
const sandboxLabel = "agents.x-k8s.io/sandbox-name-hash"

func NameHash(objectName string) string {
    return fmt.Sprintf("%08x", GetNumericHash(objectName))
}
```

This hash powers three things that all need to agree:

- The label on the Pod, the Service, and (on create) the PVC.
- The Service's `spec.selector`, which is rewritten to `{sandboxLabel: nameHash}` on adoption and on drift.
- The `Pods` listing inside `reconcilePod`, which uses a `labels.Selector` to enumerate matching Pods in the namespace (with a warning log if more than one is found, since a Sandbox is expected to own at most one).
- The watch predicate in `SetupWithManager`, which only enqueues Pods and Services that carry the label key.

`sandbox.Status.LabelSelector` is published in the `<label>=<hash>` form so the CRD's scale subresource (declared `selectorpath=.status.selector`) can be used by `kubectl scale` and HPA-style clients.

Sources: [controllers/sandbox_controller.go:49-53](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:230-271](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:447-458](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:1130-1150](controllers/sandbox_controller.go), [api/v1beta1/sandbox_types.go:225-244](api/v1beta1/sandbox_types.go)

## Resource ownership model

Before mutating any object, the reconciler classifies its relationship to the current Sandbox using `checkOwnership`, which inspects `metav1.GetControllerOf` and returns one of three states:

| State | Trigger | Reaction |
|---|---|---|
| `resourceOwnedBySandbox` | `controllerRef.UID == sandbox.UID` | Drive drift back to desired (labels, selectors, metadata). Delete on shrink/expiry. |
| `resourceUnowned` | No controllerRef on the object | Adopt by calling `ctrl.SetControllerReference`, with extra preconditions for Services. For Services, adoption is also gated on `spec.service == true` (an unset `service` leaves an unowned Service alone). |
| `resourceOwnedByOther` | A different controllerRef | Refuse to touch the object. For Pods, return a hard error so `Ready` flips to `ReconcilerError`. For Services, the reconciler returns an error from `reconcileService`. |

This three-way classification is used identically in `reconcilePod`, `reconcileService`, `reconcilePVCs`, and `handleSandboxExpiry`, so adoption-vs-refusal is consistent across the entire lifecycle.

Sources: [controllers/sandbox_controller.go:55-80](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:509-590](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:671-781](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:945-1011](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:1014-1063](controllers/sandbox_controller.go)

## Pod reconciliation

`reconcilePod` is the most involved subroutine. It must support three scenarios with one code path:

1. **Fresh creation** — no Pod exists; create one named after the Sandbox.
2. **Warm-pool adoption** — a pre-existing Pod whose name is tracked in the sandbox annotation `agents.x-k8s.io/pod-name`. The reconciler reads that annotation through `resolvePodName`, so the Pod's name may differ from `sandbox.Name`.
3. **Suspend** — `spec.replicas == 0` deletes the Pod (if owned) and clears the tracking annotation.

Key behaviors:

- The reconciler first does a label-based `List` for diagnostic purposes (logs a warning if more than one Pod matches), then a direct `Get` on the resolved pod name.
- If the annotated pod has gone missing, `clearPodNameAnnotation` removes the annotation via a `MergeFrom` patch so the next reconcile can fall back to creating a Pod named after the Sandbox.
- On adoption of an unowned Pod, `ctrl.SetControllerReference` is called and `updatePodMetadata` propagates labels and annotations from `spec.podTemplate.metadata` to the live Pod.
- On create, the desired pod is constructed from a deep-copied `spec.podTemplate.Spec`, with `MergeVolumeClaimVolumes` overlaying any `volumeClaimTemplates`-derived volumes by name. The newly built `Pod` gets the sandbox-name-hash label plus every label/annotation from the pod template. The set of keys it stamped is recorded in `agents.x-k8s.io/propagated-labels` and `agents.x-k8s.io/propagated-annotations` (comma-separated, sorted) so a later reconcile can detect and *remove* keys that have since been dropped from the template.
- Pod create races (`AlreadyExists`) fall back to a `Get` plus `reconcileExistingPod`, so a controller crash mid-create is recoverable.
- An `ensurePodNameAnnotation` closure writes the annotation after a successful create/adopt — but only if the sandbox doesn't already track a *different* pod name, to avoid hijacking a warm-pool record.

```go
// controllers/sandbox_controller.go
func resolvePodName(sandbox *sandboxv1beta1.Sandbox) string {
    if name, ok := sandbox.Annotations[sandboxv1beta1.SandboxPodNameAnnotation]; ok && name != "" {
        return name
    }
    return sandbox.Name
}
```

Sources: [controllers/sandbox_controller.go:82-90](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:92-110](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:597-609](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:623-864](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:866-943](controllers/sandbox_controller.go)

### Label and annotation propagation

`updatePodMetadata` is the source of truth for keeping the live Pod's metadata in sync with the template. It implements three jobs in one pass:

1. Force the sandbox-name-hash label to the current hash.
2. Apply every `(k, v)` from `spec.podTemplate.metadata.labels` and `.annotations`, updating only on differences.
3. Use the `propagated-labels` / `propagated-annotations` tracking annotations to delete keys that the template no longer mentions. The new key list (sorted) is then written back. Without this bookkeeping, a key removed from the template would otherwise linger on the Pod forever, because three-way diffing is not free with imperative `Update` calls.

Sources: [controllers/sandbox_controller.go:866-943](controllers/sandbox_controller.go)

## Service reconciliation and cluster-domain FQDN

`reconcileService` produces a single headless service per Sandbox. Its desired state is keyed off the optional `spec.service *bool`:

| `spec.service` | Service exists & ownership | Action |
|---|---|---|
| `nil` | not found | No-op; clear `status.service`/`status.serviceFQDN`. |
| `nil` | found, owned-by-sandbox | Reconcile drift on labels and selector. |
| `nil` | found, unowned | Leave as-is (backward compatibility), but `computeReadyCondition` still requires it for Ready. |
| `nil` | found, owned-by-other | Error (`refusing to use service`). |
| `true` | not found | Create a headless service (`ClusterIP: None`) named after the sandbox. |
| `true` | found, unowned | Adopt — but only if `service.spec.clusterIP` is `None` or empty, because `clusterIP` is immutable. |
| `true` | found, owned-by-sandbox | Patch back label and selector to `{sandboxLabel: nameHash}`. |
| `false` | found, owned-by-sandbox | Delete the service. |
| `false` | any other state | Do not delete; clear status. |

`setServiceStatus` writes both `status.service` and `status.serviceFQDN`, where the FQDN is constructed by string concatenation:

```go
// controllers/sandbox_controller.go
sandbox.Status.ServiceFQDN = service.Name + "." + service.Namespace + ".svc." + r.ClusterDomain
```

The cluster domain comes from the `--cluster-domain` flag (default `cluster.local`); `TestSetServiceStatusCustomDomain` exercises both the default and a custom value such as `custom.domain`.

Sources: [controllers/sandbox_controller.go:460-594](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:611-621](controllers/sandbox_controller.go), [controllers/sandbox_controller_test.go:2429-2463](controllers/sandbox_controller_test.go), [api/v1beta1/sandbox_types.go:194-222](api/v1beta1/sandbox_types.go)

## PVC reconciliation

`reconcilePVCs` iterates `spec.volumeClaimTemplates` and, for each entry named `vct`, manages a PVC named `vct + "-" + sandbox.Name` (the same naming convention that `reconcilePod` uses when wiring `corev1.Volume` entries through `MergeVolumeClaimVolumes`).

- If the PVC exists and is owned by the Sandbox, nothing happens.
- If it exists but is unowned, `ctrl.SetControllerReference` is called and the object is `Update`d to take ownership.
- If it exists but is owned by another controller, `reconcilePVCs` returns an error which propagates up into the Ready condition as `ReconcilerError`.
- If it does not exist, the controller creates it using the template's `Spec`, copying labels/annotations from the template and adding the sandbox-name-hash label.

PVC deletion follows the standard owner-reference garbage collection: nothing in this reconciler deletes a PVC explicitly, even on `replicas=0` or expiry.

Sources: [controllers/sandbox_controller.go:92-110](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:809-824](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:945-1011](controllers/sandbox_controller.go)

## Status conditions

`computeConditions` produces up to three `metav1.Condition` values per pass. `Suspended` and `Finished` are conditional; `Ready` is always present. Each condition carries `ObservedGeneration: sandbox.Generation`. The reconciler also explicitly *removes* the `Finished` condition when neither `PodSucceeded` nor `PodFailed` is observed in the current pass, so a Pod that's been recreated does not carry over a stale `Finished=True`.

| Type | Status / Reason | When set |
|---|---|---|
| `Ready` | `True` / `DependenciesReady` | Pod is `Running`, Pod Ready is `True`, `len(PodIPs) > 0`, and the service requirement is satisfied. |
| `Ready` | `False` / `DependenciesNotReady` | Default for any not-yet-ready dependency state; message describes Pod phase and whether the Service exists. |
| `Ready` | `False` / `ReconcilerError` | Any error returned from `reconcileChildResources`. Message is `"Error seen: " + err.Error()`. |
| `Ready` | `False` / `SandboxSuspended` | `spec.replicas == 0`. Message distinguishes "suspending" (pod still around) from "suspended" (no pod). |
| `Ready` | `False` / `SandboxExpired` | Set by `setSandboxExpiredCondition` and persisted by `handleSandboxExpiry`. |
| `Suspended` | `True` / `PodTerminated` | `spec.replicas == 0` and no Pod exists. |
| `Suspended` | `False` / `PodNotTerminated` | `spec.replicas == 0` but the Pod is still present. |
| `Finished` | `True` / `PodSucceeded` | Pod phase is `Succeeded`. |
| `Finished` | `True` / `PodFailed` | Pod phase is `Failed`. |

A service is "required" for Ready when either `spec.service == true` or a Service currently exists (the backward-compatibility branch around `controllers/sandbox_controller.go:367-372`). `TestComputeConditions` enumerates a dozen permutations that pin down this matrix.

Sources: [controllers/sandbox_controller.go:256-417](controllers/sandbox_controller.go), [controllers/sandbox_controller_test.go:63-239](controllers/sandbox_controller_test.go), [api/v1beta1/sandbox_types.go:24-54](api/v1beta1/sandbox_types.go)

## Scale subresource and pod IP surfacing

The `Sandbox` CRD declares both `+kubebuilder:subresource:status` and `+kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.replicas,selectorpath=.status.selector`. The reconciler keeps the three fields the scale subresource points at coherent on each pass:

- `status.replicas` is set to `1` when a Pod is present, `0` otherwise.
- `status.selector` is filled with `<sandboxLabel>=<NameHash(sandbox.Name)>` when the Pod exists.
- `status.podIPs` is mirrored from `pod.Status.PodIPs` (dual-stack aware) via `podIPsFromStatus`.

When `replicas` is 0, the Pod-IP and selector status fields are cleared in the same block. The `replicas` field itself only ever takes the values 0 or 1 because of the CRD's `minimum=0, maximum=1` markers.

Sources: [controllers/sandbox_controller.go:241-250](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:419-429](controllers/sandbox_controller.go), [api/v1beta1/sandbox_types.go:148-156](api/v1beta1/sandbox_types.go), [api/v1beta1/sandbox_types.go:209-222](api/v1beta1/sandbox_types.go), [api/v1beta1/sandbox_types.go:225-244](api/v1beta1/sandbox_types.go)

## Lifecycle: shutdownTime and shutdownPolicy

The inlined `Lifecycle` substruct adds two fields to `SandboxSpec`: `shutdownTime` (absolute) and `shutdownPolicy` (`Delete` or `Retain`, defaulting to `Retain`). `checkSandboxExpiry` returns whether the sandbox is past its `shutdownTime` and, if not, how long to wait before reconsidering. The wait is clamped to a 2-second minimum so reconcile thrash is bounded:

```go
// controllers/sandbox_controller.go
requeueAfter := max(remainingTime, 2*time.Second)
```

On expiry, the controller follows a two-pass protocol designed to preserve any terminal `Finished` condition for observability:

```mermaid
stateDiagram-v2
    [*] --> Live: shutdownTime in future
    Live --> ExpiringMarked: shutdownTime ≤ now\nsetSandboxExpiredCondition
    ExpiringMarked --> ExpiringMarked: requeue (immediateRequeueDelay)
    ExpiringMarked --> Cleaning: sandboxMarkedExpired = true
    Cleaning --> Retained: ShutdownPolicy=Retain\nPod & Service deleted\nstatus.Conditions preserved
    Cleaning --> Deleted: ShutdownPolicy=Delete\nSandbox object deleted
    Retained --> [*]
    Deleted --> [*]
```

Pass 1 (`Reconcile` line 198) sets the Ready=`SandboxExpired` condition, updates status, and returns `RequeueAfter: immediateRequeueDelay` so the next pass observes the marker via `sandboxMarkedExpired`.

Pass 2 calls `handleSandboxExpiry`, which:

1. Deletes the Pod if owned by this Sandbox; logs and skips deletion if it's unowned or owned by another controller.
2. Deletes the Service under the same ownership rule.
3. If `shutdownPolicy == Delete`, deletes the Sandbox itself and returns `sandboxDeleted = true`, suppressing the trailing status update.
4. Otherwise, resets `sandbox.Status` to an empty struct *while preserving `Conditions`* (so `Finished=PodSucceeded` or `Finished=PodFailed` survives the cleanup), then re-asserts `Ready=False / SandboxExpired`.

`TestSandboxShutdownExpiryUsesTwoPassAndPreservesFinishedCondition` exercises the full sequence for both `PodSucceeded` and `PodFailed` and asserts that the `Finished` condition is still present after the Pod and Service are gone.

Sources: [controllers/sandbox_controller.go:188-228](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:1014-1090](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:1092-1127](controllers/sandbox_controller.go), [controllers/sandbox_controller_test.go:2244-2306](controllers/sandbox_controller_test.go), [controllers/sandbox_controller_test.go:2308-2427](controllers/sandbox_controller_test.go), [api/v1beta1/sandbox_types.go:168-192](api/v1beta1/sandbox_types.go)

## Status persistence and field ownership

`updateStatus` does a `reflect.DeepEqual` between the snapshot taken at the top of `Reconcile` (`oldStatus`) and the post-reconcile `sandbox.Status`, calling `r.Status().Update` only when they differ. This keeps the controller from generating spurious status revisions that would themselves enqueue reconciles via the watch on `Sandbox`.

For the spec/metadata side, the controller mixes two approaches:

- `Create` calls use `client.FieldOwner("sandbox-controller")` so server-side-apply conflict detection points at this controller.
- `Update` calls (for adoption and for patching label/selector drift on owned services) do not specify a field owner; `Patch` with `client.MergeFrom` is used for narrow annotation changes (trace context, pod-name tracking).

Sources: [controllers/sandbox_controller.go:49-53](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:175-186](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:431-445](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:493-499](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:838-850](controllers/sandbox_controller.go), [controllers/sandbox_controller.go:1005-1007](controllers/sandbox_controller.go)

## Summary

The Sandbox reconciler is a single-replica, single-pod controller whose interesting behavior lives at the boundaries: a hash-derived identity label that ties together a Pod, a headless Service, and one PVC per template; a three-state ownership classifier (`resourceOwnedBySandbox` / `resourceUnowned` / `resourceOwnedByOther`) that decides between drive-to-state, adoption, and refusal; a label/annotation propagator that uses tracking annotations to detect deletions; a status surface comprising `Ready`, `Suspended`, and `Finished` plus scale-subresource fields; and a two-pass expiry path that drops live resources but preserves terminal conditions for observability. The cluster-domain FQDN is the simplest of these — a literal `<svc>.<ns>.svc.<--cluster-domain>` concatenation seeded by the controller flag — but it ties the published `status.serviceFQDN` to operator-controlled DNS configuration rather than cluster auto-discovery.
