# Linux Host Execution — Chroot, Namespaces & Cgroups

> How linux-actiond runs actions directly on a Linux host: execroot construction, read-only bind mounts for CAS inputs, private mount and network namespace setup, loopback-only networking, uid/gid drop, PR_SET_NO_NEW_PRIVS, and best-effort cgroup v2 resource limits.

- Repository: hermeticbuild/actiond
- GitHub: https://github.com/hermeticbuild/actiond
- Human wiki: https://grok-wiki.com/public/wiki/hermeticbuild-actiond-796c0ee40e63
- Complete Markdown: https://grok-wiki.com/public/wiki/hermeticbuild-actiond-796c0ee40e63/llms-full.txt

## Source Files

- `src/action_executor.zig`
- `src/action_runner.zig`
- `src/execroot.zig`
- `src/runtime_mount.zig`
- `src/staged_cas_index.zig`
- `src/cas.zig`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [src/action_executor.zig](src/action_executor.zig)
- [src/action_runner.zig](src/action_runner.zig)
- [src/execroot.zig](src/execroot.zig)
- [src/runtime_mount.zig](src/runtime_mount.zig)
- [src/staged_cas_index.zig](src/staged_cas_index.zig)
</details>

# Linux Host Execution — Chroot, Namespaces & Cgroups

`linux-actiond` runs REAPI actions directly on a Linux host without a virtual machine layer. Each action executes inside a tight sandbox: a per-action chroot directory serves as the execution root, a private mount namespace and a dedicated network namespace isolate the action from the host and from other concurrent actions, all CAS inputs are exposed as read-only bind mounts so content is never copied unnecessarily, and a cgroup v2 leaf enforces resource limits that callers declare via REAPI platform properties. The sandbox user drops to UID/GID 65534 with all Linux capabilities cleared before `execve`.

This page covers the full lifecycle from input fetching through execroot construction, namespace setup, privilege drop, and cgroup management, as implemented in the `action_executor`, `action_runner`, `execroot`, and `runtime_mount` modules.

---

## Overview: Execution Path

```text
executeActionWithOptions (action_executor.zig)
  │
  ├─ Fetch Action + Command protos from CAS
  ├─ collectInputs  → flat list of Input{path, digest, is_executable}
  ├─ Materializer.materializeInputs
  │    ├─ [no-chroot] copy blobs directly into work_root
  │    └─ [chroot]    create placeholder files + BindMount{source=CAS blob, target=workspace/path}
  ├─ prepareChrootBaseDirs  → create tmp/, var/tmp/ inside work_root
  ├─ appendRuntimeMounts    → bind-mount libc/etc from runtime_root or squashfs image
  │
  └─ runCommandWithOptions (action_runner.zig)
       ├─ Cgroup.create  → /sys/fs/cgroup/actiond/action-N
       ├─ prepareChrootWritableDirs  → fchown/fchmod for sandbox uid/gid
       ├─ forkAction
       │    ├─ [child] setpgid, write cgroup.procs, PR_SET_NO_NEW_PRIVS
       │    ├─ [child] unshare(NEWNS | NEWNET)
       │    ├─ [child] childBringUpLoopback
       │    ├─ [child] mount / MS_PRIVATE|MS_REC
       │    ├─ [child] mount actiondfs / bind mounts (read-only)
       │    ├─ [child] chroot + chdir
       │    ├─ [child] dropPrivileges (setgroups/setresgid/setresuid/capset=0)
       │    └─ [child] execve
       └─ collectChildResult, poll stdout/stderr, waitpid
```

---

## Execroot Construction

### Work Directory Layout

`executeActionWithOptions` allocates a per-action `work_root` directory. When a runtime root is provided (`options.runtime_root_path != null`), the execroot lives at `work_root/workspace/` and is the chroot target. Without a runtime, the work root itself becomes the execroot and no chroot is used.

```
work_root/
  workspace/        ← chroot target (= /  inside sandbox)
    <input files>   ← placeholder empty files for bind-mounted inputs
    tmp/            ← created by prepareChrootBaseDirs
    var/tmp/        ← created by prepareChrootBaseDirs
    lib/            ← bind-mounted from runtime_root/libc/<version>/<arch>/root/lib
    lib64/          ← ...
    usr/lib/        ← ...
    etc/            ← bind-mounted from runtime_root/common/root/etc (or libc etc)
```

The constant `chroot_execroot_prefix = "/workspace/"` documents the in-sandbox path that maps to `work_root/workspace/`.

Sources: [src/action_executor.zig:28](), [src/action_executor.zig:1611-1614]()

### Input Materialization Modes

`execroot.Materializer.materializeInputs` operates in two modes controlled by whether a `chroot_root_path` is set:

| Mode | How CAS blobs appear in execroot |
|---|---|
| **No-chroot** | Blobs are copied byte-for-byte into the work directory via `store.copyToFile` |
| **Chroot (bind-mount)** | A zero-byte placeholder file is created at `workspace/<path>`, then a `BindMount{source: CAS blob path, target: workspace/<path>}` is produced for the kernel |

In chroot mode, the CAS blob itself (`/cas/blobs/sha256/XX/YYYY…`) becomes the bind mount source. This avoids a copy and makes every non-executable input immediately read-only inside the sandbox with no additional remount step.

**Executable inputs that are argv[0]** are an exception: they must be kernel-executable, so they are copied rather than bind-mounted (the kernel cannot exec a file through a read-only bind mount that has the `NOSUID` flag if the inode was never written with executable permission relative to the sandbox uid).

Tree artifact inputs (directory entries declared as `DirectoryInput`) are bound as whole directory trees: the CAS tree staging directory (`cas/trees/XX/YYYY…`) is bind-mounted directly onto `workspace/<path>`.

Sources: [src/execroot.zig:65-115](), [src/execroot.zig:145-175]()

---

## Namespace Setup

### Namespace Flags

`action_runner.zig` declares the namespace flags used for every action:

```zig
fn actionNamespaceFlags() usize {
    const linux = std.os.linux;
    return linux.CLONE.NEWNS | linux.CLONE.NEWNET;
}
```

`CLONE.NEWNS` gives the child process a private copy of the mount table, preventing any `mount` or `umount` from affecting the host. `CLONE.NEWNET` gives it a fresh network stack. No user namespace (`CLONE_NEWUSER`) or PID namespace is created.

Sources: [src/action_runner.zig:14-17](), [src/action_runner.zig:671-672]() (test assertion)

### Child Sandbox Steps (in order)

After `fork()`, the child follows a strict sequence before signalling readiness:

1. **`setpgid(0, 0)`** — Creates a new process group so `kill(-pid, SIGKILL)` terminates all descendants.
2. **Write `"0\n"` to `cgroup.procs`** — Moves the child into the action's cgroup leaf before any work begins.
3. **`prctl(PR_SET_NO_NEW_PRIVS, 1, …)`** — Permanently forbids the process and all descendants from gaining new privileges via `setuid`/`setgid` binaries or file capabilities, even after dropping to an unprivileged uid.
4. **`close_range` (UNSHARE flag)** — Closes all file descriptors above fd 3 (the setup signal pipe) using the kernel `close_range` syscall with `CLOSE_RANGE_UNSHARE`.
5. **`unshare(NEWNS | NEWNET)`** — Enters private mount and network namespaces.
6. **`childBringUpLoopback()`** — Opens an `AF_INET SOCK_DGRAM` socket, reads the `lo` interface flags via `SIOCGIFFLAGS`, sets `IFF_UP`, and writes back via `SIOCSIFFLAGS`. This is the only network interface available inside the sandbox.
7. **`mount(null, "/", null, MS_PRIVATE | MS_REC, 0)`** — Makes all existing mount points private so no shared subtrees can propagate into or out of the namespace.
8. **actiondfs mounts** — If the `actiondfs` kernel module is in use, mount the custom filesystem.
9. **Bind mounts (read-only)** — For each `BindMount`, `mount(source, target, null, MS_BIND, 0)` followed by `mount(null, target, null, MS_BIND | MS_REMOUNT | MS_RDONLY | MS_NOSUID | MS_NODEV, 0)`.
10. **`chroot(work_root)`** — Locks the sandbox into the execroot.
11. **`chdir(chroot_cwd)`** — Sets working directory to `command.working_directory` prefixed with `/workspace`, or `/workspace` itself if empty.
12. **`childDropPrivileges(uid=65534, gid=65534)`** — Calls `setgroups(0, …)`, `setresgid`, `setresuid`, then `capset` with all-zero effective/permitted/inheritable sets.
13. **Write `"1"` to setup pipe** — Parent reads this as the signal that setup succeeded.
14. **`execve`** — Runs the action command.

Sources: [src/action_runner.zig:484-536](), [src/action_runner.zig:560-600]()

### Read-Only Bind Mount Sequence

```zig
fn childBindMountReadOnly(mount: BindMount) void {
    const linux = std.os.linux;
    childSyscallName(linux.mount(mount.source.ptr, mount.target.ptr, null, linux.MS.BIND, 0), "mount_bind");
    childSyscallName(linux.mount(
        null,
        mount.target.ptr,
        null,
        linux.MS.BIND | linux.MS.REMOUNT | linux.MS.RDONLY | linux.MS.NOSUID | linux.MS.NODEV,
        0,
    ), "mount_bind_ro");
}
```

The two-step pattern (bind, then remount read-only) is required because the Linux kernel does not accept `MS_RDONLY` on the initial `MS_BIND` call.

Sources: [src/action_runner.zig:607-620]()

---

## Runtime Mounts (libc / etc)

When an action requires a specific glibc version (declared via the `libc` REAPI platform property), `action_executor` appends additional bind mounts from a pre-discovered `RuntimeMountCache`:

| Target inside sandbox | Sourced from |
|---|---|
| `/lib` | `runtime_root/libc/<version>/<arch>/root/lib` (or `usr/lib` fallback) |
| `/lib64` | `runtime_root/libc/<version>/<arch>/root/lib64` (or `usr/lib64` fallback) |
| `/usr/lib` | `runtime_root/libc/<version>/<arch>/root/usr/lib` |
| `/etc` | `runtime_root/libc/<version>/<arch>/root/etc` (overrides common) |

Supported `libc` property values: `glibc2.31`, `glibc2.35`, `glibc2.39`. Any other non-empty, non-`none` value returns `error.UnsupportedLibcRuntime`.

For actions without a `libc` property, only the common `etc` from `runtime_root/common/root/etc` is mounted.

The runtime root itself can be a pre-mounted directory or an on-demand squashfs image (mounted read-only via a loop device in `runtime_mount.zig`).

Sources: [src/action_executor.zig:33-71](), [src/action_executor.zig:519-560](), [src/runtime_mount.zig:64-100]()

---

## Cgroup v2 Resource Limits

### Cgroup Lifecycle

`Cgroup.create` is called before `fork`. It:

1. Opens (or creates) `/sys/fs/cgroup/actiond/`.
2. Writes `+cpu +memory +pids` to `cgroup.subtree_control` (best-effort; failures are silently ignored).
3. Creates `/sys/fs/cgroup/actiond/action-{monotonic_id}/`.
4. Writes resource limit files as requested.
5. Returns the path to `cgroup.procs` for the child to self-join.

After the action completes, `Cgroup.deinit` writes `"1"` to `cgroup.kill` (killing all remaining processes in the cgroup) and then removes the cgroup directory.

If `/sys/fs/cgroup` is not accessible or any creation step fails, a zero-value `Cgroup{}` is returned and execution continues without limits (best-effort semantics).

### Platform Properties → Limits

| Platform property name(s) | Cgroup file written | Format |
|---|---|---|
| `limits.memory.bytes`, `memory`, `memory_bytes`, `resources:memory:bytes` | `memory.max` | bytes (`128M`, `1G`, raw int) |
| `limits.cpu.cores`, `cpu`, `cores`, `resources:cpu:cores` | `cpu.max` | `<quota> <period>` where period = 100 000 µs |
| `limits.pids.max`, `pids.max`, `pids` | `pids.max` | integer |

```zig
pub const CgroupLimits = struct {
    memory_max_bytes: ?u64 = null,
    cpu_max_cores: ?u32 = null,
    pids_max: ?u32 = null,
    ...
};
```

The child process writes `"0\n"` to `cgroup.procs` at the very start of its setup sequence, before `PR_SET_NO_NEW_PRIVS`, ensuring that even setup work is accounted for under the limits.

Sources: [src/action_runner.zig:20-50](), [src/action_runner.zig:178-240]()

---

## Privilege Drop

`childDropPrivileges` executes after `chroot` and `chdir`, immediately before the setup-complete signal:

```zig
fn childDropPrivileges(uid: u32, gid: u32) void {
    const linux = std.os.linux;
    var empty_groups = [_]linux.gid_t{0};
    childSyscallName(linux.setgroups(0, &empty_groups), "setgroups");
    childSyscallName(linux.setresgid(@intCast(gid), @intCast(gid), @intCast(gid)), "setresgid");
    childSyscallName(linux.setresuid(@intCast(uid), @intCast(uid), @intCast(uid)), "setresuid");

    var header = linux.cap_user_header_t{ .version = linux_capability_version_3, .pid = 0 };
    const data = [_]linux.cap_user_data_t{
        .{ .effective = 0, .permitted = 0, .inheritable = 0 },
        .{ .effective = 0, .permitted = 0, .inheritable = 0 },
    };
    _ = linux.capset(&header, &data[0]);
}
```

The default sandbox uid and gid are `65534` (the conventional `nobody` account). `setgroups(0, …)` clears all supplementary groups. `capset` with all-zero data drops every capability from the effective, permitted, and inheritable sets. Combined with `PR_SET_NO_NEW_PRIVS` (set earlier), this makes privilege re-escalation impossible regardless of what the action binary does.

Sources: [src/action_runner.zig:623-643](), [src/action_runner.zig:60-64]()

---

## actiondfs: FUSE-Free CAS Filesystem

When `use_actiondfs = true`, `action_executor` builds an `ActiondfsWorkspace` instead of materializing individual bind mounts per file. The `actiondfs` kernel module presents the CAS graph as a live filesystem. Two sub-modes exist:

| Mode | When used | Kernel mounts |
|---|---|---|
| `actiondfs_strict` | Default (inputs are not mutated) | `mount("actiondfs", workspace, "actiondfs", RDONLY\|NOSUID\|NODEV\|NOATIME, root=…,cas=…,stage=…)` |
| `actiondfs_overlay` | `mutates_inputs=true` platform property | actiondfs on `/lower`, then `overlay` on `/workspace` with actiondfs as lowerdir and a writable upperdir |

The `stage_dir` is a local directory for kernel-side caching of resolved CAS content. In overlay mode, the action's writes land in the overlay `upperdir` and are collected after the action exits by mounting the overlay again.

Sources: [src/action_executor.zig:456-545](), [src/action_executor.zig:170-210]()

---

## Setup Signal Protocol

The parent/child coordinate via a `setup_pipe` pair (fd 3 in the child). After all namespace, mount, chroot, and privilege setup succeeds, the child writes a single byte `"1"` and closes fd 3. The parent reads this signal before beginning to drain stdout/stderr. If setup fails at any point, the child calls `linux.exit(127)` without writing the signal; the parent detects EOF without the byte and knows setup failed.

Sources: [src/action_runner.zig:490-494](), [src/action_runner.zig:660-668]()

---

## Summary

Linux host execution in `actiond` composes five isolation layers:

1. **Chroot** — per-action `work_root/workspace/` directory isolates the filesystem view.
2. **Private mount namespace** (`CLONE_NEWNS`) — bind mounts, runtime mounts, and the chroot are invisible to the host and to other actions.
3. **Private network namespace** (`CLONE_NEWNET`) — only the loopback interface (`lo`) is brought up; all external networking is absent.
4. **Privilege drop** — uid/gid 65534 with empty groups, zero Linux capabilities, and `PR_SET_NO_NEW_PRIVS` set before unshare.
5. **Cgroup v2** — best-effort memory, CPU, and PID limits declared via REAPI platform properties, with automatic cleanup and forced kill on completion.

CAS inputs are delivered as read-only bind mounts (source = CAS blob path, target = workspace placeholder) rather than copies, making input staging O(inputs) in metadata operations rather than O(bytes). The `actiondfs` kernel module extends this further by deferring even the bind mounts until the kernel receives a file access.

Sources: [src/action_runner.zig:484-536](), [src/action_executor.zig:390-430]()
