# actiondfs — Lazy CAS-Backed Input Filesystem

> The custom Linux kernel filesystem module that exposes REAPI input trees to VM actions without per-file copies: lazy Directory proto resolution from the guest CAS, VM-lifetime parsed Directory cache keyed by digest, backing-file delegation for read/splice/mmap, strict vs. overlayfs compatibility paths for input-mutating actions, and the /proc/actiondfs_stats counter interface.

- Repository: hermeticbuild/actiond
- GitHub: https://github.com/hermeticbuild/actiond
- Human wiki: https://grok-wiki.com/public/wiki/hermeticbuild-actiond-796c0ee40e63
- Complete Markdown: https://grok-wiki.com/public/wiki/hermeticbuild-actiond-796c0ee40e63/llms-full.txt

## Source Files

- `kernel/actiondfs/actiondfs.c`
- `kernel/actiondfs/BUILD.bazel`
- `kernel/actiondfs/Kconfig`
- `kernel/actiondfs/Makefile`
- `src/staged_cas_index.zig`
- `src/cas.zig`
- `ARCHITECTURE.md`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [kernel/actiondfs/actiondfs.c](kernel/actiondfs/actiondfs.c)
- [kernel/actiondfs/BUILD.bazel](kernel/actiondfs/BUILD.bazel)
- [kernel/actiondfs/Kconfig](kernel/actiondfs/Kconfig)
- [kernel/actiondfs/Makefile](kernel/actiondfs/Makefile)
- [ARCHITECTURE.md](ARCHITECTURE.md)
- [src/action_executor.zig](src/action_executor.zig)
</details>

# actiondfs — Lazy CAS-Backed Input Filesystem

`actiondfs` is a custom Linux kernel filesystem module built into the actiond VM kernel. It exposes the REAPI input tree for each action as a native Linux filesystem mount — without copying any file content from the Content-Addressable Storage (CAS). Directory metadata is resolved lazily on first lookup from REAPI `Directory` protobuf blobs stored in the guest CAS, and file content reads are delegated directly to the backing CAS blob via the kernel's `backing_file_open` family of helpers.

The module exists because the VM guest needs a copy-free, isolation-correct way to present thousands of input files to compiler actions. Instead of materializing each input as a bind mount (the Linux-host strategy) or an overlayfs copy-up, `actiondfs` presents the entire input tree as a virtual mount: inode metadata comes from the parsed protobuf graph, and `read`, `splice_read`, and `mmap` calls pass through to the real CAS file without allocating intermediate page cache pages in the actiondfs inode.

---

## Module Overview

The module lives entirely in [`kernel/actiondfs/actiondfs.c`](kernel/actiondfs/actiondfs.c) and is compiled as a kernel built-in under `CONFIG_ACTIONDFS_FS`.

```
kernel/actiondfs/
  actiondfs.c    # complete implementation (~3870 lines)
  Kconfig        # CONFIG_ACTIONDFS_FS bool
  Makefile       # obj-$(CONFIG_ACTIONDFS_FS) += actiondfs.o
  BUILD.bazel    # exports srcs filegroup for linux.bzl kernel build
```

Key constants defined at the top of the source:

| Constant | Value | Purpose |
|---|---|---|
| `ACTIONDFS_MAGIC` | `0x41444653` | Filesystem magic number |
| `ACTIONDFS_MAX_DIRECTORY_PROTO_SIZE` | 64 MiB | Protobuf read guard |
| `ACTIONDFS_DIR_CACHE_BITS` | 12 (4096 buckets) | VM-lifetime dir-cache hash table |
| `ACTIONDFS_BLOB_PATH_CACHE_BITS` | 14 (16384 buckets) | CAS path resolution cache |
| `ACTIONDFS_BLOB_PATH_CACHE_MAX` | 16384 | Eviction threshold |
| `ACTIONDFS_STALE_RETRY_ATTEMPTS` | 128 | Max retries on `-ESTALE` |
| `ACTIONDFS_STALE_RETRY_MS` | 2 | Sleep between stale retries |
| `ACTIONDFS_PROC_STATS` | `"actiondfs_stats"` | `/proc` entry name |
| `ACTIONDFS_EMPTY_SHA256` | well-known hash | Empty-file short-circuit |

Sources: [kernel/actiondfs/actiondfs.c:43-57]()

---

## Mount Interface

Each per-action mount passes options as a comma-separated string:

```
root=<sha256>,root_size=<bytes>,cas=/cas/blobs/sha256[,stage=/path/to/stage]
```

| Option | Required | Description |
|---|---|---|
| `root=` | yes | SHA-256 hex of the REAPI input root `Directory` digest |
| `root_size=` | no | Expected size of the root proto blob in bytes |
| `cas=` | yes | Absolute path to the CAS blob directory (sharded `xx/sha256hash`) |
| `stage=` | no | Absolute path to the per-action stage directory; enables write support |

When `stage=` is absent, the superblock is mounted `SB_RDONLY`. When `stage=` is present, `sbi->staged_writes = true` and write operations on new files/directories are forwarded to the stage directory.

The root node starts with `loaded = false`; no protobuf read happens at mount time. The root `Directory` is parsed on first access via `actiondfs_ensure_loaded`.

Sources: [kernel/actiondfs/actiondfs.c:2614-2661](), [kernel/actiondfs/actiondfs.c:3695-3726]()

---

## Data Structures

### Per-mount node tree (`actiondfs_node`)

Every VFS inode has a corresponding `actiondfs_node` stored in `inode->i_private`:

```c
struct actiondfs_node {
    char *name;
    enum actiondfs_node_origin origin;   // INPUT or STAGED
    char *stage_rel;                     // relative path into stage dir (STAGED only)
    u64 ino;
    umode_t mode;
    u64 size;
    char hash[65];                       // SHA-256 hex of CAS blob (INPUT only)
    struct file *blob_file;              // cached backing file (INPUT only)
    struct mutex blob_lock;
    bool loaded;                         // false until CAS proto is parsed
    struct actiondfs_node *parent;
    struct actiondfs_cached_dir *cached_dir;   // pointer into VM-lifetime cache
    struct actiondfs_materialized_child *materialized_children;
    struct actiondfs_node **file_children;
    struct actiondfs_node **dir_children;
    ...
};
```

`origin` is either `ACTIONDFS_NODE_INPUT` (read-only CAS blob) or `ACTIONDFS_NODE_STAGED` (read-write file in the stage directory).

### VM-lifetime directory cache (`actiondfs_cached_dir`)

```c
struct actiondfs_cached_dir {
    struct hlist_node hnode;
    char hash[65];
    u64 size;
    struct actiondfs_cached_child *file_children;   // sorted by name
    struct actiondfs_cached_child *dir_children;    // sorted by name
    size_t file_count, dir_count;
    ...
};
```

Each `actiondfs_cached_child` stores only the name, mode, size, and hash of a child — no per-mount node pointers. Child nodes are "materialized" (allocated as `actiondfs_node`) on demand, keyed by `(is_dir, index)` in the per-mount `materialized_children` list.

Sources: [kernel/actiondfs/actiondfs.c:64-100](), [kernel/actiondfs/actiondfs.c:103-115]()

---

## Lazy Directory Resolution

When a lookup or `readdir` arrives on an unloaded directory node, `actiondfs_ensure_loaded` fires:

```c
static int actiondfs_ensure_loaded(struct super_block *sb,
                                   struct actiondfs_node *dir)
{
    if (dir->loaded) return 0;
    mutex_lock(&sbi->load_lock);
    if (!dir->loaded) {
        actiondfs_stat_inc(ACTIONDFS_STAT_DIR_LOADS);
        err = actiondfs_load_reapi_directory_locked(sbi, dir);
    }
    mutex_unlock(&sbi->load_lock);
    return err;
}
```

Inside `actiondfs_load_reapi_directory_locked`, the path diverges based on whether the node is the root:

- **Root directory**: parsed directly into per-mount `file_children` / `dir_children` arrays; not cached. The `ACTIONDFS_STAT_ROOT_DIR_PARSES` counter is incremented. Child directory nodes are added with `loaded = false`.
- **Non-root directory**: `actiondfs_get_cached_dir` is called, which checks the VM-lifetime `actiondfs_dir_cache` hash table first. On a miss, the CAS blob is read and parsed into a new `actiondfs_cached_dir`, then inserted. On a race (concurrent miss), the duplicate is discarded and the winner's entry is returned.

The inline protobuf parser handles REAPI `Directory` fields 1 (FileNode), 2 (DirectoryNode), and skips unknown fields. Symlinks (field 3) return `-EOPNOTSUPP`.

Sources: [kernel/actiondfs/actiondfs.c:2558-2610](), [kernel/actiondfs/actiondfs.c:2470-2557]()

---

## VM-Lifetime Caches

### Directory metadata cache

```c
static DEFINE_HASHTABLE(actiondfs_dir_cache, ACTIONDFS_DIR_CACHE_BITS);
static DEFINE_MUTEX(actiondfs_dir_cache_lock);
```

- Keyed by the first 16 hex characters of the SHA-256 digest, interpreted as `unsigned long`.
- Holds `actiondfs_cached_dir` entries for the lifetime of the VM module; never evicted.
- Since Directory protos are content-addressed and immutable, the same source directory or tree artifact used across many actions reuses one parse.
- Root directories are intentionally excluded because they are unique per action.

### Blob path cache

```c
static DEFINE_HASHTABLE(actiondfs_blob_path_cache, ACTIONDFS_BLOB_PATH_CACHE_BITS);
static LIST_HEAD(actiondfs_blob_path_cache_list);
static DEFINE_MUTEX(actiondfs_blob_path_cache_lock);
static size_t actiondfs_blob_path_cache_count;
```

- Caches resolved `struct path` for CAS blobs by digest (max 16384 entries).
- **Hot path uses RCU** (`hash_for_each_possible_rcu`); misses and evictions use the mutex.
- Eviction selects the entry with the lowest hit count from the LRU list.
- When a stale handle is detected (`-ESTALE`), the entry is dropped via `actiondfs_drop_cached_blob_path` and the open is retried.

Sources: [kernel/actiondfs/actiondfs.c:256-274](), [kernel/actiondfs/actiondfs.c:2279-2430]()

---

## Backing File Delegation

For `ACTIONDFS_NODE_INPUT` files, all data-path operations are forwarded to the real CAS blob file using the kernel `backing_file_*` API:

### `read_iter`

```c
file = actiondfs_get_node_blob_file(sbi, node, iocb->ki_filp);
nread = backing_file_read_iter(file, to, iocb, iocb->ki_flags, &ctx);
```

`actiondfs_get_node_blob_file` caches the `backing_file_open` result in `node->blob_file` under `node->blob_lock`, so repeated reads on the same node reuse the same backing file descriptor.

### `splice_read`

```c
nread = backing_file_splice_read(file, &backing_iocb, pipe, wanted, flags, &ctx);
```

Splice uses a copy of the iocb keyed to the actiondfs file but with position tracking on the backing iocb. This allows `sendfile`/pipe-based reads without an intermediate copy.

### `mmap`

```c
err = backing_file_mmap(file, vma, &ctx);
```

The VMA is attached to the backing CAS blob. Page faults go directly to the CAS filesystem's page cache, not through actiondfs folios. This is the critical path for compiler `execve` and library mapping — the actiondfs path shown in `/proc/PID/maps` remains the actiondfs path while data comes from the native ext4 page cache.

All three operations retry on `-ESTALE` up to `ACTIONDFS_STALE_RETRY_ATTEMPTS` (128) times, clearing the blob path cache and node blob cache on each stale detection.

Sources: [kernel/actiondfs/actiondfs.c:1712-1820](), [kernel/actiondfs/actiondfs.c:1820-1920]()

---

## Strict vs. Overlay Compatibility Paths

The executor selects between two modes based on the `mutates_inputs` execution requirement:

```
┌─────────────────────────────────────────────────────┐
│  action_executor.zig                                │
│                                                     │
│  mutates_inputs absent / "false" / "0"              │
│    → ActionInputMode.actiondfs_strict               │
│    → mount actiondfs at /workspace                  │
│      (stage= path for output capture)               │
│                                                     │
│  mutates_inputs = "1" / "yes" / truthy              │
│    → ActionInputMode.actiondfs_overlay              │
│    → mount actiondfs (no stage=) at /lower          │
│    → mount overlayfs over /lower as /workspace      │
│      (upperdir = per-action stage dir)              │
└─────────────────────────────────────────────────────┘
```

**Strict mode** mount data (from `src/action_executor.zig`):
```
root={hash},root_size={bytes},cas={cas_blob_root},stage={stage_path}
```

**Overlay mode** mount data:
```
root={hash},root_size={bytes},cas={cas_blob_root}   ← actiondfs (no stage=, SB_RDONLY)
lowerdir={lower_path},upperdir={stage_path},workdir={work_path}  ← overlayfs on top
```

In strict mode, write operations on new files route through `actiondfs` to the stage directory, while attempts to write to CAS-backed input nodes return `-EROFS`. In overlay mode, overlayfs handles copy-up for input mutation so actions that overwrite inputs get a writable upper layer without patching `actiondfs`.

Sources: [src/action_executor.zig:876-944](), [src/action_executor.zig:350-357]()

---

## Stage Layer (Write Operations)

When a `stage=` path is present, `actiondfs` supports create, mkdir, unlink, rmdir, rename, write, and truncate for nodes with `origin == ACTIONDFS_NODE_STAGED`. Write operations to `ACTIONDFS_NODE_INPUT` nodes always return `-EROFS`.

The stage layer uses VFS pass-through:

- **Create/mkdir**: Checks that the name does not collide with an input node. Ensures the parent path exists in the stage directory via `actiondfs_stage_ensure_dir`. Then calls `vfs_create` / `vfs_mkdir` on the real stage dentry.
- **Write**: Opens the stage file via `backing_file_open` with `O_WRONLY`, delegates to `backing_file_write_iter`, updates `node->size`.
- **Rename**: Calls `vfs_rename` on both real stage dentries; updates `node->stage_rel` and `node->parent` on success.
- **copy_file_range**: Directly calls `vfs_copy_file_range` between the real backing files (CAS blob → stage file or stage → stage), bypassing actiondfs folios entirely.

During `readdir` on a staged directory, the stage directory is iterated via `iterate_dir`; entries that match input children are suppressed so the merged view presents each name once.

Sources: [kernel/actiondfs/actiondfs.c:3101-3200](), [kernel/actiondfs/actiondfs.c:2986-3050]()

---

## VFS Operations Table

| Operation | Input node | Staged node |
|---|---|---|
| `read_iter` | backing_file_read_iter → CAS blob | backing_file_read_iter → stage file |
| `write_iter` | `-EROFS` | backing_file_write_iter → stage file |
| `mmap` | backing_file_mmap → CAS blob VMA | backing_file_mmap → stage file VMA |
| `splice_read` | backing_file_splice_read → CAS | backing_file_splice_read → stage |
| `copy_file_range` | vfs_copy_file_range (real files) | vfs_copy_file_range (real files) |
| `lookup` | parse cached dir, materialize node | stat stage path, build staged node |
| `create` | `-EROFS` | vfs_create in stage dir |
| `mkdir` | `-EROFS` | vfs_mkdir in stage dir |
| `unlink` | `-EROFS` | vfs_unlink in stage dir |
| `rename` | `-EROFS` | vfs_rename, update stage_rel |
| `setattr (size)` | `-EROFS` | vfs_truncate on stage file |

Sources: [kernel/actiondfs/actiondfs.c:3614-3650]()

---

## `/proc/actiondfs_stats` Counter Interface

At module init, a procfs entry is registered:

```c
proc_create_single(ACTIONDFS_PROC_STATS, 0444, NULL, actiondfs_stats_show);
```

The handler iterates `actiondfs_stats[]` (an array of `atomic64_t`) and prints one `name value` pair per line. Counters accumulate for the VM's lifetime and are never reset. The AGENTS.md notes that this file is readable inside a running guest at `/proc/actiondfs_stats`.

Selected counter groups:

| Group | Counters |
|---|---|
| Mount activity | `mounts` |
| Directory loading | `dir_loads`, `root_dir_parses`, `cached_dir_requests`, `dir_cache_hits`, `dir_cache_misses`, `dir_cache_races`, `cached_dir_builds`, `cached_dir_bytes` |
| Lookup/readdir | `lookups`, `lookup_hits`, `lookup_negative`, `cached_lookups`, `cached_lookup_hits`, `cached_materialized`, `cached_reused`, `readdirs`, `readdir_entries` |
| Blob path cache | `blob_path_cache_hits`, `blob_path_cache_misses`, `blob_path_cache_inserts`, `blob_path_cache_evictions`, `blob_path_cache_races` |
| Data I/O | `backing_reads`, `backing_read_bytes`, `splice_reads`, `splice_read_bytes`, `mmaps`, `mmap_bytes`, `mmap_failures` |
| Stale retries | `blob_open_stale_retries`, `backing_read_stale_retries`, `splice_read_stale_retries` |
| Stage layer | `stage_read_calls`, `stage_write_calls`, `stage_create_calls`, `stage_mkdir_calls`, `stage_rename_calls`, `stage_copy_file_range_*`, … |

Sources: [kernel/actiondfs/actiondfs.c:137-280](), [kernel/actiondfs/actiondfs.c:3848-3862]()

---

## Lifecycle Sequence

```mermaid
sequenceDiagram
    participant Ex as action_executor.zig
    participant K as Linux kernel (actiondfs)
    participant CAS as /cas/blobs/sha256

    Ex->>K: mount("actiondfs", "/workspace", "root=...,cas=...,stage=...")
    Note over K: fill_super: allocate sbi, root node (loaded=false)
    Ex->>K: open("/workspace/src/foo.cc")
    K->>K: actiondfs_lookup → actiondfs_ensure_loaded(root)
    K->>CAS: read root Directory proto blob
    K->>K: parse FileNodes/DirectoryNodes → file_children[]/dir_children[]
    K->>K: actiondfs_lookup → find "src" dir node (loaded=false)
    K->>K: actiondfs_ensure_loaded("src") → actiondfs_get_cached_dir(hash)
    K->>CAS: read src/ Directory proto (on cache miss)
    K->>K: store actiondfs_cached_dir in VM-lifetime hash table
    K->>K: materialize "foo.cc" node from cached child record
    K->>CAS: backing_file_open(sha256=...) for foo.cc
    K-->>Ex: fd pointing at actiondfs inode (backing = CAS blob)
    Ex->>K: mmap(fd) → backing_file_mmap → CAS page cache
```

---

## Build Integration

`Kconfig` declares `CONFIG_ACTIONDFS_FS` as a `bool` (not a tristate), so the module is always compiled in or always absent — it cannot be a loadable module. `Makefile` generates `actiondfs.o` when the config symbol is set. The `BUILD.bazel` `srcs` filegroup is consumed by `linux.bzl` which drives the kernel build from within Bazel. The actual kernel image with `CONFIG_ACTIONDFS_FS=y` is built as `//vm:linux_kernel_zst`.

The Linux host path (`linux-actiond` without a VM) never sets `use_actiondfs`; it continues to use read-only bind mounts from the CAS. `actiondfs` is only present in the Bazel-built VM kernel, so ordinary host kernels do not need the module.

Sources: [kernel/actiondfs/Kconfig:1-6](), [kernel/actiondfs/Makefile:1](), [ARCHITECTURE.md]()

---

## Summary

`actiondfs` eliminates per-file copies and bind-mount overhead for VM action inputs by implementing a minimal Linux filesystem that keeps all directory metadata in two VM-lifetime content-addressed caches (a 4096-bucket Directory hash table and a 16384-entry blob-path RCU cache) and delegates all file data operations to the underlying CAS blob through the kernel `backing_file_*` API. For most actions it is mounted directly at `/workspace` with a stage path for output capture. For input-mutating actions it is mounted as an overlayfs lowerdir so stock overlayfs handles copy-up without any changes to actiondfs internals. Observable behavior is exposed through `atomic64_t` counters at `/proc/actiondfs_stats`.
