# Compiler Pipeline — From .0 Source to Native Binary

> How the native C compiler (native/zero-c) processes a .0 file: lexer → parser → checker → IR → MIR verifier → direct emitter (ELF64 / Mach-O / COFF). Key invariant: generatedCBytes must be zero for a direct-frontend build. The legacy C bridge path is removed and must not be reintroduced.

- Repository: vercel-labs/zerolang
- GitHub: https://github.com/vercel-labs/zerolang
- Human wiki: https://grok-wiki.com/public/wiki/vercel-labs-zerolang-9ab46b2a38e0
- Complete Markdown: https://grok-wiki.com/public/wiki/vercel-labs-zerolang-9ab46b2a38e0/llms-full.txt

## Source Files

- `native/zero-c/src/main.c`
- `native/zero-c/src/lexer.c`
- `native/zero-c/src/parser.c`
- `native/zero-c/src/checker.c`
- `native/zero-c/src/ir.c`
- `native/zero-c/src/mir_verify.c`
- `native/zero-c/src/specialize.c`
- `native/zero-c/src/emit_elf64.c`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [native/zero-c/src/main.c](native/zero-c/src/main.c)
- [native/zero-c/src/lexer.c](native/zero-c/src/lexer.c)
- [native/zero-c/src/parser.c](native/zero-c/src/parser.c)
- [native/zero-c/src/checker.c](native/zero-c/src/checker.c)
- [native/zero-c/src/ir.c](native/zero-c/src/ir.c)
- [native/zero-c/src/mir_verify.c](native/zero-c/src/mir_verify.c)
- [native/zero-c/src/specialize.c](native/zero-c/src/specialize.c)
- [native/zero-c/src/emit_elf64.c](native/zero-c/src/emit_elf64.c)
- [native/zero-c/src/emit_macho64.c](native/zero-c/src/emit_macho64.c)
- [native/zero-c/src/emit_elf_aarch64.c](native/zero-c/src/emit_elf_aarch64.c)
- [native/zero-c/src/emit_coff.c](native/zero-c/src/emit_coff.c)
</details>

# Compiler Pipeline — From .0 Source to Native Binary

The Zero language compiler (`native/zero-c`) takes a `.0` source file and produces native machine code directly — no intermediate C, no external compiler toolchain required. The pipeline runs entirely within a single process: tokenize, parse, type-check, lower to IR, verify IR contracts, then emit a binary object or executable for the host platform.

This page documents each stage, the data structures that flow between them, the invariant that `generatedCBytes` must always be zero in a direct-frontend build, and why the legacy C bridge path is permanently removed and must not be reintroduced.

---

## Pipeline Overview

```text
.0 source text
       │
       ▼
  ┌──────────┐
  │  Lexer   │  z_tokenize()  → TokenVec
  └──────────┘
       │
       ▼
  ┌──────────┐
  │  Parser  │  z_parse()     → Program (AST)
  └──────────┘
       │
       ▼
  ┌──────────────┐
  │   Checker    │  z_check_program()  (type + borrow check)
  └──────────────┘
       │
       ▼
  ┌──────────────────────────┐
  │   IR Lowering            │  z_lower_program_with_source()
  │   + Specialization       │  → IrProgram (MIR)
  └──────────────────────────┘
       │
       ▼
  ┌──────────────────┐
  │  MIR Verifier    │  z_mir_verify_direct_contracts()
  └──────────────────┘
       │
       ▼
  ┌──────────────────────────────────────────────────────────┐
  │   Direct Emitter (selected by target)                    │
  │   zero-elf64   → ELF64 (.o / ELF exe)                   │
  │   zero-macho64 → Mach-O (.o / Mach-O exe)               │
  │   zero-elf-aarch64 → ELF AArch64 (.o / exe)             │
  │   zero-coff-x64    → COFF x64 (.obj / PE exe)           │
  └──────────────────────────────────────────────────────────┘
       │
       ▼
  native binary written to disk
```

Sources: [native/zero-c/src/main.c:4365-4385](), [native/zero-c/src/main.c:10175-10211]()

---

## Stage 1 — Lexer

**Entry point:** `z_tokenize(const char *source, ZDiag *diag)` in `native/zero-c/src/lexer.c`.

The lexer performs a single forward pass over the UTF-8 source text. It produces a `TokenVec` — a growable array of `Token` values. Each token carries:

| Field | Purpose |
|---|---|
| `kind` | `TOK_KEYWORD`, `TOK_IDENT`, or punctuation |
| `text` | heap-allocated string copy of the lexeme |
| `line`, `column` | source position for diagnostics |
| `offset`, `length` | byte range into the original source |

The keyword table is a simple null-terminated string array checked with `strcmp`. Two-character symbols (`->`, `=>`, `..`, `==`, `!=`, `<=`, `>=`, `&&`, `||`, `+%`, `+|`) are detected with a lookahead comparison before the single-character fallback.

The lexer emits a single structured `ZDiag` on error (e.g., code 3024 for a malformed character literal) and returns immediately; the caller checks `diag->code != 0` before proceeding.

Sources: [native/zero-c/src/lexer.c:20-39](), [native/zero-c/src/lexer.c:58-97]()

---

## Stage 2 — Parser

**Entry point:** `z_parse(TokenVec *tokens, ZDiag *diag)` in `native/zero-c/src/parser.c`.

The parser consumes the `TokenVec` produced by the lexer and constructs an AST stored in a `Program` struct. It maintains a `Parser` context with the token stream and current index:

```c
typedef struct {
  TokenVec *tokens;
  size_t index;
  ZDiag *diag;
} Parser;
```

The `Program` contains separate growable vectors for top-level declarations: `FunctionVec`, `EnumVec`, `ShapeVec`, `InterfaceVec`, `ConstVec`, `CImportVec`, and `UseImportVec`. Expressions and statements are heap-allocated individually (`Expr *`, `Stmt *`) and owned by their parent nodes.

The parser emits a diagnostic and aborts on the first error; there is no error recovery.

Sources: [native/zero-c/src/parser.c:1-80]()

---

## Stage 3 — Type Checker and Borrow Checker

**Entry point:** `z_check_program(Program *program, ZDiag *diag)` in `native/zero-c/src/checker.c`.

The checker performs semantic analysis in a single pass over the AST. It operates through a `Scope` structure that tracks:

- Variable names, types, mutability, and move state (`moved` flag)
- Provenance information (`ValueProvenance`, `ProvenanceEntry`) for borrow checking
- Parameter and type-parameter classification

The borrow checker computes `FunctionProvenanceSummary` for each function, tracking which return values and storage effects propagate ownership. This is a flow-sensitive analysis that detects aliasing and invalid moves without a separate lifetime pass.

The checker calls `z_set_check_target(target)` before running, making the target architecture visible for ABI-sensitive type rules (e.g., pointer-width types).

Diagnostic codes in the 3000-range cover type errors (`TYP001`–`TYP027`), ownership errors (`OWN001`–`OWN002`), borrow errors (`BOR001`–`BOR002`), interface errors (`IFC001`–`IFC005`), and match exhaustiveness (`MAT001`–`MAT005`).

Sources: [native/zero-c/src/checker.c:1-80](), [native/zero-c/src/main.c:4378-4384]()

---

## Stage 4 — IR Lowering and Generic Specialization

**Entry point:** `z_lower_program_with_source(const Program *program, const SourceInput *input)` in `native/zero-c/src/ir.c`.

### IR (MIR) Data Model

The IR is a flat, register-based mid-level representation stored in `IrProgram`:

- `IrFunction[]` — one entry per concrete function (generics are monomorphized)
- `IrLocal[]` per function — typed storage slots (scalars, arrays, records, `ByteView`, `Alloc`, `Vec`)
- `IrInstr[]` per function — flat instruction list (stores, calls, branches, returns)
- `IrValue` — typed SSA-like values referencing literals, locals, or sub-expressions
- `readonly_data[]` — embedded read-only byte strings (string literals, etc.)

`IrTypeKind` enumerates the value types the direct backend understands:

```
IR_TYPE_VOID, IR_TYPE_BOOL, IR_TYPE_U8, IR_TYPE_U16, IR_TYPE_USIZE,
IR_TYPE_I32, IR_TYPE_U32, IR_TYPE_I64, IR_TYPE_U64,
IR_TYPE_BYTE_VIEW, IR_TYPE_ALLOC, IR_TYPE_VEC,
IR_TYPE_MAYBE_BYTE_VIEW, IR_TYPE_MAYBE_SCALAR,
IR_TYPE_RECORD, IR_TYPE_UNSUPPORTED
```

Sources: [native/zero-c/src/ir.c:145-160]()

### Lowering Sequence

`ir_lower_direct_backend_subset()` implements the main lowering logic:

1. **Guard:** Reject programs with `choices`, `interfaces`, `aliases`, or `consts` — these are not yet supported by the direct backend MVP.
2. **Collect functions:** Clone all non-test, non-generic functions into `direct_functions`. Generic (type-parameterized) functions are excluded from the initial set.
3. **Generic specialization:** Call `ir_collect_generic_specializations_from_stmt_vec()` for each function body. When a call site uses a generic function with concrete type arguments, `z_specialization_plan_add()` records a monomorphization entry. The specialization name is constructed by appending `__TypeArg` suffixes (via `z_specialized_function_name()`). The plan is capped at `IR_SPECIALIZATION_PLAN_LIMIT = 1024` entries.
4. **Sort functions** by stable ID (for deterministic output).
5. **Lower function bodies:** `ir_lower_function_body()` recursively lowers statements and expressions into `IrInstr` sequences for each `IrFunction`.
6. **MIR contract verification:** `z_mir_verify_direct_contracts(ir)` (see Stage 5).
7. **Export check:** If no function is marked `is_exported`, lower fails with "direct backend requires at least one exported C ABI entry function".

Sources: [native/zero-c/src/ir.c:3452-3527](), [native/zero-c/src/specialize.c:58-80]()

### `ir_mark_unsupported`

When lowering encounters a construct the direct backend cannot handle, it calls `ir_mark_unsupported(ir, message, line, column, actual)`. This sets `ir->mir_valid = false` and records structured diagnostic information. Subsequent IR operations check `ir->mir_valid` and short-circuit, so the program consistently fails at the first unsupported construct rather than producing partial output.

Sources: [native/zero-c/src/ir.c:554]()

---

## Stage 5 — MIR Verifier

**Entry point:** `z_mir_verify_direct_contracts(IrProgram *ir)` in `native/zero-c/src/mir_verify.c`.

After IR lowering, the MIR verifier performs a structural correctness pass over the populated `IrProgram`. It checks:

- **Local index bounds:** Every instruction that references a local slot must use a valid index within the function's `local_len`.
- **Initializer kind compatibility:** Each local's `IrTypeKind` constrains which `IrValueKind` may initialize it (e.g., `IR_TYPE_ALLOC` requires `IR_VALUE_FIXED_BUF_ALLOC`; `IR_TYPE_VEC` requires `IR_VALUE_VEC_INIT`).
- **ABI compatibility:** `mir_type_is_direct_abi()` accepts `Bool` and all integer types (`U8`–`U64`); `mir_type_is_direct_fallible_value()` accepts `Void`, `Bool`, `U8`, `U16`, `Usize`, `I32`, `U32`. These define the subset of types that may cross the exported C ABI boundary.
- **Helper requirements:** The verifier tracks `MirHelperRequirements` (allocator helpers, buffer helpers, runtime helpers, host/HTTP runtime imports) and verifies their counts against what the IR actually references.

When a violation is found, `mir_verify_mark_unsupported()` sets `ir->mir_valid = false` with a structured message explaining the violated contract. The `backend_blocker` field is also populated so diagnostics can report the precise stage (`"lower"`) and the unsupported feature to the user.

Sources: [native/zero-c/src/mir_verify.c:92-101](), [native/zero-c/src/mir_verify.c:30-48](), [native/zero-c/src/mir_verify.c:146-158]()

---

## Stage 6 — Direct Emitter

After successful IR lowering and MIR verification, `main.c` selects an emitter based on the resolved target and requested emit kind (`EMIT_EXE` or `EMIT_OBJ`):

| Emitter string | Format | Source |
|---|---|---|
| `zero-elf64` | ELF64 x86-64 object | `emit_elf64.c` |
| `zero-elf64-exe` | ELF64 x86-64 executable | `emit_elf64.c` |
| `zero-elf-aarch64` | ELF AArch64 object | `emit_elf_aarch64.c` |
| `zero-elf-aarch64-exe` | ELF AArch64 executable | `emit_elf_aarch64.c` |
| `zero-macho64` | Mach-O x86-64 / arm64 object | `emit_macho64.c` |
| `zero-macho64-exe` | Mach-O executable | `emit_macho64.c` |
| `zero-coff-x64` | COFF x64 object | `emit_coff.c` |
| `zero-coff-x64-exe` | PE/COFF x64 executable | `emit_coff.c` |

The emitter receives the validated `IrProgram *ir` and writes into a `ZBuf *artifact` (a growable byte buffer). It writes binary data directly using helpers like `elf_append_u8`, `elf_append_u16`, `elf_append_u32`, `elf_append_u64` — little-endian, no external library.

The ELF64 emitter's diagnostic helper illustrates the expected scope:

```c
// native/zero-c/src/emit_elf64.c:52-55
snprintf(diag->expected, sizeof(diag->expected), "direct ELF64 object MVP subset");
snprintf(diag->help, sizeof(diag->help),
  "choose a supported direct target or restrict this program to exported primitive integer arithmetic functions");
```

Unsupported constructs that passed the MIR verifier but cannot be encoded in the binary format emit diagnostic code 4004 (`CGEN004`).

Sources: [native/zero-c/src/emit_elf64.c:8-68](), [native/zero-c/src/main.c:10198-10244]()

---

## The `generatedCBytes = 0` Invariant

Every JSON output path in `main.c` emits `"generatedCBytes": 0` and `"cBridgeFallback": false`. This is not a default or a placeholder — it is a hard invariant asserting that the direct-frontend build path produces **zero C code**. Examples from the codebase:

```c
// main.c:2304
zbuf_append(buf, ",\n  \"generatedCBytes\": 0,\n  \"cBridgeFallback\": false,\n ...");

// main.c:5652 (print_build_json)
printf(",\n  \"generatedCBytes\": %lld,\n ...", generated_c_bytes, ...);
// called with generated_c_bytes = 0 at every call site
```

The value is plumbed through as a `long long generated_c_bytes` parameter but always passed as `0` in direct builds. No code path in the current compiler assigns a non-zero value.

Sources: [native/zero-c/src/main.c:2304](), [native/zero-c/src/main.c:5597-5652](), [native/zero-c/src/main.c:10243]()

---

## Removal of the C Bridge Path

`--emit c` and `--legacy-backend` flags are fully rejected at startup, before any compilation begins:

```c
// native/zero-c/src/main.c:9799-9810
if (command.legacy_backend || command.emit == EMIT_C) {
  diag.code = 2003;
  snprintf(diag.message, sizeof(diag.message), "C backend output is not supported");
  snprintf(diag.expected, sizeof(diag.expected), "zero build --emit exe|obj <input>");
  snprintf(diag.actual, sizeof(diag.actual),
    command.legacy_backend ? "--legacy-backend" : "--emit c");
  snprintf(diag.help, sizeof(diag.help),
    "use direct emitters; C backend output is not a compatibility or debug path");
  ...
  return 1;
}
```

The `EmitKind` enum still has an `EMIT_C` variant and `Command` still has a `legacy_backend` bool (used only to detect and reject the flag). The infrastructure exists solely to produce a clear error. There is no code path that generates C source text from `.0` input. The help text is explicit: C backend output is **not a compatibility or debug path** — the intent is to make reintroduction unattractive at the code level.

Sources: [native/zero-c/src/main.c:29-33](), [native/zero-c/src/main.c:9799-9811](), [native/zero-c/src/main.c:9070-9073]()

---

## Error Propagation and Diagnostics

Errors are reported through a single `ZDiag` struct threaded through every stage. It carries `code`, `line`, `column`, `length`, `message`, `expected`, `actual`, and `help` fields. The `backend_blocker` sub-struct (`ZBackendBlocker`) adds `target`, `object_format`, `backend`, `stage`, and `unsupported_feature` for structured machine-readable output.

Diagnostic codes are organized by prefix:

| Range | Prefix | Domain |
|---|---|---|
| 1001–1003 | ERR | Internal errors |
| 2001–2003 | APP/BLD | Build-level errors |
| 3000–3110 | NAM/TYP/OWN/BOR/IFC/MAT/VAR | Semantic errors |
| 4004 | CGEN004 | Direct backend / emit errors |
| 6001–6002 | TAR | Target errors |
| 7001–7003 | IMP | Import errors |

The `zero explain <CODE>` command maps these codes to documentation. Sources: [native/zero-c/src/main.c:79-162](), [native/zero-c/src/main.c:165-170]()

---

## Build Dispatch Summary

The top-level build command follows this sequence in `main.c`:

1. Reject `--legacy-backend` / `--emit c` immediately (line 9799).
2. Resolve target and validate it is known (`z_find_target`).
3. Load source into `SourceInput`, recording per-phase timing (`resolve_ms`, `parse_ms`, `check_ms`).
4. Tokenize → Parse → Check (lines 4365–4384).
5. Lower to `IrProgram` via `z_lower_program_with_source` (line 10176).
6. Dispatch to the correct emitter: `EMIT_OBJ` → `z_emit_*_object_from_ir`; `EMIT_EXE` → `z_emit_*_exe_from_ir` (lines 10190–10406).
7. Write the `ZBuf` artifact to disk and emit JSON or human-readable output with `generatedCBytes: 0`.

The compiler is a direct, single-pass, single-binary tool: one `.0` file in, one `.o` / executable out, zero generated C, zero external compiler dependencies.

Sources: [native/zero-c/src/main.c:9783-9811](), [native/zero-c/src/main.c:10175-10250]()
