# The One Idea to Keep & What to Read Next

> Closing recap: the core idea in one sentence, the analogy that holds, the three things that make DeepZero different from a script, and concrete pointers to where to look next in the codebase.

- Repository: 416rehman/DeepZero
- GitHub: https://github.com/416rehman/DeepZero
- Human wiki: https://grok-wiki.com/public/wiki/416rehman-deepzero-841693239324
- Complete Markdown: https://grok-wiki.com/public/wiki/416rehman-deepzero-841693239324/llms-full.txt

## Source Files

- `README.md`
- `src/deepzero/cli.py`
- `src/deepzero/engine/pipeline.py`
- `src/deepzero/api/server.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [README.md](README.md)
- [src/deepzero/cli.py](src/deepzero/cli.py)
- [src/deepzero/engine/pipeline.py](src/deepzero/engine/pipeline.py)
- [src/deepzero/engine/runner.py](src/deepzero/engine/runner.py)
- [src/deepzero/engine/state.py](src/deepzero/engine/state.py)
- [src/deepzero/engine/stage.py](src/deepzero/engine/stage.py)
- [src/deepzero/api/server.py](src/deepzero/api/server.py)
- [pipelines/loldrivers/pipeline.yaml](pipelines/loldrivers/pipeline.yaml)
</details>

# The One Idea to Keep & What to Read Next

DeepZero is a YAML-defined pipeline engine that turns a corpus of files into a structured research report by running each file through an ordered chain of processors — filtering, transforming, and LLM-assessing — while persisting every result atomically to disk so any interrupted run can pick up exactly where it left off.

This closing page distills the core concept into one sentence, revisits the analogy that holds best, names the three structural choices that make DeepZero more than a script, and tells you exactly which files to open next in the codebase.

---

## The One Sentence

> **DeepZero is a resumable, parallel, YAML-configured assembly line: each file in your corpus enters at stage 1 and either passes forward, gets filtered out, or is marked failed — and every outcome is written atomically to disk before the next stage begins.**

Everything else in the codebase is engineering that upholds that contract.

---

## The Analogy That Holds

Think of a **factory inspection line**. Raw parts (your binary files) arrive at the start. Each workstation (stage) does one job — remove duplicates, run a decompiler, scan with Semgrep, ask an LLM — and either passes the part forward, rejects it, or marks it defective. At the end of the shift, a ledger records what happened to every part.

The analogy holds in three concrete ways:

| Factory concept | DeepZero equivalent | Where in the code |
|---|---|---|
| Part on the line | `SampleState` with per-stage `history` | `src/deepzero/engine/state.py:56-131` |
| Workstation decision | `Verdict.CONTINUE` or `Verdict.FILTER` | `src/deepzero/engine/state.py:82-97` |
| End-of-shift ledger | `run_manifest.json` written after every stage | `src/deepzero/engine/state.py:244-269` |

The analogy breaks down at one point: a real factory cannot time-travel. DeepZero can. If the power goes out, the ledger is already on disk and the line restarts at the first incomplete workstation.

---

## Three Things That Make DeepZero Different From a Script

### 1. Atomic Per-Sample State (Resumability)

A script runs start-to-finish. If it dies on sample 4,000 of 10,000, you lose everything and restart.

DeepZero writes each sample's result as a `.tmp` → `os.replace()` atomic rename before moving to the next one. When `deepzero run` is called again, `_resume_or_ingest` checks the `samples/` directory first. If sample state files already exist, the expensive ingest step is skipped entirely, and only the samples that did not complete a given stage are fed back into the thread pool.

```python
# src/deepzero/engine/runner.py:275-285
existing_states = self.state_store.list_samples()
if existing_states:
    log.info(
        "resume: found %d existing samples, skipping ingest",
        len(existing_states),
    )
    sample_states: dict[str, SampleState] = {s.sample_id: s for s in existing_states}
```

The atomic write itself lives in `StateStore._atomic_write` — write to `.tmp`, then `os.replace` — with a Windows-specific retry loop to handle Defender scan locks:

```python
# src/deepzero/engine/state.py:32-42
def atomic_replace(src: Path, dst: Path, retries: int = 5) -> None:
    for i in range(retries):
        try:
            os.replace(src, dst)
            return
        except PermissionError:
            if i == retries - 1 or os.name != "nt":
                raise
            time.sleep(0.05)
```

Sources: [src/deepzero/engine/runner.py:275-285](), [src/deepzero/engine/state.py:32-42]()

---

### 2. Typed Processor Contracts (Extensibility Without Chaos)

A script accumulates `if/elif` blocks as requirements grow. DeepZero instead enforces a typed processor taxonomy at load time:

| Type | Shape | Example |
|---|---|---|
| `IngestProcessor` | `process(ctx, target_dir) → list[Sample]` | `pe_ingest.py` — discovers `.sys` files |
| `MapProcessor` | `process(ctx, entry) → ProcessorResult` | `ghidra_decompile.py` — one file, one result |
| `BulkMapProcessor` | `process(ctx, entries) → list[ProcessorResult]` | `semgrep_scanner.py` — batch scan |
| `ReduceProcessor` | `process(ctx, entries) → set[sample_id]` | `top_k` — returns the IDs to keep |

The pipeline loader enforces that stage 0 must be an `IngestProcessor` and all subsequent stages must be one of the other three types. Any violation raises a `ValueError` before execution begins:

```python
# src/deepzero/engine/pipeline.py:159-177
if i == 0:
    if not isinstance(instance, IngestProcessor):
        raise ValueError(
            f"first stage '{spec.name}' ... must be an IngestProcessor."
        )
    pipeline.ingest_processor = instance
else:
    if isinstance(instance, IngestProcessor):
        raise ValueError(...)
    if not isinstance(instance, (MapProcessor, ReduceProcessor, BulkMapProcessor)):
        raise ValueError(...)
    pipeline.stages.append((spec, instance))
```

Custom processors are referenced by a Python file path in the YAML (`processor: ghidra_decompile/ghidra_decompile.py`). The loader resolves them at startup, so errors appear before any work is done.

Sources: [src/deepzero/engine/pipeline.py:154-184]()

---

### 3. Breadth-First Execution With Sync Barriers (Observability and Safety)

A script processes files serially or with ad-hoc threads. DeepZero uses a **breadth-first** model: all surviving samples pass through stage N before any sample enters stage N+1. After every stage, the runner writes the manifest to disk and regenerates per-sample `context.md` files before proceeding.

```python
# src/deepzero/engine/runner.py:167-256 (key excerpt)
# breadth-first stage execution
for spec, processor in self.stages:
    if self._shutdown_event.is_set():
        break
    ...
    # sync barrier
    self.state_store.save_manifest(list(sample_states.values()))
    self._generate_context_files(sample_states)
```

This means:
- You can inspect `work/<pipeline>/run_manifest.json` after any stage to see exactly how many samples passed, were filtered, or failed.
- Ctrl+C is handled gracefully: a `threading.Event` (`_shutdown_event`) is set on the first SIGINT, and the runner saves state before exiting. A second SIGINT forces `os._exit(1)`.
- The REST API (`deepzero serve`) can read the same state files concurrently because they are written atomically.

Sources: [src/deepzero/engine/runner.py:167-256]()

---

## The Structural Flow, Visualized

```text
deepzero run <target> -p pipeline.yaml
        │
        ▼
  load_pipeline()          ← parse YAML, resolve processor classes, validate graph
        │                    src/deepzero/engine/pipeline.py
        ▼
  PipelineRunner.run()     ← install SIGINT handler, create RunState
        │
        ├── _resume_or_ingest()   ← skip ingest if samples/ already on disk
        │        │
        │        ▼ list[SampleState]
        │
        └── breadth-first loop over stages
               │
               ├── MapProcessor    → ThreadPoolExecutor, 1 future per sample
               ├── BulkMapProcessor → one call, list of results
               └── ReduceProcessor → one call, returns kept sample IDs
                       │
                       ▼ (after each stage)
               StateStore.save_manifest()    ← atomic write
               _generate_context_files()     ← per-sample context.md
```

---

## What to Read Next

Here is a concrete reading order, from high-level to low-level:

### Start with the real pipeline definition

[`pipelines/loldrivers/pipeline.yaml`](pipelines/loldrivers/pipeline.yaml) is the only complete working pipeline in the repo. Its comments explain all four processor types, the `${ENV_VAR:-default}` expansion syntax, `parallel:`, `timeout:`, and `on_failure:` in context. Read this before any code.

### Then follow the load path

[`src/deepzero/engine/pipeline.py`](src/deepzero/engine/pipeline.py) — `load_pipeline()` parses the YAML (line 76), expands env vars (line 256), builds `StageSpec` objects (lines 103–135), and calls `_resolve_processors()` (line 154) to validate the stage graph. This is where YAML becomes runnable objects.

### Then the execution heart

[`src/deepzero/engine/runner.py`](src/deepzero/engine/runner.py) — `PipelineRunner._run_all_stages()` (line 156) contains the breadth-first loop, the three dispatch branches (`_run_map`, `_run_reduce`, `_run_batch`), and the sync barrier. `_process_one_map()` (line 529) shows how retries, timeouts, and `should_skip()` work at the per-sample level.

### Then the state model

[`src/deepzero/engine/state.py`](src/deepzero/engine/state.py) — `SampleState` (line 57) and `StageOutput` (line 44) are the two data structures that hold every sample's complete history. `StateStore` (line 163) is a thin file-based persistence layer with no database dependency.

### Then the processor contract

[`src/deepzero/engine/stage.py`](src/deepzero/engine/stage.py) — `MapProcessor`, `ReduceProcessor`, `BulkMapProcessor`, and `IngestProcessor` are the four abstract base classes you subclass to write custom processors. `ProcessorEntry` and `ProcessorContext` are the input and execution-context types passed to every `process()` call.

### Finally, the CLI surface

[`src/deepzero/cli.py`](src/deepzero/cli.py) — `run` (line 133), `status` (line 230), `validate` (line 289), `init` (line 336), and `serve` (line 442) are the five user-facing commands. The `run` command is the definitive wiring example: it loads the pipeline, builds the `PipelineRunner`, detects an existing run (resume) or initializes a fresh `RunState`, then calls `runner.run()`.

---

## Closing Summary

The single idea worth carrying forward is this: **every sample's fate is committed to disk atomically after each stage, so the pipeline is always safe to interrupt and always safe to resume.** The YAML schema, the typed processor hierarchy, and the breadth-first sync barriers all exist in service of that one guarantee. The `StateStore._atomic_write` method in [`src/deepzero/engine/state.py:174-177`](src/deepzero/engine/state.py) — nine lines of `write → os.replace` — is the physical foundation everything else rests on.

Sources: [src/deepzero/engine/state.py:174-177]()
