# Explain It Simply: What DeepZero Does

> Plain-language explanation of the whole project — what problem it solves, the one analogy to hold in mind, and the three ideas every reader must leave with before going deeper.

- Repository: 416rehman/DeepZero
- GitHub: https://github.com/416rehman/DeepZero
- Human wiki: https://grok-wiki.com/public/wiki/416rehman-deepzero-841693239324
- Complete Markdown: https://grok-wiki.com/public/wiki/416rehman-deepzero-841693239324/llms-full.txt

## Source Files

- `README.md`
- `pyproject.toml`
- `src/deepzero/__main__.py`
- `src/deepzero/cli.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [README.md](README.md)
- [pyproject.toml](pyproject.toml)
- [src/deepzero/__main__.py](src/deepzero/__main__.py)
- [src/deepzero/cli.py](src/deepzero/cli.py)
- [src/deepzero/engine/runner.py](src/deepzero/engine/runner.py)
- [src/deepzero/engine/stage.py](src/deepzero/engine/stage.py)
- [src/deepzero/engine/state.py](src/deepzero/engine/state.py)
- [src/deepzero/engine/pipeline.py](src/deepzero/engine/pipeline.py)
- [src/deepzero/engine/llm.py](src/deepzero/engine/llm.py)
- [src/deepzero/stages/__init__.py](src/deepzero/stages/__init__.py)
- [pipelines/loldrivers/pipeline.yaml](pipelines/loldrivers/pipeline.yaml)
</details>

# Explain It Simply: What DeepZero Does

DeepZero is a command-line tool that lets you run automated security research pipelines against a folder of files — no glue code required. You describe *what* to do in a YAML file, and DeepZero handles *how*: running stages in parallel, saving progress after each step so you can safely interrupt and resume, and calling an LLM at the end to produce written assessments.

This page gives you the mental model you need before reading any of the architecture, API, or processor documentation. Three ideas cover everything: pipelines are YAML recipes, samples flow through typed stages, and state lives on disk so nothing is lost.

---

## The One Analogy: A Fault-Tolerant Assembly Line

Imagine a factory assembly line for Windows kernel drivers. A truck arrives with thousands of `.sys` files. The line has several stations:

1. **Receiving** — identify every item and put each one in its own bin.
2. **Screening** — reject items that are already on a known-safe list.
3. **Machining** — run each item through a decompiler to extract readable code.
4. **Inspection** — run an automated scanner over the code to find patterns.
5. **Expert review** — send only the suspicious survivors to an LLM analyst who writes up a verdict.

If the power goes out between stations 3 and 4, you do not start over. Each bin already has the machined output. The line resumes exactly where it left off.

That is DeepZero. The truck is your target directory. The stations are pipeline stages. Each bin is a per-sample folder under `work/`. The LLM analyst speaks any model you configure, through [LiteLLM](https://github.com/BerriAI/litellm).

---

## What Problem It Solves

Vulnerability researchers who want to screen large binary corpora (driver packs, firmware images, OS packages) face the same pain every time:

- They write ad hoc shell scripts that cannot be resumed after a crash.
- Parallelism is tacked on as an afterthought, introducing race conditions.
- Integrating a decompiler, a static analysis tool, and an LLM requires custom glue code for each project.
- Results are scattered across text files with no consistent schema.

DeepZero's YAML-defined pipeline addresses each of these. The README describes it as an "automated vulnerability research pipeline engine" with "atomic per-sample state on disk; Ctrl+C and re-run to pick up where you left off."

Sources: [README.md:6-35]()

---

## Idea 1 — A Pipeline Is a YAML Recipe

You write a single `pipeline.yaml` file. It lists a name, an optional LLM model, and a sequence of stages. Each stage names a processor and can set parallelism, timeouts, retry behavior, and typed configuration.

```yaml
# pipelines/loldrivers/pipeline.yaml (abbreviated)
name: loldrivers
model: vertex_ai/gemini-2.5-pro

settings:
  work_dir: work
  max_workers: 4

stages:
  - name: discover
    processor: pe_ingest/pe_ingest.py   # external processor
    config:
      extensions: [".sys"]
      recursive: true

  - name: kernel_filter
    processor: metadata_filter           # built-in processor
    config:
      require:
        is_kernel_driver: true
        has_ioctl_surface: true

  - name: decompile
    processor: ghidra_decompile/ghidra_decompile.py
    parallel: 0        # 0 = auto-scale to CPU count
    timeout: 600
    config:
      ghidra_install_dir: ${GHIDRA_INSTALL_DIR}   # env-var expansion

  - name: pick_top_10
    processor: top_k
    config:
      metric_path: "semgrep_scanner.finding_count"
      keep_top: 10

  - name: assess
    processor: generic_llm
    parallel: 2
    config:
      prompt: pipelines/loldrivers/assessment.j2
      output_file: assessment.md
```

Sources: [pipelines/loldrivers/pipeline.yaml:1-93]()

The engine loads this at runtime, expands `${VAR}` and `${VAR:-default}` from environment variables, validates that every stage references a real processor class, and then enforces that the first stage is an ingest processor.

Sources: [src/deepzero/engine/pipeline.py:76-215]()

---

## Idea 2 — Samples Flow Through Four Typed Stages

Every file discovered by the ingest stage becomes a *sample*. A sample travels through the remaining stages one by one. Processors are typed by their relationship to the sample stream:

| Type | Relationship | When to use |
|---|---|---|
| `IngestProcessor` | One call, returns a list of samples | File discovery, PE parsing, API ingestion |
| `MapProcessor` | Called once per sample, fan-out via `ThreadPoolExecutor` | Filtering, decompilation, per-file LLM calls |
| `ReduceProcessor` | Sees all active samples at once, returns which survive | Top-k selection, global ranking, deduplication |
| `BulkMapProcessor` | All samples in one external invocation | Semgrep batch scan, any tool with high startup cost |

```
                        ┌─────────────────────────────────────────┐
Target dir              │  Pipeline                               │
    │                   │                                         │
    └──▶ IngestProcessor│ ──▶ sample_a ─┐                        │
                        │               ├──▶ MapProcessor (×N)   │
                        │    sample_b ──┤        │                │
                        │               ├──▶ MapProcessor         │
                        │    sample_c ──┘        │                │
                        │                        ▼                │
                        │               ReduceProcessor           │
                        │               (top-k survivors)         │
                        │                        │                │
                        │                        ▼                │
                        │               MapProcessor (LLM assess) │
                        └─────────────────────────────────────────┘
```

A processor returns one of three verdicts: `ok` (sample continues downstream), `filter` (sample intentionally excluded, still tracked), or `fail` (something broke, error is logged).

Sources: [src/deepzero/engine/stage.py:139-171](), [src/deepzero/engine/stage.py:277-350]()

### Built-in processors

DeepZero ships seven built-in processors you can reference by bare name in any pipeline YAML:

| Name | Type | What it does |
|---|---|---|
| `file_discovery` | Ingest | Recursively discovers files by extension |
| `metadata_filter` | Map | Filters samples by fields stored in history data |
| `hash_exclude` | Map | Excludes samples whose SHA-256 is on a blocklist |
| `generic_llm` | Map | Renders a Jinja2 template and calls the configured LLM |
| `generic_command` | Map | Runs an arbitrary shell command per sample |
| `top_k` | Reduce | Keeps the top-N samples by a numeric field |
| `sort` | Reduce | Reorders samples by a field |

Sources: [src/deepzero/stages/__init__.py:1-17]()

External processors (like the Ghidra decompiler or the loldrivers.io filter) are Python classes in a `processors/` directory. You reference them as `dir/file.py` or `dir/file.py:ClassName` in your YAML. The engine imports and instantiates them at load time.

Sources: [src/deepzero/engine/pipeline.py:154-177]()

### LLM integration is provider-neutral

`LLMProvider` wraps [LiteLLM](https://github.com/BerriAI/litellm), so `model:` in your YAML can be any string LiteLLM understands — `openai/gpt-4o`, `vertex_ai/gemini-2.5-pro`, `anthropic/claude-3-5-sonnet`, a local Ollama model, etc. Rate-limit handling, exponential backoff, and retry logic are built in.

Sources: [src/deepzero/engine/llm.py:26-114]()

---

## Idea 3 — State Lives on Disk, Nothing Is Lost

After every stage completes, the engine writes each sample's full history to disk atomically (write to `.tmp`, rename). A run can be interrupted at any point and resumed without replaying completed work.

The on-disk layout looks like this:

```
work/
└── loldrivers/               ← pipeline work dir
    ├── run.json              ← global run status + per-stage counters
    ├── run_manifest.json     ← summary table of all samples + verdicts
    ├── pipeline.yaml         ← snapshot of the YAML used for this run
    └── samples/
        └── <sample_id>/
            ├── state.json    ← per-stage history, verdicts, data, artifacts
            └── context.md    ← human-readable summary (auto-generated)
```

The `StateStore` class writes every file with an atomic rename, and uses a version field (`_version: 2`) to reject state files from incompatible older runs.

Sources: [src/deepzero/engine/state.py:163-286]()

When you run `deepzero run` against a target that already has a `work/` directory, the engine detects existing sample state and skips straight to the first incomplete stage. From the runner:

> "fast resume: if states already exist on disk, skip the expensive ingest"

Sources: [src/deepzero/engine/runner.py:275-296]()

A `Ctrl+C` during execution sets a shutdown event, drains in-flight threads, saves state for all in-progress samples, and marks the run as `interrupted` rather than `failed`. A second `Ctrl+C` forces an immediate exit.

Sources: [src/deepzero/engine/runner.py:826-850]()

---

## The Command Line in Practice

```bash
# run a pipeline (resumes automatically if interrupted)
deepzero run C:\drivers -p .\pipelines\loldrivers\pipeline.yaml

# check status of a run without re-running it
deepzero status -p loldrivers

# validate a pipeline YAML without executing it
deepzero validate loldrivers

# scaffold a new empty pipeline
deepzero init my-new-pipeline

# list all registered processor types
deepzero list-processors

# start an interactive LLM conversation over a completed work directory
deepzero interactive --work-dir work/loldrivers
```

The `run` command accepts a `--model` flag to override the model without editing the YAML, and a `--clean` flag to discard previous state and start fresh (the old work directory is moved aside, not deleted immediately, to avoid data loss on Windows where Defender briefly locks files).

Sources: [src/deepzero/cli.py:132-224]()

---

## How the Three Ideas Connect

```text
pipeline.yaml          Engine                     Disk
─────────────          ──────                     ────
stages: [...]  ──▶  load & validate
                    expand ${ENV}
                    resolve processor classes
                            │
                            ▼
                    ingest: discover files ──────▶ work/samples/<id>/state.json
                            │
                    for each stage:
                      if Map   → ThreadPool      ──▶ state.json (per sample, atomic)
                      if Reduce → barrier, rank  ──▶ state.json (filtered samples)
                      if Batch → one invocation  ──▶ state.json (indexed results)
                            │
                      sync barrier: save manifest ▶ run_manifest.json
                            │
                    mark run completed            ──▶ run.json
```

DeepZero is deliberately thin: the YAML and the processor base classes are the whole contract. The engine never looks inside your data dictionaries — it just passes them forward via each sample's `history` ledger so downstream processors can read upstream results with `entry.upstream_data("stage_name", "field")`.

Sources: [src/deepzero/engine/stage.py:103-113](), [src/deepzero/engine/runner.py:65-90]()

---

## Three Things to Take Away

1. **Pipeline-as-YAML**: Every pipeline is a YAML file listing typed processor stages. Built-in processors cover common needs; external processors (Python classes) handle specialized tools. No glue code is required.

2. **Four processor shapes**: Ingest discovers samples, Map transforms one at a time (parallel), Reduce filters the whole set at once, and BulkMap batches them into a single external invocation. Knowing which shape to use determines how the engine schedules your work.

3. **Atomic, resumable state**: Each sample's history is written atomically after every stage. Interrupt a run at any point and re-run the same command — the engine picks up from the first incomplete stage. No work is repeated.

The shipped `loldrivers` pipeline demonstrates all three: it ingests `.sys` driver files from a directory tree, filters by kernel driver metadata and an online hash blocklist, decompiles survivors with Ghidra, scans with Semgrep, reduces to the top 10 findings, and sends those to a configurable LLM for a written vulnerability assessment.

Sources: [pipelines/loldrivers/pipeline.yaml:1-93](), [src/deepzero/engine/runner.py:128-265]()