# Agent Pipeline: Five Phases from Scan to Graph

> The /understand skill orchestrates five sequential agent phases: project-scanner (file discovery), file-analyzer (per-file nodes), architecture-analyzer (cross-cutting edges), tour-builder (guided walkthroughs), and assemble-reviewer (graph assembly and validation). Agents write intermediate JSON to .understand-anything/intermediate/ to avoid polluting context; results are merged and cleaned up. Auto-update mode replays stale phases after a git commit via the PostToolUse hook.

- Repository: Lum1104/Understand-Anything
- GitHub: https://github.com/Lum1104/Understand-Anything
- Human wiki: https://grok-wiki.com/public/wiki/lum1104-understand-anything-3b923df96896
- Complete Markdown: https://grok-wiki.com/public/wiki/lum1104-understand-anything-3b923df96896/llms-full.txt

## Source Files

- `understand-anything-plugin/agents/project-scanner.md`
- `understand-anything-plugin/agents/file-analyzer.md`
- `understand-anything-plugin/agents/architecture-analyzer.md`
- `understand-anything-plugin/agents/tour-builder.md`
- `understand-anything-plugin/agents/assemble-reviewer.md`
- `understand-anything-plugin/hooks/hooks.json`
- `understand-anything-plugin/skills/understand/SKILL.md`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [understand-anything-plugin/agents/project-scanner.md](understand-anything-plugin/agents/project-scanner.md)
- [understand-anything-plugin/agents/file-analyzer.md](understand-anything-plugin/agents/file-analyzer.md)
- [understand-anything-plugin/agents/architecture-analyzer.md](understand-anything-plugin/agents/architecture-analyzer.md)
- [understand-anything-plugin/agents/tour-builder.md](understand-anything-plugin/agents/tour-builder.md)
- [understand-anything-plugin/agents/assemble-reviewer.md](understand-anything-plugin/agents/assemble-reviewer.md)
- [understand-anything-plugin/hooks/hooks.json](understand-anything-plugin/hooks/hooks.json)
- [understand-anything-plugin/hooks/auto-update-prompt.md](understand-anything-plugin/hooks/auto-update-prompt.md)
- [understand-anything-plugin/skills/understand/SKILL.md](understand-anything-plugin/skills/understand/SKILL.md)
- [understand-anything-plugin/skills/understand/merge-batch-graphs.py](understand-anything-plugin/skills/understand/merge-batch-graphs.py)
</details>

# Agent Pipeline: Five Phases from Scan to Graph

The `/understand` skill drives a sequential, multi-agent pipeline that transforms a raw codebase into an interactive knowledge graph. Each phase is handled by a dedicated agent that writes its results as intermediate JSON to `.understand-anything/intermediate/`, keeping large data artifacts out of the orchestrating context window. A final merge script and two validation passes stitch all batch outputs into a single `knowledge-graph.json`.

Understanding this pipeline matters when diagnosing partial failures, reasoning about what gets regenerated on an incremental run, or extending the system with new analysis types. Every design choice in the pipeline — writing to disk, batching files, running agents in parallel — is motivated by a single invariant: the orchestrator never accumulates large result payloads itself.

---

## Overview: Phases and Output Files

```text
Phase 0    Pre-flight + config     (orchestrator, no agent)
Phase 0.5  Ignore config           (orchestrator, no agent)
Phase 1    SCAN                    → .understand-anything/intermediate/scan-result.json
Phase 2    ANALYZE (batched)       → .understand-anything/intermediate/batch-N.json
           merge-batch-graphs.py   → .understand-anything/intermediate/assembled-graph.json
Phase 3    ASSEMBLE REVIEW         → .understand-anything/intermediate/assemble-review.json
Phase 4    ARCHITECTURE            → .understand-anything/intermediate/layers.json
Phase 5    TOUR                    → .understand-anything/intermediate/tour.json
Phase 6    REVIEW + assemble       → .understand-anything/intermediate/assembled-graph.json
Phase 7    SAVE + clean up         → .understand-anything/knowledge-graph.json
                                   → .understand-anything/meta.json
                                   → .understand-anything/fingerprints.json
```

Sources: [understand-anything-plugin/skills/understand/SKILL.md:214-718]()

---

## Phase 1: project-scanner — File Discovery and Import Resolution

The `project-scanner` agent performs a two-phase scan. In **Phase 1** it writes and executes a deterministic Node.js (or Python) script that:

1. Discovers all tracked files via `git ls-files` (falling back to a recursive listing), then applies a multi-tier exclusion filter: hardcoded defaults (node_modules, lock files, binaries) overlaid with optional user patterns from `.understand-anything/.understandignore` via the `createIgnoreFilter` function from `@understand-anything/core`.
2. Classifies each file into one of seven `fileCategory` values: `code`, `config`, `docs`, `infra`, `data`, `script`, `markup`.
3. Detects languages (20+ extension mappings), frameworks (via manifest inspection of `package.json`, `Cargo.toml`, `go.mod`, `pyproject.toml`, etc.), and estimates project complexity (`small`/`moderate`/`large`/`very-large`).
4. Builds a full `importMap`: for every **code-category** file it resolves project-internal imports using language-appropriate patterns — including TypeScript path aliases from `tsconfig.json`, Python absolute imports, Go module paths, Rust `use crate::` — and writes each file's resolved import list as an array of project-relative paths. External packages are silently dropped.

In **Phase 2** the agent reads the script's JSON output and synthesizes a human-readable `description` field from `rawDescription` or the first 10 lines of `README.md`.

The final output is written to `intermediate/scan-result.json` and includes the complete `importMap`. The orchestrator stores this as `$IMPORT_MAP` and `$FILE_LIST` for use in Phase 2 batch construction.

Sources: [understand-anything-plugin/agents/project-scanner.md:1-365]()

---

## Phase 2: file-analyzer — Per-File Nodes and Edges

The orchestrator batches the file list into groups of **20–30 files** and dispatches up to **5 concurrent** `file-analyzer` subagents. Each agent receives:

- Its batch file list (with `path`, `language`, `sizeLines`, `fileCategory`).
- A `batchImportData` slice of `$IMPORT_MAP` covering only the files in that batch.

Each `file-analyzer` also runs a two-phase workflow:

**Structural extraction (Phase 1):** Invokes the bundled `extract-structure.mjs` script, which uses `web-tree-sitter` (WASM) to extract functions, classes, exports, and call-graph entries from 10 languages. For non-code files it extracts sections, definitions, services (Docker), endpoints (OpenAPI), CI steps, and Terraform resources. Languages without tree-sitter support (Swift, Kotlin, shell scripts) fall back to regex matching for function signatures.

**Semantic analysis (Phase 2):** Uses the structured extraction as the basis for producing `GraphNode` and `GraphEdge` objects. Node types depend on `fileCategory`:

| fileCategory | Node type(s) |
|---|---|
| `code` | `file`, `function`, `class` |
| `config` | `config` |
| `docs` | `document` |
| `infra` | `service` / `pipeline` / `resource` |
| `data` | `table` / `schema` / `endpoint` |

For import edges, the agent follows a strict 1:1 rule: for every path in `batchImportData[file]`, it emits exactly one `imports` edge. It must not drop or aggregate imports; the orchestrator's merge script can recover missed ones but it is the agent's responsibility to emit all of them.

Each batch writes its output to `intermediate/batch-<N>.json`.

After all batches complete, the orchestrator runs:

```bash
python <SKILL_DIR>/merge-batch-graphs.py $PROJECT_ROOT
```

This script reads all `batch-*.json` files, normalizes node IDs (strips double-prefixes, adds missing type prefixes, canonicalizes `func:` → `function:`), normalizes complexity values, rewrites dangling edge references, deduplicates nodes and edges, and runs a `tested_by` linker that canonicalizes test-coverage edges via both LLM emission and path-convention pairing (e.g., `X.ts` ↔ `X.test.ts`, `src/main/java` ↔ `src/test/java`). The result is `intermediate/assembled-graph.json`.

Sources: [understand-anything-plugin/agents/file-analyzer.md:1-476](), [understand-anything-plugin/skills/understand/SKILL.md:259-328](), [understand-anything-plugin/skills/understand/merge-batch-graphs.py:1-60]()

---

## Phase 3: assemble-reviewer — Semantic Review and Gap Recovery

The `assemble-reviewer` is the quality-control agent that handles what `merge-batch-graphs.py` could not fix mechanically. It receives the merge script's stderr report (which lists fixed items and unfixable items) and the full `$IMPORT_MAP`.

Its task has four steps:

1. **Sanity-check the "Fixed" section.** If >30% of nodes required ID correction, it flags this as a systemic upstream issue.
2. **Investigate "Could not fix" items.** For nodes missing an `id` field, it reconstructs the ID from `type`, `filePath`, and `name`. It remaps unknown node types (e.g., `"svc"` → `"service"`) and maps unknown complexity values to the nearest valid value.
3. **Check for cross-batch edge gaps.** For every import relationship in `$IMPORT_MAP`, it verifies a corresponding `imports` edge exists in the assembled graph. Missing edges backed by the import map are added with `weight: 0.7`; speculative edges are never added.
4. **Apply fixes in-place** to `assembled-graph.json` and write a `assemble-review.json` summary.

Sources: [understand-anything-plugin/agents/assemble-reviewer.md:1-97](), [understand-anything-plugin/skills/understand/SKILL.md:344-366]()

---

## Phase 4: architecture-analyzer — Cross-Cutting Layer Assignment

The `architecture-analyzer` assigns every file-level node to exactly one of 3–10 architectural layers. Like the other agents, it runs a **two-phase** workflow.

**Structural analysis script (Phase 1):** Computes directory groupings, node-type distribution, import adjacency matrices (fan-in, fan-out per file), inter-group and intra-group import density, directory-name pattern matching (`routes` → `api`, `services` → `service`, etc.), deployment topology detection (Dockerfile/K8s/Terraform chain), data pipeline detection, and documentation coverage ratios.

**Semantic assignment (Phase 2):** Uses the script's output to select layers and assign every node. Non-code nodes follow type-based rules: `config` → Configuration layer, `document` → Documentation layer, `service`/`resource` → Infrastructure layer, `pipeline` → CI/CD layer, `table`/`schema`/`endpoint` → Data layer. For incremental runs, previous layer definitions are injected to enforce naming consistency.

The orchestrator normalizes the output (renames `nodes` → `nodeIds`, synthesizes missing `id` fields, drops dangling refs) before writing the final layers array.

Sources: [understand-anything-plugin/agents/architecture-analyzer.md:1-482](), [understand-anything-plugin/skills/understand/SKILL.md:369-437]()

---

## Phase 5: tour-builder — Guided Learning Walkthroughs

The `tour-builder` designs 5–15 pedagogical tour steps through the knowledge graph.

**Graph topology script (Phase 1):** Scores entry-point candidates (filename patterns, fan-out, fan-in), runs BFS from the top code entry point following `imports`/`calls` edges to produce a depth-ordered traversal, identifies tightly coupled clusters (mutual bidirectional edges), and groups non-code files (documentation, infrastructure, data, config) for scheduled tour inclusion.

**Tour design (Phase 2):** Uses BFS depth to structure tour order — depth 0 is the entry point, depth 1 covers direct dependencies, depth 2+ covers feature modules, and non-code stops (Dockerfile, CI config, SQL schema) anchor the end of the tour. Steps are connected narratively: each description explicitly references earlier steps to build a coherent mental model. Optional `languageLesson` fields attach language-specific educational notes (Docker multi-stage builds, TypeScript barrel patterns, SQL migration ordering, etc.).

The orchestrator normalizes the tour output (renames `nodesToInspect` → `nodeIds`, converts bare file paths to `file:` prefixed IDs, sorts by `order`).

Sources: [understand-anything-plugin/agents/tour-builder.md:1-379](), [understand-anything-plugin/skills/understand/SKILL.md:452-517]()

---

## Phase 6–7: Review, Save, and Cleanup

Phase 6 assembles the full `KnowledgeGraph` object (`version`, `project`, `nodes`, `edges`, `layers`, `tour`) and validates it. The default path runs an inline deterministic Node.js validator that checks: all nodes have required fields, edge sources/targets exist, every file-level node appears in exactly one layer, and no dangling tour step references exist. The `--review` flag substitutes the full LLM `graph-reviewer` agent.

Phase 7 writes three artifacts and then cleans up:

1. `knowledge-graph.json` — the final graph.
2. `fingerprints.json` — structural fingerprints (SHA-256 content hash + extracted functions/classes/imports/exports) for every source file, generated via the bundled `build-fingerprints.mjs` script using tree-sitter for precision. **This must succeed before `meta.json` is written** — a failure here causes auto-update to classify every subsequent commit as a full structural change (issue #152).
3. `meta.json` — `gitCommitHash`, `lastAnalyzedAt`, `analyzedFiles`.

Intermediate and tmp directories are deleted:

```bash
rm -rf $PROJECT_ROOT/.understand-anything/intermediate
rm -rf $PROJECT_ROOT/.understand-anything/tmp
```

Sources: [understand-anything-plugin/skills/understand/SKILL.md:522-735]()

---

## Auto-Update: Replaying Stale Phases After a Commit

When `autoUpdate: true` is stored in `.understand-anything/config.json`, two hooks fire to detect and update a stale graph.

**PostToolUse hook** fires after any `Bash` command matching a git commit/merge/cherry-pick/rebase pattern and injects an instruction into the conversation to run the auto-update prompt:

```json
{
  "matcher": "Bash",
  "hooks": [{
    "type": "command",
    "command": "printf '%s' \"$TOOL_INPUT\" | grep -qE 'git\\s+(commit|merge|cherry-pick|rebase)' && [ -f .understand-anything/config.json ] && grep -q '\"autoUpdate\".*true' .understand-anything/config.json && [ -f .understand-anything/knowledge-graph.json ] && echo \"[understand-anything] Commit detected with auto-update enabled...\""
  }]
}
```

**SessionStart hook** detects commit hash drift by comparing `meta.json`'s stored `gitCommitHash` against `git rev-parse HEAD` and fires the same auto-update instruction if they diverge.

Sources: [understand-anything-plugin/hooks/hooks.json:1-25]()

The auto-update workflow (`hooks/auto-update-prompt.md`) is designed to minimize token cost:

| Phase | Cost |
|---|---|
| Phase 0 — Pre-flight, `.understandignore` filtering | Zero tokens |
| Phase 1 — Structural fingerprint check (script) | Zero tokens |
| Phase 2 — Targeted re-analysis (LLM agents) | Tokens only for structurally changed files |
| Phase 3 — Conditional architecture/tour + save | Tokens only if `ARCHITECTURE_UPDATE` |

The fingerprint check classifies each changed source file as `NONE`, `COSMETIC`, or `STRUCTURAL` by comparing SHA-256 content hashes and regex-extracted function/class/import signatures against `fingerprints.json`. The decision gate produces four actions:

| Action | Trigger | Behavior |
|---|---|---|
| `SKIP` | All changes cosmetic | Update `meta.json`, zero tokens |
| `PARTIAL_UPDATE` | ≤10 structural files, same dirs | Re-analyze changed files only |
| `ARCHITECTURE_UPDATE` | New/deleted dirs or >10 structural files | Re-analyze + re-run architecture phase |
| `FULL_UPDATE` | >30 structural files or >50% of graph | Recommend `/understand --full`, stop |

After re-analysis, fingerprints are updated using a load-patch-save pattern: all existing entries are loaded first, then only the re-analyzed entries are patched in-place, and the full dict is saved back. Overwriting only the batch subset would discard all other files' fingerprints and cause permanent `FULL_UPDATE` escalation on every subsequent commit.

Sources: [understand-anything-plugin/hooks/auto-update-prompt.md:1-321]()

---

## Summary

The five-agent pipeline (project-scanner → file-analyzer × N → architecture-analyzer → tour-builder → assemble-reviewer) is held together by a single invariant: agents write to `.understand-anything/intermediate/` rather than returning large payloads to the orchestrator. The merge script and inline validator provide two deterministic passes that recover mechanical errors before the final knowledge graph is persisted. Auto-update extends this pipeline into an incremental mode, spending LLM tokens only on structurally changed files while using zero-token fingerprint comparison to gate the decision — a design enforced by `fingerprints.json`'s tree-sitter-generated baseline written at the end of every full analysis run.

Sources: [understand-anything-plugin/skills/understand/SKILL.md:683-735]()