# Graphify Explain Like I'm 5 Wiki

> Graphify turns folders of code, docs, media, and notes into a queryable knowledge graph with reports, exports, assistant skills, and optional model backends. This structure is source-backed by repository code; graphify-out/GRAPH_REPORT.md, STRATEGY.md, and docs/solutions were not present in this checkout.

## Context Links

- [Agent index](https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/llms.txt)
- [Human interactive wiki](https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d)
- [GitHub repository](https://github.com/safishamsi/graphify)

## Repository Metadata

- Repository: safishamsi/graphify

- Generated: 2026-05-22T20:43:19.852Z
- Updated: 2026-05-22T20:44:51.340Z
- Runtime: Codex CLI
- Format: Explain Like I'm 5
- Pages: 6

## Page Index

- 01. [Explain It Simply](https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/pages/01-explain-it-simply.md) - What Graphify does in plain language: it reads a pile of project material, finds the named things and relationships, and draws a map that agents and humans can query later.
- 02. [First Run & Assistant Setup](https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/pages/02-first-run-assistant-setup.md) - The smallest useful path from installing graphifyy to running graphify, registering assistant skills, and understanding why project-scoped installs stay portable across coding agents.
- 03. [What Gets Put In The Box](https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/pages/03-what-gets-put-in-the-box.md) - How Graphify decides which files count, skips sensitive or noisy inputs, converts Office and Google Workspace files, handles transcripts, and caches work for later runs.
- 04. [How Files Become Tiny Facts](https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/pages/04-how-files-become-tiny-facts.md) - The extraction stage: tree-sitter reads code, optional semantic extraction reads richer context, validation keeps the shape consistent, and provider-neutral backends support BYOC and BYOK choices.
- 05. [How Facts Become A Map](https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/pages/05-how-facts-become-a-map.md) - How extracted nodes and edges become a NetworkX graph, get deduplicated, clustered, analyzed, and exported as graph.json, graph.html, GRAPH_REPORT.md, wiki output, and call-flow HTML.
- 06. [Ask The Map, Keep It Fresh](https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/pages/06-ask-the-map-keep-it-fresh.md) - The closing page: remember Graphify as a reusable project map, then use query, path, explain, MCP, global graphs, update, and watch flows to keep that map useful after the first build.

## Source File Index

- `ARCHITECTURE.md`
- `graphify/__main__.py`
- `graphify/affected.py`
- `graphify/analyze.py`
- `graphify/build.py`
- `graphify/cache.py`
- `graphify/callflow_html.py`
- `graphify/cluster.py`
- `graphify/dedup.py`
- `graphify/detect.py`
- `graphify/export.py`
- `graphify/extract.py`
- `graphify/global_graph.py`
- `graphify/google_workspace.py`
- `graphify/hooks.py`
- `graphify/ingest.py`
- `graphify/llm.py`
- `graphify/prs.py`
- `graphify/report.py`
- `graphify/security.py`
- `graphify/serve.py`
- `graphify/skill-codex.md`
- `graphify/skill.md`
- `graphify/symbol_resolution.py`
- `graphify/transcribe.py`
- `graphify/validate.py`
- `graphify/watch.py`
- `graphify/wiki.py`
- `pyproject.toml`
- `README.md`
- `tests/test_detect.py`
- `tests/test_extract.py`
- `tests/test_google_workspace.py`
- `tests/test_hooks.py`
- `tests/test_install.py`
- `tests/test_languages.py`
- `tests/test_llm_backends.py`
- `tests/test_query_cli.py`
- `tests/test_symbol_resolution.py`
- `tests/test_transcribe.py`
- `tests/test_watch.py`

---

## 01. Explain It Simply

> What Graphify does in plain language: it reads a pile of project material, finds the named things and relationships, and draws a map that agents and humans can query later.

- Page Markdown: https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/pages/01-explain-it-simply.md
- Generated: 2026-05-22T20:43:19.848Z

### Source Files

- `README.md`
- `pyproject.toml`
- `ARCHITECTURE.md`
- `graphify/__main__.py`
- `graphify/skill-codex.md`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [README.md](README.md)
- [pyproject.toml](pyproject.toml)
- [ARCHITECTURE.md](ARCHITECTURE.md)
- [graphify/__main__.py](graphify/__main__.py)
- [graphify/skill-codex.md](graphify/skill-codex.md)
- [graphify/detect.py](graphify/detect.py)
- [graphify/extract.py](graphify/extract.py)
- [graphify/build.py](graphify/build.py)
- [graphify/cluster.py](graphify/cluster.py)
- [graphify/analyze.py](graphify/analyze.py)
- [graphify/report.py](graphify/report.py)
- [graphify/export.py](graphify/export.py)
- [graphify/serve.py](graphify/serve.py)
- [graphify/llm.py](graphify/llm.py)
- [graphify/wiki.py](graphify/wiki.py)
- [tests/test_cli_export.py](tests/test_cli_export.py)
</details>

# Explain It Simply

Graphify turns a messy pile of project material into a map. The pile can include code, Markdown, PDFs, images, office files, videos, and URLs. The map is a knowledge graph: named things become nodes, relationships become edges, and the finished graph can be opened, queried, exported, or committed for a team.

This page explains Graphify in plain language first, then maps each simple idea back to the actual modules and commands in the repository. The goal is to help a smart newcomer understand what Graphify does before they learn every internal detail.

Sources: [README.md:26-41](), [README.md:208-221](), [ARCHITECTURE.md:5-12]()

## The Short Version

Imagine dumping a project folder onto a table. Graphify sorts the pile, writes labels on the important things, draws strings between related things, groups nearby things into neighborhoods, and saves the result so people and agents can ask better questions later.

In real repository terms:

```text
project files
  -> graphify.detect       finds supported files and skips noise
  -> graphify.extract      reads code structure with tree-sitter
  -> graphify.llm          reads docs/papers/images through a configured backend
  -> graphify.build        turns nodes + edges into a NetworkX graph
  -> graphify.cluster      groups related nodes into communities
  -> graphify.analyze      finds central nodes and surprising links
  -> graphify.export/report/wiki
                           writes graph.json, graph.html, reports, and wiki pages
```

The architecture document describes this as a staged pipeline where each stage is its own module and stages pass plain Python dicts or NetworkX graphs rather than sharing hidden state.

Sources: [ARCHITECTURE.md:5-31](), [graphify/__main__.py:2771-2777](), [graphify/__main__.py:3209-3252]()

## What Goes In

Graphify starts by detecting files. The detector knows about categories such as code, documents, papers, images, office files, and video/audio files. It also skips common generated folders and dependency caches, including `node_modules`, `.git`, build directories, framework caches, and Graphify's own output folder.

That means the first job is not "read everything." The first job is "find the useful project material and avoid obvious noise."

| Input kind | Examples from the code |
|---|---|
| Code | `.py`, `.ts`, `.js`, `.go`, `.rs`, `.java`, `.cpp`, `.rb`, `.swift`, `.kt`, `.cs`, `.php`, `.sql`, `.json`, and more |
| Docs | `.md`, `.mdx`, `.txt`, `.rst`, `.html`, `.yaml`, `.yml` |
| Papers | `.pdf` |
| Images | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.svg` |
| Office | `.docx`, `.xlsx` |
| Video/audio | `.mp4`, `.mov`, `.mp3`, `.wav`, and related formats |

Sources: [graphify/detect.py:28-37](), [graphify/detect.py:537-557](), [graphify/detect.py:862-940]()

## What Gets Found

Graphify looks for two kinds of meaning.

### Structural Meaning

For code, Graphify uses tree-sitter parsers. It finds things like files, imports, classes, functions, calls, and definitions. Those become graph nodes and edges with source file and line information.

A tiny code-shaped example of the internal data looks like this:

```json
{
  "nodes": [
    {
      "id": "auth_service",
      "label": "AuthService",
      "file_type": "code",
      "source_file": "src/auth.py",
      "source_location": "L12"
    }
  ],
  "edges": [
    {
      "source": "auth_service",
      "target": "database_pool",
      "relation": "calls",
      "confidence": "EXTRACTED"
    }
  ]
}
```

The important word is `EXTRACTED`: Graphify uses that when a relationship is directly present in the source, such as an import or call it can see.

Sources: [ARCHITECTURE.md:33-49](), [graphify/extract.py:1310-1382](), [graphify/extract.py:7505-7528]()

### Semantic Meaning

For docs, papers, and images, Graphify can use an AI backend to extract concepts and relationships that are not available through code parsing. The CLI supports multiple backends, including Claude, Kimi, Ollama, Gemini, OpenAI, DeepSeek, Bedrock, and Claude CLI. That keeps the design provider-neutral: the graph format and pipeline do not require one hosted model or one proprietary connector.

The direct CLI path detects or validates the chosen backend before semantic extraction. Ollama can run without an API key on loopback, Bedrock can use AWS credentials, and Claude CLI can use a locally authenticated CLI.

Sources: [graphify/__main__.py:2897-2970](), [graphify/llm.py:47-118](), [README.md:471-488]()

## How The Map Is Built

After extraction, Graphify merges the AST result and semantic result. The build layer validates and normalizes the extraction shape, then creates a NetworkX graph. Edges keep relation details like `calls`, `imports`, or `uses`, plus confidence labels such as `EXTRACTED`, `INFERRED`, and `AMBIGUOUS`.

The builder also handles practical cleanup: old `links` fields can be read as `edges`, absolute source paths can be made repo-relative, and slightly mismatched IDs can be normalized so relationships are not dropped too easily.

Sources: [graphify/build.py:107-189](), [graphify/build.py:192-220](), [graphify/skill-codex.md:396-429]()

## How Graphify Groups Things

Once it has a graph, Graphify clusters related nodes into communities. A community is like a neighborhood on the map: functions, classes, concepts, files, or documents that are more connected to each other than to the rest of the project.

Graphify also looks for:

| Report idea | Plain meaning |
|---|---|
| God nodes | The most-connected real entities, often core abstractions |
| Surprising connections | Links across files, file types, or communities that may not be obvious |
| Knowledge gaps | Thin or isolated areas that may need review |
| Confidence mix | How much of the graph was directly extracted versus inferred or ambiguous |

Sources: [graphify/cluster.py:86-106](), [graphify/analyze.py:85-136](), [graphify/report.py:67-120](), [graphify/report.py:151-180]()

## What Comes Out

The README describes the default user-facing result as three core files under `graphify-out/`: an interactive HTML graph, a human-readable `GRAPH_REPORT.md`, and the full machine-readable `graph.json`.

The code and tests show additional export paths too: HTML, Obsidian vaults, wiki pages, GraphML, SVG, Neo4j Cypher, and call-flow HTML. The wiki exporter creates an `index.md`, community articles, and god-node articles, which makes the graph easier for agents to crawl as Markdown.

Sources: [README.md:34-47](), [README.md:239-265](), [graphify/export.py:1-17](), [graphify/wiki.py:1-2](), [tests/test_cli_export.py:64-116]()

## How People And Agents Query It Later

The point of saving `graph.json` is that the project does not have to be reread from scratch every time. A user can ask:

```bash
graphify query "show the auth flow"
graphify path "UserService" "DatabasePool"
graphify explain "RateLimiter"
```

Under the hood, `query` loads `graph.json`, scores matching nodes, picks seed nodes, walks the graph with BFS or DFS, and renders a compact text subgraph with a token budget. `path` finds the shortest path between two matched nodes. `explain` finds a node and describes its neighbors.

Sources: [README.md:308-323](), [graphify/__main__.py:1703-1772](), [graphify/__main__.py:1852-1935](), [graphify/serve.py:314-339]()

## Why It Helps

Plain search answers "where does this word appear?" Graphify tries to answer "what is this thing connected to?" That difference matters in large projects because important relationships are often spread across files, languages, documents, and notes.

The report is also intentionally honest. It shows extraction confidence, highlights ambiguous edges for review, and separates directly found relationships from inferred ones. That makes it useful for human review and safer for agents that need scoped context instead of a full-project file dump.

Sources: [README.md:198-205](), [ARCHITECTURE.md:50-57](), [graphify/report.py:35-75]()

## Provider-Neutral And Portable By Design

Graphify's core artifact is a file-backed graph, not a hosted database or single model provider. The package exposes optional extras for different capabilities, and semantic extraction can route through different backends depending on environment and CLI flags. The assistant skill files are also platform-specific wrappers around the same graph-building idea.

For Grok-Wiki-style integration, the portable flow is: read repository files, generate or update `graphify-out/graph.json`, then crawl `graphify-out/wiki/index.md` or query `graphify query`. That stays BYOC/BYOK-friendly because the integration depends on files and CLI commands, while model choice remains a backend configuration detail.

Sources: [pyproject.toml:50-69](), [pyproject.toml:94-100](), [graphify/__main__.py:206-280](), [graphify/llm.py:47-118](), [graphify/wiki.py:141-178]()

## Final Summary

Graphify is a project-mapping tool. It reads supported project material, extracts named things and relationships, builds a graph, clusters it into meaningful neighborhoods, writes useful artifacts, and lets humans or agents ask scoped questions later. The simple idea is "turn the pile into a map"; the implementation is a Python CLI and library pipeline that saves that map under `graphify-out/`.

Sources: [ARCHITECTURE.md:7-12](), [README.md:26-41](), [graphify/__main__.py:3250-3307]()

---

## 02. First Run & Assistant Setup

> The smallest useful path from installing graphifyy to running graphify, registering assistant skills, and understanding why project-scoped installs stay portable across coding agents.

- Page Markdown: https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/pages/02-first-run-assistant-setup.md
- Generated: 2026-05-22T20:42:56.742Z

### Source Files

- `graphify/__main__.py`
- `graphify/skill.md`
- `graphify/skill-codex.md`
- `graphify/hooks.py`
- `tests/test_install.py`
- `tests/test_hooks.py`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [README.md](README.md)
- [pyproject.toml](pyproject.toml)
- [graphify/__main__.py](graphify/__main__.py)
- [graphify/skill.md](graphify/skill.md)
- [graphify/skill-codex.md](graphify/skill-codex.md)
- [graphify/hooks.py](graphify/hooks.py)
- [graphify/llm.py](graphify/llm.py)
- [tests/test_install.py](tests/test_install.py)
- [tests/test_hooks.py](tests/test_hooks.py)
</details>

# First Run & Assistant Setup

This page shows the smallest useful path from installing the `graphifyy` Python package to using the `graphify` CLI and registering the assistant skill that powers `/graphify`.

The simple mental model is: install the tool once, register a small instruction file for your assistant, then let the assistant build or query `graphify-out/`. Project-scoped installs matter because they put those instructions inside the repository, so the setup can travel with the code instead of depending on one developer's home directory.

## The Shortest First Run

Use the package name `graphifyy`, then run the CLI named `graphify`:

```bash
uv tool install graphifyy
graphify install
```

Then open the supported coding assistant and run:

```text
/graphify .
```

The README calls out the double-y package name, recommends `uv tool install graphifyy`, and explains that the command exposed to users is still `graphify`. It also documents the two-step flow: install the package, then run `graphify install` to register the assistant skill.  
Sources: [README.md:80-99](), [pyproject.toml:5-12](), [pyproject.toml:68-70]()

If `graphify` is not on `PATH`, the README recommends `uv tool install graphifyy` or `pipx install graphifyy`; plain `pip` may require adding the user script directory to `PATH` or running `python -m graphify`.  
Sources: [README.md:84-90](), [README.md:115-117]()

## What `graphify install` Actually Does

`graphify install` copies a packaged skill file into the selected assistant's skill directory. By default it targets Claude Code on Linux/macOS and the Windows Claude skill on Windows. The platform map includes Codex, OpenCode, Aider, OpenClaw, Factory Droid, Trae, Hermes, Pi, Antigravity, Kimi, Gemini, Cursor, and related variants.  
Sources: [graphify/__main__.py:206-287](), [graphify/__main__.py:1496-1541]()

A few concrete destinations:

| Platform | Skill/config destination |
|---|---|
| Claude Code | `.claude/skills/graphify/SKILL.md` |
| Codex | `.agents/skills/graphify/SKILL.md` |
| OpenCode | `.config/opencode/skills/graphify/SKILL.md` |
| OpenClaw | `.openclaw/skills/graphify/SKILL.md` |
| Factory Droid | `.factory/skills/graphify/SKILL.md` |
| Trae | `.trae/skills/graphify/SKILL.md` |

The install tests verify these destinations and also verify that all packaged skill markdown files exist in the package.  
Sources: [tests/test_install.py:9-18](), [tests/test_install.py:32-45](), [tests/test_install.py:254-260]()

## User-Scoped vs Project-Scoped Setup

By default, installs go into the current user's assistant configuration. That is good for a personal machine, but it does not travel with the repository.

For a portable repository setup, add `--project`:

```bash
graphify install --project
graphify install --project --platform codex
```

Project-scoped installs write into the current directory, for example `.claude/skills/graphify/SKILL.md` or `.agents/skills/graphify/SKILL.md`, and print a `git add` hint so the files can be committed.  
Sources: [README.md:101-113](), [graphify/__main__.py:160-173]()

```text
User-scoped install
  ~/.claude/skills/graphify/SKILL.md
  ~/.agents/skills/graphify/SKILL.md

Project-scoped install
  ./AGENTS.md
  ./.agents/skills/graphify/SKILL.md
  ./.codex/hooks.json
  ./.claude/skills/graphify/SKILL.md
  ./.claude/settings.json
```

The key safety property is scope separation. Tests create a fake home directory and a fake project, install with `--project`, and assert that the project receives the skill files while the home directory is not modified. Separate uninstall tests assert that project uninstall removes only project files and leaves user-scoped skill files alone.  
Sources: [tests/test_install.py:57-87](), [tests/test_install.py:89-141](), [graphify/__main__.py:1059-1115]()

## Assistant Instructions: Query First, Then Read Files

The installed instructions tell assistants to use the graph before broad source browsing. For Claude-style installs, the generated section says to run `graphify query "<question>"` when `graphify-out/graph.json` exists, use `graphify path "<A>" "<B>"` for relationship questions, and use `graphify explain "<concept>"` for focused concepts. It also tells the assistant to prefer `graphify-out/wiki/index.md` when present and to update the graph after code edits.  
Sources: [graphify/__main__.py:395-405]()

For Codex and other AGENTS.md-based platforms, the generated section adds one practical caveat: dirty `graphify-out/` files are expected after hooks or incremental updates and are not by themselves a reason to skip Graphify.  
Sources: [graphify/__main__.py:409-424](), [tests/test_install.py:217-222]()

The skill files carry the same behavior. The generic skill says that if `graphify-out/graph.json` exists and the user asks a natural-language codebase question, the assistant should run `graphify query` immediately instead of re-detecting or re-extracting. The Codex skill says the same graph-first behavior should still apply when `graphify-out/` is dirty.  
Sources: [graphify/skill.md:37-58](), [graphify/skill-codex.md:52-60]()

## Hooks: Helpful Nudges, Not the Whole System

Some assistants support hooks that can run before tool use. Graphify uses those hooks as nudges toward the graph, but the durable setup is still the skill or instruction file.

Claude project setup writes `CLAUDE.md` and registers a `PreToolUse` hook in `.claude/settings.json`. Codex project setup writes `AGENTS.md` and registers a `PreToolUse` hook in `.codex/hooks.json`; the hook calls `graphify hook-check`, which intentionally exits silently because Codex Desktop rejects payload-bearing `additionalContext` from that hook path.  
Sources: [graphify/__main__.py:1159-1205](), [graphify/__main__.py:979-1008](), [graphify/__main__.py:2223-2279](), [tests/test_hooks.py:164-180]()

For Git repositories, `graphify hook install` installs both `post-commit` and `post-checkout` hooks. The post-commit hook rebuilds the code graph in the background after commits, and the post-checkout hook rebuilds after branch switches if `graphify-out/` already exists. Both use local AST/code rebuild paths and write logs under `~/.cache/graphify-rebuild.log`.  
Sources: [graphify/hooks.py:46-101](), [graphify/hooks.py:104-158](), [graphify/hooks.py:262-304]()

The hook tests verify that install creates executable hooks, is idempotent, preserves existing hook content, creates `post-checkout`, removes hooks on uninstall, and reports both hook statuses.  
Sources: [tests/test_hooks.py:15-52](), [tests/test_hooks.py:88-120]()

## Running Graphify After Setup

Once the assistant is registered, `/graphify .` is the normal first run from inside the assistant. The generic skill handles the workflow: ensure the package is importable, detect files, write interpreter/root metadata in `graphify-out/`, and continue into extraction. It also records the interpreter in `graphify-out/.graphify_python` and the scan root in `graphify-out/.graphify_root`, which helps later commands reuse the right environment and path.  
Sources: [graphify/skill.md:103-143](), [graphify/skill.md:169-181]()

For direct CLI usage, `graphify --help` exposes the same operational surface: `query`, `path`, `explain`, `watch`, `update`, `extract`, `hook install`, and per-platform install commands. The `update` command is explicitly code-only and prints that document, paper, and image changes require `/graphify --update` in the assistant.  
Sources: [graphify/__main__.py:1365-1481](), [graphify/__main__.py:2223-2272]()

## Provider-Neutral and BYOK-Friendly Behavior

Graphify is not tied to one model provider. The README says code is extracted locally with tree-sitter and no API calls, while docs, PDFs, and images use the assistant's model API or a configured headless backend.  
Sources: [README.md:208-221](), [README.md:357-362]()

For headless `graphify extract`, the backend layer supports multiple provider styles: Anthropic Claude, Kimi, local Ollama, Gemini, OpenAI, DeepSeek, AWS Bedrock, and a `claude-cli` route that uses a local Claude Code CLI session. API keys are read from backend-specific environment variables, while Bedrock uses the AWS credential chain and Ollama can use a local base URL.  
Sources: [graphify/llm.py:47-118](), [graphify/llm.py:214-245](), [graphify/llm.py:535-568](), [README.md:332-361]()

This keeps the assistant setup BYOC/BYOK friendly: the skill files are ordinary repository or user files, and the extraction backend is selected by command flags, environment variables, local tools, or the active IDE assistant session rather than by a hard dependency on one hosted service.

## Practical Recommendation

For a team repository, use a project-scoped install for each assistant family the team actually uses:

```bash
graphify install --project --platform codex
graphify install --project --platform claude
graphify hook install
```

Commit only the project-scoped assistant files you want the team to share. This makes the graph-first workflow portable across coding agents while keeping each developer's personal keys, local models, and assistant subscriptions outside the repository. Project-scoped install and uninstall behavior is covered by tests that confirm user-level skill files are not touched.  
Sources: [README.md:101-113](), [graphify/__main__.py:1059-1115](), [tests/test_install.py:89-141]()

---

## 03. What Gets Put In The Box

> How Graphify decides which files count, skips sensitive or noisy inputs, converts Office and Google Workspace files, handles transcripts, and caches work for later runs.

- Page Markdown: https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/pages/03-what-gets-put-in-the-box.md
- Generated: 2026-05-22T20:43:03.175Z

### Source Files

- `graphify/detect.py`
- `graphify/ingest.py`
- `graphify/google_workspace.py`
- `graphify/transcribe.py`
- `graphify/cache.py`
- `tests/test_detect.py`
- `tests/test_google_workspace.py`
- `tests/test_transcribe.py`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [graphify/detect.py](graphify/detect.py)
- [graphify/ingest.py](graphify/ingest.py)
- [graphify/google_workspace.py](graphify/google_workspace.py)
- [graphify/transcribe.py](graphify/transcribe.py)
- [graphify/cache.py](graphify/cache.py)
- [graphify/__main__.py](graphify/__main__.py)
- [pyproject.toml](pyproject.toml)
- [README.md](README.md)
- [tests/test_detect.py](tests/test_detect.py)
- [tests/test_google_workspace.py](tests/test_google_workspace.py)
- [tests/test_transcribe.py](tests/test_transcribe.py)
- [tests/test_cache.py](tests/test_cache.py)
</details>

# What Gets Put In The Box

Graphify's "box" is the set of files it will turn into graph input. The box is not just "everything under this folder." Graphify first walks the tree, avoids common junk, skips likely secrets, converts some files into readable Markdown sidecars, and remembers previous work so later runs can focus on what changed.

No generated wiki context, `STRATEGY.md`, or `docs/solutions/**` files were present in this checkout. The Compound Engineering profile was used only as page-shape guidance; the implementation claims below come from repository code and tests.

## The Short Mental Model

Think of Graphify as packing a moving box:

```text
folder on disk
  -> walk files, but skip trash piles and secret drawers
  -> classify remaining files by type
  -> convert Office / Google shortcuts when possible
  -> keep videos as media inputs and transcripts as text outputs
  -> hash and cache extracted results for next time
```

The important boundary is that file selection is provider-neutral. Detection, skipping, conversion, transcript caching, and manifest comparison happen locally. Later semantic extraction can use different configured backends, but the "what counts as input" layer is not tied to one model provider.

Sources: [graphify/detect.py:862-1005](), [graphify/__main__.py:2978-3017](), [README.md:357-362]()

## File Types Graphify Recognizes

Graphify classifies files into five buckets: `code`, `document`, `paper`, `image`, and `video`. The extension sets are declared near the top of `detect.py`, and `classify_file()` applies them in a fixed order.

| Bucket | Examples | Notes |
|---|---|---|
| `code` | `.py`, `.ts`, `.tsx`, `.js`, `.go`, `.rs`, `.java`, `.sql`, `.json`, shell files | Extensionless scripts can count as code when a supported shebang is found. |
| `document` | `.md`, `.mdx`, `.txt`, `.rst`, `.html`, `.yaml`, `.yml`, `.docx`, `.xlsx`, `.gdoc`, `.gsheet`, `.gslides` | Office and Google Workspace files may be converted before extraction. |
| `paper` | `.pdf`, or Markdown/text that looks academic | Markdown/text becomes `paper` only after enough paper-like signals are found. |
| `image` | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.svg` | Images are semantic extraction inputs. |
| `video` | `.mp4`, `.mov`, `.webm`, `.mkv`, `.avi`, `.m4v`, `.mp3`, `.wav`, `.m4a`, `.ogg` | Videos are counted as files but not as readable words during detection. |

A small but important edge case: PDFs inside Xcode asset catalogs are skipped, because those PDFs are usually vector assets rather than papers.

```python
# graphify/detect.py
if ext in PAPER_EXTENSIONS:
    if any(part.endswith(tuple(_ASSET_DIR_MARKERS)) for part in path.parts):
        return None
    return FileType.PAPER
```

Sources: [graphify/detect.py:18-33](), [graphify/detect.py:289-316](), [tests/test_detect.py:6-33](), [tests/test_detect.py:301-315]()

## Walking The Tree Without Eating The Build Folder

`detect()` starts from a scan root and builds a `files` dictionary with all five buckets. It uses `os.walk()`, prunes known-noise directories before descending, skips lockfiles, then classifies each remaining file.

The always-skipped directory list includes dependency folders, virtual environments, language build outputs, framework caches, coverage reports, visual regression bundles, Storybook builds, Graphify's own output, and `.worktrees`. This prevents generated output from becoming false architecture input.

Graphify does not blindly skip every dot directory. Tests show `.github/` is allowed, while `.next/` and `.graphify/` are still skipped. That means useful hidden project configuration can enter the box, but framework caches and Graphify's own cache do not.

Sources: [graphify/detect.py:537-575](), [graphify/detect.py:896-925](), [tests/test_detect.py:372-459](), [tests/test_detect.py:647-657]()

## Ignore, Include, And Exclude Rules

Graphify supports project-level filtering with `.graphifyignore`. If `.graphifyignore` is absent, it falls back to `.gitignore`; if both exist, `.graphifyignore` wins. Patterns are loaded from the nearest VCS root down to the scan root, so subdirectory scans inside a repo still inherit repo-level ignore rules.

The matching behavior follows important gitignore ideas:

| Rule | Behavior |
|---|---|
| Last match wins | Later patterns override earlier ones. |
| `!` negation | A negated pattern can re-include something already ignored. |
| Parent exclusion still wins | A file cannot be rescued if an ancestor directory remains excluded. |
| CLI excludes | `extra_excludes` are appended last, so command-line excludes override ignore files. |
| Include file | `.graphifyinclude` can opt hidden files or directories into traversal, but sensitive files and hard-skipped noise dirs are still excluded. |

Tests cover all of these decisions, including `.gitignore` fallback, `.graphifyignore` precedence, negation behavior, and explicit extra excludes.

Sources: [graphify/detect.py:618-731](), [graphify/detect.py:734-840](), [graphify/detect.py:876-885](), [tests/test_detect.py:89-121](), [tests/test_detect.py:464-503](), [tests/test_detect.py:611-672]()

## Sensitive Inputs Stay Out

Before a classified file is accepted, Graphify checks whether the path looks sensitive. There are two layers:

1. Parent directory names such as `.ssh`, `.gnupg`, `.aws`, `.gcloud`, `secrets`, `.secrets`, and `credentials`.
2. Filename patterns such as `.env`, private key and certificate extensions, passwords, secrets, tokens, `.netrc`, `.pgpass`, and common cloud credential names.

This check happens after ignore filtering and before conversion or word counting. Sensitive paths are recorded in `skipped_sensitive` rather than added to the input buckets.

The tests are intentionally specific: `api_token.txt`, `oauth_token.json`, `app_secret.yaml`, `passwords.py`, SSH keys, and `config/secrets/db.json` are flagged, while false friends like `tokenizer.py` and `tokenize.py` are not.

Sources: [graphify/detect.py:39-61](), [graphify/detect.py:82-91](), [graphify/detect.py:935-940](), [tests/test_detect.py:506-558]()

## Office Files Become Markdown Sidecars

`.docx` and `.xlsx` files are first classified as documents, then converted into Markdown files under `graphify-out/converted/`. Graphify does this because the later extraction path needs readable text, not raw Office binaries.

For Word documents, `docx_to_markdown()` reads paragraphs, maps heading styles to Markdown headings, maps list styles to bullets, and serializes tables as Markdown tables. For Excel workbooks, `xlsx_to_markdown()` reads each sheet and turns non-empty rows into sheet sections and tables.

If conversion produces no text, or the optional libraries are missing, the original Office file is skipped with a note suggesting `pip install graphifyy[office]`.

Sources: [graphify/detect.py:334-371](), [graphify/detect.py:374-401](), [graphify/detect.py:494-520](), [graphify/detect.py:963-974](), [pyproject.toml:50-59]()

## Google Workspace Shortcuts Are Opt-In

Google Drive desktop files such as `.gdoc`, `.gsheet`, and `.gslides` are shortcuts, not document content. By default, Graphify classifies them as documents but skips them with a message telling the user to pass `--google-workspace` or set `GRAPHIFY_GOOGLE_WORKSPACE=1`.

When enabled, Graphify reads the shortcut JSON, extracts a Drive file ID from fields like `doc_id`, `file_id`, `fileId`, `id`, `resource_id`, or the URL, then exports the real content through the `gws` CLI. Google Docs export as Markdown, Slides export as plain text, and Sheets export as `.xlsx` before passing through the spreadsheet-to-Markdown converter.

The converted sidecar includes frontmatter that records the source file, source type, Google file ID, export MIME type, source URL, and a hash of the Google account email when present. That account hash preserves traceability without storing the raw email in the sidecar.

Sources: [graphify/google_workspace.py:1-29](), [graphify/google_workspace.py:63-91](), [graphify/google_workspace.py:94-122](), [graphify/google_workspace.py:129-147](), [graphify/google_workspace.py:150-223](), [graphify/detect.py:942-962](), [tests/test_google_workspace.py:7-31](), [tests/test_google_workspace.py:33-75]()

## URLs, Web Pages, PDFs, Images, And YouTube Adds

`ingest.py` handles content added by URL. It first classifies the URL as tweet, arXiv, GitHub, YouTube, PDF, image, or generic webpage. PDFs and images are downloaded as binary files. Web pages, tweets, and arXiv pages are saved as annotated Markdown with YAML frontmatter. YouTube URLs are handed to the video downloader in `transcribe.py`.

The URL path is security-aware: `ingest()` validates URLs before fetching, and the lower-level fetches go through `graphify.security.safe_fetch` or `safe_fetch_text`.

Sources: [graphify/ingest.py:64-81](), [graphify/ingest.py:84-100](), [graphify/ingest.py:136-162](), [graphify/ingest.py:165-207](), [graphify/ingest.py:218-269]()

## Video And Transcript Handling

Graphify detects audio/video files, but detection does not count their bytes as words. The transcript layer is separate.

`transcribe.py` can transcribe a local media file or a URL. For URLs, it validates the URL, downloads an audio-only stream through `yt-dlp`, and names the downloaded file from a stable URL hash. For transcription, it uses `faster-whisper` locally with a model name from `GRAPHIFY_WHISPER_MODEL`, defaulting to `base`.

Caching is simple: if `graphify-out/transcripts/<media-stem>.txt` already exists, `transcribe()` returns that path unless `force=True`. `transcribe_all()` processes a list and skips files that fail, warning instead of stopping the whole batch.

The Whisper prompt is also local and provider-neutral. It uses `GRAPHIFY_WHISPER_PROMPT` if set; otherwise it formats up to five graph "god node" labels into a domain hint, or falls back to a punctuation/paragraph prompt.

Sources: [graphify/transcribe.py:9-18](), [graphify/transcribe.py:43-90](), [graphify/transcribe.py:93-113](), [graphify/transcribe.py:116-183](), [tests/test_transcribe.py:22-54](), [tests/test_transcribe.py:68-110](), [tests/test_detect.py:353-369]()

## Caching: Remember The Work, Not The Whole Run

Graphify has two related memory systems: extraction caches and the manifest.

The extraction cache stores result JSON under `graphify-out/cache/{kind}/{hash}.json`, where `kind` is usually `ast` or `semantic`. The hash is based on file content plus the path relative to the cache root, which makes cache entries portable across checkout directories. For Markdown, Graphify strips YAML frontmatter before hashing, so metadata-only changes do not invalidate extraction results.

A stat index at `graphify-out/cache/stat-index.json` avoids rereading unchanged files. If file size and `mtime_ns` match the previous entry, Graphify reuses the previous hash. If the stat data changes, it rereads and hashes the file.

Semantic caching groups nodes, edges, and hyperedges by `source_file`, then saves one cache entry per source file. During extraction, cached semantic results are merged directly, and only uncached files go to fresh semantic extraction.

Sources: [graphify/cache.py:17-41](), [graphify/cache.py:97-146](), [graphify/cache.py:149-190](), [graphify/cache.py:193-245](), [graphify/cache.py:263-329](), [tests/test_cache.py:19-51](), [tests/test_cache.py:79-128](), [graphify/__main__.py:3045-3135]()

## Incremental Runs: What Changed Since Last Time

The manifest is the second memory system. `save_manifest()` writes file mtimes plus separate `ast_hash` and `semantic_hash` values. This separation matters because `graphify update` can refresh AST-only code information without pretending that semantic document extraction is also current.

`detect_incremental()` runs a normal detection pass, loads the previous manifest, and returns changed files separately from unchanged files. It has a fast path for unchanged mtimes and matching hashes, and a slower path that compares content hashes when mtimes move. It also reports deleted files so the graph builder can prune old sources.

During full extraction, `__main__.py` is careful not to stamp semantic success for documents, papers, or images whose semantic chunks failed. That keeps failed files eligible for retry on the next incremental run.

Sources: [graphify/detect.py:1021-1091](), [graphify/detect.py:1094-1126](), [graphify/__main__.py:2983-3017](), [graphify/__main__.py:3152-3166](), [tests/test_detect.py:270-299](), [tests/test_detect.py:560-610]()

## What Does Not Go In The Box

Graphify deliberately leaves out several things:

| Input | Why it stays out |
|---|---|
| Known dependency/build/cache directories | They are generated or redundant, not source knowledge. |
| Lockfiles such as `package-lock.json`, `Cargo.lock`, `poetry.lock` | They are large generated dependency state, not usually useful graph input. |
| Sensitive-looking paths | Secrets should not enter extraction. |
| Google Workspace shortcuts without opt-in | Shortcut files are pointers and may require authenticated export. |
| Failed Office/Google conversions | Graphify needs readable text sidecars. |
| Video bytes in word counts | Media becomes useful after transcription, not by counting binary data. |
| `graphify-out/converted/` sidecars during the original walk | This prevents Graphify from re-processing its own conversion output. |

Sources: [graphify/detect.py:537-565](), [graphify/detect.py:887-891](), [graphify/detect.py:929-934](), [graphify/detect.py:937-974](), [tests/test_detect.py:318-343]()

## Summary

Graphify's input box is built in layers: recognize useful file types, avoid obvious noise, respect ignore rules, refuse likely secrets, convert unreadable-but-supported formats into Markdown, handle media through transcript files, and use caches plus manifests so repeated runs only redo necessary work. This keeps the architecture portable: local file discovery and caching are independent of the model backend, while semantic extraction can run through whichever configured provider or local backend the user chooses.

Sources: [graphify/detect.py:862-1005](), [graphify/cache.py:263-329](), [README.md:357-362]()

---

## 04. How Files Become Tiny Facts

> The extraction stage: tree-sitter reads code, optional semantic extraction reads richer context, validation keeps the shape consistent, and provider-neutral backends support BYOC and BYOK choices.

- Page Markdown: https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/pages/04-how-files-become-tiny-facts.md
- Generated: 2026-05-22T20:43:04.595Z

### Source Files

- `graphify/extract.py`
- `graphify/symbol_resolution.py`
- `graphify/llm.py`
- `graphify/validate.py`
- `tests/test_extract.py`
- `tests/test_languages.py`
- `tests/test_symbol_resolution.py`
- `tests/test_llm_backends.py`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [graphify/extract.py](graphify/extract.py)
- [graphify/symbol_resolution.py](graphify/symbol_resolution.py)
- [graphify/llm.py](graphify/llm.py)
- [graphify/validate.py](graphify/validate.py)
- [graphify/skill-pi.md](graphify/skill-pi.md)
- [tests/test_extract.py](tests/test_extract.py)
- [tests/test_languages.py](tests/test_languages.py)
- [tests/test_symbol_resolution.py](tests/test_symbol_resolution.py)
- [tests/test_llm_backends.py](tests/test_llm_backends.py)
- [tests/test_validate.py](tests/test_validate.py)
</details>

# How Files Become Tiny Facts

Graphify turns a repository into small, linkable facts: "this file contains this class," "this function calls that function," "this document mentions this concept," and "this chunk came from this source file." Those facts become nodes and edges that later graph-building steps can index, query, and visualize.

The extraction stage has two lanes. The first lane is deterministic: tree-sitter reads code syntax and emits structural facts without spending model tokens. The second lane is optional semantic extraction: an AI backend can read richer context from documents, papers, images, or code relationships that syntax alone cannot see. The important product design is that both lanes produce the same graph-shaped JSON, so Graphify can stay provider-neutral and support BYOC/BYOK choices.

Sources: [graphify/extract.py:1310-1395](), [graphify/llm.py:133-147](), [graphify/skill-pi.md:157-195]()

## The Smallest Mental Model

Think of extraction like sorting a box of index cards.

- A **node** is one card: a file, class, function, concept, document, or image-derived item.
- An **edge** is a string between cards: imports, contains, calls, references, implements, cites, or similar.
- A **confidence** says how the fact was found: `EXTRACTED` for directly visible evidence, `INFERRED` for a reasonable link, and `AMBIGUOUS` when the extractor keeps uncertainty visible instead of pretending.

The deterministic code extractor starts each file with a file node, then adds child nodes and edges as the AST walker sees syntax. `add_node` records `id`, `label`, `file_type`, `source_file`, and `source_location`; `add_edge` records endpoints, relation, confidence, source file, line, and weight.

Sources: [graphify/extract.py:1357-1382](), [graphify/validate.py:4-8]()

```text
source file
  -> file node
       -> class/function nodes
       -> import/call/inheritance/reference edges
  -> raw calls saved for later cross-file resolution
```

## Lane One: Tree-Sitter Reads Code

The core structural extractor is `_extract_generic(path, config)`. It imports the configured tree-sitter language module, parses file bytes into a syntax tree, and walks the root node. The `LanguageConfig` object tells the generic walker which AST node types count as classes, functions, imports, calls, static properties, helper functions, and language-specific boundaries.

That keeps the design simple: Graphify does not need a totally separate architecture for every language. Each language config supplies the grammar-specific labels, while the generic walker emits the same node and edge shape.

Sources: [graphify/extract.py:320-361](), [graphify/extract.py:1310-1344](), [graphify/extract.py:1396-1428]()

### What The AST Lane Emits

The AST lane favors facts that are visible in source text:

| Fact type | Example relation | Confidence |
|---|---:|---:|
| File owns symbol | `contains`, `method` | `EXTRACTED` |
| Source imports module | `imports`, `imports_from` | `EXTRACTED` |
| Class hierarchy | `inherits`, `extends`, `implements` | usually `EXTRACTED` |
| In-file call | `calls` | `EXTRACTED` when resolved from AST context |
| Cross-file unqualified call | `calls` | `INFERRED` unless import evidence proves it |

Tests assert that structural edges such as `contains`, `method`, `inherits`, `imports`, and `imports_from` stay `EXTRACTED`, and that AST-resolved call edges carry deterministic confidence and weight.

Sources: [tests/test_extract.py:43-50](), [tests/test_extract.py:261-274](), [tests/test_languages.py:99-121]()

## File Support Comes From The Dispatch Table

Graphify chooses a structural extractor by file extension. The `_DISPATCH` table covers many code and text-like formats, including Python, JavaScript/TypeScript, Go, Rust, Java, C/C++, Ruby, C#, Kotlin, Scala, PHP, Swift, Lua, Zig, PowerShell, Elixir, Objective-C, Julia, Fortran, Svelte, Astro, Dart, Verilog, SQL, Markdown, Pascal, shell scripts, and JSON.

`collect_files()` uses the same dispatch keys to discover supported files. It skips noise directories and graphify-ignore patterns, and can optionally follow symlinks with cycle protection.

Sources: [graphify/extract.py:7281-7348](), [graphify/extract.py:7762-7797](), [tests/test_extract.py:212-247]()

## IDs Are Boring On Purpose

Node IDs are normalized so that graph facts remain stable. `_make_id()` joins name parts, normalizes Unicode with NFKC, replaces non-word runs with underscores, collapses duplicate underscores, strips the edges, and case-folds the result. File stems include the parent directory name to reduce collisions when multiple folders contain the same filename.

After all files are extracted, `extract()` remaps absolute file-node IDs to project-relative IDs and relativizes `source_file` paths. That makes graph JSON more portable across machines and checkouts.

Sources: [graphify/extract.py:33-55](), [graphify/extract.py:7599-7619](), [graphify/extract.py:7741-7759](), [tests/test_extract.py:7-20]()

## Cross-File Facts Need A Second Look

Some relationships only become clear after every file has contributed its local facts. `extract()` first gathers per-file nodes, edges, and raw calls, then runs post-passes for symbol resolution, stub rewiring, language-specific import resolution, and cross-file call resolution.

The conservative rule is: do not guess when names are ambiguous. For raw calls, Graphify skips member calls like `obj.log()`, skips duplicate candidate labels, avoids self-edges, and only adds a call edge when there is exactly one matching target. Import-backed evidence can promote a call from `INFERRED` to `EXTRACTED`.

Sources: [graphify/extract.py:7589-7624](), [graphify/extract.py:7645-7739](), [graphify/symbol_resolution.py:305-356]()

### Python Import-Guided Calls

Python gets an extra deterministic helper. `parse_python_import_aliases()` reads top-level `from module import symbol [as alias]` statements with Python's `ast` module. `resolve_python_import_guided_calls()` then uses those imports to connect raw call records to exactly one target node. These edges are marked `EXTRACTED` because the call is backed by explicit import evidence.

Sources: [graphify/symbol_resolution.py:121-167](), [graphify/symbol_resolution.py:216-302](), [tests/test_symbol_resolution.py:146-157](), [tests/test_symbol_resolution.py:188-242]()

```python
# graphify/symbol_resolution.py
aliases = parse_python_import_aliases(path)
target = find_unique_python_symbol(symbol_index, imported)
```

### Bash Source Edges

Shell scripts also need care. `resolve_bash_source_edges()` treats `source` targets as static-analysis facts, resolving relative paths against the source file's directory for deterministic results. Once a sourced file is known, calls to functions from that file can be emitted as `EXTRACTED` call edges when there is exactly one match.

Sources: [graphify/symbol_resolution.py:378-403](), [graphify/symbol_resolution.py:433-527](), [tests/test_symbol_resolution.py:250-353]()

## Lane Two: Optional Semantic Extraction

Semantic extraction exists for richer context: documents, papers, images, and code relationships that syntax cannot reliably find. The workflow instructions describe extraction as two parts: deterministic AST extraction and semantic extraction. A code-only corpus can skip semantic extraction, because AST already handles code structure.

The direct LLM path in `graphify/llm.py` asks the backend to return the same JSON shape: nodes, edges, hyperedges, token counts, and confidence values. This is what keeps semantic facts compatible with AST facts.

Sources: [graphify/skill-pi.md:161-195](), [graphify/skill-pi.md:238-250](), [graphify/llm.py:133-147]()

```json
{
  "nodes": [
    {
      "id": "stem_entity",
      "label": "Human Readable Name",
      "file_type": "code|document|paper|image|rationale|concept",
      "source_file": "relative/path"
    }
  ],
  "edges": [
    {
      "source": "node_id",
      "target": "node_id",
      "relation": "calls|implements|references|cites|conceptually_related_to|shares_data_with|semantically_similar_to",
      "confidence": "EXTRACTED|INFERRED|AMBIGUOUS"
    }
  ]
}
```

Note the current code accepts `concept` in both the validation layer and the direct LLM prompt, while one older skill instruction says not to invent `concept`. For implementation truth, prefer `validate.py` and `llm.py` because they are active code/schema surfaces.

Sources: [graphify/validate.py:4-8](), [graphify/llm.py:145-146](), [graphify/skill-pi.md:238-248]()

## Provider-Neutral Backends And BYOK

Graphify's direct semantic extractor is backend-neutral by table-driven dispatch. `BACKENDS` includes Claude, Kimi, Ollama, Gemini, OpenAI, DeepSeek, Bedrock, and `claude-cli`. Each backend declares defaults such as base URL, model, environment key names, pricing, temperature, and token limits. Model overrides come from backend-specific environment variables where configured.

This supports BYOK because users can provide their own API keys through environment variables or direct arguments. It supports BYOC because local or customer-controlled routes exist through Ollama, Bedrock credentials, OpenAI-compatible endpoints, and the Claude CLI path. The architecture should not assume that one hosted model provider is always present.

Sources: [graphify/llm.py:47-118](), [graphify/llm.py:214-249](), [graphify/llm.py:533-587]()

| Backend family | How Graphify routes it | BYOC/BYOK note |
|---|---|---|
| OpenAI-compatible | `_call_openai_compat()` with configured `base_url` | Works for OpenAI, Kimi, Gemini OpenAI-compatible API, DeepSeek, Ollama-style endpoints |
| Anthropic direct | `_call_claude()` | Uses `ANTHROPIC_API_KEY` |
| Local Claude Code | `_call_claude_cli()` | Uses existing local `claude` CLI auth instead of a separate API key |
| AWS Bedrock | `_call_bedrock()` | Uses AWS region/profile credential chain |
| Ollama | OpenAI-compatible client pointed at `OLLAMA_BASE_URL` | Can stay local; warns if endpoint is non-loopback |

Sources: [graphify/llm.py:252-384](), [graphify/llm.py:387-530](), [graphify/llm.py:1056-1088]()

## Chunking Keeps Semantic Extraction Practical

Semantic extraction reads file contents into a prompt, but it caps each file at `_FILE_CHAR_CAP` and packs files by estimated token cost. When a backend reports context overflow, returns truncated output, or gives a hollow response, Graphify bisects the chunk and retries. That avoids dropping an entire corpus because one request was too large.

Ollama gets special handling because local models can silently truncate or return empty responses under load. Graphify derives `num_ctx` from actual prompt size, warns about risky settings, and defaults Ollama semantic extraction to serial execution unless the user opts into parallelism.

Sources: [graphify/llm.py:150-166](), [graphify/llm.py:590-651](), [graphify/llm.py:683-810](), [graphify/llm.py:887-895](), [tests/test_llm_backends.py:117-176](), [tests/test_llm_backends.py:360-515]()

## Validation Is The Shape Guard

`validate_extraction()` is deliberately small. It checks that extraction output is a JSON object, that `nodes` and `edges` exist and are lists, that required fields are present, that file types and confidence labels are allowed, and that edge endpoints point to known node IDs when node IDs are available.

This does not prove the graph is semantically correct. It proves the graph fragment has the shape later stages expect.

Sources: [graphify/validate.py:10-64](), [graphify/validate.py:67-72](), [tests/test_validate.py:15-87]()

## How The Lanes Merge

The workflow page describes a simple merge: AST nodes come first, semantic nodes are deduplicated by `id`, semantic edges are appended to AST edges, and semantic hyperedges are preserved. Token counts come from semantic extraction because deterministic AST extraction returns zero token usage.

Sources: [graphify/skill-pi.md:327-358](), [tests/test_extract.py:52-57]()

```text
AST lane                         Semantic lane
tree-sitter facts                model/context facts
nodes + edges + 0 tokens         nodes + edges + hyperedges + token counts
        \                         /
         \                       /
          -> graph-shaped extraction JSON
```

## Practical Rules For Changing Extraction

When adding or changing extraction behavior, keep these constraints in mind:

| Rule | Why it matters |
|---|---|
| Prefer direct AST evidence for code facts | It is deterministic, cheap, and testable |
| Keep raw ambiguous calls unresolved | Bad cross-file edges create misleading graph hubs |
| Use `EXTRACTED` only when source evidence is visible | Confidence is part of the data contract |
| Preserve relative `source_file` paths | Graph output should be portable across machines |
| Keep backend choices configurable | BYOC/BYOK users may use hosted, local, AWS, CLI, or OpenAI-compatible providers |
| Validate shape before downstream graph assembly | Bad fragments are easier to diagnose early |

Sources: [graphify/extract.py:7645-7739](), [graphify/symbol_resolution.py:310-317](), [graphify/llm.py:1091-1111](), [graphify/validate.py:10-64]()

## Closing Summary

Graphify's extraction stage is a two-lane fact factory: tree-sitter creates deterministic code facts, optional semantic extraction adds richer context through provider-neutral backends, symbol-resolution passes connect facts only when evidence is strong enough, and validation keeps the output shape predictable for the rest of the graph pipeline. The result is portable graph JSON that can work with user-chosen compute and user-owned keys rather than depending on one fixed model provider.

Sources: [graphify/extract.py:7505-7759](), [graphify/llm.py:47-118](), [graphify/validate.py:10-64]()

---

## 05. How Facts Become A Map

> How extracted nodes and edges become a NetworkX graph, get deduplicated, clustered, analyzed, and exported as graph.json, graph.html, GRAPH_REPORT.md, wiki output, and call-flow HTML.

- Page Markdown: https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/pages/05-how-facts-become-a-map.md
- Generated: 2026-05-22T20:43:09.330Z

### Source Files

- `graphify/build.py`
- `graphify/dedup.py`
- `graphify/cluster.py`
- `graphify/analyze.py`
- `graphify/report.py`
- `graphify/export.py`
- `graphify/callflow_html.py`
- `graphify/wiki.py`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [graphify/build.py](graphify/build.py)
- [graphify/dedup.py](graphify/dedup.py)
- [graphify/cluster.py](graphify/cluster.py)
- [graphify/analyze.py](graphify/analyze.py)
- [graphify/report.py](graphify/report.py)
- [graphify/export.py](graphify/export.py)
- [graphify/wiki.py](graphify/wiki.py)
- [graphify/callflow_html.py](graphify/callflow_html.py)
- [graphify/__main__.py](graphify/__main__.py)
- [graphify/watch.py](graphify/watch.py)
- [README.md](README.md)
</details>

# How Facts Become A Map

This page explains the part of graphify that starts after facts have been extracted. At that point, graphify has plain dictionaries: nodes, edges, optional hyperedges, and token counts. The job is to turn those facts into a useful map: a NetworkX graph, stable communities, ranked highlights, and files people or agents can read.

Think of it like sorting notes onto a wall. First graphify removes duplicate sticky notes, then draws strings between the remaining notes, then groups nearby notes into neighborhoods, then writes several views of the wall: `graph.json`, `graph.html`, `GRAPH_REPORT.md`, wiki pages, Obsidian notes, and call-flow HTML.

Sources: [graphify/__main__.py:3137-3149](), [graphify/build.py:192-227](), [README.md:34-47]()

## The Short Version

```text
extracted facts
  nodes + edges + hyperedges
        |
        v
deduplicate entities
  exact IDs, exact labels, fuzzy labels, optional LLM tie-breaks
        |
        v
NetworkX graph
  nodes get attributes, valid internal edges are added, direction is preserved
        |
        v
community map + analysis
  cluster(), score_all(), god_nodes(), surprising_connections()
        |
        v
outputs
  graph.json, graph.html, GRAPH_REPORT.md, wiki/, Obsidian vault, call-flow HTML
```

The important design point is that graphify keeps the graph-building path local and data-structure based. Optional LLM behavior appears in semantic extraction and in the dedup tie-breaker, but the build, cluster, analysis, report, and export stages operate on local files and Python data. This keeps the architecture BYOC/BYOK friendly: API keys or local runtimes are inputs to optional model-backed stages, not a requirement for reading, clustering, exporting, or querying an existing graph.

Sources: [graphify/build.py:107-189](), [graphify/dedup.py:129-145](), [README.md:361-361]()

## Step 1: The CLI Merges Extracted Facts

The main extraction command combines AST facts and semantic facts into one `merged` dictionary. AST results are added first, semantic results second, and the comment explains why: when the same node appears in both, semantic node attributes should win because they tend to carry richer labels and document context. Hyperedges come only from the semantic side in this path.

If the user asks for `--no-cluster`, graphify stops early and writes that raw merged dictionary directly to `graphify-out/graph.json`. That mode deliberately skips NetworkX, community detection, and the analysis sidecar. Otherwise, the CLI builds a graph, clusters it, analyzes it, and writes the normal outputs.

Sources: [graphify/__main__.py:3137-3147](), [graphify/__main__.py:3168-3183](), [graphify/__main__.py:3209-3252]()

## Step 2: `build.py` Turns Dictionaries Into a Graph

`build()` accepts one or more extraction dictionaries. It appends all nodes, edges, hyperedges, and token counts into a single combined payload. When deduplication is enabled, it calls `deduplicate_entities()` before constructing the NetworkX graph.

Then `build_from_json()` does the graph work:

- It supports both modern `edges` and legacy NetworkX `links`.
- It normalizes old node fields such as `source` into `source_file`.
- It fills missing `file_type` with `concept` and maps known bad file-type synonyms to valid values.
- It validates the extraction, but treats dangling external or stdlib edges as expected.
- It creates either `nx.Graph()` or `nx.DiGraph()`.
- It adds node attributes by ID.
- It adds only edges whose endpoints resolve to known nodes.
- It stores `_src` and `_tgt` on every edge so later exports can restore original direction even if an undirected graph canonicalizes edge order.

That last point matters. A map can be undirected for clustering while still remembering that `caller -> callee` was the original direction for display and JSON export.

Sources: [graphify/build.py:107-189](), [graphify/build.py:192-227]()

```python
# graphify/build.py
attrs["_src"] = src
attrs["_tgt"] = tgt
G.add_edge(src, tgt, **attrs)
```

Sources: [graphify/build.py:178-185]()

## Step 3: Deduplication Picks One Name for the Same Thing

Deduplication is a pipeline, not a single string comparison. The module-level summary names the stages: exact normalization, entropy gate, MinHash/LSH blocking, Jaro-Winkler verification, same-community boost, and union-find merge.

In simpler terms:

| Stage | What it does |
| --- | --- |
| Exact ID pass | Keeps the first node with each ID. |
| Exact normalized label pass | Merges same-label nodes within the same source file. |
| Fuzzy pass | Uses MinHash/LSH to find candidates, then Jaro-Winkler to verify high-similarity labels. |
| Safety gates | Blocks many short-label variants so `M1` and `M1 Pro` do not merge just because they look similar. |
| Optional LLM tie-break | If configured, asks a selected backend only for ambiguous pairs. |
| Edge rewrite | Repoints edges to the surviving node and drops self-loops created by the merge. |

There is also an explicit cross-repository guard. If nodes carry more than one `repo` value, `deduplicate_entities()` raises instead of merging across projects by label similarity. That is important for global or multi-repo use: two repositories can both have a `Config` or `Client`, and graphify should not assume they are the same entity.

Sources: [graphify/dedup.py:1-5](), [graphify/dedup.py:147-155](), [graphify/dedup.py:160-204](), [graphify/dedup.py:257-309]()

The optional LLM tie-breaker is backend-driven. It checks whether the requested backend exists and whether its API key is available before calling the model path, then skips cleanly when those conditions are not met. That preserves provider neutrality: the dedup algorithm does useful local work without a model, and model-backed resolution is an opt-in extension.

Sources: [graphify/dedup.py:324-343](), [graphify/dedup.py:371-416]()

## Step 4: Clustering Finds Neighborhoods

Once graphify has a graph, `cluster()` groups nodes into communities. It accepts directed or undirected graphs, but converts directed graphs to undirected internally because Leiden and Louvain require undirected input. If the graph has no edges, every node becomes its own community.

The clustering strategy is deterministic where possible. `_partition()` rebuilds a stable graph with sorted nodes and sorted edge rows before running community detection. It tries Leiden through `graspologic` first and falls back to NetworkX Louvain when `graspologic` is not installed. It also uses seed-like parameters where the underlying library supports them.

After the first partition, graphify handles practical graph-shape problems:

- Isolates become single-node communities.
- Optional hub exclusion removes very high-degree nodes during partitioning, then reattaches them by neighbor vote.
- Oversized communities are split with a second pass.
- Large low-cohesion communities may be split again.
- Final community IDs are re-indexed by size, largest first.

Sources: [graphify/cluster.py:22-77](), [graphify/cluster.py:86-183](), [graphify/cluster.py:204-216]()

## Step 5: Analysis Turns Structure Into Highlights

The analysis stage asks, “What should a human look at first?” It produces several kinds of signals.

`god_nodes()` ranks the most-connected real entities, but filters out file-level hubs, concept nodes, and noisy JSON key nodes. This keeps the list focused on meaningful abstractions instead of mechanical containers.

`surprising_connections()` chooses a strategy based on the corpus. For multi-source graphs, it looks for cross-file edges between real entities and scores them by confidence, file-type crossing, top-level directory crossing, community crossing, semantic similarity, and peripheral-to-hub shape. For single-source graphs, it looks for cross-community bridges, or falls back to edge betweenness when there is no community map.

`suggest_questions()` turns graph signals into review prompts: ambiguous edges, bridge nodes, god nodes with inferred edges, weakly connected nodes, and low-cohesion communities.

Sources: [graphify/analyze.py:85-104](), [graphify/analyze.py:107-136](), [graphify/analyze.py:251-311](), [graphify/analyze.py:314-399](), [graphify/analyze.py:402-520]()

## Step 6: `graph.json` Preserves the Machine Map

`to_json()` writes the graph in NetworkX node-link form. It attaches each node’s community ID and a normalized label. For links, it fills missing `confidence_score` defaults and restores the true `source` and `target` from `_src` and `_tgt` before writing. It also includes hyperedges from graph metadata and, when available, records the git commit used to build the graph.

There is a safety check before overwriting an existing graph: unless `force=True`, graphify refuses to silently replace a larger existing graph with a smaller one. That protects users from accidental partial rebuilds.

Sources: [graphify/export.py:475-525]()

## Step 7: `graph.html` Gives an Interactive View

`to_html()` generates a standalone vis-network HTML page. It sizes nodes by degree or community member count, colors nodes by community, styles edges by confidence, and includes search, node inspection, and community filtering. It restores edge direction from `_src` and `_tgt` for rendered arrows.

For large graphs, the function either raises a size error or, when a node limit is provided, builds an aggregated community-level meta-graph. The default limit is controlled by `MAX_NODES_FOR_VIZ` and can be overridden with `GRAPHIFY_VIZ_NODE_LIMIT`.

Sources: [graphify/export.py:147-164](), [graphify/export.py:615-668](), [graphify/export.py:670-779]()

## Step 8: `GRAPH_REPORT.md` Explains the Map

`report.generate()` writes the human-readable audit trail. It summarizes corpus size, graph size, extraction confidence mix, token cost, and graph freshness. Then it lists community hubs, god nodes, surprising connections, optional hyperedges, communities with cohesion scores, ambiguous edges, knowledge gaps, and suggested questions.

The report is not just a pretty summary. It is also a navigation surface: community hub links are added so `GRAPH_REPORT.md` can point into generated Obsidian community notes instead of becoming a dead end.

Sources: [graphify/report.py:15-84](), [graphify/report.py:86-130](), [graphify/report.py:132-203]()

## Step 9: Wiki Output Makes the Graph Agent-Crawlable

`wiki.py` creates a smaller article-style Markdown wiki. It writes:

| Output | Purpose |
| --- | --- |
| `index.md` | Entry point listing communities and god nodes. |
| `<CommunityName>.md` | One article per community with key concepts, relationships, source files, and confidence breakdown. |
| `<GodNodeLabel>.md` | One article per god node with connections grouped by relation. |

Before writing, `to_wiki()` refuses to run on an empty community map, drops stale node IDs that no longer exist in the graph, and clears old Markdown articles from the wiki directory. That cleanup is intentional because community labels can change across runs, and stale files would otherwise accumulate.

Sources: [graphify/wiki.py:37-102](), [graphify/wiki.py:105-178](), [graphify/wiki.py:181-280]()

## Step 10: Obsidian and Call-Flow HTML Are Specialized Views

The Obsidian exporter writes one Markdown note per node plus one `_COMMUNITY_*.md` overview per community. Node notes include YAML frontmatter, graphify tags, community tags, and wikilinks to neighbors. Community notes include member lists, optional cohesion, cross-community counts, bridge nodes, and an Obsidian graph configuration file for community colors.

Sources: [graphify/export.py:786-897](), [graphify/export.py:898-1014](), [graphify/export.py:1016-1028]()

The call-flow HTML exporter starts from `graph.json`, optional `GRAPH_REPORT.md`, optional `.graphify_labels.json`, and optional sections JSON. It normalizes graph schema variants, derives or loads sections, classifies edges across sections, then writes a self-contained Mermaid-based architecture document. If no graph file exists, it tells the user to run graphify first or pass a graph path.

Sources: [graphify/callflow_html.py:3-18](), [graphify/callflow_html.py:253-293](), [graphify/callflow_html.py:1577-1644](), [graphify/callflow_html.py:1645-1703]()

## How Re-Clustering and Updates Reuse the Same Pipeline

`cluster-only` loads an existing `graph.json`, rebuilds a NetworkX graph with `build_from_json()`, reruns clustering, recomputes cohesion, god nodes, surprising connections, and suggested questions, then rewrites `GRAPH_REPORT.md`, `graph.json`, labels, and optionally `graph.html`.

The watch/update path does similar local work for code changes. It clusters, scores, analyzes, generates a report, writes a temporary graph JSON, compares it to the old graph, backs up protected outputs, replaces `graph.json`, writes `GRAPH_REPORT.md`, and regenerates `graph.html` when appropriate. If call-flow HTML files already exist, watch mode regenerates them too; it is opt-in by existing file presence.

Sources: [graphify/__main__.py:2159-2202](), [graphify/__main__.py:2204-2221](), [graphify/watch.py:500-554](), [graphify/watch.py:562-585]()

## Practical Mental Model

Graphify’s map is not one file. It is a set of projections over the same underlying graph.

| Artifact | Best for | Source of truth |
| --- | --- | --- |
| `graph.json` | Machines, queries, re-clustering, integrations | NetworkX graph serialized by `to_json()` |
| `graph.html` | Visual exploration in a browser | Same graph plus community colors and confidence styling |
| `GRAPH_REPORT.md` | Human audit and high-level findings | Analysis outputs and community map |
| `wiki/index.md` and articles | Agent-readable navigation | Community and god-node article generation |
| Obsidian vault | Local note graph and wikilinks | Node/community Markdown export |
| `*-callflow.html` | Architecture and call-flow documentation | Existing graph output plus labels/report/sections |

The safest way to reason about the system is: `graph.json` is the durable machine map; `GRAPH_REPORT.md`, `graph.html`, wiki pages, Obsidian notes, and call-flow HTML are views built from that map and its analysis side data.

Sources: [graphify/export.py:475-525](), [graphify/report.py:45-84](), [graphify/wiki.py:181-197](), [graphify/callflow_html.py:1616-1644]()

## Closing Summary

Facts become a map through a local, composable pipeline: merge extracted dictionaries, deduplicate entities, build a NetworkX graph, cluster it into communities, score the structure, and export several views for humans and tools. The code keeps model-backed work optional and backend-selected, while the core graph, analysis, and export stages remain portable across local files, repositories, and catalog-style sources.

Sources: [graphify/__main__.py:3209-3252](), [graphify/export.py:475-525]()

---

## 06. Ask The Map, Keep It Fresh

> The closing page: remember Graphify as a reusable project map, then use query, path, explain, MCP, global graphs, update, and watch flows to keep that map useful after the first build.

- Page Markdown: https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/pages/06-ask-the-map-keep-it-fresh.md
- Generated: 2026-05-22T20:43:01.559Z

### Source Files

- `graphify/serve.py`
- `graphify/watch.py`
- `graphify/global_graph.py`
- `graphify/affected.py`
- `graphify/prs.py`
- `graphify/security.py`
- `tests/test_query_cli.py`
- `tests/test_watch.py`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [graphify/__main__.py](graphify/__main__.py)
- [graphify/serve.py](graphify/serve.py)
- [graphify/watch.py](graphify/watch.py)
- [graphify/global_graph.py](graphify/global_graph.py)
- [graphify/affected.py](graphify/affected.py)
- [graphify/prs.py](graphify/prs.py)
- [graphify/security.py](graphify/security.py)
- [tests/test_query_cli.py](tests/test_query_cli.py)
- [tests/test_watch.py](tests/test_watch.py)
- [tests/test_path_cli.py](tests/test_path_cli.py)
- [tests/test_explain_cli.py](tests/test_explain_cli.py)
- [tests/test_global_graph.py](tests/test_global_graph.py)
</details>

# Ask The Map, Keep It Fresh

Graphify is easiest to remember as a reusable project map. First it builds `graphify-out/graph.json`; after that, the useful habit is to ask the map before opening the whole codebase, then refresh the map when the code changes.

This page is the closing workflow: use `query`, `path`, `explain`, MCP tools, PR impact, global graphs, `update`, and `watch` as small loops. The goal is simple: keep answers grounded in the graph, but keep the graph cheap and fresh enough that people actually use it.

Source note: this page uses repository code as the source of truth. The requested Compound Engineering profile was applied as page-shaping guidance from the provided bundled snapshot metadata; no local `STRATEGY.md`, `docs/solutions/`, or `graphify-out/` context was present in this checkout.

Sources: [graphify/__main__.py:1365-1444](), [graphify/serve.py:447-569]()

## The Small Mental Model

Think of Graphify like a city map for a codebase. You can ask, "Where is this place?", "How do these two places connect?", "What roads lead into this building?", or "Which neighborhoods changed?" The map is not the city. You still inspect source code before editing. But the map helps you choose the right source files first.

```text
Build once:
  repository files -> graphify-out/graph.json -> report/wiki/views

Use many times:
  query      -> nearby context for a question
  path       -> shortest relationship between two concepts
  explain    -> one node and its direct connections
  MCP        -> the same map exposed to agents
  prs        -> changed files mapped to graph communities
  global     -> many repo maps merged under repo tags

Keep fresh:
  update     -> AST-only code refresh, no LLM needed
  watch      -> automatic code refresh; flag docs/images for semantic update
```

The CLI help makes these workflows explicit: `query`, `path`, `explain`, `affected`, `watch`, `update`, `extract`, and `global` are all first-class commands, not hidden internals. The installed agent instructions also tell assistants to prefer `query`, `path`, and `explain` before broad source browsing when a graph exists.

Sources: [graphify/__main__.py:395-424](), [graphify/__main__.py:1365-1444]()

## Ask The Map First

### Use `query` for "where should I look?"

`graphify query "<question>"` loads a graph JSON file, scores nodes from the query terms, picks seed nodes, and returns a bounded text subgraph. It defaults to `graphify-out/graph.json`, supports `--graph`, supports `--budget`, and can use `--dfs` instead of the default breadth-first traversal.

The important product behavior is that the answer is scoped. Instead of handing an agent the whole repository, Graphify returns matching nodes, source locations, communities, and related edges under an approximate token budget.

Sources: [graphify/__main__.py:1703-1772](), [graphify/serve.py:314-339]()

Example:

```bash
graphify query "who calls extract" --context call
```

The query system can filter by edge context. A user can pass `--context call`, or Graphify can infer a call filter from a question like "who calls extract". The tests check both the explicit and heuristic cases: the `call` edge is kept, while an unrelated `import` edge is excluded.

Sources: [graphify/serve.py:143-185](), [graphify/serve.py:188-202](), [tests/test_query_cli.py:24-51]()

### Use `path` for "how are these connected?"

`graphify path "A" "B"` finds the shortest route between two matched nodes. It uses an undirected view for path finding so the user can ask in either order, but it renders arrows using the stored edge direction. That matters because "A calls B" and "B is called by A" are the same relationship viewed from opposite ends, not the same arrow.

Sources: [graphify/__main__.py:1852-1934](), [tests/test_path_cli.py:36-48]()

Example output shape:

```text
Shortest path (1 hops):
  createPatchHandler() --calls [EXTRACTED]--> validateSanitySession()
```

### Use `explain` for "what is this one thing?"

`graphify explain "X"` prints one matched node with its ID, source file, file type, community, degree, and up to 20 sorted connections. Like `path`, it preserves direction: outgoing neighbors use `-->`, incoming callers use `<--`.

Sources: [graphify/__main__.py:1936-1990](), [tests/test_explain_cli.py:42-56]()

## Ask Through MCP When An Agent Needs Tools

The MCP server exposes the same graph as agent-callable tools. It starts from a graph JSON file, loads it into NetworkX, and registers tools such as `query_graph`, `get_node`, `get_neighbors`, `get_community`, `god_nodes`, `graph_stats`, `shortest_path`, `list_prs`, `get_pr_impact`, and `triage_prs`.

Sources: [graphify/serve.py:395-445](), [graphify/serve.py:447-569]()

The MCP flow stays provider-neutral. The server is a local stdio server over MCP; it reads `graphify-out/graph.json` and returns text/resources. It does not require a specific model provider. Any agent or UI that can speak MCP can use the same map.

The server also hot-reloads when `graph.json` changes. It tracks file `mtime_ns` and size, reloads under a lock, and keeps serving the previous graph if a transient reload fails.

Sources: [graphify/serve.py:410-443]()

## Look Backward With `affected`

`graphify affected "X"` answers a slightly different question: "What depends on this?" It resolves a seed by node ID, exact label, exact source file, or a single substring match. Then it walks incoming edges for impact-style relations such as `calls`, `references`, `imports`, `inherits`, `implements`, `uses`, and `embeds`.

Sources: [graphify/affected.py:11-23](), [graphify/affected.py:46-80](), [graphify/affected.py:86-140]()

Use it when you are about to change a function, class, component, or file and want likely dependents before editing.

## Use PR Impact To Review The Right Changes First

`graphify prs` combines GitHub PR metadata with graph impact when needed. Its data model includes CI status, review decision, staleness, worktree path, changed files, touched communities, and affected node count. The "blast radius" string is built from node and community counts.

Sources: [graphify/prs.py:57-90](), [graphify/prs.py:94-113]()

The graph-native impact calculation builds a file-to-community index from graph nodes, matches changed PR files to `source_file`, and returns `(communities_touched, nodes_affected)`. The CLI only fetches impact for deeper flows such as a PR detail, triage, or conflict detection, rather than every dashboard render.

Sources: [graphify/prs.py:243-273](), [graphify/prs.py:668-748]()

## Keep More Than One Repo On The Map

A local project graph is useful. A global graph is useful when work crosses repositories.

`graphify global add <graph.json> --as <tag>` stores graphs under `~/.graphify/global-graph.json` and tracks source hashes in `~/.graphify/global-manifest.json`. When a graph has not changed, `global_add` skips the merge. When it has changed, Graphify prefixes repo-local node IDs, prunes stale nodes for that repo tag, and merges the new graph.

Sources: [graphify/global_graph.py:10-27](), [graphify/global_graph.py:58-133](), [graphify/__main__.py:2715-2769]()

Tests verify the design point: two repositories can both have a node like `userservice`, and the global graph keeps `repoA::userservice` separate from `repoB::userservice` instead of silently merging them.

Sources: [tests/test_global_graph.py:40-67](), [tests/test_global_graph.py:129-150]()

## Keep The Map Fresh

### Use `update` after code edits

`graphify update .` is the everyday refresh command after code changes. It calls `_rebuild_code`, prints that code files are being re-extracted, blocks on the per-repo lock for an explicit user-run update, and reports that doc, paper, or image changes still need the fuller semantic update flow.

Sources: [graphify/__main__.py:2223-2273]()

The rebuild path is AST-only for code. It detects code files, extracts them, preserves previous semantic nodes and edges where possible, evicts changed or deleted sources, rebuilds the graph, reclusters when topology changes, writes `graph.json`, `GRAPH_REPORT.md`, labels, and optionally `graph.html`.

Sources: [graphify/watch.py:274-305](), [graphify/watch.py:336-424](), [graphify/watch.py:478-604]()

### Use `watch` when the map should follow you

`graphify watch <path>` listens for file changes with `watchdog`. Code changes trigger a rebuild after debounce. Doc, paper, and image changes write `graphify-out/needs_update` and tell the user to run the semantic update flow, because those changes may require LLM-backed extraction.

Sources: [graphify/watch.py:641-657](), [graphify/watch.py:701-719]()

The watcher filters noise before doing work. It loads `.graphifyignore` once at startup, skips ignored paths, skips unsupported extensions, skips dot paths, and skips files under the Graphify output directory. Tests cover ignored `node_modules` and `build` writes, and verify that the ignore file is not reparsed for every event.

Sources: [graphify/watch.py:663-690](), [tests/test_watch.py:225-260](), [tests/test_watch.py:263-296]()

### Use `check-update` for a lightweight reminder

`graphify check-update <path>` checks whether `graphify-out/needs_update` exists. It prints a reminder but does not clear the flag. Tests confirm the flag is created idempotently by `_notify_only`, and `check_update` leaves it in place.

Sources: [graphify/watch.py:611-634](), [tests/test_watch.py:13-29](), [tests/test_watch.py:65-84]()

## Safety Rails

Graphify treats graph files and corpus-derived labels as untrusted inputs. Query-like commands reject oversized `graph.json` files before parsing, using a 512 MiB cap. The query tests explicitly patch the cap lower and assert that the CLI fails before loading the graph.

Sources: [graphify/security.py:21-25](), [graphify/security.py:239-260](), [tests/test_query_cli.py:54-70]()

MCP output sanitizes node labels, source files, locations, communities, relations, and confidence strings before concatenating them into tool output. That keeps corpus-controlled text from becoming terminal control sequences or prompt-injection-shaped context.

Sources: [graphify/serve.py:261-311]()

For URL ingestion, Graphify only allows HTTP and HTTPS, blocks private/reserved/link-local/cloud metadata targets, revalidates DNS during fetch, and blocks redirect-based SSRF paths.

Sources: [graphify/security.py:17-25](), [graphify/security.py:42-86](), [graphify/security.py:89-125]()

## Provider-Neutral Integration Guidance

A Grok-Wiki or agent UI should treat Graphify as a portable file-backed map, not as a model-provider feature.

| Flow | Portable design |
| --- | --- |
| Ask | Read `graphify-out/graph.json`, then call CLI or MCP query tools. |
| Navigate | Prefer `graphify-out/wiki/index.md` when present, then source files. |
| Explain | Use `explain`, `get_node`, and `get_neighbors` for focused concepts. |
| Relate | Use `path` or MCP `shortest_path` for relationship questions. |
| Refresh | Run `graphify update .` after code edits; use `watch` for continuous refresh. |
| Cross-repo | Add local graphs into the global graph under explicit repo tags. |
| Skill sources | Load skills from file, repository, or catalog sources; do not bind the workflow to one hosted model or connector. |

This is BYOC/BYOK friendly because the stable interface is local files plus CLI/MCP. Semantic extraction can be supplied by whichever backend the deployment chooses, while AST-only `update` and watch rebuilds do not require an LLM call.

Sources: [graphify/__main__.py:1427-1438](), [graphify/__main__.py:2223-2273](), [graphify/serve.py:395-405]()

## Closing Habit

The durable habit is: ask the map, inspect the cited source, change the code, then refresh the map. `query`, `path`, `explain`, MCP tools, PR impact, and global graphs make the map reusable; `update`, `watch`, and `check-update` keep it current after the first build.

Sources: [graphify/__main__.py:395-424](), [graphify/watch.py:592-604]()

---