# How Files Become Tiny Facts

> The extraction stage: tree-sitter reads code, optional semantic extraction reads richer context, validation keeps the shape consistent, and provider-neutral backends support BYOC and BYOK choices.

- Repository: safishamsi/graphify
- GitHub: https://github.com/safishamsi/graphify
- Human wiki: https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d
- Complete Markdown: https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/llms-full.txt

## Source Files

- `graphify/extract.py`
- `graphify/symbol_resolution.py`
- `graphify/llm.py`
- `graphify/validate.py`
- `tests/test_extract.py`
- `tests/test_languages.py`
- `tests/test_symbol_resolution.py`
- `tests/test_llm_backends.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [graphify/extract.py](graphify/extract.py)
- [graphify/symbol_resolution.py](graphify/symbol_resolution.py)
- [graphify/llm.py](graphify/llm.py)
- [graphify/validate.py](graphify/validate.py)
- [graphify/skill-pi.md](graphify/skill-pi.md)
- [tests/test_extract.py](tests/test_extract.py)
- [tests/test_languages.py](tests/test_languages.py)
- [tests/test_symbol_resolution.py](tests/test_symbol_resolution.py)
- [tests/test_llm_backends.py](tests/test_llm_backends.py)
- [tests/test_validate.py](tests/test_validate.py)
</details>

# How Files Become Tiny Facts

Graphify turns a repository into small, linkable facts: "this file contains this class," "this function calls that function," "this document mentions this concept," and "this chunk came from this source file." Those facts become nodes and edges that later graph-building steps can index, query, and visualize.

The extraction stage has two lanes. The first lane is deterministic: tree-sitter reads code syntax and emits structural facts without spending model tokens. The second lane is optional semantic extraction: an AI backend can read richer context from documents, papers, images, or code relationships that syntax alone cannot see. The important product design is that both lanes produce the same graph-shaped JSON, so Graphify can stay provider-neutral and support BYOC/BYOK choices.

Sources: [graphify/extract.py:1310-1395](), [graphify/llm.py:133-147](), [graphify/skill-pi.md:157-195]()

## The Smallest Mental Model

Think of extraction like sorting a box of index cards.

- A **node** is one card: a file, class, function, concept, document, or image-derived item.
- An **edge** is a string between cards: imports, contains, calls, references, implements, cites, or similar.
- A **confidence** says how the fact was found: `EXTRACTED` for directly visible evidence, `INFERRED` for a reasonable link, and `AMBIGUOUS` when the extractor keeps uncertainty visible instead of pretending.

The deterministic code extractor starts each file with a file node, then adds child nodes and edges as the AST walker sees syntax. `add_node` records `id`, `label`, `file_type`, `source_file`, and `source_location`; `add_edge` records endpoints, relation, confidence, source file, line, and weight.

Sources: [graphify/extract.py:1357-1382](), [graphify/validate.py:4-8]()

```text
source file
  -> file node
       -> class/function nodes
       -> import/call/inheritance/reference edges
  -> raw calls saved for later cross-file resolution
```

## Lane One: Tree-Sitter Reads Code

The core structural extractor is `_extract_generic(path, config)`. It imports the configured tree-sitter language module, parses file bytes into a syntax tree, and walks the root node. The `LanguageConfig` object tells the generic walker which AST node types count as classes, functions, imports, calls, static properties, helper functions, and language-specific boundaries.

That keeps the design simple: Graphify does not need a totally separate architecture for every language. Each language config supplies the grammar-specific labels, while the generic walker emits the same node and edge shape.

Sources: [graphify/extract.py:320-361](), [graphify/extract.py:1310-1344](), [graphify/extract.py:1396-1428]()

### What The AST Lane Emits

The AST lane favors facts that are visible in source text:

| Fact type | Example relation | Confidence |
|---|---:|---:|
| File owns symbol | `contains`, `method` | `EXTRACTED` |
| Source imports module | `imports`, `imports_from` | `EXTRACTED` |
| Class hierarchy | `inherits`, `extends`, `implements` | usually `EXTRACTED` |
| In-file call | `calls` | `EXTRACTED` when resolved from AST context |
| Cross-file unqualified call | `calls` | `INFERRED` unless import evidence proves it |

Tests assert that structural edges such as `contains`, `method`, `inherits`, `imports`, and `imports_from` stay `EXTRACTED`, and that AST-resolved call edges carry deterministic confidence and weight.

Sources: [tests/test_extract.py:43-50](), [tests/test_extract.py:261-274](), [tests/test_languages.py:99-121]()

## File Support Comes From The Dispatch Table

Graphify chooses a structural extractor by file extension. The `_DISPATCH` table covers many code and text-like formats, including Python, JavaScript/TypeScript, Go, Rust, Java, C/C++, Ruby, C#, Kotlin, Scala, PHP, Swift, Lua, Zig, PowerShell, Elixir, Objective-C, Julia, Fortran, Svelte, Astro, Dart, Verilog, SQL, Markdown, Pascal, shell scripts, and JSON.

`collect_files()` uses the same dispatch keys to discover supported files. It skips noise directories and graphify-ignore patterns, and can optionally follow symlinks with cycle protection.

Sources: [graphify/extract.py:7281-7348](), [graphify/extract.py:7762-7797](), [tests/test_extract.py:212-247]()

## IDs Are Boring On Purpose

Node IDs are normalized so that graph facts remain stable. `_make_id()` joins name parts, normalizes Unicode with NFKC, replaces non-word runs with underscores, collapses duplicate underscores, strips the edges, and case-folds the result. File stems include the parent directory name to reduce collisions when multiple folders contain the same filename.

After all files are extracted, `extract()` remaps absolute file-node IDs to project-relative IDs and relativizes `source_file` paths. That makes graph JSON more portable across machines and checkouts.

Sources: [graphify/extract.py:33-55](), [graphify/extract.py:7599-7619](), [graphify/extract.py:7741-7759](), [tests/test_extract.py:7-20]()

## Cross-File Facts Need A Second Look

Some relationships only become clear after every file has contributed its local facts. `extract()` first gathers per-file nodes, edges, and raw calls, then runs post-passes for symbol resolution, stub rewiring, language-specific import resolution, and cross-file call resolution.

The conservative rule is: do not guess when names are ambiguous. For raw calls, Graphify skips member calls like `obj.log()`, skips duplicate candidate labels, avoids self-edges, and only adds a call edge when there is exactly one matching target. Import-backed evidence can promote a call from `INFERRED` to `EXTRACTED`.

Sources: [graphify/extract.py:7589-7624](), [graphify/extract.py:7645-7739](), [graphify/symbol_resolution.py:305-356]()

### Python Import-Guided Calls

Python gets an extra deterministic helper. `parse_python_import_aliases()` reads top-level `from module import symbol [as alias]` statements with Python's `ast` module. `resolve_python_import_guided_calls()` then uses those imports to connect raw call records to exactly one target node. These edges are marked `EXTRACTED` because the call is backed by explicit import evidence.

Sources: [graphify/symbol_resolution.py:121-167](), [graphify/symbol_resolution.py:216-302](), [tests/test_symbol_resolution.py:146-157](), [tests/test_symbol_resolution.py:188-242]()

```python
# graphify/symbol_resolution.py
aliases = parse_python_import_aliases(path)
target = find_unique_python_symbol(symbol_index, imported)
```

### Bash Source Edges

Shell scripts also need care. `resolve_bash_source_edges()` treats `source` targets as static-analysis facts, resolving relative paths against the source file's directory for deterministic results. Once a sourced file is known, calls to functions from that file can be emitted as `EXTRACTED` call edges when there is exactly one match.

Sources: [graphify/symbol_resolution.py:378-403](), [graphify/symbol_resolution.py:433-527](), [tests/test_symbol_resolution.py:250-353]()

## Lane Two: Optional Semantic Extraction

Semantic extraction exists for richer context: documents, papers, images, and code relationships that syntax cannot reliably find. The workflow instructions describe extraction as two parts: deterministic AST extraction and semantic extraction. A code-only corpus can skip semantic extraction, because AST already handles code structure.

The direct LLM path in `graphify/llm.py` asks the backend to return the same JSON shape: nodes, edges, hyperedges, token counts, and confidence values. This is what keeps semantic facts compatible with AST facts.

Sources: [graphify/skill-pi.md:161-195](), [graphify/skill-pi.md:238-250](), [graphify/llm.py:133-147]()

```json
{
  "nodes": [
    {
      "id": "stem_entity",
      "label": "Human Readable Name",
      "file_type": "code|document|paper|image|rationale|concept",
      "source_file": "relative/path"
    }
  ],
  "edges": [
    {
      "source": "node_id",
      "target": "node_id",
      "relation": "calls|implements|references|cites|conceptually_related_to|shares_data_with|semantically_similar_to",
      "confidence": "EXTRACTED|INFERRED|AMBIGUOUS"
    }
  ]
}
```

Note the current code accepts `concept` in both the validation layer and the direct LLM prompt, while one older skill instruction says not to invent `concept`. For implementation truth, prefer `validate.py` and `llm.py` because they are active code/schema surfaces.

Sources: [graphify/validate.py:4-8](), [graphify/llm.py:145-146](), [graphify/skill-pi.md:238-248]()

## Provider-Neutral Backends And BYOK

Graphify's direct semantic extractor is backend-neutral by table-driven dispatch. `BACKENDS` includes Claude, Kimi, Ollama, Gemini, OpenAI, DeepSeek, Bedrock, and `claude-cli`. Each backend declares defaults such as base URL, model, environment key names, pricing, temperature, and token limits. Model overrides come from backend-specific environment variables where configured.

This supports BYOK because users can provide their own API keys through environment variables or direct arguments. It supports BYOC because local or customer-controlled routes exist through Ollama, Bedrock credentials, OpenAI-compatible endpoints, and the Claude CLI path. The architecture should not assume that one hosted model provider is always present.

Sources: [graphify/llm.py:47-118](), [graphify/llm.py:214-249](), [graphify/llm.py:533-587]()

| Backend family | How Graphify routes it | BYOC/BYOK note |
|---|---|---|
| OpenAI-compatible | `_call_openai_compat()` with configured `base_url` | Works for OpenAI, Kimi, Gemini OpenAI-compatible API, DeepSeek, Ollama-style endpoints |
| Anthropic direct | `_call_claude()` | Uses `ANTHROPIC_API_KEY` |
| Local Claude Code | `_call_claude_cli()` | Uses existing local `claude` CLI auth instead of a separate API key |
| AWS Bedrock | `_call_bedrock()` | Uses AWS region/profile credential chain |
| Ollama | OpenAI-compatible client pointed at `OLLAMA_BASE_URL` | Can stay local; warns if endpoint is non-loopback |

Sources: [graphify/llm.py:252-384](), [graphify/llm.py:387-530](), [graphify/llm.py:1056-1088]()

## Chunking Keeps Semantic Extraction Practical

Semantic extraction reads file contents into a prompt, but it caps each file at `_FILE_CHAR_CAP` and packs files by estimated token cost. When a backend reports context overflow, returns truncated output, or gives a hollow response, Graphify bisects the chunk and retries. That avoids dropping an entire corpus because one request was too large.

Ollama gets special handling because local models can silently truncate or return empty responses under load. Graphify derives `num_ctx` from actual prompt size, warns about risky settings, and defaults Ollama semantic extraction to serial execution unless the user opts into parallelism.

Sources: [graphify/llm.py:150-166](), [graphify/llm.py:590-651](), [graphify/llm.py:683-810](), [graphify/llm.py:887-895](), [tests/test_llm_backends.py:117-176](), [tests/test_llm_backends.py:360-515]()

## Validation Is The Shape Guard

`validate_extraction()` is deliberately small. It checks that extraction output is a JSON object, that `nodes` and `edges` exist and are lists, that required fields are present, that file types and confidence labels are allowed, and that edge endpoints point to known node IDs when node IDs are available.

This does not prove the graph is semantically correct. It proves the graph fragment has the shape later stages expect.

Sources: [graphify/validate.py:10-64](), [graphify/validate.py:67-72](), [tests/test_validate.py:15-87]()

## How The Lanes Merge

The workflow page describes a simple merge: AST nodes come first, semantic nodes are deduplicated by `id`, semantic edges are appended to AST edges, and semantic hyperedges are preserved. Token counts come from semantic extraction because deterministic AST extraction returns zero token usage.

Sources: [graphify/skill-pi.md:327-358](), [tests/test_extract.py:52-57]()

```text
AST lane                         Semantic lane
tree-sitter facts                model/context facts
nodes + edges + 0 tokens         nodes + edges + hyperedges + token counts
        \                         /
         \                       /
          -> graph-shaped extraction JSON
```

## Practical Rules For Changing Extraction

When adding or changing extraction behavior, keep these constraints in mind:

| Rule | Why it matters |
|---|---|
| Prefer direct AST evidence for code facts | It is deterministic, cheap, and testable |
| Keep raw ambiguous calls unresolved | Bad cross-file edges create misleading graph hubs |
| Use `EXTRACTED` only when source evidence is visible | Confidence is part of the data contract |
| Preserve relative `source_file` paths | Graph output should be portable across machines |
| Keep backend choices configurable | BYOC/BYOK users may use hosted, local, AWS, CLI, or OpenAI-compatible providers |
| Validate shape before downstream graph assembly | Bad fragments are easier to diagnose early |

Sources: [graphify/extract.py:7645-7739](), [graphify/symbol_resolution.py:310-317](), [graphify/llm.py:1091-1111](), [graphify/validate.py:10-64]()

## Closing Summary

Graphify's extraction stage is a two-lane fact factory: tree-sitter creates deterministic code facts, optional semantic extraction adds richer context through provider-neutral backends, symbol-resolution passes connect facts only when evidence is strong enough, and validation keeps the output shape predictable for the rest of the graph pipeline. The result is portable graph JSON that can work with user-chosen compute and user-owned keys rather than depending on one fixed model provider.

Sources: [graphify/extract.py:7505-7759](), [graphify/llm.py:47-118](), [graphify/validate.py:10-64]()