# How Facts Become A Map

> How extracted nodes and edges become a NetworkX graph, get deduplicated, clustered, analyzed, and exported as graph.json, graph.html, GRAPH_REPORT.md, wiki output, and call-flow HTML.

- Repository: safishamsi/graphify
- GitHub: https://github.com/safishamsi/graphify
- Human wiki: https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d
- Complete Markdown: https://grok-wiki.com/public/wiki/safishamsi-graphify-af19ef9fd72d/llms-full.txt

## Source Files

- `graphify/build.py`
- `graphify/dedup.py`
- `graphify/cluster.py`
- `graphify/analyze.py`
- `graphify/report.py`
- `graphify/export.py`
- `graphify/callflow_html.py`
- `graphify/wiki.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [graphify/build.py](graphify/build.py)
- [graphify/dedup.py](graphify/dedup.py)
- [graphify/cluster.py](graphify/cluster.py)
- [graphify/analyze.py](graphify/analyze.py)
- [graphify/report.py](graphify/report.py)
- [graphify/export.py](graphify/export.py)
- [graphify/wiki.py](graphify/wiki.py)
- [graphify/callflow_html.py](graphify/callflow_html.py)
- [graphify/__main__.py](graphify/__main__.py)
- [graphify/watch.py](graphify/watch.py)
- [README.md](README.md)
</details>

# How Facts Become A Map

This page explains the part of graphify that starts after facts have been extracted. At that point, graphify has plain dictionaries: nodes, edges, optional hyperedges, and token counts. The job is to turn those facts into a useful map: a NetworkX graph, stable communities, ranked highlights, and files people or agents can read.

Think of it like sorting notes onto a wall. First graphify removes duplicate sticky notes, then draws strings between the remaining notes, then groups nearby notes into neighborhoods, then writes several views of the wall: `graph.json`, `graph.html`, `GRAPH_REPORT.md`, wiki pages, Obsidian notes, and call-flow HTML.

Sources: [graphify/__main__.py:3137-3149](), [graphify/build.py:192-227](), [README.md:34-47]()

## The Short Version

```text
extracted facts
  nodes + edges + hyperedges
        |
        v
deduplicate entities
  exact IDs, exact labels, fuzzy labels, optional LLM tie-breaks
        |
        v
NetworkX graph
  nodes get attributes, valid internal edges are added, direction is preserved
        |
        v
community map + analysis
  cluster(), score_all(), god_nodes(), surprising_connections()
        |
        v
outputs
  graph.json, graph.html, GRAPH_REPORT.md, wiki/, Obsidian vault, call-flow HTML
```

The important design point is that graphify keeps the graph-building path local and data-structure based. Optional LLM behavior appears in semantic extraction and in the dedup tie-breaker, but the build, cluster, analysis, report, and export stages operate on local files and Python data. This keeps the architecture BYOC/BYOK friendly: API keys or local runtimes are inputs to optional model-backed stages, not a requirement for reading, clustering, exporting, or querying an existing graph.

Sources: [graphify/build.py:107-189](), [graphify/dedup.py:129-145](), [README.md:361-361]()

## Step 1: The CLI Merges Extracted Facts

The main extraction command combines AST facts and semantic facts into one `merged` dictionary. AST results are added first, semantic results second, and the comment explains why: when the same node appears in both, semantic node attributes should win because they tend to carry richer labels and document context. Hyperedges come only from the semantic side in this path.

If the user asks for `--no-cluster`, graphify stops early and writes that raw merged dictionary directly to `graphify-out/graph.json`. That mode deliberately skips NetworkX, community detection, and the analysis sidecar. Otherwise, the CLI builds a graph, clusters it, analyzes it, and writes the normal outputs.

Sources: [graphify/__main__.py:3137-3147](), [graphify/__main__.py:3168-3183](), [graphify/__main__.py:3209-3252]()

## Step 2: `build.py` Turns Dictionaries Into a Graph

`build()` accepts one or more extraction dictionaries. It appends all nodes, edges, hyperedges, and token counts into a single combined payload. When deduplication is enabled, it calls `deduplicate_entities()` before constructing the NetworkX graph.

Then `build_from_json()` does the graph work:

- It supports both modern `edges` and legacy NetworkX `links`.
- It normalizes old node fields such as `source` into `source_file`.
- It fills missing `file_type` with `concept` and maps known bad file-type synonyms to valid values.
- It validates the extraction, but treats dangling external or stdlib edges as expected.
- It creates either `nx.Graph()` or `nx.DiGraph()`.
- It adds node attributes by ID.
- It adds only edges whose endpoints resolve to known nodes.
- It stores `_src` and `_tgt` on every edge so later exports can restore original direction even if an undirected graph canonicalizes edge order.

That last point matters. A map can be undirected for clustering while still remembering that `caller -> callee` was the original direction for display and JSON export.

Sources: [graphify/build.py:107-189](), [graphify/build.py:192-227]()

```python
# graphify/build.py
attrs["_src"] = src
attrs["_tgt"] = tgt
G.add_edge(src, tgt, **attrs)
```

Sources: [graphify/build.py:178-185]()

## Step 3: Deduplication Picks One Name for the Same Thing

Deduplication is a pipeline, not a single string comparison. The module-level summary names the stages: exact normalization, entropy gate, MinHash/LSH blocking, Jaro-Winkler verification, same-community boost, and union-find merge.

In simpler terms:

| Stage | What it does |
| --- | --- |
| Exact ID pass | Keeps the first node with each ID. |
| Exact normalized label pass | Merges same-label nodes within the same source file. |
| Fuzzy pass | Uses MinHash/LSH to find candidates, then Jaro-Winkler to verify high-similarity labels. |
| Safety gates | Blocks many short-label variants so `M1` and `M1 Pro` do not merge just because they look similar. |
| Optional LLM tie-break | If configured, asks a selected backend only for ambiguous pairs. |
| Edge rewrite | Repoints edges to the surviving node and drops self-loops created by the merge. |

There is also an explicit cross-repository guard. If nodes carry more than one `repo` value, `deduplicate_entities()` raises instead of merging across projects by label similarity. That is important for global or multi-repo use: two repositories can both have a `Config` or `Client`, and graphify should not assume they are the same entity.

Sources: [graphify/dedup.py:1-5](), [graphify/dedup.py:147-155](), [graphify/dedup.py:160-204](), [graphify/dedup.py:257-309]()

The optional LLM tie-breaker is backend-driven. It checks whether the requested backend exists and whether its API key is available before calling the model path, then skips cleanly when those conditions are not met. That preserves provider neutrality: the dedup algorithm does useful local work without a model, and model-backed resolution is an opt-in extension.

Sources: [graphify/dedup.py:324-343](), [graphify/dedup.py:371-416]()

## Step 4: Clustering Finds Neighborhoods

Once graphify has a graph, `cluster()` groups nodes into communities. It accepts directed or undirected graphs, but converts directed graphs to undirected internally because Leiden and Louvain require undirected input. If the graph has no edges, every node becomes its own community.

The clustering strategy is deterministic where possible. `_partition()` rebuilds a stable graph with sorted nodes and sorted edge rows before running community detection. It tries Leiden through `graspologic` first and falls back to NetworkX Louvain when `graspologic` is not installed. It also uses seed-like parameters where the underlying library supports them.

After the first partition, graphify handles practical graph-shape problems:

- Isolates become single-node communities.
- Optional hub exclusion removes very high-degree nodes during partitioning, then reattaches them by neighbor vote.
- Oversized communities are split with a second pass.
- Large low-cohesion communities may be split again.
- Final community IDs are re-indexed by size, largest first.

Sources: [graphify/cluster.py:22-77](), [graphify/cluster.py:86-183](), [graphify/cluster.py:204-216]()

## Step 5: Analysis Turns Structure Into Highlights

The analysis stage asks, “What should a human look at first?” It produces several kinds of signals.

`god_nodes()` ranks the most-connected real entities, but filters out file-level hubs, concept nodes, and noisy JSON key nodes. This keeps the list focused on meaningful abstractions instead of mechanical containers.

`surprising_connections()` chooses a strategy based on the corpus. For multi-source graphs, it looks for cross-file edges between real entities and scores them by confidence, file-type crossing, top-level directory crossing, community crossing, semantic similarity, and peripheral-to-hub shape. For single-source graphs, it looks for cross-community bridges, or falls back to edge betweenness when there is no community map.

`suggest_questions()` turns graph signals into review prompts: ambiguous edges, bridge nodes, god nodes with inferred edges, weakly connected nodes, and low-cohesion communities.

Sources: [graphify/analyze.py:85-104](), [graphify/analyze.py:107-136](), [graphify/analyze.py:251-311](), [graphify/analyze.py:314-399](), [graphify/analyze.py:402-520]()

## Step 6: `graph.json` Preserves the Machine Map

`to_json()` writes the graph in NetworkX node-link form. It attaches each node’s community ID and a normalized label. For links, it fills missing `confidence_score` defaults and restores the true `source` and `target` from `_src` and `_tgt` before writing. It also includes hyperedges from graph metadata and, when available, records the git commit used to build the graph.

There is a safety check before overwriting an existing graph: unless `force=True`, graphify refuses to silently replace a larger existing graph with a smaller one. That protects users from accidental partial rebuilds.

Sources: [graphify/export.py:475-525]()

## Step 7: `graph.html` Gives an Interactive View

`to_html()` generates a standalone vis-network HTML page. It sizes nodes by degree or community member count, colors nodes by community, styles edges by confidence, and includes search, node inspection, and community filtering. It restores edge direction from `_src` and `_tgt` for rendered arrows.

For large graphs, the function either raises a size error or, when a node limit is provided, builds an aggregated community-level meta-graph. The default limit is controlled by `MAX_NODES_FOR_VIZ` and can be overridden with `GRAPHIFY_VIZ_NODE_LIMIT`.

Sources: [graphify/export.py:147-164](), [graphify/export.py:615-668](), [graphify/export.py:670-779]()

## Step 8: `GRAPH_REPORT.md` Explains the Map

`report.generate()` writes the human-readable audit trail. It summarizes corpus size, graph size, extraction confidence mix, token cost, and graph freshness. Then it lists community hubs, god nodes, surprising connections, optional hyperedges, communities with cohesion scores, ambiguous edges, knowledge gaps, and suggested questions.

The report is not just a pretty summary. It is also a navigation surface: community hub links are added so `GRAPH_REPORT.md` can point into generated Obsidian community notes instead of becoming a dead end.

Sources: [graphify/report.py:15-84](), [graphify/report.py:86-130](), [graphify/report.py:132-203]()

## Step 9: Wiki Output Makes the Graph Agent-Crawlable

`wiki.py` creates a smaller article-style Markdown wiki. It writes:

| Output | Purpose |
| --- | --- |
| `index.md` | Entry point listing communities and god nodes. |
| `<CommunityName>.md` | One article per community with key concepts, relationships, source files, and confidence breakdown. |
| `<GodNodeLabel>.md` | One article per god node with connections grouped by relation. |

Before writing, `to_wiki()` refuses to run on an empty community map, drops stale node IDs that no longer exist in the graph, and clears old Markdown articles from the wiki directory. That cleanup is intentional because community labels can change across runs, and stale files would otherwise accumulate.

Sources: [graphify/wiki.py:37-102](), [graphify/wiki.py:105-178](), [graphify/wiki.py:181-280]()

## Step 10: Obsidian and Call-Flow HTML Are Specialized Views

The Obsidian exporter writes one Markdown note per node plus one `_COMMUNITY_*.md` overview per community. Node notes include YAML frontmatter, graphify tags, community tags, and wikilinks to neighbors. Community notes include member lists, optional cohesion, cross-community counts, bridge nodes, and an Obsidian graph configuration file for community colors.

Sources: [graphify/export.py:786-897](), [graphify/export.py:898-1014](), [graphify/export.py:1016-1028]()

The call-flow HTML exporter starts from `graph.json`, optional `GRAPH_REPORT.md`, optional `.graphify_labels.json`, and optional sections JSON. It normalizes graph schema variants, derives or loads sections, classifies edges across sections, then writes a self-contained Mermaid-based architecture document. If no graph file exists, it tells the user to run graphify first or pass a graph path.

Sources: [graphify/callflow_html.py:3-18](), [graphify/callflow_html.py:253-293](), [graphify/callflow_html.py:1577-1644](), [graphify/callflow_html.py:1645-1703]()

## How Re-Clustering and Updates Reuse the Same Pipeline

`cluster-only` loads an existing `graph.json`, rebuilds a NetworkX graph with `build_from_json()`, reruns clustering, recomputes cohesion, god nodes, surprising connections, and suggested questions, then rewrites `GRAPH_REPORT.md`, `graph.json`, labels, and optionally `graph.html`.

The watch/update path does similar local work for code changes. It clusters, scores, analyzes, generates a report, writes a temporary graph JSON, compares it to the old graph, backs up protected outputs, replaces `graph.json`, writes `GRAPH_REPORT.md`, and regenerates `graph.html` when appropriate. If call-flow HTML files already exist, watch mode regenerates them too; it is opt-in by existing file presence.

Sources: [graphify/__main__.py:2159-2202](), [graphify/__main__.py:2204-2221](), [graphify/watch.py:500-554](), [graphify/watch.py:562-585]()

## Practical Mental Model

Graphify’s map is not one file. It is a set of projections over the same underlying graph.

| Artifact | Best for | Source of truth |
| --- | --- | --- |
| `graph.json` | Machines, queries, re-clustering, integrations | NetworkX graph serialized by `to_json()` |
| `graph.html` | Visual exploration in a browser | Same graph plus community colors and confidence styling |
| `GRAPH_REPORT.md` | Human audit and high-level findings | Analysis outputs and community map |
| `wiki/index.md` and articles | Agent-readable navigation | Community and god-node article generation |
| Obsidian vault | Local note graph and wikilinks | Node/community Markdown export |
| `*-callflow.html` | Architecture and call-flow documentation | Existing graph output plus labels/report/sections |

The safest way to reason about the system is: `graph.json` is the durable machine map; `GRAPH_REPORT.md`, `graph.html`, wiki pages, Obsidian notes, and call-flow HTML are views built from that map and its analysis side data.

Sources: [graphify/export.py:475-525](), [graphify/report.py:45-84](), [graphify/wiki.py:181-197](), [graphify/callflow_html.py:1616-1644]()

## Closing Summary

Facts become a map through a local, composable pipeline: merge extracted dictionaries, deduplicate entities, build a NetworkX graph, cluster it into communities, score the structure, and export several views for humans and tools. The code keeps model-backed work optional and backend-selected, while the core graph, analysis, and export stages remain portable across local files, repositories, and catalog-style sources.

Sources: [graphify/__main__.py:3209-3252](), [graphify/export.py:475-525]()
