# ingest.py & extract.py — Building the Capability Graph

> How ingest.py walks the repo and introspects the activegraph module to produce File and Chunk objects (deduped on path + sha256), and how extract.py applies regex/heuristic patterns over those chunks to emit Capability, API, Behavior, ObjectType, Constraint, AuthorityRule, and RelationType nodes. Covers the deterministic vs. optional LLM-augment split and the SELFGRAPH_OBJECTTYPE_MATCH env flag (literal vs. relaxed) that controls which ObjectType regex fires.

- Repository: yoheinakajima/activegraph-selfgraph
- GitHub: https://github.com/yoheinakajima/activegraph-selfgraph
- Human wiki: https://grok-wiki.com/public/wiki/yoheinakajima-activegraph-selfgraph-41747ef30393
- Complete Markdown: https://grok-wiki.com/public/wiki/yoheinakajima-activegraph-selfgraph-41747ef30393/llms-full.txt

## Source Files

- `selfgraph/ingest.py`
- `selfgraph/extract.py`
- `selfgraph/cli.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [selfgraph/ingest.py](selfgraph/ingest.py)
- [selfgraph/extract.py](selfgraph/extract.py)
- [selfgraph/cli.py](selfgraph/cli.py)
</details>

# ingest.py & extract.py — Building the Capability Graph

`ingest.py` and `extract.py` are the two-stage pipeline that transforms a raw repository and Python module into a queryable capability graph. `ingest.py` walks file trees and introspects Python packages to produce `File` and `Chunk` objects, while `extract.py` applies deterministic regex and heuristic patterns over those chunks to emit higher-level graph nodes — `Capability`, `API`, `Behavior`, `ObjectType`, `Constraint`, `AuthorityRule`, and `RelationType`. Together they form the "build" step that every downstream query, proposal, and validation command depends on.

Understanding this pipeline matters because the graph is append-only and event-sourced: every node and edge added here is a durable, traceable fact. The doc comment in `ingest.py` states the design intent directly: "the trace is the proof the agent really read what it claims to know."

---

## Stage 1 — ingest.py: Files and Chunks

### Entry Points

The CLI's `cmd_build` in `cli.py` calls both entry points in sequence:

```python
# selfgraph/cli.py, lines 52-54
ingest_paths(graph, [repo])
ingest_module_docs(graph, "activegraph", max_submodules=40)
extract_capabilities(graph)
```

There are two ingestion paths:

| Function | Source | Emitted File kind |
|---|---|---|
| `ingest_paths` | Filesystem walk of one or more root paths | `"repo"` |
| `ingest_module_docs` | Live Python `importlib` introspection | `"module"` |

### ingest_paths — Filesystem Walk

`ingest_paths` accepts a list of root paths and walks each one with `os.walk`, skipping directories named `.git`, `__pycache__`, `.venv`, and `node_modules`. Only files whose suffix is in `TEXT_EXT` (`.md`, `.py`, `.toml`, `.yaml`, `.json`, etc.) are ingested, and files larger than 200 000 bytes are silently skipped.

A subtle but important detail is that `dirnames` and `filenames` are both sorted before processing, making the walk order stable across machines regardless of filesystem ordering:

```python
# selfgraph/ingest.py, lines 108-110
dirnames[:] = sorted(d for d in dirnames if d not in skip)
for fn in sorted(filenames):
    files.append(Path(dirpath) / fn)
```

Sources: [selfgraph/ingest.py:83-122]()

### ingest_module_docs — Module Introspection

`ingest_module_docs` imports a Python package with `importlib.import_module`, then uses `pkgutil.walk_packages` to enumerate every submodule. For each module it builds a synthetic Markdown-like text document using `_render_module`, which serialises:

- The module docstring
- Each public class (name, constructor signature, class docstring, public method signatures and docstrings)
- Each public function (name, signature, docstring)

The synthetic document is stored with a `module://` pseudo-path (e.g., `module://activegraph.graph`) so downstream extraction can identify it as a module artifact. Script-style entry points that call `sys.exit()` at import time are skipped to avoid crashing the ingestion process.

Sources: [selfgraph/ingest.py:125-170]()

### _emit_file — Deduplication on (path, sha256)

Both ingestion paths ultimately call `_emit_file`, which handles deduplication. Before creating a new `File` object the function checks whether an existing `File` already matches both the `path` and the `sha256` hash of the content:

```python
# selfgraph/ingest.py, lines 49-53
digest = _sha(content)
for existing in graph.objects(type="File"):
    if (existing.data.get("path") == path
            and existing.data.get("sha256") == digest):
        return existing.id
```

If an unchanged file is found, its id is returned and no new objects are emitted — re-running `build` is safe. If the content has changed, a new `File` is created (the old one is not deleted), so history is preserved.

After creating the `File`, content is split into `Chunk` objects of at most 2000 characters (`CHUNK_CHARS = 2000`). Each chunk is linked to its parent `File` via a `FILE_HAS_CHUNK` relation. Chunks carry their own `sha256` for later deduplication by the extractor.

Sources: [selfgraph/ingest.py:41-80]()

### File Object Schema

| Field | Value |
|---|---|
| `path` | Filesystem path or `module://` pseudo-path |
| `kind` | `"repo"` or `"module"` |
| `ext` | Lowercased file extension |
| `sha256` | First 16 hex digits of the content hash |
| `ingested_at` | UTC ISO-8601 timestamp |
| `size` | Content length in bytes |
| `preview` | First 400 characters of content |

---

## Stage 2 — extract.py: Capability Graph Nodes

`extract_capabilities` is the single entry point for this stage. It always runs the deterministic pass, and optionally runs an LLM augmentation pass when `ANTHROPIC_API_KEY` is set.

```
text
┌──────────────────────────────────────────────────────────────┐
│  extract_capabilities(graph)                                 │
│                                                              │
│  1. _seed()            → Capability + AuthorityRule anchors  │
│  2. _scan_chunk()      → API, Behavior, ObjectType,          │
│     (for each Chunk)     Constraint, RelationType, Example   │
│  3. _llm_augment()     → optional Capability + Constraint    │
│     (if API key set)     from doc chunks via Claude          │
└──────────────────────────────────────────────────────────────┘
```

### Seeding Stable Anchors

Before scanning any chunks, `_seed` writes six named `Capability` nodes and four `AuthorityRule` nodes into the graph. These are heuristic anchors that remain stable across re-runs; `_add_unique` prevents duplicates. Every `Capability` is then linked to the `no-authority-mutation` `AuthorityRule` via a `CAPABILITY_REQUIRES_APPROVAL` edge:

```python
# selfgraph/extract.py, lines 96-103
_SEED_CAPABILITIES = [
    ("ingest-repo",        "Read and chunk local files into File/Chunk objects."),
    ("extract-capability", "Mine signatures and docs for graph-native capabilities."),
    ("answer-question",    "Answer questions by querying the capability graph."),
    ("propose-patch",      "Generate a structured PatchProposal for a user goal."),
    ("validate-patch",     "Reject unsafe or out-of-scope patch proposals."),
    ("sandbox-apply",      "Apply a proposal in a fork, run test events, diff."),
]
```

Sources: [selfgraph/extract.py:96-128]()

### Deterministic Regex Patterns

The deterministic scan in `_scan_chunk` applies the following compiled patterns to each chunk's text:

| Pattern | Targets | Emits |
|---|---|---|
| `_RE_BEHAVIOR_DECO` | `@behavior`, `@llm_behavior`, `@relation_behavior` decorators | `Behavior` node + `BEHAVIOR_SUBSCRIBES_TO` edge for each `on=[...]` event |
| `_RE_TOOL_DECO` | `@tool(...)` decorator | `API` node with `kind="tool"` |
| `_RE_API_SIG` | `## def name(...)` / `## class name(...)` lines in `module://` files | `API` node; infers `API_CREATES`, `API_WRITES`, or `API_READS` relation from name |
| `_RE_OBJTYPE_HINT` + `_RE_OBJTYPE_CONSTRUCTOR` | `add_object("Type", ...)` and `ObjectType(name="...")` calls | `ObjectType` node |
| `` ``` `` in `.md` files | Markdown code fences | `Example` node |
| `_RE_MUST` | Sentences containing "must" or "must not" | `Constraint` node |

Every emitted object carries `source_chunk_id` and `source_file_path` metadata so grounding traces can cite back to the exact ingested artifact.

Sources: [selfgraph/extract.py:36-63](), [selfgraph/extract.py:207-305]()

### API Relation Inference

When processing `module://` synthetic files, the extractor uses a simple name-based heuristic to classify which relation an API node should have to its owning `Capability`. The name of the symbol is lowercased and matched against known vocabulary:

```python
# selfgraph/extract.py, lines 254-260
rel = (
    "API_CREATES" if any(k in lname for k in
                         ("add_", "create", "emit", "propose"))
    else "API_WRITES" if any(k in lname for k in
                             ("apply", "patch", "update", "remove"))
    else "API_READS"
)
```

This is a heuristic — all unrecognized names default to `API_READS`.

Sources: [selfgraph/extract.py:244-267]()

### The SELFGRAPH_OBJECTTYPE_MATCH Flag

Two regex patterns exist for ObjectType discovery, controlled by the `SELFGRAPH_OBJECTTYPE_MATCH` environment variable:

| Mode | Env Value | Regexes Active | What it catches |
|---|---|---|---|
| **Relaxed** (default) | `"relaxed"` or unset | `_RE_OBJTYPE_HINT` + `_RE_OBJTYPE_CONSTRUCTOR` | Capitalized `add_object("Type", ...)` calls **and** lowercase `ObjectType(name="company")` constructor calls used by the activegraph runtime |
| **Literal** | `"literal"` | `_RE_OBJTYPE_HINT` only | Only capitalized identifiers in `add_object(...)` / `ObjectType(name=...)` |

Any other value raises `ValueError` immediately — a deliberate guard so a typo does not silently produce shifted graph content:

```python
# selfgraph/extract.py, lines 83-91
mode = os.environ.get("SELFGRAPH_OBJECTTYPE_MATCH", "relaxed")
if mode == "literal":
    return [_RE_OBJTYPE_HINT]
if mode == "relaxed":
    return [_RE_OBJTYPE_HINT, _RE_OBJTYPE_CONSTRUCTOR]
raise ValueError(
    f"SELFGRAPH_OBJECTTYPE_MATCH={mode!r}; expected "
    f"'literal' or 'relaxed'"
)
```

The code comment explains the purpose: the activegraph runtime registers ObjectTypes with lowercase names like `"company"` and `"document"`, which the original literal regex misses entirely. The split between the two modes makes A/B reproducibility ("BEFORE/AFTER") cold-reproducible — set the env var to reproduce either baseline.

Sources: [selfgraph/extract.py:49-91]()

### Optional LLM Augmentation Pass

When `ANTHROPIC_API_KEY` is set, `_llm_augment` makes a second pass over the first eight `.md` chunks in the graph. It sends each chunk's first 1500 characters to `claude-sonnet-4-6` with a structured prompt asking for a JSON object containing `capabilities` (list of `{name, description}`) and `constraints` (list of strings). Returned nodes are merged via `_add_unique` so the LLM cannot create duplicates of deterministically extracted nodes.

The LLM pass is strictly additive: the deterministic pass always runs first and cannot be bypassed. If the Anthropic SDK is not installed or any exception occurs, the LLM pass is skipped silently and `counts["llm_added"]` is set to 0. LLM-sourced nodes carry `"source": "llm"` in their data to distinguish them from deterministic extractions.

Sources: [selfgraph/extract.py:334-389]()

### _add_unique — Deduplication in the Extract Pass

All nodes emitted by `_scan_chunk` go through `_add_unique`, which checks existing objects of the same type for a matching key before inserting. The key is derived from `name`, then `text`, then `snippet`:

```python
# selfgraph/extract.py, lines 308-319
def _add_unique(graph: Graph, type_: str, data: dict):
    key = data.get("name") or data.get("text") or data.get("snippet", "")
    if not key:
        return graph.add_object(type_, data, actor="extract")
    for o in graph.objects(type=type_):
        existing_key = o.data.get("name") or o.data.get("text") \
                       or o.data.get("snippet", "")
        if existing_key == key:
            return None
    return graph.add_object(type_, data, actor="extract")
```

This prevents the same `Behavior`, `API`, or `Constraint` from being recorded twice even when multiple chunks reference the same symbol.

---

## Data Flow Diagram

```text
  selfgraph build
        │
        ├─ ingest_paths(graph, [repo])
        │       │
        │       └─ per text file → _emit_file()
        │               ├─ dedup check (path + sha256)
        │               ├─ File object  (kind="repo")
        │               └─ Chunk objects (2000-char slices)
        │                       └─ FILE_HAS_CHUNK relation
        │
        ├─ ingest_module_docs(graph, "activegraph")
        │       │
        │       └─ per submodule → _render_module() → _emit_file()
        │               ├─ File object  (kind="module", path="module://...")
        │               └─ Chunk objects
        │
        └─ extract_capabilities(graph)
                │
                ├─ _seed()        → Capability + AuthorityRule nodes
                │                   CAPABILITY_REQUIRES_APPROVAL edges
                │
                ├─ _scan_chunk()  (for each Chunk)
                │   ├─ @behavior/* → Behavior + BEHAVIOR_SUBSCRIBES_TO
                │   ├─ @tool       → API (kind=tool)
                │   ├─ module:// sigs → API + API_CREATES/READS/WRITES
                │   ├─ ObjectType  → ObjectType (literal | relaxed)
                │   ├─ must/must not → Constraint
                │   └─ ``` in .md → Example
                │
                └─ _llm_augment() [optional, if ANTHROPIC_API_KEY set]
                        └─ Capability + Constraint (source="llm")
```

---

## Reading Order for New Contributors

1. **`selfgraph/ingest.py:41-80`** — `_emit_file`: understand the dedup contract before anything else; every other piece of ingest delegates here.
2. **`selfgraph/ingest.py:125-169`** — `ingest_module_docs`: see how Python module introspection generates the `module://` synthetic corpus that feeds API extraction.
3. **`selfgraph/extract.py:65-91`** — `_objecttype_regexes`: read the env-flag gate and understand why two regexes exist before looking at `_scan_chunk`.
4. **`selfgraph/extract.py:207-305`** — `_scan_chunk`: the full deterministic pattern sweep; maps directly to the node-type table above.
5. **`selfgraph/cli.py:48-57`** — `cmd_build`: the wiring that calls the above in order, and which resets the database with `create=True` on each full rebuild.

The deterministic pass is the contract; the LLM pass is enrichment. Any feature or test that relies on reproducible graph content should run with `ANTHROPIC_API_KEY` unset and `SELFGRAPH_OBJECTTYPE_MATCH=literal` or `=relaxed` pinned explicitly.

Sources: [selfgraph/extract.py:131-162]()
