# Indexing & Language Extraction Pipeline

> The pipeline from project files to graph records: git-aware scanning, include and exclude filters, language detection, tree-sitter WASM loading, worker-based parsing, extraction results, and tests that prove supported language behavior.

- Repository: colbymchenry/codegraph
- GitHub: https://github.com/colbymchenry/codegraph
- Human wiki: https://grok-wiki.com/public/wiki/colbymchenry-codegraph-89e8b2c4d43a
- Complete Markdown: https://grok-wiki.com/public/wiki/colbymchenry-codegraph-89e8b2c4d43a/llms-full.txt

## Source Files

- `src/extraction/index.ts`
- `src/extraction/tree-sitter.ts`
- `src/extraction/parse-worker.ts`
- `src/extraction/grammars.ts`
- `src/extraction/tree-sitter-types.ts`
- `src/extraction/languages/typescript.ts`
- `src/extraction/languages/python.ts`
- `__tests__/extraction.test.ts`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [src/extraction/index.ts](src/extraction/index.ts)
- [src/extraction/tree-sitter.ts](src/extraction/tree-sitter.ts)
- [src/extraction/parse-worker.ts](src/extraction/parse-worker.ts)
- [src/extraction/grammars.ts](src/extraction/grammars.ts)
- [src/extraction/tree-sitter-types.ts](src/extraction/tree-sitter-types.ts)
- [src/extraction/languages/index.ts](src/extraction/languages/index.ts)
- [src/extraction/languages/typescript.ts](src/extraction/languages/typescript.ts)
- [src/extraction/languages/python.ts](src/extraction/languages/python.ts)
- [src/types.ts](src/types.ts)
- [src/db/queries.ts](src/db/queries.ts)
- [__tests__/extraction.test.ts](__tests__/extraction.test.ts)
</details>

# Indexing & Language Extraction Pipeline

This page explains how CodeGraph turns project files into graph records. The pipeline starts with git-aware file discovery, applies include and exclude filters, detects languages, loads only the needed tree-sitter WASM grammars, parses files in a worker when available, extracts graph nodes and references, and finally persists the result into the local graph database.

Knowledge-profile note: this page used the bundled Compound Engineering page-shape guidance provided in the prompt as synthesis guidance. No installed local Compound Engineering skill was executed, and no `STRATEGY.md` or `docs/solutions/**` source was present in this checkout. Implementation claims below are grounded in repository code and tests.

## Pipeline At A Glance

```text
project root
  -> scanDirectoryAsync()
     -> git-visible files when possible
     -> filesystem walk fallback
     -> include/exclude filtering
  -> detect frameworks once per full index
  -> detect needed languages
  -> load tree-sitter WASM grammars in parse worker
  -> read files in small I/O batches
  -> parse/extract per file
     -> file node
     -> symbol nodes
     -> contains edges
     -> unresolved imports/calls/types/decorators/etc.
  -> storeExtractionResult()
     -> nodes
     -> edges
     -> unresolved_refs
     -> files metadata
```

The important boundary is that scanning and database writes stay in the orchestrator, while parsing is delegated to `extractFromSource()` and usually a worker thread. SQLite writes remain on the main thread. Sources: [src/extraction/index.ts:512-600](), [src/extraction/index.ts:734-849](), [src/extraction/tree-sitter.ts:2487-2547]()

## File Discovery And Filtering

`shouldIncludeFile()` is the first explicit filter: exclude patterns win before include patterns are considered. Both checks use normalized forward-slash paths and `picomatch` with dotfile matching enabled. Sources: [src/extraction/index.ts:97-126]()

The default configuration includes source extensions for TypeScript, JavaScript, Python, Go, Rust, Java, C/C++, C#, PHP, Ruby, Swift, Kotlin, Dart, Svelte, Vue, Liquid, Pascal/Delphi, and Scala. It excludes common dependency, build, cache, coverage, IDE, and generated-output directories, and caps processed files at 1 MB by default. Sources: [src/types.ts:453-490](), [src/types.ts:495-696]()

### Git-Aware Scanning

In git repositories, the scanner uses `git ls-files` rather than walking the filesystem first. It collects tracked files with `--recurse-submodules`, separately collects untracked files with `--exclude-standard`, and recurses into embedded nested git repositories that git reports as opaque trailing-slash entries. If git discovery fails or the root is ignored by a parent repository, scanning falls back to the filesystem walker. Sources: [src/extraction/index.ts:128-215]()

The public scanner entry points are:

| Function | Role |
|---|---|
| `scanDirectory()` | Synchronous scan for source paths. |
| `scanDirectoryAsync()` | Async scan that periodically yields during large git file lists so progress rendering can continue. |
| `scanDirectoryWalk()` | Non-git fallback that follows readable directories, guards symlink cycles, respects `.codegraphignore`, and applies include/exclude rules. |

Sources: [src/extraction/index.ts:270-334](), [src/extraction/index.ts:336-435]()

## Language Detection And Grammar Loading

Language detection is extension-first. `EXTENSION_MAP` maps each known extension to the repository's `Language` union. `.h` files default to C but are upgraded to C++ if the first 8 KB contain C++-specific syntax such as namespaces, classes, templates, access labels, virtual methods, or `using namespace`. Sources: [src/extraction/grammars.ts:40-81](), [src/extraction/grammars.ts:178-200]()

Grammar loading is deliberately lazy. `initGrammars()` initializes the `web-tree-sitter` runtime without loading grammar WASM files. `loadGrammarsForLanguages()` deduplicates requested languages, skips already-loaded or unavailable grammars, and loads WASM grammars sequentially. Pascal and Scala use bundled local WASM files under `src/extraction/wasm`; the other grammar files come from `tree-sitter-wasms`. Sources: [src/extraction/grammars.ts:1-7](), [src/extraction/grammars.ts:19-38](), [src/extraction/grammars.ts:92-140]()

Supported languages are those with a WASM grammar plus the custom extractor languages `svelte`, `vue`, and `liquid`; `unknown` is explicitly unsupported. Sources: [src/extraction/grammars.ts:202-227]()

## Worker-Based Parsing

`indexAll()` initializes the WASM runtime, scans files, detects frameworks, emits parsing progress, computes the needed languages, and loads grammars inside a parse worker when `parse-worker.js` exists. If the compiled worker is unavailable, it loads grammars locally and parses in-process; this is useful for test/runtime environments where compiled worker output may not exist. Sources: [src/extraction/index.ts:512-600]()

The worker protocol is small:

| Message | Direction | Effect |
|---|---|---|
| `load-grammars` | main -> worker | Load the requested language grammars. |
| `grammars-loaded` | worker -> main | Signal that parsing can begin. |
| `parse` | main -> worker | Detect language and run `extractFromSource()`. |
| `parse-result` | worker -> main | Return an `ExtractionResult`. |
| `shutdown` / `shutdown-ack` | main <-> worker | Acknowledge shutdown. |

The worker also filters known noisy Emscripten abort lines from stderr, periodically resets per-language parsers after 5,000 parses, and exits on WASM memory corruption so the main thread can spawn a clean worker. Sources: [src/extraction/parse-worker.ts:13-56](), [src/extraction/parse-worker.ts:58-100]()

The orchestrator separately recycles the whole worker after 250 parses, applies per-file parse timeouts that scale with content size, rejects pending parses on worker crashes, and retries likely WASM-memory failures with fresh workers. Sources: [src/extraction/index.ts:36-50](), [src/extraction/index.ts:624-731](), [src/extraction/index.ts:881-984]()

## Extraction Results

The extraction result shape is shared across parsers: `nodes`, `edges`, `unresolvedReferences`, `errors`, and `durationMs`. Nodes represent files and symbols; edges represent relationships such as `contains`, `calls`, `imports`, `extends`, `implements`, `type_of`, `returns`, `instantiates`, and `decorates`. Unresolved references hold relationships that need later resolution against graph nodes. Sources: [src/types.ts:18-60](), [src/types.ts:97-186](), [src/types.ts:221-289]()

`TreeSitterExtractor.extract()` checks support, obtains a parser from the grammar cache, parses the source, creates a file node, pushes that file node as the root scope, walks the tree, then deletes the tree and releases the source string to reduce WASM and GC pressure. Sources: [src/extraction/tree-sitter.ts:140-239]()

During traversal, the extractor dispatches based on the active language extractor's node-type lists: functions, classes, methods, interfaces, structs, enums, aliases, properties, fields, variables, imports, calls, instantiations, and Rust impl items. `createNode()` assigns graph metadata and creates `contains` edges from the current scope. Sources: [src/extraction/tree-sitter.ts:263-385](), [src/extraction/tree-sitter.ts:390-433]()

Imports, calls, constructor invocations, decorators, inheritance, and type annotations are generally recorded as unresolved references first. For example, import extraction creates an `import` node and an unresolved `imports` reference; call extraction records a `calls` reference; instantiation extraction records an `instantiates` reference to the constructor class name. Sources: [src/extraction/tree-sitter.ts:1234-1270](), [src/extraction/tree-sitter.ts:1446-1454](), [src/extraction/tree-sitter.ts:1457-1506](), [src/extraction/tree-sitter.ts:1508-1569]()

## Language Extractor Contract

Per-language behavior is configured through the `LanguageExtractor` interface. It names AST node types for language concepts, field names for common syntax roles, and optional hooks for signatures, visibility, exports, async/static flags, imports, variables, custom visitors, body resolution, class classification, receiver types, and parser-misparse handling. Sources: [src/extraction/tree-sitter-types.ts:73-151](), [src/extraction/tree-sitter-types.ts:153-208]()

The extractor registry maps repository `Language` values to concrete language extractors. TypeScript and TSX share `typescriptExtractor`; JavaScript and JSX share `javascriptExtractor`; other languages have their own modules. Sources: [src/extraction/languages/index.ts:1-46]()

TypeScript extraction covers function declarations, arrow functions, function expressions, classes, abstract classes, methods, interfaces, enums, type aliases, imports, calls, and top-level variables. It also handles class-field arrow functions by resolving the nested arrow/function body, computes signatures from parameters and return types, walks parents to detect exports, and detects `const`. Sources: [src/extraction/languages/typescript.ts:4-58](), [src/extraction/languages/typescript.ts:59-118]()

Python extraction maps `function_definition` to both functions and methods depending on class scope, handles `class_definition`, import nodes, calls, assignments, signatures with return annotations, async functions, `@staticmethod`, and `from ... import ...` module extraction. Sources: [src/extraction/languages/python.ts:4-53]()

## Custom Extractors And Framework Add-Ons

`extractFromSource()` routes Svelte, Vue, Liquid, and Pascal DFM/FMX files to custom extractors instead of the generic tree-sitter extractor. All other supported languages go through `TreeSitterExtractor`. After the language pass, framework-specific extractors can run if `frameworkNames` are supplied, and their nodes and unresolved references are merged into the same result. Sources: [src/extraction/tree-sitter.ts:2487-2547]()

`indexAll()` detects frameworks once per full index from the scanned file list, caches the names on the orchestrator for the run, and passes those names into each parse call. Single-file indexing can also detect frameworks on demand if a full run has not populated the cache. Sources: [src/extraction/index.ts:437-507](), [src/extraction/index.ts:546-552](), [src/extraction/index.ts:1140-1145]()

## Persistence Into Graph Records

`storeExtractionResult()` hashes file content, skips unchanged files, deletes stale rows for changed files, filters invalid nodes, inserts valid nodes, filters edges so both endpoints exist, inserts unresolved references with denormalized file and language context, and upserts the file metadata record. Sources: [src/extraction/index.ts:1154-1225]()

Database writes are handled by `QueryBuilder`: nodes and edges are inserted in transactions, file records are upserted by path, deleting a file also deletes its nodes, and unresolved references are inserted in a batch transaction. Sources: [src/db/queries.ts:193-264](), [src/db/queries.ts:960-992](), [src/db/queries.ts:1074-1116](), [src/db/queries.ts:1160-1189]()

## Sync Path

Full indexing is not the only entry point. `sync()` initializes grammars, then tries `git status --porcelain --no-renames` to identify modified, added, and deleted files that still pass include/exclude filtering. Deleted files are removed from the database; modified and untracked files are hashed and re-indexed only when needed. If git change detection is unavailable, sync falls back to a full scan and compares current files against stored file records. Sources: [src/extraction/index.ts:218-268](), [src/extraction/index.ts:1227-1320]()

## Behavior Proven By Tests

The extraction test suite loads all grammars before tests, then verifies language detection, language support reporting, extraction behavior, scanner filtering, git submodules, embedded git repositories, and Scala support. Sources: [__tests__/extraction.test.ts:17-20](), [__tests__/extraction.test.ts:34-125]()

Key tested behaviors include:

| Area | Evidence |
|---|---|
| TypeScript extraction | Functions, classes, interfaces, calls, arrow/function-expression exports, type aliases, exported constants, file nodes, and containment edges. Sources: [__tests__/extraction.test.ts:128-214](), [__tests__/extraction.test.ts:216-314](), [__tests__/extraction.test.ts:316-530]() |
| Python, Go, Rust, Java, PHP, Swift, Kotlin, Dart | Representative symbol extraction and language-specific relationships such as Rust trait implementation references and Swift inheritance references. Sources: [__tests__/extraction.test.ts:532-725](), [__tests__/extraction.test.ts:727-965](), [__tests__/extraction.test.ts:967-1155]() |
| Scanner exclusions | `node_modules`, nested `node_modules`, `.git`, normalized paths, and `.codegraphignore`. Sources: [__tests__/extraction.test.ts:3006-3080]() |
| Git repository boundaries | Submodule files are included; embedded non-submodule repos are traversed; each embedded repo's `.gitignore` is respected. Sources: [__tests__/extraction.test.ts:3083-3205]() |
| Scala support | Detection, support reporting, classes, objects, traits, methods, signatures, and top-level functions. Sources: [__tests__/extraction.test.ts:3212-3299]() |

## Provider-Neutral Architecture Notes

This pipeline is BYOC/BYOK-friendly because the extraction path is local and provider-neutral: it reads files from the project root, optionally asks local git for visible or changed paths, parses with tree-sitter WASM grammars, uses worker threads for isolation, and writes graph records through the local database query layer. The extraction code shown here does not depend on a hosted model provider, proprietary API key, or connector-specific runtime. Sources: [src/extraction/index.ts:7-26](), [src/extraction/grammars.ts:9-11](), [src/extraction/parse-worker.ts:8-11](), [src/db/queries.ts:7-23]()

For Grok-Wiki or skill-pack integration, keep the same boundary: treat generated wiki context, solved-problem notes, strategy files, and skill catalogs as portable source inputs that can live in files, repositories, or catalogs. They should orient documentation, but code and tests should remain the source of truth for implementation claims.

In short, CodeGraph indexing is a local, staged pipeline: discover eligible files, detect languages, load only needed grammars, parse safely through a worker when possible, extract graph-shaped records, and persist only valid nodes, edges, unresolved references, and file metadata. Sources: [src/extraction/index.ts:512-600](), [src/extraction/tree-sitter.ts:2487-2547](), [src/extraction/index.ts:1154-1225]()
