# Static Analysis: Tree-Sitter Extractors & Parsers

> Two plugin families produce deterministic graph nodes without LLM calls. Language extractors (TypeScript, Python, Go, Java, Rust, C++, Ruby, C#, PHP) use web-tree-sitter (WASM) via tree-sitter-plugin.ts to parse ASTs and emit function/class/module nodes. Config parsers (JSON, YAML, TOML, SQL, GraphQL, Dockerfile, Protobuf, Makefile, shell, Markdown, Terraform, .env) extract config/schema/document nodes. The plugin registry in registry.ts and discovery.ts wires both families together. The WASM constraint — no native bindings — is a hard invariant: never swap in the native tree-sitter package.

- Repository: Lum1104/Understand-Anything
- GitHub: https://github.com/Lum1104/Understand-Anything
- Human wiki: https://grok-wiki.com/public/wiki/lum1104-understand-anything-3b923df96896
- Complete Markdown: https://grok-wiki.com/public/wiki/lum1104-understand-anything-3b923df96896/llms-full.txt

## Source Files

- `understand-anything-plugin/packages/core/src/plugins/tree-sitter-plugin.ts`
- `understand-anything-plugin/packages/core/src/plugins/registry.ts`
- `understand-anything-plugin/packages/core/src/plugins/discovery.ts`
- `understand-anything-plugin/packages/core/src/plugins/extractors/base-extractor.ts`
- `understand-anything-plugin/packages/core/src/plugins/extractors/typescript-extractor.ts`
- `understand-anything-plugin/packages/core/src/plugins/parsers/index.ts`
- `understand-anything-plugin/packages/core/src/plugins/tree-sitter-plugin.test.ts`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [understand-anything-plugin/packages/core/src/plugins/tree-sitter-plugin.ts](understand-anything-plugin/packages/core/src/plugins/tree-sitter-plugin.ts)
- [understand-anything-plugin/packages/core/src/plugins/registry.ts](understand-anything-plugin/packages/core/src/plugins/registry.ts)
- [understand-anything-plugin/packages/core/src/plugins/discovery.ts](understand-anything-plugin/packages/core/src/plugins/discovery.ts)
- [understand-anything-plugin/packages/core/src/plugins/extractors/base-extractor.ts](understand-anything-plugin/packages/core/src/plugins/extractors/base-extractor.ts)
- [understand-anything-plugin/packages/core/src/plugins/extractors/typescript-extractor.ts](understand-anything-plugin/packages/core/src/plugins/extractors/typescript-extractor.ts)
- [understand-anything-plugin/packages/core/src/plugins/extractors/types.ts](understand-anything-plugin/packages/core/src/plugins/extractors/types.ts)
- [understand-anything-plugin/packages/core/src/plugins/extractors/index.ts](understand-anything-plugin/packages/core/src/plugins/extractors/index.ts)
- [understand-anything-plugin/packages/core/src/plugins/parsers/index.ts](understand-anything-plugin/packages/core/src/plugins/parsers/index.ts)
- [understand-anything-plugin/packages/core/src/plugins/parsers/yaml-parser.ts](understand-anything-plugin/packages/core/src/plugins/parsers/yaml-parser.ts)
- [understand-anything-plugin/packages/core/src/languages/configs/index.ts](understand-anything-plugin/packages/core/src/languages/configs/index.ts)
- [understand-anything-plugin/packages/core/src/languages/configs/typescript.ts](understand-anything-plugin/packages/core/src/languages/configs/typescript.ts)
- [understand-anything-plugin/packages/core/src/plugins/tree-sitter-plugin.test.ts](understand-anything-plugin/packages/core/src/plugins/tree-sitter-plugin.test.ts)
</details>

# Static Analysis: Tree-Sitter Extractors & Parsers

The static analysis layer is the deterministic, LLM-free core of Understand Anything's knowledge graph construction. It is responsible for turning raw source files into structured graph nodes — functions, classes, imports, exports, and config sections — without ever making an AI call. Two plugin families do this work: **language extractors** that parse programming-language ASTs via `web-tree-sitter` (WASM), and **config parsers** that parse infrastructure and documentation formats using ordinary text or library-level parsers.

Understanding this layer is important for two reasons. First, it defines the accuracy ceiling for structural analysis: every node that appears in the graph without an LLM annotation was produced here. Second, the WASM-only constraint is a hard architectural invariant: `web-tree-sitter` must be used instead of native bindings because native `tree-sitter` bindings fail on darwin/arm64 + Node 24. Swapping in the native package would silently break in CI and on Apple Silicon, so this page makes the boundary explicit.

---

## Architecture Overview

```text
LanguageConfig (configs/index.ts)
  └─ treeSitter: { wasmPackage, wasmFile }   ← present only for code langs

Plugin initialization path:
  TreeSitterPlugin(configs[]) ──init()──► load WASM grammars via LanguageCls.load()
                                                    │
                              builtinExtractors[]    │
                              (TypeScript, Python,   │
                               Go, Java, Rust, …)   │
                                         ▼
                              analyzeFile(path, src)
                                   │
                              parser.parse(src) → AST root
                                   │
                              LanguageExtractor.extractStructure(root)
                                   │
                              → StructuralAnalysis { functions, classes, imports, exports }

Config parsers (non-code family):
  registerAllParsers(registry)  ←  MarkdownParser, YAMLConfigParser, SQLParser, …
  Each implements AnalyzerPlugin but uses library/regex parsing, not tree-sitter
  Returns: StructuralAnalysis { sections: SectionInfo[] }

PluginRegistry (registry.ts)
  register(plugin) ──► languageMap: lang → plugin
  analyzeFile(path, src) ──► getPluginForFile(path) ──► plugin.analyzeFile()
```

Sources: [understand-anything-plugin/packages/core/src/plugins/tree-sitter-plugin.ts:32-98](), [understand-anything-plugin/packages/core/src/plugins/parsers/index.ts:31-44](), [understand-anything-plugin/packages/core/src/plugins/registry.ts:11-81]()

---

## The WASM Constraint

The single most important invariant in this subsystem: **`web-tree-sitter` (WASM) must be used instead of native `tree-sitter` bindings.**

The comment at the top of `tree-sitter-plugin.ts` explains the mechanics:

```ts
// web-tree-sitter uses CJS internally; we need createRequire for .wasm resolution
const require = createRequire(import.meta.url);
```

All `.wasm` grammar files are resolved with `require.resolve()`, not `import()`, because `web-tree-sitter` ships them as CommonJS-resolvable assets. The `Parser` and `Language` classes come from `web-tree-sitter` exclusively; there is no code path that conditionally falls back to native bindings.

During `init()`, grammars are loaded with `LanguageCls.load(wasmPath)` in parallel:

```ts
const wasmPath = require.resolve(
  `${config.treeSitter!.wasmPackage}/${config.treeSitter!.wasmFile}`,
);
const lang = await LanguageCls.load(wasmPath);
this._languages.set(config.id, lang);
```

Sources: [understand-anything-plugin/packages/core/src/plugins/tree-sitter-plugin.ts:1-14, 125-198]()

---

## Language Configuration

Each programming language that has tree-sitter support carries a `LanguageConfig` with a `treeSitter` field. Configs without that field are non-code languages and are handled by the config-parser family instead.

Example — the TypeScript config:

```ts
// understand-anything-plugin/packages/core/src/languages/configs/typescript.ts
export const typescriptConfig = {
  id: "typescript",
  displayName: "TypeScript",
  extensions: [".ts", ".tsx"],
  treeSitter: {
    wasmPackage: "tree-sitter-typescript",
    wasmFile: "tree-sitter-typescript.wasm",
  },
  // ...
} satisfies LanguageConfig;
```

The `TreeSitterPlugin` constructor filters `configs` to only those with a `treeSitter` field, then builds the extension-to-language map from them. Languages without a `treeSitter` field are simply excluded from structural parsing and fall through to the LLM agent.

Sources: [understand-anything-plugin/packages/core/src/languages/configs/typescript.ts:1-30](), [understand-anything-plugin/packages/core/src/plugins/tree-sitter-plugin.ts:57-97]()

### TSX Special Case

TypeScript has a two-grammar special case: when the TypeScript grammar is loaded, the plugin also attempts to load `tree-sitter-tsx.wasm` from the same WASM package. `.tsx` files receive the TSX grammar (its own grammar key `"tsx"`), but extraction logic is routed to the `TypeScriptExtractor` because the syntactic forms are identical:

```ts
private getExtractor(langKey: string): LanguageExtractor | null {
  // tsx is a synthetic grammar key — extraction logic is identical to typescript
  const key = langKey === "tsx" ? "typescript" : langKey;
  return this.extractors.get(key) ?? null;
}
```

Sources: [understand-anything-plugin/packages/core/src/plugins/tree-sitter-plugin.ts:106-110, 150-162]()

---

## Language Extractor Family

### Interface Contract

Every language extractor implements the `LanguageExtractor` interface:

```ts
// understand-anything-plugin/packages/core/src/plugins/extractors/types.ts
export interface LanguageExtractor {
  languageIds: string[];
  extractStructure(rootNode: TreeSitterNode): StructuralAnalysis;
  extractCallGraph(rootNode: TreeSitterNode): CallGraphEntry[];
}
```

`extractStructure` returns `{ functions, classes, imports, exports }`. `extractCallGraph` returns `{ caller, callee, lineNumber }[]`. The node type passed in is `web-tree-sitter`'s `Node` — the root of an already-parsed AST.

Sources: [understand-anything-plugin/packages/core/src/plugins/extractors/types.ts:1-19]()

### Built-in Extractors

Nine extractors ship as builtins:

| Extractor | `languageIds` |
|---|---|
| `TypeScriptExtractor` | `typescript`, `javascript` |
| `PythonExtractor` | `python` |
| `GoExtractor` | `go` |
| `RustExtractor` | `rust` |
| `JavaExtractor` | `java` |
| `RubyExtractor` | `ruby` |
| `PhpExtractor` | `php` |
| `CppExtractor` | `cpp` (and `c`) |
| `CSharpExtractor` | `csharp` |

All nine are instantiated at module load and collected in `builtinExtractors[]`. The `TreeSitterPlugin` constructor registers them all by default:

```ts
for (const extractor of builtinExtractors) {
  this.registerExtractor(extractor);
}
```

Sources: [understand-anything-plugin/packages/core/src/plugins/extractors/index.ts:24-34](), [understand-anything-plugin/packages/core/src/plugins/tree-sitter-plugin.ts:88-97]()

### Base Extractor Utilities

Rather than duplicating traversal logic in every extractor, `base-extractor.ts` provides shared helpers:

| Helper | Purpose |
|---|---|
| `traverse(node, visitor)` | Depth-first recursive walk |
| `getStringValue(node)` | Strips quotes from string-like nodes |
| `findChild(node, type)` | First child with matching `node.type` |
| `findChildren(node, type)` | All children with matching `node.type` |
| `hasChildOfType(node, type)` | Boolean check for export/visibility |

Sources: [understand-anything-plugin/packages/core/src/plugins/extractors/base-extractor.ts:1-53]()

### TypeScript Extractor in Detail

The `TypeScriptExtractor` is the most fully-featured extractor and illustrates the pattern used by all others. It processes top-level AST nodes in a single pass:

```ts
// understand-anything-plugin/packages/core/src/plugins/extractors/typescript-extractor.ts
switch (node.type) {
  case "function_declaration":    this.extractFunction(node, functions); break;
  case "class_declaration":       this.extractClass(node, classes); break;
  case "lexical_declaration":
  case "variable_declaration":    this.extractVariableDeclarations(node, functions); break;
  case "import_statement":        this.extractImport(node, imports); break;
  case "export_statement":        this.processExportStatement(...); break;
}
```

**Function extraction** handles three forms: `function` declarations, arrow functions assigned to `const`/`let`, and function expressions. Parameters are extracted with full support for `required_parameter`, `optional_parameter`, rest parameters (`...args`), and plain JavaScript identifiers. Return type annotations are stripped of their leading `:`.

**Call graph extraction** uses a function stack. A depth-first walk pushes function names as scope is entered and pops on exit. Every `call_expression` node encountered emits a `{ caller, callee, lineNumber }` entry using the top of the stack as caller:

```ts
if (node.type === "call_expression") {
  const callee = node.childForFieldName("function");
  if (callee && functionStack.length > 0) {
    entries.push({
      caller: functionStack[functionStack.length - 1],
      callee: callee.text,
      lineNumber: node.startPosition.row + 1,
    });
  }
}
```

Sources: [understand-anything-plugin/packages/core/src/plugins/extractors/typescript-extractor.ts:106-194]()

### Lifecycle: init() → analyzeFile()

The plugin separates async initialization from synchronous analysis. Grammar loading is async (WASM); parsing and extraction are synchronous once grammars are resident in memory.

```
await plugin.init()      // loads all .wasm grammars in parallel
↓
plugin.analyzeFile(path, src)
  getParser(path)         // synchronous: looks up pre-loaded Language, creates Parser
  parser.parse(src)       // synchronous: returns Tree
  extractor.extractStructure(tree.rootNode)
  tree.delete(); parser.delete()  // explicit WASM memory cleanup
  return StructuralAnalysis
```

If a grammar failed to load during `init()` (the npm package is missing or the WASM file is absent), `getParser()` returns `null` and `analyzeFile()` returns an empty `StructuralAnalysis`. This graceful degradation means an unsupported language never throws — it simply contributes no structural nodes, and the LLM agent fills the gap.

Sources: [understand-anything-plugin/packages/core/src/plugins/tree-sitter-plugin.ts:125-250]()

---

## Config Parser Family

Config parsers handle non-code file formats. They implement the same `AnalyzerPlugin` interface but do not use tree-sitter at all — they use format-specific libraries (e.g., the `yaml` npm package) or regular expressions.

### Built-in Config Parsers

```ts
// understand-anything-plugin/packages/core/src/plugins/parsers/index.ts
export function registerAllParsers(registry: PluginRegistry): void {
  registry.register(new MarkdownParser());
  registry.register(new YAMLConfigParser());
  registry.register(new JSONConfigParser());
  registry.register(new TOMLParser());
  registry.register(new EnvParser());
  registry.register(new DockerfileParser());
  registry.register(new SQLParser());
  registry.register(new GraphQLParser());
  registry.register(new ProtobufParser());
  registry.register(new TerraformParser());
  registry.register(new MakefileParser());
  registry.register(new ShellParser());
}
```

Sources: [understand-anything-plugin/packages/core/src/plugins/parsers/index.ts:31-44]()

### Output Shape

Config parsers emit `StructuralAnalysis` with `sections: SectionInfo[]` populated and `functions/classes/imports/exports` left empty. A `SectionInfo` carries `{ name, level, lineRange }` — enough for the knowledge graph to represent top-level configuration structure (e.g., YAML top-level keys, SQL statement blocks, Makefile targets).

### YAML Parser: Library + Regex Fallback

The `YAMLConfigParser` illustrates the robustness pattern used across config parsers. It first attempts a proper library parse with the `yaml` npm package; if that throws, it falls back to a line-level regex:

```ts
try {
  const doc = parseYAML(content);
  // extract top-level keys from the parsed object
} catch (err) {
  console.warn(`[yaml-parser] YAML parse failed, falling back to regex...`);
  const lines = content.split("\n");
  for (let i = 0; i < lines.length; i++) {
    const match = lines[i].match(/^(\w[\w-]*)\s*:/);
    if (match) sections.push({ name: match[1], level: 1, lineRange: [i+1, i+1] });
  }
}
```

The `YAMLConfigParser.languages` array also includes YAML-flavored variants (`docker-compose`, `kubernetes`, `github-actions`, `openapi`) so the language registry can route those file types without falling through to the "no plugin matched" branch.

Sources: [understand-anything-plugin/packages/core/src/plugins/parsers/yaml-parser.ts:14-106]()

---

## Plugin Registry Wiring

`PluginRegistry` is the unified dispatch layer. It maintains a `languageMap: Map<string, AnalyzerPlugin>` populated by calls to `register()`. Both `TreeSitterPlugin` and config parsers register through the same interface.

File-to-plugin routing goes through `LanguageRegistry.getForFile(filePath)` — an extension-based lookup that returns a `LanguageConfig`. The config's `id` is then used to look up the plugin:

```ts
getPluginForFile(filePath: string): AnalyzerPlugin | null {
  const langConfig = this.languageRegistry.getForFile(filePath);
  if (!langConfig) return null;
  return this.getPluginForLanguage(langConfig.id);
}
```

This means the extension-to-language mapping lives in `LanguageRegistry` (driven by `builtinLanguageConfigs`), not duplicated in each plugin.

Sources: [understand-anything-plugin/packages/core/src/plugins/registry.ts:39-48]()

### Default Plugin Configuration

`discovery.ts` holds `DEFAULT_PLUGIN_CONFIG`, which is derived at module load from `builtinLanguageConfigs` — specifically those configs that carry a `treeSitter` field. This is what gets written to disk when no user-supplied config exists:

```ts
export const DEFAULT_PLUGIN_CONFIG: PluginConfig = {
  plugins: [
    {
      name: "tree-sitter",
      enabled: true,
      languages: builtinLanguageConfigs
        .filter((c) => c.treeSitter)
        .map((c) => c.id),
    },
  ],
};
```

Sources: [understand-anything-plugin/packages/core/src/plugins/discovery.ts:14-24]()

---

## Failure Modes and Boundaries

| Failure | Behavior |
|---|---|
| WASM grammar `.wasm` file missing at `require.resolve()` time | `init()` logs a debug message, that language is skipped; `analyzeFile()` returns empty `StructuralAnalysis` |
| File extension not in any language config | `getPluginForFile()` returns `null`; no analysis is produced |
| `parser.parse()` returns `null` | Extractor skips the file; `tree.delete()` / `parser.delete()` are called in finally-equivalent branches |
| Config parser library throws (e.g., malformed YAML) | Regex fallback extracts sections where possible |
| Native `tree-sitter` package swapped in | Build failure on darwin/arm64 + Node 24 — this is a hard invariant, not a handled failure |

The design choice to return empty results rather than throw means analysis can proceed even when some languages are unavailable. LLM agents are the designated fallback for files that static analysis cannot parse.

---

## Summary

The static analysis subsystem is built around two plugin families that share a single `PluginRegistry`. Language extractors use `web-tree-sitter` (WASM) — a non-negotiable constraint driven by platform compatibility — to parse ASTs and extract functions, classes, imports, and call graphs for nine programming languages. Config parsers handle a dozen infrastructure and documentation formats using library or regex parsing, emitting section-level structure. Both families are registered identically into `PluginRegistry`, which dispatches by language id derived from `LanguageRegistry`. The WASM approach means `init()` is async but `analyzeFile()` is synchronous, and missing grammars degrade gracefully to empty output rather than hard failures.

Sources: [understand-anything-plugin/packages/core/src/plugins/tree-sitter-plugin.ts:222-251]()
