# Knowledge Abstracts

> The on-disk Knowledge Abstract (KA) model: `data.json`, `metadata.json`, and `index/` layout; lifecycle methods (`parse`, `feed_text`, `dump`, `load`, `build_index`); and incremental evolution via `he feed`.

- Repository: yifanfeng97/Hyper-Extract
- GitHub: https://github.com/yifanfeng97/Hyper-Extract
- Human docs: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf
- Complete Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/llms-full.txt

## Source Files

- `hyperextract/types/base.py`
- `hyperextract/cli/cli.py`
- `hyperextract/cli/utils.py`
- `hyperextract/cli/config.py`
- `hyperextract/utils/template_engine/factory.py`

---

---
title: "Knowledge Abstracts"
description: "The on-disk Knowledge Abstract (KA) model: `data.json`, `metadata.json`, and `index/` layout; lifecycle methods (`parse`, `feed_text`, `dump`, `load`, `build_index`); and incremental evolution via `he feed`."
---

A Knowledge Abstract (KA) is a directory persisted by `BaseAutoType.dump()` and restored by `BaseAutoType.load()`. Every `Template.create()` instance is a `BaseAutoType` subclass; `he parse` writes a new KA directory, and `he feed` loads an existing one, merges new text, and saves it back.

## On-disk layout

`dump()` writes three components under a single output directory:

:::files
my_ka/
├── data.json          # Structured knowledge (Pydantic schema JSON)
├── metadata.json      # Template ID, language, timestamps, type
└── index/             # FAISS vector store (optional until built)
    ├── index.faiss        # AutoModel, AutoList, AutoSet
    ├── docstore.json
    ├── node_index/        # AutoGraph and graph variants
    │   ├── index.faiss
    │   └── docstore.json
    └── edge_index/
        ├── index.faiss
        └── docstore.json
:::

| File | Required | Role |
|------|----------|------|
| `data.json` | Yes | Serialized `ka.data` via `model_dump()` |
| `metadata.json` | Recommended | Provenance and CLI template resolution |
| `index/` | No | Semantic search index; empty until `build_index()` |

<Warning>
`he search` and `he talk` require a non-empty `index/` directory. Run `he build-index <ka_path>` if you used `--no-index` during parse or after `he feed`.
</Warning>

### `data.json` shape by AutoType

The JSON mirrors the active Pydantic schema for the template's AutoType:

| AutoType | Top-level keys | Example use |
|----------|----------------|-------------|
| `AutoModel` | Schema field names (`name`, `summary`, …) | Single structured record |
| `AutoList` / `AutoSet` | `items` (array of objects) | Collections |
| `AutoGraph` (+ variants) | `nodes`, `edges` | Entity–relation graphs |

`he info` counts nodes/edges from `nodes`/`edges` or falls back to `entities`/`relations` for method outputs.

### `metadata.json` fields

Written by `dump_metadata()` from the in-memory `metadata` dict. `TemplateFactory.create()` seeds:

<ResponseField name="template" type="string">
Template path (e.g. `general/biography_graph`, `method/light_rag`) or custom YAML stem. Used by `get_template_from_ka()` to recreate the correct AutoType on load.
</ResponseField>

<ResponseField name="lang" type="string">
Language code (`zh`, `en`). Knowledge templates require the same value at load time. Method templates always store `en`.
</ResponseField>

<ResponseField name="type" type="string">
AutoType discriminator: `model`, `list`, `set`, `graph`, `hypergraph`, `temporal_graph`, `spatial_graph`, or `spatio_temporal_graph`.
</ResponseField>

<ResponseField name="created_at" type="string">
ISO timestamp set at first extraction.
</ResponseField>

<ResponseField name="updated_at" type="string">
ISO timestamp updated on every `feed_text()` or `clear()`.
</ResponseField>

Custom keys can be added before `dump()` and round-trip through `load_metadata()`.

## Lifecycle architecture

`BaseAutoType` owns extraction, merge, indexing, and serialization. State changes flow through three hooks:

```mermaid
stateDiagram-v2
    [*] --> Empty: __init__ / clear()
    Empty --> Populated: parse() or feed_text()
    Populated --> Populated: feed_text() merge
    Populated --> Indexed: build_index()
    Indexed --> StaleIndex: feed_text() clears index
    StaleIndex --> Indexed: build_index()
    Populated --> OnDisk: dump()
    Indexed --> OnDisk: dump()
    OnDisk --> Populated: load()
    OnDisk --> Indexed: load() with index/
```

| Method | Mutates caller | Index effect | Typical use |
|--------|----------------|--------------|-------------|
| `parse(text)` | No — returns new instance | Fresh instance, no index | Preview, branch, immutable pipeline |
| `feed_text(text)` | Yes — returns `self` | Clears index via `_update_data_state` | Incremental ingestion |
| `build_index()` | Yes | Builds FAISS from current data | Enable `search` / `chat` |
| `dump(folder)` | No | Writes data, metadata, index | Persist KA |
| `load(folder)` | Yes | Restores all three when present | Resume from disk |
| `clear()` | Yes | Resets data and index | Start over in memory |
| `clear_index()` | Yes | Drops index only | Force rebuild |

<Info>
Long texts are chunked (`chunk_size` default 2048, `chunk_overlap` 256), extracted in parallel (`max_workers` default 10), then merged with type-specific `merge_batch_data()` logic.
</Info>

### `parse` vs `feed_text`

<CodeGroup>
```python title="Python — branch without mutating"
from hyperextract import Template

ka = Template.create("general/biography_graph", "en")
branch = ka.parse(document_text)   # new instance, original stays empty
branch.build_index()
branch.dump("./preview_ka/")
```

```python title="Python — evolve in place"
ka = Template.create("general/biography_graph", "en")
ka.feed_text(doc_a).feed_text(doc_b)   # method chaining
ka.build_index()
ka.dump("./main_ka/")
```
</CodeGroup>

`parse()` calls `_set_data_state()` (full replace). `feed_text()` calls `_update_data_state()` (incremental merge with deduplication rules per AutoType).

Two instances with the same schema can be merged with the `+` operator, which calls `merge_batch_data()` and returns a new instance.

## CLI workflows

### Create a KA — `he parse`

<Steps>
<Step title="Resolve template and validate config">
`he parse` calls `validate_config()`, resolves `--template` / `--method`, and requires `--lang` for knowledge templates (methods force `en`).
</Step>
<Step title="Extract and save">
`Template.create()` → `feed_text()` on input → `dump(output)`. Unless `--no-index`, it then `build_index()` and `dump()` again to persist the index.
</Step>
<Step title="Verify">
```bash
he info ./my_ka/
ls ./my_ka/    # expect data.json, metadata.json, index/
```
</Step>
</Steps>

<ParamField body="--output / -o" type="string" required>
Output directory. Fails if non-empty unless `--force`.
</ParamField>

<ParamField body="--no-index" type="boolean">
Skip `build_index()`; KA is searchable only after `he build-index`.
</ParamField>

### Evolve a KA — `he feed`

`he feed` loads template and language from `metadata.json` (override with `--template` / `--lang`), then:

1. `ka.load(ka_path)`
2. `ka.feed_text(new_text)`
3. `ka.dump(ka_path)`

<Warning>
`he feed` does not rebuild the index. After feeding, run `he build-index <ka_path>` (or `he build-index <ka_path> --force`) before `he search` or `he talk`.
</Warning>

<RequestExample>
```bash
he parse examples/en/tesla.md -t general/biography_graph -l en -o ./tesla_ka/
he feed ./tesla_ka/ examples/en/another_doc.md
he build-index ./tesla_ka/ --force
he search ./tesla_ka/ "AC motor"
```
</RequestExample>

### Rebuild index — `he build-index`

Loads the KA, optionally `clear_index()` with `--force`, calls `build_index()`, and `dump()` to refresh `index/`. Exits early if an index already exists and `--force` is omitted.

## Python API reference

All lifecycle methods live on `BaseAutoType` and are exported through `Template.create()`:

```python
from hyperextract import Template

# Create (reads ~/.he/config.toml when clients omitted)
ka = Template.create("general/biography_graph", "en")

# Full cycle
ka.feed_text(open("doc.md").read())
ka.build_index()
ka.dump("./my_ka/")

# Reload later — template + lang must match metadata
ka2 = Template.create("general/biography_graph", "en")
ka2.load("./my_ka/")
results = ka2.search("wireless power", top_k=5)
answer = ka2.chat("What did Tesla invent?")
```

Granular serialization is also available:

| Method | Target |
|--------|--------|
| `dump_data(path)` | Single `data.json` |
| `dump_metadata(path)` | Single `metadata.json` |
| `dump_index(folder)` | FAISS files under `index/` |
| `load_data(path)` | Validates against schema, replaces state |
| `load_metadata(path)` | Merges into `metadata` dict |
| `load_index(folder)` | Restores FAISS; needs matching embedder |

<Index failures are non-fatal: `dump()` prints a warning if index save fails; `load()` warns and leaves data intact so you can call `build_index()`.</Info>

## Index internals

Hyper-Extract uses LangChain `FAISS` as the vector backend:

- **Scalar types** (`AutoModel`, `AutoList`, `AutoSet`): one `index/` with `index.faiss` and `docstore.json`.
- **Graph types** (`AutoGraph`, `AutoHypergraph`, temporal/spatial variants): `index/node_index/` and `index/edge_index/` subdirectories, each with its own FAISS store.

`build_index()` embeds text derived from stored objects. `search()` runs `similarity_search` and restores Pydantic items from `Document.metadata["raw"]`.

## Validation rules

The CLI enforces KA integrity before commands run:

| Validator | Checks |
|-----------|--------|
| `validate_ka_path` | Path exists and is a directory |
| `validate_ka_with_data` | `data.json` present |
| `validate_ka_with_index` | `index/` exists and is non-empty |
| `get_template_from_ka` | `metadata.json` with resolvable `template` (preset or local `{template}.yaml`) |

Custom templates copied into the KA directory during `he parse` are resolved from the local YAML on reload.

## Related pages

<CardGroup>
<Card title="Auto-Types" href="/auto-types">
Eight extraction primitives, merge behavior, and index layout differences across `model`, `list`, `set`, and graph types.
</Card>
<Card title="Extract and evolve" href="/extract-and-evolve">
End-to-end `he parse`, `he feed`, and `he build-index` workflows with input formats and flags.
</Card>
<Card title="Search, chat, and visualize" href="/search-chat-visualize">
Query a built KA with `he search`, `he talk`, and `he show` after the index exists.
</Card>
<Card title="Python API reference" href="/python-api-reference">
Full `Template.create`, `BaseAutoType` method signatures, and client factory exports.
</Card>
</CardGroup>
