# Knowledge Abstracts > The on-disk Knowledge Abstract (KA) model: `data.json`, `metadata.json`, and `index/` layout; lifecycle methods (`parse`, `feed_text`, `dump`, `load`, `build_index`); and incremental evolution via `he feed`. - Repository: yifanfeng97/Hyper-Extract - GitHub: https://github.com/yifanfeng97/Hyper-Extract - Human docs: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf - Complete Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/llms-full.txt ## Source Files - `hyperextract/types/base.py` - `hyperextract/cli/cli.py` - `hyperextract/cli/utils.py` - `hyperextract/cli/config.py` - `hyperextract/utils/template_engine/factory.py` --- --- title: "Knowledge Abstracts" description: "The on-disk Knowledge Abstract (KA) model: `data.json`, `metadata.json`, and `index/` layout; lifecycle methods (`parse`, `feed_text`, `dump`, `load`, `build_index`); and incremental evolution via `he feed`." --- A Knowledge Abstract (KA) is a directory persisted by `BaseAutoType.dump()` and restored by `BaseAutoType.load()`. Every `Template.create()` instance is a `BaseAutoType` subclass; `he parse` writes a new KA directory, and `he feed` loads an existing one, merges new text, and saves it back. ## On-disk layout `dump()` writes three components under a single output directory: :::files my_ka/ ├── data.json # Structured knowledge (Pydantic schema JSON) ├── metadata.json # Template ID, language, timestamps, type └── index/ # FAISS vector store (optional until built) ├── index.faiss # AutoModel, AutoList, AutoSet ├── docstore.json ├── node_index/ # AutoGraph and graph variants │ ├── index.faiss │ └── docstore.json └── edge_index/ ├── index.faiss └── docstore.json ::: | File | Required | Role | |------|----------|------| | `data.json` | Yes | Serialized `ka.data` via `model_dump()` | | `metadata.json` | Recommended | Provenance and CLI template resolution | | `index/` | No | Semantic search index; empty until `build_index()` | `he search` and `he talk` require a non-empty `index/` directory. Run `he build-index ` if you used `--no-index` during parse or after `he feed`. ### `data.json` shape by AutoType The JSON mirrors the active Pydantic schema for the template's AutoType: | AutoType | Top-level keys | Example use | |----------|----------------|-------------| | `AutoModel` | Schema field names (`name`, `summary`, …) | Single structured record | | `AutoList` / `AutoSet` | `items` (array of objects) | Collections | | `AutoGraph` (+ variants) | `nodes`, `edges` | Entity–relation graphs | `he info` counts nodes/edges from `nodes`/`edges` or falls back to `entities`/`relations` for method outputs. ### `metadata.json` fields Written by `dump_metadata()` from the in-memory `metadata` dict. `TemplateFactory.create()` seeds: Template path (e.g. `general/biography_graph`, `method/light_rag`) or custom YAML stem. Used by `get_template_from_ka()` to recreate the correct AutoType on load. Language code (`zh`, `en`). Knowledge templates require the same value at load time. Method templates always store `en`. AutoType discriminator: `model`, `list`, `set`, `graph`, `hypergraph`, `temporal_graph`, `spatial_graph`, or `spatio_temporal_graph`. ISO timestamp set at first extraction. ISO timestamp updated on every `feed_text()` or `clear()`. Custom keys can be added before `dump()` and round-trip through `load_metadata()`. ## Lifecycle architecture `BaseAutoType` owns extraction, merge, indexing, and serialization. State changes flow through three hooks: ```mermaid stateDiagram-v2 [*] --> Empty: __init__ / clear() Empty --> Populated: parse() or feed_text() Populated --> Populated: feed_text() merge Populated --> Indexed: build_index() Indexed --> StaleIndex: feed_text() clears index StaleIndex --> Indexed: build_index() Populated --> OnDisk: dump() Indexed --> OnDisk: dump() OnDisk --> Populated: load() OnDisk --> Indexed: load() with index/ ``` | Method | Mutates caller | Index effect | Typical use | |--------|----------------|--------------|-------------| | `parse(text)` | No — returns new instance | Fresh instance, no index | Preview, branch, immutable pipeline | | `feed_text(text)` | Yes — returns `self` | Clears index via `_update_data_state` | Incremental ingestion | | `build_index()` | Yes | Builds FAISS from current data | Enable `search` / `chat` | | `dump(folder)` | No | Writes data, metadata, index | Persist KA | | `load(folder)` | Yes | Restores all three when present | Resume from disk | | `clear()` | Yes | Resets data and index | Start over in memory | | `clear_index()` | Yes | Drops index only | Force rebuild | Long texts are chunked (`chunk_size` default 2048, `chunk_overlap` 256), extracted in parallel (`max_workers` default 10), then merged with type-specific `merge_batch_data()` logic. ### `parse` vs `feed_text` ```python title="Python — branch without mutating" from hyperextract import Template ka = Template.create("general/biography_graph", "en") branch = ka.parse(document_text) # new instance, original stays empty branch.build_index() branch.dump("./preview_ka/") ``` ```python title="Python — evolve in place" ka = Template.create("general/biography_graph", "en") ka.feed_text(doc_a).feed_text(doc_b) # method chaining ka.build_index() ka.dump("./main_ka/") ``` `parse()` calls `_set_data_state()` (full replace). `feed_text()` calls `_update_data_state()` (incremental merge with deduplication rules per AutoType). Two instances with the same schema can be merged with the `+` operator, which calls `merge_batch_data()` and returns a new instance. ## CLI workflows ### Create a KA — `he parse` `he parse` calls `validate_config()`, resolves `--template` / `--method`, and requires `--lang` for knowledge templates (methods force `en`). `Template.create()` → `feed_text()` on input → `dump(output)`. Unless `--no-index`, it then `build_index()` and `dump()` again to persist the index. ```bash he info ./my_ka/ ls ./my_ka/ # expect data.json, metadata.json, index/ ``` Output directory. Fails if non-empty unless `--force`. Skip `build_index()`; KA is searchable only after `he build-index`. ### Evolve a KA — `he feed` `he feed` loads template and language from `metadata.json` (override with `--template` / `--lang`), then: 1. `ka.load(ka_path)` 2. `ka.feed_text(new_text)` 3. `ka.dump(ka_path)` `he feed` does not rebuild the index. After feeding, run `he build-index ` (or `he build-index --force`) before `he search` or `he talk`. ```bash he parse examples/en/tesla.md -t general/biography_graph -l en -o ./tesla_ka/ he feed ./tesla_ka/ examples/en/another_doc.md he build-index ./tesla_ka/ --force he search ./tesla_ka/ "AC motor" ``` ### Rebuild index — `he build-index` Loads the KA, optionally `clear_index()` with `--force`, calls `build_index()`, and `dump()` to refresh `index/`. Exits early if an index already exists and `--force` is omitted. ## Python API reference All lifecycle methods live on `BaseAutoType` and are exported through `Template.create()`: ```python from hyperextract import Template # Create (reads ~/.he/config.toml when clients omitted) ka = Template.create("general/biography_graph", "en") # Full cycle ka.feed_text(open("doc.md").read()) ka.build_index() ka.dump("./my_ka/") # Reload later — template + lang must match metadata ka2 = Template.create("general/biography_graph", "en") ka2.load("./my_ka/") results = ka2.search("wireless power", top_k=5) answer = ka2.chat("What did Tesla invent?") ``` Granular serialization is also available: | Method | Target | |--------|--------| | `dump_data(path)` | Single `data.json` | | `dump_metadata(path)` | Single `metadata.json` | | `dump_index(folder)` | FAISS files under `index/` | | `load_data(path)` | Validates against schema, replaces state | | `load_metadata(path)` | Merges into `metadata` dict | | `load_index(folder)` | Restores FAISS; needs matching embedder | ## Index internals Hyper-Extract uses LangChain `FAISS` as the vector backend: - **Scalar types** (`AutoModel`, `AutoList`, `AutoSet`): one `index/` with `index.faiss` and `docstore.json`. - **Graph types** (`AutoGraph`, `AutoHypergraph`, temporal/spatial variants): `index/node_index/` and `index/edge_index/` subdirectories, each with its own FAISS store. `build_index()` embeds text derived from stored objects. `search()` runs `similarity_search` and restores Pydantic items from `Document.metadata["raw"]`. ## Validation rules The CLI enforces KA integrity before commands run: | Validator | Checks | |-----------|--------| | `validate_ka_path` | Path exists and is a directory | | `validate_ka_with_data` | `data.json` present | | `validate_ka_with_index` | `index/` exists and is non-empty | | `get_template_from_ka` | `metadata.json` with resolvable `template` (preset or local `{template}.yaml`) | Custom templates copied into the KA directory during `he parse` are resolved from the local YAML on reload. ## Related pages Eight extraction primitives, merge behavior, and index layout differences across `model`, `list`, `set`, and graph types. End-to-end `he parse`, `he feed`, and `he build-index` workflows with input formats and flags. Query a built KA with `he search`, `he talk`, and `he show` after the index exists. Full `Template.create`, `BaseAutoType` method signatures, and client factory exports.