Agent-readable docs

Hyper-Extract Documentation

Reference for the Hyper-Extract LLM knowledge extraction framework: CLI (`he`), Python API (`Template`, AutoTypes, `create_client`), YAML templates, extraction methods, and Knowledge Abstract lifecycle.

Pages

  1. OverviewWhat Hyper-Extract exposes (CLI `he`, Python `Template` API, 8 AutoTypes, 80+ YAML presets, 9 extraction methods), runtime assumptions (Python 3.11+, structured LLM output), and the shortest path from install to a queryable Knowledge Abstract.
  2. InstallationInstall via `uv tool install hyperextract` or `uv pip install hyperextract`, Python version constraints, optional provider extras (`anthropic`, `google`, `all`), and first-run configuration prerequisites.
  3. QuickstartFirst successful extraction: `he config init`, `he parse` with a preset template, `he search` / `he show`, and the equivalent Python `Template.create` + `feed_text` path using the Tesla biography example.
  4. Knowledge AbstractsThe on-disk Knowledge Abstract (KA) model: `data.json`, `metadata.json`, and `index/` layout; lifecycle methods (`parse`, `feed_text`, `dump`, `load`, `build_index`); and incremental evolution via `he feed`.
  5. Auto-TypesEight strongly-typed extraction primitives (`AutoModel`, `AutoList`, `AutoSet`, `AutoGraph`, `AutoHypergraph`, `AutoTemporalGraph`, `AutoSpatialGraph`, `AutoSpatioTemporalGraph`): structure, merge behavior, indexing, and when to pick each type.
  6. Templates vs methodsDomain YAML templates (`general/biography_graph`, `finance/earnings_summary`, etc.) versus algorithm-driven method templates (`method/light_rag`, `method/atom`); language requirements (`--lang` for templates, English-only for methods); and selection criteria.
  7. Provider systemBYOC/BYOK provider model: `openai`, `bailian`, and `vllm` presets; `provider:model@url` shorthand; `CompatibleEmbeddings` for non-OpenAI endpoints; and verified model compatibility requirements (`json_schema` / function calling).
  8. Configure providersSet up LLM and embedder clients via `he config init`, per-service `he config llm` / `he config embedder`, environment variables, or programmatic `create_client()` for mixed cloud and local vLLM deployments.
  9. Extract and evolve knowledgeRun `he parse` (single file, directory of `.md`/`.txt`, or stdin), choose templates interactively or by ID, control indexing with `--no-index`, append documents with `he feed`, and rebuild indexes with `he build-index`.
  10. Search, chat, and visualizeQuery Knowledge Abstracts with `he search` and `he talk` (single query or `-i` interactive mode), inspect stats via `he info`, and render graphs through OntoSight with `he show` or `AutoType.show()`.
  11. Create custom templatesAuthor domain YAML templates: type selection, field and identifier design, multilingual `language` blocks, merge strategies, and validation workflow per the design guide and preset base templates.
  12. Use extraction methodsInvoke algorithm templates via `he parse -m light_rag` or `Template.create("method/hyper_rag")`; direct method classes (`Light_RAG`, `Atom`, etc.); and method-specific kwargs such as `observation_time` for temporal extractors.

Complete Markdown

# Hyper-Extract Documentation

> Reference for the Hyper-Extract LLM knowledge extraction framework: CLI (`he`), Python API (`Template`, AutoTypes, `create_client`), YAML templates, extraction methods, and Knowledge Abstract lifecycle.

## Context Links

- [Agent index](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/llms.txt)
- [Human interactive docs](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf)
- [GitHub repository](https://github.com/yifanfeng97/Hyper-Extract)

## Repository Metadata

- Repository: yifanfeng97/Hyper-Extract

- Generated: 2026-06-18T20:59:59.470Z
- Updated: 2026-06-18T21:02:44.802Z
- Runtime: Grok CLI
- Format: Documentation
- Pages: 22

## Page Index

- 01. [Overview](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/01-overview.md) - What Hyper-Extract exposes (CLI `he`, Python `Template` API, 8 AutoTypes, 80+ YAML presets, 9 extraction methods), runtime assumptions (Python 3.11+, structured LLM output), and the shortest path from install to a queryable Knowledge Abstract.
- 02. [Installation](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/02-installation.md) - Install via `uv tool install hyperextract` or `uv pip install hyperextract`, Python version constraints, optional provider extras (`anthropic`, `google`, `all`), and first-run configuration prerequisites.
- 03. [Quickstart](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/03-quickstart.md) - First successful extraction: `he config init`, `he parse` with a preset template, `he search` / `he show`, and the equivalent Python `Template.create` + `feed_text` path using the Tesla biography example.
- 04. [Knowledge Abstracts](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/04-knowledge-abstracts.md) - The on-disk Knowledge Abstract (KA) model: `data.json`, `metadata.json`, and `index/` layout; lifecycle methods (`parse`, `feed_text`, `dump`, `load`, `build_index`); and incremental evolution via `he feed`.
- 05. [Auto-Types](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/05-auto-types.md) - Eight strongly-typed extraction primitives (`AutoModel`, `AutoList`, `AutoSet`, `AutoGraph`, `AutoHypergraph`, `AutoTemporalGraph`, `AutoSpatialGraph`, `AutoSpatioTemporalGraph`): structure, merge behavior, indexing, and when to pick each type.
- 06. [Templates vs methods](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/06-templates-vs-methods.md) - Domain YAML templates (`general/biography_graph`, `finance/earnings_summary`, etc.) versus algorithm-driven method templates (`method/light_rag`, `method/atom`); language requirements (`--lang` for templates, English-only for methods); and selection criteria.
- 07. [Provider system](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/07-provider-system.md) - BYOC/BYOK provider model: `openai`, `bailian`, and `vllm` presets; `provider:model@url` shorthand; `CompatibleEmbeddings` for non-OpenAI endpoints; and verified model compatibility requirements (`json_schema` / function calling).
- 08. [Configure providers](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/08-configure-providers.md) - Set up LLM and embedder clients via `he config init`, per-service `he config llm` / `he config embedder`, environment variables, or programmatic `create_client()` for mixed cloud and local vLLM deployments.
- 09. [Extract and evolve knowledge](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/09-extract-and-evolve-knowledge.md) - Run `he parse` (single file, directory of `.md`/`.txt`, or stdin), choose templates interactively or by ID, control indexing with `--no-index`, append documents with `he feed`, and rebuild indexes with `he build-index`.
- 10. [Search, chat, and visualize](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/10-search-chat-and-visualize.md) - Query Knowledge Abstracts with `he search` and `he talk` (single query or `-i` interactive mode), inspect stats via `he info`, and render graphs through OntoSight with `he show` or `AutoType.show()`.
- 11. [Create custom templates](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/11-create-custom-templates.md) - Author domain YAML templates: type selection, field and identifier design, multilingual `language` blocks, merge strategies, and validation workflow per the design guide and preset base templates.
- 12. [Use extraction methods](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/12-use-extraction-methods.md) - Invoke algorithm templates via `he parse -m light_rag` or `Template.create("method/hyper_rag")`; direct method classes (`Light_RAG`, `Atom`, etc.); and method-specific kwargs such as `observation_time` for temporal extractors.
- 13. [Template design skills](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/13-template-design-skills.md) - Agent-assisted template authoring with `hyperextract-skills`: brainstorm requirements, record/graph designers, yaml-validator rules, template-optimizer fixes, and multilingual conversion workflows.
- 14. [CLI reference](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/14-cli-reference.md) - Complete `he` command surface: `parse`, `feed`, `build-index`, `search`, `talk`, `show`, `info`, `list template`, `list method`, `config` subcommands, flags, defaults, exit conditions, and input/output contracts.
- 15. [Python API reference](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/15-python-api-reference.md) - Exported SDK: `Template.create/get/list`, `BaseAutoType` lifecycle (`parse`, `feed_text`, `search`, `chat`, `dump`, `load`, `build_index`, `show`), `create_client` / `create_llm` / `create_embedder` / `get_client`, and logging helpers.
- 16. [Configuration reference](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/16-configuration-reference.md) - `~/.he/config.toml` schema for `[llm]` and `[embedder]`, provider presets and default models, environment variable precedence (`OPENAI_API_KEY`, `OPENAI_BASE_URL`, `HYPER_EXTRACT_LOG_LEVEL`, `HYPER_EXTRACT_LOG_FILE`), and validation rules.
- 17. [Template schema reference](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/17-template-schema-reference.md) - YAML template fields (`language`, `name`, `type`, `tags`, `description`, `output`, `guideline`, `identifiers`, `options`, `display`), valid autotypes and field types, merge strategies, and identifier patterns.
- 18. [Extraction methods reference](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/18-extraction-methods-reference.md) - Registered methods (`graph_rag`, `light_rag`, `hyper_rag`, `hypergraph_rag`, `cog_rag`, `itext2kg`, `itext2kg_star`, `kg_gen`, `atom`): autotype output, descriptions, registry API, and constructor kwargs.
- 19. [Tesla biography recipe](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/19-tesla-biography-recipe.md) - End-to-end CLI and Python workflow using `examples/en/tesla.md` with `general/biography_graph`: parse, visualize, semantic search, and Q&A with expected artifacts under the output directory.
- 20. [Method demos](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/20-method-demos.md) - Runnable scripts under `examples/en/methods/` for each extraction engine: instantiate method classes, `feed_text`, `chat`, and `show` with LangChain clients and dotenv configuration.
- 21. [Troubleshooting](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/21-troubleshooting.md) - Common failure modes: missing API keys, vLLM `base_url` requirements, `--lang` required for knowledge templates, empty output directory conflicts, missing `data.json` or index for `search`/`talk`, template resolution errors, and debug logging via `HYPER_EXTRACT_LOG_LEVEL`.
- 22. [Contributing](https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/22-contributing.md) - Development setup with `uv`, running `pytest` and coverage, CI matrix (Python 3.11–3.12, Ubuntu/macOS), lint workflow, optional integration tests, and how to add templates or register new extraction methods.

## Source File Index

- `.env.example`
- `.github/workflows/integration.yml`
- `.github/workflows/lint.yml`
- `.github/workflows/test.yml`
- `.python-version`
- `examples/en/methods/atom_demo.py`
- `examples/en/methods/graph_rag_demo.py`
- `examples/en/methods/hyper_rag_demo.py`
- `examples/en/methods/kg_gen_demo.py`
- `examples/en/methods/light_rag_demo.py`
- `examples/en/tesla_question.md`
- `examples/en/tesla.md`
- `hyperextract-skills/graph-designer/SKILL.md`
- `hyperextract-skills/README.md`
- `hyperextract-skills/SKILL.md`
- `hyperextract-skills/template-optimizer/SKILL.md`
- `hyperextract-skills/yaml-validator/SKILL.md`
- `hyperextract/__init__.py`
- `hyperextract/cli/__main__.py`
- `hyperextract/cli/cli.py`
- `hyperextract/cli/commands/config.py`
- `hyperextract/cli/commands/list.py`
- `hyperextract/cli/config.py`
- `hyperextract/cli/README.md`
- `hyperextract/cli/utils.py`
- `hyperextract/methods/rag/graph_rag.py`
- `hyperextract/methods/rag/hyper_rag.py`
- `hyperextract/methods/rag/light_rag.py`
- `hyperextract/methods/registry.py`
- `hyperextract/methods/typical/atom.py`
- `hyperextract/methods/typical/kg_gen.py`
- `hyperextract/templates/DESIGN_GUIDE.md`
- `hyperextract/templates/presets/finance/earnings_summary.yaml`
- `hyperextract/templates/presets/general/base_graph.yaml`
- `hyperextract/templates/presets/general/biography_graph.yaml`
- `hyperextract/templates/README.md`
- `hyperextract/types/__init__.py`
- `hyperextract/types/base.py`
- `hyperextract/types/graph.py`
- `hyperextract/types/hypergraph.py`
- `hyperextract/types/model.py`
- `hyperextract/types/spatial_graph.py`
- `hyperextract/types/spatio_temporal_graph.py`
- `hyperextract/types/temporal_graph.py`
- `hyperextract/utils/client.py`
- `hyperextract/utils/logging.py`
- `hyperextract/utils/template_engine/factory.py`
- `hyperextract/utils/template_engine/gallery.py`
- `hyperextract/utils/template_engine/parsers/loader.py`
- `hyperextract/utils/template_engine/parsers/schemas/base.py`
- `hyperextract/utils/template_engine/parsers/schemas/graph.py`
- `hyperextract/utils/template_engine/parsers/schemas/naive.py`
- `hyperextract/utils/template_engine/template.py`
- `pyproject.toml`
- `README.md`
- `tests/cli/test_verbose.py`
- `tests/conftest.py`

---

## 01. Overview

> What Hyper-Extract exposes (CLI `he`, Python `Template` API, 8 AutoTypes, 80+ YAML presets, 9 extraction methods), runtime assumptions (Python 3.11+, structured LLM output), and the shortest path from install to a queryable Knowledge Abstract.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/01-overview.md
- Generated: 2026-06-18T20:53:53.719Z

### Source Files

- `README.md`
- `pyproject.toml`
- `hyperextract/__init__.py`
- `hyperextract/cli/cli.py`
- `hyperextract/utils/template_engine/template.py`

---
title: "Overview"
description: "What Hyper-Extract exposes (CLI `he`, Python `Template` API, 8 AutoTypes, 80+ YAML presets, 9 extraction methods), runtime assumptions (Python 3.11+, structured LLM output), and the shortest path from install to a queryable Knowledge Abstract."
---

Hyper-Extract (`hyperextract` v0.2.0) is an LLM-powered knowledge extraction framework that turns unstructured text into on-disk **Knowledge Abstracts** (KAs): strongly typed JSON plus optional FAISS vector indexes. The package exposes a Typer-based CLI (`he`), a Python SDK centered on `Template.create()`, eight `AutoType` primitives, domain YAML presets, and nine registered extraction methods.

## Surfaces

| Surface | Entry point | Primary use |
|---------|-------------|-------------|
| CLI | `he` (`hyperextract.cli:app`) | Parse documents, evolve KAs, search, chat, visualize |
| Python SDK | `from hyperextract import Template` | Programmatic extraction, indexing, and Q&A |
| Presets | `hyperextract/templates/presets/` | Domain YAML templates (general, finance, legal, medicine, TCM, industry) |
| Methods | `hyperextract.methods.registry` | Algorithm-driven extractors (`graph_rag`, `light_rag`, `atom`, …) |

The public SDK exports `BaseAutoType`, all eight `AutoType` classes, `Template`, client factories (`create_client`, `create_llm`, `create_embedder`, `get_client`), and logging helpers.

## Three-layer architecture

Hyper-Extract organizes extraction around three composable layers:

```mermaid
flowchart TB
    subgraph input["Input"]
        TXT["Unstructured text<br/>.md / .txt / stdin"]
    end

    subgraph layers["Hyper-Extract layers"]
        TPL["Templates<br/>YAML presets + method/*"]
        METH["Methods<br/>9 extraction algorithms"]
        AT["AutoTypes<br/>8 typed structures"]
    end

    subgraph output["Knowledge Abstract"]
        KA["output/<br/>data.json · metadata.json · index/"]
    end

    subgraph ops["Operations"]
        SRCH["search / chat"]
        VIS["show (OntoSight)"]
    end

    TXT --> TPL
    TXT --> METH
    TPL --> AT
    METH --> AT
    AT --> KA
    KA --> SRCH
    KA --> VIS
```

<AccordionGroup>
<Accordion title="AutoTypes — typed extraction primitives">

Eight `AutoType` classes inherit from `BaseAutoType` and own the full KA lifecycle: chunking, parallel LLM extraction, merge, indexing, serialization, search, and chat.

| Class | Structure | Typical use |
|-------|-----------|-------------|
| `AutoModel` | Single Pydantic object | Summaries, metadata, structured reports |
| `AutoList` | Ordered list | Logs, enumerations, ordered events |
| `AutoSet` | Deduplicated set | Glossaries, entity registries |
| `AutoGraph` | Nodes + binary edges | Concept maps, social networks |
| `AutoHypergraph` | Nodes + hyperedges | N-ary relationships, complex events |
| `AutoTemporalGraph` | Graph + time | Timelines, chronologies |
| `AutoSpatialGraph` | Graph + space | Physical layouts, spatial relations |
| `AutoSpatioTemporalGraph` | Graph + time + space | Event networks with location context |

Extraction uses LangChain structured output: `llm_client.with_structured_output(schema)` on chunked text (default chunk size 2048, overlap 256).

</Accordion>

<Accordion title="Templates — domain YAML presets">

Domain templates live under `hyperextract/templates/presets/` and are discovered by `Gallery` at import time. Each preset defines autotype, fields, identifiers, merge rules, and multilingual `language` blocks (`zh`, `en`). Resolve presets by path (e.g. `general/biography_graph`, `finance/earnings_summary`).

Knowledge templates require `--lang` / `language=` at creation. Method templates (`method/*`) always use English prompts.

</Accordion>

<Accordion title="Methods — algorithm extractors">

Nine methods register in `hyperextract.methods.registry`:

| Method | Output autotype | Role |
|--------|-----------------|------|
| `graph_rag` | graph | Community-aware Graph-RAG |
| `light_rag` | graph | Lightweight binary-edge RAG |
| `hyper_rag` | hypergraph | Hyperedge RAG |
| `hypergraph_rag` | hypergraph | Advanced hypergraph construction |
| `cog_rag` | hypergraph | Cognitive RAG retrieval |
| `itext2kg` | graph | Triple-based KG extraction |
| `itext2kg_star` | graph | Enhanced iText2KG |
| `kg_gen` | graph | Flexible KG generation |
| `atom` | graph | Temporal KG with evidence attribution |

Invoke via CLI (`he parse -m light_rag`) or `Template.create("method/light_rag")`.

</Accordion>
</AccordionGroup>

## Knowledge Abstract model

A KA is a directory produced by `BaseAutoType.dump()`:

:::files
output/
├── data.json       # Structured knowledge (Pydantic model_dump)
├── metadata.json   # Timestamps, template config, provenance
└── index/          # FAISS vector store (optional; rebuild with build_index)
:::

Lifecycle methods on every `BaseAutoType` instance:

| Method | Purpose |
|--------|---------|
| `parse(text)` | Extract from text (replaces in-memory state) |
| `feed_text(text)` | Incrementally append and merge |
| `build_index()` | Build semantic search index |
| `search(query, top_k=3)` | Vector retrieval |
| `chat(query, top_k=3)` | RAG Q&A over retrieved context |
| `dump(folder)` / `load(folder)` | Serialize / restore full KA |
| `show()` | Visualize via OntoSight (graph/list/set types) |

CLI mirrors these operations: `he parse`, `he feed`, `he build-index`, `he search`, `he talk`, `he show`, `he info`.

## Runtime assumptions

<Warning>
Hyper-Extract requires LLMs that support **structured output** — `json_schema` or function calling via LangChain's `with_structured_output`. Models without reliable schema adherence will produce extraction failures.
</Warning>

| Requirement | Value |
|-------------|-------|
| Python | `>=3.11` (classifiers: 3.11, 3.12) |
| Core deps | LangChain, FAISS (`faiss-cpu`), Pydantic, OntoSight, OntoMem, structlog |
| Optional extras | `anthropic`, `google`, `all` (additional LangChain provider packages) |
| Config file | `~/.he/config.toml` (`[llm]`, `[embedder]`) |
| Logging | `HYPER_EXTRACT_LOG_LEVEL`, `HYPER_EXTRACT_LOG_FILE` |

Provider presets ship for `openai`, `bailian` (DashScope compatible-mode), and `vllm` (local, requires explicit `base_url`). Use `provider:model@url` shorthand or `create_client()` for BYOC/BYOK deployments — no single vendor is required.

## Shortest path: install to queryable KA

<Steps>
<Step title="Install">

```bash
uv tool install hyperextract
# or: uv pip install hyperextract
```

</Step>

<Step title="Configure providers">

```bash
he config init -k YOUR_API_KEY
```

Writes `~/.he/config.toml` with LLM and embedder endpoints. See [Configure providers](/configure-providers) for per-provider setup and environment-variable overrides.

</Step>

<Step title="Extract a Knowledge Abstract">

```bash
he parse examples/en/tesla.md \
  -t general/biography_graph \
  -o ./output/ \
  -l en
```

<ParamField body="--template / -t" type="string">
Preset path (e.g. `general/biography_graph`). Omit for interactive selection.
</ParamField>

<ParamField body="--lang / -l" type="string" required>
Language code (`en` or `zh`). Required for knowledge templates; ignored for `method/*`.
</ParamField>

<ParamField body="--no-index" type="boolean">
Skip FAISS index build. Search and talk require an index later (`he build-index`).
</ParamField>

Produces `./output/data.json`, `metadata.json`, and `index/`.

</Step>

<Step title="Query and explore">

```bash
he search ./output/ "What are Tesla's major achievements?"
he talk ./output/ -i
he show ./output/
```

</Step>
</Steps>

<CodeGroup>
```bash CLI
he config init -k YOUR_API_KEY
he parse examples/en/tesla.md -t general/biography_graph -o ./output/ -l en
he search ./output/ "What are Tesla's major achievements?"
```

```python Python
from hyperextract import Template

ka = Template.create("general/biography_graph", language="en")

with open("examples/en/tesla.md") as f:
    ka.parse(f.read())

ka.dump("./output/")
results = ka.search("What are Tesla's major achievements?")
ka.show()
```
</CodeGroup>

<Note>
`Template.create()` reads `~/.he/config.toml` when `llm_client` and `embedder` are omitted. Pass explicit LangChain clients for mixed cloud/local setups.
</Note>

## CLI command map

| Group | Commands |
|-------|----------|
| Create / evolve | `he parse`, `he feed`, `he build-index` |
| Explore | `he search`, `he talk`, `he show`, `he info` |
| Discover | `he list template`, `he list method` |
| Configure | `he config init`, `he config llm`, `he config embedder`, `he config show` |

Run `he` with no subcommand for a grouped help panel. Full flag reference: [CLI reference](/cli-reference).

## Templates vs methods

| | Domain templates | Extraction methods |
|---|------------------|-------------------|
| Path prefix | `general/*`, `finance/*`, … | `method/*` |
| Definition | YAML in `templates/presets/` | Python classes in `hyperextract.methods` |
| Language | `--lang` required (`en` / `zh`) | English only |
| Selection | Domain fit, output autotype | Algorithm preference (RAG vs triple extraction) |
| Customization | Author new YAML presets | Register via `register_method()` |

Both paths instantiate an `AutoType` and produce the same on-disk KA layout.

## Python SDK summary

```python
from hyperextract import (
    Template,
    AutoGraph,
    create_client,
)

# Discover
Template.list()                          # presets + methods
Template.list(filter_by_type="graph")    # filter by autotype
Template.get("general/biography_graph")  # read YAML config

# Create and extract
llm, emb = create_client()               # from ~/.he/config.toml
ka = Template.create("general/biography_graph", language="en", llm_client=llm, embedder=emb)
ka.feed_text(document_text)
ka.build_index()
ka.dump("./my_ka/")
```

Exported primitives and factories are listed in `hyperextract.__all__`. Method-level detail: [Python API reference](/python-api-reference).

## Next

<CardGroup>
<Card title="Installation" href="/installation">
Install via `uv tool install` or `uv pip install`, Python version constraints, and optional provider extras.
</Card>
<Card title="Quickstart" href="/quickstart">
First successful extraction with `he parse`, search, show, and the Tesla biography Python path.
</Card>
<Card title="Knowledge Abstracts" href="/knowledge-abstracts">
On-disk KA layout, lifecycle methods, and incremental evolution with `he feed`.
</Card>
<Card title="Auto-Types" href="/auto-types">
Eight extraction primitives: structure, merge behavior, indexing, and selection guide.
</Card>
<Card title="Provider system" href="/provider-system">
BYOC/BYOK model: OpenAI, Bailian, vLLM presets, and structured-output requirements.
</Card>
<Card title="Templates vs methods" href="/templates-vs-methods">
When to pick domain YAML presets versus algorithm method templates.
</Card>
</CardGroup>

---

## 02. Installation

> Install via `uv tool install hyperextract` or `uv pip install hyperextract`, Python version constraints, optional provider extras (`anthropic`, `google`, `all`), and first-run configuration prerequisites.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/02-installation.md
- Generated: 2026-06-18T20:53:33.011Z

### Source Files

- `pyproject.toml`
- `README.md`
- `.python-version`
- `.env.example`
- `hyperextract/cli/__main__.py`

---
title: Installation
description: Install via `uv tool install hyperextract` or `uv pip install hyperextract`, Python version constraints, optional provider extras (`anthropic`, `google`, `all`), and first-run configuration prerequisites.
---

Hyper-Extract ships as the PyPI package `hyperextract`. Choose a **CLI install** when you want the `he` command globally, or a **library install** when you import `Template` and `create_client` from Python. Both paths require **Python 3.11+** and a configured LLM/embedder before extraction commands succeed.

## Requirements

| Requirement | Details |
| --- | --- |
| Python | **3.11 or 3.12** (`requires-python = ">=3.11"`) |
| Package manager | [uv](https://docs.astral.sh/uv/) recommended; `pip` / `pipx` also supported |
| LLM capability | Models must support structured output (`json_schema` or function calling) |
| Credentials | API key for cloud providers, or local vLLM endpoints with `base_url` |

The core package includes `langchain-openai` and works with any OpenAI-compatible endpoint (OpenAI, Bailian, vLLM, proxies). Optional extras add native LangChain integrations for Anthropic and Google.

## Install the CLI

Use this path when you want `he` on your PATH for parse, search, talk, and config workflows.

<Tabs>
<Tab title="uv (recommended)">

```bash
uv tool install hyperextract
```

Install with all optional provider integrations:

```bash
uv tool install 'hyperextract[all]'
```

</Tab>
<Tab title="pipx">

```bash
pipx install hyperextract
```

With extras:

```bash
pipx install 'hyperextract[all]'
```

</Tab>
</Tabs>

<Steps>
<Step title="Verify the CLI">

```bash
he --version
```

Expected output:

```
Hyper-Extract CLI version 0.2.0
```

</Step>
<Step title="Confirm the entry point">

Running `he` with no subcommand prints the command overview and exits. This confirms Typer registration and Rich rendering work in your environment.

```bash
he
```

</Step>
</Steps>

## Install as a Python library

Use this path for notebooks, services, or scripts that call `Template.create()`, `feed_text()`, and `create_client()` directly.

<Tabs>
<Tab title="uv">

```bash
uv pip install hyperextract
```

With a specific extra:

```bash
uv pip install 'hyperextract[anthropic]'
uv pip install 'hyperextract[google]'
uv pip install 'hyperextract[all]'
```

</Tab>
<Tab title="pip">

```bash
pip install hyperextract
```

With extras:

```bash
pip install 'hyperextract[anthropic,google]'
```

</Tab>
</Tabs>

<Steps>
<Step title="Verify the import">

```python
import hyperextract
print(hyperextract.__version__)
```

</Step>
<Step title="Smoke-test the API surface">

```python
from hyperextract import Template
print(len(Template.list()), "templates available")
```

</Step>
</Steps>

## Optional provider extras

Extras install additional LangChain provider packages. They are **not required** for OpenAI-compatible endpoints, which the default install already supports.

| Extra | Installs | When to use |
| --- | --- | --- |
| `anthropic` | `langchain-anthropic>=0.3.0` | Native Anthropic Claude clients |
| `google` | `langchain-google-genai>=2.1.0` | Native Google Gemini clients |
| `all` | Both `anthropic` and `google` | Multi-provider projects or CI parity |

<CodeGroup>
```bash title="CLI with all extras"
uv tool install 'hyperextract[all]'
```

```bash title="Library with Anthropic only"
uv pip install 'hyperextract[anthropic]'
```

```bash title="Library with all extras"
uv pip install 'hyperextract[all]'
```
</CodeGroup>

## What the default install includes

The base `hyperextract` package bundles extraction, indexing, CLI, and OpenAI-compatible client support:

- **AutoTypes and template engine** — eight knowledge structures and 80+ YAML presets
- **CLI** — `he` command via Typer + Rich (`typer`, `rich`, `tomli-w`)
- **Semantic search** — FAISS CPU backend (`faiss-cpu`)
- **LangChain stack** — `langchain`, `langchain-community`, `langchain-openai`
- **Visualization** — OntoSight integration (`ontosight`, `ontomem`)
- **Utilities** — `structlog`, `python-dotenv`, `semhash`

No separate install step is needed for Bailian or vLLM when using OpenAI-compatible URLs.

## First-run configuration

Extraction commands (`he parse`, `he feed`, `he search`, `he talk`, `he show`, `he build-index`) call `validate_config()` before running. Without valid LLM and embedder settings, the CLI exits with an error.

<Steps>
<Step title="Choose a configuration method">

<Tabs>
<Tab title="Interactive (recommended)">

```bash
he config init
```

Walks through provider selection (`openai`, `bailian`, `vllm`, or custom), LLM model, embedder model, API key, and base URL. Writes `~/.he/config.toml`.

</Tab>
<Tab title="Quick setup (OpenAI)">

```bash
he config init -k YOUR_OPENAI_API_KEY
```

Defaults to OpenAI: `gpt-4o-mini` (LLM) and `text-embedding-3-small` (embedder).

</Tab>
<Tab title="Quick setup (provider preset)">

```bash
he config init -p openai -k YOUR_OPENAI_API_KEY
he config init -p bailian -k YOUR_BAILIAN_API_KEY
```

Applies preset models and base URLs from the provider registry.

</Tab>
<Tab title="Environment variables">

Copy `.env.example` as a starting point:

```bash
export OPENAI_API_KEY=sk-your-api-key-here
export OPENAI_BASE_URL=https://api.openai.com/v1
```

Environment variables fill gaps when `~/.he/config.toml` omits a key or URL.

</Tab>
</Tabs>

</Step>
<Step title="Validate configuration">

```bash
he config show
```

Confirms provider, model, masked API key, and base URL for both LLM and embedder services.

</Step>
<Step title="Run your first extraction">

Once configuration is valid, proceed to a full workflow in [Quickstart](/quickstart).

</Step>
</Steps>

### Configuration file location

Hyper-Extract stores CLI settings at:

```
~/.he/config.toml
```

The file contains `[llm]` and `[embedder]` sections with `provider`, `model`, `api_key`, and `base_url` fields. See [Configuration reference](/configuration-reference) for the full schema and precedence rules.

### Provider-specific prerequisites

<AccordionGroup>
<Accordion title="OpenAI">

<ParamField body="api_key" type="string" required>
Valid OpenAI API key. Set via `he config init -k ...` or `OPENAI_API_KEY`.
</ParamField>

<ParamField body="base_url" type="string">
Defaults to `https://api.openai.com/v1`. Override with `-u` or `OPENAI_BASE_URL` for proxies.
</ParamField>

Verified models include `gpt-4o`, `gpt-4o-mini`, and `gpt-5`.

</Accordion>
<Accordion title="Bailian (Alibaba Cloud)">

```bash
he config init -p bailian -k YOUR_BAILIAN_API_KEY
```

Preset defaults: `qwen3.6-plus` (LLM), `text-embedding-v4` (embedder), base URL `https://dashscope.aliyuncs.com/compatible-mode/v1`.

</Accordion>
<Accordion title="Local vLLM">

vLLM requires explicit `base_url` for both services. API key can be `dummy`.

```bash
he config llm -p vllm -u http://localhost:8000/v1 -k dummy -m Qwen3.5-9B
he config embedder -p vllm -u http://localhost:8001/v1 -k dummy -m bge-m3
```

Or configure programmatically:

```python
from hyperextract import create_client

llm, emb = create_client(
    llm="vllm:Qwen3.5-9B@http://localhost:8000/v1",
    embedder="vllm:bge-m3@http://localhost:8001/v1",
    api_key="dummy",
)
```

</Accordion>
</AccordionGroup>

### Optional debug logging

Set these environment variables before running `he` commands:

<ParamField body="HYPER_EXTRACT_LOG_LEVEL" type="string">
Log level for structlog output. Values: `DEBUG`, `INFO`, `WARNING`, `ERROR`. Default: `WARNING`.
</ParamField>

<ParamField body="HYPER_EXTRACT_LOG_FILE" type="string">
Optional file path for persistent log output.
</ParamField>

## Development installation

To modify source or run tests locally, clone the repository and install in editable mode:

```bash
git clone https://github.com/yifanfeng97/hyper-extract.git
cd hyper-extract
uv pip install -e ".[all]"
uv pip install --group dev pytest pytest-cov
```

CI validates against Python **3.11** and **3.12** on Ubuntu and macOS. See [Contributing](/contributing) for the full development workflow.

## Troubleshooting installation

| Symptom | Likely cause | Fix |
| --- | --- | --- |
| `command not found: he` | CLI not on PATH | Re-run `uv tool install hyperextract` or add the uv tools bin directory to PATH |
| `LLM API key is not configured` | Missing first-run setup | Run `he config init` or export `OPENAI_API_KEY` |
| `vLLM provider requires base_url` | vLLM preset without URL | Set `-u http://host:port/v1` on both `he config llm` and `he config embedder` |
| `No module named 'langchain_anthropic'` | Anthropic extra not installed | `uv pip install 'hyperextract[anthropic]'` |
| Extraction produces empty or invalid JSON | Model lacks structured output | Switch to a verified model; see [Provider system](/provider-system) |

More failure modes and fixes: [Troubleshooting](/troubleshooting).

## Related pages

<CardGroup cols={2}>
<Card title="Quickstart" href="/quickstart">
Run `he config init`, parse a document, and query the resulting Knowledge Abstract.
</Card>
<Card title="Configure providers" href="/configure-providers">
Deep setup for mixed cloud and local vLLM deployments.
</Card>
<Card title="Overview" href="/overview">
What Hyper-Extract exposes and the shortest path from install to a queryable Knowledge Abstract.
</Card>
<Card title="Configuration reference" href="/configuration-reference">
Full `~/.he/config.toml` schema and environment variable precedence.
</Card>
</CardGroup>

---

## 03. Quickstart

> First successful extraction: `he config init`, `he parse` with a preset template, `he search` / `he show`, and the equivalent Python `Template.create` + `feed_text` path using the Tesla biography example.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/03-quickstart.md
- Generated: 2026-06-18T20:53:40.249Z

### Source Files

- `README.md`
- `hyperextract/cli/README.md`
- `examples/en/tesla.md`
- `hyperextract/templates/presets/general/biography_graph.yaml`
- `hyperextract/cli/cli.py`
- `hyperextract/__init__.py`

---
title: "Quickstart"
description: "First successful extraction: `he config init`, `he parse` with a preset template, `he search` / `he show`, and the equivalent Python `Template.create` + `feed_text` path using the Tesla biography example."
---

Hyper-Extract turns unstructured text into a queryable Knowledge Abstract (KA) on disk. The shortest path is: configure LLM and embedder clients, run `he parse` with a preset YAML template such as `general/biography_graph` over `examples/en/tesla.md`, then query the KA with `he search` or visualize it with `he show`. The Python SDK exposes the same lifecycle through `Template.create` and `feed_text`.

<Note>
Requires Python 3.11+, an LLM with structured output (`json_schema` or function calling), and an OpenAI-compatible embedder for semantic search. See [Installation](/installation) for package setup.
</Note>

## Prerequisites

| Requirement | Details |
| --- | --- |
| Package | `hyperextract` installed via `uv tool install hyperextract` (CLI) or `uv pip install hyperextract` (Python) |
| API access | LLM and embedder credentials, or a local vLLM deployment |
| Sample input | `examples/en/tesla.md` — Nikola Tesla biography in English |
| Template | `general/biography_graph` — `temporal_graph` preset for biographies |

## End-to-end workflow

```text
he config init          →  ~/.he/config.toml (LLM + embedder)
        │
        ▼
he parse tesla.md       →  ./output/  (data.json, metadata.json, index/)
  -t general/biography_graph
  -l en
        │
        ├─► he show ./output/     (OntoSight graph)
        └─► he search ./output/   (semantic retrieval)
```

The `he parse` command calls `Template.create`, ingests text with `feed_text`, writes the KA with `dump`, and builds a FAISS index by default.

## Step 1: Configure providers

Run `he config init` once. Configuration is stored at `~/.he/config.toml`. Environment variables (`OPENAI_API_KEY`, `OPENAI_BASE_URL`) override file settings when set.

<Steps>
<Step title="Choose a provider preset">

<Tabs>
<Tab title="OpenAI">

```bash
he config init -k YOUR_OPENAI_API_KEY
```

Sets provider `openai`, LLM `gpt-4o-mini`, embedder `text-embedding-3-small`.

</Tab>
<Tab title="Bailian">

```bash
he config init -p bailian -k YOUR_BAILIAN_API_KEY
```

Uses Bailian defaults: `qwen3.6-plus` (LLM) and `text-embedding-v4` (embedder).

</Tab>
<Tab title="Local vLLM">

```bash
he config llm -p vllm -u http://localhost:8000/v1 -k dummy -m Qwen3.5-9B
he config embedder -p vllm -u http://localhost:8001/v1 -k dummy -m bge-m3
```

vLLM requires explicit `base_url` values for LLM and embedder endpoints.

</Tab>
</Tabs>

</Step>
<Step title="Verify configuration">

```bash
he config show
```

Confirm both LLM and embedder rows show a model and API key (or `dummy` for local vLLM).

</Step>
</Steps>

<ParamField body="--api-key" type="string" required>
API key applied to both LLM and embedder in quick-init mode (`-k` / `--api-key`).
</ParamField>

<ParamField body="--provider" type="string">
Provider preset: `openai`, `bailian`, or `vllm`. Omit for OpenAI defaults when only `--api-key` is supplied.
</ParamField>

<ParamField body="--base-url" type="string">
Custom OpenAI-compatible endpoint. Used with `--provider` or standalone OpenAI init.
</ParamField>

## Step 2: Extract with the CLI

Parse the Tesla biography into a temporal knowledge graph. Knowledge templates require `--lang`; method templates (`-m`) default to English and ignore `--lang`.

```bash
he parse examples/en/tesla.md \
  -t general/biography_graph \
  -o ./output/ \
  -l en
```

<ParamField body="-t, --template" type="string" required>
Template ID. `general/biography_graph` resolves to the biography temporal-graph preset.
</ParamField>

<ParamField body="-o, --output" type="string" required>
Output KA directory. Must be empty unless `--force` is passed.
</ParamField>

<ParamField body="-l, --lang" type="string" required>
Language code (`en` or `zh`). Required for knowledge templates.
</ParamField>

<ParamField body="--no-index" type="boolean">
Skip FAISS index build. Search and chat require a later `he build-index` run.
</ParamField>

<ParamField body="-f, --force" type="boolean">
Overwrite a non-empty output directory.
</ParamField>

<RequestExample>

```bash
he parse examples/en/tesla.md -t general/biography_graph -o ./output/ -l en
```

</RequestExample>

<ResponseExample>

```text
Input: examples/en/tesla.md
Output: ./output/
Template: general/biography_graph
Language: en
Build Index: Yes

Template resolved: Biography Graph Template
Success! Knowledge extracted to output

What's next?
  he show ./output/                    # Visualize knowledge graph
  he feed ./output/ <new_document>     # Append more documents
  he search ./output/ "keyword"        # Semantic search
  he talk ./output/ -i                 # Interactive chat
```

</ResponseExample>

### Output layout

:::files
./output/
├── data.json       # Extracted entities and relations
├── metadata.json   # Template ID, language, timestamps
└── index/          # FAISS vector store (when index is built)
:::

`general/biography_graph` produces a `temporal_graph` with entities (`name`, `type`, `description`) and relations (`source`, `target`, `type`, `time`, `description`). Relation identifiers follow `{source}|{type}|{target}`; the `time` field captures biographical dates.

## Step 3: Search the Knowledge Abstract

Semantic search requires a built index. `he parse` builds one by default.

```bash
he search ./output/ "What are Tesla's major achievements?" -n 5
```

<ParamField body="query" type="string" required>
Natural-language search string.
</ParamField>

<ParamField body="-n, --top-k" type="integer">
Number of results to return. Default: `3`.
</ParamField>

<RequestExample>

```bash
he search ./output/ "Who was Tesla's main business partner?" -n 3
```

</RequestExample>

<ResponseExample>

```text
Knowledge Abstract: ./output/
Query: Who was Tesla's main business partner?
Top K: 3

Found 3 result(s):

Result 1:
{
  "name": "George Westinghouse",
  "type": "person",
  "description": "Founder of Westinghouse Electric Company..."
}
```

</ResponseExample>

<Tip>
Run `he info ./output/` to inspect node/edge counts and whether the index exists before searching.
</Tip>

## Step 4: Visualize with OntoSight

```bash
he show ./output/
```

Loads the KA from disk, recreates the template instance, and opens an interactive graph in the browser. Entity labels use `{name}`; relation labels use `{type}@{time}` per the template `display` block.

## Python equivalent

The SDK mirrors the CLI path. `Template.create` reads `~/.he/config.toml` when `llm_client` and `embedder` are omitted. Use `feed_text` to ingest into the current instance—the same call `he parse` makes internally.

<CodeGroup>

```python title="feed_text (matches CLI parse)"
from pathlib import Path
from hyperextract import Template

ka = Template.create("general/biography_graph", language="en")

text = Path("examples/en/tesla.md").read_text(encoding="utf-8")
ka.feed_text(text)

ka.build_index()
ka.dump("./output/")
ka.show()
```

```python title="parse (one-shot, new instance)"
from pathlib import Path
from hyperextract import Template

ka = Template.create("general/biography_graph", language="en")

text = Path("examples/en/tesla.md").read_text(encoding="utf-8")
result = ka.parse(text)          # returns a new instance; ka is unchanged

result.build_index()
result.dump("./output/")
result.show()
```

```python title="search and chat")
from hyperextract import Template

ka = Template.create("general/biography_graph", language="en")
ka.load("./output/")

results = ka.search("What were Tesla's major inventions?", top_k=5)
for item in results:
    print(item)

response = ka.chat("Summarize Tesla's War of Currents")
print(response.content)
```

</CodeGroup>

| Method | Behavior |
| --- | --- |
| `feed_text(text)` | Merges extracted data into the current instance. Supports chaining. |
| `parse(text)` | Returns a new instance without modifying the caller. Use for previews or branches. |
| `build_index()` | Builds FAISS index required for `search` and `chat`. |
| `dump(path)` | Writes `data.json`, `metadata.json`, and `index/`. |
| `load(path)` | Restores a saved KA from disk. |
| `show()` | Opens OntoSight visualization. |

<Warning>
Pass `language="en"` (or `"zh"`) when creating knowledge templates. Method templates such as `method/light_rag` always use English prompts regardless of the `language` argument.
</Warning>

### Optional: explicit clients

Override global config with programmatic clients for mixed cloud/local deployments:

```python
from hyperextract import Template, create_client

llm, embedder = create_client(
    llm="vllm:Qwen3.5-9B@http://localhost:8000/v1",
    embedder="vllm:bge-m3@http://localhost:8001/v1",
    api_key="dummy",
)

ka = Template.create(
    "general/biography_graph",
    language="en",
    llm_client=llm,
    embedder=embedder,
)
ka.feed_text(open("examples/en/tesla.md").read())
```

## Verification checklist

<Check>
`he config show` reports LLM and embedder models with valid credentials.
</Check>

<Check>
`he info ./output/` shows non-zero node/edge counts and index status **Built**.
</Check>

<Check>
`he search ./output/ "Tesla coil"` returns entity or relation matches.
</Check>

<Check>
`he show ./output/` opens a graph with Tesla, Edison, Westinghouse, and dated relations.
</Check>

## Common failures

| Symptom | Fix |
| --- | --- |
| `No API key found` / config validation error | Run `he config init` or set `OPENAI_API_KEY` |
| `--lang is required for knowledge templates` | Add `-l en` or `-l zh` to `he parse` |
| `Output directory already exists and is not empty` | Use `-f` or choose a new `-o` path |
| `search` fails on missing index | Re-run parse without `--no-index`, or run `he build-index ./output/` |
| `Template not found` | List presets with `he list template -q biography` |

Set `HYPER_EXTRACT_LOG_LEVEL=DEBUG` for extraction-stage logging (`feed_text`, index build, template resolution).

## Next

<CardGroup>
<Card title="Tesla biography recipe" href="/tesla-biography-recipe">
Full CLI and Python walkthrough for `examples/en/tesla.md` with expected artifacts and sample queries.
</Card>
<Card title="Configure providers" href="/configure-providers">
Per-service `he config llm` / `he config embedder`, environment variables, and `create_client()` patterns.
</Card>
<Card title="Search, chat, and visualize" href="/search-chat-visualize">
`he talk`, `he info`, interactive modes, and `AutoType.show()` details.
</Card>
<Card title="Knowledge Abstracts" href="/knowledge-abstracts">
On-disk KA model, lifecycle methods, and incremental updates via `he feed`.
</Card>
</CardGroup>

---

## 04. Knowledge Abstracts

> The on-disk Knowledge Abstract (KA) model: `data.json`, `metadata.json`, and `index/` layout; lifecycle methods (`parse`, `feed_text`, `dump`, `load`, `build_index`); and incremental evolution via `he feed`.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/04-knowledge-abstracts.md
- Generated: 2026-06-18T20:53:34.051Z

### Source Files

- `hyperextract/types/base.py`
- `hyperextract/cli/cli.py`
- `hyperextract/cli/utils.py`
- `hyperextract/cli/config.py`
- `hyperextract/utils/template_engine/factory.py`

---
title: "Knowledge Abstracts"
description: "The on-disk Knowledge Abstract (KA) model: `data.json`, `metadata.json`, and `index/` layout; lifecycle methods (`parse`, `feed_text`, `dump`, `load`, `build_index`); and incremental evolution via `he feed`."
---

A Knowledge Abstract (KA) is a directory persisted by `BaseAutoType.dump()` and restored by `BaseAutoType.load()`. Every `Template.create()` instance is a `BaseAutoType` subclass; `he parse` writes a new KA directory, and `he feed` loads an existing one, merges new text, and saves it back.

## On-disk layout

`dump()` writes three components under a single output directory:

:::files
my_ka/
├── data.json          # Structured knowledge (Pydantic schema JSON)
├── metadata.json      # Template ID, language, timestamps, type
└── index/             # FAISS vector store (optional until built)
    ├── index.faiss        # AutoModel, AutoList, AutoSet
    ├── docstore.json
    ├── node_index/        # AutoGraph and graph variants
    │   ├── index.faiss
    │   └── docstore.json
    └── edge_index/
        ├── index.faiss
        └── docstore.json
:::

| File | Required | Role |
|------|----------|------|
| `data.json` | Yes | Serialized `ka.data` via `model_dump()` |
| `metadata.json` | Recommended | Provenance and CLI template resolution |
| `index/` | No | Semantic search index; empty until `build_index()` |

<Warning>
`he search` and `he talk` require a non-empty `index/` directory. Run `he build-index <ka_path>` if you used `--no-index` during parse or after `he feed`.
</Warning>

### `data.json` shape by AutoType

The JSON mirrors the active Pydantic schema for the template's AutoType:

| AutoType | Top-level keys | Example use |
|----------|----------------|-------------|
| `AutoModel` | Schema field names (`name`, `summary`, …) | Single structured record |
| `AutoList` / `AutoSet` | `items` (array of objects) | Collections |
| `AutoGraph` (+ variants) | `nodes`, `edges` | Entity–relation graphs |

`he info` counts nodes/edges from `nodes`/`edges` or falls back to `entities`/`relations` for method outputs.

### `metadata.json` fields

Written by `dump_metadata()` from the in-memory `metadata` dict. `TemplateFactory.create()` seeds:

<ResponseField name="template" type="string">
Template path (e.g. `general/biography_graph`, `method/light_rag`) or custom YAML stem. Used by `get_template_from_ka()` to recreate the correct AutoType on load.
</ResponseField>

<ResponseField name="lang" type="string">
Language code (`zh`, `en`). Knowledge templates require the same value at load time. Method templates always store `en`.
</ResponseField>

<ResponseField name="type" type="string">
AutoType discriminator: `model`, `list`, `set`, `graph`, `hypergraph`, `temporal_graph`, `spatial_graph`, or `spatio_temporal_graph`.
</ResponseField>

<ResponseField name="created_at" type="string">
ISO timestamp set at first extraction.
</ResponseField>

<ResponseField name="updated_at" type="string">
ISO timestamp updated on every `feed_text()` or `clear()`.
</ResponseField>

Custom keys can be added before `dump()` and round-trip through `load_metadata()`.

## Lifecycle architecture

`BaseAutoType` owns extraction, merge, indexing, and serialization. State changes flow through three hooks:

```mermaid
stateDiagram-v2
    [*] --> Empty: __init__ / clear()
    Empty --> Populated: parse() or feed_text()
    Populated --> Populated: feed_text() merge
    Populated --> Indexed: build_index()
    Indexed --> StaleIndex: feed_text() clears index
    StaleIndex --> Indexed: build_index()
    Populated --> OnDisk: dump()
    Indexed --> OnDisk: dump()
    OnDisk --> Populated: load()
    OnDisk --> Indexed: load() with index/
```

| Method | Mutates caller | Index effect | Typical use |
|--------|----------------|--------------|-------------|
| `parse(text)` | No — returns new instance | Fresh instance, no index | Preview, branch, immutable pipeline |
| `feed_text(text)` | Yes — returns `self` | Clears index via `_update_data_state` | Incremental ingestion |
| `build_index()` | Yes | Builds FAISS from current data | Enable `search` / `chat` |
| `dump(folder)` | No | Writes data, metadata, index | Persist KA |
| `load(folder)` | Yes | Restores all three when present | Resume from disk |
| `clear()` | Yes | Resets data and index | Start over in memory |
| `clear_index()` | Yes | Drops index only | Force rebuild |

<Info>
Long texts are chunked (`chunk_size` default 2048, `chunk_overlap` 256), extracted in parallel (`max_workers` default 10), then merged with type-specific `merge_batch_data()` logic.
</Info>

### `parse` vs `feed_text`

<CodeGroup>
```python title="Python — branch without mutating"
from hyperextract import Template

ka = Template.create("general/biography_graph", "en")
branch = ka.parse(document_text)   # new instance, original stays empty
branch.build_index()
branch.dump("./preview_ka/")
```

```python title="Python — evolve in place"
ka = Template.create("general/biography_graph", "en")
ka.feed_text(doc_a).feed_text(doc_b)   # method chaining
ka.build_index()
ka.dump("./main_ka/")
```
</CodeGroup>

`parse()` calls `_set_data_state()` (full replace). `feed_text()` calls `_update_data_state()` (incremental merge with deduplication rules per AutoType).

Two instances with the same schema can be merged with the `+` operator, which calls `merge_batch_data()` and returns a new instance.

## CLI workflows

### Create a KA — `he parse`

<Steps>
<Step title="Resolve template and validate config">
`he parse` calls `validate_config()`, resolves `--template` / `--method`, and requires `--lang` for knowledge templates (methods force `en`).
</Step>
<Step title="Extract and save">
`Template.create()` → `feed_text()` on input → `dump(output)`. Unless `--no-index`, it then `build_index()` and `dump()` again to persist the index.
</Step>
<Step title="Verify">
```bash
he info ./my_ka/
ls ./my_ka/    # expect data.json, metadata.json, index/
```
</Step>
</Steps>

<ParamField body="--output / -o" type="string" required>
Output directory. Fails if non-empty unless `--force`.
</ParamField>

<ParamField body="--no-index" type="boolean">
Skip `build_index()`; KA is searchable only after `he build-index`.
</ParamField>

### Evolve a KA — `he feed`

`he feed` loads template and language from `metadata.json` (override with `--template` / `--lang`), then:

1. `ka.load(ka_path)`
2. `ka.feed_text(new_text)`
3. `ka.dump(ka_path)`

<Warning>
`he feed` does not rebuild the index. After feeding, run `he build-index <ka_path>` (or `he build-index <ka_path> --force`) before `he search` or `he talk`.
</Warning>

<RequestExample>
```bash
he parse examples/en/tesla.md -t general/biography_graph -l en -o ./tesla_ka/
he feed ./tesla_ka/ examples/en/another_doc.md
he build-index ./tesla_ka/ --force
he search ./tesla_ka/ "AC motor"
```
</RequestExample>

### Rebuild index — `he build-index`

Loads the KA, optionally `clear_index()` with `--force`, calls `build_index()`, and `dump()` to refresh `index/`. Exits early if an index already exists and `--force` is omitted.

## Python API reference

All lifecycle methods live on `BaseAutoType` and are exported through `Template.create()`:

```python
from hyperextract import Template

# Create (reads ~/.he/config.toml when clients omitted)
ka = Template.create("general/biography_graph", "en")

# Full cycle
ka.feed_text(open("doc.md").read())
ka.build_index()
ka.dump("./my_ka/")

# Reload later — template + lang must match metadata
ka2 = Template.create("general/biography_graph", "en")
ka2.load("./my_ka/")
results = ka2.search("wireless power", top_k=5)
answer = ka2.chat("What did Tesla invent?")
```

Granular serialization is also available:

| Method | Target |
|--------|--------|
| `dump_data(path)` | Single `data.json` |
| `dump_metadata(path)` | Single `metadata.json` |
| `dump_index(folder)` | FAISS files under `index/` |
| `load_data(path)` | Validates against schema, replaces state |
| `load_metadata(path)` | Merges into `metadata` dict |
| `load_index(folder)` | Restores FAISS; needs matching embedder |

<Index failures are non-fatal: `dump()` prints a warning if index save fails; `load()` warns and leaves data intact so you can call `build_index()`.</Info>

## Index internals

Hyper-Extract uses LangChain `FAISS` as the vector backend:

- **Scalar types** (`AutoModel`, `AutoList`, `AutoSet`): one `index/` with `index.faiss` and `docstore.json`.
- **Graph types** (`AutoGraph`, `AutoHypergraph`, temporal/spatial variants): `index/node_index/` and `index/edge_index/` subdirectories, each with its own FAISS store.

`build_index()` embeds text derived from stored objects. `search()` runs `similarity_search` and restores Pydantic items from `Document.metadata["raw"]`.

## Validation rules

The CLI enforces KA integrity before commands run:

| Validator | Checks |
|-----------|--------|
| `validate_ka_path` | Path exists and is a directory |
| `validate_ka_with_data` | `data.json` present |
| `validate_ka_with_index` | `index/` exists and is non-empty |
| `get_template_from_ka` | `metadata.json` with resolvable `template` (preset or local `{template}.yaml`) |

Custom templates copied into the KA directory during `he parse` are resolved from the local YAML on reload.

## Related pages

<CardGroup>
<Card title="Auto-Types" href="/auto-types">
Eight extraction primitives, merge behavior, and index layout differences across `model`, `list`, `set`, and graph types.
</Card>
<Card title="Extract and evolve" href="/extract-and-evolve">
End-to-end `he parse`, `he feed`, and `he build-index` workflows with input formats and flags.
</Card>
<Card title="Search, chat, and visualize" href="/search-chat-visualize">
Query a built KA with `he search`, `he talk`, and `he show` after the index exists.
</Card>
<Card title="Python API reference" href="/python-api-reference">
Full `Template.create`, `BaseAutoType` method signatures, and client factory exports.
</Card>
</CardGroup>

---

## 05. Auto-Types

> Eight strongly-typed extraction primitives (`AutoModel`, `AutoList`, `AutoSet`, `AutoGraph`, `AutoHypergraph`, `AutoTemporalGraph`, `AutoSpatialGraph`, `AutoSpatioTemporalGraph`): structure, merge behavior, indexing, and when to pick each type.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/05-auto-types.md
- Generated: 2026-06-18T20:54:54.065Z

### Source Files

- `hyperextract/types/__init__.py`
- `hyperextract/types/base.py`
- `hyperextract/types/graph.py`
- `hyperextract/types/hypergraph.py`
- `hyperextract/types/temporal_graph.py`
- `hyperextract/types/spatial_graph.py`
- `hyperextract/types/spatio_temporal_graph.py`
- `hyperextract/types/model.py`

---
title: "Auto-Types"
description: "Eight strongly-typed extraction primitives (`AutoModel`, `AutoList`, `AutoSet`, `AutoGraph`, `AutoHypergraph`, `AutoTemporalGraph`, `AutoSpatialGraph`, `AutoSpatioTemporalGraph`): structure, merge behavior, indexing, and when to pick each type."
---

Hyper-Extract's eight AutoTypes are Pydantic-backed Knowledge Abstract primitives in `hyperextract.types`. Each class wraps LangChain structured output, long-text chunking, chunk-result merging, FAISS semantic indexing, and on-disk persistence (`data.json`, `metadata.json`, `index/`). YAML templates and extraction methods both resolve to one of these types via `Template.create()`.

## Architecture

All AutoTypes inherit `BaseAutoType[T]`, which owns the shared extraction pipeline:

1. Split input with `RecursiveCharacterTextSplitter` (default `chunk_size=2048`, `chunk_overlap=256`).
2. Call `llm_client.with_structured_output(schema)` per chunk (batched up to `max_workers=10`).
3. Merge chunk results through type-specific `merge_batch_data()`.
4. Apply results via `_set_data_state()` (`parse`) or `_update_data_state()` (`feed_text`); both invalidate the vector index on mutation.

```mermaid
classDiagram
    class BaseAutoType {
        +parse(text) BaseAutoType
        +feed_text(text) BaseAutoType
        +build_index()
        +search(query, top_k)
        +chat(query, top_k)
        +dump(folder_path)
        +load(folder_path)
        #merge_batch_data(data_list) T
    }
    class AutoModel
    class AutoList
    class AutoSet
    class AutoGraph
    class AutoHypergraph
    class AutoTemporalGraph
    class AutoSpatialGraph
    class AutoSpatioTemporalGraph

    BaseAutoType <|-- AutoModel
    BaseAutoType <|-- AutoList
    BaseAutoType <|-- AutoSet
    BaseAutoType <|-- AutoGraph
    BaseAutoType <|-- AutoHypergraph
    AutoGraph <|-- AutoTemporalGraph
    AutoGraph <|-- AutoSpatialGraph
    AutoGraph <|-- AutoSpatioTemporalGraph
```

<Info>
`parse()` returns a **new** instance; `feed_text()` mutates the current instance and supports chaining (`ka.feed_text(a).feed_text(b)`). Use `+` to merge two instances of the same class and schema.
</Info>

## Type selection guide

| Class | Data shape | Primary use case | Deduplication | Default merge |
| :--- | :--- | :--- | :--- | :--- |
| `AutoModel` | Single `BaseModel` | Document summary, metadata, one record per file | Treats all chunks as one object (`singleton` key) | `llm_balanced` |
| `AutoList` | `List[Item]` | Ordered collections, event logs, quotes | None (append) | Concatenate all items |
| `AutoSet` | `List[Item]` (unique) | Entity registries, glossaries, keyword sets | By `key_extractor` via `OMem` | `llm_balanced` |
| `AutoGraph` | `nodes` + `edges` | Pairwise knowledge graphs | Nodes and edges separately via `OMem` | `llm_balanced` per node/edge |
| `AutoHypergraph` | `nodes` + hyperedges | Multi-participant events, meetings, groups | Same as graph; hyperedge keys must be order-stable | `llm_balanced` per node/edge |
| `AutoTemporalGraph` | Graph + time on edges | Timelines, news, biographies | Edge key includes time component | `llm_balanced` + `observation_time` injection |
| `AutoSpatialGraph` | Graph + location on edges | Floor plans, facility layouts | Edge key includes location component | `llm_balanced` + `observation_location` injection |
| `AutoSpatioTemporalGraph` | Graph + time + location | Travel logs, incident reports | Composite edge key (`@ time at location`) | Both observation contexts injected |

## Scalar and collection types

### AutoModel

`AutoModel` targets **one structured object per document**. Every chunk is treated as a partial view of the same object; chunk results merge through `ontomem` with a constant `singleton` key.

**Structure:** A single Pydantic `data_schema` instance stored in `_data` (or `None` when empty).

**Merge behavior:**

| Trigger | Behavior |
| :--- | :--- |
| Multi-chunk extraction | `merge_batch_data()` groups all extractions under `singleton` and merges via configured strategy |
| `feed_text()` | Merges incoming object into existing via same merger |
| `+` operator | Merges two `AutoModel` instances; `AutoModel + AutoModel` produces an `AutoList` |

**Indexing:** `build_index()` creates one FAISS document per non-null schema field. `search()` returns field-value dictionaries from matched fields.

**Template example:** `general/base_model` (`type: model`), `finance/earnings_summary`.

### AutoList

`AutoList` extracts **many independent items** where order may matter and duplicates are acceptable.

**Structure:** `AutoListSchema` with an `items: List[ItemSchema]` field.

**Merge behavior:** `merge_batch_data()` and `feed_text()` **append** items across chunks. No key-based deduplication.

**Indexing:** Each list item becomes one FAISS document (full JSON or selected `fields_for_index`).

**Template example:** `general/base_list`, `legal/compliance_list`.

### AutoSet

`AutoSet` maintains a **deduplicated registry** keyed by a user-defined `key_extractor`. Internal storage uses `OMem`; the external `items` property exposes a list.

**Merge behavior:** Configurable `strategy_or_merger` (default `MergeStrategy.LLM.BALANCED`):

| YAML / API value | Effect |
| :--- | :--- |
| `merge_field` | Non-null fields overwrite; lists append |
| `keep_existing` | First occurrence wins |
| `keep_incoming` | Latest occurrence wins |
| `llm_balanced` | LLM synthesizes both versions (default) |
| `llm_prefer_existing` | LLM merge biased toward stored data |
| `llm_prefer_incoming` | LLM merge biased toward new data |

**Indexing:** Delegates to `OMem.build_index()` with optional `fields_for_index`.

**Set operations:** Supports `|` (union), `&` (intersection), `-` (difference) on compatible instances.

**Template example:** `general/base_set` (`identifiers.item_id: name`), `finance/risk_factor_set`.

## Graph types

### AutoGraph

Standard **binary** knowledge graph: `nodes` (entities) and `edges` (source→target relations).

**Structure:**

```
AutoGraphSchema
├── nodes: List[NodeSchema]
└── edges: List[EdgeSchema]
```

**Required extractors:**

<ParamField body="node_key_extractor" type="Callable" required>
Returns a stable unique key per node (e.g., `lambda x: x.name`).
</ParamField>

<ParamField body="edge_key_extractor" type="Callable" required>
Returns a stable unique key per edge (e.g., `lambda x: f"{x.source}|{x.type}|{x.target}"`).
</ParamField>

<ParamField body="nodes_in_edge_extractor" type="Callable" required>
Returns `(source_key, target_key)` for endpoint validation and pruning.
</ParamField>

**Extraction modes:**

| Mode | Pipeline |
| :--- | :--- |
| `one_stage` | Single structured call extracts nodes and edges together |
| `two_stage` (default in base templates) | Batch-extract nodes per chunk, then batch-extract edges with chunk-local node context |

After extraction, `_prune_dangling_edges()` drops edges whose endpoints are not in the node set.

**Merge behavior:** Separate `node_merger` and `edge_merger` (default `llm_balanced`). `feed_text()` calls `OMem.add()` for incremental node/edge insertion.

**Indexing:** `build_index()` builds separate FAISS stores for nodes and edges. On disk, `index/node_index/` and `index/edge_index/`. `search()` returns `(nodes, edges)`; `chat()` formats both into structured context.

**Template example:** `general/base_graph`, `general/concept_graph`.

### AutoHypergraph

Extends the graph pattern for **N-ary relations** (hyperedges connecting two or more nodes).

**Key difference from `AutoGraph`:**

| Aspect | `AutoGraph` | `AutoHypergraph` |
| :--- | :--- | :--- |
| Edge arity | Exactly two endpoints | Two or more participants |
| `nodes_in_edge_extractor` | `Tuple[str, str]` | `Tuple[str, ...]` (all participants) |
| Consistency check | Both endpoints must exist | **All** participants must exist (strict mode) |
| Default `extraction_mode` | `one_stage` or `two_stage` | `two_stage` (recommended) |

<Warning>
Hyperedge deduplication requires **order-stable** `edge_key_extractor` values. Sort participant keys inside the extractor so `{A, B}` and `{B, A}` map to the same key:

```python
edge_key_extractor=lambda x: f"{x.name}|{sorted(x.participants)}"
```
</Warning>

**Template example:** `general/base_hypergraph` (`relation_members: participants`), `legal/contract_obligation`.

## Context-aware graph types

`AutoTemporalGraph`, `AutoSpatialGraph`, and `AutoSpatioTemporalGraph` subclass `AutoGraph`. They inject observation context into prompts and fold time/location into edge deduplication keys.

### AutoTemporalGraph

Resolves relative time expressions ("yesterday", "last year") against an observation date.

<ParamField body="observation_time" type="string">
Reference date for relative-time resolution. Defaults to today (`YYYY-MM-DD`). Pass via `Template.create(..., observation_time="2024-06-15")` or template `options`.
</ParamField>

<ParamField body="time_in_edge_extractor" type="Callable" required>
Extracts the time component from an edge (e.g., `lambda x: x.time or ""`).
</ParamField>

**Edge identity:** `f"{raw_edge_key} @ {time_val}"` when time is present.

**Extraction rules baked into prompts:** Dates and time periods are **not** extracted as nodes; time lives on edge fields. Relative times resolve against `observation_time`.

**Template example:** `general/base_temporal_graph`, `general/biography_graph`, `finance/event_timeline`.

### AutoSpatialGraph

Resolves relative location expressions ("nearby", "here") against an observation location.

<ParamField body="observation_location" type="string">
Reference location for spatial resolution. Defaults to `"Unknown Location"`.
</ParamField>

<ParamField body="location_in_edge_extractor" type="Callable" required>
Extracts the spatial component from an edge (e.g., `lambda x: x.place or ""`).
</ParamField>

**Edge identity:** `f"{raw_edge_key} at {loc_val}"` when location is present.

**Extraction rules:** Locations and directions are **not** extracted as nodes; spatial context belongs on edges.

**Template example:** `general/base_spatial_graph`, `medicine/treatment_map`.

### AutoSpatioTemporalGraph

Combines temporal and spatial resolution in one extractor.

**Edge identity:** `raw_key`, optionally suffixed with `@ {time}` and `at {location}`.

**Template example:** `general/base_spatio_temporal_graph`, `medicine/hospital_timeline`.

## Merge strategies (shared reference)

Types that support configurable merging (`AutoModel`, `AutoSet`, graph family) accept `strategy_or_merger` (or per-node/edge variants in YAML):

```yaml
options:
  merge_strategy: llm_balanced          # AutoModel, AutoSet
  entity_merge_strategy: llm_balanced   # graph family
  relation_merge_strategy: merge_field  # graph family
```

Programmatic construction passes `MergeStrategy` enum values or a custom `BaseMerger` from `ontomem.merger`.

<Note>
`AutoList` has no merge-strategy knob — chunk and feed operations always concatenate. Choose `AutoSet` when duplicate items must collapse by key.
</Note>

## Indexing and query

| Type | Index unit | `build_index()` scope | `search()` return type |
| :--- | :--- | :--- | :--- |
| `AutoModel` | Non-null fields | Single FAISS store | `List[dict]` (field snapshots) |
| `AutoList` | Each item | Single FAISS store | `List[ItemSchema]` |
| `AutoSet` | Each unique item | Via `OMem` | `List[ItemSchema]` |
| `AutoGraph` / hypergraph / context graphs | Nodes and edges separately | `node_index/` + `edge_index/` | `Tuple[List[Node], List[Edge]]` |

All types use FAISS (`langchain_community.vectorstores.FAISS`) backed by the configured embedder. Calling `search()` or `chat()` without a built index raises an error (graph types report which sub-index is missing).

`show()` renders through OntoSight (`view_nodes`, `view_graph`, or `view_hypergraph`) and wires search/chat callbacks when indices exist.

## Lifecycle and persistence

Every AutoType instance is a Knowledge Abstract. Standard operations:

| Method | Effect |
| :--- | :--- |
| `parse(text)` | Extract into a new instance; does not modify `self` |
| `feed_text(text)` | Extract and merge into `self`; invalidates index |
| `build_index()` | Build or rebuild FAISS from current data |
| `dump(folder)` | Write `data.json`, `metadata.json`, `index/` |
| `load(folder)` | Restore data, metadata, and index (rebuild if index load fails) |
| `clear()` | Reset data and index |
| `clear_index()` | Drop index only |

Chunking, merge, and index invalidation run automatically — callers supply schema, extractors, and provider clients only.

## Templates and programmatic use

YAML `type` maps 1:1 to AutoType classes via `TemplateFactory`:

| Template `type` | Python class |
| :--- | :--- |
| `model` | `AutoModel` |
| `list` | `AutoList` |
| `set` | `AutoSet` |
| `graph` | `AutoGraph` |
| `hypergraph` | `AutoHypergraph` |
| `temporal_graph` | `AutoTemporalGraph` |
| `spatial_graph` | `AutoSpatialGraph` |
| `spatio_temporal_graph` | `AutoSpatioTemporalGraph` |

<CodeGroup>
```python Python API
from hyperextract import Template, create_client

llm, embedder = create_client()
ka = Template.create("general/biography_graph", "en", llm, embedder,
                     observation_time="2024-01-15")
ka.feed_text(open("examples/en/tesla.md").read())
ka.build_index()
results = ka.search("When did Tesla move to America?", top_k=5)
ka.dump("./output/tesla")
```

```bash CLI
he config init
he parse examples/en/tesla.md -t general/biography_graph --lang en -o ./output/tesla
he search "When did Tesla move to America?" ./output/tesla
he show ./output/tesla
```
</CodeGroup>

Extraction methods (`method/light_rag`, `method/atom`, etc.) also produce AutoType instances — typically `AutoGraph` or `AutoTemporalGraph` — with algorithm-specific schemas. See the extraction methods reference for per-method output types.

## When to pick each type

<AccordionGroup>
<Accordion title="Choose AutoModel when">
- The output is **one record per document** (summary, report metadata, sentiment snapshot).
- Fields from different chunks describe the **same** object and must be synthesized, not listed.
- Example presets: `finance/earnings_summary`, `finance/sentiment_model`.
</Accordion>

<Accordion title="Choose AutoList when">
- Items are **independent** and order or repetition matters.
- You want the simplest merge semantics (append only).
- Example presets: `legal/compliance_list`, `general/base_list`.
</Accordion>

<Accordion title="Choose AutoSet when">
- Items need **deduplication** by a stable identifier.
- The same entity may appear in many chunks and attributes must merge intelligently.
- Example presets: `finance/risk_factor_set`, `legal/defined_term_set`.
</Accordion>

<Accordion title="Choose AutoGraph when">
- Relationships are **binary** (A→B).
- Standard entity–relation knowledge graphs suffice.
- Example presets: `general/biography_graph` (if time is not needed), `tcm/meridian_graph`.
</Accordion>

<Accordion title="Choose AutoHypergraph when">
- A single relation involves **three or more** participants (meetings, transactions, obligations).
- Example presets: `legal/contract_obligation`, `tcm/formula_composition`.
</Accordion>

<Accordion title="Choose AutoTemporalGraph when">
- Edges carry **time** and relative dates must resolve ("last year", "at age 20").
- Dates should not become standalone nodes.
- Example presets: `general/biography_graph`, `finance/event_timeline`.
</Accordion>

<Accordion title="Choose AutoSpatialGraph when">
- Edges carry **where** information and relative places must resolve ("nearby", "this room").
- Example presets: `medicine/treatment_map`, `industry/equipment_topology`.
</Accordion>

<Accordion title="Choose AutoSpatioTemporalGraph when">
- Both **when and where** matter on the same edges (incidents, travel, hospital events).
- Example presets: `medicine/hospital_timeline`, `general/base_spatio_temporal_graph`.
</Accordion>
</AccordionGroup>

<Tip>
Start from the matching `general/base_*` template when authoring a custom YAML template. Each base file demonstrates the expected `output`, `identifiers`, `options`, and `display` blocks for that AutoType.
</Tip>

## Related pages

<CardGroup>
<Card title="Knowledge Abstracts" href="/knowledge-abstracts">
On-disk layout (`data.json`, `metadata.json`, `index/`) and lifecycle methods shared by all AutoTypes.
</Card>
<Card title="Create custom templates" href="/create-custom-templates">
Author YAML templates: pick a `type`, define fields, identifiers, and merge strategies.
</Card>
<Card title="Template schema reference" href="/template-schema-reference">
Valid `type` values, field types, identifier patterns, and `options` keys per AutoType.
</Card>
<Card title="Python API reference" href="/python-api-reference">
`Template.create`, `BaseAutoType` methods, and `create_client()` entry points.
</Card>
<Card title="Search, chat, and visualize" href="/search-chat-visualize">
Query and render Knowledge Abstracts with `he search`, `he talk`, and `he show`.
</Card>
</CardGroup>

---

## 06. Templates vs methods

> Domain YAML templates (`general/biography_graph`, `finance/earnings_summary`, etc.) versus algorithm-driven method templates (`method/light_rag`, `method/atom`); language requirements (`--lang` for templates, English-only for methods); and selection criteria.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/06-templates-vs-methods.md
- Generated: 2026-06-18T20:54:31.827Z

### Source Files

- `hyperextract/utils/template_engine/gallery.py`
- `hyperextract/methods/registry.py`
- `hyperextract/utils/template_engine/factory.py`
- `hyperextract/cli/commands/list.py`
- `hyperextract/templates/README.md`
- `hyperextract/cli/cli.py`

---
title: "Templates vs methods"
description: "Domain YAML templates (`general/biography_graph`, `finance/earnings_summary`, etc.) versus algorithm-driven method templates (`method/light_rag`, `method/atom`); language requirements (`--lang` for templates, English-only for methods); and selection criteria."
---

Hyper-Extract exposes two extraction paths that both produce queryable Knowledge Abstracts through the same `Template.create` / `he parse` surface: **knowledge templates** (declarative YAML presets under `hyperextract/templates/presets/`) and **method templates** (registered algorithm classes under `hyperextract/methods/`). `TemplateFactory.create` routes `method/{name}` IDs to `create_method` and all other IDs through `Gallery` plus `localize_template`.

## Two extraction paths

| Aspect | Knowledge templates | Method templates |
|--------|---------------------|------------------|
| ID format | `{domain}/{name}` (e.g. `general/biography_graph`, `finance/earnings_summary`) | `method/{name}` (e.g. `method/light_rag`, `method/atom`) |
| Definition | YAML files with `output`, `guideline`, `identifiers`, `options` | Python classes registered in `hyperextract/methods/registry.py` |
| Discovery | `Gallery` scans `templates/presets/**/*.yaml` at import | `register_method` populates `_METHOD_REGISTRY` at import |
| Schema control | Full field, entity, and relation schemas per template | Fixed by algorithm; autotype comes from registry (`graph` or `hypergraph`) |
| Language | Multilingual YAML; runtime language required (`zh` or `en`) | English prompts only; `metadata["lang"]` hardcoded to `"en"` |
| Customization | Edit YAML or author new files | Pass constructor kwargs (e.g. `observation_time` for `atom`) |

Both paths return an `BaseAutoType` instance with the same lifecycle: `feed_text`, `dump`, `load`, `build_index`, `search`, `chat`, and `show`.

```text
  Knowledge path                         Method path
  ──────────────                         ───────────

  presets/{domain}/*.yaml                methods/registry.py
         │                                      │
         ▼                                      ▼
      Gallery.get()                      get_method()
         │                                      │
         └──────────► TemplateFactory.create ◄──┘
                           │
                           ▼
                    BaseAutoType instance
                           │
                           ▼
                  Knowledge Abstract (KA)
```

## Knowledge templates

Knowledge templates are domain YAML presets that declare **what** to extract and **how** to prompt the LLM. Each file lives under a domain directory inside `hyperextract/templates/presets/` and is keyed at runtime as `{domain}/{name}`.

### Domains and examples

The preset library ships 37 YAML templates across six domains:

| Domain | Count | Example IDs | Typical documents |
|--------|-------|-------------|-------------------|
| `general` | 13 | `general/biography_graph`, `general/concept_graph`, `general/base_graph` | Biographies, technical docs, agent workflows |
| `finance` | 5 | `finance/earnings_summary`, `finance/event_timeline` | Earnings calls, filings, news |
| `medicine` | 5 | `medicine/treatment_map`, `medicine/hospital_timeline` | Guidelines, discharge summaries |
| `tcm` | 5 | `tcm/syndrome_reasoning`, `tcm/formula_composition` | TCM case records, formula texts |
| `industry` | 5 | `industry/operation_flow`, `industry/safety_control` | SOPs, safety handbooks |
| `legal` | 5 | `legal/contract_obligation`, `legal/case_fact_timeline` | Contracts, court judgments |

Each YAML file declares an AutoType (`model`, `list`, `set`, `graph`, `hypergraph`, `temporal_graph`, `spatial_graph`, or `spatio_temporal_graph`), multilingual `description` and `guideline` blocks, an `output` schema, and optional `identifiers`, `options`, and `display` sections. At load time, `load_template` validates every language listed in the `language` field; at runtime, `localize_template` converts multilingual fields into a single-language `TemplateCfg` before the matching `create_{type}` factory method runs.

<Info>
Template IDs without a domain prefix resolve only under `general/`. For example, `graph` maps to `general/graph`, not templates in other domains.
</Info>

### When knowledge templates fit

Choose a knowledge template when the document type maps to a known schema:

- **Structured records** — `finance/earnings_summary` extracts quarterly metrics into an `AutoModel`.
- **Domain graphs** — `general/biography_graph` builds a `temporal_graph` of life events with timestamps.
- **Multilingual extraction** — prompts and field descriptions are localized to `zh` or `en`.
- **Custom schemas** — author a standalone YAML file and pass its path to `Template.create`.

## Method templates

Method templates wrap extraction **algorithms** as first-class template IDs. They do not use YAML; each method is a Python class registered with an autotype and description.

### Registered methods

Nine methods ship in the default registry, split across `hyperextract/methods/rag` (retrieval-augmented) and `hyperextract/methods/typical` (direct extraction):

| Method ID | Autotype | Category | Description |
|-----------|----------|----------|-------------|
| `method/graph_rag` | `graph` | RAG | Graph-RAG with community detection |
| `method/light_rag` | `graph` | RAG | Lightweight graph RAG with binary edges |
| `method/hyper_rag` | `hypergraph` | RAG | Hypergraph RAG with n-ary hyperedges |
| `method/hypergraph_rag` | `hypergraph` | RAG | Advanced hypergraph knowledge construction |
| `method/cog_rag` | `hypergraph` | RAG | Cognitive RAG for reasoning-focused retrieval |
| `method/itext2kg` | `graph` | Typical | High-quality triple-based extraction |
| `method/itext2kg_star` | `graph` | Typical | Enhanced iText2KG with improved quality |
| `method/kg_gen` | `graph` | Typical | Knowledge graph generator |
| `method/atom` | `graph` | Typical | Temporal knowledge graph with evidence attribution |

`TemplateFactory.create_method` instantiates the class, then stamps metadata:

```python
instance.metadata["template"] = f"method/{method_name}"
instance.metadata["lang"] = "en"
instance.metadata["type"] = autotype
```

Method-specific kwargs pass through to the constructor. For example, `atom` accepts `observation_time`:

```python
template = Template.create(
    "method/atom",
    observation_time="2024-06-15",
)
```

### When method templates fit

Choose a method template when schema flexibility matters less than extraction strategy:

- **General-purpose graph extraction** without a domain-specific field layout (`method/light_rag`).
- **Large documents** where chunking and retrieval help (`method/graph_rag`, `method/light_rag`).
- **Complex multi-entity relations** (`method/hyper_rag`, `method/hypergraph_rag`).
- **Temporal facts with evidence** (`method/atom` with `observation_time`).
- **Algorithm comparison** across RAG and typical pipelines using the same `feed_text` / `chat` surface.

<Note>
Method demos under `examples/en/methods/` use English source documents and instantiate method classes directly (e.g. `Light_RAG`) or via `Template.create("method/light_rag")`.
</Note>

## Language requirements

Language handling diverges at the `TemplateFactory.create` boundary.

### Knowledge templates: `--lang` required

Knowledge templates store prompts and schemas in multilingual YAML (`language: [zh, en]`). The runtime language selects which localized strings `localize_template` applies.

<ParamField body="--lang" type="string" required>
Language code for knowledge templates. Accepted values: `zh`, `en`. Required on `he parse` when using `-t` (or interactive template selection). Required as the `language` argument in `Template.create` for non-method sources.
</ParamField>

If `language` is omitted for a knowledge template, `TemplateFactory.create` raises:

```text
ValueError: language is required for knowledge templates. Provide a language code (e.g., 'zh', 'en').
```

The CLI enforces the same rule:

```bash
# Error: --lang missing
he parse document.md -t general/biography_graph -o ./ka/

# Correct
he parse document.md -t general/biography_graph -o ./ka/ -l en
he parse document.md -t finance/earnings_summary -o ./ka/ -l zh
```

### Method templates: English only

Method templates use English prompts baked into algorithm code. `TemplateFactory.create_method` documents that language is hardcoded to `"en"` in metadata, and `Template.create` ignores any `language` argument for `method/` sources.

<ParamField body="--lang" type="string">
Optional for method templates. If provided, the CLI prints a note that the value is ignored and forces `lang = "en"`.
</ParamField>

```bash
# No --lang needed
he parse document.md -m light_rag -o ./ka/

# Equivalent template ID form
he parse document.md -t method/light_rag -o ./ka/
```

`he list template --lang zh` filters to Chinese-capable knowledge templates and **excludes** method templates. Use `he list method` to browse methods independently.

## CLI invocation

`he parse` accepts templates and methods through separate flags that converge on one template ID string.

<Steps>
<Step title="List available options">

```bash
he list template          # Knowledge templates + methods (default lang: en)
he list template -l zh    # Chinese knowledge templates only
he list template --no-methods
he list method
he list method -q rag
```

</Step>
<Step title="Run extraction with a knowledge template">

```bash
he parse examples/en/tesla.md \
  -t general/biography_graph \
  -l en \
  -o ./tesla-ka
```

Omit `-t` for interactive template selection (knowledge templates only).

</Step>
<Step title="Run extraction with a method">

```bash
he parse examples/en/tesla.md \
  -m light_rag \
  -o ./tesla-ka-rag
```

The `-m` flag sets the internal template ID to `method/{name}`. No `-l` flag is required.

</Step>
<Step title="Verify output">

```bash
he info ./tesla-ka
he show ./tesla-ka
he search ./tesla-ka "AC motor"
```

Metadata records the template ID and language (`en` or `zh` for knowledge templates; always `en` for methods).

</Step>
</Steps>

## Python API

Both paths use the same `Template` facade exported from `hyperextract`.

<CodeGroup>
```python Knowledge template
from hyperextract import Template

ka = Template.create("general/biography_graph", language="en")
ka.feed_text(document_text)
ka.dump("./tesla-ka")
ka.build_index()
```

```python Method template
from hyperextract import Template

ka = Template.create("method/light_rag")
ka.feed_text(document_text)
ka.dump("./tesla-ka-rag")
```

```python Custom YAML path
ka = Template.create("/path/to/my_template.yaml", language="zh")
ka.feed_text(document_text)
```
</CodeGroup>

`Template.get` resolves configs from either source: `Gallery.get` for knowledge IDs, `get_method_cfg` for `method/` IDs. `Template.list(include_methods=True)` merges gallery results with `list_method_cfgs()`.

For direct algorithm access without the template wrapper, import classes from `hyperextract.methods.rag` or `hyperextract.methods.typical` and pass `llm_client` and `embedder` explicitly.

## Selection criteria

Use the decision below to pick a path before tuning autotype or provider settings.

```text
Need a specific output schema for a known document type?
│
├─ Yes → Knowledge template
│         Match domain + document type (see templates catalog)
│         Set --lang to match document language
│         Pick autotype by structure need (model/list/set/graph/…)
│
└─ No → Method template
          Pick algorithm by document size and relation complexity
          English input recommended
          Pass method kwargs (e.g. observation_time for atom)
```

### Knowledge template selection

| Scenario | Recommended template | AutoType |
|----------|---------------------|----------|
| Person biography or memoir | `general/biography_graph` | `temporal_graph` |
| Earnings call transcript | `finance/earnings_summary` | `model` |
| Multi-party contract | `legal/contract_obligation` | `hypergraph` |
| Clinical guideline | `medicine/treatment_map` | `hypergraph` |
| Custom domain schema | Author YAML from `general/base_*` | Any |

Match `type` to document structure: records use `model`/`list`/`set`; relationships use `graph`/`hypergraph`; time- or location-anchored relations use `temporal_graph`, `spatial_graph`, or `spatio_temporal_graph`. Temporal and spatio-temporal templates accept runtime kwargs such as `observation_time` and `observation_location`.

### Method template selection

| Priority | Recommended method |
|----------|-------------------|
| Fast general extraction | `method/light_rag` |
| Best triple quality | `method/itext2kg_star` |
| Large documents (10K+ words) | `method/graph_rag` |
| N-ary / multi-entity relations | `method/hyper_rag` |
| Temporal facts with evidence | `method/atom` |
| Reasoning-focused RAG | `method/cog_rag` |

<Warning>
Do not pass `--lang zh` expecting Chinese prompts from method templates. Methods always run with English prompts regardless of the flag value.
</Warning>

## Unified metadata and downstream commands

Regardless of path, the resulting Knowledge Abstract stores `template` and `lang` in `metadata.json`. Downstream CLI commands (`he feed`, `he search`, `he talk`, `he show`, `he build-index`) reload the KA via `Template.create(template, lang)` using those stored values. When feeding new documents, `he feed` inherits template and language from existing metadata unless overridden.

## Related pages

<CardGroup>
<Card title="Auto-Types" href="/auto-types">
Eight extraction primitives, merge behavior, and autotype selection for YAML `type` fields.
</Card>
<Card title="Create custom templates" href="/create-custom-templates">
Author domain YAML templates with multilingual blocks, identifiers, and validation.
</Card>
<Card title="Use extraction methods" href="/use-extraction-methods">
Invoke methods via CLI, `Template.create`, or direct class instantiation with kwargs.
</Card>
<Card title="Extraction methods reference" href="/extraction-methods-reference">
Full registry of nine methods with autotypes, descriptions, and constructor parameters.
</Card>
<Card title="Template schema reference" href="/template-schema-reference">
YAML field definitions for knowledge template authoring.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Missing `--lang`, template resolution errors, and method-specific failure modes.
</Card>
</CardGroup>

---

## 07. Provider system

> BYOC/BYOK provider model: `openai`, `bailian`, and `vllm` presets; `provider:model@url` shorthand; `CompatibleEmbeddings` for non-OpenAI endpoints; and verified model compatibility requirements (`json_schema` / function calling).

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/07-provider-system.md
- Generated: 2026-06-18T20:54:35.210Z

### Source Files

- `hyperextract/utils/client.py`
- `hyperextract/cli/config.py`
- `hyperextract/cli/commands/config.py`
- `README.md`
- `.env.example`

---
title: Provider system
description: BYOC/BYOK provider model with openai, bailian, and vllm presets; provider:model@url shorthand; CompatibleEmbeddings for non-OpenAI endpoints; and verified model compatibility requirements.
---

Hyper-Extract is **bring-your-own-cloud (BYOC)** and **bring-your-own-key (BYOK)** by design. You choose where LLM and embedding traffic goes — OpenAI, Alibaba Bailian, a local vLLM stack, or any OpenAI-compatible endpoint — and Hyper-Extract wires the same extraction pipeline to that backend. The provider layer does not lock you to a single vendor; presets and shorthand strings are convenience, not a hard dependency.

Every extraction path ultimately needs two LangChain clients: a **chat model** for structured extraction and an **embedder** for semantic search. Hyper-Extract centralizes both in `hyperextract.utils.client`.

## Architecture overview

```mermaid
flowchart LR
  subgraph consumers [Consumers]
    CLI["he parse / search / talk"]
    Template["Template.create()"]
    AutoType["AutoGraph, AutoList, ..."]
  end

  subgraph factory [Client factory]
    CC["create_client()"]
    GL["get_client()"]
    CL["create_llm()"]
    CE["create_embedder()"]
  end

  subgraph backends [OpenAI-compatible backends]
    OAI["openai preset"]
    BL["bailian preset"]
    VLLM["vllm + custom URL"]
  end

  CLI --> GL
  Template --> GL
  AutoType --> CC
  CC --> CL
  CC --> CE
  GL --> CL
  GL --> CE
  CL --> OAI
  CL --> BL
  CL --> VLLM
  CE --> OAI
  CE --> BL
  CE --> VLLM
```

| Layer | Role |
|-------|------|
| **Presets** | Named bundles of `base_url` and default models for `openai`, `bailian`, and `vllm` |
| **Shorthand parser** | Turns `provider:model@url` strings into resolved config dicts |
| **LLM client** | `ChatOpenAI` pointed at the resolved endpoint |
| **Embedder client** | `OpenAIEmbeddings` for official OpenAI, or `CompatibleEmbeddings` for everything else |
| **Config file** | `~/.he/config.toml` read by `get_client()` for CLI and `Template.create()` defaults |

## Provider presets

Three first-class presets ship in `PROVIDER_PRESETS`. Each defines a default LLM model, default embedder model, and (when applicable) a base URL.

| Preset | Base URL | Default LLM | Default embedder |
|--------|----------|-------------|------------------|
| `openai` | `https://api.openai.com/v1` | `gpt-4o-mini` | `text-embedding-3-small` |
| `bailian` | `https://dashscope.aliyuncs.com/compatible-mode/v1` | `qwen3.6-plus` | `text-embedding-v4` |
| `vllm` | *(none — you must supply)* | *(none)* | *(none)* |

The `vllm` preset intentionally has no defaults. Local deployments vary by host, port, and served model name, so you always specify `provider:model@url` explicitly or set `base_url` in config.

<AccordionGroup>
<Accordion title="Why presets instead of hard-coded providers">
Presets are URL and model shortcuts, not proprietary connectors. Any endpoint that speaks the OpenAI chat-completions and embeddings APIs can work when you pass a custom `base_url`. The `custom` option in `he config init` follows the same code path as `openai` or `bailian` — only the resolved URL and model names change.
</Accordion>
</AccordionGroup>

## String shorthand: `provider:model@url`

`create_client()`, `create_llm()`, and `create_embedder()` accept a compact string syntax parsed by `_parse_client_spec()`:

| Format | Example | Resolved behavior |
|--------|---------|-------------------|
| `provider` | `"bailian"` | Preset URL + default LLM/embedder models |
| `provider:model` | `"bailian:qwen-plus"` | Preset URL + overridden model |
| `provider:model@url` | `"vllm:Qwen3.5-9B@http://localhost:8000/v1"` | Full manual specification |

Dict specs are also supported for fine-grained control (temperature, extra kwargs):

```python
create_llm({"provider": "bailian", "model": "qwen-plus", "temperature": 0.5}, api_key="sk-xxx")
```

## `create_client()` patterns

`create_client()` exposes three common deployment shapes:

<Tabs>
<Tab title="Pattern A — single cloud provider">

One preset string configures both LLM and embedder. Simplest path for OpenAI or Bailian.

```python
from hyperextract import create_client

llm, emb = create_client("bailian", api_key="sk-xxx")
# → qwen3.6-plus + text-embedding-v4 at Bailian compatible-mode URL
```

</Tab>
<Tab title="Pattern B — local vLLM (split services)">

LLM and embedder often run on different ports locally. Pass separate specs:

```python
llm, emb = create_client(
    llm="vllm:Qwen3.5-9B@http://localhost:8000/v1",
    embedder="vllm:bge-m3@http://localhost:8001/v1",
    api_key="dummy",
)
```

</Tab>
<Tab title="Pattern C — mixed cloud + local">

Cloud LLM with on-prem embeddings (or the reverse):

```python
llm, emb = create_client(
    llm="bailian:qwen-plus",
    embedder="vllm:bge-m3@http://localhost:8001/v1",
    api_key="sk-xxx",
)
```

</Tab>
</Tabs>

Lower-level factories are available when you only need one side:

- `create_llm(spec, api_key=..., **kwargs)` → `ChatOpenAI`
- `create_embedder(spec, api_key=..., **kwargs)` → `OpenAIEmbeddings` or `CompatibleEmbeddings`
- `get_client(config_path=None)` → reads `~/.he/config.toml` (used by CLI and `Template.create()`)

## `CompatibleEmbeddings` for non-OpenAI endpoints

LangChain's `OpenAIEmbeddings` can send **pre-tokenized integer lists** to the API. Official OpenAI accepts that format; most OpenAI-compatible providers (Bailian, Ollama, LiteLLM, local vLLM) do not.

Hyper-Extract routes embedders through `CompatibleEmbeddings` whenever `base_url` is set and is not exactly `https://api.openai.com/v1`:

| Condition | Embedder class |
|-----------|----------------|
| Official OpenAI URL (or no custom URL) | `langchain_openai.OpenAIEmbeddings` |
| Any other `base_url` | `CompatibleEmbeddings` |

`CompatibleEmbeddings` always sends **string inputs**, uses tiktoken for chunking (falling back to `cl100k_base` for unknown model names), and batches requests conservatively (`max_batch_size=10` by default) because providers like Bailian cap batch size. Long texts are split at the token limit and averaged across chunks.

<Warning>
Semantic search quality depends on the embedding model you point at. Hyper-Extract does not translate between embedding spaces — if you change embedder model or provider after building an index, rebuild with `he build-index`.
</Warning>

## Structured output requirement

Hyper-Extract extraction depends on the LLM returning **schema-constrained JSON**. AutoTypes chain prompts through LangChain's `with_structured_output()`:

```python
self.data_extractor = (
    self.prompt_template
    | self.llm_client.with_structured_output(self._data_schema)
)
```

That requires backend support for **`json_schema`** or **function calling**. Models that only support loose `json_object` mode will fail extraction or return unusable output.

### Verified LLM compatibility

| Platform | Model | `json_schema` | Status | Notes |
|----------|-------|:-------------:|:------:|-------|
| **OpenAI** | gpt-4o / gpt-4o-mini / gpt-5 | ✅ | ✅ Verified | Recommended cloud default |
| **Alibaba Bailian** | qwen-plus / qwen-turbo / qwen3.6-plus / deepseek-r1 | ✅ | ✅ Verified | Works out of the box |
| **Alibaba Bailian** | qwen-max / deepseek-v3 | ❌ | ❌ Incompatible | Only `json_object`; switch to qwen-plus, qwen-turbo, or deepseek-r1 |
| **Local vLLM** | Qwen3.5-9B (GPTQ-Marlin 4bit) | ✅ | ✅ Verified | AutoList / AutoGraph tested |

<AccordionGroup>
<Accordion title="Bailian troubleshooting symptoms">
If you see `messages must contain the word 'json'` or non-JSON model output, the model likely lacks `json_schema` support. Switch to qwen-plus, qwen-turbo, or deepseek-r1.
</Accordion>
<Accordion title="Thinking models on local vLLM">
Thinking models (e.g. Qwen3.5 with thinking enabled) emit `</think>` blocks that conflict with constrained JSON decoding. Disable thinking when serving locally:

```bash
--default-chat-template-kwargs '{"enable_thinking": false}'
```

DeepSeek-R1 via Bailian is verified because Bailian strips thinking tags server-side.
</Accordion>
</AccordionGroup>

Some extraction methods explicitly request function calling — for example, GraphRAG community reports use `method="function_calling"`. Prefer models and vLLM builds with structured-output support enabled.

### Verified embedding compatibility

| Platform | Model | Dimensions | Status |
|----------|-------|------------|--------|
| **OpenAI** | text-embedding-3-small | 1536 | ✅ Verified |
| **Alibaba Bailian** | text-embedding-v4 | 1024 | ✅ Verified |
| **Local vLLM** | BAAI/bge-m3 | — | ✅ Verified |

Any OpenAI-compatible embeddings endpoint can work when reached through `CompatibleEmbeddings`.

## CLI and config file integration

The CLI stores provider settings in `~/.he/config.toml` under `[llm]` and `[embedder]`. Each section holds `provider`, `model`, `api_key`, and `base_url`.

<Steps>
<Step title="Initialize or set a preset">

<CodeGroup>
```bash CLI quick init (OpenAI)
he config init -p openai -k sk-xxx
```

```bash CLI quick init (Bailian)
he config init -p bailian -k sk-xxx
```

```bash Interactive (vLLM)
he config init
# Select local vLLM; enter model names and base URLs
```
</CodeGroup>

</Step>
<Step title="Configure services independently">

Mixed deployments use per-service commands:

```bash
he config llm -p bailian -k sk-xxx
he config embedder -p vllm -m bge-m3 -u http://localhost:8001/v1 -k dummy
```

</Step>
<Step title="Verify before extraction">

`he parse`, `he search`, and `he talk` call `validate_config()` first. Validation rules:

- **Cloud providers** (`openai`, `bailian`, custom): `api_key` required (from config or `OPENAI_API_KEY`)
- **vLLM**: `api_key` may be empty or `dummy`, but **`base_url` is mandatory** for both LLM and embedder

Environment variables override empty config fields:

<ParamField body="OPENAI_API_KEY" type="string">
API key fallback when `api_key` is not set in `config.toml`.
</ParamField>

<ParamField body="OPENAI_BASE_URL" type="string">
Base URL fallback when `base_url` is not set in `config.toml`.
</ParamField>

</Step>
</Steps>

After configuration, CLI commands and `Template.create()` automatically call `get_client()` — no inline provider code required.

## Local vLLM deployment sketch

Typical verified layout: LLM on port 8000, embeddings on port 8001.

<CodeGroup>
```bash Start LLM service
vllm serve /path/to/qwen3.5-9b-gptq-marlin \
  --served-model-name Qwen/Qwen3.5-9B \
  --trust-remote-code \
  --quantization gptq_marlin \
  --dtype bfloat16 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --port 8000 \
  --api-key dummy
```

```bash Start embedding service
vllm serve BAAI/bge-m3 \
  --task embed \
  --dtype float16 \
  --max-model-len 8192 \
  --port 8001
```
</CodeGroup>

<RequestExample>
```python Python client for local vLLM
from hyperextract import create_client, AutoGraph

llm, emb = create_client(
    llm="vllm:Qwen3.5-9B@http://localhost:8000/v1",
    embedder="vllm:bge-m3@http://localhost:8001/v1",
    api_key="dummy",
)

graph = AutoGraph(
    instruction="Extract people and their relationships",
    llm_client=llm,
    embedder=emb,
    node_key_extractor=lambda n: n.name,
    edge_key_extractor=lambda e: (e.source, e.target, e.type),
    nodes_in_edge_extractor=lambda e: (e.source, e.target),
)
graph.parse("Zhang San founded ByteDance. Li Si serves as CEO.")
```
</RequestExample>

Prefer **GPTQ-Marlin** over AWQ for Qwen3.5-9B on vLLM 0.21.x due to known AWQ compatibility issues.

## Failure modes

| Symptom | Likely cause | Fix |
|---------|--------------|-----|
| `vLLM provider requires base_url` | `vllm` preset without URL | Set `--base-url` or use `provider:model@url` shorthand |
| `LLM API key is not configured` | Missing key for cloud provider | `he config llm -k ...` or export `OPENAI_API_KEY` |
| Empty or partial extraction | Model lacks `json_schema` | Switch to a verified model (see table above) |
| Embedding batch errors on Bailian | Batch too large | `CompatibleEmbeddings` defaults to 10; reduce if needed |
| Search returns garbage after provider change | Embedding space mismatch | Rebuild index with `he build-index` |

Enable debug logging with `HYPER_EXTRACT_LOG_LEVEL=DEBUG` when diagnosing client or schema failures.

## API surface summary

| Function | Input | Output |
|----------|-------|--------|
| `create_client(provider=...)` or `create_client("bailian", ...)` | Shorthand or split `llm`/`embedder` specs | `(ChatOpenAI, Embeddings)` tuple |
| `create_llm(spec)` | Shorthand or dict | `ChatOpenAI` |
| `create_embedder(spec)` | Shorthand or dict | `OpenAIEmbeddings` or `CompatibleEmbeddings` |
| `get_client(path?)` | Optional config path | Reads TOML, returns client tuple |

Runnable provider demos live under `examples/providers/` (`openai_demo.py`, `bailian_demo.py`, `vllm_demo.py`).

## Related pages

<CardGroup cols={2}>
<Card title="Configure providers" href="/configure-providers">
Step-by-step setup for `he config init`, per-service commands, environment variables, and programmatic `create_client()` for mixed deployments.
</Card>
<Card title="Configuration reference" href="/configuration-reference">
Full `~/.he/config.toml` schema, preset defaults, env var precedence, and validation rules.
</Card>
<Card title="Python API reference" href="/python-api-reference">
`create_client`, `create_llm`, `create_embedder`, `get_client`, and AutoType lifecycle methods.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Missing API keys, vLLM `base_url` requirements, schema failures, and debug logging.
</Card>
</CardGroup>

---

## 08. Configure providers

> Set up LLM and embedder clients via `he config init`, per-service `he config llm` / `he config embedder`, environment variables, or programmatic `create_client()` for mixed cloud and local vLLM deployments.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/08-configure-providers.md
- Generated: 2026-06-18T20:55:00.846Z

### Source Files

- `hyperextract/cli/commands/config.py`
- `hyperextract/cli/config.py`
- `hyperextract/utils/client.py`
- `hyperextract/cli/README.md`
- `.env.example`

---
title: "Configure providers"
description: "Set up LLM and embedder clients via `he config init`, per-service `he config llm` / `he config embedder`, environment variables, or programmatic `create_client()` for mixed cloud and local vLLM deployments."
---

Hyper-Extract resolves LLM and embedder clients from `~/.he/config.toml`, environment-variable fallbacks, or the Python factory API (`create_client`, `create_llm`, `create_embedder`, `get_client`). CLI commands such as `he parse` call `validate_config()` before running and exit if credentials or vLLM `base_url` values are missing.

## Configuration surfaces

| Surface | Entry point | Persists to disk | Typical use |
|---------|-------------|------------------|---------------|
| Interactive CLI | `he config init` | Yes (`~/.he/config.toml`) | First-time setup |
| Per-service CLI | `he config llm`, `he config embedder` | Yes | Mixed providers, model overrides |
| Environment variables | `OPENAI_API_KEY`, `OPENAI_BASE_URL` | No | CI/CD, temporary overrides |
| Python factory | `create_client()`, `get_client()` | No (reads file if using `get_client`) | Scripts, notebooks, custom deployments |

```mermaid
flowchart LR
  subgraph cli ["CLI layer"]
    init["he config init"]
    llmCmd["he config llm"]
    embCmd["he config embedder"]
  end

  subgraph store ["~/.he/config.toml"]
    llmSec["[llm]"]
    embSec["[embedder]"]
  end

  subgraph resolve ["ConfigManager.get_*_config()"]
    fileVal["File values"]
    envFallback["OPENAI_API_KEY / OPENAI_BASE_URL"]
    preset["PROVIDER_PRESETS base_url"]
  end

  subgraph runtime ["Client factory"]
    getClient["get_client()"]
    createClient["create_client()"]
    chat["ChatOpenAI"]
    embed["OpenAIEmbeddings / CompatibleEmbeddings"]
  end

  init --> store
  llmCmd --> store
  embCmd --> store
  store --> fileVal
  fileVal --> envFallback
  envFallback --> preset
  preset --> getClient
  createClient --> chat
  createClient --> embed
  getClient --> chat
  getClient --> embed
```

<Note>
`Template.create()` loads clients from `get_client()` when `llm_client` and `embedder` are omitted, so file-based configuration applies to both CLI and Python template workflows.
</Note>

## Provider presets

Three built-in presets supply default models and base URLs. The `vllm` preset has no defaults — you must set `model` and `base_url` explicitly.

| Provider | Default LLM | Default embedder | Default `base_url` |
|----------|-------------|------------------|--------------------|
| `openai` | `gpt-4o-mini` | `text-embedding-3-small` | `https://api.openai.com/v1` |
| `bailian` | `qwen3.6-plus` | `text-embedding-v4` | `https://dashscope.aliyuncs.com/compatible-mode/v1` |
| `vllm` | — | — | — (required) |

Interactive `he config init` also offers a **custom** OpenAI-compatible option. It behaves like a provider without preset defaults: you supply model names and `base_url` values manually.

## CLI setup

<Steps>
<Step title="Initialize configuration">

Run interactive setup or pass flags for non-interactive configuration.

<CodeGroup>
```bash Interactive
he config init
```

```bash OpenAI one-liner
he config init -p openai -k sk-your-key
```

```bash Bailian one-liner
he config init -p bailian -k sk-your-key
```

```bash API key only (OpenAI defaults)
he config init -k sk-your-key
```
</CodeGroup>

Quick mode (`-p` + `-k`) writes both `[llm]` and `[embedder]` sections using preset default models. For `vllm`, run interactive init or configure each service separately.

</Step>

<Step title="Configure services individually (optional)">

Use per-service commands when LLM and embedder run on different providers or endpoints.

```bash
# LLM only
he config llm -p bailian -k sk-your-key -m qwen-plus

# Embedder only
he config embedder -p vllm -u http://localhost:8001/v1 -k dummy -m BAAI/bge-m3
```

<ParamField body="--provider" type="string">
Provider preset: `openai`, `bailian`, or `vllm`.
</ParamField>

<ParamField body="--api-key" type="string">
API key for the service. vLLM accepts `dummy` when the server does not enforce keys.
</ParamField>

<ParamField body="--model" type="string">
Model name served by the endpoint.
</ParamField>

<ParamField body="--base-url" type="string">
OpenAI-compatible API root (for example `http://localhost:8000/v1`). Required for `vllm`.
</ParamField>

<ParamField body="--show" type="boolean">
Display current settings for the service without writing changes.
</ParamField>

<ParamField body="--unset" type="boolean">
Reset the service section to defaults and save.
</ParamField>

</Step>

<Step title="Verify configuration">

```bash
he config show
he config llm --show
he config embedder --show
```

`he config show` prints a table with provider, model, masked API key, and base URL for both services.

</Step>
</Steps>

### Config file format

`he config init` and per-service commands persist settings to `~/.he/config.toml` (Windows: `%USERPROFILE%\.he\config.toml`).

```toml
[llm]
provider = "bailian"
model = "qwen3.6-plus"
api_key = "sk-your-api-key"
base_url = ""

[embedder]
provider = "vllm"
model = "BAAI/bge-m3"
api_key = "dummy"
base_url = "http://localhost:8001/v1"
```

Empty `base_url` fields resolve from the provider preset at runtime. For `vllm`, an empty `base_url` fails validation.

## Environment variables

`.env.example` documents the two credential-related variables:

```bash
OPENAI_API_KEY=sk-your-api-key-here
OPENAI_BASE_URL=https://api.openai.com/v1
```

| Variable | Applies to | Resolution |
|----------|------------|------------|
| `OPENAI_API_KEY` | `[llm].api_key`, `[embedder].api_key` | Used when the corresponding config field is empty |
| `OPENAI_BASE_URL` | `[llm].base_url`, `[embedder].base_url` | Used when the corresponding config field is empty, before preset resolution |

<Warning>
Config file values take precedence over environment variables. Empty fields in `config.toml` fall back to `OPENAI_API_KEY` and `OPENAI_BASE_URL`, not the other way around.
</Warning>

`create_llm()` and `create_embedder()` also read `OPENAI_API_KEY` when no `api_key` is passed in the spec or kwargs.

Logging is controlled separately:

| Variable | Purpose |
|----------|---------|
| `HYPER_EXTRACT_LOG_LEVEL` | Root log level (`DEBUG`, `INFO`, `WARNING`, `ERROR`) |
| `HYPER_EXTRACT_LOG_FILE` | Optional log file path |

## Programmatic client factory

The SDK exports four factory functions from `hyperextract`:

```python
from hyperextract import create_client, create_llm, create_embedder, get_client
```

| Function | Returns | Config source |
|----------|---------|---------------|
| `create_client()` | `(llm, embedder)` tuple | Arguments only |
| `create_llm()` | `ChatOpenAI` | Spec string or dict |
| `create_embedder()` | `OpenAIEmbeddings` or `CompatibleEmbeddings` | Spec string or dict |
| `get_client()` | `(llm, embedder)` tuple | `~/.he/config.toml` (or custom path) |

### String shorthand

Specs use `provider:model@url` syntax:

| Format | Example | Behavior |
|--------|---------|----------|
| `provider` | `"bailian"` | Preset defaults for model and URL |
| `provider:model` | `"bailian:qwen-plus"` | Override model, keep preset URL |
| `provider:model@url` | `"vllm:Qwen3.5-9B@http://localhost:8000/v1"` | Full manual specification |

Dict specs pass through with the same keys: `provider`, `model`, `base_url`, `api_key`.

### Deployment patterns

<Tabs>
<Tab title="Single cloud provider">

```python
llm, emb = create_client("openai", api_key="sk-xxx")
# or
llm, emb = create_client("bailian", api_key="sk-xxx")
```

Both services share the provider preset defaults.

</Tab>
<Tab title="Local vLLM (two services)">

```python
llm, emb = create_client(
    llm="vllm:Qwen3.5-9B@http://localhost:8000/v1",
    embedder="vllm:bge-m3@http://localhost:8001/v1",
    api_key="dummy",
)
```

LLM and embedder typically run on separate ports. See `examples/providers/vllm_demo.py`.

</Tab>
<Tab title="Mixed cloud + local">

```python
llm, emb = create_client(
    llm="bailian:qwen-plus",
    embedder="vllm:bge-m3@http://localhost:8001/v1",
    api_key="sk-xxx",
)
```

Cloud LLM with on-premise embeddings — a common cost/latency split.

</Tab>
<Tab title="Config file">

```python
llm, emb = get_client()  # reads ~/.he/config.toml
# or
llm, emb = get_client("/path/to/config.toml")
```

Equivalent CLI setup:

```bash
he config init -p bailian -k sk-xxx
```

Then use `Template.create("general/biography_graph", language="en")` without passing clients explicitly.

</Tab>
</Tabs>

### Embedder selection

`create_embedder()` chooses the implementation based on `base_url`:

- **Official OpenAI URL** (`https://api.openai.com/v1`) → `OpenAIEmbeddings` (native tiktoken batching)
- **Any other URL** → `CompatibleEmbeddings` (string-only input, conservative batch size of 10, tiktoken chunking)

Non-OpenAI-compatible endpoints (Bailian, vLLM, Ollama, LiteLLM proxies) require `CompatibleEmbeddings` because most providers reject pre-tokenized integer lists.

Extra kwargs on `create_client()` (for example `temperature=0.5`) forward to `ChatOpenAI`.

## Mixed deployment examples

<Tabs>
<Tab title="CLI">

```bash
# Cloud LLM + local embedder
he config llm -p bailian -k sk-your-key
he config embedder -p vllm \
  -u http://localhost:8001/v1 \
  -k dummy \
  -m BAAI/bge-m3

# Local LLM + cloud embedder
he config llm -p vllm \
  -u http://localhost:8000/v1 \
  -k dummy \
  -m Qwen/Qwen3.5-9B
he config embedder -p bailian -k sk-your-key
```

</Tab>
<Tab title="Python">

```python
from hyperextract import create_client, AutoGraph

llm, emb = create_client(
    llm="bailian",
    embedder="vllm:bge-m3@http://localhost:8001/v1",
    api_key="sk-xxx",
)

graph = AutoGraph(
    instruction="Extract people and their relationships",
    llm_client=llm,
    embedder=emb,
    node_key_extractor=lambda n: n.name,
    edge_key_extractor=lambda e: (e.source, e.target, e.type),
    nodes_in_edge_extractor=lambda e: (e.source, e.target),
)
```

</Tab>
</Tabs>

## Validation and CLI enforcement

`ConfigManager.validate()` checks resolved configuration before extraction commands run:

| Condition | Result |
|-----------|--------|
| `provider == "vllm"` and empty `base_url` | Fails with `vLLM provider requires base_url.` |
| Non-vLLM LLM with empty `api_key` | Fails — suggests `he config llm --api-key YOUR_KEY` |
| vLLM embedder with empty `base_url` | Fails with `vLLM embedder requires base_url.` |
| Non-vLLM embedder with empty `api_key` | Fails — suggests `he config embedder --api-key YOUR_KEY` |

`validate_config()` in the CLI prints the error and exits with code 1. Commands that call it include `he parse`, `he feed`, `he build-index`, `he search`, and `he talk`.

<Check>
After configuration, confirm services respond before running extraction:

```bash
curl http://localhost:8000/v1/models   # vLLM LLM
curl http://localhost:8001/v1/models   # vLLM embedder
he config show
```
</Check>

## Common failure modes

<AccordionGroup>
<Accordion title="Missing API key on cloud provider">

```text
Error: LLM API key is not configured. Run 'he config llm --api-key YOUR_KEY'
```

Set the key via CLI or export `OPENAI_API_KEY` when the config field is empty.

</Accordion>

<Accordion title="vLLM missing base_url">

```text
Error: vLLM provider requires base_url.
```

Set `--base-url` on `he config llm` / `he config embedder`, or use the full `provider:model@url` shorthand in Python.

</Accordion>

<Accordion title="Provider requires explicit base_url at resolution time">

```text
ValueError: Provider 'vllm' requires explicit base_url.
```

Raised by `_resolve_base_url()` when a vLLM provider has no URL in config, environment, or preset.

</Accordion>

<Accordion title="create_client() called with no arguments">

```text
ValueError: Must provide llm=, embedder=, or provider= argument.
```

Pass a provider shorthand, separate `llm`/`embedder` specs, or use `get_client()` for file-based config.

</Accordion>
</AccordionGroup>

## Next

<CardGroup>
<Card title="Provider system" href="/provider-system">
BYOC/BYOK model, `provider:model@url` shorthand, `CompatibleEmbeddings`, and verified model compatibility.
</Card>
<Card title="Configuration reference" href="/configuration-reference">
Full `~/.he/config.toml` schema, defaults, and environment variable precedence rules.
</Card>
<Card title="Quickstart" href="/quickstart">
First extraction after `he config init`: parse, search, and visualize a Knowledge Abstract.
</Card>
<Card title="Python API reference" href="/python-api-reference">
`create_client`, `get_client`, `Template.create`, and AutoType lifecycle methods.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Debug logging, template errors, and provider connection failures.
</Card>
</CardGroup>

---

## 09. Extract and evolve knowledge

> Run `he parse` (single file, directory of `.md`/`.txt`, or stdin), choose templates interactively or by ID, control indexing with `--no-index`, append documents with `he feed`, and rebuild indexes with `he build-index`.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/09-extract-and-evolve-knowledge.md
- Generated: 2026-06-18T20:55:43.224Z

### Source Files

- `hyperextract/cli/cli.py`
- `hyperextract/cli/utils.py`
- `hyperextract/types/base.py`
- `hyperextract/utils/template_engine/template.py`
- `hyperextract/cli/commands/list.py`
- `hyperextract/templates/presets/finance/earnings_summary.yaml`

---
title: "Extract and evolve knowledge"
description: "Run `he parse` (single file, directory of `.md`/`.txt`, or stdin), choose templates interactively or by ID, control indexing with `--no-index`, append documents with `he feed`, and rebuild indexes with `he build-index`."
---

Hyper-Extract creates and grows Knowledge Abstracts (KAs) through three CLI commands—`he parse`, `he feed`, and `he build-index`—backed by `Template.create`, `BaseAutoType.feed_text`, `dump`, `load`, and `build_index` in the Python SDK. Each command validates LLM and embedder configuration, resolves a YAML preset or method template, runs structured LLM extraction, and writes `data.json`, `metadata.json`, and an optional `index/` directory.

## Lifecycle overview

```mermaid
stateDiagram-v2
    [*] --> Empty: he parse -o ./ka/
    Empty --> Indexed: build_index (default)
    Empty --> Unindexed: --no-index
    Indexed --> StaleIndex: he feed
    Unindexed --> StaleIndex: he feed
    StaleIndex --> Indexed: he build-index
    Indexed --> Indexed: he build-index --force
    Unindexed --> Indexed: he build-index
```

| Phase | Command | Data change | Index state |
|-------|---------|-------------|-------------|
| Create | `he parse` | New `data.json` | Built by default; skipped with `--no-index` |
| Append | `he feed` | Merges into existing `data.json` | Cleared in memory; rebuild required for search/chat |
| Reindex | `he build-index` | No data change | Rebuilt from current `data.json` |

<Note>
`he feed` does not call `build_index`. After feeding, run `he build-index` before `he search` or `he talk`.
</Note>

## Prerequisites

LLM and embedder clients must be configured before any extraction command runs. `validate_config()` checks `~/.he/config.toml` and environment fallbacks (`OPENAI_API_KEY`, `OPENAI_BASE_URL`) on every `parse`, `feed`, and `build-index` invocation.

<CardGroup>
  <Card title="Configure providers" href="/configure-providers">
    Set up `he config init`, `he config llm`, and `he config embedder` before your first extraction.
  </Card>
  <Card title="List templates" href="/cli-reference">
    Run `he list template` to discover preset IDs such as `finance/earnings_summary` or `general/biography_graph`.
  </Card>
</CardGroup>

## Create a Knowledge Abstract with `he parse`

`he parse` reads input, instantiates a template via `Template.create`, extracts structured knowledge with `feed_text`, saves the KA with `dump`, and optionally builds a vector index.

### Input sources

| Input | Behavior |
|-------|----------|
| File path | Single UTF-8 file read via `read_input` |
| Directory | All `*.md` and `*.txt` files discovered by glob, concatenated with `\n\n` |
| `-` (stdin) | Full stdin buffer (`cat doc.md \| he parse - ...`) |

<Warning>
Directory mode errors with exit code 1 when no `.md` or `.txt` files are found. Only those extensions are processed.
</Warning>

### Template selection

Templates resolve in three ways:

1. **Preset ID** — `-t finance/earnings_summary` loads a bundled YAML preset.
2. **Method shorthand** — `-m light_rag` maps to `method/light_rag` (English-only prompts; `--lang` is ignored).
3. **Interactive** — Omit `-t` and `-m` to trigger `select_template_interactive()`, which lists all presets from `Gallery.list()` and accepts a number or keyword search.

Knowledge templates require `--lang en` or `--lang zh`. Method templates always use `lang = "en"`.

```bash
# Preset template with explicit language
he parse earnings_call.md -t finance/earnings_summary -o ./finance_kb/ -l en

# Interactive selection (omit -t)
he parse document.md -o ./output/ -l en

# Extraction method (no -l required)
he parse document.md -m light_rag -o ./output/

# Directory of markdown files
he parse ./corpus/ -t general/concept_graph -o ./corpus_kb/ -l en

# Stdin
cat notes.md | he parse - -t general/biography_graph -o ./bio_kb/ -l en
```

### Flags

<ParamField body="--output / -o" type="string" required>
Output directory for the new KA. Created with `mkdir(parents=True, exist_ok=True)`.
</ParamField>

<ParamField body="--template / -t" type="string">
Preset template ID (e.g., `general/biography_graph`, `finance/earnings_summary`). Omit for interactive selection.
</ParamField>

<ParamField body="--method / -m" type="string">
Method name (e.g., `light_rag`, `hyper_rag`). Sets template to `method/{name}`.
</ParamField>

<ParamField body="--lang / -l" type="string">
Language code (`en` or `zh`). Required for knowledge templates; ignored for method templates.
</ParamField>

<ParamField body="--force / -f" type="boolean" default="false">
Overwrite a non-empty output directory. Without `-f`, a populated directory causes exit code 1.
</ParamField>

<ParamField body="--no-index" type="boolean" default="false">
Skip `build_index` after extraction. Use for batch workflows; rebuild later with `he build-index`.
</ParamField>

### Parse pipeline

<Steps>
  <Step title="Validate configuration">
    `validate_config()` ensures LLM and embedder API keys (or vLLM `base_url`) are present.
  </Step>
  <Step title="Resolve template">
    `Template.get(template)` validates the preset; `Template.create(template, lang)` builds the AutoType instance.
  </Step>
  <Step title="Extract knowledge">
    `feed_text(text)` chunks input (default 2048 chars, 256 overlap), runs structured LLM extraction, and merges results per AutoType strategy.
  </Step>
  <Step title="Persist KA">
    `dump(output_path)` writes `data.json`, `metadata.json` (template, lang, timestamps), and optionally `index/`.
  </Step>
  <Step title="Build index (default)">
    Unless `--no-index`, `build_index()` runs, then `dump` saves the FAISS index under `index/`.
  </Step>
</Steps>

### Output layout

:::files
./output/
├── data.json           # Structured extraction (entities, relations, model fields, etc.)
├── metadata.json       # template, lang, created_at, updated_at
└── index/              # FAISS vector store (when index is built)
    ├── index.faiss
    └── docstore.json
:::

`metadata.json` records the template ID and language so later commands (`he feed`, `he build-index`, `he search`) can reload the correct AutoType without re-specifying flags.

## Append documents with `he feed`

`he feed` loads an existing KA, extracts from new input, merges incrementally, and saves updated `data.json` and `metadata.json`.

```bash
# Initial extraction
he parse tesla_bio.md -t general/biography_graph -o ./tesla_kb/ -l en

# Append a second document (template/lang read from metadata.json)
he feed ./tesla_kb/ tesla_inventions.md

# Append from stdin
cat update.md | he feed ./tesla_kb/ -
```

### Merge behavior

`feed_text` calls `_update_data_state`, which merges incoming extraction into the current AutoType and calls `clear_index()`. Merge semantics depend on the template's AutoType:

| AutoType | Incremental merge |
|----------|-------------------|
| `model` | Field-level merge; first extraction wins for populated fields |
| `graph` / `hypergraph` | Nodes and edges added via memory-layer deduplication |
| `list` / `set` | Items appended or deduplicated per identifier rules |

Override template or language only when necessary:

```bash
he feed ./ka/ doc.md -t general/biography_graph -l en
```

When `--template` and `--lang` are omitted, defaults come from `metadata.json` (`template` defaults to `general/graph`, `lang` to `zh` if missing).

<Info>
Verify growth with `he info ./ka/` — node/edge counts and `updated_at` should change after a successful feed.
</Info>

## Rebuild indexes with `he build-index`

`he build-index` loads the KA, optionally clears the existing index with `--force`, embeds all indexable items, and persists FAISS files to `index/`.

```bash
# Build index for a KA parsed with --no-index
he build-index ./output/

# Force rebuild after feeding or manual data.json edits
he build-index ./output/ -f
```

| Condition | Behavior |
|-----------|----------|
| Index exists, no `--force` | Prints warning and exits 0 without rebuilding |
| Index missing or `--force` | Clears index (`clear_index`), runs `build_index`, saves via `dump` |
| `data.json` missing | Exit 1 via `validate_ka_with_data` |

`he search` and `he talk` require a non-empty `index/` directory (`validate_ka_with_index`).

## Batch workflow pattern

For multiple documents, defer indexing until all content is merged:

<CodeGroup>
```bash title="CLI batch"
# Parse first doc without index
he parse doc1.md -t general/biography_graph -o ./ka/ -l en --no-index

# Append remaining docs
he feed ./ka/ doc2.md
he feed ./ka/ doc3.md

# Single index build
he build-index ./ka/

# Query
he search ./ka/ "key concept"
he talk ./ka/ -q "Summarize all documents"
```

```python title="Python batch"
from hyperextract import Template

ka = Template.create("finance/earnings_summary", "en")
ka.feed_text(doc1_text)
ka.feed_text(doc2_text)
ka.feed_text(doc3_text)
ka.dump("./finance_kb/")
ka.build_index()
ka.dump("./finance_kb/")  # persist index
```
</CodeGroup>

## Python API equivalent

The CLI commands map directly to `BaseAutoType` lifecycle methods:

| CLI | Python |
|-----|--------|
| `he parse` (new KA) | `Template.create(...)` → `feed_text(text)` → `dump(path)` → `build_index()` |
| `he feed` | `Template.create(...)` → `load(path)` → `feed_text(text)` → `dump(path)` |
| `he build-index` | `Template.create(...)` → `load(path)` → `build_index()` → `dump(path)` |
| Preview without mutation | `parse(text)` returns a new instance |

<RequestExample>
```python
from hyperextract import Template

# Create and extract (equivalent to he parse)
ka = Template.create("finance/earnings_summary", "en")
ka.feed_text(earnings_transcript)
ka.dump("./finance_kb/")
ka.build_index()
ka.dump("./finance_kb/")

# Evolve (equivalent to he feed)
ka.load("./finance_kb/")
ka.feed_text(q4_update)
ka.dump("./finance_kb/")
ka.build_index()
ka.dump("./finance_kb/")
```
</RequestExample>

`Template.create` reads LLM and embedder from global config when clients are not passed explicitly. Method templates accept extra kwargs (for example `observation_time` for temporal extractors).

## Error handling

| Error | Cause | Resolution |
|-------|-------|------------|
| `LLM API key is not configured` | Missing config before extraction | Run `he config init` or set `OPENAI_API_KEY` |
| `--lang is required for knowledge templates` | `-l` omitted on a preset template | Add `--lang en` or `--lang zh` |
| `Output directory already exists and is not empty` | Re-parse to same path | Use `-f` or choose a new `-o` path |
| `Template '...' not found` | Invalid `-t` or `-m` value | Run `he list template` or `he list method` |
| `No .txt or .md files found` | Empty or unsupported directory | Add `.md`/`.txt` files or pass a single file |
| `Index not found` on search/talk | Fed KA without rebuild | Run `he build-index ./ka/` |
| `Not a valid Knowledge Abstract directory` | Missing `metadata.json` on feed | Ensure directory was created by `he parse` |

Enable debug logging with `HYPER_EXTRACT_LOG_LEVEL=DEBUG` to trace extraction stages (`feed_text_invoked`, `knowledge_extracted`, `index_built`).

## Choosing a template

Use `he list template` to browse presets by domain, AutoType, and language. Example preset `finance/earnings_summary` is an `AutoModel` template with fields such as `company_name`, `quarter`, `reported_revenue`, and `overall_tone`—suited for earnings call transcripts in English or Chinese.

<Tabs>
  <Tab title="Domain presets">
    ```bash
    he parse transcript.md -t finance/earnings_summary -o ./earnings_kb/ -l en
    he parse bio.md -t general/biography_graph -o ./bio_kb/ -l en
    ```
  </Tab>
  <Tab title="Extraction methods">
    ```bash
    he parse paper.md -m hyper_rag -o ./paper_kb/
    he list method -q light
    ```
  </Tab>
</Tabs>

<CardGroup>
  <Card title="Templates vs methods" href="/templates-vs-methods">
    Compare YAML domain presets and algorithm-driven method templates, including language requirements.
  </Card>
  <Card title="Knowledge Abstracts" href="/knowledge-abstracts">
    Deep dive into `data.json`, `metadata.json`, and `index/` layout and lifecycle methods.
  </Card>
</CardGroup>

## Related pages

<CardGroup>
  <Card title="Quickstart" href="/quickstart">
    First successful extraction from install through `he search` and `he show`.
  </Card>
  <Card title="Search, chat, and visualize" href="/search-chat-visualize">
    Query and explore KAs after indexing with `he search`, `he talk`, and `he show`.
  </Card>
  <Card title="CLI reference" href="/cli-reference">
    Full `he` command surface, flags, defaults, and exit conditions.
  </Card>
  <Card title="Python API reference" href="/python-api-reference">
    `Template.create`, `feed_text`, `dump`, `load`, and `build_index` signatures.
  </Card>
  <Card title="Troubleshooting" href="/troubleshooting">
    Common failure modes for parse, feed, and index operations.
  </Card>
</CardGroup>

---

## 10. Search, chat, and visualize

> Query Knowledge Abstracts with `he search` and `he talk` (single query or `-i` interactive mode), inspect stats via `he info`, and render graphs through OntoSight with `he show` or `AutoType.show()`.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/10-search-chat-and-visualize.md
- Generated: 2026-06-18T20:55:42.063Z

### Source Files

- `hyperextract/cli/cli.py`
- `hyperextract/types/base.py`
- `hyperextract/types/graph.py`
- `hyperextract/cli/utils.py`
- `hyperextract/types/hypergraph.py`

---
title: "Search, chat, and visualize"
description: "Query Knowledge Abstracts with `he search` and `he talk` (single query or `-i` interactive mode), inspect stats via `he info`, and render graphs through OntoSight with `he show` or `AutoType.show()`."
---

Hyper-Extract exposes four exploration commands on a Knowledge Abstract (KA) directory: `he info` for metadata and counts, `he search` and `he talk` for semantic retrieval and Q&A over the vector index, and `he show` for OntoSight visualization. The Python SDK mirrors the same surface through `BaseAutoType.search()`, `chat()`, and `show()` on template-backed AutoType instances.

<Note>
`he search` and `he talk` require a populated `index/` directory. `he show` and `he info` only require `data.json`. LLM and embedder configuration is validated for search, talk, and show, but not for `he info`.
</Note>

## Prerequisites

Before querying or chatting, ensure the KA is ready:

<Steps>
<Step title="Create or load a Knowledge Abstract">

Run `he parse` (or `he feed` to append) so the output directory contains `data.json` and `metadata.json`. See [Extract and evolve](/extract-and-evolve).

</Step>
<Step title="Build the search index">

`he parse` builds the index by default. If you used `--no-index`, or appended data with `he feed`, rebuild:

```bash
he build-index ./output/
```

</Step>
<Step title="Configure providers">

Search and talk use the embedder for retrieval and the LLM for chat answers. Initialize configuration with `he config init` or set environment variables. See [Configure providers](/configure-providers).

</Step>
<Step title="Verify readiness">

```bash
he info ./output/
```

Confirm `Nodes` / `Edges` are non-zero and `Index` shows `Built`.

</Step>
</Steps>

## Inspect with `he info`

`he info` prints KA metadata and statistics without loading the full AutoType or calling an LLM.

```bash
he info ./output/
```

<ResponseExample>

```text
Knowledge Abstract Info

Path          ./output/
Template      general/biography_graph
Language      en
Created       2024-01-15 10:30:00
Updated       2024-01-15 10:35:22
Nodes         25
Edges         32
Index         Built
```

</ResponseExample>

| Field | Meaning |
|-------|---------|
| `Template` | Preset or custom template ID from `metadata.json` |
| `Language` | Processing language (`en` or `zh`) |
| `Nodes` | Entity/item count from `data.json` (`nodes`, `entities`, or list length) |
| `Edges` | Relationship count (`edges`, `relations`, or `0` for non-graph types) |
| `Index` | `Built` when `index/` exists and is non-empty; otherwise `Not Built` |

Use `he info` to confirm extraction succeeded, monitor growth after `he feed`, and check whether `he build-index` is needed before search or talk.

## Semantic search with `he search`

`he search` embeds the query, runs similarity search against the FAISS index, and prints ranked structured results as JSON.

```bash
he search ./output/ "Tesla's inventions"
he search ./output/ "electrical engineering" -n 10
```

<ParamField body="ka_path" type="string" required>
Path to the KA directory.
</ParamField>

<ParamField body="query" type="string" required>
Natural-language or keyword search string.
</ParamField>

<ParamField body="--top-k" type="integer" default="3">
Number of results. Short form: `-n`.
</ParamField>

### Retrieval pipeline

```mermaid
sequenceDiagram
    participant CLI as he search
    participant KA as AutoType instance
    participant IDX as FAISS index
    participant EMB as Embedder

    CLI->>KA: Template.create + load(ka_path)
    CLI->>KA: search(query, top_k)
    KA->>EMB: embed query
    KA->>IDX: similarity_search
    IDX-->>KA: ranked documents
    KA-->>CLI: structured items
    CLI-->>CLI: print JSON results
```

For graph and hypergraph AutoTypes, `search()` returns a tuple `(nodes, edges)`. The CLI enumerates that tuple, so output typically shows a node group and an edge group rather than flat numbered entities.

For list, set, and model AutoTypes, `search()` returns a flat list of Pydantic items—one JSON object per result.

<Tip>
Use natural-language queries (`"What were the major achievements?"`) rather than bare keywords. Increase `-n` when results feel too narrow.
</Tip>

## Chat with `he talk`

`he talk` retrieves context with the same vector index, then calls the configured LLM to synthesize an answer. Single-query and interactive modes are supported.

<Tabs>
<Tab title="Single query">

```bash
he talk ./output/ -q "What were Tesla's major achievements?"
he talk ./output/ -q "Explain the War of Currents" -n 10
```

Prints the answer to stdout. When the LLM response includes retrieved context, the CLI shows truncated `Retrieved context` lines from `response.additional_kwargs["retrieved_items"]`.

</Tab>
<Tab title="Interactive mode">

```bash
he talk ./output/ -i
```

Starts a REPL. Type questions at the `>` prompt. Exit with `exit`, `quit`, or `q`. `Ctrl+C` also exits cleanly.

</Tab>
</Tabs>

<ParamField body="--query" type="string">
Question for single-query mode. Short form: `-q`. Required unless `--interactive` is set.
</ParamField>

<ParamField body="--interactive" type="boolean" default="false">
Enter interactive chat loop. Short form: `-i`.
</ParamField>

<ParamField body="--top-k" type="integer" default="3">
Number of context items retrieved before LLM generation. Short form: `-n`.
</ParamField>

### Chat pipeline

`BaseAutoType.chat()` performs retrieval → context formatting → LLM invocation:

1. `search(query, top_k)` fetches relevant items.
2. Items are serialized to JSON (or plain text for string results) and joined into a context block.
3. A QA prompt asks the LLM to answer from that context.
4. The returned `AIMessage` includes `content` and `additional_kwargs["retrieved_items"]`.

Graph and hypergraph types override `chat()` to retrieve nodes and edges separately, format them under `=== Relevant Nodes ===` and `=== Relevant Edges ===` headers, and attach `retrieved_nodes` / `retrieved_edges` in metadata.

<Warning>
`he talk` requires either `-q` or `-i`. Running `he talk ./output/` without either exits with an error.
</Warning>

## Visualize with `he show`

`he show` loads the KA, resolves the template from `metadata.json`, and opens an OntoSight viewer in the default browser.

```bash
he show ./output/
```

Visualization works for all eight AutoTypes. Graph-based types (`AutoGraph`, `AutoHypergraph`, `AutoTemporalGraph`, `AutoSpatialGraph`, `AutoSpatioTemporalGraph`) render nodes and edges. `AutoList`, `AutoSet`, and `AutoModel` use list, set, and structured views respectively.

When both node and edge indices exist (graph types) or a single FAISS index exists (list/set/model), OntoSight wires **search** and **chat** callbacks into the viewer so you can query from the UI. Without indices, visualization is read-only.

| AutoType | OntoSight viewer | In-viewer search/chat |
|----------|------------------|----------------------|
| `AutoGraph` | `view_graph` | When `node_index` and `edge_index` exist |
| `AutoHypergraph` | `view_hypergraph` | When both indices exist |
| `AutoList` / `AutoSet` | List/set view | When FAISS index exists |
| `AutoModel` | Structured view | When index exists |

If the browser does not open automatically, check the terminal for the localhost URL and open it manually.

## Python API equivalents

The CLI commands delegate to `Template.create(template, lang)` → `load(ka_path)` → AutoType methods.

<CodeGroup>

```python Python — search and chat
from hyperextract import Template

ka = Template.create("general/biography_graph", language="en")
ka.load("./output/")
ka.build_index()  # skip if index already on disk

# Graph types return (nodes, edges)
nodes, edges = ka.search("AC power system", top_k=3)

response = ka.chat("Who was Nikola Tesla?", top_k=3)
print(response.content)
print(response.additional_kwargs.get("retrieved_nodes", []))
```

```python Python — visualize
ka.show(
    node_label_extractor=lambda n: n.name,
    edge_label_extractor=lambda e: e.type,
)
```

```python Python — in-memory workflow
from hyperextract.types import AutoGraph

graph.feed_text(text)
graph.build_index()

for q in questions:
    print(graph.chat(q).content)

graph.show()
```

</CodeGroup>

`AutoType.show()` accepts optional label extractors and `top_k_*_for_search` / `top_k_*_for_chat` kwargs on graph types to control OntoSight callback retrieval depth.

## Search vs talk

| | `he search` | `he talk` |
|---|-------------|-----------|
| Output | Raw structured items (JSON) | Natural-language answer |
| LLM call | No (embedder only) | Yes |
| Speed | Faster | Slower |
| Best for | Locating specific entities/relations | Explanations, summaries, follow-up Q&A |
| Index required | Yes | Yes |

A typical workflow: `he search` to locate relevant nodes or edges, then `he talk -q` for a synthesized explanation, then `he show` to inspect structure visually.

## Index layout

Graph and hypergraph KAs store separate FAISS indices under `index/`:

:::files
output/
├── data.json
├── metadata.json
└── index/
    ├── node_index/
    └── edge_index/
:::

List, set, and model types use a single FAISS directory at `index/`. After `he feed`, indexes may be stale; run `he build-index ./output/ --force` before searching or chatting on updated data.

## Troubleshooting

<AccordionGroup>
<Accordion title="Index not found">

```
Error: Index not found. Please run 'he build-index <ka_path>' first.
```

Run `he build-index ./output/`. If an index already exists but data changed, add `--force`.

</Accordion>

<Accordion title="No search results">

- Broaden the query or increase `-n`.
- Confirm data exists: `he info ./output/` should show `Nodes > 0`.
- Rebuild the index after feeding new documents.

</Accordion>

<Accordion title="Empty or missing visualization">

`he show` requires `data.json`. Check `he info` for zero nodes/edges—extraction may have failed or produced an empty graph.

</Accordion>

<Accordion title="Configuration errors">

Search, talk, and show call `validate_config()`. Run `he config init` and configure LLM and embedder providers. See [Troubleshooting](/troubleshooting) for API key and vLLM `base_url` issues.

</Accordion>

<Accordion title="Template resolution failures">

`he search`, `he talk`, and `he show` read `template` and `lang` from `metadata.json`. Missing or unknown templates raise load errors. Custom templates must have a `{template}.yaml` file in the KA directory.

</Accordion>
</AccordionGroup>

## Related pages

<CardGroup>
<Card title="Quickstart" href="/quickstart">
First extraction with `he parse`, then `he search` and `he show` on the Tesla biography example.
</Card>
<Card title="Knowledge Abstracts" href="/knowledge-abstracts">
On-disk KA layout (`data.json`, `metadata.json`, `index/`) and lifecycle methods.
</Card>
<Card title="CLI reference" href="/cli-reference">
Full `he search`, `he talk`, `he show`, and `he info` flag and exit contracts.
</Card>
<Card title="Python API reference" href="/python-api-reference">
`BaseAutoType.search`, `chat`, `show`, `load`, and `build_index` signatures.
</Card>
<Card title="Tesla biography recipe" href="/tesla-biography-recipe">
End-to-end parse → visualize → search → Q&A workflow.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Missing index, empty results, and provider configuration failures.
</Card>
</CardGroup>

---

## 11. Create custom templates

> Author domain YAML templates: type selection, field and identifier design, multilingual `language` blocks, merge strategies, and validation workflow per the design guide and preset base templates.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/11-create-custom-templates.md
- Generated: 2026-06-18T20:56:21.321Z

### Source Files

- `hyperextract/templates/DESIGN_GUIDE.md`
- `hyperextract/utils/template_engine/parsers/loader.py`
- `hyperextract/utils/template_engine/parsers/schemas/base.py`
- `hyperextract/templates/presets/general/base_graph.yaml`
- `hyperextract/utils/template_engine/factory.py`
- `hyperextract/templates/README.md`

---
title: "Create custom templates"
description: "Author domain YAML templates: type selection, field and identifier design, multilingual `language` blocks, merge strategies, and validation workflow per the design guide and preset base templates."
---

Domain YAML templates in Hyper-Extract declare an AutoType (`model`, `list`, `set`, or a graph family), an `output` schema, LLM `guideline` rules, `identifiers` for deduplication, and optional `options` / `display` settings. `TemplateFactory.create` loads presets from `hyperextract/templates/presets/` via `Gallery`, or accepts an absolute path to a standalone `.yaml` file; `load_template` validates structure with Pydantic and checks each entry in `language` through `localize_template` before extraction runs.

<Note>
Knowledge templates require a language code (`zh` or `en`) at runtime. Method templates under `method/` are English-only and ignore `--lang`. See [Templates vs methods](/templates-vs-methods).
</Note>

## Design workflow

Hyper-Extract follows a four-stage authoring pipeline documented in `hyperextract/templates/DESIGN_GUIDE.md`:

```text
Requirements → brainstorm → designer → optimizer (optional) → validator
                    ↓            ↓              ↓                  ↓
              Type selection   YAML draft    Auto-fix rules    Schema check
```

<Steps>
<Step title="Brainstorm requirements">

Clarify input source, extraction targets, entity granularity, relation types, and whether time or location matter. Record a draft with the chosen AutoType and field list.

</Step>
<Step title="Draft YAML from a base template">

Copy the matching base preset from `hyperextract/templates/presets/general/` (`base_model`, `base_list`, `base_set`, `base_graph`, `base_hypergraph`, `base_temporal_graph`, `base_spatial_graph`, or `base_spatio_temporal_graph`) and rename fields, tags, and guidelines for your domain.

</Step>
<Step title="Optimize (optional)">

Apply naming fixes (`relation_type` → `type`, `event_date` → `time`), separate mixed-language blocks, and trim fields above the five-field guideline. The bundled `hyperextract-skills/template-optimizer` skill automates these patterns.

</Step>
<Step title="Validate">

Run structural validation (see [Validation workflow](#validation-workflow)) before parsing sample documents.

</Step>
</Steps>

## Choose an AutoType

Use relationships and dimensionality to pick the container type. `TemplateFactory.create` dispatches to the matching constructor for all eight types.

| Need | AutoType | Output shape |
|------|----------|--------------|
| Single structured record | `model` | Flat fields on one object |
| Ordered sequence | `list` | Array of items |
| Deduplicated registry | `set` | Unique items keyed by `identifiers.item_id` |
| Binary relations (A→B) | `graph` | Nodes + edges (`source`, `target`, `type`) |
| Multi-party relations | `hypergraph` | Nodes + hyperedges with `participants` or role groups |
| Relations + time | `temporal_graph` | Edges carry `time`; set `identifiers.time_field` |
| Relations + location | `spatial_graph` | Edges carry `location`; set `identifiers.location_field` |
| Relations + time + location | `spatio_temporal_graph` | Both `time_field` and `location_field` |

```text
Need relationships?
├─ No → model | list | set
└─ Yes → graph (binary) | hypergraph (multi-party)
         └─ + time → temporal_graph
         └─ + location → spatial_graph
         └─ + both → spatio_temporal_graph
```

Domain presets such as `finance/earnings_summary` (`model`) and `general/biography_graph` (`temporal_graph`) extend the base patterns. See [Auto-Types](/auto-types) for merge behavior and indexing details.

## Template skeleton

Every knowledge template shares the same top-level keys validated by `TemplateCfg`:

| Key | Purpose |
|-----|---------|
| `language` | Supported locales, e.g. `[zh, en]` |
| `name` | Template identifier; Gallery indexes presets as `{domain}/{name}` |
| `type` | One of the eight AutoTypes |
| `tags` | Lowercase domain labels |
| `description` | Human-readable summary per language |
| `output` | Schema the LLM must populate |
| `guideline` | Extraction strategy and quality rules |
| `identifiers` | Deduplication keys (required for `set` and graph types) |
| `options` | Chunking, merge strategies, `extraction_mode`, index fields |
| `display` | Labels for OntoSight visualization |

### Schema vs guideline

**Schema (`output`) defines what fields exist; guideline defines how to extract them well.** Do not repeat field definitions in `guideline.rules` or `rules_for_entities` — keep guidelines focused on strategy, quality bar, and common mistakes.

Record types (`model`, `list`, `set`) use `output.fields`. Graph types use `output.entities` and `output.relations`, each with their own `fields` list.

### Field design rules

<ParamField body="name" type="string" required>
Field identifier in `snake_case`.
</ParamField>

<ParamField body="type" type="string" required>
One of `str`, `int`, `float`, `bool`, or `list`.
</ParamField>

<ParamField body="description" type="string | {zh, en}" required>
Semantic meaning for the LLM. Use pure Chinese in `zh` blocks and pure English in `en` blocks — no mixed scripts.
</ParamField>

<ParamField body="required" type="boolean">
When `false` or omitted, the field is optional.
</ParamField>

Keep at most five fields per entity, relation, or list item component. Prioritize essential identifiers (`source`, `target`, `participants`) before optional metadata.

**Record type example** (from `base_model.yaml`):

```yaml
output:
  fields:
    - name: name
      type: str
      description:
        zh: '对象的名称或标题'
        en: 'Name or title of the object'
    - name: description
      type: str
      required: false
      description:
        zh: '对象的简要描述'
        en: 'Brief description of the object'

guideline:
  target:
    zh: '你是一位信息提取专家…'
    en: 'You are an information extraction expert…'
  rules:
    zh: ['提取文本中核心的、结构化的对象。']
    en: ['Extract the core, structured object from the text.']

display:
  label: '{name}'
```

**Graph type example** (from `base_graph.yaml`):

```yaml
output:
  entities:
    fields:
      - name: name
        type: str
      - name: type
        type: str
  relations:
    fields:
      - name: source
        type: str
      - name: target
        type: str
      - name: type
        type: str

identifiers:
  entity_id: name
  relation_id: '{source}|{type}|{target}'
  relation_members:
    source: source
    target: target

options:
  extraction_mode: two_stage

display:
  entity_label: '{name} ({type})'
  relation_label: '{type}'
```

## Identifier design

`parse_identifiers` turns YAML identifier config into runtime key extractors. Misconfigured identifiers cause duplicate nodes, failed merges, or broken `he feed` evolution.

| AutoType | Required identifiers | Pattern |
|----------|---------------------|---------|
| `set` | `item_id` | Field name, e.g. `name` |
| `graph` | `entity_id`, `relation_id`, `relation_members` | `relation_id` supports `{field}` templates |
| `hypergraph` (flat) | same + `relation_members: participants` | String pointing to a `list` field |
| `hypergraph` (nested) | `relation_members: [group_a, group_b]` | List of `list`-typed role fields |
| `temporal_graph` | + `time_field` | e.g. `time` on relation fields |
| `spatial_graph` | + `location_field` | e.g. `location` on relation fields |
| `spatio_temporal_graph` | both `time_field` and `location_field` | Combines temporal and spatial |

`relation_id` templates interpolate field values: `'{source}|{type}|{target}'` for graphs, `'{name}|{type}'` for simple hypergraphs (see `base_hypergraph.yaml`).

<Warning>
Use `type` for relation type fields, not `relation_type`. Use `time` for temporal edges, not `event_date`. The design guide and `template-optimizer` skill rename these automatically.
</Warning>

## Multilingual `language` blocks

Set `language: [zh, en]` (or a single code) at the top level. Any string field can be:

- A plain string (single-language template)
- A dict `{zh: '…', en: '…'}`
- A dict of lists for numbered rules: `{zh: ['规则1'], en: ['Rule 1']}`

At runtime, `localize_template(config, language)` collapses multilingual values to the requested locale before `TemplateFactory` builds prompts. `load_template` validates localization for **every** language listed in `language` and raises `ValueError` if a locale is incomplete.

CLI and Python both require an explicit language for knowledge templates:

<CodeGroup>
```bash title="CLI"
he parse examples/en/tesla.md -o ./out -t general/biography_graph --lang en
```

```python title="Python"
from hyperextract import Template

ka = Template.create("finance/earnings_summary", "en")
ka.feed_text(document_text)
```
</CodeGroup>

## Merge strategies and options

`options` maps to AutoType constructor kwargs through `parse_option`. YAML keys are translated to internal names (e.g. `merge_strategy` → `strategy_or_merger`, `entity_merge_strategy` → `node_strategy_or_merger`).

### Record and set types

| YAML key | Applies to | Valid values |
|----------|-----------|--------------|
| `merge_strategy` | `model`, `set` | `merge_field`, `keep_incoming`, `keep_existing`, `llm_balanced`, `llm_prefer_incoming`, `llm_prefer_existing` |
| `fields_for_search` | `list`, `set` | List of field names indexed for semantic search |

`merge_field` overwrites non-null fields and appends lists. `llm_balanced` (default when unset) asks the LLM to synthesize conflicting chunk results.

### Graph types

| YAML key | Purpose |
|----------|---------|
| `extraction_mode` | `one_stage` (joint node+edge) or `two_stage` (nodes first, then edges). Base graph presets default to `two_stage` for accuracy. |
| `entity_merge_strategy` | Node deduplication on incremental `feed_text` |
| `relation_merge_strategy` | Edge deduplication |
| `entity_fields_for_search` | Node fields indexed for search |
| `relation_fields_for_search` | Edge fields indexed for search |
| `observation_time` | Anchor for relative dates (`temporal_graph`, `spatio_temporal_graph`) |
| `observation_location` | Fallback for fuzzy locations (`spatial_graph`, `spatio_temporal_graph`) |

Pass `observation_time` or `observation_location` as `Template.create` kwargs to override template defaults at runtime:

```python
ka = Template.create(
    "finance/event_timeline",
    "en",
    observation_time="2024-06-15",
)
```

## Validation workflow

Structural validation happens at load time — no separate CLI command.

<Steps>
<Step title="Load and parse YAML">

```python
from hyperextract.utils.template_engine.parsers import load_template

cfg = load_template("/path/to/my_template.yaml")
```

`load_template` runs Pydantic validation on `TemplateCfg` and tests `localize_template` for each language in `language`.

</Step>
<Step title="Run the checklist">

**All types**

- [ ] `language` lists supported locales
- [ ] `name`, `type`, `tags`, `description`, `output`, `guideline` present
- [ ] `type` is a valid AutoType
- [ ] `tags` are lowercase

**Graph types**

- [ ] `output.entities` and `output.relations` exist
- [ ] `identifiers.entity_id`, `relation_id`, `relation_members` configured
- [ ] Temporal/spatial types include `time_field` / `location_field`

**Hypergraph**

- [ ] `relation_members` is a string (flat list field) or list of `list`-typed role fields

</Step>
<Step title="Smoke-test extraction">

```python
from hyperextract import Template

ka = Template.create("/path/to/my_template.yaml", "en")
ka.feed_text(sample_text)
ka.dump("./test-ka")
```

Inspect `data.json` and run `ka.show()` or `he show ./test-ka` to verify field population and graph connectivity.

</Step>
</Steps>

<Tip>
Install `hyperextract-skills` and invoke the `yaml-validator` skill for agent-assisted checklist runs. See [Template design skills](/template-design-skills).
</Tip>

### Common errors

| Symptom | Fix |
|---------|-----|
| `The template configuration is not valid for language {lang}` | Add missing `zh`/`en` text for that locale in `description`, `guideline`, or field descriptions |
| `language is required for knowledge templates` | Pass `"zh"` or `"en"` to `Template.create` or `--lang` to `he parse` |
| Duplicate entities after `he feed` | Tighten `entity_id` / `item_id`; align naming rules in `guideline` |
| Empty relations | Switch `extraction_mode` to `two_stage`; strengthen `rules_for_relations` |
| `Missing fields` during merge | Ensure `relation_id` template fields exist on relation schema |

Enable debug logging with `HYPER_EXTRACT_LOG_LEVEL=DEBUG` if extraction succeeds but output shape is wrong. See [Troubleshooting](/troubleshooting).

## Use a custom template

### Standalone YAML file (Python)

Place the file anywhere. `TemplateFactory.create` resolves paths ending in `.yaml` or existing filesystem paths through `load_template`:

```python
from hyperextract import Template

ka = Template.create("/path/to/my_template.yaml", "zh")
ka.feed_text(text)
ka.dump("./my-ka")
```

### Preset registration (CLI and Gallery)

To make a template selectable via `he parse -t domain/name` and `Template.create("domain/name", lang)`, add the YAML under `hyperextract/templates/presets/<domain>/`. `Gallery` auto-discovers `*.yaml` files at import time and registers them as `<domain>/<name>`.

```bash
he parse input.md -o ./ka-out -t finance/earnings_summary --lang en
he list template
```

<Info>
When reloading a Knowledge Abstract whose template is not in presets, `get_template_from_ka` looks for `{template}.yaml` beside `data.json` in the KA directory. Copy your custom YAML into the output folder to keep `he feed` and `he search` working across sessions.
</Info>

### Publish upstream

To contribute a template to the project preset library:

1. Add YAML to `hyperextract/templates/presets/<domain>/`
2. Include both `zh` and `en` descriptions
3. Test with representative documents
4. Submit a PR per [Contributing](/contributing)

## Naming conventions

| Element | Convention | Example |
|---------|-----------|---------|
| Template `name` | Descriptive identifier (design guide recommends CamelCase for new templates; presets often use `snake_case`) | `earnings_summary` |
| Field names | `snake_case` | `company_name` |
| Relation type field | `type` | not `relation_type` |
| Time on edges | `time` | not `event_date` |
| Tags | lowercase | `[finance, investor-relations]` |

## Related pages

<CardGroup>
<Card title="Template schema reference" href="/template-schema-reference">
Field-by-field YAML schema, valid types, and identifier patterns.
</Card>
<Card title="Auto-Types" href="/auto-types">
Merge behavior, indexing, and type selection criteria.
</Card>
<Card title="Template design skills" href="/template-design-skills">
Agent-assisted authoring with `hyperextract-skills`.
</Card>
<Card title="Extract and evolve" href="/extract-and-evolve">
Run `he parse` and `he feed` with your template against documents.
</Card>
<Card title="Tesla biography recipe" href="/tesla-biography-recipe">
End-to-end example using `general/biography_graph`.
</Card>
</CardGroup>

---

## 12. Use extraction methods

> Invoke algorithm templates via `he parse -m light_rag` or `Template.create("method/hyper_rag")`; direct method classes (`Light_RAG`, `Atom`, etc.); and method-specific kwargs such as `observation_time` for temporal extractors.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/12-use-extraction-methods.md
- Generated: 2026-06-18T20:56:04.051Z

### Source Files

- `hyperextract/methods/registry.py`
- `hyperextract/utils/template_engine/factory.py`
- `hyperextract/methods/rag/light_rag.py`
- `hyperextract/methods/typical/atom.py`
- `hyperextract/cli/cli.py`
- `examples/en/methods/light_rag_demo.py`

---
title: "Use extraction methods"
description: "Invoke algorithm templates via `he parse -m light_rag` or `Template.create(\"method/hyper_rag\")`; direct method classes (`Light_RAG`, `Atom`, etc.); and method-specific kwargs such as `observation_time` for temporal extractors."
---

Extraction methods are nine registered algorithms (`graph_rag`, `light_rag`, `hyper_rag`, `hypergraph_rag`, `cog_rag`, `itext2kg`, `itext2kg_star`, `kg_gen`, `atom`) resolved through `hyperextract/methods/registry.py`. Each method instantiates an `AutoGraph` or `AutoHypergraph` subclass with fixed English prompts. Invoke them through the CLI (`he parse -m <name>`), the `Template.create("method/<name>")` API, or by constructing method classes directly (`Light_RAG`, `Atom`, etc.).

<Note>
Method templates always use English prompts. The `--lang` flag is ignored when parsing with `-m`, and `Template.create` hardcodes `metadata["lang"] = "en"`.
</Note>

## How methods resolve

Method names map to concrete classes through a central registry. `TemplateFactory.create_method` looks up the class, forwards constructor `**kwargs`, and stamps metadata before returning the instance.

```mermaid
classDiagram
    direction LR
    class Registry {
        +register_method(name, class, autotype)
        +get_method(name)
        +list_methods()
    }
    class Template {
        +create(source, **kwargs)
        +get(path)
        +list()
    }
    class TemplateFactory {
        +create_method(name, llm, embedder, **kwargs)
        +create(source, **kwargs)
    }
    class Light_RAG
    class Hyper_RAG
    class Atom
    class AutoGraph
    class AutoHypergraph

    Registry --> Light_RAG
    Registry --> Hyper_RAG
    Registry --> Atom
    Light_RAG --|> AutoGraph
    Atom --|> AutoGraph
    Hyper_RAG --|> AutoHypergraph
    Template --> TemplateFactory
    TemplateFactory --> Registry : get_method
```

| Invocation path | Entry point | Resolves to |
|---|---|---|
| CLI | `he parse <input> -m light_rag -o <dir>` | `template = "method/light_rag"` → `Template.create(...)` |
| Python API | `Template.create("method/hyper_rag")` | `TemplateFactory.create_method("hyper_rag", ...)` |
| Direct class | `Light_RAG(llm_client=llm, embedder=emb)` | Bypasses registry; same runtime behavior |

## List available methods

<Steps>
<Step title="CLI">

```bash
he list method
he list method -q rag    # filter by name or description
```

Displays method ID (`method/<name>`), output autotype (`graph` or `hypergraph`), and description.

</Step>
<Step title="Python">

```python
from hyperextract import Template
from hyperextract.methods import list_methods, list_method_cfgs

# Registry view: class, autotype, description
for name, info in list_methods().items():
    print(name, info["type"], info["description"])

# TemplateCfg view (keys are "method/<name>")
for path, cfg in list_method_cfgs().items():
    print(path, cfg.type, cfg.description)

# Methods also appear in Template.list()
all_templates = Template.list(include_methods=True)
```

</Step>
</Steps>

## Registered methods

| Method ID | Class | Autotype | Category |
|---|---|---|---|
| `method/graph_rag` | `Graph_RAG` | `graph` | RAG — community detection |
| `method/light_rag` | `Light_RAG` | `graph` | RAG — lightweight binary edges |
| `method/hyper_rag` | `Hyper_RAG` | `hypergraph` | RAG — n-ary hyperedges |
| `method/hypergraph_rag` | `HyperGraph_RAG` | `hypergraph` | RAG — advanced hypergraph |
| `method/cog_rag` | `Cog_RAG` | `hypergraph` | RAG — cognitive retrieval |
| `method/itext2kg` | `iText2KG` | `graph` | Typical — triple extraction |
| `method/itext2kg_star` | `iText2KG_Star` | `graph` | Typical — enhanced triples |
| `method/kg_gen` | `KG_Gen` | `graph` | Typical — configurable KG generation |
| `method/atom` | `Atom` | `graph` | Typical — temporal KG with evidence |

RAG methods target larger documents with retrieval-augmented extraction. Typical methods run direct LLM extraction pipelines without a separate retrieval stage.

## CLI: parse with a method

<ParamField body="--method / -m" type="string">
Method name without the `method/` prefix (e.g., `light_rag`, `atom`). When set, overrides `--template` and resolves to `method/<name>`.
</ParamField>

<ParamField body="--lang / -l" type="string">
Ignored for method templates. CLI forces `lang = "en"` and prints a note if `--lang` is supplied.
</ParamField>

<ParamField body="--output / -o" type="string" required>
Output directory for the Knowledge Abstract (`data.json`, `metadata.json`, optional `index/`).
</ParamField>

<ParamField body="--no-index" type="boolean">
Skip vector index build after extraction. Rebuild later with `he build-index`.
</ParamField>

<RequestExample>

```bash
# Prerequisites: he config init (LLM + embedder configured)
he parse examples/en/tesla.md -m light_rag -o ./ka-light-rag/
```

</RequestExample>

<RequestExample>

```bash
# Hypergraph extraction
he parse examples/en/tesla.md -m hyper_rag -o ./ka-hyper-rag/

# Skip indexing during parse
he parse examples/en/tesla.md -m atom -o ./ka-atom/ --no-index
```

</RequestExample>

After a successful parse, the CLI suggests follow-on commands: `he show`, `he search`, `he talk`, and `he feed` for incremental updates. Method-created Knowledge Abstracts store `metadata.template` as `method/<name>` and `metadata.lang` as `en`.

<Warning>
The CLI does not expose method constructor kwargs (e.g., `observation_time`). Pass those through the Python API when temporal anchoring or tuning parameters matter.
</Warning>

## Python: Template.create

`Template.create` is the unified entry point for both domain YAML templates and method templates. For methods, omit `language` — it is ignored and always set to `"en"`.

<CodeGroup>

```python CLI-equivalent workflow
from hyperextract import Template

ka = Template.create("method/light_rag")
ka.feed_text(open("examples/en/tesla.md").read())
ka.build_index()
ka.dump("./ka-light-rag/")
```

```python With explicit clients
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from hyperextract import Template

llm = ChatOpenAI(model="gpt-4o-mini")
emb = OpenAIEmbeddings(model="text-embedding-3-small")

ka = Template.create(
    "method/graph_rag",
    llm_client=llm,
    embedder=emb,
)
ka.feed_text(text)
```

```python Non-destructive preview
ka = Template.create("method/light_rag")
preview = ka.parse(text)   # returns new instance; current ka unchanged
ka.feed_text(text)         # merges into current instance
```

</CodeGroup>

`Template.get("method/light_rag")` returns a `MethodCfg` with `name`, `type`, and `description`. `Template.list(include_methods=True)` merges gallery templates with all registered methods.

## Python: direct method classes

Import method classes when you need full control over construction, post-processing hooks, or kwargs the CLI cannot pass.

<CodeGroup>

```python Light_RAG
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from hyperextract.methods.rag import Light_RAG

llm = ChatOpenAI(model="gpt-4o-mini")
emb = OpenAIEmbeddings(model="text-embedding-3-small")

rag = Light_RAG(llm_client=llm, embedder=emb)
rag.feed_text(text)
print(len(rag.nodes), len(rag.edges))
rag.chat("Who founded the company?")
rag.show()
```

```python Hyper_RAG
from hyperextract.methods.rag import Hyper_RAG

rag = Hyper_RAG(llm_client=llm, embedder=emb)
rag.feed_text(text)
print(len(rag.nodes), len(rag.hyper_edges))
```

```python Atom with temporal anchor
from hyperextract.methods.typical import Atom

atom = Atom(
    llm_client=llm,
    embedder=emb,
    observation_time="2024-06-15",
)
atom.feed_text(text)
atom.match_nodes_and_update_edges(threshold=0.85)
atom.dump("./ka-atom/")
```

</CodeGroup>

| Import path | Exported classes |
|---|---|
| `hyperextract.methods.rag` | `Light_RAG`, `Graph_RAG`, `Hyper_RAG`, `HyperGraph_RAG`, `Cog_RAG` |
| `hyperextract.methods.typical` | `Atom`, `iText2KG`, `iText2KG_Star`, `KG_Gen` |

Direct instantiation skips registry metadata stamping. Set metadata manually before `dump` if you rely on `he show` / `he search` reloading via `metadata.template`:

```python
ka.metadata["template"] = "method/atom"
ka.metadata["lang"] = "en"
ka.metadata["type"] = "graph"
```

## Constructor kwargs

`TemplateFactory.create_method` and `Template.create("method/...")` forward `**kwargs` to the method constructor.

### Shared parameters

Most methods accept:

<ParamField body="chunk_size" type="int" default="2048">
Characters per text chunk during extraction and indexing.
</ParamField>

<ParamField body="chunk_overlap" type="int" default="256">
Overlap between consecutive chunks.
</ParamField>

<ParamField body="max_workers" type="int" default="10">
Maximum concurrent LLM calls in batch extraction.
</ParamField>

<ParamField body="verbose" type="bool" default="false">
Enable detailed execution logging.
</ParamField>

```python
ka = Template.create(
    "method/light_rag",
    chunk_size=4096,
    chunk_overlap=512,
    max_workers=5,
    verbose=True,
)
```

### Method-specific parameters

| Method | Parameter | Type | Default | Purpose |
|---|---|---|---|---|
| `atom` | `observation_time` | `str \| None` | current date | Anchor for resolving relative temporal expressions (`today`, `last week`, etc.) in factoid and edge prompts |
| `atom` | `facts_per_chunk` | `int` | `10` | Max atomic facts batched per edge-extraction call |
| `itext2kg_star` | `observation_date` | `str \| None` | current datetime | Populates `edge.properties.observation_date` post-extraction |

<RequestExample>

```python
# Atom: anchor relative dates in news text
ka = Template.create(
    "method/atom",
    observation_time="2024-06-15",
    facts_per_chunk=15,
)
ka.feed_text(
    "John Doe is no longer the CEO of GreenIT since a few months ago."
)
```

</RequestExample>

`Atom` resolves `observation_time` into absolute `t_start` / `t_end` on edges and sets `t_obs` to the observation date. When `observation_time` is omitted, `Atom` defaults to `datetime.now().strftime("%Y-%m-%d")`.

`Atom` also exposes `match_nodes_and_update_edges(threshold=0.8)` for semantic node deduplication via `SemHash` embeddings — call this after `feed_text` when alias merging is needed.

## Knowledge Abstract lifecycle

Method instances inherit `BaseAutoType` lifecycle methods. A typical end-to-end Python workflow:

<Steps>
<Step title="Configure providers">

Run `he config init` or call `create_client()` so `Template.create` can read default LLM and embedder clients from `~/.he/config.toml`.

</Step>
<Step title="Extract">

```python
ka = Template.create("method/light_rag")
ka.feed_text(document_text)
```

`feed_text` merges extracted structure into the current instance. `parse(text)` returns a new instance without modifying the caller.

</Step>
<Step title="Persist">

```python
ka.build_index()
ka.dump("./my-ka/")
```

Produces `data.json`, `metadata.json`, and `index/` (when indexed).

</Step>
<Step title="Query and visualize">

```python
ka.search("wireless power", top_k=3)
ka.chat("What did Tesla invent?")
ka.show()  # OntoSight visualization
```

Equivalent CLI commands against the dumped directory: `he search`, `he talk`, `he show`.

</Step>
<Step title="Evolve">

```python
ka.feed_text(additional_text)
ka.build_index()
ka.dump("./my-ka/")
```

Or via CLI: `he feed ./my-ka/ new_doc.md` followed by `he build-index ./my-ka/`.

</Step>
</Steps>

## Choosing a method

| Goal | Start with | Output shape |
|---|---|---|
| General-purpose graph, fast | `light_rag` | `AutoGraph` — `nodes`, `edges` |
| Very large documents | `graph_rag` | `AutoGraph` with community-oriented extraction |
| Multi-entity relationships | `hyper_rag` | `AutoHypergraph` — `nodes`, `hyper_edges` |
| High-quality triples | `itext2kg` / `itext2kg_star` | `AutoGraph` |
| Temporal facts with evidence | `atom` | `AutoGraph` with `t_start`, `t_end`, `atomic_facts` on edges |
| Flexible prototyping | `kg_gen` | `AutoGraph` |

<Info>
Methods produce algorithm-driven graphs with fixed schemas baked into each class (e.g., `Light_RAG` node `name`/`type`/`description`, edge `source`/`target`/`keywords`/`strength`). Domain YAML templates under `general/`, `finance/`, etc. let you customize field schemas and multilingual prompts — see [Templates vs methods](/templates-vs-methods).
</Info>

## Troubleshooting

| Symptom | Cause | Fix |
|---|---|---|
| `Unknown method: <name>` | Name not in registry | Run `he list method`; use exact registry key (e.g., `light_rag`, not `Light_RAG`) |
| `--lang is required` error | Used `-t` with a knowledge template but omitted `-m` | Add `--lang en` or `--lang zh`, or switch to `-m <method>` |
| `--lang` ignored message | Expected multilingual prompts on a method | Methods are English-only; use a domain template with `--lang` instead |
| Relative dates resolve incorrectly | `observation_time` not set on `atom` | Pass `observation_time="YYYY-MM-DD"` via `Template.create` or `Atom(...)` |
| `he search` / `he talk` fails | Index not built | Omit `--no-index` during parse, or run `he build-index <ka_path>` |
| Empty output directory error | Target dir exists and is non-empty | Pass `--force` to overwrite |

Enable debug logging with `HYPER_EXTRACT_LOG_LEVEL=DEBUG` when tracing extraction phases (Atom logs atomic-fact and edge-extraction stages).

## Related pages

<CardGroup>
<Card title="Templates vs methods" href="/templates-vs-methods">
When to pick a domain YAML template over an algorithm method, and language requirements.
</Card>
<Card title="Extraction methods reference" href="/extraction-methods-reference">
Per-method autotype output, registry API, and full constructor signatures.
</Card>
<Card title="Method demos" href="/method-demos">
Runnable scripts under `examples/en/methods/` for each engine.
</Card>
<Card title="Configure providers" href="/configure-providers">
Set up LLM and embedder clients before parsing or calling `Template.create`.
</Card>
<Card title="CLI reference" href="/cli-reference">
Complete `he parse`, `he list method`, and related flag documentation.
</Card>
</CardGroup>

---

## 13. Template design skills

> Agent-assisted template authoring with `hyperextract-skills`: brainstorm requirements, record/graph designers, yaml-validator rules, template-optimizer fixes, and multilingual conversion workflows.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/13-template-design-skills.md
- Generated: 2026-06-18T20:56:41.419Z

### Source Files

- `hyperextract-skills/README.md`
- `hyperextract-skills/SKILL.md`
- `hyperextract-skills/graph-designer/SKILL.md`
- `hyperextract-skills/yaml-validator/SKILL.md`
- `hyperextract-skills/template-optimizer/SKILL.md`
- `hyperextract/templates/DESIGN_GUIDE.md`

---
title: "Template design skills"
description: "Agent-assisted template authoring with `hyperextract-skills`: brainstorm requirements, record/graph designers, yaml-validator rules, template-optimizer fixes, and multilingual conversion workflows."
---

The `hyperextract-skills/` directory is a portable agent skill pack that guides YAML template authoring for Hyper-Extract. It does not call LLM providers itself; skills are plain `SKILL.md` instruction files with reference rules and case YAML that any compatible local agent (Claude Code, Trae, Grok-Wiki, or another skill-capable runtime) can load from a file path or repository checkout. The root skill `hyper-extract` routes work through `brainstorm` → `record-designer` or `graph-designer` → `template-optimizer` → `yaml-validator`, with optional `multilingual` conversion. Produced templates are consumed by Hyper-Extract's runtime via `load_template()` and the `he parse --template` CLI path.

<Info>
Skill packs are provider-neutral: install by copying files into an agent's skills directory. No API keys, hosted service, or specific model vendor is required to use the design workflow.
</Info>

## Skill pack layout

```text
hyperextract-skills/
├── SKILL.md                 # Root router (name: hyper-extract)
├── brainstorm/SKILL.md
├── record-designer/
│   ├── SKILL.md
│   ├── cases/               # model, list, set examples
│   └── references/
├── graph-designer/
│   ├── SKILL.md
│   ├── cases/               # graph, hypergraph, spatio_temporal examples
│   └── references/
├── template-optimizer/
│   ├── SKILL.md
│   └── references/          # naming, multilingual, field-count, consistency
├── yaml-validator/
│   ├── SKILL.md
│   └── references/          # syntax, types, identifiers, errors
└── multilingual/SKILL.md
```

| Skill | Trigger phrases | Output |
|-------|-----------------|--------|
| `brainstorm` | "design template", "unsure which type" | Design draft with type, fields, identifiers |
| `record-designer` | model/list/set extraction | Record-type YAML |
| `graph-designer` | graph, hypergraph, temporal, spatial | Graph-type YAML |
| `template-optimizer` | "optimize template", "lint template" | Auto-fixed YAML + optimization report |
| `yaml-validator` | "validate template", "check YAML" | Pass/fail report with ERROR/WARNING/INFO |
| `multilingual` | "add translation", bilingual support | YAML with `language: [zh, en]` blocks |

## Installation

Skills install as files on disk. Copy the repository folder into the agent runtime's skills directory, or use a plugin command where supported.

<Tabs>
<Tab title="Claude Code">

```bash
cp -r hyperextract-skills ~/.claude/skills/
```

Or:

```bash
/plugin install hyperextract-skills
```

</Tab>
<Tab title="Trae">

```bash
cp -r hyperextract-skills ~/.trae/skills/
```

</Tab>
<Tab title="Other agents">

Point the agent at `hyperextract-skills/SKILL.md` as the entry skill, or copy the folder into that runtime's equivalent skills catalog. Grok-Wiki and other BYOC/BYOK agents load the same files without provider-specific adapters.

</Tab>
</Tabs>

<Note>
The canonical design reference in the main package is `hyperextract/templates/DESIGN_GUIDE.md`. Skills encode the same workflow as executable agent instructions; the design guide is the human-readable specification both skills and runtime validation align to.
</Note>

## End-to-end workflow

```mermaid
flowchart LR
  subgraph input [User input]
    U[Requirements]
  end
  subgraph skills [hyperextract-skills]
    B[brainstorm]
    RD[record-designer]
    GD[graph-designer]
    O[template-optimizer]
    V[yaml-validator]
    M[multilingual]
  end
  subgraph runtime [Hyper-Extract runtime]
    L[load_template]
    P[he parse / Template.create]
  end
  U --> B
  B -->|model/list/set| RD
  B -->|graph types| GD
  RD --> O
  GD --> O
  O --> V
  V -->|optional| M
  M --> L
  V --> L
  L --> P
```

<Steps>
<Step title="Brainstorm requirements">

Start with `brainstorm` when the extraction type is unknown. The skill asks about input source, target fields, entity granularity, relation semantics, and time/location dimensions, then emits a design draft:

```markdown
## Type: hypergraph
## Groups: [attackers, defenders]
## Fields: [battle_name, outcome]
```

Pass this draft to the appropriate designer skill.

</Step>
<Step title="Generate YAML with a designer">

Route by AutoType:

| Draft type | Skill |
|------------|-------|
| `model`, `list`, `set` | `record-designer` |
| `graph`, `hypergraph`, `temporal_graph`, `spatial_graph`, `spatio_temporal_graph` | `graph-designer` |

Each designer loads **only** the case file matching the selected type (for example `cases/earnings-summary.yaml` for `model`, `cases/corporate-ownership.yaml` for `graph`). Reference markdown under `references/` is consulted on demand, not loaded wholesale.

</Step>
<Step title="Optimize (recommended)">

Run `template-optimizer` before validation. It parses YAML, detects issues against naming, multilingual, field-count, schema-guideline separation, and hypergraph-grouping rules, then applies **Auto-fix**, **Suggest**, or **Review** changes and prints an optimization report.

</Step>
<Step title="Validate">

Run `yaml-validator` to check syntax, required fields, AutoType values, identifier configuration, and field descriptions. Validation proceeds in order: syntax → structure → identifiers → semantic warnings.

</Step>
<Step title="Add languages (optional)">

Run `multilingual` to convert single-language templates to `language: [zh, en]` with per-language `description`, `guideline`, and field description blocks. Structural keys (`name`, `type`, field `name` values, `identifiers`, `display`) stay untranslated.

</Step>
<Step title="Verify in Hyper-Extract">

Load the finished YAML through the runtime:

```python
from hyperextract.utils.template_engine.parsers.loader import load_template

cfg = load_template("path/to/template.yaml")
```

`load_template()` instantiates `TemplateCfg` and localizes each declared language; invalid configs raise `ValueError` with the failing language. Then run `he parse --template <id> --lang <lang>` or `Template.create()` as documented on the quickstart and custom-template pages.

</Step>
</Steps>

## Type selection

Both `brainstorm` and the design guide share the same decision tree:

```text
Need relationships?
├─ No → model / list / set
└─ Yes → graph / hypergraph
    ├─ Binary (A→B) → graph
    └─ Multi-entity
        ├─ Flat participants → hypergraph (simple)
        └─ Role groups → hypergraph (nested)
    ├─ + time → temporal_graph
    ├─ + space → spatial_graph
    └─ + both → spatio_temporal_graph
```

| Intent | AutoType |
|--------|----------|
| Summary card, single record | `model` |
| Ordered items, rankings | `list` |
| Deduplicated entities | `set` |
| Binary relations | `graph` |
| Multi-party events | `hypergraph` |
| Relations with timestamps | `temporal_graph` |
| Relations with locations | `spatial_graph` |
| Time + location on edges | `spatio_temporal_graph` |

## Core design principle

**Schema defines WHAT; guideline defines HOW TO DO WELL.** Every designer and optimizer skill enforces this split.

| `output` / schema | `guideline` |
|-------------------|-------------|
| Field names, types, descriptions | Extraction strategy |
| Required/optional flags | Quality requirements (naming consistency) |
| Identifier and display config | Creation conditions ("only when text explicitly states") |
| | Common mistakes to avoid |

<Warning>
Repeating field definitions inside `guideline.rules` is an optimizer **Auto-fix** target and a validator **WARNING**. Keep schema descriptions in `output`; keep behavioral guidance in `guideline`.
</Warning>

## Record designer (`model` / `list` / `set`)

`record-designer` turns brainstorm drafts into record-type YAML.

<ParamField body="output.fields" type="array" required>
Field list with `name`, `type` (`str` / `int` / `float` / `list`), `description`, optional `required` and `default`.
</ParamField>

<ParamField body="identifiers.item_id" type="string">
Required for `set` only. Names the deduplication key field (typically `name`).
</ParamField>

<ParamField body="display.label" type="string">
Template string for visualization, e.g. `{company_name}`.
</ParamField>

| Type | Identifiers | Case file |
|------|-------------|-----------|
| `model` | `{}` | `record-designer/cases/earnings-summary.yaml` |
| `list` | `{}` | `record-designer/cases/product-features.yaml` |
| `set` | `item_id: <field>` | `record-designer/cases/entity-registry.yaml` |

Field-count guidance: keep ≤ 5 fields per component; prioritize Essential → Important → Optional.

## Graph designer (graph family)

`graph-designer` produces templates with `output.entities`, `output.relations`, `identifiers`, and `display`.

### Identifier patterns

| Type | `relation_members` | Extra identifiers |
|------|---------------------|-------------------|
| `graph` | `{source: source, target: target}` | `entity_id`, `relation_id` |
| `hypergraph` (simple) | `participants` (string) | — |
| `hypergraph` (nested) | `[group_a, group_b]` (list of list fields) | — |
| `temporal_graph` | graph pattern + `time_field: time` | — |
| `spatial_graph` | graph pattern + `location_field: location` | — |
| `spatio_temporal_graph` | both `time_field` and `location_field` | — |

### Display labels

`display` drives OntoSight rendering. Edge labels must not repeat source/target node text.

| Type | `entity_label` | `relation_label` |
|------|----------------|------------------|
| `graph` | `{name} ({type})` | `{type}` |
| `hypergraph` | `{name}` | `{event_name}` or `{outcome}` |
| `spatio_temporal_graph` | `{name} ({type})` | `{type}@{location}({time})` |

Length targets: `entity_label` 5–20 characters; `relation_label` 10–30 characters.

### Hypergraph grouping

<Tip>
When relations include a `role` field but `relation_members` is a simple string (`participants`), switch to nested grouping: `relation_members: [attackers, defenders]` with matching `type: list` fields. The optimizer flags this anti-pattern automatically.
</Tip>

Common nested patterns: formula composition (sovereigns/ministers/assistants/envoys), battles (attackers/defenders), contracts (parties/witnesses).

Case files: `corporate-ownership.yaml` (`graph`), `battle-analysis.yaml` (`hypergraph`), `biography-events.yaml` (`spatio_temporal_graph`).

## Template optimizer

The optimizer runs a four-phase pipeline: parse → analyze → apply fixes → report.

| Rule file | Checks |
|-----------|--------|
| `rules-naming.md` | `relation_type` → `type`, `event_date` → `time`, snake_case fields |
| `rules-multilingual.md` | Pure `zh` / pure `en` per language block |
| `rules-field-count.md` | > 5 fields per entity/relation component |
| `rules-consistency.md` | Schema/guideline duplication |
| `rules-hypergraph-grouping.md` | Role field vs nested `relation_members` |

| Level | Behavior | Example |
|-------|----------|---------|
| **Auto-fix** | Applied automatically | Rename `relation_type` to `type` |
| **Suggest** | Proposed; may need review | Seven relation fields → simplify |
| **Review** | Design decision flagged | Open-ended vs predefined relation types |

Recommended position in the pipeline: **after designer, before validator**.

## YAML validator

Validation severity levels:

| Level | Meaning | Action |
|-------|---------|--------|
| ERROR | Blocking | Template will not load or extract correctly |
| WARNING | Quality risk | Extraction may degrade |
| INFO | Style reference | Follow for consistency |

### Required fields (all types)

- `language`: `zh`, `en`, or `[zh, en]`
- `name`: PascalCase (case YAML examples sometimes use snake_case names; validator enforces PascalCase)
- `type`: one of the eight AutoTypes
- `tags`: lowercase array
- `description`: non-empty
- `output` and `guideline`: present

### Graph-type extras

- `output.entities` and `output.relations`
- `identifiers.entity_id`, `identifiers.relation_id`, `identifiers.relation_members`
- `identifiers.time_field` / `identifiers.location_field` when applicable

### Validation order

1. Syntax (`rules-syntax.md`)
2. Structure by AutoType (`rules-types.md`)
3. Identifiers (`rules-identifiers.md`)
4. Error lookup (`rules-errors.md`)

<RequestExample>

```markdown
## Validation Results

### Syntax Validation
✅ PASSED

### Structure Validation
✅ PASSED

### Semantic Validation
⚠️ 2 warnings
- WARNING: guideline repeats field definition for `company_name`
- WARNING: relations.fields count is 7 (recommended max 5)

### Overall Assessment
✅ Configuration valid
```

</RequestExample>

Runtime validation complements the skill validator: `load_template()` re-validates localization per declared language and raises if any language block is incomplete.

## Multilingual conversion

`multilingual` converts templates between three modes:

1. **Single → bilingual** — `language: zh` becomes `language: [zh, en]` with nested `description.zh` / `description.en`
2. **Expand existing** — add missing `en` (or other) blocks to partially translated templates
3. **Add new language** — extend `[zh, en]` with `ja`, etc.

### Translatable vs fixed fields

| Translatable | Fixed (do not translate) |
|--------------|--------------------------|
| `description`, `output.description` | `name`, `type`, `tags` |
| `fields[].description` | field `name` keys |
| `guideline.target`, `guideline.rules` | `identifiers`, `display` |

Language purity rules (also enforced by optimizer):

| Language | Rule | Forbidden |
|----------|------|-----------|
| `zh` | Pure Chinese terminology | `entity(实体)` inline mixing |
| `en` | Pure English | Chinese characters in `en` blocks |

Use list format for multi-language declaration: `language: [zh, en]`.

## Naming conventions

| Element | Convention | Example |
|---------|------------|---------|
| Template `name` | PascalCase | `EarningsSummary` |
| Field names | snake_case | `company_name` |
| Relation type field | `type` | not `relation_type` |
| Time on edges | `time` | not `event_date` |
| Tags | lowercase | `finance, investor` |

## Example agent session

```text
User: I want to extract key information from financial reports

Agent (brainstorm): Clarifies fields → recommends model type

User: company name, revenue, reporting period

Agent (record-designer): Emits earnings-summary-style YAML

Agent (template-optimizer): Auto-fixes naming, flags field count

Agent (yaml-validator): Reports PASSED with 0 errors

User: Add English support

Agent (multilingual): Converts to language: [zh, en]
```

Save the final YAML under `hyperextract/templates/presets/<domain>/` or a project-local path, then register usage via `Template.create("domain/template_name")` or `he parse -t domain/template_name --lang en`.

## Failure modes

<AccordionGroup>
<Accordion title="Validator ERROR on missing identifiers">

Graph types require `identifiers.relation_members`. For `graph`, configure `source`/`target` mappings; for hypergraph, use a string (`participants`) or list (`[group_a, group_b]`).

</Accordion>
<Accordion title="load_template ValueError for language">

Every key in `language: [zh, en]` must have complete localized blocks. Run `multilingual` or manually fill missing `zh`/`en` subtrees before loading.

</Accordion>
<Accordion title="Optimizer suggests too many fields">

Split the template, move optional metadata into `description` fields, or accept **Suggest**-level warnings if domain needs exceed five fields.

</Accordion>
<Accordion title="Hypergraph role field with simple participants">

Replace flat `participants` + `role` with nested list groups. See `template-optimizer/references/rules-hypergraph-grouping.md`.

</Accordion>
<Accordion title="Agent does not pick up skills">

Confirm the skill directory path for your runtime and that the root `SKILL.md` `name: hyper-extract` is visible. Re-copy `hyperextract-skills/` after repository updates.

</Accordion>
</AccordionGroup>

## Related pages

<CardGroup>
<Card title="Create custom templates" href="/create-custom-templates">
Manual YAML authoring workflow, preset base templates, and merge strategies without agent skills.
</Card>
<Card title="Template schema reference" href="/template-schema-reference">
Field-by-field schema for `language`, `output`, `guideline`, `identifiers`, `display`, and `options`.
</Card>
<Card title="Auto-Types" href="/auto-types">
Runtime behavior of the eight AutoTypes skills target during type selection.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Runtime errors for `he parse`, missing `--lang`, and template resolution after design.
</Card>
</CardGroup>

---

## 14. CLI reference

> Complete `he` command surface: `parse`, `feed`, `build-index`, `search`, `talk`, `show`, `info`, `list template`, `list method`, `config` subcommands, flags, defaults, exit conditions, and input/output contracts.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/14-cli-reference.md
- Generated: 2026-06-18T20:56:50.948Z

### Source Files

- `hyperextract/cli/cli.py`
- `hyperextract/cli/commands/config.py`
- `hyperextract/cli/commands/list.py`
- `hyperextract/cli/README.md`
- `hyperextract/cli/utils.py`
- `pyproject.toml`

---
title: CLI reference
description: Complete `he` command surface — subcommands, flags, defaults, exit conditions, and input/output contracts for Hyper-Extract.
---

The `he` binary is the Hyper-Extract command-line interface. It is registered in `pyproject.toml` as `he = "hyperextract.cli:app"` and requires **Python 3.11+**. All extraction, search, and chat commands that call an LLM or embedder invoke `validate_config()` before proceeding; `info` and `list` commands do not.

## Global behavior

Running `he` with no subcommand prints a branded overview of available commands and exits **0**. Global options apply to every invocation.

<ParamField body="--version" type="flag" default="false">
Print package version (`Hyper-Extract CLI version …`) and exit **0**. Eager option — evaluated before subcommands.
</ParamField>

<ParamField body="HYPER_EXTRACT_LOG_LEVEL" type="env var" default="WARNING">
Controls structlog verbosity (`DEBUG`, `INFO`, `WARNING`, `ERROR`). There is no `--verbose` flag.
</ParamField>

<ParamField body="HYPER_EXTRACT_LOG_FILE" type="env var">
Optional file path for log output in addition to stderr.
</ParamField>

<RequestExample>

```bash
he                    # Overview banner
he --version          # Print version
he --help             # Typer-generated help
```

</RequestExample>

## Command map

| Command | Purpose | Requires config | Requires KA |
|---------|---------|-----------------|-------------|
| `he parse` | Create a new Knowledge Abstract (KA) | Yes | — |
| `he feed` | Append documents to an existing KA | Yes | `metadata.json` |
| `he build-index` | Build or rebuild vector index | Yes | `data.json` |
| `he search` | Semantic search | Yes | `data.json` + `index/` |
| `he talk` | Q&A chat | Yes | `data.json` + `index/` |
| `he show` | Visualize via OntoSight | Yes | `data.json` |
| `he info` | Print KA statistics | No | `data.json` |
| `he list template` | List YAML templates | No | — |
| `he list method` | List extraction methods | No | — |
| `he config …` | Manage LLM/embedder settings | No | — |

## Knowledge Abstract I/O contract

Commands that read or write a KA expect a directory with this layout:

:::files
```
<ka_path>/
├── data.json        # Structured extraction output (required for most commands)
├── metadata.json    # template, lang, timestamps (required for feed)
└── index/           # FAISS vector store (required for search/talk)
```
:::

<ResponseField name="data.json" type="object">
Serialized AutoType payload. `he info` counts `nodes`/`entities` and `edges`/`relations` for dict-shaped data, or list length for array-shaped data.
</ResponseField>

<ResponseField name="metadata.json" type="object">
Fields used by the CLI: `template`, `lang`, `created_at`, `updated_at`. `feed` and reload commands resolve template from metadata; preset templates use IDs like `general/biography_graph`, custom templates may reference a local `{template}.yaml` in the KA directory.
</ResponseField>

<ResponseField name="index/" type="directory">
Non-empty directory of vector-index files. Created by `he parse` (unless `--no-index`) or `he build-index`.
</ResponseField>

---

## `he parse`

Extract knowledge from text into a **new** KA directory.

<ParamField body="input" type="string" required>
File path, directory path, or `-` for stdin (UTF-8). Directories are scanned for `*.txt` and `*.md` only; files are concatenated with `\n\n`.
</ParamField>

<ParamField body="--output / -o" type="string" required>
Output KA directory. Created if missing.
</ParamField>

<ParamField body="--template / -t" type="string">
Template ID (e.g. `general/biography_graph`). Omit for interactive selection.
</ParamField>

<ParamField body="--method / -m" type="string">
Shorthand for method templates. Sets template to `method/{name}` (e.g. `-m light_rag` → `method/light_rag`). Takes precedence over interactive selection when set.
</ParamField>

<ParamField body="--lang / -l" type="string">
`zh` or `en`. **Required** for knowledge (YAML) templates. Ignored for method templates — language is forced to `en`.
</ParamField>

<ParamField body="--force / -f" type="flag" default="false">
Overwrite a non-empty output directory.
</ParamField>

<ParamField body="--no-index" type="flag" default="false">
Skip `build_index()` after extraction. Use `he build-index` later to enable `search`/`talk`.
</ParamField>

<Steps>
<Step title="Validate prerequisites">

`validate_config()` must pass. For knowledge templates, `--lang` must be supplied. For method templates, `--lang` is optional and ignored.

</Step>
<Step title="Resolve template">

If neither `-t` nor `-m` is given, an interactive picker lists all gallery templates. `-m` maps to `method/{method}`.

</Step>
<Step title="Extract and persist">

`Template.create(template, lang)` → `feed_text(text)` → `dump(output)`. Unless `--no-index`, also runs `build_index()` and dumps again.

</Step>
</Steps>

<RequestExample>

```bash
# Knowledge template (language required)
he parse document.md -o my_ka -t general/biography_graph -l en

# Method template (language optional, forced to en)
he parse document.md -o my_ka -m light_rag

# Directory of markdown files
he parse ./docs/ -o my_ka -t general/graph -l zh

# Stdin
cat article.md | he parse - -o my_ka -t general/graph -l en

# Skip indexing
he parse doc.md -o my_ka -t general/graph -l en --no-index
```

</RequestExample>

**Exit conditions**

| Code | Condition |
|------|-----------|
| **0** | Extraction and save succeeded |
| **1** | Missing config; no template selected; missing `--lang` for knowledge template; non-empty output without `--force`; template not found; directory has no `.txt`/`.md` files; `FileNotFoundError` on input |

---

## `he feed`

Append knowledge to an **existing** KA. Does **not** rebuild the search index — run `he build-index` afterward if you need updated search results.

<ParamField body="ka_path" type="string" required>
Existing KA directory with `metadata.json`.
</ParamField>

<ParamField body="input" type="string" required>
File path or `-` for stdin.
</ParamField>

<ParamField body="--template / -t" type="string">
Override template. Default: `metadata.template`, falling back to `general/graph`.
</ParamField>

<ParamField body="--lang / -l" type="string">
Override language. Default: `metadata.lang`, falling back to `zh`.
</ParamField>

<RequestExample>

```bash
he feed my_ka new_section.md
he feed my_ka - < additional.txt
```

</RequestExample>

**Exit conditions**

| Code | Condition |
|------|-----------|
| **0** | Knowledge appended and `data.json` updated |
| **1** | Invalid config; KA path invalid; missing `metadata.json`; template resolution failed; input file not found |

---

## `he build-index`

Build or rebuild the FAISS vector index for semantic search and chat.

<ParamField body="ka_path" type="string" required>
KA directory containing `data.json`.
</ParamField>

<ParamField body="--force / -f" type="flag" default="false">
Clear existing index and rebuild. Without `--force`, an existing non-empty `index/` prints a warning and exits **0** without changes.
</ParamField>

<RequestExample>

```bash
he build-index my_ka
he build-index my_ka --force   # Rebuild after he feed
```

</RequestExample>

**Exit conditions**

| Code | Condition |
|------|-----------|
| **0** | Index built, or index already exists (no `--force`) |
| **1** | Invalid config; KA missing `data.json`; load/build/save error |

---

## `he search`

Run semantic search against an indexed KA.

<ParamField body="ka_path" type="string" required>
KA directory with non-empty `index/`.
</ParamField>

<ParamField body="query" type="string" required>
Natural-language search query.
</ParamField>

<ParamField body="--top-k / -n" type="integer" default="3">
Number of results to return.
</ParamField>

<ResponseField name="stdout" type="text">
Rich-formatted result blocks. Pydantic models are printed as indented JSON via `model_dump()`; other objects print as strings. Empty results print `No results found.`
</ResponseField>

<RequestExample>

```bash
he search my_ka "key findings"
he search my_ka "revenue growth" -n 5
```

</RequestExample>

**Exit conditions**

| Code | Condition |
|------|-----------|
| **0** | Search completed (including zero results) |
| **1** | Invalid config; KA or index missing; load/search error |

---

## `he talk`

Chat with an indexed KA using retrieval-augmented generation.

<ParamField body="ka_path" type="string" required>
KA directory with non-empty `index/`.
</ParamField>

<ParamField body="--query / -q" type="string">
Single-turn question. Required unless `--interactive` is set.
</ParamField>

<ParamField body="--top-k / -n" type="integer" default="3">
Number of context items retrieved per query.
</ParamField>

<ParamField body="--interactive / -i" type="flag" default="false">
Enter a REPL loop. Type `exit`, `quit`, or `q` to leave; `Ctrl+C` also exits gracefully.
</ParamField>

<ResponseField name="stdout" type="text">
Single-query mode prints `response.content`. If `response.additional_kwargs["retrieved_items"]` is present, truncated previews of retrieved context are printed below the answer.
</ResponseField>

<RequestExample>

```bash
he talk my_ka -q "What was the main topic?"
he talk my_ka -i
```

</RequestExample>

**Exit conditions**

| Code | Condition |
|------|-----------|
| **0** | Query answered or interactive session ended normally |
| **1** | Invalid config; KA or index missing; neither `-q` nor `-i` provided; load/chat error |

---

## `he show`

Visualize a KA using OntoSight. Opens an interactive graph view in the default environment (browser or embedded viewer depending on OntoSight configuration).

<ParamField body="ka_path" type="string" required>
KA directory containing `data.json`.
</ParamField>

<RequestExample>

```bash
he show my_ka
```

</RequestExample>

**Exit conditions**

| Code | Condition |
|------|-----------|
| **0** | Visualization launched successfully |
| **1** | KA missing or no `data.json`; config invalid; load or visualization error |

---

## `he info`

Print KA metadata and statistics. Does not require LLM/embedder configuration.

<ParamField body="ka_path" type="string" required>
KA directory containing `data.json`.
</ParamField>

<ResponseField name="stdout" type="table">
Rich table with: Path, Template, Language, Created, Updated, Nodes, Edges, Index status (`Built` / `Not Built`).
</ResponseField>

<RequestExample>

```bash
he info my_ka
```

</RequestExample>

**Exit conditions**

| Code | Condition |
|------|-----------|
| **0** | Info displayed |
| **1** | KA path invalid or `data.json` missing |

---

## `he list template`

List available YAML knowledge templates and, by default, method templates.

<ParamField body="--query / -q" type="string">
Keyword filter on template ID or description.
</ParamField>

<ParamField body="--autotype / -a" type="string">
Filter by AutoType (`graph`, `hypergraph`, `list`, `model`, `set`, etc.).
</ParamField>

<ParamField body="--lang / -l" type="string" default="en">
Language for descriptions: `en`, `zh`, or `all` (lists every supported language variant).
</ParamField>

<ParamField body="--include-methods / --no-methods" type="flag" default="true">
Include `method/*` entries. Method templates are excluded when `--lang zh` is set.
</ParamField>

<RequestExample>

```bash
he list template
he list template -l zh
he list template -a graph -q finance
he list template --no-methods
```

</RequestExample>

**Exit conditions:** always **0** (prints `No templates found.` when the filter matches nothing).

---

## `he list method`

List registered extraction methods (`graph_rag`, `light_rag`, `hyper_rag`, etc.).

<ParamField body="--query / -q" type="string">
Keyword filter on method name or description.
</ParamField>

<RequestExample>

```bash
he list method
he list method -q rag
```

</RequestExample>

**Exit conditions:** always **0**.

---

## `he config`

Manage LLM and embedder settings stored in `~/.he/config.toml`. Running `he config` with no subcommand prints a configuration overview and exits **0**.

<Tabs>
<Tab title="init">

Interactive or one-shot setup for both LLM and embedder.

<ParamField body="--provider / -p" type="string">
Preset: `openai`, `bailian`, `vllm`.
</ParamField>

<ParamField body="--api-key / -k" type="string">
API key. With `--provider`, configures both services. Without `--provider`, defaults to OpenAI (`gpt-4o-mini` + `text-embedding-3-small`).
</ParamField>

<ParamField body="--base-url / -u" type="string">
Custom API base URL. Required for `vllm`; optional override for other presets.
</ParamField>

```bash
he config init                                    # Interactive wizard
he config init -k sk-...                          # OpenAI quick setup
he config init -p bailian -k sk-...              # Provider preset
he config init -p vllm -k dummy -u http://localhost:8000/v1
```

</Tab>
<Tab title="show">

```bash
he config show
```

Prints a table of LLM and embedder provider, model, masked API key, and base URL.

</Tab>
<Tab title="llm">

<ParamField body="--provider / -p" type="string">
`openai`, `bailian`, `vllm`
</ParamField>

<ParamField body="--api-key / -k" type="string">
LLM API key
</ParamField>

<ParamField body="--model / -m" type="string">
Model name (default in file: `gpt-4o-mini`)
</ParamField>

<ParamField body="--base-url / -u" type="string">
Custom base URL
</ParamField>

<ParamField body="--show" type="flag">
Display current LLM config only
</ParamField>

<ParamField body="--unset" type="flag">
Reset LLM section to defaults
</ParamField>

```bash
he config llm -k sk-... -m gpt-4o
he config llm --show
he config llm --unset
```

</Tab>
<Tab title="embedder">

Same flags as `llm`. Default model in file: `text-embedding-3-small`.

```bash
he config embedder -k sk-... -m text-embedding-3-small
he config embedder --show
he config embedder --unset
```

</Tab>
</Tabs>

### Configuration validation

Commands that call `validate_config()` check resolved settings (config file merged with environment):

| Provider | LLM requirement | Embedder requirement |
|----------|-----------------|----------------------|
| `vllm` | `base_url` required; API key may be `dummy` | `base_url` required; API key may be `dummy` |
| Other | API key required (or `OPENAI_API_KEY` env) | API key required (or `OPENAI_API_KEY` env) |

<ParamField body="OPENAI_API_KEY" type="env var">
Fallback API key when not set in `config.toml`.
</ParamField>

<ParamField body="OPENAI_BASE_URL" type="env var">
Fallback base URL when not set in `config.toml`.
</ParamField>

Provider presets and default models:

| Preset | Default LLM | Default embedder | Base URL |
|--------|-------------|------------------|----------|
| `openai` | `gpt-4o-mini` | `text-embedding-3-small` | `https://api.openai.com/v1` |
| `bailian` | `qwen3.6-plus` | `text-embedding-v4` | `https://dashscope.aliyuncs.com/compatible-mode/v1` |
| `vllm` | user-specified | user-specified | user-specified |

**Exit conditions:** subcommands exit **0** on success. `validate_config()` failures in other commands exit **1** with a remediation message.

---

## Typical workflows

<Steps>
<Step title="First run">

```bash
he config init
he list template -l en
```

</Step>
<Step title="Create and query">

```bash
he parse examples/en/tesla.md -o tesla_ka -t general/biography_graph -l en
he show tesla_ka
he search tesla_ka "alternating current"
he talk tesla_ka -q "What did Tesla invent?"
```

</Step>
<Step title="Evolve knowledge">

```bash
he feed tesla_ka new_chapter.md
he build-index tesla_ka --force
he info tesla_ka
```

</Step>
</Steps>

## Exit code summary

| Code | Meaning |
|------|---------|
| **0** | Success, informational exit, or no-op warning (`build-index` when index exists) |
| **1** | Validation failure, missing input/output, template resolution error, or runtime exception |
| **≠ 0** | Unrecognized global flags (e.g. removed `--verbose`) |

<AccordionGroup>
<Accordion title="Interactive template picker (`he parse` without `-t`/`-m`)">

Prompts for a number (1-based index) or search keyword. Keyword search matches template ID or description; a single match is auto-selected, multiple matches are listed for re-entry. Default selection is `1`. Exits **1** if the gallery is empty or the user aborts without a selection.

</Accordion>
<Accordion title="Stdin and encoding">

`read_input("-")` reads all of `sys.stdin`. File inputs are opened as UTF-8. Missing files raise `FileNotFoundError` (surfaced as a Python traceback unless caught upstream).

</Accordion>
<Accordion title="Custom templates in KA directories">

When reloading a KA, template resolution checks gallery presets first, then looks for `{template}.yaml` inside the KA directory. This supports custom templates copied during `he parse` from a local `.yaml` path.

</Accordion>
</AccordionGroup>

## Related pages

<CardGroup cols={2}>
<Card title="Configure providers" href="/configure-providers">
Set up LLM and embedder clients before running extraction commands.
</Card>
<Card title="Extract and evolve" href="/extract-and-evolve">
Workflow guide for `parse`, `feed`, and `build-index`.
</Card>
<Card title="Search, chat, and visualize" href="/search-chat-visualize">
Query and explore Knowledge Abstracts with `search`, `talk`, `show`, and `info`.
</Card>
<Card title="Configuration reference" href="/configuration-reference">
Full `~/.he/config.toml` schema and environment variable precedence.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Common CLI failure modes and debug logging.
</Card>
<Card title="Python API reference" href="/python-api-reference">
SDK equivalents for every CLI lifecycle operation.
</Card>
</CardGroup>

---

## 15. Python API reference

> Exported SDK: `Template.create/get/list`, `BaseAutoType` lifecycle (`parse`, `feed_text`, `search`, `chat`, `dump`, `load`, `build_index`, `show`), `create_client` / `create_llm` / `create_embedder` / `get_client`, and logging helpers.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/15-python-api-reference.md
- Generated: 2026-06-18T20:58:05.281Z

### Source Files

- `hyperextract/__init__.py`
- `hyperextract/utils/template_engine/template.py`
- `hyperextract/types/base.py`
- `hyperextract/utils/client.py`
- `hyperextract/utils/logging.py`
- `hyperextract/utils/template_engine/gallery.py`

---
title: "Python API reference"
description: "Exported SDK: `Template.create/get/list`, `BaseAutoType` lifecycle (`parse`, `feed_text`, `search`, `chat`, `dump`, `load`, `build_index`, `show`), `create_client` / `create_llm` / `create_embedder` / `get_client`, and logging helpers."
---

The `hyperextract` package exports a single SDK surface: eight `AutoType` primitives, the `Template` factory, LangChain client helpers, and structlog-based logging. Every extraction path returns a `BaseAutoType` subclass that owns LLM extraction, optional FAISS indexing, serialization to a Knowledge Abstract directory, and OntoSight visualization on concrete types.

```python
from hyperextract import (
    Template,
    BaseAutoType,
    AutoModel, AutoList, AutoSet,
    AutoGraph, AutoHypergraph,
    AutoTemporalGraph, AutoSpatialGraph, AutoSpatioTemporalGraph,
    create_client, create_llm, create_embedder, get_client,
    configure_logging, get_logger, set_log_level,
)
```

| Symbol | Role |
| --- | --- |
| `Template` | Resolve YAML presets or method templates into configured `AutoType` instances |
| `BaseAutoType` + eight `Auto*` classes | Knowledge Abstract runtime: extract, index, query, persist |
| `create_client` / `create_llm` / `create_embedder` / `get_client` | Build LangChain LLM and embedder clients from shorthand or `~/.he/config.toml` |
| `configure_logging` / `get_logger` / `set_log_level` | structlog setup; respects `HYPER_EXTRACT_LOG_LEVEL` and `HYPER_EXTRACT_LOG_FILE` |

## Template API

`Template` is a static façade over the gallery, method registry, and `TemplateFactory`. All three methods are stateless.

### `Template.create`

Returns a configured `BaseAutoType` instance.

<ParamField body="source" type="string" required>
Preset path (`general/biography_graph`), method path (`method/light_rag`), or absolute path to a `.yaml` file.
</ParamField>

<ParamField body="language" type="string">
Required for knowledge templates (`general/*`, `finance/*`, custom YAML). Ignored for `method/*` templates (always English).
</ParamField>

<ParamField body="llm_client" type="BaseChatModel">
LangChain chat model. When omitted, loaded via `get_client()` from `~/.he/config.toml`.
</ParamField>

<ParamField body="embedder" type="Embeddings">
LangChain embeddings model. When omitted, loaded via `get_client()`.
</ParamField>

<ParamField body="**kwargs" type="dict">
Overrides template `options` or method constructor args (e.g. `observation_time="2024-06-15"` for temporal methods).
</ParamField>

<ResponseField name="return" type="BaseAutoType">
Configured `AutoType` with `metadata` keys `template`, `lang`, and `type` populated by the factory.
</ResponseField>

<CodeGroup>

```python Knowledge template
from hyperextract import Template

ka = Template.create(
    "general/biography_graph",
    language="en",
)
with open("examples/en/tesla.md") as f:
    result = ka.parse(f.read())
result.show()
```

```python Method template
from hyperextract import Template, create_client

llm, emb = create_client("bailian", api_key="sk-...")
rag = Template.create(
    "method/light_rag",
    llm_client=llm,
    embedder=emb,
)
rag.feed_text(text)
```

```python Custom YAML
ka = Template.create(
    "/path/to/my_template.yaml",
    language="zh",
    llm_client=llm,
    embedder=emb,
)
```

</CodeGroup>

<Warning>
`Template.create` raises `ValueError` when `language` is omitted for knowledge templates. Method paths (`method/<name>`) do not require `language`.
</Warning>

### `Template.get`

```python
config = Template.get("general/biography_graph")  # TemplateCfg | None
method_cfg = Template.get("method/light_rag")       # method TemplateCfg
```

Returns a `TemplateCfg` Pydantic model (`language`, `name`, `type`, `tags`, `description`, `output`, `guideline`, `identifiers`, `options`, `display`) without instantiating an `AutoType`. Method paths delegate to `get_method_cfg`.

### `Template.list`

```python
all_templates = Template.list()
graphs = Template.list(filter_by_type="graph", filter_by_language="en")
biography = Template.list(filter_by_query="biography")
```

| Parameter | Default | Effect |
| --- | --- | --- |
| `filter_by_query` | `None` | Match against `name` or `description` |
| `filter_by_type` | `None` | Filter by autotype (`graph`, `list`, `model`, …) |
| `filter_by_tag` | `None` | Filter by YAML `tags` entry |
| `filter_by_language` | `None` | Filter by supported language code |
| `include_methods` | `True` | Merge registered extraction methods into the result |

Returns `Dict[str, TemplateCfg]` keyed by preset path (e.g. `general/biography_graph`) or method name.

## BaseAutoType lifecycle

`BaseAutoType[T]` is the abstract Knowledge Abstract runtime. Template-created instances and directly constructed `AutoGraph` / `AutoModel` / … objects share the same method surface.

### Construction

```python
AutoGraph(
    node_schema=Entity,
    edge_schema=Relation,
    llm_client=llm,
    embedder=emb,
    node_key_extractor=lambda n: n.name,
    edge_key_extractor=lambda e: (e.source, e.target, e.type),
    nodes_in_edge_extractor=lambda e: (e.source, e.target),
    chunk_size=2048,      # default
    chunk_overlap=256,    # default
    max_workers=10,       # parallel chunk extraction
    verbose=False,
)
```

Extraction uses `llm_client.with_structured_output(data_schema)` and `RecursiveCharacterTextSplitter` for long inputs.

### Extraction methods

| Method | Mutates instance | Returns | Use when |
| --- | --- | --- | --- |
| `parse(text)` | No | New `BaseAutoType` | Preview extraction or branch a KA without touching the original |
| `feed_text(text)` | Yes | `self` (chainable) | Incrementally grow the current KA (`feed_text(t1).feed_text(t2)`) |

Both call `_extract_data`: single-chunk LLM invoke or batched parallel extraction, then `merge_batch_data`. `feed_text` routes through `_update_data_state` (incremental merge); `parse` routes through `_set_data_state` on a fresh instance.

### Indexing, search, and chat

| Method | Notes |
| --- | --- |
| `build_index()` | Abstract; builds FAISS vector store from current data. `AutoGraph.build_index(index_nodes=True, index_edges=True)` indexes nodes and edges separately. |
| `search(query, top_k=3)` | Abstract; signature varies by type. `AutoGraph.search` returns `(nodes, edges)` and requires a prior `build_index()`. |
| `chat(query, top_k=3)` | Concrete on `BaseAutoType`. Runs `search`, formats context, invokes `llm_client`, returns `langchain_core.messages.AIMessage`. Retrieved items are attached at `response.additional_kwargs["retrieved_items"]`. |

<Note>
Call `build_index()` before `search` or `chat` when semantic retrieval is required. Index state is cleared when data is replaced (`parse` on a loaded instance) or incrementally updated (`feed_text`).
</Note>

### Persistence

`dump(folder_path)` writes a Knowledge Abstract directory:

:::files
output/
├── data.json       # Pydantic `data.model_dump()`
├── metadata.json   # timestamps, template, lang, type
└── index/          # FAISS files (non-fatal if save fails)
:::

`load(folder_path)` restores data (required), metadata (optional), and index (optional). Index load failures print a warning; call `build_index()` to rebuild.

Granular helpers: `dump_data`, `load_data`, `dump_metadata`, `load_metadata`, `dump_index`, `load_index`.

### State helpers

| Method / property | Behavior |
| --- | --- |
| `data` | Read-only Pydantic view of stored knowledge |
| `data_schema` | Schema class (`Type[T]`) |
| `empty()` | Whether any data is stored |
| `clear()` | Reset data and index |
| `clear_index()` | Reset index only |
| `__add__(other)` | Merge two instances of the same class and schema into a new instance |

### `show()`

`show()` is **not** defined on `BaseAutoType`. Concrete types implement OntoSight visualization:

- `AutoModel.show(label_extractor=..., top_k=3)`
- `AutoList.show(...)`, `AutoSet.show(...)`
- `AutoGraph.show(node_label_extractor=..., edge_label_extractor=..., top_k_nodes_for_search=3, ...)`

When a vector index exists, `show()` wires interactive search and chat callbacks into the OntoSight viewer.

```mermaid
stateDiagram-v2
    [*] --> Empty: Template.create / direct construct
    Empty --> Populated: parse() or feed_text()
    Populated --> Indexed: build_index()
    Indexed --> Queried: search() / chat()
    Populated --> Persisted: dump()
    Persisted --> Loaded: load()
    Loaded --> Indexed: build_index()
    Queried --> Evolved: feed_text()
    Evolved --> Indexed: build_index()
```

## Client factory

All client helpers return LangChain-compatible objects and accept the `provider:model@url` shorthand.

| Function | Returns | Config source |
| --- | --- | --- |
| `create_llm(spec, api_key="", **kwargs)` | `ChatOpenAI` | String shorthand or dict |
| `create_embedder(spec, api_key="", **kwargs)` | `OpenAIEmbeddings` or `CompatibleEmbeddings` | Uses `CompatibleEmbeddings` when `base_url` is not the official OpenAI endpoint |
| `create_client(llm=, embedder=, provider=, api_key=, **kwargs)` | `(llm, embedder)` tuple | Three patterns below |
| `get_client(config_path=None)` | `(llm, embedder)` tuple | Reads `~/.he/config.toml` via `ConfigManager` |

**Pattern A — single provider (cloud defaults):**

```python
llm, emb = create_client("bailian", api_key="sk-...")
# preset: qwen3.6-plus + text-embedding-v4
```

**Pattern B — separate vLLM services:**

```python
llm, emb = create_client(
    llm="vllm:Qwen3.5-9B@http://localhost:8000/v1",
    embedder="vllm:bge-m3@http://localhost:8001/v1",
    api_key="dummy",
)
```

**Pattern C — mixed deployment:**

```python
llm, emb = create_client(
    llm="bailian:qwen-plus",
    embedder="vllm:bge-m3@localhost:8001/v1",
    api_key="sk-...",
)
```

Shorthand parsing (`_parse_client_spec`):

| Input | Resolves to |
| --- | --- |
| `"bailian"` | Provider preset defaults |
| `"bailian:qwen-plus"` | Provider + model, preset URL |
| `"vllm:Qwen3.5-9B@http://localhost:8000/v1"` | Provider + model + base URL |

Built-in presets: `openai`, `bailian`, `vllm` (no default URL or models).

<Tip>
`CompatibleEmbeddings` sends string inputs (not pre-tokenized integer lists) to OpenAI-compatible endpoints, with tiktoken chunking and conservative `max_batch_size=10` for providers like Bailian/DashScope.
</Tip>

## Logging helpers

```python
from hyperextract import configure_logging, get_logger, set_log_level

configure_logging(level="WARNING", json_output=False, output_file=None)
logger = get_logger(__name__)
set_log_level("DEBUG")
```

| Function | Behavior |
| --- | --- |
| `configure_logging` | Configures structlog processors; console handler on stderr; optional file handler |
| `get_logger(name)` | Returns a `structlog.stdlib.BoundLogger` |
| `set_log_level(level)` | Updates root logger level at runtime |

Environment variables override defaults:

| Variable | Effect |
| --- | --- |
| `HYPER_EXTRACT_LOG_LEVEL` | Overrides `configure_logging(level=...)` |
| `HYPER_EXTRACT_LOG_FILE` | Adds a file handler when `output_file` is not set |

`BaseAutoType` and client code emit structured debug events (`stage=extract_start`, `stage=feed_text_start`, etc.) through `get_logger`.

## End-to-end workflow

<Steps>

<Step title="Configure clients">

```python
from hyperextract import create_client, configure_logging

configure_logging(level="INFO")
llm, emb = create_client("openai")  # or get_client() after he config init
```

</Step>

<Step title="Create and extract">

```python
from hyperextract import Template

ka = Template.create("general/biography_graph", language="en", llm_client=llm, embedder=emb)

with open("examples/en/tesla.md") as f:
    ka.feed_text(f.read())
```

</Step>

<Step title="Index, query, persist">

```python
ka.build_index()

results = ka.search("Tesla alternating current", top_k=3)
answer = ka.chat("What did Tesla invent?")
print(answer.content)

ka.dump("./output/tesla_ka")
```

</Step>

<Step title="Reload and visualize">

```python
loaded = Template.create("general/biography_graph", language="en", llm_client=llm, embedder=emb)
loaded.load("./output/tesla_ka")
loaded.show()
```

</Step>

</Steps>

For extraction algorithms without YAML presets, instantiate method classes directly (`from hyperextract.methods.rag import Light_RAG`) or use `Template.create("method/<name>")`. See [Use extraction methods](/use-extraction-methods).

## Error cases

| Condition | Exception / signal |
| --- | --- |
| Knowledge template without `language` | `ValueError: language is required for knowledge templates` |
| Unknown preset or method path | `ValueError: Template not found: <source>` |
| `create_client()` with no args | `ValueError` with usage examples |
| `load()` on missing directory | `FileNotFoundError` |
| `search()` without index (graph types) | `ValueError: Node index not built. Call build_index() first.` |
| Merging incompatible instances via `+` | `TypeError` (class or schema mismatch) |
| Index dump/load failure | Warning printed; data still persisted |

## Related pages

<CardGroup>

<Card title="Quickstart" href="/quickstart">
First successful extraction with `Template.create` and `feed_text` using the Tesla biography example.
</Card>

<Card title="Knowledge Abstracts" href="/knowledge-abstracts">
On-disk `data.json` / `metadata.json` / `index/` layout and lifecycle semantics.
</Card>

<Card title="Configure providers" href="/configure-providers">
Set up `~/.he/config.toml` or programmatic `create_client()` for mixed deployments.
</Card>

<Card title="Auto-Types" href="/auto-types">
Choose and construct the eight `Auto*` primitives directly.
</Card>

<Card title="Tesla biography recipe" href="/tesla-biography-recipe">
End-to-end CLI and Python workflow with expected artifacts.
</Card>

</CardGroup>

---

## 16. Configuration reference

> `~/.he/config.toml` schema for `[llm]` and `[embedder]`, provider presets and default models, environment variable precedence (`OPENAI_API_KEY`, `OPENAI_BASE_URL`, `HYPER_EXTRACT_LOG_LEVEL`, `HYPER_EXTRACT_LOG_FILE`), and validation rules.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/16-configuration-reference.md
- Generated: 2026-06-18T20:57:21.178Z

### Source Files

- `hyperextract/cli/config.py`
- `hyperextract/cli/commands/config.py`
- `hyperextract/utils/client.py`
- `hyperextract/utils/logging.py`
- `.env.example`
- `tests/cli/test_verbose.py`

---
title: "Configuration reference"
description: "`~/.he/config.toml` schema for `[llm]` and `[embedder]`, provider presets and default models, environment variable precedence (`OPENAI_API_KEY`, `OPENAI_BASE_URL`, `HYPER_EXTRACT_LOG_LEVEL`, `HYPER_EXTRACT_LOG_FILE`), and validation rules."
---

Hyper-Extract stores LLM and embedder credentials in `~/.he/config.toml`. The CLI loads this file through `ConfigManager`, merges environment-variable fallbacks at read time, and validates credentials before any command that calls remote models (`parse`, `feed`, `build-index`, `search`, `talk`, `show`). The Python SDK reads the same file via `get_client()`, or bypasses it with `create_client()` shorthand specs.

## File location

| Platform | Path |
|----------|------|
| Linux / macOS | `~/.he/config.toml` |
| Windows | `%USERPROFILE%\.he\config.toml` |

The directory is created on first save (`he config init`, `he config llm`, or `he config embedder`). A missing file is not an error: `ConfigManager` starts from built-in dataclass defaults and applies environment fallbacks when resolving clients.

:::files
~/.he/
└── config.toml    # [llm] and [embedder] sections
:::

## Configuration schema

The TOML file has two top-level tables. Both share the same field shape.

### `[llm]` section

<ParamField body="provider" type="string">
Provider preset identifier. Recognized presets: `openai`, `bailian`, `vllm`. Interactive `he config init` can also store `custom` for OpenAI-compatible endpoints without a built-in preset. Empty string is valid but skips preset URL resolution.
</ParamField>

<ParamField body="model" type="string" default="gpt-4o-mini">
LLM model name passed to `ChatOpenAI`. When omitted from the file, defaults to `gpt-4o-mini`.
</ParamField>

<ParamField body="api_key" type="string">
API key for the LLM endpoint. At resolution time, an empty value falls back to `OPENAI_API_KEY`. For `vllm`, a dummy value such as `dummy` is accepted.
</ParamField>

<ParamField body="base_url" type="string">
OpenAI-compatible API base URL. Resolution order: non-empty file value → `OPENAI_BASE_URL` environment variable → provider preset URL. Required when `provider = "vllm"`.
</ParamField>

### `[embedder]` section

<ParamField body="provider" type="string">
Same preset identifiers as `[llm]`. LLM and embedder can use different providers (mixed cloud + local deployments).
</ParamField>

<ParamField body="model" type="string" default="text-embedding-3-small">
Embedding model name. Defaults to `text-embedding-3-small` when absent from the file.
</ParamField>

<ParamField body="api_key" type="string">
Embedder API key. Empty value falls back to `OPENAI_API_KEY` — not to `[llm].api_key`. For `vllm`, `dummy` is accepted.
</ParamField>

<ParamField body="base_url" type="string">
Embedder endpoint URL. Same resolution order as `[llm].base_url`. Non-official URLs route through `CompatibleEmbeddings` instead of `OpenAIEmbeddings`.
</ParamField>

### Example files

<Tabs>
<Tab title="OpenAI">

```toml
[llm]
provider = "openai"
model = "gpt-4o-mini"
api_key = "sk-your-api-key"
base_url = "https://api.openai.com/v1"

[embedder]
provider = "openai"
model = "text-embedding-3-small"
api_key = ""
base_url = ""
```

Empty `api_key` / `base_url` in `[embedder]` resolve via `OPENAI_API_KEY` and the `openai` preset URL at runtime.

</Tab>
<Tab title="Bailian">

```toml
[llm]
provider = "bailian"
model = "qwen3.6-plus"
api_key = "sk-your-api-key"
base_url = "https://dashscope.aliyuncs.com/compatible-mode/v1"

[embedder]
provider = "bailian"
model = "text-embedding-v4"
api_key = "sk-your-api-key"
base_url = ""
```

</Tab>
<Tab title="Local vLLM">

```toml
[llm]
provider = "vllm"
model = "Qwen/Qwen3.5-9B"
api_key = "dummy"
base_url = "http://localhost:8000/v1"

[embedder]
provider = "vllm"
model = "BAAI/bge-m3"
api_key = "dummy"
base_url = "http://localhost:8001/v1"
```

</Tab>
<Tab title="Mixed deployment">

```toml
[llm]
provider = "bailian"
model = "qwen3.6-plus"
api_key = "sk-your-api-key"
base_url = ""

[embedder]
provider = "vllm"
model = "BAAI/bge-m3"
api_key = "dummy"
base_url = "http://localhost:8001/v1"
```

</Tab>
</Tabs>

## Provider presets

`PROVIDER_PRESETS` in `hyperextract/cli/config.py` and `hyperextract/utils/client.py` define built-in defaults. `he config init -p <preset>` and `create_client("<preset>")` both read from this table.

| Preset | `base_url` | Default LLM | Default embedder |
|--------|-----------|-------------|------------------|
| `openai` | `https://api.openai.com/v1` | `gpt-4o-mini` | `text-embedding-3-small` |
| `bailian` | `https://dashscope.aliyuncs.com/compatible-mode/v1` | `qwen3.6-plus` | `text-embedding-v4` |
| `vllm` | `None` (must be set explicitly) | `None` | `None` |

<Note>
There is no `deepseek` preset. DeepSeek and other OpenAI-compatible endpoints are reached via `bailian`, `custom`, or a manual `base_url`.
</Note>

### `he config init` preset behavior

| Mode | Trigger | Result |
|------|---------|--------|
| Quick preset | `-p <preset> -k <key>` | Sets both `[llm]` and `[embedder]` to the preset; fills default models and URL from the table |
| Legacy OpenAI | `-k <key>` without `-p` | Forces `openai` preset with `gpt-4o-mini` + `text-embedding-3-small` |
| Interactive | No flags | Prompts for provider; `vllm` requires manual model and URL for each service |

For `vllm` quick init without `-p`/`-k`, interactive mode substitutes `dummy` when the API key prompt is left empty.

## Resolution and precedence

Configuration is resolved at read time in `ConfigManager.get_llm_config()` and `get_embedder_config()`. The on-disk file stores raw values; environment variables are merged only when fields are fetched.

```text
api_key:     config.toml (non-empty)  →  OPENAI_API_KEY
base_url:    config.toml (non-empty)  →  OPENAI_BASE_URL  →  PROVIDER_PRESETS[provider].base_url
model:       config.toml value        →  dataclass default (gpt-4o-mini / text-embedding-3-small)
```

<Warning>
`OPENAI_BASE_URL` applies to both `[llm]` and `[embedder]` when their file `base_url` is empty. Set per-service URLs in the TOML file for mixed endpoints.
</Warning>

CLI `he config llm` / `he config embedder` flags write directly to the file. Partial updates preserve unspecified fields: `model` updates only when the flag value is non-empty; `api_key` and `base_url` update when the flag is present (including empty string).

### Python client resolution

| API | Config source |
|-----|---------------|
| `get_client()` / `get_client(path)` | Reads TOML via `ConfigManager`, then builds LangChain clients |
| `create_client(spec, api_key=...)` | Bypasses TOML; parses `provider:model@url` shorthand |
| `create_llm()` / `create_embedder()` | Single-service creation from spec or dict |

String shorthand examples:

<CodeGroup>
```python title="Preset only"
create_client("bailian", api_key="sk-xxx")
# → qwen3.6-plus + text-embedding-v4 at DashScope URL
```

```python title="Model override"
create_llm("bailian:qwen-plus", api_key="sk-xxx")
```

```python title="Full vLLM spec"
create_client(
    llm="vllm:Qwen3.5-9B@http://localhost:8000/v1",
    embedder="vllm:bge-m3@http://localhost:8001/v1",
    api_key="dummy",
)
```
</CodeGroup>

## Environment variables

### Provider credentials

| Variable | Affects | Precedence |
|----------|---------|------------|
| `OPENAI_API_KEY` | `[llm].api_key`, `[embedder].api_key` | Used when the corresponding TOML `api_key` is empty |
| `OPENAI_BASE_URL` | `[llm].base_url`, `[embedder].base_url` | Used when the corresponding TOML `base_url` is empty, before preset lookup |

`.env.example` documents the credential pair:

```bash
OPENAI_API_KEY=sk-your-api-key-here
OPENAI_BASE_URL=https://api.openai.com/v1
```

<Info>
Despite the `OPENAI_` prefix, both variables act as generic fallbacks for any OpenAI-compatible provider configured in TOML — Bailian, proxy endpoints, and local vLLM included.
</Info>

### Logging

| Variable | Affects | Default | Notes |
|----------|---------|---------|-------|
| `HYPER_EXTRACT_LOG_LEVEL` | Root logger level | `WARNING` | Accepted: `DEBUG`, `INFO`, `WARNING`, `ERROR` (case-insensitive). Invalid values fall back to `WARNING`. |
| `HYPER_EXTRACT_LOG_FILE` | Optional log file path | unset (stderr only) | Parent directories are created automatically. |

The CLI calls `configure_logging()` on every invocation. There is no `--verbose` flag; set `HYPER_EXTRACT_LOG_LEVEL=DEBUG` instead.

```bash
HYPER_EXTRACT_LOG_LEVEL=DEBUG he parse examples/en/tesla.md -o /tmp/tesla-ka -t general/biography_graph -l en
```

Programmatic override: `configure_logging(level="INFO", output_file="/tmp/he.log")` — the `output_file` argument takes precedence over `HYPER_EXTRACT_LOG_FILE`.

## Validation rules

`ConfigManager.validate()` runs through `validate_config()` before model-backed CLI commands. On failure the CLI prints the error message and exits with code 1.

| Condition | Rule | Error message |
|-----------|------|---------------|
| `provider = "vllm"` (LLM) | `base_url` must be non-empty after resolution | `vLLM provider requires base_url.` |
| `provider = "vllm"` (embedder) | `base_url` must be non-empty after resolution | `vLLM embedder requires base_url.` |
| Any other LLM provider | `api_key` must be non-empty after `OPENAI_API_KEY` fallback | `LLM API key is not configured. Run 'he config llm --api-key YOUR_KEY'` |
| Any other embedder provider | `api_key` must be non-empty after fallback | `Embedder API key is not configured. Run 'he config embedder --api-key YOUR_KEY'` |
| All checks pass | — | `Configuration is valid` |

Commands that **do** validate: `parse`, `feed`, `build-index`, `search`, `talk`, `show`.

Commands that **do not** validate: `config`, `list`, `info`, `he --version`.

<Warning>
`provider = "vllm"` with a missing `base_url` also raises `ValueError` inside `_resolve_base_url()` during client construction — even outside `validate()` — if resolution is attempted without a URL.
</Warning>

### Reset configuration

```bash
he config llm --unset
he config embedder --unset
```

`--unset` restores dataclass defaults (`gpt-4o-mini`, `text-embedding-3-small`, empty provider/key/URL) and rewrites the TOML file.

## Verify configuration

<Steps>
<Step title="Inspect resolved values">

```bash
he config show
```

`he config show` displays the **resolved** configuration (including environment fallbacks), not just raw TOML values. API keys are truncated to the first 10 characters.

</Step>
<Step title="Inspect one service">

```bash
he config llm --show
he config embedder --show
```

</Step>
<Step title="Run a validated command">

```bash
HYPER_EXTRACT_LOG_LEVEL=INFO he parse --help
```

Any command that calls `validate_config()` will fail fast with a clear message if credentials are incomplete.

</Step>
</Steps>

## Related pages

<CardGroup>
<Card title="Configure providers" href="/configure-providers">
Step-by-step setup with `he config init`, per-service commands, and programmatic `create_client()` patterns.
</Card>
<Card title="Provider system" href="/provider-system">
BYOC/BYOK model, `provider:model@url` shorthand, `CompatibleEmbeddings`, and model compatibility requirements.
</Card>
<Card title="CLI reference" href="/cli-reference">
Full `he config` subcommand surface, flags, and exit behavior.
</Card>
<Card title="Python API reference" href="/python-api-reference">
`get_client`, `create_client`, `create_llm`, `create_embedder`, and logging helpers.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Missing API keys, vLLM `base_url` errors, and debug logging with `HYPER_EXTRACT_LOG_LEVEL`.
</Card>
</CardGroup>

---

## 17. Template schema reference

> YAML template fields (`language`, `name`, `type`, `tags`, `description`, `output`, `guideline`, `identifiers`, `options`, `display`), valid autotypes and field types, merge strategies, and identifier patterns.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/17-template-schema-reference.md
- Generated: 2026-06-18T20:58:15.410Z

### Source Files

- `hyperextract/utils/template_engine/parsers/loader.py`
- `hyperextract/utils/template_engine/parsers/schemas/base.py`
- `hyperextract/utils/template_engine/parsers/schemas/graph.py`
- `hyperextract/utils/template_engine/parsers/schemas/naive.py`
- `hyperextract/templates/DESIGN_GUIDE.md`
- `hyperextract/templates/presets/general/biography_graph.yaml`

---
title: "Template schema reference"
description: "YAML template fields (`language`, `name`, `type`, `tags`, `description`, `output`, `guideline`, `identifiers`, `options`, `display`), valid autotypes and field types, merge strategies, and identifier patterns."
---

Domain YAML templates are validated by `TemplateCfg` (Pydantic) when loaded through `load_template()`. The loader localizes multilingual fields per language, then `TemplateFactory` maps each section to runtime AutoType constructors: Pydantic schemas from `output`, LLM prompts from `guideline`, deduplication keys from `identifiers`, merge/index options from `options`, and visualization labels from `display`.

<Note>
Schema (`output`) defines **what** to extract. Guideline defines **how** to extract well. Do not repeat field definitions inside guideline rules.
</Note>

## Top-level document shape

Every domain template is a single YAML file with these top-level keys:

| Key | Required | Applies to |
|-----|----------|------------|
| `language` | Yes (defaults to `en`) | All types |
| `name` | Yes | All types |
| `type` | Yes | All types |
| `tags` | Yes | All types |
| `description` | Yes | All types |
| `output` | Yes | All types |
| `guideline` | Yes | All types |
| `identifiers` | Conditional | `set` and all graph types |
| `options` | No | All types |
| `display` | Yes | All types |

<ParamField body="language" type="string | string[]" default="en">
  Language code or list of supported codes (for example `en`, `zh`, or `[zh, en]`). At load time, every listed language is localized and validated. CLI extraction requires `--lang` to match one of these codes.
</ParamField>

<ParamField body="name" type="string" required>
  Template identifier used in gallery paths (for example `general/biography_graph` resolves to `name: biography_graph`). Use stable, lowercase or snake_case names to match preset conventions.
</ParamField>

<ParamField body="type" type="VALID_AUTOTYPES" required>
  Selects the target AutoType and which output/guideline/identifier schema applies. See [AutoTypes](#autotypes).
</ParamField>

<ParamField body="tags" type="string[]" required>
  Domain labels for discovery and filtering (for example `[general, biography]`). Use lowercase tokens.
</ParamField>

<ParamField body="description" type="string | Record&lt;lang, string&gt;" required>
  Human-readable summary. Supports per-language dicts (`zh`, `en`) or a plain string.
</ParamField>

## AutoTypes

The `type` field must be one of eight values. It determines output shape, guideline keys, identifier requirements, and factory routing.

| `type` value | AutoType class | Output shape | Identifier keys |
|--------------|----------------|--------------|-------------------|
| `model` | `AutoModel` | `output.fields` | Optional (unused) |
| `list` | `AutoList` | `output.fields` | Optional (unused) |
| `set` | `AutoSet` | `output.fields` | `item_id` |
| `graph` | `AutoGraph` | `output.entities` + `output.relations` | `entity_id`, `relation_id`, `relation_members` |
| `hypergraph` | `AutoHypergraph` | entities + relations | Same as graph |
| `temporal_graph` | `AutoTemporalGraph` | entities + relations (+ `time`) | + `time_field` |
| `spatial_graph` | `AutoSpatialGraph` | entities + relations (+ `location`) | + `location_field` |
| `spatio_temporal_graph` | `AutoSpatioTemporalGraph` | entities + relations (+ `time`, `location`) | + `time_field`, `location_field` |

<Info>
Record types (`model`, `list`, `set`) use a flat `output.fields` list. Graph types use nested `output.entities` and `output.relations`, each with their own `description` and `fields`.
</Info>

## Field types

Each field under `output.fields`, `output.entities.fields`, or `output.relations.fields` is a `FieldSchema`:

<ParamField body="name" type="string" required>
  Field name in snake_case (for example `company_name`, `source`, `time`).
</ParamField>

<ParamField body="type" type="VALID_FIELD_TYPES" required>
  One of `str`, `int`, `float`, `bool`, or `list`. The `list` type maps to `List[str]` in generated Pydantic schemas.
</ParamField>

<ParamField body="description" type="string | Record&lt;lang, string&gt;" required>
  Semantic meaning for the LLM. Supports multilingual dicts or plain strings.
</ParamField>

<ParamField body="required" type="boolean">
  When `false` and no `default` is set, the generated schema uses `default=None`. When omitted, the field is required.
</ParamField>

<ParamField body="default" type="any">
  Explicit default value passed into the generated Pydantic field.
</ParamField>

### Naming conventions

| Element | Convention | Example |
|---------|------------|---------|
| Template `name` | snake_case (presets) | `biography_graph` |
| Field names | snake_case | `company_name` |
| Relation type field | `type` | not `relation_type` |
| Time field | `time` | not `event_date` |
| Location field | `location` | |
| Tags | lowercase | `finance`, `biography` |

Keep entity and relation field counts at five or fewer per component when possible.

## Output

### Record types (`model`, `list`, `set`)

```yaml
output:
  description: '...'          # or { zh: '...', en: '...' }
  fields:
    - name: company_name
      type: str
      description: 'Company name'
      required: true
    - name: revenue
      type: str
      description: 'Revenue amount'
      required: false
```

`parse_output()` builds a single Pydantic `DataSchema` from `output.fields`.

### Graph types

```yaml
output:
  description: '...'
  entities:
    description: 'Entity definitions'
    fields:
      - name: name
        type: str
        description: 'Entity name'
  relations:
    description: 'Relation definitions'
    fields:
      - name: source
        type: str
        description: 'Source entity'
      - name: target
        type: str
        description: 'Target entity'
      - name: type
        type: str
        description: 'Relation type'
```

`parse_output()` returns `(NodeSchema, EdgeSchema)` pairs. Hypergraph relations often include a `participants` field with `type: list`.

## Guideline

Guideline content becomes LLM extraction prompts after localization.

### Record types

<ParamField body="guideline.target" type="string | string[] | Record&lt;lang, string&gt;" required>
  Role and task preamble for the extractor.
</ParamField>

<ParamField body="guideline.rules" type="string | string[] | Record&lt;lang, string | string[]&gt;" required>
  Numbered extraction rules. Lists are joined into numbered lines at localization time.
</ParamField>

### Graph types

<ParamField body="guideline.rules_for_entities" type="string | string[] | Record&lt;lang, string | string[]&gt;" required>
  Entity extraction rules.
</ParamField>

<ParamField body="guideline.rules_for_relations" type="string | string[] | Record&lt;lang, string | string[]&gt;" required>
  Relation extraction rules.
</ParamField>

<ParamField body="guideline.rules_for_time" type="string | string[] | Record&lt;lang, string | string[]&gt;">
  Required for `temporal_graph` and `spatio_temporal_graph`. Reference `{observation_time}` placeholder when relative-time conversion is needed.
</ParamField>

<ParamField body="guideline.rules_for_location" type="string | string[] | Record&lt;lang, string | string[]&gt;">
  Required for `spatial_graph` and `spatio_temporal_graph`. Reference `{observation_location}` placeholder when needed.
</ParamField>

For graph AutoTypes, `parse_guideline()` emits three prompts: a main prompt, a node-only prompt, and an edge-only prompt (used when `extraction_mode: two_stage`).

## Identifiers

Identifiers define deduplication keys via `parse_identifiers()`. Values are either a single field name or a `{field}` template.

### `set`

```yaml
identifiers:
  item_id: name
```

`item_id` is required. The extractor reads the named field and stringifies it.

### `graph` (binary relations)

```yaml
identifiers:
  entity_id: name
  relation_id: '{source}|{type}|{target}'
  relation_members:
    source: source
    target: target
```

| Key | Pattern | Purpose |
|-----|---------|---------|
| `entity_id` | Field name or `{field}` template | Unique node key |
| `relation_id` | `{source}\|{type}\|{target}` (extend with `\|{time}` or `\|{location}` for spatio-temporal variants) | Unique edge key |
| `relation_members` | Dict `{source: source, target: target}` | Sorted endpoint tuple for graph dedup |

### `hypergraph`

**Flat participants (equal roles):**

```yaml
identifiers:
  entity_id: name
  relation_id: '{name}|{type}'
  relation_members: participants   # string → sorted tuple of list field
```

**Nested role groups:**

```yaml
identifiers:
  entity_id: name
  relation_id: '{event_name}'
  relation_members: [group_a, group_b]   # list → each entry must be type: list
```

### Temporal and spatial extensions

| AutoType | Additional identifier keys | Typical `relation_id` pattern |
|----------|---------------------------|-------------------------------|
| `temporal_graph` | `time_field: time` | `'{source}\|{type}\|{target}\|{time}'` |
| `spatial_graph` | `location_field: location` | `'{source}\|{type}\|{target}\|{location}'` |
| `spatio_temporal_graph` | `time_field`, `location_field` | `'{source}\|{type}\|{target}\|{time}\|{location}'` |

<Warning>
Bracket templates reference attribute names on extracted items. Missing fields raise `AttributeError` at runtime. Keep template placeholders aligned with `output` field names.
</Warning>

## Options

`options` is optional. `parse_option()` filters keys by AutoType and maps YAML names to runtime kwargs.

### Common options (all types)

| YAML key | Runtime param | Type |
|----------|---------------|------|
| `chunk_size` | `chunk_size` | `int` |
| `chunk_overlap` | `chunk_overlap` | `int` |
| `max_workers` | `max_workers` | `int` |
| `verbose` | `verbose` | `bool` |

### Record-type options

| YAML key | Runtime param | Applies to |
|----------|---------------|------------|
| `merge_strategy` | `strategy_or_merger` | `model`, `set` |
| `fields_for_search` | `fields_for_index` | `list`, `set` |

### Graph-type options

| YAML key | Runtime param | Applies to |
|----------|---------------|------------|
| `entity_merge_strategy` | `node_strategy_or_merger` | All graph types |
| `relation_merge_strategy` | `edge_strategy_or_merger` | All graph types |
| `entity_fields_for_search` | `node_fields_for_index` | All graph types |
| `relation_fields_for_search` | `edge_fields_for_index` | All graph types |
| `extraction_mode` | `extraction_mode` | All graph types (`two_stage` or `one_stage`) |
| `observation_time` | `observation_time` | `temporal_graph`, `spatio_temporal_graph` |
| `observation_location` | `observation_location` | `spatial_graph`, `spatio_temporal_graph` |

Preset graph templates default to `extraction_mode: two_stage`.

### Merge strategies

All merge strategy fields accept the same six values, resolved to `ontomem.merger.MergeStrategy`:

| Strategy | Behavior |
|----------|----------|
| `merge_field` | Field-level merge of conflicting values |
| `keep_incoming` | Prefer the newly extracted record |
| `keep_existing` | Prefer the stored record |
| `llm_balanced` | LLM-mediated balanced merge |
| `llm_prefer_incoming` | LLM merge biased toward incoming |
| `llm_prefer_existing` | LLM merge biased toward existing |

Use `merge_strategy` on `model`/`set`, `entity_merge_strategy` for nodes, and `relation_merge_strategy` for edges.

## Display

Display templates drive OntoSight labels and search result formatting via the same `{field}` syntax as identifiers.

### Record types

```yaml
display:
  label: '{name}'
```

### Graph types

```yaml
display:
  entity_label: '{name} ({type})'
  relation_label: '{type}@{time}'
```

| AutoType | Typical `entity_label` | Typical `relation_label` |
|----------|--------------------------|--------------------------|
| `graph` | `{name} ({type})` | `{type}` |
| `hypergraph` | `{name}` | `{name}` or `{outcome}` |
| `temporal_graph` | `{name} ({type})` | `{type}@{time}` |
| `spatial_graph` | `{name} ({type})` | `{type}@{location}` |
| `spatio_temporal_graph` | `{name} ({type})` | `{type}@{location}({time})` |

## Multilingual values

String fields accept three formats:

1. **Plain string** — used as-is for all languages.
2. **Language dict** — `{ zh: '...', en: '...' }`; `localize_template(config, lang)` picks the matching entry.
3. **String list** — joined into numbered lines (`1. ...`, `2. ...`) at localization.

`load_template()` validates localization for every code in `language`. A missing translation for a declared language raises `ValueError`.

<Check>
Each language block should use that language only. Do not mix scripts (for example English terms inside `zh` descriptions).
</Check>

## Validation and loading

<Steps>
<Step title="Parse YAML into TemplateCfg">
  `load_template(path)` reads the file and instantiates `TemplateCfg` via Pydantic. Invalid `type`, field `type`, or missing required keys fail immediately.
</Step>
<Step title="Localize per language">
  For each code in `language`, `localize_template()` converts multilingual dicts/lists to single-language strings and verifies completeness.
</Step>
<Step title="Factory assembly">
  At runtime, `TemplateFactory.create_*` calls `parse_output`, `parse_guideline`, `parse_identifiers`, `parse_option`, and `parse_display` to construct the AutoType instance.
</Step>
</Steps>

### Validation checklist

**All types**
- `language` lists only supported codes present in multilingual fields
- `type` is a valid AutoType
- `tags` is a non-empty lowercase array
- `description`, `output`, `guideline`, and `display` are present

**Graph types**
- `output.entities` and `output.relations` exist
- `identifiers.entity_id`, `identifiers.relation_id`, and `identifiers.relation_members` are configured
- `display.entity_label` and `display.relation_label` are set

**`set`**
- `identifiers.item_id` is set

**`temporal_graph` / `spatio_temporal_graph`**
- `identifiers.time_field` points to the time relation field (typically `time`)
- `guideline.rules_for_time` is present

**`spatial_graph` / `spatio_temporal_graph`**
- `identifiers.location_field` points to the location relation field (typically `location`)
- `guideline.rules_for_location` is present

**Hypergraph**
- `relation_members` is a string (flat list field) or a list of list-typed field names (nested groups)

## Complete example

`general/biography_graph` is a `temporal_graph` preset showing multilingual fields, temporal identifiers, and display labels:

```yaml
language: [zh, en]
name: biography_graph
type: temporal_graph
tags: [general, biography, life, events, timeline]

description:
  zh: '传记图模板 - 从人物传记、回忆录、年谱中提取实体和关系'
  en: 'Biography Graph Template - Extract entities and relationships from biographies'

output:
  description:
    en: "Entity and relationship network from biographies"
  entities:
    description:
      en: 'Entities in the biography'
    fields:
      - name: name
        type: str
        description: { en: 'Entity name' }
        required: true
      - name: type
        type: str
        description: { en: 'Entity type' }
        required: true
  relations:
    description:
      en: 'Relationships between entities'
    fields:
      - name: source
        type: str
        description: { en: 'Source entity' }
        required: true
      - name: target
        type: str
        description: { en: 'Target entity' }
        required: true
      - name: type
        type: str
        description: { en: 'Relation type' }
        required: true
      - name: time
        type: str
        description: { en: 'Time associated with the relation' }
        required: false

guideline:
  target:
    en: 'You are a professional biography analyst...'
  rules_for_entities:
    en: ['Extract the biographical subject as the core entity', ...]
  rules_for_relations:
    en: ['Create relationships only when explicitly stated', ...]
  rules_for_time:
    en: ['Observation time baseline: {observation_time}', ...]

identifiers:
  entity_id: name
  relation_id: '{source}|{type}|{target}'
  relation_members:
    source: source
    target: target
  time_field: time

display:
  entity_label: '{name}'
  relation_label: '{type}@{time}'
```

Load and inspect programmatically:

```python
from hyperextract.utils.template_engine.parsers import load_template, localize_template

cfg = load_template("path/to/biography_graph.yaml")
cfg_en = localize_template(cfg, "en")
```

## Related pages

<CardGroup>
<Card title="Create custom templates" href="/create-custom-templates">
  End-to-end authoring workflow: type selection, field design, multilingual blocks, and validation.
</Card>
<Card title="Auto-Types" href="/auto-types">
  Runtime behavior, merge semantics, and indexing for each of the eight AutoTypes.
</Card>
<Card title="Template design skills" href="/template-design-skills">
  Agent-assisted authoring with `hyperextract-skills` validators and optimizers.
</Card>
<Card title="Templates vs methods" href="/templates-vs-methods">
  When to use domain YAML templates versus algorithm `method/*` templates.
</Card>
</CardGroup>

---

## 18. Extraction methods reference

> Registered methods (`graph_rag`, `light_rag`, `hyper_rag`, `hypergraph_rag`, `cog_rag`, `itext2kg`, `itext2kg_star`, `kg_gen`, `atom`): autotype output, descriptions, registry API, and constructor kwargs.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/18-extraction-methods-reference.md
- Generated: 2026-06-18T20:57:53.493Z

### Source Files

- `hyperextract/methods/registry.py`
- `hyperextract/methods/rag/graph_rag.py`
- `hyperextract/methods/rag/light_rag.py`
- `hyperextract/methods/rag/hyper_rag.py`
- `hyperextract/methods/typical/kg_gen.py`
- `hyperextract/methods/typical/atom.py`
- `hyperextract/utils/template_engine/factory.py`

---
title: "Extraction methods reference"
description: "Registered methods (`graph_rag`, `light_rag`, `hyper_rag`, `hypergraph_rag`, `cog_rag`, `itext2kg`, `itext2kg_star`, `kg_gen`, `atom`): autotype output, descriptions, registry API, and constructor kwargs."
---

Nine extraction algorithms are registered in `hyperextract/methods/registry.py` at import time. Each entry maps a short name (for example `light_rag`) to a Python class, an autotype label (`graph` or `hypergraph`), and a description. `TemplateFactory.create_method` and `Template.create("method/<name>")` resolve names through this registry, instantiate the class with `llm_client` and `embedder`, and stamp instance metadata (`template`, `lang="en"`, `type`).

<Info>
Method templates always use English prompts. The `language` argument to `Template.create` is ignored for `method/` sources.
</Info>

## Method catalog

| Registry name | Python class | Autotype | Base engine | Description |
|---|---|---|---|---|
| `graph_rag` | `Graph_RAG` | `graph` | `AutoGraph` | Graph-RAG with community detection and global search |
| `light_rag` | `Light_RAG` | `graph` | `AutoGraph` | Lightweight graph RAG with binary entity–relation edges |
| `hyper_rag` | `Hyper_RAG` | `hypergraph` | `AutoHypergraph` | Hypergraph RAG with n-ary hyperedges |
| `hypergraph_rag` | `HyperGraph_RAG` | `hypergraph` | `AutoHypergraph` | Knowledge-segment hyperedges with nested entities |
| `cog_rag` | `Cog_RAG` | `hypergraph` | Dual `AutoHypergraph` layers | Theme layer + detail layer cognitive RAG |
| `itext2kg` | `iText2KG` | `graph` | `AutoGraph` | High-quality triple-based knowledge graph extraction |
| `itext2kg_star` | `iText2KG_Star` | `graph` | `AutoGraph` | Edge-first extraction with semantic node deduplication |
| `kg_gen` | `KG_Gen` | `graph` | `AutoGraph` | Subject–predicate–object triple generator |
| `atom` | `Atom` | `graph` | `AutoGraph` | Temporal triple extraction with evidence attribution |

Template IDs use the `method/<registry_name>` prefix (for example `method/atom`). List registered methods from the CLI with `he list method` or `he list method -q rag`.

```mermaid
classDiagram
    class MethodRegistry {
        +register_method(name, class, autotype, description)
        +get_method(name)
        +list_methods()
        +get_method_cfg(name)
        +list_method_cfgs()
    }
    class TemplateFactory {
        +create_method(name, llm, embedder, **kwargs)
        +create(source, language, llm, embedder, **kwargs)
    }
    class AutoGraph
    class AutoHypergraph
    class Cog_RAG

    MethodRegistry --> TemplateFactory : get_method()
    TemplateFactory --> AutoGraph : graph methods
    TemplateFactory --> AutoHypergraph : hypergraph methods
    TemplateFactory --> Cog_RAG : cog_rag
    Cog_RAG --> AutoHypergraph : theme_layer + detail_layer
```

## Registry API

Import from `hyperextract.methods`:

```python
from hyperextract.methods import (
    register_method,
    get_method,
    list_methods,
    get_method_cfg,
    list_method_cfgs,
    MethodCfg,
)
```

### `register_method`

<ParamField body="name" type="string" required>
Short registry key used by CLI `-m` and `method/<name>` template paths.
</ParamField>

<ParamField body="method_class" type="Type" required>
Callable class. Must accept `llm_client` and `embedder` as constructor arguments.
</ParamField>

<ParamField body="autotype" type="string" required>
Output category written to instance metadata `type`. Built-in values: `graph`, `hypergraph`.
</ParamField>

<ParamField body="description" type="string">
Human-readable summary shown by `he list method` and `Template.list()`.
</ParamField>

### `get_method`

<ResponseField name="class" type="Type">
Instantiable method class.
</ResponseField>

<ResponseField name="type" type="string">
Autotype label (`graph` or `hypergraph`).
</ResponseField>

<ResponseField name="description" type="string">
Registered description string.
</ResponseField>

Returns `None` when the name is unknown. `TemplateFactory.create_method` raises `ValueError` in that case.

### `list_methods`

Returns a shallow copy of the full registry: `Dict[str, Dict[str, Any]]` keyed by method name.

### `get_method_cfg` / `list_method_cfgs`

`get_method_cfg(name)` returns a `MethodCfg` Pydantic model or `None`.

<ResponseField name="name" type="string">
Registry key.
</ResponseField>

<ResponseField name="type" type="string">
Autotype label.
</ResponseField>

<ResponseField name="description" type="string">
Description string.
</ResponseField>

`list_method_cfgs()` returns `Dict[str, MethodCfg]` keyed by `method/<name>`, suitable for merging into `Template.list(include_methods=True)`.

### Register a custom method

```python
from hyperextract.methods import register_method
from hyperextract import Template

class MyMethod(AutoGraph):
    def __init__(self, llm_client, embedder, **kwargs):
        ...

register_method("my_method", MyMethod, "graph", "Custom graph extractor")

ka = Template.create("method/my_method", llm_client=llm, embedder=embedder)
```

Built-in methods are registered in `_init_registry()` inside `registry.py`; add new built-ins there and export the class from `hyperextract/methods/rag` or `hyperextract/methods/typical`.

## Autotype output

Registry `autotype` values map to Hyper-Extract primitives, not full class names:

| Registry `type` | Runtime instance | On-disk KA shape |
|---|---|---|
| `graph` | `AutoGraph` subclass | `data.json` with `nodes` and `edges` (binary or triple schemas) |
| `hypergraph` | `AutoHypergraph` subclass | `data.json` with `nodes` and hyperedge records |
| `hypergraph` (`cog_rag` only) | `Cog_RAG` wrapper | Root `data.json` (aggregated) plus `theme_layer/` and `detail_layer/` subdirectories |

After `TemplateFactory.create_method`, every instance carries:

```python
instance.metadata["template"]  # "method/<name>"
instance.metadata["lang"]      # "en"
instance.metadata["type"]      # registry autotype
```

## Constructor kwargs

All methods require a LangChain-compatible LLM and embedder. Additional kwargs pass through `Template.create(..., **kwargs)` and `TemplateFactory.create_method(..., **kwargs)` unchanged.

### Shared parameters

| Parameter | Type | Default | Applies to |
|---|---|---|---|
| `llm_client` | `BaseChatModel` | — (required) | All |
| `embedder` | `Embeddings` | — (required) | All |
| `chunk_size` | `int` | `2048` | All |
| `chunk_overlap` | `int` | `256` | All |
| `max_workers` | `int` | `10` | All |
| `verbose` | `bool` | `False` | All |

### Method-specific parameters

| Method | Extra kwargs | Default | Purpose |
|---|---|---|---|
| `atom` | `observation_time` | current date if omitted | Anchor for resolving relative temporal expressions in factoid and edge prompts |
| `atom` | `facts_per_chunk` | `10` | Max atomic facts per edge-extraction batch |
| `itext2kg_star` | `observation_date` | current datetime if omitted | Written to `edge.properties.observation_date` post-extraction |

<CodeGroup>
```python Python API
from hyperextract import Template

atom = Template.create(
    "method/atom",
    llm_client=llm,
    embedder=embedder,
    observation_time="2024-06-15",
    facts_per_chunk=8,
    chunk_size=1024,
    verbose=True,
)

star = Template.create(
    "method/itext2kg_star",
    llm_client=llm,
    embedder=embedder,
    observation_date="2024-06-15",
)
```

```bash CLI
he parse examples/en/tesla.md -m atom -o ./out/atom
he list method -q temporal
```
</CodeGroup>

## Per-method reference

### RAG methods (`hyperextract/methods/rag`)

#### `graph_rag` — `Graph_RAG`

Binary directed edges (`source`, `target`, `description`, `strength`). Two-stage extraction with `CustomRuleMerger` deduplication. Extends `AutoGraph` with community features:

- `build_communities(level=0)` — Leiden clustering via `graspologic` (optional; requires `networkx` and `graspologic`)
- `search(query, use_community=True)` — optional community-enhanced retrieval
- `dump` / `load` — persists `community_data.json` alongside standard KA files

#### `light_rag` — `Light_RAG`

Binary edges with `keywords` and `strength`. Edge keys use `f"{source}->{target}"`. Indexed fields: node `name/type/description`, edge `keywords/description`.

#### `hyper_rag` — `Hyper_RAG`

N-ary hyperedges via `participants: List[str]`. Edge keys are sorted participant tuples. Two-stage node-then-hyperedge extraction.

#### `hypergraph_rag` — `HyperGraph_RAG`

Edge-first extraction: each hyperedge is a `knowledge_segment` with embedded `related_entities` (`NodeSchema` list including `key_score`). Edge key is MD5 of segment text. Uses `MergeStrategy.LLM.BALANCED` for both nodes and edges. No separate node-extraction stage.

#### `cog_rag` — `Cog_RAG`

Dual-layer system, not a direct `AutoHypergraph` subclass:

| Layer | Class | Pattern | Output |
|---|---|---|---|
| Theme (macro) | `Cog_RAG_ThemeLayer` | Edge-first themes | `ThemeSchema` hyperedges with nested participant nodes |
| Detail (micro) | `Cog_RAG_DetailLayer` | Two-stage entities + hyperedges | `EdgeSchema` with `participants`, `keywords`, `strength` |

Public API: `feed_text`, `build_index`, `search(top_k_themes=3, top_k_entities=3)`, `chat`, `dump`, `load`, `show` (interactive layer picker). `nodes` and `edges` properties aggregate both layers.

### Typical methods (`hyperextract/methods/typical`)

#### `itext2kg` — `iText2KG`

Nested `NodeSchema` (`label`, `name`) and `EdgeSchema` (`startNode`, `endNode`, `name`). Two-stage extraction. Merge strategy: `MergeStrategy.LLM.BALANCED`.

#### `itext2kg_star` — `iText2KG_Star`

Edge-first extraction; nodes derived from edge endpoints. Sets `properties.observation_date` after extraction. Includes `match_nodes_and_update_edges(threshold=0.8)` for SemHash-based semantic deduplication. Merge strategy: `MergeStrategy.KEEP_EXISTING`.

#### `kg_gen` — `KG_Gen`

Minimal triple schema: nodes have `name` only; edges use `subject`, `predicate`, `object`. Two-stage extraction with `MergeStrategy.KEEP_EXISTING`. Post-extraction helpers:

- `deduplicate(threshold=0.9)` — returns new `KG_Gen` instance
- `self_deduplicate(threshold=0.9)` — in-place SemHash deduplication

#### `atom` — `Atom`

Temporal triple extraction pipeline:

1. Extract atomic factoids (`AtomicFactSchema`) per text chunk
2. Batch-extract edges from grouped facts
3. Derive nodes from edge `startNode` / `endNode`; set `t_obs` to observation timestamp

Edge fields: `t_start`, `t_end`, `t_obs`, `atomic_facts` (verbatim evidence). Custom `EDGE_MERGE_RULE` via `CustomRuleMerger`. Includes `match_nodes_and_update_edges(threshold=0.8)` for semantic node merging.

<Warning>
`graph_rag.build_communities()` requires optional packages `networkx` and `graspologic`. Community detection is skipped with a verbose message when `graspologic` is not installed.
</Warning>

## Extraction patterns

| Pattern | Methods | Behavior |
|---|---|---|
| Two-stage (nodes → edges) | `light_rag`, `graph_rag`, `hyper_rag`, `itext2kg`, `kg_gen`, `cog_rag` detail layer | `extraction_mode="two_stage"` |
| Edge-first (nodes derived) | `hypergraph_rag`, `itext2kg_star`, `cog_rag` theme layer | Custom `_extract_data` overrides |
| Facts → edges → nodes | `atom` | Atomic factoid pass, then edge extraction from fact batches |

## Direct class instantiation

Bypass the registry when tuning algorithm parameters directly:

```python
from hyperextract.methods.rag import Light_RAG
from hyperextract.methods.typical import Atom

rag = Light_RAG(llm_client=llm, embedder=embedder, chunk_size=1024, max_workers=5)
rag.feed_text(text)
rag.build_index()

atom = Atom(llm_client=llm, embedder=embedder, observation_time="2024-06-15")
atom.feed_text(text)
atom.match_nodes_and_update_edges(threshold=0.85)
```

Equivalent template path: `Template.create("method/light_rag", llm_client=llm, embedder=embedder, chunk_size=1024)`.

## Related pages

<CardGroup>
<Card title="Use extraction methods" href="/use-extraction-methods">
CLI `he parse -m` and Python `Template.create("method/...")` workflows with method-specific kwargs.
</Card>
<Card title="Templates vs methods" href="/templates-vs-methods">
When to pick domain YAML templates versus algorithm method templates.
</Card>
<Card title="Auto-Types" href="/auto-types">
`AutoGraph` and `AutoHypergraph` lifecycle, merge behavior, and indexing.
</Card>
<Card title="Method demos" href="/method-demos">
Runnable scripts under `examples/en/methods/` for each registered engine.
</Card>
<Card title="Python API reference" href="/python-api-reference">
`Template.create`, `BaseAutoType` lifecycle, and `create_client` helpers.
</Card>
<Card title="Contributing" href="/contributing">
Development setup and how to add templates or register new extraction methods.
</Card>
</CardGroup>

---

## 19. Tesla biography recipe

> End-to-end CLI and Python workflow using `examples/en/tesla.md` with `general/biography_graph`: parse, visualize, semantic search, and Q&A with expected artifacts under the output directory.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/19-tesla-biography-recipe.md
- Generated: 2026-06-18T20:59:59.444Z

### Source Files

- `examples/en/tesla.md`
- `examples/en/tesla_question.md`
- `hyperextract/templates/presets/general/biography_graph.yaml`
- `README.md`
- `hyperextract/cli/cli.py`
- `hyperextract/__init__.py`

---
title: "Tesla biography recipe"
description: "End-to-end CLI and Python workflow using `examples/en/tesla.md` with `general/biography_graph`: parse, visualize, semantic search, and Q&A with expected artifacts under the output directory."
---

The canonical biography walkthrough feeds `examples/en/tesla.md` through the `general/biography_graph` preset (`type: temporal_graph`), producing a persisted Knowledge Abstract under an output directory with `data.json`, `metadata.json`, and an `index/` folder for semantic `he search` / `he talk` and OntoSight visualization via `he show` or `AutoTemporalGraph.show()`.

## Recipe components

| Component | Path or ID | Role |
|-----------|------------|------|
| Source document | `examples/en/tesla.md` | English biography of Nikola Tesla (1856–1943): early life, Edison, War of Currents, Colorado Springs, Wardenclyffe, relationships, timeline |
| Sample questions | `examples/en/tesla_question.md` | Three bullet prompts for search and chat verification |
| Template preset | `general/biography_graph` | Domain YAML mapped to `AutoTemporalGraph`; extracts entities and time-stamped relations |
| Output directory | `./output/` (CLI default in README) | On-disk Knowledge Abstract consumed by `he show`, `he search`, `he talk`, `he feed`, `he info` |

The template declares bilingual support (`language: [zh, en]`) but this recipe uses `-l en` / `language="en"`.

### What `general/biography_graph` extracts

`general/biography_graph` is a `temporal_graph` template. It instructs the LLM to extract:

| Layer | Fields | Notes |
|-------|--------|-------|
| **Entities** | `name`, `type`, `description` | Types include person, location, organization, invention, event, and similar biography categories |
| **Relations** | `source`, `target`, `type`, `time`, `description` | `time` anchors events (e.g., `1884`, `1893`); relation labels render as `{type}@{time}` |

Identifier rules deduplicate by entity `name` and relation key `{source}|{type}|{target}`. Extraction guidelines require explicit text evidence—no inferred common-sense links—and use `observation_time` (defaults to today's date in `YYYY-MM-DD`) to resolve relative dates.

## Prerequisites

<Steps>
<Step title="Install Hyper-Extract">

```bash
uv tool install hyperextract
# or: uv pip install hyperextract
```

</Step>

<Step title="Configure LLM and embedder">

Provider setup is BYOC/BYOK. Any endpoint with structured output (`json_schema` or function calling) works.

```bash
he config init -k YOUR_API_KEY
```

For local vLLM, configure LLM and embedder separately. See [Configure providers](/configure-providers).

</Step>

<Step title="Confirm sample files">

From the repository root:

```bash
ls examples/en/tesla.md examples/en/tesla_question.md
```

</Step>
</Steps>

<Warning>
Knowledge templates require `--lang` / `language`. Omitting it causes `he parse` to exit with an error. Method templates ignore language and always use English prompts.
</Warning>

## CLI workflow

<Steps>
<Step title="Parse the biography">

```bash
he parse examples/en/tesla.md \
  -t general/biography_graph \
  -o ./output/ \
  -l en
```

| Flag | Purpose |
|------|---------|
| `-t general/biography_graph` | Resolve preset biography graph template |
| `-o ./output/` | Write Knowledge Abstract artifacts |
| `-l en` | Localize prompts and field descriptions to English |
| `-f` | Overwrite when output directory is non-empty |
| `--no-index` | Skip index build; run `he build-index ./output/` before search/talk |

`he parse` calls `Template.create`, `feed_text`, `dump`, `build_index` (unless `--no-index`), and dumps again with the index.

</Step>

<Step title="Inspect the Knowledge Abstract">

```bash
he info ./output/
```

Expect `Template: general/biography_graph`, `Language: en`, non-zero `Nodes` and `Edges`, and `Index: Built` when indexing ran during parse.

</Step>

<Step title="Visualize with OntoSight">

```bash
he show ./output/
```

Loads metadata from `metadata.json`, recreates the template instance, and opens an interactive graph in the browser. Nodes are entities; edges are relations labeled with type and time.

</Step>

<Step title="Run semantic search">

```bash
he search ./output/ "What are Tesla's major inventions and their significance?" -n 5
```

Additional queries from `examples/en/tesla_question.md`:

```bash
he search ./output/ "War of Currents participants"
he search ./output/ "Tesla Edison relationship"
```

`he search` requires a populated `index/` directory.

</Step>

<Step title="Chat over the graph">

Single question:

```bash
he talk ./output/ -q "How did Tesla's relationship with Edison evolve over time?"
```

Interactive session:

```bash
he talk ./output/ -i
```

Use `-n` to widen retrieved context (default `top_k=3`).

</Step>

<Step title="Optionally evolve the graph">

```bash
he feed ./output/ additional_tesla_notes.md
he build-index ./output/   # if index was stale or built with --no-index
he show ./output/
```

`he feed` reads template and language from `metadata.json` automatically.

</Step>
</Steps>

### Expected output layout

:::files
./output/
├── data.json          # entities + relations (temporal graph payload)
├── metadata.json      # template, lang, type, timestamps
└── index/             # vector store for search/chat (FAISS-backed)
:::

`data.json` holds the structured graph. `metadata.json` records at minimum:

| Key | Example value |
|-----|---------------|
| `template` | `general/biography_graph` |
| `lang` | `en` |
| `type` | `temporal_graph` |
| `created_at` / `updated_at` | ISO timestamps |

```mermaid
flowchart LR
  subgraph input
    T[examples/en/tesla.md]
  end
  subgraph runtime
    P[he parse / Template.create]
    E[LLM structured extraction]
    I[build_index + embedder]
  end
  subgraph ka["./output/ Knowledge Abstract"]
    D[data.json]
    M[metadata.json]
    X[index/]
  end
  subgraph query
    S[he search]
    C[he talk]
    V[he show / OntoSight]
  end
  T --> P --> E --> D
  P --> M
  E --> I --> X
  D --> S
  X --> S
  D --> C
  X --> C
  D --> V
```

## Python workflow

The Python path mirrors `he parse` but gives explicit control over indexing, persistence, and `observation_time`.

<CodeGroup>

```python title="Minimal (matches README)"
from hyperextract import Template

ka = Template.create("general/biography_graph", language="en")

with open("examples/en/tesla.md") as f:
    ka.feed_text(f.read())

ka.build_index()
ka.show()
```

```python title="Full lifecycle with persistence and Q&A"
from pathlib import Path
from hyperextract import Template

OUTPUT = Path("./output")

ka = Template.create(
    "general/biography_graph",
    language="en",
    observation_time="2026-06-18",  # optional; defaults to today
)

text = Path("examples/en/tesla.md").read_text(encoding="utf-8")
ka.feed_text(text)

ka.build_index()
ka.dump(OUTPUT)

# Semantic search returns (nodes, edges)
nodes, edges = ka.search("Tesla coil inventions", top_k=5)
for node in nodes:
    print(node.name, node.type)

# Chat with retrieved context
response = ka.chat("What was the War of Currents?")
print(response.content)

ka.show()

# Reload later without re-parsing
ka2 = Template.create("general/biography_graph", language="en")
ka2.load(OUTPUT)
print(ka2.chat("Summarize Tesla's career in three sentences").content)
```

```python title="Branching with parse() (non-destructive preview)"
ka = Template.create("general/biography_graph", language="en")
preview = ka.parse(Path("examples/en/tesla.md").read_text(encoding="utf-8"))
preview.build_index()
preview.show()
```

</CodeGroup>

<Tip>
`Template.create` reads LLM and embedder clients from `~/.he/config.toml` when not passed explicitly. Pass `llm_client` and `embedder` for programmatic mixed-cloud or vLLM setups.
</Tip>

### Python vs CLI equivalence

| Step | CLI | Python |
|------|-----|--------|
| Instantiate template | implicit in `he parse` | `Template.create("general/biography_graph", language="en")` |
| Extract | `feed_text` via parse | `feed_text(text)` or `parse(text)` for a new instance |
| Build index | default on parse | `build_index()` |
| Persist | `ka.dump(output_path)` | `dump(Path("./output"))` |
| Visualize | `he show ./output/` | `ka.show()` |
| Search | `he search ./output/ QUERY` | `ka.search(query, top_k=n)` → `(nodes, edges)` |
| Chat | `he talk ./output/ -q QUERY` | `ka.chat(query, top_k=n)` → `AIMessage` |

## Verification checklist

Run these checks after extraction to confirm the recipe succeeded:

| Check | Command or action | Pass signal |
|-------|-------------------|-------------|
| Artifacts exist | `ls ./output/data.json ./output/metadata.json ./output/index/` | All three present |
| Template bound | `he info ./output/` | `general/biography_graph`, `en` |
| Graph populated | `he info ./output/` | `Nodes` and `Edges` > 0 |
| Search works | `he search ./output/ "induction motor"` | JSON results returned |
| Chat grounded | `he talk ./output/ -q "Who was George Westinghouse?"` | Answer references extracted entities |
| Questions file | Loop `examples/en/tesla_question.md` bullets through `he talk` | Coherent answers per prompt |

<Check>
A successful run produces a queryable temporal graph where biography entities (Tesla, Edison, Westinghouse, Wardenclyffe Tower, AC motor, and similar) appear as nodes and dated relations (employed_by, invented, collaborated_with, and similar) appear as edges with optional `time` fields.
</Check>

## Troubleshooting

| Symptom | Cause | Fix |
|---------|-------|-----|
| `--lang is required for knowledge templates` | Missing `-l` on `he parse` | Add `-l en` |
| `Output directory already exists and is not empty` | Re-run into same path | Use `-f` or a new `-o` path |
| `Index not found` on search/talk | Parsed with `--no-index` or empty `index/` | `he build-index ./output/` |
| `No API key found` | Unconfigured provider | `he config init` or set `OPENAI_API_KEY` |
| `Not a valid Knowledge Abstract (no data.json)` | Wrong directory passed to `he show` | Point to the `-o` output folder |
| Empty or sparse graph | LLM without reliable structured output | Switch to a verified model per provider docs |
| Relative dates mis-resolved | Default `observation_time` is today | Pass `observation_time="YYYY-MM-DD"` in `Template.create` |

Enable debug logging when extraction fails silently:

```bash
export HYPER_EXTRACT_LOG_LEVEL=DEBUG
he parse examples/en/tesla.md -t general/biography_graph -o ./output/ -l en
```

## Related pages

<CardGroup>
<Card title="Quickstart" href="/quickstart">
First successful extraction with `he config init`, `he parse`, `he search`, and the Python `Template.create` path.
</Card>
<Card title="Knowledge Abstracts" href="/knowledge-abstracts">
On-disk `data.json`, `metadata.json`, and `index/` layout plus lifecycle methods.
</Card>
<Card title="Search, chat, and visualize" href="/search-chat-visualize">
`he search`, `he talk`, `he show`, and `AutoType.show()` in depth.
</Card>
<Card title="Templates vs methods" href="/templates-vs-methods">
Why domain templates like `general/biography_graph` require `--lang` while method templates do not.
</Card>
<Card title="Troubleshooting" href="/troubleshooting">
Common failure modes for missing keys, indexes, and template resolution.
</Card>
</CardGroup>

---

## 20. Method demos

> Runnable scripts under `examples/en/methods/` for each extraction engine: instantiate method classes, `feed_text`, `chat`, and `show` with LangChain clients and dotenv configuration.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/20-method-demos.md
- Generated: 2026-06-18T20:59:41.235Z

### Source Files

- `examples/en/methods/light_rag_demo.py`
- `examples/en/methods/graph_rag_demo.py`
- `examples/en/methods/hyper_rag_demo.py`
- `examples/en/methods/atom_demo.py`
- `examples/en/methods/kg_gen_demo.py`
- `examples/en/tesla.md`

---
title: "Method demos"
description: "Runnable scripts under `examples/en/methods/` for each extraction engine: instantiate method classes, `feed_text`, `chat`, and `show` with LangChain clients and dotenv configuration."
---

Nine runnable Python scripts under `examples/en/methods/` exercise every registered extraction method against the shared Tesla biography corpus. Each script follows the same lifecycle: load credentials with `dotenv`, construct LangChain LLM and embedder clients, instantiate a method class, call `feed_text`, run batch Q&A via `chat`, and open an OntoSight visualization with `show`.

## Demo inventory

All nine demos share input files and control flow; they differ only in the method class imported and the extraction statistics printed after `feed_text`.

| Script | Method class | Import path | Registry ID | AutoType | Post-extraction stats |
|--------|-------------|-------------|-------------|----------|----------------------|
| `light_rag_demo.py` | `Light_RAG` | `hyperextract.methods.rag` | `light_rag` | `graph` | `nodes`, `edges` |
| `graph_rag_demo.py` | `Graph_RAG` | `hyperextract.methods.rag` | `graph_rag` | `graph` | `nodes`, `edges` |
| `hyper_rag_demo.py` | `Hyper_RAG` | `hyperextract.methods.rag` | `hyper_rag` | `hypergraph` | `nodes`, `edges` (hyperedges) |
| `hypergraph_rag_demo.py` | `HyperGraph_RAG` | `hyperextract.methods.rag` | `hypergraph_rag` | `hypergraph` | `nodes`, `edges` (hyperedges) |
| `cog_rag_demo.py` | `Cog_RAG` | `hyperextract.methods.rag` | `cog_rag` | `hypergraph` | `nodes`, `edges` (dual-layer aggregate) |
| `itext2kg_demo.py` | `iText2KG` | `hyperextract.methods.typical` | `itext2kg` | `graph` | `nodes`, `edges` |
| `itext2kg_star_demo.py` | `iText2KG_Star` | `hyperextract.methods.typical` | `itext2kg_star` | `graph` | `nodes`, `edges` |
| `kg_gen_demo.py` | `KG_Gen` | `hyperextract.methods.typical` | `kg_gen` | `graph` | `nodes`, `edges` |
| `atom_demo.py` | `Atom` | `hyperextract.methods.typical` | `atom` | `graph` | `nodes`, `edges` (derived from atomic facts) |

<Note>
Method prompts are authored in English. The `examples/zh/methods/` directory mirrors every script with Chinese corpus files (`sushi.md`, `sushi_question.md`); entity names preserve the input language during extraction.
</Note>

## File layout

:::files
examples/
├── en/
│   ├── tesla.md              # Shared biography input (English)
│   ├── tesla_question.md     # Three evaluation questions
│   └── methods/
│       ├── light_rag_demo.py
│       ├── graph_rag_demo.py
│       ├── hyper_rag_demo.py
│       ├── hypergraph_rag_demo.py
│       ├── cog_rag_demo.py
│       ├── itext2kg_demo.py
│       ├── itext2kg_star_demo.py
│       ├── kg_gen_demo.py
│       └── atom_demo.py
└── zh/
    └── methods/              # Same nine scripts, Chinese corpus
:::

## Prerequisites

<Steps>
<Step title="Install Hyper-Extract">

```bash
uv pip install hyperextract
```

Python 3.11+ is required. Optional provider extras (`anthropic`, `google`, `all`) apply only when using non-OpenAI LangChain clients.

</Step>
<Step title="Configure credentials">

Copy `.env.example` to `.env` at the repository root and set provider variables:

```bash
cp .env.example .env
```

<ParamField body="OPENAI_API_KEY" type="string" required>
API key for the LLM and embedder endpoints.
</ParamField>

<ParamField body="OPENAI_BASE_URL" type="string">
Optional proxy or compatible endpoint base URL. Defaults to `https://api.openai.com/v1`.
</ParamField>

Each demo calls `load_dotenv()` at startup, so a root-level `.env` file is picked up automatically.

</Step>
<Step title="Verify LangChain clients">

Demos instantiate clients directly:

```python
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

llm = ChatOpenAI(model="gpt-4o-mini")
embedder = OpenAIEmbeddings(model="text-embedding-3-small")
```

For BYOC/BYOK deployments, swap in `create_client()` from `hyperextract` instead — see [Configure providers](/configure-providers).

</Step>
</Steps>

## Shared demo pattern

Every script under `examples/en/methods/` implements the same five-phase pipeline:

```mermaid
sequenceDiagram
    participant Script as Demo script
    participant Env as dotenv / .env
    participant LC as LangChain clients
    participant Method as Method class
    participant OS as OntoSight (show)

    Script->>Env: load_dotenv()
    Script->>Script: read tesla.md + tesla_question.md
    Script->>LC: ChatOpenAI + OpenAIEmbeddings
    Script->>Method: __init__(llm_client, embedder)
    Script->>Method: feed_text(text)
    Method-->>Script: nodes / edges populated
    loop Each question
        Script->>Method: chat(question)
        Method-->>Script: AIMessage.content
    end
    Script->>OS: show()
```

### Path resolution

Scripts resolve the repository root four levels above the demo file:

```python
project_root = Path(__file__).resolve().parent.parent.parent.parent
INPUT_FILE = project_root / "examples" / "en" / "tesla.md"
QUESTION_FILE = project_root / "examples" / "en" / "tesla_question.md"
```

Run demos from any working directory; paths are anchored to the script location, not the current shell cwd.

### Core lifecycle calls

<ParamField body="llm_client" type="BaseChatModel" required>
LangChain chat model passed to the method constructor.
</ParamField>

<ParamField body="embedder" type="Embeddings" required>
LangChain embedding model for vector indexing inside `feed_text`.
</ParamField>

<ParamField body="feed_text(text)" type="str → self">
Ingests document text, runs the method's extraction pipeline, and merges results into the in-memory Knowledge Abstract. Supports chaining: `method.feed_text(t1).feed_text(t2)`.
</ParamField>

<ParamField body="chat(query)" type="str → AIMessage">
Retrieves relevant graph or hypergraph context, then generates an answer. Demos print `result.content`.
</ParamField>

<ParamField body="show()" type="None">
Opens an OntoSight interactive visualization with embedded search and chat callbacks when vector indices exist.
</ParamField>

### Representative script

`light_rag_demo.py` is the canonical template; other demos differ only in the import, class name, and stat line:

```python
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from hyperextract.methods.rag import Light_RAG

load_dotenv()

llm = ChatOpenAI(model="gpt-4o-mini")
embedder = OpenAIEmbeddings(model="text-embedding-3-small")

rag = Light_RAG(llm_client=llm, embedder=embedder)
rag.feed_text(text)

for q in questions:
    result = rag.chat(q)
    print(result.content)

rag.show()
```

## Input corpus and questions

### `examples/en/tesla.md`

Nikola Tesla biography (1856–1943) covering early life, work with Edison, the War of Currents, Colorado Springs experiments, and Wardenclyffe Tower. All nine English method demos feed this single document.

### `examples/en/tesla_question.md`

Three evaluation questions, one per line:

- What are Tesla's major inventions and their significance?
- What was the "War of Currents" and who were the main participants?
- How did Tesla's relationship with Edison evolve over time?

Demos read non-empty lines into a list and iterate `chat` over each question inside a try/except block.

## Run a demo

<Steps>
<Step title="Pick a method">

```bash
python examples/en/methods/light_rag_demo.py
```

Replace `light_rag_demo.py` with any script from the inventory table.

</Step>
<Step title="Observe extraction output">

After `feed_text`, the script prints entity and relation counts. Graph methods report `len(rag.nodes)` and `len(rag.edges)`. Hypergraph methods (`Hyper_RAG`, `HyperGraph_RAG`) store hyperedges in the `edges` property on `AutoHypergraph`.

</Step>
<Step title="Review Q&A results">

Each question prints as `Q: …` followed by `A: …` from `result.content`. Errors are caught and printed without stopping the loop.

</Step>
<Step title="Visualize">

`show()` launches OntoSight. For graph and hypergraph AutoTypes, search and chat callbacks are wired when vector indices are present.

</Step>
</Steps>

<RequestExample>

```bash
python examples/en/methods/graph_rag_demo.py
```

</RequestExample>

<ResponseExample>

```text
============================================================
Graph RAG Demo
============================================================
Extracting entities and relations from Tesla's biography...

✓ Extracted 42 entities, 38 relations

------------------------------------------------------------
Q&A
------------------------------------------------------------

Q: What are Tesla's major inventions and their significance?
A: Tesla's major inventions include the AC induction motor, the Tesla coil, ...
```

</ResponseExample>

## Method-specific behavior

### RAG methods (`hyperextract.methods.rag`)

Five graph- and hypergraph-based RAG engines share constructor kwargs:

<ParamField body="chunk_size" type="int" default="2048">
Characters per text chunk during extraction.
</ParamField>

<ParamField body="chunk_overlap" type="int" default="256">
Overlap between consecutive chunks.
</ParamField>

<ParamField body="max_workers" type="int" default="10">
Concurrency limit for batch LLM calls.
</ParamField>

<ParamField body="verbose" type="bool" default="false">
Emit detailed extraction logs when `True`.
</ParamField>

| Class | Distinction |
|-------|-------------|
| `Light_RAG` | Lightweight binary-edge graph RAG |
| `Graph_RAG` | Community-detection graph RAG |
| `Hyper_RAG` | N-ary hyperedge extraction via `AutoHypergraph` |
| `HyperGraph_RAG` | Advanced multi-entity hypergraph construction |
| `Cog_RAG` | Dual-layer system: theme layer (macro narratives) + detail layer (micro entities) |

`Cog_RAG` is structurally different: it does not subclass `AutoGraph` or `AutoHypergraph`. It owns `theme_layer` and `detail_layer` sub-instances, aggregates `nodes` and `edges` across both, and its `show()` prompts interactively for layer selection (Detail Layer default, Theme Layer optional).

### Typical methods (`hyperextract.methods.typical`)

Four canonical graph-building pipelines subclass `AutoGraph`:

| Class | Role |
|-------|------|
| `iText2KG` | Triple-based knowledge graph extraction |
| `iText2KG_Star` | iText2KG with semantic deduplication |
| `KG_Gen` | Structured knowledge graph generation |
| `Atom` | Two-stage atomic-fact extraction with temporal fields (`t_start`, `t_end`, `t_obs`) and evidence attribution |

`Atom` accepts an additional constructor argument:

<ParamField body="observation_time" type="string">
Date baseline for resolving relative temporal expressions (for example `1997-10-10`). Defaults to the current date when omitted.
</ParamField>

<ParamField body="facts_per_chunk" type="int" default="10">
Maximum atomic facts batched into a single edge-extraction call.
</ParamField>

## Alternative client setup

Demos use explicit LangChain constructors for clarity. For provider-preset deployments, replace the client block with `create_client()`:

<CodeGroup>
```python title="OpenAI preset"
from hyperextract import create_client
from hyperextract.methods.rag import Light_RAG

llm, embedder = create_client("openai")
rag = Light_RAG(llm_client=llm, embedder=embedder)
```

```python title="Bailian preset"
llm, embedder = create_client("bailian", api_key="sk-xxx")
```

```python title="Mixed vLLM + cloud"
llm, embedder = create_client(
    llm="bailian:qwen-plus",
    embedder="vllm:bge-m3@localhost:8001/v1",
    api_key="sk-xxx",
)
```
</CodeGroup>

See [Provider system](/provider-system) for preset identifiers and compatibility requirements.

## CLI equivalent

Each demo method maps to a registered template ID usable from the CLI:

```bash
he parse examples/en/tesla.md -m light_rag -o ./output/light_rag
he talk ./output/light_rag -q "What was the War of Currents?"
he show ./output/light_rag
```

List all registered methods:

```bash
he list method
```

Demos exercise the direct Python class API; the CLI path persists a on-disk Knowledge Abstract under `-o`.

## Troubleshooting

<AccordionGroup>
<Accordion title="Missing API key or authentication error">

Confirm `OPENAI_API_KEY` is set in `.env` or exported in the shell before `load_dotenv()` runs. For non-OpenAI endpoints, set `OPENAI_BASE_URL` or switch to `create_client()` with the correct provider preset.

</Accordion>

<Accordion title="AttributeError on stat properties">

Hypergraph demos reference `hyper_edges` in print statements, but `AutoHypergraph` exposes hyperedges via the `edges` property. Use `len(method.edges)` when adapting demo output. `Atom` derives a graph from internal atomic facts; report `len(atom.nodes)` and `len(atom.edges)` rather than a `facts` attribute.

</Accordion>

<Accordion title="chat returns empty or generic answers">

Extraction quality depends on model structured-output support. Use models verified for `json_schema` or function calling. Enable debug logging:

```bash
export HYPER_EXTRACT_LOG_LEVEL=DEBUG
```

</Accordion>

<Accordion title="show() does not open or errors">

OntoSight requires a display environment. In headless environments, use `dump()` to persist the Knowledge Abstract and inspect `data.json` instead.

</Accordion>

<Accordion title="Cog_RAG show() blocks on input">

`Cog_RAG.show()` prompts for layer `1` (Detail) or `2` (Theme). In non-interactive runs it defaults to Detail Layer on `EOFError`.

</Accordion>
</AccordionGroup>

## Related pages

<CardGroup>
<Card title="Use extraction methods" href="/use-extraction-methods">
Invoke methods via `he parse -m` or `Template.create("method/…")`, including method-specific kwargs.
</Card>
<Card title="Extraction methods reference" href="/extraction-methods-reference">
Registry IDs, AutoType outputs, descriptions, and constructor parameters for all nine methods.
</Card>
<Card title="Tesla biography recipe" href="/tesla-biography-recipe">
End-to-end CLI and template workflow on the same `tesla.md` corpus with `general/biography_graph`.
</Card>
<Card title="Configure providers" href="/configure-providers">
Set up LLM and embedder clients via `he config`, environment variables, or `create_client()`.
</Card>
<Card title="Search, chat, and visualize" href="/search-chat-visualize">
CLI equivalents for `chat`, `search`, and `show` on persisted Knowledge Abstracts.
</Card>
</CardGroup>

---

## 21. Troubleshooting

> Common failure modes: missing API keys, vLLM `base_url` requirements, `--lang` required for knowledge templates, empty output directory conflicts, missing `data.json` or index for `search`/`talk`, template resolution errors, and debug logging via `HYPER_EXTRACT_LOG_LEVEL`.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/21-troubleshooting.md
- Generated: 2026-06-18T20:59:09.794Z

### Source Files

- `hyperextract/cli/utils.py`
- `hyperextract/cli/config.py`
- `hyperextract/cli/cli.py`
- `hyperextract/utils/logging.py`
- `hyperextract/cli/README.md`
- `tests/cli/test_verbose.py`

---
title: "Troubleshooting"
description: "Common failure modes: missing API keys, vLLM `base_url` requirements, `--lang` required for knowledge templates, empty output directory conflicts, missing `data.json` or index for `search`/`talk`, template resolution errors, and debug logging via `HYPER_EXTRACT_LOG_LEVEL`."
---

Hyper-Extract surfaces most CLI failures as `Error:` messages on stderr and exits with code `1`. Commands that call LLM or embedder clients (`he parse`, `he feed`, `he search`, `he talk`, `he show`, `he build-index`) run `validate_config()` first; Knowledge Abstract (KA) commands additionally validate directory layout via `validate_ka_path`, `validate_ka_with_data`, or `validate_ka_with_index` in `hyperextract/cli/utils.py`.

## Quick diagnosis

| Symptom | Likely cause | First fix |
|---------|--------------|-----------|
| `LLM API key is not configured` | Missing `[llm]` key and no `OPENAI_API_KEY` | `he config init -k YOUR_KEY` |
| `vLLM provider requires base_url` | `provider = "vllm"` without URL | `he config llm -p vllm -u http://localhost:8000/v1` |
| `--lang is required for knowledge templates` | Domain template without `-l` | Add `--lang en` or `--lang zh` |
| `Output directory already exists and is not empty` | Reusing a populated `-o` path | `-f` or pick a new directory |
| `no data.json` | Path is not a valid KA | Run `he parse` or point at the KA root |
| `Index not found` | `he parse --no-index` or never built index | `he build-index <ka_path>` |
| `Template '…' not found` | Wrong template ID or missing local YAML | `he list template` |
| No stage logs during extraction | Default log level is `WARNING` | `export HYPER_EXTRACT_LOG_LEVEL=DEBUG` |

<Info>
`he info <ka_path>` reports whether `data.json` exists and whether `index/` is populated. Use it before `he search` or `he talk`.
</Info>

## Configuration failures

### Missing API keys

`validate_config()` in `hyperextract/cli/config.py` checks both LLM and embedder credentials before any network call.

For non-vLLM providers, an empty API key triggers:

```text
Error: LLM API key is not configured. Run 'he config llm --api-key YOUR_KEY'
```

or the equivalent embedder message.

<Steps>
<Step title="Configure credentials">

```bash
he config init -k YOUR_API_KEY
```

Or set per service:

```bash
he config llm --api-key YOUR_KEY --model gpt-4o-mini
he config embedder --api-key YOUR_KEY --model text-embedding-3-small
```

</Step>
<Step title="Verify">

```bash
he config show
```

Keys resolve from `~/.he/config.toml` first, then fall back to `OPENAI_API_KEY` and `OPENAI_BASE_URL`.

</Step>
</Steps>

<ParamField body="OPENAI_API_KEY" type="string">
API key used when `[llm].api_key` or `[embedder].api_key` is empty in `~/.he/config.toml`.
</ParamField>

<ParamField body="OPENAI_BASE_URL" type="string">
Optional override for `[llm].base_url` and `[embedder].base_url` when not set in config.
</ParamField>

### vLLM `base_url` requirements

The `vllm` provider preset sets `base_url`, `default_llm`, and `default_embedder` to `None`. Hyper-Extract never infers a local endpoint—you must supply URLs explicitly.

Validation errors:

```text
Error: vLLM provider requires base_url.
Error: vLLM embedder requires base_url.
```

If `provider` is `vllm` (or any preset with `base_url: None`) and no URL is in config or `OPENAI_BASE_URL`, `_resolve_base_url` raises:

```text
Provider 'vllm' requires explicit base_url. Please set it via config or environment variable.
```

<CodeGroup>
```bash title="CLI — separate LLM and embedder endpoints"
he config llm -p vllm -m Qwen3.5-9B -u http://localhost:8000/v1 -k dummy
he config embedder -p vllm -m bge-m3 -u http://localhost:8001/v1 -k dummy
```

```bash title="Environment variable"
export OPENAI_BASE_URL=http://localhost:8000/v1
```

```toml title="~/.he/config.toml"
[llm]
provider = "vllm"
model = "Qwen3.5-9B"
api_key = "dummy"
base_url = "http://localhost:8000/v1"

[embedder]
provider = "vllm"
model = "bge-m3"
api_key = "dummy"
base_url = "http://localhost:8001/v1"
```
</CodeGroup>

<Warning>
`he config init -p vllm -k dummy` without `-u` saves vLLM credentials but leaves `base_url` empty. The next `he parse` fails at `validate_config()`. Always pass `--base-url` for vLLM quick init, or complete interactive setup and enter URLs when prompted.
</Warning>

<Tip>
Interactive `he config init` accepts `dummy` as the API key for vLLM and prompts separately for LLM and embedder base URLs (for example `http://localhost:8000/v1` and `http://localhost:8001/v1`).
</Tip>

## Parse and extraction failures

### `--lang` required for knowledge templates

Domain YAML templates (`general/biography_graph`, `finance/earnings_summary`, etc.) require a language because `TemplateFactory.create` calls `localize_template` with the supplied code.

CLI guard (knowledge templates only):

```text
Error: --lang is required for knowledge templates. Use --lang en or --lang zh.
```

Python API equivalent:

```text
ValueError: language is required for knowledge templates. Provide a language code (e.g., 'zh', 'en').
```

<RequestExample>
```bash
# Knowledge template — -l required
he parse document.md -t general/biography_graph -o ./out/ -l en

# Method template — -l optional (forced to en)
he parse document.md -m light_rag -o ./out/
```
</RequestExample>

Method templates (`method/light_rag`, `-m hyper_rag`, etc.) always use English prompts. If you pass `--lang`, the CLI prints a dim note and ignores it.

### Empty output directory conflicts

`he parse` refuses to write into a non-empty output directory unless `--force` is set:

```text
Error: Output directory already exists and is not empty. Use --force to overwrite.
```

An existing but **empty** directory is allowed. Use `-f` / `--force` to overwrite populated KA artifacts (`data.json`, `metadata.json`, `index/`).

### Template resolution errors

Template lookup happens at two stages: config resolution (`Template.get`) before extraction, and instance creation (`Template.create` → `TemplateFactory.create`).

| Error | When | Fix |
|-------|------|-----|
| `Template 'xxx' not found` | Preset ID not in gallery and not a `.yaml` path | `he list template` or `he list template -q keyword` |
| `Template not found: {source}` | Python `Template.create` with invalid path | Confirm spelling; use full preset path like `general/biography_graph` |
| `Config file not found: {file_path}` | Custom YAML path missing | Check file path and permissions |
| `Template '…' not found in presets and local file '…yaml' does not exist` | Reloading KA whose template left the gallery | Copy `{template}.yaml` into the KA directory or re-parse with a current preset |

<Steps>
<Step title="List presets">

```bash
he list template
he list template -l zh -q finance
```

</Step>
<Step title="Resolve at parse time">

```bash
he parse doc.md -t general/biography_graph -o ./ka/ -l en
```

Omit `-t` for interactive selection when unsure of the ID.

</Step>
<Step title="Resolve when reloading a KA">

`he show`, `he search`, `he talk`, and `he build-index` read `metadata.json` and resolve the template via `get_template_from_ka`: preset name first, then `{ka_path}/{template}.yaml`.

</Step>
</Steps>

### Other parse input errors

```text
Error: No .txt or .md files found in {input}
```

Directory inputs only include `*.txt` and `*.md` files. Other extensions are skipped.

```text
Input file not found: {input_path}
```

Check the path or use `-` for stdin.

## Knowledge Abstract validation

A valid KA directory contains at minimum `data.json` and `metadata.json`. Semantic search and chat additionally require a non-empty `index/` subdirectory.

:::files
```
my_ka/
├── data.json        # extracted knowledge (required for most commands)
├── metadata.json    # template, lang, timestamps (required for feed/reload)
└── index/           # vector index (required for search/talk)
    └── …
:::
```

### Missing `data.json`

Commands using `validate_ka_with_data` (`he info`, `he show`, `he build-index`):

```text
Error: Not a valid Knowledge Abstract: {ka_path} (no data.json)
```

`he feed` only requires `metadata.json` at the KA root; it loads existing data via `ka.load()` and will fail later if `data.json` is absent.

### Missing index for `search` and `talk`

Both commands call `validate_ka_with_index`:

```text
Error: Index not found. Please run 'he build-index {ka_path}' first.
```

Common causes:

- `he parse … --no-index` skipped index creation
- `he feed` appended data without rebuilding the index
- `index/` exists but is empty

<Steps>
<Step title="Build or rebuild the index">

```bash
he build-index ./my_ka/
he build-index ./my_ka/ --force   # replace existing index
```

</Step>
<Step title="Confirm readiness">

```bash
he info ./my_ka/
# Index row should show: Built
```

</Step>
<Step title="Query">

```bash
he search ./my_ka/ "your query"
he talk ./my_ka/ -q "your question"
he talk ./my_ka/ -i
```

</Step>
</Steps>

<Note>
`he build-index` exits `0` with a warning when an index already exists and `--force` is not passed—it does not rebuild silently. Pass `-f` to force a rebuild.
</Note>

### Metadata and template reload failures

| Error | Command | Fix |
|-------|---------|-----|
| `Knowledge Abstract not found` | Any KA command | Check path |
| `Not a directory` | Any KA command | Point at the KA folder, not `data.json` |
| `Not a valid Knowledge Abstract directory` (no `metadata.json`) | `he feed` | Re-parse or restore `metadata.json` |
| `No metadata.json found in Knowledge Abstract` | `get_template_from_ka` | Restore metadata or re-parse |
| `No template specified in metadata.json` | `get_template_from_ka` | Ensure `template` field is set |
| `Please provide a query or use --interactive mode` | `he talk` without `-q` or `-i` | Add `-q "…"` or `-i` |

## Debug logging

Log level is controlled **only** by the `HYPER_EXTRACT_LOG_LEVEL` environment variable. The CLI configures structlog on every invocation via `configure_logging()` in `hyperextract/utils/logging.py`. The `--verbose` flag is not supported.

<ParamField body="HYPER_EXTRACT_LOG_LEVEL" type="string" default="WARNING">
Root log level. Accepted values: `DEBUG`, `INFO`, `WARNING`, `ERROR` (case-insensitive). Invalid values fall back to `WARNING`.
</ParamField>

<ParamField body="HYPER_EXTRACT_LOG_FILE" type="string">
Optional file path for duplicate log output alongside stderr.
</ParamField>

<CodeGroup>
```bash title="Trace a parse run"
export HYPER_EXTRACT_LOG_LEVEL=DEBUG
he parse examples/en/tesla.md -t general/biography_graph -o ./tesla_ka/ -l en
```

```bash title="Write logs to a file"
export HYPER_EXTRACT_LOG_LEVEL=INFO
export HYPER_EXTRACT_LOG_FILE=/tmp/hyper-extract.log
he search ./tesla_ka/ "Tesla"
```

```python title="Python API"
import os
os.environ["HYPER_EXTRACT_LOG_LEVEL"] = "DEBUG"

from hyperextract.utils.logging import configure_logging
configure_logging()

from hyperextract import Template
ka = Template.create("general/biography_graph", "en")
```
</CodeGroup>

At `DEBUG` or `INFO`, the `he` logger emits stage markers such as `stage=config_validated`, `stage=template_resolved`, `stage=feed_text_invoked`, `stage=index_built`, and `stage=search_complete`—useful for pinpointing whether failure occurs during config, template load, extraction, or indexing.

<ResponseExample>
```text
2026-06-18T12:00:00.000000Z [info     ] command=parse input=doc.md output=./ka template=general/biography_graph lang=en [he]
2026-06-18T12:00:00.100000Z [info     ] stage=config_validated         [he]
2026-06-18T12:00:01.200000Z [info     ] stage=template_resolved template=BiographyGraph [he]
2026-06-18T12:00:45.000000Z [info     ] stage=knowledge_extracted chars=12450 [he]
2026-06-18T12:01:10.000000Z [info     ] stage=index_built              [he]
```
</ResponseExample>

## Error decision flow

```text
he <command> fails
        │
        ├─ "API key" / "base_url" ──► he config show → fix ~/.he/config.toml or env vars
        │
        ├─ "--lang is required" ─────► add -l en|zh (skip for -m methods)
        │
        ├─ "Output directory" ───────► -f or new -o path
        │
        ├─ "Template … not found" ───► he list template → fix -t / metadata template
        │
        ├─ "no data.json" ───────────► he parse or fix KA path
        │
        └─ "Index not found" ────────► he build-index <ka_path> [-f]
```

<AccordionGroup>
<Accordion title="Authentication or model errors at runtime">
Config validation only checks that keys and URLs are present—not that the provider accepts them. If extraction fails mid-run with HTTP 401/404 or schema errors, verify the model supports structured output (`json_schema` / function calling), confirm credits/quota, and match `base_url` to your deployment. See the provider pages for compatibility requirements.
</Accordion>
<Accordion title="Empty search results">
Index present but `he search` returns no hits: confirm `he info` shows `Nodes > 0`, try broader queries, increase `-n` / `--top-k`, and ensure document language matches the `-l` used at parse time.
</Accordion>
<Accordion title="Corrupted or partial KA">
Validate JSON with `python -c "import json; json.load(open('my_ka/data.json'))"`. If files are damaged, re-run `he parse` with `-f` or extract to a fresh output directory.
</Accordion>
</AccordionGroup>

## Related pages

<CardGroup>
<Card title="Configure providers" href="/configure-providers">
Set up LLM and embedder clients for OpenAI, Bailian, vLLM, and custom compatible endpoints.
</Card>
<Card title="Configuration reference" href="/configuration-reference">
`~/.he/config.toml` schema, provider presets, and environment variable precedence.
</Card>
<Card title="CLI reference" href="/cli-reference">
Full `he` command surface, flags, and exit conditions.
</Card>
<Card title="Knowledge Abstracts" href="/knowledge-abstracts">
On-disk KA layout, lifecycle methods, and when to rebuild indexes.
</Card>
<Card title="Templates vs methods" href="/templates-vs-methods">
Language requirements and template selection criteria.
</Card>
</CardGroup>

---

## 22. Contributing

> Development setup with `uv`, running `pytest` and coverage, CI matrix (Python 3.11–3.12, Ubuntu/macOS), lint workflow, optional integration tests, and how to add templates or register new extraction methods.

- Page Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/pages/22-contributing.md
- Generated: 2026-06-18T20:59:33.448Z

### Source Files

- `.github/workflows/test.yml`
- `.github/workflows/lint.yml`
- `.github/workflows/integration.yml`
- `pyproject.toml`
- `tests/conftest.py`
- `hyperextract/methods/registry.py`
- `README.md`

---
title: "Contributing"
description: "Development setup with `uv`, running `pytest` and coverage, CI matrix (Python 3.11–3.12, Ubuntu/macOS), lint workflow, optional integration tests, and how to add templates or register new extraction methods."
---

Hyper-Extract targets Python 3.11+, uses `uv` for dependency management, and gates changes through three GitHub Actions workflows: unit tests with coverage (`test.yml`), Ruff lint/format (`lint.yml`), and optional integration tests against a live LLM (`integration.yml`). Preset YAML templates auto-load from `hyperextract/templates/presets/`; extraction methods register through `hyperextract/methods/registry.py`.

## Prerequisites

| Requirement | Value |
|-------------|-------|
| Python | `>=3.11` (repo pins `3.11` in `.python-version`) |
| Package manager | [uv](https://docs.astral.sh/uv/) |
| License | Apache-2.0 |

<Note>
Unit tests run without API keys. `tests/conftest.py` injects `MockChatModel` and `MockEmbeddings` when `OPENAI_API_KEY` is unset or empty. CI explicitly sets `OPENAI_API_KEY: ""` for deterministic runs.
</Note>

## Development setup

<Steps>
<Step title="Clone and enter the repository">

```bash
git clone https://github.com/yifanfeng97/Hyper-Extract.git
cd Hyper-Extract
```

</Step>

<Step title="Install the package in editable mode">

<Tabs>
<Tab title="All provider extras (recommended)">

```bash
uv pip install -e ".[all]"
```

Installs core dependencies plus `anthropic` and `google` optional extras.

</Tab>
<Tab title="Core only">

```bash
uv pip install -e .
```

</Tab>
<Tab title="With dev tools">

```bash
uv pip install -e ".[all]"
uv pip install --group dev pytest pytest-cov ruff
```

The `dev` dependency group in `pyproject.toml` also includes MkDocs tooling for documentation builds.

</Tab>
</Tabs>

</Step>

<Step title="Verify the install">

```bash
he --help
python -c "from hyperextract import Template; print(len(Template.list()))"
```

</Step>
</Steps>

## Running tests

### Unit tests and coverage

The default test suite exercises `hyperextract/` and `tests/` without network calls when no API key is present.

```bash
pytest --cov=hyperextract --cov-report=term -v
```

CI additionally emits XML coverage for Codecov:

```bash
pytest --cov=hyperextract --cov-report=xml --cov-report=term -v
```

| Fixture | Scope | Behavior |
|---------|-------|----------|
| `is_real_env` | session | `True` when `OPENAI_API_KEY` is set and non-placeholder |
| `llm_client` | session | `ChatOpenAI` (real) or `MockChatModel` (mock) |
| `embedder` | session | `OpenAIEmbeddings` (real) or `MockEmbeddings` (mock) |

`tests/conftest.py` loads `.env` from the repo root and the current working directory before fixtures resolve.

### Integration tests

Integration tests live under `tests/integration/` and are marked with `@pytest.mark.integration`. They call the real OpenAI API and skip when `OPENAI_API_KEY` is absent.

```bash
export OPENAI_API_KEY=sk-...
pytest -m integration -v --tb=short
```

<Warning>
Integration tests incur API cost. They are not part of the PR gate. CI runs them nightly (02:00 UTC) and on manual `workflow_dispatch` against the `integration-tests` GitHub environment, which supplies `OPENAI_API_KEY` from repository secrets.
</Warning>

### Test layout

:::files
tests/
├── conftest.py              # Mock/real auto-detection, shared fixtures
├── mocks.py                 # MockChatModel, MockEmbeddings
├── types/                   # AutoType unit tests
├── template_engine/         # Gallery, parser, factory tests
├── cli/                     # CLI behavior tests
├── utils/                   # Client/config tests
├── integration/             # Live-API extraction tests
└── test_data/               # Sample documents (en/, zh/)
:::

## Linting

CI runs Ruff on every push and PR to `main` or `develop` when `hyperextract/**/*.py` or `pyproject.toml` changes.

```bash
uv tool install ruff
ruff check hyperextract
ruff format --check hyperextract
```

Apply fixes locally:

```bash
ruff check --fix hyperextract
ruff format hyperextract
```

`pyproject.toml` configures `[tool.ruff.lint]` with `ignore = ["E731"]`.

## CI workflows

| Workflow | Trigger | Runner matrix | Purpose |
|----------|---------|---------------|---------|
| `test.yml` | Push/PR to `main`, `develop` | Ubuntu + macOS × Python 3.11/3.12 | Unit tests + coverage |
| `lint.yml` | Push/PR to `main`, `develop` | `ubuntu-latest` | Ruff check + format |
| `integration.yml` | Nightly cron, `workflow_dispatch` | `ubuntu-latest` | Live-API integration tests |

### Test matrix detail

The test workflow uses `fail-fast: false` and runs these combinations:

| OS | Python 3.11 | Python 3.12 |
|----|:-----------:|:-----------:|
| `ubuntu-latest` | ✓ | ✓ |
| `macos-latest` | — | ✓ |

macOS + Python 3.11 is excluded to reduce CI time. Coverage uploads to Codecov run only on `ubuntu-latest` + Python 3.12.

Path filters limit CI to changes under `hyperextract/**/*.py`, `tests/**/*.py`, and `pyproject.toml`.

```text
push/PR ──► test.yml ──► matrix(os × python) ──► pytest --cov
         ──► lint.yml ──► ruff check + format --check
         ──► integration.yml (scheduled/manual) ──► pytest -m integration
```

## Adding preset templates

Preset templates are YAML files under `hyperextract/templates/presets/{domain}/`. The `Gallery` singleton scans `*.yaml` at import time and indexes each file as `{domain}/{name}`, where `name` comes from the YAML `name` field (not the filename).

<Steps>
<Step title="Pick a domain and base template">

| Domain | Examples | Base templates |
|--------|----------|----------------|
| `general/` | `biography_graph`, `concept_graph` | `base_model`, `base_list`, `base_graph`, … |
| `finance/` | `earnings_summary`, `event_timeline` | — |
| `medicine/`, `tcm/`, `legal/`, `industry/` | Domain-specific presets | — |

Start from the `base_*` template matching your target AutoType (`model`, `list`, `set`, `graph`, `hypergraph`, `temporal_graph`, `spatial_graph`, `spatio_temporal_graph`).

</Step>

<Step title="Author the YAML">

Follow `hyperextract/templates/DESIGN_GUIDE.md` for field naming, multilingual blocks, and validation. Minimum required fields:

<ParamField body="language" type="string | string[]" required>
Supported locales, e.g. `[zh, en]`.
</ParamField>

<ParamField body="name" type="string" required>
Template ID segment; combined with domain folder to form the gallery key (e.g. `finance/earnings_summary`).
</ParamField>

<ParamField body="type" type="string" required>
One of the eight AutoTypes.
</ParamField>

<ParamField body="output" type="object" required>
Field/entity/relation schemas for extraction.
</ParamField>

<ParamField body="guideline" type="object" required>
Extraction rules and prompts.
</ParamField>

</Step>

<Step title="Validate locally">

```bash
pytest tests/template_engine/ -v
python -c "from hyperextract import Template; assert Template.get('your_domain/your_name')"
```

Gallery tests in `tests/template_engine/test_gallery.py` verify discovery, filtering by type/language/tag, and config preservation.

</Step>

<Step title="Test extraction (optional, requires API key)">

```bash
he parse tests/test_data/en/general/biography_scientist.md \
  -t your_domain/your_name -o /tmp/ka-test -l en
```

Or via the Python API:

```python
from hyperextract import Template

ka = Template.create("your_domain/your_name", "en")
result = ka.parse(open("tests/test_data/en/general/biography_scientist.md").read())
```

</Step>
</Steps>

Custom templates outside the presets tree can also be loaded by file path: `Template.create("/path/to/template.yaml", "en")`.

## Registering extraction methods

Methods are algorithm-driven extractors that subclass an AutoType (typically `AutoGraph` or `AutoHypergraph`) and register in `hyperextract/methods/registry.py`. After registration, they appear as `method/{name}` in `Template.create()`, `Template.get()`, and `he list method`.

<Steps>
<Step title="Implement the method class">

Place the class under `hyperextract/methods/rag/` or `hyperextract/methods/typical/`. The constructor must accept `llm_client` and `embedder`, with optional kwargs forwarded by `TemplateFactory.create_method()`.

```python
from hyperextract.types import AutoGraph

class My_RAG(AutoGraph[NodeSchema, EdgeSchema]):
    def __init__(
        self,
        llm_client: BaseChatModel,
        embedder: Embeddings,
        chunk_size: int = 2048,
        verbose: bool = False,
    ):
        super().__init__(
            node_schema=NodeSchema,
            edge_schema=EdgeSchema,
            llm_client=llm_client,
            embedder=embedder,
            # ... key extractors, prompts, mergers
        )
```

See `hyperextract/methods/rag/light_rag.py` for a complete reference implementation.

</Step>

<Step title="Register in the method registry">

Add a `register_method()` call inside `_init_registry()` in `hyperextract/methods/registry.py`:

```python
register_method(
    name="my_rag",
    method_class=My_RAG,
    autotype="graph",  # graph | hypergraph | model | list | set | ...
    description="Short description shown in he list method",
)
```

<ParamField body="name" type="string" required>
Registry key and CLI `-m` value (e.g. `light_rag`).
</ParamField>

<ParamField body="method_class" type="Type" required>
Class constructor; must accept `llm_client`, `embedder`, and `**kwargs`.
</ParamField>

<ParamField body="autotype" type="string" required>
Output AutoType category stored in method metadata.
</ParamField>

</Step>

<Step title="Export and add a demo script">

Export the class from the appropriate `__init__.py` under `hyperextract/methods/`. Add a runnable demo under `examples/en/methods/` following the existing `*_demo.py` pattern (dotenv, LangChain clients, `feed_text`, `chat`, `show`).

</Step>

<Step title="Verify registration">

```bash
he list method
python -c "from hyperextract import Template; print(Template.get('method/my_rag'))"
pytest tests/ -v -k "method or factory"  # existing factory/registry coverage
```

Invocation paths after registration:

<CodeGroup>
```bash title="CLI"
he parse examples/en/tesla.md -m my_rag -o ./output/
```

```python title="Python API"
from hyperextract import Template

ka = Template.create("method/my_rag")
ka.feed_text(open("examples/en/tesla.md").read())
```

```python title="Direct class"
from hyperextract.methods.rag import My_RAG

rag = My_RAG(llm_client=llm, embedder=embedder)
```
</CodeGroup>

</Step>
</Steps>

<Tip>
Method templates use English prompts only. `TemplateFactory.create_method()` hardcodes `metadata["lang"] = "en"` regardless of any `--lang` flag.
</Tip>

## Pull request checklist

Before opening a PR against `main` or `develop`:

1. Run `pytest --cov=hyperextract --cov-report=term -v` locally (no API key required).
2. Run `ruff check hyperextract` and `ruff format --check hyperextract`.
3. For template changes: run `pytest tests/template_engine/ -v` and confirm `Template.get()` resolves the new key.
4. For method changes: confirm `he list method` includes the new entry and add an `examples/en/methods/` demo when practical.
5. Keep changes scoped; the package build excludes `docs/`, `tests/`, and `.github/` from the wheel per `[tool.hatch.build]`.

Report bugs and feature requests via [GitHub Issues](https://github.com/yifanfeng97/hyper-extract/issues).

## Related pages

<CardGroup>
<Card title="Create custom templates" href="/create-custom-templates">
Author domain YAML templates: type selection, fields, multilingual blocks, and merge strategies.
</Card>
<Card title="Extraction methods reference" href="/extraction-methods-reference">
Registered methods, autotype outputs, registry API, and constructor kwargs.
</Card>
<Card title="Template schema reference" href="/template-schema-reference">
YAML field definitions, valid AutoTypes, identifiers, and merge strategies.
</Card>
<Card title="Template design skills" href="/template-design-skills">
Agent-assisted authoring with `hyperextract-skills` validators and optimizers.
</Card>
</CardGroup>

---