# Overview

> What Hyper-Extract exposes (CLI `he`, Python `Template` API, 8 AutoTypes, 80+ YAML presets, 9 extraction methods), runtime assumptions (Python 3.11+, structured LLM output), and the shortest path from install to a queryable Knowledge Abstract.

- Repository: yifanfeng97/Hyper-Extract
- GitHub: https://github.com/yifanfeng97/Hyper-Extract
- Human docs: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf
- Complete Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/llms-full.txt

## Source Files

- `README.md`
- `pyproject.toml`
- `hyperextract/__init__.py`
- `hyperextract/cli/cli.py`
- `hyperextract/utils/template_engine/template.py`

---

---
title: "Overview"
description: "What Hyper-Extract exposes (CLI `he`, Python `Template` API, 8 AutoTypes, 80+ YAML presets, 9 extraction methods), runtime assumptions (Python 3.11+, structured LLM output), and the shortest path from install to a queryable Knowledge Abstract."
---

Hyper-Extract (`hyperextract` v0.2.0) is an LLM-powered knowledge extraction framework that turns unstructured text into on-disk **Knowledge Abstracts** (KAs): strongly typed JSON plus optional FAISS vector indexes. The package exposes a Typer-based CLI (`he`), a Python SDK centered on `Template.create()`, eight `AutoType` primitives, domain YAML presets, and nine registered extraction methods.

## Surfaces

| Surface | Entry point | Primary use |
|---------|-------------|-------------|
| CLI | `he` (`hyperextract.cli:app`) | Parse documents, evolve KAs, search, chat, visualize |
| Python SDK | `from hyperextract import Template` | Programmatic extraction, indexing, and Q&A |
| Presets | `hyperextract/templates/presets/` | Domain YAML templates (general, finance, legal, medicine, TCM, industry) |
| Methods | `hyperextract.methods.registry` | Algorithm-driven extractors (`graph_rag`, `light_rag`, `atom`, …) |

The public SDK exports `BaseAutoType`, all eight `AutoType` classes, `Template`, client factories (`create_client`, `create_llm`, `create_embedder`, `get_client`), and logging helpers.

## Three-layer architecture

Hyper-Extract organizes extraction around three composable layers:

```mermaid
flowchart TB
    subgraph input["Input"]
        TXT["Unstructured text<br/>.md / .txt / stdin"]
    end

    subgraph layers["Hyper-Extract layers"]
        TPL["Templates<br/>YAML presets + method/*"]
        METH["Methods<br/>9 extraction algorithms"]
        AT["AutoTypes<br/>8 typed structures"]
    end

    subgraph output["Knowledge Abstract"]
        KA["output/<br/>data.json · metadata.json · index/"]
    end

    subgraph ops["Operations"]
        SRCH["search / chat"]
        VIS["show (OntoSight)"]
    end

    TXT --> TPL
    TXT --> METH
    TPL --> AT
    METH --> AT
    AT --> KA
    KA --> SRCH
    KA --> VIS
```

<AccordionGroup>
<Accordion title="AutoTypes — typed extraction primitives">

Eight `AutoType` classes inherit from `BaseAutoType` and own the full KA lifecycle: chunking, parallel LLM extraction, merge, indexing, serialization, search, and chat.

| Class | Structure | Typical use |
|-------|-----------|-------------|
| `AutoModel` | Single Pydantic object | Summaries, metadata, structured reports |
| `AutoList` | Ordered list | Logs, enumerations, ordered events |
| `AutoSet` | Deduplicated set | Glossaries, entity registries |
| `AutoGraph` | Nodes + binary edges | Concept maps, social networks |
| `AutoHypergraph` | Nodes + hyperedges | N-ary relationships, complex events |
| `AutoTemporalGraph` | Graph + time | Timelines, chronologies |
| `AutoSpatialGraph` | Graph + space | Physical layouts, spatial relations |
| `AutoSpatioTemporalGraph` | Graph + time + space | Event networks with location context |

Extraction uses LangChain structured output: `llm_client.with_structured_output(schema)` on chunked text (default chunk size 2048, overlap 256).

</Accordion>

<Accordion title="Templates — domain YAML presets">

Domain templates live under `hyperextract/templates/presets/` and are discovered by `Gallery` at import time. Each preset defines autotype, fields, identifiers, merge rules, and multilingual `language` blocks (`zh`, `en`). Resolve presets by path (e.g. `general/biography_graph`, `finance/earnings_summary`).

Knowledge templates require `--lang` / `language=` at creation. Method templates (`method/*`) always use English prompts.

</Accordion>

<Accordion title="Methods — algorithm extractors">

Nine methods register in `hyperextract.methods.registry`:

| Method | Output autotype | Role |
|--------|-----------------|------|
| `graph_rag` | graph | Community-aware Graph-RAG |
| `light_rag` | graph | Lightweight binary-edge RAG |
| `hyper_rag` | hypergraph | Hyperedge RAG |
| `hypergraph_rag` | hypergraph | Advanced hypergraph construction |
| `cog_rag` | hypergraph | Cognitive RAG retrieval |
| `itext2kg` | graph | Triple-based KG extraction |
| `itext2kg_star` | graph | Enhanced iText2KG |
| `kg_gen` | graph | Flexible KG generation |
| `atom` | graph | Temporal KG with evidence attribution |

Invoke via CLI (`he parse -m light_rag`) or `Template.create("method/light_rag")`.

</Accordion>
</AccordionGroup>

## Knowledge Abstract model

A KA is a directory produced by `BaseAutoType.dump()`:

:::files
output/
├── data.json       # Structured knowledge (Pydantic model_dump)
├── metadata.json   # Timestamps, template config, provenance
└── index/          # FAISS vector store (optional; rebuild with build_index)
:::

Lifecycle methods on every `BaseAutoType` instance:

| Method | Purpose |
|--------|---------|
| `parse(text)` | Extract from text (replaces in-memory state) |
| `feed_text(text)` | Incrementally append and merge |
| `build_index()` | Build semantic search index |
| `search(query, top_k=3)` | Vector retrieval |
| `chat(query, top_k=3)` | RAG Q&A over retrieved context |
| `dump(folder)` / `load(folder)` | Serialize / restore full KA |
| `show()` | Visualize via OntoSight (graph/list/set types) |

CLI mirrors these operations: `he parse`, `he feed`, `he build-index`, `he search`, `he talk`, `he show`, `he info`.

## Runtime assumptions

<Warning>
Hyper-Extract requires LLMs that support **structured output** — `json_schema` or function calling via LangChain's `with_structured_output`. Models without reliable schema adherence will produce extraction failures.
</Warning>

| Requirement | Value |
|-------------|-------|
| Python | `>=3.11` (classifiers: 3.11, 3.12) |
| Core deps | LangChain, FAISS (`faiss-cpu`), Pydantic, OntoSight, OntoMem, structlog |
| Optional extras | `anthropic`, `google`, `all` (additional LangChain provider packages) |
| Config file | `~/.he/config.toml` (`[llm]`, `[embedder]`) |
| Logging | `HYPER_EXTRACT_LOG_LEVEL`, `HYPER_EXTRACT_LOG_FILE` |

Provider presets ship for `openai`, `bailian` (DashScope compatible-mode), and `vllm` (local, requires explicit `base_url`). Use `provider:model@url` shorthand or `create_client()` for BYOC/BYOK deployments — no single vendor is required.

## Shortest path: install to queryable KA

<Steps>
<Step title="Install">

```bash
uv tool install hyperextract
# or: uv pip install hyperextract
```

</Step>

<Step title="Configure providers">

```bash
he config init -k YOUR_API_KEY
```

Writes `~/.he/config.toml` with LLM and embedder endpoints. See [Configure providers](/configure-providers) for per-provider setup and environment-variable overrides.

</Step>

<Step title="Extract a Knowledge Abstract">

```bash
he parse examples/en/tesla.md \
  -t general/biography_graph \
  -o ./output/ \
  -l en
```

<ParamField body="--template / -t" type="string">
Preset path (e.g. `general/biography_graph`). Omit for interactive selection.
</ParamField>

<ParamField body="--lang / -l" type="string" required>
Language code (`en` or `zh`). Required for knowledge templates; ignored for `method/*`.
</ParamField>

<ParamField body="--no-index" type="boolean">
Skip FAISS index build. Search and talk require an index later (`he build-index`).
</ParamField>

Produces `./output/data.json`, `metadata.json`, and `index/`.

</Step>

<Step title="Query and explore">

```bash
he search ./output/ "What are Tesla's major achievements?"
he talk ./output/ -i
he show ./output/
```

</Step>
</Steps>

<CodeGroup>
```bash CLI
he config init -k YOUR_API_KEY
he parse examples/en/tesla.md -t general/biography_graph -o ./output/ -l en
he search ./output/ "What are Tesla's major achievements?"
```

```python Python
from hyperextract import Template

ka = Template.create("general/biography_graph", language="en")

with open("examples/en/tesla.md") as f:
    ka.parse(f.read())

ka.dump("./output/")
results = ka.search("What are Tesla's major achievements?")
ka.show()
```
</CodeGroup>

<Note>
`Template.create()` reads `~/.he/config.toml` when `llm_client` and `embedder` are omitted. Pass explicit LangChain clients for mixed cloud/local setups.
</Note>

## CLI command map

| Group | Commands |
|-------|----------|
| Create / evolve | `he parse`, `he feed`, `he build-index` |
| Explore | `he search`, `he talk`, `he show`, `he info` |
| Discover | `he list template`, `he list method` |
| Configure | `he config init`, `he config llm`, `he config embedder`, `he config show` |

Run `he` with no subcommand for a grouped help panel. Full flag reference: [CLI reference](/cli-reference).

## Templates vs methods

| | Domain templates | Extraction methods |
|---|------------------|-------------------|
| Path prefix | `general/*`, `finance/*`, … | `method/*` |
| Definition | YAML in `templates/presets/` | Python classes in `hyperextract.methods` |
| Language | `--lang` required (`en` / `zh`) | English only |
| Selection | Domain fit, output autotype | Algorithm preference (RAG vs triple extraction) |
| Customization | Author new YAML presets | Register via `register_method()` |

Both paths instantiate an `AutoType` and produce the same on-disk KA layout.

## Python SDK summary

```python
from hyperextract import (
    Template,
    AutoGraph,
    create_client,
)

# Discover
Template.list()                          # presets + methods
Template.list(filter_by_type="graph")    # filter by autotype
Template.get("general/biography_graph")  # read YAML config

# Create and extract
llm, emb = create_client()               # from ~/.he/config.toml
ka = Template.create("general/biography_graph", language="en", llm_client=llm, embedder=emb)
ka.feed_text(document_text)
ka.build_index()
ka.dump("./my_ka/")
```

Exported primitives and factories are listed in `hyperextract.__all__`. Method-level detail: [Python API reference](/python-api-reference).

## Next

<CardGroup>
<Card title="Installation" href="/installation">
Install via `uv tool install` or `uv pip install`, Python version constraints, and optional provider extras.
</Card>
<Card title="Quickstart" href="/quickstart">
First successful extraction with `he parse`, search, show, and the Tesla biography Python path.
</Card>
<Card title="Knowledge Abstracts" href="/knowledge-abstracts">
On-disk KA layout, lifecycle methods, and incremental evolution with `he feed`.
</Card>
<Card title="Auto-Types" href="/auto-types">
Eight extraction primitives: structure, merge behavior, indexing, and selection guide.
</Card>
<Card title="Provider system" href="/provider-system">
BYOC/BYOK model: OpenAI, Bailian, vLLM presets, and structured-output requirements.
</Card>
<Card title="Templates vs methods" href="/templates-vs-methods">
When to pick domain YAML presets versus algorithm method templates.
</Card>
</CardGroup>
