# Auto-Types

> Eight strongly-typed extraction primitives (`AutoModel`, `AutoList`, `AutoSet`, `AutoGraph`, `AutoHypergraph`, `AutoTemporalGraph`, `AutoSpatialGraph`, `AutoSpatioTemporalGraph`): structure, merge behavior, indexing, and when to pick each type.

- Repository: yifanfeng97/Hyper-Extract
- GitHub: https://github.com/yifanfeng97/Hyper-Extract
- Human docs: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf
- Complete Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/llms-full.txt

## Source Files

- `hyperextract/types/__init__.py`
- `hyperextract/types/base.py`
- `hyperextract/types/graph.py`
- `hyperextract/types/hypergraph.py`
- `hyperextract/types/temporal_graph.py`
- `hyperextract/types/spatial_graph.py`
- `hyperextract/types/spatio_temporal_graph.py`
- `hyperextract/types/model.py`

---

---
title: "Auto-Types"
description: "Eight strongly-typed extraction primitives (`AutoModel`, `AutoList`, `AutoSet`, `AutoGraph`, `AutoHypergraph`, `AutoTemporalGraph`, `AutoSpatialGraph`, `AutoSpatioTemporalGraph`): structure, merge behavior, indexing, and when to pick each type."
---

Hyper-Extract's eight AutoTypes are Pydantic-backed Knowledge Abstract primitives in `hyperextract.types`. Each class wraps LangChain structured output, long-text chunking, chunk-result merging, FAISS semantic indexing, and on-disk persistence (`data.json`, `metadata.json`, `index/`). YAML templates and extraction methods both resolve to one of these types via `Template.create()`.

## Architecture

All AutoTypes inherit `BaseAutoType[T]`, which owns the shared extraction pipeline:

1. Split input with `RecursiveCharacterTextSplitter` (default `chunk_size=2048`, `chunk_overlap=256`).
2. Call `llm_client.with_structured_output(schema)` per chunk (batched up to `max_workers=10`).
3. Merge chunk results through type-specific `merge_batch_data()`.
4. Apply results via `_set_data_state()` (`parse`) or `_update_data_state()` (`feed_text`); both invalidate the vector index on mutation.

```mermaid
classDiagram
    class BaseAutoType {
        +parse(text) BaseAutoType
        +feed_text(text) BaseAutoType
        +build_index()
        +search(query, top_k)
        +chat(query, top_k)
        +dump(folder_path)
        +load(folder_path)
        #merge_batch_data(data_list) T
    }
    class AutoModel
    class AutoList
    class AutoSet
    class AutoGraph
    class AutoHypergraph
    class AutoTemporalGraph
    class AutoSpatialGraph
    class AutoSpatioTemporalGraph

    BaseAutoType <|-- AutoModel
    BaseAutoType <|-- AutoList
    BaseAutoType <|-- AutoSet
    BaseAutoType <|-- AutoGraph
    BaseAutoType <|-- AutoHypergraph
    AutoGraph <|-- AutoTemporalGraph
    AutoGraph <|-- AutoSpatialGraph
    AutoGraph <|-- AutoSpatioTemporalGraph
```

<Info>
`parse()` returns a **new** instance; `feed_text()` mutates the current instance and supports chaining (`ka.feed_text(a).feed_text(b)`). Use `+` to merge two instances of the same class and schema.
</Info>

## Type selection guide

| Class | Data shape | Primary use case | Deduplication | Default merge |
| :--- | :--- | :--- | :--- | :--- |
| `AutoModel` | Single `BaseModel` | Document summary, metadata, one record per file | Treats all chunks as one object (`singleton` key) | `llm_balanced` |
| `AutoList` | `List[Item]` | Ordered collections, event logs, quotes | None (append) | Concatenate all items |
| `AutoSet` | `List[Item]` (unique) | Entity registries, glossaries, keyword sets | By `key_extractor` via `OMem` | `llm_balanced` |
| `AutoGraph` | `nodes` + `edges` | Pairwise knowledge graphs | Nodes and edges separately via `OMem` | `llm_balanced` per node/edge |
| `AutoHypergraph` | `nodes` + hyperedges | Multi-participant events, meetings, groups | Same as graph; hyperedge keys must be order-stable | `llm_balanced` per node/edge |
| `AutoTemporalGraph` | Graph + time on edges | Timelines, news, biographies | Edge key includes time component | `llm_balanced` + `observation_time` injection |
| `AutoSpatialGraph` | Graph + location on edges | Floor plans, facility layouts | Edge key includes location component | `llm_balanced` + `observation_location` injection |
| `AutoSpatioTemporalGraph` | Graph + time + location | Travel logs, incident reports | Composite edge key (`@ time at location`) | Both observation contexts injected |

## Scalar and collection types

### AutoModel

`AutoModel` targets **one structured object per document**. Every chunk is treated as a partial view of the same object; chunk results merge through `ontomem` with a constant `singleton` key.

**Structure:** A single Pydantic `data_schema` instance stored in `_data` (or `None` when empty).

**Merge behavior:**

| Trigger | Behavior |
| :--- | :--- |
| Multi-chunk extraction | `merge_batch_data()` groups all extractions under `singleton` and merges via configured strategy |
| `feed_text()` | Merges incoming object into existing via same merger |
| `+` operator | Merges two `AutoModel` instances; `AutoModel + AutoModel` produces an `AutoList` |

**Indexing:** `build_index()` creates one FAISS document per non-null schema field. `search()` returns field-value dictionaries from matched fields.

**Template example:** `general/base_model` (`type: model`), `finance/earnings_summary`.

### AutoList

`AutoList` extracts **many independent items** where order may matter and duplicates are acceptable.

**Structure:** `AutoListSchema` with an `items: List[ItemSchema]` field.

**Merge behavior:** `merge_batch_data()` and `feed_text()` **append** items across chunks. No key-based deduplication.

**Indexing:** Each list item becomes one FAISS document (full JSON or selected `fields_for_index`).

**Template example:** `general/base_list`, `legal/compliance_list`.

### AutoSet

`AutoSet` maintains a **deduplicated registry** keyed by a user-defined `key_extractor`. Internal storage uses `OMem`; the external `items` property exposes a list.

**Merge behavior:** Configurable `strategy_or_merger` (default `MergeStrategy.LLM.BALANCED`):

| YAML / API value | Effect |
| :--- | :--- |
| `merge_field` | Non-null fields overwrite; lists append |
| `keep_existing` | First occurrence wins |
| `keep_incoming` | Latest occurrence wins |
| `llm_balanced` | LLM synthesizes both versions (default) |
| `llm_prefer_existing` | LLM merge biased toward stored data |
| `llm_prefer_incoming` | LLM merge biased toward new data |

**Indexing:** Delegates to `OMem.build_index()` with optional `fields_for_index`.

**Set operations:** Supports `|` (union), `&` (intersection), `-` (difference) on compatible instances.

**Template example:** `general/base_set` (`identifiers.item_id: name`), `finance/risk_factor_set`.

## Graph types

### AutoGraph

Standard **binary** knowledge graph: `nodes` (entities) and `edges` (source→target relations).

**Structure:**

```
AutoGraphSchema
├── nodes: List[NodeSchema]
└── edges: List[EdgeSchema]
```

**Required extractors:**

<ParamField body="node_key_extractor" type="Callable" required>
Returns a stable unique key per node (e.g., `lambda x: x.name`).
</ParamField>

<ParamField body="edge_key_extractor" type="Callable" required>
Returns a stable unique key per edge (e.g., `lambda x: f"{x.source}|{x.type}|{x.target}"`).
</ParamField>

<ParamField body="nodes_in_edge_extractor" type="Callable" required>
Returns `(source_key, target_key)` for endpoint validation and pruning.
</ParamField>

**Extraction modes:**

| Mode | Pipeline |
| :--- | :--- |
| `one_stage` | Single structured call extracts nodes and edges together |
| `two_stage` (default in base templates) | Batch-extract nodes per chunk, then batch-extract edges with chunk-local node context |

After extraction, `_prune_dangling_edges()` drops edges whose endpoints are not in the node set.

**Merge behavior:** Separate `node_merger` and `edge_merger` (default `llm_balanced`). `feed_text()` calls `OMem.add()` for incremental node/edge insertion.

**Indexing:** `build_index()` builds separate FAISS stores for nodes and edges. On disk, `index/node_index/` and `index/edge_index/`. `search()` returns `(nodes, edges)`; `chat()` formats both into structured context.

**Template example:** `general/base_graph`, `general/concept_graph`.

### AutoHypergraph

Extends the graph pattern for **N-ary relations** (hyperedges connecting two or more nodes).

**Key difference from `AutoGraph`:**

| Aspect | `AutoGraph` | `AutoHypergraph` |
| :--- | :--- | :--- |
| Edge arity | Exactly two endpoints | Two or more participants |
| `nodes_in_edge_extractor` | `Tuple[str, str]` | `Tuple[str, ...]` (all participants) |
| Consistency check | Both endpoints must exist | **All** participants must exist (strict mode) |
| Default `extraction_mode` | `one_stage` or `two_stage` | `two_stage` (recommended) |

<Warning>
Hyperedge deduplication requires **order-stable** `edge_key_extractor` values. Sort participant keys inside the extractor so `{A, B}` and `{B, A}` map to the same key:

```python
edge_key_extractor=lambda x: f"{x.name}|{sorted(x.participants)}"
```
</Warning>

**Template example:** `general/base_hypergraph` (`relation_members: participants`), `legal/contract_obligation`.

## Context-aware graph types

`AutoTemporalGraph`, `AutoSpatialGraph`, and `AutoSpatioTemporalGraph` subclass `AutoGraph`. They inject observation context into prompts and fold time/location into edge deduplication keys.

### AutoTemporalGraph

Resolves relative time expressions ("yesterday", "last year") against an observation date.

<ParamField body="observation_time" type="string">
Reference date for relative-time resolution. Defaults to today (`YYYY-MM-DD`). Pass via `Template.create(..., observation_time="2024-06-15")` or template `options`.
</ParamField>

<ParamField body="time_in_edge_extractor" type="Callable" required>
Extracts the time component from an edge (e.g., `lambda x: x.time or ""`).
</ParamField>

**Edge identity:** `f"{raw_edge_key} @ {time_val}"` when time is present.

**Extraction rules baked into prompts:** Dates and time periods are **not** extracted as nodes; time lives on edge fields. Relative times resolve against `observation_time`.

**Template example:** `general/base_temporal_graph`, `general/biography_graph`, `finance/event_timeline`.

### AutoSpatialGraph

Resolves relative location expressions ("nearby", "here") against an observation location.

<ParamField body="observation_location" type="string">
Reference location for spatial resolution. Defaults to `"Unknown Location"`.
</ParamField>

<ParamField body="location_in_edge_extractor" type="Callable" required>
Extracts the spatial component from an edge (e.g., `lambda x: x.place or ""`).
</ParamField>

**Edge identity:** `f"{raw_edge_key} at {loc_val}"` when location is present.

**Extraction rules:** Locations and directions are **not** extracted as nodes; spatial context belongs on edges.

**Template example:** `general/base_spatial_graph`, `medicine/treatment_map`.

### AutoSpatioTemporalGraph

Combines temporal and spatial resolution in one extractor.

**Edge identity:** `raw_key`, optionally suffixed with `@ {time}` and `at {location}`.

**Template example:** `general/base_spatio_temporal_graph`, `medicine/hospital_timeline`.

## Merge strategies (shared reference)

Types that support configurable merging (`AutoModel`, `AutoSet`, graph family) accept `strategy_or_merger` (or per-node/edge variants in YAML):

```yaml
options:
  merge_strategy: llm_balanced          # AutoModel, AutoSet
  entity_merge_strategy: llm_balanced   # graph family
  relation_merge_strategy: merge_field  # graph family
```

Programmatic construction passes `MergeStrategy` enum values or a custom `BaseMerger` from `ontomem.merger`.

<Note>
`AutoList` has no merge-strategy knob — chunk and feed operations always concatenate. Choose `AutoSet` when duplicate items must collapse by key.
</Note>

## Indexing and query

| Type | Index unit | `build_index()` scope | `search()` return type |
| :--- | :--- | :--- | :--- |
| `AutoModel` | Non-null fields | Single FAISS store | `List[dict]` (field snapshots) |
| `AutoList` | Each item | Single FAISS store | `List[ItemSchema]` |
| `AutoSet` | Each unique item | Via `OMem` | `List[ItemSchema]` |
| `AutoGraph` / hypergraph / context graphs | Nodes and edges separately | `node_index/` + `edge_index/` | `Tuple[List[Node], List[Edge]]` |

All types use FAISS (`langchain_community.vectorstores.FAISS`) backed by the configured embedder. Calling `search()` or `chat()` without a built index raises an error (graph types report which sub-index is missing).

`show()` renders through OntoSight (`view_nodes`, `view_graph`, or `view_hypergraph`) and wires search/chat callbacks when indices exist.

## Lifecycle and persistence

Every AutoType instance is a Knowledge Abstract. Standard operations:

| Method | Effect |
| :--- | :--- |
| `parse(text)` | Extract into a new instance; does not modify `self` |
| `feed_text(text)` | Extract and merge into `self`; invalidates index |
| `build_index()` | Build or rebuild FAISS from current data |
| `dump(folder)` | Write `data.json`, `metadata.json`, `index/` |
| `load(folder)` | Restore data, metadata, and index (rebuild if index load fails) |
| `clear()` | Reset data and index |
| `clear_index()` | Drop index only |

Chunking, merge, and index invalidation run automatically — callers supply schema, extractors, and provider clients only.

## Templates and programmatic use

YAML `type` maps 1:1 to AutoType classes via `TemplateFactory`:

| Template `type` | Python class |
| :--- | :--- |
| `model` | `AutoModel` |
| `list` | `AutoList` |
| `set` | `AutoSet` |
| `graph` | `AutoGraph` |
| `hypergraph` | `AutoHypergraph` |
| `temporal_graph` | `AutoTemporalGraph` |
| `spatial_graph` | `AutoSpatialGraph` |
| `spatio_temporal_graph` | `AutoSpatioTemporalGraph` |

<CodeGroup>
```python Python API
from hyperextract import Template, create_client

llm, embedder = create_client()
ka = Template.create("general/biography_graph", "en", llm, embedder,
                     observation_time="2024-01-15")
ka.feed_text(open("examples/en/tesla.md").read())
ka.build_index()
results = ka.search("When did Tesla move to America?", top_k=5)
ka.dump("./output/tesla")
```

```bash CLI
he config init
he parse examples/en/tesla.md -t general/biography_graph --lang en -o ./output/tesla
he search "When did Tesla move to America?" ./output/tesla
he show ./output/tesla
```
</CodeGroup>

Extraction methods (`method/light_rag`, `method/atom`, etc.) also produce AutoType instances — typically `AutoGraph` or `AutoTemporalGraph` — with algorithm-specific schemas. See the extraction methods reference for per-method output types.

## When to pick each type

<AccordionGroup>
<Accordion title="Choose AutoModel when">
- The output is **one record per document** (summary, report metadata, sentiment snapshot).
- Fields from different chunks describe the **same** object and must be synthesized, not listed.
- Example presets: `finance/earnings_summary`, `finance/sentiment_model`.
</Accordion>

<Accordion title="Choose AutoList when">
- Items are **independent** and order or repetition matters.
- You want the simplest merge semantics (append only).
- Example presets: `legal/compliance_list`, `general/base_list`.
</Accordion>

<Accordion title="Choose AutoSet when">
- Items need **deduplication** by a stable identifier.
- The same entity may appear in many chunks and attributes must merge intelligently.
- Example presets: `finance/risk_factor_set`, `legal/defined_term_set`.
</Accordion>

<Accordion title="Choose AutoGraph when">
- Relationships are **binary** (A→B).
- Standard entity–relation knowledge graphs suffice.
- Example presets: `general/biography_graph` (if time is not needed), `tcm/meridian_graph`.
</Accordion>

<Accordion title="Choose AutoHypergraph when">
- A single relation involves **three or more** participants (meetings, transactions, obligations).
- Example presets: `legal/contract_obligation`, `tcm/formula_composition`.
</Accordion>

<Accordion title="Choose AutoTemporalGraph when">
- Edges carry **time** and relative dates must resolve ("last year", "at age 20").
- Dates should not become standalone nodes.
- Example presets: `general/biography_graph`, `finance/event_timeline`.
</Accordion>

<Accordion title="Choose AutoSpatialGraph when">
- Edges carry **where** information and relative places must resolve ("nearby", "this room").
- Example presets: `medicine/treatment_map`, `industry/equipment_topology`.
</Accordion>

<Accordion title="Choose AutoSpatioTemporalGraph when">
- Both **when and where** matter on the same edges (incidents, travel, hospital events).
- Example presets: `medicine/hospital_timeline`, `general/base_spatio_temporal_graph`.
</Accordion>
</AccordionGroup>

<Tip>
Start from the matching `general/base_*` template when authoring a custom YAML template. Each base file demonstrates the expected `output`, `identifiers`, `options`, and `display` blocks for that AutoType.
</Tip>

## Related pages

<CardGroup>
<Card title="Knowledge Abstracts" href="/knowledge-abstracts">
On-disk layout (`data.json`, `metadata.json`, `index/`) and lifecycle methods shared by all AutoTypes.
</Card>
<Card title="Create custom templates" href="/create-custom-templates">
Author YAML templates: pick a `type`, define fields, identifiers, and merge strategies.
</Card>
<Card title="Template schema reference" href="/template-schema-reference">
Valid `type` values, field types, identifier patterns, and `options` keys per AutoType.
</Card>
<Card title="Python API reference" href="/python-api-reference">
`Template.create`, `BaseAutoType` methods, and `create_client()` entry points.
</Card>
<Card title="Search, chat, and visualize" href="/search-chat-visualize">
Query and render Knowledge Abstracts with `he search`, `he talk`, and `he show`.
</Card>
</CardGroup>
