# Create custom templates

> Author domain YAML templates: type selection, field and identifier design, multilingual `language` blocks, merge strategies, and validation workflow per the design guide and preset base templates.

- Repository: yifanfeng97/Hyper-Extract
- GitHub: https://github.com/yifanfeng97/Hyper-Extract
- Human docs: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf
- Complete Markdown: https://grok-wiki.com/public/docs/yifanfeng97-hyper-extract-7891c7254cdf/llms-full.txt

## Source Files

- `hyperextract/templates/DESIGN_GUIDE.md`
- `hyperextract/utils/template_engine/parsers/loader.py`
- `hyperextract/utils/template_engine/parsers/schemas/base.py`
- `hyperextract/templates/presets/general/base_graph.yaml`
- `hyperextract/utils/template_engine/factory.py`
- `hyperextract/templates/README.md`

---

---
title: "Create custom templates"
description: "Author domain YAML templates: type selection, field and identifier design, multilingual `language` blocks, merge strategies, and validation workflow per the design guide and preset base templates."
---

Domain YAML templates in Hyper-Extract declare an AutoType (`model`, `list`, `set`, or a graph family), an `output` schema, LLM `guideline` rules, `identifiers` for deduplication, and optional `options` / `display` settings. `TemplateFactory.create` loads presets from `hyperextract/templates/presets/` via `Gallery`, or accepts an absolute path to a standalone `.yaml` file; `load_template` validates structure with Pydantic and checks each entry in `language` through `localize_template` before extraction runs.

<Note>
Knowledge templates require a language code (`zh` or `en`) at runtime. Method templates under `method/` are English-only and ignore `--lang`. See [Templates vs methods](/templates-vs-methods).
</Note>

## Design workflow

Hyper-Extract follows a four-stage authoring pipeline documented in `hyperextract/templates/DESIGN_GUIDE.md`:

```text
Requirements → brainstorm → designer → optimizer (optional) → validator
                    ↓            ↓              ↓                  ↓
              Type selection   YAML draft    Auto-fix rules    Schema check
```

<Steps>
<Step title="Brainstorm requirements">

Clarify input source, extraction targets, entity granularity, relation types, and whether time or location matter. Record a draft with the chosen AutoType and field list.

</Step>
<Step title="Draft YAML from a base template">

Copy the matching base preset from `hyperextract/templates/presets/general/` (`base_model`, `base_list`, `base_set`, `base_graph`, `base_hypergraph`, `base_temporal_graph`, `base_spatial_graph`, or `base_spatio_temporal_graph`) and rename fields, tags, and guidelines for your domain.

</Step>
<Step title="Optimize (optional)">

Apply naming fixes (`relation_type` → `type`, `event_date` → `time`), separate mixed-language blocks, and trim fields above the five-field guideline. The bundled `hyperextract-skills/template-optimizer` skill automates these patterns.

</Step>
<Step title="Validate">

Run structural validation (see [Validation workflow](#validation-workflow)) before parsing sample documents.

</Step>
</Steps>

## Choose an AutoType

Use relationships and dimensionality to pick the container type. `TemplateFactory.create` dispatches to the matching constructor for all eight types.

| Need | AutoType | Output shape |
|------|----------|--------------|
| Single structured record | `model` | Flat fields on one object |
| Ordered sequence | `list` | Array of items |
| Deduplicated registry | `set` | Unique items keyed by `identifiers.item_id` |
| Binary relations (A→B) | `graph` | Nodes + edges (`source`, `target`, `type`) |
| Multi-party relations | `hypergraph` | Nodes + hyperedges with `participants` or role groups |
| Relations + time | `temporal_graph` | Edges carry `time`; set `identifiers.time_field` |
| Relations + location | `spatial_graph` | Edges carry `location`; set `identifiers.location_field` |
| Relations + time + location | `spatio_temporal_graph` | Both `time_field` and `location_field` |

```text
Need relationships?
├─ No → model | list | set
└─ Yes → graph (binary) | hypergraph (multi-party)
         └─ + time → temporal_graph
         └─ + location → spatial_graph
         └─ + both → spatio_temporal_graph
```

Domain presets such as `finance/earnings_summary` (`model`) and `general/biography_graph` (`temporal_graph`) extend the base patterns. See [Auto-Types](/auto-types) for merge behavior and indexing details.

## Template skeleton

Every knowledge template shares the same top-level keys validated by `TemplateCfg`:

| Key | Purpose |
|-----|---------|
| `language` | Supported locales, e.g. `[zh, en]` |
| `name` | Template identifier; Gallery indexes presets as `{domain}/{name}` |
| `type` | One of the eight AutoTypes |
| `tags` | Lowercase domain labels |
| `description` | Human-readable summary per language |
| `output` | Schema the LLM must populate |
| `guideline` | Extraction strategy and quality rules |
| `identifiers` | Deduplication keys (required for `set` and graph types) |
| `options` | Chunking, merge strategies, `extraction_mode`, index fields |
| `display` | Labels for OntoSight visualization |

### Schema vs guideline

**Schema (`output`) defines what fields exist; guideline defines how to extract them well.** Do not repeat field definitions in `guideline.rules` or `rules_for_entities` — keep guidelines focused on strategy, quality bar, and common mistakes.

Record types (`model`, `list`, `set`) use `output.fields`. Graph types use `output.entities` and `output.relations`, each with their own `fields` list.

### Field design rules

<ParamField body="name" type="string" required>
Field identifier in `snake_case`.
</ParamField>

<ParamField body="type" type="string" required>
One of `str`, `int`, `float`, `bool`, or `list`.
</ParamField>

<ParamField body="description" type="string | {zh, en}" required>
Semantic meaning for the LLM. Use pure Chinese in `zh` blocks and pure English in `en` blocks — no mixed scripts.
</ParamField>

<ParamField body="required" type="boolean">
When `false` or omitted, the field is optional.
</ParamField>

Keep at most five fields per entity, relation, or list item component. Prioritize essential identifiers (`source`, `target`, `participants`) before optional metadata.

**Record type example** (from `base_model.yaml`):

```yaml
output:
  fields:
    - name: name
      type: str
      description:
        zh: '对象的名称或标题'
        en: 'Name or title of the object'
    - name: description
      type: str
      required: false
      description:
        zh: '对象的简要描述'
        en: 'Brief description of the object'

guideline:
  target:
    zh: '你是一位信息提取专家…'
    en: 'You are an information extraction expert…'
  rules:
    zh: ['提取文本中核心的、结构化的对象。']
    en: ['Extract the core, structured object from the text.']

display:
  label: '{name}'
```

**Graph type example** (from `base_graph.yaml`):

```yaml
output:
  entities:
    fields:
      - name: name
        type: str
      - name: type
        type: str
  relations:
    fields:
      - name: source
        type: str
      - name: target
        type: str
      - name: type
        type: str

identifiers:
  entity_id: name
  relation_id: '{source}|{type}|{target}'
  relation_members:
    source: source
    target: target

options:
  extraction_mode: two_stage

display:
  entity_label: '{name} ({type})'
  relation_label: '{type}'
```

## Identifier design

`parse_identifiers` turns YAML identifier config into runtime key extractors. Misconfigured identifiers cause duplicate nodes, failed merges, or broken `he feed` evolution.

| AutoType | Required identifiers | Pattern |
|----------|---------------------|---------|
| `set` | `item_id` | Field name, e.g. `name` |
| `graph` | `entity_id`, `relation_id`, `relation_members` | `relation_id` supports `{field}` templates |
| `hypergraph` (flat) | same + `relation_members: participants` | String pointing to a `list` field |
| `hypergraph` (nested) | `relation_members: [group_a, group_b]` | List of `list`-typed role fields |
| `temporal_graph` | + `time_field` | e.g. `time` on relation fields |
| `spatial_graph` | + `location_field` | e.g. `location` on relation fields |
| `spatio_temporal_graph` | both `time_field` and `location_field` | Combines temporal and spatial |

`relation_id` templates interpolate field values: `'{source}|{type}|{target}'` for graphs, `'{name}|{type}'` for simple hypergraphs (see `base_hypergraph.yaml`).

<Warning>
Use `type` for relation type fields, not `relation_type`. Use `time` for temporal edges, not `event_date`. The design guide and `template-optimizer` skill rename these automatically.
</Warning>

## Multilingual `language` blocks

Set `language: [zh, en]` (or a single code) at the top level. Any string field can be:

- A plain string (single-language template)
- A dict `{zh: '…', en: '…'}`
- A dict of lists for numbered rules: `{zh: ['规则1'], en: ['Rule 1']}`

At runtime, `localize_template(config, language)` collapses multilingual values to the requested locale before `TemplateFactory` builds prompts. `load_template` validates localization for **every** language listed in `language` and raises `ValueError` if a locale is incomplete.

CLI and Python both require an explicit language for knowledge templates:

<CodeGroup>
```bash title="CLI"
he parse examples/en/tesla.md -o ./out -t general/biography_graph --lang en
```

```python title="Python"
from hyperextract import Template

ka = Template.create("finance/earnings_summary", "en")
ka.feed_text(document_text)
```
</CodeGroup>

## Merge strategies and options

`options` maps to AutoType constructor kwargs through `parse_option`. YAML keys are translated to internal names (e.g. `merge_strategy` → `strategy_or_merger`, `entity_merge_strategy` → `node_strategy_or_merger`).

### Record and set types

| YAML key | Applies to | Valid values |
|----------|-----------|--------------|
| `merge_strategy` | `model`, `set` | `merge_field`, `keep_incoming`, `keep_existing`, `llm_balanced`, `llm_prefer_incoming`, `llm_prefer_existing` |
| `fields_for_search` | `list`, `set` | List of field names indexed for semantic search |

`merge_field` overwrites non-null fields and appends lists. `llm_balanced` (default when unset) asks the LLM to synthesize conflicting chunk results.

### Graph types

| YAML key | Purpose |
|----------|---------|
| `extraction_mode` | `one_stage` (joint node+edge) or `two_stage` (nodes first, then edges). Base graph presets default to `two_stage` for accuracy. |
| `entity_merge_strategy` | Node deduplication on incremental `feed_text` |
| `relation_merge_strategy` | Edge deduplication |
| `entity_fields_for_search` | Node fields indexed for search |
| `relation_fields_for_search` | Edge fields indexed for search |
| `observation_time` | Anchor for relative dates (`temporal_graph`, `spatio_temporal_graph`) |
| `observation_location` | Fallback for fuzzy locations (`spatial_graph`, `spatio_temporal_graph`) |

Pass `observation_time` or `observation_location` as `Template.create` kwargs to override template defaults at runtime:

```python
ka = Template.create(
    "finance/event_timeline",
    "en",
    observation_time="2024-06-15",
)
```

## Validation workflow

Structural validation happens at load time — no separate CLI command.

<Steps>
<Step title="Load and parse YAML">

```python
from hyperextract.utils.template_engine.parsers import load_template

cfg = load_template("/path/to/my_template.yaml")
```

`load_template` runs Pydantic validation on `TemplateCfg` and tests `localize_template` for each language in `language`.

</Step>
<Step title="Run the checklist">

**All types**

- [ ] `language` lists supported locales
- [ ] `name`, `type`, `tags`, `description`, `output`, `guideline` present
- [ ] `type` is a valid AutoType
- [ ] `tags` are lowercase

**Graph types**

- [ ] `output.entities` and `output.relations` exist
- [ ] `identifiers.entity_id`, `relation_id`, `relation_members` configured
- [ ] Temporal/spatial types include `time_field` / `location_field`

**Hypergraph**

- [ ] `relation_members` is a string (flat list field) or list of `list`-typed role fields

</Step>
<Step title="Smoke-test extraction">

```python
from hyperextract import Template

ka = Template.create("/path/to/my_template.yaml", "en")
ka.feed_text(sample_text)
ka.dump("./test-ka")
```

Inspect `data.json` and run `ka.show()` or `he show ./test-ka` to verify field population and graph connectivity.

</Step>
</Steps>

<Tip>
Install `hyperextract-skills` and invoke the `yaml-validator` skill for agent-assisted checklist runs. See [Template design skills](/template-design-skills).
</Tip>

### Common errors

| Symptom | Fix |
|---------|-----|
| `The template configuration is not valid for language {lang}` | Add missing `zh`/`en` text for that locale in `description`, `guideline`, or field descriptions |
| `language is required for knowledge templates` | Pass `"zh"` or `"en"` to `Template.create` or `--lang` to `he parse` |
| Duplicate entities after `he feed` | Tighten `entity_id` / `item_id`; align naming rules in `guideline` |
| Empty relations | Switch `extraction_mode` to `two_stage`; strengthen `rules_for_relations` |
| `Missing fields` during merge | Ensure `relation_id` template fields exist on relation schema |

Enable debug logging with `HYPER_EXTRACT_LOG_LEVEL=DEBUG` if extraction succeeds but output shape is wrong. See [Troubleshooting](/troubleshooting).

## Use a custom template

### Standalone YAML file (Python)

Place the file anywhere. `TemplateFactory.create` resolves paths ending in `.yaml` or existing filesystem paths through `load_template`:

```python
from hyperextract import Template

ka = Template.create("/path/to/my_template.yaml", "zh")
ka.feed_text(text)
ka.dump("./my-ka")
```

### Preset registration (CLI and Gallery)

To make a template selectable via `he parse -t domain/name` and `Template.create("domain/name", lang)`, add the YAML under `hyperextract/templates/presets/<domain>/`. `Gallery` auto-discovers `*.yaml` files at import time and registers them as `<domain>/<name>`.

```bash
he parse input.md -o ./ka-out -t finance/earnings_summary --lang en
he list template
```

<Info>
When reloading a Knowledge Abstract whose template is not in presets, `get_template_from_ka` looks for `{template}.yaml` beside `data.json` in the KA directory. Copy your custom YAML into the output folder to keep `he feed` and `he search` working across sessions.
</Info>

### Publish upstream

To contribute a template to the project preset library:

1. Add YAML to `hyperextract/templates/presets/<domain>/`
2. Include both `zh` and `en` descriptions
3. Test with representative documents
4. Submit a PR per [Contributing](/contributing)

## Naming conventions

| Element | Convention | Example |
|---------|-----------|---------|
| Template `name` | Descriptive identifier (design guide recommends CamelCase for new templates; presets often use `snake_case`) | `earnings_summary` |
| Field names | `snake_case` | `company_name` |
| Relation type field | `type` | not `relation_type` |
| Time on edges | `time` | not `event_date` |
| Tags | lowercase | `[finance, investor-relations]` |

## Related pages

<CardGroup>
<Card title="Template schema reference" href="/template-schema-reference">
Field-by-field YAML schema, valid types, and identifier patterns.
</Card>
<Card title="Auto-Types" href="/auto-types">
Merge behavior, indexing, and type selection criteria.
</Card>
<Card title="Template design skills" href="/template-design-skills">
Agent-assisted authoring with `hyperextract-skills`.
</Card>
<Card title="Extract and evolve" href="/extract-and-evolve">
Run `he parse` and `he feed` with your template against documents.
</Card>
<Card title="Tesla biography recipe" href="/tesla-biography-recipe">
End-to-end example using `general/biography_graph`.
</Card>
</CardGroup>
