Agent-readable wiki

LongMemEval-V2 Plain-Language Wiki

LongMemEval-V2 is a benchmark that tests whether an AI agent's memory system can turn long histories of web-browsing actions into the kind of practical knowledge a seasoned colleague would have. The repo ships the dataset pipeline, a pluggable memory framework, an evaluation harness, and leaderboard packaging utilities.

Pages

  1. Explain It Simply: What This Repo DoesWhat LongMemEval-V2 is in plain language, the one analogy to keep, and the three ideas every reader should hold onto before going deeper.
  2. Five Things a Good Memory Must KnowThe five memory abilities the benchmark tests — static recall, dynamic tracking, workflow knowledge, gotchas, and premise awareness — explained with real question categories from the harness source code.
  3. Downloading & Preparing the HaystackHow trajectory data moves from Hugging Face through download, screenshot extraction, and symlink preparation into the form the harness expects — covering the three data scripts and the validate step.
  4. The Six Memory Backends: How Each One WorksA plain-English tour of the six pluggable memory backends — no_retrieval, RAG variants, AgentRunbook-R, Codex, and AgentRunbook-C — explaining what each one stores and retrieves, plus the insert/query contract every custom backend must satisfy.
  5. The Evaluation Harness: From Question to ScoreHow harness.py feeds each question to a memory backend, collects context items, calls the reader model, and scores the answer — including the LLM judge paths for abstention and gotchas questions, and how shell scripts wire it all together.
  6. Scoring, LAFS, & What to RememberHow web and enterprise run results are merged, how LAFS turns accuracy and latency into a single leaderboard score, the two-step submission packaging process, and a plain-English recap of the core ideas to carry away from this repo.

Complete Markdown

# LongMemEval-V2 Plain-Language Wiki

> LongMemEval-V2 is a benchmark that tests whether an AI agent's memory system can turn long histories of web-browsing actions into the kind of practical knowledge a seasoned colleague would have. The repo ships the dataset pipeline, a pluggable memory framework, an evaluation harness, and leaderboard packaging utilities.

## Context Links

- [Agent index](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/llms.txt)
- [Human interactive wiki](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2)
- [GitHub repository](https://github.com/xiaowu0162/LongMemEval-V2)

## Repository Metadata

- Repository: xiaowu0162/LongMemEval-V2

- Generated: 2026-05-22T06:16:36.043Z
- Updated: 2026-05-22T06:49:07.550Z
- Runtime: Claude Code
- Format: Explain Like I'm 5
- Pages: 6

## Page Index

- 01. [Explain It Simply: What This Repo Does](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/01-explain-it-simply-what-this-repo-does.md) - What LongMemEval-V2 is in plain language, the one analogy to keep, and the three ideas every reader should hold onto before going deeper.
- 02. [Five Things a Good Memory Must Know](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/02-five-things-a-good-memory-must-know.md) - The five memory abilities the benchmark tests — static recall, dynamic tracking, workflow knowledge, gotchas, and premise awareness — explained with real question categories from the harness source code.
- 03. [Downloading & Preparing the Haystack](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/03-downloading-preparing-the-haystack.md) - How trajectory data moves from Hugging Face through download, screenshot extraction, and symlink preparation into the form the harness expects — covering the three data scripts and the validate step.
- 04. [The Six Memory Backends: How Each One Works](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/04-the-six-memory-backends-how-each-one-works.md) - A plain-English tour of the six pluggable memory backends — no_retrieval, RAG variants, AgentRunbook-R, Codex, and AgentRunbook-C — explaining what each one stores and retrieves, plus the insert/query contract every custom backend must satisfy.
- 05. [The Evaluation Harness: From Question to Score](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/05-the-evaluation-harness-from-question-to-score.md) - How harness.py feeds each question to a memory backend, collects context items, calls the reader model, and scores the answer — including the LLM judge paths for abstention and gotchas questions, and how shell scripts wire it all together.
- 06. [Scoring, LAFS, & What to Remember](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/06-scoring-lafs-what-to-remember.md) - How web and enterprise run results are merged, how LAFS turns accuracy and latency into a single leaderboard score, the two-step submission packaging process, and a plain-English recap of the core ideas to carry away from this repo.

## Source File Index

- `data/download_data.py`
- `data/prepare_data.py`
- `data/public_data.py`
- `data/validate_data.py`
- `environment.yml`
- `evaluation/harness.py`
- `evaluation/memory_configs/no_retrieval.json`
- `evaluation/memory_configs/rag_query_to_slice.json`
- `evaluation/qa_eval_metrics.py`
- `evaluation/run_eval.py`
- `evaluation/scripts/run_no_retrieval.sh`
- `evaluation/scripts/run_rag_query_to_slice.sh`
- `leaderboard/build_submission_step_1_single_operating_point.py`
- `leaderboard/build_submission_step_2_build_package.py`
- `leaderboard/combine_aggregated_metrics.py`
- `leaderboard/compute_lafs.py`
- `leaderboard/README.md`
- `leaderboard/submission_utils.py`
- `memory_modules/agentrunbook_c.py`
- `memory_modules/agentrunbook_r.py`
- `memory_modules/codex.py`
- `memory_modules/memory.py`
- `memory_modules/no_retrieval.py`
- `memory_modules/rag.py`
- `memory_modules/support.py`
- `memory_modules/trajectory_store.py`
- `pyproject.toml`
- `README.md`

---

## 01. Explain It Simply: What This Repo Does

> What LongMemEval-V2 is in plain language, the one analogy to keep, and the three ideas every reader should hold onto before going deeper.

- Page Markdown: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/01-explain-it-simply-what-this-repo-does.md
- Generated: 2026-05-22T06:13:14.027Z

### Source Files

- `README.md`
- `pyproject.toml`
- `memory_modules/memory.py`
- `evaluation/harness.py`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [README.md](README.md)
- [memory_modules/memory.py](memory_modules/memory.py)
- [memory_modules/no_retrieval.py](memory_modules/no_retrieval.py)
- [memory_modules/rag.py](memory_modules/rag.py)
- [memory_modules/agentrunbook_r.py](memory_modules/agentrunbook_r.py)
- [evaluation/harness.py](evaluation/harness.py)
- [leaderboard/README.md](leaderboard/README.md)
</details>

# Explain It Simply: What This Repo Does

LongMemEval-V2 is a benchmark that asks one sharp question: **can an AI agent learn from experience the way a knowledgeable colleague does?** The repo gives you the questions, the histories the agent must learn from, a harness to run any memory system against them, and ready-made baselines to compare against — all in one place.

This page explains what the benchmark is testing, how the pieces fit together, and the three ideas every reader needs before diving deeper into the code.

---

## The One Analogy to Keep

Think of a new employee joining a company. On their first day they know nothing special about your systems. After six months, they know which pages load slowly, which workflows have traps, and which assumptions from their last job don't apply here. That accumulated knowledge — earned from experience, not from a manual — is what LongMemEval-V2 measures.

The "employee" here is an AI agent. The "six months of experience" is a pile of recorded web-browsing or enterprise-software sessions (called **trajectories**). A **memory system** reads those trajectories and must later answer factual questions about the environment — just like a colleague you tap on the shoulder to ask "wait, how does checkout work on the admin site?"

Sources: [README.md:29-34]()

---

## Three Ideas to Hold Onto

### 1. The benchmark is a haystack-and-needle problem at extreme scale

Each question comes with a **haystack**: a set of up to 500 recorded agent trajectories that together can reach 115 million tokens. The memory system must ingest all of them and then surface just the relevant evidence — a needle — when asked a specific question.

The two test tiers are called **small** and **medium**, reflecting how large the haystacks grow. Sources: [README.md:38-43]()

### 2. Memory is a simple two-method contract, not a fixed implementation

Every memory backend in the repo implements exactly two methods:

- `insert(trajectory)` — called once per trajectory during the indexing phase.
- `query(query, query_image=None)` — called once per question; must return a list of text or image evidence items.

That contract is defined in `Memory` (an abstract base class) in `memory_modules/memory.py`. Any Python class that decorates itself with `@register_memory`, sets a unique `memory_type` string, and implements those two methods is a valid backend that the harness will accept.

```python
# memory_modules/memory.py:25-53
class Memory(ABC):
    memory_type: str = ""

    @abstractmethod
    def insert(self, trajectory: dict[str, object]) -> None:
        raise NotImplementedError

    @abstractmethod
    def query(
        self,
        query: str,
        query_image: str | None = None,
    ) -> list[MemoryContextItem]:
        raise NotImplementedError
```

The return type is a list of typed items:
```python
[
    {"type": "text", "value": "retrieved notes or evidence"},
    {"type": "image", "value": "/path/to/screenshot.png"},
]
```

Sources: [memory_modules/memory.py:25-53](), [README.md:197-220]()

### 3. Accuracy alone is not the score — latency matters too

The leaderboard scores submissions using **LAFS** (Latency-Adjusted Frontier Score), which rewards both high accuracy and fast query latency. A method that is 5% more accurate but ten times slower may score lower than a balanced one. The harness tracks `memory_query_duration_seconds` for every question, and the submission packaging combines the accuracy and latency numbers into a single frontier score.

Sources: [leaderboard/README.md:26-34](), [evaluation/harness.py:1438-1458]()

---

## What Gets Tested: Five Memory Abilities

The 451 questions are hand-curated across five categories that reflect different kinds of workplace knowledge:

| Ability | What it checks |
|---|---|
| **Static state recall** | Remembers landmarks, page layouts, and subtle UI differences |
| **Dynamic state tracking** | Understands how actions change the environment over time |
| **Workflow knowledge** | Knows the steps for recurring tasks |
| **Environment gotchas** | Recognizes local failure modes and avoids them |
| **Premise awareness** | Detects questions whose premise is wrong in this specific deployment |

A question in the **abstention** variant (marked `-abs` in the code) tests whether the system correctly says "I don't know" rather than guessing. The harness grades abstention separately from factual recall.

Sources: [README.md:45-57](), [evaluation/harness.py:44-60]()

---

## How an Evaluation Run Works

The harness (`evaluation/harness.py`) orchestrates three sequential passes:

```text
Pass 1 — Build prompts
  For each question:
    → Load the haystack (the set of relevant trajectories)
    → Call memory.insert() for each trajectory
    → Call memory.query() with the question text (and optional screenshot)
    → Truncate the returned context to fit within the token budget
    → Build the final reader prompt

Pass 2 — Generate answers
  → Send all prompts concurrently to the reader model (AsyncOpenAI)
  → Parse the boxed answer from each response

Pass 3 — Score answers
  → Compare each parsed answer to the gold label
  → Use exact-match, LLM judge, or custom eval function per question type
  → Write per_question.jsonl and aggregated_metrics.json
```

If every question shares the same haystack (common for small tier), the memory is built once and reused for all queries — a significant efficiency win. When questions have different haystacks, the harness can build memory in parallel across worker threads.

Sources: [evaluation/harness.py:1135-1196](), [evaluation/harness.py:1200-1343]()

The harness communicates with models through an **OpenAI-compatible API**. You point it at any server that speaks that protocol — local vLLM, a self-hosted endpoint, or a cloud provider. No model provider is baked in; the paper used Qwen3.5-9B as the reader and Qwen3-Embedding-8B for embeddings, but those are defaults you can override.

Sources: [README.md:126-156]()

---

## The Built-In Memory Backends

The repo ships six backends, ranging from a deliberate no-op to sophisticated agent-based approaches:

| `memory_type` | What it does |
|---|---|
| `no_retrieval` | Returns nothing — the zero baseline. The reader model gets no memory context. |
| `rag_query_to_slice` | Embeds raw AXTree state slices from trajectories; retrieves top-k by cosine similarity. |
| `rag_query_to_slice_notes` | Same as above, but also generates LLM-written procedure and hint notes per trajectory and retrieves those too. |
| `agentrunbook_r` | Builds richer per-transition event notes and uses multi-query retrieval with optional reranking. |
| `codex` | Drops trajectories as files into a workspace; uses Codex (a coding agent) to search them at query time. |
| `agentrunbook_c` | Like Codex, but paired with AgentRunbook's structured workspace layout. |

The `no_retrieval` backend is literally four lines of code and illustrates the minimum required to implement the interface:

```python
# memory_modules/no_retrieval.py
@register_memory
class NoRetrievalMemory(Memory):
    memory_type = "no_retrieval"

    def insert(self, trajectory): return None
    def query(self, query, query_image=None): return []
```

The `rag` backend, by contrast, maintains in-memory NumPy embedding matrices for raw state slices, procedure notes, and hint notes, and saves them to disk as `.npy` files for reuse across runs.

Sources: [memory_modules/no_retrieval.py:1-19](), [memory_modules/rag.py:483-489](), [memory_modules/rag.py:958-968]()

---

## System Architecture at a Glance

```text
┌─────────────────────────────────────────────────────────────────┐
│  data/          download + validate trajectories and questions   │
├─────────────────────────────────────────────────────────────────┤
│  memory_modules/                                                 │
│   memory.py     Memory ABC + register_memory + build_memory      │
│   no_retrieval  zero baseline (no context)                       │
│   rag           embedding-based retrieval (raw slices + notes)   │
│   agentrunbook_r  multi-query retrieval, event notes, reranking  │
│   codex / agentrunbook_c  agent-as-search (Codex binary)         │
├─────────────────────────────────────────────────────────────────┤
│  evaluation/                                                     │
│   harness.py    3-pass runner: insert → query → score            │
│   run_eval.py   CLI wrapper for shell scripts                    │
│   scripts/      run_*.sh — one script per baseline               │
├─────────────────────────────────────────────────────────────────┤
│  leaderboard/   LAFS scoring, submission packaging (2 steps)     │
└─────────────────────────────────────────────────────────────────┘

External dependencies (bring your own):
  Reader model   →  OpenAI-compatible endpoint (default: Qwen3.5-9B)
  Embed model    →  OpenAI-compatible endpoint (default: Qwen3-Embedding-8B)
  LLM judge      →  OPENAI_API_KEY (default: gpt-5.2)
  Codex binary   →  CODEX_BINARY path (for codex/agentrunbook_c only)
```

---

## Two Domains, Two Tiers

Every run is scoped to one **domain** and one **tier**:

- **Domain** — `web` (Magento shopping site + forum) or `enterprise` (ServiceNow). The domain controls the system prompt given to the reader model and which question set is loaded.
- **Tier** — `small` or `medium`. Tiers differ in haystack size; medium has larger haystacks and heavier compute requirements.

Leaderboard submissions must cover both domains for a given tier. The `leaderboard/combine_aggregated_metrics.py` script merges the two domain result files into a single combined metrics file before packaging.

Sources: [README.md:175-185](), [evaluation/harness.py:69-88]()

---

## Adding Your Own Memory System

The extension path is intentional and documented directly in the README and enforced by the `Memory` base class:

1. Create a new file in `memory_modules/`.
2. Subclass `Memory`, set `memory_type = "your_name"`, and decorate with `@register_memory`.
3. Implement `insert` and `query`.
4. Write a config JSON: `{"memory_type": "your_name", "memory_params": {}}`.
5. Pass it to the harness with `--memory-config-path`.

The `query` method receives the question text and an optional screenshot path. It may call `self.get_query_context()` to access the `question_id`, `question_type`, and the full question item — useful for specializing retrieval by question category.

Sources: [memory_modules/memory.py:140-185](), [README.md:189-232]()

---

## Closing Summary

LongMemEval-V2 is a rigorous, open benchmark for long-term agent memory. Its core insight is that memory quality should be measured not just by whether an AI can answer questions correctly, but by whether it can absorb hundreds of task recordings and become the kind of knowledgeable colleague that knows where the traps are. The repo ships the data pipeline, a clean plug-in interface for memory backends, six reference implementations ranging from trivial to sophisticated, and an end-to-end evaluation harness that measures both accuracy and retrieval latency — the two dimensions that together determine how useful a memory system actually is in practice.

Sources: [README.md:28-43](), [leaderboard/README.md:26-34]()

---

## 02. Five Things a Good Memory Must Know

> The five memory abilities the benchmark tests — static recall, dynamic tracking, workflow knowledge, gotchas, and premise awareness — explained with real question categories from the harness source code.

- Page Markdown: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/02-five-things-a-good-memory-must-know.md
- Generated: 2026-05-22T06:14:44.460Z

### Source Files

- `evaluation/harness.py`
- `evaluation/qa_eval_metrics.py`
- `evaluation/memory_configs/no_retrieval.json`
- `README.md`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [evaluation/harness.py](evaluation/harness.py)
- [evaluation/qa_eval_metrics.py](evaluation/qa_eval_metrics.py)
- [evaluation/memory_configs/no_retrieval.json](evaluation/memory_configs/no_retrieval.json)
- [evaluation/memory_configs/rag_query_to_slice.json](evaluation/memory_configs/rag_query_to_slice.json)
- [memory_modules/memory.py](memory_modules/memory.py)
- [memory_modules/no_retrieval.py](memory_modules/no_retrieval.py)
- [data/public_data.py](data/public_data.py)
- [leaderboard/README.md](leaderboard/README.md)
- [README.md](README.md)
</details>

# Five Things a Good Memory Must Know

LongMemEval-V2 is a benchmark that asks a concrete question: can an AI agent's memory system make it act like an experienced colleague who has learned the quirks of a particular environment? To answer that, the benchmark does not ask one generic "how good is your memory?" question. It identifies five distinct skills that an experienced colleague actually needs, then tests each one independently across 451 hand-curated questions. This page explains what each skill means, where it appears in the harness source code, and how it is scored.

Understanding these five abilities is the key to interpreting any evaluation result from this benchmark—whether you are running one of the included baselines, implementing your own memory backend, or reading a leaderboard entry.

---

## The Five Abilities at a Glance

The five ability categories and their internal identifiers are defined in `evaluation/harness.py`:

```python
# evaluation/harness.py, lines 44-52
CATEGORY_MAP = {
    "static-environment": "static",
    "static-environment-abs": "static-abs",
    "dynamic-environment": "dynamic",
    "dynamic-environment-abs": "dynamic-abs",
    "procedure": "procedure",
    "procedure-abs": "procedure-abs",
    "errors-gotchas": "gotchas",
}
```

Each `question_type` string in the dataset maps to a short category key used in metrics. The suffix `-abs` marks *abstention variants*, which test a related but different skill (see [Abstention: the hidden test beneath three categories](#abstention-the-hidden-test-beneath-three-categories) below).

The four primary non-abstention categories and the gotchas category are:

| Dataset `question_type` | Harness category key | Memory ability |
|---|---|---|
| `static-environment` | `static` | Static state recall |
| `dynamic-environment` | `dynamic` | Dynamic state tracking |
| `procedure` | `procedure` | Workflow knowledge |
| `errors-gotchas` | `gotchas` | Environment gotchas |

A fifth ability — **premise awareness** — is tested by the `-abs` variants of static, dynamic, and procedure questions.

Sources: [evaluation/harness.py:44-52]()

---

## Ability 1 — Static State Recall

**What it tests:** Does the memory system remember stable facts about the environment — landmark locations, page layouts, UI element names, module affordances, and subtle state differences that do not change between sessions?

Think of it like knowing the floor plan of an office building. A new hire walks in every day unsure where the printer is. An experienced colleague knows it is on the third floor, east wing, next to the supply closet. The environment has not changed; the colleague just carries that layout in memory.

In the benchmark, static questions ask about things that were true throughout the trajectory history and remain true — configuration values, menu structures in a custom Magento storefront, field names in a ServiceNow form. A memory system must have ingested those environment details from prior agent trajectories and be able to surface them when asked.

**Scoring:** Answers are compared to a gold string using normalized phrase-set matching. The scorer strips punctuation, lowercases, normalizes hyphens, and checks that every phrase from the gold answer appears in the prediction:

```python
# evaluation/qa_eval_metrics.py, lines 71-87
def norm_phrase_set_match(
    prediction: str | None,
    answer: str | None,
    *,
    separators: Iterable[str] = DEFAULT_SEPARATORS,
    require_non_empty: bool = True,
    **normalize_kwargs: bool,
) -> bool:
    normalized_pred = normalize_phrase(prediction, **normalize_kwargs)
    answer_phrases = split_phrases(answer, separators=separators, **normalize_kwargs)
    ...
    for phrase in set(answer_phrases):
        pattern = r"\b%s\b" % re.escape(phrase)
        if re.search(pattern, normalized_pred) is None:
            return False
    return True
```

Multiple-choice static questions may also use `mc_choice_match` or `mc_choice_set_match`, which strip option-label noise before comparing letters.

Sources: [evaluation/qa_eval_metrics.py:29-87](), [evaluation/harness.py:44-46]()

---

## Ability 2 — Dynamic State Tracking

**What it tests:** Does the memory system understand how states and actions change the environment over time? A static fact is timeless; a dynamic fact has history. Which products are currently in a user's cart? What is the current approval status of a ticket? Did a configuration get changed mid-trajectory?

The analogy: an experienced colleague does not just remember that a certain setting *can* be toggled — they remember that *it was* toggled last Tuesday, and that the current state is therefore different from the default.

Dynamic questions require the memory system to have tracked sequential change across trajectory steps, not just catalogued individual observations. A memory backend that indexes each state snapshot in isolation may miss the final value when an attribute was overwritten.

**Scoring:** Same normalized phrase-matching functions as static questions. Because the answer is a specific value at a specific point in time, partial matches do not earn credit.

Sources: [evaluation/harness.py:47-49](), [evaluation/qa_eval_metrics.py:71-87]()

---

## Ability 3 — Workflow Knowledge

**What it tests:** Does the memory system know the steps needed to complete recurring tasks in a customized environment? Generic knowledge of "how Magento works" or "how ServiceNow works" is not enough — the benchmark targets *customized* deployments where the local instance has non-standard workflows.

For example: submitting an expense report in a standard ServiceNow instance might follow a documented path, but a company-specific customization could require an extra approval step, a non-obvious field, or a workaround sequence. An experienced colleague has internalized those local steps from doing them repeatedly.

Procedure questions ask for step sequences, ordered actions, or required sub-tasks. Some are answered as comma-separated lists, which the scorer handles with `norm_phrase_set_match_ordered`:

```python
# evaluation/qa_eval_metrics.py, lines 90-109
def norm_phrase_set_match_ordered(
    prediction: str | None,
    answer: str | None,
    ...
) -> bool:
    ...
    start = 0
    for phrase in answer_phrases:
        pattern = r"\b%s\b" % re.escape(phrase)
        match = re.search(pattern, normalized_pred[start:])
        if match is None:
            return False
        start += match.end()
    return True
```

This enforces that the predicted steps appear in the correct order, reflecting the real constraint that procedural knowledge is not just a bag of words.

Sources: [evaluation/harness.py:50-51](), [evaluation/qa_eval_metrics.py:90-109]()

---

## Ability 4 — Environment Gotchas

**What it tests:** Does the memory system recognize recurring local failure modes and know how to avoid them? Gotchas are traps that are invisible in documentation but become obvious to someone who has burned themselves on them before. They are the "if you do X, the system breaks in this specific way" insights that only experience teaches.

Examples might include: a button that appears active but silently drops the form submission, a field that accepts values but only saves when a related toggle is on, or a sequence that must be performed in a precise order to avoid a race condition in the customized front end.

**Scoring:** Gotchas answers are *open-ended insights*, not exact strings. Simple string matching is too strict — the same insight can be expressed in many ways. The benchmark uses an LLM judge for this category:

```python
# evaluation/qa_eval_metrics.py, lines 18-25
_GOTCHAS_JUDGE_SYSTEM_PROMPT = (
    "You are a strict grader for gotchas-style insight questions. "
    "The reference answer describes the key insight(s). "
    "Grade 1 if the model response includes at least one correct insight point from the reference answer "
    "(paraphrase allowed), and does not contradict any reference point. "
    "If the model's direction is wrong, or it contains contradictions against any reference point, grade 0. "
    "If the model gives multiple points, partial coverage is enough for 1 as long as no contradictions appear."
)
```

The judge is invoked via `llm_gotchas_checker`, which sends the question, reference answer, full model response, and extracted final answer to an evaluator model (default `gpt-5.2` with `medium` reasoning effort). Partial insight coverage is rewarded; contradictions are penalized even if some wording accidentally overlaps.

```python
# evaluation/harness.py, lines 60-61
LLM_EVAL_FUNCTIONS = {"llm_abstention_checker", "llm_gotchas_checker"}
```

The harness routes questions whose `eval_function` is one of these to the LLM evaluator path rather than rule-based scoring.

Sources: [evaluation/qa_eval_metrics.py:18-25, 290-356](), [evaluation/harness.py:60-61]()

---

## Ability 5 — Premise Awareness

**What it tests:** Does the memory system catch assumptions that are valid in general but wrong in this specific deployment? Premise-aware questions are framed as if something is true, when in fact the memory record shows it is not. A system without good memory will confidently answer based on the (false) premise. A system with good memory will detect the contradiction and explain why the question's premise is flawed.

This is the most subtle ability. It is not enough to say "I don't know" (that earns a zero). The system must both reject the premise *and* identify the specific flaw, matching the explanation in the reference answer.

**How it appears in the harness:** Premise-awareness questions carry `question_type` values ending in `-abs` (`static-environment-abs`, `dynamic-environment-abs`, `procedure-abs`). The harness maintains a separate accounting for these:

```python
# evaluation/harness.py, lines 53-59
NON_ABSTENTION_CATEGORIES = ["static", "dynamic", "procedure", "gotchas"]
ABSTENTION_CATEGORIES = ["static-abs", "dynamic-abs", "procedure-abs"]
COMBINED_ABSTENTION_CATEGORY_PAIRS = {
    "static": ("static", "static-abs"),
    "dynamic": ("dynamic", "dynamic-abs"),
    "procedure": ("procedure", "procedure-abs"),
}
```

The domain system prompts instruct the reader model about both behaviors simultaneously:

```python
# evaluation/harness.py, lines 76-78 (web domain excerpt)
"If you do not know the answer, output exactly \\boxed{UNKNOWN}. "
"Do not guess. Never attempt to guess an answer if you are not sure. "
"If you believe the question's construction/premise is wrong, provide an "
"explanation in \\boxed{} explaining why the question is flawed."
```

**Scoring:** Premise-awareness questions use `llm_abstention_checker`, an LLM judge with a strict rubric:

```python
# evaluation/qa_eval_metrics.py, lines 8-17
_ABSTENTION_JUDGE_SYSTEM_PROMPT = (
    "You are a strict grader for flawed-premise (abstention) questions. "
    "Judge whether a model answer correctly identifies that the question premise is wrong, "
    "consistent with the reference answer. "
    "If the model follows the flawed premise and gives a concrete answer under that premise, "
    "it must be graded 0. "
    "If the model's final answer is just UNKNOWN / cannot determine without identifying the flaw, grade 0. "
    "If the model is contradictory (both rejects premise and also gives a concrete premise-following answer), grade 0. "
    "Paraphrases are allowed when they preserve the same core flaw described by the reference answer."
)
```

There are four failure modes that all earn zero: following the premise, saying "unknown" without naming the flaw, contradicting yourself, and giving a paraphrase that misses the core flaw. Only correct flaw identification earns a 1.

Sources: [evaluation/harness.py:53-59, 69-88](), [evaluation/qa_eval_metrics.py:8-17, 221-287]()

---

## Abstention: The Hidden Test Beneath Three Categories

Static, dynamic, and procedure categories each have a parallel abstention variant. These are *not* separate categories in the final leaderboard scores — they are combined back into their parent categories for the `combined_abstention_by_category` metrics:

```python
# evaluation/harness.py, lines 986-990
combined_abstention_by_category: dict[str, Any] = {}
for cat, pair in COMBINED_ABSTENTION_CATEGORY_PAIRS.items():
    rows = [r for r in records if r["category"] in pair]
    combined_abstention_by_category[cat] = breakdown(rows)
```

This means the published accuracy for "static" in the leaderboard blends both plain static recall and premise-detection for static questions. A memory system that is excellent at recall but blind to false premises will look weaker than its recall alone would suggest.

Sources: [evaluation/harness.py:976-998]()

---

## How the Harness Uses Category Information

At runtime, the harness reads `question_type` from each question record, converts it to a category key via `CATEGORY_MAP`, and stores the result on the prepared question row. This category drives which scoring branch is used:

```python
# evaluation/harness.py, lines 1099-1100
q_eval_name = eval_name(q_eval_spec)
...
"category": category_from_question_type(qtype),
```

During scoring (pass 3 of the three-pass pipeline), the record is routed to either the LLM evaluator path (for `llm_gotchas_checker` and `llm_abstention_checker`) or the rule-based path (for all other eval function specs such as `norm_phrase_set_match` or `mc_choice_match`). The `eval_function` field in each question JSON encodes not just the function name but also pipe-delimited options:

```python
# evaluation/qa_eval_metrics.py, lines 568-595
def parse_eval_function_spec(spec: str) -> tuple[Callable[..., Any], dict[str, Any]]:
    parts = [part.strip() for part in spec.split("|")]
    name = parts[0]
    ...
    kwargs: dict[str, Any] = {}
    for part in parts[1:]:
        key, value = part.split("=", 1)
        kwargs[key] = _parse_eval_value(key, value)
    return func, kwargs
```

For example, a spec of `norm_phrase_set_match|lower=true|separators=[,;]` selects the function and passes custom normalization options without changing the harness code.

Sources: [evaluation/harness.py:1087-1119](), [evaluation/qa_eval_metrics.py:568-601]()

---

## What a Memory Backend Must Do

Every memory backend — whether the trivial `no_retrieval` baseline or a complex RAG system — must implement two methods from `memory_modules/memory.py`:

```python
# memory_modules/memory.py, lines 43-54
@abstractmethod
def insert(self, trajectory: dict[str, object]) -> None:
    """Index one full trajectory object into the backend."""
    raise NotImplementedError

@abstractmethod
def query(
    self,
    query: str,
    query_image: str | None = None,
) -> list[MemoryContextItem]:
    """Return a formatted memory context payload for a query."""
    raise NotImplementedError
```

The `insert` call runs once per trajectory during haystack construction. The `query` call runs once per question and must return a list of `{"type": "text"|"image", "value": ...}` items. The harness appends these to the reader prompt before calling the answer model.

The no-retrieval baseline shows the floor: it ignores all trajectories and returns an empty list, forcing the reader to rely entirely on parametric knowledge. Any meaningful memory improvement over this baseline requires a backend that can correctly retrieve evidence for all five ability types.

```python
# memory_modules/no_retrieval.py, lines 9-18
@register_memory
class NoRetrievalMemory(Memory):
    memory_type = "no_retrieval"

    def insert(self, trajectory: dict[str, object]) -> None:
        return None

    def query(self, query: str, query_image: str | None = None) -> list[MemoryContextItem]:
        return []
```

Sources: [memory_modules/memory.py:43-54](), [memory_modules/no_retrieval.py:4-18]()

---

## Leaderboard Metrics by Category

The leaderboard extracts five per-category accuracy numbers from the `aggregated_metrics.json` each run produces:

| Leaderboard metric | Source category | Ability tested |
|---|---|---|
| `static_accuracy` | `static` (+ `static-abs`) | Static state recall + premise awareness |
| `dynamic_accuracy` | `dynamic` (+ `dynamic-abs`) | Dynamic state tracking + premise awareness |
| `procedure_accuracy` | `procedure` (+ `procedure-abs`) | Workflow knowledge + premise awareness |
| `gotchas_accuracy` | `gotchas` | Environment gotchas |
| `overall_full_set` | All categories | Aggregate across all five abilities |

The final LAFS score combines `overall_full_set` accuracy with `memory_query_avg_seconds` latency, so a memory system cannot win by being accurate but unbearably slow. The tradeoff between all five abilities and retrieval speed is the central design challenge the benchmark is intended to expose.

Sources: [leaderboard/README.md](), [evaluation/harness.py:968-998]()

---

## Summary

LongMemEval-V2 breaks "long-term memory" into five testable components: static environment knowledge, dynamic change tracking, step-by-step workflow recall, recognition of local failure patterns, and the ability to challenge a false assumption rather than answer it blindly. Each ability maps to a specific `question_type` tag in the dataset, a specific evaluation function in `evaluation/qa_eval_metrics.py`, and a specific bucket in the `aggregated_metrics.json` output. Two of the five abilities — gotchas and premise awareness — require an LLM judge rather than rule-based matching, reflecting that insight and flaw detection cannot be reduced to string overlap. Any memory backend that performs well across all five must combine faithful retrieval of stable facts, temporal ordering of state changes, procedural sequencing, failure-mode recognition, and the epistemic discipline to say "that premise is wrong" when the evidence demands it. Sources: [evaluation/harness.py:44-61](), [evaluation/qa_eval_metrics.py:8-25]().

---

## 03. Downloading & Preparing the Haystack

> How trajectory data moves from Hugging Face through download, screenshot extraction, and symlink preparation into the form the harness expects — covering the three data scripts and the validate step.

- Page Markdown: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/03-downloading-preparing-the-haystack.md
- Generated: 2026-05-22T06:13:05.984Z

### Source Files

- `data/download_data.py`
- `data/prepare_data.py`
- `data/validate_data.py`
- `data/public_data.py`
- `environment.yml`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [data/download_data.py](data/download_data.py)
- [data/prepare_data.py](data/prepare_data.py)
- [data/validate_data.py](data/validate_data.py)
- [data/public_data.py](data/public_data.py)
- [environment.yml](environment.yml)
- [README.md](README.md)
</details>

# Downloading & Preparing the Haystack

This page explains how LongMemEval-V2 trajectory data travels from Hugging Face onto your local machine into the exact directory layout the evaluation harness expects. The process has three sequential steps — download, screenshot preparation, and validation — each implemented as its own Python script in the `data/` directory.

If you skip or partially complete any step, the harness will fail loudly rather than silently produce wrong results. Understanding what each script does — and why the steps are ordered — makes troubleshooting straightforward.

---

## What "the haystack" is

Think of the dataset as a library of recorded browsing sessions. Each *trajectory* is one such session: a sequence of browser states (screenshots, page text, actions). A *haystack* for a given question is the specific set of trajectory IDs the memory system must sift through to find an answer. The haystack files live at:

```
data/longmemeval-v2/haystacks/lme_v2_small.json   # small tier
data/longmemeval-v2/haystacks/lme_v2_medium.json  # medium tier
```

Each haystack file is a JSON object mapping `question_id → [trajectory_id, ...]`. The harness references those IDs, so every trajectory ID must resolve to both a metadata record (`trajectories.jsonl`) and a physical screenshot directory (`screenshots/<trajectory_id>/`).

---

## Step 1 — Download from Hugging Face (`download_data.py`)

### What it does

`data/download_data.py` calls `huggingface_hub.snapshot_download` to fetch the full dataset snapshot from the Hub repository `xiaowu0162/longmemeval-v2` into a local directory.

```python
# data/download_data.py:62-67
snapshot_download(
    repo_id=args.repo_id,
    repo_type="dataset",
    revision=args.revision,
    local_dir=str(data_root),
)
```

After downloading, it checks that the two required root files exist:

```python
# data/download_data.py:68-69
require((data_root / "questions.jsonl").exists(), ...)
require((data_root / "trajectories.jsonl").exists(), ...)
```

### Idempotency

If both sentinel files already exist the script prints `"status": "already_present"` and exits without downloading again. Pass `--force` to delete the data root and re-download from scratch.

Sources: [data/download_data.py:39-52](), [data/download_data.py:62-69]()

### Usage

```bash
python data/download_data.py --data-root data/longmemeval-v2
```

| Flag | Default | Effect |
|---|---|---|
| `--repo-id` | `xiaowu0162/longmemeval-v2` | Hugging Face dataset repo |
| `--revision` | *(latest)* | Pin to a specific commit/tag |
| `--data-root` | `data/longmemeval-v2` | Where to write the snapshot |
| `--force` | off | Wipe and re-download if present |

After a successful download the script prints a JSON object that includes the recommended `next` commands — `prepare_data.py` then `validate_data.py` — so you always know what to run next.

Sources: [data/download_data.py:78-84]()

---

## Step 2 — Extract archives and build the screenshot tree (`prepare_data.py`)

### Why this step exists

Screenshots are large. The Hugging Face snapshot ships them as `.tar.gz` archives rather than individual files to reduce transfer overhead. Before the harness can reference `screenshots/<trajectory_id>/<step>.png`, those archives must be extracted and the resulting directories must be reachable under a single stable path prefix.

`data/prepare_data.py` is a thin CLI wrapper; all real logic lives in `data/public_data.prepare_screenshots`.

```python
# data/prepare_data.py:22-26
result = prepare_screenshots(
    Path(args.data_root).expanduser().resolve(),
    mode=args.mode,
    extract_archives=not args.no_extract_archives,
)
```

Sources: [data/prepare_data.py:17-27]()

### What `prepare_screenshots` does

The function looks for screenshot content in two locations inside `data_root`:

1. **`trajectory_screenshots/`** — a directory that may contain either pre-extracted subdirectories *or* `.tar.gz` archives named after three source groups:

| Archive name | `replace` flag | Purpose |
|---|---|---|
| `web_screenshots.tar.gz` | no | Web-domain trajectory screenshots |
| `enterprise_screenshots_base.tar.gz` | no | Enterprise base screenshots |
| `enterprise_screenshots_patch.tar.gz` | **yes** | Patches/updates that overwrite base entries |

2. **`trajectory_screenshots/` (direct)** — if the directory itself contains subdirectories with `.png` files, it is treated as an already-extracted source.

For each tar archive that has not yet been extracted, `_safe_extract_tar` validates every member path against a path-traversal check before unpacking:

```python
# data/public_data.py:133-136
member_target = (destination / member.name).resolve()
require(
    destination_resolved == member_target or destination_resolved in member_target.parents,
    f"Refusing unsafe archive member path: {member.name}",
)
```

Sources: [data/public_data.py:127-138]()

### Symlinking vs copying

After expansion, each `<trajectory_id>/` subdirectory inside a source directory is linked (or copied) into the unified `screenshots/` tree:

```
data/longmemeval-v2/screenshots/<trajectory_id>/   ← what the harness reads
```

The default mode is `symlink`: a relative symlink is created using `os.path.relpath` so the layout remains portable. If `symlink` fails (e.g., on a filesystem that does not support symlinks), the code falls back automatically to `shutil.copytree`.

```python
# data/public_data.py:154-159
if mode == "symlink":
    try:
        _relative_symlink(src.resolve(), dst)
        return "symlinked"
    except OSError:
        shutil.copytree(src, dst)
        return "copied"
```

The `enterprise_screenshots_patch` source has `replace=True`, meaning its entries overwrite existing symlinks for the same trajectory ID. All other sources skip directories that already exist.

Sources: [data/public_data.py:146-162](), [data/public_data.py:165-213]()

### Usage

```bash
export DATA_ROOT="$(pwd)/data/longmemeval-v2"
python data/prepare_data.py --data-root "$DATA_ROOT" --mode symlink
```

| Flag | Default | Effect |
|---|---|---|
| `--data-root` | *(required)* | Path downloaded in Step 1 |
| `--mode` | `symlink` | `symlink` or `copy` |
| `--no-extract-archives` | off | Skip tar extraction (archives already unpacked) |

The script prints a JSON summary with counts of how many trajectory directories were symlinked, copied, or skipped.

---

## Step 3 — Validate the layout (`validate_data.py`)

`data/validate_data.py` is a sanity check that confirms every piece of data the harness will need is actually present and internally consistent. It calls `data/public_data.validate_public_data`.

Sources: [data/validate_data.py:13-30](), [data/public_data.py:217-260]()

### What it checks

The validator loads all three data files — `questions.jsonl`, `trajectories.jsonl`, and the selected haystack — then runs a sequence of assertions:

```text
1. Every question in questions.jsonl has a haystack entry.
2. Every haystack entry points to a known question.
3. Every question has domain "web" or "enterprise" and non-empty question text.
4. Any question image path (if present) resolves to a real file.
5. Every trajectory_id referenced in any haystack exists in trajectories.jsonl.
6. No duplicate trajectory IDs within a single haystack.
7. Trajectories and their haystacks share the same domain (no cross-domain mixing).
8. Every screenshot path in every trajectory state resolves to a real file under data_root.
```

Check 8 is the one that catches a missing `prepare_data.py` run:

```python
# data/public_data.py:242-253
for trajectory in trajectories.values():
    for state in trajectory.get("states", []):
        screenshot_value = state.get("screenshot")
        if isinstance(screenshot_value, str) and not (data_root / screenshot_value).exists():
            missing_screenshots += 1
require(
    missing_screenshots == 0,
    f"Missing {missing_screenshots} trajectory screenshots. Run data/prepare_data.py first.",
)
```

### Usage

```bash
python data/validate_data.py --data-root "$DATA_ROOT" --tier small
```

| Flag | Default | Effect |
|---|---|---|
| `--data-root` | *(required)* | Same path used in steps 1 and 2 |
| `--tier` | `small` | `small` or `medium` — selects which haystack file to validate |
| `--check-screenshots` / `--no-check-screenshots` | on | Skip screenshot existence checks when disk I/O is slow |

On success the script prints a JSON summary:

```json
{
  "questions": 451,
  "trajectories": ...,
  "haystack_questions": ...,
  "tier": "small",
  "check_screenshots": true
}
```

---

## Data flow diagram

```text
Hugging Face Hub
  xiaowu0162/longmemeval-v2
         │
         │  snapshot_download()
         ▼
data/longmemeval-v2/
  ├── questions.jsonl          ← 451 questions with domain, text, optional image
  ├── trajectories.jsonl       ← all trajectory records (states, screenshots paths)
  ├── haystacks/
  │   ├── lme_v2_small.json    ← question_id → [trajectory_id, ...]
  │   └── lme_v2_medium.json
  └── trajectory_screenshots/
      ├── web_screenshots.tar.gz
      ├── enterprise_screenshots_base.tar.gz
      └── enterprise_screenshots_patch.tar.gz
         │
         │  prepare_data.py  (extract + symlink)
         ▼
data/longmemeval-v2/
  └── screenshots/
      ├── <trajectory_id_A>/   ← symlink → trajectory_screenshots/web_screenshots/...
      │   ├── step_0.png
      │   └── step_1.png
      └── <trajectory_id_B>/
          └── ...
         │
         │  validate_data.py  (assert all references resolve)
         ▼
     ✓ Ready for evaluation harness
```

---

## The shared library: `public_data.py`

`data/public_data.py` is the library both `prepare_data.py` and `validate_data.py` import from. It exposes:

| Function | Used by | Purpose |
|---|---|---|
| `read_jsonl` | both | Parse a `.jsonl` file into a list of dicts |
| `load_questions` | both | Load and optionally filter questions by domain |
| `load_trajectories` | both | Load trajectories keyed by ID |
| `load_haystack` | both | Load a tier-specific haystack JSON |
| `resolve_question_image` | both | Resolve a question's optional image path |
| `prepare_screenshots` | prepare | Extract archives + build `screenshots/` tree |
| `validate_public_data` | validate | Run all integrity checks |
| `materialize_runtime_questions` | harness | Emit a filtered question list for a run |
| `materialize_runtime_haystack` | harness | Emit a filtered haystack for a run |

The `materialize_*` functions are not called by the data scripts but by the evaluation harness at runtime — they translate the raw JSONL files into per-run JSON files with resolved absolute image paths.

Sources: [data/public_data.py:34-124]()

---

## Environment prerequisites

The download step requires `huggingface_hub`, which is listed in `requirements.txt` (pulled in via `environment.yml`). The conda environment is named `lme-v2-release` and uses Python 3.11.

```yaml
# environment.yml
name: lme-v2-release
dependencies:
  - python=3.11
  - pip:
      - -r requirements-torch.txt
      - -r requirements.txt
```

If you run `download_data.py` outside the conda environment you will get a clear `RuntimeError: Missing huggingface_hub` rather than a confusing import error.

Sources: [environment.yml:1-10](), [data/download_data.py:54-59]()

---

## Summary

Three scripts in the `data/` directory form a strict pipeline: `download_data.py` fetches the raw snapshot from Hugging Face and verifies that `questions.jsonl` and `trajectories.jsonl` are present; `prepare_data.py` extracts screenshot archives and builds a stable `screenshots/<trajectory_id>/` tree through relative symlinks (with safe path-traversal validation on every archive member); and `validate_data.py` confirms that every question has a haystack, every haystack references real trajectories of the correct domain, and every screenshot path resolves on disk — catching any gap between the download and preparation steps before an expensive evaluation run begins. The shared implementation in `data/public_data.py` also provides `materialize_runtime_questions` and `materialize_runtime_haystack`, which the evaluation harness calls at runtime to assemble per-run inputs from the same prepared data root.

Sources: [data/public_data.py:217-260]()

---

## 04. The Six Memory Backends: How Each One Works

> A plain-English tour of the six pluggable memory backends — no_retrieval, RAG variants, AgentRunbook-R, Codex, and AgentRunbook-C — explaining what each one stores and retrieves, plus the insert/query contract every custom backend must satisfy.

- Page Markdown: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/04-the-six-memory-backends-how-each-one-works.md
- Generated: 2026-05-22T06:16:36.035Z

### Source Files

- `memory_modules/memory.py`
- `memory_modules/no_retrieval.py`
- `memory_modules/rag.py`
- `memory_modules/agentrunbook_r.py`
- `memory_modules/agentrunbook_c.py`
- `memory_modules/codex.py`
- `memory_modules/trajectory_store.py`
- `memory_modules/support.py`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [memory_modules/memory.py](memory_modules/memory.py)
- [memory_modules/no_retrieval.py](memory_modules/no_retrieval.py)
- [memory_modules/rag.py](memory_modules/rag.py)
- [memory_modules/agentrunbook_r.py](memory_modules/agentrunbook_r.py)
- [memory_modules/agentrunbook_c.py](memory_modules/agentrunbook_c.py)
- [memory_modules/codex.py](memory_modules/codex.py)
- [memory_modules/trajectory_store.py](memory_modules/trajectory_store.py)
</details>

# The Six Memory Backends: How Each One Works

LongMemEval-V2 evaluates how well an agent can answer questions about its own past history—long sequences of browser-automation trajectories. To answer a question, the agent needs to *retrieve* the right evidence from those past trajectories. The memory backend is the component that does this: it stores trajectories when they arrive (`insert`) and retrieves relevant context when a question appears (`query`).

There are six concrete backends, each with a different retrieval strategy ranging from "do nothing" to "spawn a full coding agent to read the files." This page explains what each backend stores, how it retrieves, and describes the two-method contract (`insert` / `query`) that any custom backend must satisfy.

---

## The Base Contract: `Memory`

Every backend inherits from the abstract class `Memory` (defined in `memory_modules/memory.py`). The contract is minimal:

```python
# memory_modules/memory.py:43-54
@abstractmethod
def insert(self, trajectory: dict[str, object]) -> None:
    """Index one full trajectory object into the backend."""
    raise NotImplementedError

@abstractmethod
def query(
    self,
    query: str,
    query_image: str | None = None,
) -> list[MemoryContextItem]:
    """Return a formatted memory context payload for a query."""
    raise NotImplementedError
```

`insert` receives one trajectory dictionary (id, goal, states, screenshots, actions) and must index it in whatever form the backend prefers. `query` receives a natural-language question and optionally an image, and must return a list of `MemoryContextItem` dicts, each with `"type"` (`"text"` or `"image"`) and `"value"`. The harness concatenates these items into the final context block passed to the answering model.

Two optional hooks matter in practice:

- `configure_runtime(**kwargs)` — called once after build/load to warm up connections (embedding client, tokenizer).
- `post_query_hook(...)` — synchronous work after retrieval, used for logging.

Backends register themselves with `@register_memory`, which adds them to the `MEMORY_TYPES` dict keyed by `memory_type`. The factory functions `build_memory` and `load_memory` look up and instantiate them from this registry.

Sources: [memory_modules/memory.py:25-54](), [memory_modules/memory.py:140-144]()

---

## The Six Backends at a Glance

```text
┌─────────────────────┬──────────────────────────────┬──────────────────────────────────┐
│ memory_type         │ What is stored at insert time │ How query retrieves               │
├─────────────────────┼──────────────────────────────┼──────────────────────────────────┤
│ no_retrieval        │ nothing                       │ returns []                        │
│ rag                 │ raw-state slices + embeddings │ cosine similarity vector search   │
│                     │ + optional LLM notes          │ (+ optional notes search)         │
│ agentrunbook_r      │ raw-state slices + events     │ LLM generates multi-pool queries; │
│                     │ + notes + embeddings          │ cosine search + optional rerank   │
│ codex               │ trajectory JSON files on disk │ Codex CLI subprocess explores     │
│                     │                               │ files and writes output JSON      │
│ agentrunbook_c      │ trajectory JSON + summaries   │ Codex CLI subprocess with richer  │
│                     │ (concise & full Markdown)     │ sandbox (trajectory summary files)│
└─────────────────────┴──────────────────────────────┴──────────────────────────────────┘
```

Note: The five distinct class names map to six evaluation variants because `rag` covers both `rag` (raw states only) and `rag` with `enable_notes=true` (raw states + LLM-generated notes).

---

## 1. `no_retrieval` — The Baseline

**Class:** `NoRetrievalMemory`  
**File:** `memory_modules/no_retrieval.py`

This is the evaluation baseline. It intentionally does nothing.

```python
# memory_modules/no_retrieval.py:10-18
def insert(self, trajectory: dict[str, object]) -> None:
    return None

def query(
    self,
    query: str,
    query_image: str | None = None,
) -> list[MemoryContextItem]:
    return []
```

`insert` discards every trajectory. `query` returns an empty list, so the answering model receives no memory context at all. Its purpose is to establish a lower bound: how well does the model perform on memory questions when it has zero access to its past? All other backends should score strictly higher.

Sources: [memory_modules/no_retrieval.py:4-19]()

---

## 2. `rag` — Embedding-Based Retrieval (with Optional Notes)

**Class:** `RagMemory`  
**File:** `memory_modules/rag.py`

This backend borrows the full retrieval and embedding machinery from `AgentRunbookR` (via direct method assignment) and uses it without the LLM-driven query-generation step.

### What is stored

On `insert`, each trajectory is converted into:

1. **Raw-state slices** — windows of AXTree text centered on each state (controlled by `raw_state_slice_radius`). Each slice becomes one searchable entry; its AXTree text is embedded and stored in a NumPy matrix (`raw_state_embeddings`).

2. **LLM-generated notes** (when `enable_notes=True`) — an LLM (configured via `controller_params`) reads the full simplified trajectory and produces two structured notes per trajectory:
   - **procedure note** — how to navigate or accomplish the task family (reusable procedure).
   - **hint note** — a concise answer-ready hint with key visible facts.

   Each note is embedded and stored in separate matrices.

### How query works

`query` receives the benchmark question directly as the embedding query — no LLM query-generation step:

```python
# memory_modules/rag.py:554-570
raw_results = self._search_entries(
    entries=self.raw_state_entries,
    embeddings=self.raw_state_embeddings,
    query_text=query,
    top_k=self.raw_state_search_top_k,
)
note_results = self._search_note_query(query) if self.enable_notes else {
    "procedure_results": [],
    "hint_results": [],
}
```

Retrieval is pure cosine similarity using the pre-computed embedding matrix. Top-k results are assembled into `MemoryContextItem` blocks.

### Configuration highlights

| Section | Key params |
|---|---|
| `embedding_params` | model, base_url, max_input_tokens, query_instruction |
| `index_params` | raw_state_slice_radius (window size around each state) |
| `retrieval_params` | enable_notes, raw_state_search_top_k, note_search_top_k_per_type |
| `controller_params` | model, base_url (used only when enable_notes=True) |

The backend can cache pre-computed embeddings and notes in a `trajectory_pool_root` directory, allowing fast pooled loading at insert time instead of re-embedding.

Sources: [memory_modules/rag.py:67-69](), [memory_modules/rag.py:590-656](), [memory_modules/rag.py:958-1005]()

---

## 3. `agentrunbook_r` — LLM-Driven Multi-Pool Retrieval

**Class:** `AgentRunbookR`  
**File:** `memory_modules/agentrunbook_r.py`

This is the most feature-rich retrieval backend. It adds a third pool (state-transition *events*) and an LLM query-generation step, making retrieval smarter at the cost of a controller-model call per question.

### What is stored

On `insert`, each trajectory produces three pools:

1. **Raw-state slices** — same windowed AXTree slices as `rag`, embedded with the same embedding model.
2. **Event entries** — for each state-to-state transition, the LLM (controller model) writes a short two-part description: `overview` (where in the workflow this step sits) and `state_transition` (what specifically changed after the action). Events are particularly useful for navigation and before/after questions.
3. **Notes** — same procedure and hint notes as `rag`, generated by the controller LLM.

The prompts driving event generation are inlined in the file:

```python
# memory_modules/agentrunbook_r.py:219-253 (EVENT_GENERATION_SYSTEM_PROMPT excerpt)
# "overview": one concise paragraph that briefly recaps the concrete task goal...
# "state_transition": one concise paragraph that explicitly compares the post-state to the pre-state...
```

### How query works

Unlike `rag`, `AgentRunbookR` asks the controller LLM to **decompose the question** into a structured set of retrieval queries before searching:

```json
{
  "raw_state_queries": ["incident form suggestion button fields mandatory", ...],
  "event_query": "apply 4 stars & up filter and observe what changes",
  "note_query": "ServiceNow create incident form field requirements"
}
```

The prompt (`QUERY_GENERATION_SYSTEM_PROMPT`) instructs the LLM to generate up to 5 distinct `raw_state_queries` (one per distinct UI surface), one `event_query` (only for navigation/before-after questions), and one broader `note_query`. Each pool is then searched separately with cosine similarity and a configurable top-k budget.

An optional reranking step (`enable_rerank=True`) asks the controller LLM to filter each pool's candidates down to those that actually help answer the question.

### Default models

```python
# memory_modules/agentrunbook_r.py:41-57
DEFAULT_CONTROLLER_MODEL = "Qwen/Qwen3.5-9B"
DEFAULT_CONTROLLER_BASE_URL = "http://localhost:8023/v1"
DEFAULT_EMBEDDING_MODEL = "Qwen/Qwen3-Embedding-8B"
DEFAULT_EMBEDDING_BASE_URL = "http://localhost:8114/v1"
```

These defaults point to locally-hosted models over OpenAI-compatible endpoints—fully BYOK/BYOC.

Sources: [memory_modules/agentrunbook_r.py:41-94](), [memory_modules/agentrunbook_r.py:126-162](), [memory_modules/agentrunbook_r.py:219-253]()

---

## 4. `codex` — Coding-Agent Filesystem Exploration

**Class:** `CodexMemory`  
**File:** `memory_modules/codex.py`

Instead of building an in-memory index, `codex` stores trajectories as plain files on disk and delegates retrieval to the **Codex CLI**—a coding agent that reads those files and reasons over them.

### What is stored

`insert` materializes each trajectory as a `trajectory.json` file inside a workspace directory:

```python
# memory_modules/codex.py:525-561
def insert(self, trajectory: dict[str, object]) -> None:
    ...
    prepared = prepare_trajectory_insert_shared(trajectory, trajectories_root_dir=...)
    ...
    materialize_prepared_trajectory_shared(prepared, trajectory_dir)
    self.inserted_trajectory_ids.append(trajectory_id)
    self._write_index_files(self.workspace_dir)
```

An `index.json` and `haystack_manifest.json` are also written (and updated) to help the agent orient itself in the workspace.

### How query works

On `query`, the backend:

1. Writes `question.json` (the benchmark question, optionally including a question image) and `INSTRUCTION.md` into a sandbox directory.
2. Symlinks the `trajectories/` directory into the sandbox.
3. Launches the `codex` binary as a subprocess with `codex exec`.
4. Waits for the subprocess to write `memory_module_output.json` containing two fields:
   - `memory_markdown` — a structured Markdown narrative with a `## Support Analysis` and `## Relevant Procedure and Hint Notes` section.
   - `trajectory_spans` — a list of `{trajectory_id, start_state_index, end_state_index}` pointers to the most relevant trajectory slices.

```python
# memory_modules/codex.py:31-38 (DEFAULT_PROMPT)
"You are acting as a memory retrieval module. "
"Read the local files in this directory, especially INSTRUCTION.md and question.json. "
"The local trajectories/ directory contains the current haystack for this evaluation item..."
"Write your final result to memory_module_output.json as valid JSON."
```

The agent is given up to `MAX_TOTAL_SPAN_STATES = 20` states total across all spans. After the subprocess exits, the harness validates the output, loads the referenced trajectory states (with optional screenshots), and assembles them into `MemoryContextItem` blocks.

Up to `codex_max_attempts` retries are made if the output is missing or malformed.

Sources: [memory_modules/codex.py:22-80](), [memory_modules/codex.py:344-562](), [memory_modules/codex.py:747-767]()

---

## 5. `agentrunbook_c` — Codex Agent with Richer Scaffolding

**Class:** `AgentRunbookC`  
**File:** `memory_modules/agentrunbook_c.py`

`AgentRunbookC` extends `CodexMemory` and adds **pre-rendered trajectory summary files** and a dedicated inspection helper script to the agent's sandbox. It is the "C" (Codex + scaffolding) counterpart to the "R" (retrieval/embedding) variant.

### What is stored

`insert` is inherited from `CodexMemory`—trajectories are materialized to disk as `trajectory.json` files, just as in the plain `codex` backend.

### How query works

Before launching the Codex subprocess, `AgentRunbookC` renders two Markdown summary files:

- `TRAJECTORY_SUMMARY_CONCISE.md` — a brief overview of all trajectories in the workspace.
- `TRAJECTORY_SUMMARY_FULL.md` — a detailed narrative of every trajectory.

These are written to the `trajectories/` directory so the Codex agent can read them as a quick orientation before drilling into individual `trajectory.json` files:

```python
# memory_modules/agentrunbook_c.py:143-146
concise_output_path = self.workspace_dir / "trajectories" / TRAJECTORY_SUMMARY_CONCISE_FILENAME
full_output_path = self.workspace_dir / "trajectories" / TRAJECTORY_SUMMARY_FULL_FILENAME
```

The sandbox also receives an `inspect_trajectory.py` helper script (under `scripts/`), which gives the Codex agent a tool to inspect individual trajectories, single states, spans, or perform text matching within one trajectory quickly—without reading the entire JSON manually.

The agent is instructed via a specialized `INSTRUCTION.md` and a different default prompt:

```python
# memory_modules/agentrunbook_c.py:37-44
DEFAULT_QUERY_PROMPT = (
    "You are acting as the query-time agent for AgentRunbook-C. "
    "Read the local files in this directory, especially INSTRUCTION.md and question.json. "
    "The local trajectories/ directory contains the current haystack for this evaluation item, "
    "and you must explore trajectories/ before returning your final result. "
    "Use the local inspection helper under scripts/ when you need to inspect one trajectory..."
)
```

The output schema and post-processing logic are identical to `codex`: `memory_module_output.json` with `memory_markdown` and `trajectory_spans`.

Sources: [memory_modules/agentrunbook_c.py:24-45](), [memory_modules/agentrunbook_c.py:108-114](), [memory_modules/agentrunbook_c.py:135-223]()

---

## The insert/query Contract in Detail

The table below summarizes the complete contract for custom backends:

| Method | Required | Inputs | Expected output |
|---|---|---|---|
| `insert(trajectory)` | Yes | `dict` with id, goal, states, actions, screenshots | Side-effect only; index or persist data. Must not crash on duplicate ids without signaling error. |
| `query(query, query_image)` | Yes | `str` question, optional image path | `list[MemoryContextItem]` — each item `{"type": "text"|"image", "value": str}` |
| `configure_runtime(**kwargs)` | No | Arbitrary kwargs | Warm connections; return `None` |
| `post_query_hook(...)` | No | query, query_image, memory_context | Return `dict` or `None` |
| `_save_backend(output_dir)` | No | `Path` | Persist any non-config state (embeddings, JSONL pools, trajectory files) |
| `_load_backend(input_dir)` | No | `Path` | Restore state from a previous `save_memory` call |
| `reconcile_loaded_memory_config(saved, requested)` | No (has default) | Two `MemoryConfig` dicts | Return the effective config; raise on incompatible mismatch |

A backend that only implements `insert` and `query` is fully functional. The save/load pair is needed only if the backend builds state that is expensive to recompute (embeddings, LLM-generated notes).

### The `MemoryContextItem` format

```python
# memory_modules/memory.py:14-16
class MemoryContextItem(TypedDict):
    type: Literal["text", "image"]
    value: str
```

Text items have the retrieved text as `value`. Image items have an **absolute filesystem path** to a screenshot as `value`. The harness passes both types to the answering model; image items are only meaningful when the evaluating model is multimodal.

Sources: [memory_modules/memory.py:14-16](), [memory_modules/memory.py:43-54](), [memory_modules/memory.py:56-68]()

---

## Data Flow: Insert → Query

```text
               insert(trajectory)
               ┌───────────────────────────────────────────────────────────────┐
               │  trajectory_store.prepare_trajectory_insert()                 │
               │    → normalize states, resolve screenshots, compute fingerprint│
               │                                                               │
               │  no_retrieval  → discard                                      │
               │  rag / ar_r    → embed AXTree slices → store in NumPy matrix  │
               │               → (opt) LLM → procedure & hint notes → embed    │
               │               → (ar_r) LLM → event entries → embed            │
               │  codex / ar_c  → write trajectory.json to workspace/           │
               └───────────────────────────────────────────────────────────────┘

               query(question_text)
               ┌───────────────────────────────────────────────────────────────┐
               │  no_retrieval  → return []                                    │
               │                                                               │
               │  rag           → embed question → cosine search raw-states    │
               │               → (opt) cosine search notes                     │
               │               → assemble MemoryContextItems                   │
               │                                                               │
               │  agentrunbook_r→ LLM decomposes question into multi-pool      │
               │               → cosine search each pool separately            │
               │               → (opt) LLM reranks candidates                  │
               │               → assemble MemoryContextItems                   │
               │                                                               │
               │  codex / ar_c  → write question.json + INSTRUCTION.md         │
               │               → (ar_c) render trajectory summaries            │
               │               → spawn Codex CLI subprocess                    │
               │               → Codex reads files, writes memory_module_output.json │
               │               → harness loads spans → assemble MemoryContextItems │
               └───────────────────────────────────────────────────────────────┘
```

---

## Summary

The six backends form a progression from a zero-effort baseline to a full agentic retrieval system. `no_retrieval` measures the floor. `rag` adds dense vector retrieval over raw UI states, and optionally LLM-generated notes. `agentrunbook_r` adds a state-transition event pool and a smart query-decomposition step. `codex` replaces the embedding index entirely with a coding agent that reads trajectory files directly from disk. `agentrunbook_c` enhances `codex` by pre-rendering human-readable summaries and providing an inspection helper, giving the agent a better starting orientation before it digs into raw evidence.

Every custom backend must implement exactly two abstract methods—`insert` and `query`—and register with `@register_memory`. Saving and loading state across sessions is optional but recommended for any backend that performs expensive embedding or LLM inference at insert time.

Sources: [memory_modules/memory.py:140-144](), [memory_modules/codex.py:22-23](), [memory_modules/agentrunbook_r.py:60-68]()

---

## 05. The Evaluation Harness: From Question to Score

> How harness.py feeds each question to a memory backend, collects context items, calls the reader model, and scores the answer — including the LLM judge paths for abstention and gotchas questions, and how shell scripts wire it all together.

- Page Markdown: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/05-the-evaluation-harness-from-question-to-score.md
- Generated: 2026-05-22T06:14:08.518Z

### Source Files

- `evaluation/harness.py`
- `evaluation/run_eval.py`
- `evaluation/qa_eval_metrics.py`
- `evaluation/scripts/run_no_retrieval.sh`
- `evaluation/scripts/run_rag_query_to_slice.sh`
- `evaluation/memory_configs/rag_query_to_slice.json`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [evaluation/harness.py](evaluation/harness.py)
- [evaluation/run_eval.py](evaluation/run_eval.py)
- [evaluation/qa_eval_metrics.py](evaluation/qa_eval_metrics.py)
- [evaluation/scripts/run_no_retrieval.sh](evaluation/scripts/run_no_retrieval.sh)
- [evaluation/scripts/run_rag_query_to_slice.sh](evaluation/scripts/run_rag_query_to_slice.sh)
- [evaluation/memory_configs/rag_query_to_slice.json](evaluation/memory_configs/rag_query_to_slice.json)
- [evaluation/memory_configs/no_retrieval.json](evaluation/memory_configs/no_retrieval.json)
- [memory_modules/memory.py](memory_modules/memory.py)
</details>

# The Evaluation Harness: From Question to Score

The evaluation harness is the central pipeline that turns a dataset of questions — together with a memory backend and a reader model — into per-question scores and aggregate metrics. Understanding it is essential for anyone who wants to reproduce results, add a new memory backend, or debug why a particular question was answered incorrectly.

This page traces the full lifecycle of a single question: how the haystack of trajectories is loaded into a memory backend, how that backend returns context items for the question, how those items are assembled into a prompt for the reader model, how the model's response is parsed, and finally how the harness decides whether the answer is correct — including the two special paths that call a separate LLM judge for abstention and gotchas questions.

---

## The Two Entry Points

There are two ways to launch an evaluation run.

**`run_eval.py` — the user-facing wrapper.** It accepts high-level arguments (`--method`, `--domain`, `--tier`, `--data-root`) and does three things before calling the harness: it materializes a filtered `questions.json` and a `haystack.json` into a `runtime_inputs/` directory, builds a `memory_config.json` from the chosen method, and then invokes `harness.main()` in-process with the translated `sys.argv`.

**`harness.py` — the execution engine.** It owns the three-pass pipeline (build prompts → generate reader outputs → score) and all file I/O for results. It does not know which evaluation method is being used; that is entirely determined by the `memory_config` it receives.

Sources: [evaluation/run_eval.py:190-268](), [evaluation/harness.py:119-198]()

```
# Shell script → run_eval.py → harness.py (in-process)
# run_eval.py builds runtime_inputs/, then:
sys.argv = harness_argv
from evaluation.harness import main as harness_main
harness_main()
```

Sources: [evaluation/run_eval.py:213-268]()

### Shell Script Wrappers

The `evaluation/scripts/` directory contains one Bash script per method. Each script is a thin driver: it validates that the caller has not passed arguments it owns, then iterates over both domains (`web` and `enterprise`) and calls `run_eval.py`:

```bash
for domain in web enterprise; do
  python "$REPO_ROOT/evaluation/run_eval.py" \
    --method "$METHOD" \
    --data-root "$DATA_ROOT_VALUE" \
    --domain "$domain" \
    --tier "$TIER_VALUE" \
    --output-dir "$OUTPUT_ROOT_VALUE/${METHOD}_${domain}_${TIER_VALUE}" \
    "$@"
done
```

Sources: [evaluation/scripts/run_rag_query_to_slice.sh:20-28](), [evaluation/scripts/run_no_retrieval.sh:20-28]()

The environment variables `DATA_ROOT`, `OUTPUT_ROOT`, and `TIER` drive where data is read and results are written. Additional flags (model names, concurrency limits, API keys) pass through `"$@"` unchanged to `run_eval.py`.

---

## Memory Backends and Configuration

Every method is expressed as a JSON config with exactly two keys: `memory_type` and `memory_params`. The harness loads this config, instantiates the matching `Memory` subclass, and calls `insert()` and `query()` on it — nothing else.

```json
{
  "memory_type": "rag",
  "memory_params": {
    "controller_params": { "model": "Qwen/Qwen3.5-9B", "base_url": "http://localhost:8023/v1", ... },
    "embedding_params":  { "model": "Qwen/Qwen3-Embedding-8B", "base_url": "http://localhost:8114/v1", ... },
    "index_params":      { "raw_state_slice_radius": 1 },
    "retrieval_params":  { "enable_notes": false, "raw_state_search_top_k": 6 }
  }
}
```

Sources: [evaluation/memory_configs/rag_query_to_slice.json:1-35]()

The `no_retrieval` config is the minimal case — empty params, no external services required:

```json
{ "memory_type": "no_retrieval", "memory_params": {} }
```

Sources: [evaluation/memory_configs/no_retrieval.json:1-4]()

`run_eval.py` assembles these configs programmatically from CLI arguments for the six supported methods: `no_retrieval`, `rag_query_to_slice`, `rag_query_to_slice_notes`, `agentrunbook_r`, `codex`, and `agentrunbook_c`. Sources: [evaluation/run_eval.py:21-28]()

All concrete memory classes implement the `Memory` abstract base class:

| Method | `insert()` behavior | `query()` return |
|---|---|---|
| `no_retrieval` | no-op | empty list |
| `rag` | embed trajectory slices into vector index | top-k text/image slices |
| `agentrunbook_r` | embed + build runbook summary | multi-query retrieved slices |
| `codex` | invoke Codex CLI on trajectory files | text context items |
| `agentrunbook_c` | Codex-driven agentic runbook | text context items |

Sources: [memory_modules/memory.py:25-68](), [memory_modules/memory.py:178-223]()

---

## Pass 1: Building Prompts

The harness works in three sequential passes. Pass 1 turns each question into a fully assembled prompt.

### Shared vs. Per-Question Haystack

The first fork in the pipeline decides whether one memory object can be shared across all questions.

- **Shared haystack**: all questions draw from the same ordered list of trajectory IDs. The harness builds a single `Memory` instance, inserts all trajectories once with a progress bar, then queries it for every question. This is the common case for standard benchmark runs.
- **Per-question haystack**: different questions see different trajectory sets. A separate `Memory` is built and populated for every question. This is only supported for memory types in `NONSHARED_PARALLEL_MEMORY_TYPES` (`rag`, `codex`, `agentrunbook_r`, `agentrunbook_c`) when `--prompt-build-max-workers > 1`.

Sources: [evaluation/harness.py:439-444](), [evaluation/harness.py:1136-1199]()

### `build_prompt_row`

For each question, the core function `build_prompt_row` does the following:

1. Calls `memory.set_query_context(question_id, question_type, question_item)` to pass metadata the backend may use.
2. Calls `memory.query(question_text, query_image=...)`, which returns a `list[MemoryContextItem]`. Each item is `{"type": "text"|"image", "value": "..."}`.
3. Validates the items.
4. Calls `memory.post_query_hook(...)` for optional post-retrieval work (timing is tracked separately).
5. Calls `truncate_memory_context()` if the context exceeds `--memory-context-max-tokens` (default 200,000), using a binary search over token counts measured with the Qwen3.5-9B processor.
6. Calls `build_messages()` to assemble the final prompt.

Sources: [evaluation/harness.py:536-593]()

### Token Counting and Truncation

The truncation check counts tokens using `transformers.AutoProcessor` for `Qwen/Qwen3.5-9B` — the same tokenizer the default reader model uses. Images are loaded via PIL and counted as their tokenized visual representation. A binary search finds the largest prefix of context items that fits within the limit.

Sources: [evaluation/harness.py:357-436]()

### Prompt Structure

`build_messages` assembles the final chat messages in a fixed layout:

```
[system]  domain-specific instruction
[user]    ### Memory context:
          <context items — text blocks or base64-encoded image_url>

          ### Question to answer:
          <question text>
          [optional question image]
```

The system prompt instructs the model to answer from memory, output `\boxed{UNKNOWN}` if it does not know, and explain in `\boxed{}` if the question's premise is wrong. Sources: [evaluation/harness.py:501-533](), [evaluation/harness.py:69-88]()

Two versions of the message list are produced: `messages` (with images as base64 data URLs, sent to the model API) and `messages_for_log` (with image file paths, written to disk). Sources: [evaluation/harness.py:501-533]()

---

## Pass 2: Calling the Reader Model

After all prompts are built, the harness fires all requests concurrently using `asyncio` and the OpenAI-compatible chat completions API.

```python
async def generate_all_reader_outputs(args, prompt_rows):
    client = create_async_client(...)
    semaphore = asyncio.Semaphore(args.reader_max_concurrent_requests)  # default 500
    tasks = [asyncio.create_task(run_one(row)) for row in prompt_rows]
    ...
```

Sources: [evaluation/harness.py:889-930]()

The harness is provider-neutral: it targets any OpenAI-compatible endpoint via `--base-url`. Local vLLM servers, cloud APIs, and proxies all work. When `--base-url` is set, it uses `max_tokens` (local convention); otherwise it uses `max_completion_tokens` (OpenAI API convention). Sources: [evaluation/harness.py:842-866]()

After each response arrives, `extract_boxed_answer` parses the final `\boxed{...}` from the raw text (depth-aware brace matching, taking the last occurrence). If none is found, the full response is used. `is_unknown` checks if the parsed answer equals `"UNKNOWN"` (case-insensitive). Sources: [evaluation/qa_eval_metrics.py:180-206]()

---

## Pass 3: Scoring

Pass 3 iterates `prompt_rows` sequentially, merging each row with its reader output and calling `score_prediction`. Results are written to `per_question.jsonl` incrementally (one flush per question) so partial results survive interruption. Sources: [evaluation/harness.py:1358-1411]()

### Eval Function Dispatch

Every question carries an `eval_function` field whose value is a pipe-delimited spec string, for example:

```
norm_phrase_set_match
norm_phrase_set_match|separators=[,]|lower=true
llm_abstention_checker
llm_gotchas_checker
```

`eval_from_spec` parses the name and any `key=value` options, looks up the function by name in `qa_eval_metrics.py`'s global namespace, and calls it. Sources: [evaluation/qa_eval_metrics.py:568-601]()

The available deterministic eval functions include:

| Function | What it checks |
|---|---|
| `norm_phrase_set_match` | All gold phrases appear in the prediction (word-boundary regex, normalized) |
| `norm_phrase_set_match_ordered` | Same but phrases must appear in order |
| `mc_choice_match` | Single multiple-choice letter matches |
| `mc_choice_set_match` | Set of multiple-choice letters matches (multi-select) |

Sources: [evaluation/qa_eval_metrics.py:71-178]()

### LLM Judge Paths

Two question types require a second model call for scoring: abstention questions (`-abs` suffix) and gotchas questions. These are identified by `LLM_EVAL_FUNCTIONS = {"llm_abstention_checker", "llm_gotchas_checker"}`. When `score_prediction` sees one of these, it routes the **full raw response** (not just the parsed boxed answer) to the judge. Sources: [evaluation/harness.py:60](), [evaluation/harness.py:1012-1037]()

**`llm_abstention_checker`** — evaluates whether the model correctly identified a flawed question premise. The system prompt is strict: the model must name the flaw and reach the same conclusion as the reference answer. A generic `UNKNOWN` reply without identifying the flaw scores 0. The judge outputs `{"label": 0|1, "reason": "..."}`.

Sources: [evaluation/qa_eval_metrics.py:7-17](), [evaluation/qa_eval_metrics.py:221-287](), [evaluation/qa_eval_metrics.py:380-413]()

**`llm_gotchas_checker`** — evaluates whether the model surfaced the correct insight from an errors/gotchas question. Partial coverage of a multi-point reference answer is sufficient for label 1, as long as no point is contradicted.

Sources: [evaluation/qa_eval_metrics.py:18-25](), [evaluation/qa_eval_metrics.py:290-356](), [evaluation/qa_eval_metrics.py:416-447]()

Both judge functions call a separate evaluator model (default `gpt-5.2`) configured independently from the reader model. This separation allows using a stronger model for judgment without affecting the capability being measured. Sources: [evaluation/run_eval.py:82-84](), [evaluation/harness.py:186-198]()

The judge response is parsed with `_parse_llm_binary_judgement`, which tries `json.loads` first and falls back to regex extraction for malformed outputs. Sources: [evaluation/qa_eval_metrics.py:459-488]()

### Abstention Scoring Gotcha

When a question is marked `is_unknown` (the reader output `\boxed{UNKNOWN}`), `score_prediction` forces `score_bool = False` regardless of what the eval function returns. This means a model that always replies UNKNOWN will score 0 on non-abstention questions, preventing a trivial exploit. Sources: [evaluation/harness.py:1034-1036]()

---

## Output Files

After all three passes complete, the harness writes the following files to `--output-dir`:

| File | Contents |
|---|---|
| `run_args.json` | All CLI arguments plus `started_at_utc` |
| `prompt_rows.jsonl` | One row per question with full prompt, memory context, and timing |
| `prompt_build_summary.json` | Question order and count after prompt build |
| `per_question.jsonl` | Per-question record with score, raw response, token usage |
| `aggregated_metrics.json` | Overall and per-category accuracy, token stats, memory timing |
| `memory_state/` (optional) | Saved shared memory, loadable with `--load-memory-dir` |

Sources: [evaluation/harness.py:1044-1045](), [evaluation/harness.py:1329-1341](), [evaluation/harness.py:1345-1412](), [evaluation/harness.py:1413-1492]()

### Aggregated Metrics Structure

`aggregate_metrics` partitions questions into non-abstention and abstention groups, then breaks each down by category. The top-level result includes:

```
overall_full_set            # mean score over all questions
overall_non_abstention_only # mean score ignoring -abs questions
overall_abstention_only     # mean score for -abs questions only
non_abstention_by_category  # {static, dynamic, procedure, gotchas} → {count, pct_correct, ...}
abstention_by_category      # {static-abs, dynamic-abs, procedure-abs} → breakdown
combined_abstention_by_category  # paired non-abs + abs per category
```

Sources: [evaluation/harness.py:938-998](), [evaluation/harness.py:44-59]()

---

## End-to-End Flow Diagram

```
┌─────────────────────────────────────────────────────────────────┐
│  Shell script (run_rag_query_to_slice.sh)                       │
│  Sets DATA_ROOT, TIER, loops over domain=web,enterprise         │
└───────────────────────────┬─────────────────────────────────────┘
                            │ python run_eval.py --method rag_query_to_slice ...
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│  run_eval.py                                                    │
│  • materialize_runtime_questions → runtime_inputs/questions.json│
│  • materialize_runtime_haystack  → runtime_inputs/haystack.json │
│  • build_memory_config()         → runtime_inputs/memory_config │
│  • harness.main() [in-process]                                  │
└───────────────────────────┬─────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│  harness.py — PASS 1: Build Prompts                             │
│                                                                 │
│  build_memory(memory_config)                                    │
│  for each trajectory_id in haystack:                           │
│      memory.insert(trajectory)                                  │
│                                                                 │
│  for each question:                                             │
│      memory.set_query_context(...)                              │
│      ctx_items = memory.query(question_text, image)  ← backend │
│      truncate_memory_context(ctx_items, max_tokens=200k)        │
│      build_messages(system_prompt, ctx_items, question)         │
│      → prompt_row                                               │
└───────────────────────────┬─────────────────────────────────────┘
                            │ prompt_rows.jsonl saved
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│  harness.py — PASS 2: Reader Model                              │
│                                                                 │
│  asyncio + semaphore (≤500 concurrent)                          │
│  for each prompt_row:                                           │
│      POST /chat/completions → raw_response                      │
│      extract_boxed_answer(raw) → parsed_boxed                   │
│      is_unknown(parsed_boxed) → bool                            │
└───────────────────────────┬─────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│  harness.py — PASS 3: Score                                     │
│                                                                 │
│  for each row:                                                  │
│      eval_from_spec(eval_function, parsed, gold)                │
│        ├─ deterministic: norm_phrase_set_match, mc_choice_match  │
│        └─ LLM judge:                                            │
│             llm_abstention_checker → POST to evaluator model    │
│             llm_gotchas_checker    → POST to evaluator model    │
│      if is_unknown: force score_bool = False                    │
│      write per_question.jsonl                                   │
│                                                                 │
│  aggregate_metrics(records) → aggregated_metrics.json           │
└─────────────────────────────────────────────────────────────────┘
```

---

## Summary

The harness is a clean three-pass pipeline: memory build + prompt assembly → concurrent reader inference → sequential scoring. Its key design choices are: the `Memory` abstract interface that keeps all retrieval logic out of the harness; the OpenAI-compatible client that makes the reader and evaluator models interchangeable; the `\boxed{}` answer convention that gives all eval functions a well-defined target to match; and the two LLM judge paths that handle question types where string matching is insufficient. The shell scripts and `run_eval.py` layer on top add convenience (domain loops, runtime config assembly) without coupling the engine to any specific method. Sources: [evaluation/harness.py:1040-1496]()

---

## 06. Scoring, LAFS, & What to Remember

> How web and enterprise run results are merged, how LAFS turns accuracy and latency into a single leaderboard score, the two-step submission packaging process, and a plain-English recap of the core ideas to carry away from this repo.

- Page Markdown: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/06-scoring-lafs-what-to-remember.md
- Generated: 2026-05-22T06:13:17.839Z

### Source Files

- `leaderboard/README.md`
- `leaderboard/compute_lafs.py`
- `leaderboard/combine_aggregated_metrics.py`
- `leaderboard/build_submission_step_1_single_operating_point.py`
- `leaderboard/build_submission_step_2_build_package.py`
- `leaderboard/submission_utils.py`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [leaderboard/README.md](leaderboard/README.md)
- [leaderboard/compute_lafs.py](leaderboard/compute_lafs.py)
- [leaderboard/combine_aggregated_metrics.py](leaderboard/combine_aggregated_metrics.py)
- [leaderboard/build_submission_step_1_single_operating_point.py](leaderboard/build_submission_step_1_single_operating_point.py)
- [leaderboard/build_submission_step_2_build_package.py](leaderboard/build_submission_step_2_build_package.py)
- [leaderboard/submission_utils.py](leaderboard/submission_utils.py)
</details>

# Scoring, LAFS, & What to Remember

This page explains how LongMemEval-V2 turns raw evaluation runs into a single comparable number on the leaderboard. It covers the two-domain merge that produces a unified metric view, how LAFS converts an accuracy-latency trade-off curve into one score, the two-command packaging process that turns folders of results into a submittable archive, and the key ideas to carry away when reading or extending this repo.

Everything described here lives inside the `leaderboard/` directory. No external service is required: the scripts run locally, all reference values are hard-coded in the source, and the resulting `.tar.gz` is submitted through a web form.

---

## How Web and Enterprise Results Are Merged

Every leaderboard entry is evaluated on two independently run domains: **web** and **enterprise**. These domains represent different haystack compositions and question styles. Because the two domains contain different numbers of questions, raw averages would be misleading. The solution is **example-count-weighted averaging**.

The merging happens in `combine_aggregated_metrics.py`. The entry point reads two `aggregated_metrics.json` files — one from the web run, one from the enterprise run — and combines every numeric field using the total question count from each domain as the weight:

```python
# leaderboard/combine_aggregated_metrics.py:100-111
def weighted_average(values: list[tuple[Any, int]]) -> float | None:
    weighted_total = 0.0
    total_count = 0
    for value, count in values:
        numeric = as_number(value)
        if numeric is None or count == 0:
            continue
        weighted_total += numeric * count
        total_count += count
    if total_count == 0:
        return None
    return weighted_total / total_count
```

This same weighted-average function is applied to:

- `overall_full_set`, `overall_non_abstention_only`, `overall_abstention_only` (accuracy scores)
- per-category accuracy breakdowns (`gotchas`, `static`, `dynamic`, `procedure`)
- timing fields: `memory_query`, `memory_post_query`
- token usage: prompt, completion, and total token counts

For timing, the combination is slightly different: totals are summed, then re-averaged by total question count. `max_seconds` takes the maximum across both domains rather than an average.

Sources: [leaderboard/combine_aggregated_metrics.py:100-111](), [leaderboard/combine_aggregated_metrics.py:288-321](), [leaderboard/combine_aggregated_metrics.py:324-375]()

---

## The Extracted Metric Overview

After merging, `submission_utils.build_metric_overview` extracts the six fields that matter for scoring and display:

| Field | Where it comes from |
|---|---|
| `overall_full_set` | `combined.overall.overall_full_set` |
| `gotchas_accuracy` | `non_abstention_by_category.gotchas.pct_correct` |
| `static_accuracy` | `combined_abstention_by_category.static.pct_correct` |
| `dynamic_accuracy` | `combined_abstention_by_category.dynamic.pct_correct` |
| `procedure_accuracy` | `combined_abstention_by_category.procedure.pct_correct` |
| `memory_query_avg_seconds` | `memory_query.avg_seconds` |

`overall_full_set` is the headline accuracy; `memory_query_avg_seconds` is the headline latency. Both feed directly into LAFS. The category-level fields give an interpretable diagnostic breakdown but are not used in the LAFS calculation.

Sources: [leaderboard/submission_utils.py:378-404]()

---

## LAFS: Accuracy-Latency Frontier Score

### The Core Idea

A memory system that is very accurate but takes three minutes per query is not as useful as one that is fast and nearly as accurate. LAFS (Latency-Adjusted Frontier Score) rewards systems that push the **Pareto frontier** of the accuracy-versus-latency trade-off — not just those that maximize accuracy alone.

Conceptually, imagine a graph with latency on the x-axis and accuracy on the y-axis. The Pareto frontier is the set of points where no other point is both faster *and* more accurate. LAFS measures the area under that frontier over a log-uniform distribution of latency budgets.

### Parameters

```python
# leaderboard/compute_lafs.py:19-22
T_MIN = 1.0      # minimum latency budget (seconds)
T_MAX = 200.0    # maximum latency budget (seconds)
FLOOR_ACC = 0.0  # accuracy floor when nothing fits the budget
```

The log-latency range 1–200 seconds corresponds to the practical operating range seen in the reference methods. Integration is log-uniform because latency improvements at 5 s feel as meaningful as improvements at 50 s.

### The Formula

```
LAFS = (1 / log(T_MAX / T_MIN)) * ∫ best_acc_under_budget(T) d(log T)
```

Implemented exactly as a step-function integral:

```python
# leaderboard/compute_lafs.py:96-127
def lafs(points, t_min=T_MIN, t_max=T_MAX, floor_acc=FLOOR_ACC):
    frontier = pareto_frontier(points)
    breakpoints = {t_min, t_max}
    for point in frontier:
        if t_min < point.latency < t_max:
            breakpoints.add(point.latency)
    breakpoints = sorted(breakpoints)
    denom = math.log(t_max / t_min)
    area = 0.0
    for left, right in zip(breakpoints[:-1], breakpoints[1:]):
        acc = best_acc_under_budget(frontier, left, floor_acc=floor_acc)
        area += acc * math.log(right / left)
    return area / denom
```

### LAFS Gain

The leaderboard ranks submissions by **LAFS gain** — how much a submission improves the frontier beyond the fixed reference baseline:

```
LAFS gain = LAFS(reference_frontier ∪ submission_points) − LAFS(reference_frontier)
```

A submission that is dominated everywhere by the existing frontier receives a gain of exactly 0. A submission that opens a new accuracy-latency operating region receives positive gain proportional to the area it adds.

### The Reference Frontier

The reference frontier is hard-coded and will never change. Downstream scores depend on these exact values:

| Tier | Method | Accuracy | Latency |
|---|---|---|---|
| small | RAG: query → slice + notes | 51.0% | 0.2 s |
| small | Codex | 69.9% | 177.2 s |
| small | AgentRunbook-R | 58.6% | 26.9 s |
| small | AgentRunbook-C | 74.9% | 108.3 s |
| medium | RAG: query → slice + notes | 45.9% | 0.3 s |
| medium | Codex | 68.7% | 185.8 s |
| medium | AgentRunbook-R | 57.0% | 25.8 s |
| medium | AgentRunbook-C | 70.1% | 139.9 s |

Sources: [leaderboard/compute_lafs.py:35-48]()

### Worked Example

A system with a single operating point at 62% accuracy and 15 s latency (`Fast RAG++` in the source) falls between the RAG baseline and AgentRunbook-R on the latency axis but offers better accuracy than RAG in that region, so it extends the frontier and earns positive LAFS gain. A system at 70% accuracy but 150 s latency is dominated by AgentRunbook-C (74.9% at 108.3 s) and receives a gain of 0.

Sources: [leaderboard/compute_lafs.py:221-256]()

---

## The Two-Step Submission Package

### Why Two Steps?

A submission can contain multiple **operating points** — different speed/accuracy trade-offs of the same method (e.g., `fast`, `balanced`, `accurate`). Step 1 handles one operating point at a time; Step 2 assembles them all into the final package. Running them separately lets you add or rebuild a single operating point without repeating validation for others.

### Step 1: Validate and Stage One Operating Point

```bash
python leaderboard/build_submission_step_1_single_operating_point.py \
  runs/my_method_fast_web_small \
  runs/my_method_fast_enterprise_small \
  submission_1 \
  fast \
  small
```

Step 1 calls `validate_run` on each run folder, which checks:

- Required files exist: `aggregated_metrics.json`, `per_question.jsonl`, `run_args.json`, `runtime_inputs/questions.json`, `runtime_inputs/haystack.json`
- `run_args.json` domain matches `web` or `enterprise`
- `run_args.json` model contains `qwen3.5-9b` and evaluator model contains `gpt-5.2`
- `per_question.jsonl` covers every question in `runtime_inputs/questions.json` (no missing or extra IDs)
- Question-type counts in the output match the runtime inputs
- `aggregated_metrics.json` `count_all_questions` matches the actual question and output counts

After both runs pass, `validate_run_pair` confirms they share the same method name and tier.

Step 1 then copies the run artifacts, combines the two domains using `combine_domain_metrics`, builds the `metric_overview.json` from the combined result, and writes `operating_point_metadata.json`.

Output layout:

```text
leaderboard/submissions/submission_1/operating_points/fast/
  metric_overview.json
  operating_point_metadata.json
  web/
    aggregated_metrics.json  per_question.jsonl  run_args.json  runtime_inputs/
  enterprise/
    aggregated_metrics.json  per_question.jsonl  run_args.json  runtime_inputs/
```

Sources: [leaderboard/build_submission_step_1_single_operating_point.py:70-132](), [leaderboard/submission_utils.py:226-334]()

### Step 2: Assemble the Final Package

```bash
python leaderboard/build_submission_step_2_build_package.py \
  submission_1 \
  SYSTEM_DESCRIPTION.md \
  path/to/code_file.py \
  leaderboard/submissions/submission_1/operating_points/fast \
  leaderboard/submissions/submission_1/operating_points/balanced
```

Step 2 validates that all operating points share the same method, tier, web question IDs, enterprise question IDs, and haystack contents. It then:

1. Copies each operating point folder into the package directory
2. Copies `SYSTEM_DESCRIPTION.md` and the code file to the package root
3. Computes `lafs_summary_for_submission` over all operating points and writes `submission_overview.json`
4. Creates the final `.tar.gz` archive (symlinks are rejected)

`submission_overview.json` at the package root records the method, tier, per-operating-point accuracy and latency values, and the full LAFS summary including `reference_lafs`, `submission_lafs`, and `lafs_gain`.

Sources: [leaderboard/build_submission_step_2_build_package.py:189-287](), [leaderboard/submission_utils.py:437-512]()

---

## End-to-End Flow

```text
web run folder ─────┐
                    ├─► Step 1 (validate + merge) ─► operating_points/fast/
enterprise run folder┘                                  metric_overview.json

                     (repeat for each operating point)

operating_points/fast/  ─┐
operating_points/balanced/├─► Step 2 (assemble + LAFS) ─► submission_1.tar.gz
operating_points/accurate/┘                                submission_overview.json
```

Each box in the diagram maps to exactly one script; there is no hidden middleware.

---

## Model Constraints Enforced at Validation Time

The submission tooling enforces two fixed model constraints on every run:

| Slot | Required substring |
|---|---|
| Reader model (`model` in `run_args.json`) | `qwen3.5-9b` |
| Evaluator model (`evaluator_model` in `run_args.json`) | `gpt-5.2` |

These checks exist to keep the evaluation protocol reproducible across all leaderboard entries. A run using a different reader or judge model will be rejected at Step 1 with a clear error.

Sources: [leaderboard/submission_utils.py:19-20](), [leaderboard/submission_utils.py:194-207]()

---

## Key Takeaways

**What the leaderboard measures.** Accuracy alone is not enough. The benchmark explicitly rewards memory systems that are fast *and* accurate by measuring how much a submission expands the Pareto frontier across a 1–200 second latency range.

**Two domains, one score.** Every submission is evaluated on both `web` and `enterprise` haystacks. The merge is example-count-weighted, so a domain with more questions has proportionally more influence on the combined score.

**LAFS gain can be zero.** If every submitted operating point falls inside (i.e., is dominated by) the reference frontier, the gain is exactly 0. To earn a positive gain, at least one operating point must improve accuracy at some latency budget where the reference frontier does not already reach that accuracy level.

**Multiple operating points help in different ways.** A fast operating point with moderate accuracy and a slow operating point with high accuracy together can improve LAFS more than either alone, because they each fill a different region of the frontier.

**The reference frontier is frozen.** The four reference methods per tier are hard-coded constants. Changing them would invalidate all previously computed scores, so they will not change after release.

The `submission_overview.json` written by Step 2 is the single artifact that captures everything: method name, tier, per-operating-point numbers, and the LAFS summary. Inspect it with `python -m json.tool` before submitting to confirm the numbers look correct.

Sources: [leaderboard/compute_lafs.py:35-51](), [leaderboard/build_submission_step_2_build_package.py:189-227]()

---