# Explain It Simply: What This Repo Does

> What LongMemEval-V2 is in plain language, the one analogy to keep, and the three ideas every reader should hold onto before going deeper.

- Repository: xiaowu0162/LongMemEval-V2
- GitHub: https://github.com/xiaowu0162/LongMemEval-V2
- Human wiki: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2
- Complete Markdown: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/llms-full.txt

## Source Files

- `README.md`
- `pyproject.toml`
- `memory_modules/memory.py`
- `evaluation/harness.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [README.md](README.md)
- [memory_modules/memory.py](memory_modules/memory.py)
- [memory_modules/no_retrieval.py](memory_modules/no_retrieval.py)
- [memory_modules/rag.py](memory_modules/rag.py)
- [memory_modules/agentrunbook_r.py](memory_modules/agentrunbook_r.py)
- [evaluation/harness.py](evaluation/harness.py)
- [leaderboard/README.md](leaderboard/README.md)
</details>

# Explain It Simply: What This Repo Does

LongMemEval-V2 is a benchmark that asks one sharp question: **can an AI agent learn from experience the way a knowledgeable colleague does?** The repo gives you the questions, the histories the agent must learn from, a harness to run any memory system against them, and ready-made baselines to compare against — all in one place.

This page explains what the benchmark is testing, how the pieces fit together, and the three ideas every reader needs before diving deeper into the code.

---

## The One Analogy to Keep

Think of a new employee joining a company. On their first day they know nothing special about your systems. After six months, they know which pages load slowly, which workflows have traps, and which assumptions from their last job don't apply here. That accumulated knowledge — earned from experience, not from a manual — is what LongMemEval-V2 measures.

The "employee" here is an AI agent. The "six months of experience" is a pile of recorded web-browsing or enterprise-software sessions (called **trajectories**). A **memory system** reads those trajectories and must later answer factual questions about the environment — just like a colleague you tap on the shoulder to ask "wait, how does checkout work on the admin site?"

Sources: [README.md:29-34]()

---

## Three Ideas to Hold Onto

### 1. The benchmark is a haystack-and-needle problem at extreme scale

Each question comes with a **haystack**: a set of up to 500 recorded agent trajectories that together can reach 115 million tokens. The memory system must ingest all of them and then surface just the relevant evidence — a needle — when asked a specific question.

The two test tiers are called **small** and **medium**, reflecting how large the haystacks grow. Sources: [README.md:38-43]()

### 2. Memory is a simple two-method contract, not a fixed implementation

Every memory backend in the repo implements exactly two methods:

- `insert(trajectory)` — called once per trajectory during the indexing phase.
- `query(query, query_image=None)` — called once per question; must return a list of text or image evidence items.

That contract is defined in `Memory` (an abstract base class) in `memory_modules/memory.py`. Any Python class that decorates itself with `@register_memory`, sets a unique `memory_type` string, and implements those two methods is a valid backend that the harness will accept.

```python
# memory_modules/memory.py:25-53
class Memory(ABC):
    memory_type: str = ""

    @abstractmethod
    def insert(self, trajectory: dict[str, object]) -> None:
        raise NotImplementedError

    @abstractmethod
    def query(
        self,
        query: str,
        query_image: str | None = None,
    ) -> list[MemoryContextItem]:
        raise NotImplementedError
```

The return type is a list of typed items:
```python
[
    {"type": "text", "value": "retrieved notes or evidence"},
    {"type": "image", "value": "/path/to/screenshot.png"},
]
```

Sources: [memory_modules/memory.py:25-53](), [README.md:197-220]()

### 3. Accuracy alone is not the score — latency matters too

The leaderboard scores submissions using **LAFS** (Latency-Adjusted Frontier Score), which rewards both high accuracy and fast query latency. A method that is 5% more accurate but ten times slower may score lower than a balanced one. The harness tracks `memory_query_duration_seconds` for every question, and the submission packaging combines the accuracy and latency numbers into a single frontier score.

Sources: [leaderboard/README.md:26-34](), [evaluation/harness.py:1438-1458]()

---

## What Gets Tested: Five Memory Abilities

The 451 questions are hand-curated across five categories that reflect different kinds of workplace knowledge:

| Ability | What it checks |
|---|---|
| **Static state recall** | Remembers landmarks, page layouts, and subtle UI differences |
| **Dynamic state tracking** | Understands how actions change the environment over time |
| **Workflow knowledge** | Knows the steps for recurring tasks |
| **Environment gotchas** | Recognizes local failure modes and avoids them |
| **Premise awareness** | Detects questions whose premise is wrong in this specific deployment |

A question in the **abstention** variant (marked `-abs` in the code) tests whether the system correctly says "I don't know" rather than guessing. The harness grades abstention separately from factual recall.

Sources: [README.md:45-57](), [evaluation/harness.py:44-60]()

---

## How an Evaluation Run Works

The harness (`evaluation/harness.py`) orchestrates three sequential passes:

```text
Pass 1 — Build prompts
  For each question:
    → Load the haystack (the set of relevant trajectories)
    → Call memory.insert() for each trajectory
    → Call memory.query() with the question text (and optional screenshot)
    → Truncate the returned context to fit within the token budget
    → Build the final reader prompt

Pass 2 — Generate answers
  → Send all prompts concurrently to the reader model (AsyncOpenAI)
  → Parse the boxed answer from each response

Pass 3 — Score answers
  → Compare each parsed answer to the gold label
  → Use exact-match, LLM judge, or custom eval function per question type
  → Write per_question.jsonl and aggregated_metrics.json
```

If every question shares the same haystack (common for small tier), the memory is built once and reused for all queries — a significant efficiency win. When questions have different haystacks, the harness can build memory in parallel across worker threads.

Sources: [evaluation/harness.py:1135-1196](), [evaluation/harness.py:1200-1343]()

The harness communicates with models through an **OpenAI-compatible API**. You point it at any server that speaks that protocol — local vLLM, a self-hosted endpoint, or a cloud provider. No model provider is baked in; the paper used Qwen3.5-9B as the reader and Qwen3-Embedding-8B for embeddings, but those are defaults you can override.

Sources: [README.md:126-156]()

---

## The Built-In Memory Backends

The repo ships six backends, ranging from a deliberate no-op to sophisticated agent-based approaches:

| `memory_type` | What it does |
|---|---|
| `no_retrieval` | Returns nothing — the zero baseline. The reader model gets no memory context. |
| `rag_query_to_slice` | Embeds raw AXTree state slices from trajectories; retrieves top-k by cosine similarity. |
| `rag_query_to_slice_notes` | Same as above, but also generates LLM-written procedure and hint notes per trajectory and retrieves those too. |
| `agentrunbook_r` | Builds richer per-transition event notes and uses multi-query retrieval with optional reranking. |
| `codex` | Drops trajectories as files into a workspace; uses Codex (a coding agent) to search them at query time. |
| `agentrunbook_c` | Like Codex, but paired with AgentRunbook's structured workspace layout. |

The `no_retrieval` backend is literally four lines of code and illustrates the minimum required to implement the interface:

```python
# memory_modules/no_retrieval.py
@register_memory
class NoRetrievalMemory(Memory):
    memory_type = "no_retrieval"

    def insert(self, trajectory): return None
    def query(self, query, query_image=None): return []
```

The `rag` backend, by contrast, maintains in-memory NumPy embedding matrices for raw state slices, procedure notes, and hint notes, and saves them to disk as `.npy` files for reuse across runs.

Sources: [memory_modules/no_retrieval.py:1-19](), [memory_modules/rag.py:483-489](), [memory_modules/rag.py:958-968]()

---

## System Architecture at a Glance

```text
┌─────────────────────────────────────────────────────────────────┐
│  data/          download + validate trajectories and questions   │
├─────────────────────────────────────────────────────────────────┤
│  memory_modules/                                                 │
│   memory.py     Memory ABC + register_memory + build_memory      │
│   no_retrieval  zero baseline (no context)                       │
│   rag           embedding-based retrieval (raw slices + notes)   │
│   agentrunbook_r  multi-query retrieval, event notes, reranking  │
│   codex / agentrunbook_c  agent-as-search (Codex binary)         │
├─────────────────────────────────────────────────────────────────┤
│  evaluation/                                                     │
│   harness.py    3-pass runner: insert → query → score            │
│   run_eval.py   CLI wrapper for shell scripts                    │
│   scripts/      run_*.sh — one script per baseline               │
├─────────────────────────────────────────────────────────────────┤
│  leaderboard/   LAFS scoring, submission packaging (2 steps)     │
└─────────────────────────────────────────────────────────────────┘

External dependencies (bring your own):
  Reader model   →  OpenAI-compatible endpoint (default: Qwen3.5-9B)
  Embed model    →  OpenAI-compatible endpoint (default: Qwen3-Embedding-8B)
  LLM judge      →  OPENAI_API_KEY (default: gpt-5.2)
  Codex binary   →  CODEX_BINARY path (for codex/agentrunbook_c only)
```

---

## Two Domains, Two Tiers

Every run is scoped to one **domain** and one **tier**:

- **Domain** — `web` (Magento shopping site + forum) or `enterprise` (ServiceNow). The domain controls the system prompt given to the reader model and which question set is loaded.
- **Tier** — `small` or `medium`. Tiers differ in haystack size; medium has larger haystacks and heavier compute requirements.

Leaderboard submissions must cover both domains for a given tier. The `leaderboard/combine_aggregated_metrics.py` script merges the two domain result files into a single combined metrics file before packaging.

Sources: [README.md:175-185](), [evaluation/harness.py:69-88]()

---

## Adding Your Own Memory System

The extension path is intentional and documented directly in the README and enforced by the `Memory` base class:

1. Create a new file in `memory_modules/`.
2. Subclass `Memory`, set `memory_type = "your_name"`, and decorate with `@register_memory`.
3. Implement `insert` and `query`.
4. Write a config JSON: `{"memory_type": "your_name", "memory_params": {}}`.
5. Pass it to the harness with `--memory-config-path`.

The `query` method receives the question text and an optional screenshot path. It may call `self.get_query_context()` to access the `question_id`, `question_type`, and the full question item — useful for specializing retrieval by question category.

Sources: [memory_modules/memory.py:140-185](), [README.md:189-232]()

---

## Closing Summary

LongMemEval-V2 is a rigorous, open benchmark for long-term agent memory. Its core insight is that memory quality should be measured not just by whether an AI can answer questions correctly, but by whether it can absorb hundreds of task recordings and become the kind of knowledgeable colleague that knows where the traps are. The repo ships the data pipeline, a clean plug-in interface for memory backends, six reference implementations ranging from trivial to sophisticated, and an end-to-end evaluation harness that measures both accuracy and retrieval latency — the two dimensions that together determine how useful a memory system actually is in practice.

Sources: [README.md:28-43](), [leaderboard/README.md:26-34]()
