# Five Things a Good Memory Must Know

> The five memory abilities the benchmark tests — static recall, dynamic tracking, workflow knowledge, gotchas, and premise awareness — explained with real question categories from the harness source code.

- Repository: xiaowu0162/LongMemEval-V2
- GitHub: https://github.com/xiaowu0162/LongMemEval-V2
- Human wiki: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2
- Complete Markdown: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/llms-full.txt

## Source Files

- `evaluation/harness.py`
- `evaluation/qa_eval_metrics.py`
- `evaluation/memory_configs/no_retrieval.json`
- `README.md`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [evaluation/harness.py](evaluation/harness.py)
- [evaluation/qa_eval_metrics.py](evaluation/qa_eval_metrics.py)
- [evaluation/memory_configs/no_retrieval.json](evaluation/memory_configs/no_retrieval.json)
- [evaluation/memory_configs/rag_query_to_slice.json](evaluation/memory_configs/rag_query_to_slice.json)
- [memory_modules/memory.py](memory_modules/memory.py)
- [memory_modules/no_retrieval.py](memory_modules/no_retrieval.py)
- [data/public_data.py](data/public_data.py)
- [leaderboard/README.md](leaderboard/README.md)
- [README.md](README.md)
</details>

# Five Things a Good Memory Must Know

LongMemEval-V2 is a benchmark that asks a concrete question: can an AI agent's memory system make it act like an experienced colleague who has learned the quirks of a particular environment? To answer that, the benchmark does not ask one generic "how good is your memory?" question. It identifies five distinct skills that an experienced colleague actually needs, then tests each one independently across 451 hand-curated questions. This page explains what each skill means, where it appears in the harness source code, and how it is scored.

Understanding these five abilities is the key to interpreting any evaluation result from this benchmark—whether you are running one of the included baselines, implementing your own memory backend, or reading a leaderboard entry.

---

## The Five Abilities at a Glance

The five ability categories and their internal identifiers are defined in `evaluation/harness.py`:

```python
# evaluation/harness.py, lines 44-52
CATEGORY_MAP = {
    "static-environment": "static",
    "static-environment-abs": "static-abs",
    "dynamic-environment": "dynamic",
    "dynamic-environment-abs": "dynamic-abs",
    "procedure": "procedure",
    "procedure-abs": "procedure-abs",
    "errors-gotchas": "gotchas",
}
```

Each `question_type` string in the dataset maps to a short category key used in metrics. The suffix `-abs` marks *abstention variants*, which test a related but different skill (see [Abstention: the hidden test beneath three categories](#abstention-the-hidden-test-beneath-three-categories) below).

The four primary non-abstention categories and the gotchas category are:

| Dataset `question_type` | Harness category key | Memory ability |
|---|---|---|
| `static-environment` | `static` | Static state recall |
| `dynamic-environment` | `dynamic` | Dynamic state tracking |
| `procedure` | `procedure` | Workflow knowledge |
| `errors-gotchas` | `gotchas` | Environment gotchas |

A fifth ability — **premise awareness** — is tested by the `-abs` variants of static, dynamic, and procedure questions.

Sources: [evaluation/harness.py:44-52]()

---

## Ability 1 — Static State Recall

**What it tests:** Does the memory system remember stable facts about the environment — landmark locations, page layouts, UI element names, module affordances, and subtle state differences that do not change between sessions?

Think of it like knowing the floor plan of an office building. A new hire walks in every day unsure where the printer is. An experienced colleague knows it is on the third floor, east wing, next to the supply closet. The environment has not changed; the colleague just carries that layout in memory.

In the benchmark, static questions ask about things that were true throughout the trajectory history and remain true — configuration values, menu structures in a custom Magento storefront, field names in a ServiceNow form. A memory system must have ingested those environment details from prior agent trajectories and be able to surface them when asked.

**Scoring:** Answers are compared to a gold string using normalized phrase-set matching. The scorer strips punctuation, lowercases, normalizes hyphens, and checks that every phrase from the gold answer appears in the prediction:

```python
# evaluation/qa_eval_metrics.py, lines 71-87
def norm_phrase_set_match(
    prediction: str | None,
    answer: str | None,
    *,
    separators: Iterable[str] = DEFAULT_SEPARATORS,
    require_non_empty: bool = True,
    **normalize_kwargs: bool,
) -> bool:
    normalized_pred = normalize_phrase(prediction, **normalize_kwargs)
    answer_phrases = split_phrases(answer, separators=separators, **normalize_kwargs)
    ...
    for phrase in set(answer_phrases):
        pattern = r"\b%s\b" % re.escape(phrase)
        if re.search(pattern, normalized_pred) is None:
            return False
    return True
```

Multiple-choice static questions may also use `mc_choice_match` or `mc_choice_set_match`, which strip option-label noise before comparing letters.

Sources: [evaluation/qa_eval_metrics.py:29-87](), [evaluation/harness.py:44-46]()

---

## Ability 2 — Dynamic State Tracking

**What it tests:** Does the memory system understand how states and actions change the environment over time? A static fact is timeless; a dynamic fact has history. Which products are currently in a user's cart? What is the current approval status of a ticket? Did a configuration get changed mid-trajectory?

The analogy: an experienced colleague does not just remember that a certain setting *can* be toggled — they remember that *it was* toggled last Tuesday, and that the current state is therefore different from the default.

Dynamic questions require the memory system to have tracked sequential change across trajectory steps, not just catalogued individual observations. A memory backend that indexes each state snapshot in isolation may miss the final value when an attribute was overwritten.

**Scoring:** Same normalized phrase-matching functions as static questions. Because the answer is a specific value at a specific point in time, partial matches do not earn credit.

Sources: [evaluation/harness.py:47-49](), [evaluation/qa_eval_metrics.py:71-87]()

---

## Ability 3 — Workflow Knowledge

**What it tests:** Does the memory system know the steps needed to complete recurring tasks in a customized environment? Generic knowledge of "how Magento works" or "how ServiceNow works" is not enough — the benchmark targets *customized* deployments where the local instance has non-standard workflows.

For example: submitting an expense report in a standard ServiceNow instance might follow a documented path, but a company-specific customization could require an extra approval step, a non-obvious field, or a workaround sequence. An experienced colleague has internalized those local steps from doing them repeatedly.

Procedure questions ask for step sequences, ordered actions, or required sub-tasks. Some are answered as comma-separated lists, which the scorer handles with `norm_phrase_set_match_ordered`:

```python
# evaluation/qa_eval_metrics.py, lines 90-109
def norm_phrase_set_match_ordered(
    prediction: str | None,
    answer: str | None,
    ...
) -> bool:
    ...
    start = 0
    for phrase in answer_phrases:
        pattern = r"\b%s\b" % re.escape(phrase)
        match = re.search(pattern, normalized_pred[start:])
        if match is None:
            return False
        start += match.end()
    return True
```

This enforces that the predicted steps appear in the correct order, reflecting the real constraint that procedural knowledge is not just a bag of words.

Sources: [evaluation/harness.py:50-51](), [evaluation/qa_eval_metrics.py:90-109]()

---

## Ability 4 — Environment Gotchas

**What it tests:** Does the memory system recognize recurring local failure modes and know how to avoid them? Gotchas are traps that are invisible in documentation but become obvious to someone who has burned themselves on them before. They are the "if you do X, the system breaks in this specific way" insights that only experience teaches.

Examples might include: a button that appears active but silently drops the form submission, a field that accepts values but only saves when a related toggle is on, or a sequence that must be performed in a precise order to avoid a race condition in the customized front end.

**Scoring:** Gotchas answers are *open-ended insights*, not exact strings. Simple string matching is too strict — the same insight can be expressed in many ways. The benchmark uses an LLM judge for this category:

```python
# evaluation/qa_eval_metrics.py, lines 18-25
_GOTCHAS_JUDGE_SYSTEM_PROMPT = (
    "You are a strict grader for gotchas-style insight questions. "
    "The reference answer describes the key insight(s). "
    "Grade 1 if the model response includes at least one correct insight point from the reference answer "
    "(paraphrase allowed), and does not contradict any reference point. "
    "If the model's direction is wrong, or it contains contradictions against any reference point, grade 0. "
    "If the model gives multiple points, partial coverage is enough for 1 as long as no contradictions appear."
)
```

The judge is invoked via `llm_gotchas_checker`, which sends the question, reference answer, full model response, and extracted final answer to an evaluator model (default `gpt-5.2` with `medium` reasoning effort). Partial insight coverage is rewarded; contradictions are penalized even if some wording accidentally overlaps.

```python
# evaluation/harness.py, lines 60-61
LLM_EVAL_FUNCTIONS = {"llm_abstention_checker", "llm_gotchas_checker"}
```

The harness routes questions whose `eval_function` is one of these to the LLM evaluator path rather than rule-based scoring.

Sources: [evaluation/qa_eval_metrics.py:18-25, 290-356](), [evaluation/harness.py:60-61]()

---

## Ability 5 — Premise Awareness

**What it tests:** Does the memory system catch assumptions that are valid in general but wrong in this specific deployment? Premise-aware questions are framed as if something is true, when in fact the memory record shows it is not. A system without good memory will confidently answer based on the (false) premise. A system with good memory will detect the contradiction and explain why the question's premise is flawed.

This is the most subtle ability. It is not enough to say "I don't know" (that earns a zero). The system must both reject the premise *and* identify the specific flaw, matching the explanation in the reference answer.

**How it appears in the harness:** Premise-awareness questions carry `question_type` values ending in `-abs` (`static-environment-abs`, `dynamic-environment-abs`, `procedure-abs`). The harness maintains a separate accounting for these:

```python
# evaluation/harness.py, lines 53-59
NON_ABSTENTION_CATEGORIES = ["static", "dynamic", "procedure", "gotchas"]
ABSTENTION_CATEGORIES = ["static-abs", "dynamic-abs", "procedure-abs"]
COMBINED_ABSTENTION_CATEGORY_PAIRS = {
    "static": ("static", "static-abs"),
    "dynamic": ("dynamic", "dynamic-abs"),
    "procedure": ("procedure", "procedure-abs"),
}
```

The domain system prompts instruct the reader model about both behaviors simultaneously:

```python
# evaluation/harness.py, lines 76-78 (web domain excerpt)
"If you do not know the answer, output exactly \\boxed{UNKNOWN}. "
"Do not guess. Never attempt to guess an answer if you are not sure. "
"If you believe the question's construction/premise is wrong, provide an "
"explanation in \\boxed{} explaining why the question is flawed."
```

**Scoring:** Premise-awareness questions use `llm_abstention_checker`, an LLM judge with a strict rubric:

```python
# evaluation/qa_eval_metrics.py, lines 8-17
_ABSTENTION_JUDGE_SYSTEM_PROMPT = (
    "You are a strict grader for flawed-premise (abstention) questions. "
    "Judge whether a model answer correctly identifies that the question premise is wrong, "
    "consistent with the reference answer. "
    "If the model follows the flawed premise and gives a concrete answer under that premise, "
    "it must be graded 0. "
    "If the model's final answer is just UNKNOWN / cannot determine without identifying the flaw, grade 0. "
    "If the model is contradictory (both rejects premise and also gives a concrete premise-following answer), grade 0. "
    "Paraphrases are allowed when they preserve the same core flaw described by the reference answer."
)
```

There are four failure modes that all earn zero: following the premise, saying "unknown" without naming the flaw, contradicting yourself, and giving a paraphrase that misses the core flaw. Only correct flaw identification earns a 1.

Sources: [evaluation/harness.py:53-59, 69-88](), [evaluation/qa_eval_metrics.py:8-17, 221-287]()

---

## Abstention: The Hidden Test Beneath Three Categories

Static, dynamic, and procedure categories each have a parallel abstention variant. These are *not* separate categories in the final leaderboard scores — they are combined back into their parent categories for the `combined_abstention_by_category` metrics:

```python
# evaluation/harness.py, lines 986-990
combined_abstention_by_category: dict[str, Any] = {}
for cat, pair in COMBINED_ABSTENTION_CATEGORY_PAIRS.items():
    rows = [r for r in records if r["category"] in pair]
    combined_abstention_by_category[cat] = breakdown(rows)
```

This means the published accuracy for "static" in the leaderboard blends both plain static recall and premise-detection for static questions. A memory system that is excellent at recall but blind to false premises will look weaker than its recall alone would suggest.

Sources: [evaluation/harness.py:976-998]()

---

## How the Harness Uses Category Information

At runtime, the harness reads `question_type` from each question record, converts it to a category key via `CATEGORY_MAP`, and stores the result on the prepared question row. This category drives which scoring branch is used:

```python
# evaluation/harness.py, lines 1099-1100
q_eval_name = eval_name(q_eval_spec)
...
"category": category_from_question_type(qtype),
```

During scoring (pass 3 of the three-pass pipeline), the record is routed to either the LLM evaluator path (for `llm_gotchas_checker` and `llm_abstention_checker`) or the rule-based path (for all other eval function specs such as `norm_phrase_set_match` or `mc_choice_match`). The `eval_function` field in each question JSON encodes not just the function name but also pipe-delimited options:

```python
# evaluation/qa_eval_metrics.py, lines 568-595
def parse_eval_function_spec(spec: str) -> tuple[Callable[..., Any], dict[str, Any]]:
    parts = [part.strip() for part in spec.split("|")]
    name = parts[0]
    ...
    kwargs: dict[str, Any] = {}
    for part in parts[1:]:
        key, value = part.split("=", 1)
        kwargs[key] = _parse_eval_value(key, value)
    return func, kwargs
```

For example, a spec of `norm_phrase_set_match|lower=true|separators=[,;]` selects the function and passes custom normalization options without changing the harness code.

Sources: [evaluation/harness.py:1087-1119](), [evaluation/qa_eval_metrics.py:568-601]()

---

## What a Memory Backend Must Do

Every memory backend — whether the trivial `no_retrieval` baseline or a complex RAG system — must implement two methods from `memory_modules/memory.py`:

```python
# memory_modules/memory.py, lines 43-54
@abstractmethod
def insert(self, trajectory: dict[str, object]) -> None:
    """Index one full trajectory object into the backend."""
    raise NotImplementedError

@abstractmethod
def query(
    self,
    query: str,
    query_image: str | None = None,
) -> list[MemoryContextItem]:
    """Return a formatted memory context payload for a query."""
    raise NotImplementedError
```

The `insert` call runs once per trajectory during haystack construction. The `query` call runs once per question and must return a list of `{"type": "text"|"image", "value": ...}` items. The harness appends these to the reader prompt before calling the answer model.

The no-retrieval baseline shows the floor: it ignores all trajectories and returns an empty list, forcing the reader to rely entirely on parametric knowledge. Any meaningful memory improvement over this baseline requires a backend that can correctly retrieve evidence for all five ability types.

```python
# memory_modules/no_retrieval.py, lines 9-18
@register_memory
class NoRetrievalMemory(Memory):
    memory_type = "no_retrieval"

    def insert(self, trajectory: dict[str, object]) -> None:
        return None

    def query(self, query: str, query_image: str | None = None) -> list[MemoryContextItem]:
        return []
```

Sources: [memory_modules/memory.py:43-54](), [memory_modules/no_retrieval.py:4-18]()

---

## Leaderboard Metrics by Category

The leaderboard extracts five per-category accuracy numbers from the `aggregated_metrics.json` each run produces:

| Leaderboard metric | Source category | Ability tested |
|---|---|---|
| `static_accuracy` | `static` (+ `static-abs`) | Static state recall + premise awareness |
| `dynamic_accuracy` | `dynamic` (+ `dynamic-abs`) | Dynamic state tracking + premise awareness |
| `procedure_accuracy` | `procedure` (+ `procedure-abs`) | Workflow knowledge + premise awareness |
| `gotchas_accuracy` | `gotchas` | Environment gotchas |
| `overall_full_set` | All categories | Aggregate across all five abilities |

The final LAFS score combines `overall_full_set` accuracy with `memory_query_avg_seconds` latency, so a memory system cannot win by being accurate but unbearably slow. The tradeoff between all five abilities and retrieval speed is the central design challenge the benchmark is intended to expose.

Sources: [leaderboard/README.md](), [evaluation/harness.py:968-998]()

---

## Summary

LongMemEval-V2 breaks "long-term memory" into five testable components: static environment knowledge, dynamic change tracking, step-by-step workflow recall, recognition of local failure patterns, and the ability to challenge a false assumption rather than answer it blindly. Each ability maps to a specific `question_type` tag in the dataset, a specific evaluation function in `evaluation/qa_eval_metrics.py`, and a specific bucket in the `aggregated_metrics.json` output. Two of the five abilities — gotchas and premise awareness — require an LLM judge rather than rule-based matching, reflecting that insight and flaw detection cannot be reduced to string overlap. Any memory backend that performs well across all five must combine faithful retrieval of stable facts, temporal ordering of state changes, procedural sequencing, failure-mode recognition, and the epistemic discipline to say "that premise is wrong" when the evidence demands it. Sources: [evaluation/harness.py:44-61](), [evaluation/qa_eval_metrics.py:8-25]().
