# LongMemEval-V2 Plain-Language Wiki

> LongMemEval-V2 is a benchmark that tests whether an AI agent's memory system can turn long histories of web-browsing actions into the kind of practical knowledge a seasoned colleague would have. The repo ships the dataset pipeline, a pluggable memory framework, an evaluation harness, and leaderboard packaging utilities.

This is a Grok-Wiki source-grounded repository wiki. Use the complete Markdown link when an agent needs the full repo context.

## Context Links

- [Complete Markdown wiki](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/llms-full.txt)
- [Complete Markdown alias](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2.md)
- [Human interactive wiki](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2)
- [GitHub repository](https://github.com/xiaowu0162/LongMemEval-V2)

## Repository

- Repository: xiaowu0162/LongMemEval-V2

- Generated: 2026-05-22T06:16:36.043Z
- Updated: 2026-05-22T06:49:07.550Z
- Runtime: Claude Code
- Format: Explain Like I'm 5
- Pages: 6

## Pages

- [Explain It Simply: What This Repo Does](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/01-explain-it-simply-what-this-repo-does.md): What LongMemEval-V2 is in plain language, the one analogy to keep, and the three ideas every reader should hold onto before going deeper.
- [Five Things a Good Memory Must Know](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/02-five-things-a-good-memory-must-know.md): The five memory abilities the benchmark tests — static recall, dynamic tracking, workflow knowledge, gotchas, and premise awareness — explained with real question categories from the harness source code.
- [Downloading & Preparing the Haystack](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/03-downloading-preparing-the-haystack.md): How trajectory data moves from Hugging Face through download, screenshot extraction, and symlink preparation into the form the harness expects — covering the three data scripts and the validate step.
- [The Six Memory Backends: How Each One Works](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/04-the-six-memory-backends-how-each-one-works.md): A plain-English tour of the six pluggable memory backends — no_retrieval, RAG variants, AgentRunbook-R, Codex, and AgentRunbook-C — explaining what each one stores and retrieves, plus the insert/query contract every custom backend must satisfy.
- [The Evaluation Harness: From Question to Score](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/05-the-evaluation-harness-from-question-to-score.md): How harness.py feeds each question to a memory backend, collects context items, calls the reader model, and scores the answer — including the LLM judge paths for abstention and gotchas questions, and how shell scripts wire it all together.
- [Scoring, LAFS, & What to Remember](https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/pages/06-scoring-lafs-what-to-remember.md): How web and enterprise run results are merged, how LAFS turns accuracy and latency into a single leaderboard score, the two-step submission packaging process, and a plain-English recap of the core ideas to carry away from this repo.

## Source Files

- `data/download_data.py`
- `data/prepare_data.py`
- `data/public_data.py`
- `data/validate_data.py`
- `environment.yml`
- `evaluation/harness.py`
- `evaluation/memory_configs/no_retrieval.json`
- `evaluation/memory_configs/rag_query_to_slice.json`
- `evaluation/qa_eval_metrics.py`
- `evaluation/run_eval.py`
- `evaluation/scripts/run_no_retrieval.sh`
- `evaluation/scripts/run_rag_query_to_slice.sh`
- `leaderboard/build_submission_step_1_single_operating_point.py`
- `leaderboard/build_submission_step_2_build_package.py`
- `leaderboard/combine_aggregated_metrics.py`
- `leaderboard/compute_lafs.py`
- `leaderboard/README.md`
- `leaderboard/submission_utils.py`
- `memory_modules/agentrunbook_c.py`
- `memory_modules/agentrunbook_r.py`
- `memory_modules/codex.py`
- `memory_modules/memory.py`
- `memory_modules/no_retrieval.py`
- `memory_modules/rag.py`
- `memory_modules/support.py`
- `memory_modules/trajectory_store.py`
- `pyproject.toml`
- `README.md`