Agent-readable wiki

LongMemEval-V2 Plain-Language Wiki

LongMemEval-V2 is a benchmark that tests whether an AI agent's memory system can turn long histories of web-browsing actions into the kind of practical knowledge a seasoned colleague would have. The repo ships the dataset pipeline, a pluggable memory framework, an evaluation harness, and leaderboard packaging utilities.

Pages

Explain It Simply: What This Repo DoesWhat LongMemEval-V2 is in plain language, the one analogy to keep, and the three ideas every reader should hold onto before going deeper.
Five Things a Good Memory Must KnowThe five memory abilities the benchmark tests — static recall, dynamic tracking, workflow knowledge, gotchas, and premise awareness — explained with real question categories from the harness source code.
Downloading & Preparing the HaystackHow trajectory data moves from Hugging Face through download, screenshot extraction, and symlink preparation into the form the harness expects — covering the three data scripts and the validate step.
The Six Memory Backends: How Each One WorksA plain-English tour of the six pluggable memory backends — no_retrieval, RAG variants, AgentRunbook-R, Codex, and AgentRunbook-C — explaining what each one stores and retrieves, plus the insert/query contract every custom backend must satisfy.
The Evaluation Harness: From Question to ScoreHow harness.py feeds each question to a memory backend, collects context items, calls the reader model, and scores the answer — including the LLM judge paths for abstention and gotchas questions, and how shell scripts wire it all together.
Scoring, LAFS, & What to RememberHow web and enterprise run results are merged, how LAFS turns accuracy and latency into a single leaderboard score, the two-step submission packaging process, and a plain-English recap of the core ideas to carry away from this repo.

Complete Markdown

The complete agent-readable Markdown files are published separately from this HTML page.