# The Evaluation Harness: From Question to Score

> How harness.py feeds each question to a memory backend, collects context items, calls the reader model, and scores the answer — including the LLM judge paths for abstention and gotchas questions, and how shell scripts wire it all together.

- Repository: xiaowu0162/LongMemEval-V2
- GitHub: https://github.com/xiaowu0162/LongMemEval-V2
- Human wiki: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2
- Complete Markdown: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/llms-full.txt

## Source Files

- `evaluation/harness.py`
- `evaluation/run_eval.py`
- `evaluation/qa_eval_metrics.py`
- `evaluation/scripts/run_no_retrieval.sh`
- `evaluation/scripts/run_rag_query_to_slice.sh`
- `evaluation/memory_configs/rag_query_to_slice.json`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [evaluation/harness.py](evaluation/harness.py)
- [evaluation/run_eval.py](evaluation/run_eval.py)
- [evaluation/qa_eval_metrics.py](evaluation/qa_eval_metrics.py)
- [evaluation/scripts/run_no_retrieval.sh](evaluation/scripts/run_no_retrieval.sh)
- [evaluation/scripts/run_rag_query_to_slice.sh](evaluation/scripts/run_rag_query_to_slice.sh)
- [evaluation/memory_configs/rag_query_to_slice.json](evaluation/memory_configs/rag_query_to_slice.json)
- [evaluation/memory_configs/no_retrieval.json](evaluation/memory_configs/no_retrieval.json)
- [memory_modules/memory.py](memory_modules/memory.py)
</details>

# The Evaluation Harness: From Question to Score

The evaluation harness is the central pipeline that turns a dataset of questions — together with a memory backend and a reader model — into per-question scores and aggregate metrics. Understanding it is essential for anyone who wants to reproduce results, add a new memory backend, or debug why a particular question was answered incorrectly.

This page traces the full lifecycle of a single question: how the haystack of trajectories is loaded into a memory backend, how that backend returns context items for the question, how those items are assembled into a prompt for the reader model, how the model's response is parsed, and finally how the harness decides whether the answer is correct — including the two special paths that call a separate LLM judge for abstention and gotchas questions.

---

## The Two Entry Points

There are two ways to launch an evaluation run.

**`run_eval.py` — the user-facing wrapper.** It accepts high-level arguments (`--method`, `--domain`, `--tier`, `--data-root`) and does three things before calling the harness: it materializes a filtered `questions.json` and a `haystack.json` into a `runtime_inputs/` directory, builds a `memory_config.json` from the chosen method, and then invokes `harness.main()` in-process with the translated `sys.argv`.

**`harness.py` — the execution engine.** It owns the three-pass pipeline (build prompts → generate reader outputs → score) and all file I/O for results. It does not know which evaluation method is being used; that is entirely determined by the `memory_config` it receives.

Sources: [evaluation/run_eval.py:190-268](), [evaluation/harness.py:119-198]()

```
# Shell script → run_eval.py → harness.py (in-process)
# run_eval.py builds runtime_inputs/, then:
sys.argv = harness_argv
from evaluation.harness import main as harness_main
harness_main()
```

Sources: [evaluation/run_eval.py:213-268]()

### Shell Script Wrappers

The `evaluation/scripts/` directory contains one Bash script per method. Each script is a thin driver: it validates that the caller has not passed arguments it owns, then iterates over both domains (`web` and `enterprise`) and calls `run_eval.py`:

```bash
for domain in web enterprise; do
  python "$REPO_ROOT/evaluation/run_eval.py" \
    --method "$METHOD" \
    --data-root "$DATA_ROOT_VALUE" \
    --domain "$domain" \
    --tier "$TIER_VALUE" \
    --output-dir "$OUTPUT_ROOT_VALUE/${METHOD}_${domain}_${TIER_VALUE}" \
    "$@"
done
```

Sources: [evaluation/scripts/run_rag_query_to_slice.sh:20-28](), [evaluation/scripts/run_no_retrieval.sh:20-28]()

The environment variables `DATA_ROOT`, `OUTPUT_ROOT`, and `TIER` drive where data is read and results are written. Additional flags (model names, concurrency limits, API keys) pass through `"$@"` unchanged to `run_eval.py`.

---

## Memory Backends and Configuration

Every method is expressed as a JSON config with exactly two keys: `memory_type` and `memory_params`. The harness loads this config, instantiates the matching `Memory` subclass, and calls `insert()` and `query()` on it — nothing else.

```json
{
  "memory_type": "rag",
  "memory_params": {
    "controller_params": { "model": "Qwen/Qwen3.5-9B", "base_url": "http://localhost:8023/v1", ... },
    "embedding_params":  { "model": "Qwen/Qwen3-Embedding-8B", "base_url": "http://localhost:8114/v1", ... },
    "index_params":      { "raw_state_slice_radius": 1 },
    "retrieval_params":  { "enable_notes": false, "raw_state_search_top_k": 6 }
  }
}
```

Sources: [evaluation/memory_configs/rag_query_to_slice.json:1-35]()

The `no_retrieval` config is the minimal case — empty params, no external services required:

```json
{ "memory_type": "no_retrieval", "memory_params": {} }
```

Sources: [evaluation/memory_configs/no_retrieval.json:1-4]()

`run_eval.py` assembles these configs programmatically from CLI arguments for the six supported methods: `no_retrieval`, `rag_query_to_slice`, `rag_query_to_slice_notes`, `agentrunbook_r`, `codex`, and `agentrunbook_c`. Sources: [evaluation/run_eval.py:21-28]()

All concrete memory classes implement the `Memory` abstract base class:

| Method | `insert()` behavior | `query()` return |
|---|---|---|
| `no_retrieval` | no-op | empty list |
| `rag` | embed trajectory slices into vector index | top-k text/image slices |
| `agentrunbook_r` | embed + build runbook summary | multi-query retrieved slices |
| `codex` | invoke Codex CLI on trajectory files | text context items |
| `agentrunbook_c` | Codex-driven agentic runbook | text context items |

Sources: [memory_modules/memory.py:25-68](), [memory_modules/memory.py:178-223]()

---

## Pass 1: Building Prompts

The harness works in three sequential passes. Pass 1 turns each question into a fully assembled prompt.

### Shared vs. Per-Question Haystack

The first fork in the pipeline decides whether one memory object can be shared across all questions.

- **Shared haystack**: all questions draw from the same ordered list of trajectory IDs. The harness builds a single `Memory` instance, inserts all trajectories once with a progress bar, then queries it for every question. This is the common case for standard benchmark runs.
- **Per-question haystack**: different questions see different trajectory sets. A separate `Memory` is built and populated for every question. This is only supported for memory types in `NONSHARED_PARALLEL_MEMORY_TYPES` (`rag`, `codex`, `agentrunbook_r`, `agentrunbook_c`) when `--prompt-build-max-workers > 1`.

Sources: [evaluation/harness.py:439-444](), [evaluation/harness.py:1136-1199]()

### `build_prompt_row`

For each question, the core function `build_prompt_row` does the following:

1. Calls `memory.set_query_context(question_id, question_type, question_item)` to pass metadata the backend may use.
2. Calls `memory.query(question_text, query_image=...)`, which returns a `list[MemoryContextItem]`. Each item is `{"type": "text"|"image", "value": "..."}`.
3. Validates the items.
4. Calls `memory.post_query_hook(...)` for optional post-retrieval work (timing is tracked separately).
5. Calls `truncate_memory_context()` if the context exceeds `--memory-context-max-tokens` (default 200,000), using a binary search over token counts measured with the Qwen3.5-9B processor.
6. Calls `build_messages()` to assemble the final prompt.

Sources: [evaluation/harness.py:536-593]()

### Token Counting and Truncation

The truncation check counts tokens using `transformers.AutoProcessor` for `Qwen/Qwen3.5-9B` — the same tokenizer the default reader model uses. Images are loaded via PIL and counted as their tokenized visual representation. A binary search finds the largest prefix of context items that fits within the limit.

Sources: [evaluation/harness.py:357-436]()

### Prompt Structure

`build_messages` assembles the final chat messages in a fixed layout:

```
[system]  domain-specific instruction
[user]    ### Memory context:
          <context items — text blocks or base64-encoded image_url>

          ### Question to answer:
          <question text>
          [optional question image]
```

The system prompt instructs the model to answer from memory, output `\boxed{UNKNOWN}` if it does not know, and explain in `\boxed{}` if the question's premise is wrong. Sources: [evaluation/harness.py:501-533](), [evaluation/harness.py:69-88]()

Two versions of the message list are produced: `messages` (with images as base64 data URLs, sent to the model API) and `messages_for_log` (with image file paths, written to disk). Sources: [evaluation/harness.py:501-533]()

---

## Pass 2: Calling the Reader Model

After all prompts are built, the harness fires all requests concurrently using `asyncio` and the OpenAI-compatible chat completions API.

```python
async def generate_all_reader_outputs(args, prompt_rows):
    client = create_async_client(...)
    semaphore = asyncio.Semaphore(args.reader_max_concurrent_requests)  # default 500
    tasks = [asyncio.create_task(run_one(row)) for row in prompt_rows]
    ...
```

Sources: [evaluation/harness.py:889-930]()

The harness is provider-neutral: it targets any OpenAI-compatible endpoint via `--base-url`. Local vLLM servers, cloud APIs, and proxies all work. When `--base-url` is set, it uses `max_tokens` (local convention); otherwise it uses `max_completion_tokens` (OpenAI API convention). Sources: [evaluation/harness.py:842-866]()

After each response arrives, `extract_boxed_answer` parses the final `\boxed{...}` from the raw text (depth-aware brace matching, taking the last occurrence). If none is found, the full response is used. `is_unknown` checks if the parsed answer equals `"UNKNOWN"` (case-insensitive). Sources: [evaluation/qa_eval_metrics.py:180-206]()

---

## Pass 3: Scoring

Pass 3 iterates `prompt_rows` sequentially, merging each row with its reader output and calling `score_prediction`. Results are written to `per_question.jsonl` incrementally (one flush per question) so partial results survive interruption. Sources: [evaluation/harness.py:1358-1411]()

### Eval Function Dispatch

Every question carries an `eval_function` field whose value is a pipe-delimited spec string, for example:

```
norm_phrase_set_match
norm_phrase_set_match|separators=[,]|lower=true
llm_abstention_checker
llm_gotchas_checker
```

`eval_from_spec` parses the name and any `key=value` options, looks up the function by name in `qa_eval_metrics.py`'s global namespace, and calls it. Sources: [evaluation/qa_eval_metrics.py:568-601]()

The available deterministic eval functions include:

| Function | What it checks |
|---|---|
| `norm_phrase_set_match` | All gold phrases appear in the prediction (word-boundary regex, normalized) |
| `norm_phrase_set_match_ordered` | Same but phrases must appear in order |
| `mc_choice_match` | Single multiple-choice letter matches |
| `mc_choice_set_match` | Set of multiple-choice letters matches (multi-select) |

Sources: [evaluation/qa_eval_metrics.py:71-178]()

### LLM Judge Paths

Two question types require a second model call for scoring: abstention questions (`-abs` suffix) and gotchas questions. These are identified by `LLM_EVAL_FUNCTIONS = {"llm_abstention_checker", "llm_gotchas_checker"}`. When `score_prediction` sees one of these, it routes the **full raw response** (not just the parsed boxed answer) to the judge. Sources: [evaluation/harness.py:60](), [evaluation/harness.py:1012-1037]()

**`llm_abstention_checker`** — evaluates whether the model correctly identified a flawed question premise. The system prompt is strict: the model must name the flaw and reach the same conclusion as the reference answer. A generic `UNKNOWN` reply without identifying the flaw scores 0. The judge outputs `{"label": 0|1, "reason": "..."}`.

Sources: [evaluation/qa_eval_metrics.py:7-17](), [evaluation/qa_eval_metrics.py:221-287](), [evaluation/qa_eval_metrics.py:380-413]()

**`llm_gotchas_checker`** — evaluates whether the model surfaced the correct insight from an errors/gotchas question. Partial coverage of a multi-point reference answer is sufficient for label 1, as long as no point is contradicted.

Sources: [evaluation/qa_eval_metrics.py:18-25](), [evaluation/qa_eval_metrics.py:290-356](), [evaluation/qa_eval_metrics.py:416-447]()

Both judge functions call a separate evaluator model (default `gpt-5.2`) configured independently from the reader model. This separation allows using a stronger model for judgment without affecting the capability being measured. Sources: [evaluation/run_eval.py:82-84](), [evaluation/harness.py:186-198]()

The judge response is parsed with `_parse_llm_binary_judgement`, which tries `json.loads` first and falls back to regex extraction for malformed outputs. Sources: [evaluation/qa_eval_metrics.py:459-488]()

### Abstention Scoring Gotcha

When a question is marked `is_unknown` (the reader output `\boxed{UNKNOWN}`), `score_prediction` forces `score_bool = False` regardless of what the eval function returns. This means a model that always replies UNKNOWN will score 0 on non-abstention questions, preventing a trivial exploit. Sources: [evaluation/harness.py:1034-1036]()

---

## Output Files

After all three passes complete, the harness writes the following files to `--output-dir`:

| File | Contents |
|---|---|
| `run_args.json` | All CLI arguments plus `started_at_utc` |
| `prompt_rows.jsonl` | One row per question with full prompt, memory context, and timing |
| `prompt_build_summary.json` | Question order and count after prompt build |
| `per_question.jsonl` | Per-question record with score, raw response, token usage |
| `aggregated_metrics.json` | Overall and per-category accuracy, token stats, memory timing |
| `memory_state/` (optional) | Saved shared memory, loadable with `--load-memory-dir` |

Sources: [evaluation/harness.py:1044-1045](), [evaluation/harness.py:1329-1341](), [evaluation/harness.py:1345-1412](), [evaluation/harness.py:1413-1492]()

### Aggregated Metrics Structure

`aggregate_metrics` partitions questions into non-abstention and abstention groups, then breaks each down by category. The top-level result includes:

```
overall_full_set            # mean score over all questions
overall_non_abstention_only # mean score ignoring -abs questions
overall_abstention_only     # mean score for -abs questions only
non_abstention_by_category  # {static, dynamic, procedure, gotchas} → {count, pct_correct, ...}
abstention_by_category      # {static-abs, dynamic-abs, procedure-abs} → breakdown
combined_abstention_by_category  # paired non-abs + abs per category
```

Sources: [evaluation/harness.py:938-998](), [evaluation/harness.py:44-59]()

---

## End-to-End Flow Diagram

```
┌─────────────────────────────────────────────────────────────────┐
│  Shell script (run_rag_query_to_slice.sh)                       │
│  Sets DATA_ROOT, TIER, loops over domain=web,enterprise         │
└───────────────────────────┬─────────────────────────────────────┘
                            │ python run_eval.py --method rag_query_to_slice ...
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│  run_eval.py                                                    │
│  • materialize_runtime_questions → runtime_inputs/questions.json│
│  • materialize_runtime_haystack  → runtime_inputs/haystack.json │
│  • build_memory_config()         → runtime_inputs/memory_config │
│  • harness.main() [in-process]                                  │
└───────────────────────────┬─────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│  harness.py — PASS 1: Build Prompts                             │
│                                                                 │
│  build_memory(memory_config)                                    │
│  for each trajectory_id in haystack:                           │
│      memory.insert(trajectory)                                  │
│                                                                 │
│  for each question:                                             │
│      memory.set_query_context(...)                              │
│      ctx_items = memory.query(question_text, image)  ← backend │
│      truncate_memory_context(ctx_items, max_tokens=200k)        │
│      build_messages(system_prompt, ctx_items, question)         │
│      → prompt_row                                               │
└───────────────────────────┬─────────────────────────────────────┘
                            │ prompt_rows.jsonl saved
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│  harness.py — PASS 2: Reader Model                              │
│                                                                 │
│  asyncio + semaphore (≤500 concurrent)                          │
│  for each prompt_row:                                           │
│      POST /chat/completions → raw_response                      │
│      extract_boxed_answer(raw) → parsed_boxed                   │
│      is_unknown(parsed_boxed) → bool                            │
└───────────────────────────┬─────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│  harness.py — PASS 3: Score                                     │
│                                                                 │
│  for each row:                                                  │
│      eval_from_spec(eval_function, parsed, gold)                │
│        ├─ deterministic: norm_phrase_set_match, mc_choice_match  │
│        └─ LLM judge:                                            │
│             llm_abstention_checker → POST to evaluator model    │
│             llm_gotchas_checker    → POST to evaluator model    │
│      if is_unknown: force score_bool = False                    │
│      write per_question.jsonl                                   │
│                                                                 │
│  aggregate_metrics(records) → aggregated_metrics.json           │
└─────────────────────────────────────────────────────────────────┘
```

---

## Summary

The harness is a clean three-pass pipeline: memory build + prompt assembly → concurrent reader inference → sequential scoring. Its key design choices are: the `Memory` abstract interface that keeps all retrieval logic out of the harness; the OpenAI-compatible client that makes the reader and evaluator models interchangeable; the `\boxed{}` answer convention that gives all eval functions a well-defined target to match; and the two LLM judge paths that handle question types where string matching is insufficient. The shell scripts and `run_eval.py` layer on top add convenience (domain loops, runtime config assembly) without coupling the engine to any specific method. Sources: [evaluation/harness.py:1040-1496]()
