# Scoring, LAFS, & What to Remember

> How web and enterprise run results are merged, how LAFS turns accuracy and latency into a single leaderboard score, the two-step submission packaging process, and a plain-English recap of the core ideas to carry away from this repo.

- Repository: xiaowu0162/LongMemEval-V2
- GitHub: https://github.com/xiaowu0162/LongMemEval-V2
- Human wiki: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2
- Complete Markdown: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/llms-full.txt

## Source Files

- `leaderboard/README.md`
- `leaderboard/compute_lafs.py`
- `leaderboard/combine_aggregated_metrics.py`
- `leaderboard/build_submission_step_1_single_operating_point.py`
- `leaderboard/build_submission_step_2_build_package.py`
- `leaderboard/submission_utils.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [leaderboard/README.md](leaderboard/README.md)
- [leaderboard/compute_lafs.py](leaderboard/compute_lafs.py)
- [leaderboard/combine_aggregated_metrics.py](leaderboard/combine_aggregated_metrics.py)
- [leaderboard/build_submission_step_1_single_operating_point.py](leaderboard/build_submission_step_1_single_operating_point.py)
- [leaderboard/build_submission_step_2_build_package.py](leaderboard/build_submission_step_2_build_package.py)
- [leaderboard/submission_utils.py](leaderboard/submission_utils.py)
</details>

# Scoring, LAFS, & What to Remember

This page explains how LongMemEval-V2 turns raw evaluation runs into a single comparable number on the leaderboard. It covers the two-domain merge that produces a unified metric view, how LAFS converts an accuracy-latency trade-off curve into one score, the two-command packaging process that turns folders of results into a submittable archive, and the key ideas to carry away when reading or extending this repo.

Everything described here lives inside the `leaderboard/` directory. No external service is required: the scripts run locally, all reference values are hard-coded in the source, and the resulting `.tar.gz` is submitted through a web form.

---

## How Web and Enterprise Results Are Merged

Every leaderboard entry is evaluated on two independently run domains: **web** and **enterprise**. These domains represent different haystack compositions and question styles. Because the two domains contain different numbers of questions, raw averages would be misleading. The solution is **example-count-weighted averaging**.

The merging happens in `combine_aggregated_metrics.py`. The entry point reads two `aggregated_metrics.json` files — one from the web run, one from the enterprise run — and combines every numeric field using the total question count from each domain as the weight:

```python
# leaderboard/combine_aggregated_metrics.py:100-111
def weighted_average(values: list[tuple[Any, int]]) -> float | None:
    weighted_total = 0.0
    total_count = 0
    for value, count in values:
        numeric = as_number(value)
        if numeric is None or count == 0:
            continue
        weighted_total += numeric * count
        total_count += count
    if total_count == 0:
        return None
    return weighted_total / total_count
```

This same weighted-average function is applied to:

- `overall_full_set`, `overall_non_abstention_only`, `overall_abstention_only` (accuracy scores)
- per-category accuracy breakdowns (`gotchas`, `static`, `dynamic`, `procedure`)
- timing fields: `memory_query`, `memory_post_query`
- token usage: prompt, completion, and total token counts

For timing, the combination is slightly different: totals are summed, then re-averaged by total question count. `max_seconds` takes the maximum across both domains rather than an average.

Sources: [leaderboard/combine_aggregated_metrics.py:100-111](), [leaderboard/combine_aggregated_metrics.py:288-321](), [leaderboard/combine_aggregated_metrics.py:324-375]()

---

## The Extracted Metric Overview

After merging, `submission_utils.build_metric_overview` extracts the six fields that matter for scoring and display:

| Field | Where it comes from |
|---|---|
| `overall_full_set` | `combined.overall.overall_full_set` |
| `gotchas_accuracy` | `non_abstention_by_category.gotchas.pct_correct` |
| `static_accuracy` | `combined_abstention_by_category.static.pct_correct` |
| `dynamic_accuracy` | `combined_abstention_by_category.dynamic.pct_correct` |
| `procedure_accuracy` | `combined_abstention_by_category.procedure.pct_correct` |
| `memory_query_avg_seconds` | `memory_query.avg_seconds` |

`overall_full_set` is the headline accuracy; `memory_query_avg_seconds` is the headline latency. Both feed directly into LAFS. The category-level fields give an interpretable diagnostic breakdown but are not used in the LAFS calculation.

Sources: [leaderboard/submission_utils.py:378-404]()

---

## LAFS: Accuracy-Latency Frontier Score

### The Core Idea

A memory system that is very accurate but takes three minutes per query is not as useful as one that is fast and nearly as accurate. LAFS (Latency-Adjusted Frontier Score) rewards systems that push the **Pareto frontier** of the accuracy-versus-latency trade-off — not just those that maximize accuracy alone.

Conceptually, imagine a graph with latency on the x-axis and accuracy on the y-axis. The Pareto frontier is the set of points where no other point is both faster *and* more accurate. LAFS measures the area under that frontier over a log-uniform distribution of latency budgets.

### Parameters

```python
# leaderboard/compute_lafs.py:19-22
T_MIN = 1.0      # minimum latency budget (seconds)
T_MAX = 200.0    # maximum latency budget (seconds)
FLOOR_ACC = 0.0  # accuracy floor when nothing fits the budget
```

The log-latency range 1–200 seconds corresponds to the practical operating range seen in the reference methods. Integration is log-uniform because latency improvements at 5 s feel as meaningful as improvements at 50 s.

### The Formula

```
LAFS = (1 / log(T_MAX / T_MIN)) * ∫ best_acc_under_budget(T) d(log T)
```

Implemented exactly as a step-function integral:

```python
# leaderboard/compute_lafs.py:96-127
def lafs(points, t_min=T_MIN, t_max=T_MAX, floor_acc=FLOOR_ACC):
    frontier = pareto_frontier(points)
    breakpoints = {t_min, t_max}
    for point in frontier:
        if t_min < point.latency < t_max:
            breakpoints.add(point.latency)
    breakpoints = sorted(breakpoints)
    denom = math.log(t_max / t_min)
    area = 0.0
    for left, right in zip(breakpoints[:-1], breakpoints[1:]):
        acc = best_acc_under_budget(frontier, left, floor_acc=floor_acc)
        area += acc * math.log(right / left)
    return area / denom
```

### LAFS Gain

The leaderboard ranks submissions by **LAFS gain** — how much a submission improves the frontier beyond the fixed reference baseline:

```
LAFS gain = LAFS(reference_frontier ∪ submission_points) − LAFS(reference_frontier)
```

A submission that is dominated everywhere by the existing frontier receives a gain of exactly 0. A submission that opens a new accuracy-latency operating region receives positive gain proportional to the area it adds.

### The Reference Frontier

The reference frontier is hard-coded and will never change. Downstream scores depend on these exact values:

| Tier | Method | Accuracy | Latency |
|---|---|---|---|
| small | RAG: query → slice + notes | 51.0% | 0.2 s |
| small | Codex | 69.9% | 177.2 s |
| small | AgentRunbook-R | 58.6% | 26.9 s |
| small | AgentRunbook-C | 74.9% | 108.3 s |
| medium | RAG: query → slice + notes | 45.9% | 0.3 s |
| medium | Codex | 68.7% | 185.8 s |
| medium | AgentRunbook-R | 57.0% | 25.8 s |
| medium | AgentRunbook-C | 70.1% | 139.9 s |

Sources: [leaderboard/compute_lafs.py:35-48]()

### Worked Example

A system with a single operating point at 62% accuracy and 15 s latency (`Fast RAG++` in the source) falls between the RAG baseline and AgentRunbook-R on the latency axis but offers better accuracy than RAG in that region, so it extends the frontier and earns positive LAFS gain. A system at 70% accuracy but 150 s latency is dominated by AgentRunbook-C (74.9% at 108.3 s) and receives a gain of 0.

Sources: [leaderboard/compute_lafs.py:221-256]()

---

## The Two-Step Submission Package

### Why Two Steps?

A submission can contain multiple **operating points** — different speed/accuracy trade-offs of the same method (e.g., `fast`, `balanced`, `accurate`). Step 1 handles one operating point at a time; Step 2 assembles them all into the final package. Running them separately lets you add or rebuild a single operating point without repeating validation for others.

### Step 1: Validate and Stage One Operating Point

```bash
python leaderboard/build_submission_step_1_single_operating_point.py \
  runs/my_method_fast_web_small \
  runs/my_method_fast_enterprise_small \
  submission_1 \
  fast \
  small
```

Step 1 calls `validate_run` on each run folder, which checks:

- Required files exist: `aggregated_metrics.json`, `per_question.jsonl`, `run_args.json`, `runtime_inputs/questions.json`, `runtime_inputs/haystack.json`
- `run_args.json` domain matches `web` or `enterprise`
- `run_args.json` model contains `qwen3.5-9b` and evaluator model contains `gpt-5.2`
- `per_question.jsonl` covers every question in `runtime_inputs/questions.json` (no missing or extra IDs)
- Question-type counts in the output match the runtime inputs
- `aggregated_metrics.json` `count_all_questions` matches the actual question and output counts

After both runs pass, `validate_run_pair` confirms they share the same method name and tier.

Step 1 then copies the run artifacts, combines the two domains using `combine_domain_metrics`, builds the `metric_overview.json` from the combined result, and writes `operating_point_metadata.json`.

Output layout:

```text
leaderboard/submissions/submission_1/operating_points/fast/
  metric_overview.json
  operating_point_metadata.json
  web/
    aggregated_metrics.json  per_question.jsonl  run_args.json  runtime_inputs/
  enterprise/
    aggregated_metrics.json  per_question.jsonl  run_args.json  runtime_inputs/
```

Sources: [leaderboard/build_submission_step_1_single_operating_point.py:70-132](), [leaderboard/submission_utils.py:226-334]()

### Step 2: Assemble the Final Package

```bash
python leaderboard/build_submission_step_2_build_package.py \
  submission_1 \
  SYSTEM_DESCRIPTION.md \
  path/to/code_file.py \
  leaderboard/submissions/submission_1/operating_points/fast \
  leaderboard/submissions/submission_1/operating_points/balanced
```

Step 2 validates that all operating points share the same method, tier, web question IDs, enterprise question IDs, and haystack contents. It then:

1. Copies each operating point folder into the package directory
2. Copies `SYSTEM_DESCRIPTION.md` and the code file to the package root
3. Computes `lafs_summary_for_submission` over all operating points and writes `submission_overview.json`
4. Creates the final `.tar.gz` archive (symlinks are rejected)

`submission_overview.json` at the package root records the method, tier, per-operating-point accuracy and latency values, and the full LAFS summary including `reference_lafs`, `submission_lafs`, and `lafs_gain`.

Sources: [leaderboard/build_submission_step_2_build_package.py:189-287](), [leaderboard/submission_utils.py:437-512]()

---

## End-to-End Flow

```text
web run folder ─────┐
                    ├─► Step 1 (validate + merge) ─► operating_points/fast/
enterprise run folder┘                                  metric_overview.json

                     (repeat for each operating point)

operating_points/fast/  ─┐
operating_points/balanced/├─► Step 2 (assemble + LAFS) ─► submission_1.tar.gz
operating_points/accurate/┘                                submission_overview.json
```

Each box in the diagram maps to exactly one script; there is no hidden middleware.

---

## Model Constraints Enforced at Validation Time

The submission tooling enforces two fixed model constraints on every run:

| Slot | Required substring |
|---|---|
| Reader model (`model` in `run_args.json`) | `qwen3.5-9b` |
| Evaluator model (`evaluator_model` in `run_args.json`) | `gpt-5.2` |

These checks exist to keep the evaluation protocol reproducible across all leaderboard entries. A run using a different reader or judge model will be rejected at Step 1 with a clear error.

Sources: [leaderboard/submission_utils.py:19-20](), [leaderboard/submission_utils.py:194-207]()

---

## Key Takeaways

**What the leaderboard measures.** Accuracy alone is not enough. The benchmark explicitly rewards memory systems that are fast *and* accurate by measuring how much a submission expands the Pareto frontier across a 1–200 second latency range.

**Two domains, one score.** Every submission is evaluated on both `web` and `enterprise` haystacks. The merge is example-count-weighted, so a domain with more questions has proportionally more influence on the combined score.

**LAFS gain can be zero.** If every submitted operating point falls inside (i.e., is dominated by) the reference frontier, the gain is exactly 0. To earn a positive gain, at least one operating point must improve accuracy at some latency budget where the reference frontier does not already reach that accuracy level.

**Multiple operating points help in different ways.** A fast operating point with moderate accuracy and a slow operating point with high accuracy together can improve LAFS more than either alone, because they each fill a different region of the frontier.

**The reference frontier is frozen.** The four reference methods per tier are hard-coded constants. Changing them would invalidate all previously computed scores, so they will not change after release.

The `submission_overview.json` written by Step 2 is the single artifact that captures everything: method name, tier, per-operating-point numbers, and the LAFS summary. Inspect it with `python -m json.tool` before submitting to confirm the numbers look correct.

Sources: [leaderboard/compute_lafs.py:35-51](), [leaderboard/build_submission_step_2_build_package.py:189-227]()
