# Scoring, LAFS, & What to Remember > How web and enterprise run results are merged, how LAFS turns accuracy and latency into a single leaderboard score, the two-step submission packaging process, and a plain-English recap of the core ideas to carry away from this repo. - Repository: xiaowu0162/LongMemEval-V2 - GitHub: https://github.com/xiaowu0162/LongMemEval-V2 - Human wiki: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2 - Complete Markdown: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/llms-full.txt ## Source Files - `leaderboard/README.md` - `leaderboard/compute_lafs.py` - `leaderboard/combine_aggregated_metrics.py` - `leaderboard/build_submission_step_1_single_operating_point.py` - `leaderboard/build_submission_step_2_build_package.py` - `leaderboard/submission_utils.py` ---

Relevant source files

The following files were used as context for generating this wiki page: - [leaderboard/README.md](leaderboard/README.md) - [leaderboard/compute_lafs.py](leaderboard/compute_lafs.py) - [leaderboard/combine_aggregated_metrics.py](leaderboard/combine_aggregated_metrics.py) - [leaderboard/build_submission_step_1_single_operating_point.py](leaderboard/build_submission_step_1_single_operating_point.py) - [leaderboard/build_submission_step_2_build_package.py](leaderboard/build_submission_step_2_build_package.py) - [leaderboard/submission_utils.py](leaderboard/submission_utils.py)

# Scoring, LAFS, & What to Remember This page explains how LongMemEval-V2 turns raw evaluation runs into a single comparable number on the leaderboard. It covers the two-domain merge that produces a unified metric view, how LAFS converts an accuracy-latency trade-off curve into one score, the two-command packaging process that turns folders of results into a submittable archive, and the key ideas to carry away when reading or extending this repo. Everything described here lives inside the `leaderboard/` directory. No external service is required: the scripts run locally, all reference values are hard-coded in the source, and the resulting `.tar.gz` is submitted through a web form. --- ## How Web and Enterprise Results Are Merged Every leaderboard entry is evaluated on two independently run domains: **web** and **enterprise**. These domains represent different haystack compositions and question styles. Because the two domains contain different numbers of questions, raw averages would be misleading. The solution is **example-count-weighted averaging**. The merging happens in `combine_aggregated_metrics.py`. The entry point reads two `aggregated_metrics.json` files — one from the web run, one from the enterprise run — and combines every numeric field using the total question count from each domain as the weight: ```python # leaderboard/combine_aggregated_metrics.py:100-111 def weighted_average(values: list[tuple[Any, int]]) -> float | None: weighted_total = 0.0 total_count = 0 for value, count in values: numeric = as_number(value) if numeric is None or count == 0: continue weighted_total += numeric * count total_count += count if total_count == 0: return None return weighted_total / total_count ``` This same weighted-average function is applied to: - `overall_full_set`, `overall_non_abstention_only`, `overall_abstention_only` (accuracy scores) - per-category accuracy breakdowns (`gotchas`, `static`, `dynamic`, `procedure`) - timing fields: `memory_query`, `memory_post_query` - token usage: prompt, completion, and total token counts For timing, the combination is slightly different: totals are summed, then re-averaged by total question count. `max_seconds` takes the maximum across both domains rather than an average. Sources: [leaderboard/combine_aggregated_metrics.py:100-111](), [leaderboard/combine_aggregated_metrics.py:288-321](), [leaderboard/combine_aggregated_metrics.py:324-375]() --- ## The Extracted Metric Overview After merging, `submission_utils.build_metric_overview` extracts the six fields that matter for scoring and display: | Field | Where it comes from | |---|---| | `overall_full_set` | `combined.overall.overall_full_set` | | `gotchas_accuracy` | `non_abstention_by_category.gotchas.pct_correct` | | `static_accuracy` | `combined_abstention_by_category.static.pct_correct` | | `dynamic_accuracy` | `combined_abstention_by_category.dynamic.pct_correct` | | `procedure_accuracy` | `combined_abstention_by_category.procedure.pct_correct` | | `memory_query_avg_seconds` | `memory_query.avg_seconds` | `overall_full_set` is the headline accuracy; `memory_query_avg_seconds` is the headline latency. Both feed directly into LAFS. The category-level fields give an interpretable diagnostic breakdown but are not used in the LAFS calculation. Sources: [leaderboard/submission_utils.py:378-404]() --- ## LAFS: Accuracy-Latency Frontier Score ### The Core Idea A memory system that is very accurate but takes three minutes per query is not as useful as one that is fast and nearly as accurate. LAFS (Latency-Adjusted Frontier Score) rewards systems that push the **Pareto frontier** of the accuracy-versus-latency trade-off — not just those that maximize accuracy alone. Conceptually, imagine a graph with latency on the x-axis and accuracy on the y-axis. The Pareto frontier is the set of points where no other point is both faster *and* more accurate. LAFS measures the area under that frontier over a log-uniform distribution of latency budgets. ### Parameters ```python # leaderboard/compute_lafs.py:19-22 T_MIN = 1.0 # minimum latency budget (seconds) T_MAX = 200.0 # maximum latency budget (seconds) FLOOR_ACC = 0.0 # accuracy floor when nothing fits the budget ``` The log-latency range 1–200 seconds corresponds to the practical operating range seen in the reference methods. Integration is log-uniform because latency improvements at 5 s feel as meaningful as improvements at 50 s. ### The Formula ``` LAFS = (1 / log(T_MAX / T_MIN)) * ∫ best_acc_under_budget(T) d(log T) ``` Implemented exactly as a step-function integral: ```python # leaderboard/compute_lafs.py:96-127 def lafs(points, t_min=T_MIN, t_max=T_MAX, floor_acc=FLOOR_ACC): frontier = pareto_frontier(points) breakpoints = {t_min, t_max} for point in frontier: if t_min < point.latency < t_max: breakpoints.add(point.latency) breakpoints = sorted(breakpoints) denom = math.log(t_max / t_min) area = 0.0 for left, right in zip(breakpoints[:-1], breakpoints[1:]): acc = best_acc_under_budget(frontier, left, floor_acc=floor_acc) area += acc * math.log(right / left) return area / denom ``` ### LAFS Gain The leaderboard ranks submissions by **LAFS gain** — how much a submission improves the frontier beyond the fixed reference baseline: ``` LAFS gain = LAFS(reference_frontier ∪ submission_points) − LAFS(reference_frontier) ``` A submission that is dominated everywhere by the existing frontier receives a gain of exactly 0. A submission that opens a new accuracy-latency operating region receives positive gain proportional to the area it adds. ### The Reference Frontier The reference frontier is hard-coded and will never change. Downstream scores depend on these exact values: | Tier | Method | Accuracy | Latency | |---|---|---|---| | small | RAG: query → slice + notes | 51.0% | 0.2 s | | small | Codex | 69.9% | 177.2 s | | small | AgentRunbook-R | 58.6% | 26.9 s | | small | AgentRunbook-C | 74.9% | 108.3 s | | medium | RAG: query → slice + notes | 45.9% | 0.3 s | | medium | Codex | 68.7% | 185.8 s | | medium | AgentRunbook-R | 57.0% | 25.8 s | | medium | AgentRunbook-C | 70.1% | 139.9 s | Sources: [leaderboard/compute_lafs.py:35-48]() ### Worked Example A system with a single operating point at 62% accuracy and 15 s latency (`Fast RAG++` in the source) falls between the RAG baseline and AgentRunbook-R on the latency axis but offers better accuracy than RAG in that region, so it extends the frontier and earns positive LAFS gain. A system at 70% accuracy but 150 s latency is dominated by AgentRunbook-C (74.9% at 108.3 s) and receives a gain of 0. Sources: [leaderboard/compute_lafs.py:221-256]() --- ## The Two-Step Submission Package ### Why Two Steps? A submission can contain multiple **operating points** — different speed/accuracy trade-offs of the same method (e.g., `fast`, `balanced`, `accurate`). Step 1 handles one operating point at a time; Step 2 assembles them all into the final package. Running them separately lets you add or rebuild a single operating point without repeating validation for others. ### Step 1: Validate and Stage One Operating Point ```bash python leaderboard/build_submission_step_1_single_operating_point.py \ runs/my_method_fast_web_small \ runs/my_method_fast_enterprise_small \ submission_1 \ fast \ small ``` Step 1 calls `validate_run` on each run folder, which checks: - Required files exist: `aggregated_metrics.json`, `per_question.jsonl`, `run_args.json`, `runtime_inputs/questions.json`, `runtime_inputs/haystack.json` - `run_args.json` domain matches `web` or `enterprise` - `run_args.json` model contains `qwen3.5-9b` and evaluator model contains `gpt-5.2` - `per_question.jsonl` covers every question in `runtime_inputs/questions.json` (no missing or extra IDs) - Question-type counts in the output match the runtime inputs - `aggregated_metrics.json` `count_all_questions` matches the actual question and output counts After both runs pass, `validate_run_pair` confirms they share the same method name and tier. Step 1 then copies the run artifacts, combines the two domains using `combine_domain_metrics`, builds the `metric_overview.json` from the combined result, and writes `operating_point_metadata.json`. Output layout: ```text leaderboard/submissions/submission_1/operating_points/fast/ metric_overview.json operating_point_metadata.json web/ aggregated_metrics.json per_question.jsonl run_args.json runtime_inputs/ enterprise/ aggregated_metrics.json per_question.jsonl run_args.json runtime_inputs/ ``` Sources: [leaderboard/build_submission_step_1_single_operating_point.py:70-132](), [leaderboard/submission_utils.py:226-334]() ### Step 2: Assemble the Final Package ```bash python leaderboard/build_submission_step_2_build_package.py \ submission_1 \ SYSTEM_DESCRIPTION.md \ path/to/code_file.py \ leaderboard/submissions/submission_1/operating_points/fast \ leaderboard/submissions/submission_1/operating_points/balanced ``` Step 2 validates that all operating points share the same method, tier, web question IDs, enterprise question IDs, and haystack contents. It then: 1. Copies each operating point folder into the package directory 2. Copies `SYSTEM_DESCRIPTION.md` and the code file to the package root 3. Computes `lafs_summary_for_submission` over all operating points and writes `submission_overview.json` 4. Creates the final `.tar.gz` archive (symlinks are rejected) `submission_overview.json` at the package root records the method, tier, per-operating-point accuracy and latency values, and the full LAFS summary including `reference_lafs`, `submission_lafs`, and `lafs_gain`. Sources: [leaderboard/build_submission_step_2_build_package.py:189-287](), [leaderboard/submission_utils.py:437-512]() --- ## End-to-End Flow ```text web run folder ─────┐ ├─► Step 1 (validate + merge) ─► operating_points/fast/ enterprise run folder┘ metric_overview.json (repeat for each operating point) operating_points/fast/ ─┐ operating_points/balanced/├─► Step 2 (assemble + LAFS) ─► submission_1.tar.gz operating_points/accurate/┘ submission_overview.json ``` Each box in the diagram maps to exactly one script; there is no hidden middleware. --- ## Model Constraints Enforced at Validation Time The submission tooling enforces two fixed model constraints on every run: | Slot | Required substring | |---|---| | Reader model (`model` in `run_args.json`) | `qwen3.5-9b` | | Evaluator model (`evaluator_model` in `run_args.json`) | `gpt-5.2` | These checks exist to keep the evaluation protocol reproducible across all leaderboard entries. A run using a different reader or judge model will be rejected at Step 1 with a clear error. Sources: [leaderboard/submission_utils.py:19-20](), [leaderboard/submission_utils.py:194-207]() --- ## Key Takeaways **What the leaderboard measures.** Accuracy alone is not enough. The benchmark explicitly rewards memory systems that are fast *and* accurate by measuring how much a submission expands the Pareto frontier across a 1–200 second latency range. **Two domains, one score.** Every submission is evaluated on both `web` and `enterprise` haystacks. The merge is example-count-weighted, so a domain with more questions has proportionally more influence on the combined score. **LAFS gain can be zero.** If every submitted operating point falls inside (i.e., is dominated by) the reference frontier, the gain is exactly 0. To earn a positive gain, at least one operating point must improve accuracy at some latency budget where the reference frontier does not already reach that accuracy level. **Multiple operating points help in different ways.** A fast operating point with moderate accuracy and a slow operating point with high accuracy together can improve LAFS more than either alone, because they each fill a different region of the frontier. **The reference frontier is frozen.** The four reference methods per tier are hard-coded constants. Changing them would invalidate all previously computed scores, so they will not change after release. The `submission_overview.json` written by Step 2 is the single artifact that captures everything: method name, tier, per-operating-point numbers, and the LAFS summary. Inspect it with `python -m json.tool` before submitting to confirm the numbers look correct. Sources: [leaderboard/compute_lafs.py:35-51](), [leaderboard/build_submission_step_2_build_package.py:189-227]()