# Benchmarks, Leaderboard & Observability — Did It Work?

> How RLMBenchmarkCase definitions drive automated runs, how scores flow into the leaderboard, how trajectory replay lets you re-watch any session, and how observability sinks (OTel-shaped JSONL, trace analysis) record what happened.

- Repository: SuperagenticAI/rlm-code
- GitHub: https://github.com/SuperagenticAI/rlm-code
- Human wiki: https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91
- Complete Markdown: https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/llms-full.txt

## Source Files

- `rlm_code/rlm/benchmarks.py`
- `rlm_code/rlm/benchmark_manager.py`
- `rlm_code/rlm/leaderboard.py`
- `rlm_code/rlm/session_replay.py`
- `rlm_code/rlm/observability.py`
- `rlm_code/rlm/observability_sinks.py`
- `rlm_code/traces/store.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [rlm_code/rlm/benchmarks.py](rlm_code/rlm/benchmarks.py)
- [rlm_code/rlm/benchmark_manager.py](rlm_code/rlm/benchmark_manager.py)
- [rlm_code/rlm/leaderboard.py](rlm_code/rlm/leaderboard.py)
- [rlm_code/rlm/session_replay.py](rlm_code/rlm/session_replay.py)
- [rlm_code/rlm/observability.py](rlm_code/rlm/observability.py)
- [rlm_code/rlm/observability_sinks.py](rlm_code/rlm/observability_sinks.py)
- [rlm_code/traces/store.py](rlm_code/traces/store.py)
</details>

# Benchmarks, Leaderboard & Observability — Did It Work?

This page explains how RLM Code measures whether an agent run actually succeeded: how test cases are defined, how runs are executed and scored, how results flow into a ranked leaderboard, how every session can be replayed step-by-step after the fact, and how a layered observability system (local JSONL files up through OpenTelemetry, MLflow, LangSmith, LangFuse, and Logfire) records what happened in each run.

Think of it like a sports league: the benchmark cases are the game schedule, each `run_benchmark` call plays the games and records the score, the leaderboard ranks teams by multiple statistics, trajectory replay lets you rewatch any match in slow motion, and the observability sinks are the broadcast cameras capturing the action in real time.

---

## 1. Benchmark Cases: the Unit of Evaluation

The smallest piece of the benchmark system is `RLMBenchmarkCase`, a frozen dataclass that describes exactly one task to run.

```python
# rlm_code/rlm/benchmarks.py  lines 14-23
@dataclass(frozen=True, slots=True)
class RLMBenchmarkCase:
    case_id: str
    description: str
    task: str
    environment: str = "dspy"
    max_steps: int = 4
    exec_timeout: int = 30
```

Every field matters:

| Field | Purpose |
|---|---|
| `case_id` | Unique key used to track results per case across runs |
| `task` | The plain-language prompt given to the agent |
| `environment` | Execution sandbox (`dspy`, `generic`, `pure_rlm`) |
| `max_steps` | Hard cap on agent loop iterations |
| `exec_timeout` | Per-step subprocess timeout in seconds |

### Built-in Preset Suites

Cases are organized into named *presets*. The table below lists all presets defined in the repository:

| Preset | Cases | Focus |
|---|---|---|
| `dspy_quick` | 3 | Fast DSPy coding smoke test |
| `dspy_extended` | 5 | Broader DSPy sweep |
| `generic_smoke` | 2 | Generic run_python sanity |
| `pure_rlm_smoke` | 3 | Pure RLM context-as-variable basics |
| `pure_rlm_context` | 4 | Chunking, accumulation, map-reduce |
| `deep_recursion` | 3 | Depth > 1 recursive delegation |
| `paradigm_comparison` | 3 | Pure RLM vs CodeAct side-by-side |
| `oolong_style` | 4 | OOLONG long-context tasks |
| `browsecomp_style` | 3 | BrowseComp web reasoning |
| `token_efficiency` | 3 | Token usage measurement |
| `dynamic_web_filtering` | 3 | Domain-scoped retrieval |

Sources: [rlm_code/rlm/benchmarks.py:26-478]()

### Loading External Packs

In addition to the built-in presets, `load_benchmark_packs` can parse external files in five shapes: an explicit `presets:` YAML block, a top-level `cases` list, a Google ADK `eval_cases` JSON, plain JSONL record rows, and generic `records`/`items` mappings. This makes it straightforward to import third-party evaluation datasets without modifying source code.

Sources: [rlm_code/rlm/benchmarks.py:527-579]()

---

## 2. Running a Benchmark Preset

`BenchmarkManagerMixin.run_benchmark` is the entry point that executes all cases in a preset and produces a persisted JSON summary.

```python
# rlm_code/rlm/benchmark_manager.py  lines 156-173
def run_benchmark(
    self,
    *,
    preset: str = "dspy_quick",
    mode: str = "native",
    ...
) -> RLMBenchmarkResult:
    """Execute a benchmark preset and persist aggregate summary."""
```

Three execution modes are supported:

| Mode | What it does |
|---|---|
| `native` | Runs each case through `RLMRunner.run_task` (full agent loop) |
| `harness` | Delegates to `HarnessRunner` with optional MCP tool access |
| `direct-llm` | Single LLM call, no tool loop — baseline comparison |

After all cases finish, the mixin computes aggregate statistics and writes a timestamped JSON file under `<workdir>/benchmarks/bench_YYYYMMDD_HHMMSS_<μs>.json`.

```python
# rlm_code/rlm/benchmark_manager.py  lines 282-317
avg_reward = (sum(total_rewards) / attempted_cases) if attempted_cases else 0.0
avg_steps = (sum(total_steps) / attempted_cases) if attempted_cases else 0.0
duration_stats = self._summarize_distribution(durations)
usage_totals = self._aggregate_usage_totals(case_results)
...
summary_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
```

The summary JSON contains: `benchmark_id`, `preset`, `mode`, `avg_reward`, `avg_steps`, `latency_seconds` (with avg/p50/p95/p99/max), `usage_totals`, and the full `case_results` list.

Sources: [rlm_code/rlm/benchmark_manager.py:156-333]()

### CI-Style Comparison Gates

`compare_benchmarks` loads two summary files (by ID, keyword `latest`/`previous`, or file path) and evaluates four quality gates:

```python
# rlm_code/rlm/benchmark_manager.py  lines 410-431
gates = {
    "reward": reward_delta >= float(min_reward_delta),
    "completion": completion_delta >= float(min_completion_delta),
    "steps": steps_increase <= float(max_steps_increase),
    "completion_regressions": (
        case_summary["completion_regressions"] == 0
        if fail_on_completion_regression
        else True
    ),
}
...
passed=all(bool(value) for value in gates.values()),
```

When all four gates pass, `RLMBenchmarkComparison.passed` is `True`. Reports can be exported in Markdown, CSV, or JSON format via `export_benchmark_report`.

Sources: [rlm_code/rlm/benchmark_manager.py:367-495]()

### LLM Judge for Predictions

`judge_predictions` provides an LLM-as-judge layer on top of raw model outputs. It matches predictions to a reference JSONL by `question_id`, calls the configured LLM with a task-type-aware prompt (temporal reasoning, knowledge update, single-session preference, or generic), and appends `autoeval_label: {model, label, raw}` to a results file. It supports resumption: already-judged IDs are skipped.

Sources: [rlm_code/rlm/benchmark_manager.py:497-659]()

---

## 3. Leaderboard: Ranking Results Across Runs

The `Leaderboard` class loads results from two sources and ranks them by configurable metrics.

```
text
Sources loaded by Leaderboard.load_all():

  <workdir>/rlm/benchmarks/*.json   ←  one file per benchmark preset run
  <workdir>/observability/runs.jsonl ←  one line per individual task run
```

### Ranking Metrics

```python
# rlm_code/rlm/leaderboard.py  lines 29-57
class RankingMetric(Enum):
    REWARD = "reward"           # higher is better
    COMPLETION_RATE = "completion_rate"  # higher is better
    STEPS = "steps"             # lower is better
    TOKENS = "tokens"           # lower is better
    COST = "cost"               # lower is better
    DURATION = "duration"       # lower is better
    EFFICIENCY = "efficiency"   # reward per 1000 tokens, higher is better
```

`LeaderboardEntry` computes `efficiency = (avg_reward * 1000) / total_tokens` in `__post_init__`, so it is always available without a separate query.

Sources: [rlm_code/rlm/leaderboard.py:60-107]()

### Filtering

`LeaderboardFilter` supports slicing by environment, model, preset, tags, reward range, completion rate, date window, and minimum case count. The `rank()` method applies the filter before sorting:

```python
# rlm_code/rlm/leaderboard.py  lines 414-459
def rank(self, metric, order=None, limit=None, filter=None) -> RankingResult:
    entries = self._entries
    if filter:
        entries = [e for e in entries if filter.matches(e)]
    ...
    entries = sorted(entries, key=lambda e: e.get_metric(metric), reverse=reverse)
```

Sources: [rlm_code/rlm/leaderboard.py:414-459]()

### Export Formats

The leaderboard can be exported as a Rich terminal table (via `format_rich_table`), JSON, CSV, or Markdown. The Markdown export includes both the ranked table and a statistics block (mean, median, std dev, range).

Sources: [rlm_code/rlm/leaderboard.py:599-637]()

---

## 4. Session Replay: Rewatching Any Run

The replay system lets you load a saved run and scrub through it step by step, inspect variables and memory state at any point, find all error steps, or compare two runs to find where they diverged.

### Recording

`SessionRecorder` is the write side. During a run it records typed `SessionEvent` objects (lifecycle, step start/action/result/end, LLM request/response, child spawn, final detection, error, checkpoint) and appends them to a JSONL file in real time.

```python
# rlm_code/rlm/session_replay.py  lines 37-68
class SessionEventType(Enum):
    SESSION_START = "session_start"
    SESSION_END = "session_end"
    STEP_START = "step_start"
    STEP_ACTION = "step_action"
    STEP_RESULT = "step_result"
    STEP_END = "step_end"
    LLM_REQUEST = "llm_request"
    LLM_RESPONSE = "llm_response"
    CHILD_SPAWN = "child_spawn"
    CHILD_RESULT = "child_result"
    FINAL_DETECTED = "final_detected"
    CHECKPOINT = "checkpoint"
    ERROR = "error"
```

At any point `create_checkpoint()` materializes a `SessionSnapshot`—a full state object—and writes it to disk. Sources: [rlm_code/rlm/session_replay.py:273-585]()

### Replaying

`SessionReplayer` wraps a `SessionSnapshot` and offers forward/backward navigation:

```python
# rlm_code/rlm/session_replay.py  lines 668-715
def step_forward(self) -> StepState | None: ...
def step_backward(self) -> StepState | None: ...
def goto_step(self, step: int) -> StepState | None: ...
def find_errors(self) -> list[StepState]: ...
def find_successes(self) -> list[StepState]: ...
```

It can be loaded from either a compact `.json` snapshot or a raw `.jsonl` trajectory file. Legacy trajectory formats are converted automatically via `_convert_legacy_event` and `_convert_legacy_step`.

Sources: [rlm_code/rlm/session_replay.py:593-744]()

### `StepState` fields

Each replayed step carries the full picture:

| Field | Content |
|---|---|
| `action_type`, `action_code` | What the agent decided to do |
| `action_rationale` | LLM's reasoning text |
| `success`, `output`, `error` | Observation result |
| `reward`, `cumulative_reward` | Per-step and running score |
| `memory_notes` | Active memory at this point |
| `variables` | REPL variable state |
| `tokens_used` | Token cost for this step |

Sources: [rlm_code/rlm/session_replay.py:119-160]()

### Session Comparison

`compare_sessions` finds the first step where two sessions diverge (by action type, code, or success flag) and computes reward/token/efficiency deltas:

```python
# rlm_code/rlm/session_replay.py  lines 930-992
def compare_sessions(snapshot_a, snapshot_b) -> SessionComparison:
    ...
    a_efficiency = (snapshot_a.total_reward * 1000) / snapshot_a.total_tokens
```

---

## 5. Observability Sinks: What Happened, Sent Everywhere

The observability layer is built around `RLMObservabilitySink`, a Protocol with three hooks: `on_run_start`, `on_step`, and `on_run_end`. `RLMObservability` is a coordinator that fans out calls to all configured sinks, catching per-sink exceptions so a failing sink never breaks the run.

```python
# rlm_code/rlm/observability.py  lines 46-85
class RLMObservabilitySink(Protocol):
    name: str
    def on_run_start(self, run_id, *, task, environment, params) -> None: ...
    def on_step(self, run_id, *, event, cumulative_reward) -> None: ...
    def on_run_end(self, run_id, *, result, run_path) -> None: ...
```

Sources: [rlm_code/rlm/observability.py:46-85]()

### Sink Activation via Environment Variables

All sinks are opt-in. The master switch is `DSPY_RLM_OBS_ENABLED` (default `True`). Individual sinks:

| Env variable | Sink | Default |
|---|---|---|
| `DSPY_RLM_OBS_LOCAL_JSONL` | `LocalJSONLSink` | `True` |
| `DSPY_RLM_MLFLOW_ENABLED` | `MLflowSink` | `False` |
| `DSPY_RLM_OTEL_ENABLED` | `OpenTelemetrySink` | `False` |
| `DSPY_RLM_LANGSMITH_ENABLED` | `LangSmithSink` | `False` |
| `DSPY_RLM_LANGFUSE_ENABLED` | `LangFuseSink` | `False` |
| `DSPY_RLM_LOGFIRE_ENABLED` | `LogfireSink` | `False` |

Sources: [rlm_code/rlm/observability.py:308-357]()

### LocalJSONLSink: Always-On Local Storage

This sink never requires an external service. It writes two files:

- `<workdir>/observability/runs.jsonl` — one JSON line per completed run (aggregated)
- `<workdir>/observability/steps/<run_id>.jsonl` — one line per step within a run

```python
# rlm_code/rlm/observability.py  lines 87-181
@dataclass(slots=True)
class LocalJSONLSink:
    base_dir: Path
    runs_file: Path   # observability/runs.jsonl
    steps_dir: Path   # observability/steps/
```

Each step line carries: `timestamp`, `run_id`, `step`, `action`, `reward`, `cumulative_reward`, `success`.

Sources: [rlm_code/rlm/observability.py:87-181]()

### OpenTelemetrySink: Distributed Tracing

When enabled, this sink creates one parent OTEL span per run (`rlm.run`) and one child span per step (`rlm.step`). Code and output are added as span events. Metrics instruments (`rlm.runs`, `rlm.steps`, `rlm.reward`, `rlm.run_duration`) can be exported via OTLP to any compatible backend (Jaeger, Grafana Tempo, etc.).

```python
# rlm_code/rlm/observability_sinks.py  lines 173-196
span = self._tracer.start_span(
    "rlm.run",
    attributes={
        "rlm.run_id": run_id,
        "rlm.task": task[:500],
        "rlm.environment": environment,
        "rlm.max_steps": params.get("max_steps", 0),
    },
)
```

The trace ID for each run is available via `get_trace_id(run_id)` for cross-system correlation.

Sources: [rlm_code/rlm/observability_sinks.py:42-321]()

### Other Optional Sinks

| Sink | Integration | Key mechanism |
|---|---|---|
| `MLflowSink` | MLflow experiment tracking | `log_metric` per step, `log_artifact` on run end |
| `LangSmithSink` | LangChain's observability | `RunTree` with parent/child structure |
| `LangFuseSink` | Open-source LLM tracing | `trace()` + per-step `span()` + `score()` on completion |
| `LogfireSink` | Pydantic's Logfire | OTEL-compatible `span()` context manager |

All optional sinks degrade gracefully: if the package is not installed or the connection fails, `_available` is set to `False` and subsequent calls are silently skipped.

Sources: [rlm_code/rlm/observability_sinks.py:328-965]()

---

## 6. Trace Analysis: Querying Recorded OTel Spans

For richer post-hoc investigation, `TraceStore` provides a read-only query API over a JSONL file of raw OTEL spans plus a sidecar index.

```python
# rlm_code/traces/store.py  lines 73-91
class TraceStore:
    """Read-only query API over a trace JSONL file and sidecar index."""

    @classmethod
    def load(cls, trace_path, index_path=None) -> "TraceStore": ...
```

Key operations:

| Method | What it returns |
|---|---|
| `get_overview()` | Service names, model names, agent names, error counts, token totals |
| `query_traces()` | Paginated summary list, filterable by `has_errors`, model, service, agent |
| `view_trace(trace_id)` | All spans for one trace, capped at 150 000 chars; returns oversized diagnostics if exceeded |
| `view_spans(trace_id, span_ids)` | Surgical extraction of specific spans with higher attribute cap (16 384 chars) |
| `search_trace(trace_id, pattern)` | Text-searches raw bytes using the index's stored byte offsets for efficiency |
| `export_evidence_corpus()` | Layered Markdown+JSONL export for harness-optimization agents |

The search uses byte-offset seeks into the JSONL file rather than scanning all lines, making pattern search efficient even on large trace files.

Sources: [rlm_code/traces/store.py:73-268]()

---

## 7. End-to-End Data Flow

```text
RLMBenchmarkCase
  │  defined in benchmarks.py or loaded from YAML/JSON/JSONL pack
  ▼
BenchmarkManagerMixin.run_benchmark()
  │  iterates cases → run_task() / HarnessRunner.run() / direct LLM
  │  RLMObservability.on_run_start / on_step / on_run_end fired for each case
  ▼
  ┌─────────────────────────────────────────────────────┐
  │  LocalJSONLSink  →  runs.jsonl + steps/<id>.jsonl   │
  │  MLflowSink      →  experiment metrics              │
  │  OpenTelemetrySink → OTEL spans → any OTLP backend  │
  │  LangSmithSink   →  RunTree in LangSmith project    │
  │  LangFuseSink    →  traces + scores in LangFuse     │
  │  LogfireSink     →  structured spans in Logfire     │
  └─────────────────────────────────────────────────────┘
  │
  ▼  benchmarks/<bench_id>.json (summary with avg_reward, avg_steps, …)
  │
  ├─→ Leaderboard.load_all()  →  rank() by REWARD / EFFICIENCY / TOKENS / …
  │      Export: Rich table, JSON, CSV, Markdown
  │
  ├─→ compare_benchmarks()  →  CI gate: reward / completion / steps / regressions
  │      Export: Markdown / CSV / JSON report
  │
  └─→ SessionReplayer.from_jsonl()  →  step_forward / goto_step / find_errors
         SessionStore.save_snapshot() / load_checkpoint()
```

---

## Summary

The benchmark and observability stack in RLM Code covers the full "did it work?" question from multiple angles: `RLMBenchmarkCase` definitions drive automated runs through `run_benchmark`, per-run JSON summaries feed into the `Leaderboard` for multi-metric ranking, `compare_benchmarks` provides CI-style regression gates, `SessionReplayer` lets you re-examine any session step by step, and `RLMObservability` fans telemetry out to local JSONL files as a reliable baseline while optionally forwarding to MLflow, OpenTelemetry, LangSmith, LangFuse, or Logfire without changing any agent code. All sinks are BYOK/BYOC: they activate via environment variables and degrade gracefully when libraries are absent.

Sources: [rlm_code/rlm/observability.py:302-357]()