Agent-readable wiki

RLM Code — Explain Like I'm 5 Wiki

RLM Code is a Python tool that runs AI agents in a looping read-execute-judge cycle, benchmarks them across environments, and lets you compare results — all from a terminal UI. It implements the Recursive Language Models paper idea: instead of stuffing a giant document into the AI's memory all at once, the AI reads a little piece, writes code to analyze it, and repeats until it has an answer.

Pages

  1. Explain It Simply — What Is RLM Code?RLM Code in plain language: what problem it solves, the one core idea to keep, and what you will find when you open the repo for the first time.
  2. The Loop — How an Agent Actually Works HereStep by step: how RLMRunner turns one user prompt into many rounds of context → action proposal → sandbox execution → observation → reward → memory update, and when it stops.
  3. Environments & Sandboxes — Where Code Actually RunsThe three built-in environments (DSPy coding, Generic, TraceAnalysis, PureRLM), what each one does, and the sandbox runtimes (Docker, Monty, mock) that execute untrusted code safely.
  4. Framework Adapters — Plug In Your Favourite AI StackHow rlm_code/rlm/frameworks/ lets DSPy, Google ADK, Pydantic-AI, and DeepAgents all plug into the same RLM loop through a shared base class and a framework registry, without changing the core runner.
  5. Benchmarks, Leaderboard & Observability — Did It Work?How RLMBenchmarkCase definitions drive automated runs, how scores flow into the leaderboard, how trajectory replay lets you re-watch any session, and how observability sinks (OTel-shaped JSONL, trace analysis) record what happened.
  6. The One Map to Keep — Core Idea, Key Files, What to Read NextA plain-English recap of the whole system: the single analogy that holds, the five files that matter most, the two constraints every newcomer hits, and where to go from here.

Complete Markdown

# RLM Code — Explain Like I'm 5 Wiki

> RLM Code is a Python tool that runs AI agents in a looping read-execute-judge cycle, benchmarks them across environments, and lets you compare results — all from a terminal UI. It implements the Recursive Language Models paper idea: instead of stuffing a giant document into the AI's memory all at once, the AI reads a little piece, writes code to analyze it, and repeats until it has an answer.

## Context Links

- [Agent index](https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/llms.txt)
- [Human interactive wiki](https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91)
- [GitHub repository](https://github.com/SuperagenticAI/rlm-code)

## Repository Metadata

- Repository: SuperagenticAI/rlm-code

- Generated: 2026-05-22T02:09:16.448Z
- Updated: 2026-05-22T02:14:08.224Z
- Runtime: Claude Code
- Format: Explain Like I'm 5
- Pages: 6

## Page Index

- 01. [Explain It Simply — What Is RLM Code?](https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/pages/01-explain-it-simply-what-is-rlm-code.md) - RLM Code in plain language: what problem it solves, the one core idea to keep, and what you will find when you open the repo for the first time.
- 02. [The Loop — How an Agent Actually Works Here](https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/pages/02-the-loop-how-an-agent-actually-works-here.md) - Step by step: how RLMRunner turns one user prompt into many rounds of context → action proposal → sandbox execution → observation → reward → memory update, and when it stops.
- 03. [Environments & Sandboxes — Where Code Actually Runs](https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/pages/03-environments-sandboxes-where-code-actually-runs.md) - The three built-in environments (DSPy coding, Generic, TraceAnalysis, PureRLM), what each one does, and the sandbox runtimes (Docker, Monty, mock) that execute untrusted code safely.
- 04. [Framework Adapters — Plug In Your Favourite AI Stack](https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/pages/04-framework-adapters-plug-in-your-favourite-ai-stack.md) - How rlm_code/rlm/frameworks/ lets DSPy, Google ADK, Pydantic-AI, and DeepAgents all plug into the same RLM loop through a shared base class and a framework registry, without changing the core runner.
- 05. [Benchmarks, Leaderboard & Observability — Did It Work?](https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/pages/05-benchmarks-leaderboard-observability-did-it-work.md) - How RLMBenchmarkCase definitions drive automated runs, how scores flow into the leaderboard, how trajectory replay lets you re-watch any session, and how observability sinks (OTel-shaped JSONL, trace analysis) record what happened.
- 06. [The One Map to Keep — Core Idea, Key Files, What to Read Next](https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/pages/06-the-one-map-to-keep-core-idea-key-files-what-to-read-next.md) - A plain-English recap of the whole system: the single analogy that holds, the five files that matter most, the two constraints every newcomer hits, and where to go from here.

## Source File Index

- `docs/core/environments.md`
- `docs/core/execution-patterns.md`
- `pyproject.toml`
- `README.md`
- `rlm_code/__main__.py`
- `rlm_code/commands/run_command.py`
- `rlm_code/commands/slash_commands.py`
- `rlm_code/execution/engine.py`
- `rlm_code/execution/sandbox.py`
- `rlm_code/main.py`
- `rlm_code/models/providers/registry.py`
- `rlm_code/rlm/action_planner.py`
- `rlm_code/rlm/benchmark_manager.py`
- `rlm_code/rlm/benchmarks.py`
- `rlm_code/rlm/context_store.py`
- `rlm_code/rlm/docker_interpreter.py`
- `rlm_code/rlm/environments.py`
- `rlm_code/rlm/events.py`
- `rlm_code/rlm/frameworks/base.py`
- `rlm_code/rlm/frameworks/deepagents_adapter.py`
- `rlm_code/rlm/frameworks/dspy_rlm_adapter.py`
- `rlm_code/rlm/frameworks/google_adk_adapter.py`
- `rlm_code/rlm/frameworks/pydantic_ai_adapter.py`
- `rlm_code/rlm/frameworks/registry.py`
- `rlm_code/rlm/leaderboard.py`
- `rlm_code/rlm/memory_compaction.py`
- `rlm_code/rlm/monty_interpreter.py`
- `rlm_code/rlm/observability_sinks.py`
- `rlm_code/rlm/observability.py`
- `rlm_code/rlm/pure_rlm_environment.py`
- `rlm_code/rlm/runner.py`
- `rlm_code/rlm/session_replay.py`
- `rlm_code/rlm/termination.py`
- `rlm_code/rlm/trajectory.py`
- `rlm_code/sandbox/runtimes.py`
- `rlm_code/traces/store.py`

---

## 01. Explain It Simply — What Is RLM Code?

> RLM Code in plain language: what problem it solves, the one core idea to keep, and what you will find when you open the repo for the first time.

- Page Markdown: https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/pages/01-explain-it-simply-what-is-rlm-code.md
- Generated: 2026-05-22T02:05:30.213Z

### Source Files

- `README.md`
- `pyproject.toml`
- `rlm_code/main.py`
- `rlm_code/__main__.py`
- `rlm_code/commands/run_command.py`
- `rlm_code/commands/slash_commands.py`

<details>
<summary>Relevant source files</summary>

The following files were used as context for generating this wiki page:

- [README.md](README.md)
- [pyproject.toml](pyproject.toml)
- [rlm_code/main.py](rlm_code/main.py)
- [rlm_code/commands/run_command.py](rlm_code/commands/run_command.py)
- [rlm_code/commands/slash_commands.py](rlm_code/commands/slash_commands.py)
- [rlm_code/rlm/pure_rlm_environment.py](rlm_code/rlm/pure_rlm_environment.py)
- [rlm_code/rlm/runner.py](rlm_code/rlm/runner.py)
- [rlm_code/rlm/environments.py](rlm_code/rlm/environments.py)
- [rlm_code/rlm/benchmarks.py](rlm_code/rlm/benchmarks.py)
- [rlm_code/rlm/termination.py](rlm_code/rlm/termination.py)
- [rlm_code/rlm/trajectory.py](rlm_code/rlm/trajectory.py)
- [rlm_code/sandbox/runtimes/docker_runtime.py](rlm_code/sandbox/runtimes/docker_runtime.py)
</details>

# Explain It Simply — What Is RLM Code?

RLM Code is a command-line research tool for running, benchmarking, and replaying LLM-powered agents that solve tasks by writing and executing code — iteration by iteration — rather than trying to process everything in one large prompt. It wraps the **Recursive Language Models (RLM)** algorithm from a 2025 research paper in an interactive terminal UI with built-in evaluation, trajectory replay, and support for any LLM provider you bring.

This page is for someone opening the repository for the first time who wants to understand the one core idea before exploring the code, and know what each major part of the project does.

---

## The Problem RLM Solves

Think about what normally happens when you ask an LLM to analyze a large document. You paste the whole thing into the prompt and hope the model holds all the details in mind simultaneously. For anything bigger than the model's context window — a long codebase, a 500-page PDF, a large JSONL trace file — this breaks down: details get dropped, the model loses focus, and token costs explode.

**RLM's answer:** don't put the document in the prompt. Instead, store it as a Python variable in a sandboxed REPL, give the model just a short description (length, a preview), and let it write code to read and process the data in manageable chunks. The model sees only what its code surfaces, builds up intermediate results in buffer variables, and finally calls `FINAL("my answer")` to terminate.

This is token-efficient and scales to inputs far larger than any model's context window.

---

## The One Core Idea: A REPL Loop With a Context Variable

The system prompt injected in `rlm_code/rlm/pure_rlm_environment.py` makes the design explicit:

```
The REPL environment is initialized with:
1. A `context` variable that contains extremely important information about your query.
2. A `llm_query` function that allows you to query an LLM inside your REPL environment.
3. A `llm_query_batched` function for concurrent multi-prompt queries.
4. A `SHOW_VARS()` function ...
5. print() statements to view output and continue reasoning.
6. A `buffers` list for accumulating intermediate findings across iterations.
```

The model writes Python, the REPL runs it, the model sees the output, and the cycle repeats until it calls `FINAL(answer)`. That is the entire algorithm.

Sources: [rlm_code/rlm/pure_rlm_environment.py:186-198]()

---

## What Happens When You Run a Task

```text
/rlm run "Summarize the key findings" context=report.pdf steps=6
```

Here is what the loop does at each step:

```text
┌──────────────────────────────────────────────────────┐
│  User task:  "Summarize the key findings"            │
│  context =  <large document stored as variable>      │
└───────────────────────┬──────────────────────────────┘
                        │
              ┌─────────▼─────────┐
              │  LLM proposes     │  ← model writes Python code
              │  next action      │    e.g. context[:5000]
              └─────────┬─────────┘
                        │
              ┌─────────▼─────────┐
              │  Sandbox REPL     │  ← Docker / local runtime
              │  executes code    │    executes the code safely
              └─────────┬─────────┘
                        │
              ┌─────────▼─────────┐
              │  Observation      │  ← output of the code is fed
              │  fed back to LLM  │    back as next prompt context
              └─────────┬─────────┘
                        │
              ┌─────────▼─────────┐
              │  Repeat or        │  ← repeat until FINAL() called
              │  FINAL(answer)    │    or step/timeout limit hit
              └───────────────────┘
```

The runner loop in `rlm_code/rlm/runner.py` drives this cycle. Each iteration is logged as a `TrajectoryEventType` (e.g. `ITERATION_CODE`, `ITERATION_OUTPUT`, `LLM_RESPONSE`) to a JSONL file for replay and analysis.

Sources: [rlm_code/rlm/runner.py:1-10](), [rlm_code/rlm/trajectory.py:33-50]()

---

## What You Find When You Open the Repo

### Top-level structure

```
rlm_code/
├── main.py             ← CLI entry point; launches Textual TUI
├── commands/           ← /run, slash_commands handlers
├── rlm/                ← Core RLM engine
│   ├── runner.py       ← Drives the context→action→exec→reward loop
│   ├── pure_rlm_environment.py  ← Paper-faithful REPL + security sandbox
│   ├── environments.py ← GenericRLMEnvironment, DSPyCodingRLMEnvironment, TraceAnalysisEnvironment
│   ├── benchmarks.py   ← Preset benchmark cases (pure_rlm_smoke, dspy_quick, …)
│   ├── benchmark_manager.py     ← Runs benches, scores, compares
│   ├── termination.py  ← FINAL(), FINAL_VAR(), SUBMIT() control-flow
│   ├── trajectory.py   ← JSONL trace logging + replay
│   ├── policies/       ← Reward, termination, action, compaction policies
│   └── frameworks/     ← DSPy, Pydantic AI, Google ADK adapters
├── ui/                 ← Textual TUI (tabs, chat input, Research view)
├── models/             ← LLM provider connectors (Anthropic, OpenAI, Gemini, Ollama)
├── sandbox/            ← Sandboxed execution backends
│   ├── superbox.py     ← Runtime selector with fallback chain
│   └── runtimes/       ← Docker, local, Monty, cloud (Daytona, E2B, Modal), Apple Container
├── harness/            ← Tool-using coding agent harness (/harness run …)
└── mcp/                ← MCP server and client for tool integration
```

Sources: [README.md:382-390](), [rlm_code/rlm/environments.py:122-145]()

---

## The Three Execution Modes

| Mode | Command | What it does |
|------|---------|-------------|
| **Pure RLM** | `/rlm run "task" env=pure_rlm` | Paper-faithful: context as variable, `llm_query()`, `FINAL()` termination |
| **DSPy Coding** | `/rlm run "task" env=dspy` | Writes DSPy modules; uses Docker REPL with verifier scoring |
| **Harness / Coding Agent** | `/harness run "task" steps=8` | Tool-using loop (like Claude Code); supports MCP, reads/writes project files |

The `pure_rlm` environment enforces strict security: `eval()`, `exec()`, `subprocess`, `os.system()`, `__import__()`, and several other builtins are statically blocked before code is run in the REPL.

Sources: [rlm_code/rlm/pure_rlm_environment.py:143-182]()

---

## The Termination Contract

The LLM must call one of these functions from within its REPL code to end a run:

```python
# Direct string or dict answer
FINAL("Here is my answer")

# Reference a REPL variable you built up
FINAL_VAR("results_buffer")

# Typed multi-field output (DSPy-style)
SUBMIT(answer="...", confidence=0.9)
```

Each raises a Python exception (`FinalOutput`, `SubmitOutput`) that the runner catches, records, and uses as the run's final result. This means the LLM cannot accidentally exit early with an incomplete answer — it must use the explicit termination API.

Sources: [rlm_code/rlm/termination.py:17-55]()

---

## Benchmarks and Comparison

RLM Code ships with named preset benchmark packs:

| Preset | Description |
|--------|-------------|
| `pure_rlm_smoke` | 3 cases testing the paper-compliant RLM mode |
| `dspy_quick` | 3 DSPy coding loop smoke tests |
| `oolong_style` | 4 long-context benchmarks (paper-compatible) |
| `paradigm_comparison` | Side-by-side RLM vs CodeAct vs Traditional |
| `token_efficiency` | Token efficiency comparison benchmarks |

You run them with `/rlm bench preset=<name>`, compare two runs with `/rlm bench report candidate=latest baseline=previous`, and replay any individual run step-by-step with `/rlm replay <run_id>`.

Sources: [rlm_code/rlm/benchmarks.py:26-40]()

---

## Provider and Sandbox Flexibility (BYOK/BYOC)

RLM Code is provider-neutral. You connect whichever LLM you have access to:

```
/connect anthropic claude-opus-4-6
/connect openai gpt-5.3-codex
/connect gemini gemini-2.5-flash
/connect ollama llama3.2        ← free, no API key, runs locally
```

The `[llm-all]` install extra pulls the Anthropic, OpenAI, and Google client libraries; `[tui]` adds Textual for the terminal UI. Each extra is optional.

Sandbox execution is similarly flexible. The `superbox.py` runtime selector tries Docker first, then falls back through Daytona, E2B, Modal, Monty, and a local command runtime — configurable in `rlm_config.yaml`:

```yaml
sandbox:
  runtime: docker
  superbox_auto_fallback: true
  superbox_fallback_runtimes: [docker, daytona, e2b]
```

Sources: [pyproject.toml:79-93](), [README.md:60-70](), [README.md:352-363]()

---

## Safety Guardrails Built In

Two layers of safety are enforced before the project even runs code:

1. **Directory safety check** (`rlm_code/main.py`): On startup, RLM Code refuses to run from your home directory, `~/Desktop`, `~/Documents`, `/System`, `/usr`, or other sensitive paths. This prevents an agent from accidentally scanning personal files.

2. **Code pattern scanner** (`rlm_code/rlm/pure_rlm_environment.py`): Before executing any LLM-written code in the REPL, a static scanner blocks `os.system()`, `subprocess`, `eval()`, `exec()`, `__import__()`, `globals()`, and several other escape hatches.

For cost and runtime bounds, every run accepts `steps=N timeout=S budget=B` parameters, and `/rlm abort all` cancels any active run cooperatively.

Sources: [rlm_code/main.py:25-62](), [rlm_code/rlm/pure_rlm_environment.py:145-182]()

---

## Closing Summary

RLM Code is a research playground for the Recursive Language Models paradigm: instead of pasting data into a prompt, the LLM interacts with data as a Python variable through a secure REPL loop, calling `FINAL()` when done. The repository delivers this as a terminal UI application with built-in benchmarks, trajectory logging, multi-provider support, and a pluggable sandbox layer — making it practical to run experiments, compare models and approaches, and replay agent behavior step by step. The core loop lives in `rlm_code/rlm/runner.py` and `rlm_code/rlm/pure_rlm_environment.py`, which together implement the paper's context-as-variable, `llm_query()`, and `FINAL()` termination semantics.

Sources: [rlm_code/rlm/pure_rlm_environment.py:1-13]()

---

## 02. The Loop — How an Agent Actually Works Here

> Step by step: how RLMRunner turns one user prompt into many rounds of context → action proposal → sandbox execution → observation → reward → memory update, and when it stops.

- Page Markdown: https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/pages/02-the-loop-how-an-agent-actually-works-here.md
- Generated: 2026-05-22T02:08:24.439Z

### Source Files

- `rlm_code/rlm/runner.py`
- `rlm_code/rlm/action_planner.py`
- `rlm_code/rlm/context_store.py`
- `rlm_code/rlm/termination.py`
- `rlm_code/rlm/events.py`
- `rlm_code/rlm/trajectory.py`
- `rlm_code/rlm/memory_compaction.py`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [rlm_code/rlm/runner.py](rlm_code/rlm/runner.py)
- [rlm_code/rlm/action_planner.py](rlm_code/rlm/action_planner.py)
- [rlm_code/rlm/context_store.py](rlm_code/rlm/context_store.py)
- [rlm_code/rlm/termination.py](rlm_code/rlm/termination.py)
- [rlm_code/rlm/events.py](rlm_code/rlm/events.py)
- [rlm_code/rlm/trajectory.py](rlm_code/rlm/trajectory.py)
- [rlm_code/rlm/memory_compaction.py](rlm_code/rlm/memory_compaction.py)
- [rlm_code/rlm/environments.py](rlm_code/rlm/environments.py)
</details>

# The Loop — How an Agent Actually Works Here

`RLMRunner` is the engine that turns a single user task string into a multi-round agent loop. Each round ("step") follows the same pipeline: build context → ask the LLM to propose an action → execute that action in a sandbox → observe the result → assign a reward → optionally update a short-term memory note → repeat. The loop ends when the agent signals it is done, a step budget is exhausted, a time budget expires, or the user cancels it.

This page walks through each phase of that loop in the order the code runs it, with precise citations so you can follow along in the source.

---

## 1. Anatomy of one run

`RLMRunner.run_task()` is the entry point. It accepts a task string, chooses an environment, and drives the step loop. The final result is an `RLMRunResult` dataclass that carries `completed`, `steps`, `total_reward`, and a human-readable `final_response`.

```python
# rlm_code/rlm/runner.py  (simplified sketch)
for step_index in range(1, max_steps + 1):
    planner_prompt = env.planner_prompt(task, memory, trajectory, step_index)
    candidates     = self._propose_step_candidates(planner_prompt, ...)
    selected       = max(candidates, key=lambda item: item["score"])
    action_result  = env.execute_action(selected["action"], ...)
    total_reward  += action_result.reward
    trajectory.append(step_event)
    if action_result.done:
        break
```

Sources: [rlm_code/rlm/runner.py:634-774]()

---

## 2. Phase-by-phase breakdown

### 2.1 Context assembly

Before each step, the environment builds a **planner prompt** that packages:

- the original task string,
- the short rolling `memory` list (capped at the last 8 notes),
- and the growing `trajectory` of past steps.

`LazyFileContext` (`context_store.py`) supplies file-level snippets when the environment needs workspace content. It reads lazily — only the files actually requested — and enforces a character budget (default 8 000 chars total, 1 600 chars per file).

```python
# rlm_code/rlm/context_store.py
def render(self, refs, *, max_chars=8000, max_chars_per_ref=1600) -> str:
    ...
```

Sources: [rlm_code/rlm/context_store.py:72-89]()

### 2.2 Action proposal (with optional branching)

`ActionPlannerMixin._propose_step_candidates()` calls the LLM once (branch_width=1) or multiple times (branch_width > 1) to generate candidate actions:

- **Single-branch**: one LLM call, response parsed as JSON into an `RLMAction`.
- **Multi-branch**: `branch_width` independent LLM calls. Each candidate is speculatively **scored** by running it in a throwaway copy of the workspace (`_preview_action_score`). The candidate with the highest score is selected.

```python
# rlm_code/rlm/action_planner.py
selected = max(candidates, key=lambda item: item["score"])
```

Sources: [rlm_code/rlm/action_planner.py:94-218]()

The JSON parser is tolerant: it tries fenced code blocks first, then walks the text for balanced `{…}` pairs. If no JSON is found at all, the whole raw response is treated as a `final` action.

Sources: [rlm_code/rlm/action_planner.py:309-338]()

### 2.3 Sandbox execution

Each selected action is dispatched to the active **environment** (`env.execute_action()`). Environments wrap a sandboxed Python execution engine. Available environments include:

| Environment key | Purpose |
|---|---|
| `generic` / `rlm` | Default general-purpose coding loop |
| `dspy` / `dspy-coding` | DSPy-aware code tasks |
| `trace_analysis` / `traces` | Analysing agent execution traces |
| `pure_rlm` / `pure-rlm` | Strict RLM-paper semantics with REPL and FINAL() termination |

The `pure_rlm` environment spins up a secure interpreter backend (Monty or Docker; `exec` is opt-in unsafe). The runner tries the configured backend and falls back automatically.

Sources: [rlm_code/rlm/runner.py:266-303](), [rlm_code/rlm/runner.py:426-468]()

### 2.4 Observation

`execute_action()` returns an `EnvironmentActionResult` that carries:

- **observation** — a dict with stdout, stderr, success flag, and any structured output.
- **reward** — a float in `[-1.0, 1.0]` computed by the environment based on execution outcome.
- **done** — bool indicating whether this step terminates the run.
- **memory_note** — an optional short string to add to the rolling memory list.
- **final_response** — a human-readable answer string when `done=True`.

Sources: [rlm_code/rlm/environments.py:23-32]()

### 2.5 Reward computation

The runner calls `reward_profile.apply_global_scale(raw_reward)` after each step, then accumulates `total_reward`. The `RLMRewardProfile` dataclass carries tunable weights for common scoring situations:

| Scoring situation | Relevant fields |
|---|---|
| Python execution success | `run_python_success_bonus`, `run_python_failure_penalty` |
| DSPy pattern match | `dspy_pattern_match_bonus`, `dspy_pattern_bonus_cap` |
| File write / patch verification | `verifier_*` family |
| Global scale | `global_scale` |

Sources: [rlm_code/rlm/environments.py:44-97]()

### 2.6 Memory update

After each step, if `action_result.memory_note` is set, it is appended to the `memory` list. The list is **hard-capped at 8 entries** (oldest entries are discarded):

```python
# rlm_code/rlm/runner.py:763-765
if action_result.memory_note:
    memory.append(action_result.memory_note)
    memory = memory[-8:]
```

Sources: [rlm_code/rlm/runner.py:763-765]()

For the `pure_rlm` environment, longer REPL interaction histories are managed by `MemoryCompactor`. Compaction triggers when the history grows beyond `max_entries_before_compaction` (default 10) or `max_chars_before_compaction` (default 8 000 chars). The compactor calls the LLM to produce a 2–3 sentence summary of older entries, then discards them, keeping only the summary plus the last `preserve_last_n_entries` (default 2).

Sources: [rlm_code/rlm/memory_compaction.py:83-165]()

---

## 3. Termination: how the loop ends

The loop has four exit paths:

```text
┌─────────────────────────────────────────────────┐
│  End condition          │  How it fires           │
│─────────────────────────┼─────────────────────────│
│  Agent done             │  action_result.done=True │
│  Step budget            │  step_index > max_steps  │
│  Time budget            │  monotonic > deadline    │
│  Cooperative cancel     │  _is_cancel_requested()  │
└─────────────────────────────────────────────────┘
```

The "agent done" path is triggered when the agent's code calls one of the three terminal functions defined in `termination.py`:

- **`FINAL(answer)`** — raises `FinalOutput({"answer": answer, "type": "direct"})` and immediately exits the REPL loop.
- **`FINAL_VAR("varname")`** — raises `FinalOutput({"var": varname, "type": "variable"})`, causing the runner to look up the variable from the REPL namespace.
- **`SUBMIT(**kwargs)`** — raises `SubmitOutput(fields=kwargs)`, which supports typed multi-field outputs and optional schema validation.

```python
# rlm_code/rlm/termination.py
def FINAL(answer: Any) -> NoReturn:
    raise FinalOutput({"answer": answer, "type": "direct"})

def FINAL_VAR(variable_name: str) -> NoReturn:
    raise FinalOutput({"var": variable_name, "type": "variable"})

def SUBMIT(**kwargs: Any) -> NoReturn:
    raise SubmitOutput(fields=kwargs)
```

Sources: [rlm_code/rlm/termination.py:43-86]()

`detect_final_in_text()` and `detect_final_in_code()` also scan for these patterns in the LLM's raw text response (not just in executed code), so a model that writes `FINAL("answer")` in a markdown code block will still trigger termination correctly.

Sources: [rlm_code/rlm/termination.py:122-217]()

When the step budget is exhausted without a terminal call, the runner tries a fallback: it calls `_extract_answer_from_trajectory()`, which passes the last 10 steps to the LLM and asks it to synthesize the best possible answer from the execution history.

Sources: [rlm_code/rlm/action_planner.py:341-391]()

---

## 4. Trajectory persistence

Every step event and the final event are appended as newline-delimited JSON to a `.jsonl` file under `.rlm_code/rlm/runs/`. The run ID is a timestamp string like `run_20240501_142000_123456`.

```python
# rlm_code/rlm/runner.py:1404-1414
def _append_event(self, run_path: Path, event: dict[str, Any]) -> None:
    with run_path.open("a", encoding="utf-8") as handle:
        handle.write(json.dumps(event, ...) + "\n")
```

Each step record includes: `type`, `run_id`, `environment`, `task`, `timestamp`, `step`, `action`, `observation`, `reward`, and `usage` (token counts). The final record adds `completed`, `total_reward`, `steps`, and `cancelled`.

`TrajectoryLogger` in `trajectory.py` provides a richer alternative logger with per-event types (`ITERATION_REASONING`, `ITERATION_CODE`, `ITERATION_OUTPUT`, `LLM_REQUEST`, `CHILD_SPAWN`, `FINAL_DETECTED`, etc.) and supports export to an interactive HTML viewer.

Sources: [rlm_code/rlm/trajectory.py:33-70](), [rlm_code/rlm/trajectory.py:548-630]()

---

## 5. Event bus and observability

A lightweight in-process `RLMEventBus` (`events.py`) fires named events at every lifecycle point: `run_start`, `step_start`, `step_end`, `run_end`, `run_cycle_guard`. Subscribers can listen to all events or filter by `RLMEventType`. The event bus is used for real-time UI updates, streaming, and integration with observability sinks.

The full set of typed events covers LLM call boundaries (`LLM_CALL_START/END`), code execution boundaries (`CODE_EXEC_START/END`), sub-LLM calls from REPL code (`SUB_LLM_START/END`), child-agent lifecycle (`CHILD_SPAWN/START/END`), and memory compaction (`MEMORY_COMPACT_START/END`).

Sources: [rlm_code/rlm/events.py:17-80](), [rlm_code/rlm/events.py:190-298]()

---

## 6. Cycle guard and delegation

When an action's name is `delegate` or `delegate_batch`, the runner recursively calls `run_task()` for each sub-task. A `_RecursionState` object tracks a SHA-1 fingerprint of every `(environment, task)` pair that is currently active. If a recursive call would repeat an already-active fingerprint, it is skipped and assigned a reward of `-0.25` to discourage cycles.

Sources: [rlm_code/rlm/runner.py:536-581]()

The recursion depth is bounded by `max_depth` (default 2) and an optional `time_budget_seconds` deadline shared across all nested calls.

---

## 7. The full loop as a state diagram

```mermaid
stateDiagram-v2
    [*] --> Initializing : run_task() called
    Initializing --> StepLoop : run_id created, memory=[], trajectory=[]

    state StepLoop {
        [*] --> BuildContext
        BuildContext --> ProposeAction : env.planner_prompt()
        ProposeAction --> ScoreCandidates : branch_width > 1
        ScoreCandidates --> SelectBest
        ProposeAction --> SelectBest : branch_width = 1
        SelectBest --> ExecuteAction : env.execute_action()
        ExecuteAction --> UpdateMemory : EnvironmentActionResult
        UpdateMemory --> AppendTrajectory
        AppendTrajectory --> CheckDone
        CheckDone --> [*] : done=True
        CheckDone --> BuildContext : continue
    }

    StepLoop --> Synthesize : max_steps exhausted
    StepLoop --> Done : action_result.done=True
    StepLoop --> Cancelled : cancel requested
    StepLoop --> TimedOut : deadline exceeded

    Synthesize --> Done
    Done --> PersistFinal : write final JSONL event
    Cancelled --> PersistFinal
    TimedOut --> PersistFinal
    PersistFinal --> [*]
```

---

## Summary

The RLM loop in this repository is a straightforward, persistent reinforcement-learning style agent harness. `RLMRunner.run_task()` drives the step loop; `ActionPlannerMixin` handles prompting and candidate scoring; environments execute code in sandboxes and assign rewards; `termination.py` defines `FINAL()`, `FINAL_VAR()`, and `SUBMIT()` as the three ways code can signal completion; `MemoryCompactor` prevents context bloat over long runs; and every event is appended to a JSONL file and broadcast over the in-process event bus. The design is model-provider-agnostic — any `llm_connector` that implements `generate_response()` works with no changes to the loop itself.

Sources: [rlm_code/rlm/runner.py:488-847]()

---

## 03. Environments & Sandboxes — Where Code Actually Runs

> The three built-in environments (DSPy coding, Generic, TraceAnalysis, PureRLM), what each one does, and the sandbox runtimes (Docker, Monty, mock) that execute untrusted code safely.

- Page Markdown: https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/pages/03-environments-sandboxes-where-code-actually-runs.md
- Generated: 2026-05-22T02:06:26.020Z

### Source Files

- `rlm_code/rlm/environments.py`
- `rlm_code/rlm/pure_rlm_environment.py`
- `rlm_code/execution/sandbox.py`
- `rlm_code/execution/engine.py`
- `rlm_code/rlm/docker_interpreter.py`
- `rlm_code/rlm/monty_interpreter.py`
- `rlm_code/sandbox/runtimes.py`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [rlm_code/rlm/environments.py](rlm_code/rlm/environments.py)
- [rlm_code/rlm/pure_rlm_environment.py](rlm_code/rlm/pure_rlm_environment.py)
- [rlm_code/execution/sandbox.py](rlm_code/execution/sandbox.py)
- [rlm_code/execution/engine.py](rlm_code/execution/engine.py)
- [rlm_code/rlm/docker_interpreter.py](rlm_code/rlm/docker_interpreter.py)
- [rlm_code/rlm/monty_interpreter.py](rlm_code/rlm/monty_interpreter.py)
- [rlm_code/sandbox/runtimes/registry.py](rlm_code/sandbox/runtimes/registry.py)
- [rlm_code/sandbox/runtimes/monty_runtime.py](rlm_code/sandbox/runtimes/monty_runtime.py)
- [rlm_code/sandbox/runtimes/base.py](rlm_code/sandbox/runtimes/base.py)
- [rlm_code/sandbox/superbox.py](rlm_code/sandbox/superbox.py)
</details>

# Environments & Sandboxes — Where Code Actually Runs

When the RLM (Recursive Language Model) loop produces an action, two things happen: the *environment* decides what that action means and which tools are available to the LLM planner, while the *sandbox runtime* actually executes any Python code generated, in isolation. These are deliberately separate concerns — you can mix and match any environment with any compatible runtime.

This page explains the four built-in environments (`generic`, `dspy`, `trace_analysis`, `pure_rlm`), what each one teaches the planner and how it scores results, then walks through the runtime ladder: from Monty (Rust-sandboxed Python) and Docker (container-per-step) to the simple local subprocess fallback and optional cloud providers. Together they define where code actually runs and how safe that execution is.

---

## The Environment Abstraction

Every environment implements three methods (defined in `RLMEnvironment`, a `Protocol`):

| Method | Purpose |
|---|---|
| `system_prompt()` | Tells the LLM planner what actions are available and how to format its JSON response |
| `planner_prompt(task, memory, trajectory, step_index)` | Builds the per-step prompt including task, recent history, and environment context |
| `execute_action(action, execution_engine, timeout, llm_connector)` | Dispatches the JSON action, runs code if needed, and returns an `EnvironmentActionResult` |

`EnvironmentActionResult` carries four fields that flow back to the runner: `observation` (what the planner sees next), `reward` (a float in `[-1.0, 1.0]`), `done` (stop the loop?), and `final_response` (the answer to emit when done).

Rewards are shaped by `RLMRewardProfile`, a dataclass with named fields for every scoring dimension (base, success bonus, failure penalty, verifier checks, DSPy pattern bonuses, etc.). The profile can be passed at construction from a plain dict, making reward tuning fully data-driven without changing environment code.

Sources: [rlm_code/rlm/environments.py:100-120](), [rlm_code/rlm/environments.py:44-98]()

---

## Built-in Environments

### GenericRLMEnvironment (`name = "generic"`)

The baseline environment. The planner can take exactly two actions:

- **`run_python`** — submit a code string; the environment calls `execution_engine.execute_code(code, timeout=exec_timeout)` and computes a reward from success/failure/stderr.
- **`final`** — declare the task done and emit a final answer (reward `1.0`).

The system prompt is minimal JSON-only instructions. The planner prompt shows the task, last six memory entries, and last three trajectory steps (action, success, reward).

```python
# rlm_code/rlm/environments.py:138-145
def system_prompt(self) -> str:
    return (
        "You are an RLM planner.\n"
        "Return ONLY valid JSON with keys: "
        "action, code, rationale, done, final_response.\n"
        'Valid action values: "run_python", "final".\n'
        "No markdown. JSON only."
    )
```

Reward formula: base `0.1` + success bonus `0.7` − failure penalty `0.3` − stderr penalty `0.1`, clamped to `[-1.0, 1.0]`.

Sources: [rlm_code/rlm/environments.py:122-286]()

---

### DSPyCodingRLMEnvironment (`name = "dspy"`)

Extends `GenericRLMEnvironment` with a rich action vocabulary for code authoring tasks, especially DSPy modules. The action set grows to fourteen verbs:

| Action | What it does |
|---|---|
| `run_python` | Execute arbitrary Python (with DSPy pattern bonus on top) |
| `write_file` | Write a new file under `workdir`; runs the post-write verifier suite |
| `patch_file` | Apply a search-and-replace or full-content rewrite |
| `read_file` | Return a line-range excerpt of a file |
| `search_code` | Regex search over `.py` files in the project |
| `list_tree` | Enumerate directory entries up to configurable depth |
| `run_tests` | Run `pytest` (via subprocess or the execution engine) |
| `analyze_dspy` / `analyze_code` | Score DSPy source quality (0–100 heuristic) |
| `llm_query` | Forward a prompt to the LLM connector for delegated analysis |
| `llm_query_batched` | Run multiple prompts concurrently with `ThreadPoolExecutor` |
| `delegate` / `delegate_batch` | Reserved for recursive subtask spawning |
| `final` | Terminate with an answer |

**Path safety:** every file action goes through `_safe_resolve`, which rejects paths that escape `workdir` via symlinks or `..` traversal.

```python
# rlm_code/rlm/environments.py:1333-1341
def _safe_resolve(self, path_raw: str) -> Path | None:
    path = Path(path_raw)
    if path.is_absolute():
        resolved = path.resolve()
    else:
        resolved = (self.workdir / path).resolve()
    if not resolved.is_relative_to(self.workdir):
        return None
    return resolved
```

**Post-write verifier suite:** after every `write_file` or `patch_file`, the environment runs three checks automatically:
1. `python -m compileall` — catches syntax/import-time parse errors.
2. `pytest -q tests/test_<stem>.py` — runs a matching test file if one exists.
3. `execution_engine.validate_code(content)` — DSPy-aware linting (deprecated API checks, etc.).

The verifier outcome feeds directly into reward calculation: compile bonus `+0.20`, pytest bonus `+0.25`, validation bonus `+0.15`, with matching penalties on failure.

**DSPy pattern bonus:** running code that imports `dspy`, uses `dspy.Signature`, `dspy.InputField`, `dspy.OutputField`, `dspy.Module`, or implements `forward()` earns a small extra reward per matched pattern, capped at `+0.20`.

Sources: [rlm_code/rlm/environments.py:565-1483]()

---

### TraceAnalysisEnvironment (`name = "trace_analysis"`)

A HALO-style environment for inspecting agent execution traces stored as one-span-per-line JSONL files. It wraps a `TraceStore` object and exposes eight trace-specific actions:

| Action | Reward | Purpose |
|---|---|---|
| `set_trace_path` | 0.55 | Load a JSONL trace dataset |
| `get_dataset_overview` | 0.45 | Count traces, spans, errors; get sample IDs |
| `query_traces` | 0.50 | List traces with optional filters (errors, model, service, agent, project) |
| `count_traces` | 0.35 | Count traces matching a filter |
| `view_trace` | 0.65 | Fetch all spans for one trace ID |
| `search_trace` | 0.65 | Substring-search spans within a trace |
| `view_spans` | 0.70 | Fetch a specific list of span IDs |
| `export_evidence_corpus` | 0.75 | Write filtered traces to a directory for downstream agents |

The planner prompt automatically injects the active trace path and a live overview (total traces, spans, error count, sample IDs) so the LLM does not need to request it manually. If the task string contains `trace=<path>` or `trace_path=<path>`, the environment loads the file proactively before the first LLM step.

The goal articulated in the system prompt is to find *systemic* harness failure modes, not one-off anomalies, and to produce concrete evidence reports with trace IDs and span references.

Sources: [rlm_code/rlm/environments.py:289-491]()

---

### PureRLMEnvironment (`name = "pure_rlm"`)

This environment implements the exact semantics from the *Recursive Language Models* paper (2025). It is structurally different from the other three:

- The planner's response is **free-form text**, not constrained JSON. Code blocks are extracted with `````repl````  or ````python``` regexes.
- The context is stored as a **REPL variable** (`context`) rather than appearing in the token window. The LLM only sees metadata (type, total character count, chunk sizes).
- **`llm_query(prompt)`** and **`llm_query_batched(prompts)`** are available inside the REPL as callable functions. These make recursive LLM calls from within code execution.
- Termination is via `FINAL(answer)` or `FINAL_VAR(variable_name)` called in code or written in text, rather than a JSON `"action": "final"`.
- Message history **grows** across iterations (it is never truncated) so the full chain of reasoning and REPL output is preserved.

The REPL namespace starts from a tightly restricted `SAFE_BUILTINS` dict — `eval`, `exec`, `compile`, `globals`, `locals`, `__import__`, `subprocess`, and `os.system` are all absent. A pre-flight scanner also blocks these patterns via regex before `exec()` is called:

```python
# rlm_code/rlm/pure_rlm_environment.py:144-162
_BLOCKED_CODE_PATTERNS = [
    (re.compile(r"\b__import__\s*\("), "Dynamic __import__() is blocked"),
    (re.compile(r"\bos\.system\s*\("),  "os.system() is blocked"),
    (re.compile(r"\bsubprocess\b"),      "subprocess module is blocked"),
    (re.compile(r"\beval\s*\("),         "eval() is blocked"),
    ...
]
```

The `open()` function is replaced with a `safe_open` that only allows read-only access to files under `workdir`.

**Multi-file helpers** (`load_file`, `load_files`, `switch_to`, `list_files`, `remove_file`) and a `chunk_indices(total_length, chunk_size, overlap)` helper are pre-injected into the namespace to support large-document workflows.

**Interpreter selection:** `PureRLMEnvironment` requires either an explicit `interpreter` (a `MontyInterpreter` or `DockerPersistentInterpreter` instance) or `allow_unsafe_exec=True` for local experiments:

```python
# rlm_code/rlm/pure_rlm_environment.py:500-507
if interpreter is None:
    if not self._allow_unsafe_exec:
        raise RuntimeError(
            "PureRLMEnvironment requires a secure interpreter by default. "
            "Pass interpreter=MontyInterpreter(...) or interpreter=DockerPersistentInterpreter(...)."
        )
```

Sources: [rlm_code/rlm/pure_rlm_environment.py:418-540](), [rlm_code/rlm/pure_rlm_environment.py:144-182]()

---

## Environment Comparison

```text
                 Actions          Code execution    LLM calls from code    Context in token window
generic          2                run_python only   no                     yes (task + history)
dspy            14                run_python + file no                     yes
trace_analysis   8                no code exec      no                     yes
pure_rlm         free-form REPL   exec()/interp     yes (llm_query*)       no — in `context` var
```

*`llm_query` calls are subject to `max_llm_calls` (default 50) enforced with a thread lock.

---

## The Execution Stack

When an environment calls `execution_engine.execute_code(code, timeout)`, that call passes through two layers before any code runs.

```text
Environment.execute_action()
        │
        ▼
ExecutionEngine.execute_code()    [rlm_code/execution/engine.py]
   ├─ validate_code()             (AST syntax + import checks + DSPy warnings)
   └─ ExecutionSandbox.execute()  [rlm_code/execution/sandbox.py]
           │
           ▼
        Superbox.resolve_runtime()  [rlm_code/sandbox/superbox.py]
           │  priority: runtime_override → config.sandbox.runtime → fallbacks
           ├─ local           LocalSandboxRuntime   (subprocess, always available)
           ├─ monty           MontySandboxRuntime   (pydantic_monty Rust VM)
           ├─ docker          DockerSandboxRuntime  (docker run --rm)
           ├─ apple-container AppleContainerRuntime (Apple VM, macOS)
           ├─ modal           ModalSandboxRuntime   (cloud, optional)
           ├─ e2b             E2BSandboxRuntime     (cloud, optional)
           └─ daytona         DaytonaSandboxRuntime (cloud, optional)
```

`ExecutionEngine` runs `validate_code` first (AST parse, dangerous-import scan, DSPy API checks) and returns a failed `ExecutionResult` immediately if validation fails, before any subprocess is spawned. The `Superbox` layer tries runtimes in priority order, skipping known-unavailable ones, and raises `ConfigurationError` only if every candidate fails.

Sources: [rlm_code/execution/engine.py:49-195](), [rlm_code/sandbox/superbox.py:29-116]()

---

## Sandbox Runtimes

### local — subprocess baseline

Writes code to a temp file and runs it with `subprocess.run([python_exe, code_file, ...])`. The environment is stripped down: `PYTHONPATH=""`, `PYTHONUNBUFFERED=1`, `HOME`/`TMPDIR` point to the temp dir, and `PATH` is limited to `/usr/bin:/bin`. Additional host env vars can be allowed via `sandbox.env_allowlist`.

This runtime is always available and is used as the ultimate fallback.

Sources: [rlm_code/execution/sandbox.py:163-200]()

---

### monty — Rust-sandboxed Python

`MontySandboxRuntime` wraps `MontyInterpreter`, which uses `pydantic_monty.Monty` — a Python interpreter written in Rust. Key properties:

- **No filesystem access, no network, no `import`** — the sandbox is enforced at the Rust VM level, not by Python policy.
- **Resource limits** via `ResourceLimits(max_duration_secs, max_memory, max_allocations)`.
- **External function dispatch**: when Monty code calls `llm_query`, `FINAL`, `FINAL_VAR`, or any user-registered tool, execution pauses (`MontySnapshot`) and control returns to the host Python process, which runs the handler and calls `snapshot.resume(return_value=...)` to continue.
- **Variable persistence** across REPL steps is simulated: the host injects known variables as `inputs`, and a synthetic `__rlm_collect__({...})` call at the end of each block sends new variables back.
- **Optional type checking** using Monty's Ruff-based parser (set `type_check=True`).
- **Microsecond startup** — no container to spin up.

The `create_rlm_monty_interpreter()` factory wires up all standard RLM external functions (`FINAL`, `FINAL_VAR`, `SUBMIT`, `SHOW_VARS`, `llm_query`, `llm_query_batched`) in one call.

```python
# rlm_code/rlm/monty_interpreter.py:864-886
interp = MontyInterpreter(
    timeout=timeout,
    tools=tools,
    resource_limits=resource_limits,
    type_check=type_check,
)
interp.start()
interp.register_external("FINAL", lambda answer: None)
interp.register_external("FINAL_VAR", lambda var_name: None)
...
```

Sources: [rlm_code/rlm/monty_interpreter.py:254-612](), [rlm_code/sandbox/runtimes/monty_runtime.py]()

---

### docker — container-per-step (ExecutionSandbox path)

`DockerSandboxRuntime` (used via `ExecutionSandbox`) runs each code file in an ephemeral container (`docker run --rm`). Networking is disabled by default (`--network none`). Dangerous Docker flags (`--privileged`, `--volume`, `--mount`, `--network=host`, etc.) are rejected by the registry before the runtime is created.

**DockerPersistentInterpreter** (the interpreter used by `PureRLMEnvironment`) is a separate, higher-level implementation that maintains REPL state across steps:

- A shared session directory on the host is mounted into each container as `/workspace`.
- The REPL namespace is serialized to `state.dill` (falling back to `pickle`) after each step and reloaded before the next.
- External functions (e.g., `llm_query`) are dispatched over a lightweight HTTP bridge: the container script calls `http://host.docker.internal:<port>/external` and the host's `_ProxyHandler` runs the actual function and returns the result as base64-pickled JSON.
- `FinalOutput` and `SubmitOutput` exceptions raised on the host side are forwarded back through the proxy as structured error payloads.

```python
# rlm_code/rlm/docker_interpreter.py:267-298
def _build_docker_command(self, code: str) -> list[str]:
    mount_arg = f"{self._session_dir}:/workspace:rw"
    cmd = [
        "docker", "run", "--rm",
        "--workdir", "/workspace",
        "--volume", mount_arg,
        "--add-host", "host.docker.internal:host-gateway",
    ]
    ...
    cmd.extend([self.image, "python", "-c", script])
    return cmd
```

Sources: [rlm_code/rlm/docker_interpreter.py:41-544]()

---

### apple-container — macOS VM runtime

`AppleContainerRuntime` uses Apple's `container` CLI (macOS only) with similar semantics to Docker. It requires `sandbox.apple_container_enabled=true` in config and checks for the `container` binary on startup.

---

### Cloud runtimes (Modal, E2B, Daytona)

Three optional cloud-based runtimes are registered under the same `SandboxRuntime` protocol. They are loaded at import time and silently skipped if their SDKs are not installed:

| Runtime | Install | Notes |
|---|---|---|
| Modal | `pip install modal && modal setup` | Configurable memory/CPU |
| E2B | `pip install e2b-code-interpreter` | Template-based sandboxes |
| Daytona | `pip install daytona-sdk` or CLI | Workspace-based execution |

---

## Runtime Selection & Fallback (Superbox)

`Superbox` centralizes runtime selection. On every `ExecutionSandbox.execute()` call it:

1. Reads `sandbox.runtime` from config (or uses the session-level `runtime_override`).
2. Probes all runtimes with `detect_runtime_health()`.
3. Tries the primary runtime first; if that fails, tries fallback candidates in order (`docker` → `apple-container` → `local`), skipping runtimes already flagged as unhealthy.
4. Raises `ConfigurationError` if everything fails.

Auto-fallback is enabled by default (`superbox_auto_fallback=true`) but can be turned off. The fallback list can also be overridden via `superbox_fallback_runtimes`.

```text
SUPPORTED_RUNTIMES = {"local", "monty", "docker", "apple-container",
                      "modal", "e2b", "daytona"}
```

Sources: [rlm_code/sandbox/runtimes/registry.py:46-50](), [rlm_code/sandbox/superbox.py:37-116]()

---

## Choosing a Runtime

```text
Need                                  Recommended runtime
────────────────────────────────────  ─────────────────────────────────
Local dev, no Docker installed        local  (always works, least safe)
Fast execution, strong isolation      monty  (requires pydantic-monty)
Pure RLM with llm_query in REPL       monty or docker (DockerPersistentInterpreter)
Arbitrary OS access / packages        docker
macOS native VMs                      apple-container
Cloud execution, long timeout         modal / e2b / daytona
```

The `PureRLMEnvironment` environment enforces this explicitly at construction: passing `interpreter=MontyInterpreter(...)` or `interpreter=DockerPersistentInterpreter(...)` is required unless `allow_unsafe_exec=True` is set. Every other environment routes through `ExecutionEngine` → `ExecutionSandbox` → `Superbox`, so the runtime selection is transparent to the environment code itself.

Sources: [rlm_code/rlm/pure_rlm_environment.py:499-513](), [rlm_code/sandbox/superbox.py:87-116]()

---

## 04. Framework Adapters — Plug In Your Favourite AI Stack

> How rlm_code/rlm/frameworks/ lets DSPy, Google ADK, Pydantic-AI, and DeepAgents all plug into the same RLM loop through a shared base class and a framework registry, without changing the core runner.

- Page Markdown: https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/pages/04-framework-adapters-plug-in-your-favourite-ai-stack.md
- Generated: 2026-05-22T02:06:06.178Z

### Source Files

- `rlm_code/rlm/frameworks/base.py`
- `rlm_code/rlm/frameworks/registry.py`
- `rlm_code/rlm/frameworks/dspy_rlm_adapter.py`
- `rlm_code/rlm/frameworks/google_adk_adapter.py`
- `rlm_code/rlm/frameworks/pydantic_ai_adapter.py`
- `rlm_code/rlm/frameworks/deepagents_adapter.py`
- `rlm_code/models/providers/registry.py`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [rlm_code/rlm/frameworks/base.py](rlm_code/rlm/frameworks/base.py)
- [rlm_code/rlm/frameworks/registry.py](rlm_code/rlm/frameworks/registry.py)
- [rlm_code/rlm/frameworks/dspy_rlm_adapter.py](rlm_code/rlm/frameworks/dspy_rlm_adapter.py)
- [rlm_code/rlm/frameworks/google_adk_adapter.py](rlm_code/rlm/frameworks/google_adk_adapter.py)
- [rlm_code/rlm/frameworks/pydantic_ai_adapter.py](rlm_code/rlm/frameworks/pydantic_ai_adapter.py)
- [rlm_code/rlm/frameworks/deepagents_adapter.py](rlm_code/rlm/frameworks/deepagents_adapter.py)
- [rlm_code/rlm/frameworks/adk_rlm_adapter.py](rlm_code/rlm/frameworks/adk_rlm_adapter.py)
- [rlm_code/rlm/runner.py](rlm_code/rlm/runner.py)
</details>

# Framework Adapters — Plug In Your Favourite AI Stack

The `rlm_code/rlm/frameworks/` package is a thin plug-in layer that lets the RLM runner drive tasks through DSPy, Google ADK, Pydantic-AI, or DeepAgents — all through a single, uniform interface — without touching the core run loop. Think of it like a universal power adapter: no matter which wall socket your AI framework prefers, the RLM runner only needs to push current through the same two pins (`doctor()` and `run_episode()`).

This page explains the shared contract, the five built-in adapters, how the registry wires them in at runtime, and what happens inside the runner when you choose a framework.

---

## The Shared Contract: `RLMFrameworkAdapter`

Every adapter implements a Python `Protocol` defined in `base.py`. A `Protocol` in Python is like an interface: any class that has the right attributes and methods satisfies it automatically, without inheritance.

```python
# rlm_code/rlm/frameworks/base.py:33-55
class RLMFrameworkAdapter(Protocol):
    framework_id: str

    def doctor(self) -> tuple[bool, str]: ...

    def run_episode(
        self,
        *,
        task: str,
        llm_connector: Any,
        max_steps: int,
        exec_timeout: int,
        workdir: str,
        sub_model: str | None = None,
        sub_provider: str | None = None,
        context: dict[str, Any] | None = None,
    ) -> FrameworkEpisodeResult: ...
```

**`framework_id`** — a unique slug like `"dspy-rlm"` or `"pydantic-ai"`. This is the key the registry and the CLI both use to look up an adapter.

**`doctor()`** — a pre-flight check. Returns `(True, "ok message")` when the adapter's optional dependency is installed and ready, or `(False, "install hint")` when it is not. This lets the runner surface actionable error messages before attempting any work.

**`run_episode()`** — the main hook. Executes one complete task run inside the target framework and returns a `FrameworkEpisodeResult`.

Sources: [rlm_code/rlm/frameworks/base.py:33-55]()

### Data Shapes Crossing the Boundary

Two dataclasses carry all data between an adapter and the runner:

```python
# rlm_code/rlm/frameworks/base.py:11-29
@dataclass(slots=True)
class FrameworkStepRecord:
    action: str
    observation: dict[str, Any] = field(default_factory=dict)
    reward: float = 0.0
    done: bool = False

@dataclass(slots=True)
class FrameworkEpisodeResult:
    completed: bool
    final_response: str
    steps: list[FrameworkStepRecord] = field(default_factory=list)
    total_reward: float = 0.0
    usage_summary: dict[str, int] | None = None
    metadata: dict[str, Any] = field(default_factory=dict)
```

`FrameworkStepRecord` is RLM's unit of trajectory data: one action, its outcome (observation), and a scalar reward signal. `FrameworkEpisodeResult` wraps the whole run: the final text answer, the full list of steps, a clipped total reward, optional token-usage totals, and a metadata dict the adapter can fill with whatever the framework exposes.

Sources: [rlm_code/rlm/frameworks/base.py:11-29]()

---

## The Registry: One Call to Bind Them All

`FrameworkAdapterRegistry` is a plain dictionary wrapper. Adapters are registered by their `framework_id` (lowercased, stripped). The `default()` classmethod constructs and registers all five built-in adapters in a single call:

```python
# rlm_code/rlm/frameworks/registry.py:32-46
@classmethod
def default(cls, *, workdir: str) -> "FrameworkAdapterRegistry":
    registry = cls()
    from .adk_rlm_adapter import ADKRLMFrameworkAdapter
    from .deepagents_adapter import DeepAgentsFrameworkAdapter
    from .dspy_rlm_adapter import DSPyRLMFrameworkAdapter
    from .google_adk_adapter import GoogleADKFrameworkAdapter
    from .pydantic_ai_adapter import PydanticAIFrameworkAdapter

    registry.register(DSPyRLMFrameworkAdapter(workdir=workdir))
    registry.register(ADKRLMFrameworkAdapter(workdir=workdir))
    registry.register(PydanticAIFrameworkAdapter(workdir=workdir))
    registry.register(GoogleADKFrameworkAdapter(workdir=workdir))
    registry.register(DeepAgentsFrameworkAdapter(workdir=workdir))
    return registry
```

The registry exposes three methods:

| Method | Purpose |
|--------|---------|
| `register(adapter)` | Add an adapter; raises `ValueError` if `framework_id` is empty |
| `get(framework_id)` | Look up by slug; returns `None` for unknown ids |
| `list_ids()` | Sorted list of all registered slugs |
| `doctor()` | Run `doctor()` on every adapter and collect results |

The registry is constructed once when the `RLMRunner` initialises, via `self.framework_registry = FrameworkAdapterRegistry.default(workdir=...)` (runner.py:210). From that point on, the runner knows nothing about individual frameworks — it only talks to the registry.

Sources: [rlm_code/rlm/frameworks/registry.py:12-61]()

---

## Class Structure

```
┌─────────────────────────────────────────────────────────────────────────────────┐
│  RLMFrameworkAdapter  (Protocol — base.py)                                      │
│  ─────────────────────────────────────────                                      │
│  framework_id: str                                                              │
│  doctor() -> (bool, str)                                                        │
│  run_episode(...) -> FrameworkEpisodeResult                                     │
└─────────────────────────────┬──────────────────────────────────────────────────┘
                              │ satisfied by
        ┌─────────────────────┼──────────────────────────┐
        │                     │                          │
DSPyRLMFrameworkAdapter  GoogleADKFrameworkAdapter  PydanticAIFrameworkAdapter
  framework_id="dspy-rlm"  framework_id="google-adk"  framework_id="pydantic-ai"
  adapter_mode="native_rlm" adapter_mode="agent_loop"  adapter_mode="agent_loop"

ADKRLMFrameworkAdapter    DeepAgentsFrameworkAdapter
  framework_id="adk-rlm"   framework_id="deepagents"
  adapter_mode="native_rlm" adapter_mode="agent_loop"
```

The `adapter_mode` field (not part of the Protocol, but present on every adapter) lets the runner's `doctor()` report distinguish between frameworks that use their own native RLM loop (`"native_rlm"`) from those that run a generic agent loop (`"agent_loop"`). Sources: [rlm_code/rlm/frameworks/registry.py:55-61]()

---

## Built-in Adapters

### DSPy RLM (`dspy-rlm`)

**Install:** `pip install dspy`

The DSPy adapter is in `native_rlm` mode: it delegates directly to `dspy.RLM`, DSPy's own reinforcement-learning module. The adapter checks for `hasattr(dspy, "RLM")` in `doctor()` because older DSPy versions do not expose this attribute.

```python
# rlm_code/rlm/frameworks/dspy_rlm_adapter.py:78-88
rlm = dspy.RLM(
    "context, query -> answer",
    max_iterations=max(2, int(max_steps)),
    sub_lm=lm,
)
with dspy.context(lm=lm):
    prediction = rlm(context=context_payload, query=task)
```

Model resolution maps RLM provider strings to DSPy's `provider/model` format. For example, `provider="google"` becomes `"gemini/model-name"`, and `provider="openai-compatible"` is normalised to `"openai/model-name"`. If the global `dspy.settings.lm` is already configured, the adapter reuses it rather than constructing a new `dspy.LM`.

Sources: [rlm_code/rlm/frameworks/dspy_rlm_adapter.py:22-190]()

---

### Google ADK (`google-adk`)

**Install:** `pip install 'rlm-code[adk]'`

The Google ADK adapter wraps the `google.adk` package's `LlmAgent` and `InMemoryRunner`. Because ADK's runner is fully async, the adapter uses a helper `_run_coro_sync()` that spins up a new thread with its own event loop when an existing loop is already running (avoiding the "cannot run nested event loop" problem):

```python
# rlm_code/rlm/frameworks/google_adk_adapter.py:193-215
def _run_coro_sync(coro: Any) -> Any:
    try:
        asyncio.get_running_loop()
    except RuntimeError:
        return asyncio.run(coro)
    # Already in a running loop — run in a daemon thread
    import threading
    thread = threading.Thread(target=_runner, daemon=True)
    thread.start()
    thread.join()
```

Events are streamed with `runner.run_async(...)` and serialised into `FrameworkStepRecord` objects with actions `"model_text"`, `"tool_call"`, or `"tool_result"`. For Gemini models, the adapter strips the `provider:` prefix from the model name because Google ADK expects a bare model name like `gemini-2.0-flash`.

Sources: [rlm_code/rlm/frameworks/google_adk_adapter.py:1-215]()

---

### ADK RLM (`adk-rlm`)

**Install:** `pip install 'rlm-code[adk]'`

A second ADK-flavoured adapter in `native_rlm` mode. While `google-adk` uses the public `LlmAgent`/`InMemoryRunner` API, `adk-rlm` calls a vendored `adk_rlm.completion()` function — a sample RLM implementation that adds depth-limited recursive search on top of ADK. It exposes `max_iterations` and `max_depth` directly.

```python
# rlm_code/rlm/frameworks/adk_rlm_adapter.py:94-102
completion_result = completion(
    context=context_payload,
    prompt=task,
    model=resolved_model,
    sub_model=sub_model or resolved_model,
    max_iterations=max(2, int(max_steps)),
    max_depth=max(1, min(8, int(max_steps))),
    verbose=False,
)
```

Sources: [rlm_code/rlm/frameworks/adk_rlm_adapter.py:60-174]()

---

### Pydantic AI (`pydantic-ai`)

**Install:** `pip install 'rlm-code[pydantic]'`

Uses `pydantic_ai.Agent` in synchronous mode (`agent.run_sync(task)`). The adapter converts every message part from Pydantic-AI's message history into a `FrameworkStepRecord`, assigning reward values by part type:

| Part type | Action label | Reward |
|-----------|-------------|--------|
| `ToolCallPart` | `tool_call` | +0.02 |
| `ToolReturnPart` | `tool_result` | +0.06 |
| `RetryPromptPart` | `retry_prompt` | −0.05 |
| Any text part | `model_part` | +0.05 |

Model resolution maps several local-inference providers (LM Studio, vLLM, SGLang, TGI) to `openai:model-name` and sets `OPENAI_BASE_URL` from the connector's `base_url` field, keeping local inference working without code changes. Ollama is mapped to `ollama:model-name`.

Sources: [rlm_code/rlm/frameworks/pydantic_ai_adapter.py:109-143](), [rlm_code/rlm/frameworks/pydantic_ai_adapter.py:144-190]()

---

### DeepAgents / LangGraph (`deepagents`)

**Install:** `pip install 'rlm-code[deepagents]'`

Wraps `deepagents.create_deep_agent()` which is itself built on LangGraph. The adapter converts LangChain's message types (`AIMessage`, `ToolMessage`, `HumanMessage`) into step records:

| LangChain message type | Produces |
|------------------------|---------|
| `AIMessage` with tool calls | `tool_call` step (+0.02, or +0.03 for `write_todos`/`read_todos`) |
| `AIMessage` with text content | `model_text` step (+0.05) |
| `ToolMessage` (success) | `tool_result` step (+0.06) |
| `ToolMessage` (error) | `tool_result` step (−0.05) |

DeepAgents additionally supports multiple execution backends, selectable via the `deepagents_backend` key in the `context` dict:

```python
# rlm_code/rlm/frameworks/deepagents_adapter.py:151-167
if backend_name == "filesystem":
    return FilesystemBackend(root=workdir)
if backend_name == "local_shell":
    return LocalShellBackend(cwd=workdir)
return StateBackend
```

Sources: [rlm_code/rlm/frameworks/deepagents_adapter.py:169-245](), [rlm_code/rlm/frameworks/deepagents_adapter.py:151-167]()

---

## How the Runner Uses Adapters

The sequence below shows the full call chain from a user task to a logged episode:

```
sequenceDiagram
    participant User
    participant Runner (runner.py)
    participant Registry
    participant Adapter
    participant Framework

    User->>Runner: run_task(task, framework="dspy-rlm")
    Runner->>Runner: _resolve_framework_id("dspy-rlm")
    Runner->>Registry: get("dspy-rlm")
    Registry-->>Runner: DSPyRLMFrameworkAdapter
    Runner->>Adapter: doctor()
    Adapter-->>Runner: (True, "dspy RLM available")
    Runner->>Adapter: run_episode(task, llm_connector, ...)
    Adapter->>Framework: dspy.RLM(...)(context, query)
    Framework-->>Adapter: prediction
    Adapter-->>Runner: FrameworkEpisodeResult(steps=[...], ...)
    Runner->>Runner: write step events to .jsonl
    Runner->>Runner: emit run_end event
    Runner-->>User: RLMRunResult
```

The runner's `_run_task_with_framework_adapter()` method (runner.py:1240-1388) handles the adapter path:

1. Looks up the adapter from the registry by `framework_id`.
2. Calls `adapter.doctor()` — fails fast with the adapter's human-readable install hint if the dependency is missing.
3. Calls `adapter.run_episode(...)`, passing the shared `llm_connector`, `workdir`, and RLM run parameters.
4. Iterates `episode.steps`, applies the global reward scale, and writes each step as a JSON event to a `.jsonl` file.
5. Writes a `"final"` event capturing `completed`, `total_reward`, `final_response`, token usage, and the adapter's `metadata` dict.

Sources: [rlm_code/rlm/runner.py:1240-1388]()

---

## Adding a Custom Adapter

To plug in a new framework, implement the `RLMFrameworkAdapter` Protocol and register it before the runner starts:

```python
from rlm_code.rlm.frameworks.base import FrameworkEpisodeResult, FrameworkStepRecord
from dataclasses import dataclass
from typing import Any

@dataclass(slots=True)
class MyFrameworkAdapter:
    workdir: str
    framework_id: str = "my-framework"
    adapter_mode: str = "agent_loop"
    reference_impl: str = "my_framework (installed package)"

    def doctor(self) -> tuple[bool, str]:
        try:
            import my_framework  # noqa: F401
            return (True, "my-framework available")
        except ImportError:
            return (False, "pip install my-framework")

    def run_episode(self, *, task, llm_connector, max_steps, exec_timeout,
                    workdir, sub_model=None, sub_provider=None, context=None):
        import my_framework
        result = my_framework.run(task)
        return FrameworkEpisodeResult(
            completed=True,
            final_response=result.text,
            steps=[FrameworkStepRecord(action="answer", observation={"text": result.text}, reward=0.5)],
            total_reward=0.5,
        )

# Register it alongside the built-ins
runner.framework_registry.register(MyFrameworkAdapter(workdir=runner.workdir))
```

Because `FrameworkAdapterRegistry.register()` only checks `framework_id` (registry.py:18-21), no changes to the core runner are needed.

Sources: [rlm_code/rlm/frameworks/registry.py:18-21]()

---

## Health Check: `rlm frameworks doctor`

The registry's `doctor()` method runs every adapter's `doctor()` in one pass and returns a list of rows, including the `adapter_mode` and `reference_impl` fields the runner surfaces to operators:

```python
# rlm_code/rlm/frameworks/registry.py:48-61
def doctor(self) -> list[dict[str, Any]]:
    rows: list[dict[str, Any]] = []
    for framework_id, adapter in sorted(self._adapters.items()):
        ok, detail = adapter.doctor()
        rows.append({
            "framework": framework_id,
            "ok": bool(ok),
            "detail": str(detail),
            "mode": str(getattr(adapter, "adapter_mode", "adapter")),
            "reference": str(getattr(adapter, "reference_impl", "")),
        })
    return rows
```

A typical healthy output might look like:

| framework | ok | mode | reference |
|-----------|-----|------|-----------|
| adk-rlm | ✓ | native_rlm | adk_rlm/main.py (vendored sample package) |
| deepagents | ✓ | agent_loop | deepagents (installed package) |
| dspy-rlm | ✓ | native_rlm | dspy.RLM (installed package) |
| google-adk | ✗ | agent_loop | google.adk (installed package) |
| pydantic-ai | ✓ | agent_loop | pydantic_ai.Agent (installed package) |

Sources: [rlm_code/rlm/frameworks/registry.py:48-61]()

---

## Summary

The `rlm_code/rlm/frameworks/` package achieves clean framework extensibility through three cooperating pieces: the `RLMFrameworkAdapter` Protocol that defines a two-method contract, the `FrameworkAdapterRegistry` that maps string slugs to adapter instances, and the shared `FrameworkEpisodeResult` / `FrameworkStepRecord` dataclasses that carry every framework's output back into the same RLM trajectory machinery. The five built-in adapters (`dspy-rlm`, `adk-rlm`, `google-adk`, `pydantic-ai`, `deepagents`) each live in their own file, import their optional dependency lazily, and express readiness through `doctor()` — so a missing package surfaces a clear install hint rather than a cryptic import error. The runner itself remains unchanged regardless of which adapter is selected; it only calls `registry.get(framework_id)` followed by `adapter.run_episode(...)`, as shown in [rlm_code/rlm/runner.py:1240-1293]().

---

## 05. Benchmarks, Leaderboard & Observability — Did It Work?

> How RLMBenchmarkCase definitions drive automated runs, how scores flow into the leaderboard, how trajectory replay lets you re-watch any session, and how observability sinks (OTel-shaped JSONL, trace analysis) record what happened.

- Page Markdown: https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/pages/05-benchmarks-leaderboard-observability-did-it-work.md
- Generated: 2026-05-22T02:09:16.443Z

### Source Files

- `rlm_code/rlm/benchmarks.py`
- `rlm_code/rlm/benchmark_manager.py`
- `rlm_code/rlm/leaderboard.py`
- `rlm_code/rlm/session_replay.py`
- `rlm_code/rlm/observability.py`
- `rlm_code/rlm/observability_sinks.py`
- `rlm_code/traces/store.py`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [rlm_code/rlm/benchmarks.py](rlm_code/rlm/benchmarks.py)
- [rlm_code/rlm/benchmark_manager.py](rlm_code/rlm/benchmark_manager.py)
- [rlm_code/rlm/leaderboard.py](rlm_code/rlm/leaderboard.py)
- [rlm_code/rlm/session_replay.py](rlm_code/rlm/session_replay.py)
- [rlm_code/rlm/observability.py](rlm_code/rlm/observability.py)
- [rlm_code/rlm/observability_sinks.py](rlm_code/rlm/observability_sinks.py)
- [rlm_code/traces/store.py](rlm_code/traces/store.py)
</details>

# Benchmarks, Leaderboard & Observability — Did It Work?

This page explains how RLM Code measures whether an agent run actually succeeded: how test cases are defined, how runs are executed and scored, how results flow into a ranked leaderboard, how every session can be replayed step-by-step after the fact, and how a layered observability system (local JSONL files up through OpenTelemetry, MLflow, LangSmith, LangFuse, and Logfire) records what happened in each run.

Think of it like a sports league: the benchmark cases are the game schedule, each `run_benchmark` call plays the games and records the score, the leaderboard ranks teams by multiple statistics, trajectory replay lets you rewatch any match in slow motion, and the observability sinks are the broadcast cameras capturing the action in real time.

---

## 1. Benchmark Cases: the Unit of Evaluation

The smallest piece of the benchmark system is `RLMBenchmarkCase`, a frozen dataclass that describes exactly one task to run.

```python
# rlm_code/rlm/benchmarks.py  lines 14-23
@dataclass(frozen=True, slots=True)
class RLMBenchmarkCase:
    case_id: str
    description: str
    task: str
    environment: str = "dspy"
    max_steps: int = 4
    exec_timeout: int = 30
```

Every field matters:

| Field | Purpose |
|---|---|
| `case_id` | Unique key used to track results per case across runs |
| `task` | The plain-language prompt given to the agent |
| `environment` | Execution sandbox (`dspy`, `generic`, `pure_rlm`) |
| `max_steps` | Hard cap on agent loop iterations |
| `exec_timeout` | Per-step subprocess timeout in seconds |

### Built-in Preset Suites

Cases are organized into named *presets*. The table below lists all presets defined in the repository:

| Preset | Cases | Focus |
|---|---|---|
| `dspy_quick` | 3 | Fast DSPy coding smoke test |
| `dspy_extended` | 5 | Broader DSPy sweep |
| `generic_smoke` | 2 | Generic run_python sanity |
| `pure_rlm_smoke` | 3 | Pure RLM context-as-variable basics |
| `pure_rlm_context` | 4 | Chunking, accumulation, map-reduce |
| `deep_recursion` | 3 | Depth > 1 recursive delegation |
| `paradigm_comparison` | 3 | Pure RLM vs CodeAct side-by-side |
| `oolong_style` | 4 | OOLONG long-context tasks |
| `browsecomp_style` | 3 | BrowseComp web reasoning |
| `token_efficiency` | 3 | Token usage measurement |
| `dynamic_web_filtering` | 3 | Domain-scoped retrieval |

Sources: [rlm_code/rlm/benchmarks.py:26-478]()

### Loading External Packs

In addition to the built-in presets, `load_benchmark_packs` can parse external files in five shapes: an explicit `presets:` YAML block, a top-level `cases` list, a Google ADK `eval_cases` JSON, plain JSONL record rows, and generic `records`/`items` mappings. This makes it straightforward to import third-party evaluation datasets without modifying source code.

Sources: [rlm_code/rlm/benchmarks.py:527-579]()

---

## 2. Running a Benchmark Preset

`BenchmarkManagerMixin.run_benchmark` is the entry point that executes all cases in a preset and produces a persisted JSON summary.

```python
# rlm_code/rlm/benchmark_manager.py  lines 156-173
def run_benchmark(
    self,
    *,
    preset: str = "dspy_quick",
    mode: str = "native",
    ...
) -> RLMBenchmarkResult:
    """Execute a benchmark preset and persist aggregate summary."""
```

Three execution modes are supported:

| Mode | What it does |
|---|---|
| `native` | Runs each case through `RLMRunner.run_task` (full agent loop) |
| `harness` | Delegates to `HarnessRunner` with optional MCP tool access |
| `direct-llm` | Single LLM call, no tool loop — baseline comparison |

After all cases finish, the mixin computes aggregate statistics and writes a timestamped JSON file under `<workdir>/benchmarks/bench_YYYYMMDD_HHMMSS_<μs>.json`.

```python
# rlm_code/rlm/benchmark_manager.py  lines 282-317
avg_reward = (sum(total_rewards) / attempted_cases) if attempted_cases else 0.0
avg_steps = (sum(total_steps) / attempted_cases) if attempted_cases else 0.0
duration_stats = self._summarize_distribution(durations)
usage_totals = self._aggregate_usage_totals(case_results)
...
summary_path.write_text(json.dumps(payload, indent=2, ensure_ascii=False), encoding="utf-8")
```

The summary JSON contains: `benchmark_id`, `preset`, `mode`, `avg_reward`, `avg_steps`, `latency_seconds` (with avg/p50/p95/p99/max), `usage_totals`, and the full `case_results` list.

Sources: [rlm_code/rlm/benchmark_manager.py:156-333]()

### CI-Style Comparison Gates

`compare_benchmarks` loads two summary files (by ID, keyword `latest`/`previous`, or file path) and evaluates four quality gates:

```python
# rlm_code/rlm/benchmark_manager.py  lines 410-431
gates = {
    "reward": reward_delta >= float(min_reward_delta),
    "completion": completion_delta >= float(min_completion_delta),
    "steps": steps_increase <= float(max_steps_increase),
    "completion_regressions": (
        case_summary["completion_regressions"] == 0
        if fail_on_completion_regression
        else True
    ),
}
...
passed=all(bool(value) for value in gates.values()),
```

When all four gates pass, `RLMBenchmarkComparison.passed` is `True`. Reports can be exported in Markdown, CSV, or JSON format via `export_benchmark_report`.

Sources: [rlm_code/rlm/benchmark_manager.py:367-495]()

### LLM Judge for Predictions

`judge_predictions` provides an LLM-as-judge layer on top of raw model outputs. It matches predictions to a reference JSONL by `question_id`, calls the configured LLM with a task-type-aware prompt (temporal reasoning, knowledge update, single-session preference, or generic), and appends `autoeval_label: {model, label, raw}` to a results file. It supports resumption: already-judged IDs are skipped.

Sources: [rlm_code/rlm/benchmark_manager.py:497-659]()

---

## 3. Leaderboard: Ranking Results Across Runs

The `Leaderboard` class loads results from two sources and ranks them by configurable metrics.

```
text
Sources loaded by Leaderboard.load_all():

  <workdir>/rlm/benchmarks/*.json   ←  one file per benchmark preset run
  <workdir>/observability/runs.jsonl ←  one line per individual task run
```

### Ranking Metrics

```python
# rlm_code/rlm/leaderboard.py  lines 29-57
class RankingMetric(Enum):
    REWARD = "reward"           # higher is better
    COMPLETION_RATE = "completion_rate"  # higher is better
    STEPS = "steps"             # lower is better
    TOKENS = "tokens"           # lower is better
    COST = "cost"               # lower is better
    DURATION = "duration"       # lower is better
    EFFICIENCY = "efficiency"   # reward per 1000 tokens, higher is better
```

`LeaderboardEntry` computes `efficiency = (avg_reward * 1000) / total_tokens` in `__post_init__`, so it is always available without a separate query.

Sources: [rlm_code/rlm/leaderboard.py:60-107]()

### Filtering

`LeaderboardFilter` supports slicing by environment, model, preset, tags, reward range, completion rate, date window, and minimum case count. The `rank()` method applies the filter before sorting:

```python
# rlm_code/rlm/leaderboard.py  lines 414-459
def rank(self, metric, order=None, limit=None, filter=None) -> RankingResult:
    entries = self._entries
    if filter:
        entries = [e for e in entries if filter.matches(e)]
    ...
    entries = sorted(entries, key=lambda e: e.get_metric(metric), reverse=reverse)
```

Sources: [rlm_code/rlm/leaderboard.py:414-459]()

### Export Formats

The leaderboard can be exported as a Rich terminal table (via `format_rich_table`), JSON, CSV, or Markdown. The Markdown export includes both the ranked table and a statistics block (mean, median, std dev, range).

Sources: [rlm_code/rlm/leaderboard.py:599-637]()

---

## 4. Session Replay: Rewatching Any Run

The replay system lets you load a saved run and scrub through it step by step, inspect variables and memory state at any point, find all error steps, or compare two runs to find where they diverged.

### Recording

`SessionRecorder` is the write side. During a run it records typed `SessionEvent` objects (lifecycle, step start/action/result/end, LLM request/response, child spawn, final detection, error, checkpoint) and appends them to a JSONL file in real time.

```python
# rlm_code/rlm/session_replay.py  lines 37-68
class SessionEventType(Enum):
    SESSION_START = "session_start"
    SESSION_END = "session_end"
    STEP_START = "step_start"
    STEP_ACTION = "step_action"
    STEP_RESULT = "step_result"
    STEP_END = "step_end"
    LLM_REQUEST = "llm_request"
    LLM_RESPONSE = "llm_response"
    CHILD_SPAWN = "child_spawn"
    CHILD_RESULT = "child_result"
    FINAL_DETECTED = "final_detected"
    CHECKPOINT = "checkpoint"
    ERROR = "error"
```

At any point `create_checkpoint()` materializes a `SessionSnapshot`—a full state object—and writes it to disk. Sources: [rlm_code/rlm/session_replay.py:273-585]()

### Replaying

`SessionReplayer` wraps a `SessionSnapshot` and offers forward/backward navigation:

```python
# rlm_code/rlm/session_replay.py  lines 668-715
def step_forward(self) -> StepState | None: ...
def step_backward(self) -> StepState | None: ...
def goto_step(self, step: int) -> StepState | None: ...
def find_errors(self) -> list[StepState]: ...
def find_successes(self) -> list[StepState]: ...
```

It can be loaded from either a compact `.json` snapshot or a raw `.jsonl` trajectory file. Legacy trajectory formats are converted automatically via `_convert_legacy_event` and `_convert_legacy_step`.

Sources: [rlm_code/rlm/session_replay.py:593-744]()

### `StepState` fields

Each replayed step carries the full picture:

| Field | Content |
|---|---|
| `action_type`, `action_code` | What the agent decided to do |
| `action_rationale` | LLM's reasoning text |
| `success`, `output`, `error` | Observation result |
| `reward`, `cumulative_reward` | Per-step and running score |
| `memory_notes` | Active memory at this point |
| `variables` | REPL variable state |
| `tokens_used` | Token cost for this step |

Sources: [rlm_code/rlm/session_replay.py:119-160]()

### Session Comparison

`compare_sessions` finds the first step where two sessions diverge (by action type, code, or success flag) and computes reward/token/efficiency deltas:

```python
# rlm_code/rlm/session_replay.py  lines 930-992
def compare_sessions(snapshot_a, snapshot_b) -> SessionComparison:
    ...
    a_efficiency = (snapshot_a.total_reward * 1000) / snapshot_a.total_tokens
```

---

## 5. Observability Sinks: What Happened, Sent Everywhere

The observability layer is built around `RLMObservabilitySink`, a Protocol with three hooks: `on_run_start`, `on_step`, and `on_run_end`. `RLMObservability` is a coordinator that fans out calls to all configured sinks, catching per-sink exceptions so a failing sink never breaks the run.

```python
# rlm_code/rlm/observability.py  lines 46-85
class RLMObservabilitySink(Protocol):
    name: str
    def on_run_start(self, run_id, *, task, environment, params) -> None: ...
    def on_step(self, run_id, *, event, cumulative_reward) -> None: ...
    def on_run_end(self, run_id, *, result, run_path) -> None: ...
```

Sources: [rlm_code/rlm/observability.py:46-85]()

### Sink Activation via Environment Variables

All sinks are opt-in. The master switch is `DSPY_RLM_OBS_ENABLED` (default `True`). Individual sinks:

| Env variable | Sink | Default |
|---|---|---|
| `DSPY_RLM_OBS_LOCAL_JSONL` | `LocalJSONLSink` | `True` |
| `DSPY_RLM_MLFLOW_ENABLED` | `MLflowSink` | `False` |
| `DSPY_RLM_OTEL_ENABLED` | `OpenTelemetrySink` | `False` |
| `DSPY_RLM_LANGSMITH_ENABLED` | `LangSmithSink` | `False` |
| `DSPY_RLM_LANGFUSE_ENABLED` | `LangFuseSink` | `False` |
| `DSPY_RLM_LOGFIRE_ENABLED` | `LogfireSink` | `False` |

Sources: [rlm_code/rlm/observability.py:308-357]()

### LocalJSONLSink: Always-On Local Storage

This sink never requires an external service. It writes two files:

- `<workdir>/observability/runs.jsonl` — one JSON line per completed run (aggregated)
- `<workdir>/observability/steps/<run_id>.jsonl` — one line per step within a run

```python
# rlm_code/rlm/observability.py  lines 87-181
@dataclass(slots=True)
class LocalJSONLSink:
    base_dir: Path
    runs_file: Path   # observability/runs.jsonl
    steps_dir: Path   # observability/steps/
```

Each step line carries: `timestamp`, `run_id`, `step`, `action`, `reward`, `cumulative_reward`, `success`.

Sources: [rlm_code/rlm/observability.py:87-181]()

### OpenTelemetrySink: Distributed Tracing

When enabled, this sink creates one parent OTEL span per run (`rlm.run`) and one child span per step (`rlm.step`). Code and output are added as span events. Metrics instruments (`rlm.runs`, `rlm.steps`, `rlm.reward`, `rlm.run_duration`) can be exported via OTLP to any compatible backend (Jaeger, Grafana Tempo, etc.).

```python
# rlm_code/rlm/observability_sinks.py  lines 173-196
span = self._tracer.start_span(
    "rlm.run",
    attributes={
        "rlm.run_id": run_id,
        "rlm.task": task[:500],
        "rlm.environment": environment,
        "rlm.max_steps": params.get("max_steps", 0),
    },
)
```

The trace ID for each run is available via `get_trace_id(run_id)` for cross-system correlation.

Sources: [rlm_code/rlm/observability_sinks.py:42-321]()

### Other Optional Sinks

| Sink | Integration | Key mechanism |
|---|---|---|
| `MLflowSink` | MLflow experiment tracking | `log_metric` per step, `log_artifact` on run end |
| `LangSmithSink` | LangChain's observability | `RunTree` with parent/child structure |
| `LangFuseSink` | Open-source LLM tracing | `trace()` + per-step `span()` + `score()` on completion |
| `LogfireSink` | Pydantic's Logfire | OTEL-compatible `span()` context manager |

All optional sinks degrade gracefully: if the package is not installed or the connection fails, `_available` is set to `False` and subsequent calls are silently skipped.

Sources: [rlm_code/rlm/observability_sinks.py:328-965]()

---

## 6. Trace Analysis: Querying Recorded OTel Spans

For richer post-hoc investigation, `TraceStore` provides a read-only query API over a JSONL file of raw OTEL spans plus a sidecar index.

```python
# rlm_code/traces/store.py  lines 73-91
class TraceStore:
    """Read-only query API over a trace JSONL file and sidecar index."""

    @classmethod
    def load(cls, trace_path, index_path=None) -> "TraceStore": ...
```

Key operations:

| Method | What it returns |
|---|---|
| `get_overview()` | Service names, model names, agent names, error counts, token totals |
| `query_traces()` | Paginated summary list, filterable by `has_errors`, model, service, agent |
| `view_trace(trace_id)` | All spans for one trace, capped at 150 000 chars; returns oversized diagnostics if exceeded |
| `view_spans(trace_id, span_ids)` | Surgical extraction of specific spans with higher attribute cap (16 384 chars) |
| `search_trace(trace_id, pattern)` | Text-searches raw bytes using the index's stored byte offsets for efficiency |
| `export_evidence_corpus()` | Layered Markdown+JSONL export for harness-optimization agents |

The search uses byte-offset seeks into the JSONL file rather than scanning all lines, making pattern search efficient even on large trace files.

Sources: [rlm_code/traces/store.py:73-268]()

---

## 7. End-to-End Data Flow

```text
RLMBenchmarkCase
  │  defined in benchmarks.py or loaded from YAML/JSON/JSONL pack
  ▼
BenchmarkManagerMixin.run_benchmark()
  │  iterates cases → run_task() / HarnessRunner.run() / direct LLM
  │  RLMObservability.on_run_start / on_step / on_run_end fired for each case
  ▼
  ┌─────────────────────────────────────────────────────┐
  │  LocalJSONLSink  →  runs.jsonl + steps/<id>.jsonl   │
  │  MLflowSink      →  experiment metrics              │
  │  OpenTelemetrySink → OTEL spans → any OTLP backend  │
  │  LangSmithSink   →  RunTree in LangSmith project    │
  │  LangFuseSink    →  traces + scores in LangFuse     │
  │  LogfireSink     →  structured spans in Logfire     │
  └─────────────────────────────────────────────────────┘
  │
  ▼  benchmarks/<bench_id>.json (summary with avg_reward, avg_steps, …)
  │
  ├─→ Leaderboard.load_all()  →  rank() by REWARD / EFFICIENCY / TOKENS / …
  │      Export: Rich table, JSON, CSV, Markdown
  │
  ├─→ compare_benchmarks()  →  CI gate: reward / completion / steps / regressions
  │      Export: Markdown / CSV / JSON report
  │
  └─→ SessionReplayer.from_jsonl()  →  step_forward / goto_step / find_errors
         SessionStore.save_snapshot() / load_checkpoint()
```

---

## Summary

The benchmark and observability stack in RLM Code covers the full "did it work?" question from multiple angles: `RLMBenchmarkCase` definitions drive automated runs through `run_benchmark`, per-run JSON summaries feed into the `Leaderboard` for multi-metric ranking, `compare_benchmarks` provides CI-style regression gates, `SessionReplayer` lets you re-examine any session step by step, and `RLMObservability` fans telemetry out to local JSONL files as a reliable baseline while optionally forwarding to MLflow, OpenTelemetry, LangSmith, LangFuse, or Logfire without changing any agent code. All sinks are BYOK/BYOC: they activate via environment variables and degrade gracefully when libraries are absent.

Sources: [rlm_code/rlm/observability.py:302-357]()

---

## 06. The One Map to Keep — Core Idea, Key Files, What to Read Next

> A plain-English recap of the whole system: the single analogy that holds, the five files that matter most, the two constraints every newcomer hits, and where to go from here.

- Page Markdown: https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/pages/06-the-one-map-to-keep-core-idea-key-files-what-to-read-next.md
- Generated: 2026-05-22T02:05:37.725Z

### Source Files

- `rlm_code/rlm/runner.py`
- `rlm_code/rlm/environments.py`
- `rlm_code/rlm/frameworks/registry.py`
- `rlm_code/rlm/benchmarks.py`
- `docs/core/environments.md`
- `docs/core/execution-patterns.md`

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [rlm_code/rlm/runner.py](rlm_code/rlm/runner.py)
- [rlm_code/rlm/environments.py](rlm_code/rlm/environments.py)
- [rlm_code/rlm/pure_rlm_environment.py](rlm_code/rlm/pure_rlm_environment.py)
- [rlm_code/rlm/frameworks/registry.py](rlm_code/rlm/frameworks/registry.py)
- [rlm_code/rlm/benchmarks.py](rlm_code/rlm/benchmarks.py)
- [docs/core/environments.md](docs/core/environments.md)
- [docs/core/execution-patterns.md](docs/core/execution-patterns.md)
- [README.md](README.md)
</details>

# The One Map to Keep — Core Idea, Key Files, What to Read Next

RLM Code implements the **Recursive Language Models** paradigm from the 2025 paper of the same name. The central problem it solves: large documents and contexts don't fit in a model's token window without expensive chunking or truncation. RLM Code's answer is to keep that data *outside* the window — stored as a Python variable in a sandboxed REPL — and let the model write code to analyze it step by step. The model calls `llm_query()` from inside that code when it needs the LLM's reasoning on a chunk. The result is a tight, measurable loop: propose an action, execute it in a sandbox, observe the outcome, accumulate a reward signal, update a short rolling memory, repeat.

This page gives you the single mental model that holds for the whole system, names the five files that define it, identifies the two constraints that trip up newcomers, and points you at what to read next.

---

## The One Analogy That Holds

Think of RLM Code as a **lab manager running experiments**.

- The **task** is the experiment brief.
- The **LLM** is the researcher who proposes the next action, always in JSON.
- The **sandbox** (Docker or Monty) is the lab bench where actions run safely.
- The **environment** is the rulebook: what actions are legal, what rewards each outcome gets.
- The **runner** is the lab manager keeping the journal — it records every action, reward, and observation to a JSONL file, manages the deadline, and signals completion.

In the "pure RLM" mode, the context (a large document, a dataset) lives in the lab's storage room as a named variable. The researcher never sees it directly — only a short preview and its size. They write code to fetch chunks, call the LLM on those chunks, and build up a result in another variable. This is what the paper calls *context stored outside the token window*.

Everything else in the codebase — framework adapters, benchmarks, observability, chat sessions — is scaffolding around that core loop.

---

## The Five Files That Matter Most

```
rlm_code/rlm/
├── runner.py            ← The lab manager. Drives the loop, persists trajectory.
├── environments.py      ← The rulebooks. Defines legal actions and reward formulas.
├── pure_rlm_environment.py  ← Paper-faithful mode. Context as REPL variable.
├── frameworks/registry.py   ← Plug-in adapters for DSPy, ADK, Pydantic-AI, etc.
└── benchmarks.py        ← Named task packs for reproducible evaluation.
```

### 1. `runner.py` — The Loop Itself

`RLMRunner.run_task()` is the method that does the work. At its core it is a `for step_index in range(1, max_steps + 1)` loop:

1. Ask the environment to build a **planner prompt** from the task, rolling memory (last 8 notes), and recent trajectory (last 3 steps).
2. Send that prompt to the LLM via `_propose_step_candidates()` and pick the highest-scoring candidate action.
3. Dispatch the action — either to the environment's `execute_action()`, or to `_execute_delegate_action()` for recursive subtasks.
4. Apply the global reward scale, accumulate the total reward, write a JSONL step event, update rolling memory.
5. If `action_result.done` is `True`, break.

Every run produces a `<run_id>.jsonl` file under `.rlm_code/rlm/runs/`. The final event in that file carries `completed`, `total_reward`, `steps`, and `usage` counts.

Sources: [rlm_code/rlm/runner.py:634-774]()

### 2. `environments.py` — What the LLM Is Allowed to Do

The `RLMEnvironment` protocol declares four methods: `system_prompt()`, `planner_prompt()`, `execute_action()`, and `doctor_checks()`. Every concrete environment implements those four.

Three concrete environments live here:

| Class | Alias(es) | Key actions |
|---|---|---|
| `GenericRLMEnvironment` | `generic`, `rlm` | `run_python`, `final` |
| `DSPyCodingRLMEnvironment` | `dspy`, `dspy-coding`, `framework` | All generic + `write_file`, `patch_file`, `read_file`, `search_code`, `list_tree`, `run_tests`, `analyze_code`, `llm_query`, `delegate` |
| `TraceAnalysisEnvironment` | `trace_analysis`, `traces` | Trace-oriented: `set_trace_path`, `query_traces`, `view_trace`, `search_trace`, `view_spans`, `export_evidence_corpus` |

Rewards are computed inside each environment's `execute_action()` and clamped to `[-1.0, 1.0]` by `RLMRewardProfile.clamp()`. The runner then applies a global scale multiplier on top.

Sources: [rlm_code/rlm/environments.py:100-119](), [rlm_code/rlm/environments.py:122-287](), [rlm_code/rlm/environments.py:565-617]()

### 3. `pure_rlm_environment.py` — The Paper-Faithful Mode

The `PureRLMEnvironment` class implements the actual RLM paper semantics. Its key difference from the generic environment: the input context is injected into the REPL namespace as a named Python variable, not into the LLM's token window. The LLM receives only a **metadata preview** (type, length, a short excerpt).

Inside REPL code, the model can call:
- `llm_query(prompt)` — a single recursive LLM call
- `llm_query_batched(prompts)` — concurrent parallel LLM calls
- `FINAL(answer)` or `FINAL_VAR("varname")` — clean termination
- `SHOW_VARS()` — inspect the REPL namespace

The interpreter backend is either **Docker** (recommended) or **Monty** (`pydantic-monty`). Bare `exec` is disabled unless `sandbox.pure_rlm_allow_unsafe_exec=true` is explicitly set.

Sources: [rlm_code/rlm/pure_rlm_environment.py:1-14](), [docs/core/environments.md:41-55]()

### 4. `frameworks/registry.py` — The Adapter Layer

`FrameworkAdapterRegistry.default()` registers five plug-in adapters at startup: `DSPyRLMFrameworkAdapter`, `ADKRLMFrameworkAdapter`, `PydanticAIFrameworkAdapter`, `GoogleADKFrameworkAdapter`, and `DeepAgentsFrameworkAdapter`. Each adapter implements `run_episode(task, llm_connector, max_steps, ...)` and returns a `FrameworkEpisodeResult`.

When a user picks `framework=dspy` or `framework=pydantic-ai`, the runner calls `_run_task_with_framework_adapter()` instead of its native loop. The adapter handles the framework-specific invocation; the runner still persists the trajectory and computes rewards.

If no framework is specified, `framework=native` is used, which goes through the environment loop directly.

Sources: [rlm_code/rlm/frameworks/registry.py:33-46](), [rlm_code/rlm/runner.py:1504-1536]()

### 5. `benchmarks.py` — Reproducible Evaluation

`RLMBenchmarkCase` is a frozen dataclass: `case_id`, `description`, `task`, `environment`, `max_steps`, `exec_timeout`. Named preset packs are defined inline and can also be loaded from external YAML or JSON files.

Built-in preset names include `dspy_quick` (3 cases), `dspy_extended` (5 cases), `pure_rlm_smoke` (3 cases), `oolong_style` (4 long-context cases), and `paradigm_comparison` (3 cases). The runner's `run_benchmark()` method (in `BenchmarkManagerMixin`) iterates cases, calls `run_task()` for each, and saves a JSON benchmark report.

Sources: [rlm_code/rlm/benchmarks.py:14-39]()

---

## The Architecture in One Diagram

```
                    ┌─────────────────────────────────┐
                    │          RLMRunner               │
                    │  run_task() loop (runner.py)     │
                    │                                  │
                    │  step 1..N:                      │
                    │    planner_prompt → LLM          │
                    │    parse JSON action             │
                    │    execute_action()              │
                    │    record reward + observation   │
                    │    update rolling memory (8)     │
                    └───────────┬─────────────────────┘
                                │
              ┌─────────────────┼────────────────────┐
              │                 │                    │
   ┌──────────▼──────┐ ┌────────▼────────┐ ┌────────▼──────────┐
   │ GenericRLMEnv   │ │ DSPyCodingEnv   │ │ PureRLMEnvironment│
   │ run_python      │ │ write_file      │ │ context as var    │
   │ final           │ │ run_tests       │ │ llm_query()       │
   └──────────┬──────┘ │ delegate        │ │ FINAL()           │
              │        └────────┬────────┘ └────────┬──────────┘
              │                 │                    │
              └────────┬────────┘                    │
                       │                             │
              ┌────────▼─────────┐         ┌─────────▼──────────┐
              │ Sandbox          │         │ Secure Interpreter  │
              │ (execution_engine│         │ Docker / Monty      │
              │  local or docker)│         └────────────────────-┘
              └──────────────────┘

  Framework adapters (DSPy, ADK, Pydantic-AI, DeepAgents)
  wrap run_episode() → runner persists trajectory either way

  Trajectory persisted to: .rlm_code/rlm/runs/<run_id>.jsonl
```

---

## The Two Constraints Every Newcomer Hits

### Constraint 1: The Secure Sandbox Is Not Optional for Pure RLM

The `pure_rlm` environment requires a secure code interpreter. The runner tries Docker first, then Monty. If neither is available and `sandbox.pure_rlm_allow_unsafe_exec=true` is not set, the runner substitutes `_UnavailablePureRLMInterpreter`, which returns an error on every `execute()` call.

The error message in the runner is explicit:

> "Pure-RLM secure backend is unavailable. Configure a secure backend with `sandbox.pure_rlm_backend=monty` or `docker`, then install dependencies (Monty: `pip install pydantic-monty`, Docker: install Docker/OrbStack/Colima)."

**What to do:** Run `docker info` to confirm Docker is running, or `pip install pydantic-monty`. Then run `/rlm doctor` in the TUI.

Sources: [rlm_code/rlm/runner.py:251-265](), [rlm_code/rlm/runner.py:394-401]()

### Constraint 2: The Cycle Guard Silently Caps Recursion

The runner maintains a `_RecursionState` object with `active_task_hashes: set[str]`. If a `delegate` action produces the same task string + environment combination as an ancestor run (detected by SHA-1 fingerprint), the child run is skipped immediately with `total_reward = -0.25` and `completed = False`. The parent run continues with the next step.

This is entirely silent in the output — no exception is raised, just a `blocked_by_cycle_guard: True` flag in the JSONL. If you see suspiciously short runs or unexpected `-0.25` penalties, check the JSONL for that flag.

The default `max_depth=2` also limits delegation depth independently. Pure RLM strict mode (`pure_rlm_strict=true`) disables delegate actions entirely.

Sources: [rlm_code/rlm/runner.py:536-580](), [rlm_code/rlm/runner.py:697-729]()

---

## Reward Signal: The Quick Reference

Each `EnvironmentActionResult` carries a `reward: float` in `[-1.0, 1.0]`. The global scale from `RLMRewardProfile` is applied by the runner after every step. Key default values:

| Situation | Reward |
|---|---|
| `run_python` succeeds, no stderr | `0.1 + 0.7 = 0.8` |
| `run_python` fails | `0.1 - 0.3 = -0.2` (before stderr penalty) |
| `final` action | `1.0` |
| Unsupported action | `-0.2` |
| Cycle guard blocked | `-0.25` |
| `write_file` with compile pass + test pass | up to `~0.85` depending on DSPy score |

Sources: [rlm_code/rlm/environments.py:44-98](), [docs/core/environments.md:204-216]()

---

## What to Read Next

| Goal | Where to go |
|---|---|
| Understand environment actions in depth | [docs/core/environments.md](docs/core/environments.md) |
| Compare the three execution patterns | [docs/core/execution-patterns.md](docs/core/execution-patterns.md) |
| Set up the sandbox and interpreter | `rlm_code/rlm/pure_rlm_environment.py` + runner `_build_pure_rlm_environment()` |
| Add or understand a framework adapter | `rlm_code/rlm/frameworks/registry.py` and the `*_adapter.py` files beside it |
| Run a benchmark sweep | `rlm_code/rlm/benchmarks.py` and `BenchmarkManagerMixin` in `benchmark_manager.py` |
| Inspect a run's raw trajectory | `.rlm_code/rlm/runs/<run_id>.jsonl` — each line is a JSON event of type `step` or `final` |
| Tune reward behavior | `RLMRewardProfile` in `environments.py:44-98`; pass as `reward_profile=` to `RLMRunner` |

The system's surface area is larger than these five files suggest — observability sinks, chat sessions, trace analysis, leaderboard — but every path eventually calls `run_task()` in `runner.py` and dispatches through an `RLMEnvironment`. That one call, that one protocol, is the map that holds.

Sources: [rlm_code/rlm/runner.py:175-210](), [rlm_code/rlm/environments.py:100-119]()

---