# The One Map to Keep — Core Idea, Key Files, What to Read Next

> A plain-English recap of the whole system: the single analogy that holds, the five files that matter most, the two constraints every newcomer hits, and where to go from here.

- Repository: SuperagenticAI/rlm-code
- GitHub: https://github.com/SuperagenticAI/rlm-code
- Human wiki: https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91
- Complete Markdown: https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/llms-full.txt

## Source Files

- `rlm_code/rlm/runner.py`
- `rlm_code/rlm/environments.py`
- `rlm_code/rlm/frameworks/registry.py`
- `rlm_code/rlm/benchmarks.py`
- `docs/core/environments.md`
- `docs/core/execution-patterns.md`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [rlm_code/rlm/runner.py](rlm_code/rlm/runner.py)
- [rlm_code/rlm/environments.py](rlm_code/rlm/environments.py)
- [rlm_code/rlm/pure_rlm_environment.py](rlm_code/rlm/pure_rlm_environment.py)
- [rlm_code/rlm/frameworks/registry.py](rlm_code/rlm/frameworks/registry.py)
- [rlm_code/rlm/benchmarks.py](rlm_code/rlm/benchmarks.py)
- [docs/core/environments.md](docs/core/environments.md)
- [docs/core/execution-patterns.md](docs/core/execution-patterns.md)
- [README.md](README.md)
</details>

# The One Map to Keep — Core Idea, Key Files, What to Read Next

RLM Code implements the **Recursive Language Models** paradigm from the 2025 paper of the same name. The central problem it solves: large documents and contexts don't fit in a model's token window without expensive chunking or truncation. RLM Code's answer is to keep that data *outside* the window — stored as a Python variable in a sandboxed REPL — and let the model write code to analyze it step by step. The model calls `llm_query()` from inside that code when it needs the LLM's reasoning on a chunk. The result is a tight, measurable loop: propose an action, execute it in a sandbox, observe the outcome, accumulate a reward signal, update a short rolling memory, repeat.

This page gives you the single mental model that holds for the whole system, names the five files that define it, identifies the two constraints that trip up newcomers, and points you at what to read next.

---

## The One Analogy That Holds

Think of RLM Code as a **lab manager running experiments**.

- The **task** is the experiment brief.
- The **LLM** is the researcher who proposes the next action, always in JSON.
- The **sandbox** (Docker or Monty) is the lab bench where actions run safely.
- The **environment** is the rulebook: what actions are legal, what rewards each outcome gets.
- The **runner** is the lab manager keeping the journal — it records every action, reward, and observation to a JSONL file, manages the deadline, and signals completion.

In the "pure RLM" mode, the context (a large document, a dataset) lives in the lab's storage room as a named variable. The researcher never sees it directly — only a short preview and its size. They write code to fetch chunks, call the LLM on those chunks, and build up a result in another variable. This is what the paper calls *context stored outside the token window*.

Everything else in the codebase — framework adapters, benchmarks, observability, chat sessions — is scaffolding around that core loop.

---

## The Five Files That Matter Most

```
rlm_code/rlm/
├── runner.py            ← The lab manager. Drives the loop, persists trajectory.
├── environments.py      ← The rulebooks. Defines legal actions and reward formulas.
├── pure_rlm_environment.py  ← Paper-faithful mode. Context as REPL variable.
├── frameworks/registry.py   ← Plug-in adapters for DSPy, ADK, Pydantic-AI, etc.
└── benchmarks.py        ← Named task packs for reproducible evaluation.
```

### 1. `runner.py` — The Loop Itself

`RLMRunner.run_task()` is the method that does the work. At its core it is a `for step_index in range(1, max_steps + 1)` loop:

1. Ask the environment to build a **planner prompt** from the task, rolling memory (last 8 notes), and recent trajectory (last 3 steps).
2. Send that prompt to the LLM via `_propose_step_candidates()` and pick the highest-scoring candidate action.
3. Dispatch the action — either to the environment's `execute_action()`, or to `_execute_delegate_action()` for recursive subtasks.
4. Apply the global reward scale, accumulate the total reward, write a JSONL step event, update rolling memory.
5. If `action_result.done` is `True`, break.

Every run produces a `<run_id>.jsonl` file under `.rlm_code/rlm/runs/`. The final event in that file carries `completed`, `total_reward`, `steps`, and `usage` counts.

Sources: [rlm_code/rlm/runner.py:634-774]()

### 2. `environments.py` — What the LLM Is Allowed to Do

The `RLMEnvironment` protocol declares four methods: `system_prompt()`, `planner_prompt()`, `execute_action()`, and `doctor_checks()`. Every concrete environment implements those four.

Three concrete environments live here:

| Class | Alias(es) | Key actions |
|---|---|---|
| `GenericRLMEnvironment` | `generic`, `rlm` | `run_python`, `final` |
| `DSPyCodingRLMEnvironment` | `dspy`, `dspy-coding`, `framework` | All generic + `write_file`, `patch_file`, `read_file`, `search_code`, `list_tree`, `run_tests`, `analyze_code`, `llm_query`, `delegate` |
| `TraceAnalysisEnvironment` | `trace_analysis`, `traces` | Trace-oriented: `set_trace_path`, `query_traces`, `view_trace`, `search_trace`, `view_spans`, `export_evidence_corpus` |

Rewards are computed inside each environment's `execute_action()` and clamped to `[-1.0, 1.0]` by `RLMRewardProfile.clamp()`. The runner then applies a global scale multiplier on top.

Sources: [rlm_code/rlm/environments.py:100-119](), [rlm_code/rlm/environments.py:122-287](), [rlm_code/rlm/environments.py:565-617]()

### 3. `pure_rlm_environment.py` — The Paper-Faithful Mode

The `PureRLMEnvironment` class implements the actual RLM paper semantics. Its key difference from the generic environment: the input context is injected into the REPL namespace as a named Python variable, not into the LLM's token window. The LLM receives only a **metadata preview** (type, length, a short excerpt).

Inside REPL code, the model can call:
- `llm_query(prompt)` — a single recursive LLM call
- `llm_query_batched(prompts)` — concurrent parallel LLM calls
- `FINAL(answer)` or `FINAL_VAR("varname")` — clean termination
- `SHOW_VARS()` — inspect the REPL namespace

The interpreter backend is either **Docker** (recommended) or **Monty** (`pydantic-monty`). Bare `exec` is disabled unless `sandbox.pure_rlm_allow_unsafe_exec=true` is explicitly set.

Sources: [rlm_code/rlm/pure_rlm_environment.py:1-14](), [docs/core/environments.md:41-55]()

### 4. `frameworks/registry.py` — The Adapter Layer

`FrameworkAdapterRegistry.default()` registers five plug-in adapters at startup: `DSPyRLMFrameworkAdapter`, `ADKRLMFrameworkAdapter`, `PydanticAIFrameworkAdapter`, `GoogleADKFrameworkAdapter`, and `DeepAgentsFrameworkAdapter`. Each adapter implements `run_episode(task, llm_connector, max_steps, ...)` and returns a `FrameworkEpisodeResult`.

When a user picks `framework=dspy` or `framework=pydantic-ai`, the runner calls `_run_task_with_framework_adapter()` instead of its native loop. The adapter handles the framework-specific invocation; the runner still persists the trajectory and computes rewards.

If no framework is specified, `framework=native` is used, which goes through the environment loop directly.

Sources: [rlm_code/rlm/frameworks/registry.py:33-46](), [rlm_code/rlm/runner.py:1504-1536]()

### 5. `benchmarks.py` — Reproducible Evaluation

`RLMBenchmarkCase` is a frozen dataclass: `case_id`, `description`, `task`, `environment`, `max_steps`, `exec_timeout`. Named preset packs are defined inline and can also be loaded from external YAML or JSON files.

Built-in preset names include `dspy_quick` (3 cases), `dspy_extended` (5 cases), `pure_rlm_smoke` (3 cases), `oolong_style` (4 long-context cases), and `paradigm_comparison` (3 cases). The runner's `run_benchmark()` method (in `BenchmarkManagerMixin`) iterates cases, calls `run_task()` for each, and saves a JSON benchmark report.

Sources: [rlm_code/rlm/benchmarks.py:14-39]()

---

## The Architecture in One Diagram

```
                    ┌─────────────────────────────────┐
                    │          RLMRunner               │
                    │  run_task() loop (runner.py)     │
                    │                                  │
                    │  step 1..N:                      │
                    │    planner_prompt → LLM          │
                    │    parse JSON action             │
                    │    execute_action()              │
                    │    record reward + observation   │
                    │    update rolling memory (8)     │
                    └───────────┬─────────────────────┘
                                │
              ┌─────────────────┼────────────────────┐
              │                 │                    │
   ┌──────────▼──────┐ ┌────────▼────────┐ ┌────────▼──────────┐
   │ GenericRLMEnv   │ │ DSPyCodingEnv   │ │ PureRLMEnvironment│
   │ run_python      │ │ write_file      │ │ context as var    │
   │ final           │ │ run_tests       │ │ llm_query()       │
   └──────────┬──────┘ │ delegate        │ │ FINAL()           │
              │        └────────┬────────┘ └────────┬──────────┘
              │                 │                    │
              └────────┬────────┘                    │
                       │                             │
              ┌────────▼─────────┐         ┌─────────▼──────────┐
              │ Sandbox          │         │ Secure Interpreter  │
              │ (execution_engine│         │ Docker / Monty      │
              │  local or docker)│         └────────────────────-┘
              └──────────────────┘

  Framework adapters (DSPy, ADK, Pydantic-AI, DeepAgents)
  wrap run_episode() → runner persists trajectory either way

  Trajectory persisted to: .rlm_code/rlm/runs/<run_id>.jsonl
```

---

## The Two Constraints Every Newcomer Hits

### Constraint 1: The Secure Sandbox Is Not Optional for Pure RLM

The `pure_rlm` environment requires a secure code interpreter. The runner tries Docker first, then Monty. If neither is available and `sandbox.pure_rlm_allow_unsafe_exec=true` is not set, the runner substitutes `_UnavailablePureRLMInterpreter`, which returns an error on every `execute()` call.

The error message in the runner is explicit:

> "Pure-RLM secure backend is unavailable. Configure a secure backend with `sandbox.pure_rlm_backend=monty` or `docker`, then install dependencies (Monty: `pip install pydantic-monty`, Docker: install Docker/OrbStack/Colima)."

**What to do:** Run `docker info` to confirm Docker is running, or `pip install pydantic-monty`. Then run `/rlm doctor` in the TUI.

Sources: [rlm_code/rlm/runner.py:251-265](), [rlm_code/rlm/runner.py:394-401]()

### Constraint 2: The Cycle Guard Silently Caps Recursion

The runner maintains a `_RecursionState` object with `active_task_hashes: set[str]`. If a `delegate` action produces the same task string + environment combination as an ancestor run (detected by SHA-1 fingerprint), the child run is skipped immediately with `total_reward = -0.25` and `completed = False`. The parent run continues with the next step.

This is entirely silent in the output — no exception is raised, just a `blocked_by_cycle_guard: True` flag in the JSONL. If you see suspiciously short runs or unexpected `-0.25` penalties, check the JSONL for that flag.

The default `max_depth=2` also limits delegation depth independently. Pure RLM strict mode (`pure_rlm_strict=true`) disables delegate actions entirely.

Sources: [rlm_code/rlm/runner.py:536-580](), [rlm_code/rlm/runner.py:697-729]()

---

## Reward Signal: The Quick Reference

Each `EnvironmentActionResult` carries a `reward: float` in `[-1.0, 1.0]`. The global scale from `RLMRewardProfile` is applied by the runner after every step. Key default values:

| Situation | Reward |
|---|---|
| `run_python` succeeds, no stderr | `0.1 + 0.7 = 0.8` |
| `run_python` fails | `0.1 - 0.3 = -0.2` (before stderr penalty) |
| `final` action | `1.0` |
| Unsupported action | `-0.2` |
| Cycle guard blocked | `-0.25` |
| `write_file` with compile pass + test pass | up to `~0.85` depending on DSPy score |

Sources: [rlm_code/rlm/environments.py:44-98](), [docs/core/environments.md:204-216]()

---

## What to Read Next

| Goal | Where to go |
|---|---|
| Understand environment actions in depth | [docs/core/environments.md](docs/core/environments.md) |
| Compare the three execution patterns | [docs/core/execution-patterns.md](docs/core/execution-patterns.md) |
| Set up the sandbox and interpreter | `rlm_code/rlm/pure_rlm_environment.py` + runner `_build_pure_rlm_environment()` |
| Add or understand a framework adapter | `rlm_code/rlm/frameworks/registry.py` and the `*_adapter.py` files beside it |
| Run a benchmark sweep | `rlm_code/rlm/benchmarks.py` and `BenchmarkManagerMixin` in `benchmark_manager.py` |
| Inspect a run's raw trajectory | `.rlm_code/rlm/runs/<run_id>.jsonl` — each line is a JSON event of type `step` or `final` |
| Tune reward behavior | `RLMRewardProfile` in `environments.py:44-98`; pass as `reward_profile=` to `RLMRunner` |

The system's surface area is larger than these five files suggest — observability sinks, chat sessions, trace analysis, leaderboard — but every path eventually calls `run_task()` in `runner.py` and dispatches through an `RLMEnvironment`. That one call, that one protocol, is the map that holds.

Sources: [rlm_code/rlm/runner.py:175-210](), [rlm_code/rlm/environments.py:100-119]()
