# The Loop — How an Agent Actually Works Here

> Step by step: how RLMRunner turns one user prompt into many rounds of context → action proposal → sandbox execution → observation → reward → memory update, and when it stops.

- Repository: SuperagenticAI/rlm-code
- GitHub: https://github.com/SuperagenticAI/rlm-code
- Human wiki: https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91
- Complete Markdown: https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/llms-full.txt

## Source Files

- `rlm_code/rlm/runner.py`
- `rlm_code/rlm/action_planner.py`
- `rlm_code/rlm/context_store.py`
- `rlm_code/rlm/termination.py`
- `rlm_code/rlm/events.py`
- `rlm_code/rlm/trajectory.py`
- `rlm_code/rlm/memory_compaction.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [rlm_code/rlm/runner.py](rlm_code/rlm/runner.py)
- [rlm_code/rlm/action_planner.py](rlm_code/rlm/action_planner.py)
- [rlm_code/rlm/context_store.py](rlm_code/rlm/context_store.py)
- [rlm_code/rlm/termination.py](rlm_code/rlm/termination.py)
- [rlm_code/rlm/events.py](rlm_code/rlm/events.py)
- [rlm_code/rlm/trajectory.py](rlm_code/rlm/trajectory.py)
- [rlm_code/rlm/memory_compaction.py](rlm_code/rlm/memory_compaction.py)
- [rlm_code/rlm/environments.py](rlm_code/rlm/environments.py)
</details>

# The Loop — How an Agent Actually Works Here

`RLMRunner` is the engine that turns a single user task string into a multi-round agent loop. Each round ("step") follows the same pipeline: build context → ask the LLM to propose an action → execute that action in a sandbox → observe the result → assign a reward → optionally update a short-term memory note → repeat. The loop ends when the agent signals it is done, a step budget is exhausted, a time budget expires, or the user cancels it.

This page walks through each phase of that loop in the order the code runs it, with precise citations so you can follow along in the source.

---

## 1. Anatomy of one run

`RLMRunner.run_task()` is the entry point. It accepts a task string, chooses an environment, and drives the step loop. The final result is an `RLMRunResult` dataclass that carries `completed`, `steps`, `total_reward`, and a human-readable `final_response`.

```python
# rlm_code/rlm/runner.py  (simplified sketch)
for step_index in range(1, max_steps + 1):
    planner_prompt = env.planner_prompt(task, memory, trajectory, step_index)
    candidates     = self._propose_step_candidates(planner_prompt, ...)
    selected       = max(candidates, key=lambda item: item["score"])
    action_result  = env.execute_action(selected["action"], ...)
    total_reward  += action_result.reward
    trajectory.append(step_event)
    if action_result.done:
        break
```

Sources: [rlm_code/rlm/runner.py:634-774]()

---

## 2. Phase-by-phase breakdown

### 2.1 Context assembly

Before each step, the environment builds a **planner prompt** that packages:

- the original task string,
- the short rolling `memory` list (capped at the last 8 notes),
- and the growing `trajectory` of past steps.

`LazyFileContext` (`context_store.py`) supplies file-level snippets when the environment needs workspace content. It reads lazily — only the files actually requested — and enforces a character budget (default 8 000 chars total, 1 600 chars per file).

```python
# rlm_code/rlm/context_store.py
def render(self, refs, *, max_chars=8000, max_chars_per_ref=1600) -> str:
    ...
```

Sources: [rlm_code/rlm/context_store.py:72-89]()

### 2.2 Action proposal (with optional branching)

`ActionPlannerMixin._propose_step_candidates()` calls the LLM once (branch_width=1) or multiple times (branch_width > 1) to generate candidate actions:

- **Single-branch**: one LLM call, response parsed as JSON into an `RLMAction`.
- **Multi-branch**: `branch_width` independent LLM calls. Each candidate is speculatively **scored** by running it in a throwaway copy of the workspace (`_preview_action_score`). The candidate with the highest score is selected.

```python
# rlm_code/rlm/action_planner.py
selected = max(candidates, key=lambda item: item["score"])
```

Sources: [rlm_code/rlm/action_planner.py:94-218]()

The JSON parser is tolerant: it tries fenced code blocks first, then walks the text for balanced `{…}` pairs. If no JSON is found at all, the whole raw response is treated as a `final` action.

Sources: [rlm_code/rlm/action_planner.py:309-338]()

### 2.3 Sandbox execution

Each selected action is dispatched to the active **environment** (`env.execute_action()`). Environments wrap a sandboxed Python execution engine. Available environments include:

| Environment key | Purpose |
|---|---|
| `generic` / `rlm` | Default general-purpose coding loop |
| `dspy` / `dspy-coding` | DSPy-aware code tasks |
| `trace_analysis` / `traces` | Analysing agent execution traces |
| `pure_rlm` / `pure-rlm` | Strict RLM-paper semantics with REPL and FINAL() termination |

The `pure_rlm` environment spins up a secure interpreter backend (Monty or Docker; `exec` is opt-in unsafe). The runner tries the configured backend and falls back automatically.

Sources: [rlm_code/rlm/runner.py:266-303](), [rlm_code/rlm/runner.py:426-468]()

### 2.4 Observation

`execute_action()` returns an `EnvironmentActionResult` that carries:

- **observation** — a dict with stdout, stderr, success flag, and any structured output.
- **reward** — a float in `[-1.0, 1.0]` computed by the environment based on execution outcome.
- **done** — bool indicating whether this step terminates the run.
- **memory_note** — an optional short string to add to the rolling memory list.
- **final_response** — a human-readable answer string when `done=True`.

Sources: [rlm_code/rlm/environments.py:23-32]()

### 2.5 Reward computation

The runner calls `reward_profile.apply_global_scale(raw_reward)` after each step, then accumulates `total_reward`. The `RLMRewardProfile` dataclass carries tunable weights for common scoring situations:

| Scoring situation | Relevant fields |
|---|---|
| Python execution success | `run_python_success_bonus`, `run_python_failure_penalty` |
| DSPy pattern match | `dspy_pattern_match_bonus`, `dspy_pattern_bonus_cap` |
| File write / patch verification | `verifier_*` family |
| Global scale | `global_scale` |

Sources: [rlm_code/rlm/environments.py:44-97]()

### 2.6 Memory update

After each step, if `action_result.memory_note` is set, it is appended to the `memory` list. The list is **hard-capped at 8 entries** (oldest entries are discarded):

```python
# rlm_code/rlm/runner.py:763-765
if action_result.memory_note:
    memory.append(action_result.memory_note)
    memory = memory[-8:]
```

Sources: [rlm_code/rlm/runner.py:763-765]()

For the `pure_rlm` environment, longer REPL interaction histories are managed by `MemoryCompactor`. Compaction triggers when the history grows beyond `max_entries_before_compaction` (default 10) or `max_chars_before_compaction` (default 8 000 chars). The compactor calls the LLM to produce a 2–3 sentence summary of older entries, then discards them, keeping only the summary plus the last `preserve_last_n_entries` (default 2).

Sources: [rlm_code/rlm/memory_compaction.py:83-165]()

---

## 3. Termination: how the loop ends

The loop has four exit paths:

```text
┌─────────────────────────────────────────────────┐
│  End condition          │  How it fires           │
│─────────────────────────┼─────────────────────────│
│  Agent done             │  action_result.done=True │
│  Step budget            │  step_index > max_steps  │
│  Time budget            │  monotonic > deadline    │
│  Cooperative cancel     │  _is_cancel_requested()  │
└─────────────────────────────────────────────────┘
```

The "agent done" path is triggered when the agent's code calls one of the three terminal functions defined in `termination.py`:

- **`FINAL(answer)`** — raises `FinalOutput({"answer": answer, "type": "direct"})` and immediately exits the REPL loop.
- **`FINAL_VAR("varname")`** — raises `FinalOutput({"var": varname, "type": "variable"})`, causing the runner to look up the variable from the REPL namespace.
- **`SUBMIT(**kwargs)`** — raises `SubmitOutput(fields=kwargs)`, which supports typed multi-field outputs and optional schema validation.

```python
# rlm_code/rlm/termination.py
def FINAL(answer: Any) -> NoReturn:
    raise FinalOutput({"answer": answer, "type": "direct"})

def FINAL_VAR(variable_name: str) -> NoReturn:
    raise FinalOutput({"var": variable_name, "type": "variable"})

def SUBMIT(**kwargs: Any) -> NoReturn:
    raise SubmitOutput(fields=kwargs)
```

Sources: [rlm_code/rlm/termination.py:43-86]()

`detect_final_in_text()` and `detect_final_in_code()` also scan for these patterns in the LLM's raw text response (not just in executed code), so a model that writes `FINAL("answer")` in a markdown code block will still trigger termination correctly.

Sources: [rlm_code/rlm/termination.py:122-217]()

When the step budget is exhausted without a terminal call, the runner tries a fallback: it calls `_extract_answer_from_trajectory()`, which passes the last 10 steps to the LLM and asks it to synthesize the best possible answer from the execution history.

Sources: [rlm_code/rlm/action_planner.py:341-391]()

---

## 4. Trajectory persistence

Every step event and the final event are appended as newline-delimited JSON to a `.jsonl` file under `.rlm_code/rlm/runs/`. The run ID is a timestamp string like `run_20240501_142000_123456`.

```python
# rlm_code/rlm/runner.py:1404-1414
def _append_event(self, run_path: Path, event: dict[str, Any]) -> None:
    with run_path.open("a", encoding="utf-8") as handle:
        handle.write(json.dumps(event, ...) + "\n")
```

Each step record includes: `type`, `run_id`, `environment`, `task`, `timestamp`, `step`, `action`, `observation`, `reward`, and `usage` (token counts). The final record adds `completed`, `total_reward`, `steps`, and `cancelled`.

`TrajectoryLogger` in `trajectory.py` provides a richer alternative logger with per-event types (`ITERATION_REASONING`, `ITERATION_CODE`, `ITERATION_OUTPUT`, `LLM_REQUEST`, `CHILD_SPAWN`, `FINAL_DETECTED`, etc.) and supports export to an interactive HTML viewer.

Sources: [rlm_code/rlm/trajectory.py:33-70](), [rlm_code/rlm/trajectory.py:548-630]()

---

## 5. Event bus and observability

A lightweight in-process `RLMEventBus` (`events.py`) fires named events at every lifecycle point: `run_start`, `step_start`, `step_end`, `run_end`, `run_cycle_guard`. Subscribers can listen to all events or filter by `RLMEventType`. The event bus is used for real-time UI updates, streaming, and integration with observability sinks.

The full set of typed events covers LLM call boundaries (`LLM_CALL_START/END`), code execution boundaries (`CODE_EXEC_START/END`), sub-LLM calls from REPL code (`SUB_LLM_START/END`), child-agent lifecycle (`CHILD_SPAWN/START/END`), and memory compaction (`MEMORY_COMPACT_START/END`).

Sources: [rlm_code/rlm/events.py:17-80](), [rlm_code/rlm/events.py:190-298]()

---

## 6. Cycle guard and delegation

When an action's name is `delegate` or `delegate_batch`, the runner recursively calls `run_task()` for each sub-task. A `_RecursionState` object tracks a SHA-1 fingerprint of every `(environment, task)` pair that is currently active. If a recursive call would repeat an already-active fingerprint, it is skipped and assigned a reward of `-0.25` to discourage cycles.

Sources: [rlm_code/rlm/runner.py:536-581]()

The recursion depth is bounded by `max_depth` (default 2) and an optional `time_budget_seconds` deadline shared across all nested calls.

---

## 7. The full loop as a state diagram

```mermaid
stateDiagram-v2
    [*] --> Initializing : run_task() called
    Initializing --> StepLoop : run_id created, memory=[], trajectory=[]

    state StepLoop {
        [*] --> BuildContext
        BuildContext --> ProposeAction : env.planner_prompt()
        ProposeAction --> ScoreCandidates : branch_width > 1
        ScoreCandidates --> SelectBest
        ProposeAction --> SelectBest : branch_width = 1
        SelectBest --> ExecuteAction : env.execute_action()
        ExecuteAction --> UpdateMemory : EnvironmentActionResult
        UpdateMemory --> AppendTrajectory
        AppendTrajectory --> CheckDone
        CheckDone --> [*] : done=True
        CheckDone --> BuildContext : continue
    }

    StepLoop --> Synthesize : max_steps exhausted
    StepLoop --> Done : action_result.done=True
    StepLoop --> Cancelled : cancel requested
    StepLoop --> TimedOut : deadline exceeded

    Synthesize --> Done
    Done --> PersistFinal : write final JSONL event
    Cancelled --> PersistFinal
    TimedOut --> PersistFinal
    PersistFinal --> [*]
```

---

## Summary

The RLM loop in this repository is a straightforward, persistent reinforcement-learning style agent harness. `RLMRunner.run_task()` drives the step loop; `ActionPlannerMixin` handles prompting and candidate scoring; environments execute code in sandboxes and assign rewards; `termination.py` defines `FINAL()`, `FINAL_VAR()`, and `SUBMIT()` as the three ways code can signal completion; `MemoryCompactor` prevents context bloat over long runs; and every event is appended to a JSONL file and broadcast over the in-process event bus. The design is model-provider-agnostic — any `llm_connector` that implements `generate_response()` works with no changes to the loop itself.

Sources: [rlm_code/rlm/runner.py:488-847]()
