# Environments & Sandboxes — Where Code Actually Runs

> The three built-in environments (DSPy coding, Generic, TraceAnalysis, PureRLM), what each one does, and the sandbox runtimes (Docker, Monty, mock) that execute untrusted code safely.

- Repository: SuperagenticAI/rlm-code
- GitHub: https://github.com/SuperagenticAI/rlm-code
- Human wiki: https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91
- Complete Markdown: https://grok-wiki.com/public/wiki/superagenticai-rlm-code-8e144acefc91/llms-full.txt

## Source Files

- `rlm_code/rlm/environments.py`
- `rlm_code/rlm/pure_rlm_environment.py`
- `rlm_code/execution/sandbox.py`
- `rlm_code/execution/engine.py`
- `rlm_code/rlm/docker_interpreter.py`
- `rlm_code/rlm/monty_interpreter.py`
- `rlm_code/sandbox/runtimes.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [rlm_code/rlm/environments.py](rlm_code/rlm/environments.py)
- [rlm_code/rlm/pure_rlm_environment.py](rlm_code/rlm/pure_rlm_environment.py)
- [rlm_code/execution/sandbox.py](rlm_code/execution/sandbox.py)
- [rlm_code/execution/engine.py](rlm_code/execution/engine.py)
- [rlm_code/rlm/docker_interpreter.py](rlm_code/rlm/docker_interpreter.py)
- [rlm_code/rlm/monty_interpreter.py](rlm_code/rlm/monty_interpreter.py)
- [rlm_code/sandbox/runtimes/registry.py](rlm_code/sandbox/runtimes/registry.py)
- [rlm_code/sandbox/runtimes/monty_runtime.py](rlm_code/sandbox/runtimes/monty_runtime.py)
- [rlm_code/sandbox/runtimes/base.py](rlm_code/sandbox/runtimes/base.py)
- [rlm_code/sandbox/superbox.py](rlm_code/sandbox/superbox.py)
</details>

# Environments & Sandboxes — Where Code Actually Runs

When the RLM (Recursive Language Model) loop produces an action, two things happen: the *environment* decides what that action means and which tools are available to the LLM planner, while the *sandbox runtime* actually executes any Python code generated, in isolation. These are deliberately separate concerns — you can mix and match any environment with any compatible runtime.

This page explains the four built-in environments (`generic`, `dspy`, `trace_analysis`, `pure_rlm`), what each one teaches the planner and how it scores results, then walks through the runtime ladder: from Monty (Rust-sandboxed Python) and Docker (container-per-step) to the simple local subprocess fallback and optional cloud providers. Together they define where code actually runs and how safe that execution is.

---

## The Environment Abstraction

Every environment implements three methods (defined in `RLMEnvironment`, a `Protocol`):

| Method | Purpose |
|---|---|
| `system_prompt()` | Tells the LLM planner what actions are available and how to format its JSON response |
| `planner_prompt(task, memory, trajectory, step_index)` | Builds the per-step prompt including task, recent history, and environment context |
| `execute_action(action, execution_engine, timeout, llm_connector)` | Dispatches the JSON action, runs code if needed, and returns an `EnvironmentActionResult` |

`EnvironmentActionResult` carries four fields that flow back to the runner: `observation` (what the planner sees next), `reward` (a float in `[-1.0, 1.0]`), `done` (stop the loop?), and `final_response` (the answer to emit when done).

Rewards are shaped by `RLMRewardProfile`, a dataclass with named fields for every scoring dimension (base, success bonus, failure penalty, verifier checks, DSPy pattern bonuses, etc.). The profile can be passed at construction from a plain dict, making reward tuning fully data-driven without changing environment code.

Sources: [rlm_code/rlm/environments.py:100-120](), [rlm_code/rlm/environments.py:44-98]()

---

## Built-in Environments

### GenericRLMEnvironment (`name = "generic"`)

The baseline environment. The planner can take exactly two actions:

- **`run_python`** — submit a code string; the environment calls `execution_engine.execute_code(code, timeout=exec_timeout)` and computes a reward from success/failure/stderr.
- **`final`** — declare the task done and emit a final answer (reward `1.0`).

The system prompt is minimal JSON-only instructions. The planner prompt shows the task, last six memory entries, and last three trajectory steps (action, success, reward).

```python
# rlm_code/rlm/environments.py:138-145
def system_prompt(self) -> str:
    return (
        "You are an RLM planner.\n"
        "Return ONLY valid JSON with keys: "
        "action, code, rationale, done, final_response.\n"
        'Valid action values: "run_python", "final".\n'
        "No markdown. JSON only."
    )
```

Reward formula: base `0.1` + success bonus `0.7` − failure penalty `0.3` − stderr penalty `0.1`, clamped to `[-1.0, 1.0]`.

Sources: [rlm_code/rlm/environments.py:122-286]()

---

### DSPyCodingRLMEnvironment (`name = "dspy"`)

Extends `GenericRLMEnvironment` with a rich action vocabulary for code authoring tasks, especially DSPy modules. The action set grows to fourteen verbs:

| Action | What it does |
|---|---|
| `run_python` | Execute arbitrary Python (with DSPy pattern bonus on top) |
| `write_file` | Write a new file under `workdir`; runs the post-write verifier suite |
| `patch_file` | Apply a search-and-replace or full-content rewrite |
| `read_file` | Return a line-range excerpt of a file |
| `search_code` | Regex search over `.py` files in the project |
| `list_tree` | Enumerate directory entries up to configurable depth |
| `run_tests` | Run `pytest` (via subprocess or the execution engine) |
| `analyze_dspy` / `analyze_code` | Score DSPy source quality (0–100 heuristic) |
| `llm_query` | Forward a prompt to the LLM connector for delegated analysis |
| `llm_query_batched` | Run multiple prompts concurrently with `ThreadPoolExecutor` |
| `delegate` / `delegate_batch` | Reserved for recursive subtask spawning |
| `final` | Terminate with an answer |

**Path safety:** every file action goes through `_safe_resolve`, which rejects paths that escape `workdir` via symlinks or `..` traversal.

```python
# rlm_code/rlm/environments.py:1333-1341
def _safe_resolve(self, path_raw: str) -> Path | None:
    path = Path(path_raw)
    if path.is_absolute():
        resolved = path.resolve()
    else:
        resolved = (self.workdir / path).resolve()
    if not resolved.is_relative_to(self.workdir):
        return None
    return resolved
```

**Post-write verifier suite:** after every `write_file` or `patch_file`, the environment runs three checks automatically:
1. `python -m compileall` — catches syntax/import-time parse errors.
2. `pytest -q tests/test_<stem>.py` — runs a matching test file if one exists.
3. `execution_engine.validate_code(content)` — DSPy-aware linting (deprecated API checks, etc.).

The verifier outcome feeds directly into reward calculation: compile bonus `+0.20`, pytest bonus `+0.25`, validation bonus `+0.15`, with matching penalties on failure.

**DSPy pattern bonus:** running code that imports `dspy`, uses `dspy.Signature`, `dspy.InputField`, `dspy.OutputField`, `dspy.Module`, or implements `forward()` earns a small extra reward per matched pattern, capped at `+0.20`.

Sources: [rlm_code/rlm/environments.py:565-1483]()

---

### TraceAnalysisEnvironment (`name = "trace_analysis"`)

A HALO-style environment for inspecting agent execution traces stored as one-span-per-line JSONL files. It wraps a `TraceStore` object and exposes eight trace-specific actions:

| Action | Reward | Purpose |
|---|---|---|
| `set_trace_path` | 0.55 | Load a JSONL trace dataset |
| `get_dataset_overview` | 0.45 | Count traces, spans, errors; get sample IDs |
| `query_traces` | 0.50 | List traces with optional filters (errors, model, service, agent, project) |
| `count_traces` | 0.35 | Count traces matching a filter |
| `view_trace` | 0.65 | Fetch all spans for one trace ID |
| `search_trace` | 0.65 | Substring-search spans within a trace |
| `view_spans` | 0.70 | Fetch a specific list of span IDs |
| `export_evidence_corpus` | 0.75 | Write filtered traces to a directory for downstream agents |

The planner prompt automatically injects the active trace path and a live overview (total traces, spans, error count, sample IDs) so the LLM does not need to request it manually. If the task string contains `trace=<path>` or `trace_path=<path>`, the environment loads the file proactively before the first LLM step.

The goal articulated in the system prompt is to find *systemic* harness failure modes, not one-off anomalies, and to produce concrete evidence reports with trace IDs and span references.

Sources: [rlm_code/rlm/environments.py:289-491]()

---

### PureRLMEnvironment (`name = "pure_rlm"`)

This environment implements the exact semantics from the *Recursive Language Models* paper (2025). It is structurally different from the other three:

- The planner's response is **free-form text**, not constrained JSON. Code blocks are extracted with `````repl````  or ````python``` regexes.
- The context is stored as a **REPL variable** (`context`) rather than appearing in the token window. The LLM only sees metadata (type, total character count, chunk sizes).
- **`llm_query(prompt)`** and **`llm_query_batched(prompts)`** are available inside the REPL as callable functions. These make recursive LLM calls from within code execution.
- Termination is via `FINAL(answer)` or `FINAL_VAR(variable_name)` called in code or written in text, rather than a JSON `"action": "final"`.
- Message history **grows** across iterations (it is never truncated) so the full chain of reasoning and REPL output is preserved.

The REPL namespace starts from a tightly restricted `SAFE_BUILTINS` dict — `eval`, `exec`, `compile`, `globals`, `locals`, `__import__`, `subprocess`, and `os.system` are all absent. A pre-flight scanner also blocks these patterns via regex before `exec()` is called:

```python
# rlm_code/rlm/pure_rlm_environment.py:144-162
_BLOCKED_CODE_PATTERNS = [
    (re.compile(r"\b__import__\s*\("), "Dynamic __import__() is blocked"),
    (re.compile(r"\bos\.system\s*\("),  "os.system() is blocked"),
    (re.compile(r"\bsubprocess\b"),      "subprocess module is blocked"),
    (re.compile(r"\beval\s*\("),         "eval() is blocked"),
    ...
]
```

The `open()` function is replaced with a `safe_open` that only allows read-only access to files under `workdir`.

**Multi-file helpers** (`load_file`, `load_files`, `switch_to`, `list_files`, `remove_file`) and a `chunk_indices(total_length, chunk_size, overlap)` helper are pre-injected into the namespace to support large-document workflows.

**Interpreter selection:** `PureRLMEnvironment` requires either an explicit `interpreter` (a `MontyInterpreter` or `DockerPersistentInterpreter` instance) or `allow_unsafe_exec=True` for local experiments:

```python
# rlm_code/rlm/pure_rlm_environment.py:500-507
if interpreter is None:
    if not self._allow_unsafe_exec:
        raise RuntimeError(
            "PureRLMEnvironment requires a secure interpreter by default. "
            "Pass interpreter=MontyInterpreter(...) or interpreter=DockerPersistentInterpreter(...)."
        )
```

Sources: [rlm_code/rlm/pure_rlm_environment.py:418-540](), [rlm_code/rlm/pure_rlm_environment.py:144-182]()

---

## Environment Comparison

```text
                 Actions          Code execution    LLM calls from code    Context in token window
generic          2                run_python only   no                     yes (task + history)
dspy            14                run_python + file no                     yes
trace_analysis   8                no code exec      no                     yes
pure_rlm         free-form REPL   exec()/interp     yes (llm_query*)       no — in `context` var
```

*`llm_query` calls are subject to `max_llm_calls` (default 50) enforced with a thread lock.

---

## The Execution Stack

When an environment calls `execution_engine.execute_code(code, timeout)`, that call passes through two layers before any code runs.

```text
Environment.execute_action()
        │
        ▼
ExecutionEngine.execute_code()    [rlm_code/execution/engine.py]
   ├─ validate_code()             (AST syntax + import checks + DSPy warnings)
   └─ ExecutionSandbox.execute()  [rlm_code/execution/sandbox.py]
           │
           ▼
        Superbox.resolve_runtime()  [rlm_code/sandbox/superbox.py]
           │  priority: runtime_override → config.sandbox.runtime → fallbacks
           ├─ local           LocalSandboxRuntime   (subprocess, always available)
           ├─ monty           MontySandboxRuntime   (pydantic_monty Rust VM)
           ├─ docker          DockerSandboxRuntime  (docker run --rm)
           ├─ apple-container AppleContainerRuntime (Apple VM, macOS)
           ├─ modal           ModalSandboxRuntime   (cloud, optional)
           ├─ e2b             E2BSandboxRuntime     (cloud, optional)
           └─ daytona         DaytonaSandboxRuntime (cloud, optional)
```

`ExecutionEngine` runs `validate_code` first (AST parse, dangerous-import scan, DSPy API checks) and returns a failed `ExecutionResult` immediately if validation fails, before any subprocess is spawned. The `Superbox` layer tries runtimes in priority order, skipping known-unavailable ones, and raises `ConfigurationError` only if every candidate fails.

Sources: [rlm_code/execution/engine.py:49-195](), [rlm_code/sandbox/superbox.py:29-116]()

---

## Sandbox Runtimes

### local — subprocess baseline

Writes code to a temp file and runs it with `subprocess.run([python_exe, code_file, ...])`. The environment is stripped down: `PYTHONPATH=""`, `PYTHONUNBUFFERED=1`, `HOME`/`TMPDIR` point to the temp dir, and `PATH` is limited to `/usr/bin:/bin`. Additional host env vars can be allowed via `sandbox.env_allowlist`.

This runtime is always available and is used as the ultimate fallback.

Sources: [rlm_code/execution/sandbox.py:163-200]()

---

### monty — Rust-sandboxed Python

`MontySandboxRuntime` wraps `MontyInterpreter`, which uses `pydantic_monty.Monty` — a Python interpreter written in Rust. Key properties:

- **No filesystem access, no network, no `import`** — the sandbox is enforced at the Rust VM level, not by Python policy.
- **Resource limits** via `ResourceLimits(max_duration_secs, max_memory, max_allocations)`.
- **External function dispatch**: when Monty code calls `llm_query`, `FINAL`, `FINAL_VAR`, or any user-registered tool, execution pauses (`MontySnapshot`) and control returns to the host Python process, which runs the handler and calls `snapshot.resume(return_value=...)` to continue.
- **Variable persistence** across REPL steps is simulated: the host injects known variables as `inputs`, and a synthetic `__rlm_collect__({...})` call at the end of each block sends new variables back.
- **Optional type checking** using Monty's Ruff-based parser (set `type_check=True`).
- **Microsecond startup** — no container to spin up.

The `create_rlm_monty_interpreter()` factory wires up all standard RLM external functions (`FINAL`, `FINAL_VAR`, `SUBMIT`, `SHOW_VARS`, `llm_query`, `llm_query_batched`) in one call.

```python
# rlm_code/rlm/monty_interpreter.py:864-886
interp = MontyInterpreter(
    timeout=timeout,
    tools=tools,
    resource_limits=resource_limits,
    type_check=type_check,
)
interp.start()
interp.register_external("FINAL", lambda answer: None)
interp.register_external("FINAL_VAR", lambda var_name: None)
...
```

Sources: [rlm_code/rlm/monty_interpreter.py:254-612](), [rlm_code/sandbox/runtimes/monty_runtime.py]()

---

### docker — container-per-step (ExecutionSandbox path)

`DockerSandboxRuntime` (used via `ExecutionSandbox`) runs each code file in an ephemeral container (`docker run --rm`). Networking is disabled by default (`--network none`). Dangerous Docker flags (`--privileged`, `--volume`, `--mount`, `--network=host`, etc.) are rejected by the registry before the runtime is created.

**DockerPersistentInterpreter** (the interpreter used by `PureRLMEnvironment`) is a separate, higher-level implementation that maintains REPL state across steps:

- A shared session directory on the host is mounted into each container as `/workspace`.
- The REPL namespace is serialized to `state.dill` (falling back to `pickle`) after each step and reloaded before the next.
- External functions (e.g., `llm_query`) are dispatched over a lightweight HTTP bridge: the container script calls `http://host.docker.internal:<port>/external` and the host's `_ProxyHandler` runs the actual function and returns the result as base64-pickled JSON.
- `FinalOutput` and `SubmitOutput` exceptions raised on the host side are forwarded back through the proxy as structured error payloads.

```python
# rlm_code/rlm/docker_interpreter.py:267-298
def _build_docker_command(self, code: str) -> list[str]:
    mount_arg = f"{self._session_dir}:/workspace:rw"
    cmd = [
        "docker", "run", "--rm",
        "--workdir", "/workspace",
        "--volume", mount_arg,
        "--add-host", "host.docker.internal:host-gateway",
    ]
    ...
    cmd.extend([self.image, "python", "-c", script])
    return cmd
```

Sources: [rlm_code/rlm/docker_interpreter.py:41-544]()

---

### apple-container — macOS VM runtime

`AppleContainerRuntime` uses Apple's `container` CLI (macOS only) with similar semantics to Docker. It requires `sandbox.apple_container_enabled=true` in config and checks for the `container` binary on startup.

---

### Cloud runtimes (Modal, E2B, Daytona)

Three optional cloud-based runtimes are registered under the same `SandboxRuntime` protocol. They are loaded at import time and silently skipped if their SDKs are not installed:

| Runtime | Install | Notes |
|---|---|---|
| Modal | `pip install modal && modal setup` | Configurable memory/CPU |
| E2B | `pip install e2b-code-interpreter` | Template-based sandboxes |
| Daytona | `pip install daytona-sdk` or CLI | Workspace-based execution |

---

## Runtime Selection & Fallback (Superbox)

`Superbox` centralizes runtime selection. On every `ExecutionSandbox.execute()` call it:

1. Reads `sandbox.runtime` from config (or uses the session-level `runtime_override`).
2. Probes all runtimes with `detect_runtime_health()`.
3. Tries the primary runtime first; if that fails, tries fallback candidates in order (`docker` → `apple-container` → `local`), skipping runtimes already flagged as unhealthy.
4. Raises `ConfigurationError` if everything fails.

Auto-fallback is enabled by default (`superbox_auto_fallback=true`) but can be turned off. The fallback list can also be overridden via `superbox_fallback_runtimes`.

```text
SUPPORTED_RUNTIMES = {"local", "monty", "docker", "apple-container",
                      "modal", "e2b", "daytona"}
```

Sources: [rlm_code/sandbox/runtimes/registry.py:46-50](), [rlm_code/sandbox/superbox.py:37-116]()

---

## Choosing a Runtime

```text
Need                                  Recommended runtime
────────────────────────────────────  ─────────────────────────────────
Local dev, no Docker installed        local  (always works, least safe)
Fast execution, strong isolation      monty  (requires pydantic-monty)
Pure RLM with llm_query in REPL       monty or docker (DockerPersistentInterpreter)
Arbitrary OS access / packages        docker
macOS native VMs                      apple-container
Cloud execution, long timeout         modal / e2b / daytona
```

The `PureRLMEnvironment` environment enforces this explicitly at construction: passing `interpreter=MontyInterpreter(...)` or `interpreter=DockerPersistentInterpreter(...)` is required unless `allow_unsafe_exec=True` is set. Every other environment routes through `ExecutionEngine` → `ExecutionSandbox` → `Superbox`, so the runtime selection is transparent to the environment code itself.

Sources: [rlm_code/rlm/pure_rlm_environment.py:499-513](), [rlm_code/sandbox/superbox.py:87-116]()
