# The Six Memory Backends: How Each One Works

> A plain-English tour of the six pluggable memory backends — no_retrieval, RAG variants, AgentRunbook-R, Codex, and AgentRunbook-C — explaining what each one stores and retrieves, plus the insert/query contract every custom backend must satisfy.

- Repository: xiaowu0162/LongMemEval-V2
- GitHub: https://github.com/xiaowu0162/LongMemEval-V2
- Human wiki: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2
- Complete Markdown: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/llms-full.txt

## Source Files

- `memory_modules/memory.py`
- `memory_modules/no_retrieval.py`
- `memory_modules/rag.py`
- `memory_modules/agentrunbook_r.py`
- `memory_modules/agentrunbook_c.py`
- `memory_modules/codex.py`
- `memory_modules/trajectory_store.py`
- `memory_modules/support.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [memory_modules/memory.py](memory_modules/memory.py)
- [memory_modules/no_retrieval.py](memory_modules/no_retrieval.py)
- [memory_modules/rag.py](memory_modules/rag.py)
- [memory_modules/agentrunbook_r.py](memory_modules/agentrunbook_r.py)
- [memory_modules/agentrunbook_c.py](memory_modules/agentrunbook_c.py)
- [memory_modules/codex.py](memory_modules/codex.py)
- [memory_modules/trajectory_store.py](memory_modules/trajectory_store.py)
</details>

# The Six Memory Backends: How Each One Works

LongMemEval-V2 evaluates how well an agent can answer questions about its own past history—long sequences of browser-automation trajectories. To answer a question, the agent needs to *retrieve* the right evidence from those past trajectories. The memory backend is the component that does this: it stores trajectories when they arrive (`insert`) and retrieves relevant context when a question appears (`query`).

There are six concrete backends, each with a different retrieval strategy ranging from "do nothing" to "spawn a full coding agent to read the files." This page explains what each backend stores, how it retrieves, and describes the two-method contract (`insert` / `query`) that any custom backend must satisfy.

---

## The Base Contract: `Memory`

Every backend inherits from the abstract class `Memory` (defined in `memory_modules/memory.py`). The contract is minimal:

```python
# memory_modules/memory.py:43-54
@abstractmethod
def insert(self, trajectory: dict[str, object]) -> None:
    """Index one full trajectory object into the backend."""
    raise NotImplementedError

@abstractmethod
def query(
    self,
    query: str,
    query_image: str | None = None,
) -> list[MemoryContextItem]:
    """Return a formatted memory context payload for a query."""
    raise NotImplementedError
```

`insert` receives one trajectory dictionary (id, goal, states, screenshots, actions) and must index it in whatever form the backend prefers. `query` receives a natural-language question and optionally an image, and must return a list of `MemoryContextItem` dicts, each with `"type"` (`"text"` or `"image"`) and `"value"`. The harness concatenates these items into the final context block passed to the answering model.

Two optional hooks matter in practice:

- `configure_runtime(**kwargs)` — called once after build/load to warm up connections (embedding client, tokenizer).
- `post_query_hook(...)` — synchronous work after retrieval, used for logging.

Backends register themselves with `@register_memory`, which adds them to the `MEMORY_TYPES` dict keyed by `memory_type`. The factory functions `build_memory` and `load_memory` look up and instantiate them from this registry.

Sources: [memory_modules/memory.py:25-54](), [memory_modules/memory.py:140-144]()

---

## The Six Backends at a Glance

```text
┌─────────────────────┬──────────────────────────────┬──────────────────────────────────┐
│ memory_type         │ What is stored at insert time │ How query retrieves               │
├─────────────────────┼──────────────────────────────┼──────────────────────────────────┤
│ no_retrieval        │ nothing                       │ returns []                        │
│ rag                 │ raw-state slices + embeddings │ cosine similarity vector search   │
│                     │ + optional LLM notes          │ (+ optional notes search)         │
│ agentrunbook_r      │ raw-state slices + events     │ LLM generates multi-pool queries; │
│                     │ + notes + embeddings          │ cosine search + optional rerank   │
│ codex               │ trajectory JSON files on disk │ Codex CLI subprocess explores     │
│                     │                               │ files and writes output JSON      │
│ agentrunbook_c      │ trajectory JSON + summaries   │ Codex CLI subprocess with richer  │
│                     │ (concise & full Markdown)     │ sandbox (trajectory summary files)│
└─────────────────────┴──────────────────────────────┴──────────────────────────────────┘
```

Note: The five distinct class names map to six evaluation variants because `rag` covers both `rag` (raw states only) and `rag` with `enable_notes=true` (raw states + LLM-generated notes).

---

## 1. `no_retrieval` — The Baseline

**Class:** `NoRetrievalMemory`  
**File:** `memory_modules/no_retrieval.py`

This is the evaluation baseline. It intentionally does nothing.

```python
# memory_modules/no_retrieval.py:10-18
def insert(self, trajectory: dict[str, object]) -> None:
    return None

def query(
    self,
    query: str,
    query_image: str | None = None,
) -> list[MemoryContextItem]:
    return []
```

`insert` discards every trajectory. `query` returns an empty list, so the answering model receives no memory context at all. Its purpose is to establish a lower bound: how well does the model perform on memory questions when it has zero access to its past? All other backends should score strictly higher.

Sources: [memory_modules/no_retrieval.py:4-19]()

---

## 2. `rag` — Embedding-Based Retrieval (with Optional Notes)

**Class:** `RagMemory`  
**File:** `memory_modules/rag.py`

This backend borrows the full retrieval and embedding machinery from `AgentRunbookR` (via direct method assignment) and uses it without the LLM-driven query-generation step.

### What is stored

On `insert`, each trajectory is converted into:

1. **Raw-state slices** — windows of AXTree text centered on each state (controlled by `raw_state_slice_radius`). Each slice becomes one searchable entry; its AXTree text is embedded and stored in a NumPy matrix (`raw_state_embeddings`).

2. **LLM-generated notes** (when `enable_notes=True`) — an LLM (configured via `controller_params`) reads the full simplified trajectory and produces two structured notes per trajectory:
   - **procedure note** — how to navigate or accomplish the task family (reusable procedure).
   - **hint note** — a concise answer-ready hint with key visible facts.

   Each note is embedded and stored in separate matrices.

### How query works

`query` receives the benchmark question directly as the embedding query — no LLM query-generation step:

```python
# memory_modules/rag.py:554-570
raw_results = self._search_entries(
    entries=self.raw_state_entries,
    embeddings=self.raw_state_embeddings,
    query_text=query,
    top_k=self.raw_state_search_top_k,
)
note_results = self._search_note_query(query) if self.enable_notes else {
    "procedure_results": [],
    "hint_results": [],
}
```

Retrieval is pure cosine similarity using the pre-computed embedding matrix. Top-k results are assembled into `MemoryContextItem` blocks.

### Configuration highlights

| Section | Key params |
|---|---|
| `embedding_params` | model, base_url, max_input_tokens, query_instruction |
| `index_params` | raw_state_slice_radius (window size around each state) |
| `retrieval_params` | enable_notes, raw_state_search_top_k, note_search_top_k_per_type |
| `controller_params` | model, base_url (used only when enable_notes=True) |

The backend can cache pre-computed embeddings and notes in a `trajectory_pool_root` directory, allowing fast pooled loading at insert time instead of re-embedding.

Sources: [memory_modules/rag.py:67-69](), [memory_modules/rag.py:590-656](), [memory_modules/rag.py:958-1005]()

---

## 3. `agentrunbook_r` — LLM-Driven Multi-Pool Retrieval

**Class:** `AgentRunbookR`  
**File:** `memory_modules/agentrunbook_r.py`

This is the most feature-rich retrieval backend. It adds a third pool (state-transition *events*) and an LLM query-generation step, making retrieval smarter at the cost of a controller-model call per question.

### What is stored

On `insert`, each trajectory produces three pools:

1. **Raw-state slices** — same windowed AXTree slices as `rag`, embedded with the same embedding model.
2. **Event entries** — for each state-to-state transition, the LLM (controller model) writes a short two-part description: `overview` (where in the workflow this step sits) and `state_transition` (what specifically changed after the action). Events are particularly useful for navigation and before/after questions.
3. **Notes** — same procedure and hint notes as `rag`, generated by the controller LLM.

The prompts driving event generation are inlined in the file:

```python
# memory_modules/agentrunbook_r.py:219-253 (EVENT_GENERATION_SYSTEM_PROMPT excerpt)
# "overview": one concise paragraph that briefly recaps the concrete task goal...
# "state_transition": one concise paragraph that explicitly compares the post-state to the pre-state...
```

### How query works

Unlike `rag`, `AgentRunbookR` asks the controller LLM to **decompose the question** into a structured set of retrieval queries before searching:

```json
{
  "raw_state_queries": ["incident form suggestion button fields mandatory", ...],
  "event_query": "apply 4 stars & up filter and observe what changes",
  "note_query": "ServiceNow create incident form field requirements"
}
```

The prompt (`QUERY_GENERATION_SYSTEM_PROMPT`) instructs the LLM to generate up to 5 distinct `raw_state_queries` (one per distinct UI surface), one `event_query` (only for navigation/before-after questions), and one broader `note_query`. Each pool is then searched separately with cosine similarity and a configurable top-k budget.

An optional reranking step (`enable_rerank=True`) asks the controller LLM to filter each pool's candidates down to those that actually help answer the question.

### Default models

```python
# memory_modules/agentrunbook_r.py:41-57
DEFAULT_CONTROLLER_MODEL = "Qwen/Qwen3.5-9B"
DEFAULT_CONTROLLER_BASE_URL = "http://localhost:8023/v1"
DEFAULT_EMBEDDING_MODEL = "Qwen/Qwen3-Embedding-8B"
DEFAULT_EMBEDDING_BASE_URL = "http://localhost:8114/v1"
```

These defaults point to locally-hosted models over OpenAI-compatible endpoints—fully BYOK/BYOC.

Sources: [memory_modules/agentrunbook_r.py:41-94](), [memory_modules/agentrunbook_r.py:126-162](), [memory_modules/agentrunbook_r.py:219-253]()

---

## 4. `codex` — Coding-Agent Filesystem Exploration

**Class:** `CodexMemory`  
**File:** `memory_modules/codex.py`

Instead of building an in-memory index, `codex` stores trajectories as plain files on disk and delegates retrieval to the **Codex CLI**—a coding agent that reads those files and reasons over them.

### What is stored

`insert` materializes each trajectory as a `trajectory.json` file inside a workspace directory:

```python
# memory_modules/codex.py:525-561
def insert(self, trajectory: dict[str, object]) -> None:
    ...
    prepared = prepare_trajectory_insert_shared(trajectory, trajectories_root_dir=...)
    ...
    materialize_prepared_trajectory_shared(prepared, trajectory_dir)
    self.inserted_trajectory_ids.append(trajectory_id)
    self._write_index_files(self.workspace_dir)
```

An `index.json` and `haystack_manifest.json` are also written (and updated) to help the agent orient itself in the workspace.

### How query works

On `query`, the backend:

1. Writes `question.json` (the benchmark question, optionally including a question image) and `INSTRUCTION.md` into a sandbox directory.
2. Symlinks the `trajectories/` directory into the sandbox.
3. Launches the `codex` binary as a subprocess with `codex exec`.
4. Waits for the subprocess to write `memory_module_output.json` containing two fields:
   - `memory_markdown` — a structured Markdown narrative with a `## Support Analysis` and `## Relevant Procedure and Hint Notes` section.
   - `trajectory_spans` — a list of `{trajectory_id, start_state_index, end_state_index}` pointers to the most relevant trajectory slices.

```python
# memory_modules/codex.py:31-38 (DEFAULT_PROMPT)
"You are acting as a memory retrieval module. "
"Read the local files in this directory, especially INSTRUCTION.md and question.json. "
"The local trajectories/ directory contains the current haystack for this evaluation item..."
"Write your final result to memory_module_output.json as valid JSON."
```

The agent is given up to `MAX_TOTAL_SPAN_STATES = 20` states total across all spans. After the subprocess exits, the harness validates the output, loads the referenced trajectory states (with optional screenshots), and assembles them into `MemoryContextItem` blocks.

Up to `codex_max_attempts` retries are made if the output is missing or malformed.

Sources: [memory_modules/codex.py:22-80](), [memory_modules/codex.py:344-562](), [memory_modules/codex.py:747-767]()

---

## 5. `agentrunbook_c` — Codex Agent with Richer Scaffolding

**Class:** `AgentRunbookC`  
**File:** `memory_modules/agentrunbook_c.py`

`AgentRunbookC` extends `CodexMemory` and adds **pre-rendered trajectory summary files** and a dedicated inspection helper script to the agent's sandbox. It is the "C" (Codex + scaffolding) counterpart to the "R" (retrieval/embedding) variant.

### What is stored

`insert` is inherited from `CodexMemory`—trajectories are materialized to disk as `trajectory.json` files, just as in the plain `codex` backend.

### How query works

Before launching the Codex subprocess, `AgentRunbookC` renders two Markdown summary files:

- `TRAJECTORY_SUMMARY_CONCISE.md` — a brief overview of all trajectories in the workspace.
- `TRAJECTORY_SUMMARY_FULL.md` — a detailed narrative of every trajectory.

These are written to the `trajectories/` directory so the Codex agent can read them as a quick orientation before drilling into individual `trajectory.json` files:

```python
# memory_modules/agentrunbook_c.py:143-146
concise_output_path = self.workspace_dir / "trajectories" / TRAJECTORY_SUMMARY_CONCISE_FILENAME
full_output_path = self.workspace_dir / "trajectories" / TRAJECTORY_SUMMARY_FULL_FILENAME
```

The sandbox also receives an `inspect_trajectory.py` helper script (under `scripts/`), which gives the Codex agent a tool to inspect individual trajectories, single states, spans, or perform text matching within one trajectory quickly—without reading the entire JSON manually.

The agent is instructed via a specialized `INSTRUCTION.md` and a different default prompt:

```python
# memory_modules/agentrunbook_c.py:37-44
DEFAULT_QUERY_PROMPT = (
    "You are acting as the query-time agent for AgentRunbook-C. "
    "Read the local files in this directory, especially INSTRUCTION.md and question.json. "
    "The local trajectories/ directory contains the current haystack for this evaluation item, "
    "and you must explore trajectories/ before returning your final result. "
    "Use the local inspection helper under scripts/ when you need to inspect one trajectory..."
)
```

The output schema and post-processing logic are identical to `codex`: `memory_module_output.json` with `memory_markdown` and `trajectory_spans`.

Sources: [memory_modules/agentrunbook_c.py:24-45](), [memory_modules/agentrunbook_c.py:108-114](), [memory_modules/agentrunbook_c.py:135-223]()

---

## The insert/query Contract in Detail

The table below summarizes the complete contract for custom backends:

| Method | Required | Inputs | Expected output |
|---|---|---|---|
| `insert(trajectory)` | Yes | `dict` with id, goal, states, actions, screenshots | Side-effect only; index or persist data. Must not crash on duplicate ids without signaling error. |
| `query(query, query_image)` | Yes | `str` question, optional image path | `list[MemoryContextItem]` — each item `{"type": "text"|"image", "value": str}` |
| `configure_runtime(**kwargs)` | No | Arbitrary kwargs | Warm connections; return `None` |
| `post_query_hook(...)` | No | query, query_image, memory_context | Return `dict` or `None` |
| `_save_backend(output_dir)` | No | `Path` | Persist any non-config state (embeddings, JSONL pools, trajectory files) |
| `_load_backend(input_dir)` | No | `Path` | Restore state from a previous `save_memory` call |
| `reconcile_loaded_memory_config(saved, requested)` | No (has default) | Two `MemoryConfig` dicts | Return the effective config; raise on incompatible mismatch |

A backend that only implements `insert` and `query` is fully functional. The save/load pair is needed only if the backend builds state that is expensive to recompute (embeddings, LLM-generated notes).

### The `MemoryContextItem` format

```python
# memory_modules/memory.py:14-16
class MemoryContextItem(TypedDict):
    type: Literal["text", "image"]
    value: str
```

Text items have the retrieved text as `value`. Image items have an **absolute filesystem path** to a screenshot as `value`. The harness passes both types to the answering model; image items are only meaningful when the evaluating model is multimodal.

Sources: [memory_modules/memory.py:14-16](), [memory_modules/memory.py:43-54](), [memory_modules/memory.py:56-68]()

---

## Data Flow: Insert → Query

```text
               insert(trajectory)
               ┌───────────────────────────────────────────────────────────────┐
               │  trajectory_store.prepare_trajectory_insert()                 │
               │    → normalize states, resolve screenshots, compute fingerprint│
               │                                                               │
               │  no_retrieval  → discard                                      │
               │  rag / ar_r    → embed AXTree slices → store in NumPy matrix  │
               │               → (opt) LLM → procedure & hint notes → embed    │
               │               → (ar_r) LLM → event entries → embed            │
               │  codex / ar_c  → write trajectory.json to workspace/           │
               └───────────────────────────────────────────────────────────────┘

               query(question_text)
               ┌───────────────────────────────────────────────────────────────┐
               │  no_retrieval  → return []                                    │
               │                                                               │
               │  rag           → embed question → cosine search raw-states    │
               │               → (opt) cosine search notes                     │
               │               → assemble MemoryContextItems                   │
               │                                                               │
               │  agentrunbook_r→ LLM decomposes question into multi-pool      │
               │               → cosine search each pool separately            │
               │               → (opt) LLM reranks candidates                  │
               │               → assemble MemoryContextItems                   │
               │                                                               │
               │  codex / ar_c  → write question.json + INSTRUCTION.md         │
               │               → (ar_c) render trajectory summaries            │
               │               → spawn Codex CLI subprocess                    │
               │               → Codex reads files, writes memory_module_output.json │
               │               → harness loads spans → assemble MemoryContextItems │
               └───────────────────────────────────────────────────────────────┘
```

---

## Summary

The six backends form a progression from a zero-effort baseline to a full agentic retrieval system. `no_retrieval` measures the floor. `rag` adds dense vector retrieval over raw UI states, and optionally LLM-generated notes. `agentrunbook_r` adds a state-transition event pool and a smart query-decomposition step. `codex` replaces the embedding index entirely with a coding agent that reads trajectory files directly from disk. `agentrunbook_c` enhances `codex` by pre-rendering human-readable summaries and providing an inspection helper, giving the agent a better starting orientation before it digs into raw evidence.

Every custom backend must implement exactly two abstract methods—`insert` and `query`—and register with `@register_memory`. Saving and loading state across sessions is optional but recommended for any backend that performs expensive embedding or LLM inference at insert time.

Sources: [memory_modules/memory.py:140-144](), [memory_modules/codex.py:22-23](), [memory_modules/agentrunbook_r.py:60-68]()