# Downloading & Preparing the Haystack

> How trajectory data moves from Hugging Face through download, screenshot extraction, and symlink preparation into the form the harness expects — covering the three data scripts and the validate step.

- Repository: xiaowu0162/LongMemEval-V2
- GitHub: https://github.com/xiaowu0162/LongMemEval-V2
- Human wiki: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2
- Complete Markdown: https://grok-wiki.com/public/wiki/xiaowu0162-longmemeval-v2-0193366cbab2/llms-full.txt

## Source Files

- `data/download_data.py`
- `data/prepare_data.py`
- `data/validate_data.py`
- `data/public_data.py`
- `environment.yml`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [data/download_data.py](data/download_data.py)
- [data/prepare_data.py](data/prepare_data.py)
- [data/validate_data.py](data/validate_data.py)
- [data/public_data.py](data/public_data.py)
- [environment.yml](environment.yml)
- [README.md](README.md)
</details>

# Downloading & Preparing the Haystack

This page explains how LongMemEval-V2 trajectory data travels from Hugging Face onto your local machine into the exact directory layout the evaluation harness expects. The process has three sequential steps — download, screenshot preparation, and validation — each implemented as its own Python script in the `data/` directory.

If you skip or partially complete any step, the harness will fail loudly rather than silently produce wrong results. Understanding what each script does — and why the steps are ordered — makes troubleshooting straightforward.

---

## What "the haystack" is

Think of the dataset as a library of recorded browsing sessions. Each *trajectory* is one such session: a sequence of browser states (screenshots, page text, actions). A *haystack* for a given question is the specific set of trajectory IDs the memory system must sift through to find an answer. The haystack files live at:

```
data/longmemeval-v2/haystacks/lme_v2_small.json   # small tier
data/longmemeval-v2/haystacks/lme_v2_medium.json  # medium tier
```

Each haystack file is a JSON object mapping `question_id → [trajectory_id, ...]`. The harness references those IDs, so every trajectory ID must resolve to both a metadata record (`trajectories.jsonl`) and a physical screenshot directory (`screenshots/<trajectory_id>/`).

---

## Step 1 — Download from Hugging Face (`download_data.py`)

### What it does

`data/download_data.py` calls `huggingface_hub.snapshot_download` to fetch the full dataset snapshot from the Hub repository `xiaowu0162/longmemeval-v2` into a local directory.

```python
# data/download_data.py:62-67
snapshot_download(
    repo_id=args.repo_id,
    repo_type="dataset",
    revision=args.revision,
    local_dir=str(data_root),
)
```

After downloading, it checks that the two required root files exist:

```python
# data/download_data.py:68-69
require((data_root / "questions.jsonl").exists(), ...)
require((data_root / "trajectories.jsonl").exists(), ...)
```

### Idempotency

If both sentinel files already exist the script prints `"status": "already_present"` and exits without downloading again. Pass `--force` to delete the data root and re-download from scratch.

Sources: [data/download_data.py:39-52](), [data/download_data.py:62-69]()

### Usage

```bash
python data/download_data.py --data-root data/longmemeval-v2
```

| Flag | Default | Effect |
|---|---|---|
| `--repo-id` | `xiaowu0162/longmemeval-v2` | Hugging Face dataset repo |
| `--revision` | *(latest)* | Pin to a specific commit/tag |
| `--data-root` | `data/longmemeval-v2` | Where to write the snapshot |
| `--force` | off | Wipe and re-download if present |

After a successful download the script prints a JSON object that includes the recommended `next` commands — `prepare_data.py` then `validate_data.py` — so you always know what to run next.

Sources: [data/download_data.py:78-84]()

---

## Step 2 — Extract archives and build the screenshot tree (`prepare_data.py`)

### Why this step exists

Screenshots are large. The Hugging Face snapshot ships them as `.tar.gz` archives rather than individual files to reduce transfer overhead. Before the harness can reference `screenshots/<trajectory_id>/<step>.png`, those archives must be extracted and the resulting directories must be reachable under a single stable path prefix.

`data/prepare_data.py` is a thin CLI wrapper; all real logic lives in `data/public_data.prepare_screenshots`.

```python
# data/prepare_data.py:22-26
result = prepare_screenshots(
    Path(args.data_root).expanduser().resolve(),
    mode=args.mode,
    extract_archives=not args.no_extract_archives,
)
```

Sources: [data/prepare_data.py:17-27]()

### What `prepare_screenshots` does

The function looks for screenshot content in two locations inside `data_root`:

1. **`trajectory_screenshots/`** — a directory that may contain either pre-extracted subdirectories *or* `.tar.gz` archives named after three source groups:

| Archive name | `replace` flag | Purpose |
|---|---|---|
| `web_screenshots.tar.gz` | no | Web-domain trajectory screenshots |
| `enterprise_screenshots_base.tar.gz` | no | Enterprise base screenshots |
| `enterprise_screenshots_patch.tar.gz` | **yes** | Patches/updates that overwrite base entries |

2. **`trajectory_screenshots/` (direct)** — if the directory itself contains subdirectories with `.png` files, it is treated as an already-extracted source.

For each tar archive that has not yet been extracted, `_safe_extract_tar` validates every member path against a path-traversal check before unpacking:

```python
# data/public_data.py:133-136
member_target = (destination / member.name).resolve()
require(
    destination_resolved == member_target or destination_resolved in member_target.parents,
    f"Refusing unsafe archive member path: {member.name}",
)
```

Sources: [data/public_data.py:127-138]()

### Symlinking vs copying

After expansion, each `<trajectory_id>/` subdirectory inside a source directory is linked (or copied) into the unified `screenshots/` tree:

```
data/longmemeval-v2/screenshots/<trajectory_id>/   ← what the harness reads
```

The default mode is `symlink`: a relative symlink is created using `os.path.relpath` so the layout remains portable. If `symlink` fails (e.g., on a filesystem that does not support symlinks), the code falls back automatically to `shutil.copytree`.

```python
# data/public_data.py:154-159
if mode == "symlink":
    try:
        _relative_symlink(src.resolve(), dst)
        return "symlinked"
    except OSError:
        shutil.copytree(src, dst)
        return "copied"
```

The `enterprise_screenshots_patch` source has `replace=True`, meaning its entries overwrite existing symlinks for the same trajectory ID. All other sources skip directories that already exist.

Sources: [data/public_data.py:146-162](), [data/public_data.py:165-213]()

### Usage

```bash
export DATA_ROOT="$(pwd)/data/longmemeval-v2"
python data/prepare_data.py --data-root "$DATA_ROOT" --mode symlink
```

| Flag | Default | Effect |
|---|---|---|
| `--data-root` | *(required)* | Path downloaded in Step 1 |
| `--mode` | `symlink` | `symlink` or `copy` |
| `--no-extract-archives` | off | Skip tar extraction (archives already unpacked) |

The script prints a JSON summary with counts of how many trajectory directories were symlinked, copied, or skipped.

---

## Step 3 — Validate the layout (`validate_data.py`)

`data/validate_data.py` is a sanity check that confirms every piece of data the harness will need is actually present and internally consistent. It calls `data/public_data.validate_public_data`.

Sources: [data/validate_data.py:13-30](), [data/public_data.py:217-260]()

### What it checks

The validator loads all three data files — `questions.jsonl`, `trajectories.jsonl`, and the selected haystack — then runs a sequence of assertions:

```text
1. Every question in questions.jsonl has a haystack entry.
2. Every haystack entry points to a known question.
3. Every question has domain "web" or "enterprise" and non-empty question text.
4. Any question image path (if present) resolves to a real file.
5. Every trajectory_id referenced in any haystack exists in trajectories.jsonl.
6. No duplicate trajectory IDs within a single haystack.
7. Trajectories and their haystacks share the same domain (no cross-domain mixing).
8. Every screenshot path in every trajectory state resolves to a real file under data_root.
```

Check 8 is the one that catches a missing `prepare_data.py` run:

```python
# data/public_data.py:242-253
for trajectory in trajectories.values():
    for state in trajectory.get("states", []):
        screenshot_value = state.get("screenshot")
        if isinstance(screenshot_value, str) and not (data_root / screenshot_value).exists():
            missing_screenshots += 1
require(
    missing_screenshots == 0,
    f"Missing {missing_screenshots} trajectory screenshots. Run data/prepare_data.py first.",
)
```

### Usage

```bash
python data/validate_data.py --data-root "$DATA_ROOT" --tier small
```

| Flag | Default | Effect |
|---|---|---|
| `--data-root` | *(required)* | Same path used in steps 1 and 2 |
| `--tier` | `small` | `small` or `medium` — selects which haystack file to validate |
| `--check-screenshots` / `--no-check-screenshots` | on | Skip screenshot existence checks when disk I/O is slow |

On success the script prints a JSON summary:

```json
{
  "questions": 451,
  "trajectories": ...,
  "haystack_questions": ...,
  "tier": "small",
  "check_screenshots": true
}
```

---

## Data flow diagram

```text
Hugging Face Hub
  xiaowu0162/longmemeval-v2
         │
         │  snapshot_download()
         ▼
data/longmemeval-v2/
  ├── questions.jsonl          ← 451 questions with domain, text, optional image
  ├── trajectories.jsonl       ← all trajectory records (states, screenshots paths)
  ├── haystacks/
  │   ├── lme_v2_small.json    ← question_id → [trajectory_id, ...]
  │   └── lme_v2_medium.json
  └── trajectory_screenshots/
      ├── web_screenshots.tar.gz
      ├── enterprise_screenshots_base.tar.gz
      └── enterprise_screenshots_patch.tar.gz
         │
         │  prepare_data.py  (extract + symlink)
         ▼
data/longmemeval-v2/
  └── screenshots/
      ├── <trajectory_id_A>/   ← symlink → trajectory_screenshots/web_screenshots/...
      │   ├── step_0.png
      │   └── step_1.png
      └── <trajectory_id_B>/
          └── ...
         │
         │  validate_data.py  (assert all references resolve)
         ▼
     ✓ Ready for evaluation harness
```

---

## The shared library: `public_data.py`

`data/public_data.py` is the library both `prepare_data.py` and `validate_data.py` import from. It exposes:

| Function | Used by | Purpose |
|---|---|---|
| `read_jsonl` | both | Parse a `.jsonl` file into a list of dicts |
| `load_questions` | both | Load and optionally filter questions by domain |
| `load_trajectories` | both | Load trajectories keyed by ID |
| `load_haystack` | both | Load a tier-specific haystack JSON |
| `resolve_question_image` | both | Resolve a question's optional image path |
| `prepare_screenshots` | prepare | Extract archives + build `screenshots/` tree |
| `validate_public_data` | validate | Run all integrity checks |
| `materialize_runtime_questions` | harness | Emit a filtered question list for a run |
| `materialize_runtime_haystack` | harness | Emit a filtered haystack for a run |

The `materialize_*` functions are not called by the data scripts but by the evaluation harness at runtime — they translate the raw JSONL files into per-run JSON files with resolved absolute image paths.

Sources: [data/public_data.py:34-124]()

---

## Environment prerequisites

The download step requires `huggingface_hub`, which is listed in `requirements.txt` (pulled in via `environment.yml`). The conda environment is named `lme-v2-release` and uses Python 3.11.

```yaml
# environment.yml
name: lme-v2-release
dependencies:
  - python=3.11
  - pip:
      - -r requirements-torch.txt
      - -r requirements.txt
```

If you run `download_data.py` outside the conda environment you will get a clear `RuntimeError: Missing huggingface_hub` rather than a confusing import error.

Sources: [environment.yml:1-10](), [data/download_data.py:54-59]()

---

## Summary

Three scripts in the `data/` directory form a strict pipeline: `download_data.py` fetches the raw snapshot from Hugging Face and verifies that `questions.jsonl` and `trajectories.jsonl` are present; `prepare_data.py` extracts screenshot archives and builds a stable `screenshots/<trajectory_id>/` tree through relative symlinks (with safe path-traversal validation on every archive member); and `validate_data.py` confirms that every question has a haystack, every haystack references real trajectories of the correct domain, and every screenshot path resolves on disk — catching any gap between the download and preparation steps before an expensive evaluation run begins. The shared implementation in `data/public_data.py` also provides `materialize_runtime_questions` and `materialize_runtime_haystack`, which the evaluation harness calls at runtime to assemble per-run inputs from the same prepared data root.

Sources: [data/public_data.py:217-260]()