# Why Is Running the Simulation a Process Boundary?

> The run stage asks why live simulation leaves Flask request handling: subprocesses, file-based IPC, action logs, SQLite traces, and optional graph-memory updates create the observable boundary.

- Repository: 666ghj/MiroFish
- GitHub: https://github.com/666ghj/MiroFish
- Human wiki: https://grok-wiki.com/public/wiki/666ghj-mirofish-5af7beba06b9
- Complete Markdown: https://grok-wiki.com/public/wiki/666ghj-mirofish-5af7beba06b9/llms-full.txt

## Source Files

- `backend/app/services/simulation_runner.py`
- `backend/app/services/simulation_ipc.py`
- `backend/app/services/zep_graph_memory_updater.py`
- `backend/scripts/run_parallel_simulation.py`
- `backend/scripts/run_twitter_simulation.py`
- `backend/scripts/run_reddit_simulation.py`
- `backend/scripts/action_logger.py`
- `frontend/src/components/Step3Simulation.vue`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [backend/app/services/simulation_runner.py](backend/app/services/simulation_runner.py)
- [backend/app/services/simulation_ipc.py](backend/app/services/simulation_ipc.py)
- [backend/app/services/zep_graph_memory_updater.py](backend/app/services/zep_graph_memory_updater.py)
- [backend/app/api/simulation.py](backend/app/api/simulation.py)
- [backend/scripts/run_parallel_simulation.py](backend/scripts/run_parallel_simulation.py)
- [backend/scripts/run_twitter_simulation.py](backend/scripts/run_twitter_simulation.py)
- [backend/scripts/run_reddit_simulation.py](backend/scripts/run_reddit_simulation.py)
- [backend/scripts/action_logger.py](backend/scripts/action_logger.py)
- [frontend/src/components/Step3Simulation.vue](frontend/src/components/Step3Simulation.vue)
- [frontend/src/api/simulation.js](frontend/src/api/simulation.js)
</details>

# Why Is Running the Simulation a Process Boundary?

A simulation run is not just a Flask endpoint doing work. It is a handoff from request/response code into a long-lived Python worker process that owns the OASIS environments, writes observable files, accepts later commands through file IPC, and leaves traces for the API and UI to read.

The useful question is: what would break if the simulation stayed inside the Flask request? The answer is visible in the code: the run can outlive the HTTP request, expose progress through durable files, keep an environment alive for interviews after the main loop, terminate as a process group, and optionally stream actions into graph memory without coupling that work to the web server thread.

Context note: this page uses the provided Compound Engineering wiki guidance as page-shape metadata. In this checkout, no `STRATEGY.md` or `docs/solutions/**` source files were present in the focused inventory, so repository code remains the source of truth.

## What is the simplest version?

The simplest version would be: `/api/simulation/start` receives a `simulation_id`, loads `simulation_config.json`, runs the loop inline, and returns when done. The implementation chooses a different shape. The Flask route validates `simulation_id`, `platform`, `max_rounds`, `force`, and optional graph-memory settings, then delegates to `SimulationRunner.start_simulation(...)` and returns a state object that includes `runner_status`, platform flags, and `process_pid`.

That tells us the product contract is asynchronous: starting a run is not the same as completing a run. The UI also treats start as a transition into polling, not as a blocking operation. `Step3Simulation.vue` sends `platform: 'parallel'`, `force: true`, and `enable_graph_memory_update: true`, then starts status and detail polling after receiving the process id.

Sources: [backend/app/api/simulation.py:1451-1627](), [frontend/src/components/Step3Simulation.vue:382-425](), [frontend/src/api/simulation.js:79-109]()

## Where does the boundary actually sit?

The boundary sits between Flask-managed orchestration and script-managed simulation state.

`SimulationRunner.start_simulation` chooses one of three scripts: `run_twitter_simulation.py`, `run_reddit_simulation.py`, or `run_parallel_simulation.py`. It builds a command using the current Python interpreter, passes `--config`, optionally passes `--max-rounds`, sets the working directory to the simulation directory, redirects stdout/stderr to `simulation.log`, starts a new process session, stores the child PID, and launches a monitor thread.

```python
# backend/app/services/simulation_runner.py
cmd = [
    sys.executable,
    script_path,
    "--config", config_path,
]
process = subprocess.Popen(
    cmd,
    cwd=sim_dir,
    stdout=main_log_file,
    stderr=subprocess.STDOUT,
    text=True,
    encoding='utf-8',
    bufsize=1,
    env=env,
    start_new_session=True,
)
```

The request path owns validation and launch. The worker process owns the live OASIS environment and the simulation loop. The monitor thread bridges them by reading files the worker emits.

Sources: [backend/app/services/simulation_runner.py:313-479](), [backend/app/services/simulation_runner.py:482-547]()

## What crosses the boundary?

The crossing is deliberately file-shaped.

| Boundary artifact | Written by | Read by | Purpose |
|---|---|---|---|
| `simulation_config.json` | preparation flow before run | runner and worker script | input contract for the run |
| `simulation.log` | child process stdout and log manager | runner on failure, humans | main process log and failure tail |
| `run_state.json` | `SimulationRunner` monitor | API status endpoints | durable observable run state |
| `twitter/actions.jsonl` | action logger in worker | monitor/API/detail UI | platform action stream |
| `reddit/actions.jsonl` | action logger in worker | monitor/API/detail UI | platform action stream |
| `twitter_simulation.db` / `reddit_simulation.db` | OASIS environment | worker helpers and API endpoints | raw SQLite traces and post/comment data |
| `env_status.json` | IPC handler in worker | Flask IPC client/API | whether the live environment can accept commands |
| `ipc_commands/*.json` and `ipc_responses/*.json` | Flask and worker | worker and Flask | request/response IPC for interviews and close-env |

The cleanup list is also evidence of the boundary. A forced restart deletes run state, logs, action JSONL files, platform SQLite databases, and `env_status.json`, but not the original config or profile files.

Sources: [backend/app/services/simulation_runner.py:299-310](), [backend/app/services/simulation_runner.py:487-516](), [backend/app/services/simulation_runner.py:1102-1181](), [backend/scripts/action_logger.py:1-13]()

## Why are action logs separate from SQLite traces?

The worker uses SQLite as the platform trace store and JSONL as the API-facing action stream.

Inside the parallel worker, each platform creates its own OASIS environment with its own database path. After each environment step, the worker queries new rows from the SQLite `trace` table using `rowid`, normalizes action names, enriches action context, and writes selected actions through `PlatformActionLogger`. That logger appends one JSON object per line to `twitter/actions.jsonl` or `reddit/actions.jsonl`, including normal actions and event markers such as `round_start`, `round_end`, `simulation_start`, and `simulation_end`.

The monitor thread then reads the JSONL streams incrementally by file position. Event rows update progress and completion flags; action rows become `AgentAction` objects and optionally feed graph memory.

Sources: [backend/scripts/run_parallel_simulation.py:657-746](), [backend/scripts/run_parallel_simulation.py:1101-1288](), [backend/scripts/action_logger.py:22-117](), [backend/app/services/simulation_runner.py:583-691]()

## How does the system keep a finished simulation interactive?

The worker does not necessarily exit when the main loop ends. The parallel script defaults into a wait-for-commands mode, creates a `ParallelIPCHandler`, marks `env_status.json` as `alive`, and repeatedly polls command files every half second. It supports `interview`, `batch_interview`, and `close_env`.

Flask uses `SimulationIPCClient` for the other side. It writes a command JSON file with a UUID, waits for a matching response JSON file, and removes both files after success. Before sending interview commands, `SimulationRunner` checks `env_status.json`; closing the environment is a graceful command, distinct from `/stop`, which terminates the process.

Sources: [backend/scripts/run_parallel_simulation.py:1595-1634](), [backend/scripts/run_parallel_simulation.py:217-299](), [backend/scripts/run_parallel_simulation.py:560-601](), [backend/app/services/simulation_ipc.py:95-187](), [backend/app/services/simulation_runner.py:1373-1489](), [backend/app/services/simulation_runner.py:1610-1656]()

```mermaid
flowchart LR
  subgraph UI["UI"]
    Step3["Step3Simulation.vue"]
  end

  subgraph Flask["Flask API and Runner"]
    Start["/api/simulation/start"]
    Runner["SimulationRunner"]
    IPCClient["SimulationIPCClient"]
    Monitor["monitor thread"]
  end

  subgraph Worker["Simulation subprocess"]
    Parallel["run_parallel_simulation.py"]
    Twitter["Twitter OASIS env"]
    Reddit["Reddit OASIS env"]
    IPCHandler["ParallelIPCHandler"]
  end

  subgraph Files["Simulation directory"]
    Config["simulation_config.json"]
    State["run_state.json"]
    Actions["twitter/reddit/actions.jsonl"]
    DB["*_simulation.db trace tables"]
    EnvStatus["env_status.json"]
    IPCFiles["ipc_commands / ipc_responses"]
    MainLog["simulation.log"]
  end

  subgraph Optional["Optional graph memory"]
    ZepUpdater["ZepGraphMemoryUpdater"]
    Graph["graph.add text episodes"]
  end

  Step3 --> Start
  Start --> Runner
  Runner -->|Popen| Parallel
  Runner --> Config
  Parallel --> Twitter
  Parallel --> Reddit
  Parallel --> MainLog
  Twitter --> DB
  Reddit --> DB
  Parallel --> Actions
  Monitor --> Actions
  Monitor --> State
  IPCClient --> IPCFiles
  IPCHandler --> IPCFiles
  IPCHandler --> EnvStatus
  Monitor --> ZepUpdater
  ZepUpdater --> Graph
```

Sources: [backend/app/services/simulation_runner.py:387-469](), [backend/scripts/run_parallel_simulation.py:1540-1618](), [backend/app/services/simulation_ipc.py:102-187](), [backend/app/services/zep_graph_memory_updater.py:340-418]()

## What does `/stop` mean versus `close-env`?

There are two shutdown paths because there are two different problems.

`/stop` is process control. `SimulationRunner.stop_simulation` moves the run state to `STOPPING`, terminates the subprocess tree, then marks the run `STOPPED`. On Unix, the runner sends signals to the process group created by `start_new_session=True`; on Windows, it uses `taskkill` for the process tree. This is the right tool when the run itself must stop.

`close-env` is protocol control. It sends a `close_env` IPC command so a worker already in wait mode can exit cleanly. The API documentation explicitly distinguishes it from `/stop`: `/stop` forcefully terminates the process, while close-env asks the simulation to shut down the environment and exit.

Sources: [backend/app/services/simulation_runner.py:720-822](), [backend/app/api/simulation.py:2649-2708](), [backend/app/services/simulation_runner.py:1610-1656]()

## Where does graph memory fit?

Graph memory is optional and sits on the Flask-side observation path, not inside the OASIS loop itself. When `/start` asks for graph-memory updates, the API resolves a `graph_id` from simulation or project state and passes it to `SimulationRunner.start_simulation`. The runner creates a `ZepGraphMemoryUpdater` before launching the subprocess. As the monitor reads action JSONL rows, it calls `graph_updater.add_activity_from_dict(...)`.

The updater is another asynchronous boundary: it owns a queue, a daemon worker thread, per-platform buffers, batching, retry behavior, and a final flush on stop. It filters `DO_NOTHING`, converts action dictionaries into natural-language activity episodes, and sends batches with `client.graph.add(...)`.

This is provider-specific in the current implementation because the class imports `zep_cloud.client.Zep` and requires `ZEP_API_KEY`. Architecturally, however, the boundary is already portable: the simulation process emits provider-neutral action JSONL, and the graph-memory adapter consumes that observable stream. A BYOC/BYOK-friendly extension would keep `actions.jsonl` as the source event contract and swap the memory sink behind an adapter rather than coupling OASIS or Flask routes to one hosted memory provider.

Sources: [backend/app/api/simulation.py:1584-1623](), [backend/app/services/simulation_runner.py:372-385](), [backend/app/services/simulation_runner.py:603-684](), [backend/app/services/zep_graph_memory_updater.py:202-246](), [backend/app/services/zep_graph_memory_updater.py:275-308](), [backend/app/services/zep_graph_memory_updater.py:340-418](), [backend/app/config.py:30-37]()

## What does the UI observe?

The UI does not subscribe to the child process directly. It starts the run, receives the backend state, then polls two HTTP endpoints:

- `GET /api/simulation/{id}/run-status` for coarse progress, current rounds, platform completion, and action counts.
- `GET /api/simulation/{id}/run-status/detail` for the action list used by the live activity display.

The backend implements those endpoints by reading `SimulationRunner` state and action files. The Vue component separately detects per-platform round changes and final completion. That keeps the browser out of the process boundary; it sees HTTP resources, not subprocess pipes or SQLite handles.

Sources: [backend/app/api/simulation.py:1705-1752](), [backend/app/api/simulation.py:1763-1838](), [frontend/src/components/Step3Simulation.vue:492-585](), [frontend/src/api/simulation.js:95-109]()

## What would break if the boundary disappeared?

If the run stayed inside Flask request handling, several implemented behaviors lose their natural owner:

| Current behavior | Why the process boundary helps |
|---|---|
| Long simulation loop | The request returns immediately with `process_pid` while the worker continues. |
| Durable progress | `run_state.json` and JSONL files survive beyond one request handler. |
| Platform-parallel execution | The worker can `asyncio.gather(...)` Twitter and Reddit runs while Flask remains responsive. |
| Post-run interviews | The worker keeps OASIS environments alive and accepts IPC commands after the main loop. |
| Force stop | The runner can terminate a process group instead of trying to interrupt in-request Python state. |
| Failure inspection | `simulation.log` captures stdout/stderr and the runner can attach the tail to failed state. |
| Optional graph updates | The monitor can transform observed actions into graph-memory updates without changing the simulation loop. |

The code is not saying subprocesses are the only possible design. It is saying this implementation treats the simulation as a separately observable runtime with file contracts between layers.

Sources: [backend/app/services/simulation_runner.py:426-469](), [backend/app/services/simulation_runner.py:501-547](), [backend/scripts/run_parallel_simulation.py:1583-1590](), [backend/scripts/run_parallel_simulation.py:1595-1634](), [backend/app/services/simulation_runner.py:720-822]()

## Closing summary

Running the simulation is a process boundary because the live simulation has a different lifecycle than a Flask request. Flask validates, launches, monitors, serves status, and sends file IPC commands. The subprocess owns OASIS environments, SQLite trace generation, action JSONL emission, and optional wait-mode interactivity. The shared simulation directory is the contract that makes the boundary observable, restartable, and portable enough to support BYOC/BYOK-oriented adapters around model config and graph-memory sinks. Sources: [backend/app/services/simulation_runner.py:196-205](), [backend/scripts/run_parallel_simulation.py:1-26](), [backend/scripts/action_logger.py:1-13]()
