# Why Is Uploading Files Not Enough?

> The graph-building page asks why raw documents must become ontology, chunks, tasks, and Zep graph memory before they can support simulation.

- Repository: 666ghj/MiroFish
- GitHub: https://github.com/666ghj/MiroFish
- Human wiki: https://grok-wiki.com/public/wiki/666ghj-mirofish-5af7beba06b9
- Complete Markdown: https://grok-wiki.com/public/wiki/666ghj-mirofish-5af7beba06b9/llms-full.txt

## Source Files

- `backend/app/api/graph.py`
- `backend/app/services/ontology_generator.py`
- `backend/app/services/graph_builder.py`
- `backend/app/services/text_processor.py`
- `backend/app/utils/file_parser.py`
- `backend/app/models/project.py`
- `backend/app/models/task.py`
- `frontend/src/components/Step1GraphBuild.vue`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [backend/app/api/graph.py](backend/app/api/graph.py)
- [backend/app/services/ontology_generator.py](backend/app/services/ontology_generator.py)
- [backend/app/services/graph_builder.py](backend/app/services/graph_builder.py)
- [backend/app/services/text_processor.py](backend/app/services/text_processor.py)
- [backend/app/utils/file_parser.py](backend/app/utils/file_parser.py)
- [backend/app/models/project.py](backend/app/models/project.py)
- [backend/app/models/task.py](backend/app/models/task.py)
- [frontend/src/components/Step1GraphBuild.vue](frontend/src/components/Step1GraphBuild.vue)
- [backend/app/api/simulation.py](backend/app/api/simulation.py)
- [backend/app/services/simulation_manager.py](backend/app/services/simulation_manager.py)
- [backend/app/services/simulation_runner.py](backend/app/services/simulation_runner.py)
- [backend/app/services/zep_graph_memory_updater.py](backend/app/services/zep_graph_memory_updater.py)
- [backend/app/config.py](backend/app/config.py)
- [backend/app/utils/llm_client.py](backend/app/utils/llm_client.py)
</details>

# Why Is Uploading Files Not Enough?

Uploading files only gives MiroFish bytes and filenames. The simulation layer needs something stronger: a project with durable extracted text, an ontology that defines which actors and relationships matter, chunked episodes that can be ingested into a graph, task state that tracks long-running work, and a graph id that later simulation steps can read from or update.

The central question is simple: if the goal is social simulation, what must be true before agents can act? The code’s answer is that documents must be converted from passive source material into a structured, queryable memory substrate.

Sources: [backend/app/api/graph.py:122-247](), [backend/app/api/graph.py:260-522](), [frontend/src/components/Step1GraphBuild.vue:108-168]()

## What is the simplest version?

The simplest version would be: upload a PDF or text file, then ask a model to simulate from that text. MiroFish does not stop there because the rest of the product is not just summarization. The graph build step must preserve reusable state across endpoints, expose progress, produce graph nodes and edges, and hand a `graph_id` into simulation creation.

That is why `generate_ontology` creates a project, saves uploaded files, extracts and preprocesses text, persists the extracted text, generates ontology, and moves the project to `ONTOLOGY_GENERATED`. The upload is only the first operation in that sequence.

```python
# backend/app/api/graph.py
project = ProjectManager.create_project(name=project_name)
project.simulation_requirement = simulation_requirement
...
text = FileParser.extract_text(file_info["path"])
text = TextProcessor.preprocess_text(text)
...
ProjectManager.save_extracted_text(project.project_id, all_text)
...
project.ontology = {
    "entity_types": ontology.get("entity_types", []),
    "edge_types": ontology.get("edge_types", [])
}
project.status = ProjectStatus.ONTOLOGY_GENERATED
```

Sources: [backend/app/api/graph.py:175-235](), [backend/app/models/project.py:17-24](), [backend/app/models/project.py:133-174](), [backend/app/models/project.py:275-290]()

## What does a raw file lack?

A raw file does not answer the operational questions the simulator asks later.

| Need | Why upload alone is insufficient | Where the repo encodes it |
| --- | --- | --- |
| Text extraction | The backend accepts files, but graph building consumes text. PDF, Markdown, and text files must be parsed first. | `FileParser.extract_text` branches by extension. |
| Simulation intent | The ontology is generated from documents plus `simulation_requirement`, not documents alone. | `/ontology/generate` requires `simulation_requirement`. |
| Entity schema | Simulation needs actors that can speak or interact, not arbitrary concepts. | Ontology prompt constrains entity types. |
| Relationship schema | Graph edges need source/target-compatible relation types. | Ontology prompt and validation normalize edge definitions. |
| Persistent context | Later endpoints use `project_id`; they do not re-upload all source files. | `ProjectManager` persists project metadata and extracted text. |
| Progress and failure handling | Graph building is long-running and external-service-backed. | `TaskManager` tracks `pending`, `processing`, `completed`, and `failed`. |

Sources: [backend/app/utils/file_parser.py:61-108](), [backend/app/api/graph.py:153-173](), [backend/app/services/ontology_generator.py:29-173](), [backend/app/models/project.py:26-73](), [backend/app/models/task.py:16-53]()

## Why must documents become ontology?

What would break if ontology disappeared? The graph builder would have chunks of text, but no domain contract for what counts as an entity or edge. The ontology generator explicitly asks for social-media-simulation-friendly actors: people, companies, organizations, government bodies, media outlets, platforms, or representative groups. It rejects abstract concepts, topics, and attitudes as entity types because those cannot act as simulated social accounts.

The generated ontology is not accepted blindly. The service normalizes entity names into PascalCase, relation names into uppercase, fixes source/target references after renaming, deduplicates entity types, caps entity and edge types at ten each, and ensures fallback `Person` and `Organization` types exist.

Sources: [backend/app/services/ontology_generator.py:41-57](), [backend/app/services/ontology_generator.py:91-130](), [backend/app/services/ontology_generator.py:277-398]()

## Why must text become chunks?

A graph ingestion service cannot reliably receive one arbitrary-length project document as a single meaningful unit. MiroFish splits extracted text into overlapping chunks before sending it to the graph service. Chunking preserves local context while making each submitted unit small enough to process.

The split function also tries to cut at sentence or paragraph boundaries before falling back to raw character ranges. That means chunking is not just a transport trick; it shapes what evidence is available to graph extraction.

Sources: [backend/app/services/text_processor.py:17-34](), [backend/app/utils/file_parser.py:161-202](), [backend/app/api/graph.py:392-403]()

## Why must graph building become a task?

Graph construction calls an external graph service, submits multiple batches, waits for processing, then fetches graph data. That is too long and failure-prone to model as a simple synchronous upload response. The API creates a task, marks the project as `GRAPH_BUILDING`, starts a daemon thread, updates progress, and eventually stores completion or failure state.

The guard clauses also show why ordering matters. A project still in `CREATED` state cannot build a graph because ontology has not been generated. A project already building cannot start another build unless forced.

```python
# backend/app/api/graph.py
if project.status == ProjectStatus.CREATED:
    return jsonify({
        "success": False,
        "error": t('api.ontologyNotGenerated')
    }), 400
```

Sources: [backend/app/api/graph.py:316-337](), [backend/app/api/graph.py:364-423](), [backend/app/api/graph.py:447-522](), [backend/app/models/task.py:75-164]()

## Why must chunks become Zep graph memory?

In the current implementation, `GraphBuilderService` is the boundary between MiroFish project state and Zep standalone graphs. It creates a graph id, dynamically turns ontology definitions into Zep entity and edge models, submits chunks as text episodes, waits until episodes are processed, and reads back nodes and edges.

```text
Upload files
  -> extracted_text.txt in ProjectManager
  -> ontology: entity_types + edge_types
  -> chunked text episodes
  -> Zep graph_id
  -> simulation creation and profile/config generation
  -> optional graph memory updates during simulation run
```

The result is not just stored text. `get_graph_data` returns graph nodes with labels, summaries, attributes, and timestamps, plus edges with facts, source/target node ids and names, attributes, temporal fields, and episode references. That is the shape downstream simulation features can query and display.

Sources: [backend/app/services/graph_builder.py:193-203](), [backend/app/services/graph_builder.py:205-292](), [backend/app/services/graph_builder.py:294-345](), [backend/app/services/graph_builder.py:347-401](), [backend/app/services/graph_builder.py:426-501]()

## How does the UI expose this pipeline?

`Step1GraphBuild.vue` presents the workflow as three stages: ontology generation, graph/RAG build, and simulation creation. The first card shows generated entity and relation types and lets the user inspect descriptions, attributes, examples, and relation source/target connections. The second card reports graph node, edge, and schema-type counts. The third card only creates the simulation after both `project_id` and `graph_id` exist.

That UI is a product-facing explanation of the same backend constraint: users are not waiting for “upload” to finish; they are waiting for source material to become a simulation-ready graph.

Sources: [frontend/src/components/Step1GraphBuild.vue:4-105](), [frontend/src/components/Step1GraphBuild.vue:108-168](), [frontend/src/components/Step1GraphBuild.vue:213-228](), [frontend/src/components/Step1GraphBuild.vue:252-257]()

## Where does simulation depend on the graph?

Simulation creation requires `project_id` and either an explicit `graph_id` or a `graph_id` already saved on the project. If neither is available, the API returns `graphNotBuilt`. Once created, the simulation state stores both ids.

Later preparation uses `state.graph_id` to initialize profile generation and to pass graph context into profile generation. Configuration generation also receives the same `graph_id`, simulation requirement, original document text, and filtered entities. At run time, graph memory updates are optional, but if enabled they require `graph_id` and create a `ZepGraphMemoryUpdater`.

Sources: [backend/app/api/simulation.py:166-224](), [backend/app/services/simulation_manager.py:194-228](), [backend/app/services/simulation_manager.py:315-347](), [backend/app/services/simulation_manager.py:403-410](), [backend/app/services/simulation_runner.py:316-384]()

## What does “memory” add after the initial graph?

The initial graph is built from documents. Runtime graph memory is different: it can add simulation activities back into the graph as text. `ZepGraphMemoryUpdater` owns a queue, platform buffers for Twitter and Reddit activity, a worker thread, and sends combined activity text to the graph with `client.graph.add`.

So the graph is both seed context and, when enabled, a place to record simulated activity. Uploading files cannot provide that evolving memory loop.

Sources: [backend/app/services/zep_graph_memory_updater.py:232-269](), [backend/app/services/zep_graph_memory_updater.py:275-291](), [backend/app/services/zep_graph_memory_updater.py:407-424](), [backend/app/services/zep_graph_memory_updater.py:490-510]()

## Provider-neutral architecture note

The current code is partly BYOK-friendly: LLM access is configured through `LLM_API_KEY`, `LLM_BASE_URL`, and `LLM_MODEL_NAME`, and `LLMClient` uses an OpenAI-compatible client with injectable key, base URL, and model. That supports multiple OpenAI-compatible providers without changing the ontology generator interface.

The graph store is less abstract today: `GraphBuilderService` imports and instantiates Zep directly, and graph build endpoints require `ZEP_API_KEY`. A vendor-agnostic evolution should preserve the same product stages while putting graph operations behind an interface such as `create_graph`, `set_ontology`, `add_text_batches`, `wait_for_ingestion`, `get_graph_data`, and `add_memory_event`. That keeps the Grok-Wiki or skill-source layer portable across file, repository, or catalog inputs, and keeps BYOC/BYOK decisions at adapter/config boundaries rather than inside the upload workflow.

Sources: [backend/app/config.py:30-45](), [backend/app/config.py:67-74](), [backend/app/utils/llm_client.py:17-33](), [backend/app/services/graph_builder.py:13-19](), [backend/app/services/graph_builder.py:46-52]()

## Source-context note

The requested knowledge profile mentioned generated wiki context, solved-problem notes under `docs/solutions/**`, and `STRATEGY.md` when present. In this repository checkout, the focused source pass found the implementation files above but did not find `STRATEGY.md` or `docs/solutions/**`, so this page treats repository code as the source of truth and does not claim prior strategy or solution-note evidence.

Sources: [backend/app/api/graph.py:120-235](), [backend/app/services/graph_builder.py:40-52]()

## Summary

Uploading files is not enough because MiroFish is not building a file viewer. It is building a simulation substrate. Files become extracted text; extracted text plus simulation intent becomes ontology; ontology plus chunks becomes a graph; graph construction becomes a tracked task; and the resulting `graph_id` becomes the handle that simulation creation, profile generation, configuration generation, and optional runtime memory updates use. Without those transformations, the simulator would have documents, but not actors, relationships, progress state, or graph memory. Sources: [backend/app/api/graph.py:348-493](), [backend/app/api/simulation.py:211-224](), [backend/app/services/zep_graph_memory_updater.py:414-418]()
