# Verifying the Engine and Where to Go Next

> The closing page: how numeric parity is proven rather than assumed (omni-verify against Python-generated fixtures, cosine >= 0.999 with matching token ids, image/video/audio matching the upstream model.py), how the test suite and fixture generators are run, and a short map of what to read next after the first 30 minutes.

- Repository: hanxiao/omni-macos
- GitHub: https://github.com/hanxiao/omni-macos
- Human wiki: https://grok-wiki.com/public/wiki/hanxiao-omni-macos-7817a5cffe05
- Complete Markdown: https://grok-wiki.com/public/wiki/hanxiao-omni-macos-7817a5cffe05/llms-full.txt

## Source Files

- `Sources/omni-verify/main.swift`
- `Tools/gen_fixtures.py`
- `Tests/OmniKitTests/TextEncoderTests.swift`
- `Tests/OmniKitTests/VectorStoreTests.swift`
- `Scripts/run-tests.sh`
- `Makefile`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [Sources/omni-verify/main.swift](Sources/omni-verify/main.swift)
- [Tools/gen_fixtures.py](Tools/gen_fixtures.py)
- [Tests/OmniKitTests/TextEncoderTests.swift](Tests/OmniKitTests/TextEncoderTests.swift)
- [Tests/OmniKitTests/VectorStoreTests.swift](Tests/OmniKitTests/VectorStoreTests.swift)
- [Tests/OmniKitTests/VisionEncoderTests.swift](Tests/OmniKitTests/VisionEncoderTests.swift)
- [Tests/OmniKitTests/AVEncoderTests.swift](Tests/OmniKitTests/AVEncoderTests.swift)
- [Scripts/run-tests.sh](Scripts/run-tests.sh)
- [Makefile](Makefile)
</details>

# Verifying the Engine and Where to Go Next

Omni reimplements a Python embedding model (`jinaai/jina-embeddings-v5-omni-small-mlx`) as an in-process MLX-Swift engine. The central risk in any port is silent numeric drift: a vector that is *close enough to look right* but ranks files differently from the reference. Omni's answer is to never assume parity and instead **measure it against fixtures the original Python model produced** — exact token-id equality plus cosine similarity above a hard threshold, modality by modality. This page explains how that proof works, how to run it, and what to read once parity is no longer a mystery.

The mental model is two independent implementations meeting at a shared, version-controlled artifact. The Python reference writes embeddings and token ids to `Fixtures/`; the Swift port loads the *same* original Hugging Face weights, merges the same retrieval LoRA at load time, and must reproduce those numbers. If it cannot, the verifier and the test suite both exit non-zero.

## The parity loop: reference, artifact, port

```mermaid
flowchart LR
  subgraph REF["Reference side (Python, run in mls venv)"]
    GEN["Tools/gen_fixtures.py<br/>model.encode(query/passage)"]
  end
  subgraph ART["Shared artifacts (checked in)"]
    TXT["Fixtures/text_fixtures.json<br/>token ids + embeddings"]
    IMG["image_ref / video_ref / audio_ref<br/>.safetensors"]
    META["meta.json<br/>dim, prefixes, lora r/alpha"]
  end
  subgraph PORT["Port side (Swift, MLX)"]
    WS["WeightStore<br/>load HF safetensors + merge LoRA"]
    ENC["OmniTextEncoder / Vision / Audio towers"]
    VER["omni-verify (default)"]
    TST["OmniKitTests (XCTest)"]
  end
  GEN --> TXT
  GEN --> META
  TXT --> VER
  TXT --> TST
  IMG --> TST
  WS --> ENC --> VER
  ENC --> TST
  VER -->|"cos >= 0.999 + token ids exact"| GATE{"exit 0 / fail"}
  TST -->|"XCTAssertGreaterThanOrEqual"| GATE
```

The two sides share weights but not code. `gen_fixtures.py` runs inside the Python `mls` venv, calls `model.encode(...)` with `task_type="retrieval.query"` and `"retrieval.passage"`, and records both the embeddings and the token ids the reference actually fed (`"Query: " + t` / `"Document: " + t`). The Swift side loads the same snapshot through `WeightStore` and must reproduce the result.

Sources: [Tools/gen_fixtures.py:42-98](Tools/gen_fixtures.py), [Sources/omni-verify/main.swift:898-906](Sources/omni-verify/main.swift)

## Text parity: token ids exact, cosine >= 0.999

The default mode of `omni-verify` (invoked as `omni-verify <modelDir> <fixturesJson>` with no benchmark subcommand) is the text parity gate. It loads the encoder once, then for every fixture record it checks two things and tracks the worst case across all records:

- **Token ids must match exactly.** `encoder.tokenIds(text, .query) == record.query_token_ids` (and the passage equivalent). A single divergent id fails the run — tokenization is treated as a correctness invariant, not a fuzzy match.
- **Embeddings must hit cosine >= 0.999** against the reference query and passage vectors.

```swift
// Sources/omni-verify/main.swift
let qTokMatch = qIds == r.query_token_ids
let pTokMatch = pIds == r.passage_token_ids
...
let cq = cosine(q, r.query_embedding)
let cp = cosine(p, r.passage_embedding)
let flag = (cq >= 0.999 && cp >= 0.999 && qTokMatch && pTokMatch) ? "ok " : "BAD"
...
exit(worstQ >= 0.999 && worstP >= 0.999 && tokOK ? 0 : 1)
```

The probe strings are deliberately diverse — short, long, German, Chinese, a Python snippet, punctuation, and the single character `"a"` — so the gate exercises edge cases of tokenization and pooling rather than one happy-path sentence.

Sources: [Sources/omni-verify/main.swift:908-929](Sources/omni-verify/main.swift), [Tools/gen_fixtures.py:30-39](Tools/gen_fixtures.py)

### The same gate, run as a test

`TextEncoderTests` asserts the identical contract through XCTest, with one critical detail: it pins the encoder to the **fp32** compute path before building weights, because the fixtures are fp32 while the shipping app defaults to bf16.

```swift
// Tests/OmniKitTests/TextEncoderTests.swift — setUp()
setenv("OMNI_BF16_COMPUTE", "0", 1)
setenv("OMNI_BACKBONE_BF16", "0", 1)
```

Three text tests guard distinct claims: `testTextEmbeddingsMatchReference` (the small model, cos >= 0.999), `testNanoEmbeddingsMatchReference` (the 768-dim nano variant against its own fixtures, with an explicit dimension check to catch a wrong-model-loaded mistake), and `testBatchedMatchesSingle` (right-padded batched encoding must equal per-string encoding *and* the reference, so padding never perturbs a vector). Each test `XCTSkip`s cleanly when the model snapshot is not staged locally, so the suite stays green on machines without the weights.

Sources: [Tests/OmniKitTests/TextEncoderTests.swift:17-23](Tests/OmniKitTests/TextEncoderTests.swift), [Tests/OmniKitTests/TextEncoderTests.swift:67-123](Tests/OmniKitTests/TextEncoderTests.swift)

## Image, video, and audio parity against `model.py`

Non-text modalities are verified the same way, against `.safetensors` reference dumps from the upstream model. The key technique is to feed the Swift towers the **reference's exact preprocessed inputs** (`pixel_values`, `grid_thw`, mel frames) so a parity test isolates the tower + injection + pooling math from preprocessing/resize differences. A tower-level gate is strict (cos >= 0.999); an end-to-end gate that includes Omni's own resize/decode path is looser (cos >= 0.90), reflecting that pixel resampling legitimately differs.

| Modality | What is compared | Gate | Where |
|---|---|---|---|
| Text query + passage | cosine vs reference embedding, exact token ids | cos >= 0.999, ids identical | `omni-verify` default; `TextEncoderTests` |
| Text (nano, dim 768) | same, against nano fixtures | cos >= 0.999 | `TextEncoderTests.testNanoEmbeddingsMatchReference` |
| Batched text | batched vs single vs reference | cos >= 0.999 | `TextEncoderTests.testBatchedMatchesSingle` |
| Image tower (same `pixel_values`) | cosine vs `encode_image` | cos >= 0.999 | `VisionEncoderTests` |
| Image end-to-end (resize path) | cosine vs reference | cos >= 0.90 | `VisionEncoderTests` |
| Video tower (same input) | cosine vs `encode_video` | cos >= 0.999 | `AVEncoderTests` |
| Audio tower (same mel) | cosine vs `encode_audio` | cos >= 0.999 | `AVEncoderTests` |
| Audio single-vs-batched | clip-isolation under batching | cos >= 0.99999 | `AVEncoderTests.testAudioBatchParity` |

The `0.99999` batch gate is stronger than the reference gate on purpose: batched media forwards use block-diagonal (`cu_seqlens`) attention, and that gate proves one image/clip never leaks into another's vector — batching must be bit-for-bit equivalent to one-at-a-time, even where the reference comparison only needs 3 nines.

Sources: [Tests/OmniKitTests/VisionEncoderTests.swift:46-69](Tests/OmniKitTests/VisionEncoderTests.swift), [Tests/OmniKitTests/AVEncoderTests.swift:42-99](Tests/OmniKitTests/AVEncoderTests.swift), [Tests/OmniKitTests/AVEncoderTests.swift:107-155](Tests/OmniKitTests/AVEncoderTests.swift)

## Parity vs. benchmarks: two different questions

`omni-verify` is one binary with many subcommands dispatched off `args[1]`. It is important to separate the two kinds: **parity/quality** modes answer "is the output correct?" and exit non-zero on failure; **benchmark** modes answer "how fast / how much memory?" and never gate correctness. The page above is about the first group.

| Mode | Question it answers |
|---|---|
| *(default)* `<modelDir> <fixtures.json>` | Text token-id + cosine parity (the gate) |
| `imgbatchparity` | Image single-vs-batched (cos >= 0.99999) + reference gate (cos >= 0.999) |
| `audiocheck` | Audio path returns a finite, L2-normalized vector |
| `retrieve` / `xmodal` | Retrieval *quality*: top-1 accuracy + MRR, and text->image cross-modal search |
| `levercheck` | Optional perf levers (`OMNI_ASYNC_EVAL`, `OMNI_COMPILE_BLOCK`) are output-neutral |
| `bench`, `searchbench`, `concbench`, `concbench2`, `storemem`, `loadbench`, `crawlbench`, `indexbench`, `mediabench`, `audiobench` | Throughput, latency, memory, concurrency — performance only |

Note that `retrieve` and `xmodal` measure something parity cannot: a port can be numerically faithful yet still retrieve poorly if the *model* is weak, so these report accuracy on labelled corpora and confusable "hard" clusters as a separate signal.

Sources: [Sources/omni-verify/main.swift:459-543](Sources/omni-verify/main.swift), [Sources/omni-verify/main.swift:631-718](Sources/omni-verify/main.swift), [Sources/omni-verify/main.swift:831-873](Sources/omni-verify/main.swift)

## The store path: tested without the GPU

Not everything needs MLX. `VectorStoreTests` exercises insert, search ranking, per-file chunk ranking, delete compaction, in-place replace, and reload-from-disk using only SQLite + Accelerate, with orthonormal basis vectors so a dot product *equals* cosine and expected scores are exactly `1.0` or `0.0`. This makes the store's contiguous-`flat`-buffer bookkeeping (does a deleted row stay aligned? does reopen rebuild identically?) verifiable cheaply and deterministically, independent of the embedding engine.

```swift
// Tests/OmniKitTests/VectorStoreTests.swift
let hits = store.search(basis(1), topK: 10)
XCTAssertEqual(hits.first?.path, "/b.txt")
XCTAssertEqual(hits.first?.score ?? 0, 1.0, accuracy: 1e-6)
```

Sources: [Tests/OmniKitTests/VectorStoreTests.swift:14-89](Tests/OmniKitTests/VectorStoreTests.swift)

## Running it

The `Makefile` is the front door. `make fixtures` regenerates the Python references and copies `text_fixtures.json` into the test resources; `make test` runs the bundle (optionally filtered with `ONLY=`); `make app` generates and builds the Xcode project.

| Command | Effect |
|---|---|
| `make fixtures` | Run `gen_fixtures.py`, copy fixtures into `Tests/OmniKitTests/Resources/` |
| `make test` | Build + run `OmniKitTests` via `Scripts/run-tests.sh` |
| `make test ONLY=OmniKitTests.VectorStoreTests` | Run a single test class |
| `swift run omni-verify <modelDir> Fixtures/text_fixtures.json` | The text parity gate as a CLI |

`Scripts/run-tests.sh` exists because plain `xcodebuild test` fails twice in this project, and the header documents exactly why: `swift-tokenizers` ships its Rust FFI as an SE-0482 static-library artifact bundle that `xcodebuild` will not expose (build error `Cannot find type 'RustBuffer'`), and the resulting SPM test bundle is unsigned so the test runner refuses to load it. The script applies the same global module-map / static-lib overrides the app build uses, builds *for testing*, ad-hoc signs the `.xctest` bundle, and runs it directly with `xcrun xctest` — temporarily moving `Omni.xcodeproj` aside so it does not shadow the SwiftPM package, and restoring it on any exit.

Sources: [Makefile:18-24](Makefile), [Scripts/run-tests.sh:7-47](Scripts/run-tests.sh)

### A note on portability

The verification design is model-agnostic by construction: the gate compares against whatever `OMNI_MODEL_DIR` / `OMNI_NANO_MODEL_DIR` point at, and the same fixture-and-cosine harness already covers both the small (dim 1024) and nano (dim 768) variants. Swapping in a different model snapshot means regenerating fixtures from that model's own reference code and rerunning — no part of the proof is tied to a hosted service or a single weights provider.

Sources: [Tests/OmniKitTests/TextEncoderTests.swift:36-44](Tests/OmniKitTests/TextEncoderTests.swift), [Tools/gen_fixtures.py:23-26](Tools/gen_fixtures.py)

## Where to go next after the first 30 minutes

Once you trust the numbers, read the engine that produces them, then the pipeline that uses them.

```text
Engine internals (Sources/OmniKit/)
  WeightStore.swift        load HF safetensors, merge retrieval LoRA, fp32 upcast  <- starts the parity chain
  OmniTextEncoder.swift    Qwen3 text tower, last-token pool + L2
  OmniVisionTower.swift    image/video ViT + merger, block-diagonal attention
  OmniAudioTower.swift     mel -> audio tower; OmniAudioPreprocess.swift (STFT)
  OmniConfig.swift         dims, loraScale, prefixes (mirrors Fixtures/meta.json)
  OmniEngine.swift         the serialized facade the app + benchmarks drive

Index + search (Sources/OmniKit/)
  FileCrawler.swift / FileExtractor.swift   crawl + text extraction
  Indexer.swift            crawl -> decode -> batched embed -> store
  VectorStore.swift        SQLite + contiguous flat buffer + cosine search
  SearchQueryParser.swift  query syntax

Fixtures + generators
  Tools/gen_{image,audio,video}_fixtures.py   per-modality reference dumps
  Fixtures/*.safetensors, meta.json           the artifacts the gates load

App surface (App/)
  OmniApp.swift / ContentView.swift / ResultsList.swift   SwiftUI shell
  Serving/HTTPServer.swift, Router.swift                  local serving endpoints
```

A productive path: `WeightStore.swift` (the load + LoRA merge that everything downstream depends on), then `OmniTextEncoder.swift` (where the verified text vector is actually produced), then `OmniEngine.swift` and `Indexer.swift` to see how those vectors reach `VectorStore.swift` and on-screen results. The fixtures and `meta.json` are the Rosetta stone between the Python reference and the Swift config — keep them open while reading the towers.

Sources: [Sources/omni-verify/main.swift:902-905](Sources/omni-verify/main.swift), [Sources/omni-verify/main.swift:549-555](Sources/omni-verify/main.swift)

## Summary

Omni proves its port rather than trusting it: a Python reference writes token ids and embeddings to checked-in fixtures, the Swift engine loads the same weights and merges the same LoRA, and `omni-verify` plus the `OmniKitTests` suite fail loudly unless token ids match exactly and cosine clears 0.999 for text (with strict 0.999 tower gates and 0.99999 batch-isolation gates for image/video/audio). The store is tested separately with orthonormal vectors and no GPU, and the whole suite runs through a purpose-built script that works around the `swift-tokenizers` Rust-artifact and code-signing constraints. With the gate understood, the natural next reading order is `WeightStore` → `OmniTextEncoder` → `OmniEngine`/`Indexer` → `VectorStore` — the exact chain the verifier walks to produce a number you can now trust.
