# OmniEngine: Embedding Towers, LoRA Merge, and the Priority Gate

> The heart of OmniKit: how WeightStore loads HF safetensors and merges the retrieval LoRA, how the Qwen3 text tower and Qwen3-VL / Whisper-style towers land in one shared space, and how MLX calls are serialized through a priority gate so an interactive query jumps ahead of in-flight indexing work.

- Repository: hanxiao/omni-macos
- GitHub: https://github.com/hanxiao/omni-macos
- Human wiki: https://grok-wiki.com/public/wiki/hanxiao-omni-macos-7817a5cffe05
- Complete Markdown: https://grok-wiki.com/public/wiki/hanxiao-omni-macos-7817a5cffe05/llms-full.txt

## Source Files

- `Sources/OmniKit/OmniEngine.swift`
- `Sources/OmniKit/WeightStore.swift`
- `Sources/OmniKit/OmniTextEncoder.swift`
- `Sources/OmniKit/OmniImageEncoder.swift`
- `Sources/OmniKit/Qwen3Backbone.swift`
- `Sources/OmniKit/OmniAudioEncoder.swift`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [Sources/OmniKit/OmniEngine.swift](Sources/OmniKit/OmniEngine.swift)
- [Sources/OmniKit/WeightStore.swift](Sources/OmniKit/WeightStore.swift)
- [Sources/OmniKit/OmniTextEncoder.swift](Sources/OmniKit/OmniTextEncoder.swift)
- [Sources/OmniKit/OmniImageEncoder.swift](Sources/OmniKit/OmniImageEncoder.swift)
- [Sources/OmniKit/Qwen3Backbone.swift](Sources/OmniKit/Qwen3Backbone.swift)
- [Sources/OmniKit/OmniAudioEncoder.swift](Sources/OmniKit/OmniAudioEncoder.swift)
- [Sources/OmniKit/OmniConfig.swift](Sources/OmniKit/OmniConfig.swift)
- [Sources/OmniKit/OmniAudioTower.swift](Sources/OmniKit/OmniAudioTower.swift)
</details>

# OmniEngine: Embedding Towers, LoRA Merge, and the Priority Gate

`OmniEngine` is the public face of the embedding engine: one object that loads the `jina-embeddings-v5-omni` model once and turns text, images, video, and audio into L2-normalized vectors that all land in the **same** retrieval space. Everything in OmniKit that produces a vector — the indexer crawling your Documents folder and the search box you type into — calls through this one class. This page traces three things that make it work: how `WeightStore` loads Hugging Face safetensors and **bakes the retrieval LoRA into the backbone at load time**, how the Qwen3 text tower and the Qwen3-VL / Qwen2.5-Omni media towers all funnel through one shared `Qwen3Backbone` so a text query can match a scanned PDF, and how every MLX evaluation is funneled through a **priority gate** so an interactive search jumps ahead of in-flight indexing work.

If you are new to the repo, read `OmniEngine.swift` first (the facade and the gate), then `WeightStore.swift` (what "the model" actually is in memory), then `Qwen3Backbone.swift` (the shared compute core). The three media encoders are thin wrappers that all delegate to that backbone.

## The shape of the engine

```mermaid
flowchart TB
    subgraph app["Callers"]
      Q["Search box<br/>embedQuery / embedFileQuery"]
      IDX["Indexer<br/>embedText / embedImages / embedAudioMelBatch"]
    end
    subgraph engine["OmniEngine (facade + priority gate)"]
      GATE["run(highPriority:)<br/>NSCondition: busy / highWaiting"]
    end
    subgraph encoders["Encoders (one shared backbone)"]
      TE["OmniTextEncoder<br/>Qwen3 text tower"]
      IE["OmniImageEncoder<br/>Qwen3-VL ViT + merger"]
      AE["OmniAudioEncoder<br/>Qwen2.5-Omni audio tower + projector"]
      BB["Qwen3Backbone<br/>28-layer GQA, last-token pool, L2"]
    end
    WS["WeightStore<br/>safetensors + merged retrieval LoRA"]
    Q --> GATE
    IDX --> GATE
    GATE --> TE & IE & AE
    TE --> BB
    IE --> BB
    AE --> BB
    WS -. weights .-> TE & IE & AE & BB
```

`OmniEngine.init` loads the config, parses the BPE tokenizer concurrently with the synchronous weight load, then constructs the three encoders over one shared `WeightStore`. The embedding dimension is just the text hidden size (1024). The image and audio encoders are failable: `OmniImageEncoder?` and `OmniAudioEncoder?` are `nil` when their tower weights are absent, which drives `supportsImages` / `supportsAudio`.

Sources: [Sources/OmniKit/OmniEngine.swift:114-165](Sources/OmniKit/OmniEngine.swift), [Sources/OmniKit/OmniConfig.swift:7-21](Sources/OmniKit/OmniConfig.swift)

## WeightStore: loading safetensors and merging the retrieval LoRA

`WeightStore` is where "the model" becomes a `[String: MLXArray]` dictionary in memory. Its job is to mirror the reference Python `JinaMultiTaskModel` + `sanitize` so the Swift vectors match the Python ones numerically. Three things happen, in order.

**1. Load and prune.** It loads `model.safetensors`, then drops keys the app won't run: `position_ids`, and (optionally) `audio_tower.*` / `audio_projector.*` / `vision_tower.*` / `merger.*`. `OmniEngine` keeps all towers (`keepVision: true, keepAudio: true`), so pruning here mostly removes `position_ids`.

**2. Merge the retrieval LoRA in fp32.** The retrieval adapter (`adapters/retrieval/adapter_model.safetensors`) is a set of low-rank `A`/`B` pairs. For each target weight `W`, the merge computes `W += loraScale * (B @ A)`. The `loraScale` is `alpha/r = 1.0` for retrieval. This is a permanent, baked-in merge — there is no runtime adapter switching; the engine only ever produces retrieval embeddings.

```swift
// Sources/OmniKit/WeightStore.swift
let a = aArr.asType(.float32)        // [r, in]
let b = bArr.asType(.float32)        // [out, r]
let delta = matmul(b, a)             // [out, in]
w[baseKey] = base + (loraScale * delta)
```

**3. Pay fp32 only where it matters.** The reference upcasts the entire backbone to fp32. This implementation is smarter by default: it first scans the adapter to learn exactly which `language_model.*` linears the LoRA touches (`loraTargets`), upcasts only those to fp32 for the merge, then casts them back to bf16. Every non-target weight is a bf16→fp32→bf16 identity round-trip, so skipping it is byte-identical at a fraction of the load memory. The exact fp32 parity path (`OMNI_BACKBONE_BF16=0`, used by the fp32 fixtures) still upcasts everything.

Finally, only the merged language backbone is force-evaluated; vision/audio tower weights stay **lazily memory-mapped** until the first image/audio embed, so launching and running a text-only query never pays to materialize towers it won't use.

Sources: [Sources/OmniKit/WeightStore.swift:9-86](Sources/OmniKit/WeightStore.swift), [Sources/OmniKit/OmniConfig.swift:55](Sources/OmniKit/OmniConfig.swift)

| Weight class | fp32 round-trip? | When materialized |
| --- | --- | --- |
| `language_model.*` LoRA targets | yes (merge), cast back to bf16 | eagerly at load (`eval`) |
| other `language_model.*` | no (identity, default bf16 path) | eagerly at load |
| `vision_tower.*` / `merger.*` | kept in stored dtype | lazily on first image embed |
| `audio_tower.*` / `audio_projector.*` | kept in stored dtype | lazily on first audio embed |

## One shared space: how every tower lands on the same vector

The reason text can search images is structural: every modality pools at the **last token** of a sequence run through the same `Qwen3Backbone`, and every sequence is wrapped so that last token lands in a comparable position. Two pieces enforce the alignment.

First, the **retrieval prefix** (`"Query: "` / `"Document: "`) is applied to *every* modality, not just text — the v5-omni model card applies the Query/Document distinction across all inputs. Media is indexed with the `Document:` prefix; a file used as a search query gets the `Query:` prefix instead.

Second, the **media suffix**. Last-token pooling is only meaningful if every modality pools at the *same kind* of token. The text path's tokenizer post-processor may append trailing special tokens (Nano appends `<|end_of_text|>`; Small appends nothing). `OmniTextEncoder` recovers exactly those trailing tokens by diffing `encode("x")` with and without special tokens, and the media encoders append the identical suffix. Without it, an image sequence would pool at `<|vision_end|>` while text pools at end-of-text — different positions, leaving the two modalities in near-orthogonal regions of the space. The embedding version string `omni-2-mediasuffix` records this fix.

```text
text   :  [Query:/Document:] + BPE(text)                                + [suffix]   -> pool @ last
image  :  [Query:/Document:] + <|vision_start|> + ViT feats + <|vision_end|> + [suffix]   -> pool @ last
audio  :  [Query:/Document:] + <|audio_start|>  + aud feats + <|audio_end|>  + [suffix]   -> pool @ last
                                          ^ injected tower features          ^ same trailing token everywhere
```

The injection itself is a concatenation, not a scatter: each media encoder builds `inputs_embeds` by concatenating the embedded prefix, the start token, the raw tower features (`[1, N, dim]`), the end token, and the suffix, then runs the shared backbone forward and last-token pools.

Sources: [Sources/OmniKit/OmniEngine.swift:93-100,124-132](Sources/OmniKit/OmniEngine.swift), [Sources/OmniKit/OmniTextEncoder.swift:18-34](Sources/OmniKit/OmniTextEncoder.swift), [Sources/OmniKit/OmniImageEncoder.swift:122-136](Sources/OmniKit/OmniImageEncoder.swift), [Sources/OmniKit/OmniAudioEncoder.swift:38-58](Sources/OmniKit/OmniAudioEncoder.swift)

### The towers, briefly

- **Text** — `OmniTextEncoder` runs `prefix -> Qwen2 BPE tokenize -> Qwen3 backbone -> last-token pool -> L2`, verified at cosine 1.00000 against the Python reference.
- **Image / video** — `OmniImageEncoder` runs the Qwen3-VL ViT plus a merger to produce per-image features, then injects them into the backbone. Video reuses the exact same path with `grid_t > 1` temporal features.
- **Audio** — `OmniAudioEncoder` runs the Qwen2.5-Omni audio encoder (a Whisper-style mel + conv stem with sinusoidal positions) plus a fused `audio_projector` (Linear 1280 -> 1024), then injects.

All three construct their **own** `Qwen3Backbone` over the same shared `WeightStore`, so the language weights are not duplicated — the backbone object is a thin stateless wrapper around the weight dictionary.

Sources: [Sources/OmniKit/OmniTextEncoder.swift:6-9](Sources/OmniKit/OmniTextEncoder.swift), [Sources/OmniKit/OmniImageEncoder.swift:5-40](Sources/OmniKit/OmniImageEncoder.swift), [Sources/OmniKit/OmniAudioTower.swift:6-20](Sources/OmniKit/OmniAudioTower.swift)

### Inside the backbone

`Qwen3Backbone` is 28 layers of grouped-query attention with RoPE (theta 3.5M) and last-token pooling. Two model-shape switches matter, both read from config rather than hardcoded:

- **Per-head q/k RMSNorm** is a Qwen3 feature (Small). Nano is Qwen2-style and omits those weights, so the norm is applied only when the weight is present.
- **Attention mask**: Small is causal (`isCausal`); Nano is bidirectional and, when batched, uses a padding-only additive mask (`-1e9` on pad columns) matching the reference `_bidi_mask`.

Compute precision is bf16 by default (faster, half the VRAM) with RMSNorm variance and the pooled output kept in fp32; `OMNI_BF16_COMPUTE=0` forces the exact fp32 path used by parity fixtures. Two opt-in performance levers exist: `OMNI_ASYNC_EVAL` double-buffers a batch's GPU forward over the prior batch's host readout, and `OMNI_COMPILE_BLOCK` fuses each transformer layer into one compiled kernel — both documented as bit-identical (or within cos 0.99995) to the eager path.

Sources: [Sources/OmniKit/Qwen3Backbone.swift:6-42,99-112,237-264](Sources/OmniKit/Qwen3Backbone.swift), [Sources/OmniKit/OmniTextEncoder.swift:101-149](Sources/OmniKit/OmniTextEncoder.swift)

## The priority gate: keeping search responsive during indexing

MLX evaluation is not safe to run concurrently from multiple threads, so every embed must be serialized. The naive serialization — one global lock — would make an interactive search wait behind the entire indexing queue. The gate solves this with a small `NSCondition` state machine that lets a **high-priority** query jump ahead of pending **low-priority** indexing work.

```swift
// Sources/OmniKit/OmniEngine.swift
private func run<T>(highPriority: Bool, _ work: () -> T) -> T {
    cond.lock()
    if highPriority { highWaiting += 1 }
    while busy || (!highPriority && highWaiting > 0) { cond.wait() }
    busy = true
    if highPriority { highWaiting -= 1 }
    cond.unlock()
    let result = work()
    cond.lock(); busy = false; cond.broadcast(); cond.unlock()
    return result
}
```

The invariant is one line of logic: a low-priority call blocks whenever `highWaiting > 0`, so the instant a query registers itself it leapfrogs every waiting indexing embed. A query still cannot preempt the *one* embed already running (MLX has no mid-eval cancellation), so a search waits **at most one in-flight embed** — bounded, not queue-length.

```mermaid
sequenceDiagram
    participant Idx as Indexer (low)
    participant Gate as run() / NSCondition
    participant Q as Search (high)
    Idx->>Gate: embedText (low) -> busy=true, runs
    Idx->>Gate: embedText (low) -> waits (busy)
    Q->>Gate: embedQuery (high) -> highWaiting=1, waits (busy)
    Note over Gate: in-flight embed finishes -> broadcast
    Gate-->>Q: highWaiting>0 wins -> runs first
    Gate-->>Idx: low embed resumes only after query clears
```

Which calls are high vs low is decided by intent, not modality:

| Call | Priority | Notes |
| --- | --- | --- |
| `embedQuery`, `embedImageQuery`, `embedVideoQuery`, `embedAudioQuery`, `embedFileQuery` | high | interactive search; excluded from the tok/s counter |
| `embedText`/`embedTextBatch` with `.query` | high | `highPriority: type == .query` |
| `embedText`/`embedTextBatch` with `.passage` | low | indexing |
| `embedImage(s)`, `embedVideoFrames`, `embedAudio*` | low | indexing |

Indexing calls also feed a separate, lock-guarded `tokensProcessed` counter (queries are deliberately excluded) that the UI samples for live tok/s. The `embedFileQuery` entry point closes the loop on cross-modal search: it detects a dropped file's modality, reuses the *indexing-path* decoders at high priority, and picks `queryPrefix` vs `docPrefix` based on whether the search is asymmetric ("search by this file") or symmetric ("find similar").

Sources: [Sources/OmniKit/OmniEngine.swift:118-201,273-317](Sources/OmniKit/OmniEngine.swift)

## Memory and batching discipline

Because this runs in-process on unified memory alongside the OS and the app UI, the engine bounds peak VRAM deliberately. `omniSetMemoryLimit` caps MLX memory globally (cache set to half the limit). The image encoder caps each block-diagonal vision-tower forward to a `patchBudget` (default 8192 packed patches, `OMNI_IMAGE_PATCH_BUDGET`), splitting larger image sets into successive bounded forwards, and evaluates the packed tower features before the backbone allocates — so the tower's large activations are freed first. The audio batch path bounds attention to per-chunk windows rather than an `O(L_total^2)` packed matrix. Crucially, batched media keeps the **backbone** pass at `B=1` per image for numerical stability (a batched bidirectional Nano forward over packed vision features measured as unstable), so every batched vector is bit-identical to its single-item counterpart.

Sources: [Sources/OmniKit/OmniEngine.swift:103-109,235-245](Sources/OmniKit/OmniEngine.swift), [Sources/OmniKit/OmniImageEncoder.swift:20-119](Sources/OmniKit/OmniImageEncoder.swift), [Sources/OmniKit/OmniAudioEncoder.swift:60-132](Sources/OmniKit/OmniAudioEncoder.swift)

## Summary

`OmniEngine` is a single facade over a single weight store, and three design choices give it its character. **The LoRA is baked at load time** — `WeightStore` merges the retrieval adapter into the backbone in fp32 and pays that precision cost only on the linears the adapter touches, leaving towers lazily mapped. **One backbone, one space** — text, vision, and audio towers all inject into the same `Qwen3Backbone` and pool at the same trailing token via a shared prefix and media suffix, which is exactly what makes text-to-PDF and file-to-file search possible. And **the priority gate** turns mandatory MLX serialization from a latency liability into a feature: an interactive query waits at most one in-flight embed, never the indexing backlog. To extend the engine, the load path is `WeightStore.init`, the compute core is `Qwen3Backbone.forward`/`pool`, and the concurrency contract lives entirely in `OmniEngine.run(highPriority:)`.
