# Local Embedding Server: OpenAI / Cohere / Gemini-Compatible APIs

> A subsystem the README never mentions: an in-app HTTP server that exposes the engine as drop-in embedding APIs. Covers the Router and auth gate, the single ServingBackend seam onto OmniEngine + VectorStore, the per-provider SchemaAdapters (/v1/embeddings, /v1/embed, /v2/embed, Gemini :embedContent, /v1/search), and the controller/tab/log that manage it.

- Repository: hanxiao/omni-macos
- GitHub: https://github.com/hanxiao/omni-macos
- Human wiki: https://grok-wiki.com/public/wiki/hanxiao-omni-macos-7817a5cffe05
- Complete Markdown: https://grok-wiki.com/public/wiki/hanxiao-omni-macos-7817a5cffe05/llms-full.txt

## Source Files

- `App/Serving/Router.swift`
- `App/Serving/ServingBackend.swift`
- `App/Serving/SchemaAdapters.swift`
- `App/Serving/HTTPServer.swift`
- `App/Serving/ServingController.swift`
- `App/Serving/ServingTab.swift`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:
- [App/Serving/Router.swift](App/Serving/Router.swift)
- [App/Serving/ServingBackend.swift](App/Serving/ServingBackend.swift)
- [App/Serving/SchemaAdapters.swift](App/Serving/SchemaAdapters.swift)
- [App/Serving/HTTPServer.swift](App/Serving/HTTPServer.swift)
- [App/Serving/HTTPMessage.swift](App/Serving/HTTPMessage.swift)
- [App/Serving/ServingController.swift](App/Serving/ServingController.swift)
- [App/Serving/ServingTab.swift](App/Serving/ServingTab.swift)
- [App/Serving/ServingLog.swift](App/Serving/ServingLog.swift)
- [App/AppModel.swift](App/AppModel.swift)
- [Sources/OmniKit/VectorStore.swift](Sources/OmniKit/VectorStore.swift)
</details>

# Local Embedding Server: OpenAI / Cohere / Gemini-Compatible APIs

Omni ships an HTTP server inside the app. The README pitches the project as "no server, no cloud" (that line refers to the *embedding runtime* never being a Python service), so a first-time reader will not learn from the README that the app can also turn itself into a drop-in embeddings endpoint for OpenAI, Jina, Cohere, and Gemini SDKs, plus a native search route. Everything under `App/Serving/` implements that subsystem, and it is wired in through a single `attach()` call from `AppModel`.

This page is the orientation map for that directory. It covers the layering (transport → routing → adapters → backend seam → engine), the auth/scope security model, the per-provider request and response shapes, and the SwiftUI controller and tab that start, persist, and observe the server. The design goal worth keeping in mind while reading: the serving layer adds **no new locks and no new vector math**. It is a thin, stateless translation membrane in front of the existing `OmniEngine` and `VectorStore`.

Sources: [README.md:19](README.md#L19), [App/AppModel.swift:351-355](App/AppModel.swift#L351-L355), [App/AppModel.swift:1013-1017](App/AppModel.swift#L1013-L1017)

## Where to start reading

Read the files in dependency order, bottom-up:

| Layer | File | Responsibility |
| --- | --- | --- |
| Transport | `HTTPServer.swift` | Network.framework listener, connection loop, keep-alive, body cap. Knows nothing about embeddings. |
| Wire types | `HTTPMessage.swift` | `HTTPRequest`/`HTTPResponse` value types + a hand-rolled HTTP/1.1 parser (`HTTPParse`). |
| Routing | `Router.swift` | Method+path → adapter dispatch, plus the auth gate and per-provider 401/404 envelopes. |
| Translation | `SchemaAdapters.swift` | Stateless per-provider enums that parse JSON and emit provider-shaped JSON. |
| Engine seam | `ServingBackend.swift` | The only protocol that touches `OmniKit`; `EngineServingBackend` wraps engine + store. |
| Control | `ServingController.swift` | `@MainActor @Observable` lifecycle, persistence, auth snapshot, log coalescing. |
| UI | `ServingTab.swift`, `ServingLog.swift` | Settings tab and the live request log row/model. |

The cleanest mental model: `HTTPServer` and `HTTPMessage` are pure transport; `Router` + `SchemaAdapters` are pure translation; `ServingBackend` is the *one* seam onto the engine; `ServingController` + `ServingTab` are the main-actor control plane.

## System architecture

The diagram below shows the ownership boundaries. Note the actor boundary: the controller and UI live on the `@MainActor`, while every request is serviced off it — on the server's own `DispatchQueue` and in a detached `Task` — with only `LogEntry` marshalled back.

```mermaid
flowchart TB
    subgraph MainActor["@MainActor control plane"]
        Tab["ServingTab (SwiftUI)"]
        Ctrl["ServingController\n(@Observable, persists omni.serving.*)"]
        AppModel["AppModel.attach(engine, store, modelName)"]
    end

    subgraph OffActor["Off-actor request path (DispatchQueue + detached Task)"]
        Srv["HTTPServer\n(NWListener, keep-alive, 8MB cap)"]
        Parse["HTTPParse / HTTPRequest / HTTPResponse"]
        Router["Router\n(auth gate + dispatch)"]
        subgraph Adapters["SchemaAdapters (stateless enums)"]
            OAI["OpenAIJinaAdapter"]
            Coh["CohereAdapter (v1/v2)"]
            Gem["GeminiAdapter"]
            Search["SearchAdapter"]
            Health["HealthAdapter"]
        end
        Backend["EngineServingBackend\n(ServingBackend seam, @unchecked Sendable)"]
    end

    subgraph Engine["OmniKit (thread-safe)"]
        OmniEngine["OmniEngine\nembedTextBatch / embedQuery"]
        Store["VectorStore\nsearch()"]
    end

    AppModel -->|attach| Ctrl
    Tab -->|binds enabled/scope/port/token| Ctrl
    Ctrl -->|builds Router + auth closure, starts| Srv
    Srv --> Parse --> Router
    Router --> OAI & Coh & Gem & Search & Health
    OAI & Coh & Gem & Search --> Backend
    Backend --> OmniEngine
    Backend --> Store
    Srv -. "LogEntry (coalesced)" .-> Ctrl
```

Sources: [App/Serving/HTTPServer.swift:11-35](App/Serving/HTTPServer.swift#L11-L35), [App/Serving/Router.swift:5-54](App/Serving/Router.swift#L5-L54), [App/Serving/ServingController.swift:95-141](App/Serving/ServingController.swift#L95-L141), [App/Serving/ServingBackend.swift:6-52](App/Serving/ServingBackend.swift#L6-L52)

## The transport: HTTPServer + HTTPParse

`HTTPServer` is a hand-rolled HTTP/1.1 server on `Network.framework`. It binds an `NWListener` on all interfaces (the reliable path; `requiredLocalEndpoint` with a port throws `EINVAL`), and for `local` scope it instead enforces loopback **per connection** by cancelling any peer whose endpoint is not `127.0.0.1`/`::1`/`localhost`. Bind failures such as a busy port arrive asynchronously through `onFailure`, which the controller maps to a `portInUse`/`failed` state.

Connection callbacks all fire on a single serial `DispatchQueue` (`omni.serving.http`), so there is no actor hop on the hot path. Per request, the server hands the parsed `HTTPRequest` to a detached `Task` that awaits the async handler, then hops back to the queue to write the response and continue the keep-alive loop. The parser (`HTTPParse.tryParse`) handles pipelining (it drains residual bytes already buffered), enforces an 8 MB body cap (`413`), and explicitly rejects chunked transfer encoding — only `Content-Length` framing is supported.

```swift
// App/Serving/HTTPServer.swift:176-188 — service off the main actor, then log
Task { [weak self] in
    guard let self else { return }
    let resp = await self.handler(req)
    let ms = Double(DispatchTime.now().uptimeNanoseconds - started.uptimeNanoseconds) / 1_000_000.0
    let entry = LogEntry(time: Date(), method: req.method, path: req.routePath,
                         status: resp.status, ms: ms, client: client)
    self.onLog(entry)
    // ... hop back to queue to write + continue ...
}
```

`HTTPResponse.json(_:status:)` is the single response constructor used everywhere; `serialize(keepAlive:)` adds framing headers (`Content-Length`, `Connection`, `Date`) and a small reason-phrase table.

Sources: [App/Serving/HTTPServer.swift:39-112](App/Serving/HTTPServer.swift#L39-L112), [App/Serving/HTTPServer.swift:116-209](App/Serving/HTTPServer.swift#L116-L209), [App/Serving/HTTPMessage.swift:45-105](App/Serving/HTTPMessage.swift#L45-L105), [App/Serving/HTTPMessage.swift:132-199](App/Serving/HTTPMessage.swift#L132-L199)

## The Router and the auth gate

`Router` is a `Sendable` value type so it can be captured by the `@Sendable` handler closure. It carries the backend and a pre-snapshotted `auth` closure (no main-actor state). `handle()` does three things in order:

1. **Liveness first, pre-auth.** `GET /health` and `GET /v1/models` are always open.
2. **Auth gate.** If `auth(req)` fails, it returns a 401 whose JSON envelope is shaped by path prefix — Gemini gets `{error:{status:"UNAUTHENTICATED"}}`, Cohere gets `{message:"invalid api token"}`, everything else gets the OpenAI `{error:{code:"invalid_api_key"}}` shape.
3. **Dispatch.** A `(method, route)` switch covers the fixed paths; Gemini is matched separately because its action lives after a `:` in `/v1beta/models/{model}:embedContent`.

```swift
// App/Serving/Router.swift:24-51 — dispatch table
switch (req.method, route) {
case ("POST", "/v1/embeddings"): return OpenAIJinaAdapter.handle(req, backend)
case ("POST", "/v1/embed"):      return CohereAdapter.handle(req, backend, v2: false)
case ("POST", "/v2/embed"):      return CohereAdapter.handle(req, backend, v2: true)
case ("POST", "/v1/search"):     return SearchAdapter.handle(req, backend)
default: break
}
// Gemini: /v1beta/models/{model}:embedContent | :batchEmbedContents
```

The auth closure itself is built in `ServingController.startServer()` as a snapshot, so a token rotation requires a restart. It accepts the token via three transports to match each provider's SDK convention: `Authorization: Bearer`, the Gemini `x-goog-api-key` header, or a `?key=` query parameter.

```swift
// App/Serving/ServingController.swift:97-109 — token only when public AND set
let requireToken = isPublic && !token.isEmpty
let auth: @Sendable (HTTPRequest) -> Bool = { req in
    guard requireToken else { return true }
    if req.bearer == token { return true }
    if req.googApiKey == token { return true }
    if req.query["key"] == token { return true }
    return false
}
```

Sources: [App/Serving/Router.swift:11-79](App/Serving/Router.swift#L11-L79), [App/Serving/ServingController.swift:95-109](App/Serving/ServingController.swift#L95-L109), [App/Serving/HTTPMessage.swift:30-43](App/Serving/HTTPMessage.swift#L30-L43)

## The ServingBackend seam

`ServingBackend` is the only place the serving layer touches `OmniKit`. It is deliberately tiny — `dim`, `modelName`, `embedBatch(_:query:)`, `search(_:topK:filter:)` — so adapters never reach into the engine directly. `EngineServingBackend` is the production conformer; its `@unchecked Sendable` is justified in-source because it adds no mutable state and every member it calls is documented thread-safe (the engine's `NSCondition` run-gate, the store's serial queue), letting it be called straight from the connection's detached `Task`.

Two behaviors matter for callers:

- **Priority routing.** `query == true` routes through the engine's high-priority query path (`OmniInputType.query`); otherwise the low-priority passage/indexing path. Each adapter decides this from its provider's own field (see table below).
- **Batch splitting.** Large client batches are chunked into groups of `groupCap = 48` to match the indexer's forward-pass width, so serving never exceeds the engine's batch expectations. Output order matches input order.

```swift
// App/Serving/ServingBackend.swift:33-46 — split, embed, preserve order
let type: OmniInputType = query ? .query : .passage
while i < texts.count {
    let end = min(i + groupCap, texts.count)
    out.append(contentsOf: engine.embedTextBatch(Array(texts[i..<end]), as: type))
    i = end
}
```

Sources: [App/Serving/ServingBackend.swift:6-52](App/Serving/ServingBackend.swift#L6-L52)

## The per-provider SchemaAdapters

All adapters are stateless `enum`s. They parse with `JSONSerialization`, call the backend, and emit provider-shaped JSON. A shared invariant is stated at the top of the file: the engine emits fixed **1024-d, L2-normalized** float vectors; adapters **never truncate, requantize, or fabricate** vectors. Only the `usage`/`billed_units` token counts are an acknowledged whitespace heuristic (`tokenEstimate`).

| Endpoint | Method | Adapter | Provider shape | Query-path signal |
| --- | --- | --- | --- | --- |
| `/v1/embeddings` | POST | `OpenAIJinaAdapter` | OpenAI `{object:"list", data:[{embedding}]}` + Jina | `task == "query"` or `task` ends `.query` |
| `/v1/embed` | POST | `CohereAdapter(v2:false)` | Cohere v1; bare list unless `embedding_types` given | `input_type == "search_query"` |
| `/v2/embed` | POST | `CohereAdapter(v2:true)` | Cohere v2; always `{float:[...]}`; `input_type` required | `input_type == "search_query"` |
| `/v1beta/models/{m}:embedContent` | POST | `GeminiAdapter(batch:false)` | `{embedding:{values}}` | `taskType` ∈ {`RETRIEVAL_QUERY`,`QUESTION_ANSWERING`,`CODE_RETRIEVAL_QUERY`} |
| `/v1beta/models/{m}:batchEmbedContents` | POST | `GeminiAdapter(batch:true)` | `{embeddings:[{values}]}` | any request's `taskType` is a query type |
| `/v1/search` | POST | `SearchAdapter` | `{query, results:[{path,score,snippet,kind,modified}]}` | n/a (always high-priority query embed) |
| `/health`, `/v1/models` | GET | `HealthAdapter` | status / OpenAI model list | open, pre-auth |

Provider-specific quirks worth knowing before you debug a 400:

- **OpenAI/Jina** accept `input` as a string, `[String]`, or `[{text:...}]`, and emit base64 little-endian Float32 when either OpenAI's `encoding_format` or Jina's `embedding_type` requests `base64`.
- **Cohere** only produces `float`; any other `embedding_types` value is a 400 (`unsupported embedding_type`). v2 requires `input_type`. Texts come from `texts` or the v4 multimodal `inputs[].text`.
- **Gemini** accepts `outputDimensionality` only if it equals the engine's `dim` (the engine never truncates); otherwise `INVALID_ARGUMENT`. Text is the joined `content.parts[].text`.
- **Search** clamps `top_k` to `[1, 200]`, builds a `SearchFilter` from `filters` (`kinds`, `folder`, `ext`, `since`), clamps scores to `[0,1]`, and maps `SearchHit` fields straight through.

```swift
// App/Serving/SchemaAdapters.swift:119-136 — Cohere refuses to fake quantized vectors
if let bad = requestedTypes.first(where: { $0 != "float" }) {
    return cohereError("unsupported embedding_type: \(bad)")
}
// v2 always emits the object form; v1 emits bare list unless embedding_types given.
let embeddings: Any = (v2 || !requestedTypes.isEmpty) ? ["float": floatRows] : floatRows
```

Sources: [App/Serving/SchemaAdapters.swift:1-104](App/Serving/SchemaAdapters.swift#L1-L104), [App/Serving/SchemaAdapters.swift:106-234](App/Serving/SchemaAdapters.swift#L106-L234), [App/Serving/SchemaAdapters.swift:236-289](App/Serving/SchemaAdapters.swift#L236-L289), [Sources/OmniKit/VectorStore.swift:36-71](Sources/OmniKit/VectorStore.swift#L36-L71)

## Request lifecycle

The sequence below traces one POST through every boundary, including the off-actor service and the coalesced log hop back to the main actor.

```mermaid
sequenceDiagram
    participant C as Client (SDK/curl)
    participant S as HTTPServer (queue)
    participant P as HTTPParse
    participant R as Router
    participant A as SchemaAdapter
    participant B as EngineServingBackend
    participant E as OmniEngine / VectorStore
    participant Ctrl as ServingController (@MainActor)

    C->>S: TCP + HTTP/1.1 request
    S->>P: tryParse(buffer)
    P-->>S: HTTPRequest (or "need more bytes")
    S->>R: await handler(req)  (detached Task, off main actor)
    alt /health or /v1/models (GET)
        R-->>S: open response, pre-auth
    else authorized
        R->>A: dispatch by (method, path)
        A->>B: embedBatch(query:) / search(...)
        B->>E: embedTextBatch / embedQuery / store.search
        E-->>A: 1024-d L2 vectors / hits
        A-->>R: provider-shaped JSON
    else unauthorized
        R-->>S: 401 (provider-shaped envelope)
    end
    R-->>S: HTTPResponse
    S->>Ctrl: onLog(LogEntry)  (Task @MainActor, coalesced)
    S->>C: serialize + write; keep-alive loop
```

Sources: [App/Serving/HTTPServer.swift:116-209](App/Serving/HTTPServer.swift#L116-L209), [App/Serving/Router.swift:11-54](App/Serving/Router.swift#L11-L54), [App/Serving/ServingController.swift:112-116](App/Serving/ServingController.swift#L112-L116)

## ServingController: lifecycle and state

`ServingController` is the single instance `AppModel` owns. It is `@MainActor @Observable`; the SwiftUI tab binds only to its documented surface. Persisted settings (`enabled`, `scope`, `port`, `bearerToken`) live in `UserDefaults` under `omni.serving.*`, and each `didSet` calls `persist()` plus a reconcile/restart — gated by an `isLoading` flag so loading saved values doesn't clobber a half-loaded snapshot.

The engine and store are never owned by the controller; `AppModel` hands them in via `attach(engine:store:modelName:)`, which builds an `EngineServingBackend`, then either restarts a running server against the new backend (model swap) or reconciles (auto-start if previously enabled). Changing `scope` or `port` while running triggers `restart()`; rotating the token applies only on the next start.

```mermaid
stateDiagram-v2
    [*] --> stopped
    stopped --> running: enabled && backend attached (startServer)
    running --> stopped: disabled / detach (stopServer)
    running --> running: scope/port change (restart)
    stopped --> portInUse: bind EADDRINUSE
    stopped --> failed: listener error (onFailure)
    portInUse --> running: retry after enable
    failed --> running: retry after enable
    running --> portInUse: runtime bind failure
    running --> failed: runtime listener failure
```

State transitions are driven by `reconcile()` (start/stop based on `enabled` + backend presence) and by `HTTPServer.onFailure`, which flips the controller to `portInUse`/`failed`, sets `isRunning = false`, and disables serving. The live log is capped at 200 entries, newest-first, with request/error counters incremented in `ingest()` — coalesced to one main-actor invalidation per runloop tick.

Sources: [App/Serving/ServingController.swift:14-93](App/Serving/ServingController.swift#L14-L93), [App/Serving/ServingController.swift:143-185](App/Serving/ServingController.swift#L143-L185), [App/Serving/ServingLog.swift:5-19](App/Serving/ServingLog.swift#L5-L19)

## ServingTab: the control surface

`ServingTab` is the `Settings > Serving` form. It exposes the on/off toggle, a scope picker ("This Mac only" vs "Local network"), a port field clamped to `1...65535`, and a bearer-token field with show/hide, copy, and a "Generate New" button that produces a URL-safe base64 token from `SecRandomCopyBytes(24)`. A status dot reflects the controller's `State`, and the bound address is shown and selectable when running.

The tab also renders **ready-to-run `curl` examples** for the selected schema — including the right auth header (`Authorization: Bearer` for most, `x-goog-api-key` for Gemini) — and a live request log (`LogRow`: time, method, path, color-coded status, latency). The examples are the most useful piece of in-app documentation for this otherwise-undocumented subsystem:

```swift
// App/Serving/ServingTab.swift:208-217 — schema-specific example bodies
case .openai: "curl \(base)/v1/embeddings ... -d '{\"model\":\"omni\",\"input\":[\"your text\"]}'"
case .jina:   "... -d '{...\"input\":\"your text\",\"task\":\"retrieval.query\"}'"
case .cohere: "curl \(base)/v2/embed ... -d '{...\"input_type\":\"search_document\",\"embedding_types\":[\"float\"]}'"
case .gemini: "curl \(base)/v1beta/models/omni:embedContent ... -d '{\"content\":{\"parts\":[{\"text\":\"your text\"}]}}'"
```

Sources: [App/Serving/ServingTab.swift:26-149](App/Serving/ServingTab.swift#L26-L149), [App/Serving/ServingTab.swift:193-291](App/Serving/ServingTab.swift#L193-L291)

## Security model and provider neutrality

The security posture is intentionally minimal and local-first: on `local` scope the server is loopback-only (enforced per connection) and **never requires a token**; a token is enforced only on `public` (LAN) scope and only when non-empty. The UI footer makes this explicit ("Local network reaches other devices, so set a token"). There is no TLS, no rate limiting, and the token check is a plain equality compare — appropriate for a single-user desktop tool on a trusted LAN, and worth flagging if anyone proposes exposing it more widely.

A useful framing for a Grok-Wiki or BYOC/BYOK integration: this subsystem is *provider-neutral by construction*. It impersonates four vendors' wire formats over the same local engine, so any client SDK can point its base URL at `http://127.0.0.1:<port>` without code changes and without depending on a hosted model service. The neutrality lives entirely in `SchemaAdapters`; adding a fifth provider means one more stateless enum behind the `ServingBackend` seam, not a change to the engine or transport. (This page was synthesized directly from repository source — no `docs/solutions/` notes or `STRATEGY.md` were present to draw on, and there are currently no automated tests under `App/Serving/`, so the curl examples in `ServingTab` are the practical contract.)

Sources: [App/Serving/HTTPServer.swift:43-86](App/Serving/HTTPServer.swift#L43-L86), [App/Serving/HTTPServer.swift:95-100](App/Serving/HTTPServer.swift#L95-L100), [App/Serving/ServingController.swift:97-133](App/Serving/ServingController.swift#L97-L133), [App/Serving/ServingTab.swift:134-140](App/Serving/ServingTab.swift#L134-L140)
