# Voice Engines — Realtime, ConvAI & Pipeline Mode

> The three voice architectures: speech-to-speech engines (OpenAI Realtime, ElevenLabs ConvAI) vs. the STT→LLM→TTS pipeline, how engine adapters implement the shared engine interface, barge-in and VAD semantics, and latency trade-offs for each mode.

- Repository: PatterAI/Patter
- GitHub: https://github.com/PatterAI/Patter
- Human wiki: https://grok-wiki.com/public/wiki/patterai-patter-57d14e233afc
- Complete Markdown: https://grok-wiki.com/public/wiki/patterai-patter-57d14e233afc/llms-full.txt

## Source Files

- `libraries/python/getpatter/engines/openai.py`
- `libraries/python/getpatter/engines/openai_realtime_2.py`
- `libraries/python/getpatter/engines/elevenlabs.py`
- `libraries/python/getpatter/stream_handler.py`
- `libraries/typescript/src/engines/`
- `libraries/typescript/src/stream-handler.ts`
- `libraries/typescript/src/pipeline-hooks.ts`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [libraries/python/getpatter/engines/openai.py](libraries/python/getpatter/engines/openai.py)
- [libraries/python/getpatter/engines/openai_realtime_2.py](libraries/python/getpatter/engines/openai_realtime_2.py)
- [libraries/python/getpatter/engines/elevenlabs.py](libraries/python/getpatter/engines/elevenlabs.py)
- [libraries/python/getpatter/stream_handler.py](libraries/python/getpatter/stream_handler.py)
- [libraries/typescript/src/engines/openai.ts](libraries/typescript/src/engines/openai.ts)
- [libraries/typescript/src/engines/openai-2.ts](libraries/typescript/src/engines/openai-2.ts)
- [libraries/typescript/src/engines/elevenlabs.ts](libraries/typescript/src/engines/elevenlabs.ts)
- [libraries/typescript/src/stream-handler.ts](libraries/typescript/src/stream-handler.ts)
- [libraries/typescript/src/pipeline-hooks.ts](libraries/typescript/src/pipeline-hooks.ts)
</details>

# Voice Engines — Realtime, ConvAI & Pipeline Mode

Patter supports three distinct voice processing architectures, each represented by an engine marker class. The two speech-to-speech engines — **OpenAI Realtime** and **ElevenLabs ConvAI** — hand the full STT→LLM→TTS loop to a single hosted service, while **Pipeline mode** composes independent STT, LLM, and TTS providers under local control. Choosing an architecture determines latency characteristics, barge-in semantics, per-turn hook surface, and the degree to which each component can be swapped independently.

This page covers how engine marker classes are structured, how the `kind` discriminator drives per-call adapter selection, how each `StreamHandler` subclass handles audio, VAD-based barge-in, and turn completion, and what latency trade-offs each architecture entails.

---

## Engine Marker Classes

Every engine starts as a small, immutable configuration object — a **frozen dataclass** in Python or a **`readonly`-field class** in TypeScript — that carries credentials and tuning knobs. Its only behavioral method is the `kind` property, which serves as the stable discriminator used by the Patter server at call time to instantiate the correct adapter.

### OpenAI Realtime (`openai.Realtime` / `openai_realtime_2.Realtime2`)

Two markers exist for the two generations of the OpenAI Realtime API:

| Marker | `kind` | Default model |
|---|---|---|
| `openai.Realtime` | `"openai_realtime"` | `gpt-realtime-mini` |
| `openai_realtime_2.Realtime2` | `"openai_realtime_2"` | `gpt-realtime-2` |

Both markers expose the same tuneable fields:

- **`voice`** — voice preset (default: `alloy`).
- **`reasoning_effort`** / **`reasoningEffort`** — `"minimal" | "low" | "medium" | "high"`. OpenAI recommends `"low"` for production voice flows; higher tiers add measurable per-turn latency. Omitting the field leaves the server default.
- **`input_audio_transcription_model`** / **`inputAudioTranscriptionModel`** — override the Whisper model used for input transcription (e.g. `"gpt-realtime-whisper"` for low-latency partials, `"gpt-4o-transcribe"` for higher accuracy).

```python
# libraries/python/getpatter/engines/openai.py
engine = openai.Realtime(
    model="gpt-realtime-2",
    reasoning_effort="low",
    input_audio_transcription_model="gpt-realtime-whisper",
)
```

```typescript
// libraries/typescript/src/engines/openai-2.ts
const engine = new Realtime2({ reasoningEffort: "low" });
```

> **Implementation note (2026-05):** Although two marker classes exist, both `"openai_realtime"` and `"openai_realtime_2"` route through the same `OpenAIRealtime2Adapter` at call start. OpenAI deprecated the Beta Realtime API in 2026-05, and the legacy `session.update` shape and `OpenAI-Beta: realtime=v1` header returned `invalid_model`. Only the default model string differs between the two markers.

Sources: [libraries/python/getpatter/engines/openai.py:10-59](libraries/python/getpatter/engines/openai.py), [libraries/python/getpatter/stream_handler.py:598-606](libraries/python/getpatter/stream_handler.py)

### ElevenLabs ConvAI (`elevenlabs.ConvAI`)

```python
# libraries/python/getpatter/engines/elevenlabs.py
engine = elevenlabs.ConvAI(api_key="...", agent_id="ag_...", voice="...")
```

The `ConvAI` marker requires both an API key (`ELEVENLABS_API_KEY`) and an **`agent_id`** (`ELEVENLABS_AGENT_ID`) — the pre-configured ElevenLabs Conversational AI agent. The agent ID encodes prompts, persona, and voice configuration managed in the ElevenLabs dashboard. The `kind` discriminator is `"elevenlabs_convai"`.

Sources: [libraries/python/getpatter/engines/elevenlabs.py:1-59](libraries/python/getpatter/engines/elevenlabs.py)

### Pipeline Mode (no marker)

Pipeline mode is selected when `engine=` is not one of the realtime markers. It uses the STT, LLM, and TTS configs on the `Agent` object directly. No marker class is required.

---

## StreamHandler Architecture

The `StreamHandler` abstract base class is the per-call controller that owns the AI adapter, audio routing, transcript history, metrics, guardrails, tool calling, and call control. The telephony handler (Twilio or Telnyx) creates the appropriate subclass after reading the engine `kind`.

```text
┌─────────────────────────────────────────────────────┐
│                  StreamHandler (ABC)                │
│  start()  on_audio_received()  cleanup()  …         │
└──────┬──────────────────┬─────────────────┬─────────┘
       │                  │                 │
OpenAIRealtime        ElevenLabs       Pipeline
StreamHandler         ConvAI           StreamHandler
                      StreamHandler
  (websocket to         (websocket       (local VAD +
   OpenAI Realtime)      to ConvAI)       STT + LLM
                                          + TTS)
```

The three abstract methods every subclass must implement are `start()`, `on_audio_received()`, and `cleanup()`.

Sources: [libraries/python/getpatter/stream_handler.py:392-430](libraries/python/getpatter/stream_handler.py), [libraries/typescript/src/stream-handler.ts:238-260](libraries/typescript/src/stream-handler.ts)

---

## OpenAI Realtime Mode

### How it works

The `OpenAIRealtimeStreamHandler` opens a persistent WebSocket to the OpenAI Realtime API. All audio (inbound from telephony, outbound to telephony) flows through this single socket. OpenAI handles STT (Whisper), LLM response generation, and TTS in one end-to-end session — Patter never sees raw tokens or synthesized audio bytes from separate providers.

**Prewarm optimization:** At `start()`, the handler attempts to adopt a pre-opened `OpenAIRealtime2Adapter` WebSocket parked during the ringing window by `Patter._park_provider_connections`. A live parked socket skips the cold TCP+TLS+HTTP-101 handshake + `session.update` acknowledgment round-trip (~300–600 ms saved on the first audible word). If the parked socket is dead or absent the handler falls back to a fresh `connect()`.

Sources: [libraries/python/getpatter/stream_handler.py:662-720](libraries/python/getpatter/stream_handler.py)

### Event loop

The `_forward_events` coroutine runs as a background task consuming events from the adapter. Key events and their handling:

| Event | Handler action |
|---|---|
| `audio` | Forward bytes to `audio_sender.send_audio()`; record first-byte TTFB |
| `speech_started` | Send clear + cancel response (barge-in); emit user speech started |
| `speech_stopped` | Start turn latency timer; mark user transcript pending |
| `transcript_input` | Record STT complete; push `user` entry to history; request response |
| `transcript_output` | Accumulate agent text delta; check guardrails |
| `response_done` | Flush assistant turn (possibly buffered behind user transcript); record usage |
| `function_call` | Dispatch to tool executor or built-in `transfer_call`/`end_call` |

**Transcript ordering:** Because OpenAI Realtime emits the user's Whisper transcription *after* `response_done` (transcription runs in parallel with response generation), Patter buffers the assistant turn and flushes it only once the user transcript arrives. A `_REALTIME_USER_TRANSCRIPT_WAIT_S = 3.0` timeout ensures the assistant turn is eventually surfaced even if the transcript never arrives.

Sources: [libraries/python/getpatter/stream_handler.py:703-740](libraries/python/getpatter/stream_handler.py), [libraries/python/getpatter/stream_handler.py:557-562](libraries/python/getpatter/stream_handler.py)

### Barge-in and VAD semantics

OpenAI's server-side VAD fires a `speech_started` event when it detects user speech during the agent's turn. The handler responds immediately with `send_clear()` + `cancel_response()`. However, on PSTN lines without acoustic echo cancellation (AEC), TTS bleed into the microphone can trigger phantom `speech_started` events.

To suppress early self-cancellation, the handler enforces a minimum elapsed time between the agent's first audio chunk and an allowed barge-in:

```python
# libraries/python/getpatter/stream_handler.py
MIN_AGENT_SPEAKING_S_BEFORE_BARGE_IN_AEC = 1.0      # AEC warmup window
MIN_AGENT_SPEAKING_S_BEFORE_BARGE_IN_NO_AEC = 0.5   # raised from 0.1 s in 0.6.2
```

The gate is anchored to `_current_response_first_audio_at` on the adapter, so the window runs from the first wire-time audio chunk rather than from the `beginSpeaking` timestamp (which precedes TTS TTFB by 200–700 ms for cloud TTS providers).

Sources: [libraries/python/getpatter/stream_handler.py:56-68](libraries/python/getpatter/stream_handler.py), [libraries/python/getpatter/stream_handler.py:1348-1370](libraries/python/getpatter/stream_handler.py)

---

## ElevenLabs ConvAI Mode

### How it works

`ElevenLabsConvAIStreamHandler` opens a WebSocket to the ElevenLabs ConvAI endpoint identified by `agent_id`. Like OpenAI Realtime, this is a fully-baked speech-to-speech path where ElevenLabs internally manages STT, LLM inference, and TTS. Patter sees streamed audio chunks and transcript events, but has no token-level visibility into the LLM response.

**Audio transcoding:** Twilio delivers μ-law 8 kHz audio. By default, the handler decodes it to PCM16 and resamples to 16 kHz before forwarding to ConvAI. When ConvAI negotiates `ulaw_8000` on its input side, a native μ-law fast-path (`_native_mulaw_8k`) bypasses the decode+resample entirely.

### Event loop and barge-in

The ConvAI event loop is simpler than the Realtime path because ElevenLabs manages all turn-taking internally. The `interruption` event from the ConvAI adapter is the canonical barge-in signal:

```python
elif ev_type == "interruption":
    await self.audio_sender.send_clear()
    if self.metrics is not None:
        self.metrics.record_turn_interrupted()
    waiting_first_audio = False
    current_agent_text = ""
```

Unlike the Realtime path, there is no Patter-side barge-in gate — barge-in detection and suppression happen entirely inside ElevenLabs. The SDK cannot configure the VAD sensitivity or gate durations for ConvAI.

Sources: [libraries/python/getpatter/stream_handler.py:1780-1800](libraries/python/getpatter/stream_handler.py)

---

## Pipeline Mode (STT → LLM → TTS)

### How it works

`PipelineStreamHandler` composes three independently-configured providers. The telephony WebSocket delivers audio → a local VAD and STT adapter transcribe it → an LLM loop generates a text response → a TTS adapter synthesizes audio → it is sent back.

```
Telephony audio (mulaw 8kHz)
      │
      ▼
[Decode + resample → PCM16 16kHz]
      │
      ▼
[VAD]  ──speech_start──►  [STT (Deepgram / Whisper / Cartesia / …)]
                                   │ transcript
                                   ▼
                          [LLM (OpenAI / Anthropic / Groq / …)]
                                   │ token stream
                                   ▼
                          [SentenceChunker]
                                   │ sentence
                                   ▼
                          [TTS (ElevenLabs / OpenAI / Cartesia / …)]
                                   │ audio chunks
                                   ▼
                          [Encode → mulaw 8kHz]
                                   │
                                   ▼
                          Telephony audio out
```

At `start()`, the handler initializes STT and TTS from `agent.stt` / `agent.tts` config objects, falling back to `deepgram_key` for STT and `elevenlabs_key` for TTS when explicit adapters are not provided. It also auto-loads SileroVAD (when `onnxruntime-node`/`onnxruntime` is available) unless `agent.vad` is set explicitly.

Sources: [libraries/python/getpatter/stream_handler.py:2050-2070](libraries/python/getpatter/stream_handler.py)

### Pipeline hooks

Pipeline mode exposes a rich hook surface via `PipelineHookExecutor`. Every hook is **fail-open**: exceptions are logged and the original value passes through unchanged so a broken hook never kills a call.

| Hook | Stage | Tier |
|---|---|---|
| `beforeSendToStt` | Pre-STT audio | Drop (return `null`) or pass through |
| `afterTranscribe` | Post-STT transcript | Modify or veto the transcript |
| `beforeLlm` | Pre-LLM messages | Modify the messages array |
| `afterLlm.onChunk` | Per-LLM-token | Synchronous, ~0 ms budget |
| `afterLlm.onSentence` | Per-sentence | Async rewrite or drop |
| `afterLlm.onResponse` | Full response | Async rewrite (requires buffering) |
| `beforeSynthesize` | Pre-TTS text | Modify or veto the sentence |
| `afterSynthesize` | Post-TTS audio | Modify or drop the audio chunk |

The `afterLlm` hook is normalized from a legacy `(text, ctx) => string` callable or a new three-tier `AfterLLMHook` object. Only `onResponse` requires the LLM loop to buffer the full stream before proceeding.

Sources: [libraries/typescript/src/pipeline-hooks.ts:1-50](libraries/typescript/src/pipeline-hooks.ts), [libraries/typescript/src/pipeline-hooks.ts:101-222](libraries/typescript/src/pipeline-hooks.ts)

### Barge-in and VAD semantics

Pipeline mode implements barge-in entirely in Patter. When the local VAD fires `speech_start` while the agent is speaking, the handler consults optional **barge-in confirmation strategies** before canceling:

- With no strategies configured (default), the first `speech_start` triggers immediate cancel of STT streaming, LLM consumption (`_llm_cancel_event` / `llmAbort`), and TTS synthesis.
- With one or more strategies, barge-in enters a **pending** state — TTS continues streaming naturally — and each incoming STT transcript is passed to the strategies. The first strategy that approves confirms the barge-in; if none confirm within `barge_in_confirm_ms` (default 1500 ms) the pending state is dropped.

The same AEC-vs-no-AEC gate constants apply as in the Realtime path:

```python
# stream_handler.py (Python) / stream-handler.ts (TypeScript)
MIN_AGENT_SPEAKING_S_BEFORE_BARGE_IN_AEC    = 1.0   # covers AEC convergence window
MIN_AGENT_SPEAKING_S_BEFORE_BARGE_IN_NO_AEC = 0.5   # anti-phantom-VAD on PSTN
```

The gate is anchored to `_first_audio_sent_at` (the instant the first audio chunk actually reached the carrier wire), not to `_speaking_started_at`, so slow-TTFB TTS providers do not leave the gate expired before audio goes out.

**Inbound audio ring buffer:** While the agent is speaking and the self-hearing guard is dropping inbound audio, up to ~250–600 ms of PCM16 16 kHz frames are kept in a ring buffer (`_inbound_audio_ring`). On confirmed barge-in, this buffer is flushed to STT so the user's leading speech — missed while the VAD's `minSpeechDuration` window was accumulating — is recovered and transcribed.

Sources: [libraries/python/getpatter/stream_handler.py:56-68](libraries/python/getpatter/stream_handler.py), [libraries/typescript/src/stream-handler.ts:303-310](libraries/typescript/src/stream-handler.ts), [libraries/typescript/src/stream-handler.ts:330-360](libraries/typescript/src/stream-handler.ts)

---

## STT Hallucination Filtering

All three modes share a filter against known Whisper and Deepgram hallucinations on silence or TTS echo. When a STT transcript matches any entry in `_STT_HALLUCINATIONS` (Python) / `HALLUCINATIONS` (TypeScript) after lower-casing and stripping punctuation, the turn is dropped entirely rather than passed to the LLM. This prevents PSTN echo loopback from producing phantom "thank you for watching" user turns that trigger spurious LLM responses.

```python
# libraries/python/getpatter/stream_handler.py
_STT_HALLUCINATIONS: frozenset[str] = frozenset({
    "you", "thank you", "thanks", "yeah", "yes", "no", "okay", "ok",
    "uh", "um", "mmm", "hmm", ".", "bye", "right", "cool",
    "thank you for watching", "thanks for watching", "[music]", "[silence]", ...
})
```

Sources: [libraries/python/getpatter/stream_handler.py:74-110](libraries/python/getpatter/stream_handler.py), [libraries/typescript/src/stream-handler.ts:175-200](libraries/typescript/src/stream-handler.ts)

---

## Built-in Tool Injection

Both Realtime and Pipeline modes inject the `transfer_call` and `end_call` built-in tools into every session so the LLM can initiate a call transfer or hang up regardless of system-prompt instructions. In Realtime mode the tools appear in the `session.update` sent by `OpenAIRealtimeStreamHandler.start()`; in Pipeline mode they are appended by `_augment_with_builtin_handoff_tools()` / `augmentWithBuiltinHandoffTools()`, wiring handler closures that call the telephony-level `_transfer_fn` / `_hangup_fn`. This ensures parity between modes.

Sources: [libraries/python/getpatter/stream_handler.py:147-190](libraries/python/getpatter/stream_handler.py), [libraries/typescript/src/stream-handler.ts:120-158](libraries/typescript/src/stream-handler.ts)

---

## Latency Trade-offs

| | OpenAI Realtime | ElevenLabs ConvAI | Pipeline |
|---|---|---|---|
| **Architecture** | Speech-to-speech (single WS) | Speech-to-speech (single WS) | STT + LLM + TTS (3 services) |
| **Latency** | Lowest — one model, one hop | Low — ElevenLabs managed | Higher — three sequential hops |
| **Prewarm** | WS parked during ringing (~300–600 ms saved) | WS parked during ringing | STT WS + first-message audio prewarmed |
| **Barge-in control** | Patter gate + server VAD | ElevenLabs managed | Full local control via strategies |
| **LLM provider** | OpenAI only | ElevenLabs-configured | Any (OpenAI, Anthropic, Groq, …) |
| **Per-turn hooks** | Guardrails only | Guardrails only | Full hook surface (7 stages) |
| **Reasoning effort** | `reasoning_effort` knob | N/A | Per-provider model selection |
| **Transparency** | Transcript + tool events | Transcript events | Full token stream, audio chunks |

The `reasoning_effort` field on OpenAI Realtime markers directly controls per-turn latency on `gpt-realtime-2`; the docs consistently recommend `"low"` for production flows because `"medium"` and `"high"` add measurable latency per turn with diminishing returns on voice tasks.

---

## Summary

Patter's three voice engine modes are distinguished at construction time by frozen engine marker objects whose `kind` property selects the correct `StreamHandler` subclass at call start. The two speech-to-speech engines (OpenAI Realtime and ElevenLabs ConvAI) minimize latency by collapsing STT, LLM, and TTS into a single vendor-managed WebSocket session, while Pipeline mode pays extra round-trip latency in exchange for full provider choice, rich per-stage hooks, and local VAD/barge-in strategy control. All three modes share the same barge-in gate constants, STT hallucination filter, built-in tool injection, and speech-event observability surface, making it straightforward to switch architectures without reworking application logic.

Sources: [libraries/python/getpatter/stream_handler.py:380-430](libraries/python/getpatter/stream_handler.py)
