# STT & TTS Provider Catalog

> All supported speech-to-text (Deepgram, Whisper, OpenAI Transcribe, AssemblyAI, Cartesia, Soniox, Speechmatics) and text-to-speech (ElevenLabs, Cartesia, OpenAI, LMNT, Rime, Inworld) adapters — their configuration, streaming contracts, known limitations, and how to swap providers in pipeline mode.

- Repository: PatterAI/Patter
- GitHub: https://github.com/PatterAI/Patter
- Human wiki: https://grok-wiki.com/public/wiki/patterai-patter-57d14e233afc
- Complete Markdown: https://grok-wiki.com/public/wiki/patterai-patter-57d14e233afc/llms-full.txt

## Source Files

- `libraries/python/getpatter/stt/deepgram.py`
- `libraries/python/getpatter/stt/openai_transcribe.py`
- `libraries/python/getpatter/stt/whisper.py`
- `libraries/python/getpatter/tts/elevenlabs.py`
- `libraries/python/getpatter/tts/cartesia.py`
- `libraries/python/getpatter/tts/openai.py`
- `libraries/python/getpatter/providers/base.py`

---

<details>
<summary>Relevant source files</summary>
The following files were used as context for generating this wiki page:

- [libraries/python/getpatter/providers/base.py](libraries/python/getpatter/providers/base.py)
- [libraries/python/getpatter/stt/deepgram.py](libraries/python/getpatter/stt/deepgram.py)
- [libraries/python/getpatter/stt/assemblyai.py](libraries/python/getpatter/stt/assemblyai.py)
- [libraries/python/getpatter/stt/cartesia.py](libraries/python/getpatter/stt/cartesia.py)
- [libraries/python/getpatter/stt/openai_transcribe.py](libraries/python/getpatter/stt/openai_transcribe.py)
- [libraries/python/getpatter/stt/whisper.py](libraries/python/getpatter/stt/whisper.py)
- [libraries/python/getpatter/stt/soniox.py](libraries/python/getpatter/stt/soniox.py)
- [libraries/python/getpatter/stt/speechmatics.py](libraries/python/getpatter/stt/speechmatics.py)
- [libraries/python/getpatter/tts/elevenlabs.py](libraries/python/getpatter/tts/elevenlabs.py)
- [libraries/python/getpatter/tts/cartesia.py](libraries/python/getpatter/tts/cartesia.py)
- [libraries/python/getpatter/tts/openai.py](libraries/python/getpatter/tts/openai.py)
- [libraries/python/getpatter/tts/lmnt.py](libraries/python/getpatter/tts/lmnt.py)
- [libraries/python/getpatter/tts/rime.py](libraries/python/getpatter/tts/rime.py)
- [libraries/python/getpatter/tts/inworld.py](libraries/python/getpatter/tts/inworld.py)
- [libraries/python/getpatter/providers/deepgram_stt.py](libraries/python/getpatter/providers/deepgram_stt.py)
- [libraries/python/getpatter/providers/assemblyai_stt.py](libraries/python/getpatter/providers/assemblyai_stt.py)
- [libraries/python/getpatter/providers/cartesia_stt.py](libraries/python/getpatter/providers/cartesia_stt.py)
- [libraries/python/getpatter/providers/openai_transcribe_stt.py](libraries/python/getpatter/providers/openai_transcribe_stt.py)
- [libraries/python/getpatter/providers/soniox_stt.py](libraries/python/getpatter/providers/soniox_stt.py)
- [libraries/python/getpatter/providers/speechmatics_stt.py](libraries/python/getpatter/providers/speechmatics_stt.py)
- [libraries/python/getpatter/providers/elevenlabs_ws_tts.py](libraries/python/getpatter/providers/elevenlabs_ws_tts.py)
</details>

# STT & TTS Provider Catalog

Patter's voice pipeline is built around a pair of abstract interfaces — `STTProvider` and `TTSProvider` — that standardize the contract for any speech-to-text or text-to-speech backend. Concrete adapters wrap each vendor API behind this common interface, so swapping providers is a one-line constructor change at the call-site. All adapters live under `libraries/python/getpatter/stt/` and `libraries/python/getpatter/tts/`; the underlying implementations are in `libraries/python/getpatter/providers/`.

This page catalogs every supported STT and TTS adapter: its configuration parameters, transport model, streaming contract, telephony shortcuts, and known constraints. It also explains how the base interfaces are structured and how to swap providers inside pipeline mode.

---

## Base Interface Contract

Both families share a common lifecycle defined in `providers/base.py`:

```text
STTProvider                          TTSProvider
────────────────────────────────     ─────────────────────────────
connect() → None                     synthesize(text) → AsyncIterator[bytes]
send_audio(chunk: bytes) → None      close() → None
receive_transcripts() → AsyncIterator[Transcript]
close() → None
warmup() → None  (optional, best-effort)
```

`warmup()` is a no-op by default. When `prewarm=True` (the agent default), Patter calls `warmup()` once per outbound call before the carrier reports `answered`, pre-heating DNS, TLS, and provider-edge state to save 200–500 ms of first-turn latency. Failures are swallowed and logged at DEBUG — the live call always proceeds.

`Transcript` is the normalized STT output type:

| Field | Type | Meaning |
|---|---|---|
| `text` | `str` | Transcribed text (stripped) |
| `is_final` | `bool` | Stable utterance, not a partial |
| `confidence` | `float` | Per-utterance confidence in `[0.0, 1.0]` |
| `speech_final` | `bool` | Faster VAD end-of-utterance hint (Deepgram) |
| `from_finalize` | `bool` | Result was triggered by a `Finalize` control frame |
| `event_type` | `Literal["Results", "UtteranceEnd", "SpeechStarted"]` | Event kind |
| `words` | `list[dict]` | Optional per-word timings/metadata |
| `request_id` | `str \| None` | Provider-side trace ID for cost reconciliation |

Sources: [libraries/python/getpatter/providers/base.py:18-61](libraries/python/getpatter/providers/base.py)

---

## STT Providers

### Provider Summary

| Adapter module | `provider_key` | Transport | Default model | Default sample rate | Env var |
|---|---|---|---|---|---|
| `stt/deepgram.py` | `deepgram` | WebSocket (persistent) | `nova-3` | 16 kHz | `DEEPGRAM_API_KEY` |
| `stt/assemblyai.py` | `assemblyai` | WebSocket (persistent) | `universal-streaming-english` | 16 kHz | `ASSEMBLYAI_API_KEY` |
| `stt/cartesia.py` | `cartesia_stt` | WebSocket (persistent) | `ink-whisper` | 16 kHz | `CARTESIA_API_KEY` |
| `stt/openai_transcribe.py` | `openai_transcribe` | HTTP POST (buffered) | `gpt-4o-transcribe` | 16 kHz | `OPENAI_API_KEY` |
| `stt/whisper.py` | `whisper` | HTTP POST (buffered) | `whisper-1` | 16 kHz | `OPENAI_API_KEY` |
| `stt/soniox.py` | `soniox` | WebSocket (persistent) | `stt-rt-v4` | 16 kHz | `SONIOX_API_KEY` |
| `stt/speechmatics.py` | _(no provider_key)_ | SDK WebSocket | `adaptive` turn detection | 16 kHz | `SPEECHMATICS_API_KEY` |

---

### Deepgram

**Transport:** Persistent WebSocket to `wss://api.deepgram.com/v1/listen`.

Deepgram is the most feature-complete streaming adapter. It maintains a KeepAlive pump (JSON `{"type":"KeepAlive"}` every 4 s) to prevent the server closing idle sessions after ~10 s. On speech end, the SDK sends a `Finalize` control frame to force an immediate final transcript rather than waiting for Deepgram's own `utterance_end_ms` heuristic (~1 s). Graceful teardown uses the `Finalize → drain 100 ms → CloseStream` sequence.

The adapter emits three `event_type` values: `Results` (normal transcripts), `SpeechStarted` (VAD), and `UtteranceEnd` (VAD). Both `is_final` and `speech_final` are surfaced so callers can gate on either signal independently.

`smart_format` (punctuation and numeral normalization) defaults to `False` in the implementation because it adds 50–150 ms to TTFT per transcript and is rarely useful for LLM pipelines — pass `smart_format=True` to opt in for human-visible transcripts.

```python
from getpatter.stt import deepgram

stt = deepgram.STT()                                          # reads DEEPGRAM_API_KEY
stt = deepgram.STT(api_key="dg_...", endpointing_ms=80)
stt_twilio = deepgram.STT.for_twilio(api_key="dg_...")        # mulaw, 8 kHz
```

**Key parameters:**

| Parameter | Default | Notes |
|---|---|---|
| `model` | `nova-3` | Also `nova-2`, `nova-2-phonecall`, `enhanced`, `base` |
| `encoding` | `linear16` | Also `mulaw`, `alaw`, `opus`, `flac` |
| `sample_rate` | `16000` | 8000, 16000, 24000, 44100, 48000 |
| `endpointing_ms` | `150` | Silence wait before endpoint decision |
| `utterance_end_ms` | `1000` | Hard minimum on Deepgram; min 1000 enforced |
| `smart_format` | `True` (pipeline wrapper) / `False` (provider) | Punctuation/numeral formatting |
| `interim_results` | `True` | Emit partial transcripts |
| `vad_events` | `True` | Emit `SpeechStarted` / `UtteranceEnd` frames |

Sources: [libraries/python/getpatter/providers/deepgram_stt.py:54-102](libraries/python/getpatter/providers/deepgram_stt.py), [libraries/python/getpatter/stt/deepgram.py:24-59](libraries/python/getpatter/stt/deepgram.py)

---

### AssemblyAI

**Transport:** Persistent WebSocket to `wss://streaming.assemblyai.com/v3/ws` (pure `aiohttp`, no vendor SDK).

AssemblyAI's adapter implements the v3 streaming protocol with coalescing buffering: because Twilio emits 20 ms frames (below AssemblyAI's 50 ms floor), frames are accumulated into a ~60 ms buffer before being forwarded. Sending frames below 50 ms triggers server error 3007 and stream closure.

Reconnect logic handles transient close codes 3005 and 3008 with one automatic retry. The adapter supports mid-session `UpdateConfiguration` (raise `min_turn_silence` while collecting digit strings) and `ForceEndpoint` (barge-in).

The `language` constructor argument is currently ignored — language behavior is driven by the `model` kwarg; a warning is emitted when a non-default value is supplied.

```python
from getpatter.stt import assemblyai

stt = assemblyai.STT()
stt = assemblyai.STT(api_key="...", model="universal-streaming-multilingual")
stt_twilio = assemblyai.STT.for_twilio(api_key="...")   # pcm_mulaw, 8 kHz
```

**Key parameters:**

| Parameter | Default | Notes |
|---|---|---|
| `model` | `universal-streaming-english` | Also `universal-streaming-multilingual`, `u3-rt-pro`, `whisper-rt` |
| `encoding` | `pcm_s16le` | Also `pcm_mulaw` |
| `sample_rate` | `16000` | 8000 or 16000 |
| `language` | `"en"` | Ignored — drives a warning if non-default |

Sources: [libraries/python/getpatter/providers/assemblyai_stt.py:58-110](libraries/python/getpatter/providers/assemblyai_stt.py), [libraries/python/getpatter/providers/assemblyai_stt.py:270-330](libraries/python/getpatter/providers/assemblyai_stt.py)

---

### Cartesia (STT)

**Transport:** Persistent WebSocket to `wss://api.cartesia.ai/stt/websocket` (pure `aiohttp`).

Cartesia's `ink-whisper` STT adapter emits interim and final transcripts via the `transcript` event type. A `finalize` text frame forces immediate utterance finalization — wired to the SDK's VAD `speech_end` event to convert Cartesia's otherwise conservative silence-based heuristic (2–7 s on PSTN audio) into a fast VAD-driven one. The keepalive loop sends WebSocket pings every 30 s.

The adapter supports connection parking: `open_parked_connection()` pre-opens a WS during the carrier ringing window; `adopt_websocket()` adopts it at call pickup, eliminating the TLS+WS-upgrade round-trip (~150–400 ms) on the first turn.

```python
from getpatter.stt import cartesia

stt = cartesia.STT()                          # reads CARTESIA_API_KEY
stt = cartesia.STT(api_key="...", language="es")
```

**Key parameters:**

| Parameter | Default | Notes |
|---|---|---|
| `model` | `ink-whisper` | Only currently supported model |
| `encoding` | `pcm_s16le` | Only supported encoding |
| `sample_rate` | `16000` | 8000, 16000, 24000, 44100, 48000 |
| `language` | `"en"` | BCP-47 code |

Sources: [libraries/python/getpatter/providers/cartesia_stt.py:50-80](libraries/python/getpatter/providers/cartesia_stt.py), [libraries/python/getpatter/providers/cartesia_stt.py:200-240](libraries/python/getpatter/providers/cartesia_stt.py)

---

### OpenAI Transcribe (GPT-4o)

**Transport:** Buffered HTTP POST to OpenAI's `/v1/audio/transcriptions` endpoint.

`OpenAITranscribeSTT` subclasses `WhisperSTT` and reuses its buffering + transcription logic; the only differences are the default model (`gpt-4o-transcribe`) and an accepted-model whitelist that rejects `whisper-1`. Described in the source as "~10x faster than Whisper-1" for latency-sensitive pipelines.

```python
from getpatter.stt import openai_transcribe

stt = openai_transcribe.STT()                    # reads OPENAI_API_KEY
stt = openai_transcribe.STT(language="it")
```

**Key parameters:**

| Parameter | Default | Notes |
|---|---|---|
| `model` | `gpt-4o-transcribe` | Also `gpt-4o-mini-transcribe`; `whisper-1` rejected |
| `language` | `"en"` | BCP-47 language hint |

Sources: [libraries/python/getpatter/providers/openai_transcribe_stt.py:1-60](libraries/python/getpatter/providers/openai_transcribe_stt.py), [libraries/python/getpatter/stt/openai_transcribe.py:18-40](libraries/python/getpatter/stt/openai_transcribe.py)

---

### Whisper (whisper-1)

**Transport:** Buffered HTTP POST (same endpoint as OpenAI Transcribe).

The original Whisper adapter buffers incoming PCM audio across the call turn and submits it as a single POST request when the utterance ends. Higher latency than streaming WebSocket providers; use `openai_transcribe.STT` for production pipelines unless you specifically need `whisper-1`.

```python
from getpatter.stt import whisper

stt = whisper.STT()                # reads OPENAI_API_KEY
stt = whisper.STT(language="it")
```

**Key parameters:**

| Parameter | Default | Notes |
|---|---|---|
| `model` | `whisper-1` | OpenAI Whisper v1 |
| `language` | `"en"` | BCP-47 hint |

Sources: [libraries/python/getpatter/stt/whisper.py:1-42](libraries/python/getpatter/stt/whisper.py)

---

### Soniox

**Transport:** Persistent WebSocket to `wss://stt-rt.soniox.com/transcribe-websocket`.

Soniox operates on a token-level streaming protocol: `is_final` tokens are accumulated into segments and flushed when an `<end>` / `<fin>` endpoint token arrives. The adapter supports automatic language identification alongside language hints and optional speaker diarization. Model `stt-rt-v4` is the current default.

```python
from getpatter.stt import soniox

stt = soniox.STT()                              # reads SONIOX_API_KEY
stt = soniox.STT(language_hints=["en", "it"])
```

**Key parameters:**

| Parameter | Default | Notes |
|---|---|---|
| `model` | `stt-rt-v4` | Also `stt-rt-v3`, `stt-rt-v2` |
| `language_hints` | `None` | List of BCP-47 hints for auto-detection |
| `language_hints_strict` | `False` | Restrict detection to hints |
| `sample_rate` | `16000` | 8000, 16000, 24000 |
| `enable_speaker_diarization` | `False` | |
| `enable_language_identification` | `True` | |
| `max_endpoint_delay_ms` | `500` | |

Sources: [libraries/python/getpatter/stt/soniox.py:1-56](libraries/python/getpatter/stt/soniox.py), [libraries/python/getpatter/providers/soniox_stt.py:1-55](libraries/python/getpatter/providers/soniox_stt.py)

---

### Speechmatics

**Transport:** SDK WebSocket via `speechmatics-voice[smart]` (optional dependency; lazy import).

Speechmatics is the only STT adapter that depends on a vendor SDK rather than a bare `aiohttp`/`websockets` transport. The dependency is imported lazily so users who don't install the `speechmatics` extra can still import other Patter components.

Turn detection is configurable with four modes: `ADAPTIVE` (default), `FIXED`, `EXTERNAL`, and `SMART_TURN`. The adapter supports speaker diarization and partial transcripts.

```python
from getpatter.stt import speechmatics
from getpatter.stt.speechmatics import TurnDetectionMode

stt = speechmatics.STT()                             # reads SPEECHMATICS_API_KEY
stt = speechmatics.STT(turn_detection_mode=TurnDetectionMode.SMART_TURN)
```

**Install the optional dependency:**
```
pip install 'getpatter[speechmatics]'
```

**Key parameters:**

| Parameter | Default | Notes |
|---|---|---|
| `language` | `"en"` | BCP-47 |
| `turn_detection_mode` | `ADAPTIVE` | `EXTERNAL`, `FIXED`, `ADAPTIVE`, `SMART_TURN` |
| `sample_rate` | `16000` | 8000, 16000, 44100 |
| `enable_diarization` | `False` | |
| `include_partials` | `True` | Emit interim transcripts |

Sources: [libraries/python/getpatter/stt/speechmatics.py:1-52](libraries/python/getpatter/stt/speechmatics.py), [libraries/python/getpatter/providers/speechmatics_stt.py:1-80](libraries/python/getpatter/providers/speechmatics_stt.py)

---

## TTS Providers

### Provider Summary

| Adapter module | `provider_key` | Transport | Default model | Default sample rate | Env var |
|---|---|---|---|---|---|
| `tts/elevenlabs.py` | `elevenlabs_ws` | WebSocket streaming per utterance | `eleven_flash_v2_5` | PCM 16 kHz | `ELEVENLABS_API_KEY` |
| `tts/cartesia.py` | `cartesia_tts` | HTTP streaming | `sonic-3` | PCM 16 kHz | `CARTESIA_API_KEY` |
| `tts/openai.py` | `openai_tts` | HTTP streaming | `tts-1` | PCM 16 kHz | `OPENAI_API_KEY` |
| `tts/lmnt.py` | `lmnt` | HTTP streaming | `blizzard` | PCM 16 kHz | `LMNT_API_KEY` |
| `tts/rime.py` | `rime` | HTTP streaming | `arcana` | PCM 16 kHz | `RIME_API_KEY` |
| `tts/inworld.py` | `inworld` | HTTP NDJSON streaming | `inworld-tts-2` | PCM 16 kHz | `INWORLD_API_KEY` |

---

### ElevenLabs

**Transport:** WebSocket streaming-input endpoint (`/v1/text-to-speech/{voice_id}/stream-input`), one WS per utterance. Saves ~50 ms/request vs the legacy HTTP REST endpoint by removing per-request setup time.

`auto_mode=True` (default) delegates chunk scheduling to ElevenLabs. `chunk_length_schedule` is accepted on the constructor but only takes effect when `auto_mode=False`. The `eleven_v3` model is **not** supported by the WS endpoint — use the HTTP REST variant (`ElevenLabsRestTTS`) for v3.

The `output_format` field is intentionally omitted from the default constructor path so the internal `_output_format_explicit` flag remains `False`, allowing `set_telephony_carrier()` to flip the format automatically from `pcm_16000` to `ulaw_8000` at call time when Twilio is detected (avoiding client-side resampling and the associated audio quality issue).

```python
from getpatter.tts import elevenlabs

tts = elevenlabs.TTS()                                     # reads ELEVENLABS_API_KEY
tts = elevenlabs.TTS(voice_id="EXAVITQu4vr4xnSDxMaL", model_id="eleven_flash_v2_5")
tts_twilio = elevenlabs.TTS.for_twilio(api_key="...")      # ulaw_8000
tts_telnyx = elevenlabs.TTS.for_telnyx(api_key="...")      # pcm_16000
```

**Key parameters:**

| Parameter | Default | Notes |
|---|---|---|
| `voice_id` | `EXAVITQu4vr4xnSDxMaL` | Katie (default); any ElevenLabs voice ID |
| `model_id` | `eleven_flash_v2_5` | Also `eleven_multilingual_v2`; NOT `eleven_v3` |
| `output_format` | _(carrier-derived)_ | `pcm_16000`, `ulaw_8000`, etc. |
| `language_code` | `None` | For multilingual models |
| `auto_mode` | `True` | ElevenLabs-managed chunking |
| `chunk_length_schedule` | `None` | Manual chunk scheduling (requires `auto_mode=False`) |

Sources: [libraries/python/getpatter/tts/elevenlabs.py:1-130](libraries/python/getpatter/tts/elevenlabs.py), [libraries/python/getpatter/providers/elevenlabs_ws_tts.py:1-80](libraries/python/getpatter/providers/elevenlabs_ws_tts.py)

---

### Cartesia (TTS)

**Transport:** HTTP streaming, ~90 ms TTFB quoted in the source docstring.

Default model is `sonic-3` (Cartesia's current GA model). Voice IDs from the prior `sonic-2` family remain compatible. Audio is returned as raw PCM bytes; `sample_rate` controls the output rate directly, making telephony configuration trivial: pass `sample_rate=8000` for Twilio (PCM 8 kHz) or leave `sample_rate=16000` for Telnyx.

```python
from getpatter.tts import cartesia

tts = cartesia.TTS()                        # reads CARTESIA_API_KEY
tts = cartesia.TTS(voice="f786b574-...", speed=1.2)
tts_twilio = cartesia.TTS.for_twilio()      # PCM 8 kHz
tts_telnyx = cartesia.TTS.for_telnyx()      # PCM 16 kHz
```

**Key parameters:**

| Parameter | Default | Notes |
|---|---|---|
| `model` | `sonic-3` | GA model; `sonic-2` voice IDs still work |
| `voice` | `f786b574-daa5-4673-aa0c-cbe3e8534c02` | Katie (default) |
| `language` | `"en"` | BCP-47 |
| `sample_rate` | `16000` | 8000 or 16000 |
| `speed` | `None` | Optional string or float |

Sources: [libraries/python/getpatter/tts/cartesia.py:1-105](libraries/python/getpatter/tts/cartesia.py)

---

### OpenAI TTS

**Transport:** HTTP streaming via OpenAI's TTS endpoint.

Two model tiers: `tts-1` (default, lower latency) and `tts-1-hd` (higher quality). Six built-in voices: `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`.

```python
from getpatter.tts import openai

tts = openai.TTS()                               # reads OPENAI_API_KEY
tts = openai.TTS(voice="nova", model="tts-1-hd")
```

**Key parameters:**

| Parameter | Default | Notes |
|---|---|---|
| `model` | `tts-1` | Also `tts-1-hd` |
| `voice` | `alloy` | `alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer` |

Sources: [libraries/python/getpatter/tts/openai.py:1-38](libraries/python/getpatter/tts/openai.py)

---

### LMNT

**Transport:** HTTP streaming.

Uses the `blizzard` model by default and the `leah` voice. Raw PCM is returned via the `format="raw"` default. Language can be explicitly set or left `None` for auto-detection.

```python
from getpatter.tts import lmnt

tts = lmnt.TTS()                              # reads LMNT_API_KEY
tts = lmnt.TTS(voice="leah", model="blizzard")
```

**Key parameters:**

| Parameter | Default | Notes |
|---|---|---|
| `model` | `blizzard` | |
| `voice` | `leah` | |
| `language` | `None` | Optional BCP-47 |
| `format` | `raw` | Raw PCM output |
| `sample_rate` | `16000` | |

Sources: [libraries/python/getpatter/tts/lmnt.py:1-46](libraries/python/getpatter/tts/lmnt.py)

---

### Rime

**Transport:** HTTP streaming (Arcana / Mist model family).

```python
from getpatter.tts import rime

tts = rime.TTS()                              # reads RIME_API_KEY
tts = rime.TTS(speaker="astra", model="arcana")
```

**Key parameters:**

| Parameter | Default | Notes |
|---|---|---|
| `model` | `arcana` | Also `mist` |
| `speaker` | `None` | Optional speaker ID |
| `lang` | `"eng"` | ISO 639-3 language code |
| `sample_rate` | `16000` | |

Sources: [libraries/python/getpatter/tts/rime.py:1-42](libraries/python/getpatter/tts/rime.py)

---

### Inworld

**Transport:** HTTP NDJSON streaming (`inworld-tts-2` model).

Inworld's adapter accepts a richer set of generation controls than other providers, including `temperature`, `speaking_rate`, and `delivery_mode`. Authentication uses `auth_token` (mapped from the `api_key` kwarg). The `audio_encoding` defaults to `PCM`.

```python
from getpatter.tts import inworld

tts = inworld.TTS()                                        # reads INWORLD_API_KEY
tts = inworld.TTS(voice="Olivia", temperature=0.8, speaking_rate=1.1)
```

**Key parameters:**

| Parameter | Default | Notes |
|---|---|---|
| `model` | `inworld-tts-2` | |
| `voice` | `Ashley` | |
| `language` | `None` | Optional BCP-47 |
| `audio_encoding` | `PCM` | |
| `sample_rate` | `16000` | |
| `bitrate` | `64000` | |
| `temperature` | `None` | Generation temperature |
| `speaking_rate` | `1.0` | |
| `delivery_mode` | `None` | |

Sources: [libraries/python/getpatter/tts/inworld.py:1-55](libraries/python/getpatter/tts/inworld.py)

---

## Streaming Contracts and Transport Comparison

```text
STT Adapters — transport at a glance
──────────────────────────────────────────────────────────────────────────────
Provider       Transport           Interim results   Finalize control   Warmup
──────────────────────────────────────────────────────────────────────────────
Deepgram       WS (websockets)     Yes               Finalize + CloseStream  Yes
AssemblyAI     WS (aiohttp)        Yes               ForceEndpoint + Terminate Yes
Cartesia       WS (aiohttp)        Yes               finalize text frame     Yes
Soniox         WS (aiohttp)        Yes (token-level) <end>/<fin> tokens      Yes
Speechmatics   SDK WS              Yes               External / adaptive     Yes
Whisper        HTTP POST           No (batch)        N/A                     No
OpenAI Transcribe HTTP POST        No (batch)        N/A                     No
──────────────────────────────────────────────────────────────────────────────
```

HTTP-backed adapters (Whisper, OpenAI Transcribe) buffer audio for the entire utterance and submit one POST request at end of turn. They do not produce interim results and have higher inherent latency. Streaming WebSocket adapters emit interim transcripts in real time, enabling faster barge-in detection.

All WebSocket adapters implement `warmup()` to pre-open a socket during the carrier ringing window. Only Cartesia STT additionally supports connection parking (`open_parked_connection` / `adopt_websocket`), which eliminates the full WS handshake on the first call turn rather than just pre-heating DNS/TLS.

Sources: [libraries/python/getpatter/providers/deepgram_stt.py:130-160](libraries/python/getpatter/providers/deepgram_stt.py), [libraries/python/getpatter/providers/assemblyai_stt.py:185-240](libraries/python/getpatter/providers/assemblyai_stt.py), [libraries/python/getpatter/providers/cartesia_stt.py:165-200](libraries/python/getpatter/providers/cartesia_stt.py)

---

## Telephony Shortcuts (`for_twilio` / `for_telnyx`)

Several adapters expose class-method shortcuts that pre-configure the right encoding and sample rate for telephony carriers, avoiding client-side resampling:

| Adapter | `for_twilio()` | `for_telnyx()` |
|---|---|---|
| `stt/deepgram` | mulaw, 8 kHz | — |
| `stt/assemblyai` | pcm_mulaw, 8 kHz | — |
| `tts/elevenlabs` | `ulaw_8000` output format | `pcm_16000` output format |
| `tts/cartesia` | PCM, 8 kHz | PCM, 16 kHz |

For ElevenLabs, `for_twilio()` explicitly sets `output_format="ulaw_8000"` and marks the format as caller-explicit so the carrier auto-flip hook does not override it at call time.

Sources: [libraries/python/getpatter/tts/elevenlabs.py:89-130](libraries/python/getpatter/tts/elevenlabs.py), [libraries/python/getpatter/tts/cartesia.py:64-105](libraries/python/getpatter/tts/cartesia.py)

---

## Swapping Providers in Pipeline Mode

Because every STT adapter satisfies `STTProvider` and every TTS adapter satisfies `TTSProvider`, swapping is a constructor-level change only:

```python
# Before: Deepgram STT + ElevenLabs TTS
from getpatter.stt import deepgram
from getpatter.tts import elevenlabs

agent = Agent(
    stt=deepgram.STT(),
    tts=elevenlabs.TTS(),
    llm=...,
)

# After: Cartesia STT + Cartesia TTS (single-vendor, single API key)
from getpatter.stt import cartesia as cartesia_stt
from getpatter.tts import cartesia as cartesia_tts

agent = Agent(
    stt=cartesia_stt.STT(),
    tts=cartesia_tts.TTS(),
    llm=...,
)
```

The pipeline handler (`PipelineStreamHandler`) calls `connect()`, `send_audio()`, `receive_transcripts()`, and `close()` on whatever `STTProvider` instance is passed, and calls `synthesize()` / `close()` on the `TTSProvider` — no other code changes are required.

The `provider_key` class variable on each adapter is used for cost attribution and OTel metrics (`patter.stt.provider`, `patter.cost.stt_seconds`) so switching providers automatically updates dashboards without code changes beyond the constructor.

---

## Known Limitations

| Provider | Limitation |
|---|---|
| **AssemblyAI** | `language` kwarg is silently ignored; language behavior is controlled by `model`. Chunk coalescing is mandatory — raw 20 ms Twilio frames trigger server error 3007. |
| **Cartesia STT** | Only `pcm_s16le` encoding is accepted (no mulaw support). |
| **OpenAI Transcribe** | `whisper-1` is explicitly rejected; use `whisper.STT` instead. No interim results. |
| **Whisper** | Batch-only (no interim results); highest latency of all STT options. |
| **Speechmatics** | Requires optional `pip install 'getpatter[speechmatics]'`; vendor SDK dependency not shared with other adapters. |
| **ElevenLabs WS** | `eleven_v3` model is not supported on the WebSocket endpoint — use `ElevenLabsRestTTS` for v3. |
| **ElevenLabs WS** | `optimize_streaming_latency` is deprecated by ElevenLabs and not exposed. |

---

## Summary

Patter ships eight STT adapters and six TTS adapters behind a unified `STTProvider` / `TTSProvider` interface. Streaming WebSocket adapters (Deepgram, AssemblyAI, Cartesia, Soniox, Speechmatics) support real-time interim transcripts and explicit finalization hooks for low-latency barge-in. HTTP-buffered adapters (Whisper, OpenAI Transcribe) trade streaming for simplicity. All adapters implement `warmup()` for pre-call latency optimization, and telephony-specific `for_twilio()` / `for_telnyx()` shortcuts are available on the most common adapters to avoid client-side audio resampling. Because every adapter is a drop-in replacement at the constructor level, BYOK/BYOC provider selection requires no pipeline code changes beyond the instantiation line.
