Audio Models

Speech-to-text, text-to-speech, real-time audio, and the engineering behind voice AI

Audio is the modality where the gap between "demo" and "production" is widest. A Whisper transcription demo takes ten minutes to build; a reliable, low-latency speech pipeline that handles accents, background noise, speaker overlaps, and real-time interruption takes months. Here's what you actually need to know.

Speech-to-Text

Whisper (OpenAI) remains the default starting point. Key facts:

The open-source model (whisper-large-v3) is good enough for most use cases and runs locally.
OpenAI's hosted API adds better punctuation, formatting, and word-level timestamps.
For production at scale, Faster Whisper (CTranslate2 backend) gives 4-8x speedup with equivalent quality.
Deepgram, AssemblyAI, and Google Chirp are strong hosted alternatives with streaming support out of the box.

What to benchmark on your data: word error rate (WER) by language, by accent, by noise level. The aggregate numbers in blog posts don't predict your use case.

Scaling Speech-to-Text

The naive pattern — send the whole audio file, get the whole transcript back — breaks quickly:

Long files — chunk audio into segments (30s-60s with overlap), transcribe in parallel, stitch results.
Streaming — use a streaming API or a VAD (voice activity detection) model to transcribe in near-real-time as audio arrives.
Cost — Whisper API is cheap per minute, but a million hours of audio adds up. Self-hosting large-v3 on GPUs is often cheaper at scale.
Language detection — if you handle multilingual audio, detect the language first. Whisper's auto-detection works but wastes capacity on the first 30s.

Text-to-Speech

The TTS landscape has exploded. Three tiers:

Frontier quality — ElevenLabs and OpenAI TTS produce speech that's nearly indistinguishable from human in short passages. Best for customer-facing voice.
Good and fast — Azure Neural TTS, Google Cloud TTS, Amazon Polly Neural — slightly less natural, but lower latency and cost. Good for notifications, IVR, accessibility.
Open-source — Bark (Suno), XTTS (Coqui), Parler-TTS — self-hostable, fine-tunable, improving fast. Quality is usable for internal tools but not yet at frontier level for production voice.

Key engineering concerns for TTS:

Latency. Time-to-first-byte matters for conversational UX. Streaming TTS APIs let you start playing audio before generation finishes.
Voice cloning. Most frontier TTS supports custom voices from a few minutes of reference audio. Legal and ethical implications are real — get consent, document provenance.
Prosody control. SSML or model-specific tags let you control emphasis, pauses, and speed. Critical for natural-sounding output.

Real-Time Audio

The big shift in 2024-2025: models that handle speech natively in real-time, skipping the STT→LLM→TTS cascade.

OpenAI Realtime API — WebSocket-based, speech-in speech-out, supports interruption and turn-taking. The first widely available end-to-end speech model API.
Gemini Live — similar real-time speech capability, integrated with Gemini's multimodal context.
Open-source alternatives — emerging but not yet production-ready for most use cases.

When to use cascaded (STT→LLM→TTS) vs end-to-end:

Cascaded wins when you need to inspect, log, or modify the text between stages. Better for compliance, debugging, and hybrid workflows.
End-to-end wins on latency and naturalness. Better for conversational agents, real-time translation, voice assistants.

Speaker Diarization

Knowing who said what is essential for meeting transcription, call analytics, and multi-party conversations.

PyAnnote is the leading open-source diarization model. Pair it with Whisper for transcription + speaker labels.
Hosted services (AssemblyAI, Deepgram, Google) bundle diarization into their transcription APIs.
The hard cases: overlapping speech, very short turns, speakers with similar voices. Expect diarization error rates of 5-15% even with good models.

Pipeline pattern: VAD → diarization → per-speaker segments → transcription → merge. Running diarization before transcription (rather than after) gives better results because the model gets cleaner, single-speaker segments.

Audio Embeddings

Not everything is about transcription. Audio embeddings let you search, cluster, and classify audio by semantic content:

CLAP (Contrastive Language-Audio Pretraining) — joint text-audio embedding space. Search audio with text queries.
Speaker embeddings — represent a speaker's voice as a vector for identification and verification.
Music and environmental sound embeddings — for content tagging, similarity search, audio fingerprinting.

Use case example: build a searchable archive of podcast episodes where you can query "discussion about GPU pricing" and get timestamped results — even without a full transcript.