Audio Models
Speech-to-text, text-to-speech, real-time audio, and the engineering behind voice AI
Audio is the modality where the gap between "demo" and "production" is widest. A Whisper transcription demo takes ten minutes to build; a reliable, low-latency speech pipeline that handles accents, background noise, speaker overlaps, and real-time interruption takes months. Here's what you actually need to know.
Speech-to-Text
Whisper (OpenAI) remains the default starting point. Key facts:
- The open-source model (whisper-large-v3) is good enough for most use cases and runs locally.
- OpenAI's hosted API adds better punctuation, formatting, and word-level timestamps.
- For production at scale, Faster Whisper (CTranslate2 backend) gives 4-8x speedup with equivalent quality.
- Deepgram, AssemblyAI, and Google Chirp are strong hosted alternatives with streaming support out of the box.
What to benchmark on your data: word error rate (WER) by language, by accent, by noise level. The aggregate numbers in blog posts don't predict your use case.
Scaling Speech-to-Text
The naive pattern — send the whole audio file, get the whole transcript back — breaks quickly:
- Long files — chunk audio into segments (30s-60s with overlap), transcribe in parallel, stitch results.
- Streaming — use a streaming API or a VAD (voice activity detection) model to transcribe in near-real-time as audio arrives.
- Cost — Whisper API is cheap per minute, but a million hours of audio adds up. Self-hosting large-v3 on GPUs is often cheaper at scale.
- Language detection — if you handle multilingual audio, detect the language first. Whisper's auto-detection works but wastes capacity on the first 30s.
Text-to-Speech
The TTS landscape has exploded. Three tiers:
- Frontier quality — ElevenLabs and OpenAI TTS produce speech that's nearly indistinguishable from human in short passages. Best for customer-facing voice.
- Good and fast — Azure Neural TTS, Google Cloud TTS, Amazon Polly Neural — slightly less natural, but lower latency and cost. Good for notifications, IVR, accessibility.
- Open-source — Bark (Suno), XTTS (Coqui), Parler-TTS — self-hostable, fine-tunable, improving fast. Quality is usable for internal tools but not yet at frontier level for production voice.
Key engineering concerns for TTS:
- Latency. Time-to-first-byte matters for conversational UX. Streaming TTS APIs let you start playing audio before generation finishes.
- Voice cloning. Most frontier TTS supports custom voices from a few minutes of reference audio. Legal and ethical implications are real — get consent, document provenance.
- Prosody control. SSML or model-specific tags let you control emphasis, pauses, and speed. Critical for natural-sounding output.
Real-Time Audio
The big shift in 2024-2025: models that handle speech natively in real-time, skipping the STT→LLM→TTS cascade.
- OpenAI Realtime API — WebSocket-based, speech-in speech-out, supports interruption and turn-taking. The first widely available end-to-end speech model API.
- Gemini Live — similar real-time speech capability, integrated with Gemini's multimodal context.
- Open-source alternatives — emerging but not yet production-ready for most use cases.
When to use cascaded (STT→LLM→TTS) vs end-to-end:
- Cascaded wins when you need to inspect, log, or modify the text between stages. Better for compliance, debugging, and hybrid workflows.
- End-to-end wins on latency and naturalness. Better for conversational agents, real-time translation, voice assistants.
Speaker Diarization
Knowing who said what is essential for meeting transcription, call analytics, and multi-party conversations.
- PyAnnote is the leading open-source diarization model. Pair it with Whisper for transcription + speaker labels.
- Hosted services (AssemblyAI, Deepgram, Google) bundle diarization into their transcription APIs.
- The hard cases: overlapping speech, very short turns, speakers with similar voices. Expect diarization error rates of 5-15% even with good models.
Pipeline pattern: VAD → diarization → per-speaker segments → transcription → merge. Running diarization before transcription (rather than after) gives better results because the model gets cleaner, single-speaker segments.
Audio Embeddings
Not everything is about transcription. Audio embeddings let you search, cluster, and classify audio by semantic content:
- CLAP (Contrastive Language-Audio Pretraining) — joint text-audio embedding space. Search audio with text queries.
- Speaker embeddings — represent a speaker's voice as a vector for identification and verification.
- Music and environmental sound embeddings — for content tagging, similarity search, audio fingerprinting.
Use case example: build a searchable archive of podcast episodes where you can query "discussion about GPU pricing" and get timestamped results — even without a full transcript.