762MS MOUTH-TO-EAR: BREAKING DOWN VOICE AI LATENCY
762ms total. STT: 298ms. LLM first token: 370ms. TTS first audio: 75ms. Transport: 19ms. That's a good voice AI latency in 2026. Anything under 800ms feels natural to callers. Above 1.2s and you start getting hang-ups. Here's how to measure every stage and which provider swaps actually move the needle.
THE PIPELINE STAGES
Latency in a cascaded voice pipeline (STT → LLM → TTS) is the sum of four components. Understanding which is your bottleneck determines which optimization to pursue.
[STT LATENCY]
Time from audio_received to transcript.final. Dominated by model inference and streaming chunking. Deepgram nova-2 streams partials — final arrives ~300ms after utterance end. AssemblyAI Universal-3 is comparable with better accuracy on accents.
[LLM LATENCY]
Time from transcript.final to llm.first_token. This is the dominant bottleneck. GPT-4o-mini: ~350ms. GPT-4o: ~500ms. Claude Haiku: ~280ms. Reducing prompt length and avoiding tool calls in the hot path are the highest-leverage optimizations.
[TTS LATENCY]
Time from first LLM token to first audio frame. ElevenLabs Flash v2.5: ~75ms. Cartesia: ~90ms. This is the smallest component and the hardest to move. Don't optimize here first.
MEASURING WITH HZRELAY
Every HZRelay session records millisecond timestamps at each stage transition. Call session.getMetrics() or hit the REST endpoint after a call.
PROVIDER SWAP IMPACT
Not every provider swap moves the needle equally. Based on real sessions through HZRelay, here's what actually changes end-to-end latency when you switch providers.