[ BACK_TO_TRANSMISSIONS ]
[ENGINEERING]2026-04-12·11 min read

762MS MOUTH-TO-EAR: BREAKING DOWN VOICE AI LATENCY

[VOICE · OBSERVABILITY]

762ms total. STT: 298ms. LLM first token: 370ms. TTS first audio: 75ms. Transport: 19ms. That's a good voice AI latency in 2026. Anything under 800ms feels natural to callers. Above 1.2s and you start getting hang-ups. Here's how to measure every stage and which provider swaps actually move the needle.

THE PIPELINE STAGES

Latency in a cascaded voice pipeline (STT → LLM → TTS) is the sum of four components. Understanding which is your bottleneck determines which optimization to pursue.

[STT LATENCY]

Time from audio_received to transcript.final. Dominated by model inference and streaming chunking. Deepgram nova-2 streams partials — final arrives ~300ms after utterance end. AssemblyAI Universal-3 is comparable with better accuracy on accents.

[LLM LATENCY]

Time from transcript.final to llm.first_token. This is the dominant bottleneck. GPT-4o-mini: ~350ms. GPT-4o: ~500ms. Claude Haiku: ~280ms. Reducing prompt length and avoiding tool calls in the hot path are the highest-leverage optimizations.

[TTS LATENCY]

Time from first LLM token to first audio frame. ElevenLabs Flash v2.5: ~75ms. Cartesia: ~90ms. This is the smallest component and the hardest to move. Don't optimize here first.

MEASURING WITH HZRELAY

Every HZRelay session records millisecond timestamps at each stage transition. Call session.getMetrics() or hit the REST endpoint after a call.

metrics — GET /voice/metrics?session_id=a3f9b2c1
{
"audio_received_ms": 0,
"stt_start_ms": 12, // deepgram WS open
"stt_final_ms": 310, // final transcript
"llm_start_ms": 312, // openai request sent
"llm_first_token_ms": 680, // first token received
"tts_start_ms": 685, // elevenlabs text sent
"tts_first_audio_ms": 760, // first PCM frame
"audio_sent_ms": 762 // caller hears it
}

PROVIDER SWAP IMPACT

Not every provider swap moves the needle equally. Based on real sessions through HZRelay, here's what actually changes end-to-end latency when you switch providers.

[ SYS_LOG ] LIVE
BENCHbaseline: deepgram+gpt-4o-mini+elevenlabs_flash = 762ms
SWAPgpt-4o instead of gpt-4o-mini: +145ms (llm bottleneck)
SWAPcartesia instead of elevenlabs: -10ms (negligible)
SWAPclaude-haiku instead of gpt-4o-mini: -70ms (llm wins)
SWAPassemblyai instead of deepgram: +40ms (accuracy tradeoff)
TIPoptimize LLM first — it owns >45% of total latency
_