WHY MULAW 8KHZ SILENTLY DESTROYS YOUR AI VOICE AGENT
You build a voice agent. It works in your tests. You deploy it. Callers complain the AI sounds garbled — or they hear nothing at all. Your logs show no errors. Your Deepgram dashboard shows transcripts. Everything looks fine. The bug is a codec mismatch, and it fails silently.
THE MISMATCH
Twilio's Media Streams sends audio as G.711 µ-law (mulaw) encoded at 8kHz mono, packaged as base64 strings inside JSON envelopes over a WebSocket. This is the PSTN standard — it's what every phone network uses.
Deepgram, OpenAI, and ElevenLabs all expect PCM 16kHz — linear16 encoding, 16-bit signed little-endian samples, 16,000 samples per second. When you forward mulaw bytes directly without transcoding, the AI hears noise. Sometimes it still produces output (hallucinating from garbage input). Sometimes it silently drops the audio. Either way, no error — just wrong behavior.
[THE SILENT FAILURE]
Deepgram will often return an empty transcript rather than an error when audio encoding is wrong. Your pipeline keeps running. The caller gets no response. You see no exception.
THE FIX: TRANSCODE AT THE BOUNDARY
The correct approach is to decode mulaw bytes to raw PCM samples, then upsample from 8kHz to 16kHz before sending to any AI provider. On the return path (TTS → Twilio), downsample 16kHz PCM back to 8kHz and re-encode to mulaw.
WHAT HZRELAY DOES
HZRelay transcodes at every adapter boundary automatically. Twilio sends mulaw — the Twilio adapter decodes and upsamples before any audio enters the routing pipeline. ElevenLabs returns PCM 16kHz — the Twilio outbound adapter downsamples and re-encodes before sending back to Twilio. You never specify an encoding or sample rate.