[ENGINEERING]2026-05-12·9 min read

WHY MULAW 8KHZ SILENTLY DESTROYS YOUR AI VOICE AGENT

[VOICE]

You build a voice agent. It works in your tests. You deploy it. Callers complain the AI sounds garbled — or they hear nothing at all. Your logs show no errors. Your Deepgram dashboard shows transcripts. Everything looks fine. The bug is a codec mismatch, and it fails silently.

THE MISMATCH

Twilio's Media Streams sends audio as G.711 µ-law (mulaw) encoded at 8kHz mono, packaged as base64 strings inside JSON envelopes over a WebSocket. This is the PSTN standard — it's what every phone network uses.

[ SYS_LOG ] LIVE

TWILIOevent: media, encoding: audio/x-mulaw, sampleRate: 8000

TWILIOpayload: /+MgxAAjY3IAACAAAAAAA... (base64 mulaw)

WARNforwarding raw bytes to deepgram — encoding mismatch

ERRdeepgram: unexpected audio encoding, transcription degraded

Deepgram, OpenAI, and ElevenLabs all expect PCM 16kHz — linear16 encoding, 16-bit signed little-endian samples, 16,000 samples per second. When you forward mulaw bytes directly without transcoding, the AI hears noise. Sometimes it still produces output (hallucinating from garbage input). Sometimes it silently drops the audio. Either way, no error — just wrong behavior.

[THE SILENT FAILURE]

Deepgram will often return an empty transcript rather than an error when audio encoding is wrong. Your pipeline keeps running. The caller gets no response. You see no exception.

THE FIX: TRANSCODE AT THE BOUNDARY

The correct approach is to decode mulaw bytes to raw PCM samples, then upsample from 8kHz to 16kHz before sending to any AI provider. On the return path (TTS → Twilio), downsample 16kHz PCM back to 8kHz and re-encode to mulaw.

transcode.go — what HZRelay does internally

// mulaw 8kHz → PCM 16kHz (for Twilio inbound → STT)

func mulawToPCM16k(mulaw []byte) []byte {

decoded := g711.DecodeUlaw(mulaw) // mulaw → PCM 8kHz

return upsample2x(decoded) // 8kHz → 16kHz linear interp

}

// PCM 16kHz → mulaw 8kHz (for TTS → Twilio outbound)

func pcm16kToMulaw(pcm []byte) []byte {

downsampled := downsample2x(pcm) // 16kHz → 8kHz

return g711.EncodeUlaw(downsampled) // PCM 8kHz → mulaw

}

WHAT HZRELAY DOES

HZRelay transcodes at every adapter boundary automatically. Twilio sends mulaw — the Twilio adapter decodes and upsamples before any audio enters the routing pipeline. ElevenLabs returns PCM 16kHz — the Twilio outbound adapter downsamples and re-encodes before sending back to Twilio. You never specify an encoding or sample rate.

[ SYS_LOG ] LIVE

CODECtwilio inbound: mulaw 8kHz → PCM 16kHz [at adapter boundary]

STTdeepgram: PCM 16kHz ✓ — nova-2 streaming

TTSelevenlabs: PCM 16kHz output received

CODECtwilio outbound: PCM 16kHz → mulaw 8kHz [at adapter boundary]

OKcaller hears clear audio — no transcoding in your code

ALL_TRANSMISSIONS READ_QUICKSTART