[TUTORIAL]2026-05-05·7 min read

PHONE + WEB VOICE UNDER ONE SESSION MODEL

[VOICE · WEBRTC]

Most teams end up maintaining two completely separate voice pipelines. The Twilio pipeline handles inbound phone calls — mulaw codec, Media Streams WebSocket, telephony events. The WebRTC pipeline handles browser voice — Opus codec, ICE negotiation, browser APIs. Two codebases. Two sets of reconnect logic. Two dashboards. Same AI underneath.

THE PROBLEM

The divergence happens at the transport layer. Twilio and WebRTC speak completely different protocols. To unify them, you'd need to abstract both behind a common interface — handle codec differences at the boundary, normalize events, and expose the same API to your agent code regardless of which transport is active.

[WITHOUT HZRELAY]

Two separate WebSocket handlers. Two codec paths. Two reconnect strategies. Two sets of events your agent must handle. Doubling code for the same product feature.

THE SOLUTION: ONE SESSION CONFIG

HZRelay exposes a single session model. The inbound: field determines the transport. Everything downstream — STT, LLM, TTS, events — is identical regardless of whether the user is on a phone or in a browser.

phone_session.ts

// Phone call (Twilio)

const session = createSession({

inbound: { type: 'twilio' }, // mulaw 8kHz — handled

stt: { provider: 'deepgram', apiKey: env.DG },

llm: { provider: 'openai', apiKey: env.OAI },

tts: { provider: 'elevenlabs', apiKey: env.EL },

outbound: { type: 'twilio' },

agent: { systemPrompt: '...' },

});

web_session.ts

// Web voice (browser WebRTC) — one line different

const session = createSession({

inbound: { type: 'webrtc' }, // Opus 48kHz — handled

stt: { provider: 'deepgram', apiKey: env.DG },

llm: { provider: 'openai', apiKey: env.OAI },

tts: { provider: 'elevenlabs', apiKey: env.EL },

outbound: { type: 'webrtc' },

agent: { systemPrompt: '...' }, // same prompt

});

// same event API — your agent code doesn't change

session.on('transcript', (e) => console.log(e.text))

SAME EVENTS, ALWAYS

Whether the call comes from Twilio or WebRTC, your agent sees identical events: speech.start, transcript.final, llm.response, tts.audio_start. The codec differences (mulaw vs Opus) are absorbed by the adapter layer — your code is transport-agnostic.

[ SYS_LOG ] LIVE

INFOinbound: twilio → session a3f9b2c1

EVTspeech.start → transcript.final: 'book my appointment'

OKmouth→ear: 748ms

INFOinbound: webrtc → session b7d1e4f2 (same agent)

EVTspeech.start → transcript.final: 'reschedule for friday'

OKmouth→ear: 731ms

ALL_TRANSMISSIONS READ_QUICKSTART