Reference Architecture · voice

Voice Customer Service Agent

Last updated: April 15, 2026

Quick answer

The production stack connects Twilio or Vonage for telephony, Deepgram Nova-3 for STT, GPT-4o Realtime or Claude Sonnet 4 plus ElevenLabs Turbo for the LLM and TTS legs, and a barge-in detector for interruption handling. Target end-to-end latency under 1.2s. Expect $0.25 to $0.80 per minute at 2026 prices.

The problem

You need a voice agent that sounds natural, handles interruptions, integrates with your telephony system, and responds fast enough that callers do not hang up. Cost per minute is the critical metric — most voice agents lose money on long calls.

Architecture

input

llm

data

infra

output

Telephony (Twilio/Vonage)

Receives inbound PSTN calls and streams audio over WebSocket in 20ms frames.

Alternatives: Vonage, Telnyx, Bandwidth, SIP trunk

Speech-to-Text (Deepgram)

Transcribes audio in real time with word-level timestamps and confidence scores.

Alternatives: OpenAI Whisper streaming, Google STT v2, AssemblyAI

Voice Activity Detector

Detects when the caller stops speaking to trigger the LLM turn. Also detects barge-in mid-response.

Alternatives: Silero VAD, Twilio native VAD, WebRTC VAD

LLM + Tool Use

Generates the spoken response turn. Calls CRM and order tools when needed.

Alternatives: Claude Sonnet 4 (pipelined), Gemini 2.5 Flash Live

Text-to-Speech

Converts LLM output to audio, streamed in chunks so playback starts before generation completes.

Alternatives: ElevenLabs Turbo v2, Cartesia Sonic, OpenAI TTS native (realtime API)

CRM / Order Tools

Function calls to look up account status, open tickets, and process simple transactions.

Alternatives: MCP server over internal APIs

Call Recording + Transcription

Stores full call audio and final transcript for quality review and compliance.

Alternatives: Twilio recording, S3 + async Whisper, Deepgram async transcription

Call Analytics

Tracks resolution rate, call duration, escalation rate, and sentiment per call.

Alternatives: Braintrust call evals, Custom dashboard, Observe.AI

The stack

TelephonyTwilio

Best documentation and WebSocket streaming API. Vonage and Telnyx are cheaper for high volume (over 1M minutes/month) but have higher integration complexity.

Alternatives: Vonage, Telnyx, Bandwidth

STTDeepgram Nova-3

Lowest word error rate on telephony audio (noisy, 8kHz) in 2026 benchmarks. Sub-300ms latency on streaming transcription. 40% cheaper than Google STT at scale.

Alternatives: OpenAI Whisper streaming, Google STT v2, AssemblyAI

LLMGPT-4o Realtime API

Audio-native end-to-end: no separate STT or TTS required. Cuts 200 to 400ms compared to a pipelined approach. Weaker tool-use than Claude but faster for conversational turns. Use Claude pipelined when tool-use accuracy matters more than raw latency.

Alternatives: Claude Sonnet 4 (pipelined), Gemini 2.5 Flash Live

TTSElevenLabs Turbo v2

120ms time-to-first-audio. Sounds natural on telephony-quality audio. Cartesia Sonic is slightly faster at 80ms but voice quality is slightly lower. Native GPT-4o Realtime TTS is competitive but locks you into OpenAI.

Alternatives: Cartesia Sonic, OpenAI TTS native

Barge-in handlingSilero VAD + interrupt signal

Silero VAD runs locally on CPU and detects speech onset in under 100ms. Send an interrupt signal to the LLM on detection to stop generation and start a new turn.

Alternatives: Twilio native VAD, WebRTC VAD

Cost at each scale

Prototype

100 minutes/mo

$45/mo

Twilio inbound calls$3

Deepgram Nova-3 STT$1

LLM (GPT-4o Realtime)$22

TTS (ElevenLabs Turbo)$15

Hosting$4

Startup

50,000 minutes/mo

$16,000/mo

Twilio inbound calls$750

Deepgram Nova-3 STT$300

LLM (GPT-4o Realtime)$10,000

TTS (ElevenLabs Turbo)$3,500

Hosting + infra$800

Recording + storage$650

Scale

1,000,000 minutes/mo

$280,000/mo

Twilio inbound calls$12,000

Deepgram Nova-3 STT$5,000

LLM (GPT-4o Realtime)$180,000

TTS (ElevenLabs Turbo)$65,000

Infra$10,000

Recording + analytics$8,000

Latency budget

Total P50: 830ms

Total P95: 1,550ms

STT end-of-utterance detection

200ms · 380ms p95

VAD turn detection

80ms · 140ms p95

LLM first token (pipelined) or audio onset (realtime)

350ms · 650ms p95

TTS first audio chunk

120ms · 220ms p95

Network + audio buffer

80ms · 160ms p95

Median

P95

Tradeoffs

Audio-native vs pipelined

GPT-4o Realtime API handles STT, LLM, and TTS in a single model call. End-to-end latency is 200 to 400ms faster than a pipelined approach. Downside: weaker tool-use and less control over individual components. For support flows where order lookups and CRM writes are common, a pipelined Claude + ElevenLabs setup gives better accuracy.

ElevenLabs vs Cartesia vs native TTS

ElevenLabs Turbo v2 has the best voice quality for English-language calls. Cartesia Sonic is 30% cheaper and 40ms faster but slightly robotic on long sentences. Native GPT-4o TTS is good but costs more per character at high volume.

Twilio vs Vonage vs Telnyx pricing

Twilio charges $0.0085 per minute for inbound calls. Vonage and Telnyx are both under $0.005 at volume. The $0.003 per minute difference matters at 1M minutes/month ($3k/mo savings). Twilio's developer experience and support justify the premium up to around 200k minutes per month.

Failure modes & guardrails

Latency exceeds 1.5 seconds, caller hangs up

Mitigation: If total end-to-end P95 creeps above 1.2s in monitoring, add a filler phrase ('one moment') triggered after 800ms of silence. This buys 1 to 2 extra seconds without callers perceiving a failure.

Barge-in is not detected, caller talks over response

Mitigation: Run Silero VAD continuously on the inbound audio stream even while TTS is playing. On speech detection above the confidence threshold, send an interrupt signal and flush the TTS buffer.

LLM fabricates account or order details

Mitigation: Never let the LLM generate account data from memory. Every order number, account balance, or status must come from a tool call. If the tool fails, say 'I cannot look that up right now' — do not hallucinate a plausible answer.

Wrong language detection

Mitigation: Detect the caller's language in the first 5 seconds using a language-identification model. Route non-target-language callers to a language-specific agent or a human. Do not attempt to respond in the wrong language.

Call recording violates 2-party consent laws

Mitigation: Play a required consent disclosure at the start of every call before recording begins. Log that the disclosure was played with a timestamp. In California, consent must be affirmative — a beep is not sufficient.

Frequently asked questions

How much does a voice agent cost per minute?

At scale (1M+ minutes/month), budget $0.25 to $0.35 per minute for a pipelined stack (Deepgram + Claude + ElevenLabs). The audio-native GPT-4o Realtime approach costs $0.40 to $0.60 per minute at similar scale. A 5-minute call costs $1.25 to $3.00 depending on your stack.

Which STT engine is best for voice agents?

Deepgram Nova-3 leads on telephony audio quality in 2026 — low word error rate on accented speech and noisy phone audio. Google STT v2 is competitive for clear audio. AssemblyAI is good for async transcription but adds latency in streaming mode.

How do I hit sub-500ms time-to-first-token?

Use GPT-4o Realtime API (audio-native, no STT/TTS overhead) or combine Deepgram streaming STT with speculative generation that starts before the utterance ends. On pipelined stacks, the biggest gain comes from starting LLM generation the moment VAD detects end-of-utterance rather than waiting for full transcription.

Twilio vs Vonage vs Telnyx?

Twilio is the default choice for teams under 200k minutes per month — best documentation, most integrations, and reliable WebSocket streaming. Telnyx and Vonage are 40 to 50% cheaper at high volume and worth evaluating above 500k minutes per month, but require more engineering to integrate reliably.

How do I handle PCI or HIPAA compliance on voice calls?

For PCI: pause recording and do not transcribe when the caller is entering card numbers. Use DTMF (keypad input) for sensitive numbers instead of voice. For HIPAA: use a HIPAA Business Associate Agreement with your telephony and STT providers — Twilio, Deepgram, and AssemblyAI all offer BAAs. Encrypt all recordings at rest.

How do I evaluate voice agent quality?

Track four metrics: call resolution rate (resolved without human transfer), average handle time, CSAT score from post-call SMS survey, and word error rate from transcription on a sampled 1% of calls. Run LLM-as-judge on transcripts daily against a rubric that scores empathy, accuracy, and escalation appropriateness.

Architectures

Customer Support Agent

Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...

Models mentioned

gpt-4o claude-sonnet-4 gemini-2-5-pro