Reference Architecture · voice
Voice Customer Service Agent
Last updated: April 15, 2026
Quick answer
The production stack connects Twilio or Vonage for telephony, Deepgram Nova-3 for STT, GPT-4o Realtime or Claude Sonnet 4 plus ElevenLabs Turbo for the LLM and TTS legs, and a barge-in detector for interruption handling. Target end-to-end latency under 1.2s. Expect $0.25 to $0.80 per minute at 2026 prices.
The problem
You need a voice agent that sounds natural, handles interruptions, integrates with your telephony system, and responds fast enough that callers do not hang up. Cost per minute is the critical metric — most voice agents lose money on long calls.
Architecture
Telephony (Twilio/Vonage)
Receives inbound PSTN calls and streams audio over WebSocket in 20ms frames.
Alternatives: Vonage, Telnyx, Bandwidth, SIP trunk
Speech-to-Text (Deepgram)
Transcribes audio in real time with word-level timestamps and confidence scores.
Alternatives: OpenAI Whisper streaming, Google STT v2, AssemblyAI
Voice Activity Detector
Detects when the caller stops speaking to trigger the LLM turn. Also detects barge-in mid-response.
Alternatives: Silero VAD, Twilio native VAD, WebRTC VAD
LLM + Tool Use
Generates the spoken response turn. Calls CRM and order tools when needed.
Alternatives: Claude Sonnet 4 (pipelined), Gemini 2.5 Flash Live
Text-to-Speech
Converts LLM output to audio, streamed in chunks so playback starts before generation completes.
Alternatives: ElevenLabs Turbo v2, Cartesia Sonic, OpenAI TTS native (realtime API)
CRM / Order Tools
Function calls to look up account status, open tickets, and process simple transactions.
Alternatives: MCP server over internal APIs
Call Recording + Transcription
Stores full call audio and final transcript for quality review and compliance.
Alternatives: Twilio recording, S3 + async Whisper, Deepgram async transcription
Call Analytics
Tracks resolution rate, call duration, escalation rate, and sentiment per call.
Alternatives: Braintrust call evals, Custom dashboard, Observe.AI
The stack
Best documentation and WebSocket streaming API. Vonage and Telnyx are cheaper for high volume (over 1M minutes/month) but have higher integration complexity.
Alternatives: Vonage, Telnyx, Bandwidth
Lowest word error rate on telephony audio (noisy, 8kHz) in 2026 benchmarks. Sub-300ms latency on streaming transcription. 40% cheaper than Google STT at scale.
Alternatives: OpenAI Whisper streaming, Google STT v2, AssemblyAI
Audio-native end-to-end: no separate STT or TTS required. Cuts 200 to 400ms compared to a pipelined approach. Weaker tool-use than Claude but faster for conversational turns. Use Claude pipelined when tool-use accuracy matters more than raw latency.
Alternatives: Claude Sonnet 4 (pipelined), Gemini 2.5 Flash Live
120ms time-to-first-audio. Sounds natural on telephony-quality audio. Cartesia Sonic is slightly faster at 80ms but voice quality is slightly lower. Native GPT-4o Realtime TTS is competitive but locks you into OpenAI.
Alternatives: Cartesia Sonic, OpenAI TTS native
Silero VAD runs locally on CPU and detects speech onset in under 100ms. Send an interrupt signal to the LLM on detection to stop generation and start a new turn.
Alternatives: Twilio native VAD, WebRTC VAD
Cost at each scale
Prototype
100 minutes/mo
$45/mo
Startup
50,000 minutes/mo
$16,000/mo
Scale
1,000,000 minutes/mo
$280,000/mo
Latency budget
Tradeoffs
Audio-native vs pipelined
GPT-4o Realtime API handles STT, LLM, and TTS in a single model call. End-to-end latency is 200 to 400ms faster than a pipelined approach. Downside: weaker tool-use and less control over individual components. For support flows where order lookups and CRM writes are common, a pipelined Claude + ElevenLabs setup gives better accuracy.
ElevenLabs vs Cartesia vs native TTS
ElevenLabs Turbo v2 has the best voice quality for English-language calls. Cartesia Sonic is 30% cheaper and 40ms faster but slightly robotic on long sentences. Native GPT-4o TTS is good but costs more per character at high volume.
Twilio vs Vonage vs Telnyx pricing
Twilio charges $0.0085 per minute for inbound calls. Vonage and Telnyx are both under $0.005 at volume. The $0.003 per minute difference matters at 1M minutes/month ($3k/mo savings). Twilio's developer experience and support justify the premium up to around 200k minutes per month.
Failure modes & guardrails
Latency exceeds 1.5 seconds, caller hangs up
Mitigation: If total end-to-end P95 creeps above 1.2s in monitoring, add a filler phrase ('one moment') triggered after 800ms of silence. This buys 1 to 2 extra seconds without callers perceiving a failure.
Barge-in is not detected, caller talks over response
Mitigation: Run Silero VAD continuously on the inbound audio stream even while TTS is playing. On speech detection above the confidence threshold, send an interrupt signal and flush the TTS buffer.
LLM fabricates account or order details
Mitigation: Never let the LLM generate account data from memory. Every order number, account balance, or status must come from a tool call. If the tool fails, say 'I cannot look that up right now' — do not hallucinate a plausible answer.
Wrong language detection
Mitigation: Detect the caller's language in the first 5 seconds using a language-identification model. Route non-target-language callers to a language-specific agent or a human. Do not attempt to respond in the wrong language.
Call recording violates 2-party consent laws
Mitigation: Play a required consent disclosure at the start of every call before recording begins. Log that the disclosure was played with a timestamp. In California, consent must be affirmative — a beep is not sufficient.
Frequently asked questions
How much does a voice agent cost per minute?
At scale (1M+ minutes/month), budget $0.25 to $0.35 per minute for a pipelined stack (Deepgram + Claude + ElevenLabs). The audio-native GPT-4o Realtime approach costs $0.40 to $0.60 per minute at similar scale. A 5-minute call costs $1.25 to $3.00 depending on your stack.
Which STT engine is best for voice agents?
Deepgram Nova-3 leads on telephony audio quality in 2026 — low word error rate on accented speech and noisy phone audio. Google STT v2 is competitive for clear audio. AssemblyAI is good for async transcription but adds latency in streaming mode.
How do I hit sub-500ms time-to-first-token?
Use GPT-4o Realtime API (audio-native, no STT/TTS overhead) or combine Deepgram streaming STT with speculative generation that starts before the utterance ends. On pipelined stacks, the biggest gain comes from starting LLM generation the moment VAD detects end-of-utterance rather than waiting for full transcription.
Twilio vs Vonage vs Telnyx?
Twilio is the default choice for teams under 200k minutes per month — best documentation, most integrations, and reliable WebSocket streaming. Telnyx and Vonage are 40 to 50% cheaper at high volume and worth evaluating above 500k minutes per month, but require more engineering to integrate reliably.
How do I handle PCI or HIPAA compliance on voice calls?
For PCI: pause recording and do not transcribe when the caller is entering card numbers. Use DTMF (keypad input) for sensitive numbers instead of voice. For HIPAA: use a HIPAA Business Associate Agreement with your telephony and STT providers — Twilio, Deepgram, and AssemblyAI all offer BAAs. Encrypt all recordings at rest.
How do I evaluate voice agent quality?
Track four metrics: call resolution rate (resolved without human transfer), average handle time, CSAT score from post-call SMS survey, and word error rate from transcription on a sampled 1% of calls. Run LLM-as-judge on transcripts daily against a rubric that scores empathy, accuracy, and escalation appropriateness.