function · Use Case

AI for Voice Agents

AI voice agents for customer service, IVR replacement, and outbound sales. Achieve sub-500ms end-to-end latency with natural interruption handling and seamless human handoff for complex cases.

Updated Apr 16, 20265 workflows~$0.4–$4 per 1,000 requests

Quick answer

The best voice agent stack for production combines a fast STT service (Deepgram Nova-3 or AssemblyAI Streaming at ~150ms), a low-latency LLM (claude-haiku-3-5 or GPT-4o mini at ~200ms), and a natural TTS voice (ElevenLabs or Cartesia at ~100ms), achieving end-to-end latency of 400–600ms. Total cost is $0.50–$2.50 per call-minute. Use LiveKit or Twilio Media Streams for telephony infrastructure.

The problem

Traditional IVR systems have a 67% customer abandonment rate and resolve fewer than 20% of calls without human transfer, according to Forrester. Contact centers pay $25–$45 per handled call for human agents, while a fully automated AI voice interaction costs $0.50–$3.00. For a contact center handling 100,000 calls per month, automating even 40% of call volume saves $1M–$1.7M annually. Meanwhile, staffing shortages mean average hold times have climbed to 8.3 minutes in 2025, directly driving customer churn.

Core workflows

Inbound IVR Replacement

Replace touch-tone IVR menus with a natural conversation that understands intent from free-form speech. Increases first-call resolution from 20% (IVR) to 55–70% (AI agent). Reduces average handle time by 35%.

claude-haiku-3-5Twilio VoiceArchitecture →

Outbound Sales and Lead Qualification

Automate outbound calls for lead qualification, appointment setting, and follow-up. AI agents qualify 200+ leads per hour versus 30–40 for a human SDR. Integrate with Salesforce to log call outcomes in real time.

gpt-4o-miniBland AIArchitecture →

Appointment Scheduling and Confirmation

Handle inbound appointment requests and outbound confirmation calls. Read from and write to Google Calendar or Calendly. Cuts no-show rates by 25–40% with automated reminder calls 24 hours before appointments.

claude-haiku-3-5Retell AIArchitecture →

Human Handoff with Context Transfer

Detect when a conversation exceeds AI capability (anger, complex policy exceptions, security verification failure) and transfer to a human agent with a real-time transcript and AI-generated call summary. Reduces human agent ramp-up time per transferred call from 3 minutes to 30 seconds.

claude-sonnet-4LiveKitArchitecture →

Post-Call Analytics and QA

Transcribe, summarize, and score every call for sentiment, resolution status, compliance adherence, and CSAT prediction. Automate 100% call QA versus the typical 3–5% human-reviewed sample.

claude-sonnet-4DeepgramArchitecture →

Top tools

  • Deepgram
  • ElevenLabs
  • Twilio Voice
  • LiveKit
  • Retell AI
  • Bland AI

Top models

  • claude-haiku-3-5
  • gpt-4o-mini
  • claude-sonnet-4
  • gemini-2.0-flash

FAQs

What end-to-end latency is achievable for AI voice agents, and why does it matter?

End-to-end latency (time from end of user speech to start of AI speech) should be under 600ms for a natural-feeling conversation. Humans perceive pauses above 700ms as awkward. The latency breakdown: STT transcription 100–200ms (Deepgram Nova-3 streaming), LLM first-token generation 150–300ms (claude-haiku-3-5 or GPT-4o mini with streaming), TTS audio start 80–150ms (ElevenLabs Turbo v2 or Cartesia). Achieve this by streaming audio in 100ms chunks, starting TTS as soon as the LLM produces the first sentence, and overlapping processing stages in parallel.

How do I handle interruptions in AI voice conversations?

Barge-in (user talking over the agent) must be handled gracefully to avoid the robotic experience of legacy IVR. Implement voice activity detection (VAD) with a 100–150ms buffer: when user speech is detected, immediately stop TTS playback and re-process with the new user input as context. LiveKit, Twilio, and Retell AI handle this at the infrastructure layer. The LLM prompt should instruct the agent to acknowledge interruptions naturally ('Of course, let me address that') rather than ignoring them or repeating itself.

What STT accuracy is needed, and how do accents affect it?

Production voice agents need STT word error rate (WER) below 8% for general English and below 15% for accented English, to avoid comprehension failures. Deepgram Nova-3 achieves 5–7% WER on standard US/UK English and 9–14% on heavy accents. AssemblyAI Universal-2 performs similarly. For Spanish, French, German, and Portuguese, both services are within 1–2% WER of their English performance. For low-resource languages, Whisper large-v3 covers 90+ languages but adds 200–400ms latency. Always test your specific customer demographics — accent distribution varies significantly by industry and geography.

How do I prevent the AI voice agent from saying something harmful or off-script?

Implement a three-layer safety system: (1) system prompt guardrails specifying exact permitted topics, prohibited phrases, and required disclaimers, (2) real-time output filtering that interrupts TTS if the LLM produces a disallowed pattern (pricing not in approved list, legal advice, health claims), and (3) post-call audit flagging for compliance review. For regulated industries (insurance, healthcare, financial services), maintain a human review queue for any call where the safety filter activates. Claude models are particularly strong at following system prompt constraints even under adversarial user pressure.

What regulations apply to AI voice agents making outbound calls?

In the US, TCPA requires prior express written consent for auto-dialed or pre-recorded calls to cell phones — FCC rules enacted in 2025 extend this to AI voice calls explicitly. Do-Not-Call registry compliance is mandatory. Calls must be disclosed as AI at the start of the conversation in many states (California, Illinois, others). In the EU, GDPR requires consent and the right to speak with a human on request. For debt collection, FDCPA rules apply to AI agents identically to human agents. Consult telecom counsel before launching outbound AI calling at scale.

What does it cost to build vs buy a voice agent platform?

Build-your-own using LiveKit (open source) + Deepgram + ElevenLabs + claude-haiku-3-5 costs roughly $0.40–$1.20 per call-minute in API costs, with $50,000–$150,000 in engineering time to build a production-ready system with IVR replacement, handoff, and analytics. Managed platforms like Retell AI and Bland AI cost $0.07–$0.15 per minute (all-in, on their infrastructure) and can go live in 1–2 weeks. Build your own when call volume exceeds 500,000 minutes/month (cost crossover with managed platforms) or when you need deep customization of the telephony layer.

Related architectures