Question 1

What end-to-end latency is achievable for AI voice agents, and why does it matter?

Accepted Answer

End-to-end latency (time from end of user speech to start of AI speech) should be under 600ms for a natural-feeling conversation. Humans perceive pauses above 700ms as awkward. The latency breakdown: STT transcription 100–200ms (Deepgram Nova-3 streaming), LLM first-token generation 150–300ms (claude-haiku-3-5 or GPT-4o mini with streaming), TTS audio start 80–150ms (ElevenLabs Turbo v2 or Cartesia). Achieve this by streaming audio in 100ms chunks, starting TTS as soon as the LLM produces the first sentence, and overlapping processing stages in parallel.

Question 2

How do I handle interruptions in AI voice conversations?

Accepted Answer

Barge-in (user talking over the agent) must be handled gracefully to avoid the robotic experience of legacy IVR. Implement voice activity detection (VAD) with a 100–150ms buffer: when user speech is detected, immediately stop TTS playback and re-process with the new user input as context. LiveKit, Twilio, and Retell AI handle this at the infrastructure layer. The LLM prompt should instruct the agent to acknowledge interruptions naturally ('Of course, let me address that') rather than ignoring them or repeating itself.

Question 3

What STT accuracy is needed, and how do accents affect it?

Accepted Answer

Production voice agents need STT word error rate (WER) below 8% for general English and below 15% for accented English, to avoid comprehension failures. Deepgram Nova-3 achieves 5–7% WER on standard US/UK English and 9–14% on heavy accents. AssemblyAI Universal-2 performs similarly. For Spanish, French, German, and Portuguese, both services are within 1–2% WER of their English performance. For low-resource languages, Whisper large-v3 covers 90+ languages but adds 200–400ms latency. Always test your specific customer demographics — accent distribution varies significantly by industry and geography.

Question 4

How do I prevent the AI voice agent from saying something harmful or off-script?

Accepted Answer

Implement a three-layer safety system: (1) system prompt guardrails specifying exact permitted topics, prohibited phrases, and required disclaimers, (2) real-time output filtering that interrupts TTS if the LLM produces a disallowed pattern (pricing not in approved list, legal advice, health claims), and (3) post-call audit flagging for compliance review. For regulated industries (insurance, healthcare, financial services), maintain a human review queue for any call where the safety filter activates. Claude models are particularly strong at following system prompt constraints even under adversarial user pressure.

Question 5

What regulations apply to AI voice agents making outbound calls?

Accepted Answer

In the US, TCPA requires prior express written consent for auto-dialed or pre-recorded calls to cell phones — FCC rules enacted in 2025 extend this to AI voice calls explicitly. Do-Not-Call registry compliance is mandatory. Calls must be disclosed as AI at the start of the conversation in many states (California, Illinois, others). In the EU, GDPR requires consent and the right to speak with a human on request. For debt collection, FDCPA rules apply to AI agents identically to human agents. Consult telecom counsel before launching outbound AI calling at scale.

Question 6

What does it cost to build vs buy a voice agent platform?

Accepted Answer

Build-your-own using LiveKit (open source) + Deepgram + ElevenLabs + claude-haiku-3-5 costs roughly $0.40–$1.20 per call-minute in API costs, with $50,000–$150,000 in engineering time to build a production-ready system with IVR replacement, handoff, and analytics. Managed platforms like Retell AI and Bland AI cost $0.07–$0.15 per minute (all-in, on their infrastructure) and can go live in 1–2 weeks. Build your own when call volume exceeds 500,000 minutes/month (cost crossover with managed platforms) or when you need deep customization of the telephony layer.

AI for Voice Agents

The problem

Core workflows

Inbound IVR Replacement

Outbound Sales and Lead Qualification

Appointment Scheduling and Confirmation

Human Handoff with Context Transfer

Post-Call Analytics and QA

Top tools

Top models

FAQs

Related architectures