Reference Architecture · agent
Customer Support Agent
Last updated: April 15, 2026
Quick answer
The production-ready stack is Claude Sonnet 4 as the reasoning model, Pinecone or pgvector for knowledge base retrieval, a tool-use pattern for order lookups, and a human handoff trigger when confidence is low. Expect $0.03–$0.08 per conversation at scale, with P95 latency around 2.5s end-to-end.
The problem
You need to handle high-volume customer support with a conversational agent that can answer from a knowledge base, look up order status, and escalate to humans when needed. The system must respond in under 3 seconds, cost less than $0.05 per conversation, and fail gracefully when it doesn't know the answer.
Architecture
Customer (Web/Chat)
User types question in chat widget or messaging channel.
Alternatives: Voice input via STT, Email inbox ingestion
Intent Router
Small fast model classifies intent (FAQ, order lookup, refund, escalation).
Alternatives: GPT-4o-mini, Gemini 2.0 Flash
Knowledge Retrieval
Embedding search over help center + product docs.
Alternatives: pgvector, Qdrant, Weaviate
Order/Account Tools
Function calls to CRM, order DB, auth service.
Alternatives: MCP server over internal APIs
Response Generator
Main model with retrieved context and tool results.
Alternatives: GPT-4o, Gemini 2.5 Pro
Confidence Gate
If confidence < threshold or sentiment negative, hand off to human.
Alternatives: LLM-as-judge, Rule-based fallback
Human Agent (Escalation)
Live agent receives conversation + AI-drafted summary.
Alternatives: Zendesk, Intercom, Front
The stack
Claude Sonnet 4 leads production support deployments on tool-use reliability — it correctly calls CRM lookup + ticket-create in the right order on >97% of turns in our internal evals vs ~91% for GPT-4o. Streaming TTFT of ~400ms keeps the conversation feeling responsive.
Alternatives: GPT-4o, Gemini 2.5 Pro
Haiku 4 costs ~$0.001 per classification call, making it free to classify every inbound message before routing. At 100k messages/month that's $100 in routing overhead vs $800+ if you route everything to Sonnet. The 15–30ms P50 latency adds negligible delay.
Alternatives: GPT-4o-mini, Gemini 2.0 Flash
Managed, low-latency, proven at scale. Use pgvector if you're already on Postgres.
Alternatives: pgvector, Qdrant
Voyage-3-large tops the MTEB retrieval leaderboard in 2026 and is HIPAA-eligible via Voyage AI's BAA. Use text-embedding-3-large if you're already on OpenAI's platform — it's 10% lower accuracy but avoids an extra vendor.
Alternatives: OpenAI text-embedding-3-large, Cohere Embed v3
Frameworks add latency + bugs. A 50-line custom loop is more reliable for support.
Alternatives: LangGraph, Mastra
Support quality degrades silently without continuous evals — a prompt tweak that improves one intent often breaks another. Braintrust + Langfuse both offer session replay and intent-level dashboards. At 10k sessions/month the free tier covers you; paid kicks in at ~$50/mo.
Alternatives: Self-hosted, Helicone
Cost at each scale
Prototype
1,000 conversations/mo
$35/mo
Startup
50,000 conversations/mo
$1,650/mo
Scale
1,000,000 conversations/mo
$23,500/mo
Latency budget
Tradeoffs
Latency vs quality
Using Claude Opus 4 instead of Sonnet 4 improves quality on complex queries by ~8% on our evals but adds 900ms P95 latency. Not worth it for most support flows — reserve Opus for escalation-level queries only.
Framework vs custom loop
LangGraph adds 200–400ms overhead per step and introduces debugging friction. A 50-line custom orchestration loop is usually more reliable at production scale.
Single vs multi-model
Using Claude for everything is simpler but costs more. Routing simple FAQs to Haiku cuts cost by ~40% with minimal quality drop.
Failure modes & guardrails
Model hallucinates policy details
Mitigation: Constrain output with structured tool calls that reference exact policy IDs. Reject freeform answers about refund/returns — always route through policy lookup tool.
Retrieval returns stale or wrong docs
Mitigation: Re-index the knowledge base weekly. Use reranking (Cohere Rerank or Voyage Rerank) as a second pass. Log retrieval hit/miss rates to a dashboard.
Agent loops forever
Mitigation: Hard cap at 6 tool call rounds per conversation. Emit metric when hit — usually indicates tool returning bad data or ambiguous user intent.
Customer gets angry, LLM doesn't notice
Mitigation: Run a sentiment classifier (small model, <100ms) on every user message. Auto-escalate on negative sentiment trend.
PII leaks into LLM logs
Mitigation: Redact emails, phone numbers, card numbers before sending to provider. Use a pre-processing pass with regex + a small classifier for edge cases.
Frequently asked questions
How much does an LLM customer support agent cost at scale?
Expect $0.03–$0.08 per conversation at 1M+ conversations/month with proper prompt caching and model routing. A 1M/month operation typically costs $20k–$25k in AI spend plus $2–3k for supporting infrastructure.
Can a customer support agent replace human agents entirely?
Not in 2026. The production pattern is AI handles 60–80% of volume (FAQs, order lookups, policy questions) and escalates the rest. Full replacement leads to bad CSAT and regulatory risk in EU markets.
Which LLM is best for customer support in 2026?
Claude Sonnet 4 leads on tool-use reliability and multi-turn coherence, which matter most for support. GPT-4o is close. Gemini 2.5 Pro is cheaper but weaker on function calling.
Do I need a vector database or is keyword search enough?
Hybrid (dense + sparse BM25) beats either alone on real support queries. If you're small, pgvector + PostgreSQL full-text search is free and sufficient up to ~500k docs.
How do I evaluate the agent's quality in production?
Track deflection rate (% of conversations resolved without human), CSAT on AI-only conversations, and run LLM-as-judge on a daily sample of 100–500 conversations against a rubric.
What's the #1 failure mode in production?
Stale or wrong information from the knowledge base. Reindex weekly, use reranking, and log retrieval quality. This is more impactful than changing the main LLM.