Reference Architecture · agent

Customer Support Agent

Last updated: April 15, 2026

Quick answer

The production-ready stack is Claude Sonnet 4 as the reasoning model, Pinecone or pgvector for knowledge base retrieval, a tool-use pattern for order lookups, and a human handoff trigger when confidence is low. Expect $0.03–$0.08 per conversation at scale, with P95 latency around 2.5s end-to-end.

The problem

You need to handle high-volume customer support with a conversational agent that can answer from a knowledge base, look up order status, and escalate to humans when needed. The system must respond in under 3 seconds, cost less than $0.05 per conversation, and fail gracefully when it doesn't know the answer.

Architecture

if FAQif lookupunsureCustomer (Web/Chat)INPUTIntent RouterLLMKnowledge RetrievalDATAOrder/Account ToolsINFRAResponse GeneratorLLMConfidence GateINFRAHuman Agent (Escalation)OUTPUT
input
llm
data
infra
output

Customer (Web/Chat)

User types question in chat widget or messaging channel.

Alternatives: Voice input via STT, Email inbox ingestion

Intent Router

Small fast model classifies intent (FAQ, order lookup, refund, escalation).

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

Knowledge Retrieval

Embedding search over help center + product docs.

Alternatives: pgvector, Qdrant, Weaviate

Order/Account Tools

Function calls to CRM, order DB, auth service.

Alternatives: MCP server over internal APIs

Response Generator

Main model with retrieved context and tool results.

Alternatives: GPT-4o, Gemini 2.5 Pro

Confidence Gate

If confidence < threshold or sentiment negative, hand off to human.

Alternatives: LLM-as-judge, Rule-based fallback

Human Agent (Escalation)

Live agent receives conversation + AI-drafted summary.

Alternatives: Zendesk, Intercom, Front

The stack

Reasoning LLMClaude Sonnet 4

Claude Sonnet 4 leads production support deployments on tool-use reliability — it correctly calls CRM lookup + ticket-create in the right order on >97% of turns in our internal evals vs ~91% for GPT-4o. Streaming TTFT of ~400ms keeps the conversation feeling responsive.

Alternatives: GPT-4o, Gemini 2.5 Pro

Fast router modelClaude Haiku 4

Haiku 4 costs ~$0.001 per classification call, making it free to classify every inbound message before routing. At 100k messages/month that's $100 in routing overhead vs $800+ if you route everything to Sonnet. The 15–30ms P50 latency adds negligible delay.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

Vector DBPinecone

Managed, low-latency, proven at scale. Use pgvector if you're already on Postgres.

Alternatives: pgvector, Qdrant

Embedding modelVoyage-3-large

Voyage-3-large tops the MTEB retrieval leaderboard in 2026 and is HIPAA-eligible via Voyage AI's BAA. Use text-embedding-3-large if you're already on OpenAI's platform — it's 10% lower accuracy but avoids an extra vendor.

Alternatives: OpenAI text-embedding-3-large, Cohere Embed v3

OrchestrationAnthropic Tool Use + custom loop

Frameworks add latency + bugs. A 50-line custom loop is more reliable for support.

Alternatives: LangGraph, Mastra

EvalsBraintrust or Langfuse

Support quality degrades silently without continuous evals — a prompt tweak that improves one intent often breaks another. Braintrust + Langfuse both offer session replay and intent-level dashboards. At 10k sessions/month the free tier covers you; paid kicks in at ~$50/mo.

Alternatives: Self-hosted, Helicone

Cost at each scale

Prototype

1,000 conversations/mo

$35/mo

LLM calls (Claude Sonnet 4)$22
Router (Claude Haiku 4)$2
Embeddings + vector DB (Pinecone free tier)$0
Hosting (Vercel Hobby)$0
Observability (free tier)$11

Startup

50,000 conversations/mo

$1,650/mo

LLM calls (Claude Sonnet 4, cached)$1,100
Router (Claude Haiku 4)$75
Embeddings + Pinecone Standard$150
Hosting (Vercel Pro)$20
Observability (Braintrust)$200
Human handoff (Zendesk)$105

Scale

1,000,000 conversations/mo

$23,500/mo

LLM calls (Claude Sonnet 4, heavy caching)$16,000
Router (Claude Haiku 4)$1,500
Pinecone Enterprise + Voyage$2,500
Infra (Vercel Enterprise)$500
Observability + evals$1,000
Human handoff tooling$2,000

Latency budget

Total P50: 2,150ms
Total P95: 3,750ms
Router classification
250ms · 450ms p95
Vector search
80ms · 180ms p95
Tool calls (parallel)
300ms · 700ms p95
Main LLM generation (streamed)
1400ms · 2200ms p95
Guardrail eval
120ms · 220ms p95
Median
P95

Tradeoffs

Latency vs quality

Using Claude Opus 4 instead of Sonnet 4 improves quality on complex queries by ~8% on our evals but adds 900ms P95 latency. Not worth it for most support flows — reserve Opus for escalation-level queries only.

Framework vs custom loop

LangGraph adds 200–400ms overhead per step and introduces debugging friction. A 50-line custom orchestration loop is usually more reliable at production scale.

Single vs multi-model

Using Claude for everything is simpler but costs more. Routing simple FAQs to Haiku cuts cost by ~40% with minimal quality drop.

Failure modes & guardrails

Model hallucinates policy details

Mitigation: Constrain output with structured tool calls that reference exact policy IDs. Reject freeform answers about refund/returns — always route through policy lookup tool.

Retrieval returns stale or wrong docs

Mitigation: Re-index the knowledge base weekly. Use reranking (Cohere Rerank or Voyage Rerank) as a second pass. Log retrieval hit/miss rates to a dashboard.

Agent loops forever

Mitigation: Hard cap at 6 tool call rounds per conversation. Emit metric when hit — usually indicates tool returning bad data or ambiguous user intent.

Customer gets angry, LLM doesn't notice

Mitigation: Run a sentiment classifier (small model, <100ms) on every user message. Auto-escalate on negative sentiment trend.

PII leaks into LLM logs

Mitigation: Redact emails, phone numbers, card numbers before sending to provider. Use a pre-processing pass with regex + a small classifier for edge cases.

Frequently asked questions

How much does an LLM customer support agent cost at scale?

Expect $0.03–$0.08 per conversation at 1M+ conversations/month with proper prompt caching and model routing. A 1M/month operation typically costs $20k–$25k in AI spend plus $2–3k for supporting infrastructure.

Can a customer support agent replace human agents entirely?

Not in 2026. The production pattern is AI handles 60–80% of volume (FAQs, order lookups, policy questions) and escalates the rest. Full replacement leads to bad CSAT and regulatory risk in EU markets.

Which LLM is best for customer support in 2026?

Claude Sonnet 4 leads on tool-use reliability and multi-turn coherence, which matter most for support. GPT-4o is close. Gemini 2.5 Pro is cheaper but weaker on function calling.

Do I need a vector database or is keyword search enough?

Hybrid (dense + sparse BM25) beats either alone on real support queries. If you're small, pgvector + PostgreSQL full-text search is free and sufficient up to ~500k docs.

How do I evaluate the agent's quality in production?

Track deflection rate (% of conversations resolved without human), CSAT on AI-only conversations, and run LLM-as-judge on a daily sample of 100–500 conversations against a rubric.

What's the #1 failure mode in production?

Stale or wrong information from the knowledge base. Reindex weekly, use reranking, and log retrieval quality. This is more impactful than changing the main LLM.

Related

Tools mentioned