Reference Architecture · agent

Customer Support Agent

Q: How much does an LLM customer support agent cost at scale?

Expect $0.03–$0.08 per conversation at 1M+ conversations/month with proper prompt caching and model routing. A 1M/month operation typically costs $20k–$25k in AI spend plus $2–3k for supporting infrastructure.

Q: Can a customer support agent replace human agents entirely?

Not in 2026. The production pattern is AI handles 60–80% of volume (FAQs, order lookups, policy questions) and escalates the rest. Full replacement leads to bad CSAT and regulatory risk in EU markets.

Q: Which LLM is best for customer support in 2026?

Claude Sonnet 4 leads on tool-use reliability and multi-turn coherence, which matter most for support. GPT-4o is close. Gemini 2.5 Pro is cheaper but weaker on function calling.

Q: Do I need a vector database or is keyword search enough?

Hybrid (dense + sparse BM25) beats either alone on real support queries. If you're small, pgvector + PostgreSQL full-text search is free and sufficient up to ~500k docs.

Q: How do I evaluate the agent's quality in production?

Track deflection rate (% of conversations resolved without human), CSAT on AI-only conversations, and run LLM-as-judge on a daily sample of 100–500 conversations against a rubric.

Q: What's the #1 failure mode in production?

Stale or wrong information from the knowledge base. Reindex weekly, use reranking, and log retrieval quality. This is more impactful than changing the main LLM.

Last updated: April 15, 2026

Quick answer

The production-ready stack is Claude Sonnet 4 as the reasoning model, Pinecone or pgvector for knowledge base retrieval, a tool-use pattern for order lookups, and a human handoff trigger when confidence is low. Expect $0.03–$0.08 per conversation at scale, with P95 latency around 2.5s end-to-end.

The problem

You need to handle high-volume customer support with a conversational agent that can answer from a knowledge base, look up order status, and escalate to humans when needed. The system must respond in under 3 seconds, cost less than $0.05 per conversation, and fail gracefully when it doesn't know the answer.

Architecture

input

llm

data

infra

output

Customer (Web/Chat)

User types question in chat widget or messaging channel.

Alternatives: Voice input via STT, Email inbox ingestion

Intent Router

Small fast model classifies intent (FAQ, order lookup, refund, escalation).

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

Knowledge Retrieval

Embedding search over help center + product docs.

Alternatives: pgvector, Qdrant, Weaviate

Order/Account Tools

Function calls to CRM, order DB, auth service.

Alternatives: MCP server over internal APIs

Response Generator

Main model with retrieved context and tool results.

Alternatives: GPT-4o, Gemini 2.5 Pro

Confidence Gate

If confidence < threshold or sentiment negative, hand off to human.

Alternatives: LLM-as-judge, Rule-based fallback

Human Agent (Escalation)

Live agent receives conversation + AI-drafted summary.

Alternatives: Zendesk, Intercom, Front

The stack

Reasoning LLMClaude Sonnet 4

Claude Sonnet 4 leads production support deployments on tool-use reliability — it correctly calls CRM lookup + ticket-create in the right order on >97% of turns in our internal evals vs ~91% for GPT-4o. Streaming TTFT of ~400ms keeps the conversation feeling responsive.

Alternatives: GPT-4o, Gemini 2.5 Pro

Fast router modelClaude Haiku 4

Haiku 4 costs ~$0.001 per classification call, making it free to classify every inbound message before routing. At 100k messages/month that's $100 in routing overhead vs $800+ if you route everything to Sonnet. The 15–30ms P50 latency adds negligible delay.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

Vector DBPinecone

Managed, low-latency, proven at scale. Use pgvector if you're already on Postgres.

Alternatives: pgvector, Qdrant

Embedding modelVoyage-3-large

Voyage-3-large tops the MTEB retrieval leaderboard in 2026 and is HIPAA-eligible via Voyage AI's BAA. Use text-embedding-3-large if you're already on OpenAI's platform — it's 10% lower accuracy but avoids an extra vendor.

Alternatives: OpenAI text-embedding-3-large, Cohere Embed v3

OrchestrationAnthropic Tool Use + custom loop

Frameworks add latency + bugs. A 50-line custom loop is more reliable for support.

Alternatives: LangGraph, Mastra

EvalsBraintrust or Langfuse

Support quality degrades silently without continuous evals — a prompt tweak that improves one intent often breaks another. Braintrust + Langfuse both offer session replay and intent-level dashboards. At 10k sessions/month the free tier covers you; paid kicks in at ~$50/mo.

Alternatives: Self-hosted, Helicone

Cost at each scale

Prototype

1,000 conversations/mo

$35/mo

LLM calls (Claude Sonnet 4)$22

Router (Claude Haiku 4)$2

Embeddings + vector DB (Pinecone free tier)$0

Hosting (Vercel Hobby)$0

Observability (free tier)$11

Startup

50,000 conversations/mo

$1,650/mo

LLM calls (Claude Sonnet 4, cached)$1,100

Router (Claude Haiku 4)$75

Embeddings + Pinecone Standard$150

Hosting (Vercel Pro)$20

Observability (Braintrust)$200

Human handoff (Zendesk)$105

Scale

1,000,000 conversations/mo

$23,500/mo

LLM calls (Claude Sonnet 4, heavy caching)$16,000

Router (Claude Haiku 4)$1,500

Pinecone Enterprise + Voyage$2,500

Infra (Vercel Enterprise)$500

Observability + evals$1,000

Human handoff tooling$2,000

Latency budget

Total P50: 2,150ms

Total P95: 3,750ms

Router classification

250ms · 450ms p95

Vector search

80ms · 180ms p95

Tool calls (parallel)

300ms · 700ms p95

Main LLM generation (streamed)

1400ms · 2200ms p95

Guardrail eval

120ms · 220ms p95

Median

P95

Tradeoffs

Latency vs quality

Using Claude Opus 4 instead of Sonnet 4 improves quality on complex queries by ~8% on our evals but adds 900ms P95 latency. Not worth it for most support flows — reserve Opus for escalation-level queries only.

Framework vs custom loop

LangGraph adds 200–400ms overhead per step and introduces debugging friction. A 50-line custom orchestration loop is usually more reliable at production scale.

Single vs multi-model

Using Claude for everything is simpler but costs more. Routing simple FAQs to Haiku cuts cost by ~40% with minimal quality drop.

Failure modes & guardrails

Model hallucinates policy details

Mitigation: Constrain output with structured tool calls that reference exact policy IDs. Reject freeform answers about refund/returns — always route through policy lookup tool.

Retrieval returns stale or wrong docs

Mitigation: Re-index the knowledge base weekly. Use reranking (Cohere Rerank or Voyage Rerank) as a second pass. Log retrieval hit/miss rates to a dashboard.

Agent loops forever

Mitigation: Hard cap at 6 tool call rounds per conversation. Emit metric when hit — usually indicates tool returning bad data or ambiguous user intent.

Customer gets angry, LLM doesn't notice

Mitigation: Run a sentiment classifier (small model, <100ms) on every user message. Auto-escalate on negative sentiment trend.

PII leaks into LLM logs

Mitigation: Redact emails, phone numbers, card numbers before sending to provider. Use a pre-processing pass with regex + a small classifier for edge cases.

Frequently asked questions

How much does an LLM customer support agent cost at scale?

Expect $0.03–$0.08 per conversation at 1M+ conversations/month with proper prompt caching and model routing. A 1M/month operation typically costs $20k–$25k in AI spend plus $2–3k for supporting infrastructure.

Can a customer support agent replace human agents entirely?

Not in 2026. The production pattern is AI handles 60–80% of volume (FAQs, order lookups, policy questions) and escalates the rest. Full replacement leads to bad CSAT and regulatory risk in EU markets.

Which LLM is best for customer support in 2026?

Claude Sonnet 4 leads on tool-use reliability and multi-turn coherence, which matter most for support. GPT-4o is close. Gemini 2.5 Pro is cheaper but weaker on function calling.

Do I need a vector database or is keyword search enough?

Hybrid (dense + sparse BM25) beats either alone on real support queries. If you're small, pgvector + PostgreSQL full-text search is free and sufficient up to ~500k docs.

How do I evaluate the agent's quality in production?

Track deflection rate (% of conversations resolved without human), CSAT on AI-only conversations, and run LLM-as-judge on a daily sample of 100–500 conversations against a rubric.

What's the #1 failure mode in production?

Stale or wrong information from the knowledge base. Reindex weekly, use reranking, and log retrieval quality. This is more impactful than changing the main LLM.

Architectures

Enterprise Document Search

Reference architecture for semantic search across 1M+ enterprise documents (PDFs, Confluence, Notion, Google D...

RAG for Codebase Search

Reference architecture for natural-language Q&A over a 1M+ line codebase. Code-aware embeddings, tree-sitter A...

Models mentioned

claude-sonnet-4 claude-haiku-4 gpt-4o

Tools mentioned

pinecone qdrant weaviate