Reference Architecture · rag
Customer Knowledge Base Chatbot
Last updated: April 16, 2026
Quick answer
Use GPT-4o-mini or Claude Haiku 4 as the answer model — both are under $1 per 1000 conversations. Embed articles with OpenAI text-embedding-3-large, store in pgvector or Pinecone, rerank the top 20 with Cohere Rerank v3, and stream an answer with citations. Expect $0.003 to $0.012 per conversation at scale. The real ROI lever is deflection tracking plus weekly content gap analysis — not a more expensive model.
The problem
You run a SMB help center with ~10,000 articles in Zendesk, Intercom, or HelpScout. Customers ask the same 200 questions in a hundred ways. You need a chatbot that answers in 2 seconds, deflects 40-60% of tickets, costs pennies per conversation, and hands off cleanly to a human when it is not confident. It has to work at 50k-500k conversations per month without burning your margin.
Architecture
Help Center Ingester
Pulls articles from Zendesk, Intercom, HelpScout, or a CMS. Handles HTML cleanup, removes boilerplate, and extracts metadata like category, product, and last-updated date.
Alternatives: Zendesk Help Center API, Intercom Articles API, Ragie, Custom scraper
Section-aware Chunker
Splits each article at H2/H3 headings into 300-600 token chunks. Preserves the article title and section path as metadata so citations look natural in the UI.
Alternatives: Recursive character splitter, Semantic chunking, Fixed-size 512-token chunks
Embedding Model
Embeds chunks and user questions into a shared 3072-dim vector space. Cheap at scale — under $0.01 per 100k articles.
Alternatives: Voyage-3, Cohere Embed v3
Vector Store
Stores ~30k chunks with metadata for filtering by product, locale, and publish state. Sized small enough that pgvector on a shared Postgres works.
Alternatives: Pinecone Serverless, Qdrant Cloud, Turbopuffer
Hybrid Retriever
Runs dense retrieval plus BM25 on the same chunk store. Merges with RRF. Filters by locale so a German customer never sees English articles first.
Alternatives: Dense-only, BM25 + RRF, Tantivy + pgvector
Reranker
Cohere Rerank v3 takes the top 20 retrieved chunks and ranks them down to 4. Adds ~150ms but pushes answer accuracy from ~70% to ~90%.
Alternatives: Voyage Rerank 2.5, Jina Reranker v2
Answer Model
Small, fast, cheap. Streams an answer with inline citations, a confidence signal, and a suggested next step. Prompt-cached so the system instruction is free after first token.
Alternatives: Claude Haiku 4, Gemini 2.0 Flash, Llama 3.3 on Groq
Escalation Router
Reads a confidence score and a topic classifier. If confidence is below threshold or topic is billing/account, routes the conversation to a human in Zendesk with full transcript and retrieved docs attached.
Alternatives: Zendesk Flow Builder, Intercom Fin handoff, Custom webhook
Chat Widget
In-app chat widget that streams the answer token by token, renders inline citations, shows a thumbs up/down, and offers 'talk to a human' at any turn.
Alternatives: Intercom Messenger, Crisp, Custom React widget
The stack
Native API exposes publish state, locale, and labels. Sync every 15 minutes is more than enough — help center content does not change minute to minute.
Alternatives: Intercom API, HelpScout API, Ragie
Help articles are already structured with H2/H3 sections. Splitting on headings gives chunks that map to a specific task, which makes citations look like ‘see How to reset your password → Step 2’ instead of fragments.
Alternatives: Semantic chunking, Fixed-size 512
At 10k articles you are not volume-bound. OpenAI embeddings are fine here and integrate cleanly with gpt-4o-mini. If you are already on Anthropic, Voyage-3 is the equivalent choice.
Alternatives: Voyage-3, Cohere Embed v3
At ~30k chunks you do not need a dedicated vector DB. pgvector with an HNSW index queries in under 40ms and avoids a new infra bill. Move to Pinecone Serverless only past ~1M chunks or when you need geo-replication.
Alternatives: Pinecone Serverless, Qdrant, Turbopuffer
$1 per 1k searches and 150ms latency. Lifts answer accuracy from the mid 70s to low 90s for help-center queries. Cheapest accuracy boost available.
Alternatives: Voyage Rerank 2.5, Jina Reranker v2
$0.15 input / $0.60 output per 1M tokens. With a cached 2k-token system prompt, a full conversation costs under $0.005. Claude Haiku 4 is a drop-in alternative with slightly better citation accuracy.
Alternatives: Claude Haiku 4, Gemini 2.0 Flash
You must measure thumbs up rate, escalation rate, and cost per resolved conversation weekly. Without this, content gaps go invisible and deflection rate drops over time.
Alternatives: Langfuse, LangSmith, Arize Phoenix
Cost at each scale
Prototype
2k articles · 5k conversations/mo
$85/mo
Startup
10k articles · 100k conversations/mo
$780/mo
Scale
30k articles · 1M conversations/mo
$5,200/mo
Latency budget
Tradeoffs
Cheap model vs big model
GPT-4o-mini or Haiku 4 handles 90% of help-center queries just as well as GPT-4o or Sonnet 4. The 10% where a bigger model helps are the same queries you should be routing to a human anyway. Save the money — don't pay $10/1M output tokens to answer ‘how do I reset my password’.
Rerank every query vs cache popular answers
For a help center, 60% of queries hit the top 50 questions. You can cache the retrieved-chunks-plus-answer keyed on a normalized question hash and skip rerank + LLM entirely for cache hits. Cuts cost by roughly 40% at the price of a smarter cache invalidation story when articles change.
Build vs buy (Intercom Fin, Ada, Forethought)
Intercom Fin is $0.99 per resolved conversation. At 100k conversations per month with 40% deflection, that is ~$40k/mo. A custom stack runs ~$800/mo at the same volume — but you now own the retrieval, evals, and content ops. The break-even is roughly 20k conversations per month.
Failure modes & guardrails
Bot answers confidently from outdated articles
Mitigation: Store article updated_at in chunk metadata. In the system prompt, instruct the model to mention when a cited article was last updated if it is older than 180 days. Decay retrieval score by 10% for articles older than 12 months.
Customer phrases differ wildly from article wording
Mitigation: Maintain a weekly ‘unresolved questions’ report from low-thumbs-up conversations and low-confidence escalations. Feed top clusters to a content writer to create new articles or edit existing ones. This is the single biggest deflection-rate lever.
Model hallucinates policy details (refunds, SLAs, pricing)
Mitigation: Hard-code a deny list of regex topics (‘refund’, ‘cancel subscription’, ‘SLA’, ‘legal’, ‘guarantee’). For any match, skip the LLM and route straight to a human. Policy answers must come from a human or a structured policy doc, never an article.
Deflection rate decays silently over time
Mitigation: Run a weekly eval of the top 200 question variants against a golden answer set. Alert on any regression above 3%. Also track ‘escalation after N turns’ — if it climbs, your retrieval or content is drifting.
Multilingual customers get English answers
Mitigation: Detect the user’s locale from the help center context or the first message. Filter retrieval to articles with matching locale metadata. If no localized article exists, fall back to English and append a note: ‘This answer is from our English help center’.
Frequently asked questions
How much does a customer support chatbot cost per conversation?
With GPT-4o-mini or Claude Haiku 4 and prompt caching, budget $0.003 to $0.012 per conversation at 100k+ monthly volume. That includes embeddings, reranking, and LLM synthesis. Cost climbs to $0.05-0.10 if you use GPT-4o or Claude Sonnet 4 as the answer model, which is rarely worth it for help-center queries.
Which model should I use for a help center chatbot?
GPT-4o-mini is the default in 2026 — cheapest per token, fast, and handles simple Q&A well. Claude Haiku 4 is a strong alternative with slightly better citation fidelity. Save Claude Sonnet 4 or GPT-4o for escalated or multi-turn troubleshooting conversations, not first-line answers.
Is pgvector enough or do I need Pinecone?
For 10k articles (~30k chunks) pgvector on your existing Postgres is fine and queries in under 40ms. Move to Pinecone Serverless or Turbopuffer past 500k chunks, when you need multi-region reads, or when your Postgres is already at 70% CPU.
Should I use Intercom Fin, Ada, or build this myself?
Intercom Fin charges ~$0.99 per resolved conversation. At 100k conversations/month and 40% deflection that is around $40k/month. A custom stack is closer to $800-1500/month but you now own evals, content ops, and retrieval quality. Break-even is around 20k monthly conversations.
How do I measure deflection rate?
Track: (1) thumbs-up rate on answers, (2) percent of conversations that end without human handoff, (3) percent of handoffs that the human marks ‘would have been resolved with better docs’. Report weekly. A healthy mature deployment sits at 40-60% true deflection with thumbs up rate above 70%.
How do I prevent the bot from making up policy details?
Hard-block a list of sensitive topics — refunds, cancellations, SLAs, legal, anything with dollar amounts. For those, route straight to a human. Additionally, force the LLM to quote the exact article sentence it is relying on, and reject answers that do not include a valid citation to a retrieved chunk.
How often should I reindex articles?
Sync every 15 minutes is fine. Help center content changes on a human cadence, not a machine one. Use Zendesk or Intercom webhooks to trigger an incremental reindex on article publish/update events, and run a full nightly reindex as a safety net.
Can I use this with voice support?
Yes — swap the chat UI for a voice loop using Deepgram for STT and ElevenLabs or OpenAI tts-1 for TTS. The retrieval + rerank + LLM core stays identical. Latency budget gets tighter: target under 1.2s first-token for voice vs 2.5s for chat.