Reference Architecture · rag

Customer Knowledge Base Chatbot

Last updated: April 16, 2026

Quick answer

Use GPT-4o-mini or Claude Haiku 4 as the answer model — both are under $1 per 1000 conversations. Embed articles with OpenAI text-embedding-3-large, store in pgvector or Pinecone, rerank the top 20 with Cohere Rerank v3, and stream an answer with citations. Expect $0.003 to $0.012 per conversation at scale. The real ROI lever is deflection tracking plus weekly content gap analysis — not a more expensive model.

The problem

You run a SMB help center with ~10,000 articles in Zendesk, Intercom, or HelpScout. Customers ask the same 200 questions in a hundred ways. You need a chatbot that answers in 2 seconds, deflects 40-60% of tickets, costs pennies per conversation, and hands off cleanly to a human when it is not confident. It has to work at 50k-500k conversations per month without burning your margin.

Architecture

input

llm

data

infra

output

Help Center Ingester

Pulls articles from Zendesk, Intercom, HelpScout, or a CMS. Handles HTML cleanup, removes boilerplate, and extracts metadata like category, product, and last-updated date.

Alternatives: Zendesk Help Center API, Intercom Articles API, Ragie, Custom scraper

Section-aware Chunker

Splits each article at H2/H3 headings into 300-600 token chunks. Preserves the article title and section path as metadata so citations look natural in the UI.

Alternatives: Recursive character splitter, Semantic chunking, Fixed-size 512-token chunks

Embedding Model

Embeds chunks and user questions into a shared 3072-dim vector space. Cheap at scale — under $0.01 per 100k articles.

Alternatives: Voyage-3, Cohere Embed v3

Vector Store

Stores ~30k chunks with metadata for filtering by product, locale, and publish state. Sized small enough that pgvector on a shared Postgres works.

Alternatives: Pinecone Serverless, Qdrant Cloud, Turbopuffer

Hybrid Retriever

Runs dense retrieval plus BM25 on the same chunk store. Merges with RRF. Filters by locale so a German customer never sees English articles first.

Alternatives: Dense-only, BM25 + RRF, Tantivy + pgvector

Reranker

Cohere Rerank v3 takes the top 20 retrieved chunks and ranks them down to 4. Adds ~150ms but pushes answer accuracy from ~70% to ~90%.

Alternatives: Voyage Rerank 2.5, Jina Reranker v2

Answer Model

Small, fast, cheap. Streams an answer with inline citations, a confidence signal, and a suggested next step. Prompt-cached so the system instruction is free after first token.

Alternatives: Claude Haiku 4, Gemini 2.0 Flash, Llama 3.3 on Groq

Escalation Router

Reads a confidence score and a topic classifier. If confidence is below threshold or topic is billing/account, routes the conversation to a human in Zendesk with full transcript and retrieved docs attached.

Alternatives: Zendesk Flow Builder, Intercom Fin handoff, Custom webhook

Chat Widget

In-app chat widget that streams the answer token by token, renders inline citations, shows a thumbs up/down, and offers 'talk to a human' at any turn.

Alternatives: Intercom Messenger, Crisp, Custom React widget

The stack

Article ingestionZendesk API + scheduled sync

Native API exposes publish state, locale, and labels. Sync every 15 minutes is more than enough — help center content does not change minute to minute.

Alternatives: Intercom API, HelpScout API, Ragie

ChunkingHeading-aware 400-token chunks

Help articles are already structured with H2/H3 sections. Splitting on headings gives chunks that map to a specific task, which makes citations look like ‘see How to reset your password → Step 2’ instead of fragments.

Alternatives: Semantic chunking, Fixed-size 512

EmbeddingsOpenAI text-embedding-3-large

At 10k articles you are not volume-bound. OpenAI embeddings are fine here and integrate cleanly with gpt-4o-mini. If you are already on Anthropic, Voyage-3 is the equivalent choice.

Alternatives: Voyage-3, Cohere Embed v3

Vector storepgvector on existing Postgres

At ~30k chunks you do not need a dedicated vector DB. pgvector with an HNSW index queries in under 40ms and avoids a new infra bill. Move to Pinecone Serverless only past ~1M chunks or when you need geo-replication.

Alternatives: Pinecone Serverless, Qdrant, Turbopuffer

RerankerCohere Rerank v3

$1 per 1k searches and 150ms latency. Lifts answer accuracy from the mid 70s to low 90s for help-center queries. Cheapest accuracy boost available.

Alternatives: Voyage Rerank 2.5, Jina Reranker v2

Answer LLMGPT-4o-mini with prompt caching

$0.15 input / $0.60 output per 1M tokens. With a cached 2k-token system prompt, a full conversation costs under $0.005. Claude Haiku 4 is a drop-in alternative with slightly better citation accuracy.

Alternatives: Claude Haiku 4, Gemini 2.0 Flash

Deflection analyticsHelicone + custom dashboard

You must measure thumbs up rate, escalation rate, and cost per resolved conversation weekly. Without this, content gaps go invisible and deflection rate drops over time.

Alternatives: Langfuse, LangSmith, Arize Phoenix

Cost at each scale

Prototype

2k articles · 5k conversations/mo

$85/mo

One-time embedding (2k articles, ~6k chunks)$1

Query embeddings (5k)$1

Cohere Rerank v3 (5k)$5

GPT-4o-mini answers (5k × ~1.5k tokens)$12

pgvector on existing Postgres$0

Hosting + chat widget + logging$66

Startup

10k articles · 100k conversations/mo

$780/mo

Incremental reindex (20% churn)$2

Query embeddings (100k)$8

Cohere Rerank v3 (100k)$100

GPT-4o-mini answers with prompt caching$240

pgvector on dedicated Postgres$60

Helicone observability$100

Ingestion + eval runs + hosting$270

Scale

30k articles · 1M conversations/mo

$5,200/mo

Ongoing reindex$20

Query embeddings (1M)$80

Cohere Rerank v3 (1M)$1,000

GPT-4o-mini answers with caching$2,400

Pinecone Serverless (high-read tier)$400

Observability + eval harness$500

Ingestion + escalation infra + hosting$800

Latency budget

Total P50: 2,105ms

Total P95: 4,030ms

Query embedding

60ms · 140ms p95

Hybrid retrieval (top-20)

55ms · 130ms p95

Rerank to top-4

140ms · 260ms p95

LLM answer (first token)

450ms · 900ms p95

LLM answer (full stream)

1400ms · 2600ms p95

Median

P95

Tradeoffs

Cheap model vs big model

GPT-4o-mini or Haiku 4 handles 90% of help-center queries just as well as GPT-4o or Sonnet 4. The 10% where a bigger model helps are the same queries you should be routing to a human anyway. Save the money — don't pay $10/1M output tokens to answer ‘how do I reset my password’.

Rerank every query vs cache popular answers

For a help center, 60% of queries hit the top 50 questions. You can cache the retrieved-chunks-plus-answer keyed on a normalized question hash and skip rerank + LLM entirely for cache hits. Cuts cost by roughly 40% at the price of a smarter cache invalidation story when articles change.

Build vs buy (Intercom Fin, Ada, Forethought)

Intercom Fin is $0.99 per resolved conversation. At 100k conversations per month with 40% deflection, that is ~$40k/mo. A custom stack runs ~$800/mo at the same volume — but you now own the retrieval, evals, and content ops. The break-even is roughly 20k conversations per month.

Failure modes & guardrails

Bot answers confidently from outdated articles

Mitigation: Store article updated_at in chunk metadata. In the system prompt, instruct the model to mention when a cited article was last updated if it is older than 180 days. Decay retrieval score by 10% for articles older than 12 months.

Customer phrases differ wildly from article wording

Mitigation: Maintain a weekly ‘unresolved questions’ report from low-thumbs-up conversations and low-confidence escalations. Feed top clusters to a content writer to create new articles or edit existing ones. This is the single biggest deflection-rate lever.

Model hallucinates policy details (refunds, SLAs, pricing)

Mitigation: Hard-code a deny list of regex topics (‘refund’, ‘cancel subscription’, ‘SLA’, ‘legal’, ‘guarantee’). For any match, skip the LLM and route straight to a human. Policy answers must come from a human or a structured policy doc, never an article.

Deflection rate decays silently over time

Mitigation: Run a weekly eval of the top 200 question variants against a golden answer set. Alert on any regression above 3%. Also track ‘escalation after N turns’ — if it climbs, your retrieval or content is drifting.

Multilingual customers get English answers

Mitigation: Detect the user’s locale from the help center context or the first message. Filter retrieval to articles with matching locale metadata. If no localized article exists, fall back to English and append a note: ‘This answer is from our English help center’.

Frequently asked questions

How much does a customer support chatbot cost per conversation?

With GPT-4o-mini or Claude Haiku 4 and prompt caching, budget $0.003 to $0.012 per conversation at 100k+ monthly volume. That includes embeddings, reranking, and LLM synthesis. Cost climbs to $0.05-0.10 if you use GPT-4o or Claude Sonnet 4 as the answer model, which is rarely worth it for help-center queries.

Which model should I use for a help center chatbot?

GPT-4o-mini is the default in 2026 — cheapest per token, fast, and handles simple Q&A well. Claude Haiku 4 is a strong alternative with slightly better citation fidelity. Save Claude Sonnet 4 or GPT-4o for escalated or multi-turn troubleshooting conversations, not first-line answers.

Is pgvector enough or do I need Pinecone?

For 10k articles (~30k chunks) pgvector on your existing Postgres is fine and queries in under 40ms. Move to Pinecone Serverless or Turbopuffer past 500k chunks, when you need multi-region reads, or when your Postgres is already at 70% CPU.

Should I use Intercom Fin, Ada, or build this myself?

Intercom Fin charges ~$0.99 per resolved conversation. At 100k conversations/month and 40% deflection that is around $40k/month. A custom stack is closer to $800-1500/month but you now own evals, content ops, and retrieval quality. Break-even is around 20k monthly conversations.

How do I measure deflection rate?

Track: (1) thumbs-up rate on answers, (2) percent of conversations that end without human handoff, (3) percent of handoffs that the human marks ‘would have been resolved with better docs’. Report weekly. A healthy mature deployment sits at 40-60% true deflection with thumbs up rate above 70%.

How do I prevent the bot from making up policy details?

Hard-block a list of sensitive topics — refunds, cancellations, SLAs, legal, anything with dollar amounts. For those, route straight to a human. Additionally, force the LLM to quote the exact article sentence it is relying on, and reject answers that do not include a valid citation to a retrieved chunk.

How often should I reindex articles?

Sync every 15 minutes is fine. Help center content changes on a human cadence, not a machine one. Use Zendesk or Intercom webhooks to trigger an incremental reindex on article publish/update events, and run a full nightly reindex as a safety net.

Can I use this with voice support?

Yes — swap the chat UI for a voice loop using Deepgram for STT and ElevenLabs or OpenAI tts-1 for TTS. The retrieval + rerank + LLM core stays identical. Latency budget gets tighter: target under 1.2s first-token for voice vs 2.5s for chat.

Architectures

Enterprise Document Search

Reference architecture for semantic search across 1M+ enterprise documents (PDFs, Confluence, Notion, Google D...

Slack + Notion Internal Search

Reference architecture for unified, permissions-aware search across Slack, Notion, Linear, Google Drive, and G...

Models mentioned

gpt-4o-mini claude-haiku-4

Tools mentioned

pgvector pinecone cohere