Reference Architecture · rag

Enterprise Document Search

Last updated: April 15, 2026

Quick answer

The production-ready stack is a hybrid retrieval pipeline: BM25 plus dense embeddings stored in Qdrant or pgvector, a Cohere or Voyage reranker, and Claude Sonnet 4 as the answer synthesizer. Expect $0.10 to $0.30 per query at scale. The hardest problem is not retrieval — it is access control and document freshness.

The problem

You need employees to search across every internal document — PDFs, Notion pages, Confluence, Google Drive, Slack — with natural-language questions that return accurate answers with citations. The system must respect per-user access permissions, handle 1M+ documents, and answer in under 4 seconds.

Architecture

input

llm

data

infra

output

Source Connectors

Ingests from Notion, Confluence, Google Drive, Slack, SharePoint, and PDFs.

Alternatives: Nuclia, Ragie, Airbyte, Unstructured.io

Semantic Chunker

Splits documents into 400 to 800 token chunks that preserve semantic boundaries.

Alternatives: Late chunking, Contextual retrieval (Anthropic), Recursive character splitter

Embedding Model

Converts chunks and queries into dense vectors.

Alternatives: OpenAI text-embedding-3-large, Cohere Embed v3

Vector DB (Hybrid)

Stores embeddings, BM25 index, and metadata including doc ID and ACL.

Alternatives: pgvector, Pinecone, Weaviate

Hybrid Retriever

Runs dense and BM25 retrieval in parallel, then merges results with reciprocal rank fusion.

Alternatives: RRF, Weighted merging, ColBERT late-interaction

Reranker

Cross-encoder reranks the top 50 results down to 5.

Alternatives: Cohere Rerank 3, Voyage Rerank-2, Jina reranker

ACL Filter

Removes documents the requesting user cannot access before ranking.

Alternatives: SpiceDB, OpenFGA, Custom middleware

Answer Synthesizer

Generates the answer with inline citations to source chunks.

Alternatives: GPT-4o, Gemini 2.5 Pro

Search UI

Displays the answer plus ranked source documents with citations.

Alternatives: Custom React, Glean-style, Slack bot

The stack

Source connectorsUnstructured.io + custom

Open-source, handles 60+ document formats, and integrates with most enterprise systems without an additional SaaS contract.

Alternatives: Nuclia, Ragie

Chunking strategyContextual retrieval (Anthropic)

Prepends a 50-token context header to each chunk. Anthropic benchmarks show 35% improvement in retrieval recall on document Q&A tasks.

Alternatives: Semantic chunking, Late chunking

EmbeddingsVoyage-3-large

Leads MTEB in 2026. The 1024-dimension variant balances quality and storage cost across million-document corpora.

Alternatives: OpenAI text-embedding-3-large, Cohere Embed v3

Vector DBQdrant

Native hybrid search with built-in BM25. pgvector works well for under 5M vectors if you want to avoid an additional managed service.

Alternatives: pgvector, Pinecone

RerankerVoyage Rerank-2

Adds $0.002 per query and 200ms latency. In return you get a 40 to 60% improvement in answer quality. No reason to skip this step.

Alternatives: Cohere Rerank 3, Jina Reranker

Access controlOpenFGA

Relationship-based access control (ReBAC) maps onto real enterprise permission hierarchies. Filter at retrieval time — not after the LLM has already seen the content.

Alternatives: SpiceDB, Custom middleware

Answer LLMClaude Sonnet 4

Best citation accuracy in 2026 benchmarks. Rarely invents source references.

Alternatives: GPT-4o, Gemini 2.5 Pro

Cost at each scale

Prototype

100k docs · 5k queries/mo

$150/mo

One-time embedding (100k chunks, Voyage)$40

Query embeddings (5k/mo)$2

Reranking (5k queries, top-50)$10

Claude Sonnet 4 synthesis$35

Qdrant Cloud (starter)$30

Hosting + observability$33

Startup

1M docs · 100k queries/mo

$2,900/mo

Reindex monthly (1M chunks, 20% churn)$80

Query embeddings (100k/mo)$30

Reranking (100k queries, top-50)$200

Claude Sonnet 4 synthesis (with caching)$1,400

Qdrant Cloud (standard)$500

Connectors + ingestion infra$400

Observability + evals$290

Scale

10M docs · 2M queries/mo

$38,000/mo

Embeddings (churn + new docs)$2,500

Query embeddings (2M/mo)$600

Reranking (2M queries, top-50)$4,000

Claude Sonnet 4 synthesis$18,000

Qdrant Enterprise (self-hosted)$4,000

Ingestion infra (always-on)$3,500

ACL service + observability$5,400

Latency budget

Total P50: 2,420ms

Total P95: 4,040ms

Query embedding

80ms · 180ms p95

Hybrid retrieval (top-50)

120ms · 250ms p95

ACL filter

40ms · 90ms p95

Reranking to top-5

180ms · 320ms p95

LLM answer synthesis (streamed)

2000ms · 3200ms p95

Median

P95

Tradeoffs

Dense-only vs hybrid retrieval

Dense embeddings alone miss exact keyword matches — proper nouns, error codes, version numbers, and acronyms. Hybrid adds 20% recall with minimal cost overhead. Use hybrid for any enterprise deployment.

Reranking cost vs quality

Voyage Rerank-2 adds $0.002 per query and 200ms but lifts answer quality 40 to 60% versus no reranking. The cost-quality ratio is hard to beat at enterprise scale.

Framework vs custom pipeline

LlamaIndex and LangChain RAG pipelines work but add latency and debugging surface. At 1M+ queries per month, custom Python with direct provider SDKs runs 30% faster and is easier to trace through.

Failure modes & guardrails

Answers cite wrong documents

Mitigation: Force the LLM to cite chunk IDs, not doc titles. Validate citations post-generation against the retrieved chunks. Reject answers that reference chunks not in the context window.

Stale documents appear current

Mitigation: Store last-modified date in vector DB metadata. Boost recency in ranking. Surface an 'updated more than 6 months ago' warning in the UI for older docs.

ACL bypass returns forbidden content

Mitigation: Filter by permission at retrieval time, not at the presentation layer. An LLM that sees a forbidden document can still reference or paraphrase it even if you hide the citation.

Retrieval quality degrades silently

Mitigation: Maintain a golden test set of 200 Q&A pairs with known correct source documents. Run evals weekly and alert on any regression above 5%.

PII in document corpus leaks to LLM provider

Mitigation: Route queries over sensitive document categories to a self-hosted model (Llama 4 70B or equivalent). Tag documents at ingest time with sensitivity level and use that tag for routing.

Frequently asked questions

How much does enterprise document search cost?

Budget $0.10 to $0.30 per query at scale, plus $0.50 to $2.00 per thousand documents indexed. For 1M docs and 100k queries per month, total AI and infra cost runs $2,500 to $3,500 per month.

Do I need a reranker, or is vector search enough?

Always add a reranker. Cross-encoders like Voyage Rerank-2 and Cohere Rerank 3 improve answer quality by 40 to 60% for $0.001 to $0.002 per query. The quality gain far exceeds the cost.

Which embedding model is best for enterprise search?

Voyage-3-large leads MTEB as of April 2026. Cohere Embed v3 is close behind. OpenAI text-embedding-3-large performs about 5% below on retrieval-heavy benchmarks. For enterprise deployments where quality matters, Voyage is the default choice.

How do I handle access control in RAG?

Store doc-level ACLs in vector DB metadata and filter at retrieval time. Use OpenFGA or SpiceDB for relationship-based access control. Filtering after the LLM has seen the documents is too late — the model can still reference content it was shown.

Is pgvector sufficient or should I use Qdrant?

pgvector handles 1 to 5M vectors comfortably. For 5M+ vectors or when you need native hybrid search (dense plus BM25), Qdrant is the better option. Pinecone is the safest managed choice but costs more per vector stored.

How often should I reindex?

For slow-changing docs like policy PDFs and wikis, weekly reindexing is fine. For Slack, Notion, and other fast-moving sources, stream updates in real time using source webhooks. Track document freshness in metadata and expose it to users.

Architectures

Customer Support Agent

Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...

RAG for Codebase Search

Reference architecture for natural-language Q&A over a 1M+ line codebase. Code-aware embeddings, tree-sitter A...

Models mentioned

claude-sonnet-4 gpt-4o

Tools mentioned

qdrant pinecone weaviate