Reference Architecture · rag
Enterprise Document Search
Last updated: April 15, 2026
Quick answer
The production-ready stack is a hybrid retrieval pipeline: BM25 plus dense embeddings stored in Qdrant or pgvector, a Cohere or Voyage reranker, and Claude Sonnet 4 as the answer synthesizer. Expect $0.10 to $0.30 per query at scale. The hardest problem is not retrieval — it is access control and document freshness.
The problem
You need employees to search across every internal document — PDFs, Notion pages, Confluence, Google Drive, Slack — with natural-language questions that return accurate answers with citations. The system must respect per-user access permissions, handle 1M+ documents, and answer in under 4 seconds.
Architecture
Source Connectors
Ingests from Notion, Confluence, Google Drive, Slack, SharePoint, and PDFs.
Alternatives: Nuclia, Ragie, Airbyte, Unstructured.io
Semantic Chunker
Splits documents into 400 to 800 token chunks that preserve semantic boundaries.
Alternatives: Late chunking, Contextual retrieval (Anthropic), Recursive character splitter
Embedding Model
Converts chunks and queries into dense vectors.
Alternatives: OpenAI text-embedding-3-large, Cohere Embed v3
Vector DB (Hybrid)
Stores embeddings, BM25 index, and metadata including doc ID and ACL.
Alternatives: pgvector, Pinecone, Weaviate
Hybrid Retriever
Runs dense and BM25 retrieval in parallel, then merges results with reciprocal rank fusion.
Alternatives: RRF, Weighted merging, ColBERT late-interaction
Reranker
Cross-encoder reranks the top 50 results down to 5.
Alternatives: Cohere Rerank 3, Voyage Rerank-2, Jina reranker
ACL Filter
Removes documents the requesting user cannot access before ranking.
Alternatives: SpiceDB, OpenFGA, Custom middleware
Answer Synthesizer
Generates the answer with inline citations to source chunks.
Alternatives: GPT-4o, Gemini 2.5 Pro
Search UI
Displays the answer plus ranked source documents with citations.
Alternatives: Custom React, Glean-style, Slack bot
The stack
Open-source, handles 60+ document formats, and integrates with most enterprise systems without an additional SaaS contract.
Alternatives: Nuclia, Ragie
Prepends a 50-token context header to each chunk. Anthropic benchmarks show 35% improvement in retrieval recall on document Q&A tasks.
Alternatives: Semantic chunking, Late chunking
Leads MTEB in 2026. The 1024-dimension variant balances quality and storage cost across million-document corpora.
Alternatives: OpenAI text-embedding-3-large, Cohere Embed v3
Native hybrid search with built-in BM25. pgvector works well for under 5M vectors if you want to avoid an additional managed service.
Alternatives: pgvector, Pinecone
Adds $0.002 per query and 200ms latency. In return you get a 40 to 60% improvement in answer quality. No reason to skip this step.
Alternatives: Cohere Rerank 3, Jina Reranker
Relationship-based access control (ReBAC) maps onto real enterprise permission hierarchies. Filter at retrieval time — not after the LLM has already seen the content.
Alternatives: SpiceDB, Custom middleware
Best citation accuracy in 2026 benchmarks. Rarely invents source references.
Alternatives: GPT-4o, Gemini 2.5 Pro
Cost at each scale
Prototype
100k docs · 5k queries/mo
$150/mo
Startup
1M docs · 100k queries/mo
$2,900/mo
Scale
10M docs · 2M queries/mo
$38,000/mo
Latency budget
Tradeoffs
Dense-only vs hybrid retrieval
Dense embeddings alone miss exact keyword matches — proper nouns, error codes, version numbers, and acronyms. Hybrid adds 20% recall with minimal cost overhead. Use hybrid for any enterprise deployment.
Reranking cost vs quality
Voyage Rerank-2 adds $0.002 per query and 200ms but lifts answer quality 40 to 60% versus no reranking. The cost-quality ratio is hard to beat at enterprise scale.
Framework vs custom pipeline
LlamaIndex and LangChain RAG pipelines work but add latency and debugging surface. At 1M+ queries per month, custom Python with direct provider SDKs runs 30% faster and is easier to trace through.
Failure modes & guardrails
Answers cite wrong documents
Mitigation: Force the LLM to cite chunk IDs, not doc titles. Validate citations post-generation against the retrieved chunks. Reject answers that reference chunks not in the context window.
Stale documents appear current
Mitigation: Store last-modified date in vector DB metadata. Boost recency in ranking. Surface an 'updated more than 6 months ago' warning in the UI for older docs.
ACL bypass returns forbidden content
Mitigation: Filter by permission at retrieval time, not at the presentation layer. An LLM that sees a forbidden document can still reference or paraphrase it even if you hide the citation.
Retrieval quality degrades silently
Mitigation: Maintain a golden test set of 200 Q&A pairs with known correct source documents. Run evals weekly and alert on any regression above 5%.
PII in document corpus leaks to LLM provider
Mitigation: Route queries over sensitive document categories to a self-hosted model (Llama 4 70B or equivalent). Tag documents at ingest time with sensitivity level and use that tag for routing.
Frequently asked questions
How much does enterprise document search cost?
Budget $0.10 to $0.30 per query at scale, plus $0.50 to $2.00 per thousand documents indexed. For 1M docs and 100k queries per month, total AI and infra cost runs $2,500 to $3,500 per month.
Do I need a reranker, or is vector search enough?
Always add a reranker. Cross-encoders like Voyage Rerank-2 and Cohere Rerank 3 improve answer quality by 40 to 60% for $0.001 to $0.002 per query. The quality gain far exceeds the cost.
Which embedding model is best for enterprise search?
Voyage-3-large leads MTEB as of April 2026. Cohere Embed v3 is close behind. OpenAI text-embedding-3-large performs about 5% below on retrieval-heavy benchmarks. For enterprise deployments where quality matters, Voyage is the default choice.
How do I handle access control in RAG?
Store doc-level ACLs in vector DB metadata and filter at retrieval time. Use OpenFGA or SpiceDB for relationship-based access control. Filtering after the LLM has seen the documents is too late — the model can still reference content it was shown.
Is pgvector sufficient or should I use Qdrant?
pgvector handles 1 to 5M vectors comfortably. For 5M+ vectors or when you need native hybrid search (dense plus BM25), Qdrant is the better option. Pinecone is the safest managed choice but costs more per vector stored.
How often should I reindex?
For slow-changing docs like policy PDFs and wikis, weekly reindexing is fine. For Slack, Notion, and other fast-moving sources, stream updates in real time using source webhooks. Track document freshness in metadata and expose it to users.