Reference Architecture · rag
Advanced RAG with Reranking: Two-Stage Retrieval for Production
Last updated: April 16, 2026
Quick answer
Use hybrid search (BM25 + dense embeddings, fused with RRF) for broad recall of top-50 candidates, then rerank with Cohere Rerank or cross-encoder to get top-5 precision results, then compress those 5 chunks with a small LLM before the final generation call. This pattern improves answer accuracy by 15-30% and reduces LLM input costs by 40-60% vs stuffing all 50 candidates.
The problem
Basic single-stage RAG (embed query → retrieve top-K by cosine similarity → stuff into context) yields poor accuracy on real-world corpora because: (1) embedding models optimize for recall, not precision — top-10 chunks contain relevant content but rank irrelevant chunks above relevant ones 30-40% of the time; (2) stuffing 10 chunks directly into LLM context costs $0.01-0.05 per query and forces the LLM to filter noise; (3) sparse/dense retrieval alone each miss 20-35% of relevant documents. Teams see 60-70% answer accuracy with naive RAG, vs 85-92% with two-stage retrieval + reranking.
Architecture
User Query
The raw user question. May need preprocessing: query expansion (generate 2-3 alternative phrasings), HyDE (generate a hypothetical answer and embed that), or query classification (factual vs analytical) to select retrieval strategy.
Alternatives: Raw query, Query + conversation history (for multi-turn), Structured API query
Query Expander / HyDE
Optional but effective: generate 2-3 alternative phrasings of the query to improve recall, or use HyDE (Hypothetical Document Embeddings) — generate a fake answer and embed it to find documents similar to the answer rather than the question. HyDE improves recall by 10-20% on factual questions.
Alternatives: GPT-4o mini, Gemini Flash, Simple synonym expansion
Sparse Retriever (BM25)
Keyword-based retrieval using BM25 (Best Match 25). Excellent for exact term matching — catches documents that semantic search misses (product codes, names, technical terms). Returns top-100 candidates by BM25 score.
Alternatives: Typesense BM25, Meilisearch, PostgreSQL full-text search (ts_rank), tantivy (Rust BM25)
Dense Retriever (Vector Search)
Semantic embedding-based retrieval. Converts query to a vector and finds nearest neighbors by cosine similarity. Catches semantically related documents even without keyword overlap. Returns top-100 candidates.
Alternatives: Pinecone, Weaviate, Chroma, Milvus, OpenSearch kNN
Embedding Model
Converts queries and documents to dense vectors. Choice of model significantly impacts retrieval quality. Domain-specific models (code, legal, medical) outperform general-purpose by 10-25% on domain tasks.
Alternatives: text-embedding-3-small (1536 dims, 5x cheaper), Cohere embed-english-v3.0, voyage-3 (Anthropic's recommended), nomic-embed-text-v1.5 (open-source)
Reciprocal Rank Fusion (RRF)
Merges ranked results from BM25 and dense retrieval into a single unified ranking without needing to normalize scores across different scales. RRF formula: 1/(k + rank_i) where k=60. Consistently outperforms score-based fusion in benchmarks.
Alternatives: Linear score interpolation (alpha * dense + (1-alpha) * sparse), CombMNZ, Weaviate hybrid search (built-in RRF)
Reranker (Cross-Encoder)
Takes the fused top-50 candidates and the query, scores each (query, document) pair jointly using a cross-encoder model. Unlike bi-encoders (which score query and document independently), cross-encoders see both together — dramatically more accurate but 10-50x slower, making them unsuitable for first-stage retrieval.
Alternatives: Jina Reranker v2, BGE-Reranker-Large (open-source), ms-marco-MiniLM-L-6-v2 (self-hosted), LLM-as-reranker (expensive but high quality)
Contextual Compressor
After reranking, the top-5 chunks may still contain sections irrelevant to the specific query. A small LLM extracts only the sentences from each chunk that are relevant to the query — reducing average chunk size from 500 to 150 tokens while preserving the key information.
Alternatives: GPT-4o mini, Gemini Flash 2.0, Extractive summarization (no LLM, faster)
Generator LLM
The final LLM that generates the answer given the compressed, reranked context. Receives 3-5 compressed chunks (vs 10-50 raw chunks in naive RAG), resulting in 40-60% lower input token costs and better answer quality (less distraction from irrelevant context).
Alternatives: gpt-4o, gemini-2-flash, Llama 3.1 70B (self-hosted)
Citation & Source Tracker
Records which chunks were used in the final context, maps them to source documents, and enables citation generation. Passes source metadata to the LLM to include in its response (document title, section, page number).
Alternatives: LangChain LCEL with source tracking, LlamaIndex source nodes
Cited Answer
The final LLM response with inline citations to source documents. Accuracy improves to 85-92% from 60-70% with naive RAG. Include confidence metadata: which sources were used, their relevance scores, and whether the model expressed uncertainty.
Alternatives: Plain text answer, Markdown with footnotes, Structured JSON with citations
The stack
pgvector with HNSW indexing handles 10M+ vectors at 20-50ms query latency and integrates with your existing Postgres DB (no new infra). Qdrant is the best self-hosted option at scale: 40ms p99 at 100M vectors, built-in sparse vector support for hybrid search, and Rust performance. Pinecone's managed service costs $70/mo for 1M vectors — 3-5x more than self-hosted but eliminates all maintenance.
Alternatives: Pinecone (managed, easiest), Weaviate (built-in hybrid), Milvus (self-hosted, large scale), Chroma (dev/prototype)
Elasticsearch BM25 is battle-tested for production-scale corpora (billions of documents). For smaller corpora (<5M docs), Typesense is simpler to operate and 3x cheaper to host. Postgres full-text search works for <500K documents and eliminates the need for a separate search service — pgvector + tsvector in the same DB covers both retrieval modes.
Alternatives: OpenSearch (AWS-managed), Typesense (simpler API), Postgres full-text search (tsvector), Weaviate hybrid (built-in BM25)
voyage-3 ranks #1-2 on MTEB leaderboard for most retrieval tasks at $0.06/M tokens — 2x cheaper than text-embedding-3-large. text-embedding-3-small at $0.02/M tokens is the cost-optimization choice with 85-90% of large's quality. For code corpora, voyage-code-3 significantly outperforms general-purpose models.
Alternatives: text-embedding-3-large ($0.13/M tokens, highest quality), Cohere embed-english-v3.0, nomic-embed-text-v1.5 (free, open-source)
Cohere Rerank v3.5 at $0.002 per 1K documents scored ($0.0001 per document) is the cheapest high-quality managed reranker. Reranking top-50 costs $0.005 per query. BGE-Reranker-Large on GPU (Lambda A10G) costs $0.0001/query at scale but requires GPU infrastructure. Reranking typically improves NDCG@5 by 15-30% over embedding-only retrieval.
Alternatives: Jina Reranker v2 (open API), BGE-Reranker-Large (self-hosted, free), ms-marco-MiniLM-L-6-v2 (fastest, lower quality), LLM reranking with GPT-4o mini (highest quality, 3x cost)
Claude Haiku 4 at $0.80/M input tokens compresses 5 chunks (avg 500 tokens each = 2,500 tokens) to ~750 relevant tokens for $0.002 per query. This saves $0.005-0.015 per query on the generator call (fewer input tokens to Claude Sonnet 4) — net positive ROI at >200 queries/day. LLMLingua (token-level compression) is 10x cheaper but lower quality.
Alternatives: GPT-4o mini, Gemini Flash 2.0, Custom extractive summarizer (no LLM), LLMLingua (token compression)
RRF requires no score normalization (BM25 scores are not comparable to cosine similarity scores) and consistently outperforms linear interpolation by 2-5% NDCG in benchmarks. Implementation is 10 lines of code. Weaviate's built-in hybrid search uses RRF internally — use it if you're already on Weaviate.
Alternatives: Linear interpolation (α=0.7 dense + 0.3 sparse), Weaviate built-in hybrid, Learned fusion weights (requires training data)
With contextual compression reducing context to 750-1500 tokens, Claude Sonnet 4's answer quality on compressed context matches naive RAG with 5K tokens at $0.06 vs $0.015 per query — a 4x cost saving. Enable prompt caching on the system prompt + document instructions (typically 2K tokens, static) to cut cached-portion cost by 90%.
Alternatives: gpt-4o, Gemini 2.0 Flash (cheapest high-quality option), Claude Haiku 4 (for latency-critical, lower quality)
Cost at each scale
Prototype
1,000 queries/mo
$35/mo
Growth
50,000 queries/mo
$1,100/mo
Scale
1M queries/mo
$16,000/mo
Latency budget
Tradeoffs
Failure modes & guardrails
Mitigation: The vector index falls behind new documents — users get outdated information. Implement a document ingestion pipeline with change detection (webhook on document update, hash-based change detection for periodic crawls). For critical documents, implement a 'freshness' score in retrieval that boosts recently-indexed documents. Always surface document date in retrieved chunks so the LLM can reason about recency.
Mitigation: Users use different terms than the corpus (e.g., 'MI' vs 'myocardial infarction'). BM25 handles this poorly; dense retrieval handles it better but not perfectly. Fix: (1) add query expansion with domain synonyms, (2) use HyDE to embed a hypothetical answer instead of the raw query, (3) maintain a synonym dictionary for domain-critical terms and expand queries pre-retrieval.
Mitigation: Cohere Rerank API latency occasionally spikes to 2-3s due to cold starts or high load. Implement a timeout (max 1.5s for reranking) with fallback to RRF-fused results if reranking times out. Log timeout rate — if >5% of requests timeout, switch to self-hosted reranker or reduce the candidate pool size.
Mitigation: When no retrieved chunk is relevant (e.g., the user asks about something not in the corpus), the LLM generates a plausible-sounding answer from its parametric knowledge — which may be wrong. Fix: implement a retrieval confidence check. If the top reranker score is below a threshold (e.g., 0.3), respond with 'I don't have information about this in my knowledge base' instead of generating.
Frequently asked questions
How much does reranking actually improve accuracy?
In controlled experiments, adding a cross-encoder reranker to a dense-only retrieval pipeline improves NDCG@5 (normalized discounted cumulative gain at 5 results) by 15-30% depending on the corpus and query difficulty. In practical terms: if naive RAG answers 65% of queries correctly, adding reranking typically brings this to 80-90%. The gain is larger for questions with specific factual answers than for open-ended analytical questions.
What's the difference between dense retrieval, sparse retrieval, and hybrid?
Dense retrieval (vector search) converts text to semantic vectors and finds similar meanings — good for paraphrased questions. Sparse retrieval (BM25/TF-IDF) matches on exact keyword overlap — better for product codes, names, and technical terms. Hybrid combines both: BM25 catches what dense misses and vice versa. In benchmarks, hybrid consistently outperforms either alone by 5-15% NDCG, especially on out-of-domain queries.
How should I chunk my documents?
Start with 512-token chunks with 64-token overlap. Measure retrieval accuracy on a sample of real queries. If accuracy is low due to answers spanning chunks, increase overlap to 128 tokens. If the retrieved chunks contain too much irrelevant content, reduce chunk size to 256 tokens. Advanced: use parent-child chunking — index 256-token child chunks for retrieval, but inject the 1024-token parent chunk into the generator for richer context.
How do I evaluate my RAG pipeline quality?
Use three metrics: (1) Retrieval recall@K — what fraction of queries have the gold-standard document in the top-K retrieved? Measure with 200+ labeled (query, relevant_doc) pairs. (2) Answer accuracy — LLM-as-judge comparing generated answers vs gold-standard answers on 100+ labeled examples. (3) Faithfulness — does the answer contradict any retrieved document? RAGAS library automates all three metrics with LLM-as-judge.
Related
Architectures
Enterprise Document Search
Reference architecture for semantic search across 1M+ enterprise documents (PDFs, Confluence, Notion, Google D...
Log Analysis RAG
Reference architecture for natural-language queries over 1TB/day of observability logs. Combines log ingestion...
Real-time News RAG
Reference architecture for RAG over minute-fresh news, RSS, and social feeds. Streaming ingestion, freshness-w...
Legal Document Search
Reference architecture for natural-language search across contract repositories and case law with strict citat...
RAG for Codebase Search
Reference architecture for natural-language Q&A over a 1M+ line codebase. Code-aware embeddings, tree-sitter A...
Slack + Notion Internal Search
Reference architecture for unified, permissions-aware search across Slack, Notion, Linear, Google Drive, and G...
Multimodal RAG (Text + Images + PDFs)
Reference architecture for layout-aware RAG over documents that are 30-70% images, diagrams, tables, and chart...