ragintermediate

Reranking Retrieved Documents for Better RAG (2026)

Quick Answer

Reranking takes the top-K chunks from vector search (e.g., top-50) and re-scores them with a cross-encoder model that considers both query and chunk together. The top-N (e.g., top-5) after reranking are sent to the LLM. Cross-encoders are more accurate than bi-encoders (used in vector search) but too slow to run on the full corpus — hence the two-stage pattern.

When to Use

  • Vector search is retrieving chunks that are topically related but not specifically relevant to the query
  • Precision@5 is below 0.7 on your eval set and you've already tuned chunking and embedding
  • Queries require understanding nuanced differences between similar chunks (e.g., different sections of the same policy document)
  • Building a high-quality RAG system where answer accuracy is more important than minimizing latency
  • After retrieval, before passing context to an expensive LLM — reranking is cheap insurance

How It Works

  1. 1Two-stage retrieval: Stage 1 uses fast approximate vector search to retrieve top-K candidates (K=20–100). Stage 2 uses a reranker to score each (query, chunk) pair and return the top-N (N=3–10).
  2. 2Cross-encoders (used in rerankers) process query and document together in a single forward pass, capturing interaction between them. This is what makes them more accurate than bi-encoders — but also why they're too slow to run on millions of documents.
  3. 3Popular reranker models in 2026: Cohere rerank-3, Voyage rerank-2, BGE-reranker-v2 (open source), Jina reranker-v2. All use similar cross-encoder architecture; differences are in training data and domain coverage.
  4. 4Integration: after vector search, take chunk IDs and texts, call the reranker API with the query and all candidates, sort by reranker score, return top-N to the LLM.
  5. 5Calibration: reranker scores are not probabilities. Scores above ~0.5 from Cohere rerank-3 are generally relevant; below ~0.2 are irrelevant. Use a minimum score threshold to filter out irrelevant chunks even if they're in the top-N.

Examples

Cohere reranker integration
import cohere

co = cohere.Client(api_key=COHERE_API_KEY)

def rerank_chunks(query: str, chunks: list[str], top_n: int = 5) -> list[str]:
    response = co.rerank(
        model='rerank-english-v3.0',
        query=query,
        documents=chunks,
        top_n=top_n,
        return_documents=True
    )
    return [r.document.text for r in response.results if r.relevance_score > 0.1]

# Usage: retrieve top-50 from vector DB, rerank to top-5
raw_chunks = vector_db.search(query, top_k=50)
reranked = rerank_chunks(query, raw_chunks, top_n=5)
Output:Reduces context from 50 chunks to the 5 most relevant. Adds ~100-200ms latency for 50 chunks. Cohere rerank-3 costs $2/1000 queries — negligible vs. LLM generation cost.
Open-source BGE reranker
from FlagEmbedding import FlagReranker

reranker = FlagReranker('BAAI/bge-reranker-v2-m3', use_fp16=True)

pairs = [[query, chunk] for chunk in retrieved_chunks]
scores = reranker.compute_score(pairs, normalize=True)

ranked = sorted(zip(scores, retrieved_chunks), reverse=True)
top_chunks = [chunk for score, chunk in ranked[:5] if score > 0.1]
Output:Free local reranker, comparable quality to Cohere for English. BAAI/bge-reranker-v2-m3 is the multilingual variant. Requires GPU for reasonable latency (50 chunks in ~50ms on A10G).

Common Mistakes

  • Retrieving too few candidates before reranking — if you retrieve top-10 and rerank to top-5, you're barely doing anything. Retrieve top-20 to top-50 to give the reranker enough to work with.
  • Not setting a minimum relevance score threshold — even the top-5 after reranking may all be irrelevant if the query is outside the corpus. Filter results below 0.1 relevance score and return 'no relevant documents found' rather than hallucinating an answer.
  • Running reranking on very long chunks — cross-encoders have max token limits (typically 512 tokens for BGE, 4096 for Cohere v3). Chunks longer than the limit are truncated, degrading relevance scores. Chunk to under 512 tokens if using a constrained reranker.
  • Ignoring reranker latency in production — for latency-sensitive applications, measure P95 reranker latency. Async reranking (return initial results immediately, update with reranked results) can hide latency from users.

FAQ

How much does reranking actually improve accuracy?+

On BEIR benchmarks, reranking improves nDCG@10 by 5–15 percentage points over pure vector search. In production RAG pipelines, teams typically report 10–25% improvement in answer accuracy. The improvement is larger when your queries are specific and your corpus has many similar-but-distinct chunks.

Can I use an LLM as a reranker?+

Yes — LLM-as-reranker prompts a language model to rate each (query, chunk) pair for relevance. It can outperform cross-encoder rerankers on nuanced relevance judgments but is 10-50x slower and 100x more expensive. Use LLM-as-reranker for high-stakes offline evaluation, not production retrieval.

Is reranking necessary if I use hybrid search?+

Hybrid search (combining dense and sparse retrieval) improves recall, but reranking is still needed for precision. They're complementary: hybrid search finds more candidate chunks, reranking selects the best ones from that expanded candidate set.

What's the latency impact of reranking?+

Cohere rerank-3 API: 100–300ms for 50 chunks. BGE-reranker on GPU: 20–80ms for 50 chunks. This is usually acceptable since LLM generation takes 500ms–5s. If reranking latency is a problem, reduce the candidate set (top-20 instead of top-50) or run reranking async.

Should I rerank when using long-context models?+

Even with 200K token context windows, reranking is valuable — it ensures the most relevant content appears at the beginning of the context (avoiding lost-in-the-middle issues). Retrieve top-50, rerank, and include the top-10 in order of relevance score.

Related