Contextual Retrieval: Anthropic's Method That Improves RAG Accuracy 49%
In September 2024, Anthropic published a technique called Contextual Retrieval that reduced retrieval failure rates by 49% on their internal benchmarks. The idea is disarmingly simple, but it's one of the most impactful improvements you can make to a RAG pipeline without changing your vector database or LLM.
This post explains how it works, when to use it, what it costs, and how to implement it.
The Problem with Standard RAG Chunking
In a standard RAG pipeline, you split documents into chunks, embed each chunk, and store them. When a user queries, you find the most similar chunks and feed them to your LLM.
The problem: chunks lose context when extracted from their source.
Imagine a financial report with this sentence:
> "The company's revenue grew 23% year-over-year."
Embedded as a standalone chunk, this is almost meaningless. Which company? What year? What revenue line?
The embedding captures the semantic content, but the embedding space has no way to encode "this refers to Acme Corp's Q3 2024 product revenue" unless that context is in the chunk.
When a user asks "What was Acme Corp's revenue growth in 2024?", the chunk might not be retrieved even though it contains the answer, because the embedding distance is high due to missing context.
What Contextual Retrieval Does
Contextual Retrieval fixes this by prepending a short, AI-generated context description to each chunk before embedding it:
Before (standard chunking):
Chunk text: "The company's revenue grew 23% year-over-year."
Embedding: [0.12, -0.44, 0.89, ...]
After (contextual retrieval):
Chunk text: "Context: This chunk is from the Acme Corp Q3 2024 annual report,
fiscal year ending September 2024. The document covers product revenue,
cost structure, and guidance for 2025.
Chunk content: The company's revenue grew 23% year-over-year."
Embedding: [0.31, -0.12, 0.95, ...]
The context is generated by an LLM (Claude Haiku, in Anthropic's original implementation) using a prompt that asks: "Given the full document, what context would be most useful for someone who sees only this chunk?"
The 49% Improvement
In Anthropic's evaluation:
- Standard RAG: retrieval failure rate of ~X%
- Contextual Retrieval: failure rate reduced by 49%
- Contextual Retrieval + BM25 hybrid: failure rate reduced by 67%
- Contextual Retrieval + BM25 + Reranking: failure rate reduced by 67% with higher precision
The improvement is larger for:
- Long documents (legal, medical, financial reports)
- Domain-specific content with lots of references ("as noted in Section 3...")
- Documents with many similar sections (quarterly reports, product specs)
The improvement is smaller for:
- Short, self-contained documents
- Documents where each chunk is already context-complete
- FAQ-style content where question = context
Implementation
Here's a complete implementation using Claude Haiku for context generation:
import anthropic
from typing import List
client = anthropic.Anthropic()
def chunk_document(text: str, chunk_size: int = 800, overlap: int = 100) -> List[str]:
"""Simple sliding window chunker."""
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = " ".join(words[start:end])
chunks.append(chunk)
start += chunk_size - overlap
return chunks
def generate_context(document: str, chunk: str) -> str:
"""Generate context for a chunk using Claude Haiku with prompt caching."""
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=200,
system="You are a helpful assistant that generates concise contextual descriptions.",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": f"<document>\n{document}\n</document>",
"cache_control": {"type": "ephemeral"} # Cache the full document!
},
{
"type": "text",
"text": f"""Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>
Provide a short, specific context (2-3 sentences) that describes where this chunk fits
within the document and what key information it contains. This context will be prepended
to the chunk to improve retrieval. Output only the context, no preamble."""
}
]
}
]
)
return response.content[0].text
def create_contextual_chunks(document: str, chunk_size: int = 800) -> List[dict]:
"""Create contextual chunks for a full document."""
chunks = chunk_document(document, chunk_size)
contextual_chunks = []
for i, chunk in enumerate(chunks):
context = generate_context(document, chunk)
contextual_text = f"{context}\n\n{chunk}"
contextual_chunks.append({
"chunk_index": i,
"original_chunk": chunk,
"context": context,
"contextual_text": contextual_text # This is what you embed
})
return contextual_chunks
# Usage
document = open("annual_report.txt").read()
contextual_chunks = create_contextual_chunks(document)
# Now embed contextual_text (not original_chunk) for each chunk
for chunk in contextual_chunks:
embedding = get_embedding(chunk["contextual_text"])
# Store embedding + original_chunk in your vector DB
Prompt Caching: The Key to Making This Affordable
The naive implementation calls Claude once per chunk with the full document. For a 100-page document split into 200 chunks, that's 200 API calls, each sending the full document.
Without caching: 200 chunks * 50K tokens (full doc) = 10M tokens at $0.25/1M (Haiku) = $2.50 per document. Expensive.
With prompt caching: The document is cached after the first call. Subsequent calls pay only the cache read price.
Anthropic's cache read pricing: $0.03/1M tokens (vs $0.25/1M for regular input).
First chunk: 50K tokens * $0.25/1M = $0.0125
Remaining 199 chunks: 199 * 50K * $0.03/1M = $0.299
Total: ~$0.31 per document
With caching, contextual retrieval costs roughly 8x less than without. The cache_control: {"type": "ephemeral"} parameter in the implementation above enables this.
Cost Per Document at Scale
| Document Size | Chunks | Cost without Caching | Cost with Caching |
| 10 pages (~5K tokens) | 10 | $0.013 | $0.008 |
| 50 pages (~25K tokens) | 50 | $0.31 | $0.056 |
| 100 pages (~50K tokens) | 100 | $1.25 | $0.15 |
| 500 pages (~250K tokens) | 500 | $31.25 | $0.98 |
For a corpus of 10,000 100-page documents, contextual retrieval costs about $1,500 with caching, a one-time cost for meaningfully better retrieval.
When to Use Contextual Retrieval
High value scenarios:
- Legal documents (contracts, case law), lots of cross-references and defined terms
- Financial reports, company names, time periods, specific metrics that context clarifies
- Technical documentation, "the following API" needs context for which API
- Medical records, patient context, procedure context, reference to earlier findings
- Books and long-form content, chapter context matters for recall
Lower value scenarios:
- FAQ documents (already self-contained Q&A pairs)
- Short product descriptions
- Standalone articles where each section is context-complete
- Code snippets that are self-explanatory
Contextual Retrieval + BM25 = Even Better
Combine contextual embeddings with BM25 keyword search and merge with Reciprocal Rank Fusion (RRF) for the best results:
from rank_bm25 import BM25Okapi
def reciprocal_rank_fusion(results_list: List[List[int]], k: int = 60) -> List[int]:
"""Merge multiple ranked lists using RRF."""
scores = {}
for results in results_list:
for rank, doc_id in enumerate(results):
if doc_id not in scores:
scores[doc_id] = 0
scores[doc_id] += 1.0 / (k + rank + 1)
return sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
def hybrid_search(query: str, query_embedding: List[float],
bm25_index: BM25Okapi, vector_results: List[int],
top_k: int = 10) -> List[int]:
# BM25 results
tokenized_query = query.split()
bm25_scores = bm25_index.get_scores(tokenized_query)
bm25_results = sorted(range(len(bm25_scores)),
key=lambda i: bm25_scores[i], reverse=True)[:top_k * 2]
# Merge with RRF
return reciprocal_rank_fusion([vector_results[:top_k * 2], bm25_results])[:top_k]
Summary
Contextual Retrieval is one of the highest-ROI RAG improvements available:
- 49% fewer retrieval failures on Anthropic's benchmarks
- 67% fewer failures when combined with BM25 hybrid search
- Cost: ~$0.15 per 100-page document with prompt caching
- Implementation time: 1-2 days for a working pipeline
If your RAG system handles long, structured documents, legal, financial, technical, contextual retrieval is likely the single best improvement you can make this quarter. The cost is low, the improvement is large, and the implementation is straightforward.