Contextual Retrieval: Anthropic's Method That Improves RAG Accuracy 49%

In September 2024, Anthropic published a technique called Contextual Retrieval that reduced retrieval failure rates by 49% on their internal benchmarks. The idea is disarmingly simple, but it's one of the most impactful improvements you can make to a RAG pipeline without changing your vector database or LLM.

This post explains how it works, when to use it, what it costs, and how to implement it.

The Problem with Standard RAG Chunking

In a standard RAG pipeline, you split documents into chunks, embed each chunk, and store them. When a user queries, you find the most similar chunks and feed them to your LLM.

The problem: chunks lose context when extracted from their source.

Consider a financial report with this sentence:

> "The company's revenue grew 23% year-over-year."

Embedded as a standalone chunk, this is almost meaningless. Which company? What year? What revenue line?

The embedding captures the semantic content, but the embedding space has no way to encode "this refers to Acme Corp's Q3 2024 product revenue" unless that context is in the chunk.

When a user asks "What was Acme Corp's revenue growth in 2024?", the chunk might not be retrieved even though it contains the answer, because the embedding distance is high due to missing context.

What Contextual Retrieval Does

Contextual Retrieval fixes this by prepending a short, AI-generated context description to each chunk before embedding it:

Before (standard chunking):

Chunk text: "The company's revenue grew 23% year-over-year."
Embedding: [0.12, -0.44, 0.89, ...]

After (contextual retrieval):

Chunk text: "Context: This chunk is from the Acme Corp Q3 2024 annual report, 
fiscal year ending September 2024. The document covers product revenue, 
cost structure, and guidance for 2025.

Chunk content: The company's revenue grew 23% year-over-year."
Embedding: [0.31, -0.12, 0.95, ...]

The context is generated by an LLM (Claude Haiku, in Anthropic's original implementation) using a prompt that asks: "Given the full document, what context would be most useful for someone who sees only this chunk?"

The 49% Improvement

In Anthropic's evaluation:

Standard RAG: retrieval failure rate of ~X%
Contextual Retrieval: failure rate reduced by 49%
Contextual Retrieval + BM25 hybrid: failure rate reduced by 67%
Contextual Retrieval + BM25 + Reranking: failure rate reduced by 67% with higher precision

The improvement is larger for:

Long documents (legal, medical, financial reports)
Domain-specific content with lots of references ("as noted in Section 3...")
Documents with many similar sections (quarterly reports, product specs)

The improvement is smaller for:

Short, self-contained documents
Documents where each chunk is already context-complete
FAQ-style content where question = context

Implementation

Here's a complete implementation using Claude Haiku for context generation:

import anthropic
from typing import List

client = anthropic.Anthropic()

def chunk_document(text: str, chunk_size: int = 800, overlap: int = 100) -> List[str]:
    """Simple sliding window chunker."""
    words = text.split()
    chunks = []
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

def generate_context(document: str, chunk: str) -> str:
    """Generate context for a chunk using Claude Haiku with prompt caching."""
    
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=200,
        system="You are a helpful assistant that generates concise contextual descriptions.",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"<document>\n{document}\n</document>",
                        "cache_control": {"type": "ephemeral"}  # Cache the full document!
                    },
                    {
                        "type": "text",
                        "text": f"""Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>

Provide a short, specific context (2-3 sentences) that describes where this chunk fits 
within the document and what key information it contains. This context will be prepended 
to the chunk to improve retrieval. Output only the context, no preamble."""
                    }
                ]
            }
        ]
    )
    
    return response.content[0].text

def create_contextual_chunks(document: str, chunk_size: int = 800) -> List[dict]:
    """Create contextual chunks for a full document."""
    chunks = chunk_document(document, chunk_size)
    contextual_chunks = []
    
    for i, chunk in enumerate(chunks):
        context = generate_context(document, chunk)
        contextual_text = f"{context}\n\n{chunk}"
        contextual_chunks.append({
            "chunk_index": i,
            "original_chunk": chunk,
            "context": context,
            "contextual_text": contextual_text  # This is what you embed
        })
    
    return contextual_chunks

# Usage
document = open("annual_report.txt").read()
contextual_chunks = create_contextual_chunks(document)

# Now embed contextual_text (not original_chunk) for each chunk
for chunk in contextual_chunks:
    embedding = get_embedding(chunk["contextual_text"])
    # Store embedding + original_chunk in your vector DB

Prompt Caching: The Key to Making This Affordable

The naive implementation calls Claude once per chunk with the full document. For a 100-page document split into 200 chunks, that's 200 API calls, each sending the full document.

Without caching: 200 chunks * 50K tokens (full doc) = 10M tokens at $0.25/1M (Haiku) = $2.50 per document. Expensive.

With prompt caching: The document is cached after the first call. Subsequent calls pay only the cache read price.

Anthropic's cache read pricing: $0.03/1M tokens (vs $0.25/1M for regular input).

First chunk: 50K tokens * $0.25/1M = $0.0125
Remaining 199 chunks: 199 * 50K * $0.03/1M = $0.299
Total: ~$0.31 per document

With caching, contextual retrieval costs roughly 8x less than without. The cache_control: {"type": "ephemeral"} parameter in the implementation above enables this.

Cost Per Document at Scale

Document Size

Chunks

Cost without Caching

Cost with Caching

10 pages (~5K tokens)	10	$0.013	$0.008
50 pages (~25K tokens)	50	$0.31	$0.056
100 pages (~50K tokens)	100	$1.25	$0.15
500 pages (~250K tokens)	500	$31.25	$0.98

For a corpus of 10,000 100-page documents, contextual retrieval costs about $1,500 with caching, a one-time cost for meaningfully better retrieval.

When to Use Contextual Retrieval

High value scenarios:

Legal documents (contracts, case law), lots of cross-references and defined terms
Financial reports, company names, time periods, specific metrics that context clarifies
Technical documentation, "the following API" needs context for which API
Medical records, patient context, procedure context, reference to earlier findings
Books and long-form content, chapter context matters for recall

Lower value scenarios:

FAQ documents (already self-contained Q&A pairs)
Short product descriptions
Standalone articles where each section is context-complete
Code snippets that are self-explanatory

Contextual Retrieval + BM25 = Even Better

Combine contextual embeddings with BM25 keyword search and merge with Reciprocal Rank Fusion (RRF) for the best results:

from rank_bm25 import BM25Okapi

def reciprocal_rank_fusion(results_list: List[List[int]], k: int = 60) -> List[int]:
    """Merge multiple ranked lists using RRF."""
    scores = {}
    for results in results_list:
        for rank, doc_id in enumerate(results):
            if doc_id not in scores:
                scores[doc_id] = 0
            scores[doc_id] += 1.0 / (k + rank + 1)
    
    return sorted(scores.keys(), key=lambda x: scores[x], reverse=True)

def hybrid_search(query: str, query_embedding: List[float],
                  bm25_index: BM25Okapi, vector_results: List[int],
                  top_k: int = 10) -> List[int]:
    # BM25 results
    tokenized_query = query.split()
    bm25_scores = bm25_index.get_scores(tokenized_query)
    bm25_results = sorted(range(len(bm25_scores)),
                          key=lambda i: bm25_scores[i], reverse=True)[:top_k * 2]
    
    # Merge with RRF
    return reciprocal_rank_fusion([vector_results[:top_k * 2], bm25_results])[:top_k]

Summary

Contextual Retrieval is one of the highest-ROI RAG improvements available:

49% fewer retrieval failures on Anthropic's benchmarks
67% fewer failures when combined with BM25 hybrid search
Cost: ~$0.15 per 100-page document with prompt caching
Implementation time: 1-2 days for a working pipeline

If your RAG system handles long, structured documents, legal, financial, technical, contextual retrieval is likely the single best improvement you can make this quarter. The cost is low, the improvement is large, and the implementation is straightforward.