Contextual Retrieval: Anthropic's Chunk Context Technique (2026)
Contextual retrieval fixes a core RAG problem: chunks lose their context when split from the original document. A chunk saying 'The policy covers all full-time employees' is ambiguous without knowing which policy. Contextual retrieval uses Claude to prepend 'This chunk is from the 2026 Benefits Policy, Section 3 (Eligibility)...' to each chunk before embedding. This context travels with the chunk through retrieval, improving precision by 35–49%.
When to Use
- ✓Your RAG system retrieves chunks that are grammatically complete but semantically ambiguous without their surrounding document context
- ✓Document corpus has many similar-sounding sections across different documents (e.g., 'Overview' sections in every policy)
- ✓Chunks frequently contain pronouns or references ('it', 'the aforementioned') that require document context to resolve
- ✓Improving an existing RAG system without changing your embedding model or chunking strategy
- ✓High-precision use cases (legal, medical, financial) where ambiguous context causes critical errors
How It Works
- 1For each chunk in your corpus, call Claude (or another fast LLM) with the full document and the specific chunk, asking it to generate a 1-3 sentence context summary: 'Given this document, explain what this chunk is about in context.'
- 2Prepend the generated context to the chunk text before embedding: '[CONTEXT: This chunk describes the eligibility criteria from the 2026 Benefits Policy, specifically covering full-time employee thresholds.] The policy covers all full-time employees working 30+ hours per week...'
- 3Store both the contextualized embedding and the original chunk text. Use the contextualized embedding for retrieval but return the original chunk text (without the prepended context) to the LLM.
- 4Batch the context generation using Claude's batch API to reduce cost. At ~$0.25/1000 chunks using Haiku for context generation, contextualizing 100K chunks costs about $25.
- 5Combine with BM25 hybrid search: contextual retrieval improves both dense and sparse recall. Anthropic reports 67% reduction in retrieval failures when combining contextual retrieval with hybrid search.
Examples
Here is a document:
<document>
{full_document_text}
</document>
Here is a chunk from that document:
<chunk>
{chunk_text}
</chunk>
Please give a short, succinct context (1-3 sentences) that explains what this chunk is about within the overall document. This context will be prepended to the chunk to improve retrieval. Answer only with the context, no preamble.import anthropic
from typing import Generator
def contextualize_chunks_batch(
document: str,
chunks: list[str],
model: str = 'claude-3-5-haiku-20241022'
) -> list[str]:
client = anthropic.Anthropic()
contextualized = []
for chunk in chunks:
response = client.messages.create(
model=model,
max_tokens=200,
messages=[{
'role': 'user',
'content': CONTEXT_PROMPT.format(
document=document, chunk=chunk
)
}]
)
context = response.content[0].text
contextualized.append(f'{context}\n\n{chunk}')
return contextualizedCommon Mistakes
- ✗Not using prompt caching for the document prefix — the full document is sent with each chunk but doesn't change. Without caching, you pay full price for the document tokens on every chunk. With caching, you pay 10% after the first call.
- ✗Generating context that's too long — a 5-sentence context is as effective as 2 sentences but costs 2.5x more tokens. Keep context to 1–3 sentences focusing on: document name, section, and the key point the chunk makes.
- ✗Including the contextualized chunk in the LLM response context — the prepended context text is for embeddings, not for the LLM. When passing retrieved chunks to the LLM, strip the context prefix and pass only the original chunk text.
- ✗Using contextual retrieval for all document types without testing — boilerplate-heavy documents (templated contracts, form letters) benefit less because the context is obvious. Test on your corpus before committing to the additional cost.
FAQ
How much does contextual retrieval cost?+
Using Claude Haiku 3.5 at $0.80/1M input tokens and $4/1M output tokens, with a 2000-token document and 100-token context output, each chunk costs approximately $0.002. For 100K chunks, that's $200 total — a one-time cost. With prompt caching on the document (which you should always use), cost drops by 80-90% for documents with many chunks.
Can I use a smaller/local model for context generation?+
Yes, though quality varies. Llama 3.1 8B and Mistral 7B can generate reasonable chunk contexts. For most document types, a 7B model produces contexts that are 80-90% as good as Claude Haiku at zero API cost. Test on your specific document types before switching to a local model.
How does contextual retrieval compare to parent-child chunking?+
They're complementary. Parent-child chunking retrieves small chunks but returns larger parent chunks for context. Contextual retrieval improves the embedding quality of small chunks so they're retrieved more accurately in the first place. Use both together: contextualized embeddings for small child chunks, parent chunk text for LLM context.
Does contextual retrieval work for code documentation?+
Very well. Code documentation chunks lose critical context ('this function is part of the authentication module'). Contextual retrieval adds class/module context to each docstring chunk, significantly improving retrieval for technical queries. It's particularly effective for API reference documentation.
When should I NOT use contextual retrieval?+
For very short documents (under 500 tokens) where chunks already contain full context. For real-time ingestion pipelines where the context generation latency is unacceptable. For documents with boilerplate that doesn't meaningfully distinguish sections. In these cases, standard chunking with good metadata is sufficient.