Query Expansion for Better RAG Retrieval (2026)
Query expansion transforms the user's query into multiple retrieval queries before hitting the vector index. Techniques include HyDE (generate a hypothetical answer document, embed it for retrieval), multi-query (generate 3–5 query variations), and sub-query decomposition (break complex queries into atomic sub-questions). These techniques improve recall by 10–25% on complex queries with minimal added latency.
When to Use
- ✓Users phrase queries differently from how documents are written (vocabulary mismatch)
- ✓Complex queries that span multiple topics or require combining information from different sections
- ✓Domain-specific queries where LLM can expand abbreviations and jargon to improve recall
- ✓Conversational RAG where user queries rely on conversation history and are underspecified
- ✓When context recall is below 0.70 on your eval set and retrieval misses are due to phrasing, not missing documents
How It Works
- 1HyDE (Hypothetical Document Embeddings): ask the LLM to generate what a good answer document would look like, then embed the generated document (not the query) for retrieval. The hypothetical answer is in the same register as the actual documents, reducing vocabulary mismatch.
- 2Multi-query expansion: ask the LLM to generate N (typically 3–5) alternative phrasings of the query. Retrieve top-K for each, deduplicate, and merge. Simple and effective for queries with ambiguous phrasing.
- 3Sub-query decomposition: for complex multi-part questions, break into atomic sub-questions. Retrieve separately for each, then provide all retrieved context to the LLM for final synthesis. Handles multi-hop questions that single-query retrieval misses.
- 4Step-back prompting: before searching for the specific answer, retrieve general context about the topic. 'What are the rules for insider trading?' → first retrieve 'insider trading overview' then retrieve the specific rule query. Improves answer completeness.
- 5Combine with RRF: when merging results from multiple query variants, use reciprocal rank fusion to score chunks that appear in multiple retrieval results higher — appearing in multiple searches is strong evidence of relevance.
Examples
# Generate hypothetical answer document for retrieval
from anthropic import Anthropic
client = Anthropic()
def hyde_query(query: str) -> str:
response = client.messages.create(
model='claude-3-5-haiku-20241022',
max_tokens=300,
messages=[{
'role': 'user',
'content': f'Write a 2-paragraph answer to the following question. Write as if you are an expert. Focus on the key facts.\n\nQuestion: {query}'
}]
)
return response.content[0].text
# Usage
hypothetical_doc = hyde_query('How does prompt caching reduce costs?')
hyde_embedding = embed_model.embed(hypothetical_doc)
results = vector_db.search(hyde_embedding, top_k=10)from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model='claude-3-5-haiku-20241022')
retriever = MultiQueryRetriever.from_llm(
retriever=vectorstore.as_retriever(search_kwargs={'k': 5}),
llm=llm,
prompt=QUERY_PROMPT, # prompt to generate 3 query variations
include_original=True # also search with original query
)
docs = retriever.get_relevant_documents(
'What is the refund policy for cancelled subscriptions?'
)Common Mistakes
- ✗Running too many query variants without deduplication — 5 variants × top-10 retrieval = 50 chunks before deduplication. Without deduplication, you send duplicate context to the LLM and waste tokens. Always deduplicate by chunk_id.
- ✗Using HyDE for time-sensitive or factual queries — if the LLM's hypothetical answer contains factual errors, the embedding will retrieve wrong documents. HyDE works best for conceptual queries, not queries with specific facts (dates, numbers, names).
- ✗Not including the original query in multi-query — always search with the original query alongside variants. The original query often retrieves the most relevant results; variants supplement it.
- ✗Query expansion for every query regardless of complexity — simple, well-phrased queries don't benefit from expansion. Classify queries by complexity and apply expansion only to complex, multi-part, or low-confidence queries.
FAQ
Does query expansion add too much latency?+
HyDE adds one LLM call (~100ms for Haiku). Multi-query adds one LLM call plus N parallel vector searches. Total latency overhead is typically 100-300ms. Run the expansion LLM call and the original query search in parallel to minimize perceived latency — by the time expansion finishes, you already have the original query results.
Is HyDE or multi-query better?+
HyDE outperforms multi-query on technical domains where vocabulary mismatch is the primary problem. Multi-query outperforms HyDE on ambiguous or conversational queries where different phrasings yield different relevant results. Many production systems combine both.
How do I handle conversational context in query expansion?+
Include the conversation history when expanding: 'Given this conversation: [history], what is the full expanded query for retrieval?' The LLM resolves pronouns and references from history. This is called 'contextual query rewriting' and is essential for conversational RAG.
Can query expansion make retrieval worse?+
Yes — for short, specific queries with exact keyword matches. Expanding 'ERROR_CODE_42' to synonyms may retrieve tangentially related documents and miss the exact error. Detect queries with specific identifiers (codes, IDs, model names) and skip expansion for them.
What model should I use for query expansion?+
Use a fast, cheap model — Claude Haiku, GPT-4o-mini, or Gemini Flash. Query expansion doesn't require deep reasoning, just paraphrasing ability. The cost is per query (not per document), so it matters at scale. Using a $3/M token model for expansion when the retrieval model costs $0.10/M input tokens defeats the purpose.