Chunking Strategies for RAG (2026)
Chunking is how you split documents before embedding them for retrieval. Fixed-size chunking (512 tokens with 10% overlap) is the default but underperforms on structured documents. Semantic chunking groups sentences by topic, producing more coherent chunks. For best results in 2026, use semantic chunking with parent-child hierarchies — retrieve child chunks but include parent context when sending to the LLM.
When to Use
- ✓Building any RAG pipeline where documents are longer than your context window
- ✓Optimizing an existing RAG system where retrieval precision is low (many irrelevant chunks retrieved)
- ✓Indexing heterogeneous documents (mix of PDFs, Markdown, HTML) that have different natural structures
- ✓When answers span multiple sections of a document and single-chunk retrieval keeps missing them
- ✓Comparing chunking approaches on a new document corpus before committing to an architecture
How It Works
- 1Fixed-size chunking: split text every N tokens (typically 256–512) with overlap (typically 10–20%). Fast and simple. Overlap prevents cutting sentences mid-thought. Fails for tabular data, code, and hierarchical documents.
- 2Sentence/paragraph chunking: split at natural boundaries (sentence endings, paragraph breaks). Produces semantically coherent chunks but variable size — some chunks may be 10 tokens, others 800. Requires downstream filtering by min/max size.
- 3Semantic chunking: embed each sentence, then group consecutive sentences with high cosine similarity into chunks. LangChain and LlamaIndex both implement this. Produces more topically coherent chunks, improving recall by 10–20% on most benchmarks.
- 4Hierarchical (parent-child) chunking: index small child chunks (128 tokens) for precise retrieval, but store and return larger parent chunks (512–1024 tokens) to the LLM for context. This gives the retriever precision and the LLM context.
- 5Document-aware chunking: respect document structure — split Markdown at headers, HTML at section tags, PDFs at detected section boundaries. Requires document-specific parsing but dramatically improves retrieval on structured documents.
Examples
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type='percentile',
breakpoint_threshold_amount=95
)
chunks = text_splitter.split_text(document_text)
print(f'Created {len(chunks)} semantic chunks')from llama_index.node_parser import HierarchicalNodeParser, get_leaf_nodes
parser = HierarchicalNodeParser.from_defaults(
chunk_sizes=[2048, 512, 128] # parent, intermediate, child
)
nodes = parser.get_nodes_from_documents(documents)
leaf_nodes = get_leaf_nodes(nodes)
# Index leaf_nodes for retrieval
# Return parent node content to LLMCommon Mistakes
- ✗Using the same chunk size for all document types — a 512-token chunk is appropriate for prose but too large for code (functions vary from 5 to 200 tokens) and too small for tables (a single table can be 2000 tokens).
- ✗No overlap in fixed-size chunking — without overlap, sentences split across chunk boundaries lose context in both chunks. Use 10–15% overlap (50 tokens for 512-token chunks) as the minimum.
- ✗Not filtering chunk size after splitting — semantic and paragraph splitting produce variable-size chunks. Chunks under 50 tokens are often incomplete sentences; chunks over 1500 tokens contain multiple topics. Filter both extremes.
- ✗Embedding chunks without cleaning — PDFs often contain headers/footers repeated on every page, page numbers, and column artifacts. Clean text before chunking or you'll pollute every nearby chunk with irrelevant tokens.
FAQ
What chunk size works best?+
For most retrieval tasks, 256–512 tokens is the sweet spot. Smaller chunks (128 tokens) have higher precision but less context. Larger chunks (1024+ tokens) have better context but lower precision — the embedding has to represent too many topics. Test on your specific corpus; the right answer depends heavily on document type.
Should I use the same model for chunking and retrieval embeddings?+
For semantic chunking (embedding-based boundary detection), yes — use the same embedding model you'll use for retrieval so the similarity scores are consistent. Mixing embedding models for chunking and retrieval adds a mismatch that can degrade quality.
How does chunking interact with metadata filtering?+
Metadata filtering restricts retrieval to chunks from specific documents, time ranges, or categories before vector similarity search. Your chunking strategy should propagate document-level metadata (source, date, section) to each chunk. This allows precise filtered retrieval without scanning the whole index.
What's late chunking?+
Late chunking (introduced in 2024 by Jina AI) embeds the full document first to capture global context, then chunks the resulting token embeddings. This preserves cross-sentence context in each chunk's embedding. It requires a model that can embed long documents (up to 8K tokens) and is more expensive but produces better embeddings for context-heavy content.
How many chunks should I retrieve per query?+
Start with top-5 and measure. Retrieving more chunks increases recall but also increases noise sent to the LLM. If you have a reranker, retrieve top-20 and rerank to top-5. Without a reranker, top-5 from pure vector search is typically better than top-20.