ragintermediate

Chunking Strategies for RAG (2026)

Quick Answer

Chunking is how you split documents before embedding them for retrieval. Fixed-size chunking (512 tokens with 10% overlap) is the default but underperforms on structured documents. Semantic chunking groups sentences by topic, producing more coherent chunks. For best results in 2026, use semantic chunking with parent-child hierarchies — retrieve child chunks but include parent context when sending to the LLM.

When to Use

  • Building any RAG pipeline where documents are longer than your context window
  • Optimizing an existing RAG system where retrieval precision is low (many irrelevant chunks retrieved)
  • Indexing heterogeneous documents (mix of PDFs, Markdown, HTML) that have different natural structures
  • When answers span multiple sections of a document and single-chunk retrieval keeps missing them
  • Comparing chunking approaches on a new document corpus before committing to an architecture

How It Works

  1. 1Fixed-size chunking: split text every N tokens (typically 256–512) with overlap (typically 10–20%). Fast and simple. Overlap prevents cutting sentences mid-thought. Fails for tabular data, code, and hierarchical documents.
  2. 2Sentence/paragraph chunking: split at natural boundaries (sentence endings, paragraph breaks). Produces semantically coherent chunks but variable size — some chunks may be 10 tokens, others 800. Requires downstream filtering by min/max size.
  3. 3Semantic chunking: embed each sentence, then group consecutive sentences with high cosine similarity into chunks. LangChain and LlamaIndex both implement this. Produces more topically coherent chunks, improving recall by 10–20% on most benchmarks.
  4. 4Hierarchical (parent-child) chunking: index small child chunks (128 tokens) for precise retrieval, but store and return larger parent chunks (512–1024 tokens) to the LLM for context. This gives the retriever precision and the LLM context.
  5. 5Document-aware chunking: respect document structure — split Markdown at headers, HTML at section tags, PDFs at detected section boundaries. Requires document-specific parsing but dramatically improves retrieval on structured documents.

Examples

Semantic chunking with LangChain
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type='percentile',
    breakpoint_threshold_amount=95
)

chunks = text_splitter.split_text(document_text)
print(f'Created {len(chunks)} semantic chunks')
Output:Creates semantically coherent chunks by detecting topic shifts. breakpoint_threshold_amount=95 means split when similarity drops to the 5th percentile — produces fewer, larger chunks. Lower values (85) produce more, smaller chunks.
Parent-child retrieval setup
from llama_index.node_parser import HierarchicalNodeParser, get_leaf_nodes

parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]  # parent, intermediate, child
)
nodes = parser.get_nodes_from_documents(documents)
leaf_nodes = get_leaf_nodes(nodes)

# Index leaf_nodes for retrieval
# Return parent node content to LLM
Output:3-level hierarchy: 128-token leaves for retrieval precision, 512-token intermediate, 2048-token parents for LLM context. Retrieval score improves because 128-token chunks are more topically specific.

Common Mistakes

  • Using the same chunk size for all document types — a 512-token chunk is appropriate for prose but too large for code (functions vary from 5 to 200 tokens) and too small for tables (a single table can be 2000 tokens).
  • No overlap in fixed-size chunking — without overlap, sentences split across chunk boundaries lose context in both chunks. Use 10–15% overlap (50 tokens for 512-token chunks) as the minimum.
  • Not filtering chunk size after splitting — semantic and paragraph splitting produce variable-size chunks. Chunks under 50 tokens are often incomplete sentences; chunks over 1500 tokens contain multiple topics. Filter both extremes.
  • Embedding chunks without cleaning — PDFs often contain headers/footers repeated on every page, page numbers, and column artifacts. Clean text before chunking or you'll pollute every nearby chunk with irrelevant tokens.

FAQ

What chunk size works best?+

For most retrieval tasks, 256–512 tokens is the sweet spot. Smaller chunks (128 tokens) have higher precision but less context. Larger chunks (1024+ tokens) have better context but lower precision — the embedding has to represent too many topics. Test on your specific corpus; the right answer depends heavily on document type.

Should I use the same model for chunking and retrieval embeddings?+

For semantic chunking (embedding-based boundary detection), yes — use the same embedding model you'll use for retrieval so the similarity scores are consistent. Mixing embedding models for chunking and retrieval adds a mismatch that can degrade quality.

How does chunking interact with metadata filtering?+

Metadata filtering restricts retrieval to chunks from specific documents, time ranges, or categories before vector similarity search. Your chunking strategy should propagate document-level metadata (source, date, section) to each chunk. This allows precise filtered retrieval without scanning the whole index.

What's late chunking?+

Late chunking (introduced in 2024 by Jina AI) embeds the full document first to capture global context, then chunks the resulting token embeddings. This preserves cross-sentence context in each chunk's embedding. It requires a model that can embed long documents (up to 8K tokens) and is more expensive but produces better embeddings for context-heavy content.

How many chunks should I retrieve per query?+

Start with top-5 and measure. Retrieving more chunks increases recall but also increases noise sent to the LLM. If you have a reranker, retrieve top-20 and rerank to top-5. Without a reranker, top-5 from pure vector search is typically better than top-20.

Related