Late Chunking: Long-Context Embeddings for RAG (2026)
Traditional chunking splits text then embeds each chunk independently, losing context. Late chunking runs a long-context embedding model on the full document, then splits the resulting token embeddings into chunks at the boundary positions — so each chunk embedding reflects its place in the document. Jina reports 10–15% better retrieval on context-dependent corpora vs. traditional chunking.
When to Use
- ✓Documents where chunks frequently contain pronouns and references that require document context to understand ('this regulation', 'the aforementioned clause')
- ✓Legal, medical, or scientific documents with dense cross-references between sections
- ✓Narratives or sequential documents where context from earlier sections is essential for understanding later chunks
- ✓When contextual retrieval is too expensive (per-chunk LLM calls) but you still need context-aware embeddings
- ✓Documents under 8K tokens where a long-context embedding model can process the full document in one pass
How It Works
- 1Process the full document through a long-context transformer (Jina v3, nomic-embed-text v2, or similar with 8K+ token capacity) to get token-level embeddings.
- 2Record your chunk boundary positions in terms of token indices from the tokenizer.
- 3For each chunk, mean-pool the token embeddings that fall within that chunk's token range. This produces one embedding vector per chunk, just like traditional chunking — but each embedding was computed with full document context.
- 4Store chunk embeddings in your vector index as normal. At query time, embed the query with the same model and retrieve as usual. The improvement is entirely in the index quality.
- 5Late chunking requires a model that supports returning token-level embeddings (not just a single document embedding). Jina Embeddings v3 and BGE-M3 support this. OpenAI's API currently does not expose token embeddings.
Examples
import numpy as np
from transformers import AutoModel, AutoTokenizer
import torch
model_name = 'jinaai/jina-embeddings-v3'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
def late_chunk_embed(document: str, chunk_boundaries: list[tuple[int, int]]):
# Tokenize full document
inputs = tokenizer(document, return_tensors='pt', return_offsets_mapping=True)
offsets = inputs['offset_mapping'][0]
# Get token embeddings
with torch.no_grad():
outputs = model(**{k: v for k, v in inputs.items() if k != 'offset_mapping'})
token_embs = outputs.last_hidden_state[0] # [seq_len, hidden_dim]
# Pool per chunk using character offsets
chunk_embeddings = []
for start_char, end_char in chunk_boundaries:
mask = (offsets[:, 0] >= start_char) & (offsets[:, 1] <= end_char)
chunk_emb = token_embs[mask].mean(dim=0).numpy()
chunk_embeddings.append(chunk_emb)
return chunk_embeddingsCommon Mistakes
- ✗Using late chunking on documents longer than the model's context window — if the document is 20K tokens and the model supports 8K, you can't process it in one pass. You'll need to use overlapping windows or fall back to traditional chunking for long documents.
- ✗Confusing late chunking with parent-child chunking — they solve different problems. Parent-child gives the LLM more context at generation time; late chunking gives the embedding more context at index time.
- ✗Not adapting the tokenizer to your chunk boundaries — character-level chunk boundaries don't map perfectly to token boundaries. You need the model's tokenizer offset_mapping to find which tokens belong to each chunk.
- ✗Applying late chunking to structured data (tables, code) — late chunking's benefit comes from cross-sentence narrative context. For structured data where each row/function is self-contained, traditional chunking performs identically.
FAQ
What's the computational cost of late chunking vs. traditional chunking?+
Late chunking processes the full document in a single forward pass regardless of how many chunks it produces. Traditional chunking requires one forward pass per chunk. For a 4K document split into 8 chunks of 512 tokens, late chunking runs 1 forward pass vs. 8 forward passes for traditional chunking. Late chunking is actually faster for documents with many chunks.
Does late chunking work with OpenAI embeddings?+
Not currently. OpenAI's API returns a single document embedding, not token-level embeddings. Late chunking requires access to token-level hidden states. You need to use either the Jina API, or run an open-source model like jina-embeddings-v3 or bge-m3 locally or via HuggingFace.
How does late chunking compare to contextual retrieval?+
Both add context to chunk embeddings but through different mechanisms. Contextual retrieval uses an LLM to generate a text context summary prepended before embedding. Late chunking uses the embedding model itself to encode context through full-document attention. Late chunking is more elegant and cheaper (no extra LLM calls), but requires a specific type of embedding model. Both improve retrieval quality similarly on most benchmarks.
What models support late chunking?+
Models that expose token-level embeddings: Jina Embeddings v3 (8K context, state-of-the-art quality), BGE-M3 (multilingual, 8K context), nomic-embed-text v1.5. All are available as open-source models. For API access, Jina's API supports late chunking natively.
Should I use late chunking in production today?+
Late chunking is mature enough for production in 2026 if your documents are under 8K tokens and you can run Jina v3 locally or via API. For longer documents, combine late chunking for sections with overlapping-window processing for the full document. If you prefer API-only embeddings, contextual retrieval is the practical alternative.