promptingadvanced

Prompt Compression: Reducing Token Count Without Losing Quality (2026)

Quick Answer

Prompt compression reduces the token count of prompts before sending them to an expensive frontier model. Techniques range from simple text truncation and deduplication to learned compression with LLMLingua, which can remove 80% of tokens from retrieved context with under 5% quality drop on many QA tasks. The ROI is highest in high-volume RAG pipelines where context is large and query volume is high.

When to Use

  • High-volume RAG pipelines where retrieved context is long (2K+ tokens) and you run thousands of queries per day
  • Fitting long documents into smaller, cheaper models to reduce cost while maintaining quality
  • Reducing latency in latency-sensitive applications where time-to-first-token matters
  • Extracting only the information-dense portions of boilerplate-heavy documents (legal contracts, terms of service)
  • A/B testing whether a compressed prompt maintains quality before permanently switching to a cheaper model

How It Works

  1. 1Manual compression: remove verbose instructions, collapse repeated whitespace, shorten few-shot examples, use abbreviations. Often reduces prompts by 20–40% with minimal quality impact on well-written prompts.
  2. 2Selective truncation: rank sentences or chunks by relevance to the query (BM25 or embedding similarity), keep only the top-k. Simple but effective — relevant sentences are usually 30% of a retrieved document.
  3. 3LLMLingua (Microsoft, 2023–2025): a small LM (125M parameters) scores each token's importance given the query and removes low-importance tokens. Achieves 5–20x compression with 5–10% quality drop on reading comprehension tasks.
  4. 4Semantic summarization: use a cheap model (GPT-4o-mini, Claude Haiku) to summarize retrieved chunks before passing to the expensive model. Effective for narrative documents; less effective for technical content where exact wording matters.
  5. 5Caching compressed versions: pre-compress static documents in your corpus and store the compressed form in the vector store alongside original text. This amortizes compression cost across all queries.

Examples

Manual instruction compression
BEFORE:
You are a helpful assistant. Your task is to carefully read the user's question and provide a comprehensive, accurate answer. Please make sure to consider all aspects of the question before responding. If you are not certain about something, please indicate that clearly.

AFTER:
Answer accurately. Flag uncertainty explicitly.
Output:Token reduction: ~70 tokens → 7 tokens (90% compression). Quality impact: minimal for factual Q&A. Do not compress if the verbose instructions are load-bearing (e.g., if 'consider all aspects' actually changes behavior you measured).
Chunk relevance filtering before LLM call
# Python: filter retrieved chunks to top-3 most relevant sentences
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def compress_context(query_emb, chunks, chunk_embs, top_k=3):
    scores = cosine_similarity([query_emb], chunk_embs)[0]
    top_idx = np.argsort(scores)[-top_k:][::-1]
    return ' '.join(chunks[i] for i in top_idx)
Output:Returns top-3 most semantically relevant sentences from retrieved chunks. Reduces context by 60-80% for typical 10-chunk retrievals. Combine with a reranker for better precision.

Common Mistakes

  • Compressing without measuring quality — always compare compressed vs. full-context performance on a held-out eval set. Some tasks (exact quote extraction, legal analysis) degrade severely with compression.
  • Applying the same compression ratio to all document types — code and structured data should not be compressed (removing tokens breaks syntax); narrative text can handle 70%+ compression.
  • Using compression instead of prompt caching — if your context is static or semi-static, prompt caching (90% cost reduction) is better than compression (lossy). Try caching first.
  • Over-compressing system prompts — system prompts are read-heavy; the model relies on every instruction. Compressing them aggressively causes instruction-following failures.

FAQ

What is LLMLingua and how does it work?+

LLMLingua (Microsoft Research) uses a small language model (GPT-2 or similar) to score the perplexity of each token in the prompt given the query. Low-perplexity tokens (predictable, low-information) are pruned. LLMLingua-2 (2024) improves on this with a trained compressor that achieves better compression-quality tradeoffs. It's available as an open-source Python library.

How much does prompt compression save in practice?+

For a typical RAG pipeline with 3K token contexts at Claude Sonnet pricing ($3/M input tokens), compressing to 1K tokens saves $0.006 per query. At 100K queries/day, that's $600/day or $219K/year. The ROI depends heavily on query volume.

Does prompt compression work for code?+

Poorly. Code depends on precise syntax — removing tokens like parentheses or indentation breaks it. For code contexts, use selective file inclusion (only include relevant files), comment stripping, or whitespace normalization instead of learned compression.

Is compression better than using a smaller model?+

They're complementary. Compression + small model can outperform a large model on many tasks at 1/10th the cost. The pattern: use LLMLingua to compress the context by 5x, then route to a smaller model. This works best when the quality bottleneck is context length, not model capability.

What about the selective context technique from Anthropic?+

Anthropic's contextual retrieval (2024) addresses a related problem by prepending chunk-specific context before embedding, improving retrieval precision. This is complementary to compression — better retrieval means you retrieve fewer but more relevant chunks, reducing context size before the compression step.

Related