Prompt Caching: 90% Cost Reduction on Repeated Contexts (2026)
Prompt caching works by storing the KV cache of your prompt prefix on Anthropic/OpenAI servers. Requests that hit the same prefix pay 10 cents on the dollar for those tokens. For any agent that sends the same large system prompt, tool definitions, or reference documents with every request, caching cuts costs 80–90%. You enable it by adding a cache_control breakpoint to the content you want cached.
When to Use
- ✓Large system prompts (500+ tokens) sent with every agent request
- ✓Tool definitions for agents with many tools — definitions can be 2K-5K tokens
- ✓RAG systems where the same reference documents are retrieved and sent frequently
- ✓Contextual retrieval pipelines where you process the same source document with many chunk prompts
- ✓Any application where users have multiple turns and the conversation prefix grows large
How It Works
- 1Place a cache_control breakpoint after the content you want cached: {'type': 'text', 'text': '...your 5K token context...', 'cache_control': {'type': 'ephemeral'}}. Everything before this breakpoint is cached.
- 2Cache hits require the prefix to match exactly (same text, same position). Changing any character before the breakpoint invalidates the cache. Put frequently changing content (user message) after the breakpoint.
- 3Cache TTL is 5 minutes on Anthropic (ephemeral). The first request in any 5-minute window is a cache miss (full cost). Subsequent requests within the window pay 10% of input token cost.
- 4Cache pricing on Claude: cache write costs 25% more than regular (one-time), cache reads cost 10% of regular. So amortized over 10+ requests, caching saves 85%+.
- 5Structure prompts for caching: system prompt (static, cache this) → tool definitions (static, cache) → retrieved documents (semi-static, optionally cache) → conversation history (dynamic, don't cache) → user message (dynamic).
Examples
import anthropic
client = anthropic.Anthropic()
# Large static context you want cached
SYSTEM_PROMPT = 'You are an expert AI pricing analyst...'
REFERENCE_DOCS = '... 10,000 tokens of pricing documentation ...'
response = client.messages.create(
model='claude-3-5-sonnet-20241022',
max_tokens=1024,
system=[
{
'type': 'text',
'text': SYSTEM_PROMPT + '\n\n' + REFERENCE_DOCS,
'cache_control': {'type': 'ephemeral'} # Cache this prefix
}
],
messages=[
{'role': 'user', 'content': user_query} # Changes per request — not cached
]
)
# Check cache performance
print(response.usage.cache_read_input_tokens) # Tokens from cache
print(response.usage.cache_creation_input_tokens) # Tokens that were cached this request# Cache tool definitions separately from conversation
response = client.messages.create(
model='claude-3-5-sonnet-20241022',
max_tokens=1024,
tools=tools, # 20 tools with detailed schemas = ~4K tokens
system=system_prompt,
messages=[
# Cache everything up to current turn
*conversation_history,
{
'role': 'user',
'content': [
{
'type': 'text',
'text': previous_turns_text,
'cache_control': {'type': 'ephemeral'} # Cache previous context
},
{'type': 'text', 'text': current_user_message} # Don't cache current
]
}
]
)Common Mistakes
- ✗Caching content that changes frequently — if you cache user-specific context that changes every request, you get cache misses every time and pay the write premium without benefit. Only cache truly static or semi-static content.
- ✗Placing dynamic content before the cache breakpoint — the cache key includes all content before the breakpoint. Inserting the timestamp, user ID, or any variable before the breakpoint busts the cache on every request.
- ✗Not measuring cache hit rate — Anthropic returns cache_read_input_tokens and cache_creation_input_tokens in usage. Monitor these. A hit rate below 70% suggests your cache prefix is changing too often.
- ✗Using prompt caching as a substitute for prompt compression — caching reduces cost per token but doesn't reduce latency for cache misses. For latency-sensitive applications, compress the prompt too.
FAQ
How long does the prompt cache last?+
Anthropic's ephemeral cache lasts 5 minutes. There's no way to extend it or use persistent caching beyond 5 minutes. For use cases needing longer persistence, you must re-send the full context every 5 minutes (triggering a cache write). OpenAI's prompt caching has a longer TTL but is applied automatically without explicit control.
Does prompt caching work for all Claude models?+
Prompt caching is supported on Claude 3 Haiku, Claude 3.5 Haiku, Claude 3 Opus, Claude 3.5 Sonnet, and Claude 3.7 Sonnet. It's not available on older Claude 2 models. OpenAI offers automatic prompt caching on GPT-4o and GPT-4o-mini for prompts over 1024 tokens.
Can I cache multiple sections independently?+
Anthropic supports up to 4 cache breakpoints per request, enabling multi-level caching. For example: [system_prompt, cache_breakpoint] + [tool_definitions, cache_breakpoint] + [retrieved_docs, cache_breakpoint] + [conversation_history, cache_breakpoint] + [user_message]. Each level caches independently.
Is prompt caching the same as context window management?+
No. Context window management decides what to include in the prompt. Prompt caching decides how to pay for what you include. Context management reduces what you send; prompt caching reduces the cost of sending it. Use both: context management to fit within the window, prompt caching to reduce the cost of the static portion.
Does caching work with streaming?+
Yes — prompt caching is fully compatible with streaming. The cache hit/miss determination happens before streaming begins. You'll see cache_read_input_tokens and cache_creation_input_tokens in the final usage event of the stream.