Reference Architecture · generation
Prompt Caching & Cost Optimization: 90% Savings on Repetitive Prompts
Last updated: April 16, 2026
Quick answer
Use Anthropic's prompt caching (cache_control: ephemeral) for system prompts >2,048 tokens, RAG document context, and tool schema blocks. Cache hit rate depends on cache lifetime (5 minutes for Anthropic, 1 hour via cache-control) and traffic patterns. At 1,000 requests/day with a 10K-token system prompt, caching saves $27/day or $810/month on Claude Sonnet 4. OpenAI caches automatically for prompts >1,024 tokens with no code changes required.
The problem
AI applications with long system prompts (RAG context, tool schemas, few-shot examples, persona instructions) pay full input token costs on every request — even when 80-90% of the prompt is identical across requests. At scale, a 10K-token system prompt on Claude Sonnet 4 costs $0.03 per request. With 100K requests/month, that's $3,000/month just for the static portion of prompts. Prompt caching converts this to a one-time cache write cost ($0.00375 per 1K tokens) plus 90% cheaper cache read cost ($0.003 per 1K vs $0.003 baseline read cost).
Architecture
Incoming API Request
Client request containing the variable portion of the prompt (user message, dynamic context) and a reference to the cacheable static portions (system prompt ID, document set ID). The architecture separates static from dynamic prompt components before the API call.
Alternatives: Direct API call (no cache optimization), Batch request (Anthropic Batch API), Pre-computed response cache (Redis)
Prompt Assembler
Builds the final prompt from static (cacheable) and dynamic (non-cacheable) components. Places cacheable content at the BEGINNING of the prompt — Anthropic caches from the start, so any dynamic content before cached content breaks the cache. Outputs the structured message array with cache_control markers.
Alternatives: Vercel AI SDK prompt building, LangChain PromptTemplate, LiteLLM with cache headers
Static Cache Prefix
The unchanging portion of the prompt marked with cache_control. Includes: system instructions (persona, rules, output format), tool schemas (often 2-8K tokens for complex agent setups), static few-shot examples, and boilerplate context. Must be >2,048 tokens for Anthropic caching and >1,024 tokens for OpenAI auto-caching.
Alternatives: System prompt only, System + tool schemas, System + tools + few-shot examples + static RAG docs
Semi-Static RAG Context Cache
For RAG applications: the retrieved documents for a given document set or topic can be cached as a secondary cache prefix. If the same user asks multiple questions about the same document (e.g., a 50-page contract), cache the document content (marked with cache_control) and only send the changing question. This is the highest-ROI caching pattern for RAG.
Alternatives: Per-session cache prefix (Anthropic supports 4 cache breakpoints), Pre-indexed document embeddings only (no caching)
Dynamic Suffix (Not Cached)
The variable part: the user's current message, retrieved chunks for this specific query, conversation history for multi-turn. This goes AFTER all cache_control blocks. Any tokens here are charged at the standard input rate ($3/M for Claude Sonnet 4).
Alternatives: User message only, User message + dynamic retrieved context, User message + conversation history
Anthropic Prompt Cache
Anthropic's server-side KV cache for prefix tokens. Stores up to 4 cache breakpoints per request. Cache lifetime: 5 minutes (extended to 1 hour for prompts with cache_control: ephemeral in the newest API versions). Cache writes cost $3.75/M tokens (1.25x write premium); reads cost $0.30/M tokens (90% discount from $3/M standard). Read/write costs apply only to cached tokens.
Alternatives: OpenAI automatic prompt caching (>1K token prefix, no code changes), Custom semantic cache (Redis + similarity search), CDN-level response caching (for identical queries only)
Cache Hit Rate Monitor
Tracks the ratio of cache_read_input_tokens to total input tokens from Anthropic's usage response headers. Low hit rates (<50%) indicate cache misses due to short TTL, low request frequency, or dynamic content placed before cache markers. Alerts on hit rate drops.
Alternatives: Langfuse usage tracking, Anthropic API usage dashboard, DataDog LLM Observability
LLM Response
The model's output. Response tokens are not cached — only input tokens. Output costs remain at standard rates ($15/M for Claude Sonnet 4). The usage metadata in the response distinguishes cache_read_input_tokens vs cache_creation_input_tokens vs standard input_tokens.
Alternatives: claude-haiku-4 (lower quality, but caching savings are proportionally smaller), gpt-4o (OpenAI auto-caches at >1K tokens), gemini-2-flash (context caching at different pricing)
Cost Analytics Dashboard
Displays per-endpoint cost breakdown, cache hit rates, token distribution (cached vs non-cached), and cost savings vs baseline (no caching). Essential for demonstrating ROI of caching investment and identifying new caching opportunities.
Alternatives: Custom Grafana + Prometheus, Langfuse cost tracking, Braintrust token analytics
The stack
Claude Sonnet 4 has the highest absolute savings from caching: standard input rate $3/M tokens, cached read rate $0.30/M (90% off). A 10K-token system prompt saves $0.027 per request at cache hit. Compare: Claude Haiku 4 saves $0.00072 per request from same cache (base rate $0.80/M, cached $0.08/M) — caching is 37x more valuable on Sonnet than Haiku in absolute terms.
Alternatives: Claude Haiku 4 (caching saves proportionally less due to lower base cost), GPT-4o (auto-caching, no code changes), Gemini 2.0 Flash (context caching with explicit TTL API)
Anthropic's cache is prefix-based: the prompt must start with identical content for a cache hit. Design your prompt structure as: [system prompt (static)] → [tool schemas (static)] → [few-shot examples (static)] → [RAG docs if semi-static] → [user messages (dynamic)]. Any deviation in the static prefix causes a cache miss and a cache write charge.
Alternatives: Hash-based (for reordered content), Semantic cache (Redis + embeddings, for paraphrased queries)
Anthropic's default cache TTL is 5 minutes. For workloads with >12 requests/minute (720/hour), the cache stays warm indefinitely. For lower-traffic workloads, calculate your break-even: cache_write_cost / (standard_rate - cache_read_rate). A 10K-token prefix breaks even at 1.25 cache hits (write premium = 1.25x standard, read = 0.10x). Even 2 hits in the TTL window yields net savings.
Alternatives: Anthropic extended TTL (available on higher tiers), Application-level keep-alive (periodic no-op requests to refresh TTL), OpenAI auto-caching (1-hour TTL, simpler)
Helicone captures `cache_read_input_tokens`, `cache_creation_input_tokens`, and `input_tokens` from every Anthropic response and displays a live cost dashboard. Free tier covers up to 100K requests/month. Without monitoring, teams don't know if caching is actually working — misplaced cache_control blocks or dynamic content before the cache marker are silent failures.
Alternatives: Custom Prometheus metrics (parse usage from Anthropic API response), Langfuse cost tracking, DataDog LLM Observability
Prompt caching saves on the per-token LLM computation cost. A semantic cache (store query embedding → LLM response, check similarity at query time) saves the entire LLM call cost for repeated/similar queries. At 20-30% query repeat rate, a semantic cache on top of prompt caching reduces total LLM costs by another 20-30%. Combined savings: 60-80%.
Alternatives: Momento (managed semantic cache), Upstash Vector + Redis, LangChain SemanticCache
For workloads where real-time response is not required (nightly document processing, bulk evaluations, data enrichment), Anthropic Batch API provides 50% cost reduction on top of prompt caching. Combine both: use prompt caching for the static system prompt + batch API for the processing run. A 10K-token system prompt on 100K batch requests costs $0.30 (cached read) vs $300 (standard) — a 1,000x saving.
Alternatives: OpenAI Batch API (50% discount), Scheduled jobs with Inngest or Temporal
Cost at each scale
Prototype
5,000 requests/mo, 8K token system prompt
$55/mo
Growth
100,000 requests/mo, 12K token system prompt + 8K RAG context
$970/mo
Scale
2M requests/mo, 15K token system prompt
$14,000/mo
Latency budget
Tradeoffs
Failure modes & guardrails
Mitigation: The most common mistake: placing a timestamp, request ID, or personalized content BEFORE the cache_control block. Anthropic's cache is prefix-based — any change before the marker causes a full cache miss and a write charge. Validate your prompt structure by checking the API response: `usage.cache_read_input_tokens` should be >0. If it's always 0, your cache_control placement is wrong.
Mitigation: Routes with <6 requests per 5-minute window will write more cache entries than they read — net cost increase. Identify these routes with your monitoring dashboard and disable caching for them (simply omit the cache_control block). Apply caching only to routes where: (requests_per_5min > 2) AND (cached_tokens > 2048).
Mitigation: You deploy a system prompt update, but the old cached version serves requests for up to 5 minutes. For most use cases, this is acceptable. For critical updates (safety rule changes, pricing changes), use a cache-busting strategy: include a version token in the prompt that changes on deployment (`# System v2.1.4`). The version change invalidates the old cache entry immediately.
Mitigation: OpenAI auto-caches prompts >1,024 tokens with a 1-hour TTL, but cache hit rates are not visible in the standard API response (only in the usage dashboard). Low-traffic hours cause cache expiry and the next request pays the full rate. Monitor total input token costs hourly — a 2-3x spike during low-traffic hours indicates cache expiry is a significant cost driver. Solution: schedule a warm-up request at the start of each hour for critical routes.
Frequently asked questions
How do I implement Anthropic prompt caching in TypeScript?
Add `cache_control: { type: 'ephemeral' }` to the content block(s) you want to cache. The cache marker applies to all tokens up to and including that block. Example: in your messages array, structure the system message as an array of content blocks: `[{ type: 'text', text: '<your long system prompt>', cache_control: { type: 'ephemeral' } }]`. All tokens in this block get cached. The next content block (user message) is not cached. Check `response.usage.cache_read_input_tokens` to verify cache hits.
What's the minimum prompt size where caching starts saving money?
The math: cache writes cost 1.25x standard rate; cache reads cost 0.10x. Break-even: 1 write + N reads totaling less than (N+1) × standard cost. Solve for N: 1.25 + 0.10N < N+1 → N > 0.277. So you need just 1 cache read after 1 write to break even on cost. BUT Anthropic requires ≥2,048 tokens for caching. The practical question is traffic volume: at 1 request/5min (exactly the TTL), you get 1 hit per write = break-even. At 2+ requests/5min, caching pays off.
Does OpenAI automatically cache my prompts?
Yes, OpenAI automatically caches prompts longer than 1,024 tokens with no code changes required. Cache hits are charged at 50% of standard input rate (not 90% like Anthropic). Cached prefixes expire after 5-60 minutes depending on traffic. The caveat: OpenAI's caching is less transparent — you can see cache savings in the billing dashboard but not per-request cache hit/miss in the API response. As of 2025, cached token rates: GPT-4o input $2.50/M → $1.25/M cached; Claude Sonnet 4 $3/M → $0.30/M cached — Claude's cache discount is significantly larger.
Can I cache tool schemas to reduce costs in function-calling use cases?
Yes, and this is one of the highest-value caching targets. Complex agent setups with 10-20 tools can have 4-8K tokens of tool schema definitions. These are completely static and change only on deployments. Place tool schemas in a cached content block. For Claude: add `cache_control` to the tools array or include schemas in the system message. For a customer support agent with 15 tools (6K tokens), caching saves $0.018 per request on Claude Sonnet 4. At 10K requests/day, that's $180/day in savings from tool schema caching alone.
Related
Architectures
End-to-End Fine-Tuning Pipeline: From Data to Deployment
A complete fine-tuning pipeline covering data collection, cleaning, formatting, LoRA training, evaluation, and...
Automated LLM Evaluation Harness: CI/CD for AI Quality
A production evaluation system for LLMs covering test dataset management, LLM-as-judge scoring, regression tes...
Token Streaming Pipeline: LLM to UI at Scale
Production architecture for streaming LLM tokens to web and mobile clients using SSE and WebSocket. Covers bac...
LLM Function Calling & Tool Use: Production Architecture
Production patterns for LLM tool use: schema design, parallel tool calls, error handling when tools fail, resu...
Customer Support Agent
Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...
Customer Knowledge Base Chatbot
Reference architecture for a high-volume help-center chatbot over 10k support articles. Zendesk-style, cheap p...
Advanced RAG with Reranking: Two-Stage Retrieval for Production
Production RAG pipeline with two-stage retrieval: broad recall via hybrid dense+sparse search followed by prec...