Reference Architecture · generation

Prompt Caching & Cost Optimization: 90% Savings on Repetitive Prompts

Last updated: April 16, 2026

Quick answer

Use Anthropic's prompt caching (cache_control: ephemeral) for system prompts >2,048 tokens, RAG document context, and tool schema blocks. Cache hit rate depends on cache lifetime (5 minutes for Anthropic, 1 hour via cache-control) and traffic patterns. At 1,000 requests/day with a 10K-token system prompt, caching saves $27/day or $810/month on Claude Sonnet 4. OpenAI caches automatically for prompts >1,024 tokens with no code changes required.

The problem

AI applications with long system prompts (RAG context, tool schemas, few-shot examples, persona instructions) pay full input token costs on every request — even when 80-90% of the prompt is identical across requests. At scale, a 10K-token system prompt on Claude Sonnet 4 costs $0.03 per request. With 100K requests/month, that's $3,000/month just for the static portion of prompts. Prompt caching converts this to a one-time cache write cost ($0.00375 per 1K tokens) plus 90% cheaper cache read cost ($0.003 per 1K vs $0.003 baseline read cost).

Architecture

request contextstatic portionssemi-static RAG docsvariable user inputcache_control: ephemeralcache_control: ephemeralstandard (no cache)prompt (cached + live)cache metricstoken costsIncoming API RequestINPUTPrompt AssemblerINFRAStatic Cache PrefixINFRASemi-Static RAG Context CacheINFRADynamic Suffix (Not Cached)INFRAAnthropic Prompt CacheINFRACache Hit Rate MonitorINFRALLM ResponseLLMCost Analytics DashboardOUTPUT
input
llm
data
infra
output

Incoming API Request

Client request containing the variable portion of the prompt (user message, dynamic context) and a reference to the cacheable static portions (system prompt ID, document set ID). The architecture separates static from dynamic prompt components before the API call.

Alternatives: Direct API call (no cache optimization), Batch request (Anthropic Batch API), Pre-computed response cache (Redis)

Prompt Assembler

Builds the final prompt from static (cacheable) and dynamic (non-cacheable) components. Places cacheable content at the BEGINNING of the prompt — Anthropic caches from the start, so any dynamic content before cached content breaks the cache. Outputs the structured message array with cache_control markers.

Alternatives: Vercel AI SDK prompt building, LangChain PromptTemplate, LiteLLM with cache headers

Static Cache Prefix

The unchanging portion of the prompt marked with cache_control. Includes: system instructions (persona, rules, output format), tool schemas (often 2-8K tokens for complex agent setups), static few-shot examples, and boilerplate context. Must be >2,048 tokens for Anthropic caching and >1,024 tokens for OpenAI auto-caching.

Alternatives: System prompt only, System + tool schemas, System + tools + few-shot examples + static RAG docs

Semi-Static RAG Context Cache

For RAG applications: the retrieved documents for a given document set or topic can be cached as a secondary cache prefix. If the same user asks multiple questions about the same document (e.g., a 50-page contract), cache the document content (marked with cache_control) and only send the changing question. This is the highest-ROI caching pattern for RAG.

Alternatives: Per-session cache prefix (Anthropic supports 4 cache breakpoints), Pre-indexed document embeddings only (no caching)

Dynamic Suffix (Not Cached)

The variable part: the user's current message, retrieved chunks for this specific query, conversation history for multi-turn. This goes AFTER all cache_control blocks. Any tokens here are charged at the standard input rate ($3/M for Claude Sonnet 4).

Alternatives: User message only, User message + dynamic retrieved context, User message + conversation history

Anthropic Prompt Cache

Anthropic's server-side KV cache for prefix tokens. Stores up to 4 cache breakpoints per request. Cache lifetime: 5 minutes (extended to 1 hour for prompts with cache_control: ephemeral in the newest API versions). Cache writes cost $3.75/M tokens (1.25x write premium); reads cost $0.30/M tokens (90% discount from $3/M standard). Read/write costs apply only to cached tokens.

Alternatives: OpenAI automatic prompt caching (>1K token prefix, no code changes), Custom semantic cache (Redis + similarity search), CDN-level response caching (for identical queries only)

Cache Hit Rate Monitor

Tracks the ratio of cache_read_input_tokens to total input tokens from Anthropic's usage response headers. Low hit rates (<50%) indicate cache misses due to short TTL, low request frequency, or dynamic content placed before cache markers. Alerts on hit rate drops.

Alternatives: Langfuse usage tracking, Anthropic API usage dashboard, DataDog LLM Observability

LLM Response

The model's output. Response tokens are not cached — only input tokens. Output costs remain at standard rates ($15/M for Claude Sonnet 4). The usage metadata in the response distinguishes cache_read_input_tokens vs cache_creation_input_tokens vs standard input_tokens.

Alternatives: claude-haiku-4 (lower quality, but caching savings are proportionally smaller), gpt-4o (OpenAI auto-caches at >1K tokens), gemini-2-flash (context caching at different pricing)

Cost Analytics Dashboard

Displays per-endpoint cost breakdown, cache hit rates, token distribution (cached vs non-cached), and cost savings vs baseline (no caching). Essential for demonstrating ROI of caching investment and identifying new caching opportunities.

Alternatives: Custom Grafana + Prometheus, Langfuse cost tracking, Braintrust token analytics

The stack

Primary ModelClaude Sonnet 4 (claude-sonnet-4-5) with prompt caching

Claude Sonnet 4 has the highest absolute savings from caching: standard input rate $3/M tokens, cached read rate $0.30/M (90% off). A 10K-token system prompt saves $0.027 per request at cache hit. Compare: Claude Haiku 4 saves $0.00072 per request from same cache (base rate $0.80/M, cached $0.08/M) — caching is 37x more valuable on Sonnet than Haiku in absolute terms.

Alternatives: Claude Haiku 4 (caching saves proportionally less due to lower base cost), GPT-4o (auto-caching, no code changes), Gemini 2.0 Flash (context caching with explicit TTL API)

Cache Key DesignPrefix-based (static content always at position 0 of messages array)

Anthropic's cache is prefix-based: the prompt must start with identical content for a cache hit. Design your prompt structure as: [system prompt (static)] → [tool schemas (static)] → [few-shot examples (static)] → [RAG docs if semi-static] → [user messages (dynamic)]. Any deviation in the static prefix causes a cache miss and a cache write charge.

Alternatives: Hash-based (for reordered content), Semantic cache (Redis + embeddings, for paraphrased queries)

Cache Lifetime ManagementMaximize request frequency to stay within the 5-minute TTL window

Anthropic's default cache TTL is 5 minutes. For workloads with >12 requests/minute (720/hour), the cache stays warm indefinitely. For lower-traffic workloads, calculate your break-even: cache_write_cost / (standard_rate - cache_read_rate). A 10K-token prefix breaks even at 1.25 cache hits (write premium = 1.25x standard, read = 0.10x). Even 2 hits in the TTL window yields net savings.

Alternatives: Anthropic extended TTL (available on higher tiers), Application-level keep-alive (periodic no-op requests to refresh TTL), OpenAI auto-caching (1-hour TTL, simpler)

Cost MonitoringHelicone (real-time LLM cost analytics with cache hit tracking)

Helicone captures `cache_read_input_tokens`, `cache_creation_input_tokens`, and `input_tokens` from every Anthropic response and displays a live cost dashboard. Free tier covers up to 100K requests/month. Without monitoring, teams don't know if caching is actually working — misplaced cache_control blocks or dynamic content before the cache marker are silent failures.

Alternatives: Custom Prometheus metrics (parse usage from Anthropic API response), Langfuse cost tracking, DataDog LLM Observability

Semantic Cache (Complementary)GPTCache or custom Redis + embedding cache for identical/near-identical queries

Prompt caching saves on the per-token LLM computation cost. A semantic cache (store query embedding → LLM response, check similarity at query time) saves the entire LLM call cost for repeated/similar queries. At 20-30% query repeat rate, a semantic cache on top of prompt caching reduces total LLM costs by another 20-30%. Combined savings: 60-80%.

Alternatives: Momento (managed semantic cache), Upstash Vector + Redis, LangChain SemanticCache

Batch API (Complementary)Anthropic Batch API for non-real-time workloads

For workloads where real-time response is not required (nightly document processing, bulk evaluations, data enrichment), Anthropic Batch API provides 50% cost reduction on top of prompt caching. Combine both: use prompt caching for the static system prompt + batch API for the processing run. A 10K-token system prompt on 100K batch requests costs $0.30 (cached read) vs $300 (standard) — a 1,000x saving.

Alternatives: OpenAI Batch API (50% discount), Scheduled jobs with Inngest or Temporal

Cost at each scale

Prototype

5,000 requests/mo, 8K token system prompt

$55/mo

Without caching: 5K × 8K tokens × $3/M = $120 input cost$0
Cache write (first request per 5min window): 1K writes × 8K × $3.75/M = $30$30
Cache reads (4K cache hits × 8K × $0.30/M) = $9.60$10
Dynamic tokens (5K requests × 500 token user msg × $3/M) = $7.50$7
Output tokens (5K × 500 tokens × $15/M) = $37.50$37
Helicone monitoring (free tier)$0
Note: vs $164 without caching — 66% savings on input tokens$-29

Growth

100,000 requests/mo, 12K token system prompt + 8K RAG context

$970/mo

Cache writes (system prompt, 1K writes × 12K × $3.75/M)$45
Cache reads (99K hits × 12K tokens × $0.30/M)$356
Cache writes (RAG context, 5K unique doc sets × 8K × $3.75/M)$150
Cache reads (95K RAG hits × 8K × $0.30/M)$228
Dynamic tokens (100K × 300 token user msgs × $3/M)$90
Output tokens (100K × 600 tokens × $15/M)$900
Helicone Pro ($20/mo) + infra$50
Note: vs $5,100 without caching — 81% savings on input tokens$-849

Scale

2M requests/mo, 15K token system prompt

$14,000/mo

Cache writes (5K writes × 15K × $3.75/M)$281
Cache reads (1.995M hits × 15K × $0.30/M)$8,978
Dynamic tokens (2M × 400 avg × $3/M)$2,400
Output tokens (2M × 400 avg × $15/M — partially Haiku)$9,600
Monitoring + infra overhead$500
Note: vs $104K without caching — 87% savings on input tokens$-7,759

Latency budget

Total P50: 800ms
Total P95: 2,000ms
Total
800ms · 2000ms p95
Median
P95

Tradeoffs

Failure modes & guardrails

Mitigation: The most common mistake: placing a timestamp, request ID, or personalized content BEFORE the cache_control block. Anthropic's cache is prefix-based — any change before the marker causes a full cache miss and a write charge. Validate your prompt structure by checking the API response: `usage.cache_read_input_tokens` should be >0. If it's always 0, your cache_control placement is wrong.

Mitigation: Routes with <6 requests per 5-minute window will write more cache entries than they read — net cost increase. Identify these routes with your monitoring dashboard and disable caching for them (simply omit the cache_control block). Apply caching only to routes where: (requests_per_5min > 2) AND (cached_tokens > 2048).

Mitigation: You deploy a system prompt update, but the old cached version serves requests for up to 5 minutes. For most use cases, this is acceptable. For critical updates (safety rule changes, pricing changes), use a cache-busting strategy: include a version token in the prompt that changes on deployment (`# System v2.1.4`). The version change invalidates the old cache entry immediately.

Mitigation: OpenAI auto-caches prompts >1,024 tokens with a 1-hour TTL, but cache hit rates are not visible in the standard API response (only in the usage dashboard). Low-traffic hours cause cache expiry and the next request pays the full rate. Monitor total input token costs hourly — a 2-3x spike during low-traffic hours indicates cache expiry is a significant cost driver. Solution: schedule a warm-up request at the start of each hour for critical routes.

View starter code →

Frequently asked questions

How do I implement Anthropic prompt caching in TypeScript?

Add `cache_control: { type: 'ephemeral' }` to the content block(s) you want to cache. The cache marker applies to all tokens up to and including that block. Example: in your messages array, structure the system message as an array of content blocks: `[{ type: 'text', text: '<your long system prompt>', cache_control: { type: 'ephemeral' } }]`. All tokens in this block get cached. The next content block (user message) is not cached. Check `response.usage.cache_read_input_tokens` to verify cache hits.

What's the minimum prompt size where caching starts saving money?

The math: cache writes cost 1.25x standard rate; cache reads cost 0.10x. Break-even: 1 write + N reads totaling less than (N+1) × standard cost. Solve for N: 1.25 + 0.10N < N+1 → N > 0.277. So you need just 1 cache read after 1 write to break even on cost. BUT Anthropic requires ≥2,048 tokens for caching. The practical question is traffic volume: at 1 request/5min (exactly the TTL), you get 1 hit per write = break-even. At 2+ requests/5min, caching pays off.

Does OpenAI automatically cache my prompts?

Yes, OpenAI automatically caches prompts longer than 1,024 tokens with no code changes required. Cache hits are charged at 50% of standard input rate (not 90% like Anthropic). Cached prefixes expire after 5-60 minutes depending on traffic. The caveat: OpenAI's caching is less transparent — you can see cache savings in the billing dashboard but not per-request cache hit/miss in the API response. As of 2025, cached token rates: GPT-4o input $2.50/M → $1.25/M cached; Claude Sonnet 4 $3/M → $0.30/M cached — Claude's cache discount is significantly larger.

Can I cache tool schemas to reduce costs in function-calling use cases?

Yes, and this is one of the highest-value caching targets. Complex agent setups with 10-20 tools can have 4-8K tokens of tool schema definitions. These are completely static and change only on deployments. Place tool schemas in a cached content block. For Claude: add `cache_control` to the tools array or include schemas in the system message. For a customer support agent with 15 tools (6K tokens), caching saves $0.018 per request on Claude Sonnet 4. At 10K requests/day, that's $180/day in savings from tool schema caching alone.

Related