Reference Architecture · generation

Prompt Caching & Cost Optimization: 90% Savings on Repetitive Prompts

Last updated: April 16, 2026

Quick answer

Use Anthropic's prompt caching (cache_control: ephemeral) for system prompts >2,048 tokens, RAG document context, and tool schema blocks. Cache hit rate depends on cache lifetime (5 minutes for Anthropic, 1 hour via cache-control) and traffic patterns. At 1,000 requests/day with a 10K-token system prompt, caching saves $27/day or $810/month on Claude Sonnet 4. OpenAI caches automatically for prompts >1,024 tokens with no code changes required.

The problem

AI applications with long system prompts (RAG context, tool schemas, few-shot examples, persona instructions) pay full input token costs on every request — even when 80-90% of the prompt is identical across requests. At scale, a 10K-token system prompt on Claude Sonnet 4 costs $0.03 per request. With 100K requests/month, that's $3,000/month just for the static portion of prompts. Prompt caching converts this to a one-time cache write cost ($0.00375 per 1K tokens) plus 90% cheaper cache read cost ($0.003 per 1K vs $0.003 baseline read cost).

Architecture

input

llm

data

infra

output

Incoming API Request

Client request containing the variable portion of the prompt (user message, dynamic context) and a reference to the cacheable static portions (system prompt ID, document set ID). The architecture separates static from dynamic prompt components before the API call.

Alternatives: Direct API call (no cache optimization), Batch request (Anthropic Batch API), Pre-computed response cache (Redis)

Prompt Assembler

Builds the final prompt from static (cacheable) and dynamic (non-cacheable) components. Places cacheable content at the BEGINNING of the prompt — Anthropic caches from the start, so any dynamic content before cached content breaks the cache. Outputs the structured message array with cache_control markers.

Alternatives: Vercel AI SDK prompt building, LangChain PromptTemplate, LiteLLM with cache headers

Static Cache Prefix

The unchanging portion of the prompt marked with cache_control. Includes: system instructions (persona, rules, output format), tool schemas (often 2-8K tokens for complex agent setups), static few-shot examples, and boilerplate context. Must be >2,048 tokens for Anthropic caching and >1,024 tokens for OpenAI auto-caching.

Alternatives: System prompt only, System + tool schemas, System + tools + few-shot examples + static RAG docs

Semi-Static RAG Context Cache

For RAG applications: the retrieved documents for a given document set or topic can be cached as a secondary cache prefix. If the same user asks multiple questions about the same document (e.g., a 50-page contract), cache the document content (marked with cache_control) and only send the changing question. This is the highest-ROI caching pattern for RAG.

Alternatives: Per-session cache prefix (Anthropic supports 4 cache breakpoints), Pre-indexed document embeddings only (no caching)

Dynamic Suffix (Not Cached)

The variable part: the user's current message, retrieved chunks for this specific query, conversation history for multi-turn. This goes AFTER all cache_control blocks. Any tokens here are charged at the standard input rate ($3/M for Claude Sonnet 4).

Alternatives: User message only, User message + dynamic retrieved context, User message + conversation history

Anthropic Prompt Cache

Anthropic's server-side KV cache for prefix tokens. Stores up to 4 cache breakpoints per request. Cache lifetime: 5 minutes (extended to 1 hour for prompts with cache_control: ephemeral in the newest API versions). Cache writes cost $3.75/M tokens (1.25x write premium); reads cost $0.30/M tokens (90% discount from $3/M standard). Read/write costs apply only to cached tokens.

Alternatives: OpenAI automatic prompt caching (>1K token prefix, no code changes), Custom semantic cache (Redis + similarity search), CDN-level response caching (for identical queries only)

Cache Hit Rate Monitor

Tracks the ratio of cache_read_input_tokens to total input tokens from Anthropic's usage response headers. Low hit rates (<50%) indicate cache misses due to short TTL, low request frequency, or dynamic content placed before cache markers. Alerts on hit rate drops.

Alternatives: Langfuse usage tracking, Anthropic API usage dashboard, DataDog LLM Observability

LLM Response

The model's output. Response tokens are not cached — only input tokens. Output costs remain at standard rates ($15/M for Claude Sonnet 4). The usage metadata in the response distinguishes cache_read_input_tokens vs cache_creation_input_tokens vs standard input_tokens.

Alternatives: claude-haiku-4 (lower quality, but caching savings are proportionally smaller), gpt-4o (OpenAI auto-caches at >1K tokens), gemini-2-flash (context caching at different pricing)

Cost Analytics Dashboard

Displays per-endpoint cost breakdown, cache hit rates, token distribution (cached vs non-cached), and cost savings vs baseline (no caching). Essential for demonstrating ROI of caching investment and identifying new caching opportunities.

Alternatives: Custom Grafana + Prometheus, Langfuse cost tracking, Braintrust token analytics

The stack

Primary ModelClaude Sonnet 4 (claude-sonnet-4-5) with prompt caching

Claude Sonnet 4 has the highest absolute savings from caching: standard input rate $3/M tokens, cached read rate $0.30/M (90% off). A 10K-token system prompt saves $0.027 per request at cache hit. Compare: Claude Haiku 4 saves $0.00072 per request from same cache (base rate $0.80/M, cached $0.08/M) — caching is 37x more valuable on Sonnet than Haiku in absolute terms.

Alternatives: Claude Haiku 4 (caching saves proportionally less due to lower base cost), GPT-4o (auto-caching, no code changes), Gemini 2.0 Flash (context caching with explicit TTL API)

Cache Key DesignPrefix-based (static content always at position 0 of messages array)

Anthropic's cache is prefix-based: the prompt must start with identical content for a cache hit. Design your prompt structure as: [system prompt (static)] → [tool schemas (static)] → [few-shot examples (static)] → [RAG docs if semi-static] → [user messages (dynamic)]. Any deviation in the static prefix causes a cache miss and a cache write charge.

Alternatives: Hash-based (for reordered content), Semantic cache (Redis + embeddings, for paraphrased queries)

Cache Lifetime ManagementMaximize request frequency to stay within the 5-minute TTL window

Anthropic's default cache TTL is 5 minutes. For workloads with >12 requests/minute (720/hour), the cache stays warm indefinitely. For lower-traffic workloads, calculate your break-even: cache_write_cost / (standard_rate - cache_read_rate). A 10K-token prefix breaks even at 1.25 cache hits (write premium = 1.25x standard, read = 0.10x). Even 2 hits in the TTL window yields net savings.

Alternatives: Anthropic extended TTL (available on higher tiers), Application-level keep-alive (periodic no-op requests to refresh TTL), OpenAI auto-caching (1-hour TTL, simpler)

Cost MonitoringHelicone (real-time LLM cost analytics with cache hit tracking)

Helicone captures `cache_read_input_tokens`, `cache_creation_input_tokens`, and `input_tokens` from every Anthropic response and displays a live cost dashboard. Free tier covers up to 100K requests/month. Without monitoring, teams don't know if caching is actually working — misplaced cache_control blocks or dynamic content before the cache marker are silent failures.

Alternatives: Custom Prometheus metrics (parse usage from Anthropic API response), Langfuse cost tracking, DataDog LLM Observability

Semantic Cache (Complementary)GPTCache or custom Redis + embedding cache for identical/near-identical queries

Prompt caching saves on the per-token LLM computation cost. A semantic cache (store query embedding → LLM response, check similarity at query time) saves the entire LLM call cost for repeated/similar queries. At 20-30% query repeat rate, a semantic cache on top of prompt caching reduces total LLM costs by another 20-30%. Combined savings: 60-80%.

Alternatives: Momento (managed semantic cache), Upstash Vector + Redis, LangChain SemanticCache

Batch API (Complementary)Anthropic Batch API for non-real-time workloads

For workloads where real-time response is not required (nightly document processing, bulk evaluations, data enrichment), Anthropic Batch API provides 50% cost reduction on top of prompt caching. Combine both: use prompt caching for the static system prompt + batch API for the processing run. A 10K-token system prompt on 100K batch requests costs $0.30 (cached read) vs $300 (standard) — a 1,000x saving.

Alternatives: OpenAI Batch API (50% discount), Scheduled jobs with Inngest or Temporal

Cost at each scale

Prototype

5,000 requests/mo, 8K token system prompt

$55/mo

Without caching: 5K × 8K tokens × $3/M = $120 input cost$0

Cache write (first request per 5min window): 1K writes × 8K × $3.75/M = $30$30

Cache reads (4K cache hits × 8K × $0.30/M) = $9.60$10

Dynamic tokens (5K requests × 500 token user msg × $3/M) = $7.50$7

Output tokens (5K × 500 tokens × $15/M) = $37.50$37

Helicone monitoring (free tier)$0

Note: vs $164 without caching — 66% savings on input tokens$-29

Growth

100,000 requests/mo, 12K token system prompt + 8K RAG context

$970/mo

Cache writes (system prompt, 1K writes × 12K × $3.75/M)$45

Cache reads (99K hits × 12K tokens × $0.30/M)$356

Cache writes (RAG context, 5K unique doc sets × 8K × $3.75/M)$150

Cache reads (95K RAG hits × 8K × $0.30/M)$228

Dynamic tokens (100K × 300 token user msgs × $3/M)$90

Output tokens (100K × 600 tokens × $15/M)$900

Helicone Pro ($20/mo) + infra$50

Note: vs $5,100 without caching — 81% savings on input tokens$-849

Scale

2M requests/mo, 15K token system prompt

$14,000/mo

Cache writes (5K writes × 15K × $3.75/M)$281

Cache reads (1.995M hits × 15K × $0.30/M)$8,978

Dynamic tokens (2M × 400 avg × $3/M)$2,400

Output tokens (2M × 400 avg × $15/M — partially Haiku)$9,600

Monitoring + infra overhead$500

Note: vs $104K without caching — 87% savings on input tokens$-7,759

Latency budget

Total P50: 800ms

Total P95: 2,000ms

Total

800ms · 2000ms p95

Median

P95

Tradeoffs

Failure modes & guardrails

Mitigation: The most common mistake: placing a timestamp, request ID, or personalized content BEFORE the cache_control block. Anthropic's cache is prefix-based — any change before the marker causes a full cache miss and a write charge. Validate your prompt structure by checking the API response: `usage.cache_read_input_tokens` should be >0. If it's always 0, your cache_control placement is wrong.

Mitigation: Routes with <6 requests per 5-minute window will write more cache entries than they read — net cost increase. Identify these routes with your monitoring dashboard and disable caching for them (simply omit the cache_control block). Apply caching only to routes where: (requests_per_5min > 2) AND (cached_tokens > 2048).

Mitigation: You deploy a system prompt update, but the old cached version serves requests for up to 5 minutes. For most use cases, this is acceptable. For critical updates (safety rule changes, pricing changes), use a cache-busting strategy: include a version token in the prompt that changes on deployment (`# System v2.1.4`). The version change invalidates the old cache entry immediately.

Mitigation: OpenAI auto-caches prompts >1,024 tokens with a 1-hour TTL, but cache hit rates are not visible in the standard API response (only in the usage dashboard). Low-traffic hours cause cache expiry and the next request pays the full rate. Monitor total input token costs hourly — a 2-3x spike during low-traffic hours indicates cache expiry is a significant cost driver. Solution: schedule a warm-up request at the start of each hour for critical routes.

View starter code →

Frequently asked questions

How do I implement Anthropic prompt caching in TypeScript?

Add `cache_control: { type: 'ephemeral' }` to the content block(s) you want to cache. The cache marker applies to all tokens up to and including that block. Example: in your messages array, structure the system message as an array of content blocks: `[{ type: 'text', text: '<your long system prompt>', cache_control: { type: 'ephemeral' } }]`. All tokens in this block get cached. The next content block (user message) is not cached. Check `response.usage.cache_read_input_tokens` to verify cache hits.

What's the minimum prompt size where caching starts saving money?

The math: cache writes cost 1.25x standard rate; cache reads cost 0.10x. Break-even: 1 write + N reads totaling less than (N+1) × standard cost. Solve for N: 1.25 + 0.10N < N+1 → N > 0.277. So you need just 1 cache read after 1 write to break even on cost. BUT Anthropic requires ≥2,048 tokens for caching. The practical question is traffic volume: at 1 request/5min (exactly the TTL), you get 1 hit per write = break-even. At 2+ requests/5min, caching pays off.

Does OpenAI automatically cache my prompts?

Yes, OpenAI automatically caches prompts longer than 1,024 tokens with no code changes required. Cache hits are charged at 50% of standard input rate (not 90% like Anthropic). Cached prefixes expire after 5-60 minutes depending on traffic. The caveat: OpenAI's caching is less transparent — you can see cache savings in the billing dashboard but not per-request cache hit/miss in the API response. As of 2025, cached token rates: GPT-4o input $2.50/M → $1.25/M cached; Claude Sonnet 4 $3/M → $0.30/M cached — Claude's cache discount is significantly larger.

Can I cache tool schemas to reduce costs in function-calling use cases?

Yes, and this is one of the highest-value caching targets. Complex agent setups with 10-20 tools can have 4-8K tokens of tool schema definitions. These are completely static and change only on deployments. Place tool schemas in a cached content block. For Claude: add `cache_control` to the tools array or include schemas in the system message. For a customer support agent with 15 tools (6K tokens), caching saves $0.018 per request on Claude Sonnet 4. At 10K requests/day, that's $180/day in savings from tool schema caching alone.

Architectures

End-to-End Fine-Tuning Pipeline: From Data to Deployment

A complete fine-tuning pipeline covering data collection, cleaning, formatting, LoRA training, evaluation, and...

Automated LLM Evaluation Harness: CI/CD for AI Quality

A production evaluation system for LLMs covering test dataset management, LLM-as-judge scoring, regression tes...

Token Streaming Pipeline: LLM to UI at Scale

Production architecture for streaming LLM tokens to web and mobile clients using SSE and WebSocket. Covers bac...

LLM Function Calling & Tool Use: Production Architecture

Production patterns for LLM tool use: schema design, parallel tool calls, error handling when tools fail, resu...

Customer Support Agent

Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...

Customer Knowledge Base Chatbot

Reference architecture for a high-volume help-center chatbot over 10k support articles. Zendesk-style, cheap p...

Advanced RAG with Reranking: Two-Stage Retrieval for Production

Production RAG pipeline with two-stage retrieval: broad recall via hybrid dense+sparse search followed by prec...

Models mentioned

claude-sonnet-4 claude-haiku-4 gpt-4o claude-opus-4

Tools mentioned

Anthropic SDK OpenAI SDK Helicone Langfuse GPTCache Upstash Redis