prompt-cachinganthropicopenaicost-optimization

Hidden Cost of LLM Caching: Anthropic vs OpenAI 2026

Hidden Cost of LLM Caching: Anthropic vs OpenAI 2026

By Aniket Nigam. Published 2026-04-15.

Quick answer

Anthropic prompt caching charges $0.30 per million cached input tokens versus $3.00 per million for fresh input on Claude Sonnet 4.5. That is a 90% discount on cache reads. OpenAI caches GPT-4o input at $1.25 per million versus $2.50 fresh, a 50% discount. Anthropic wins on cache-heavy workloads. OpenAI wins when your prompts change too often to cache.

Why this post exists

Provider blog posts about prompt caching pitch "up to 90% savings" without showing the math. The 90% is real but only hits if your workload fits a narrow shape. For about half the workloads I have audited, caching actually costs more once you add the write premium.

I pulled invoices from 6 production apps running on both Anthropic and OpenAI. Three of those six would save money moving everything to Anthropic cache; two would save on OpenAI; one would not benefit from caching at all and was better off with raw input billing.

Table of contents

  1. How prompt caching actually works
  2. Anthropic cache mechanics and pricing
  3. OpenAI cache mechanics and pricing
  4. The write premium that catches teams off guard
  5. Workload shape 1: chatbot with 5K system prompt
  6. Workload shape 2: code review with 30K context
  7. Workload shape 3: document Q&A with 100K context
  8. When caching makes things worse
  9. Combining caching with batch API

1. How prompt caching actually works

The provider stores the KV cache for a prefix of your prompt and reuses it across subsequent calls. On a hit, they skip the forward pass for the cached tokens. That is why the discount is real: the inference cost of those tokens was zero on the second call.

Caching is not free to set up. Both providers charge a write premium on the first call that creates the cache. Anthropic charges 1.25x the base input rate for cache writes. OpenAI charges nothing extra for cache writes but requires the prompt to match exactly for a hit.

The hit rules differ. Anthropic caches on explicit cache_control breakpoints you place in the prompt. OpenAI caches automatically on prompts over 1,024 tokens that match a previous prompt byte-for-byte at the prefix.

TTL also differs. Anthropic defaults to 5 minutes and offers a 1-hour tier at 2x the write premium. OpenAI keeps the cache for 5-10 minutes and has no paid extension.

2. Anthropic cache mechanics

Here is the shape of a cache-enabled request:

import anthropic

client = anthropic.Anthropic()

resp = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": BIG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[{"role": "user", "content": user_query}],
)

The cache_control breakpoint tells Anthropic where to checkpoint. Anything before that breakpoint is cacheable. Up to 4 breakpoints per request.

Per million tokens on Sonnet 4.5 in April 2026:

CategoryPrice
Fresh input$3.00
Cache write (5 min)$3.75 (1.25x)
Cache write (1 hour)$6.00 (2x)
Cache read$0.30 (0.1x)
Output$15.00

The minimum cacheable prefix is 1,024 tokens on Sonnet 4.5 and 2,048 on Opus 4. Prompts below those thresholds cannot cache at all.

3. OpenAI cache mechanics

OpenAI caching is implicit. You send the same prefix, they hit the cache:

from openai import OpenAI

client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-4o-2024-11-20",
    messages=[
        {"role": "system", "content": BIG_SYSTEM_PROMPT},
        {"role": "user", "content": user_query},
    ],
)
# Check resp.usage.prompt_tokens_details.cached_tokens

OpenAI does not charge a write premium. The first call bills at the fresh rate, later hits get the cache rate. The hit rule is byte-identical prefix matching in 1,024-token chunks.

Per million tokens on GPT-4o in April 2026:

CategoryPrice
Fresh input$2.50
Cache read$1.25
Output$10.00

The minimum prompt length to be eligible is 1,024 tokens. Anything under that never caches.

4. The write premium

Anthropic's 1.25x write premium matters more than teams expect.

If your system prompt is 10,000 tokens and gets read 10 times before the 5-minute TTL expires, you pay:

  • 1 write at $3.75/M: $0.0375
  • 10 reads at $0.30/M: $0.030
  • Total: $0.0675

Compare to no caching: 11 reads at $3.00/M = $0.330. Caching saves 80%.

Now say your cache only gets hit 2 times before it expires:

  • 1 write at $3.75/M: $0.0375
  • 2 reads at $0.30/M: $0.006
  • Total: $0.0435

Without caching: 3 fresh reads at $3.00/M = $0.090. Caching still wins by 52%.

The break-even is 1 hit. If the cache expires before a single hit, you paid 25% extra for nothing. In production I target a hit-to-write ratio of 5x or better and alert when it drops below 2x.

5. Workload shape 1: chatbot with 5K system prompt

Assume 500,000 messages per month, each with a 5K token system prompt and 1K of user input. Responses average 500 tokens.

No caching, fresh every time:

  • Input: 500K x 6K = 3B tokens
  • Output: 500K x 500 = 250M tokens
  • OpenAI cost: 3B x $2.50 + 250M x $10.00 = $10,000
  • Anthropic cost: 3B x $3.00 + 250M x $15.00 = $12,750

Anthropic with caching (assume 85% of users send within 5 min of the last message on that shard, giving 85% hit rate on the 5K system prompt):

  • System prompt writes: 75K x 5K x $3.75/M = $1,406
  • System prompt reads: 425K x 5K x $0.30/M = $638
  • User input: 500K x 1K x $3.00/M = $1,500
  • Output: 250M x $15.00/M = $3,750
  • Total: $7,294

OpenAI with caching (same 85% hit assumption):

  • System prompt fresh: 75K x 5K x $2.50/M = $938
  • System prompt cached: 425K x 5K x $1.25/M = $2,656
  • User input: 500K x 1K x $2.50/M = $1,250
  • Output: 250M x $10.00/M = $2,500
  • Total: $7,344

Dead heat on total cost. Anthropic wins by $50. Both providers save about 30% versus uncached. The choice comes down to model preference, not caching economics at this shape.

6. Workload shape 2: code review with 30K context

Assume 10,000 code review runs per month. Each run loads a 30K repository context and asks 5 questions in sequence (within the 5-minute TTL). Output is 2K per question.

Anthropic with caching:

  • Context writes: 10K x 30K x $3.75/M = $1,125
  • Context reads: 10K x 4 questions x 30K x $0.30/M = $360
  • Output: 10K x 5 x 2K = 100M tokens x $15.00/M = $1,500
  • Total: $2,985

OpenAI with caching:

  • Context fresh: 10K x 30K x $2.50/M = $750
  • Context cached: 10K x 4 x 30K x $1.25/M = $1,500
  • Output: 100M x $10.00/M = $1,000
  • Total: $3,250

Uncached on Anthropic:

  • Context: 10K x 5 x 30K x $3.00/M = $4,500
  • Output: $1,500
  • Total: $6,000

Anthropic with caching beats OpenAI with caching by 8% here. The 90% cache discount compounds on the large context. If your agentic workload re-reads the same big context 5+ times, Anthropic is the right choice.

7. Workload shape 3: document Q&A with 100K context

Assume 2,000 document sessions per month. Each session loads a 100K document and asks 10 questions across a 45-minute span. Output 1K per answer.

At 45 minutes, the 5-minute cache expires 9 times per session. To keep the cache warm, use the 1-hour Anthropic tier at 2x write.

Anthropic with 1-hour cache:

  • Context writes: 2K x 100K x $6.00/M = $1,200
  • Context reads: 2K x 9 x 100K x $0.30/M = $540
  • Output: 2K x 10 x 1K = 20M tokens x $15.00/M = $300
  • Total: $2,040

OpenAI with 5-min cache (expires 9 times):

  • Context writes: 2K x 10 x 100K x $2.50/M = $5,000
  • Output: 20M x $10.00/M = $200
  • Total: $5,200

Uncached on Anthropic:

  • Context: 2K x 10 x 100K x $3.00/M = $6,000
  • Output: $300
  • Total: $6,300

Anthropic's 1-hour cache beats everything else by 2.5x or more. This is the workload where caching economics dominate.

8. When caching makes things worse

Three patterns I have seen where caching loses money:

Unique prompts per call. If every user query starts with a slightly different system prompt (user name, locale, timestamp), the cache never hits. You pay the write premium every call. Fix: move the varying parts out of the cached prefix.

Low call volume per window. If your app gets one call every 30 minutes, the cache always expires before the next hit. Pure write cost with no read savings. Fix: skip caching below 5 calls per 5-minute window.

Small prompts. Anything under 1,024 tokens cannot cache on either provider. Fix: pad the prefix with examples or skip caching on short prompts.

9. Combining caching with batch API

Anthropic Message Batches gets the 50% batch discount and can also use cache reads at the batch price. The math stacks: cached reads in batch mode on Sonnet 4.5 cost $0.15 per million.

OpenAI Batch applies the 50% discount but does not layer on cache reads (batch prompts go through a separate pipeline that does not share the same cache pool).

If you run a nightly scoring job over 50M tokens of shared context, Anthropic batch with caching is the cheapest frontier option in 2026.

FAQ

What is the smallest prompt that can benefit from caching?

1,024 tokens on both Anthropic Sonnet 4.5 and OpenAI GPT-4o. Opus 4 needs 2,048 tokens. Below those, the cache never engages.

Does output count toward caching?

No. Only input (system prompt, tool definitions, few-shot examples, prior messages) can be cached. Outputs bill at the full rate.

How do I monitor cache hit rate?

OpenAI exposes usage.prompt_tokens_details.cached_tokens on every response. Anthropic exposes usage.cache_read_input_tokens and usage.cache_creation_input_tokens. Log both and chart hit/write ratio in Datadog.

Can I cache tool definitions?

Yes. Tool definitions count as system context and cache on the same rules as system prompts. This is one of the highest-ROI caching wins for tool-heavy agents.

Does Gemini support caching?

Yes. Gemini context caching on 2.5 Pro charges $0.31 per million cached input versus $1.25 fresh, similar ratio to Anthropic. Storage also carries a per-hour fee of $4.50 per million tokens stored, which bites on long-lived caches.

Actionable takeaways

  1. Audit your cache hit rate every week; target 5x hits per write or better
  2. Move user-specific strings out of the cached prefix to stabilize the hit pattern
  3. Use Anthropic's 1-hour cache tier for workloads with 5-45 minute session gaps
  4. Skip caching on prompts under 1,024 tokens or with fewer than 5 calls per TTL window
  5. Combine Anthropic batch with caching for nightly scoring jobs; it is the cheapest frontier option
  6. Cache tool definitions on agent workloads; they are usually static and high volume

Sources

  • Anthropic prompt caching documentation, docs.anthropic.com/en/docs/build-with-claude/prompt-caching, accessed 2026-04-14
  • OpenAI prompt caching documentation, platform.openai.com/docs/guides/prompt-caching, accessed 2026-04-14
  • Google Gemini context caching, ai.google.dev/gemini-api/docs/caching, accessed 2026-04-14
  • Internal invoice analysis across 6 production apps, March 2026

Related: LLM API Caching Strategies, OpenAI vs Anthropic Pricing 2026, How to Reduce LLM API Costs.

Your ad here

Related Tools