optimizationintermediate

LLM Cost Optimization: Cutting API Bills 50–90% (2026)

Quick Answer

The biggest LLM cost levers in order of impact: (1) model routing — use cheap models for simple tasks (saves 80–95%), (2) prompt caching — cache static prefixes (saves 80–90% on cached tokens), (3) output constraints — limit max_tokens to what's needed (saves 20–40%), (4) batching — batch API for 50% cost reduction at the cost of latency, (5) prompt compression — reduce input tokens by 30–70%. Most applications can cut costs 50–80% with these techniques.

When to Use

✓Monthly LLM API bills exceeding $1,000 — at this scale, optimization pays for itself in days
✓Before scaling a feature from prototype to production — optimization at design time is 10x easier than retrofitting
✓When a model upgrade would improve quality but the cost increase is unacceptable
✓Building multi-tenant SaaS where LLM cost per user directly affects unit economics
✓Optimizing developer tooling or internal tools that run thousands of queries per day

How It Works

1Model routing: classify queries by complexity and route to the cheapest model that can handle them. Claude Haiku ($0.80/M input) is 3.75x cheaper than Sonnet ($3/M). GPT-4o-mini is 16x cheaper than GPT-4o. For most applications, 60–80% of queries are simple enough for a cheap model.
2Prompt caching: cache your system prompt and static context. Anthropic prompt caching cuts cached token cost by 90%. For agents with 2K+ token system prompts making 1M requests/month, prompt caching alone saves $2,400/month on Claude Sonnet.
3Output constraints: set max_tokens to 1.5x your expected output length, not the model maximum. An unconstrained max_tokens doesn't guarantee longer output — it just keeps the door open for unexpected verbosity that costs money.
4Batch processing: use the Batch API (Anthropic, OpenAI) for non-real-time workloads. 50% cost discount, responses within 24 hours. Perfect for: nightly data processing, bulk classification, report generation.
5Response caching: cache LLM responses for repeated queries. A semantic cache (embed the query, check for similar past queries) can achieve 20–40% cache hit rates on high-volume applications, with near-zero marginal cost on hits.

Examples

Multi-tier model routing

from anthropic import Anthropic

client = Anthropic()

# Complexity classifier (uses cheap model)
def classify_complexity(query: str) -> str:
    response = client.messages.create(
        model='claude-3-5-haiku-20241022',  # Cheapest model for classification
        max_tokens=10,
        messages=[{
            'role': 'user',
            'content': f'Classify query complexity: simple (factual lookup, short answer) or complex (analysis, reasoning, long output). Query: {query}. Reply: simple or complex'
        }]
    )
    return response.content[0].text.strip().lower()

def cost_optimized_call(query: str, system: str) -> str:
    complexity = classify_complexity(query)
    model = 'claude-3-5-haiku-20241022' if complexity == 'simple' else 'claude-3-5-sonnet-20241022'
    
    response = client.messages.create(
        model=model, max_tokens=1024,
        system=system,
        messages=[{'role': 'user', 'content': query}]
    )
    return response.content[0].text

Output:Classification costs ~$0.001/query on Haiku. If 70% of queries route to Haiku (saves $2.20/1K queries vs. all-Sonnet) and classification costs $0.70/1K, net saving: $1.50/1K queries = 50% cost reduction.

Output token limiting

# Bad: leaves max_tokens unlimited
response = client.messages.create(
    model='claude-3-5-sonnet-20241022',
    max_tokens=4096,  # Maximum — you'll pay if it uses all 4096
    messages=[{'role': 'user', 'content': 'Summarize this article in 2-3 sentences.'}]
)

# Good: constrain to task requirements
response = client.messages.create(
    model='claude-3-5-sonnet-20241022',
    max_tokens=150,  # 2-3 sentences ≈ 50-100 tokens, buffer for safety
    messages=[{
        'role': 'user',
        'content': 'Summarize this article in 2-3 sentences (maximum 100 words). Article: ...'
    }]
)
# Also constrain in the prompt itself for double enforcement

Output:For a summarization task generating 80 tokens on average: max_tokens=4096 doesn't cost more if the model stops at 80 tokens. But without a max_tokens constraint AND a prompt constraint, models sometimes pad output unnecessarily. Both constraints together reduce average output by 15-20%.

Common Mistakes

✗Optimizing prematurely before measuring — don't guess which queries are expensive. Instrument your application first: log (model, input_tokens, output_tokens, cost) per request. You'll discover the actual cost drivers are often not what you expected.
✗Routing all queries to cheap models — some tasks genuinely require frontier model quality. Measure quality degradation when routing to a cheaper model; don't just assume the cost savings are worth it.
✗Forgetting tool definition tokens — tool definitions sent with every request can be 1,000–4,000 tokens. For agents making 1M requests/month, 2,000 extra tokens = $6,000/month on Sonnet. Cache tool definitions and use prompt caching on them.
✗Not monitoring costs after optimization — cost optimization can be undone by new features adding tokens, increased query volume, or provider price changes. Set up daily cost tracking with alerts for 2x budget overruns.

FAQ

What's the single highest-impact cost optimization?+

Model routing, but only if your application has a mix of simple and complex queries (most do). Routing 70% of queries to a model that's 5x cheaper while maintaining quality cuts your bill by 56%. If you have just one query type, prompt caching (if queries share a long static prefix) is often the highest-impact change.

How do I measure actual cost savings from optimization?+

Log token usage before and after in production. Compare: (input_tokens × input_price + output_tokens × output_price) per request over a 1-week baseline. Apply optimization. Compare the next 1-week cost-per-request. Use the same query distribution (no seasonality bias). Report actual cost savings, not estimated savings from token count reduction.

Is it worth switching providers to save money?+

Potentially. As of 2026: DeepSeek V3 at $0.27/M input is 10x cheaper than Claude Sonnet with comparable quality on many tasks. Gemini 1.5 Flash is extremely competitive on price-performance. Evaluate on your specific task distribution — provider switching has integration costs but can save 50-80% on compatible tasks.

How much does prompt caching actually save?+

Example: Claude Sonnet system prompt of 2,000 tokens, 100,000 requests/month. Without caching: 2,000 × 100,000 × $3/1M = $600/month. With caching (90% cache hit rate): $600 × 10% = $60/month. Savings: $540/month. Cache write cost (one-time per 5 min window): negligible. The savings are real and significant.

What's the Batch API and when should I use it?+

The Batch API (Anthropic and OpenAI) processes requests asynchronously (results in 24 hours) at 50% of the regular price. Use it for: nightly data enrichment, bulk document processing, offline evaluation runs, report generation. Don't use for: anything requiring real-time response, interactive features. The 50% discount is significant — a $10K/month real-time workload becomes $5K with batching if it can tolerate async delivery.

model routing prompt caching batch processing token counting ↗ prompt caching cost optimization ↗ streaming response pipeline

LLM Cost Optimization: Cutting API Bills 50–90% (2026)

When to Use

How It Works

Examples

Common Mistakes

FAQ

Related