pricingcostcalculatortutorialllm

How to Calculate LLM API Costs Before You Build: The Complete Formula

How to Calculate LLM API Costs Before You Build: The Complete Formula

One of the most common mistakes in building LLM-powered products: discovering the real cost only after you've built it. A feature that costs $0.01 per use in development can cost $500/day in production at scale. Here's how to calculate costs accurately before you build.

The Core Cost Formula

LLM API cost has three components:

Total Cost = (Input Tokens × Input Price)
           + (Output Tokens × Output Price)
           - (Cached Tokens × Cache Savings)

All prices are per 1 million tokens.

Step 1: Count Your Tokens

Before you can calculate costs, you need to know how many tokens your prompts use.

What Is a Token?

A token is roughly 4 characters of English text, or 0.75 words. But it varies:

  • Common English words: ~1 token each
  • Technical terms, proper nouns: often 2-3 tokens
  • Code: typically more tokens per character (special syntax)
  • Non-English text: often 2-4x more tokens than equivalent English

Counting Tokens Programmatically

# Count tokens before making API calls
import anthropic
import tiktoken

# For Claude (Anthropic provides a count_tokens endpoint)
client = anthropic.Anthropic()

def count_claude_tokens(messages: list, system: str = "") -> dict:
    """Count tokens for a Claude request before making it."""
    response = client.messages.count_tokens(
        model="claude-sonnet-4-5",
        system=system,
        messages=messages
    )
    return {
        "input_tokens": response.input_tokens,
        "estimated_cost_input": response.input_tokens * 3.00 / 1_000_000
    }

# Example
tokens = count_claude_tokens(
    messages=[{"role": "user", "content": "Summarize this document: " + long_document}],
    system="You are a helpful assistant."
)
print(f"Input tokens: {tokens['input_tokens']}")
print(f"Input cost: ${tokens['estimated_cost_input']:.4f}")

# For OpenAI models, use tiktoken
import tiktoken

def count_openai_tokens(text: str, model: str = "gpt-4o") -> int:
    """Count tokens for an OpenAI request."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

# Count tokens for a full message array
def count_messages_tokens(messages: list, model: str = "gpt-4o") -> int:
    encoding = tiktoken.encoding_for_model(model)
    tokens = 0
    for message in messages:
        tokens += 4  # Every message has a 4-token overhead
        for key, value in message.items():
            tokens += len(encoding.encode(str(value)))
    tokens += 2  # Reply overhead
    return tokens

# Example
document_tokens = count_openai_tokens(my_document, "gpt-4o")
print(f"Document tokens: {document_tokens}")
print(f"Input cost: ${document_tokens * 2.50 / 1_000_000:.4f}")

# Rough estimate without API call (works for any model)
def estimate_tokens(text: str) -> int:
    """Rough estimation: 1 token ≈ 4 characters for English."""
    return len(text) // 4

# More accurate: count words
def estimate_tokens_by_words(text: str) -> int:
    """1.3 tokens per word is a reasonable estimate."""
    return int(len(text.split()) * 1.3)

Step 2: Estimate Output Tokens

Output tokens are harder to predict. Strategies:

1. Set and measure: Run your prompt 50-100 times in development, measure average output tokens.

results = []
for _ in range(50):
    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=2048,
        messages=[{"role": "user", "content": sample_prompt}]
    )
    results.append(response.usage.output_tokens)

import statistics
print(f"Mean output tokens: {statistics.mean(results):.0f}")
print(f"P95 output tokens: {sorted(results)[int(len(results)*0.95)]:.0f}")
print(f"Max output tokens: {max(results)}")

2. Output type heuristics:

Output TypeTypical Token Count
Yes/No answer1-5
Single-sentence answer15-30
Short paragraph80-150
Bullet list (5 items)100-200
Code function (~20 lines)200-400
Short essay (500 words)700-800
Code file (~100 lines)1,000-2,000
Long-form report2,000-5,000

3. Set max_tokens strategically: Don't set max_tokens to the model's maximum. Set it to 1.5× your expected max output. You don't pay for tokens you don't generate.

Step 3: The Full Cost Calculation

from dataclasses import dataclass
from typing import Optional

@dataclass
class ModelPricing:
    name: str
    input_per_1m: float
    output_per_1m: float
    cached_input_per_1m: Optional[float] = None

PRICING = {
    "claude-sonnet-4-5": ModelPricing("Claude Sonnet 4", 3.00, 15.00, 0.30),
    "claude-haiku-3-5": ModelPricing("Claude Haiku 3.5", 0.80, 4.00, 0.08),
    "gpt-4o": ModelPricing("GPT-4o", 2.50, 10.00, 1.25),
    "gpt-4o-mini": ModelPricing("GPT-4o Mini", 0.15, 0.60, 0.075),
    "gemini-2.5-pro": ModelPricing("Gemini 2.5 Pro", 1.25, 10.00, None),
    "gemini-2.0-flash": ModelPricing("Gemini Flash", 0.10, 0.40, None),
    "deepseek-r1": ModelPricing("DeepSeek R1", 0.55, 2.19, None),
    "llama-3.3-70b-groq": ModelPricing("Llama 3.3 70B (Groq)", 0.59, 0.79, None),
}

def calculate_cost(
    model: str,
    input_tokens: int,
    output_tokens: int,
    cached_tokens: int = 0
) -> dict:
    pricing = PRICING[model]
    
    uncached_input = input_tokens - cached_tokens
    
    input_cost = uncached_input * pricing.input_per_1m / 1_000_000
    output_cost = output_tokens * pricing.output_per_1m / 1_000_000
    cache_cost = 0
    
    if cached_tokens > 0 and pricing.cached_input_per_1m:
        cache_cost = cached_tokens * pricing.cached_input_per_1m / 1_000_000
    
    total = input_cost + output_cost + cache_cost
    
    return {
        "model": pricing.name,
        "input_cost": input_cost,
        "output_cost": output_cost,
        "cache_cost": cache_cost,
        "total": total,
        "formatted": f"${total:.6f}"
    }

# Example: Document summarization
cost = calculate_cost(
    model="claude-sonnet-4-5",
    input_tokens=5000,   # 5K input (a medium document)
    output_tokens=500,   # 500 output (summary)
    cached_tokens=1000   # 1K cached system prompt tokens
)
print(f"Per request: {cost['formatted']}")
# Per request: $0.022500

Step 4: Estimate Production Costs From Prototype Usage

The hardest part: your prototype with 10 test cases doesn't tell you what 100K real users will cost.

The Prototype-to-Production Formula

def project_production_cost(
    prototype_cost_per_call: float,
    daily_active_users: int,
    calls_per_user_per_day: float,
    overhead_multiplier: float = 1.4  # Account for retries, dev calls, etc.
) -> dict:
    """
    Project production costs from prototype data.
    overhead_multiplier: production typically runs 30-50% more calls than you expect
    """
    daily_calls = daily_active_users * calls_per_user_per_day
    daily_cost = daily_calls * prototype_cost_per_call * overhead_multiplier
    
    return {
        "daily_calls": int(daily_calls),
        "daily_cost": f"${daily_cost:.2f}",
        "monthly_cost": f"${daily_cost * 30:.2f}",
        "annual_cost": f"${daily_cost * 365:.2f}",
        "cost_per_user_per_month": f"${daily_cost * 30 / daily_active_users:.2f}"
    }

# Example: AI writing assistant
prototype_avg_cost = 0.045  # $0.045 per request (measured in dev)

costs = project_production_cost(
    prototype_cost_per_call=prototype_avg_cost,
    daily_active_users=1000,
    calls_per_user_per_day=5.0,
    overhead_multiplier=1.3
)

print(f"Daily calls: {costs['daily_calls']:,}")
print(f"Daily cost: {costs['daily_cost']}")
print(f"Monthly cost: {costs['monthly_cost']}")
print(f"Cost per active user/month: {costs['cost_per_user_per_month']}")
# Daily calls: 5,000
# Daily cost: $292.50
# Monthly cost: $8,775
# Cost per active user/month: $8.78

At $8.78/user/month AI cost, you need to charge users at least $15-20/month to maintain reasonable margins.

Step 5: Build a Cost Estimator for Your Specific Workload

def estimate_workload_cost(
    # Input description
    avg_system_prompt_tokens: int,
    avg_user_message_tokens: int,
    avg_conversation_turns: int,  # Messages before task completion
    pct_system_prompt_cached: float,  # 0.0-1.0
    
    # Output description
    avg_output_tokens_per_turn: int,
    
    # Scale
    requests_per_day: int,
    
    # Model
    model: str = "claude-sonnet-4-5"
) -> dict:
    pricing = PRICING[model]
    
    # Per single request (one turn)
    uncached_system = avg_system_prompt_tokens * (1 - pct_system_prompt_cached)
    cached_system = avg_system_prompt_tokens * pct_system_prompt_cached
    
    input_tokens = uncached_system + avg_user_message_tokens
    cached_input_tokens = cached_system
    output_tokens = avg_output_tokens_per_turn
    
    # For multi-turn: accumulate conversation in context
    # Each turn adds previous turn's tokens to input
    if avg_conversation_turns > 1:
        extra_context_per_turn = (avg_user_message_tokens + avg_output_tokens_per_turn) * (avg_conversation_turns - 1) / 2
        input_tokens += extra_context_per_turn
    
    cost_per_request = calculate_cost(
        model, int(input_tokens), int(output_tokens * avg_conversation_turns),
        int(cached_input_tokens)
    )["total"]
    
    daily_cost = cost_per_request * requests_per_day
    
    return {
        "tokens_per_request": {
            "input": int(input_tokens),
            "cached": int(cached_input_tokens),
            "output": int(output_tokens * avg_conversation_turns)
        },
        "cost_per_request": f"${cost_per_request:.4f}",
        "daily_cost": f"${daily_cost:.2f}",
        "monthly_cost": f"${daily_cost * 30:,.2f}"
    }

# Example: Customer support chatbot
estimate = estimate_workload_cost(
    avg_system_prompt_tokens=2000,      # Company knowledge base in system prompt
    avg_user_message_tokens=150,        # Short customer questions
    avg_conversation_turns=3,           # 3 exchanges per ticket
    pct_system_prompt_cached=0.95,      # 95% cache hit rate on system prompt
    avg_output_tokens_per_turn=250,     # Moderate response length
    requests_per_day=2000,              # 2K tickets/day
    model="claude-sonnet-4-5"
)

print("Customer Support Bot Estimate:")
print(f"Tokens per request: {estimate['tokens_per_request']}")
print(f"Cost per ticket: {estimate['cost_per_request']}")
print(f"Daily cost: {estimate['daily_cost']}")
print(f"Monthly cost: {estimate['monthly_cost']}")

The Prompt Caching Impact

This is the largest cost lever many teams miss:

# Same workload, with vs without prompt caching
without_cache = calculate_cost(
    "claude-sonnet-4-5",
    input_tokens=5000,
    output_tokens=500,
    cached_tokens=0
)

with_cache = calculate_cost(
    "claude-sonnet-4-5",
    input_tokens=5000,
    output_tokens=500,
    cached_tokens=3000  # 3K of 5K input is cached system prompt
)

print(f"Without caching: {without_cache['formatted']}")
print(f"With caching: {with_cache['formatted']}")
print(f"Savings: {(1 - with_cache['total']/without_cache['total'])*100:.0f}%")
# Without caching: $0.022500
# With caching: $0.015900
# Savings: 29%

# At 100K requests/month:
monthly_savings = (without_cache['total'] - with_cache['total']) * 100_000
print(f"Monthly savings at 100K req/month: ${monthly_savings:,.2f}")
# Monthly savings: $660.00

When to Switch Providers

def cost_comparison_table(
    input_tokens: int,
    output_tokens: int,
    requests_per_month: int
):
    """Compare monthly costs across all major providers."""
    results = []
    for model_key, pricing in PRICING.items():
        cost_per_req = calculate_cost(model_key, input_tokens, output_tokens)["total"]
        monthly_cost = cost_per_req * requests_per_month
        results.append({
            "model": pricing.name,
            "per_request": f"${cost_per_req:.4f}",
            "monthly": f"${monthly_cost:,.2f}"
        })
    
    results.sort(key=lambda x: float(x["monthly"].replace("$", "").replace(",", "")))
    
    print(f"\nCost comparison: {input_tokens} input + {output_tokens} output, {requests_per_month:,} req/month")
    print(f"{'Model':<25} {'Per Request':<15} {'Monthly'}")
    print("-" * 55)
    for r in results:
        print(f"{r['model']:<25} {r['per_request']:<15} {r['monthly']}")

cost_comparison_table(2000, 500, 100_000)

Output:

Cost comparison: 2000 input + 500 output, 100,000 req/month
Model                     Per Request     Monthly
-------------------------------------------------------
Gemini Flash              $0.0004         $40.00
GPT-4o Mini               $0.0006         $60.00
Llama 3.3 70B (Groq)      $0.0016         $160.00
DeepSeek R1               $0.0021         $210.00
Gemini 2.5 Pro            $0.0075         $750.00
GPT-4o                    $0.0100         $1,000.00
Claude Haiku 3.5          $0.0036         $360.00
Claude Sonnet 4           $0.0135         $1,350.00

This makes the decision obvious: for a task where Gemini Flash or GPT-4o Mini is good enough, using Claude Sonnet 4 costs 33x more per month.

The Bottom Line

Cost estimation before building requires:

  1. Count input tokens: Use the provider's token counter or tiktoken on sample prompts
  2. Measure output tokens: Run 50+ test cases, record P50 and P95 output lengths
  3. Apply the formula: cost = (input × input_price) + (output × output_price) - (cached × cache_savings)
  4. Project to scale: monthly_cost = cost_per_request × requests_per_day × 30 × 1.4
  5. Check unit economics: cost_per_user = monthly_cost / MAU — must be well below your monetization
  6. Model the caching benefit: Large, reused system prompts can cut costs 30-60%

Verify current pricing at llmversus.com/calculator — prices change frequently enough that hardcoded numbers go stale fast.

Your ad here

Related Tools