optimizationbeginner

Token Counting: Understanding and Measuring LLM Tokens (2026)

Quick Answer

Tokens are not words — they're subword units defined by the model's tokenizer. English prose averages 0.75 words per token (4 characters per token). Code is denser: about 2–4 characters per token. Use tiktoken (OpenAI), claude-tokenizer, or the official count_tokens API endpoint to get exact counts before sending prompts. Token counting is free — count before you send to avoid surprises.

When to Use

✓Before building any LLM application — understand the pricing unit
✓When a prompt is near the context window limit and you need to know if it fits
✓Estimating monthly API costs before launching a product
✓Debugging unexpected billing spikes — identify which prompts are consuming the most tokens
✓Optimizing prompts for cost — measure before and after compression to validate savings

How It Works

1Tokenizers split text into subwords using Byte Pair Encoding (BPE) or similar algorithms. Common words become single tokens; rare words split into multiple tokens. Numbers often split character by character.
2GPT-4o, GPT-4o-mini use tiktoken (cl100k_base or o200k_base encoding). Claude uses a different tokenizer. Gemini uses SentencePiece. Counts differ slightly between models — always use the right tokenizer.
3Count tokens before sending: tiktoken for OpenAI (pip install tiktoken, free, local), or the API's count_tokens endpoint for Claude (makes a free API call). Always count the full prompt: system + messages + tool definitions.
4Token counts for common content types: 1K English words ≈ 1,300 tokens; 1K Python lines ≈ 1,500–3,000 tokens; a 1-page PDF ≈ 500–800 tokens; a 10KB JSON ≈ 2,000–4,000 tokens.
5Output tokens are typically priced 3–5x higher than input tokens. Always account for both in cost estimates. The ratio matters: a task generating 500 output tokens from a 100-token input has very different cost profile than a task generating 50 output tokens from a 5,000-token input.

Examples

Token counting with tiktoken

import tiktoken

# Count tokens for OpenAI models
enc = tiktoken.encoding_for_model('gpt-4o')

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

# Count full conversation
def count_conversation_tokens(messages: list[dict]) -> int:
    total = 0
    for msg in messages:
        # 4 tokens overhead per message (role, content delimiters)
        total += 4 + count_tokens(msg['content'])
    total += 2  # assistant response prime tokens
    return total

# Examples
print(count_tokens('Hello, world!'))  # → 4 tokens
print(count_tokens('def fibonacci(n):'))  # → 7 tokens
print(count_tokens('2024-04-16'))  # → 5 tokens (dates tokenize per-character)

Output:tiktoken is fast (~1M tokens/sec), free, and runs locally. Use encoding_for_model() to get the right encoding for your model. For Claude, use the count_tokens API endpoint instead (tiktoken undercounts by 10-20% for Claude).

Cost estimation before API call

# Estimate cost before sending to avoid surprises
PRICING = {
    'claude-3-5-sonnet-20241022': {'input': 3.00, 'output': 15.00},
    'claude-3-5-haiku-20241022': {'input': 0.80, 'output': 4.00},
    'gpt-4o': {'input': 2.50, 'output': 10.00},
    'gpt-4o-mini': {'input': 0.15, 'output': 0.60},
}  # $/M tokens

def estimate_cost(prompt_tokens: int, estimated_output_tokens: int, model: str) -> float:
    prices = PRICING[model]
    input_cost = (prompt_tokens / 1_000_000) * prices['input']
    output_cost = (estimated_output_tokens / 1_000_000) * prices['output']
    return input_cost + output_cost

# Example: 5K input, 500 output on Claude Sonnet
cost = estimate_cost(5000, 500, 'claude-3-5-sonnet-20241022')
print(f'Estimated cost: ${cost:.4f}')  # $0.0225

Output:Pre-flight cost check. For batch jobs, multiply by query count to estimate monthly cost. 5K input + 500 output × 10,000 queries/day = $225/day on Claude Sonnet. Use this to decide if you need a cheaper model.

Common Mistakes

✗Estimating tokens as word count — 'I have 1,000 words so about 1,000 tokens' is wrong. English prose is 1,300–1,500 tokens per 1,000 words. Code can be 2,000–3,000 tokens per 1,000 words. Always count with the actual tokenizer.
✗Forgetting to count system prompts and tool definitions — many developers track user message tokens but forget that system prompts (often 500–2,000 tokens) and tool definitions (often 1,000–4,000 tokens) are billed as input tokens on every request.
✗Using tiktoken for Claude — tiktoken is OpenAI's tokenizer. Claude uses a different tokenizer, and tiktoken can undercount Claude tokens by 10–20%. Use Anthropic's count_tokens endpoint or build in a 20% buffer for estimates.
✗Not accounting for output token variability — output token counts vary by request. Budget for the P95 output length, not the median. If P50 output is 300 tokens but P95 is 800 tokens, use 800 for cost ceiling estimates.

FAQ

How many tokens is a typical system prompt?+

A minimal system prompt ('You are a helpful assistant.') is 7 tokens. A detailed system prompt with persona, constraints, and examples typically runs 300–1,500 tokens. A very detailed agent system prompt with tool documentation can reach 3,000–8,000 tokens. System prompts are billed every request — long system prompts are a major cost driver at scale.

What's the token limit for major models in 2026?+

Claude 3.7 Sonnet: 200K input tokens. GPT-4o: 128K input tokens. Gemini 2.5 Pro: 1M input tokens. Llama 3.1: 128K tokens. Most models support 4K–8K output tokens maximum regardless of context window size. Plan your context budget to leave room for the full expected output.

Why do numbers and special characters use more tokens than regular words?+

Tokenizers are trained on natural language text, where sequences like '2024-04-16' or '$1,234,567' are uncommon. These get split character-by-character: '2', '0', '2', '4', '-', '0', '4', '-', '1', '6' = 10 tokens for a single date. This is why JSON with many numeric IDs tokenizes poorly. For numeric-heavy inputs, encode numbers as words when possible.

Does token count affect latency?+

Yes — time-to-first-token increases with input token count (the model must process all input before generating the first output token). Output token count increases time-to-complete linearly. For latency-sensitive applications, minimize input tokens and use early stopping or streaming to get partial results quickly.

How do I track token usage in production?+

Every API response includes usage metadata: {input_tokens, output_tokens, cache_read_input_tokens, cache_creation_input_tokens}. Log these with every request. Aggregate by: user_id, feature_name, model, date. Set up daily cost dashboards and alerts for unusual spikes. This is the foundation of AI cost management.

cost optimization prompt compression prompt caching batch processing ↗ prompt caching cost optimization

Token Counting: Understanding and Measuring LLM Tokens (2026)

When to Use

How It Works

Examples

Common Mistakes

FAQ

Related