claudeextended-thinkingapireasoningcost

Claude Extended Thinking: Complete Guide (Cost, When It Helps, When It Doesn't)

Claude Extended Thinking: Complete Guide

Claude's Extended Thinking feature lets the model "think" before responding — generating internal reasoning tokens that aren't shown in the final answer but meaningfully improve quality on hard problems. This guide covers what it actually does, when it's worth the extra cost, and how to use it.

What Is Extended Thinking?

When Extended Thinking is enabled, Claude generates a chain of reasoning before producing its final response. You can see this reasoning in the thinking blocks of the response. The model uses these tokens to:

  • Break down complex problems into steps
  • Consider multiple approaches before committing
  • Check its own reasoning for errors
  • Work through mathematical derivations carefully

Thinking tokens are not free — they cost the same as output tokens. But they don't count against your output token budget (they're separate).

Pricing

Token TypePrice per 1M
Input tokens$3.00
Output tokens$15.00
Thinking tokens$3.75

Thinking tokens are priced between input and output tokens. For a problem requiring 10,000 thinking tokens + 500 output tokens:

  • Without extended thinking: 500 tokens × $15/1M = $0.0075
  • With extended thinking: 10,000 × $3.75/1M + 500 × $15/1M = $0.0375 + $0.0075 = $0.045

A 6x cost increase for that request. Whether it's worth it depends entirely on the task.

How to Enable Extended Thinking

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Max thinking tokens allowed
    },
    messages=[{
        "role": "user",
        "content": "Prove that there are infinitely many prime numbers."
    }]
)

# The response has thinking blocks + text blocks
for block in response.content:
    if block.type == "thinking":
        print("=== THINKING ===")
        print(block.thinking[:500])  # Show first 500 chars
        print()
    elif block.type == "text":
        print("=== RESPONSE ===")
        print(block.text)

# Check thinking tokens used
print(f"\nInput tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
# Note: thinking tokens are reported differently per API version

Setting Budget Tokens

The budget_tokens parameter controls how much the model can think. Guidelines:

Task ComplexitySuggested Budget
Simple logic1,000-2,000
Medium coding3,000-5,000
Hard math/proofs8,000-15,000
Very complex problems15,000-32,000

Claude won't always use the full budget — it uses what it needs.

When Extended Thinking Helps

1. Mathematics and Formal Proofs

This is where extended thinking shines most. Problems requiring:

  • Multi-step algebraic manipulation
  • Mathematical proofs by induction or contradiction
  • Calculus problems with multiple integration steps
  • Combinatorics and probability problems

Benchmark improvement (MATH-500):

A 14-point improvement on hard math problems is significant.

2. Complex Multi-Step Coding

Problems that benefit:

  • Algorithm design with non-obvious approaches
  • Debugging complex race conditions or state management issues
  • Designing data structures for specific performance characteristics
  • Problems with subtle edge cases (off-by-one, overflow, etc.)

For implementing a simple CRUD endpoint? Extended thinking adds cost with no benefit. For implementing a lock-free concurrent queue? It helps.

3. Logic Puzzles and Reasoning

Extended thinking significantly improves performance on:

  • Constraint satisfaction problems
  • Multi-step logical deductions
  • Problems requiring eliminating possibilities systematically

Example where it helps:

Alice, Bob, Carol, and Dan each own exactly one pet (cat, dog, fish, rabbit).
- Alice doesn't own a dog or fish.
- Bob's pet is not a cat.
- The person with a rabbit lives next to Carol.
- Dan has a fish.
Who owns the rabbit?

Without thinking, Claude may get this wrong or give a poorly explained answer. With thinking, it systematically works through the constraints.

4. Complex Code Architecture Decisions

When asked to design a system architecture, extended thinking helps Claude consider:

  • Trade-offs between approaches
  • Edge cases in the design
  • Potential failure modes
  • Whether the simpler or more complex approach is actually warranted

When Extended Thinking Is Wasteful

Don't enable extended thinking for these use cases:

Simple Q&A

# Wasteful — extended thinking adds nothing
response = client.messages.create(
    model="claude-sonnet-4-5",
    thinking={"type": "enabled", "budget_tokens": 5000},  # Unnecessary
    messages=[{"role": "user", "content": "What is the capital of France?"}]
)

Creative Writing

Extended thinking doesn't improve creative writing quality. Creativity is not a reasoning task — it's a generation task. Use the tokens for actual output.

Summarization

Summarizing a document doesn't require deep reasoning. Save your budget.

Translation

Language translation is a pattern-matching task, not a reasoning task.

Conversation and Chat

For real-time chat applications, extended thinking adds latency (seconds to minutes) for no meaningful improvement on typical conversational turns.

Extraction Tasks

Extracting structured data from text (names, dates, addresses) doesn't benefit from extended thinking.

Cost Control in Practice

import anthropic
from dataclasses import dataclass

@dataclass
class TaskClassifier:
    """Decide whether to use extended thinking based on task type."""
    
    COMPLEX_KEYWORDS = [
        "prove", "proof", "derive", "calculate", "solve",
        "algorithm", "optimize", "complexity", "debug",
        "why does", "what's wrong", "race condition"
    ]
    
    def needs_thinking(self, prompt: str) -> tuple[bool, int]:
        """Returns (use_thinking, budget_tokens)"""
        prompt_lower = prompt.lower()
        
        # Check for math indicators
        if any(k in prompt_lower for k in ["prove", "proof", "derive", "integral", "sum of"]):
            return True, 15000
        
        # Check for complex coding
        if any(k in prompt_lower for k in ["race condition", "concurrent", "deadlock", "algorithm"]):
            return True, 8000
        
        # Check for logic puzzles
        if "if" in prompt_lower and "then" in prompt_lower and len(prompt) > 200:
            return True, 5000
        
        return False, 0

client = anthropic.Anthropic()
classifier = TaskClassifier()

def smart_complete(prompt: str) -> str:
    use_thinking, budget = classifier.needs_thinking(prompt)
    
    kwargs = {
        "model": "claude-sonnet-4-5",
        "max_tokens": 4096,
        "messages": [{"role": "user", "content": prompt}]
    }
    
    if use_thinking:
        kwargs["thinking"] = {"type": "enabled", "budget_tokens": budget}
        print(f"Using extended thinking (budget: {budget} tokens)")
    
    response = client.messages.create(**kwargs)
    
    # Return only the text block
    return next(b.text for b in response.content if hasattr(b, 'text'))

Streaming with Extended Thinking

with client.messages.stream(
    model="claude-sonnet-4-5",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 8000},
    messages=[{"role": "user", "content": "Implement a binary search tree with AVL balancing in Python"}]
) as stream:
    for event in stream:
        if hasattr(event, 'type'):
            if event.type == 'content_block_start':
                if hasattr(event.content_block, 'type'):
                    if event.content_block.type == 'thinking':
                        print("[Thinking...]")
                    elif event.content_block.type == 'text':
                        print("\n[Response]")
            elif event.type == 'content_block_delta':
                if hasattr(event.delta, 'text'):
                    print(event.delta.text, end='', flush=True)

Impact on Latency

Extended thinking adds meaningful latency:

Budget TokensAdded Latency (approx)
1,000+2-5 seconds
5,000+10-20 seconds
10,000+20-40 seconds
32,000+60-120 seconds

For interactive applications, only use extended thinking when the quality improvement justifies the wait. For batch processing, it's usually worth enabling.

the key point

Extended thinking is a targeted tool, not a universal upgrade.

Enable it for: Hard math, formal proofs, complex algorithm design, debugging tricky multi-step logic, problems where being wrong has a high cost.

Skip it for: Chat, creative tasks, summarization, translation, simple Q&A, anything where latency matters and the task doesn't require deep reasoning.

A reasonable default: disable extended thinking globally, enable it in specific call sites for known hard problems. The 6x cost increase is only justified when the quality improvement is measurable and matters.

Methodology

All performance figures in this article are sourced from publicly available benchmarks (MMLU, HumanEval, LMSYS Chatbot Arena ELO), provider pricing pages verified on 2026-04-16, and independent speed tests conducted via provider APIs. Pricing is listed as input/output per million tokens unless noted otherwise. Rankings reflect the date of publication and will change as models are updated.

Your ad here

Related Tools