promptingintermediate

Chain-of-Thought Prompting (2026)

Quick Answer

Chain-of-thought (CoT) prompting improves LLM accuracy on complex reasoning tasks by instructing the model to show its work before giving a final answer. Simply adding 'Think step by step' to a prompt (zero-shot CoT) often boosts performance by 20–40% on math and logic problems. For the best results, combine CoT with few-shot examples showing good reasoning traces.

When to Use

  • Math word problems or any task requiring arithmetic across multiple steps
  • Logic puzzles, deductive reasoning, or constraint-satisfaction tasks
  • Code debugging where you want the model to trace through the logic before proposing a fix
  • Multi-condition decisions (e.g., eligibility checks with 5+ criteria)
  • Tasks where the model consistently gets wrong answers on zero-shot but you suspect it has the underlying knowledge

How It Works

  1. 1CoT works by forcing the model to generate intermediate reasoning tokens before the final answer token. This leverages the autoregressive nature of LLMs — the reasoning tokens become context that constrains the final answer.
  2. 2Zero-shot CoT: append 'Let's think step by step.' to your prompt. The model generates a reasoning chain, then produces an answer. This alone often closes 50% of the gap with few-shot CoT.
  3. 3Few-shot CoT: provide 2–4 examples where each example shows the full reasoning chain, not just the answer. The model learns to mimic the depth and structure of your reasoning examples.
  4. 4For extracting just the final answer reliably, use a two-turn pattern: first prompt generates the reasoning, a second prompt asks 'Based on the above reasoning, what is the final answer? Reply with only the answer.'
  5. 5In 2025–2026, frontier models (o3, Claude 3.7 Sonnet with extended thinking, Gemini 2.0 thinking) perform CoT internally using 'thinking tokens' — you may not need to elicit it manually, but explicit CoT still helps smaller models.

Examples

Zero-shot CoT on a math word problem
A store sells notebooks for $2.50 each and pens for $0.75 each. Maria buys 4 notebooks and 6 pens and pays with a $20 bill. How much change does she receive? Let's think step by step.
Output:Step 1: Cost of notebooks = 4 × $2.50 = $10.00 Step 2: Cost of pens = 6 × $0.75 = $4.50 Step 3: Total cost = $10.00 + $4.50 = $14.50 Step 4: Change = $20.00 − $14.50 = $5.50 Maria receives $5.50 in change.
Few-shot CoT for eligibility logic
Determine if each applicant qualifies for the senior discount (age 65+, annual income under $40,000, resident for 2+ years).

Applicant: Jane, age 67, income $35,000, resident 3 years.
Reasoning: Age 67 ≥ 65 ✓. Income $35,000 < $40,000 ✓. Resident 3 years ≥ 2 ✓. All criteria met.
Decision: QUALIFIES

Applicant: Tom, age 63, income $28,000, resident 5 years.
Reasoning: Age 63 < 65 ✗. Age criterion not met regardless of other factors.
Decision: DOES NOT QUALIFY

Applicant: Rita, age 70, income $42,000, resident 1 year.
Reasoning:
Output:Age 70 ≥ 65 ✓. Income $42,000 ≥ $40,000 ✗. Income criterion not met. Decision: DOES NOT QUALIFY

Common Mistakes

  • Using CoT for simple tasks: Adding 'think step by step' to a basic lookup or trivial classification adds latency and tokens with no quality benefit. Reserve CoT for tasks that genuinely require multi-step reasoning.
  • Accepting reasoning without validating the answer: CoT can produce confident-sounding but incorrect reasoning chains ('hallucinated steps'). Always validate the final answer, especially for math. Use self-consistency (multiple samples + majority vote) for high-stakes outputs.
  • Not extracting the final answer cleanly: CoT outputs mix reasoning and answer. If parsing programmatically, always instruct the model to output the final answer in a structured format at the end, e.g., 'Final answer: X'.
  • Too few reasoning steps in few-shot examples: If your CoT examples skip obvious steps, the model learns to skip them too. Write out every logical step, even the obvious ones.

FAQ

Does chain-of-thought work with small models?+

CoT is most effective on models with 70B+ parameters. Below ~7B parameters, the reasoning chains are often incoherent or lead to wrong answers. For small models, few-shot CoT with very structured examples works better than zero-shot CoT.

What's the difference between CoT and extended thinking (Claude) or reasoning (o-series)?+

Extended thinking and o-series reasoning are built-in CoT — the model generates reasoning tokens internally, often not visible in the final output. Explicit CoT prompting is manual elicitation of the same behavior. Built-in reasoning typically outperforms prompt-elicited CoT, but costs more per token.

How does self-consistency improve CoT?+

Self-consistency (Wang et al., 2022) runs the same CoT prompt multiple times (typically 10–40 times with temperature > 0) and takes a majority vote on the final answers. It consistently outperforms single-sample CoT by 5–15% on math benchmarks at the cost of higher inference spend.

Should I always show the reasoning in the final output?+

Not necessarily. In production pipelines, verbose reasoning chains increase output tokens (cost) and may confuse end users. Use two-turn prompting to generate reasoning privately, then extract the answer. Or use a model with internal reasoning that doesn't stream the thinking tokens.

Can CoT help with hallucination?+

Partially. CoT forces the model to generate a reasoning path that can be inspected and validated. However, models can hallucinate entire reasoning chains ('galaxy-brained' reasoning). CoT reduces hallucination on tasks where the model has relevant knowledge but benefit is limited on tasks requiring up-to-date factual recall.

Related