Chain-of-Thought Prompting (2026)
Chain-of-thought (CoT) prompting improves LLM accuracy on complex reasoning tasks by instructing the model to show its work before giving a final answer. Simply adding 'Think step by step' to a prompt (zero-shot CoT) often boosts performance by 20–40% on math and logic problems. For the best results, combine CoT with few-shot examples showing good reasoning traces.
When to Use
- ✓Math word problems or any task requiring arithmetic across multiple steps
- ✓Logic puzzles, deductive reasoning, or constraint-satisfaction tasks
- ✓Code debugging where you want the model to trace through the logic before proposing a fix
- ✓Multi-condition decisions (e.g., eligibility checks with 5+ criteria)
- ✓Tasks where the model consistently gets wrong answers on zero-shot but you suspect it has the underlying knowledge
How It Works
- 1CoT works by forcing the model to generate intermediate reasoning tokens before the final answer token. This leverages the autoregressive nature of LLMs — the reasoning tokens become context that constrains the final answer.
- 2Zero-shot CoT: append 'Let's think step by step.' to your prompt. The model generates a reasoning chain, then produces an answer. This alone often closes 50% of the gap with few-shot CoT.
- 3Few-shot CoT: provide 2–4 examples where each example shows the full reasoning chain, not just the answer. The model learns to mimic the depth and structure of your reasoning examples.
- 4For extracting just the final answer reliably, use a two-turn pattern: first prompt generates the reasoning, a second prompt asks 'Based on the above reasoning, what is the final answer? Reply with only the answer.'
- 5In 2025–2026, frontier models (o3, Claude 3.7 Sonnet with extended thinking, Gemini 2.0 thinking) perform CoT internally using 'thinking tokens' — you may not need to elicit it manually, but explicit CoT still helps smaller models.
Examples
A store sells notebooks for $2.50 each and pens for $0.75 each. Maria buys 4 notebooks and 6 pens and pays with a $20 bill. How much change does she receive? Let's think step by step.Determine if each applicant qualifies for the senior discount (age 65+, annual income under $40,000, resident for 2+ years).
Applicant: Jane, age 67, income $35,000, resident 3 years.
Reasoning: Age 67 ≥ 65 ✓. Income $35,000 < $40,000 ✓. Resident 3 years ≥ 2 ✓. All criteria met.
Decision: QUALIFIES
Applicant: Tom, age 63, income $28,000, resident 5 years.
Reasoning: Age 63 < 65 ✗. Age criterion not met regardless of other factors.
Decision: DOES NOT QUALIFY
Applicant: Rita, age 70, income $42,000, resident 1 year.
Reasoning:Common Mistakes
- ✗Using CoT for simple tasks: Adding 'think step by step' to a basic lookup or trivial classification adds latency and tokens with no quality benefit. Reserve CoT for tasks that genuinely require multi-step reasoning.
- ✗Accepting reasoning without validating the answer: CoT can produce confident-sounding but incorrect reasoning chains ('hallucinated steps'). Always validate the final answer, especially for math. Use self-consistency (multiple samples + majority vote) for high-stakes outputs.
- ✗Not extracting the final answer cleanly: CoT outputs mix reasoning and answer. If parsing programmatically, always instruct the model to output the final answer in a structured format at the end, e.g., 'Final answer: X'.
- ✗Too few reasoning steps in few-shot examples: If your CoT examples skip obvious steps, the model learns to skip them too. Write out every logical step, even the obvious ones.
FAQ
Does chain-of-thought work with small models?+
CoT is most effective on models with 70B+ parameters. Below ~7B parameters, the reasoning chains are often incoherent or lead to wrong answers. For small models, few-shot CoT with very structured examples works better than zero-shot CoT.
What's the difference between CoT and extended thinking (Claude) or reasoning (o-series)?+
Extended thinking and o-series reasoning are built-in CoT — the model generates reasoning tokens internally, often not visible in the final output. Explicit CoT prompting is manual elicitation of the same behavior. Built-in reasoning typically outperforms prompt-elicited CoT, but costs more per token.
How does self-consistency improve CoT?+
Self-consistency (Wang et al., 2022) runs the same CoT prompt multiple times (typically 10–40 times with temperature > 0) and takes a majority vote on the final answers. It consistently outperforms single-sample CoT by 5–15% on math benchmarks at the cost of higher inference spend.
Should I always show the reasoning in the final output?+
Not necessarily. In production pipelines, verbose reasoning chains increase output tokens (cost) and may confuse end users. Use two-turn prompting to generate reasoning privately, then extract the answer. Or use a model with internal reasoning that doesn't stream the thinking tokens.
Can CoT help with hallucination?+
Partially. CoT forces the model to generate a reasoning path that can be inspected and validated. However, models can hallucinate entire reasoning chains ('galaxy-brained' reasoning). CoT reduces hallucination on tasks where the model has relevant knowledge but benefit is limited on tasks requiring up-to-date factual recall.