promptingadvanced

Self-Consistency Prompting (2026)

Quick Answer

Self-consistency (Wang et al., 2022) improves reasoning accuracy by generating multiple independent reasoning paths for the same question and aggregating the final answers by majority vote. It consistently outperforms single-sample chain-of-thought by 5–20% on math, logic, and commonsense reasoning benchmarks — at the cost of 5–40x more inference calls.

When to Use

  • Math word problems or numerical reasoning where errors in a single reasoning path are common
  • High-stakes decisions where you want confidence estimates alongside the answer
  • When a single CoT run produces different answers across multiple attempts (indicating the model is uncertain)
  • Benchmarking model quality — self-consistency gives a ceiling estimate of what the model can achieve on a task
  • When accuracy matters more than latency or cost (e.g., batch offline processing)

How It Works

  1. 1Set temperature to 0.5–1.0 (you need diversity — not the same greedy output every time).
  2. 2Run the same chain-of-thought prompt N times (typically 10–40 samples). Each run explores a different reasoning path.
  3. 3Extract the final answer from each run. For math, this is the numeric result. For classification, the class label.
  4. 4Take the majority vote: the most frequently occurring answer is selected as the final answer.
  5. 5Optionally, use the vote distribution as a confidence score: if 9/10 runs agree, confidence is high. If 5/10 agree, treat the output with more suspicion.

Examples

Self-consistency with majority vote in Python
Confidence-gated self-consistency (only retry if uncertain)

Common Mistakes

  • Using temperature=0 (greedy): Self-consistency requires diverse reasoning paths. With temperature=0, every run produces the same output — you're paying for N calls but getting no benefit. Use temperature 0.5–1.0.
  • Running too few samples: 3–5 samples often don't produce enough diversity for a reliable majority vote. Use at least 10 for meaningful results; research shows gains plateau around 40.
  • Not parsing the final answer reliably: If you extract answers inconsistently (sometimes grabbing intermediate steps), the vote becomes unreliable. Use a strict extraction pattern like 'Final answer: X' and parse only that.
  • Applying self-consistency to factual recall tasks: If the model doesn't know a fact, running it 20 times won't help — it will just hallucinate the same wrong fact consistently. Self-consistency helps reasoning, not knowledge retrieval.

FAQ

How much does self-consistency improve accuracy?+

Wang et al. (2022) showed 17.9% improvement over standard CoT on GSM8K (math), and similar gains on reasoning benchmarks. In practice, 10 samples gives most of the benefit — gains taper significantly after 20–40 samples.

Is self-consistency cost-effective?+

For batch offline workloads where accuracy matters, yes. For real-time user-facing features, the latency (10 parallel calls + aggregation) makes it prohibitive unless you run the calls in parallel. Parallel self-consistency with async requests adds ~50% latency overhead vs a single call, not 10x.

Can I run the N calls in parallel?+

Yes — this is how you make self-consistency viable for near-real-time use. Use asyncio + batch API calls to fire all N requests simultaneously. The effective latency becomes roughly 1x the per-call latency, not N×.

What's the difference between self-consistency and ensemble prompting?+

Self-consistency uses the same prompt multiple times with temperature > 0. Ensemble prompting uses different prompt phrasings or different models and aggregates results. Ensemble approaches are more expensive but can be more effective when individual prompts have systematic biases.

Does self-consistency work for open-ended generation?+

Not with simple majority vote, since two generated essays will never be identical. For open-ended tasks, use LLM-as-judge to score multiple outputs and select the best, or use nucleus sampling + reranking instead of majority vote.

Related