promptingadvanced

Self-Consistency Prompting (2026)

Quick Answer

Self-consistency (Wang et al., 2022) improves reasoning accuracy by generating multiple independent reasoning paths for the same question and aggregating the final answers by majority vote. It consistently outperforms single-sample chain-of-thought by 5–20% on math, logic, and commonsense reasoning benchmarks — at the cost of 5–40x more inference calls.

When to Use

✓Math word problems or numerical reasoning where errors in a single reasoning path are common
✓High-stakes decisions where you want confidence estimates alongside the answer
✓When a single CoT run produces different answers across multiple attempts (indicating the model is uncertain)
✓Benchmarking model quality — self-consistency gives a ceiling estimate of what the model can achieve on a task
✓When accuracy matters more than latency or cost (e.g., batch offline processing)

How It Works

1Set temperature to 0.5–1.0 (you need diversity — not the same greedy output every time).
2Run the same chain-of-thought prompt N times (typically 10–40 samples). Each run explores a different reasoning path.
3Extract the final answer from each run. For math, this is the numeric result. For classification, the class label.
4Take the majority vote: the most frequently occurring answer is selected as the final answer.
5Optionally, use the vote distribution as a confidence score: if 9/10 runs agree, confidence is high. If 5/10 agree, treat the output with more suspicion.

Examples

Self-consistency with majority vote in Python

Confidence-gated self-consistency (only retry if uncertain)

Common Mistakes

✗Using temperature=0 (greedy): Self-consistency requires diverse reasoning paths. With temperature=0, every run produces the same output — you're paying for N calls but getting no benefit. Use temperature 0.5–1.0.
✗Running too few samples: 3–5 samples often don't produce enough diversity for a reliable majority vote. Use at least 10 for meaningful results; research shows gains plateau around 40.
✗Not parsing the final answer reliably: If you extract answers inconsistently (sometimes grabbing intermediate steps), the vote becomes unreliable. Use a strict extraction pattern like 'Final answer: X' and parse only that.
✗Applying self-consistency to factual recall tasks: If the model doesn't know a fact, running it 20 times won't help — it will just hallucinate the same wrong fact consistently. Self-consistency helps reasoning, not knowledge retrieval.

FAQ

How much does self-consistency improve accuracy?+

Wang et al. (2022) showed 17.9% improvement over standard CoT on GSM8K (math), and similar gains on reasoning benchmarks. In practice, 10 samples gives most of the benefit — gains taper significantly after 20–40 samples.

Is self-consistency cost-effective?+

For batch offline workloads where accuracy matters, yes. For real-time user-facing features, the latency (10 parallel calls + aggregation) makes it prohibitive unless you run the calls in parallel. Parallel self-consistency with async requests adds ~50% latency overhead vs a single call, not 10x.

Can I run the N calls in parallel?+

Yes — this is how you make self-consistency viable for near-real-time use. Use asyncio + batch API calls to fire all N requests simultaneously. The effective latency becomes roughly 1x the per-call latency, not N×.

What's the difference between self-consistency and ensemble prompting?+

Self-consistency uses the same prompt multiple times with temperature > 0. Ensemble prompting uses different prompt phrasings or different models and aggregates results. Ensemble approaches are more expensive but can be more effective when individual prompts have systematic biases.

Does self-consistency work for open-ended generation?+

Not with simple majority vote, since two generated essays will never be identical. For open-ended tasks, use LLM-as-judge to score multiple outputs and select the best, or use nucleus sampling + reranking instead of majority vote.

chain of thought tree of thought llm as judge temperature and sampling ↗ qa testing agent ↗ llm eval harness

Self-Consistency Prompting (2026)

When to Use

How It Works

Examples

Common Mistakes

FAQ

Related