promptingadvanced

Tree of Thought Prompting (2026)

Quick Answer

Tree of Thought (Yao et al., 2023) prompts the model to generate multiple candidate 'thoughts' at each step, evaluate which are most promising, and explore only the best branches — like a search tree. It dramatically outperforms linear chain-of-thought on planning, puzzles, and tasks requiring lookahead, but is expensive (many LLM calls) and overkill for most production tasks.

When to Use

  • Creative writing where you want to explore multiple plot directions and pick the best
  • Code generation for complex algorithms where multiple approaches should be compared
  • Math olympiad-style problems requiring backtracking when a path fails
  • Multi-step planning tasks where early decisions significantly affect downstream options
  • Any task where chain-of-thought fails because the model commits to a wrong path and can't recover

How It Works

  1. 1At each reasoning step, instead of generating one next thought, generate K candidate thoughts (typically 3–5).
  2. 2Use a separate evaluation prompt to score each candidate thought: 'Given the problem and the current state, rate each candidate thought as: sure/likely/impossible.'
  3. 3Select the top-rated thoughts and expand only those — pruning unpromising branches early.
  4. 4Repeat until a terminal state (solution or max depth) is reached. Use BFS for broad exploration or DFS for depth.
  5. 5The key insight: the LLM evaluates intermediate steps, not just final answers. This lets it backtrack from dead ends — something standard CoT cannot do.

Examples

ToT for creative writing direction selection
I'm writing a thriller novel. The protagonist just discovered their mentor is the villain.

Generate 3 different directions the story could go next (each in 2-3 sentences):

Direction 1:
Direction 2:
Direction 3:

---
Now evaluate each direction: which creates the most narrative tension and gives the protagonist meaningful agency? Score each as Strong/Moderate/Weak and explain why.

Finally, write the first paragraph of the best direction.
Output:Direction 1: The protagonist confronts the mentor directly, triggering an immediate chase. (Strong — high tension, immediate conflict, tests protagonist's courage) Direction 2: The protagonist pretends not to know and tries to gather more evidence. (Strong — psychological tension, strategic depth, delays confrontation interestingly) Direction 3: The protagonist flees and calls the police. (Weak — removes protagonist agency, resolves too quickly) Best direction: Direction 2 She kept her face neutral as he poured the tea, her hands steady despite the hammering in her chest. Twenty-three years she had trusted him — and now every memory felt like a wound reopening. She smiled, asked about the new exhibit opening, and began counting the exits.

Common Mistakes

  • Using ToT for simple tasks: ToT requires many LLM calls (branching factor × depth). For a task that chain-of-thought handles well, ToT is pure overhead. Reserve it for tasks where CoT consistently fails.
  • Not pruning aggressively: Without evaluation-based pruning, ToT degenerates into an expensive brute-force search. The evaluator prompt is as important as the generation prompt.
  • Evaluating branches with the same model that generated them: The model tends to rate its own outputs highly regardless of quality. Consider using a separate, more powerful model as the evaluator.
  • Setting branching factor too high: K=10 at each step with depth 4 means 10,000 possible paths. Start with K=3 and depth ≤ 4. Most of the benefit comes from the early pruning steps.

FAQ

When does ToT outperform chain-of-thought?+

The original paper showed ToT solved 74% of Game of 24 problems vs 4% for standard CoT. The gap is largest on tasks requiring lookahead and backtracking — puzzles, planning, and creative tasks with many valid directions. For linear reasoning tasks (math word problems), the improvement over self-consistency CoT is marginal.

How expensive is ToT?+

With branching factor K=3 and depth D=4, you need roughly K^D + evaluation calls = 3^4 = 81 generation calls + 81 evaluation calls ≈ 160 LLM calls per query. At $0.003/call this is $0.48 per query — feasible for high-value tasks, not for high-volume production. Use cheaper models for pruning steps.

Is there a simpler approximation of ToT?+

Yes. A simple ToT approximation is: generate 3 options, have the model vote on which is best, then continue from the winner. This two-prompt version captures most of the benefit at 3–4x the cost of single CoT, vs 100x for full ToT. It's often the right trade-off.

Are there open-source implementations of ToT?+

Yes. The original paper's code is at github.com/princeton-nlp/tree-of-thought-llm. LangGraph also supports graph-based workflows that implement ToT-style exploration natively. Langchain has a ToT chain implementation as well.

How does ToT relate to MCTS (Monte Carlo Tree Search)?+

MCTS, used in AlphaGo, is a more sophisticated tree search that uses rollouts to estimate state value. ToT is simpler — it uses the LLM as both the generator and evaluator, with explicit pruning rather than statistical rollouts. MCTS-inspired LLM methods like AlphaCode 2 combine both approaches.

Related