promptingintermediate

Temperature and Sampling Parameters (2026)

Quick Answer

Temperature controls randomness: 0 makes outputs deterministic (always picks the highest-probability token), 1 samples proportionally to the model's probability distribution, and values above 1 increase diversity at the cost of coherence. For factual extraction and classification, use temperature 0. For creative tasks, start at 0.7–1.0 and tune from there.

When to Use

✓Set temperature to 0 for deterministic tasks: classification, structured extraction, code generation where you want reproducible results
✓Use temperature 0.7–1.0 when generating marketing copy, stories, or brainstorming lists where variety improves quality
✓Lower top-p (0.9 → 0.7) when the model is producing incoherent or off-topic text despite a reasonable temperature
✓Use frequency_penalty when the model keeps repeating the same phrases in long-form outputs
✓Tune temperature as part of a systematic eval — don't guess, measure quality across a held-out set

How It Works

1At each token step, the model produces a probability distribution over its full vocabulary (50k+ tokens). Temperature T rescales the logits before softmax: divide each logit by T. T=0 collapses to argmax; T>1 flattens the distribution.
2Top-p (nucleus sampling) restricts sampling to the smallest set of tokens whose cumulative probability exceeds p. Top-p=0.95 means at each step only tokens that together account for 95% of probability mass are considered — the long tail is excluded.
3Top-k limits sampling to the k highest-probability tokens regardless of their probability values. Top-k=50 picks uniformly from the 50 most likely tokens. It's less adaptive than top-p and rarely used alone on modern models.
4Frequency penalty reduces the logit of tokens proportional to how often they've appeared in the output so far, discouraging repetition. Presence penalty applies a flat penalty to any token that has appeared at all.
5Temperature and top-p interact: with temperature=1 and top-p=1 you get fully random sampling. Most practitioners set temperature and leave top-p at 0.95 or 1.0, only reducing top-p if the model generates incoherent text.

Examples

Deterministic extraction (temperature 0)

Extract the invoice number, date, and total from the following invoice text. Return JSON only.

Invoice #INV-2024-0847
Date: March 15, 2026
Services: API consulting (40 hrs @ $150) = $6,000
Total due: $6,000

Output:{"invoice_number": "INV-2024-0847", "date": "2026-03-15", "total": 6000}

Creative tagline generation (temperature 0.9)

Generate 5 punchy product taglines for an AI cost-tracking tool called LLMversus. Each should be under 10 words. Be creative and unexpected.

Output:1. Know what your AI really costs. 2. Every token tracked. No more bill shock. 3. AI spending visibility, finally. 4. Stop guessing. Start measuring. 5. Your LLM budget, under control.

Common Mistakes

✗Using high temperature for structured output — temperature above 0.3 on JSON extraction causes malformed JSON because sampling can select tokens that break schema constraints.
✗Setting both temperature and top-p low simultaneously — temperature=0.3 with top-p=0.5 over-constrains the model and produces repetitive, bland output. Pick one primary control.
✗Assuming temperature=0 means 100% deterministic — some model serving infrastructure uses batching that introduces tiny floating-point differences. For true determinism, fix the random seed at the API level if available.
✗Not measuring the effect — changing temperature without an eval set means you can't tell if quality improved. Always test on 20+ samples from your actual distribution.

FAQ

What temperature should I use for coding tasks?+

For correctness-critical code generation, use temperature 0–0.2. For exploratory coding or generating multiple solution candidates, use 0.7–1.0 and pick the best. Many practitioners use temperature 0 plus self-consistency (generate 3 solutions, pick the majority) for important code.

Does temperature affect reasoning quality in thinking models?+

For models with extended thinking (Claude, o-series), the thinking tokens are generated with their own internal temperature setting. Adjusting the output temperature has less impact on reasoning quality than with standard models. The recommendation is to leave temperature at 1.0 for thinking models and let the internal process drive quality.

What's the practical difference between top-p 0.95 and 1.0?+

Minimal for most tasks. Top-p=1.0 allows the full probability distribution. Top-p=0.95 excludes the bottom 5% probability mass, which typically represents unlikely or incoherent tokens. The difference is most noticeable in very open-ended creative tasks where the model might otherwise sample rare tokens.

Is there a temperature equivalent in Claude, Gemini, and OpenAI?+

Yes, all major APIs support temperature. Claude uses 0–1 scale (default 1). OpenAI/GPT uses 0–2 scale. Gemini uses 0–2 scale. The semantics are similar but not identical — temperature=1 in Claude is not the same as temperature=1 in OpenAI. Always benchmark on your specific model.

When should I use repetition_penalty vs frequency_penalty?+

Frequency_penalty (OpenAI/Anthropic) scales with how many times a token has appeared — the more often, the bigger the penalty. Repetition_penalty (Hugging Face/local models) is a multiplicative divisor applied uniformly to all tokens that have appeared. Frequency penalty is better for long documents where some repetition is natural; repetition_penalty is harsher and better for preventing loops.

self consistency structured output prompt compression cost optimization ↗ streaming response pipeline ↗ llm eval harness

Temperature and Sampling Parameters (2026)

When to Use

How It Works

Examples

Common Mistakes

FAQ

Related