promptingbeginner

Few-Shot Prompting (2026)

Quick Answer

Few-shot prompting provides 2–8 input-output examples in the prompt before your actual query. It consistently outperforms zero-shot on extraction, classification, and transformation tasks by showing the model exactly what format and quality you expect. The key is selecting diverse, representative examples — not just easy ones.

When to Use

  • You need consistent output format across many runs (JSON extraction, structured reports)
  • The task has a non-standard output style the model hasn't been explicitly instruction-tuned on
  • Zero-shot produces correct answers but inconsistent formatting
  • You have a small set of labeled examples (even 3–5 is enough to start)
  • The task involves domain-specific jargon or an unusual classification taxonomy

How It Works

  1. 1You prepend 2–8 demonstration pairs (input → expected output) to the actual query. Each pair is formatted identically.
  2. 2The model uses in-context learning — it identifies the pattern across the demonstrations and extends it to the new input without any weight update.
  3. 3Example diversity matters more than count: 3 diverse examples outperform 8 similar ones because the model needs to see the edges of the task, not just the center.
  4. 4Example order can affect results: research shows the most recent examples (closest to the actual query) have the most influence. Put your highest-quality example last.
  5. 5For long demonstrations, token cost scales linearly. Switch to fine-tuning or retrieval-augmented few-shot (pulling examples from a vector store at runtime) for production scale.

Examples

Named entity extraction with consistent JSON output
Extract company name, person name, and dollar amount from each sentence. Output as JSON.

Input: Satya Nadella announced that Microsoft would invest $13 billion in OpenAI.
Output: {"company": "Microsoft", "person": "Satya Nadella", "amount": "$13 billion"}

Input: Sam Altman confirmed OpenAI raised $40 billion from SoftBank.
Output: {"company": "OpenAI", "person": "Sam Altman", "amount": "$40 billion"}

Input: Jensen Huang said Nvidia's data center revenue hit $47.5 billion last quarter.
Output:
Output:{"company": "Nvidia", "person": "Jensen Huang", "amount": "$47.5 billion"}
Tone transformation (casual → professional)
Rewrite each message in a formal business tone.

Input: hey can we push the meeting? i'm slammed
Output: Could we please reschedule our meeting? I have a conflicting commitment at that time.

Input: the numbers look kinda off tbh
Output: I've noticed some potential discrepancies in the figures that may warrant a closer review.

Input: just ping me whenever, i'll be around
Output:
Output:Please feel free to reach out at your convenience — I will be available.

Common Mistakes

  • Using only easy or central examples: If all your examples are straightforward, the model won't know how to handle edge cases. Always include at least one example near the boundary of the task.
  • Inconsistent formatting across examples: If example 1 uses 'Output:' and example 2 uses 'Answer:', the model gets confused about the pattern. Keep all examples structurally identical.
  • Too many examples for simple tasks: Adding 8 examples to a basic sentiment task wastes tokens without improving accuracy. Start with 2–3 and measure.
  • Not separating the system instruction from the examples: Mixing 'You are an extractor...' with examples in the same paragraph creates ambiguity. Put the instruction in the system prompt and the examples in the user turn.

FAQ

How many examples should I use?+

Start with 3–5. Research by Min et al. (2022) shows diminishing returns after 8 examples for most tasks. For classification with many classes, aim for at least one example per class. Monitor token cost — each example adds context length.

Does example order matter?+

Yes, especially for smaller models. Recency bias means the last 1–2 examples have the most influence. For critical tasks, place your highest-quality example last. For robust results, run multiple orderings and take the majority vote.

Can I select examples dynamically at runtime?+

Yes — retrieval-augmented few-shot prompting embeds your example library and retrieves the k most similar examples to each incoming query. Tools like LlamaIndex and LangChain support this. It dramatically outperforms static examples on diverse inputs.

When should I fine-tune instead of using few-shot prompting?+

Consider fine-tuning when: (1) you have 100+ labeled examples, (2) the task runs at high volume and token cost matters, (3) few-shot with 8 examples still doesn't hit your quality bar. Fine-tuning amortizes the demonstration cost into the model weights.

Do the labels in examples need to be correct?+

Surprisingly, Min et al. showed that random labels in few-shot examples barely hurt performance on classification tasks — the format matters more than the label accuracy. However, for extraction and generation tasks, correct outputs in examples matter significantly.

Related