Golden Datasets: Building Ground Truth for LLM Evaluation (2026)
A golden dataset is your ground truth: carefully labeled input/output pairs that define correct behavior. Reliable evaluation requires at least 100 examples; production monitoring needs 500+. Build it by combining expert annotation, production sampling, and synthetic generation with validation. The golden dataset must be version-controlled, append-only, and regularly audited for label quality.
When to Use
- ✓Before building any eval pipeline — the eval is only as good as its ground truth
- ✓Setting up regression tests that will block deploys on quality degradation
- ✓Training or fine-tuning models where you need high-quality labeled examples
- ✓Validating that a new model or prompt is better than the baseline on your specific tasks
- ✓Documenting expected behavior for stakeholder alignment and team onboarding
How It Works
- 1Source 1 — Expert annotation: domain experts manually label inputs with correct outputs. Highest quality, most expensive. Use for your most important and complex cases (15–20% of dataset).
- 2Source 2 — Production sampling: randomly sample real user queries from production logs, generate outputs with your current best system, then have experts validate and correct them. This captures the real input distribution.
- 3Source 3 — Synthetic generation: use an LLM to generate diverse input variations from seed examples, then validate a subset. Best for augmenting coverage of rare cases.
- 4Annotation guidelines: write explicit criteria for what makes a good output. Include positive and negative examples. Without clear guidelines, different annotators will label inconsistently, and label quality degrades rapidly.
- 5Maintenance: never modify existing examples (append-only). Add 10–20 new examples per week from production failures. Conduct quarterly audits: review 10% of examples for label accuracy, retire stale examples that no longer reflect desired behavior.
Examples
import json
from pydantic import BaseModel
from typing import Optional
class GoldenExample(BaseModel):
id: str # Unique, stable identifier
input: dict # The input to the LLM system
expected_output: str # Ground truth output
output_type: str # 'exact', 'semantic', 'schema', 'judge'
source: str # 'expert', 'production', 'synthetic'
difficulty: str # 'easy', 'medium', 'hard', 'edge_case'
tags: list[str] # Task categories
created_at: str # ISO date
annotator: str # Who labeled this
validation_notes: Optional[str] # Why this expected output is correct
# Load and validate dataset
def load_golden_dataset(path: str) -> list[GoldenExample]:
with open(path) as f:
examples = json.load(f)
return [GoldenExample(**ex) for ex in examples] # Validates schemadef generate_synthetic_examples(seed_examples: list, n: int, task_description: str) -> list:
prompt = f'''
Here are {len(seed_examples)} examples of {task_description}:
{format_examples(seed_examples)}
Generate {n} diverse new examples following the same format.
Vary: phrasing, complexity, edge cases, domains.
Return as JSON array with 'input' and 'expected_output' fields.
'''
generated = llm_call(prompt, json_output=True)
# Validate a sample (20%) manually before adding to golden set
sample_for_review = random.sample(generated, max(1, len(generated) // 5))
print(f'Please review {len(sample_for_review)} examples before adding to dataset:')
for ex in sample_for_review:
print(json.dumps(ex, indent=2))
return generated # Only add after human reviewCommon Mistakes
- ✗Training on your golden dataset — the golden dataset must be held-out evaluation data. If you use it to develop or tune your prompts, you overfit to the eval and lose its value as a quality signal.
- ✗Insufficient coverage of edge cases — golden datasets dominated by easy examples give a false sense of quality. Explicitly target hard cases: ambiguous inputs, incomplete information, adversarial phrasings, conflicting requirements.
- ✗No inter-annotator agreement checking — when multiple annotators label the same examples, measure their agreement (Cohen's kappa > 0.7 is acceptable). Low agreement reveals ambiguous annotation guidelines that will produce noisy labels.
- ✗Static datasets that never grow — production LLM systems encounter new input types over time. A golden dataset that never grows will miss emerging failure modes. Budget 30 minutes per week to add new examples from production failures.
FAQ
How many examples do I need in a golden dataset?+
For initial development: 50–100 examples per task category (minimum viable eval). For production regression testing: 200–500 examples total with good difficulty distribution. For fine-tuning ground truth: 1,000+ validated examples. Below 50 examples, variance in evaluation scores makes it hard to detect meaningful regressions.
How do I balance the golden dataset across difficulty levels?+
Target distribution: 40% easy (should always pass — validates basic functionality), 40% medium (target 70-80% pass rate), 20% hard/edge cases (expected lower pass rates — tracks capability ceiling). If your dataset is all-easy, you'll miss regressions on the hard cases that users care about most.
What do I do when the 'expected output' is subjective?+
Use a rubric instead of a specific expected output. For example, instead of 'The capital of France is Paris' as expected output, use criteria: [is_factually_accurate: true, mentions_Paris: true, appropriate_length: true]. Rubric-based evals are less brittle and better for subjective tasks. LLM-as-judge can apply the rubric at scale.
How do I handle private data in golden datasets?+
Anonymize before adding to the golden dataset: replace names with synthetic names, replace company names with generic identifiers, remove email addresses and phone numbers. Use a data classification step that identifies PII before examples enter the golden set. Store golden datasets with the same access controls as your production data.
Should golden datasets be in code repos or a database?+
Both have trade-offs. Code repos (JSON/JSONL files in git): version control, diff visibility, easy PR review for new examples. Databases (Braintrust, LangSmith): collaborative annotation, richer querying, better tooling for large datasets. For under 1,000 examples, git works well. For larger datasets or collaborative teams, a dedicated eval platform is worth the setup.