Evaluation
Benchmark
Quick Answer
A standardized test dataset used to compare model performance across different models.
A benchmark is a standard dataset and evaluation protocol for comparing models fairly. Good benchmarks are diverse, challenging, and reproducible. Benchmarks include: MMLU (knowledge), HumanEval (coding), GSM8K (math), and many others. Benchmarks drive progress but have limitations—models can overfit to published benchmarks. Benchmarks measure specific capabilities; no single benchmark captures overall quality. Multiple benchmarks give more complete picture than any single one. Creating representative benchmarks is challenging.
Last verified: 2026-04-08