Evaluation
HumanEval
Quick Answer
A benchmark evaluating code generation capability through functional correctness on programming tasks.
HumanEval is the standard coding benchmark with 164 programming problems. Evaluation is pass/fail based on execution. This objective evaluation measures actual coding ability. HumanEval drives competition between coding models. Pass rates range from 0% (non-coding models) to 92%+ (top models). HumanEval has limitations (simple problems, not production code). However, it's standard for comparing models. Variants like HumanEval+ make it harder.
Last verified: 2026-04-08