Accuracy %87 models ranked
HumanEval Leaderboard 2026
HumanEval is OpenAI's code generation benchmark. Models are given Python function signatures + docstrings and must produce correct implementations. The score is the percentage of 164 problems solved (pass@1).
87 / 87 models
What HumanEval Tests
Code correctness: given a function signature and description, write working Python code. Tests cover data structures, algorithms, string manipulation, and math. A score of 80% means 131 of 164 problems solved correctly on the first attempt.
Score Range
0–100% (human baseline ~95%)
Source
OpenAI HumanEval ↗Other Benchmarks
Compare models side-by-side
Full spec comparison — pricing, context window, and all benchmarks.