Accuracy %87 models ranked

HumanEval Leaderboard 2026

HumanEval is OpenAI's code generation benchmark. Models are given Python function signatures + docstrings and must produce correct implementations. The score is the percentage of 164 problems solved (pass@1).

Quick Answer

The best model on HumanEval in 2026 is o3 by OpenAI, scoring 98%. Runner-up: o1 (96%).

87 / 87 models
#ModelScore
🥇o398%
🥈o196%
🥉Claude Opus 495.2%
4Gemini 2.5 Pro94%
5Qwen 2.5 Coder 32B94%
6DeepSeek R193%
7GPT-4.593%
8o3-mini93%
9Qwen 3 235B MoE92%
10Grok 392%
11Claude Sonnet 492%
12DeepSeek V392%
13Claude 3.5 Sonnet92%
14o1-mini92%
15Codestral 22B92%
16Llama 4 Maverick91.5%
17DeepSeek R1 (Groq)91%
18DeepSeek R1 (Together)91%
19ChatGPT-4o Latest91%
20DeepSeek R1 Distill Llama 70B91%
21GPT-4o90.2%
22GPT-4o (Aug 2024)90.2%
23DeepSeek R1 Distill Qwen 32B90%
24Llama 3.1 405B (Fireworks)89.5%
25Llama 3.1 405B89.5%
26Llama 3.1 405B (Together)89.5%
27Mistral Large89%
28Gemini 1.5 Pro89%
29Qwen 2.5 72B89%
30Qwen 2.5 72B (Together)89%
31Qwen 2.5 Max88.5%
32Gemini 2.5 Flash88%
33Gemini 2.0 Flash88%
34o4-mini88%
35Llama 4 Scout88%
36GPT-4 Turbo88%
37Sonar Reasoning88%
38Llama 3.3 70B (Fireworks)88%
39Llama 3.3 70B (Groq)88%
40Llama 3.3 70B88%
41Llama 3.3 70B (Together)88%
42Llama 3.2 90B Vision88%
43GPT-4o Mini87.2%
44Command A87%
45Grok 287%
46Amazon Nova Pro87%
47Claude Haiku 486.5%
48Sonar Pro86%
49Llama 3.1 70B85.9%
50Claude 3.5 Haiku85%
51Mistral Medium 385%
52DeepSeek V2.585%
53Mixtral 8x22B (Fireworks)85%
54GPT-4 185%
55WizardLM-2 8x22B85%
56Phi-3.5 MoE84%
57GPT-4 1.5-mini84%
58Grok 3-mini84%
59Yi-Large83%
60Mistral Small82.5%
61Gemini 2.0 Flash Lite82%
62Gemini 1.5 Flash82%
63Gemma 2 27B82%
64Phi-3 Medium81%
65Command R+80.5%
66Mixtral 8x7B (Groq)80%
67Sonar80%
68Phi-480%
69Yi-Lightning79%
70Phi-3.5 Mini79%
71Gemma 2 9B (Groq)78%
72Gemma 2 9B78%
73Llama 3.2 11B Vision78%
74Qwen 2.5 7B78%
75GPT-4 1.5-nano78%
76Mistral Nemo 12B78%
77Amazon Nova Lite77%
78InternLM 2.5 20B77%
79Gemini 1.5 Flash 8B75%
80Command R75%
81Llama 3.1 8B (Groq)72%
82Llama 3.1 8B72%
83Amazon Nova Micro71%
84Command R7B71%
85Mistral 7B71%
86Mistral 7B (Together)71%
87GPT-3.5 Turbo65%

What HumanEval Tests

Code correctness: given a function signature and description, write working Python code. Tests cover data structures, algorithms, string manipulation, and math. A score of 80% means 131 of 164 problems solved correctly on the first attempt.

Score Range

0–100% (human baseline ~95%)

Compare models side-by-side

Full spec comparison — pricing, context window, and all benchmarks.

Compare Models →