Accuracy %87 models ranked

HumanEval Leaderboard 2026

HumanEval is OpenAI's code generation benchmark. Models are given Python function signatures + docstrings and must produce correct implementations. The score is the percentage of 164 problems solved (pass@1).

Quick Answer

The best model on HumanEval in 2026 is o3 by OpenAI, scoring 98%. Runner-up: o1 (96%).

Provider

87 / 87 models

#	Model	Provider	Score	Percentile
🥇	o3	OpenAI	98%	99th
🥈	o1	OpenAI	96%	98th
🥉	Claude Opus 4	Anthropic	95.2%	97th
4	Gemini 2.5 Pro	Google	94%	95th
5	Qwen 2.5 Coder 32B	Alibaba	94%	94th
6	DeepSeek R1	DeepSeek	93%	93th
7	GPT-4.5	OpenAI	93%	92th
8	o3-mini	OpenAI	93%	91th
9	Qwen 3 235B MoE	Alibaba	92%	90th
10	Grok 3	xAI	92%	89th
11	Claude Sonnet 4	Anthropic	92%	87th
12	DeepSeek V3	DeepSeek	92%	86th
13	Claude 3.5 Sonnet	Anthropic	92%	85th
14	o1-mini	OpenAI	92%	84th
15	Codestral 22B	Mistral AI	92%	83th
16	Llama 4 Maverick	Meta	91.5%	82th
17	DeepSeek R1 (Groq)	Groq	91%	80th
18	DeepSeek R1 (Together)	Together AI	91%	79th
19	ChatGPT-4o Latest	OpenAI	91%	78th
20	DeepSeek R1 Distill Llama 70B	DeepSeek	91%	77th
21	GPT-4o	OpenAI	90.2%	76th
22	GPT-4o (Aug 2024)	OpenAI	90.2%	75th
23	DeepSeek R1 Distill Qwen 32B	DeepSeek	90%	74th
24	Llama 3.1 405B (Fireworks)	Fireworks AI	89.5%	72th
25	Llama 3.1 405B	Meta	89.5%	71th
26	Llama 3.1 405B (Together)	Together AI	89.5%	70th
27	Mistral Large	Mistral	89%	69th
28	Gemini 1.5 Pro	Google	89%	68th
29	Qwen 2.5 72B	Alibaba	89%	67th
30	Qwen 2.5 72B (Together)	Together AI	89%	66th
31	Qwen 2.5 Max	Alibaba	88.5%	64th
32	Gemini 2.5 Flash	Google	88%	63th
33	Gemini 2.0 Flash	Google	88%	62th
34	o4-mini	OpenAI	88%	61th
35	Llama 4 Scout	Meta	88%	60th
36	GPT-4 Turbo	OpenAI	88%	59th
37	Sonar Reasoning	Perplexity	88%	57th
38	Llama 3.3 70B (Fireworks)	Fireworks AI	88%	56th
39	Llama 3.3 70B (Groq)	Groq	88%	55th
40	Llama 3.3 70B	Meta	88%	54th
41	Llama 3.3 70B (Together)	Together AI	88%	53th
42	Llama 3.2 90B Vision	Meta	88%	52th
43	GPT-4o Mini	OpenAI	87.2%	51th
44	Command A	Cohere	87%	49th
45	Grok 2	xAI	87%	48th
46	Amazon Nova Pro	Amazon	87%	47th
47	Claude Haiku 4	Anthropic	86.5%	46th
48	Sonar Pro	Perplexity	86%	45th
49	Llama 3.1 70B	Meta	85.9%	44th
50	Claude 3.5 Haiku	Anthropic	85%	43th
51	Mistral Medium 3	Mistral AI	85%	41th
52	DeepSeek V2.5	DeepSeek	85%	40th
53	Mixtral 8x22B (Fireworks)	Fireworks AI	85%	39th
54	GPT-4 1	OpenAI	85%	38th
55	WizardLM-2 8x22B	Microsoft	85%	37th
56	Phi-3.5 MoE	Microsoft	84%	36th
57	GPT-4 1.5-mini	OpenAI	84%	34th
58	Grok 3-mini	xAI	84%	33th
59	Yi-Large	01.AI	83%	32th
60	Mistral Small	Mistral	82.5%	31th
61	Gemini 2.0 Flash Lite	Google	82%	30th
62	Gemini 1.5 Flash	Google	82%	29th
63	Gemma 2 27B	Google	82%	28th
64	Phi-3 Medium	Microsoft	81%	26th
65	Command R+	Cohere	80.5%	25th
66	Mixtral 8x7B (Groq)	Groq	80%	24th
67	Sonar	Perplexity	80%	23th
68	Phi-4	Microsoft	80%	22th
69	Yi-Lightning	01.AI	79%	21th
70	Phi-3.5 Mini	Microsoft	79%	20th
71	Gemma 2 9B (Groq)	Groq	78%	18th
72	Gemma 2 9B	Google	78%	17th
73	Llama 3.2 11B Vision	Meta	78%	16th
74	Qwen 2.5 7B	Alibaba	78%	15th
75	GPT-4 1.5-nano	OpenAI	78%	14th
76	Mistral Nemo 12B	Mistral AI	78%	13th
77	Amazon Nova Lite	Amazon	77%	11th
78	InternLM 2.5 20B	Shanghai AI Lab	77%	10th
79	Gemini 1.5 Flash 8B	Google	75%	9th
80	Command R	Cohere	75%	8th
81	Llama 3.1 8B (Groq)	Groq	72%	7th
82	Llama 3.1 8B	Meta	72%	6th
83	Amazon Nova Micro	Amazon	71%	5th
84	Command R7B	Cohere	71%	3th
85	Mistral 7B	Mistral AI	71%	2th
86	Mistral 7B (Together)	Together AI	71%	1th
87	GPT-3.5 Turbo	OpenAI	65%	0th

What HumanEval Tests

Code correctness: given a function signature and description, write working Python code. Tests cover data structures, algorithms, string manipulation, and math. A score of 80% means 131 of 164 problems solved correctly on the first attempt.

Score Range

0–100% (human baseline ~95%)

Source

OpenAI HumanEval ↗

Other Benchmarks

Arena ELO Coding ELO Reasoning ELO MMLU MATH GPQA

Compare models side-by-side

Full spec comparison — pricing, context window, and all benchmarks.

Compare Models →