Accuracy %90 models ranked

GPQA Diamond Leaderboard 2026

Graduate-Level Google-Proof Q&A (GPQA) Diamond is a set of 198 expert-level science questions written by domain specialists. The 'Google-proof' design means the answers cannot be found by simple web search — they require genuine understanding.

Quick Answer

The best model on GPQA Diamond in 2026 is o3 by OpenAI, scoring 94%. Runner-up: o1 (92%).

Provider

90 / 90 models

#	Model	Provider	Score	Percentile
🥇	o3	OpenAI	94%	99th
🥈	o1	OpenAI	92%	98th
🥉	o3-mini	OpenAI	87%	97th
4	Qwen 3 235B MoE	Alibaba	80%	96th
5	Gemini Experimental 1206	Google	75%	94th
6	Claude Opus 4	Anthropic	74.8%	93th
7	Gemini 2.5 Pro	Google	74%	92th
8	DeepSeek R1	DeepSeek	72%	91th
9	GPT-4.5	OpenAI	70%	90th
10	DeepSeek R1 (Groq)	Groq	70%	89th
11	DeepSeek R1 (Together)	Together AI	70%	88th
12	Claude Sonnet 4	Anthropic	65.2%	87th
13	Gemini 2.0 Flash Thinking	Google	65%	86th
14	Llama 4 Maverick	Meta	60.5%	84th
15	Claude 3.5 Sonnet	Anthropic	60%	83th
16	Gemini 2.5 Flash	Google	60%	82th
17	o1-mini	OpenAI	60%	81th
18	o4-mini	OpenAI	60%	80th
19	DeepSeek V3	DeepSeek	59%	79th
20	ChatGPT-4o Latest	OpenAI	57.5%	78th
21	Grok 3	xAI	55%	77th
22	Qwen 2.5 Max	Alibaba	55%	76th
23	QwQ 32B	Alibaba	55%	74th
24	Mistral Large	Mistral	55%	73th
25	GPT-4o	OpenAI	53.6%	72th
26	Llama 4 Scout	Meta	53%	71th
27	Gemini 2.0 Flash	Google	52.8%	70th
28	GPT-4o (Aug 2024)	OpenAI	52.5%	69th
29	DeepSeek R1 Distill Llama 70B	DeepSeek	50%	68th
30	GPT-4 1	OpenAI	49%	67th
31	Claude Haiku 4	Anthropic	48.5%	66th
32	Command A	Cohere	45%	64th
33	DeepSeek R1 Distill Qwen 32B	DeepSeek	45%	63th
34	Llama 3.1 405B (Fireworks)	Fireworks AI	45%	62th
35	GPT-4 Turbo	OpenAI	45%	61th
36	Grok 2	xAI	45%	60th
37	Llama 3.1 405B	Meta	45%	59th
38	Sonar Reasoning	Perplexity	45%	58th
39	Llama 3.1 405B (Together)	Together AI	45%	57th
40	Phi-4	Microsoft	45%	56th
41	GPT-4o Mini	OpenAI	43.9%	54th
42	Command R+	Cohere	42%	53th
43	Gemini 2.0 Flash Lite	Google	42%	52th
44	Gemini 1.5 Pro	Google	40%	51th
45	Grok 2 Vision	xAI	40%	50th
46	Pixtral Large	Mistral AI	40%	49th
47	Qwen 2.5 72B	Alibaba	40%	48th
48	Qwen 2.5 72B (Together)	Together AI	40%	47th
49	Amazon Nova Pro	Amazon	40%	46th
50	Claude 3.5 Haiku	Anthropic	40%	44th
51	Llama 3.3 70B (Fireworks)	Fireworks AI	40%	43th
52	Llama 3.3 70B (Groq)	Groq	40%	42th
53	Llama 3.3 70B	Meta	40%	41th
54	Mistral Medium 3	Mistral AI	40%	40th
55	Llama 3.3 70B (Together)	Together AI	40%	39th
56	Llama 3.2 90B Vision	Meta	40%	38th
57	DeepSeek V2.5	DeepSeek	40%	37th
58	Mixtral 8x22B (Fireworks)	Fireworks AI	40%	36th
59	Sonar Pro	Perplexity	40%	34th
60	WizardLM-2 8x22B	Microsoft	40%	33th
61	Llama 3.1 70B	Meta	40%	32th
62	Phi-3.5 MoE	Microsoft	40%	31th
63	Gemini 1.5 Flash	Google	40%	30th
64	Gemma 2 27B	Google	40%	29th
65	Mistral Small	Mistral	40%	28th
66	Yi-Large	01.AI	40%	27th
67	GPT-4 1.5-mini	OpenAI	40%	26th
68	Grok 3-mini	xAI	40%	24th
69	Amazon Nova Lite	Amazon	40%	23th
70	Gemma 2 9B (Groq)	Groq	40%	22th
71	Phi-3 Medium	Microsoft	40%	21th
72	Yi-Lightning	01.AI	40%	20th
73	Gemma 2 9B	Google	40%	19th
74	Mixtral 8x7B (Groq)	Groq	40%	18th
75	Llama 3.2 11B Vision	Meta	40%	17th
76	Phi-3.5 Mini	Microsoft	40%	16th
77	Qwen 2.5 7B	Alibaba	40%	14th
78	Sonar	Perplexity	40%	13th
79	InternLM 2.5 20B	Shanghai AI Lab	40%	12th
80	Gemini 1.5 Flash 8B	Google	40%	11th
81	GPT-4 1.5-nano	OpenAI	40%	10th
82	Mistral Nemo 12B	Mistral AI	40%	9th
83	Amazon Nova Micro	Amazon	40%	8th
84	Command R7B	Cohere	40%	7th
85	GPT-3.5 Turbo	OpenAI	40%	6th
86	Llama 3.1 8B (Groq)	Groq	40%	4th
87	Llama 3.1 8B	Meta	40%	3th
88	Mistral 7B	Mistral AI	40%	2th
89	Mistral 7B (Together)	Together AI	40%	1th
90	Command R	Cohere	35%	0th

What GPQA Diamond Tests

Expert-level multiple-choice questions in biology, chemistry, and physics written by PhD researchers. Questions are intentionally hard to Google. Human non-expert accuracy is ~22%; PhD expert accuracy is ~65%. Scores above 50% indicate strong scientific reasoning.

Score Range

0–100% (PhD expert ~65%, non-expert ~22%)

Source

Rein et al. — GPQA ↗

Other Benchmarks

Arena ELO Coding ELO Reasoning ELO HumanEval MMLU MATH

Compare models side-by-side

Full spec comparison — pricing, context window, and all benchmarks.

Compare Models →