Accuracy %90 models ranked

GPQA Diamond Leaderboard 2026

Graduate-Level Google-Proof Q&A (GPQA) Diamond is a set of 198 expert-level science questions written by domain specialists. The 'Google-proof' design means the answers cannot be found by simple web search — they require genuine understanding.

Quick Answer

The best model on GPQA Diamond in 2026 is o3 by OpenAI, scoring 94%. Runner-up: o1 (92%).

90 / 90 models
#ModelScore
🥇o394%
🥈o192%
🥉o3-mini87%
4Qwen 3 235B MoE80%
5Gemini Experimental 120675%
6Claude Opus 474.8%
7Gemini 2.5 Pro74%
8DeepSeek R172%
9GPT-4.570%
10DeepSeek R1 (Groq)70%
11DeepSeek R1 (Together)70%
12Claude Sonnet 465.2%
13Gemini 2.0 Flash Thinking65%
14Llama 4 Maverick60.5%
15Claude 3.5 Sonnet60%
16Gemini 2.5 Flash60%
17o1-mini60%
18o4-mini60%
19DeepSeek V359%
20ChatGPT-4o Latest57.5%
21Grok 355%
22Qwen 2.5 Max55%
23QwQ 32B55%
24Mistral Large55%
25GPT-4o53.6%
26Llama 4 Scout53%
27Gemini 2.0 Flash52.8%
28GPT-4o (Aug 2024)52.5%
29DeepSeek R1 Distill Llama 70B50%
30GPT-4 149%
31Claude Haiku 448.5%
32Command A45%
33DeepSeek R1 Distill Qwen 32B45%
34Llama 3.1 405B (Fireworks)45%
35GPT-4 Turbo45%
36Grok 245%
37Llama 3.1 405B45%
38Sonar Reasoning45%
39Llama 3.1 405B (Together)45%
40Phi-445%
41GPT-4o Mini43.9%
42Command R+42%
43Gemini 2.0 Flash Lite42%
44Gemini 1.5 Pro40%
45Grok 2 Vision40%
46Pixtral Large40%
47Qwen 2.5 72B40%
48Qwen 2.5 72B (Together)40%
49Amazon Nova Pro40%
50Claude 3.5 Haiku40%
51Llama 3.3 70B (Fireworks)40%
52Llama 3.3 70B (Groq)40%
53Llama 3.3 70B40%
54Mistral Medium 340%
55Llama 3.3 70B (Together)40%
56Llama 3.2 90B Vision40%
57DeepSeek V2.540%
58Mixtral 8x22B (Fireworks)40%
59Sonar Pro40%
60WizardLM-2 8x22B40%
61Llama 3.1 70B40%
62Phi-3.5 MoE40%
63Gemini 1.5 Flash40%
64Gemma 2 27B40%
65Mistral Small40%
66Yi-Large40%
67GPT-4 1.5-mini40%
68Grok 3-mini40%
69Amazon Nova Lite40%
70Gemma 2 9B (Groq)40%
71Phi-3 Medium40%
72Yi-Lightning40%
73Gemma 2 9B40%
74Mixtral 8x7B (Groq)40%
75Llama 3.2 11B Vision40%
76Phi-3.5 Mini40%
77Qwen 2.5 7B40%
78Sonar40%
79InternLM 2.5 20B40%
80Gemini 1.5 Flash 8B40%
81GPT-4 1.5-nano40%
82Mistral Nemo 12B40%
83Amazon Nova Micro40%
84Command R7B40%
85GPT-3.5 Turbo40%
86Llama 3.1 8B (Groq)40%
87Llama 3.1 8B40%
88Mistral 7B40%
89Mistral 7B (Together)40%
90Command R35%

What GPQA Diamond Tests

Expert-level multiple-choice questions in biology, chemistry, and physics written by PhD researchers. Questions are intentionally hard to Google. Human non-expert accuracy is ~22%; PhD expert accuracy is ~65%. Scores above 50% indicate strong scientific reasoning.

Score Range

0–100% (PhD expert ~65%, non-expert ~22%)

Compare models side-by-side

Full spec comparison — pricing, context window, and all benchmarks.

Compare Models →