Accuracy %85 models ranked

MMLU Leaderboard 2026

Massive Multitask Language Understanding (MMLU) tests knowledge across 57 subjects including STEM, humanities, social sciences, and professional domains. It is the most widely reported academic benchmark for general knowledge.

Quick Answer

The best model on MMLU in 2026 is o3 by OpenAI, scoring 95%. Runner-up: Gemini 2.5 Pro (92%).

85 / 85 models
#ModelScore
🥇o395%
🥈Gemini 2.5 Pro92%
🥉o192%
4Claude Opus 491.5%
5DeepSeek R189%
6GPT-4.589%
7ChatGPT-4o Latest89%
8Claude Sonnet 488.7%
9GPT-4o88.7%
10GPT-4o (Aug 2024)88.7%
11Claude 3.5 Sonnet88.3%
12Qwen 3 235B MoE88%
13Llama 4 Maverick88%
14DeepSeek V387.5%
15Grok 387%
16Gemini 1.5 Pro87%
17Mistral Large86.5%
18Llama 3.3 70B (Fireworks)86.2%
19Llama 3.3 70B (Groq)86.2%
20Llama 3.3 70B86.2%
21Llama 3.3 70B (Together)86.2%
22o3-mini86%
23Gemini 2.5 Flash86%
24Qwen 2.5 Max86%
25DeepSeek R1 Distill Llama 70B86%
26GPT-4 Turbo86%
27GPT-4 186%
28Llama 3.1 405B (Fireworks)85.9%
29Llama 3.1 405B85.9%
30Llama 3.1 405B (Together)85.9%
31Gemini 2.0 Flash85.5%
32Llama 3.1 70B85.2%
33DeepSeek R1 (Groq)85%
34DeepSeek R1 (Together)85%
35o1-mini85%
36o4-mini85%
37Llama 4 Scout85%
38Command A85%
39DeepSeek R1 Distill Qwen 32B85%
40Grok 285%
41Sonar Reasoning85%
42Qwen 2.5 72B85%
43Qwen 2.5 72B (Together)85%
44Amazon Nova Pro85%
45Mistral Medium 384%
46Llama 3.2 90B Vision84%
47DeepSeek V2.584%
48Sonar Pro84%
49Claude 3.5 Haiku83%
50Claude Haiku 483%
51Mixtral 8x22B (Fireworks)83%
52WizardLM-2 8x22B83%
53Phi-3 Medium83%
54GPT-4o Mini82%
55Command R+82%
56Phi-3.5 MoE82%
57Gemini 1.5 Flash82%
58Yi-Large82%
59GPT-4 1.5-mini82%
60Grok 3-mini82%
61Gemma 2 27B81%
62Phi-3.5 Mini81%
63Phi-480.5%
64Gemini 2.0 Flash Lite80%
65Amazon Nova Lite80%
66Yi-Lightning80%
67Llama 3.2 11B Vision80%
68Sonar80%
69Mistral Small79%
70Qwen 2.5 7B79%
71Llama 3.1 8B (Groq)79%
72Llama 3.1 8B79%
73InternLM 2.5 20B78%
74Gemini 1.5 Flash 8B78%
75GPT-4 1.5-nano78%
76Gemma 2 9B (Groq)77%
77Gemma 2 9B77%
78Amazon Nova Micro76%
79Command R75.5%
80Mixtral 8x7B (Groq)74%
81Mistral Nemo 12B72%
82Command R7B72%
83GPT-3.5 Turbo70%
84Mistral 7B62%
85Mistral 7B (Together)62%

What MMLU Tests

Multiple-choice questions across 57 subjects: mathematics, history, law, medicine, computer science, and more. Tests both breadth of knowledge and reasoning within those domains. Score = percentage of questions answered correctly.

Score Range

0–100% (human expert ~89%)

Compare models side-by-side

Full spec comparison — pricing, context window, and all benchmarks.

Compare Models →