Accuracy %85 models ranked

MMLU Leaderboard 2026

Massive Multitask Language Understanding (MMLU) tests knowledge across 57 subjects including STEM, humanities, social sciences, and professional domains. It is the most widely reported academic benchmark for general knowledge.

Quick Answer

The best model on MMLU in 2026 is o3 by OpenAI, scoring 95%. Runner-up: Gemini 2.5 Pro (92%).

Provider

85 / 85 models

#	Model	Provider	Score	Percentile
🥇	o3	OpenAI	95%	99th
🥈	Gemini 2.5 Pro	Google	92%	98th
🥉	o1	OpenAI	92%	96th
4	Claude Opus 4	Anthropic	91.5%	95th
5	DeepSeek R1	DeepSeek	89%	94th
6	GPT-4.5	OpenAI	89%	93th
7	ChatGPT-4o Latest	OpenAI	89%	92th
8	Claude Sonnet 4	Anthropic	88.7%	91th
9	GPT-4o	OpenAI	88.7%	89th
10	GPT-4o (Aug 2024)	OpenAI	88.7%	88th
11	Claude 3.5 Sonnet	Anthropic	88.3%	87th
12	Qwen 3 235B MoE	Alibaba	88%	86th
13	Llama 4 Maverick	Meta	88%	85th
14	DeepSeek V3	DeepSeek	87.5%	84th
15	Grok 3	xAI	87%	82th
16	Gemini 1.5 Pro	Google	87%	81th
17	Mistral Large	Mistral	86.5%	80th
18	Llama 3.3 70B (Fireworks)	Fireworks AI	86.2%	79th
19	Llama 3.3 70B (Groq)	Groq	86.2%	78th
20	Llama 3.3 70B	Meta	86.2%	76th
21	Llama 3.3 70B (Together)	Together AI	86.2%	75th
22	o3-mini	OpenAI	86%	74th
23	Gemini 2.5 Flash	Google	86%	73th
24	Qwen 2.5 Max	Alibaba	86%	72th
25	DeepSeek R1 Distill Llama 70B	DeepSeek	86%	71th
26	GPT-4 Turbo	OpenAI	86%	69th
27	GPT-4 1	OpenAI	86%	68th
28	Llama 3.1 405B (Fireworks)	Fireworks AI	85.9%	67th
29	Llama 3.1 405B	Meta	85.9%	66th
30	Llama 3.1 405B (Together)	Together AI	85.9%	65th
31	Gemini 2.0 Flash	Google	85.5%	64th
32	Llama 3.1 70B	Meta	85.2%	62th
33	DeepSeek R1 (Groq)	Groq	85%	61th
34	DeepSeek R1 (Together)	Together AI	85%	60th
35	o1-mini	OpenAI	85%	59th
36	o4-mini	OpenAI	85%	58th
37	Llama 4 Scout	Meta	85%	56th
38	Command A	Cohere	85%	55th
39	DeepSeek R1 Distill Qwen 32B	DeepSeek	85%	54th
40	Grok 2	xAI	85%	53th
41	Sonar Reasoning	Perplexity	85%	52th
42	Qwen 2.5 72B	Alibaba	85%	51th
43	Qwen 2.5 72B (Together)	Together AI	85%	49th
44	Amazon Nova Pro	Amazon	85%	48th
45	Mistral Medium 3	Mistral AI	84%	47th
46	Llama 3.2 90B Vision	Meta	84%	46th
47	DeepSeek V2.5	DeepSeek	84%	45th
48	Sonar Pro	Perplexity	84%	44th
49	Claude 3.5 Haiku	Anthropic	83%	42th
50	Claude Haiku 4	Anthropic	83%	41th
51	Mixtral 8x22B (Fireworks)	Fireworks AI	83%	40th
52	WizardLM-2 8x22B	Microsoft	83%	39th
53	Phi-3 Medium	Microsoft	83%	38th
54	GPT-4o Mini	OpenAI	82%	36th
55	Command R+	Cohere	82%	35th
56	Phi-3.5 MoE	Microsoft	82%	34th
57	Gemini 1.5 Flash	Google	82%	33th
58	Yi-Large	01.AI	82%	32th
59	GPT-4 1.5-mini	OpenAI	82%	31th
60	Grok 3-mini	xAI	82%	29th
61	Gemma 2 27B	Google	81%	28th
62	Phi-3.5 Mini	Microsoft	81%	27th
63	Phi-4	Microsoft	80.5%	26th
64	Gemini 2.0 Flash Lite	Google	80%	25th
65	Amazon Nova Lite	Amazon	80%	24th
66	Yi-Lightning	01.AI	80%	22th
67	Llama 3.2 11B Vision	Meta	80%	21th
68	Sonar	Perplexity	80%	20th
69	Mistral Small	Mistral	79%	19th
70	Qwen 2.5 7B	Alibaba	79%	18th
71	Llama 3.1 8B (Groq)	Groq	79%	16th
72	Llama 3.1 8B	Meta	79%	15th
73	InternLM 2.5 20B	Shanghai AI Lab	78%	14th
74	Gemini 1.5 Flash 8B	Google	78%	13th
75	GPT-4 1.5-nano	OpenAI	78%	12th
76	Gemma 2 9B (Groq)	Groq	77%	11th
77	Gemma 2 9B	Google	77%	9th
78	Amazon Nova Micro	Amazon	76%	8th
79	Command R	Cohere	75.5%	7th
80	Mixtral 8x7B (Groq)	Groq	74%	6th
81	Mistral Nemo 12B	Mistral AI	72%	5th
82	Command R7B	Cohere	72%	4th
83	GPT-3.5 Turbo	OpenAI	70%	2th
84	Mistral 7B	Mistral AI	62%	1th
85	Mistral 7B (Together)	Together AI	62%	0th

What MMLU Tests

Multiple-choice questions across 57 subjects: mathematics, history, law, medicine, computer science, and more. Tests both breadth of knowledge and reasoning within those domains. Score = percentage of questions answered correctly.

Score Range

0–100% (human expert ~89%)

Source

UC Berkeley — Hendrycks et al. ↗

Other Benchmarks

Arena ELO Coding ELO Reasoning ELO HumanEval MATH GPQA

Compare models side-by-side

Full spec comparison — pricing, context window, and all benchmarks.

Compare Models →