Accuracy %90 models ranked

MATH Leaderboard 2026

The MATH benchmark tests mathematical problem-solving across 5 difficulty levels. Problems range from pre-algebra to calculus and competition mathematics. It remains one of the hardest benchmarks — early GPT-4 scored only 42%.

Quick Answer

The best model on MATH in 2026 is Qwen 3 235B MoE by Alibaba, scoring 168%. Runner-up: Gemini Experimental 1206 (160%).

Gemini Experimental 1206

Provider

90 / 90 models

#	Model	Provider	Score	Percentile
🥇	Qwen 3 235B MoE	Alibaba	168%	99th
🥈	Gemini Experimental 1206	Google	160%	98th
🥉	GPT-4.5	OpenAI	152%	97th
4	DeepSeek R1 (Groq)	Groq	152%	96th
5	DeepSeek R1 (Together)	Together AI	152%	94th
6	Gemini 2.0 Flash Thinking	Google	144%	93th
7	Claude 3.5 Sonnet	Anthropic	136%	92th
8	Gemini 2.5 Flash	Google	136%	91th
9	o1-mini	OpenAI	136%	90th
10	ChatGPT-4o Latest	OpenAI	132%	89th
11	QwQ 32B	Alibaba	128%	88th
12	GPT-4o (Aug 2024)	OpenAI	124%	87th
13	DeepSeek R1 Distill Llama 70B	DeepSeek	120%	86th
14	Command A	Cohere	112%	84th
15	DeepSeek R1 Distill Qwen 32B	DeepSeek	112%	83th
16	Llama 3.1 405B (Fireworks)	Fireworks AI	112%	82th
17	GPT-4 Turbo	OpenAI	112%	81th
18	Grok 2	xAI	112%	80th
19	Llama 3.1 405B	Meta	112%	79th
20	Sonar Reasoning	Perplexity	112%	78th
21	Llama 3.1 405B (Together)	Together AI	112%	77th
22	Gemini 1.5 Pro	Google	104%	76th
23	Grok 2 Vision	xAI	104%	74th
24	Pixtral Large	Mistral AI	104%	73th
25	Qwen 2.5 72B	Alibaba	104%	72th
26	Qwen 2.5 72B (Together)	Together AI	104%	71th
27	o3	OpenAI	96%	70th
28	Amazon Nova Pro	Amazon	96%	69th
29	Claude 3.5 Haiku	Anthropic	96%	68th
30	Llama 3.3 70B (Fireworks)	Fireworks AI	96%	67th
31	Llama 3.3 70B (Groq)	Groq	96%	66th
32	Llama 3.3 70B	Meta	96%	64th
33	Mistral Medium 3	Mistral AI	96%	63th
34	Llama 3.3 70B (Together)	Together AI	96%	62th
35	o1	OpenAI	94%	61th
36	DeepSeek R1	DeepSeek	91%	60th
37	Gemini 2.5 Pro	Google	90.5%	59th
38	Llama 3.2 90B Vision	Meta	88%	58th
39	Claude Opus 4	Anthropic	85.4%	57th
40	o3-mini	OpenAI	84%	56th
41	DeepSeek V2.5	DeepSeek	80%	54th
42	Mixtral 8x22B (Fireworks)	Fireworks AI	80%	53th
43	Sonar Pro	Perplexity	80%	52th
44	WizardLM-2 8x22B	Microsoft	80%	51th
45	DeepSeek V3	DeepSeek	78.5%	50th
46	Claude Sonnet 4	Anthropic	78.3%	49th
47	Llama 4 Maverick	Meta	78%	48th
48	GPT-4o	OpenAI	76.6%	47th
49	Qwen 2.5 Max	Alibaba	76%	46th
50	Llama 3.1 70B	Meta	76%	44th
51	Phi-3.5 MoE	Microsoft	76%	43th
52	o4-mini	OpenAI	75%	42th
53	Grok 3	xAI	74%	41th
54	Mistral Large	Mistral	74%	40th
55	Gemini 2.0 Flash	Google	73.5%	39th
56	Llama 4 Scout	Meta	72.5%	38th
57	Gemini 1.5 Flash	Google	72%	37th
58	Gemma 2 27B	Google	72%	36th
59	Phi-4	Microsoft	72%	34th
60	GPT-4o Mini	OpenAI	70.2%	33th
61	GPT-4 1	OpenAI	70%	32th
62	Claude Haiku 4	Anthropic	68.2%	31th
63	Yi-Large	01.AI	68%	30th
64	Gemini 2.0 Flash Lite	Google	65%	29th
65	Mistral Small	Mistral	62.5%	28th
66	Grok 3-mini	xAI	62%	27th
67	Command R+	Cohere	60%	26th
68	GPT-4 1.5-mini	OpenAI	60%	24th
69	Amazon Nova Lite	Amazon	56%	23th
70	Gemma 2 9B (Groq)	Groq	56%	22th
71	Phi-3 Medium	Microsoft	56%	21th
72	Yi-Lightning	01.AI	52%	20th
73	Command R	Cohere	52%	19th
74	GPT-4 1.5-nano	OpenAI	50%	18th
75	Gemma 2 9B	Google	48%	17th
76	Mixtral 8x7B (Groq)	Groq	48%	16th
77	Llama 3.2 11B Vision	Meta	48%	14th
78	Phi-3.5 Mini	Microsoft	48%	13th
79	Qwen 2.5 7B	Alibaba	48%	12th
80	Sonar	Perplexity	48%	11th
81	InternLM 2.5 20B	Shanghai AI Lab	44%	10th
82	Gemini 1.5 Flash 8B	Google	40%	9th
83	Mistral Nemo 12B	Mistral AI	40%	8th
84	Amazon Nova Micro	Amazon	40%	7th
85	Command R7B	Cohere	40%	6th
86	GPT-3.5 Turbo	OpenAI	40%	4th
87	Llama 3.1 8B (Groq)	Groq	40%	3th
88	Llama 3.1 8B	Meta	40%	2th
89	Mistral 7B	Mistral AI	40%	1th
90	Mistral 7B (Together)	Together AI	40%	0th

What MATH Tests

Solving math problems requiring multi-step work: algebra, geometry, number theory, counting, probability, and calculus. Models must show their reasoning and produce exact answers. Harder than MMLU math questions.

Score Range

0–100% (human expert ~90%)

Source

UC Berkeley — Hendrycks et al. ↗

Other Benchmarks

Arena ELO Coding ELO Reasoning ELO HumanEval MMLU GPQA

Compare models side-by-side

Full spec comparison — pricing, context window, and all benchmarks.

Compare Models →