Accuracy %90 models ranked

MATH Leaderboard 2026

The MATH benchmark tests mathematical problem-solving across 5 difficulty levels. Problems range from pre-algebra to calculus and competition mathematics. It remains one of the hardest benchmarks — early GPT-4 scored only 42%.

Quick Answer

The best model on MATH in 2026 is Qwen 3 235B MoE by Alibaba, scoring 168%. Runner-up: Gemini Experimental 1206 (160%).

90 / 90 models
#ModelScore
🥇Qwen 3 235B MoE168%
🥈Gemini Experimental 1206160%
🥉GPT-4.5152%
4DeepSeek R1 (Groq)152%
5DeepSeek R1 (Together)152%
6Gemini 2.0 Flash Thinking144%
7Claude 3.5 Sonnet136%
8Gemini 2.5 Flash136%
9o1-mini136%
10ChatGPT-4o Latest132%
11QwQ 32B128%
12GPT-4o (Aug 2024)124%
13DeepSeek R1 Distill Llama 70B120%
14Command A112%
15DeepSeek R1 Distill Qwen 32B112%
16Llama 3.1 405B (Fireworks)112%
17GPT-4 Turbo112%
18Grok 2112%
19Llama 3.1 405B112%
20Sonar Reasoning112%
21Llama 3.1 405B (Together)112%
22Gemini 1.5 Pro104%
23Grok 2 Vision104%
24Pixtral Large104%
25Qwen 2.5 72B104%
26Qwen 2.5 72B (Together)104%
27o396%
28Amazon Nova Pro96%
29Claude 3.5 Haiku96%
30Llama 3.3 70B (Fireworks)96%
31Llama 3.3 70B (Groq)96%
32Llama 3.3 70B96%
33Mistral Medium 396%
34Llama 3.3 70B (Together)96%
35o194%
36DeepSeek R191%
37Gemini 2.5 Pro90.5%
38Llama 3.2 90B Vision88%
39Claude Opus 485.4%
40o3-mini84%
41DeepSeek V2.580%
42Mixtral 8x22B (Fireworks)80%
43Sonar Pro80%
44WizardLM-2 8x22B80%
45DeepSeek V378.5%
46Claude Sonnet 478.3%
47Llama 4 Maverick78%
48GPT-4o76.6%
49Qwen 2.5 Max76%
50Llama 3.1 70B76%
51Phi-3.5 MoE76%
52o4-mini75%
53Grok 374%
54Mistral Large74%
55Gemini 2.0 Flash73.5%
56Llama 4 Scout72.5%
57Gemini 1.5 Flash72%
58Gemma 2 27B72%
59Phi-472%
60GPT-4o Mini70.2%
61GPT-4 170%
62Claude Haiku 468.2%
63Yi-Large68%
64Gemini 2.0 Flash Lite65%
65Mistral Small62.5%
66Grok 3-mini62%
67Command R+60%
68GPT-4 1.5-mini60%
69Amazon Nova Lite56%
70Gemma 2 9B (Groq)56%
71Phi-3 Medium56%
72Yi-Lightning52%
73Command R52%
74GPT-4 1.5-nano50%
75Gemma 2 9B48%
76Mixtral 8x7B (Groq)48%
77Llama 3.2 11B Vision48%
78Phi-3.5 Mini48%
79Qwen 2.5 7B48%
80Sonar48%
81InternLM 2.5 20B44%
82Gemini 1.5 Flash 8B40%
83Mistral Nemo 12B40%
84Amazon Nova Micro40%
85Command R7B40%
86GPT-3.5 Turbo40%
87Llama 3.1 8B (Groq)40%
88Llama 3.1 8B40%
89Mistral 7B40%
90Mistral 7B (Together)40%

What MATH Tests

Solving math problems requiring multi-step work: algebra, geometry, number theory, counting, probability, and calculus. Models must show their reasoning and produce exact answers. Harder than MMLU math questions.

Score Range

0–100% (human expert ~90%)

Compare models side-by-side

Full spec comparison — pricing, context window, and all benchmarks.

Compare Models →