ELO Rating90 models ranked

Reasoning ELO Leaderboard 2026

Reasoning ELO is the Chatbot Arena leaderboard filtered to hard reasoning and math problems. It measures how well models solve multi-step logic, quantitative reasoning, and complex problem-solving.

Quick Answer

The best model on Reasoning ELO in 2026 is Claude Opus 4 by Anthropic, scoring 1503 ELO. Runner-up: Gemini 2.5 Pro (1430).

Provider

90 / 90 models

#	Model	Provider	Score	Percentile
🥇	Claude Opus 4	Anthropic	1503ELO	99th
🥈	Gemini 2.5 Pro	Google	1430ELO	98th
🥉	o3	OpenAI	1350ELO	97th
4	DeepSeek R1	DeepSeek	1350ELO	96th
5	o1	OpenAI	1330ELO	94th
6	Qwen 3 235B MoE	Alibaba	1320ELO	93th
7	Gemini Experimental 1206	Google	1310ELO	92th
8	DeepSeek R1 (Groq)	Groq	1300ELO	91th
9	DeepSeek R1 (Together)	Together AI	1300ELO	90th
10	Grok 3	xAI	1295ELO	89th
11	o3-mini	OpenAI	1295ELO	88th
12	GPT-4.5	OpenAI	1290ELO	87th
13	Gemini 2.0 Flash Thinking	Google	1290ELO	86th
14	o1-mini	OpenAI	1280ELO	84th
15	Llama 4 Maverick	Meta	1275ELO	83th
16	Claude Sonnet 4	Anthropic	1275ELO	82th
17	o4-mini	OpenAI	1275ELO	81th
18	Claude 3.5 Sonnet	Anthropic	1270ELO	80th
19	Gemini 2.5 Flash	Google	1270ELO	79th
20	QwQ 32B	Alibaba	1270ELO	78th
21	ChatGPT-4o Latest	OpenAI	1265ELO	77th
22	DeepSeek V3	DeepSeek	1260ELO	76th
23	DeepSeek R1 Distill Llama 70B	DeepSeek	1260ELO	74th
24	GPT-4o (Aug 2024)	OpenAI	1255ELO	73th
25	GPT-4o	OpenAI	1250ELO	72th
26	DeepSeek R1 Distill Qwen 32B	DeepSeek	1250ELO	71th
27	Llama 3.1 405B (Fireworks)	Fireworks AI	1250ELO	70th
28	Llama 3.1 405B	Meta	1250ELO	69th
29	Sonar Reasoning	Perplexity	1250ELO	68th
30	Llama 3.1 405B (Together)	Together AI	1250ELO	67th
31	Qwen 2.5 Max	Alibaba	1240ELO	66th
32	Command A	Cohere	1240ELO	64th
33	GPT-4 Turbo	OpenAI	1240ELO	63th
34	Grok 2	xAI	1240ELO	62th
35	Gemini 2.0 Flash	Google	1230ELO	61th
36	Mistral Large	Mistral	1230ELO	60th
37	Gemini 1.5 Pro	Google	1230ELO	59th
38	Grok 2 Vision	xAI	1230ELO	58th
39	Pixtral Large	Mistral AI	1230ELO	57th
40	Qwen 2.5 72B	Alibaba	1230ELO	56th
41	Qwen 2.5 72B (Together)	Together AI	1230ELO	54th
42	Llama 4 Scout	Meta	1220ELO	53th
43	Amazon Nova Pro	Amazon	1220ELO	52th
44	Claude 3.5 Haiku	Anthropic	1220ELO	51th
45	Llama 3.3 70B (Fireworks)	Fireworks AI	1220ELO	50th
46	Llama 3.3 70B (Groq)	Groq	1220ELO	49th
47	Llama 3.3 70B	Meta	1220ELO	48th
48	Mistral Medium 3	Mistral AI	1220ELO	47th
49	Llama 3.3 70B (Together)	Together AI	1220ELO	46th
50	Llama 3.2 90B Vision	Meta	1210ELO	44th
51	Sonar Pro	Perplexity	1210ELO	43th
52	DeepSeek V2.5	DeepSeek	1200ELO	42th
53	Mixtral 8x22B (Fireworks)	Fireworks AI	1200ELO	41th
54	GPT-4 1	OpenAI	1200ELO	40th
55	WizardLM-2 8x22B	Microsoft	1200ELO	39th
56	Llama 3.1 70B	Meta	1195ELO	38th
57	Phi-3.5 MoE	Microsoft	1195ELO	37th
58	Gemini 1.5 Flash	Google	1190ELO	36th
59	Gemma 2 27B	Google	1190ELO	34th
60	Claude Haiku 4	Anthropic	1185ELO	33th
61	Yi-Large	01.AI	1185ELO	32th
62	GPT-4o Mini	OpenAI	1180ELO	31th
63	GPT-4 1.5-mini	OpenAI	1180ELO	30th
64	Grok 3-mini	xAI	1175ELO	29th
65	Command R+	Cohere	1170ELO	28th
66	Amazon Nova Lite	Amazon	1170ELO	27th
67	Gemma 2 9B (Groq)	Groq	1170ELO	26th
68	Phi-3 Medium	Microsoft	1170ELO	24th
69	Yi-Lightning	01.AI	1165ELO	23th
70	Gemini 2.0 Flash Lite	Google	1160ELO	22th
71	Gemma 2 9B	Google	1160ELO	21th
72	Mixtral 8x7B (Groq)	Groq	1160ELO	20th
73	Llama 3.2 11B Vision	Meta	1160ELO	19th
74	Phi-3.5 Mini	Microsoft	1160ELO	18th
75	Qwen 2.5 7B	Alibaba	1160ELO	17th
76	Sonar	Perplexity	1160ELO	16th
77	InternLM 2.5 20B	Shanghai AI Lab	1155ELO	14th
78	Mistral Small	Mistral	1150ELO	13th
79	Gemini 1.5 Flash 8B	Google	1150ELO	12th
80	GPT-4 1.5-nano	OpenAI	1150ELO	11th
81	Phi-4	Microsoft	1140ELO	10th
82	Mistral Nemo 12B	Mistral AI	1140ELO	9th
83	Amazon Nova Micro	Amazon	1130ELO	8th
84	Command R7B	Cohere	1120ELO	7th
85	GPT-3.5 Turbo	OpenAI	1120ELO	6th
86	Llama 3.1 8B (Groq)	Groq	1120ELO	4th
87	Llama 3.1 8B	Meta	1120ELO	3th
88	Command R	Cohere	1110ELO	2th
89	Mistral 7B	Mistral AI	1100ELO	1th
90	Mistral 7B (Together)	Together AI	1100ELO	0th

What Reasoning ELO Tests

Human preference on reasoning-heavy tasks: math word problems, logic puzzles, structured analysis. A higher score means humans find the model's reasoning more sound and useful.

Score Range

1100–1450+ (average ~1210)

Source

LMSYS Chatbot Arena — Hard Prompts ↗

Other Benchmarks

Arena ELO Coding ELO HumanEval MMLU MATH GPQA

Compare models side-by-side

Full spec comparison — pricing, context window, and all benchmarks.

Compare Models →