Methodology · Last reviewed 2026-04-15

How LLMversus ranks models

Every number on this site comes from one of four places: a provider pricing page, an official benchmark leaderboard, an independent measurement service, or a test I ran myself. Here is exactly how each column is sourced and how often it is refreshed.

1. Pricing data

Price per million input and output tokens is the single number most people come here to check. I pull pricing from two places, in this order of precedence:

Provider pricing pages - the canonical source. Examples: OpenAI API pricing, Anthropic pricing, Google AI pricing, Mistral pricing.
OpenRouter model list - a single JSON endpoint that aggregates current rates across providers. Useful for cross-checking and for open-weight models hosted on third-party inference platforms.

The scrape runs weekly and every time a major release is announced. When the two sources disagree by more than a tenth of a cent, I flag the row and re-check by hand before publishing. Prompt caching discounts and batch pricing are shown separately on the provider page and are not mixed into the headline cost column.

2. Benchmark data

Quality scores come from three independent leaderboards. I do not publish self-reported scores from model cards unless they are also verified by one of the sources below.

Arena ELO - pulled from the LMSYS Chatbot Arena leaderboard. This is blind pairwise human preference, aggregated across millions of votes. I treat Arena ELO as the best single proxy for “which model do users actually prefer on open-ended tasks.”
MMLU, GPQA, HumanEval, MATH - pulled from the official benchmark sites and the model card of record. MMLU-Pro scores come from the TIGER-Lab MMLU-Pro leaderboard. GPQA results come from the original paper and the Papers With Code leaderboard.
Independent aggregation - I cross-check against Artificial Analysis, which runs its own evaluations on hosted endpoints. When their number and the official number differ by more than two points, I note it on the model page.

3. Speed: TTFT and tokens per second

Time to first token and sustained throughput are measured against the primary hosted endpoint for each model - OpenAI for GPT, the Anthropic API for Claude, Gemini API for Google models, and the reference provider listed on the model page for open-weight models. I use a rolling window of requests sent over a fresh HTTPS connection from a US-East region, with a 1,024-token prompt and a 256-token completion, and I average across at least 20 requests per measurement.

Where the public measurement from Artificial Analysis disagrees with mine by more than 15%, I defer to their number - they run a continuous measurement fleet and have a far larger sample size. My own readings exist mainly to catch cases where a provider has silently changed routing or region.

4. Context windows and capability flags

Context window, function-calling support, structured-output support, and vision capability come straight from the provider API reference. Where a provider advertises a window but silently caps it lower in practice (this happens - Claude long-context and Gemini 2M are the usual suspects), I note the practical limit alongside the advertised one.

5. Update cadence

Weekly: pricing scrape, OpenRouter sync, and a spot-check of the top ten models by Arena ELO.
On release: full model page within 48 hours of any new weights from OpenAI, Anthropic, Google, Mistral, Meta, xAI, DeepSeek, Alibaba Qwen, or Cohere.
Monthly: benchmark refresh against the official leaderboards and a full speed remeasurement for every tracked model.
Continuous: reader-submitted corrections sent to hello@llmversus.com - usually fixed the same day.

Every comparison page shows the last-verified date at the top. If you are reading a page older than two weeks, assume prices may have moved - click through to the provider link to confirm before you commit to a contract.

6. Known limitations

I run LLMversus as a solo operator, so there are things I cannot do. I do not measure 99th-percentile latency under production load. I do not run sovereign-region endpoints. I do not benchmark fine-tuned variants. Where a leaderboard is gamed or contamination-prone (parts of MMLU and HumanEval fall into this category), I say so on the page rather than quietly trusting the number.

Who writes this

Every comparison and benchmark writeup on LLMversus is written by the LLMversus team. Corrections to hello@llmversus.com.