LLM Rate Limits in 2026: GPT-4o, Claude, Groq, Gemini

By Aniket Nigam. Published 2026-04-15.

Quick answer

OpenAI runs a 5-tier system that auto-upgrades on spend. Anthropic runs Build, Scale, and Custom tiers with manual review above Scale. Groq keeps a hard split between dev and prod with a 30-day qualification window. Gemini ties quota to Google Cloud projects and does not auto-upgrade. Across all four, TPM is the limit you hit in production, not RPM.

Why this post exists

Official rate-limit pages go stale within a quarter. OpenAI quietly raised Tier 4 TPM on GPT-4o from 1.2M to 2M in January 2026. Anthropic added a new "Scale Plus" tier for Sonnet 4.5 that the public docs did not show until March. Gemini 2.5 Pro has different quotas than 2.5 Flash on the same project.

I rebuilt this table every quarter for a client work deck. The version below reflects what the provider dashboards actually show on April 14, 2026, cross-checked against three paying accounts I have access to.

RPM vs TPM vs RPD: what each one measures
OpenAI tiers for GPT-4o and o3 (April 2026)
Anthropic tiers for Claude Sonnet 4.5 and Opus 4
Groq tiers for Llama 4, Kimi K2, and GPT-OSS 120B
Gemini quotas for 2.5 Pro and 2.5 Flash
Reading the 429 error payload from each provider
Retry code in Python and TypeScript
Multi-provider fallback strategy
How to actually get a tier increase

1. RPM vs TPM vs RPD

Three numbers, easy to confuse.

RPM is requests per minute. Each API call counts as one request regardless of size. A 50-token ping and a 50,000-token document summary both count as one.

TPM is tokens per minute, summed across input and output. A single 40,000-token RAG request with a 2,000-token answer burns 42,000 against your TPM. Three of those in a minute will trip a 120K TPM limit even though you are at 3 RPM.

RPD is requests per day. Free tiers use it. Paid tiers mostly do not.

TPM is the limit production systems hit first. I track both separately in Datadog and alert at 70% of the provisioned TPM because bursty traffic can double the rolling average in 15 seconds.

2. OpenAI: GPT-4o and o3 tiers, April 2026

OpenAI runs 5 tiers. Tier upgrades trigger on cumulative spend and paid time since account creation. The thresholds shifted in late 2025 and the current map is:

Tier

Qualifier

GPT-4o RPM

GPT-4o TPM

o3 RPM

o3 TPM

Free	None	3	40,000	0	0
Tier 1	$5 paid	500	30,000	60	30,000
Tier 2	$50 + 7 days	5,000	450,000	500	150,000
Tier 3	$100 + 7 days	10,000	800,000	1,000	300,000
Tier 4	$250 + 14 days	20,000	2,000,000	3,000	600,000
Tier 5	$1,000 + 30 days	30,000	4,000,000	5,000	1,500,000

The big change in 2026: o3 now has separate quotas from GPT-4o. Before March, one shared tier covered both. If you migrated reasoning traffic to o3 recently, check your dashboard.

3. Anthropic: Claude Sonnet 4.5 and Opus 4

Anthropic uses three public tiers plus a custom option.

Tier

Qualifier

Sonnet 4.5 RPM

Sonnet 4.5 TPM

Opus 4 RPM

Opus 4 TPM

Build	Default	50	50,000	20	20,000
Scale	$100 + 30 days	1,000	400,000	200	80,000
Scale Plus	$1,000 + 60 days	4,000	1,500,000	800	300,000
Custom	Sales review	Negotiated	Negotiated	Negotiated	Negotiated

Scale Plus rolled out in March 2026 and you have to ask for it in the dashboard. Anthropic also imposes an output token per minute cap that is half the TPM. It bites on long-generation workloads like code synthesis.

Prompt caching reads count at 10% of full input TPM. If your cache hit rate is above 60%, you can run effective input volume 3-4x higher than your stated TPM cap.

4. Groq: Llama 4, Kimi K2, GPT-OSS 120B

Groq is the fastest of the four by a wide margin (580 output TPS on Llama 4 70B) but has the tightest absolute quotas. The platform runs two modes:

Tier

Qualifier

Llama 4 70B RPM

Llama 4 70B TPM

Kimi K2 RPM

GPT-OSS 120B RPM

Dev	Default	30	14,400	30	30
Prod	30-day qualification	600	300,000	300	300

The dev-to-prod move is not automatic. You file a request, Groq runs a review (usually 2-5 business days in 2026), and you get a flat RPM bump. No tier ladder. No auto-upgrade.

Groq also has a daily TPD cap on dev (1M total tokens per day). Prod removes it.

5. Gemini: 2.5 Pro, 2.5 Flash, 2.0 Flash Lite

Gemini quotas are per Google Cloud project, not per API key. If you are hitting limits, check whether another service in the same project is eating your quota.

Tier

2.5 Pro RPM

2.5 Pro TPM

2.5 Flash RPM

2.5 Flash TPM

Free	2	32,000	10	250,000
Paid	360	4,000,000	2,000	4,000,000
Enterprise	Via Cloud sales	Custom	Via Cloud sales	Custom

Gemini does not publish a tier ladder. You request an increase via the Google Cloud quota console, it goes to a human reviewer, and the SLA is 1-3 business days in my experience.

6. Reading the 429 payload

Each provider returns different error bodies on rate-limit. If you log them, you can route retries intelligently.

OpenAI returns error.code = "rate_limit_exceeded" with error.message naming RPM, TPM, or RPD. Header retry-after gives seconds to wait.

Anthropic returns error.type = "rate_limit_error" with error.message specifying input vs output tokens. Header retry-after is populated.

Groq returns HTTP 429 with error.type = "rate_limit_exceeded". Header retry-after-ms is in milliseconds, not seconds.

Gemini returns HTTP 429 with a code: 8 (RESOURCE_EXHAUSTED) in the body. No retry-after header. You guess.

7. Retry code with jitter

Here is the Python retry I use with tenacity. It reads retry-after when available and adds jitter so clustered pods do not stampede.

import random
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type
from anthropic import RateLimitError as AnthropicLimit
from openai import RateLimitError as OpenAILimit

@retry(
    reraise=True,
    stop=stop_after_attempt(6),
    wait=wait_exponential_jitter(initial=1, max=60, jitter=2),
    retry=retry_if_exception_type((AnthropicLimit, OpenAILimit)),
)
def call_with_backoff(client, **kwargs):
    return client.messages.create(**kwargs)

The TypeScript version using a simple token bucket. I keep this in a shared module and wrap every provider call:

class TokenBucket {
  private tokens: number;
  private lastRefill = Date.now();
  constructor(
    private readonly capacity: number,
    private readonly refillPerSec: number,
  ) {
    this.tokens = capacity;
  }
  async take(cost = 1) {
    while (true) {
      const now = Date.now();
      const elapsed = (now - this.lastRefill) / 1000;
      this.tokens = Math.min(
        this.capacity,
        this.tokens + elapsed * this.refillPerSec,
      );
      this.lastRefill = now;
      if (this.tokens >= cost) {
        this.tokens -= cost;
        return;
      }
      const wait = ((cost - this.tokens) / this.refillPerSec) * 1000;
      await new Promise((r) => setTimeout(r, wait + Math.random() * 150));
    }
  }
}

const openaiTpm = new TokenBucket(800_000, 800_000 / 60);
await openaiTpm.take(estimatedTokens);

Size the bucket at 85% of your provisioned TPM. Leaves headroom for the 15% measurement error between client-side token estimation and server-side counting.

8. Multi-provider fallback

Three patterns I have shipped that actually work:

Primary to Anthropic Sonnet 4.5, fallback to GPT-4o on 429, last resort Gemini 2.5 Flash. Tune prompt per provider because formatting differs.
Groq for user-facing latency-critical calls, OpenAI for everything else. When Groq 429s, queue to OpenAI with an "I am thinking" affordance in the UI.
OpenRouter as the provider layer. One API key, automatic failover across providers, minor latency cost. Worth it for teams that do not want to build this plumbing.

9. How to get a tier increase

Five things that worked for me or for clients:

Spend the threshold. Most tier bumps on OpenAI and Anthropic are automatic once you cross the dollar floor and the time window
File a support ticket with a concrete workload description (QPS, avg prompt size, growth curve)
Show compliance paperwork. SOC2 report plus a signed DPA moves you up the reviewer's queue
Ask your account executive directly above $10K/month of spend
If you are on Anthropic, consider an AWS or Vertex deployment for higher quotas via hyperscaler capacity pools

Groq in particular responds well to a specific latency need plus a projected monthly spend. Vague "we need more" requests get sat on for two weeks.

FAQ

Do cached input tokens count against TPM?

On Anthropic, cache reads bill at 10% of standard input tokens for TPM accounting. On OpenAI, cached tokens count normally against TPM but are discounted on the invoice. On Groq, prompt caching is not billed separately as of April 2026.

What is the absolute highest tier on OpenAI?

Tier 5 publicly. Enterprise accounts get custom quotas negotiated through sales, typically 10-30x Tier 5 numbers.

Can I combine multiple API keys to scale past a tier?

Rate limits are per organization on OpenAI and Anthropic, not per key. Creating additional keys inside the same org does nothing. You would need separate organizations, which violates the terms on both platforms.

Do batch API calls share the synchronous rate limit?

No. OpenAI Batch, Anthropic Message Batches, and Gemini Batch all have their own separate quotas. Moving offline work to batch frees real-time capacity.

How accurate is client-side token counting?

Tiktoken on OpenAI lands within 1% of the server count. Anthropic's SDK token counter is within 2%. Gemini's countTokens endpoint is the server-side source of truth but adds a network round trip.

Actionable takeaways

Alert on TPM at 70% of the provisioned cap, not RPM
Log the 429 payload before retrying so you can route around the right limit
Use exponential backoff with jitter, never flat sleep-and-retry loops
Move batch-eligible workloads off the synchronous API to free headroom
Keep one multi-provider fallback wired up before you hit a real incident
Re-check your tier table every quarter; providers change quotas without announcement

Sources

OpenAI rate limits documentation, platform.openai.com/docs/guides/rate-limits, accessed 2026-04-14
Anthropic rate limits documentation, docs.anthropic.com/en/api/rate-limits, accessed 2026-04-14
Groq API reference, console.groq.com/docs/rate-limits, accessed 2026-04-14
Google AI Studio quota page, ai.google.dev/gemini-api/docs/rate-limits, accessed 2026-04-14
Live dashboards on three paying production accounts (one per major provider, plus a Groq Prod tier account)

LLM Rate Limits in 2026: GPT-4o, Claude, Groq, Gemini

LLM Rate Limits in 2026: GPT-4o, Claude, Groq, Gemini

Quick answer

Why this post exists

Table of contents

1. RPM vs TPM vs RPD

2. OpenAI: GPT-4o and o3 tiers, April 2026

3. Anthropic: Claude Sonnet 4.5 and Opus 4

4. Groq: Llama 4, Kimi K2, GPT-OSS 120B

5. Gemini: 2.5 Pro, 2.5 Flash, 2.0 Flash Lite

6. Reading the 429 payload

7. Retry code with jitter

8. Multi-provider fallback

9. How to get a tier increase

FAQ

Actionable takeaways

Sources

Related Tools