LLM Rate Limits in 2026: GPT-4o, Claude, Groq, Gemini
By Aniket Nigam. Published 2026-04-15.
Quick answer
OpenAI runs a 5-tier system that auto-upgrades on spend. Anthropic runs Build, Scale, and Custom tiers with manual review above Scale. Groq keeps a hard split between dev and prod with a 30-day qualification window. Gemini ties quota to Google Cloud projects and does not auto-upgrade. Across all four, TPM is the limit you hit in production, not RPM.
Why this post exists
Official rate-limit pages go stale within a quarter. OpenAI quietly raised Tier 4 TPM on GPT-4o from 1.2M to 2M in January 2026. Anthropic added a new "Scale Plus" tier for Sonnet 4.5 that the public docs did not show until March. Gemini 2.5 Pro has different quotas than 2.5 Flash on the same project.
I rebuilt this table every quarter for a client work deck. The version below reflects what the provider dashboards actually show on April 14, 2026, cross-checked against three paying accounts I have access to.
Table of contents
- RPM vs TPM vs RPD: what each one measures
- OpenAI tiers for GPT-4o and o3 (April 2026)
- Anthropic tiers for Claude Sonnet 4.5 and Opus 4
- Groq tiers for Llama 4, Kimi K2, and GPT-OSS 120B
- Gemini quotas for 2.5 Pro and 2.5 Flash
- Reading the 429 error payload from each provider
- Retry code in Python and TypeScript
- Multi-provider fallback strategy
- How to actually get a tier increase
1. RPM vs TPM vs RPD
Three numbers, easy to confuse.
RPM is requests per minute. Each API call counts as one request regardless of size. A 50-token ping and a 50,000-token document summary both count as one.
TPM is tokens per minute, summed across input and output. A single 40,000-token RAG request with a 2,000-token answer burns 42,000 against your TPM. Three of those in a minute will trip a 120K TPM limit even though you are at 3 RPM.
RPD is requests per day. Free tiers use it. Paid tiers mostly do not.
TPM is the limit production systems hit first. I track both separately in Datadog and alert at 70% of the provisioned TPM because bursty traffic can double the rolling average in 15 seconds.
2. OpenAI: GPT-4o and o3 tiers, April 2026
OpenAI runs 5 tiers. Tier upgrades trigger on cumulative spend and paid time since account creation. The thresholds shifted in late 2025 and the current map is:
| Free | None | 3 | 40,000 | 0 | 0 |
| Tier 1 | $5 paid | 500 | 30,000 | 60 | 30,000 |
| Tier 2 | $50 + 7 days | 5,000 | 450,000 | 500 | 150,000 |
| Tier 3 | $100 + 7 days | 10,000 | 800,000 | 1,000 | 300,000 |
| Tier 4 | $250 + 14 days | 20,000 | 2,000,000 | 3,000 | 600,000 |
| Tier 5 | $1,000 + 30 days | 30,000 | 4,000,000 | 5,000 | 1,500,000 |
The big change in 2026: o3 now has separate quotas from GPT-4o. Before March, one shared tier covered both. If you migrated reasoning traffic to o3 recently, check your dashboard.
3. Anthropic: Claude Sonnet 4.5 and Opus 4
Anthropic uses three public tiers plus a custom option.
| Tier | Qualifier | Sonnet 4.5 RPM | Sonnet 4.5 TPM | Opus 4 RPM | Opus 4 TPM |
| Build | Default | 50 | 50,000 | 20 | 20,000 |
| Scale | $100 + 30 days | 1,000 | 400,000 | 200 | 80,000 |
| Scale Plus | $1,000 + 60 days | 4,000 | 1,500,000 | 800 | 300,000 |
| Custom | Sales review | Negotiated | Negotiated | Negotiated | Negotiated |
Scale Plus rolled out in March 2026 and you have to ask for it in the dashboard. Anthropic also imposes an output token per minute cap that is half the TPM. It bites on long-generation workloads like code synthesis.
Prompt caching reads count at 10% of full input TPM. If your cache hit rate is above 60%, you can run effective input volume 3-4x higher than your stated TPM cap.
4. Groq: Llama 4, Kimi K2, GPT-OSS 120B
Groq is the fastest of the four by a wide margin (580 output TPS on Llama 4 70B) but has the tightest absolute quotas. The platform runs two modes:
| Tier | Qualifier | Llama 4 70B RPM | Llama 4 70B TPM | Kimi K2 RPM | GPT-OSS 120B RPM |
| Dev | Default | 30 | 14,400 | 30 | 30 |
| Prod | 30-day qualification | 600 | 300,000 | 300 | 300 |
The dev-to-prod move is not automatic. You file a request, Groq runs a review (usually 2-5 business days in 2026), and you get a flat RPM bump. No tier ladder. No auto-upgrade.
Groq also has a daily TPD cap on dev (1M total tokens per day). Prod removes it.
5. Gemini: 2.5 Pro, 2.5 Flash, 2.0 Flash Lite
Gemini quotas are per Google Cloud project, not per API key. If you are hitting limits, check whether another service in the same project is eating your quota.
| Tier | 2.5 Pro RPM | 2.5 Pro TPM | 2.5 Flash RPM | 2.5 Flash TPM |
| Free | 2 | 32,000 | 10 | 250,000 |
| Paid | 360 | 4,000,000 | 2,000 | 4,000,000 |
| Enterprise | Via Cloud sales | Custom | Via Cloud sales | Custom |
Gemini does not publish a tier ladder. You request an increase via the Google Cloud quota console, it goes to a human reviewer, and the SLA is 1-3 business days in my experience.
6. Reading the 429 payload
Each provider returns different error bodies on rate-limit. If you log them, you can route retries intelligently.
OpenAI returns error.code = "rate_limit_exceeded" with error.message naming RPM, TPM, or RPD. Header retry-after gives seconds to wait.
Anthropic returns error.type = "rate_limit_error" with error.message specifying input vs output tokens. Header retry-after is populated.
Groq returns HTTP 429 with error.type = "rate_limit_exceeded". Header retry-after-ms is in milliseconds, not seconds.
Gemini returns HTTP 429 with a code: 8 (RESOURCE_EXHAUSTED) in the body. No retry-after header. You guess.
7. Retry code with jitter
Here is the Python retry I use with tenacity. It reads retry-after when available and adds jitter so clustered pods do not stampede.
import random
from tenacity import retry, stop_after_attempt, wait_exponential_jitter, retry_if_exception_type
from anthropic import RateLimitError as AnthropicLimit
from openai import RateLimitError as OpenAILimit
@retry(
reraise=True,
stop=stop_after_attempt(6),
wait=wait_exponential_jitter(initial=1, max=60, jitter=2),
retry=retry_if_exception_type((AnthropicLimit, OpenAILimit)),
)
def call_with_backoff(client, **kwargs):
return client.messages.create(**kwargs)
The TypeScript version using a simple token bucket. I keep this in a shared module and wrap every provider call:
class TokenBucket {
private tokens: number;
private lastRefill = Date.now();
constructor(
private readonly capacity: number,
private readonly refillPerSec: number,
) {
this.tokens = capacity;
}
async take(cost = 1) {
while (true) {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(
this.capacity,
this.tokens + elapsed * this.refillPerSec,
);
this.lastRefill = now;
if (this.tokens >= cost) {
this.tokens -= cost;
return;
}
const wait = ((cost - this.tokens) / this.refillPerSec) * 1000;
await new Promise((r) => setTimeout(r, wait + Math.random() * 150));
}
}
}
const openaiTpm = new TokenBucket(800_000, 800_000 / 60);
await openaiTpm.take(estimatedTokens);
Size the bucket at 85% of your provisioned TPM. Leaves headroom for the 15% measurement error between client-side token estimation and server-side counting.
8. Multi-provider fallback
Three patterns I have shipped that actually work:
- Primary to Anthropic Sonnet 4.5, fallback to GPT-4o on 429, last resort Gemini 2.5 Flash. Tune prompt per provider because formatting differs.
- Groq for user-facing latency-critical calls, OpenAI for everything else. When Groq 429s, queue to OpenAI with an "I am thinking" affordance in the UI.
- OpenRouter as the provider layer. One API key, automatic failover across providers, minor latency cost. Worth it for teams that do not want to build this plumbing.
9. How to get a tier increase
Five things that worked for me or for clients:
- Spend the threshold. Most tier bumps on OpenAI and Anthropic are automatic once you cross the dollar floor and the time window
- File a support ticket with a concrete workload description (QPS, avg prompt size, growth curve)
- Show compliance paperwork. SOC2 report plus a signed DPA moves you up the reviewer's queue
- Ask your account executive directly above $10K/month of spend
- If you are on Anthropic, consider an AWS or Vertex deployment for higher quotas via hyperscaler capacity pools
Groq in particular responds well to a specific latency need plus a projected monthly spend. Vague "we need more" requests get sat on for two weeks.
FAQ
Do cached input tokens count against TPM?
On Anthropic, cache reads bill at 10% of standard input tokens for TPM accounting. On OpenAI, cached tokens count normally against TPM but are discounted on the invoice. On Groq, prompt caching is not billed separately as of April 2026.
What is the absolute highest tier on OpenAI?
Tier 5 publicly. Enterprise accounts get custom quotas negotiated through sales, typically 10-30x Tier 5 numbers.
Can I combine multiple API keys to scale past a tier?
Rate limits are per organization on OpenAI and Anthropic, not per key. Creating additional keys inside the same org does nothing. You would need separate organizations, which violates the terms on both platforms.
Do batch API calls share the synchronous rate limit?
No. OpenAI Batch, Anthropic Message Batches, and Gemini Batch all have their own separate quotas. Moving offline work to batch frees real-time capacity.
How accurate is client-side token counting?
Tiktoken on OpenAI lands within 1% of the server count. Anthropic's SDK token counter is within 2%. Gemini's countTokens endpoint is the server-side source of truth but adds a network round trip.
Actionable takeaways
- Alert on TPM at 70% of the provisioned cap, not RPM
- Log the 429 payload before retrying so you can route around the right limit
- Use exponential backoff with jitter, never flat sleep-and-retry loops
- Move batch-eligible workloads off the synchronous API to free headroom
- Keep one multi-provider fallback wired up before you hit a real incident
- Re-check your tier table every quarter; providers change quotas without announcement
Sources
- OpenAI rate limits documentation, platform.openai.com/docs/guides/rate-limits, accessed 2026-04-14
- Anthropic rate limits documentation, docs.anthropic.com/en/api/rate-limits, accessed 2026-04-14
- Groq API reference, console.groq.com/docs/rate-limits, accessed 2026-04-14
- Google AI Studio quota page, ai.google.dev/gemini-api/docs/rate-limits, accessed 2026-04-14
- Live dashboards on three paying production accounts (one per major provider, plus a Groq Prod tier account)
Related: LLM API Rate Limits Explained, How to Choose an LLM API Provider, How to Reduce LLM API Costs.