Inference

Tokens Per Second (TPS)

Quick Answer

A throughput metric measuring how many tokens the model generates per second.

Tokens per second (TPS) quantifies inference speed. A model might generate 50 tokens/second on a GPU. TPS depends on model size, hardware, and implementation. Larger models have lower TPS. Efficient implementations can 2-3x baseline TPS. TPS is useful for latency predictions: with 100 output tokens and 50 TPS, expect 2 seconds. TPS is easier to compare than end-to-end latency due to variable input sizes. Real-world TPS includes all overhead (tokenization, batching).

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →