Inference
Tokens Per Second (TPS)
Quick Answer
A throughput metric measuring how many tokens the model generates per second.
Tokens per second (TPS) quantifies inference speed. A model might generate 50 tokens/second on a GPU. TPS depends on model size, hardware, and implementation. Larger models have lower TPS. Efficient implementations can 2-3x baseline TPS. TPS is useful for latency predictions: with 100 output tokens and 50 TPS, expect 2 seconds. TPS is easier to compare than end-to-end latency due to variable input sizes. Real-world TPS includes all overhead (tokenization, batching).
Last verified: 2026-04-08