Inference
Latency
Quick Answer
The time delay from input to first output (time-to-first-token) or complete output.
Latency is the time from sending input to receiving output. Two key metrics: time-to-first-token (TTFT, important for interactivity) and end-to-end latency (total time). Lower latency is critical for interactive applications (chat, real-time translation). Latency depends on model size, hardware, and implementation. Inference optimization reduces latency. Streaming outputs can appear faster despite same total latency. Latency and throughput are often trade-offs.
Last verified: 2026-04-08