Inference

Cold Start

Quick Answer

The latency delay when a model is loaded into memory before it can serve requests.

Cold start is the time to load a model from storage into inference infrastructure before processing requests. For large models (70B+), cold start can be minutes. Cold start is problematic for serverless and dynamic scaling. Warm pools (keeping models ready) avoid cold starts but waste resources. Cold start is a key consideration in deployment design. Some providers optimize cold start. Cold start latency is critical for serverless LLM services.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →