Inference
GPU Memory
Quick Answer
The VRAM available on GPUs, a key constraint for model loading and inference.
GPU memory (VRAM) limits which models can run on a device. A 40GB GPU can run some large models but not others. Model size (parameters × bytes per parameter) determines memory needs. Inference requires: model weights, activations, and KV cache. Quantization reduces memory usage. Tensor parallelism splits models across multiple GPUs. Managing GPU memory is crucial for practical inference. Optimization techniques focus on reducing memory footprint.
Last verified: 2026-04-08