Inference

GPU Memory

Quick Answer

The VRAM available on GPUs, a key constraint for model loading and inference.

GPU memory (VRAM) limits which models can run on a device. A 40GB GPU can run some large models but not others. Model size (parameters × bytes per parameter) determines memory needs. Inference requires: model weights, activations, and KV cache. Quantization reduces memory usage. Tensor parallelism splits models across multiple GPUs. Managing GPU memory is crucial for practical inference. Optimization techniques focus on reducing memory footprint.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →

← All glossary terms