Inference

INT4 Quantization

Quick Answer

Quantizing model weights to 4-bit integers, achieving 8x compression with some quality trade-off.

INT4 quantization uses 4-bit integers, achieving 8x compression from float32. This is aggressive—quality loss is more noticeable than INT8. INT4 is used for fitting large models on consumer hardware. Careful calibration mitigates quality loss. INT4 enables running 70B models on single 24GB GPUs. Quality is usually acceptable for many tasks. INT4 is practical for cost-sensitive deployment.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →