Inference
INT4 Quantization
Quick Answer
Quantizing model weights to 4-bit integers, achieving 8x compression with some quality trade-off.
INT4 quantization uses 4-bit integers, achieving 8x compression from float32. This is aggressive—quality loss is more noticeable than INT8. INT4 is used for fitting large models on consumer hardware. Careful calibration mitigates quality loss. INT4 enables running 70B models on single 24GB GPUs. Quality is usually acceptable for many tasks. INT4 is practical for cost-sensitive deployment.
Last verified: 2026-04-08