Training

Knowledge Distillation

Quick Answer

Training a smaller student model to mimic a larger teacher model's behavior.

Knowledge distillation trains a smaller (student) model to mimic a larger (teacher) model. The student learns to match the teacher's outputs. This enables smaller, faster models achieving near-teacher performance. Temperature in softmax is increased during distillation to create softer targets. Distillation is practical for deployment where speed/size matter. Students don't match teacher performance exactly but come close at a fraction of the cost. Distillation is widely used for practical optimization.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →