Training
Knowledge Distillation
Quick Answer
Training a smaller student model to mimic a larger teacher model's behavior.
Knowledge distillation trains a smaller (student) model to mimic a larger (teacher) model. The student learns to match the teacher's outputs. This enables smaller, faster models achieving near-teacher performance. Temperature in softmax is increased during distillation to create softer targets. Distillation is practical for deployment where speed/size matter. Students don't match teacher performance exactly but come close at a fraction of the cost. Distillation is widely used for practical optimization.
Last verified: 2026-04-08