Training
Direct Preference Optimization (DPO)
Quick Answer
A training method that directly optimizes for human preferences without training a separate reward model.
DPO is a simpler, more efficient alternative to RLHF. Instead of training a separate reward model, DPO directly trains the model to maximize the difference between preferred and dispreferred outputs. DPO has several advantages: it's simpler to implement, more stable, and more parameter-efficient. Empirically, DPO achieves comparable results to RLHF with less computation. It requires preference pairs (chosen/rejected) rather than scalar rewards. DPO is increasingly popular as an alternative to RLHF. It makes alignment training more accessible.
Last verified: 2026-04-08