Training
DPO
Quick Answer
Abbreviation for Direct Preference Optimization.
DPO is the abbreviation for Direct Preference Optimization, a modern alignment training approach. DPO directly optimizes human preferences without intermediate reward modeling. DPO is simpler and often more effective than RLHF. It's increasingly used in state-of-the-art model training. DPO requires preference data but not explicit reward scores. DPO represents progress toward simpler, more efficient alignment methods.
Last verified: 2026-04-08