Technology

ORPO

ORPO (Odds Ratio Preference Optimization) is a monolithic, single-stage fine-tuning method that aligns Large Language Models (LLMs) with human preferences without requiring a separate reference or reward model.

ORPO is a novel, efficient preference optimization technique for LLMs, eliminating the multi-stage complexity of methods like RLHF and DPO. It integrates preference alignment directly into Supervised Fine-Tuning (SFT) by modifying the loss function with an Odds Ratio (OR) term: this term simultaneously rewards chosen responses and penalizes rejected ones. This monolithic approach simplifies the training pipeline, drastically reducing computational overhead. For example, testing on models like Mistral 7B, ORPO demonstrated superior performance, achieving up to a 12.20% gain on the AlpacaEval 2.0 benchmark compared to existing methods. The core mechanism leverages the odds ratio to efficiently contrast preferred and dispreferred generation styles, making preference alignment more accessible and resource-friendly for models of varying sizes.

https://arxiv.org/abs/2403.07691

1 project · 1 city

Related technologies

DPO 1 LoRA 13 RLHF 2 Transformer Lab 2

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Transformer Lab: No Code Tuning

Toronto Sep 20

LoRA DPO