RLHF & Preference Tuning

The pipeline from human preferences to model behavior — RLHF, DPO, RLAIF, KTO, and when each one fits

RLHF — reinforcement learning from human feedback — is the technique that turned GPT-3 into ChatGPT. The idea is simple: train a reward model on human preferences, then use RL to steer the language model toward outputs that score highly on that reward model. The execution is anything but simple.

The Classic RLHF Pipeline

The standard pipeline has three stages, and each one can go wrong in its own way:

Supervised fine-tuning (SFT). Start with a pretrained model and fine-tune it on high-quality instruction/response pairs. This gives you a reasonable starting point — a model that can follow instructions but hasn't been preference-tuned yet.
Reward model training. Collect pairwise comparisons — "given this prompt, response A is better than response B" — and train a separate model to predict those preferences. This reward model becomes your proxy for human judgment.
RL optimization (PPO). Use Proximal Policy Optimization to update the language model, generating responses and using the reward model's scores as the reward signal. A KL divergence penalty keeps the model from drifting too far from the SFT checkpoint.

The reason this works: humans are much better at comparing two outputs than writing a perfect one from scratch. RLHF exploits that asymmetry.

Why PPO Is Hard in Practice

PPO sounds clean on paper. In practice:

Four models in memory. You need the policy model, a reference model (for KL), the reward model, and a value model. That's a lot of GPU.
Training instability. RL on language models is notoriously brittle. Hyperparameter sensitivity is high; reward hacking is common; training runs can collapse.
Slow iteration. Each RLHF run involves generation, scoring, and gradient updates in a loop. It's much slower than supervised training.
Infrastructure complexity. You need generation infrastructure, reward inference, and training all coordinated in a single loop.

These are engineering problems, not research curiosities. They're why simpler alternatives have gained so much traction.

DPO: Skip the Reward Model

Direct Preference Optimization (DPO) reformulates the RLHF objective into a supervised loss. Instead of training a reward model and then doing RL against it, DPO directly optimizes the language model on preference pairs using a classification-style loss.

Key properties:

No separate reward model. The language model itself implicitly defines the reward.
No RL loop. It's a single supervised training pass.
Much simpler infrastructure. Looks like normal fine-tuning.
Comparable results to PPO-based RLHF on many benchmarks.

The catch: DPO is more sensitive to data quality than RLHF. PPO can explore — it generates new completions during training. DPO only learns from the completions already in the dataset. If your preference data doesn't cover important regions of the output space, DPO can underperform.

RLAIF: Replace Human Annotators with AI

Reinforcement Learning from AI Feedback (RLAIF) replaces human annotators with an LLM. Instead of showing response pairs to humans, you ask a strong model (often with a set of principles) to judge which is better.

When it works:

Scale. You can generate millions of preference labels cheaply.
Consistency. No inter-annotator disagreement, no calibration drift.
Speed. Hours instead of weeks to build a preference dataset.

When it doesn't:

Circular reasoning. An LLM judging LLM outputs can reinforce existing biases.
Blind spots. The judge model can't catch errors it would also make.
Ceiling effect. You can't reliably improve a model beyond the judge's capability.

RLAIF is most useful when the judge model is significantly stronger than the model being trained, or when human annotation simply can't scale to the volume you need.

KTO: No Pairs Required

Kahneman-Tversky Optimization (KTO) goes further than DPO by not requiring paired comparisons at all. Instead, each response is independently labeled as "good" or "bad," and the loss function uses insights from prospect theory — humans weight losses more than gains.

Why this matters:

Easier data collection. Thumbs-up/thumbs-down is simpler than pairwise ranking.
Works with production signals. User upvotes, completions, and rejection signals map directly.
Competitive results with DPO on standard benchmarks.

The main limitation: without pairwise comparisons, you lose some granularity in what "better" means.

When to Use What

Method	Best for	Infrastructure	Data requirement
RLHF (PPO)	Maximum control, frontier labs with infra	Heavy — 4 models, RL loop	Pairwise preferences
DPO	Most production teams	Light — supervised training	Pairwise preferences
RLAIF	Scale + cost sensitive	Medium — needs judge model	AI-generated preferences
KTO	Leveraging production signals	Light — supervised training	Binary labels per response

For most teams building products today: start with DPO. It's the best ratio of result quality to implementation complexity. Move to RLHF only if you're pushing the frontier and have the engineering team to support it. Use KTO when you have abundant binary signals but no paired comparisons.

The Alignment Tax

Every alignment method trades off some capability for some behavioral property. The model after RLHF is worse at certain tasks than the SFT model — it's more cautious, more verbose, more likely to refuse edge cases. This is intentional but it's real.

The practical question is always: how much capability are you willing to trade for how much alignment? The answer depends on your product, your users, and your risk tolerance. There's no universal right answer — only tradeoffs you can measure.