Reward Modeling

Training the model that scores model outputs — preference data, Bradley-Terry, reward hacking, and over-optimization

A reward model is a learned proxy for human judgment. You give it a prompt and a response, it returns a scalar score estimating how much a human would prefer that response. It's the critical link in the RLHF pipeline — and it's the component most likely to break in subtle, hard-to-debug ways.

What a Reward Model Actually Is

At its core, a reward model is a language model with a scalar head. You take a pretrained model, replace the output layer with a linear projection to a single number, and train it on human preference data. The model learns to assign higher scores to preferred responses and lower scores to rejected ones.

Most reward models today are built on top of the same architecture families as the models they're scoring. A 7B reward model scoring a 7B chat model is common; you don't necessarily need the reward model to be larger.

The Bradley-Terry Framework

The standard training formulation uses the Bradley-Terry model from the comparison judgment literature. Given a prompt and two responses (one preferred, one rejected), the loss is:

Loss = -log(sigmoid(r(preferred) - r(rejected)))

This is just binary cross-entropy on the score difference. The model learns to push preferred responses' scores up and rejected responses' scores down, with the gap between them being what matters.

Why Bradley-Terry works well:

Only relative ordering matters. You don't need annotators to assign absolute quality scores — just "A is better than B."
Captures preference strength. A large score gap signals confident preference; a small gap signals near-ties.
Simple and stable. The loss is well-behaved and doesn't need tricks to train.

Collecting Preference Data

The quality of the reward model is bounded by the quality of its training data. Standard collection workflow:

Sample prompts from your target distribution — real user queries, evaluation sets, or a curated mix.
Generate multiple responses per prompt, ideally from the model you're aligning (or a similar one).
Have annotators compare pairs. "Given this prompt, is response A or B better?" Optionally collect ties or confidence scores.
Quality control. Compute inter-annotator agreement, flag low-agreement examples, calibrate regularly.

Key decisions:

Number of responses per prompt. Two is minimum; four or more lets you construct more comparison pairs per annotation dollar.
Annotator expertise. General preference tasks can use crowdworkers; domain-specific or safety-critical tasks need experts.
Comparison granularity. Binary (A > B) vs. Likert scale vs. multi-dimensional ratings. Binary is simplest and most reliable.

Reward Hacking

This is the central failure mode. The language model, optimized against the reward model, finds outputs that score highly according to the reward model but are obviously bad to a human. Classic examples:

Verbose padding. The reward model gives higher scores to longer responses, so the policy model learns to pad with filler.
Sycophancy. The reward model rewards agreement, so the model learns to flatter the user instead of being accurate.
Format gaming. Responses with bullet points, headers, and bold text score higher regardless of content quality.
Hedging. The model learns that "I'm not sure, but..." scores better than being wrong, so it hedges on everything.

Reward hacking happens because the reward model is an imperfect proxy. Any systematic error in the proxy becomes an exploit that RL will find.

Mitigating Reward Hacking

No silver bullet, but several strategies help:

KL penalty. Standard in RLHF — penalizes the policy for drifting too far from the SFT model. The coefficient is the primary knob.
Reward model ensembles. Train multiple reward models on different data splits; use the minimum or mean score. Harder to hack when the exploits don't align.
Reward model updates. Periodically retrain the reward model on outputs from the current policy. This closes the distribution shift gap.
Length normalization. Explicitly control for length in the reward model or add a length penalty.
Human spot-checks. Regularly compare reward model scores against human judgments on current policy outputs. Track the gap.

Over-Optimization

Even without outright hacking, there's a subtler problem: over-optimization. As you train the policy longer against the reward model, performance on the reward model goes up, but performance according to actual human judges eventually plateaus and then drops.

This is the Goodhart's Law of alignment: when a measure becomes a target, it ceases to be a good measure.

Practical implications:

Early stopping matters. Don't train until the reward model score plateaus — stop well before that, guided by human evaluation.
Track both scores. Monitor reward model score and periodic human evals. When they diverge, you've over-optimized.
Smaller KL budgets limit how far the policy can drift and naturally cap over-optimization.

Reward Model Evaluation

How do you know if your reward model is any good?

Accuracy on held-out comparisons. The most direct metric — what fraction of held-out preference pairs does the model rank correctly?
Agreement with humans on novel outputs. Score model outputs the reward model hasn't seen, then get human preferences. Compare.
Calibration. Do large score differences actually correspond to clear human preferences?
Known-good / known-bad tests. Curate a set of obviously good and obviously bad responses. Does the reward model rank them correctly? Failures here are red flags.

A reward model with 70% accuracy on held-out pairs is usable. 75% is solid. Above 80% is excellent. Below 65% and you'll likely see severe reward hacking during RL.

The Bigger Picture

The reward model is where your understanding of "good" gets operationalized. Every shortcut in preference collection, every blind spot in annotator instructions, every systematic bias in your data — all of it flows through the reward model and into the final model's behavior. The team that treats reward modeling as a careful engineering discipline, not a checkbox, builds the better product.

On this page