Preference Data

Building preference datasets that actually work — annotation guidelines, agreement metrics, synthetic data, and the quality-vs-quantity tradeoff

Preference data is the raw material of alignment. Every RLHF run, every DPO training, every reward model — they all start with a dataset of human (or AI) judgments saying "this output is better than that one." The quality of that dataset puts a hard ceiling on everything downstream. And yet, most teams underinvest here relative to the impact.

What Preference Data Looks Like

The simplest format: a prompt, a chosen response, and a rejected response. The chosen response is the one a human (or AI) judged to be better.

Richer formats exist:

Ranked lists. Instead of pairs, rank 4+ responses from best to worst. This decomposes into multiple pairwise comparisons.
Likert ratings. Rate each response on a 1–5 or 1–7 scale. Pairs are derived from score differences.
Multi-dimensional ratings. Rate helpfulness, harmlessness, and honesty separately. Enables more nuanced reward modeling.
Margin annotations. "A is much better" vs. "A is slightly better" vs. "tie." Tells the reward model how confident to be about each pair.

For most teams: start with binary pairwise comparisons. They're the simplest to collect, the hardest to mess up, and sufficient for DPO and standard reward model training.

Annotation Guidelines That Actually Work

The annotation spec is the single most important document in your preference data pipeline. Bad guidelines produce noisy data that no algorithm can fix.

Principles for effective guidelines:

Define "better" concretely. Not "which response is better?" but "which response more accurately answers the question while being concise and not including false claims?" Your definition of "better" is a product decision.
Prioritize dimensions explicitly. When accuracy and helpfulness conflict, which wins? When brevity and completeness conflict? Annotators need to know.
Cover edge cases upfront. Both responses are wrong — which do you pick? Both are good — do you mark a tie? One is longer but equally correct — does length matter?
Provide worked examples. At least 10–15 real examples with the correct label and the reasoning behind it. Annotators learn from examples, not from rules.
Include "when in doubt" defaults. "If you genuinely cannot decide, select tie" or "If you genuinely cannot decide, choose the shorter response."
Version the guidelines. As you discover new edge cases, update the document. Track which data was labeled under which version.

Inter-Annotator Agreement

If two annotators can't agree on which response is better, neither can your reward model. Agreement metrics are how you measure guideline quality.

Key metrics:

Raw agreement rate. Percentage of pairs where annotators chose the same label. Simple but inflated by easy examples and class imbalance.
Cohen's kappa. Adjusts for chance agreement. The standard for two-annotator setups.
Fleiss' kappa. Extends Cohen's to more than two annotators.
Krippendorff's alpha. Works with missing data and multiple annotators. The most general metric.

Benchmarks:

Above 0.75 kappa: strong agreement. Your guidelines are clear and the task is well-defined.
0.60–0.75: moderate. Workable, but investigate where disagreements cluster.
0.40–0.60: fair. The task is ambiguous or the guidelines need significant work.
Below 0.40: poor. Don't use this data for training until you fix the guidelines.

What to do with disagreements:

Adjudicate. Send to a senior annotator or expert. Their label becomes the gold standard.
Analyze. Cluster disagreements by type. Are they always about the same kind of prompt? The same dimension of quality?
Update guidelines. Disagreement patterns reveal ambiguity in the spec. Fix the spec.
Use as signal. High-disagreement examples are genuinely hard. Consider weighting them differently in training.

Synthetic Preferences

Using AI to generate preference labels is standard practice now — it's how RLAIF works, and it's how most large-scale alignment datasets are built.

Approaches:

AI judge with principles. Give a strong model a prompt, two responses, and a set of principles. Ask it to pick the better one.
AI judge with rubric. More structured: provide specific scoring criteria, have the model score each dimension, then aggregate.
Self-play. Generate responses from the model being trained, then use a stronger model to rank them.
Constitutional critique. Generate, critique, revise, then prefer the revised version over the original.

When synthetic preferences work well:

The judge model is significantly stronger than the model being aligned.
The task has relatively clear right/wrong answers (factuality, instruction following).
You need high volume and can tolerate some noise.

When they don't:

Subjective tasks where "better" depends on cultural context or personal taste.
Tasks that require real-world experience the AI doesn't have.
When the judge model shares the same blind spots as the model being trained.

Best practice: use synthetic preferences for the bulk of your data, but always maintain a human-labeled gold set for evaluation and calibration.

Data Quality vs. Quantity

This is the most common question teams ask. The answer is nuanced:

Quality wins when:

You're training a reward model (garbage in, garbage out is literal here).
Your task is complex or subjective (many ways to be "good," subtle distinctions).
You're using DPO (which can't explore beyond the data it sees).

Quantity wins when:

Your task is straightforward with clear right/wrong answers.
You're covering a wide distribution of prompts (breadth matters more than depth).
You're doing PPO-based RLHF (the RL loop can compensate for some data noise).

The realistic answer for most teams: a few thousand high-quality, well-annotated preference pairs beats tens of thousands of noisy ones. Invest in guidelines and annotator calibration first, then scale volume.

Concrete numbers to orient around:

500–1,000 pairs: Minimum viable for DPO on a narrow task.
5,000–10,000 pairs: Solid for general-purpose DPO or reward model training.
50,000+ pairs: What frontier labs use, but most of it is synthetic.
Human gold set for eval: 500–1,000 high-quality pairs, separate from training data.

Common Pitfalls

Position bias. Annotators tend to prefer the first response shown. Randomize presentation order.
Length bias. Longer responses look more "thorough." Control for length or explicitly instruct annotators to ignore it.
Anchoring. Once an annotator decides on a pattern, they apply it too broadly. Regular calibration sessions counteract this.
Contamination. Using the same data for reward model training and reward model evaluation. Always hold out a clean eval set.
Distribution mismatch. Training preferences on one prompt distribution but deploying on another. Your preference data should cover your production traffic.

The Data Flywheel

The best preference datasets aren't built once — they grow with the product:

Deploy the aligned model.
Collect implicit signals from users (thumbs up/down, edits, regenerations).
Sample hard or interesting cases for human annotation.
Update the preference dataset and retrain.
Repeat.

This flywheel is how alignment improves over time in production. The teams that set it up early compound their advantage.