Golden Test Sets

Building and maintaining the ground-truth datasets that anchor your LLM evaluation

Every LLM evaluation system bottoms out at the same question: compared to what? Golden test sets are the "compared to what" — curated input-output pairs that define what good looks like. Get them wrong and every metric you build on top is measuring the wrong thing. Get them right and they become the single most valuable artifact in your LLM project.

What a Golden Set Actually Is

A golden set is a collection of (input, expected output, metadata) tuples where the expected output represents a ground truth — either the objectively correct answer or a human-approved reference response.

The key distinction from general test data: every example in a golden set has been individually reviewed and approved. This is expensive, which is why golden sets are small and carefully maintained rather than auto-generated at scale.

When to Build One

Build your golden set before you start optimizing prompts, not after. Common mistake: teams iterate on prompts by vibes for weeks, then try to build a golden set to validate what they already shipped. By then, the golden set is reverse-engineered to match existing behavior rather than defining desired behavior.

The right sequence:

Define what "good" means for your use case.
Curate 50-200 examples that represent that definition.
Use the golden set to evaluate prompt candidates.
Ship the prompt that scores best.
Maintain the golden set as requirements evolve.

Curating Representative Examples

A golden set that only covers the easy cases is useless. You need coverage across the dimensions that matter:

Input variety — different phrasings, lengths, languages, tones.
Difficulty spectrum — trivial cases, typical cases, hard cases, adversarial cases.
Domain coverage — if your app handles multiple topics or categories, each one needs representation.
Edge cases — empty inputs, extremely long inputs, inputs with special characters, ambiguous queries.
Failure modes — inputs where the model is known to struggle.

A useful heuristic: if you can describe a category of inputs your app might receive, it should have at least 3-5 examples in the golden set.

Size Guidelines

How big should your golden set be? It depends on what you're measuring:

Minimum viable: 30-50 examples. Enough to catch gross regressions, not enough for statistical confidence on subtle changes.
Working set: 100-200 examples. The sweet spot for most teams. Enough to stratify across categories and detect moderate regressions.
Comprehensive: 500-2000 examples. For mature products with many prompt variants and strict quality requirements.
Research-grade: 5000+. Only if you're publishing benchmarks or training evaluator models.

More examples cost more to curate and more to run. A well-curated set of 150 beats a sloppy set of 1500.

Stratified Sampling for Coverage

Don't just grab random production examples. Use stratified sampling:

Define your strata — the dimensions across which you need coverage (category, difficulty, input length, user segment).
Set quotas — decide how many examples per stratum.
Sample from production logs — pull real examples that match each stratum.
Fill gaps — for rare strata (edge cases, adversarial inputs), write examples by hand.
Balance — over-represent hard cases and edge cases relative to their production frequency. A golden set is not a traffic distribution; it's a capability map.

A typical stratification for a customer support summarizer:

Stratum	Count	Notes
Simple ticket, single issue	20	Baseline capability
Multi-issue ticket	25	Tests decomposition
Angry customer tone	15	Tests tone handling
Technical jargon heavy	15	Tests domain understanding
Very short ticket (<50 words)	10	Edge case
Very long ticket (> 2000 words)	10	Edge case
Non-English or mixed language	10	Coverage
Ambiguous or unclear ticket	15	Hard case
Total	120

Ground Truth Management

Who decides what the "right" answer is?

Factual tasks — the answer is objectively verifiable. Ground truth comes from the source data.
Generative tasks — there's no single right answer. Ground truth is a reference response plus evaluation criteria. The criteria matter more than the reference.
Classification tasks — labels come from domain experts, ideally with inter-annotator agreement measured.

For generative tasks, store both the reference response and a rubric. When evaluating, score against the rubric, not against exact match with the reference.

When Golden Sets Drift

Your golden set will become stale. Signs it needs updating:

Product changes — you added a new feature or changed the output format. Old golden examples no longer reflect the desired behavior.
Domain shift — your users started asking about topics that weren't in the original set.
Quality bar shift — what counted as "good enough" six months ago isn't anymore.
Model capability shift — a new model can do things the old golden set didn't test for.

Maintenance cadence: review your golden set quarterly. Add new examples for new capabilities. Retire examples for deprecated features. Update reference outputs when the quality bar changes.

Practical Tips

Version your golden set in git alongside your prompts. Treat it as code.
Tag each example with metadata: category, difficulty, date added, source (production log vs. hand-written).
Track per-example scores over time — an example that was passing and now fails is a stronger regression signal than an aggregate score drop.
Don't auto-generate golden sets with LLMs — you end up testing whether the model can match its own output. Human curation is the whole point.
Share golden sets across teams working on the same domain. Avoid redundant curation work.
Keep a "parking lot" of interesting production examples that might be worth adding. Review it monthly.

On this page