Steven's Knowledge

Regression Testing

Catching when your LLM app gets worse — across prompt changes, model updates, and config drift

In traditional software, a regression means something that used to work now doesn't. In LLM apps, regressions are sneakier: the output is still plausible, still well-formatted, but subtly worse. Maybe the tone shifted. Maybe it stopped citing sources. Maybe it handles edge cases differently. Catching these requires a different testing philosophy than assert-equals.

Why LLM Regressions Are Hard

Three properties make LLM regression testing uniquely challenging:

  • Non-determinism — the same input can produce different-but-valid outputs.
  • No single correct answer — there's a range of acceptable responses, not one gold string.
  • Silent degradation — the output "looks fine" to a casual glance but fails on dimensions you care about.

The consequence: you can't just diff old output vs. new output and call it a regression. You need behavioral specifications — properties the output should always satisfy, regardless of exact wording.

Behavioral Testing (CheckList-Style)

Inspired by the CheckList paper, behavioral tests define capabilities and test them with perturbation sets:

  • Invariance tests — changing irrelevant details shouldn't change the answer. Swap "John" for "Maria" in a question; the factual answer should be identical.
  • Directional tests — adding a stronger signal should move the output in a predictable direction. Adding "urgent" to a support ticket should increase the priority classification.
  • Minimum functionality tests — basic capabilities that must always work. "What is 2+2?" must return 4, not a hallucinated essay.

Build a matrix: capabilities (factual accuracy, format compliance, tone, safety) vs. test types (invariance, directional, minimum). Fill the cells with concrete test cases.

Snapshot Testing for LLM Outputs

Borrow the pattern from UI snapshot testing, adapted for non-determinism:

  1. Run your prompt against a fixed set of inputs.
  2. Store the outputs as snapshots.
  3. On next run, compare new outputs to snapshots.
  4. Flag changes for human review.

The key adaptation: don't do exact-match comparison. Instead:

  • Compare structural features — did the output have the same sections? Same JSON keys? Same number of bullet points?
  • Compare semantic similarity — embed both outputs and check cosine distance. Flag if below threshold.
  • Compare extracted facts — parse out key claims and check if they match.
  • Compare classification labels — if the output includes a category or rating, check if it changed.

Detecting Regressions Across Model Updates

Model updates are the most dangerous regression vector. The provider ships a new version of gpt-4o or claude-sonnet, and your prompts that were tuned for the old model may behave differently.

A practical defense:

  1. Pin model versions in production when available (e.g., gpt-4o-2025-01-15).
  2. Run your full eval suite against the new version before switching.
  3. Keep a model migration checklist: test each prompt family, check output format stability, verify tool-calling behavior, confirm safety filters still work.
  4. A/B test in production — route a small percentage of traffic to the new model and compare metrics.

Don't trust the provider's release notes alone. A "minor update" can change your specific use case significantly.

Version-Locked vs. Version-Flexible Tests

Two categories of regression tests:

Version-locked tests — pinned to a specific model version. These validate that this exact model produces acceptable outputs for these exact prompts. They catch prompt regressions but not model regressions.

  • Run on every PR that changes prompts.
  • Fast feedback, deterministic (as much as possible).
  • Break when you upgrade the model — and that's the point.

Version-flexible tests — test behavioral properties that should hold across any reasonable model. "The output must be valid JSON." "The summary must be shorter than the input." "The response must not contain PII."

  • Run on both prompt changes and model upgrades.
  • More stable, less specific.
  • The backbone of your long-term test suite.

A healthy test suite has both. Version-locked tests are your precise regression net; version-flexible tests are your safety floor.

Building a Regression Test Workflow

A practical workflow:

  1. Capture baselines — when a prompt is working well in production, snapshot its outputs on your golden set.
  2. Run on change — when a prompt or model changes, re-run and compare.
  3. Triage diffs — not every diff is a regression. Categorize as: improvement, neutral, regression, unclear.
  4. Update baselines — after review, accept the new outputs as the baseline or fix the prompt.
  5. Track trends — log quality scores over time. A slow downward trend across releases is a regression even if no single change was flagged.

Common Anti-Patterns

  • Testing only the happy path — your golden set is all clean, well-formed inputs. Real users send garbage.
  • Exact-match assertions on free-text outputs — brittle and noisy.
  • Ignoring model version in test results — you can't compare results across model versions without accounting for the version change.
  • Running full eval suites on every commit — expensive and slow. Use the tiered approach from the CI/CD page.
  • No human in the loop — fully automated pass/fail on LLM output quality leads to either too many false alarms or missed regressions.

On this page