A/B Testing for ML

Your offline eval says the new model is 3% better. You deploy it. Revenue goes down. What happened? Offline metrics and online metrics measure different things, and the gap between them is where A/B testing for ML lives. If you ship models without online experiments, you're flying blind — optimizing for a proxy that may not correlate with the business outcome you actually care about.

Why Offline Eval Isn't Enough

Offline evaluation has systematic blind spots:

Distribution shift — your eval set is a frozen snapshot; production traffic is alive and changing.
Feedback loops — the current model shapes what data you collect, which shapes what the next model sees.
Proxy metric mismatch — you optimize NDCG offline, but the business cares about purchase rate. These don't always move together.
Interaction effects — the model doesn't exist in isolation. It interacts with UI, ranking logic, caching layers, and user behavior.

Offline eval is a gate — it filters out clearly bad models. Online testing is the verdict.

Classic A/B Testing for Models

The basic setup:

Control group — sees the current production model.
Treatment group — sees the new candidate model.
Random assignment — users are randomly split, typically by user ID hash.
Business metrics — measure what actually matters: clicks, conversions, revenue, retention, satisfaction.
Run long enough — collect enough data for statistical significance. For ML changes, this often means 1-4 weeks.

The key difference from feature A/B tests: model changes often have subtle, slow-moving effects. A button color change shows up in a day. A ranking model change might take two weeks to manifest in retention.

Shadow Mode Deployment

Before you split real traffic, consider shadow mode:

The new model runs on production traffic but its predictions are logged, not served.
You compare its predictions to the current model's predictions offline.
No risk to users; you catch obvious failures (latency spikes, error rates, degenerate outputs) before they matter.

Shadow mode is the staging environment for ML models. Run it for a few days before any A/B test.

Interleaving Experiments

For ranking and recommendation systems, interleaving is more statistically efficient than A/B testing:

Each user sees results from both models, interleaved into a single list.
You measure which model's results the user prefers (clicks, engages with).
Requires roughly 10x fewer users to detect the same effect size as a standard A/B test.

The trade-off: interleaving measures preference between two models, not absolute impact on business metrics. Use it as a fast signal, then confirm with a proper A/B test.

Statistical Significance

ML A/B tests have specific statistical challenges:

Variance is high — model predictions affect different users differently. You need larger sample sizes than you think.
Multiple comparisons — if you test 5 metrics, you'll get a false positive on one by chance. Use correction (Bonferroni, Benjamini-Hochberg) or pre-register your primary metric.
Sequential testing — don't peek at results daily and stop when it "looks significant." Use sequential testing methods (CUPED, always-valid p-values) if you need to monitor continuously.
Novelty and primacy effects — users react differently to change itself. A new recommendation algorithm might get more clicks initially just because it's different. Wait for the effect to stabilize.

The practical rule: pick one primary metric, pre-register your hypothesis, run for the pre-calculated duration, then call it.

Closing the Offline-Online Gap

The gap between offline and online metrics is a signal, not just noise. Investigate it:

Log everything — for every prediction served in the A/B test, log the model version, input features, prediction, and user outcome.
Build counterfactual datasets — use logged predictions from the A/B test to build better offline eval sets for next time.
Calibrate your offline metrics — track which offline improvements actually translate to online wins. Over time, you learn which offline metrics to trust.
Replay analysis — use logged data to simulate what would have happened under different model decisions.

The teams that systematically close this gap iterate faster because they can trust their offline eval to predict online outcomes.

Common Mistakes

Shipping without online testing — "offline metrics improved" is not enough evidence to ship.
Under-powering the test — running for too short or with too few users. You get inconclusive results and make a gut call anyway, defeating the purpose.
Testing too many things at once — changing the model, the features, and the serving logic simultaneously. When results are ambiguous, you can't attribute.
Ignoring guardrail metrics — your primary metric improves but latency doubles or error rate spikes. Always monitor guardrails.
Not having a rollback plan — if the A/B test goes badly, you need to kill the treatment in minutes, not hours.
Optimizing for the A/B test — the test measures a real-world outcome. If you start gaming the test setup instead of building better models, you've lost the plot.