Experiment Tracking
Recording every run so you can reproduce results and compare fairly
If you cannot reproduce last week's best run, you don't have a result — you have an anecdote. Experiment tracking is the boring discipline that turns ad-hoc notebook cells into a searchable, comparable history of everything you've tried. Skip it and you'll be re-running "that one good config" from memory within a month.
What to Track
At minimum, every run should record:
- Hyperparameters — learning rate, batch size, architecture choices, data splits. Everything that changes between runs.
- Metrics — loss curves, validation metrics, evaluation scores. Both final values and curves over time.
- Artifacts — model checkpoints, confusion matrices, sample predictions, generated outputs.
- Environment — library versions, GPU type, random seeds. The stuff you forget until you can't reproduce.
- Code version — git commit hash or a snapshot. Non-negotiable.
- Data version — which dataset, which split, which preprocessing version. This is the one most teams skip and most regret.
The Tools
The space has consolidated around a few serious options:
- Weights & Biases (W&B) — the most polished UX. Great dashboards, easy team sharing, good integrations. The default recommendation for most teams. Paid at scale.
- MLflow — open-source, self-hostable. The right choice when you need full data control or are already in a Databricks shop. UI is functional, not beautiful.
- Neptune — strong on experiment comparison and metadata flexibility. Worth evaluating if W&B pricing is a blocker.
- Comet — similar tier to Neptune. Good diffing between runs.
- TensorBoard — free, bundled with TensorFlow/PyTorch. Fine for individual researchers; breaks down for team workflows.
Pick one and commit. The value is in consistency, not in the tool itself.
Organizing Experiments
Flat lists of runs become useless fast. Structure matters:
- Projects group related experiments (e.g., "fraud-detection-v2").
- Tags mark intent: "baseline", "architecture-search", "data-ablation".
- Groups tie together runs that belong to the same sweep or ablation.
- Notes capture why you ran this, not just what. "Testing whether dropout helps on small dataset" is more useful than "run_047".
Comparing Runs
The whole point is comparison. Good practices:
- Fix a baseline and always compare against it. Relative improvement matters more than absolute numbers.
- Use parallel coordinates plots to see which hyperparameters correlate with good metrics.
- Compare on the same eval set — sounds obvious, but dataset drift between runs is a real and silent failure mode.
- Log statistical significance when differences are small. A 0.2% improvement on one eval set is noise, not signal.
Reproducibility
Tracking is necessary but not sufficient for reproducibility. You also need:
- Deterministic training — set all random seeds, use deterministic CUDA ops where possible. Accept that perfect reproducibility across hardware is often impossible.
- Data snapshots — hash your training data or use a data versioning tool (DVC, lakeFS). The model is only as reproducible as the data that trained it.
- Environment locking — pin your dependencies. A
requirements.txtwith unpinned versions is a time bomb. - One-command replay — if you can't re-run a tracked experiment with a single command, your tracking is incomplete.
Common Mistakes
- Tracking too late — retrofitting tracking onto an existing project is painful. Start from the first run.
- Logging everything, labeling nothing — 500 unlabeled runs are worse than 50 well-tagged ones.
- Ignoring failed runs — failures carry information. Track them too; you want to know what didn't work.
- Local-only tracking — if it's on your laptop, it doesn't exist for your team. Use a shared server or hosted service from day one.
The bar is low: log params, log metrics, tag your runs, version your data. Teams that do this consistently ship better models faster than teams that don't, regardless of which tool they pick.