Experiment Tracking

If you cannot reproduce last week's best run, you don't have a result — you have an anecdote. Experiment tracking is the boring discipline that turns ad-hoc notebook cells into a searchable, comparable history of everything you've tried. Skip it and you'll be re-running "that one good config" from memory within a month.

What to Track

At minimum, every run should record:

Hyperparameters — learning rate, batch size, architecture choices, data splits. Everything that changes between runs.
Metrics — loss curves, validation metrics, evaluation scores. Both final values and curves over time.
Artifacts — model checkpoints, confusion matrices, sample predictions, generated outputs.
Environment — library versions, GPU type, random seeds. The stuff you forget until you can't reproduce.
Code version — git commit hash or a snapshot. Non-negotiable.
Data version — which dataset, which split, which preprocessing version. This is the one most teams skip and most regret.

The Tools

The space has consolidated around a few serious options:

Weights & Biases (W&B) — the most polished UX. Great dashboards, easy team sharing, good integrations. The default recommendation for most teams. Paid at scale.
MLflow — open-source, self-hostable. The right choice when you need full data control or are already in a Databricks shop. UI is functional, not beautiful.
Neptune — strong on experiment comparison and metadata flexibility. Worth evaluating if W&B pricing is a blocker.
Comet — similar tier to Neptune. Good diffing between runs.
TensorBoard — free, bundled with TensorFlow/PyTorch. Fine for individual researchers; breaks down for team workflows.

Pick one and commit. The value is in consistency, not in the tool itself.

Organizing Experiments

Flat lists of runs become useless fast. Structure matters:

Projects group related experiments (e.g., "fraud-detection-v2").
Tags mark intent: "baseline", "architecture-search", "data-ablation".
Groups tie together runs that belong to the same sweep or ablation.
Notes capture why you ran this, not just what. "Testing whether dropout helps on small dataset" is more useful than "run_047".

Comparing Runs

The whole point is comparison. Good practices:

Fix a baseline and always compare against it. Relative improvement matters more than absolute numbers.
Use parallel coordinates plots to see which hyperparameters correlate with good metrics.
Compare on the same eval set — sounds obvious, but dataset drift between runs is a real and silent failure mode.
Log statistical significance when differences are small. A 0.2% improvement on one eval set is noise, not signal.

Reproducibility

Tracking is necessary but not sufficient for reproducibility. You also need:

Deterministic training — set all random seeds, use deterministic CUDA ops where possible. Accept that perfect reproducibility across hardware is often impossible.
Data snapshots — hash your training data or use a data versioning tool (DVC, lakeFS). The model is only as reproducible as the data that trained it.
Environment locking — pin your dependencies. A requirements.txt with unpinned versions is a time bomb.
One-command replay — if you can't re-run a tracked experiment with a single command, your tracking is incomplete.

Common Mistakes

Tracking too late — retrofitting tracking onto an existing project is painful. Start from the first run.
Logging everything, labeling nothing — 500 unlabeled runs are worse than 50 well-tagged ones.
Ignoring failed runs — failures carry information. Track them too; you want to know what didn't work.
Local-only tracking — if it's on your laptop, it doesn't exist for your team. Use a shared server or hosted service from day one.

The bar is low: log params, log metrics, tag your runs, version your data. Teams that do this consistently ship better models faster than teams that don't, regardless of which tool they pick.