ML Pipelines
Orchestrating the data-to-deploy flow so retraining doesn't require a human
An ML pipeline is the sequence of steps that turns raw data into a deployed model — and does it repeatably, without someone manually running notebooks in order. If you've ever heard "just re-run cells 3 through 17, but skip 12," you understand the problem pipelines solve.
The Canonical Flow
Almost every ML pipeline is a variation of:
- Data ingestion — pull raw data from sources, validate schema and freshness.
- Data validation — check for distribution drift, missing values, schema changes. Catch data problems before they become model problems.
- Feature engineering — transform raw data into model-ready features.
- Training — fit the model on the processed features.
- Evaluation — score the model against held-out data and compare to the current production model.
- Validation gates — automated checks: is the new model better? Does it pass fairness criteria? Is latency acceptable?
- Registration — push the model to the registry with full lineage.
- Deployment — promote to staging or production if gates pass.
Not every pipeline needs every step. But the ordering is almost always the same, and skipping validation steps is how you ship regressions.
Orchestration Tools
The job of the orchestrator is to run these steps in the right order, handle failures, and give you visibility into what's happening.
- Kubeflow Pipelines — Kubernetes-native. Good for teams already on K8s. Pipeline steps run as containers. Strong ML-specific features (experiment tracking integration, model serving). Complex to set up.
- Airflow — the general-purpose orchestrator. Battle-tested, huge ecosystem. Not ML-specific, but you can build ML pipelines on it. DAG authoring can be verbose.
- Prefect — modern Airflow alternative. Better developer experience, native Python, good error handling. Growing ML adoption.
- Dagster — asset-oriented orchestrator. You define what you want to produce (datasets, models), not just what to run. Strong data lineage and testing story.
- Vertex AI Pipelines / SageMaker Pipelines — cloud-managed. Least infrastructure burden, most vendor lock-in.
- ZenML — ML-specific, pluggable. Sits on top of other orchestrators and adds ML abstractions.
For pure ML: Kubeflow if you're on K8s, Dagster if you want asset-oriented thinking. For general orchestration that happens to include ML: Airflow or Prefect. For zero-ops: your cloud provider.
DAG Design for ML
ML pipelines are DAGs (directed acyclic graphs). Design principles:
- Each step should be independently testable and retriable. If evaluation fails, you shouldn't need to retrain.
- Make data the edges. Steps communicate through versioned datasets and artifacts, not in-memory objects.
- Parameterize everything. The same pipeline should run with different hyperparameters, data versions, or target environments via config.
- Separate fast steps from slow steps. Data validation (seconds) should run before training (hours). Fail fast.
- Cache aggressively. If the data hasn't changed, don't reprocess features. If features haven't changed, don't retrain. Content-addressed caching saves enormous time.
Pipeline Testing
Pipelines are code. Test them like code:
- Unit tests — test individual transform functions with small synthetic data.
- Integration tests — run the full pipeline on a small dataset. Verify that outputs have the right schema, shapes, and value ranges.
- Data contract tests — assert that input data matches expected schema and distributions before the pipeline runs.
- Smoke tests — run a minimal end-to-end pass after every pipeline code change.
The most common failure mode is a pipeline that works on today's data and silently breaks when data distribution shifts. Data validation steps inside the pipeline are your defense.
Scheduling and Triggers
Pipelines can run on:
- Schedule — daily, weekly, after each data refresh. The simplest and most common.
- Data trigger — new data lands, pipeline fires. Requires event infrastructure but gives you faster retraining.
- Performance trigger — monitoring detects model degradation, triggers retraining. The most sophisticated; requires reliable monitoring.
- Manual — someone clicks a button. Fine for the first version; you want to graduate from this.
Start with scheduled, add triggers as your monitoring matures.
Common Mistakes
- Monolithic notebooks as pipelines — a notebook is not a pipeline. It can't be tested, versioned, or retried at a step level.
- No idempotency — rerunning a step should produce the same output. Side effects (appending to tables, overwriting without versioning) break this.
- Ignoring data validation — the pipeline succeeds, the model trains, and nobody notices the input data was garbage until production metrics tank.
- Over-engineering early — you don't need Kubeflow on day one. A well-structured Python script with DVC is a pipeline too. Scale the tooling to the complexity.