Parallelism Strategies
Data, tensor, pipeline, sequence, and expert parallelism — when to use each and how to combine them
Every large-scale training run is a parallelism problem. The model is too big, the data is too much, or the sequences are too long for a single device. The art is picking the right combination of parallelism strategies so you maximize GPU utilization and minimize idle time.
Data Parallelism
The simplest and most widely used strategy. Every GPU holds a full copy of the model and processes a different chunk of the batch. After the backward pass, gradients are synchronized across GPUs.
DDP (Distributed Data Parallel) — PyTorch's standard approach. All-reduces gradients in the backward pass, overlapping communication with computation. Works beautifully when the model fits on one GPU.
FSDP (Fully Sharded Data Parallel) — an evolution of DDP. Instead of replicating the full model on every GPU, FSDP shards parameters, gradients, and optimizer states. Each GPU holds 1/N of the model and gathers what it needs on demand. This is data parallelism and memory-efficient model sharding combined.
When to use data parallelism:
- The model fits on a single GPU (DDP)
- The model fits with sharding across your node (FSDP)
- You want linear scaling of throughput with GPU count
- Your interconnect doesn't need to be the fastest (data parallelism is communication-friendly)
Tensor Parallelism
Split individual operations — typically large matrix multiplies — across GPUs. In a transformer, this usually means splitting the attention heads and MLP projections across devices so each GPU computes a slice of every layer.
How it works in practice:
- A linear layer with weight matrix (d x 4d) gets split column-wise across 4 GPUs, each holding (d x d).
- Each GPU computes its portion, then results are combined with an all-reduce or all-gather.
- Every layer requires communication, so fast interconnect (NVLink) is non-negotiable.
Typical tensor parallelism degrees: 2, 4, or 8 — usually within a single node where NVLink connects the GPUs. Going beyond one node for tensor parallelism is painful because of inter-node bandwidth.
When to use:
- Single layers are too large or you need to reduce per-GPU memory below what FSDP alone provides
- You have NVLink-connected GPUs within a node
- You're training frontier-scale models and need to combine with other strategies
Pipeline Parallelism
Split the model into sequential stages. GPU 0 runs layers 0–11, GPU 1 runs layers 12–23, and so on. Data flows through the pipeline like an assembly line.
The bubble problem: naive pipeline parallelism wastes GPU time. While GPU 0 processes micro-batch 2, GPU 1 is idle waiting for micro-batch 1 to finish. These idle slots are "pipeline bubbles."
Solutions to reduce bubbles:
- Micro-batching — split the mini-batch into many micro-batches and feed them through the pipeline in sequence. More micro-batches = smaller bubbles relative to useful compute.
- 1F1B schedule (one forward, one backward) — interleave forward and backward passes of different micro-batches. The standard schedule in Megatron-LM and DeepSpeed.
- Interleaved stages — assign non-consecutive layers to each GPU (e.g., GPU 0 gets layers 0–5 and 24–29). Reduces bubble size at the cost of more communication.
When to use:
- You have more GPUs than NVLink connections (i.e., cross-node training)
- The model is deep enough to split meaningfully into stages
- You want to combine with data parallelism across pipeline replicas
Sequence Parallelism
Long sequences create enormous activation memory. Sequence parallelism splits the sequence dimension across GPUs so each device handles a portion of the sequence length.
Two flavors:
- Megatron-style sequence parallelism — specifically targets the LayerNorm and dropout operations that aren't covered by tensor parallelism. Splits activations along the sequence dimension for those operations. Almost always used alongside tensor parallelism.
- Ring attention / context parallelism — splits the full attention computation across GPUs along the sequence dimension. Each GPU computes attention for its chunk of queries against all keys/values, with keys/values passed in a ring between devices. Enables training on sequences far longer than a single GPU's memory allows.
When to use:
- Training with very long sequences (32K+ tokens)
- Activation memory is the bottleneck, not parameter memory
- You're already using tensor parallelism and need further memory savings
Expert Parallelism (MoE)
Mixture of Experts (MoE) models have multiple "expert" sub-networks per layer. A router selects which experts process each token. Expert parallelism places different experts on different GPUs.
How it works:
- Each layer has N experts (e.g., 8 or 16). A gating function routes each token to the top-k experts (typically k = 1 or 2).
- Different experts live on different GPUs. Tokens are dispatched to the right GPU via all-to-all communication.
- Each GPU processes only the tokens routed to its experts, then results are gathered back.
The load-balancing challenge: if the router sends all tokens to the same expert, most GPUs sit idle. Solutions include auxiliary load-balancing losses, expert capacity factors (dropping tokens that exceed an expert's budget), and carefully tuned routing.
When to use:
- Training MoE architectures (Mixtral-style, Switch Transformer, etc.)
- You want to scale parameter count without proportionally scaling compute
- You have enough GPUs to dedicate one or more per expert
Communication Overhead
Every parallelism strategy introduces communication. The cost depends on volume, frequency, and network bandwidth:
| Strategy | Primary Operation | Frequency | Bandwidth Sensitivity |
|---|---|---|---|
| Data parallel | All-reduce gradients | Once per step | Moderate |
| Tensor parallel | All-reduce / all-gather activations | Every layer, both passes | Very high |
| Pipeline parallel | Point-to-point activations | Every micro-batch boundary | Low |
| Sequence parallel | All-gather / reduce-scatter activations | Every layer | High |
| Expert parallel | All-to-all token dispatch | Every MoE layer | High |
The hierarchy rule: put communication-heavy strategies (tensor parallelism) on fast interconnects (NVLink within a node). Put communication-light strategies (data parallelism, pipeline parallelism) across nodes where bandwidth is lower.
Combining Strategies: 3D+ Parallelism
Real frontier-scale training never uses just one strategy. The standard recipe at scale:
- Tensor parallelism within a node (TP = 4 or 8, over NVLink)
- Pipeline parallelism across nodes in a rack (PP = 4–8, over high-bandwidth Ethernet or InfiniBand)
- Data parallelism across pipeline replicas (DP = remaining GPUs)
This is "3D parallelism." Add sequence parallelism for long contexts and expert parallelism for MoE models and you get "5D parallelism" — which is what frontier labs actually run.
The decision order:
- Set tensor parallelism degree based on your intra-node interconnect and per-GPU memory needs.
- Set pipeline parallelism degree based on model depth and inter-node bandwidth.
- Use all remaining GPUs for data parallelism.
- Add sequence parallelism if activation memory is still a problem.
- Add expert parallelism if training a MoE model.
Practical Advice
- Profile before you commit. A 10-minute profiling run with PyTorch Profiler or Nsight Systems reveals whether you're compute-bound or communication-bound.
- Start simple. DDP or FSDP handles most fine-tuning and moderate-scale pre-training. Don't add tensor or pipeline parallelism until you actually need it.
- Match parallelism to hardware topology. The fastest link should carry the most communication. Ignoring network topology is the #1 cause of poor scaling.
- Watch for load imbalance. Uneven pipeline stages, uneven expert routing, or uneven sequence lengths all create idle GPUs that tank your effective throughput.