Distributed Snapshots
Chandy-Lamport and friends — capturing a consistent global state of a running system without stopping it, and what real systems do with the snapshot
Distributed Snapshots
A distributed snapshot is a recording of a running distributed system's global state that is consistent in the happens-before sense — the recording could plausibly have been the system's state at some moment, even though no actual moment captured exactly those values across all nodes. The algorithm that makes this possible without stopping the system is Chandy and Lamport's 1985 paper, one of the most-cited results in distributed systems and one of the most under-appreciated by application engineers.
This page is about how the algorithm works, what "consistent" means here, and where snapshots show up in real systems — most visibly in stream processors like Flink that depend on them for exactly-once processing.
Why "Consistent" Is Tricky
In a distributed system, no global wall-clock instant exists at which to read every node's state. By the time you finish reading node A and start reading node B, A has moved on. A naive "snapshot" — read each node when convenient and concatenate — produces an inconsistent picture:
Naive snapshot:
Read A at t=100: A has sent message m to B, A.balance = 90
Read B at t=110: B has not yet received m, B.balance = 0
Total balance in snapshot: 90 + 0 = 90
But m is "in flight" — the real total is 100.The recording is missing a message. If you used this snapshot to restore the system, the message would be lost; if you used it for analysis ("did our system ever violate the invariant total == 100?"), it would falsely flag a violation.
A consistent snapshot is one where:
- The state of every process is recorded, and
- The state of every channel between processes is recorded (the in-flight messages), and
- Together they form a "cut" through the happens-before graph — for every recorded receive of message
m, the corresponding send ofmis also in the snapshot.
Chandy-Lamport is the algorithm that produces such a snapshot without freezing the system.
The Chandy-Lamport Algorithm
The setting: processes communicate via FIFO channels. The algorithm uses special marker messages to delimit the snapshot.
Initiator
Any process can start a snapshot. The initiator:
- Records its own local state.
- Sends a
MARKERon every outgoing channel. - Starts recording messages arriving on every incoming channel (until a marker arrives on that channel).
Receiving a Marker
When process P receives a MARKER on incoming channel C:
-
If
Phas not yet recorded its state:- Record
P's own state. - Mark channel
C's state as empty (no messages received onCbetweenP's state recording and this marker, becausePstarted recording at the same instant it recorded its own state). - Send
MARKERon every outgoing channel. - Start recording messages on every other incoming channel.
- Record
-
If
Phas already recorded its state:- Mark channel
C's state as the sequence of messages received onCsincePrecorded its state, up to (but not including) this marker.
- Mark channel
The algorithm terminates when every process has received a marker on every incoming channel. The combined recording — local states plus channel states — is a consistent snapshot.
Why It Works
The marker acts as a virtual instant per channel. By construction:
- Messages sent before the sender's local snapshot arrive before the marker on the receiver's channel — these are in the channel state.
- Messages sent after the sender's local snapshot arrive after the marker — these are not in the snapshot (they belong to "after").
The marker partitions every channel cleanly. The result is a consistent cut, even though different processes recorded their states at different real-time instants.
Process A: ── send m1 ──▶ ── record state ── send MARKER ──▶ ── send m2 ──▶
Process B receives: m1 MARKER m2
│ │
│ └── record state, channel A→B is empty
│
└── this message is in A→B channel state in snapshotAssumptions
The algorithm assumes:
- Reliable, FIFO channels. Messages are delivered, and in order. The marker depends on FIFO to partition messages cleanly.
- No process failures during snapshot. If a process crashes mid-snapshot, the algorithm hangs waiting for markers that will never arrive.
- The system's graph is known. Each process knows its incoming and outgoing channels.
Variants exist for less restrictive models (Lai-Yang, Mattern, etc.) but the FIFO version is the most common practical implementation.
What Snapshots Are For
Distributed snapshots show up in surprisingly many places:
Checkpointing for Fault Tolerance
The dominant practical use today. A stream processor periodically takes a snapshot of every operator's state. On failure, the system restores from the most recent snapshot and replays input from that point.
Apache Flink uses distributed snapshots inspired by Chandy-Lamport for its checkpointing mechanism. Flink's variant (described in Carbone et al., 2015) is sometimes called asynchronous barrier snapshotting: barriers (Flink's markers) flow through the dataflow graph alongside data, and each operator snapshots its state when it has received barriers from all incoming streams. This is the mechanism behind Flink's exactly-once processing guarantee — combined with transactional sinks, the entire end-to-end pipeline becomes recoverable.
Distributed Debugging
If you want to inspect the global state of a running distributed system — "what is every node doing right now?" — taking a snapshot gives you a coherent picture. The standard alternative, ad-hoc inspection of each node, suffers from the inconsistency problem above.
Deadlock Detection
A snapshot can reveal cycles in the wait-for graph that would be hard to detect from any single node's view. Classical use case; less common today as application-level deadlocks are mostly handled by timeouts.
Distributed Garbage Collection
When references to objects can be held across process boundaries, deciding what is garbage requires a global view. Snapshots support reachability analysis across the whole system. Erlang and a few other distributed runtimes use snapshot-based GC.
Termination Detection
Determining that a distributed computation has finished (no more messages in flight, every process idle) is non-trivial. Snapshots provide one approach: snapshot the system; if every process is idle and every channel is empty, the computation has terminated.
Relation to Other Concepts
- Logical Clocks. A consistent snapshot is a cut through the happens-before graph. Logical clocks give you the partial order; Chandy-Lamport gives you the cut.
- Exactly-Once Semantics. Flink's checkpoint-based exactly-once processing depends on distributed snapshots. The snapshot is what makes "rewind and replay" produce a consistent result.
- Consistency Models. Linearizability is about ordering of operations on shared state; a snapshot is about the global state itself. Different problems, sometimes confused. A linearizable system does not automatically give you snapshot semantics.
- Failure Models. Chandy-Lamport assumes no process failures during the snapshot. Real implementations either retry on failure or use variants designed for crash-recovery.
Practical Lessons
- Most application engineers do not implement Chandy-Lamport. They use a system (Flink, Spark Structured Streaming, Akka Persistence, Camunda) that does, and benefit from the consistency property without writing the algorithm. The value of understanding it is in debugging and tuning, not implementation.
- Snapshots are expensive. Capturing the state of every operator and every in-flight message has real cost. Flink users tune checkpoint intervals carefully — too frequent and throughput suffers; too rare and recovery rewinds a long way.
- Channel state can be huge. A snapshot must capture all in-flight messages, which during high-throughput operation can be gigabytes. Some variants reduce this by snapshotting only "logical" channel content; others rely on the input source being replayable (Kafka, in Flink's case).
- Coordinated vs uncoordinated snapshots. The classical algorithm coordinates a global snapshot. Some systems use uncoordinated snapshots (each operator independently) and pay reconciliation cost on recovery instead. Both have their place.
Further Reading
- Chandy & Lamport, Distributed Snapshots: Determining Global States of Distributed Systems (TOCS 1985) — the paper. Short, readable, foundational.
- Mattern, Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation (1993) — variant without FIFO assumption.
- Carbone et al., Lightweight Asynchronous Snapshots for Distributed Dataflows (2015) — Flink's barrier snapshot algorithm. The most production-impactful adaptation.
- Lai & Yang, On Distributed Snapshots (1987) — non-FIFO variant.
- Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms — Chapter 6 has the cleanest textbook treatment.
Pre-commit Checklist
- For any "global state" question I am answering (deadlock, termination, invariant check), am I building on a consistent snapshot, or stitching together per-node reads (which can produce inconsistent results)?
- For my stream processor's checkpoint interval, have I tuned the trade-off between throughput cost and recovery latency?
- For checkpointed state, is the input source replayable to the checkpoint position? If not, exactly-once is not achievable across failures.
- For implementations of Chandy-Lamport from scratch (rare), have I verified FIFO channels and handled process-failure-during-snapshot explicitly?
- For "I need to inspect the running cluster" debugging tasks, am I using a snapshot-aware tool, or just polling endpoints (which can produce a torn view)?