Chandy-Lamport and friends — capturing a consistent global state of a running system without stopping it, and what real systems do with the snapshot

Distributed Snapshots

A distributed snapshot is a recording of a running distributed system's global state that is consistent in the happens-before sense — the recording could plausibly have been the system's state at some moment, even though no actual moment captured exactly those values across all nodes. The algorithm that makes this possible without stopping the system is Chandy and Lamport's 1985 paper, one of the most-cited results in distributed systems and one of the most under-appreciated by application engineers.

This page is about how the algorithm works, what "consistent" means here, and where snapshots show up in real systems — most visibly in stream processors like Flink that depend on them for exactly-once processing.

Why "Consistent" Is Tricky

In a distributed system, no global wall-clock instant exists at which to read every node's state. By the time you finish reading node A and start reading node B, A has moved on. A naive "snapshot" — read each node when convenient and concatenate — produces an inconsistent picture:

Naive snapshot:
  Read A at t=100: A has sent message m to B, A.balance = 90
  Read B at t=110: B has not yet received m, B.balance = 0
  
  Total balance in snapshot: 90 + 0 = 90
  But m is "in flight" — the real total is 100.

The recording is missing a message. If you used this snapshot to restore the system, the message would be lost; if you used it for analysis ("did our system ever violate the invariant total == 100?"), it would falsely flag a violation.

A consistent snapshot is one where:

The state of every process is recorded, and
The state of every channel between processes is recorded (the in-flight messages), and
Together they form a "cut" through the happens-before graph — for every recorded receive of message m, the corresponding send of m is also in the snapshot.

Chandy-Lamport is the algorithm that produces such a snapshot without freezing the system.

The Chandy-Lamport Algorithm

The setting: processes communicate via FIFO channels. The algorithm uses special marker messages to delimit the snapshot.

Initiator

Any process can start a snapshot. The initiator:

Records its own local state.
Sends a MARKER on every outgoing channel.
Starts recording messages arriving on every incoming channel (until a marker arrives on that channel).

Receiving a Marker

When process P receives a MARKER on incoming channel C:

If P has not yet recorded its state:
- Record P's own state.
- Mark channel C's state as empty (no messages received on C between P's state recording and this marker, because P started recording at the same instant it recorded its own state).
- Send MARKER on every outgoing channel.
- Start recording messages on every other incoming channel.
If P has already recorded its state:
- Mark channel C's state as the sequence of messages received on C since P recorded its state, up to (but not including) this marker.

The algorithm terminates when every process has received a marker on every incoming channel. The combined recording — local states plus channel states — is a consistent snapshot.

Why It Works

The marker acts as a virtual instant per channel. By construction:

Messages sent before the sender's local snapshot arrive before the marker on the receiver's channel — these are in the channel state.
Messages sent after the sender's local snapshot arrive after the marker — these are not in the snapshot (they belong to "after").

The marker partitions every channel cleanly. The result is a consistent cut, even though different processes recorded their states at different real-time instants.

Process A: ── send m1 ──▶ ── record state ── send MARKER ──▶ ── send m2 ──▶
                                                                          
Process B receives:           m1                  MARKER             m2
                              │                  │
                              │                  └── record state, channel A→B is empty
                              │
                              └── this message is in A→B channel state in snapshot

Assumptions

The algorithm assumes:

Reliable, FIFO channels. Messages are delivered, and in order. The marker depends on FIFO to partition messages cleanly.
No process failures during snapshot. If a process crashes mid-snapshot, the algorithm hangs waiting for markers that will never arrive.
The system's graph is known. Each process knows its incoming and outgoing channels.

Variants exist for less restrictive models (Lai-Yang, Mattern, etc.) but the FIFO version is the most common practical implementation.

What Snapshots Are For

Distributed snapshots show up in surprisingly many places:

Checkpointing for Fault Tolerance

The dominant practical use today. A stream processor periodically takes a snapshot of every operator's state. On failure, the system restores from the most recent snapshot and replays input from that point.

Apache Flink uses distributed snapshots inspired by Chandy-Lamport for its checkpointing mechanism. Flink's variant (described in Carbone et al., 2015) is sometimes called asynchronous barrier snapshotting: barriers (Flink's markers) flow through the dataflow graph alongside data, and each operator snapshots its state when it has received barriers from all incoming streams. This is the mechanism behind Flink's exactly-once processing guarantee — combined with transactional sinks, the entire end-to-end pipeline becomes recoverable.

Distributed Debugging

If you want to inspect the global state of a running distributed system — "what is every node doing right now?" — taking a snapshot gives you a coherent picture. The standard alternative, ad-hoc inspection of each node, suffers from the inconsistency problem above.

Deadlock Detection

A snapshot can reveal cycles in the wait-for graph that would be hard to detect from any single node's view. Classical use case; less common today as application-level deadlocks are mostly handled by timeouts.

Distributed Garbage Collection

When references to objects can be held across process boundaries, deciding what is garbage requires a global view. Snapshots support reachability analysis across the whole system. Erlang and a few other distributed runtimes use snapshot-based GC.

Termination Detection

Determining that a distributed computation has finished (no more messages in flight, every process idle) is non-trivial. Snapshots provide one approach: snapshot the system; if every process is idle and every channel is empty, the computation has terminated.

Relation to Other Concepts

Logical Clocks. A consistent snapshot is a cut through the happens-before graph. Logical clocks give you the partial order; Chandy-Lamport gives you the cut.
Exactly-Once Semantics. Flink's checkpoint-based exactly-once processing depends on distributed snapshots. The snapshot is what makes "rewind and replay" produce a consistent result.
Consistency Models. Linearizability is about ordering of operations on shared state; a snapshot is about the global state itself. Different problems, sometimes confused. A linearizable system does not automatically give you snapshot semantics.
Failure Models. Chandy-Lamport assumes no process failures during the snapshot. Real implementations either retry on failure or use variants designed for crash-recovery.

Practical Lessons

Most application engineers do not implement Chandy-Lamport. They use a system (Flink, Spark Structured Streaming, Akka Persistence, Camunda) that does, and benefit from the consistency property without writing the algorithm. The value of understanding it is in debugging and tuning, not implementation.
Snapshots are expensive. Capturing the state of every operator and every in-flight message has real cost. Flink users tune checkpoint intervals carefully — too frequent and throughput suffers; too rare and recovery rewinds a long way.
Channel state can be huge. A snapshot must capture all in-flight messages, which during high-throughput operation can be gigabytes. Some variants reduce this by snapshotting only "logical" channel content; others rely on the input source being replayable (Kafka, in Flink's case).
Coordinated vs uncoordinated snapshots. The classical algorithm coordinates a global snapshot. Some systems use uncoordinated snapshots (each operator independently) and pay reconciliation cost on recovery instead. Both have their place.

Pre-commit Checklist

For any "global state" question I am answering (deadlock, termination, invariant check), am I building on a consistent snapshot, or stitching together per-node reads (which can produce inconsistent results)?
For my stream processor's checkpoint interval, have I tuned the trade-off between throughput cost and recovery latency?
For checkpointed state, is the input source replayable to the checkpoint position? If not, exactly-once is not achievable across failures.
For implementations of Chandy-Lamport from scratch (rare), have I verified FIFO channels and handled process-failure-during-snapshot explicitly?
For "I need to inspect the running cluster" debugging tasks, am I using a snapshot-aware tool, or just polling endpoints (which can produce a torn view)?

Distributed Snapshots

Distributed Snapshots

Why "Consistent" Is Tricky

The Chandy-Lamport Algorithm

Initiator

Receiving a Marker

Why It Works

Assumptions

What Snapshots Are For

Checkpointing for Fault Tolerance

Distributed Debugging

Deadlock Detection

Distributed Garbage Collection

Termination Detection

Relation to Other Concepts

Practical Lessons

Further Reading

Pre-commit Checklist

On this page