Membership & Failure Detection
Who is in the cluster right now, who has failed, and the gossip / phi-accrual / SWIM protocols that answer those questions
Membership & Failure Detection
Before a cluster can do anything interesting it must answer two questions: who is in this cluster right now? and which of us has failed? The first is group membership; the second is failure detection. They are intertwined — you cannot say "node 7 failed" without first agreeing that node 7 was a member — but the algorithms that solve them have different shapes and different cost models.
This page is about the building blocks: failure detector properties, heartbeats and their problems, the phi-accrual and SWIM protocols that fix those problems, and where membership fits relative to consensus. For acting on failures, see Leader Election and Failure Models; for the impossibility result that bounds what detectors can do, see FLP.
Why It Is Hard
In an asynchronous network, you cannot distinguish "node 7 crashed" from "node 7 is slow" from "the network between us is broken." A failure detector necessarily makes a guess; the only choice is whether to guess too quickly (false positives) or too slowly (false negatives). This is the dual of the problem FLP describes: without bounded message delay, no algorithm can correctly classify every node every time.
Chandra and Toueg's 1996 paper formalized the failure-detector design space with two properties:
- Completeness — every failed process is eventually suspected. Strong completeness: by every correct process. Weak: by at least one.
- Accuracy — no correct process is suspected. Strong accuracy: never. Weak: by at least one correct process. Eventual: eventually no correct process is suspected.
Real failure detectors are eventually strong (◇S in the Chandra-Toueg classification): they can be wrong temporarily but eventually agree on a correct picture. This is exactly enough to enable consensus under partial synchrony, which is why every production consensus algorithm assumes a detector roughly this strong.
Heartbeats: The Simple Approach
Every node periodically sends "I am alive" messages. A node that has not heard from a peer within a timeout is declared failed.
Node A ── heartbeat ──▶ Node B every 1s
Node A misses 3 heartbeats from B → declare B failedTwo parameters to choose: heartbeat interval and timeout. The trade-off is the standard detection-speed-versus-false-positive curve:
- Short timeout — fast detection but many false positives during normal GC, congestion, or bursty load.
- Long timeout — fewer false positives but slow failover when things really do break.
Production systems usually settle on intervals of 1-10 seconds and timeouts of 3-30 seconds, but the right number depends on your worst-case GC pause, network jitter, and how expensive a false positive is.
The Problems with Plain Heartbeats
Heartbeats are simple and almost-good-enough. The places they break:
- All-or-nothing classification. A node is either alive or dead. Reality is a continuum: a node can be slow without being dead. The detector should output a level of suspicion, not a binary.
- Static timeouts. A timeout set for a quiet datacenter fires constantly during a noisy hour. A timeout set for the noisy hour is too slow during quiet hours.
- Asymmetric reachability. A can hear B's heartbeats but B cannot hear A's replies. A says "B is alive"; B says "A is dead." Both halves of the system disagree.
- Centralized monitoring. A single failure-detector node is itself a single point of failure. Distributed detection is harder but more robust.
Phi-Accrual Failure Detector
Hayashibara et al. (2004) proposed a detector that turns heartbeat history into a continuous suspicion value phi:
- Track the recent inter-arrival times of heartbeats from a peer (rolling window).
- Fit a distribution to those times — typically normal or exponential.
- On each request for "is this peer healthy?", compute
phi = -log10(P(no heartbeat by now))from the distribution. - Higher
phi= higher suspicion. Applications act whenphiexceeds an application-specific threshold (e.g., 8 for "probably dead," 12 for "definitely dead").
Steady state: heartbeats arrive every ~1s.
Last heartbeat at t=10s. Current time t=10.5s.
P(no heartbeat by t=10.5s) ≈ 0.5 → phi ≈ 0.3 (no suspicion)
Last heartbeat at t=10s. Current time t=15s.
P(no heartbeat by t=15s) ≈ 0.001 → phi ≈ 3 (some suspicion)
Last heartbeat at t=10s. Current time t=30s.
P(no heartbeat by t=30s) ≈ 1e-12 → phi ≈ 12 (high suspicion)The win: the detector adapts to actual network behavior. A noisier network shifts the distribution; the same phi threshold then takes longer to trigger, automatically. Used in Cassandra, Akka, and several other Erlang/Scala/JVM systems.
SWIM Protocol
Das, Gupta, and Motivala (2002) introduced SWIM (Scalable Weakly-consistent Infection-style Membership) to solve the centralized-detector and scalability problems together. The protocol has two parts:
Failure Detection via Random Probing
Periodically, each node:
- Picks a random peer
Pand sends aping. - If
Preplies within timeout, done. - If not, pick
Kother random peers (typically 3) and ask each toping-req(P). If any of them gets a reply,Pis alive. - If still no reply,
Pis suspected.
The indirect probing step is the clever part: it filters out one-off network glitches between this node and P. If P is reachable from anyone else in the cluster, P is alive.
Membership Dissemination via Gossip
When a node's status changes (joined, suspected, confirmed dead, alive again), the news is gossiped to randomly chosen peers, who in turn gossip it forward. After O(log N) rounds, the entire cluster knows.
Node A learns "C suspected"
A gossips to {D, F}
D gossips to {B, E}, F gossips to {G, H}
... convergence in roughly log2(N) gossip roundsGossip is robust: no single failure point, scales to thousands of nodes, message load per node stays constant as cluster grows.
SWIM (and its variants like Lifeguard, which adds local-health awareness) is the protocol behind HashiCorp's Serf and Consul, Uber's Ringpop, and many cluster-membership systems built since 2010.
Membership vs Failure Detection
The two are different problems:
- Membership answers "who is supposed to be in this cluster?" — usually persistent, changed by deliberate operations (add node, remove node).
- Failure detection answers "who is currently reachable?" — transient, changes constantly.
Conflating them produces bugs. A node that is briefly unreachable should not be removed from membership; it should be marked unreachable until it returns. A node that is intentionally removed should not be re-added just because its heartbeats resume.
Practical systems separate these:
- Membership changes go through consensus or an admin API. They are durable, slow, and rare.
- Reachability changes are detected by gossip / heartbeats. They are transient, fast, and frequent.
Consensus-Based vs Gossip-Based Membership
Two implementation styles, with different trade-offs:
| Style | Consensus-based (etcd, ZooKeeper, Raft) | Gossip-based (SWIM, Serf) |
|---|---|---|
| Consistency | Strongly consistent membership view | Eventually consistent |
| Scale | Hundreds of nodes (limited by consensus quorum cost) | Thousands of nodes |
| Reachability | Cluster halts on quorum loss | Continues with partial view |
| Use case | When you need strict agreement on who is in | When you need broad scale and AP behavior |
Cassandra uses gossip-based membership and pays the cost in occasional schema-disagreement bugs. Kubernetes uses etcd (consensus) for membership of services, and gossip-style health checks for pod reachability. Different layers, different choices.
Common Failure Modes
- Asymmetric partitions. A → B works, B → A does not. Membership systems that depend on bidirectional reachability declare both nodes alive or both dead depending on the direction of probes. SWIM's
ping-reqindirection helps; pure heartbeats often do not. - Flapping. A node oscillates between "alive" and "dead" due to marginal network. Each flap triggers expensive recovery (rebalancing, re-election). The fix is hysteresis: require a node to be stable in its new state for some time before acting.
- Slow membership convergence under churn. Adding 100 nodes at once to a gossip cluster takes time to propagate. During convergence, different observers have different views — operations that require global agreement (Cassandra schema changes, distributed locks) can race.
- Cascading failures from over-aggressive detection. A single slow node triggers timeouts everywhere, gets marked dead, work redistributes to other nodes, they become slow, more timeouts. Conservative thresholds plus circuit breakers.
- Membership state lost. A node restarts and forgets it was ever in the cluster, joins as a new node. Persistent membership IDs (not derived from IP) avoid this.
Real-World Choices
| System | Membership | Failure Detection |
|---|---|---|
| Cassandra | Gossip | Phi-accrual |
| Consul / Nomad | Serf (SWIM variant) | SWIM |
| Kubernetes | etcd (Raft) for services; lease-based for node health | Kubelet heartbeats; node-lifecycle controller |
| etcd / Raft clusters | Raft membership-change protocol | Raft heartbeats |
| ZooKeeper | Ensemble membership is static, configured | Session-based heartbeats |
| Akka | Custom (PhiAccrual-based) | Phi-accrual |
The pattern: high-scale systems prefer gossip-based; systems requiring strict consistency prefer consensus-based; almost all use phi-accrual or SWIM-style indirect probing rather than raw timeouts.
Further Reading
- Chandra & Toueg, Unreliable Failure Detectors for Reliable Distributed Systems (JACM 1996) — the foundational paper. The vocabulary every subsequent paper uses.
- Hayashibara et al., The phi-accrual failure detector (SRDS 2004) — the adaptive detector now standard in many JVM systems.
- Das, Gupta, Motivala, SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol (DSN 2002) — the SWIM paper.
- Dadgar et al., Lifeguard: Local Health Awareness for More Accurate Failure Detection (DSN-W 2018) — HashiCorp's refinement of SWIM.
- Cassandra documentation on gossip and the failure detector — the most readable practitioner treatment.
- Bailis & Kingsbury, The Network is Reliable (Communications of the ACM 2014) — what real partitions look like in production.
Pre-commit Checklist
- Are membership and failure detection separate concepts in my system, or am I conflating them?
- Does my failure detector output a level of suspicion or just alive/dead? Continuous is almost always better.
- Are my timeouts adaptive to actual network conditions, or set to a single static value?
- Do I have hysteresis on state transitions to prevent flapping?
- For asymmetric partitions, does my detector handle "A can hear B but not vice versa"? Test with iptables.
- Does my system survive a cascading-detection scenario (one slow node taking down its neighbors)?
- For membership changes, is there a deliberate admin path separate from the transient reachability signal?