Who is in the cluster right now, who has failed, and the gossip / phi-accrual / SWIM protocols that answer those questions

Membership & Failure Detection

Before a cluster can do anything interesting it must answer two questions: who is in this cluster right now? and which of us has failed? The first is group membership; the second is failure detection. They are intertwined — you cannot say "node 7 failed" without first agreeing that node 7 was a member — but the algorithms that solve them have different shapes and different cost models.

This page is about the building blocks: failure detector properties, heartbeats and their problems, the phi-accrual and SWIM protocols that fix those problems, and where membership fits relative to consensus. For acting on failures, see Leader Election and Failure Models; for the impossibility result that bounds what detectors can do, see FLP.

Why It Is Hard

In an asynchronous network, you cannot distinguish "node 7 crashed" from "node 7 is slow" from "the network between us is broken." A failure detector necessarily makes a guess; the only choice is whether to guess too quickly (false positives) or too slowly (false negatives). This is the dual of the problem FLP describes: without bounded message delay, no algorithm can correctly classify every node every time.

Chandra and Toueg's 1996 paper formalized the failure-detector design space with two properties:

Completeness — every failed process is eventually suspected. Strong completeness: by every correct process. Weak: by at least one.
Accuracy — no correct process is suspected. Strong accuracy: never. Weak: by at least one correct process. Eventual: eventually no correct process is suspected.

Real failure detectors are eventually strong (◇S in the Chandra-Toueg classification): they can be wrong temporarily but eventually agree on a correct picture. This is exactly enough to enable consensus under partial synchrony, which is why every production consensus algorithm assumes a detector roughly this strong.

Heartbeats: The Simple Approach

Every node periodically sends "I am alive" messages. A node that has not heard from a peer within a timeout is declared failed.

Node A ── heartbeat ──▶ Node B   every 1s
Node A misses 3 heartbeats from B → declare B failed

Two parameters to choose: heartbeat interval and timeout. The trade-off is the standard detection-speed-versus-false-positive curve:

Short timeout — fast detection but many false positives during normal GC, congestion, or bursty load.
Long timeout — fewer false positives but slow failover when things really do break.

Production systems usually settle on intervals of 1-10 seconds and timeouts of 3-30 seconds, but the right number depends on your worst-case GC pause, network jitter, and how expensive a false positive is.

The Problems with Plain Heartbeats

Heartbeats are simple and almost-good-enough. The places they break:

All-or-nothing classification. A node is either alive or dead. Reality is a continuum: a node can be slow without being dead. The detector should output a level of suspicion, not a binary.
Static timeouts. A timeout set for a quiet datacenter fires constantly during a noisy hour. A timeout set for the noisy hour is too slow during quiet hours.
Asymmetric reachability. A can hear B's heartbeats but B cannot hear A's replies. A says "B is alive"; B says "A is dead." Both halves of the system disagree.
Centralized monitoring. A single failure-detector node is itself a single point of failure. Distributed detection is harder but more robust.

Phi-Accrual Failure Detector

Hayashibara et al. (2004) proposed a detector that turns heartbeat history into a continuous suspicion value phi:

Track the recent inter-arrival times of heartbeats from a peer (rolling window).
Fit a distribution to those times — typically normal or exponential.
On each request for "is this peer healthy?", compute phi = -log10(P(no heartbeat by now)) from the distribution.
Higher phi = higher suspicion. Applications act when phi exceeds an application-specific threshold (e.g., 8 for "probably dead," 12 for "definitely dead").

Steady state: heartbeats arrive every ~1s.
Last heartbeat at t=10s. Current time t=10.5s.
P(no heartbeat by t=10.5s) ≈ 0.5  → phi ≈ 0.3   (no suspicion)

Last heartbeat at t=10s. Current time t=15s.
P(no heartbeat by t=15s) ≈ 0.001  → phi ≈ 3     (some suspicion)

Last heartbeat at t=10s. Current time t=30s.
P(no heartbeat by t=30s) ≈ 1e-12  → phi ≈ 12    (high suspicion)

The win: the detector adapts to actual network behavior. A noisier network shifts the distribution; the same phi threshold then takes longer to trigger, automatically. Used in Cassandra, Akka, and several other Erlang/Scala/JVM systems.

SWIM Protocol

Das, Gupta, and Motivala (2002) introduced SWIM (Scalable Weakly-consistent Infection-style Membership) to solve the centralized-detector and scalability problems together. The protocol has two parts:

Failure Detection via Random Probing

Periodically, each node:

Picks a random peer P and sends a ping.
If P replies within timeout, done.
If not, pick K other random peers (typically 3) and ask each to ping-req(P). If any of them gets a reply, P is alive.
If still no reply, P is suspected.

The indirect probing step is the clever part: it filters out one-off network glitches between this node and P. If P is reachable from anyone else in the cluster, P is alive.

Membership Dissemination via Gossip

When a node's status changes (joined, suspected, confirmed dead, alive again), the news is gossiped to randomly chosen peers, who in turn gossip it forward. After O(log N) rounds, the entire cluster knows.

Node A learns "C suspected"
A gossips to {D, F}
D gossips to {B, E}, F gossips to {G, H}
... convergence in roughly log2(N) gossip rounds

Gossip is robust: no single failure point, scales to thousands of nodes, message load per node stays constant as cluster grows.

SWIM (and its variants like Lifeguard, which adds local-health awareness) is the protocol behind HashiCorp's Serf and Consul, Uber's Ringpop, and many cluster-membership systems built since 2010.

Membership vs Failure Detection

The two are different problems:

Membership answers "who is supposed to be in this cluster?" — usually persistent, changed by deliberate operations (add node, remove node).
Failure detection answers "who is currently reachable?" — transient, changes constantly.

Conflating them produces bugs. A node that is briefly unreachable should not be removed from membership; it should be marked unreachable until it returns. A node that is intentionally removed should not be re-added just because its heartbeats resume.

Practical systems separate these:

Membership changes go through consensus or an admin API. They are durable, slow, and rare.
Reachability changes are detected by gossip / heartbeats. They are transient, fast, and frequent.

Consensus-Based vs Gossip-Based Membership

Two implementation styles, with different trade-offs:

Style	Consensus-based (etcd, ZooKeeper, Raft)	Gossip-based (SWIM, Serf)
Consistency	Strongly consistent membership view	Eventually consistent
Scale	Hundreds of nodes (limited by consensus quorum cost)	Thousands of nodes
Reachability	Cluster halts on quorum loss	Continues with partial view
Use case	When you need strict agreement on who is in	When you need broad scale and AP behavior

Cassandra uses gossip-based membership and pays the cost in occasional schema-disagreement bugs. Kubernetes uses etcd (consensus) for membership of services, and gossip-style health checks for pod reachability. Different layers, different choices.

Common Failure Modes

Asymmetric partitions. A → B works, B → A does not. Membership systems that depend on bidirectional reachability declare both nodes alive or both dead depending on the direction of probes. SWIM's ping-req indirection helps; pure heartbeats often do not.
Flapping. A node oscillates between "alive" and "dead" due to marginal network. Each flap triggers expensive recovery (rebalancing, re-election). The fix is hysteresis: require a node to be stable in its new state for some time before acting.
Slow membership convergence under churn. Adding 100 nodes at once to a gossip cluster takes time to propagate. During convergence, different observers have different views — operations that require global agreement (Cassandra schema changes, distributed locks) can race.
Cascading failures from over-aggressive detection. A single slow node triggers timeouts everywhere, gets marked dead, work redistributes to other nodes, they become slow, more timeouts. Conservative thresholds plus circuit breakers.
Membership state lost. A node restarts and forgets it was ever in the cluster, joins as a new node. Persistent membership IDs (not derived from IP) avoid this.

Real-World Choices

System	Membership	Failure Detection
Cassandra	Gossip	Phi-accrual
Consul / Nomad	Serf (SWIM variant)	SWIM
Kubernetes	etcd (Raft) for services; lease-based for node health	Kubelet heartbeats; node-lifecycle controller
etcd / Raft clusters	Raft membership-change protocol	Raft heartbeats
ZooKeeper	Ensemble membership is static, configured	Session-based heartbeats
Akka	Custom (PhiAccrual-based)	Phi-accrual

The pattern: high-scale systems prefer gossip-based; systems requiring strict consistency prefer consensus-based; almost all use phi-accrual or SWIM-style indirect probing rather than raw timeouts.

Pre-commit Checklist

Are membership and failure detection separate concepts in my system, or am I conflating them?
Does my failure detector output a level of suspicion or just alive/dead? Continuous is almost always better.
Are my timeouts adaptive to actual network conditions, or set to a single static value?
Do I have hysteresis on state transitions to prevent flapping?
For asymmetric partitions, does my detector handle "A can hear B but not vice versa"? Test with iptables.
Does my system survive a cascading-detection scenario (one slow node taking down its neighbors)?
For membership changes, is there a deliberate admin path separate from the transient reachability signal?

Membership & Failure Detection

Membership & Failure Detection

Why It Is Hard

Heartbeats: The Simple Approach

The Problems with Plain Heartbeats

Phi-Accrual Failure Detector

SWIM Protocol

Failure Detection via Random Probing

Membership Dissemination via Gossip

Membership vs Failure Detection

Consensus-Based vs Gossip-Based Membership

Common Failure Modes

Real-World Choices

Further Reading

Pre-commit Checklist

On this page