Steven's Knowledge

Leader Election

Patterns for electing a single coordinator — lease vs consensus, the split-brain problem, fencing tokens, and the zombie-leader failure mode

Leader Election

A leader is a process that other processes have agreed will play a designated role: write coordinator, cron runner, partition primary, lock holder, schema migrator. Many distributed protocols simplify dramatically when one node has authority — until that node fails, at which point the system needs to elect a new one without producing two leaders at the same time. That election is harder than it looks.

This page is about the patterns that work, the failures that recur in production, and where leader election sits in the wider distributed-systems toolbox. For the consensus algorithms that have leader election baked in, see Raft and Paxos; for the resource-side protection without which an election is unsafe, see Distributed Locks.

Two Contexts to Keep Distinct

The phrase "leader election" gets used for two related but distinct problems:

  • Internal leader election inside a consensus algorithm. Raft's randomized-timeout election, Paxos's "proposer with highest number wins." The leader exists to make consensus fast; the consensus algorithm itself proves the election is safe.
  • Application-level leader election for a singleton role outside consensus. "Only one worker should run this cron job" or "only one instance should be primary." This is what most people mean when they ask "how do I do leader election."

The two need the same correctness properties — at most one leader, eventually some leader — but the implementation cost differs by an order of magnitude. Application-level election usually delegates to a coordination service (etcd, ZooKeeper, Consul) rather than implementing consensus directly.

What "Leader" Has to Mean

A correct leader election must satisfy:

  • Safety: at most one process believes it is the leader, or the resource it leads is protected by a fencing mechanism that rejects stale-leader writes.
  • Liveness: if the current leader fails, eventually a new leader is elected.
  • Stability: under steady state, the leader does not change. Election churn is a liveness failure in practice even if every individual election terminates.

The first property is the hardest. As with distributed locks, no amount of cleverness in the election protocol prevents a paused or slow node from later believing it is still leader. The fix is the same: fencing tokens. The election service emits a monotonically increasing token (epoch number, term, generation ID); the resource records the highest token it has seen and rejects operations from lower-numbered leaders.

Common Patterns

Lock-Based with TTL

The simplest approach. Use a coordination service (etcd, ZooKeeper, Consul) to acquire a session-bound lock; the holder is leader. The session has a TTL; if it expires (heartbeat lost, process crashed), another node can acquire it.

Worker A: acquire(/leader, ttl=10s) → success, epoch=42
          start serving as leader
          renew lease every 3s

Worker A: pauses for GC at t=20s
Coordination svc: session expires at t=30s

Worker B: acquire(/leader, ttl=10s) → success, epoch=43
          starts serving as leader

Worker A: GC ends at t=60s
          still believes it is leader (epoch=42)
          tries to write to resource — REJECTED if resource checks epoch

This pattern is the workhorse for application singletons. It is correct if and only if the resource being led enforces fencing. Without fencing, you have a distributed lock problem in disguise.

Consensus-Based

The consensus algorithm itself elects a leader. Raft's randomized timeouts and term numbers, Multi-Paxos's stable-leader optimization. The election produces a leader plus a monotonic term/epoch that doubles as a fencing token.

If your system already needs consensus for some state (etcd, CockroachDB internals), leader election comes free. If you do not need consensus, building it just for the election is overkill.

Bully Algorithm (Garcia-Molina, 1982)

Classical algorithm. Each node has an ID. When a node detects the leader has failed, it sends an "election" message to all nodes with higher IDs:

  • If any higher-ID node replies, this node defers.
  • If no higher-ID node replies, this node becomes leader and announces it.

Simple to implement, but assumes synchronous communication and produces O(N²) messages in the worst case. Used in textbooks and small clusters; rarely in modern production.

Ring-Based

Nodes arranged in a logical ring. An election message circulates with the IDs of all candidates; the node with the highest ID wins. Variants include Chang-Roberts (lower message count) and Hirschberg-Sinclair.

Theoretically elegant. Rarely chosen in practice because the ring is brittle — a single broken link blocks elections — and ring membership is itself a hard problem (Membership & Failure Detection).

ZooKeeper "Recipe"

The canonical ZooKeeper leader election:

  1. Each candidate creates an ephemeral sequential znode under /election — e.g., /election/node-0000000005.
  2. The node with the lowest sequence number is leader.
  3. Each non-leader watches the next-lowest node; when that node disappears (ephemeral, session ended), the watcher re-evaluates.

The czxid of the leader's znode acts as a fencing token. This recipe is used by HBase, Kafka (pre-KRaft), Solr, and many other systems. It is the operational gold standard for application-level leader election when ZooKeeper is already in your stack.

Split-Brain

Split-brain is the failure mode where the system has two leaders simultaneously, each accepting work as if it were the sole leader. Causes:

  • Network partition. A and B can both reach a quorum of coordination nodes in their respective network slices. (In practice, a properly configured coordination service of 2f+1 nodes survives f failures; only one side of the partition can have quorum, so this is the rare case.)
  • Paused old leader. The classic GC scenario: lease expires, new leader elected, old leader resumes and is unaware.
  • Asymmetric partition. Old leader can send writes to the resource but not heartbeats to the coordinator. Coordinator declares the old leader dead and elects a new one; old leader keeps writing.
  • Clock skew on leases. Coordinator and leader disagree on elapsed time.

Prevention is the same in every case: fence the resource with a monotonic token. Detection without fencing is too late — by the time you notice two leaders are writing, the damage is done.

The Zombie Leader Failure Mode

A zombie leader is a process that believes it is leader but is not. This is the dominant cause of split-brain in real systems and the reason fencing is non-optional. Causes:

  • The leader's process paused (GC, swap, slow disk) past the lease TTL.
  • The leader lost its connection to the coordination service but the resource is still reachable.
  • A network partition isolates the leader from the coordinator but not from clients.

In every case, the leader's belief about its role is stale. The only defense is for the resource (database, queue, file system, downstream service) to verify the leader's epoch on every write. Anything the leader does without that check is at risk.

Practical Recipes

ScenarioApproach
Already running etcd / ZooKeeper / ConsulUse their built-in leader-election recipe
Already running a Raft / Paxos clusterUse its internal leader
Just need "one worker runs cron"Postgres advisory lock + epoch token (cheapest)
Need leader for a database singletonLease in the DB itself (the row contains the epoch)
Greenfield distributed systemUse a coordination service; do not implement consensus yourself
The leader must coordinate writes to many resourcesPass the epoch with every write; each resource checks

Common Mistakes

  • Electing without fencing. "We use ZooKeeper for elections" with a resource that does not check epochs. The election is correct; the resource is unsafe.
  • TTLs too aggressive. Lease of 5 seconds with 3-second renewal looks safe until a GC pause or a slow disk produces a zombie. Lease should be at least 10x the worst-case pause.
  • Single coordinator node. "We use Redis for elections." Single-node coordination services do not survive their own failure, and Redlock is contested (Distributed Locks).
  • Conflating leadership with mastership. A "leader" that serves reads from a stale replica is not really leading any consistency property. Be explicit about what the leader is responsible for.
  • No drain on demotion. When a leader loses its lease, it should stop accepting writes immediately, even before the new leader is elected. Many implementations only stop on a new-leader signal, leaving a gap.
  • Treating leader election as a performance optimization. If the system is correct only when there is exactly one leader, election is a safety mechanism, not a perf trick.

Relation to Other Pages

  • Distributed Locks — leader election is a special case of mutual exclusion. Same fencing argument applies.
  • Raft and Paxos — consensus algorithms with leader election built in.
  • FLP Impossibility — leader election is a form of consensus on "who is leader," so FLP applies. No timeout-free, randomization-free, asynchronous, deterministic leader election can guarantee termination.
  • Failure Models — what "the leader has failed" means depends on the failure model you assume.
  • Membership & Failure Detection — failure detection is the input to elections; bad detection means churning elections.

Further Reading

  • Garcia-Molina, Elections in a Distributed Computing System (IEEE TC 1982) — the original Bully algorithm paper.
  • Chang & Roberts, An improved algorithm for decentralized extrema-finding in circular configurations of processes (CACM 1979) — ring-based election.
  • Burrows, The Chubby Lock Service (OSDI 2006) — Google's production leader-election service; the model most modern systems follow.
  • Hunt et al., ZooKeeper: Wait-free coordination for Internet-scale systems (USENIX ATC 2010) — ZooKeeper's design; election recipes built on it are the de facto standard.
  • Kleppmann, How to do distributed locking (2016) — the fencing-token argument, equally applicable to leader election.

Pre-commit Checklist

  • Does the resource led by my elected leader check a fencing token on every write?
  • Is my lease TTL at least 10x my worst-case GC or pause?
  • On lease loss, does my leader stop accepting writes immediately, before the new leader is signaled?
  • Is my coordination service itself fault-tolerant (consensus-backed, not a single node)?
  • Have I tested zombie-leader scenarios (kill -STOP the leader, observe the resource's behavior)?
  • For "the leader does X" guarantees, is X protected by fencing or by structural single-writer constraints?

On this page