Steven's Knowledge

Raft

The understandable consensus algorithm — leader election, log replication, and the safety property that ties them together

Raft

Raft is a consensus algorithm: a protocol that lets a group of nodes agree on a sequence of values despite crashes, network delays, and message loss. It does the same job as Paxos, but Ongaro and Ousterhout's 2014 paper made an explicit design goal of understandability. That goal is what made Raft the default consensus algorithm in modern systems: etcd, Consul, CockroachDB, TiKV, MongoDB (replica sets), and Kafka's KRaft mode all run variants of it.

This page is the protocol view: how Raft works, why it is safe, and the failure modes you should know about before you trust it. For the broader why of consensus, read CAP & PACELC and Consistency Models first — Raft is how systems pay the bill for linearizability.

What Raft Provides

A Raft cluster of N nodes (typically 3 or 5) presents a replicated log to clients. The log is identical on a majority of nodes, and once an entry is committed it never disappears, even if up to (N-1)/2 nodes fail.

If you put a state machine on top of the log — applying each entry in order — every node ends up in the same state. This is the standard recipe for a linearizable distributed service.

        Client write


        ┌─────────┐
        │ Leader  │   1. append to own log
        └────┬────┘   2. send AppendEntries to followers

    ┌────────┼────────┐
    ▼        ▼        ▼
┌──────┐ ┌──────┐ ┌──────┐
│Follow│ │Follow│ │Follow│  3. each appends, replies success
└──────┘ └──────┘ └──────┘


    4. Leader sees majority replied → entry is COMMITTED
    5. Leader applies entry to its state machine, replies to client
    6. Followers learn of commit on next AppendEntries, apply too

The Three Roles

At any moment, every node is in exactly one of three states:

  • Leader — handles all client requests. There is at most one leader per term.
  • Follower — passive. Responds to RPCs from the leader (and candidates during elections).
  • Candidate — a follower that has not heard from a leader for a while and is trying to become one.

Time is divided into terms, monotonically increasing integers. Each term begins with an election; if a leader is elected, it runs until it fails or is partitioned away.

Leader Election

Followers expect to hear AppendEntries heartbeats from the leader regularly. If a follower's election timeout (typically randomized between 150–300 ms) elapses without one, it assumes the leader is dead and converts to candidate:

  1. Increment its current term.
  2. Vote for itself.
  3. Send RequestVote RPCs to all other nodes.

A node that receives RequestVote grants its vote if:

  • It has not already voted in this term, and
  • The candidate's log is at least as up-to-date as its own (defined precisely as: higher last term, or same last term and longer log).

A candidate wins when it has votes from a majority. It immediately sends heartbeats to assert leadership. If two candidates run simultaneously and split the vote, both time out and try again with new randomized timeouts — the randomization makes a repeat split unlikely.

Randomized timeouts prevent collisions:

Node A: timeout=180ms ─────▶ starts election first
Node B: timeout=240ms       
Node C: timeout=270ms       both vote for A
                            
Result: A becomes leader without contention.

Log Replication

Once elected, the leader is the sole entry point for client writes. For each command:

  1. Leader appends {term, command} to its own log.
  2. Leader sends AppendEntries(prevLogIndex, prevLogTerm, entries[], leaderCommit) to all followers in parallel.
  3. A follower accepts the new entries only if its own log matches at prevLogIndex (the log matching property). If not, it rejects, and the leader retries with an earlier prevLogIndex until it finds the point where logs agree, then overwrites everything after that point.
  4. Once a majority of nodes (including the leader) hold the entry, the leader marks it committed.
  5. The leader piggybacks the new commitIndex on the next AppendEntries. Followers then apply the entry to their state machines.

The leader never overwrites or deletes entries from its own log; only follower logs get truncated when they diverge.

The Safety Property

The heart of Raft's correctness is the State Machine Safety property:

If a node has applied a log entry at a given index, no other node will ever apply a different entry at the same index.

Raft achieves this with three rules:

  1. Election restriction. A candidate cannot win unless its log contains all committed entries. (This is what "at least as up-to-date" enforces during voting.)
  2. Leaders only append. A leader never modifies entries in its own log — only adds to the end.
  3. Commit only entries from current term. A leader does not consider an entry committed by counting replicas alone if the entry is from a prior term; it must replicate an entry from its own term and let the log-matching property carry the older entries along. This rule plugs a subtle hole called the "phantom commit" problem.

These three rules together guarantee that committed entries are durable across any sequence of failures involving up to (N-1)/2 nodes.

What Else You Need to Know

The 2014 paper covers more than the core protocol. Briefly:

  • Log compaction (snapshots). Logs grow without bound. Periodically, a node snapshots its state machine and discards the prefix. Followers that fall too far behind catch up by receiving a snapshot via InstallSnapshot, then resuming normal replication.
  • Membership changes. Adding or removing nodes naively can split the cluster into two majorities. Raft handles this with joint consensus: during the transition, agreement requires majorities in both old and new configurations. Most implementations use the simpler single-server change variant (add/remove one node at a time), which is safe under similar reasoning.
  • Linearizable reads. Naively serving reads from the leader is not linearizable — the leader might have been deposed without knowing it. The two fixes: route reads through the log (slow but obviously correct), or use leader leases / ReadIndex (faster, requires confirming leadership via a no-op heartbeat round before responding).

When Raft Is Not the Answer

  • Low-latency wide-area writes. Every write requires a quorum round trip. Across regions, that is at least 100 ms. If you can tolerate causal or eventual consistency, you should (Consistency Models).
  • Geo-distributed multi-leader writes. Raft has exactly one leader. Multi-leader systems (Cassandra, Riak, DynamoDB Global Tables) use CRDTs or last-writer-wins, not Raft.
  • Byzantine failures. Raft assumes crash-stop / crash-recovery. A node that lies about its log breaks it. BFT consensus (PBFT, HotStuff) is the relevant family there.
  • Anything where you do not need consensus. Most application state does not need a linearizable log. Putting Raft under it is over-engineering with a real latency cost.

Failure Modes in Production

  • Liveness during partition. Raft prioritizes safety: during a partition that leaves no majority, no new entries are committed and the cluster is unavailable (CP, in CAP terms). This is correct behavior, but is the most common "Raft cluster is stuck" symptom in practice.
  • Disk fsync cost. Raft requires every log append to be durable before responding. A misconfigured node that returns success before fsync can lose committed entries on crash. Treat fsync correctness as the most important config in your Raft library.
  • Leader churn under load. If election timeouts are too short relative to actual network jitter or GC pauses, leaders are repeatedly demoted. Aim for election timeouts at least 10x your worst-case heartbeat round trip.
  • Slow disks make the whole cluster slow. A leader cannot commit faster than its slowest disk in the majority. One degraded node out of three is enough to halve throughput.

Further Reading

  • Ongaro & Ousterhout, In Search of an Understandable Consensus Algorithm (USENIX ATC 2014) — the paper. Short and worth reading directly.
  • Ongaro, Consensus: Bridging Theory and Practice (Stanford PhD thesis, 2014) — the full version, including membership changes, snapshots, and the ReadIndex protocol.
  • raft.github.io — the canonical implementation list and the visualization that has taught most people the protocol.
  • Lamport, Paxos Made Simple (2001) — read after Raft. The contrast makes both clearer.
  • The TLA+ spec of Raft, published by Ongaro — useful if you ever doubt the safety argument.

Pre-commit Checklist

  • Did I confirm fsync semantics of my Raft library against my actual disk / OS / filesystem?
  • Are my election timeouts safely larger than GC pauses and tail-latency network jitter?
  • For reads, am I using log-reads or ReadIndex / leader-lease? Plain "read from leader" is not linearizable.
  • If I have wide-area replicas, do my writes actually need Raft, or am I paying the WAN latency for nothing?
  • For membership changes, am I using joint consensus or single-server change? Did I test it?

On this page