Paxos
Classic Paxos, Multi-Paxos, and what the variants actually solve — the algorithm Raft was designed to replace and still cannot fully retire
Paxos
Paxos is the consensus algorithm that everyone cites and almost no one implements from scratch. Lamport published the original paper in 1989 (it took eight years to be accepted), then wrote a follow-up titled Paxos Made Simple in 2001 because no one understood the first one. Raft was designed in 2014 specifically because Paxos Made Simple still was not simple enough.
So why a page on Paxos? Because it is the algorithm Google, Microsoft, and Amazon actually use in their core infrastructure (Chubby, Megastore, Spanner, the original Dynamo paper); because every distributed-systems paper from the last 30 years assumes you have read it; and because the failure modes that show up in production consensus systems are easier to understand once you have seen the original.
This page is the protocol view. If you only intend to use consensus, Raft is the better page to start with.
What Paxos Solves
The problem statement is identical to Raft's: a group of nodes must agree on a single value despite crashes, message loss, and arbitrary delays. The differences are stylistic and pedagogical, not semantic — anything Paxos can decide, Raft can decide, and vice versa. Both work around the FLP impossibility the same way, by assuming partial synchrony.
Paxos works for one decision. To decide a sequence (i.e., build a replicated log), you run Paxos repeatedly — see Multi-Paxos below.
The Three Roles
Every node may play one or more of:
- Proposer — proposes a value to be decided.
- Acceptor — votes on proposals. The quorum lives here.
- Learner — learns the decided value once acceptors agree.
In practice all three roles run on every node. The separation matters because the safety argument only needs to constrain acceptors.
Single-Decree Paxos
The protocol has two phases. Each proposal carries a globally unique, monotonically increasing proposal number n (typically a tuple of (counter, node_id) so different proposers cannot collide).
Phase 1: Prepare / Promise
Proposer ──Prepare(n)─▶ Acceptors
Acceptor's reply:
IF n > highest n it has promised:
promise to reject any proposal with number < n
reply Promise(n, prev_n, prev_v) where prev_n/prev_v is the
highest-numbered proposal it
previously *accepted*, if any
ELSE:
reject (or ignore)A proposer needs Promise from a majority of acceptors to continue.
Phase 2: Accept / Accepted
Proposer picks the value v:
IF any Promise included a (prev_n, prev_v):
v = the prev_v with the highest prev_n ← MUST reuse it
ELSE:
v = whatever the proposer wanted to propose
Proposer ──Accept(n, v)─▶ Acceptors
Acceptor's reply:
IF it has not promised a higher number:
accept (n, v), reply Accepted(n, v)
ELSE:
rejectWhen a majority Accepted(n, v), the value v is chosen. Once a value is chosen, no other value can ever be chosen — that is the safety property.
Why It Is Safe
The non-obvious clause is the rule in Phase 2 that says the proposer must reuse the highest previously-accepted value. This is the rule that prevents two competing proposers from each getting a majority for a different value: any majority intersects the previous majority in at least one acceptor, so the new proposer will see the previously chosen value in at least one Promise reply and is forced to re-propose it.
The proof is short but extremely careful; it is the canonical "I think I get it… wait, no" experience for new distributed-systems practitioners. Paxos Made Simple walks through it; do not skip the section called "The Algorithm."
Multi-Paxos
Single-decree Paxos decides one value, in two RPC rounds, with no leader. Running it independently for every log entry would be unbearably slow (every entry pays two round trips) and produces fights among proposers (livelock during contention).
Multi-Paxos optimizes this:
- Elect a stable leader by running one Paxos round on "who is the leader for the next many entries?" Once elected, the leader skips Phase 1 for subsequent entries — Phase 1 only needs to run once per leadership term.
- Each new log entry takes one round (Phase 2 only) under the stable leader.
- A new leader runs Phase 1 for all unfilled log positions to learn what (if anything) was previously chosen there.
If this description sounds suspiciously like Raft, that is because Raft is, in effect, Multi-Paxos with the algorithm structure rearranged for clarity. Diego Ongaro has said as much.
Variants Worth Knowing
The Paxos family has accumulated dozens of variants. The ones that show up in real systems:
- Fast Paxos (Lamport, 2005) — saves one round trip in the uncontested case, at the cost of needing a larger quorum (3/4 instead of majority). Used in some Kubernetes-adjacent systems.
- Cheap Paxos — uses auxiliary acceptors only when needed, reducing the steady-state cost.
- EPaxos (Egalitarian Paxos, 2013) — leaderless. Any node can propose; commands that do not interfere commit in one round trip. Used in some Cassandra-adjacent systems.
- Flexible Paxos (Howard et al., 2016) — proves that Phase 1 and Phase 2 quorums need not be the same size, only that they intersect. Opens up new latency / fault-tolerance trade-offs (see Quorum Systems).
- Generalized Paxos — exploits commutative operations to commit non-conflicting ones concurrently.
Most production systems use Multi-Paxos with stable leader. The fancier variants are usually overkill.
Production Lessons
The Google paper Paxos Made Live (2007) is the canonical document on what you actually have to deal with when running Paxos. Key takeaways:
- Disk corruption matters. Acceptors must remember their promised numbers across crashes. A misbehaving disk that loses or rewrites this state breaks safety. Real implementations checksum, replicate, and treat disk corruption as a first-class fault mode.
- Master leases need careful handling. A leader that thinks it is still the master after losing quorum can serve stale reads. Most production systems use bounded clock leases tied to the master's term.
- Snapshotting and log compaction are most of the code. The basic protocol is small; making it run forever is large.
- Configuration changes (cluster membership) are the bug hotspot. The original Paxos paper barely covers this; production systems use joint consensus (membership-change rounds in both old and new configs) or the Stoppable Paxos / SMART approach.
These lessons apply equally to Raft.
Paxos vs Raft
Identical safety guarantees. Same fault tolerance ((N-1)/2 failures with N nodes). The differences:
| Aspect | Paxos | Raft |
|---|---|---|
| Pedagogy | Famously confusing | Designed for understandability |
| Log structure | Implicit (entries can have holes) | Explicit, contiguous |
| Leader election | Implicit (proposer with highest n wins) | Explicit, with terms and votes |
| Membership changes | Underspecified; many variants | Joint consensus, well-defined |
| Default production choice | Older systems (Chubby, Spanner) | Newer systems (etcd, Consul, CockroachDB) |
If you are choosing today: pick Raft. If you are reading code from Google, Amazon, or pre-2014 academia: you need to be able to read Paxos.
Further Reading
- Lamport, The Part-Time Parliament (1998) — the original, written as a parable about a fictional Greek island. Famously inscrutable; read for cultural reasons.
- Lamport, Paxos Made Simple (2001) — the version everyone actually reads.
- Chandra, Griesemer, Redstone, Paxos Made Live: An Engineering Perspective (PODC 2007) — Google's Chubby team on what the papers do not tell you.
- van Renesse & Altinbuken, Paxos Made Moderately Complex (2015) — a more rigorous middle ground; the best modern teaching reference.
- Howard et al., Flexible Paxos: Quorum Intersection Revisited (2016) — the recent insight on quorum design.
Pre-commit Checklist
- Did I confirm my Paxos library checksums acceptor state and treats corruption as a fault?
- Are proposal numbers globally unique and monotonically increasing across restarts? (Tuple of
(persistent_counter, node_id)is the standard recipe.) - For Multi-Paxos, does the new leader's Phase 1 cover all unfilled log positions, not just the next one?
- If I am using a Paxos variant (Fast / EPaxos / Flexible), do I understand the quorum size implication?
- Do I have an answer for "what does this cluster do during a partition?" — the answer should be "stops committing," not "diverges silently."