Quorum Systems
Read/write quorums, sloppy quorums, hinted handoff — the cheap mechanism behind most replicated storage
Quorum Systems
A quorum is a subset of nodes large enough to make a decision binding on the whole. The classical use is voting: a proposal passes if a majority approve. In distributed storage, quorums let a replicated system make progress while a minority of nodes are unreachable, and let two operations conflict if and only if their quorums overlap.
Quorums are how systems like Dynamo, Cassandra, and Riak get tunable consistency without the full cost of consensus. They are also the mechanism inside Raft and Paxos — every "majority" in those algorithms is a strict read/write quorum. This page covers the patterns, the math, and the trade-offs you can actually tune.
The Strict Quorum: R + W > N
Given N replicas, configure:
W= number of replicas a write must acknowledge before returning success.R= number of replicas a read must contact before returning a result.
If R + W > N, every read and every write overlap on at least one replica. That replica has the latest write, and the read will see it.
N = 5 replicas
W = 3, R = 3
R + W = 6 > 5
Write goes to replicas {1, 2, 3}.
Read goes to replicas {3, 4, 5}.
─────────────────────────────
Overlap: {3}. Read sees the write.The reader contacts R replicas, collects their values along with versions (typically a vector clock or timestamp), and returns the latest. If the values disagree, the reader can repair the stale replicas (read repair) before returning.
R + W > N is the cheapest mechanism for strong consistency in the sense most applications need. Variants:
| Config | Property |
|---|---|
W = N, R = 1 | Cheap reads, expensive writes. Cannot tolerate any write-side failure. |
W = 1, R = N | Cheap writes, expensive reads. Cannot tolerate any read-side failure. |
W = R = ⌈(N+1)/2⌉ | Both quorums are majorities. Balanced — usually the default. |
R + W ≤ N | No overlap guarantee. Eventually consistent reads — you might miss the latest write. |
What R + W > N Does Not Give You
Several common misunderstandings:
- It is not linearizability. A reader can race with a writer: both happen "at the same time" (no real-time ordering), and the reader may see either the old or new value. To get linearizable reads you need additional machinery (read leases, version-fence-tokens, or a single-leader scheme).
- It does not handle concurrent writes correctly. Two writers update the same key concurrently to different values. Both succeed on their write quorums. A reader sees both values and has to pick one — typically last-writer-wins (broken under clock skew) or by application-level reconciliation. See Logical Clocks.
- It does not protect against multi-key invariants. Each key has its own quorum. A transaction across two keys gets you nothing.
Sloppy Quorums and Hinted Handoff
Dynamo's tweak: if W of the preferred replicas are unreachable, write to some other W nodes — any W that respond — and record a hint that these are temporary replicas of the offline preferred ones.
Preferred replicas for key K: {1, 2, 3}
Replica 3 is down.
Sloppy quorum: write to {1, 2, 7}. Node 7 stores a hint
"I'm holding data for replica 3."
When replica 3 comes back, node 7 forwards the hint (hinted
handoff) and the system converges.Trade-off: higher write availability during partial failure, but read-after-write consistency is no longer guaranteed. A reader contacting {1, 2, 3} after the sloppy write will miss the value — because the write was on {1, 2, 7}, not the preferred set. Dynamo / Cassandra accept this; the design assumes eventual convergence and read-side repair.
Sloppy quorums are AP-flavored (CAP). Strict quorums where reads contact the preferred set are CP-flavored.
Read Repair and Anti-Entropy
Quorum systems converge through three mechanisms:
- Read repair. When a read sees divergent replicas, it writes back the latest value to the stale ones. Lightweight and free for any key being read; useless for cold keys.
- Anti-entropy. Background process that compares replicas (typically using Merkle trees) and reconciles differences. Catches cold keys that read repair misses.
- Hinted handoff. Already described — temporary replicas push their hints back when the original replicas recover.
A production AP store needs all three. Skip one and you get convergence anomalies that surface months later.
Flexible Quorums
Howard et al. (2016) generalized the quorum rule for consensus protocols: Phase 1 and Phase 2 quorums in Paxos do not have to be the same size. They only need to intersect.
The original Paxos rule was effectively Q1 + Q2 > N where both Q1 and Q2 are majorities (so Q1 = Q2 = ⌈(N+1)/2⌉). Flexible Paxos lets you trade them off:
Q1 (Prepare) | Q2 (Accept) | Property |
|---|---|---|
| 4 | 2 | Faster commit, slower leader election. Good for stable-leader workloads. |
| 2 | 4 | Slower commit, faster election. Good for failover-heavy workloads. |
| 3 | 3 | Standard majority. |
The insight applies to read/write quorums too: if reads are 100x more frequent than writes, use W = 4, R = 2 (with N = 5) — pay write latency once to make reads cheap.
Witness Replicas and Learners
Not every replica needs to hold full data:
- Witnesses (Spanner, MongoDB, ZooKeeper) participate in voting but do not store data. They cost less but let you build odd-numbered quorums from
N = 4(3 voters + 2 witnesses, say). Used to break ties cheaply. - Learners (Paxos terminology, used in some Raft variants) receive committed entries but do not vote. Useful for read replicas in different geographies — they get linearizable data without affecting the write quorum.
These are pragmatic optimizations on the basic quorum model; the safety reasoning is unchanged.
Choosing N, R, W
Heuristics that hold up:
- Start with
N = 3for non-critical data,N = 5for critical. LargerNincreases write latency and storage cost; benefits diminish past 5. Nshould be odd. Avoids even-split partitions and keeps the math clean.- Default to majority quorums (
R = W = ⌈(N+1)/2⌉) unless you have a measured reason to deviate. - For read-heavy workloads, increase
Wand decreaseR(subject toR + W > Nif you want consistency). - Replicas should be in different failure domains. Three replicas in the same rack is not
N = 3; it isN = 1with two copies.
Common Pitfalls
- Treating sloppy quorums as strict. Cassandra's
QUORUMconsistency level is a strict quorum within the preferred replicas;ANYis sloppy. Mixing them up gets you stale reads in production. - Cross-DC quorums without thinking about latency. A quorum that requires majority across regions adds 50–100 ms per write. Often the right answer is
LOCAL_QUORUM(majority within DC) with cross-DC replication async. - Assuming clock-based last-writer-wins works. It does not, except under bounded clock skew — see Logical Clocks.
- Forgetting anti-entropy. Read repair alone leaves cold-key divergence forever.
- Sizing for capacity, not failure tolerance.
N = 5, W = 3survives 2 failures only if your nodes are independent. Verify the placement strategy.
Further Reading
- DeCandia et al., Dynamo: Amazon's Highly Available Key-Value Store (SOSP 2007) — the paper that mainstreamed tunable quorums and sloppy quorums.
- Gifford, Weighted Voting for Replicated Data (SOSP 1979) — the foundational paper on read/write quorums.
- Howard et al., Flexible Paxos: Quorum Intersection Revisited (OPODIS 2016) — the modern insight on quorum sizing.
- Jepsen analyses of Cassandra, Riak, MongoDB — read for what goes wrong, not just what is supposed to work.
- Kleppmann, Designing Data-Intensive Applications, Chapter 5 — the most readable practitioner treatment.
Pre-commit Checklist
- For each key class, do I know
N,R,W— and isR + W > N(if I want consistency)? - Are my replicas in independent failure domains? (Different racks, AZs, ideally regions.)
- For multi-DC deployments, do my quorums span DCs (high latency, real cross-region durability) or stay within (lower latency, weaker durability)?
- Is read repair plus anti-entropy both running? Without both, cold keys diverge forever.
- For "strong consistency" claims, did I check the system is using strict (preferred-replica) quorums, not sloppy?
- For last-writer-wins reconciliation, what is my clock-skew assumption, and is it monitored?