Failure Models
Crash-stop, crash-recovery, omission, timing, Byzantine — the taxonomy of what can go wrong, and what each model lets you build
Failure Models
You cannot reason about failures you have not named. A protocol that "tolerates failures" without saying which kind is making a promise it cannot keep — the failure modes that actually matter in production are often the ones outside the model the protocol assumes. Naming failure types is the first step to picking a model whose assumptions match your environment.
This page is the taxonomy: what each failure model allows, what it forbids, and what real-world phenomena map to it.
The Hierarchy
Failure models form a containment hierarchy: weaker models are subsets of stronger ones, and tolerating a stronger model also tolerates the weaker.
Byzantine ← strongest assumption about adversary
│ (anything can happen)
▼
Timing / Performance
│
▼
Omission
│
▼
Crash-recovery
│
▼
Crash-stop / Fail-stop ← weakest assumption
(nodes just die cleanly)A protocol that tolerates Byzantine failures also tolerates crash-stop. The reverse — a Raft cluster surviving a malicious node — is not true.
Crash-Stop (Fail-Stop)
Behavior allowed: a node executes the protocol correctly until some moment, then halts and never recovers. From that moment, no further messages, no further state. Other nodes can eventually detect the absence (via timeouts) but cannot distinguish "crashed" from "very slow."
Real-world fit: rare in pure form. A process that segfaults is close, but the disk persists, and a process supervisor will restart it. Crash-stop is a useful theoretical model and a common assumption for proofs, less common as the actual operational reality.
What you can build: consensus algorithms targeting crash-stop (early Paxos formulations) tolerate (N-1)/2 failures. Simpler reasoning, but you must verify your environment actually fits the model.
Crash-Recovery
Behavior allowed: a node crashes (stops executing, loses in-memory state), then later restarts and resumes — possibly losing some non-persistent data, possibly with stale state on disk.
Real-world fit: most modern systems. A node has a process supervisor; if the process dies, it restarts. Disks persist most state. Network links recover. This is the default model for Raft, Paxos, and most production consensus algorithms.
Key technical concern: what is durable across a crash? A protocol that requires writes to be durable before responding (Raft's AppendEntries rule, Paxos acceptor state) breaks silently if a misconfigured disk or filesystem returns success before fsync. Disk corruption, partial writes, and torn writes are all in scope for this model.
What you can build: the same (N-1)/2-tolerant consensus as crash-stop, with the additional requirement of correctly handling restart. Most production bugs in this category come from the durable-state assumption being violated.
Omission
Behavior allowed: a node executes correctly when it executes, but it can drop messages — either sending them and not having them delivered (send-omission), or receiving them and not processing them (receive-omission).
Real-world fit: networks. Packet loss, dropped UDP, full receive buffers, NAT timeouts, asymmetric partitions ("A can reach B but B cannot reach A"). Asymmetric partitions are surprisingly common in cloud environments and are the source of many subtle bugs.
What you can build: most consensus algorithms work under crash-recovery + omission with no protocol change, because retries handle dropped messages. The real cost is liveness: a partial omission failure (e.g., one node can hear the leader but cannot reply) can keep a cluster stuck without an obvious symptom.
Common production form: a node whose outbound bandwidth is saturated. It can receive heartbeats (looking alive to others) but cannot send replies. The cluster believes the node is healthy but is making no progress.
Timing / Performance
Behavior allowed: a node executes correctly but slowly. Messages arrive eventually but late. Disk writes take 10x longer than expected. GC pauses for 30 seconds.
Real-world fit: the most common failure in modern systems. Not a binary crash but a brownout: stragglers, slow disks, GC, noisy neighbors, full memory bandwidth, JIT recompilation.
Why it is its own category: systems that assume crash-stop or omission often handle this badly. A node that responds in 30 seconds when the timeout was 5 seconds is treated as crashed, demoted, fenced — and then comes back and confuses the system. "Slow is the new down" is the operational mantra.
What you can build: systems that handle slow nodes without misclassifying them. Patterns:
- Hedged requests (send to multiple, take the first response).
- Tail-tolerant load balancing.
- Conservative timeouts (>= 10x worst-case latency).
- Explicit "this node is slow, route around it" health signals, separate from "this node is dead."
Byzantine
Behavior allowed: anything. A node may send conflicting messages to different peers, lie about its state, forge identities, withhold messages selectively, collude with other faulty nodes.
Real-world fit: adversarial environments (public blockchains, multi-organization systems where parties do not fully trust each other). Also fits cosmic-ray bit flips, memory corruption, and disk corruption returning wrong-but-checksumming data — even non-malicious systems exhibit byzantine-looking failures occasionally.
What you can build: BFT consensus algorithms (PBFT, HotStuff, Tendermint) tolerating (N-1)/3 Byzantine failures — strictly more nodes needed than crash-stop tolerates. Costs are higher: more messages per round, more state, more cryptography. Practical Byzantine fault tolerance has shipped in production (Hyperledger Fabric, Diem, several blockchain platforms) but the cost-benefit only pencils out when the threat is real.
The cost question: "do I need BFT?" The honest answer for most internal systems: no. Crash-recovery is enough, and the operational complexity of a BFT system is large. The honest answer for public-internet systems with untrusted participants: yes, and you should pay the cost.
The Synchrony Dimension
Orthogonal to the failure model is the timing model:
| Model | Assumption |
|---|---|
| Synchronous | Message delays bounded by a known constant. Processing time bounded. Clocks accurate. |
| Asynchronous | No bound on message delay or processing time. No clocks. |
| Partial synchrony | Eventually synchronous: bounds exist but are unknown, or hold most of the time. |
The same failure model has different solvability under different synchrony assumptions. FLP proves consensus is impossible under pure asynchrony with even one crash-stop failure. Under partial synchrony, the same crash-stop failure is tolerable (Raft, Paxos). Under full synchrony, even Byzantine consensus has straightforward solutions.
Real networks are partially synchronous. Designing as if they were synchronous is the source of many production bugs ("our timeout was 100ms because we measured 5ms in steady state"). Designing as if they were asynchronous is theoretically pure but operationally wasteful (you can never use a timeout).
Choosing a Model
A practical sequence:
-
What do my nodes actually do when they fail? Get an operational data point — crash logs, post-mortems, on-call experience. Pure crashes? Slow nodes? Asymmetric partitions?
-
What is the trust boundary? Internal services with the same operator: crash-recovery is enough. Multi-tenant or multi-organization with adversaries: Byzantine.
-
What synchrony can I assume? Same datacenter: partial synchrony is safe to assume. Cross-region: partial synchrony with longer bounds. Internet-facing: closer to asynchronous.
-
What is the cost of overstating the model? Designing for Byzantine when crash-recovery would suffice doubles the message count and the latency. Designing for crash-stop when omission failures happen costs you stuck clusters in production.
Common Mistakes
- Assuming crash-stop in a slow-node world. The protocol works in tests, fails in production when one disk gets slow.
- Ignoring asymmetric partitions. Many implementations of failure detectors assume symmetric communication. Test with iptables rules that drop one direction. See Membership & Failure Detection for the protocols that handle this correctly.
- "It is fine because we have monitoring." Monitoring tells you after the failure mode bit you. A correct failure model design tolerates the failure before the alert fires.
- Treating disk corruption as out-of-scope. Modern filesystems and disks fail silently more often than people expect. ZFS-style checksums, application-level checksums, and end-to-end verification are not paranoia.
- Assuming Byzantine when crash-recovery would do. Adds operational complexity and latency for no safety gain in single-operator systems.
Further Reading
- Hadzilacos & Toueg, A Modular Approach to Fault-Tolerant Broadcasts and Related Problems (1994) — foundational taxonomy of failure types.
- Cristian, Understanding Fault-Tolerant Distributed Systems (CACM 1991) — the canonical practitioner introduction.
- Lamport, Shostak, Pease, The Byzantine Generals Problem (TOPLAS 1982) — the founding Byzantine paper.
- Castro & Liskov, Practical Byzantine Fault Tolerance (OSDI 1999) — the algorithm that made BFT shippable.
- Bailis, Network is Reliable — operational data on what failures actually look like in production.
Pre-commit Checklist
- For every "this protocol tolerates N failures" claim in my system, do I know which kind of failure?
- Have I tested the slow-node case (sleep injection, slow disk simulation), not just the crash case?
- Have I tested asymmetric partitions (one-way packet drop), not just symmetric ones?
- Is my failure detection separate signals for "slow" and "dead"? They should not collapse into one.
- If my model assumes "crash-stop," is the actual operational reality crash-recovery? (Almost certainly yes — recheck.)
- For untrusted-participant systems, have I weighed BFT vs trusted-coordinator vs cryptographic-attestation approaches?