Error Handling
Defensive boundaries, assertions, exceptions, and failure strategy
Error Handling
Error handling is one of the few topics where consistent application across a codebase matters more than the absolute choice. A team that handles errors uniformly — even imperfectly — produces software easier to operate than a team that mixes idioms.
Distinguish Two Kinds of Errors
The first decision in any error-handling design is to separate two categories that are often confused:
| Category | Source | Characteristic |
|---|---|---|
| Expected failure | External world: user input, network, files, peers | Will happen; must be handled |
| Programming error | Internal: invariant violations, impossible states | Should not happen; indicates a bug |
The two demand different mechanisms:
- Expected failures belong in the type signature (
Result,Either, checked exceptions, return codes). They are part of the API. - Programming errors should fail loudly via assertions, panics, or unchecked exceptions. There is no recovery; the goal is rapid diagnosis.
Conflating them produces both fragile defensive code (catching Exception to hide bugs) and silent data corruption (treating bugs as recoverable).
Trust Boundaries
Validate aggressively at the edges; trust the interior.
┌──────────────────────────────────────────┐
│ External (untrusted) │
│ – HTTP requests, file I/O, RPC, queues │
│ ─────────────── validate here ────── │
│ Internal (trusted contracts) │
│ – module-to-module, function-to-function│
│ – assert, do not re-validate │
└──────────────────────────────────────────┘Re-validating internal calls everywhere produces noise without improving safety. Failing to validate at the boundary produces vulnerabilities and corruption that surface deep inside the system.
Boundary checks include:
- Type and shape (schemas, parsers).
- Semantic constraints (ranges, regex, business rules).
- Authorization and authentication.
- Size and rate limits.
Assertions
Use assertions to state invariants — facts that must be true if the code is correct.
function popLast(stack) {
assert(stack.length > 0, 'pop from empty stack');
// ...
}When to assert
- Preconditions on internal helpers.
- Postconditions when the caller relies on a specific shape.
- Loop and class invariants whose violation indicates a bug.
- Unreachable branches (
switchdefaults, exhaustive matches).
When not to assert
- For input from outside the trust boundary — that is validation, not assertion.
- For conditions that can legitimately be false (file missing, network down).
- With side effects in the assertion expression — assertions may be compiled out in production.
The line is intent: an assertion documents "if this fails, the program is wrong"; validation documents "if this fails, the input is wrong."
Exception Strategy
Throw at the layer that detects; catch at the layer that can act
A function that detects a failure but cannot meaningfully respond should propagate the error. The handler belongs at the level that can do something useful — retry, fall back, surface the error to the user, give up.
The two failure modes to avoid:
- Catch and ignore. Swallowing an exception destroys evidence and converts a loud failure into a silent one.
- Catch too broadly.
catch (Exception)at the top of every function obscures specific recovery logic and turns programming errors into operational noise.
Preserve the cause when re-raising
When wrapping an exception in a higher-level type, attach the original as the cause:
try {
return await chargeCard(order);
} catch (err) {
throw new OrderProcessingError(`order ${order.id} failed`, { cause: err });
}The original stack and message remain available for debugging, while the higher-level type carries domain context for callers.
Use exceptions for exceptional cases
Do not use exceptions for ordinary control flow — end-of-iterator, missing-key, validation failure. The cost is partly performance (in some runtimes), but mostly readability: exception-as-control-flow obscures the normal path.
Prefer checked failures in the type system, where supported
Languages with Result, Either, or sum types let the compiler force callers to consider the failure case. Use them. They eliminate an entire class of "forgot to handle" defects that exception-based APIs cannot prevent.
Choosing a Recovery Strategy
When a failure does occur, the response sits on a spectrum from "tolerate" to "halt":
| Strategy | When appropriate |
|---|---|
| Return a neutral value (empty list, zero) | Best-effort reads; failure is non-critical |
| Substitute the next valid value | Streaming pipelines; one bad record |
| Return the last known good value | Sensors, caches |
| Clamp to a permissible range | Bounded numeric inputs |
| Log and continue | Telemetry, non-essential side jobs |
| Return an error to the caller | Most domain operations |
| Throw / panic | Programming errors; corrupted state |
| Shut down | Safety-critical violations |
The right point on the spectrum is determined by the correctness vs. robustness trade-off:
- Correctness systems — finance, medical, control — prefer to halt rather than risk wrong output.
- Robustness systems — media, gaming, observability — prefer to degrade rather than stop.
A single product usually has both kinds of code. The strategy should be explicit per module, not implicit per developer.
Fail Fast, Fail Loudly
The further a failure travels from its origin, the harder it becomes to diagnose. Two corollaries:
- Detect close to the cause. Validate at the boundary, assert invariants where they hold, check returned values at the call site.
- Make failures visible. A logged warning with no actionable signal is barely better than a silent failure. Use error rates, alerts, and structured logs that route to humans who can respond.
Silent recovery in the wrong place is the most expensive form of error handling, because the failure surfaces somewhere with no remaining context.
Resource Cleanup
Errors complicate resource management. Use the language's structured cleanup primitives, not manual try/finally ladders, when available:
with(Python),using(C#, Swift),try-with-resources(Java).- RAII destructors (C++, Rust).
defer(Go).await using(modern JavaScript).
The goal is that cleanup happens regardless of whether the body succeeded, returned early, or threw. Manual cleanup ladders are error-prone and routinely forgotten on the failure path.
Logging
Logs are part of the error-handling strategy, not a separate concern.
- Log at the layer that handles the error, not at every layer that propagates it. Duplicate logs make incident triage harder.
- Include enough structured context (IDs, inputs, request metadata) to reconstruct the failure without reproducing it.
- Distinguish severity meaningfully:
errorshould be actionable;warnshould be unusual but tolerable;infoshould describe normal operation.
Pre-Commit Checklist
- Are expected failures separated from programming errors in the API?
- Is validation done at the trust boundary, not scattered throughout?
- Are assertions used for invariants, not for input validation?
- Is every catch block doing something meaningful — recover, transform, log with context, or rethrow?
- Is the recovery strategy (halt vs. degrade) explicit and appropriate for this module?
- Will the failure be visible to someone who can act on it?
- Are resources cleaned up on every path, including the failure path?