Steven's Knowledge

Error Handling

Defensive boundaries, assertions, exceptions, and failure strategy

Error Handling

Error handling is one of the few topics where consistent application across a codebase matters more than the absolute choice. A team that handles errors uniformly — even imperfectly — produces software easier to operate than a team that mixes idioms.

Distinguish Two Kinds of Errors

The first decision in any error-handling design is to separate two categories that are often confused:

CategorySourceCharacteristic
Expected failureExternal world: user input, network, files, peersWill happen; must be handled
Programming errorInternal: invariant violations, impossible statesShould not happen; indicates a bug

The two demand different mechanisms:

  • Expected failures belong in the type signature (Result, Either, checked exceptions, return codes). They are part of the API.
  • Programming errors should fail loudly via assertions, panics, or unchecked exceptions. There is no recovery; the goal is rapid diagnosis.

Conflating them produces both fragile defensive code (catching Exception to hide bugs) and silent data corruption (treating bugs as recoverable).

Trust Boundaries

Validate aggressively at the edges; trust the interior.

┌──────────────────────────────────────────┐
│  External (untrusted)                    │
│  – HTTP requests, file I/O, RPC, queues  │
│  ───────────────  validate here  ──────  │
│  Internal (trusted contracts)            │
│  – module-to-module, function-to-function│
│  – assert, do not re-validate            │
└──────────────────────────────────────────┘

Re-validating internal calls everywhere produces noise without improving safety. Failing to validate at the boundary produces vulnerabilities and corruption that surface deep inside the system.

Boundary checks include:

  • Type and shape (schemas, parsers).
  • Semantic constraints (ranges, regex, business rules).
  • Authorization and authentication.
  • Size and rate limits.

Assertions

Use assertions to state invariants — facts that must be true if the code is correct.

function popLast(stack) {
  assert(stack.length > 0, 'pop from empty stack');
  // ...
}

When to assert

  • Preconditions on internal helpers.
  • Postconditions when the caller relies on a specific shape.
  • Loop and class invariants whose violation indicates a bug.
  • Unreachable branches (switch defaults, exhaustive matches).

When not to assert

  • For input from outside the trust boundary — that is validation, not assertion.
  • For conditions that can legitimately be false (file missing, network down).
  • With side effects in the assertion expression — assertions may be compiled out in production.

The line is intent: an assertion documents "if this fails, the program is wrong"; validation documents "if this fails, the input is wrong."

Exception Strategy

Throw at the layer that detects; catch at the layer that can act

A function that detects a failure but cannot meaningfully respond should propagate the error. The handler belongs at the level that can do something useful — retry, fall back, surface the error to the user, give up.

The two failure modes to avoid:

  • Catch and ignore. Swallowing an exception destroys evidence and converts a loud failure into a silent one.
  • Catch too broadly. catch (Exception) at the top of every function obscures specific recovery logic and turns programming errors into operational noise.

Preserve the cause when re-raising

When wrapping an exception in a higher-level type, attach the original as the cause:

try {
  return await chargeCard(order);
} catch (err) {
  throw new OrderProcessingError(`order ${order.id} failed`, { cause: err });
}

The original stack and message remain available for debugging, while the higher-level type carries domain context for callers.

Use exceptions for exceptional cases

Do not use exceptions for ordinary control flow — end-of-iterator, missing-key, validation failure. The cost is partly performance (in some runtimes), but mostly readability: exception-as-control-flow obscures the normal path.

Prefer checked failures in the type system, where supported

Languages with Result, Either, or sum types let the compiler force callers to consider the failure case. Use them. They eliminate an entire class of "forgot to handle" defects that exception-based APIs cannot prevent.

Choosing a Recovery Strategy

When a failure does occur, the response sits on a spectrum from "tolerate" to "halt":

StrategyWhen appropriate
Return a neutral value (empty list, zero)Best-effort reads; failure is non-critical
Substitute the next valid valueStreaming pipelines; one bad record
Return the last known good valueSensors, caches
Clamp to a permissible rangeBounded numeric inputs
Log and continueTelemetry, non-essential side jobs
Return an error to the callerMost domain operations
Throw / panicProgramming errors; corrupted state
Shut downSafety-critical violations

The right point on the spectrum is determined by the correctness vs. robustness trade-off:

  • Correctness systems — finance, medical, control — prefer to halt rather than risk wrong output.
  • Robustness systems — media, gaming, observability — prefer to degrade rather than stop.

A single product usually has both kinds of code. The strategy should be explicit per module, not implicit per developer.

Fail Fast, Fail Loudly

The further a failure travels from its origin, the harder it becomes to diagnose. Two corollaries:

  • Detect close to the cause. Validate at the boundary, assert invariants where they hold, check returned values at the call site.
  • Make failures visible. A logged warning with no actionable signal is barely better than a silent failure. Use error rates, alerts, and structured logs that route to humans who can respond.

Silent recovery in the wrong place is the most expensive form of error handling, because the failure surfaces somewhere with no remaining context.

Resource Cleanup

Errors complicate resource management. Use the language's structured cleanup primitives, not manual try/finally ladders, when available:

  • with (Python), using (C#, Swift), try-with-resources (Java).
  • RAII destructors (C++, Rust).
  • defer (Go).
  • await using (modern JavaScript).

The goal is that cleanup happens regardless of whether the body succeeded, returned early, or threw. Manual cleanup ladders are error-prone and routinely forgotten on the failure path.

Logging

Logs are part of the error-handling strategy, not a separate concern.

  • Log at the layer that handles the error, not at every layer that propagates it. Duplicate logs make incident triage harder.
  • Include enough structured context (IDs, inputs, request metadata) to reconstruct the failure without reproducing it.
  • Distinguish severity meaningfully: error should be actionable; warn should be unusual but tolerable; info should describe normal operation.

Pre-Commit Checklist

  • Are expected failures separated from programming errors in the API?
  • Is validation done at the trust boundary, not scattered throughout?
  • Are assertions used for invariants, not for input validation?
  • Is every catch block doing something meaningful — recover, transform, log with context, or rethrow?
  • Is the recovery strategy (halt vs. degrade) explicit and appropriate for this module?
  • Will the failure be visible to someone who can act on it?
  • Are resources cleaned up on every path, including the failure path?

On this page