Resilience
Circuit breakers, retries, rate limiting, graceful degradation — patterns that keep your service alive when things fail
Resilience
In production, things fail. Databases go down, downstream services time out, networks partition, disks fill up. A resilient system does not prevent failures — it handles them gracefully. The user gets a degraded experience instead of an error page. The on-call engineer gets an alert instead of a wake-up call.
This page covers the patterns that make backend services survive the failures that production will throw at them.
Circuit Breaker
A circuit breaker prevents your service from repeatedly calling a failing downstream. Like an electrical circuit breaker, it "trips" after too many failures and stops sending requests — giving the downstream time to recover.
States
┌──────────┐
│ CLOSED │ (normal — requests pass through)
└────┬─────┘
│ failure threshold exceeded
┌────▼─────┐
│ OPEN │ (tripped — requests fail immediately)
└────┬─────┘
│ timeout expires
┌────▼─────┐
│HALF-OPEN │ (testing — let one request through)
└────┬─────┘
│ success → CLOSED
│ failure → OPENImplementation
class CircuitBreaker {
private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
private failureCount = 0;
private lastFailureTime = 0;
constructor(
private readonly threshold: number = 5,
private readonly resetTimeout: number = 30_000,
) {}
async call<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'OPEN') {
if (Date.now() - this.lastFailureTime > this.resetTimeout) {
this.state = 'HALF_OPEN';
} else {
throw new CircuitOpenError('Circuit is open');
}
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
private onSuccess() {
this.failureCount = 0;
this.state = 'CLOSED';
}
private onFailure() {
this.failureCount++;
this.lastFailureTime = Date.now();
if (this.failureCount >= this.threshold) {
this.state = 'OPEN';
}
}
}
// Usage
const paymentCircuit = new CircuitBreaker(5, 30_000);
async function chargeUser(userId: string, amount: number) {
try {
return await paymentCircuit.call(() =>
paymentService.charge(userId, amount)
);
} catch (err) {
if (err instanceof CircuitOpenError) {
// Fallback: queue for later processing
await paymentQueue.add({ userId, amount });
return { status: 'queued' };
}
throw err;
}
}Key decisions: failure threshold (too low = false trips, too high = too many failed requests before tripping), reset timeout (too short = downstream still recovering, too long = unnecessary downtime).
Retry with Exponential Backoff + Jitter
Retries handle transient failures — network blips, temporary overloads. Without backoff, retries create a thundering herd that makes the problem worse.
async function retryWithBackoff<T>(
fn: () => Promise<T>,
options: {
maxRetries?: number;
baseDelay?: number;
maxDelay?: number;
retryableErrors?: (error: Error) => boolean;
} = {}
): Promise<T> {
const {
maxRetries = 3,
baseDelay = 1000,
maxDelay = 30_000,
retryableErrors = isTransient,
} = options;
let lastError: Error;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
lastError = error as Error;
if (attempt === maxRetries || !retryableErrors(lastError)) {
throw lastError;
}
// Exponential backoff with full jitter
const exponentialDelay = baseDelay * Math.pow(2, attempt);
const delay = Math.random() * Math.min(exponentialDelay, maxDelay);
await sleep(delay);
}
}
throw lastError!;
}
function isTransient(error: Error): boolean {
if ('statusCode' in error) {
const code = (error as any).statusCode;
return code === 429 || code === 502 || code === 503 || code === 504;
}
return error.message.includes('ECONNRESET')
|| error.message.includes('ETIMEDOUT');
}
// Usage
const user = await retryWithBackoff(
() => userService.getById(userId),
{ maxRetries: 3, baseDelay: 500 }
);Why Jitter Matters
Without jitter, all clients retry at the same time:
No jitter: [all retry at 1s] [all retry at 2s] [all retry at 4s]
Full jitter: [retries spread across 0-1s] [0-2s] [0-4s]Full jitter (random between 0 and the calculated delay) spreads the load. Decorrelated jitter is even better for some patterns — but full jitter is good enough for most cases.
Timeout Policies
Every outbound call needs a timeout. No exceptions. An unbounded call can hold a thread/connection forever, eventually exhausting your server's resources.
// Per-request timeout with AbortController
async function fetchWithTimeout(url: string, timeoutMs: number) {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), timeoutMs);
try {
const response = await fetch(url, { signal: controller.signal });
return response;
} catch (error) {
if (error.name === 'AbortError') {
throw new TimeoutError(`Request to ${url} timed out after ${timeoutMs}ms`);
}
throw error;
} finally {
clearTimeout(timeout);
}
}
// Cascading timeouts: each layer gets a budget
// HTTP handler: 5000ms total
// → DB query: 2000ms
// → API call: 3000ms
// → Retry 1: 1500ms
// → Retry 2: 1500msCascading timeouts: if your handler has a 5-second budget, don't give a downstream call a 5-second timeout — it leaves no time for fallback logic. Budget your time across all operations.
Bulkhead Pattern
Isolate failures so one slow dependency doesn't consume all your resources and bring down unrelated endpoints.
// Semaphore-based bulkhead: limit concurrent calls per dependency
class Bulkhead {
private active = 0;
private queue: Array<() => void> = [];
constructor(
private readonly maxConcurrent: number,
private readonly maxQueue: number = 100,
) {}
async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.active >= this.maxConcurrent) {
if (this.queue.length >= this.maxQueue) {
throw new BulkheadFullError('Bulkhead queue is full');
}
await new Promise<void>((resolve) => this.queue.push(resolve));
}
this.active++;
try {
return await fn();
} finally {
this.active--;
const next = this.queue.shift();
if (next) next();
}
}
}
// Separate bulkheads per dependency
const paymentBulkhead = new Bulkhead(10); // max 10 concurrent payment calls
const inventoryBulkhead = new Bulkhead(20); // max 20 concurrent inventory calls
// If payments are slow, inventory calls are unaffectedWithout a bulkhead, a slow payment service can consume all your connection pool, causing inventory checks and user lookups to fail too.
Rate Limiting
Rate limiting protects your service from abuse and overload. Three common algorithms:
Token Bucket
Allows bursts up to a limit, then enforces a steady rate:
class TokenBucket {
private tokens: number;
private lastRefill: number;
constructor(
private readonly capacity: number,
private readonly refillRate: number, // tokens per second
) {
this.tokens = capacity;
this.lastRefill = Date.now();
}
tryConsume(tokens: number = 1): boolean {
this.refill();
if (this.tokens >= tokens) {
this.tokens -= tokens;
return true;
}
return false;
}
private refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate);
this.lastRefill = now;
}
}
// 100 requests capacity, refill 10 per second
const limiter = new TokenBucket(100, 10);Sliding Window Log
Tracks exact timestamps — precise but memory-intensive for high-volume endpoints:
class SlidingWindowLog {
private timestamps: number[] = [];
constructor(
private readonly windowMs: number,
private readonly maxRequests: number,
) {}
tryConsume(): boolean {
const now = Date.now();
const windowStart = now - this.windowMs;
// Remove expired entries
this.timestamps = this.timestamps.filter(t => t > windowStart);
if (this.timestamps.length < this.maxRequests) {
this.timestamps.push(now);
return true;
}
return false;
}
}Fixed Window Counter
Simple, memory-efficient, but allows bursts at window boundaries:
class FixedWindowCounter {
private count = 0;
private windowStart = Date.now();
constructor(
private readonly windowMs: number,
private readonly maxRequests: number,
) {}
tryConsume(): boolean {
const now = Date.now();
if (now - this.windowStart > this.windowMs) {
this.count = 0;
this.windowStart = now;
}
if (this.count < this.maxRequests) {
this.count++;
return true;
}
return false;
}
}Choosing an Algorithm
| Algorithm | Burst handling | Memory | Precision | Best for |
|---|---|---|---|---|
| Token bucket | Allows controlled bursts | Low | Good | API rate limiting |
| Sliding window log | No bursts | High | Exact | Low-volume, strict limits |
| Fixed window counter | Boundary bursts possible | Very low | Approximate | High-volume, approximate limits |
Rate Limit Response
Always tell the client what is happening:
function rateLimitMiddleware(req, res, next) {
const key = req.ip; // or req.user.id for authenticated endpoints
const allowed = limiter.tryConsume(key);
res.setHeader('X-RateLimit-Limit', '100');
res.setHeader('X-RateLimit-Remaining', limiter.remaining(key).toString());
res.setHeader('X-RateLimit-Reset', limiter.resetTime(key).toString());
if (!allowed) {
res.setHeader('Retry-After', '60');
return res.status(429).json({
error: { code: 'RATE_LIMITED', message: 'Too many requests' },
});
}
next();
}Graceful Degradation
When a dependency fails, serve a reduced experience instead of an error:
async function getProductPage(productId: string) {
const product = await productService.getById(productId); // Required — fail if this fails
// Non-critical: degrade gracefully
const [reviews, recommendations, inventory] = await Promise.allSettled([
reviewService.getForProduct(productId),
recommendationService.getForProduct(productId),
inventoryService.getStock(productId),
]);
return {
product,
reviews: reviews.status === 'fulfilled' ? reviews.value : [],
recommendations: recommendations.status === 'fulfilled' ? recommendations.value : [],
inStock: inventory.status === 'fulfilled' ? inventory.value > 0 : null, // null = unknown
};
}Promise.allSettled is your friend. Unlike Promise.all, it does not reject on the first failure. Each result has a status of 'fulfilled' or 'rejected', letting you handle each dependency independently.
Health Checks
Two types of health check serve different purposes:
Liveness
"Is the process alive?" If this fails, the orchestrator (Kubernetes) should restart the container.
app.get('/healthz', (req, res) => {
// Only check that the process can respond
res.status(200).json({ status: 'ok' });
});Keep liveness checks trivial. Do not check database connectivity here. If the database is down, restarting your container will not fix it — and you will create a restart loop.
Readiness
"Can this instance serve traffic?" If this fails, the load balancer should stop sending requests to this instance.
app.get('/readyz', async (req, res) => {
const checks = {
database: false,
cache: false,
};
try {
await db.query('SELECT 1');
checks.database = true;
} catch {}
try {
await redis.ping();
checks.cache = true;
} catch {}
const ready = checks.database; // Cache is optional, DB is required
res.status(ready ? 200 : 503).json({ status: ready ? 'ready' : 'not ready', checks });
});Putting It Together
A resilient request handler combines multiple patterns:
const paymentCircuit = new CircuitBreaker(5, 30_000);
const paymentBulkhead = new Bulkhead(10);
async function processPayment(orderId: string, amount: number) {
// Bulkhead: limit concurrency
return paymentBulkhead.execute(async () => {
// Circuit breaker: stop calling if downstream is down
return paymentCircuit.call(async () => {
// Retry: handle transient failures
return retryWithBackoff(
() => fetchWithTimeout(
`${PAYMENT_URL}/charge`,
3000, // Timeout: 3 seconds
),
{ maxRetries: 2, baseDelay: 500 }
);
});
});
}The order matters: bulkhead (limit how many attempts), then circuit breaker (fail fast if downstream is gone), then retry (handle transient errors), then timeout (bound each individual call).
Checklist
Before shipping to production:
- Every outbound HTTP call has a timeout.
- Retries use exponential backoff with jitter.
- Critical dependencies have a circuit breaker.
- Non-critical dependencies degrade gracefully (no error page because the recommendation engine is down).
- Rate limiting is in place for public endpoints.
- Health checks distinguish liveness from readiness.
- Failed requests return useful error messages and appropriate status codes.
- Resilience behavior is observable — circuit state, retry count, rate limit hits are logged or metriced.