Steven's Knowledge

Performance

When to optimize, how to measure, and which trade-offs are worth making

Performance

Performance work is among the most easily done badly. Code that is "obviously fast" is often slower than the obvious version; code optimized for the wrong bottleneck can be slower and less readable. The discipline is to make performance work boring: measure, change one thing, measure again, document.

Most code in most systems does not need optimization. The portion that does is usually small, and the surrounding clarity matters more than the local cleverness.

The Misquoted Quote

The line everyone knows is Knuth's:

Premature optimization is the root of all evil.

The full quotation is more useful:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

The principle is not "never optimize." It is:

  1. Start with code that is clear and correct.
  2. Identify the small fraction of code that genuinely matters.
  3. Optimize that code, deliberately, with measurements.
  4. Leave the rest alone.

The failure mode the quotation warns against is optimizing speculatively — twisting the shape of code for performance gains that may not exist, in places that may not matter.

Decide Whether It Is a Problem

Before optimizing anything, answer:

  • What is the target? "Faster" is not a goal. "P99 latency under 200ms" is. "Process the batch in under five minutes" is. Without a target, optimization has no stopping condition.
  • What is the current measurement? A baseline number, in conditions resembling production. Without a baseline, "improvement" is unverifiable.
  • What is the cost of being slow? A user-facing latency, a wasted budget, a missed deadline, a SLA breach. If the cost is small, the budget for optimization is small.

If the target is met, the work is done. Continuing to optimize past the target trades clarity for nothing.

Measure, Don't Guess

Intuition about performance is unreliable. Measurement is not.

Profile before changing

A profiler shows where time is actually spent. The result is almost always different from the prediction. Common surprises:

  • The "expensive" function turns out to be cheap; the cost is in a function nobody suspected.
  • A small allocation in a hot loop dominates the runtime.
  • A "fast" data structure is slow because of cache misses or pointer chasing.
  • Most of the latency is I/O, not CPU — and the CPU optimization is irrelevant.

Profile in conditions as close to production as possible. Microbenchmarks of single functions are a useful supplement but a poor substitute.

Measure the right thing

Pick the metric that matches the goal:

  • Latency — how long one operation takes. Report distributions (P50/P95/P99), not averages.
  • Throughput — how many operations per second.
  • Resource usage — CPU, memory, I/O bandwidth, network bytes.
  • End-to-end vs isolated — a microbenchmark of the function may improve while the user-visible time gets worse.

Mismatched metrics produce optimizations that look good in benchmarks and bad in production.

Measure after changing

Every change is a hypothesis. Verify it. A "speedup" untested by measurement is often:

  • A regression (the change made it slower).
  • A no-op (the bottleneck was elsewhere).
  • A trade-off (faster on one metric, worse on another).
  • A microbenchmark artifact that does not reproduce in production.

Where Performance Lives

Performance is decided at four levels, from most to least impactful:

1. Architecture

Choices like "cache the result," "shard the data," "compute incrementally," "do it asynchronously," "do it on the client" produce order-of-magnitude differences. Choices that have to be undone later are expensive — architecture-level decisions reward foresight.

2. Algorithms and data structures

A linear search through a million records and a hash lookup differ by six orders of magnitude. The cost of choosing the right structure is small (a different import); the cost of not is enormous and hard to retrofit.

Big-O analysis is a coarse but reliable filter:

  • O(1) — constant; ideal where applicable.
  • O(log n) — almost as good; tree and hash operations.
  • O(n) — usually fine for small to moderate n.
  • O(n log n) — good for sorting; rarely a bottleneck.
  • O(n²) — danger zone; nested loops over moderate input.
  • O(2ⁿ), O(n!) — only acceptable for tiny n.

A correct Big-O is not the same as fast — constants matter — but a wrong Big-O is rarely fast.

3. Implementation

Within the chosen algorithm, the implementation has another order of magnitude to give. Common wins:

  • Avoid repeated work. Hoist constants out of loops, memoize results, cache expensive lookups.
  • Reduce allocations. Object creation has cost; reusing buffers and pooling are cheap optimizations once measured.
  • Batch I/O. Network round-trips and disk seeks dominate I/O time; batching cuts them dramatically. The classic example is N+1 queries: a query that runs once per row in a result set, when one query with a join would do.
  • Stream when you can. Holding everything in memory and then iterating is slower and less scalable than processing as you go.

4. Hardware sympathy

The lowest level of optimization, and the easiest to misuse. Modern CPUs reward:

  • Sequential memory access (cache-friendly).
  • Predictable branches (branch prediction).
  • Compact data layouts (more values per cache line).

Most code does not need to think at this level. When it does — tight inner loops in performance-critical infrastructure — the gains can be substantial, and the readability cost is real. Confine such work to the small modules where it pays off.

Common Pitfalls

N+1 queries

A loop over results, where each iteration triggers another query.

const orders = await db.orders.all();
for (const o of orders) {
  o.customer = await db.customers.find(o.customerId);   // one query per order
}

Replace with a single join, a batched lookup, or eager loading.

String concatenation in loops

In languages where strings are immutable, repeated concatenation allocates a new string each time. Use a builder or join over an array.

Hidden quadratic complexity

array.includes(x) inside a loop over the same array is O(n²). For larger inputs, build a Set once, then check membership in O(1).

Loading more than you need

Selecting * from a wide table when three columns suffice; deserializing a multi-megabyte JSON document to read one field; downloading a whole image to display a thumbnail. Pay only for the bytes you use.

Synchronous I/O on the hot path

Network calls, disk reads, and downstream service calls block the calling thread or task. Move them off the hot path (cache, async, background) when latency matters.

Logging in tight loops

Logging is cheap until it isn't. Inside a loop run millions of times, formatting and writing each log line dominates. Aggregate or sample.

Premature parallelism

Concurrency adds overhead. For small workloads, single-threaded code is faster than the same work split across cores. Parallelism pays off when the per-task cost dominates the synchronization cost — which is often not where intuition suggests.

Trade-Offs

Performance work always trades against something:

GainCommon cost
SpeedMemory (caches, indexes, pre-computation)
MemorySpeed (streaming instead of holding)
LatencyThroughput (per-operation overhead)
ThroughputLatency (batching introduces delay)
EitherCode clarity
EitherMaintainability

Make the trade-off deliberate. A 10% gain that doubles the complexity of the code is rarely worth it; a 10× gain in a measured hot spot usually is.

When Optimized Code Survives

Optimized code that obscures intent should:

  • Be confined to a small, well-named module.
  • Be accompanied by a comment explaining why the obvious version is too slow, with the measurement.
  • Have a benchmark that captures the gain so future readers know the constraint.
  • Be tested against the obvious version on a representative input set.

Without these, optimized code rots: a future reader "simplifies" it back to the slow version, the regression goes unnoticed for months, and the original measurement is lost.

Performance Culture

Two failure modes are common at the team level:

  • Performance theater. Optimizing for visible-but-irrelevant metrics ("we shaved a millisecond") while the user-visible flow stays slow. The remedy is measuring user-facing outcomes, not internal proxies.
  • Performance neglect. Treating performance as someone else's problem — usually "ops" or "the database team" — until production breaks. The remedy is cheap, continuous monitoring on the metrics that matter, so regressions are caught when the cause is one commit, not a quarter.

The goal is performance that is treated like correctness: measured, tested, and protected against regression — not heroically rescued under pressure.

Pre-Commit Checklist

  • Is there a concrete performance target, and a baseline measurement?
  • Has a profiler identified the actual bottleneck — not a guess?
  • Will the change be verified by re-measuring under realistic conditions?
  • Is the algorithm and data structure right? (No optimization can save an O(n²) inner loop.)
  • Have you avoided N+1 queries, hidden quadratics, and unnecessary loads?
  • Is the trade-off (clarity, memory, throughput) explicit and acceptable?
  • If the optimized form is harder to read, is there a comment explaining why and a benchmark protecting it?

On this page