Load, stress, soak, and benchmark testing — separate disciplines with different questions, different setups, and different definitions of success

Performance Testing

Performance testing is what you do when "correct" isn't enough — when the system has to be correct under load, over time, or within a budget. The mistake most teams make is treating it as one thing. It's at least four different activities, each answering a different question, each requiring different setup.

A team that runs "a load test" without knowing which question it's answering usually gets a graph that proves nothing.

This page is about the discipline itself: which test answers which question, how each is set up, and what makes the results trustworthy.

For code-level optimization (profiling, hot paths, algorithmic improvements), see Performance under code-craft. That's about making code faster. This page is about proving a system stays fast under conditions that matter.

Four Disciplines

The same tooling can drive all of these; the question and acceptance criteria are different.

Type	Question it answers	Typical duration
Benchmark	"How fast is this specific operation?"	Seconds to minutes
Load test	"Can the system handle expected traffic?"	10 minutes to 1 hour
Stress test	"Where does the system break, and how?"	30 minutes to several hours
Soak / endurance	"Does it stay healthy over long periods?"	Hours to days

Conflating them is the first failure mode. A test that runs "high load for 4 hours" is doing two things at once and the result tells you neither cleanly.

Benchmarks

A benchmark measures the cost of one operation in isolation: a function call, a query, an API endpoint. Output is a number you can compare against a baseline.

Properties of a useful benchmark:

Single operation, run many times, statistics reported (p50, p95, p99, max).
Same machine, same warmup, same input each time — the comparison only works if conditions don't drift.
Repeatable. A benchmark you can't re-run weeks later isn't a benchmark.
Tracked against a baseline. A number with no comparison is decoration.

Tools: go test -bench, cargo bench, hyperfine, k6 for HTTP-level, wrk2, JMH (Java). Use the language's idiomatic benchmark runner — it handles warmup and statistics correctly.

What benchmarks miss:

They run one thing at a time. Real systems do many things concurrently.
They run on idle hardware. Real systems share resources.
They typically have small, hot working sets. Real systems have cold caches.

Benchmarks tell you about the operation, not the system. Pair with load tests, don't substitute.

Load Tests

A load test runs the system at expected traffic and asks whether it holds.

The defining inputs:

Workload model. What requests, in what ratio, with what payloads. Drawn from production traffic if you can; synthetic profiles if you can't.
Concurrency or rate. How many users / requests per second. The relationship matters: 100 concurrent users at 1 req/sec is different from 10 users at 10 req/sec.
Duration. Long enough for steady state. Usually 10–30 minutes. Too short and you're measuring startup; too long and it becomes a soak test.

Acceptance criteria (set before running):

Latency budget. p95 < 200ms; p99 < 1s.
Error rate. < 0.1%.
Throughput. Sustained X requests per second.
Resource ceiling. CPU < 70% per node; memory stable.

A load test without acceptance criteria is exploratory. That's a useful activity, but don't call it a pass/fail test.

Tools: k6, Locust, Gatling, JMeter, Vegeta, Artillery. All can drive HTTP load; k6 and Locust have the friendliest scripting models.

Open vs. Closed Workload Models

The mistake that ruins load tests: confusing two ways of generating load.

Closed model: N virtual users, each waits for a response, then makes the next request. Throughput is output of the test — limited by latency.

Open model: Requests arrive at rate R, independent of how fast the system responds. Throughput is input — if the system slows down, requests queue up.

Production traffic almost always behaves like an open model — users don't wait for slowness to settle down before clicking again. A closed-model load test that "passes" can be hiding a system that, in production, would tip over the moment latency rose. Use open-model load generators (k6's constant-arrival-rate, Vegeta's rate-based mode) for realistic results.

Stress Tests

A stress test pushes the system past expected limits and observes what breaks. The question isn't "does it hold at 100 req/s," it's "what happens at 500, at 1000, at 5000."

What you're learning:

The knee. At what load does latency start rising sharply, and what's the slope?
The failure mode. When it does break, does it return errors gracefully, time out, OOM, or take down the database?
The recovery. After the stress drops, does the system recover, or stay degraded?

Stress tests are how you find the difference between "designed for 1000 req/s" and "tested to 1000 req/s without falling over."

Patterns to run:

Ramp. Load grows from 0 to N over time. The graph shows the knee.
Spike. Sudden jump from idle to peak. Tests provisioning and autoscaling response time.
Sustained overload. Hold at, say, 150% of expected capacity for 10 minutes. Watch what fails first.

The result of a stress test isn't usually pass/fail — it's a chart and a writeup. "We start dropping requests at 800 req/s; the bottleneck is database connection pool; recovery takes 2 minutes after load returns to normal."

Soak / Endurance Tests

A soak test runs moderate load for a long time to find issues that only appear over hours or days:

Memory leaks (small allocations that aren't freed).
Connection leaks (sockets, DB connections, file handles).
Disk fills (logs, temp files, growing tables).
Cache thrash (working set exceeds cache; performance degrades slowly).
Background job queues that grow unbounded.

A 12-hour run at production-typical load is more useful than a 30-minute run at 10x load. Different test, different goal.

Acceptance criteria: at the end of the run, are key metrics (memory, latency, error rate) at the same level as the beginning? If they've drifted, something is leaking.

Run soaks pre-release, not per PR. They're expensive.

What "Production-Like" Means

A load test on dev hardware against a 10-row database tells you very little. The dimensions that matter:

Dimension	Default mistake	Better
Hardware	Smaller dev instances	Same instance types as prod, at scale
Data volume	Empty or seeded with 100 rows	Realistic table sizes (millions of rows)
Data distribution	Sequential IDs, uniform values	Skewed like prod (hot keys, long-tail)
Network	Same datacenter as load generator	Realistic latency between generator, app, DB
Caches	Empty (cold)	Pre-warmed if that reflects prod
Concurrency context	Just the test traffic	Background jobs, cron, real-traffic noise

Each dimension you simplify makes the result less applicable to production. A test in a fully production-like environment is expensive; one in a test environment with two production-like dimensions is much better than one with none.

Where to Run

Environment	Use for
Local laptop	Benchmark development, tuning a single function
Dev environment	Smoke tests, "is this endpoint slower than yesterday"
Dedicated perf environment	Load / stress / soak tests with realistic conditions
Staging	Production-like dry runs before release
Production (read-only synthetic)	Continuous load monitoring, not testing

The dedicated perf environment is the load-bearing one. Sharing it with functional tests pollutes the results; running on production risks the users.

Baselines and Regression Detection

Performance numbers in isolation are meaningless. The shape of a useful perf-testing practice:

Establish a baseline. Run the suite on a known-good build; record results.
Compare on each significant change. Per-PR for small benchmarks; pre-release for full load tests.
Alert on regression past a threshold. "Endpoint X p95 increased by 20% vs. baseline" is a question worth asking.
Re-baseline deliberately. When a fast deploys, the new performance becomes the new baseline. Don't compare against a 6-month-old number.

Tools that automate this: Bencher, Codspeed, custom dashboards over Prometheus.

What kills baselines: noisy environments (shared CI runners with variable CPU), insufficient repetition (running once and comparing), and no alerting (results are recorded but no one looks).

Common Failure Modes

Closed-model load test claiming to test open-model traffic

Already mentioned; the most common mistake. A closed test with 100 virtual users gives radically different results from an open test at 100 req/s when latency rises. Production behaves like the open one.

Single-machine load generator

The generator runs on one box; the box's NIC, CPU, or open-file-descriptor limit becomes the bottleneck before the system under test does. Result: "the system caps at 500 req/s" when actually the generator capped at 500 req/s. Always confirm the generator isn't the bottleneck.

Tests against an empty database

Query performance changes dramatically with data volume. A test against a 100-row table tells you nothing about a system that runs against 100 million rows in production. Most index issues, plan changes, and cache misses only appear at scale.

No warmup

JVM JIT, connection pools, OS page cache, CDN caches — all need warmup. The first 30 seconds of any load test are setup, not signal. Either discard them explicitly or run the test long enough that they don't dominate.

Variance ignored

A single run gives a single number. Two runs give a range. Ten runs give a distribution. Decisions based on one run are decisions based on noise — especially in shared environments. Repeat and report variance.

Performance regression caught the day before launch

The test exists; the team runs it once, before release; it shows a 40% regression introduced eight weeks ago. Now the launch slips while you bisect. Run perf tests on a regular cadence, not as a release gate.

Optimizing for the test, not the user

The load test asserts "p95 < 200ms." Code gets restructured to satisfy the test — by caching aggressively, by returning stale data, by skipping work the test doesn't notice. Real users see worse behavior because the test was measuring the wrong thing. Acceptance criteria should be derived from real user expectations, not from what's easy to measure.

Pre-release Checklist

Before declaring performance testing healthy:

Is each test in the suite clearly one of {benchmark, load, stress, soak}, not a mix?
Are acceptance criteria written down before the run?
Is the load generator known not to be the bottleneck?
Is the environment production-like in the dimensions that matter for the test?
Are baselines tracked over time, and is regression alerted on?
Does the team look at the results, or do they pile up in a dashboard nobody opens?

If the answer to the last one is honest and unflattering, the performance tests aren't pulling weight — they're a record-keeping exercise.

Performance Testing

On this page