Performance Testing
Load, stress, soak, and benchmark testing — separate disciplines with different questions, different setups, and different definitions of success
Performance Testing
Performance testing is what you do when "correct" isn't enough — when the system has to be correct under load, over time, or within a budget. The mistake most teams make is treating it as one thing. It's at least four different activities, each answering a different question, each requiring different setup.
A team that runs "a load test" without knowing which question it's answering usually gets a graph that proves nothing.
This page is about the discipline itself: which test answers which question, how each is set up, and what makes the results trustworthy.
For code-level optimization (profiling, hot paths, algorithmic improvements), see Performance under code-craft. That's about making code faster. This page is about proving a system stays fast under conditions that matter.
Four Disciplines
The same tooling can drive all of these; the question and acceptance criteria are different.
| Type | Question it answers | Typical duration |
|---|---|---|
| Benchmark | "How fast is this specific operation?" | Seconds to minutes |
| Load test | "Can the system handle expected traffic?" | 10 minutes to 1 hour |
| Stress test | "Where does the system break, and how?" | 30 minutes to several hours |
| Soak / endurance | "Does it stay healthy over long periods?" | Hours to days |
Conflating them is the first failure mode. A test that runs "high load for 4 hours" is doing two things at once and the result tells you neither cleanly.
Benchmarks
A benchmark measures the cost of one operation in isolation: a function call, a query, an API endpoint. Output is a number you can compare against a baseline.
Properties of a useful benchmark:
- Single operation, run many times, statistics reported (p50, p95, p99, max).
- Same machine, same warmup, same input each time — the comparison only works if conditions don't drift.
- Repeatable. A benchmark you can't re-run weeks later isn't a benchmark.
- Tracked against a baseline. A number with no comparison is decoration.
Tools: go test -bench, cargo bench, hyperfine, k6 for HTTP-level, wrk2, JMH (Java). Use the language's idiomatic benchmark runner — it handles warmup and statistics correctly.
What benchmarks miss:
- They run one thing at a time. Real systems do many things concurrently.
- They run on idle hardware. Real systems share resources.
- They typically have small, hot working sets. Real systems have cold caches.
Benchmarks tell you about the operation, not the system. Pair with load tests, don't substitute.
Load Tests
A load test runs the system at expected traffic and asks whether it holds.
The defining inputs:
- Workload model. What requests, in what ratio, with what payloads. Drawn from production traffic if you can; synthetic profiles if you can't.
- Concurrency or rate. How many users / requests per second. The relationship matters: 100 concurrent users at 1 req/sec is different from 10 users at 10 req/sec.
- Duration. Long enough for steady state. Usually 10–30 minutes. Too short and you're measuring startup; too long and it becomes a soak test.
Acceptance criteria (set before running):
- Latency budget. p95 < 200ms; p99 < 1s.
- Error rate. < 0.1%.
- Throughput. Sustained X requests per second.
- Resource ceiling. CPU < 70% per node; memory stable.
A load test without acceptance criteria is exploratory. That's a useful activity, but don't call it a pass/fail test.
Tools: k6, Locust, Gatling, JMeter, Vegeta, Artillery. All can drive HTTP load; k6 and Locust have the friendliest scripting models.
Open vs. Closed Workload Models
The mistake that ruins load tests: confusing two ways of generating load.
Closed model: N virtual users, each waits for a response, then makes the next request. Throughput is output of the test — limited by latency.
Open model: Requests arrive at rate R, independent of how fast the system responds. Throughput is input — if the system slows down, requests queue up.
Production traffic almost always behaves like an open model — users don't wait for slowness to settle down before clicking again. A closed-model load test that "passes" can be hiding a system that, in production, would tip over the moment latency rose. Use open-model load generators (k6's constant-arrival-rate, Vegeta's rate-based mode) for realistic results.
Stress Tests
A stress test pushes the system past expected limits and observes what breaks. The question isn't "does it hold at 100 req/s," it's "what happens at 500, at 1000, at 5000."
What you're learning:
- The knee. At what load does latency start rising sharply, and what's the slope?
- The failure mode. When it does break, does it return errors gracefully, time out, OOM, or take down the database?
- The recovery. After the stress drops, does the system recover, or stay degraded?
Stress tests are how you find the difference between "designed for 1000 req/s" and "tested to 1000 req/s without falling over."
Patterns to run:
- Ramp. Load grows from 0 to N over time. The graph shows the knee.
- Spike. Sudden jump from idle to peak. Tests provisioning and autoscaling response time.
- Sustained overload. Hold at, say, 150% of expected capacity for 10 minutes. Watch what fails first.
The result of a stress test isn't usually pass/fail — it's a chart and a writeup. "We start dropping requests at 800 req/s; the bottleneck is database connection pool; recovery takes 2 minutes after load returns to normal."
Soak / Endurance Tests
A soak test runs moderate load for a long time to find issues that only appear over hours or days:
- Memory leaks (small allocations that aren't freed).
- Connection leaks (sockets, DB connections, file handles).
- Disk fills (logs, temp files, growing tables).
- Cache thrash (working set exceeds cache; performance degrades slowly).
- Background job queues that grow unbounded.
A 12-hour run at production-typical load is more useful than a 30-minute run at 10x load. Different test, different goal.
Acceptance criteria: at the end of the run, are key metrics (memory, latency, error rate) at the same level as the beginning? If they've drifted, something is leaking.
Run soaks pre-release, not per PR. They're expensive.
What "Production-Like" Means
A load test on dev hardware against a 10-row database tells you very little. The dimensions that matter:
| Dimension | Default mistake | Better |
|---|---|---|
| Hardware | Smaller dev instances | Same instance types as prod, at scale |
| Data volume | Empty or seeded with 100 rows | Realistic table sizes (millions of rows) |
| Data distribution | Sequential IDs, uniform values | Skewed like prod (hot keys, long-tail) |
| Network | Same datacenter as load generator | Realistic latency between generator, app, DB |
| Caches | Empty (cold) | Pre-warmed if that reflects prod |
| Concurrency context | Just the test traffic | Background jobs, cron, real-traffic noise |
Each dimension you simplify makes the result less applicable to production. A test in a fully production-like environment is expensive; one in a test environment with two production-like dimensions is much better than one with none.
Where to Run
| Environment | Use for |
|---|---|
| Local laptop | Benchmark development, tuning a single function |
| Dev environment | Smoke tests, "is this endpoint slower than yesterday" |
| Dedicated perf environment | Load / stress / soak tests with realistic conditions |
| Staging | Production-like dry runs before release |
| Production (read-only synthetic) | Continuous load monitoring, not testing |
The dedicated perf environment is the load-bearing one. Sharing it with functional tests pollutes the results; running on production risks the users.
Baselines and Regression Detection
Performance numbers in isolation are meaningless. The shape of a useful perf-testing practice:
- Establish a baseline. Run the suite on a known-good build; record results.
- Compare on each significant change. Per-PR for small benchmarks; pre-release for full load tests.
- Alert on regression past a threshold. "Endpoint X p95 increased by 20% vs. baseline" is a question worth asking.
- Re-baseline deliberately. When a fast deploys, the new performance becomes the new baseline. Don't compare against a 6-month-old number.
Tools that automate this: Bencher, Codspeed, custom dashboards over Prometheus.
What kills baselines: noisy environments (shared CI runners with variable CPU), insufficient repetition (running once and comparing), and no alerting (results are recorded but no one looks).
Common Failure Modes
Closed-model load test claiming to test open-model traffic
Already mentioned; the most common mistake. A closed test with 100 virtual users gives radically different results from an open test at 100 req/s when latency rises. Production behaves like the open one.
Single-machine load generator
The generator runs on one box; the box's NIC, CPU, or open-file-descriptor limit becomes the bottleneck before the system under test does. Result: "the system caps at 500 req/s" when actually the generator capped at 500 req/s. Always confirm the generator isn't the bottleneck.
Tests against an empty database
Query performance changes dramatically with data volume. A test against a 100-row table tells you nothing about a system that runs against 100 million rows in production. Most index issues, plan changes, and cache misses only appear at scale.
No warmup
JVM JIT, connection pools, OS page cache, CDN caches — all need warmup. The first 30 seconds of any load test are setup, not signal. Either discard them explicitly or run the test long enough that they don't dominate.
Variance ignored
A single run gives a single number. Two runs give a range. Ten runs give a distribution. Decisions based on one run are decisions based on noise — especially in shared environments. Repeat and report variance.
Performance regression caught the day before launch
The test exists; the team runs it once, before release; it shows a 40% regression introduced eight weeks ago. Now the launch slips while you bisect. Run perf tests on a regular cadence, not as a release gate.
Optimizing for the test, not the user
The load test asserts "p95 < 200ms." Code gets restructured to satisfy the test — by caching aggressively, by returning stale data, by skipping work the test doesn't notice. Real users see worse behavior because the test was measuring the wrong thing. Acceptance criteria should be derived from real user expectations, not from what's easy to measure.
Pre-release Checklist
Before declaring performance testing healthy:
- Is each test in the suite clearly one of
{benchmark, load, stress, soak}, not a mix? - Are acceptance criteria written down before the run?
- Is the load generator known not to be the bottleneck?
- Is the environment production-like in the dimensions that matter for the test?
- Are baselines tracked over time, and is regression alerted on?
- Does the team look at the results, or do they pile up in a dashboard nobody opens?
If the answer to the last one is honest and unflattering, the performance tests aren't pulling weight — they're a record-keeping exercise.