Detecting flaky tests systematically, quarantining without burying them, and treating flakes as bugs instead of weather

Flake Management

A flaky test is a test that fails or passes on the same code without anyone changing anything. The first time it happens, it's a curiosity. The tenth time, it's a culture. By the hundredth time, the team has learned to ignore red CI — and the suite stops protecting anything.

Flake management is the discipline of keeping that from happening. The two failure modes it prevents:

Death by ignorance. Flakes accumulate, no one fixes them, eventually a real regression is dismissed as "probably the flake again."
Death by retry. Auto-retry-until-green hides the flake rate; everything looks healthy until the underlying cause produces a real outage.

A suite without flake management isn't a suite. It's a slot machine.

The Math Nobody Wants to Hear

A single test with 99% reliability seems fine. Run 200 of them in a suite, and the probability that all 200 pass is 0.99^200 ≈ 0.13. An 87% suite failure rate, with zero real bugs.

Push that to 1000 tests and the suite essentially never passes. This is why "just a tiny flake rate per test" is not survivable. The economics of flake compound multiplicatively.

The corollary: a healthy suite needs per-test reliability above 99.9%, not 99%.

Sources, by Frequency

Most flakes trace back to a small set of causes. Knowing the catalog speeds triage:

Cause	Symptom	Fix direction
Timing / async race	"Element not found"; intermittent timeouts	Replace `sleep(n)` with explicit wait-for-condition; await every async boundary
Order dependence	Passes alone, fails in suite	Isolate state in `beforeEach`; per-worker DB
Shared state across tests	Last test of the day fails first thing the next day	Reset module caches, singletons, env vars per test
External service / network	Fails when third-party is slow or rate-limited	Replace with stub; if real, retry with backoff at the boundary, not the test
Clock / timezone	Fails on certain dates, around midnight, in CI's timezone	Inject clock; pin TZ in the runner
Resource exhaustion	Fails under high parallelism, fine alone	Cap connection pools per worker; check FD/port limits
Memory / state pollution	Passes locally, fails on shared CI runners	Reset globals; check for module-level mutable state
CI infrastructure	Fails only on one runner type / image	Pin images; debug on the bad runner

A genuinely random flake (cosmic-ray-level) is rare. If a test feels random, the more likely answer is you haven't found the dependency yet.

Detection: Make Flake Rate Visible

A team can't act on what it can't see. The minimum metric:

Flake rate per test = (passed-after-retry runs + failed-after-retry runs) / total runs

If your CI doesn't track this, you don't have flake management; you have flake denial.

Common tools that surface this:

Buildkite Test Analytics, CircleCI Test Insights, Datadog CI Visibility, Trunk Flaky Tests, GitHub Actions + custom job — all do flake detection by re-running failed tests automatically and recording whether the retry passed.
Roll-your-own: a daily job runs the full suite N times, diffs results, posts the flaky set to a dashboard.

The dashboard is the artifact. Without it, "we have a few flakes" stays a vibe; with it, "tests A, B, C had a 12% flake rate this week, owned by team X" is a backlog item.

The Quarantine Pattern

When a flake is detected, three options exist:

Fix immediately. Best, when feasible.
Quarantine. Mark the test as known-flaky; it continues running but does not fail the build. Failures go to a separate report.
Delete. When the test was never valuable.

Quarantine is the load-bearing option. The rules that make it work:

Quarantine is bounded. Every quarantined test has an owner and a deadline. Past the deadline, the test is fixed, re-enabled, or deleted.
Quarantine is visible. A weekly report shows the quarantine list growing or shrinking. A growing list past N is a triage signal.
Quarantine is the exception. First-line response is "fix the flake." Quarantine is a tool for when the fix takes longer than the cost of blocking CI.
Quarantined tests still run. They produce data for fixing. A test you skip is a test you've already given up on.

What kills quarantine: no owner, no deadline, no review. Items rot for months. The "quarantine" becomes a permanent skip list. Eventually a real bug hides in there.

Retry Budgets

Auto-retry is a tool, not a strategy. The bounds:

Retry once at most, at the test level. Not at the job level. Job-level retries make flake rate invisible.
Record retries. A test that passed on retry is data — push it to the flake dashboard.
Fail fast on patterns. Three consecutive flake-passes from the same test in a week = auto-quarantine + ticket.
Never retry an integration- or contract-test failure silently. Those are usually real.

A useful contract with the team: a green build that required a retry is not the same as a green build. Treat it as "passed with a flag."

Triage Workflow

When a flake is reported, the steps:

Reproduce. Get the failure to happen at least once. If you can't, you can't fix it.
- Run the test alone, then in its file, then in the full suite.
- Run with the same seed, same worker count, same data.
- Run in CI's image locally (containers help here).
Reduce. Find the smallest setup that reproduces.
- Disable other tests in the file.
- Replace dependencies with fakes one at a time.
- Strip assertions until only the flaky one is left.
Hypothesize. What is non-deterministic about this code path?
- Anything in the sources table above.
- Logging timestamps, IDs, anything that looks like it shouldn't matter.
Verify. Run N times after the fix. A "fix" that hasn't been re-run 100 times isn't verified — it might just have moved the failure mode.
Document. What the cause was, what the fix was. Future-you will thank you when a similar flake surfaces.

This is the same loop as production debugging. See Debugging for the underlying discipline.

The "Just Retry" Trap

The most common request when a CI is flaky: "can we just retry failures three times?" The math:

A test with 5% flake rate, retried 3 times, has effective flake rate 0.05^3 = 0.0125%. Suite-level reliability looks great.

What you've actually done:

Hidden the 5% flake rate. It's still there; the underlying cause hasn't moved.
Made CI 3x slower in the worst case.
Made debugging real failures harder — was it a real bug, or did the 3rd retry catch a flake?
Trained the team that red means "wait for retry," not "investigate."

Retries are an analgesic, not a cure. Use them for genuinely unfixable infrastructure flake (network blips to external services), not as a default policy.

Cultural Failure Modes

"We have a flake culture." Statement of identity, not a problem to solve. Read: "we have given up."

"It's flaky, just rerun it." Said often enough, this becomes the protocol. Real bugs get dismissed the same way.

"That test is too important to delete." A test that doesn't reliably pass when correct is providing zero signal. It is worse than deleted — it adds noise.

"We'll fix it next sprint." Across enough sprints, this becomes the new normal. The fix never lands. Set a deadline; if it slips, delete the test.

Hidden retries. CI silently re-runs failures; flake rate is invisible; one day production breaks for a reason CI saw three times but never reported.

Pre-merge Checklist

Before declaring flake management healthy:

Can the team name the top 5 flakiest tests this week? (If not: no visibility.)
Is per-test flake rate tracked over time, and trending toward zero?
Does every quarantined test have an owner and a deadline?
Is the quarantine list shrinking month over month?
Are retries recorded and reported, not silent?
When CI goes red, is the first team reaction "investigate," not "rerun"?

The last one is the leading indicator. If the team reaches for "rerun" by reflex, no amount of dashboards will save the suite.

Flake Management

On this page