Steven's Knowledge

Parallelization & Sharding

Cutting wall-clock time without losing determinism — splitting the suite, isolating shared state, and balancing shards as the suite grows

Parallelization & Sharding

The most reliable way to make a slow test suite fast is to make it run on more machines at once. The unreliable part is doing that without breaking determinism. Most teams who try parallelization for the first time discover their tests have been quietly relying on shared state, ordering, or implicit singletons — and the suite starts flaking the moment it's split.

The framing: parallelization is a distribution problem, but it's mostly an isolation problem in disguise.

Two Levels of Parallelism

Parallelism happens at two levels, and conflating them causes most of the pain:

LevelWhat's parallelControlled by
Process / workerMultiple test files running in separate processes on one machineTest runner (Jest, Vitest, Pytest-xdist, RSpec parallel)
Shard / jobThe suite split across multiple CI jobs / machinesCI system (matrix strategy, sharding plugin)

Both compound: 4 shards × 4 workers per shard = 16x parallelism. A 32-minute suite drops to ~2 minutes, if tests are isolated cleanly.

The order to enable them: process-level first, since it surfaces shared-state bugs without the cost of extra CI minutes. Then shards.

What Breaks When You Split

Tests that pass serially and fail under parallel almost always touch one of these:

  • Shared databases. Two tests in two processes write to the same row. One reads a value the other just deleted.
  • Shared filesystems. Tests writing to /tmp/test.log step on each other.
  • Shared ports. Two integration tests bind localhost:8080. Second one fails.
  • Module-level singletons. A cache populated by the first test in a process leaks into the second.
  • Global mocks. Date.now patched globally affects every test that runs in that worker.
  • External services. A real third-party sandbox throttles when N workers all hit it at once.

The fix template, in order of preference:

  1. Isolate at the resource level. Each worker gets its own database/schema/port/temp dir, named by process.env.JEST_WORKER_ID or equivalent.
  2. Reset between tests. beforeEach clears the state. Works for cheap state, expensive for databases.
  3. Serialize the dangerous subset. Some tests fundamentally can't run in parallel (e.g. they test global config). Mark them; run them in a single worker.

Mocks and patches need worker-scope as well — a global vi.spyOn(Date, 'now') set in test A leaks to test B if they share a process.

Sharding Strategies

Once tests are isolation-safe, the question becomes: how do we split work across shards so they finish at the same time?

Bad shards mean one shard takes 12 minutes while three finish in 3. Wall-clock = max(shards), not avg.

File-count splitting

Round-robin tests across shards by file count. Trivial. Wrong almost always: file count correlates poorly with runtime. The 5 longest tests can land in the same shard.

Hash-based splitting

hash(filename) % N. Stable across runs (good for caching). Still ignores runtime.

Timing-based splitting

Record per-test (or per-file) duration from the previous successful run; pack shards using a bin-packing heuristic to balance total time.

  • CircleCI, GitHub Actions with actions/upload-artifact, Buildkite test analytics, Knapsack (for Ruby), Nx, Turborepo — all support some flavor of this.
  • The first run is unbalanced. Subsequent runs converge.
  • Stable timings depend on stable infrastructure. If your CI runners vary in CPU, timings drift.

Test-aware splitting

The runner inspects the call graph and packs by estimated cost. Bazel and Buck do this. Most teams don't need it.

The right default for most teams: timing-based with a recent baseline. Start with file-count if no infrastructure exists, switch when shards become unbalanced.

Isolating Databases

The single most common parallelization pain point. Patterns, ordered by speed:

Per-worker schema

Each worker gets its own schema inside a single database. Schema name includes worker ID. Migrations run once per schema at setup.

  • Fast (single DB process).
  • Cheap (no extra containers).
  • Doesn't isolate connection state or DB-wide config.

Per-worker database

Each worker gets its own database, in the same engine. More isolated, slightly more overhead.

Per-worker container

Each worker spins up its own database container (Testcontainers, ephemeralPostgres). Maximum isolation, maximum cost — startup latency multiplies.

Per-test transaction

Wrap each test in a transaction; roll back at the end. Resets between tests within a worker — but doesn't help across workers, which still need their own database or schema.

The combination that works for most teams: per-worker database + per-test transactional rollback. Workers don't see each other; tests within a worker don't see each other either.

Browser / E2E Parallelization

E2E is where parallelization pays the most and breaks the most. Things to control:

  • Browser instances per worker. One per worker, not shared.
  • Ports / hostnames. Each worker gets its own app instance or its own subdomain.
  • Auth state. Pre-seeded user accounts per worker, not a shared "test user."
  • Data. Each worker creates its own fixtures. Tests don't read data they didn't write.
  • Visual artifacts. Screenshots, videos, traces — name them with the worker/shard ID so collisions don't overwrite.

Playwright's workers config and the --shard flag are the standard pattern. The rule: a test should not care which worker it runs on, and should never see another worker's data.

Diminishing Returns

More shards isn't always better. Costs that scale linearly with shard count:

  • Setup time. Each shard re-installs dependencies, re-builds, re-starts services. A 30-second setup × 16 shards = 8 minutes of pure overhead.
  • CI cost. Each shard is a billed job.
  • Result aggregation complexity. Coverage merging, test report combination, artifact uploads.

The math: a 20-minute suite with 30-second setup, split into 10 shards, runs in ~2.5 minutes. Splitting into 20 shards runs in ~1.5 minutes — half a minute saved for double the cost. Past a point, you're paying linearly for sublinear gains.

Rule of thumb: shard until the longest shard is roughly 2–3x the setup time. Past that, optimize tests, don't add shards.

Determinism Audits

A parallel suite that passes most of the time but flakes occasionally is harder to debug than one that's deterministically broken. Periodic audits to keep it honest:

  • Run the suite N times in a row. Any test that fails on a subset of runs is suspect.
  • Reverse the test order. Tests that depend on order will fail.
  • Run with --workers=1 and --workers=max and compare. Differences are isolation bugs.
  • Run on cold and warm caches. Tests that depend on cache warmth will diverge.

A team that does this monthly catches flakes before they become normalized.

Anti-Patterns

Splitting before isolating. Enable parallelism on a suite that's never been audited. Spend the next week debugging "intermittent" failures that were silently order-dependent.

Shared "test user" across workers. Worker 1 logs in, worker 2 logs in with the same account and invalidates the session.

Database truncation between tests with parallel access. Worker 1 truncates the users table mid-test, worker 2's test (already running) starts seeing missing rows.

Sharding by alphabetical filename. Front-loads the slow tests in early shards if they happen to start with A.

One shard = one container = one app instance, but app reads a shared config from disk. Parallel writes corrupt the config file.

Pre-merge Checklist

Before declaring the suite parallel-safe:

  • Does the suite pass with --workers=1 and with --workers=max?
  • Does the suite pass when test files are shuffled?
  • Does a single test, run in isolation, pass with the same output as run inside the full suite?
  • Are databases / temp dirs / ports / mocks scoped to a worker?
  • Are shards balanced within ~20% of each other, or is the longest shard 3x+ the setup overhead?
  • Do test artifacts (screenshots, logs, coverage) survive aggregation without collisions?

If any of these aren't true, the suite isn't parallel-safe yet — it's parallel-lucky.

On this page