Steven's Knowledge

E2E Testing

A thin layer of tests that prove the whole stack works — chosen carefully, structured to stay stable, and never the place for business logic

E2E Testing

End-to-end tests run the whole system the way a user does — browser to backend to database and back. They're the only layer that proves your deployed product, in its actual configuration, can complete a real task.

They're also the layer that fails most teams. Too many of them, and the suite is slow, flaky, and expensive to maintain; merges back up behind random failures and no one trusts a green run anyway. Too few, and you discover at 3am that login has been broken for a day because every layer below the browser passes its own tests.

The framing this section uses: E2E is a smoke detector, not a microscope. A small, well-chosen set proves the system is alive. A large, indiscriminate set proves nothing and rots the team's relationship with the test suite.

For where E2E sits in the larger picture, see Testing Strategy. This page is the practice.

What E2E Tests Should Cover

The right answer is short and finite:

  • Critical user journeys. Sign up. Log in. Place an order. Cancel a subscription. Pay an invoice. The actions that, if broken, mean the product isn't working.
  • Cross-system glue. The handful of paths where browser, frontend, API, database, and a third-party (auth, payments) all have to agree.
  • Deployment smoke. A few seconds of "the app boots, the homepage loads, login works" run against every environment after deploy.

What E2E should not cover:

  • Business rule combinatorics. Twenty discount-tier permutations? Unit tests.
  • Validation messages. Component or integration tests.
  • Every API field. Contract tests.
  • Every page renders correctly. Component snapshot or visual regression tests.
  • Edge cases of error handling. Integration tests at the boundary they happen.

A rule of thumb: if you can write the assertion as a unit or integration test, do that. E2E is for the things that can only be tested by exercising the full stack.

How Many

There is no universal number, but there are useful brackets:

  • 5–20 E2E tests covering critical journeys and smoke is healthy for most products.
  • 50+ E2E tests is suspicious — almost always business logic that should be lower.
  • 200+ E2E tests is broken — the team has stopped writing real unit tests because "we have E2E."

When the count grows, the right question isn't "how do we make E2E faster?" — it's "which of these tests are doing a job that belongs lower in the pyramid?"

Smoke vs. Full E2E

Two distinct usages, run on different cadences:

TypeWhat runsWhenTime budget
SmokeThe 3–5 most critical journeysEvery deploy; every PR1–2 min
Full E2EAll critical journeys, all browsers, all key personasNightly; pre-release15–60 min

The mistake to avoid: running full E2E on every PR. The 30-minute wait kills the loop; flakes block merges; trust evaporates. Run a smoke subset on PRs, full coverage nightly. See Pipeline Shape for the staging.

Locators: The Source of Most Flake

If E2E tests are flaky, the most common cause is locators — the selectors used to find elements on the page. Ranked from worst to best:

LocatorWhy
xpath('//div[3]/span[2]/button')Breaks the moment any DOM around it changes
.btn.btn-primary.btn-largeBreaks the moment styling is reorganized
text("Submit Order")Breaks on i18n, copy changes, A/B tests
[aria-label="Submit order"]Breaks if accessibility labels are restructured
getByRole("button", { name: "Submit" })Breaks less; reflects user intent
data-testid="submit-order"Stable; explicitly added for tests

The principle: the locator should reflect intent that doesn't change when implementation does. A test for "submit the order" should not break because someone changed the button color.

Test IDs (data-testid, or a framework's equivalent) are the most reliable option. The complaint that "test IDs pollute production code" is usually wrong — they're a few bytes per element and they're the difference between a stable suite and a flaky one.

Page Objects (and When to Skip Them)

Page Objects: a pattern that wraps a page's selectors and interactions behind a class, so tests are written in business terms.

class CheckoutPage {
  async addToCart(item) { ... }
  async proceedToPayment() { ... }
  async enterCard(card) { ... }
  async confirmOrder() { ... }
}

test('user can complete checkout', async () => {
  const page = new CheckoutPage();
  await page.addToCart(item);
  await page.proceedToPayment();
  await page.enterCard(testCard);
  await page.confirmOrder();
  await expect(page.confirmation()).toBeVisible();
});

The good: tests read like flows. Selectors are centralized. UI changes update one place.

The bad: when overdone, you end up with a custom framework on top of your test framework. Page Objects with 40 methods, half of which are used once.

Use them when:

  • A flow is reused across tests.
  • Selectors are non-obvious and benefit from a name.
  • The team has more than one E2E author and needs consistency.

Skip them when:

  • A test is a one-off journey.
  • The page is a single form with three fields.

Playwright's recommendation — a thin wrapper called a "fixture" instead of a heavy Page Object — is often the better default. The point is centralizing selectors, not building an OO hierarchy.

Waits: The Source of the Other Flakes

The second most common flake source is bad waits. Categories:

Fixed sleeps (always wrong)

await page.click('Submit');
await sleep(2000);  // hope the page loaded
expect(await page.text()).toContain('Thank you');

sleep(2000) is too long on a fast machine and too short on a slow one. It's flake by design.

Auto-waits (Playwright, modern Cypress)

await page.click('Submit');
await expect(page.locator('text=Thank you')).toBeVisible({ timeout: 5000 });

The driver polls until the condition holds or the timeout expires. This is what you want — wait for the condition you actually care about, not a fixed duration.

Explicit waits for specific signals

await page.click('Submit');
await page.waitForResponse(r => r.url().endsWith('/orders'));
await expect(page.locator('text=Thank you')).toBeVisible();

For situations where you need to wait for a specific network call to complete, not just for DOM to settle.

The rule: never sleep. Always wait on a condition. If you find yourself unable to identify a condition, that's a sign the app's loading states aren't observable — fix the app, not the test.

Test Data for E2E

Three patterns, with sharp differences:

Created per test via API

The test makes an authenticated API call to create what it needs (user, product, order) before exercising the UI.

  • Fastest.
  • Most isolated.
  • Requires test-only API endpoints or admin credentials.

This is the default for serious E2E suites.

Pre-seeded fixtures

The test environment has a known set of users and data; tests use these.

  • Simple to start.
  • Tests step on each other under parallelism.
  • Fixture rot: data the team forgot why it's there.

Acceptable for read-only smoke tests; problematic for anything that mutates.

Built through the UI

The test logs in, navigates, fills forms — all to get the system into the state it actually wants to test.

  • Slowest by orders of magnitude.
  • Highest flake surface area.
  • A test for "checkout" runs through 5 minutes of setup before the actual checkout starts.

Avoid except for the rare case where the UI flow is what's being tested.

Browsers, Devices, Viewports

Coverage decisions cost real time:

  • One browser, one viewport: runs in minutes. Catches functional bugs.
  • Chrome + Firefox + Safari: triples the time. Catches a small number of browser-specific issues.
  • Plus mobile viewports: doubles again. Catches responsive bugs.
  • Plus real devices (BrowserStack, Sauce): another order of magnitude. Catches mobile-specific bugs (touch, IME, real iOS Safari).

Practical defaults:

  • PR smoke: one browser, one viewport (Chrome desktop). Minutes.
  • Nightly: 3 browsers × 2 viewports. Tens of minutes.
  • Pre-release: add real devices for the top 3 mobile platforms.

What kills teams: running the full matrix on every PR. That's an entire afternoon for one merge.

Where E2E Tests Run

The environment matters more than for unit/integration:

  • Local dev: developers should be able to run individual E2E tests against their dev environment with one command. If they can't, they won't.
  • Ephemeral PR environment: the gold standard. A fresh app stack per PR. See Test Data & Environments.
  • Shared staging: acceptable as long as tests isolate their data (per-test users) and tolerate other tests running.
  • Production: never — full E2E in prod corrupts user data. Synthetic monitoring (read-only checks running continuously) is a different thing; it lives in observability, not the test suite.

Common Failure Modes

"Login through the UI 100 times per test run"

Every test starts with await page.goto('/login'); fill; fill; click. Cumulatively, login becomes the slowest thing in the suite. Save auth state once, reuse it:

// once, before all tests
const auth = await loginAsUser('test@example.com');
await page.context().storageState({ path: 'auth.json' });

// per test
const context = await browser.newContext({ storageState: 'auth.json' });

Tests that assert on the wrong layer

await page.click('Add to cart');
await expect(page.locator('[data-cart-count]')).toHaveText('1');

What you actually care about: is the item in the cart, so the user can check out with it? Asserting on the cart-count badge tests the badge, not the cart. Better: try to check out, see the item in the order summary.

One giant E2E that tests "everything"

test('user can do all the things', async () => {
  // 200 lines: sign up, browse, search, add to cart,
  // remove, add again, checkout, view order, contact support, cancel
});

When this test fails — and it will, often — you don't know what broke. Five tests covering five flows fail independently and tell you what's wrong.

Conditional logic in tests

if (await page.locator('.promo-banner').isVisible()) {
  await page.click('.dismiss');
}
await page.click('Submit');

The test is now non-deterministic — it does different things based on what the app shows. Real symptoms (a banner that shouldn't be there) get swallowed. Make the test setup deterministic so the banner isn't there in the first place.

Retries hiding real bugs

The CI is configured to retry failed E2E up to 3 times. A genuine regression that's intermittent (a race condition that's 60% reproducible) passes on the third try. Production breaks because, surprise, it's 60% reproducible there too.

See Flake Management for the discipline; the short version is: log retries, don't hide them.

Pre-commit Checklist

Before an E2E test goes in:

  • Could this be a unit or integration test instead?
  • Does it cover something the team would consider a production incident if it broke?
  • Are the locators stable (test IDs or role-based selectors), not implementation-coupled?
  • Does it wait on conditions, never on sleep?
  • Does it set up its own data via API, not by clicking through the UI?
  • Does it isolate from other tests (own user, own data)?
  • If it fails in CI, do the artifacts (screenshot, video, trace) explain why?

If most answers are "kind of," the test will rot. E2E coverage that nobody trusts is negative coverage — it costs CI time and signals nothing.

On this page