Reporting
Surfacing failures so they're diagnosable in 30 seconds — formats, artifacts, retention, and where the message lands
Reporting
A test suite that runs in CI and a test suite that communicates are different things. The first one passes or fails. The second one tells whoever needs to know — within 30 seconds of opening the result — what broke, where, and how to reproduce it.
Most reporting problems aren't about which tool to pick. They're about answering four questions consistently:
- What format? Standardized output the rest of the toolchain can read.
- What artifacts survive the run? Logs, screenshots, traces — kept long enough to be useful.
- Where does the message land? PR comment, dashboard, on-call channel.
- What does the message say? Enough context to act without re-running.
A team that gets these right spends minutes on failures. A team that doesn't spends afternoons.
Formats
The default standard, in every language, is JUnit XML. It's old, it's awkward, every CI and analytics platform reads it. Use it.
| Format | Used by | Notes |
|---|---|---|
| JUnit XML | CI dashboards, test analytics, sharding tools | Lingua franca; emit it from every test runner |
| TAP | Older tools, some CLI runners | Mostly legacy; convert to JUnit if you have a choice |
| Allure | Allure report viewer, dashboards | Richer per-test metadata; pair with JUnit |
| Custom JSON | In-house dashboards, ML-based flake detection | Use when standard formats can't carry the metadata you need |
| HTML reports | Human consumption | Output as an artifact; don't parse |
The pattern: emit JUnit unconditionally; layer richer formats on top when you have a consumer for them.
Artifacts
A failed test that produces only a stack trace gives you 5% of what you need. The artifacts that close the gap, in rough order of importance for each test type:
Unit / integration
- Stack trace with full context lines.
- Diff output for
expected vs actualwhen assertions fail. - Stdout / stderr from the failing test.
- Logger output captured during the test (with the test's name in each line).
- Coverage data — separately, for the coverage gate.
Browser / E2E
- Screenshot at point of failure.
- Video of the test run (often only on failure, to save space).
- Browser console logs.
- Network HAR file for the test session.
- Playwright trace / Cypress snapshot — single artifact bundling DOM, network, console.
- Server-side logs for the time window of the test.
Mobile
- Screenshot.
- Device logs (logcat for Android, syslog for iOS).
- Crash dumps if the app crashed.
- Test report with steps that ran before the failure.
Performance / load
- Baseline comparison. "P95 was 320ms, baseline is 280ms."
- Trace files for slow requests.
- Resource graphs (CPU, memory, GC) over the test run.
The rule: anyone who didn't run the test must be able to understand why it failed from the artifacts alone. If they need to re-run it, your artifacts are insufficient.
Retention
Long enough to debug; short enough to not pay forever.
A sane default:
- PR-check artifacts: 30 days. Long enough to debug after a merge.
- Main branch artifacts: 90 days. Covers most regression triage windows.
- Release artifacts: 1 year or longer. Audit and rollback evidence.
- Coverage / test-result history: indefinite, but compressed (per-PR summary, not per-test).
- Video / trace files: 7–30 days. They're expensive; archive critical ones manually.
A common mistake: artifacts retained indefinitely. After 6 months you're paying for terabytes of screenshots no one will ever look at. A retention policy is part of the reporting design, not an afterthought.
Where Messages Land
The hierarchy of places to put a test result, from most to least urgent:
On the PR / commit itself
The most important surface. The developer who pushed is looking at this page. What they need:
- Pass / fail summary with counts and elapsed time.
- Names of failing tests — clickable to expand.
- For each failure: the assertion that failed, the diff, a link to artifacts.
- Diff coverage if there's a gate.
What kills it: a bot comment that's 400 lines long with every passing test enumerated. Hide passes by default; show failures.
On the merged commit / build dashboard
Post-merge runs need a home where the team can see "is main healthy?" at a glance:
- A status board (green/red per recent commit).
- A drill-down to test results per build.
- Trend graphs for test count, runtime, flake rate.
In the on-call channel
For things that need a human now:
- main is red. Page the team or rotation.
- Nightly perf regression past threshold. Ticket + alert.
- Critical-path test broken. Stronger signal than a routine failure.
What does not go here: every PR failure. The on-call channel becoming noisy is how teams learn to mute it.
In an analytics tool
Aggregated data for trend questions:
- "What's our flake rate this week?"
- "Which tests are slowest?"
- "Which files churn coverage the most?"
Buildkite Test Analytics, Datadog CI Visibility, CircleCI Insights, Trunk — all answer these. The tool matters less than the team actually using the output.
What the Message Says
A failure message is good if a developer reading it cold can:
- Name the test that failed.
- Understand what it expected vs. what it got.
- Form a hypothesis without re-running.
- Find the relevant artifact in one click.
Anti-pattern: Test failed: timeout after 30s. The developer now has to clone the branch, run the test locally, instrument it, and reproduce. Five-minute debug becomes an hour.
Pattern: assertion failed at OrderService.test.ts:42: expected order.status === 'paid', got 'pending'. See screenshot: <link>. Full trace: <link>. Five-minute debug stays five minutes.
In CI tooling terms: configure your test reporter to emit context, not just status. Most tools support custom reporters; even a 50-line wrapper pays off across thousands of failures.
PR Comment Patterns
The bot that posts to a PR is the highest-leverage surface. Conventions that work:
- Update the same comment, don't post new ones. Five "tests failed" comments on a PR drown the conversation.
- Collapse by default. Show only what's failing. Pass count + runtime in the header is enough.
- Group by file or by failure pattern. Twenty failures all caused by one teardown bug should appear as one group.
- Link, don't paste. Long logs go to artifacts; the comment links to them.
- Include "what to do next." A failure caused by a known flake should say "this is in the flake list, see
<dashboard>." A real failure should say "see Flake Management if you suspect non-determinism, otherwise debug locally with:<command>."
Anti-Patterns
Logs by default. Every test pipes all output to CI logs. The fail message is buried in 10,000 lines of debug noise.
No artifact on success. Tests that "passed" but exercised slow paths leave no trace; later, you can't tell whether a flake is new or always existed.
Two reporting systems. One in CI, one in a third-party dashboard, with different definitions of "passed." Reconciling them costs more than either provides.
Notification fatigue. Every PR build status goes to Slack. The team mutes the channel within a week.
Reports that require login to view. A PR reviewer who has to authenticate to a separate dashboard to see a failure will assume it's fine.
Screenshots only on the last failure. When 5 tests fail in sequence, only the last has visual evidence — but the first one was the cause.
Pre-merge Checklist
Before declaring reporting healthy:
- Can a developer who didn't write the failing test understand the failure from the PR comment alone, without re-running?
- Are screenshots / traces / logs attached to failed runs automatically?
- Is there a retention policy, and does someone know what it is?
- Does the PR comment update in place, not append?
- Does the on-call channel only get messages a human should act on right now?
- When the team asks "is main healthy," is there one place to look?
If the answer to the first question is "they'll have to check it out and run it," your reporting is a status code, not a report.