Outbox Pattern
The reliable bridge between a database and a message bus — solving the dual-write problem with a transactional outbox and a relay worker
Outbox Pattern
A service that owns its own database and publishes events runs into a question on the first day: if I update my database and publish an event, what happens if one succeeds and the other fails? The naive answer — write to the database, then publish the event — produces a system that loses events on crash. The naive other answer — publish the event, then write to the database — produces a system that emits events for things that did not happen.
The outbox pattern is the standard solution. It is the most operationally important pattern in event-driven systems: every multi-service architecture eventually needs it, and the architectures that skip it eventually have a "we lost an event somewhere" incident that motivates retrofitting it.
The Dual-Write Problem
A service receives a command, updates its state, and needs other services to know. The temptation:
function placeOrder(req) {
db.insert("orders", req) // [1] write to database
broker.publish("OrderPlaced", req) // [2] publish event
}Failure modes:
- Crash between [1] and [2]: order is in the database; no event published. Downstream services do not know. Inventory is not reserved, payment never charged, shipping not scheduled.
- Network failure on [2]: order is in the database; event publish failed. Same outcome.
- Reorder for "safety": publish first, then write. Now if [2] succeeds and [1] fails, downstream services react to a phantom order. Worse outcome.
- Wrap in a transaction: there is no transaction that spans a database and a message broker. Distributed transactions (2PC) across these two are possible in principle, awful in practice.
This is the dual-write problem: two systems, each of which can fail independently, must agree on whether something happened. The outbox pattern is the standard way out.
The Outbox Pattern
Add a table to the same database as the application state. When the application writes business state, it writes a corresponding row to the outbox in the same transaction. A separate worker reads outbox rows and publishes them to the broker.
function placeOrder(req) {
begin transaction
db.insert("orders", req)
db.insert("outbox", {
eventType: "OrderPlaced",
payload: req,
published: false
})
commit
}
// separate worker, runs continuously
function outboxRelay() {
while true {
unpublished = db.query(
"SELECT * FROM outbox WHERE published = false LIMIT 100"
)
for row in unpublished {
broker.publish(row.eventType, row.payload)
db.update("outbox", { published: true }, where id = row.id)
}
sleep(100ms)
}
}The key insight: the business write and the outbox write are atomic (one database transaction). If either fails, both fail. Once committed, the event is guaranteed to be in the outbox, and the worker will eventually publish it.
The worker can crash between broker.publish and db.update, which means the same event may be published twice. This is acceptable — downstream consumers must be idempotent anyway. At-least-once delivery plus idempotent consumers equals effectively-once processing. See Exactly-Once Semantics for the full pattern.
What the Outbox Pattern Solves
- Atomicity between business state and event emission. The two cannot diverge.
- Reliability of event publishing. The broker can be down for hours; events accumulate in the outbox; the worker publishes when it comes back.
- Decoupling of write path from broker. The user-facing request returns as soon as the database transaction commits. The broker publish happens asynchronously.
- A natural audit log. The outbox table records every event the service ever emitted; useful for debugging and replay.
Implementation Options
Polling Worker
The simplest implementation: a worker polls the outbox table for unpublished rows and publishes them. The code above.
Pros:
- Simple. The worker can be a single process or a Kubernetes deployment with one replica.
- No infrastructure beyond the database and broker that already exist.
- Easy to reason about.
Cons:
- Database load — frequent SELECTs.
- Latency — events wait for the next poll cycle.
Tuning: poll interval (100ms - 1s common), batch size (50-500 typical). Index the outbox on published (filtered/partial index) and use FOR UPDATE SKIP LOCKED to allow multiple worker replicas if needed.
Change Data Capture (CDC)
Instead of polling, attach a CDC tool to the database's write-ahead log. Tools like Debezium watch the WAL and emit a stream for every committed row change. Configure CDC to follow the outbox table and forward events to the broker.
Pros:
- Low latency — events flow as fast as the WAL is read.
- No polling load on the database (the WAL is read once for replication anyway).
- Scales naturally to high event volume.
Cons:
- More infrastructure — CDC is a real operational component.
- Schema-coupling — CDC sees the table shape; renaming the outbox table breaks it.
- Requires database WAL access — not always available (some managed databases restrict it).
Used by many high-scale systems (notably Confluent ecosystem) where polling load would be problematic.
Listen / Notify
Some databases (Postgres) support pub/sub-style triggers — LISTEN / NOTIFY. A trigger on the outbox table fires NOTIFY; the worker LISTENs and publishes. Lower latency than polling, fewer moving parts than CDC.
Pros:
- Low latency.
- Lighter infrastructure than CDC.
Cons:
- Database-specific (Postgres has it; others vary).
- Notifications can be lost if no listener is connected — must still poll as a backstop.
- Trigger logic in the database is a thing to maintain.
Variants
Outbox per Aggregate vs Shared Outbox
A shared outbox table is simplest. Per-aggregate outboxes (one per major entity type) allow independent processing and isolation, at the cost of more tables. Most systems start with a shared outbox.
Inline Publish with Outbox Fallback
For latency-sensitive paths, some implementations publish synchronously after the transaction commits, and always write to the outbox. The worker publishes from the outbox only if the inline publish failed. Reduces normal-case latency; the outbox is the safety net.
This adds complexity but is appropriate where the latency budget is tight.
Cleanup
Outbox rows are kept after publishing for some retention period (for audit, replay, debugging). Old rows are deleted by a periodic cleanup job. Without cleanup, the table grows monotonically.
Retention is a policy: 7-90 days is common. Use a partial index on published = false so the cleanup-pending rows do not slow the worker.
When You Need It
Almost always, if your system:
- Owns its own database and emits events to other services.
- Cannot tolerate "wrote to DB but did not publish" or "published but did not write" outcomes.
- Operates in a microservices or event-driven architecture.
Concretely: any service that does dual writes. The bar to introducing the pattern is the moment you have two systems to update from one operation.
When You Do Not Need It
- Single-writer monolith. If everything is in one database, regular transactions suffice.
- Read-only services. Nothing to dual-write.
- Logging / metrics / non-critical events. If losing events is acceptable, you can skip the outbox. ("Acceptable" usually means analytics or telemetry, never business state.)
- You are using a system that builds it in. Kafka transactional producers + idempotent consumers, certain stream processors, some workflow engines provide equivalent guarantees through different mechanisms.
Common Mistakes
- Skipping the pattern entirely. "We'll just be careful." This is the failure mode that motivates introducing the outbox eventually.
- Outbox in a different database. If the outbox is in a different database than the business state, the dual-write problem returns. They must share a transaction.
- Non-idempotent consumers. The worker can publish the same event twice. Consumers that fail on duplicates produce duplicate side effects.
- No retention policy. The outbox grows forever, performance degrades over months.
- Ignoring ordering. Within a single aggregate, events should typically be published in order. A worker with multiple replicas reading the outbox concurrently can publish out of order. Use a
processed_atcolumn ordered by aggregate-id, or single-replica processing per aggregate. - Schema coupling for CDC variants. A schema migration on the outbox breaks the CDC pipeline. Coordinate migrations carefully or use schemaless payloads (JSON column).
- Treating the outbox as an event store. It is a staging table for outgoing events, not a permanent event log. If you want event sourcing, see Event Sourcing.
Relation to Other Patterns
- Idempotency — the consumer-side requirement that makes outbox work.
- Exactly-Once Semantics — outbox is the standard implementation pattern.
- Saga Pattern — every saga step that emits events uses outbox.
- Event-Driven Architecture — outbox is the infrastructure that makes event-driven publishing reliable.
- CQRS — outbox can drive read-model projections in CQRS systems.
Further Reading
- Chris Richardson, Microservices Patterns (2018), Section 3.2 — the canonical write-up.
- Gunnar Morling, Reliable Microservices Data Exchange With the Outbox Pattern (Debezium blog, 2019) — the CDC variant explained thoroughly.
- Pat Helland, Life Beyond Distributed Transactions (2007) — the philosophical underpinning of "use outboxes, not 2PC."
- Sam Newman, Building Microservices (2nd ed.), Chapter 4 — covers outbox in the event-driven context.
- Microservices.io, Pattern: Transactional Outbox — short reference.
Pre-commit Checklist
- For every service that writes to its database and publishes events, is the outbox pattern in place — or is dual-write inconsistency tolerated (and how)?
- Is the outbox table in the same database as the business state it accompanies?
- Are downstream consumers idempotent, so duplicate publishes do not produce duplicate side effects?
- Is there a retention / cleanup job for old outbox rows?
- For ordering-sensitive events, is the worker preserving order within an aggregate?
- For CDC variants, are schema migrations coordinated with the CDC pipeline?
- Is the worker monitored — backlog depth, processing rate, failure count?