Steven's Knowledge

Exactly-Once Semantics

Why it does not exist over a lossy channel, what "effectively-once" actually delivers, and how the major systems claim it without lying

Exactly-Once Semantics

"Exactly-once delivery" is the most marketed feature in distributed messaging that does not exist. Kafka offers "exactly-once semantics." AWS SQS has "exactly-once FIFO." RabbitMQ has "publisher confirms." Each of these provides something useful, but none of them is the literal-meaning thing that the name suggests. The mismatch between the marketing and the math is the source of an entire category of production bugs.

This page is about what is and is not provable, what real systems actually deliver, and the pattern that lets you build correct systems anyway: at-least-once delivery + idempotent processing = effectively-once.

The Impossibility

Strict exactly-once delivery — every message arrives at the receiver and is processed exactly one time, no more and no less — is impossible over a lossy channel between independent processes. This is essentially the Two Generals Problem: if either side can fail or messages can be lost, neither side can be certain a particular message was processed unless an acknowledgement was received, and the acknowledgement itself can be lost.

Sender                                    Receiver
   │── message ──▶                            │
                                              │  processes message
   │              ◀── ack ──────             │
   │                                          │
   ?  Did the ack get lost? Did the receive get lost?
   ?  Did the receiver process and crash before acking?
   ?  Cannot distinguish these from the sender's perspective.

The sender has three options:

  • Retry on no ack → may deliver the same message twice (at-least-once).
  • Do not retry → may lose the message if the original delivery failed (at-most-once).
  • Both retry and somehow guarantee single processing → impossible over a lossy channel; FLP-adjacent.

This is the formal result. Marketing aside, no system can do better than this on a real network.

The Practical Definitions

There are three delivery guarantees real systems offer:

GuaranteePromiseRisk
At-most-onceEach message delivered zero or one times.Messages can be lost.
At-least-onceEach message delivered one or more times.Duplicates.
Exactly-onceEach message processed exactly once.Impossible in delivery; achievable in processing.

When systems claim "exactly-once," they almost always mean effectively-once processing: at-least-once delivery from the broker, combined with idempotent or transactional handling at the consumer, so the observable effect is as if each message were processed once.

Whether that meets your need depends entirely on the consumer side. Without an idempotent or transactional handler, at-least-once delivery produces duplicate effects — and "exactly-once" claims from the broker do not save you.

How Real Systems Achieve It

Kafka

Kafka's "exactly-once semantics" (EOS), introduced in 0.11, works inside a Kafka-to-Kafka pipeline:

  • Idempotent producer — each producer is assigned a unique ID; each message gets a sequence number. The broker rejects duplicate sequence numbers, so producer retries do not create duplicates.
  • Transactional producer — produces to multiple partitions and updates consumer offsets in a single atomic transaction. A consumer reading "committed" messages only sees fully-committed transactions.
  • read_committed isolation — consumers ignore messages from open or aborted transactions.

This is genuinely exactly-once within Kafka: messages produced as part of a transaction are either all visible or none. But the moment you cross out of Kafka — to a database, an external API, an email — you are back to at-least-once and need idempotency.

AWS SQS FIFO

FIFO queues use deduplication IDs: messages with the same ID within a 5-minute window are silently deduplicated. The producer is responsible for choosing IDs (content-hash or explicit). Within that window, SQS provides exactly-once delivery in the literal sense.

Limitations: the 5-minute window means duplicates outside that range still get through; the consumer still needs idempotent processing for ack failures; FIFO throughput is much lower than standard SQS.

RabbitMQ

Publisher confirms give you "the broker has durably received this message" — combined with mandatory routing and persistent queues, this provides at-least-once durability. Combined with idempotent consumers (RabbitMQ provides message IDs and delivery counts), you get effectively-once.

RabbitMQ does not market "exactly-once," which is more honest than most.

Stream processors achieve end-to-end exactly-once via transactional sinks plus checkpointing:

  • The processor periodically checkpoints its state (Chandy-Lamport-style distributed snapshots).
  • On failure, it rewinds to the last checkpoint and replays input.
  • For the output to be exactly-once, the sink must be transactional and tied to the checkpoint: either the sink commits as part of the checkpoint (Kafka-to-Kafka via transactions) or the sink is idempotent.

If the sink is a database with idempotent upserts, this works. If the sink is "send an email," it does not — the email gets sent on every replay, and the system can only achieve effectively-once for the database write, not the email.

The Effectively-Once Pattern

The pattern that works, everywhere:

  1. At-least-once delivery at the transport layer. Producers retry, brokers persist, consumers can fail and reprocess. This is the cheap and correct foundation.
  2. Idempotent processing at the consumer. See Idempotency. The consumer can safely process the same message twice and produce the same external effect once.
  3. Transactional state updates where both the message offset and the application state must agree. The classic pattern: process the message, write the result and the new offset in the same database transaction. If the transaction commits, both happen; if not, neither happens and the next attempt sees the old offset.
BEGIN TRANSACTION;
  -- apply the effect of message M
  UPDATE accounts SET balance = balance - 100 WHERE id = 7;
  -- mark this message processed
  INSERT INTO processed_messages (id) VALUES ('msg-9c2f...');
COMMIT;

The processed_messages table is the idempotency key. On retry, the INSERT fails (unique constraint violation), the transaction aborts, and the side effect is not re-applied.

The Outbox Pattern (Cross-Service)

When the consumer must produce a downstream message and update local state atomically, the outbox pattern is the standard answer:

BEGIN TRANSACTION;
  UPDATE accounts SET balance = balance - 100 WHERE id = 7;
  INSERT INTO outbox (event_type, payload) VALUES ('debit', '...');
COMMIT;

[separately] A worker reads outbox rows and publishes them to a broker,
            then marks the row as published. The publish is at-least-once
            (the worker can crash after publishing but before marking),
            so downstream consumers must be idempotent.

This guarantees that the downstream event is produced if and only if the local state change committed — atomically — and the downstream system can use the event's ID as an idempotency key.

What People Actually Mean

When you read "exactly-once" in a system's documentation, decode it:

  • "Exactly-once delivery" — almost always means at-least-once delivery + deduplication, sometimes only within a time window.
  • "Exactly-once semantics" — Kafka's term; means transactional production + idempotent consumption within Kafka. Crossing out of Kafka requires more work.
  • "Exactly-once processing" — the only honest version; means effectively-once via idempotency or transactions.
  • "At-most-once" — increasingly rare; usually fire-and-forget metrics or analytics.

Treat all "exactly-once" claims as "effectively-once if you build the consumer correctly." That is what is on offer.

Common Mistakes

  • Trusting "exactly-once" at the broker without idempotent consumers. The broker can deduplicate as much as it wants; if your consumer's side effect is at-least-once, you have duplicate side effects.
  • Putting external side effects (email, payment, webhook) inside a "transactional" stream operator. The framework cannot transact across external systems. Either move the side effect into a downstream consumer that uses idempotency keys, or accept at-least-once for that effect.
  • Using SQS FIFO as if its dedup window were unlimited. Duplicates beyond 5 minutes get through; the consumer still needs idempotency.
  • Ignoring the producer side. A retrying producer that does not have a deduplication ID can publish the same logical message twice, even if the broker is perfect. Producer-side idempotency keys are part of the system.
  • Outbox pattern without the worker idempotency. The outbox guarantees the event is produced; the consumer must still handle duplicates.

Further Reading

  • Hopcroft & Karp, A linear algorithm for testing equivalence of finite automata (1971) — adjacent, but historical: the Two Generals Problem was first formalized in this era.
  • Confluent, Exactly-Once Semantics Are Possible: Here's How Kafka Does It (2017) — the engineering writeup. Read with the understanding that it scopes to Kafka-to-Kafka.
  • Kleppmann, Designing Data-Intensive Applications, Chapter 11 — the most accessible practitioner treatment.
  • Carbone et al., State Management in Apache Flink (VLDB 2017) — how a stream processor implements end-to-end exactly-once.
  • Fowler, Patterns of Distributed Systems: Outbox Pattern — the canonical writeup of the outbox.

Pre-commit Checklist

  • For each "exactly-once" claim in my system docs, can I name what is actually guaranteed (delivery dedup, processing idempotency, transactional state)?
  • For every external side effect (email, payment, webhook, downstream service), is the operation idempotent or does it have a deduplication key?
  • For multi-step pipelines, am I using transactional state + offset commits, or the outbox pattern, where needed?
  • For SQS FIFO or any system with a dedup window, is the window longer than my worst-case retry latency?
  • Have I tested duplicate delivery (replay a message after a crash) end-to-end, not just unit tests?

On this page