Designing operations that are safe to retry — keys, scopes, storage, and the everyday tool that buys you effectively-once

Idempotency

In a distributed system, every network call has three possible outcomes: succeeded, failed, and the response was lost on the way back to you. The third case is not rare — it is the one you must design for. The only honest response to it is to retry. And the only honest response to retries is to make the operation safe to repeat.

That property has a name: idempotency. An operation is idempotent if performing it twice has the same effect as performing it once. Idempotency is the cheapest, most reusable tool for turning at-least-once delivery into something users can trust.

What "Same Effect" Means

Be precise about what is preserved:

State — the system ends in the same state regardless of how many times the operation runs.
Side effects — external effects (emails sent, money charged, webhooks delivered) happen at most once, not "at most twice if you are lucky."
Response — the client sees the same result every time, including the same generated IDs and timestamps. This last point is what people forget; without it, a retry that "succeeds" still confuses the client about what actually happened.

Two operations that produce the same final state but different responses are not idempotent enough for production. A duplicate POST /payments that creates a second payment row but returns the second row's ID has done damage; one that returns the original row's ID has not.

Natural vs Designed Idempotency

Some operations are naturally idempotent:

SET x = 5 — repeating it leaves x = 5.
DELETE /users/42 — after the first call, the second is a no-op.
PUT /config { theme: "dark" } — the resource is replaced with the same value.

Most interesting operations are not:

POST /payments — each call creates a new payment.
INSERT of a row with a generated key.
increment counter by 1.
send email.

For these, idempotency must be designed in. The standard mechanism is an idempotency key.

Idempotency Keys

The pattern: the client generates a unique key for each logical operation (typically a UUID) and sends it with the request. The server stores the key alongside the result; on retry, it returns the stored result instead of executing again.

First request:
   Client → POST /payments
            Idempotency-Key: 9c2f...e1
            { amount: 100 }
   
   Server: key not seen → execute, store {key → response}, return 201
   Client: connection times out, no response received
   
Retry:
   Client → POST /payments
            Idempotency-Key: 9c2f...e1   ← same key
            { amount: 100 }
   
   Server: key seen → return stored response, do NOT charge again

Stripe's idempotency API is the canonical reference for this pattern; if you are designing one, copy its semantics.

Key Scope

A subtle but important question: what is the key namespaced under? Three reasonable choices:

Per endpoint — the same key on POST /payments and POST /refunds are unrelated. Simple.
Per resource type — keys are unique within "payments," regardless of which endpoint creates one. Useful if multiple endpoints converge on the same resource.
Globally per account — keys are unique within a customer. Catches accidental cross-endpoint reuse.

Stripe uses per-account global scope. The cost is a slightly larger key space; the benefit is that a client cannot accidentally collide by reusing a key across unrelated calls.

Storage and TTL

Idempotency keys cannot live forever. Three rules:

Persist before executing. Write {key → in_progress} to durable storage before doing the work. If the work succeeds, update to {key → response}. If the process dies between, the next retry sees in_progress and can either wait or re-execute under a lock.
Give them a TTL. 24 hours is the common default. Long enough to cover any reasonable retry window (including client-side queued retries), short enough that you are not storing them forever.
Store the response, not just a flag. A "yes I already did this" answer is insufficient — the client needs the same response it would have gotten the first time. Returning a fresh 200 without the original IDs is a quiet bug.

What Counts as a "Match"?

If a client sends the same key but a different body, that is almost always a bug — either a client error or an attacker. The server should reject with 409 (or 422), not silently return the stored response. Stripe verifies the request fingerprint matches; copy that.

At-Least-Once + Idempotency = Effectively-Once

Exactly-once delivery does not exist over a lossy channel — the formal result is well-known. What is achievable is effectively-once processing: at-least-once delivery from the producer, combined with idempotent processing at the consumer.

At-most-once:    you may miss messages           (acceptable for metrics, lossy)
At-least-once:   you may process duplicates      (the default in real systems)
Exactly-once:    impossible over lossy channels  (in the formal sense)
Effectively-once = at-least-once + idempotent handler

Every "exactly-once" feature in real systems — Kafka transactions, RabbitMQ confirms, SQS FIFO — is implemented this way under the hood. They give you better tools for the idempotent side; they cannot give you exactly-once delivery.

Idempotency Patterns Beyond Keys

The key pattern is the most general, but several specialized patterns are worth knowing:

Idempotent inserts via natural unique constraints. If a (user_id, external_payment_id) pair must be unique, the database itself rejects duplicates. You no longer need an idempotency key for this operation; the constraint is the key.
CAS / conditional updates. UPDATE x SET v = 5 WHERE v = 4 is idempotent: the second call matches no rows and does nothing. ETags and If-Match headers are the HTTP version.
Append-only logs with sequence numbers. A consumer that tracks "last applied offset = N" can ignore any message with offset ≤ N. The offset is the implicit idempotency key.
Token-based delegation. Pass a one-time token to an external service; the service rejects reuse. Useful when you cannot control the consumer's idempotency layer.

Where Idempotency Breaks Down

Non-deterministic side effects. "Send a welcome email" with a timestamp in the body — the second send produces a different email, even if the operation "should" be idempotent. Either freeze the timestamp into the request, or accept that the email side effect is at-least-once.
External systems without idempotency. Calling a third-party API that itself is not idempotent shifts the problem outward. The fix is to checkpoint before the external call — record "we attempted this with payload P" — so the retry can recognize partial state.
Long-running operations. If the operation takes 30 seconds and the client retries at 10, two executions can be in flight simultaneously. Take an exclusive lock on the idempotency key for the duration of work (with all the caveats in Distributed Locks), or accept the race.
Cross-service operations. Idempotency on service A does not help service B. The Saga protocol and the outbox pattern (covered in Exactly-Once Semantics) are the tools for composing per-service idempotency into end-to-end correctness.

Designing for Retries from Day One

The retroactive cost of adding idempotency to an API that did not have it is enormous; clients who issued requests without keys cannot be helped. Build it in from the start:

Every mutating endpoint accepts an idempotency key. Even reads, if they have side effects (logging, analytics).
Document the TTL and matching rules. Clients need to know how long they can safely retry and what counts as "the same request."
Return the same response on retry. Including all generated values.
Reject conflicting reuse loudly. Same key + different body should fail visibly.
Test with chaos. Inject duplicate deliveries in CI; if your service quietly handles them, you have built it right.

Pre-commit Checklist

Is every mutating endpoint either naturally idempotent or accepting an idempotency key?
Does my server store the response, not just a "seen" flag?
Do I reject conflicting reuse of the same key with a different body?
For each external side effect (email, payment, webhook), have I named where the duplicate-suppression lives?
Is my idempotency key TTL longer than my worst retry window, and shorter than "forever"?
Have I tested the duplicate-delivery case, not just the happy path?

Idempotency

Idempotency

What "Same Effect" Means

Natural vs Designed Idempotency

Idempotency Keys

Key Scope

Storage and TTL

What Counts as a "Match"?

At-Least-Once + Idempotency = Effectively-Once

Idempotency Patterns Beyond Keys

Where Idempotency Breaks Down

Designing for Retries from Day One

Further Reading

Pre-commit Checklist

On this page