Compensation-based long-running transactions — what the protocol guarantees, what compensation actually means, and where the failures hide

Saga Protocol

A saga is a sequence of local transactions across multiple services, where each step has a corresponding compensating action that semantically undoes it. If any step fails, the saga runs the compensations for the steps that already committed, leaving the system in a consistent — though not necessarily identical — state.

Sagas exist because the alternative does not work. Two-phase commit (2PC) across services is technically possible but operationally terrible: every participant holds a lock until every other participant has voted, so a single slow service stalls the entire transaction; a coordinator crash mid-protocol blocks all participants indefinitely. Long-running business workflows (place order → reserve inventory → charge card → schedule shipment) cannot afford to hold cross-service locks for the seconds-to-minutes-to-days these operations take. TCC is the middle-ground pattern that combines reservation with explicit confirm/cancel methods; see those pages for the full comparison.

This page is the protocol view: what guarantees a saga can and cannot offer, how compensation is supposed to work, and which failure modes will get you in production. The pattern view — when to choose orchestration vs choreography, how to structure code, common pitfalls — lives in Software Architecture.

What a Saga Is and Is Not

Garcia-Molina and Salem coined the term in their 1987 paper. Their original definition:

A long-lived transaction (LLT) is a saga if it can be written as a sequence of transactions that can be interleaved with other transactions.

The key word is interleaved. A saga gives up the I in ACID — isolation. Other transactions can read the saga's partial state. This is the price of avoiding cross-service locks, and it has consequences you must design for.

A saga is not:

A distributed transaction. It does not give you atomicity across services.
A rollback. Compensation is semantic undo, not "restore the previous bytes."
A solution that hides distribution from the caller. Callers must understand that a saga can leave the system in any of 2n+1 states (each step committed-or-compensated, plus the in-progress state).

The Two Coordination Styles

Both produce the same correctness guarantees; the choice is architectural.

Orchestration

A central coordinator (often called the orchestrator or saga manager) tells each service what to do next. The orchestrator owns the saga's state.

        ┌──────────────┐
        │ Orchestrator │
        └──────┬───────┘
               │
   ┌───────────┼───────────┐
   ▼           ▼           ▼
[Order]    [Inventory]  [Payment]   [Shipping]
   ▲           ▲           ▲           ▲
   │           │           │           │
   └─── Each step's result reported back ──┘

Pros: explicit state machine; failure handling is centralized; easy to monitor.
Cons: orchestrator becomes a critical service; risk of coupling.

Temporal, Camunda, AWS Step Functions, Cadence, and Azure Durable Functions are all orchestration platforms in this style.

Choreography

Each service listens for events and decides what to do. No central coordinator; the saga "happens" through the event chain.

[Order] ── OrderPlaced ──▶ [Inventory] ── ItemReserved ──▶ [Payment] ── Paid ──▶ [Shipping]
                                  │                              │
                                  ▼                              ▼
                            (on failure)                   (on failure)
                       publish ItemReservationFailed   publish PaymentFailed
                                                              │
                              ◀─── compensations propagate ───┘

Pros: loosely coupled; no single point of failure; services can evolve independently.
Cons: the saga's state is implicit across the event log; debugging "where did this saga get stuck?" is genuinely hard.

Most production systems mix the two: orchestration for the parts where business rules are complex (payments, fulfillment), choreography for "send a notification when X happens"-style fan-out.

Compensation: Semantic Undo

The word "compensation" hides a lot. A compensation is a new forward transaction that semantically undoes a prior one — not a database rollback. Three categories:

Reversible. A pure inverse exists. reserve_inventory(item, qty) → release_inventory(item, qty). Easy.
Compensatable but visible. The undo is itself a business event with its own audit trail. charge_card($100) → refund_card($100). The customer sees both transactions on their statement; they cannot be erased.
Hard to compensate. send_email("Your order is confirmed!") cannot be unsent. The compensation is send_email("Apology, your order failed."), which is a new business event with its own correctness considerations.

The lesson: order saga steps so that the hardest-to-compensate ones come last. If sending the confirmation email is irreversible, do it after the payment is irreversibly captured.

The Required Properties

For a saga to be correct, each step must satisfy three properties:

1. The forward action is idempotent

Retries are inevitable (network failure between coordinator and service, coordinator crash before recording success). The forward action must be safe to execute twice. See Idempotency for the mechanics.

2. The compensation is idempotent

Same reason. If the coordinator records "compensated" but the response was lost, it will retry the compensation.

3. The compensation is commutative with concurrent forward steps

This one is subtle and frequently violated. If step 3 of a saga is committing while step 1 is being compensated (because a parallel saga failed), the system must end up in the same state regardless of the interleaving. Garcia-Molina's original paper handles this by requiring the saga's steps to be ordered, with no parallel compensation; many practical systems run steps in parallel and must verify commutativity per step.

Failure Modes That Will Bite You

The saga literature is optimistic. The production reality:

The compensation fails

You attempted to compensate step 3, but the service is down or returned an error. Now you have a saga that committed steps 1–3, failed step 4, started compensation for step 3, and is stuck. Sagas must have a retry-with-backoff-then-escalate model for compensation failures, ending in a human-investigated state. There is no automatic answer; an irrecoverable saga is an operational incident.

A compensated step's effect leaked

Between the forward step and the compensation, another transaction read the value and acted on it. The compensation does not retract those downstream effects.

Saga A:  reserve 5 widgets (inventory = 95)
         payment fails
         compensate: release 5 widgets (inventory = 100)

Meanwhile:
Saga B:  read inventory (saw 95), made a marketing decision
         "We're running low — start promoting the alternative widget."

Saga B's decision is now based on a transient state that no longer exists. This is the I (isolation) we gave up. The fix is to design business processes that are tolerant of this — never to pretend it cannot happen.

Out-of-order compensation

If you compensate steps (1,2,3,4) in arbitrary order rather than (4,3,2,1), you can violate invariants between steps. Compensate in reverse order of execution unless you can prove commutativity.

Saga state lost

If the coordinator's own storage of saga state is lost, you have committed steps with no record of needing compensation. The coordinator's state must be at least as durable as the participants' state — typically a database, often the same database the orchestrator service uses for its own data.

Lifecycle ambiguity

A user-facing question: "Is this order confirmed yet?" The honest answer during a saga is "in progress." UIs that show "confirmed" the moment the order is placed (before payment commits) will mislead users when the saga later compensates.

Relation to Distributed Consensus

Sagas are not a substitute for consensus. They give up isolation; consensus algorithms give up availability during partitions. The two solve different problems:

Use consensus when you need a globally agreed-upon order of operations on a single piece of state.
Use sagas when you need a long-running cross-service workflow with semantic compensation.

A real system often has both: consensus at the storage layer (etcd, Raft-based DBs) and sagas at the application layer (Temporal, custom orchestrators).

Pre-commit Checklist

For each saga step, is the forward action idempotent and the compensation idempotent?
For each compensation, what business-visible effect does it produce? (Audit trail, customer-facing artifact, downstream events.)
Have I ordered the steps so the hardest-to-compensate are last?
What happens if a compensation itself fails? Is there a retry policy and a human-escalation path?
What does the user-facing UI show during the in-progress state? Is "confirmed" reserved for after the saga commits its final irreversible step?
Is the orchestrator's state at least as durable as the participants' state?
Have I tested the partial-failure case — not just the happy path — in integration tests?

Saga Protocol

Saga Protocol

What a Saga Is and Is Not

The Two Coordination Styles

Orchestration

Choreography

Compensation: Semantic Undo

The Required Properties

1. The forward action is idempotent

2. The compensation is idempotent

3. The compensation is commutative with concurrent forward steps

Failure Modes That Will Bite You

The compensation fails

A compensated step's effect leaked

Out-of-order compensation

Saga state lost

Lifecycle ambiguity

Relation to Distributed Consensus

Further Reading

Pre-commit Checklist

On this page