Long-running business workflows over an event bus — the architectural pattern side of saga, complementing the protocol view in distributed-systems

Saga Pattern

A saga is a multi-step business workflow that spans services, where each step is a local transaction and the system tolerates seeing intermediate states. If any step fails, earlier steps are compensated — semantically undone via new forward business actions, not by database rollback.

Saga Protocol covers the correctness side of this: what guarantees the protocol provides, where the failures hide. This page is the architectural pattern side: how you structure code, the choice between orchestration and choreography, what state-machine modeling looks like in practice, and which tools the industry has converged on. The two pages complement each other; the protocol page tells you whether your saga is correct, this page tells you how to design and implement one.

What Makes a Workflow a Saga

A saga has the following shape:

Multiple steps. Two or more business actions that together form a meaningful workflow.
Distributed. Each step is owned by a different service (or different aggregate within the same service).
Long-running. The whole thing takes seconds to days, not milliseconds. Holding database transactions across the workflow is not feasible.
Compensable. If a later step fails, earlier steps have a defined business undo.
Tolerant of intermediate visibility. Other transactions can see the saga's partial state. Some users may notice "in-progress" states.

Examples that fit:

Place order → reserve inventory → charge payment → schedule shipping.
Submit loan application → run credit check → human review → fund disbursement.
User uploads document → virus scan → content extraction → indexing → notification.

Examples that do not fit:

Operations that need to be atomic across services with no intermediate visibility — use TCC or 2PC instead.
Operations short enough that synchronous calls are fine.
Operations with no business undo — sending a one-way notification, an irreversible payment to an external party.

Orchestration vs Choreography

Sagas come in two architectural styles. Both correct; the choice is about coupling, observability, and team boundaries.

Orchestration

A central orchestrator service knows the workflow and tells each participant what to do next.

[Orchestrator]
     │
     ├─▶ Inventory.reserve(orderId)        ──ack──▶
     │
     ├─▶ Payment.charge(orderId, amount)   ──ack──▶
     │
     └─▶ Shipping.schedule(orderId)        ──ack──▶

The orchestrator owns the workflow's state machine. If a step fails, the orchestrator runs the compensation chain.

Pros:

The workflow is explicit, in one place, readable.
Failure handling is centralized.
Easy to observe — one log of the orchestrator shows the full saga lifecycle.
New steps are added in the orchestrator without touching all participants.

Cons:

The orchestrator becomes a critical service (uptime, correctness).
Risks coupling — participants must expose APIs shaped to the orchestrator's needs.
Conceptually centralized; conflicts with strict service autonomy.

Used heavily in modern systems via Temporal, Cadence, AWS Step Functions, Camunda, Inngest, Azure Durable Functions. The "workflow engine" category is essentially "tooling for saga orchestration."

Choreography

There is no central coordinator. Each service publishes events when its step completes; other services subscribe to the events they care about and act.

OrderService    → publishes OrderPlaced
                                ↓
InventoryService → receives OrderPlaced, reserves, publishes ItemReserved
                                                          ↓
PaymentService  → receives ItemReserved, charges, publishes Paid
                                                          ↓
ShippingService → receives Paid, schedules, publishes Scheduled

The "saga" exists as a property of the event chain, not as code in any one place.

Pros:

No central service to maintain or fail.
Loose coupling — each service knows only its own inputs and outputs.
Aligns with strict service autonomy and event-driven thinking.

Cons:

The workflow is implicit — there is no place in the code that describes the full flow. Reading "what happens when an order is placed?" requires tracing event subscriptions across services.
Failure handling is distributed. If step 3 fails, the compensation chain runs by emitting cancellation events, which other services must subscribe to.
Observability is harder — to see the full saga, distributed tracing across services is essentially mandatory.
Cyclic dependencies easy to create accidentally ("service A reacts to B's event by emitting a B-trigger event").

Which to Choose

The honest framing:

Workflow property	Use orchestration	Use choreography
Workflow is complex (many steps, branches, retries)	✓
Workflow rarely changes		✓
Workflow is owned by one team	✓
Workflow spans many teams		✓
You want to read the workflow in one place	✓
You want maximum decoupling		✓
Operational maturity is high	either	✓ requires tracing
Team is new to distributed systems	✓

A mix is common in real systems: orchestration for the parts where the workflow is business-critical and complex (payments, fulfillment), choreography for the broadcast-style additions (analytics, notifications).

State Machine Modeling

A saga's behavior is naturally a state machine. The states are the workflow's stages; the transitions are the events or commands that drive it forward; the failure transitions go to compensation states.

[Started]
   │ OrderPlaced
   ▼
[Reserving Inventory]
   │ ItemReserved            │ ReservationFailed
   ▼                          ▼
[Charging Payment]         [Failed: Out of Stock]
   │ Paid          │ PaymentFailed
   ▼               ▼
[Scheduling]    [Compensating: Release Inventory]
   │ Scheduled       │ done
   ▼                 ▼
[Completed]     [Failed: Payment Declined]

This explicit model surfaces:

What states the saga can be in (useful for queries: "show me sagas stuck in Scheduling").
What can fail and what the compensation path is.
What can happen in parallel (some sagas split into multiple branches).
What human-escalation states exist (sagas that need manual intervention).

Workflow engines (Temporal, Step Functions) provide this state machine as code; choreography-based sagas have it implicit and you should still document it.

Compensation as Forward Business Action

This is worth restating from the protocol view: compensation is not a database rollback. It is a new forward business action that semantically undoes the previous one.

Reserve inventory → Release inventory (visible, but no business consequence).
Charge payment → Refund payment (visible on customer's statement).
Send confirmation email → Send apology email (a new event; the first email cannot be unsent).
Allocate a resource → Mark allocation as cancelled (audit trail of both).

Designing the compensation is a domain modeling exercise, not a technical one. Each step's compensation should be:

Idempotent — the orchestrator or downstream consumer may retry.
Eventually successful — if it cannot succeed, escalate to human review; do not silently fail.
Auditable — both the original action and the compensation should appear in the business record.

Order saga steps so that the least-compensable actions come last. The confirmation email (impossible to truly undo) goes after the payment (refundable) which goes after the inventory reservation (easy to release).

Workflow Engines: When to Use One

If your saga is more than 3-4 steps, has branches, has retries with backoff, or needs human-in-the-loop, a workflow engine is usually the right tool. The category includes:

Temporal / Cadence — code-first workflows; you write the saga as ordinary code, the engine handles persistence, retries, timeouts, and replays.
AWS Step Functions — JSON/YAML state machine definition; tight AWS integration.
Camunda / Zeebe — BPMN-based (visual diagrams); strong in enterprise environments.
Inngest / Trigger.dev — TypeScript-first workflow services for modern stacks.
Azure Durable Functions — Microsoft's code-first orchestration.

What they give you:

Persistence of workflow state (the saga survives orchestrator restarts).
Automatic retry with backoff.
Long-duration support (sagas that span days or months).
Visibility — UIs to inspect running sagas, find stuck ones.
Versioning of workflow definitions.

What they cost:

A new service in your operational footprint.
Workflow code is not entirely ordinary code — there are constraints (determinism, idempotency) the engine requires.
Vendor lock-in if you choose a managed service.

For a 2-step saga, rolling your own with an outbox + idempotent handlers may be simpler. For a 7-step saga with retries and compensation, do not write your own orchestrator; the workflow engines are battle-tested at exactly this problem.

Common Mistakes

Implementing a saga as a synchronous chain. Service A calls B calls C calls D, each waiting. End-to-end latency is the sum; reliability is the product. Use async messaging.
Compensation as rollback. Trying to "undo" with reverse SQL writes. The point of saga is that local transactions are committed; you cannot un-commit. Use forward compensating actions.
No saga state. "We just have events; the saga emerges." Without an explicit state somewhere, you cannot answer "is saga X in flight or done?" Add at least an orchestrator's state table or a correlation-id-indexed saga record.
No idempotency at participants. Saga steps retry. Participants must handle duplicates. (Idempotency.)
Order of steps ignores compensability. Sending the email first then charging the card. Now compensation requires the impossible "unsend an email" plus the trivial "refund."
No human-escalation state. Some sagas get stuck (compensation step keeps failing, external dependency is down). Without an explicit "this needs a human" state, stuck sagas hide.
Choreography for the wrong reasons. Adopting choreography because it sounds elegant, then needing tools (tracing, saga-state explorers) that you would have gotten for free with orchestration.

Relation to Other Patterns

Saga Protocol — the correctness side. Read both.
Event-Driven Architecture — choreographed sagas are an event-driven pattern.
Outbox Pattern — the standard way to publish saga events reliably from a service with its own database.
Idempotency — every saga step needs it.
TCC — sometimes the right alternative when atomicity matters more than long duration.
2PC — the heavy hammer that sagas avoid.

Pre-commit Checklist

For each saga in my system, do I know whether it is orchestrated or choreographed, and why?
Is the workflow's state machine explicit somewhere — code, diagram, or workflow engine?
Is every step idempotent? Can each step be retried without harm?
Is every step's compensation a forward business action, not a database rollback?
Are steps ordered so that the least-compensable actions come last?
Is there a human-escalation path for sagas that get stuck?
For multi-step workflows with retries and timeouts, am I using a workflow engine, or am I about to reinvent one badly?

Saga Pattern

Saga Pattern

What Makes a Workflow a Saga

Orchestration vs Choreography

Orchestration

Choreography

Which to Choose

State Machine Modeling

Compensation as Forward Business Action

Workflow Engines: When to Use One

Common Mistakes

Relation to Other Patterns

Further Reading

Pre-commit Checklist

On this page