Steven's Knowledge

TCC (Try-Confirm-Cancel)

Application-level distributed transactions — resource reservation as the building block, the explicit three-method protocol, and the production failure modes

TCC (Try-Confirm-Cancel)

TCC is the distributed-transaction pattern that lives between 2PC and Saga. Where 2PC holds database locks across the whole protocol and Saga gives up isolation entirely, TCC asks each service to model a reservation of business-level resources: I do not commit yet, but I promise the resource is yours if you confirm. Resources are held by the application logic, not by the database, so locks can be released in milliseconds while the global decision plays out over seconds or minutes.

TCC originated in the Chinese e-commerce world — Alibaba's Seata and Ant Financial's DTP both popularized it — but the idea generalizes far beyond. This page is about the protocol shape, what makes a resource "TCC-able," and the production realities that the academic descriptions tend to skip.

What TCC Solves

The problem space is the same as 2PC and Saga: a transaction spans multiple independent services or resources, and you need an "all or nothing" outcome.

The gap TCC fills:

  • 2PC requires participants to hold locks until the global decision arrives. Across services or across regions, this is operationally unviable.
  • Saga has no isolation — intermediate states are visible to other transactions, and the application must tolerate that.
  • TCC says: each service holds the resource at the application layer for a bounded time, with a clear three-method contract. The intermediate state is hidden from other transactions (the resource is "reserved"), but database locks are not held.

This is the right pattern when you can model a business resource that supports temporary reservation — inventory, account balance, hotel rooms, concurrent connection slots, rate limits. It is the wrong pattern when the resource cannot be reserved (sending an email, calling an external API that does not understand reservations).

The Three Methods

Each participant service implements three operations:

Try

Reserve the resource. Validate that the operation is possible (the user has enough balance, the inventory has enough units), and mark the resource as reserved without making the change visible to other transactions.

Try("debit account A, $100")
  → check balance(A) >= 100
  → write reservation row: {tx_id, account_a, -100, status=RESERVED}
  → balance is now: available = current - sum(active reservations)
  → return SUCCESS

After Try succeeds, the resource is held for this transaction. Other transactions see less available balance / inventory / capacity. Critically, the reservation is at the application layer — no database row is locked, the table can serve any other query.

Confirm

The global decision is commit. Apply the reservation, making it visible.

Confirm("tx_id")
  → find reservation row
  → apply: balance(A) -= 100
  → mark reservation: status=CONFIRMED (or delete)
  → return SUCCESS

Cancel

The global decision is abort. Release the reservation without applying it.

Cancel("tx_id")
  → find reservation row
  → mark reservation: status=CANCELLED (or delete)
  → return SUCCESS

The Coordinator Flow

A TCC coordinator runs all three steps:

For each participant Pi:
  result = Pi.Try()
  if result == FAILURE:
    for each Pj already Try-d:
      Pj.Cancel()
    return ABORT

For each participant Pi:
  Pi.Confirm()
return COMMIT

If any Try fails, the coordinator calls Cancel on all participants that already succeeded their Try. If all Trys succeed, the coordinator calls Confirm on every participant. Notice the asymmetry with 2PC: there is no "prepared, waiting for decision" state. Each participant either has a reservation (resource held at app layer) or not.

Required Properties

For TCC to be correct, the three methods must satisfy:

Try is Atomic, Local

Within a single service, the Try step is a normal local transaction. It either reserves the resource and returns success, or fails and reserves nothing. No coordination with other services.

Confirm and Cancel Are Idempotent

The coordinator can crash mid-Confirm or mid-Cancel and retry. Both methods must be safe to call multiple times. See Idempotency for the mechanics; the standard implementation is a status column that the operation transitions through (RESERVED → CONFIRMED is idempotent; calling Confirm again is a no-op).

Confirm and Cancel Are Eventually Successful

Unlike Try, which can fail, Confirm and Cancel must succeed eventually. They run after the global decision; there is no rolling back the rolling-back. If they fail (database down, network partition), the coordinator retries with backoff until they succeed. Operations that cannot guarantee eventual success (external API calls, third-party services) are poor TCC participants.

Try Reservations Have a Timeout

If the coordinator crashes after Try but before Confirm/Cancel, the reservation would otherwise be held forever. Practical implementations attach a TTL to each Try reservation; expired reservations are released by a background process.

TCC vs 2PC vs Saga

The three families are different cells in the design matrix, not interchangeable:

Property2PCTCCSaga
Lock holderDatabase (storage layer)Application (business layer)Nobody (compensation only)
Lock durationWhole protocolTry → Confirm/CancelNone
IsolationStrongWeak (reserved-not-committed)None
Recovery on coord failureBlocksRetries Confirm/CancelRetries forward or compensation
Operational complexityHigh (in-doubt transactions)Medium (three methods per service)Low protocol, high compensation logic
Best forSame-DC ACIDCross-service with reservable resourcesLong-running multi-service workflows
Worst forLong transactions, cross-regionNon-reservable resourcesOperations needing isolation

The honest framing:

  • 2PC: when participants are databases in the same datacenter and the transaction is short.
  • TCC: when participants are services holding business resources you can model as reservable, and you need atomicity in seconds-to-minutes.
  • Saga: when participants are services and the workflow is long-running, with compensation as semantic undo.

A Worked Example

E-commerce order placement involving three services: inventory, payment, and shipping.

Try phase:
  Inventory.Try(order_id, sku, qty=2)
    → check available >= 2, reserve 2 units, available -= 2
  Payment.Try(order_id, user, $100)
    → check balance(user) >= $100, reserve $100
  Shipping.Try(order_id, address, slot)
    → check slot available, reserve it

[If all three succeed]
Confirm phase:
  Inventory.Confirm(order_id) → decrement inventory permanently
  Payment.Confirm(order_id)   → debit account permanently
  Shipping.Confirm(order_id)  → assign slot permanently

[If any Try fails, e.g., Payment fails because insufficient balance]
Cancel phase:
  Inventory.Cancel(order_id) → release the 2 reserved units
  Payment.Cancel(order_id)   → no-op, nothing was reserved
  Shipping.Cancel(order_id)  → release the slot

Each Try is a fast local transaction (milliseconds). The reservation lives in the application — other shoppers see less available inventory, but no database row is locked.

Production Failure Modes

The TCC literature is optimistic. The realities:

Coordinator Crashes Mid-Try

Some participants completed Try, others did not get the call. On recovery, the coordinator needs to know which participants were already tried. The fix: persist the transaction state before issuing each Try; on recovery, replay missing operations.

Confirm Fails Persistently

Confirm is supposed to be guaranteed-successful, but a participant might be permanently broken (corrupted state, removed account). The system needs a human-escalation path; you cannot retry forever. Track stuck transactions, alert on long-running ones.

Network Partition Between Try and Confirm

A participant accepted Try, the coordinator did not get the reply, but the reservation is held. If the coordinator times out and Cancels, the participant gets Cancel for a reservation it does not know about (because its Try reply was lost). Idempotency must handle "Cancel before known Try" — typically as a no-op that records the cancellation in case Try arrives later.

Try Reservation TTL Expires Before Confirm

The coordinator was slow; the participant timed out and released the reservation. Now the coordinator's Confirm arrives, and the reservation is gone. Two reasonable resolutions: extend the TTL based on coordinator heartbeats, or treat expired reservations as Cancel and have the coordinator handle Confirm-after-Cancel as a transaction failure.

Non-Idempotent Implementations

The most common production bug: a Confirm that adds to a counter or sends a notification, called twice during retry, produces double counts or double notifications. Verify idempotency rigorously in tests.

When TCC Is Wrong

  • The resource cannot be reserved. Sending an email, calling an external API that does not understand reservations, mutating a shared file. The "reservation" is the central mechanism; without it, TCC degenerates into Saga without the compensation discipline.
  • Many participants. Each participant pays the cost of implementing three methods carefully. With more than 3-4 participants per transaction, the implementation burden grows quickly.
  • Workflows longer than the comfortable reservation window. A Try that needs to be held for hours puts pressure on the resource (available counters shrink). Sagas handle long workflows better.
  • You do not own all participants. External services are unlikely to implement your TCC contract.

Frameworks

If you decide to use TCC, several open-source frameworks handle the coordinator, retries, and state persistence:

  • Seata (Alibaba) — Java, supports TCC and several other patterns. Production-proven at Alibaba scale.
  • DTM — Go-based, multi-language SDKs, supports TCC, Saga, and 2PC.
  • Hmily — another Java framework, TCC-focused.

These reduce the boilerplate substantially. If you implement TCC by hand, the coordinator's state persistence and retry logic are where most bugs live.

Further Reading

  • Pat Helland, Life Beyond Distributed Transactions: An Apostate's Opinion (CIDR 2007) — the classic argument for application-level transactions over 2PC. TCC is one realization of this view.
  • Seata documentation (seata.io) — production-quality TCC reference, in both Chinese and English.
  • Microservices Patterns by Chris Richardson, Chapter 4 — TCC sits adjacent to the saga material; useful as comparative reading.
  • Garcia-Molina & Salem, Sagas (SIGMOD 1987) — the predecessor; TCC is essentially saga + explicit reservation phase.

Pre-commit Checklist

  • For each participant in my TCC transaction, are Try / Confirm / Cancel implemented as separate methods with idempotency guaranteed?
  • Does my Try reserve resources at the application layer without holding database locks?
  • Is my Confirm / Cancel retried with backoff until success? Is there an escalation path for permanent failures?
  • Do my Try reservations have a TTL? Is there a background process releasing expired reservations?
  • Have I tested coordinator-crash, network-partition, and double-call scenarios — not just the happy path?
  • For each business resource: can it actually be modeled as reservable? If not, TCC is the wrong tool.

On this page