Steven's Knowledge

Logical Clocks

Lamport timestamps, vector clocks, version vectors, and hybrid logical clocks — how to order events when wall-clock time cannot be trusted

Logical Clocks

The deepest mistake new distributed-systems engineers make is trusting timestamps. A row written at 2024-01-15T10:00:00.123Z and another at 2024-01-15T10:00:00.124Z may well have been written in the opposite order from what those numbers suggest — the two nodes' clocks differ by more than a millisecond, NTP just stepped one of them, a leap second is being smeared, a VM was paused. Wall-clock time is not a global order, and using it as one is a recipe for silent corruption.

The fix is to define an ordering that only uses information the system already has: who sent a message to whom, who saw whose write. That ordering is what logical clocks give you.

Why Physical Clocks Are Broken

Physical clocks suffer three problems at once:

  • Skew. Two nodes' clocks differ. NTP corrects to within milliseconds on a good day, tens of milliseconds in the cloud, hundreds during incidents. PTP can do microseconds but requires specialized hardware.
  • Drift. Even a perfect clock drifts (a few parts per million is typical). Between syncs, the error grows.
  • Jumps. When NTP corrects a large error, the clock can jump backward. Leap seconds, VM live migrations, and Docker time skews add more discontinuities.

For application timestamps in logs, "approximately when did this happen?" works fine. For ordering — "did event A happen before event B?" — it does not. You need a different abstraction.

The Happens-Before Relation

Lamport's 1978 paper begins by defining a relation (read "happens before") on events:

  1. If a and b are events in the same process, and a comes first in that process, then a → b.
  2. If a is the send of a message and b is the receive of that same message, then a → b.
  3. The relation is transitive: if a → b and b → c, then a → c.

Two events are concurrent (a ∥ b) if neither a → b nor b → a. Concurrency is not "happened at the same time" — it means there is no causal chain between them. Concurrent events can happen in any order, including simultaneously or years apart.

This is the relation that underlies causal consistency: a model where writes that are causally related must be observed in order, while concurrent writes can be observed in any order.

This is the only ordering a distributed system can determine without external clocks. Everything that follows is a way to make it cheap to compute.

Lamport Timestamps

The simplest logical clock. Each process keeps an integer counter L. Rules:

  • On a local event: L = L + 1.
  • On send: increment L, attach it to the message.
  • On receive of message with timestamp t: L = max(L, t) + 1.
Process P1:  L=1 ──send msg(1)──▶
                                  \
Process P2:  L=4                   \
              \                     \
               local event (L=5)     ▶ recv → L = max(5, 1)+1 = 6

Property: if a → b, then L(a) < L(b). The converse is not true — L(a) < L(b) does not imply a → b. Lamport timestamps give you a total order consistent with causality, but they cannot tell you whether two events are concurrent or causally related.

Use: assigning a coarse "logical clock" to log lines for debugging, breaking ties in deterministic algorithms, defining a serial order in DBs that have no other natural one.

Don't use when you need to detect concurrency. Lamport timestamps will happily assign a strict order to two events that have no causal relation, and the application cannot tell.

Vector Clocks

The fix to Lamport's limitation. Each process i keeps a vector V[1..N] indexed by process. Rules:

  • On a local event in process i: V[i] = V[i] + 1.
  • On send from process i: increment V[i], attach the whole vector.
  • On receive at process j of vector W: V[k] = max(V[k], W[k]) for all k, then V[j] = V[j] + 1.

Comparison rules between two vectors V and W:

  • V < W if V[k] ≤ W[k] for all k and V ≠ W. Then the event with V happened-before the one with W.
  • V > W symmetrically.
  • Otherwise (some entries higher in V, others higher in W), the two events are concurrent.
P1: [0,0,0] ── e1 ─▶ [1,0,0] ──send─▶
                                       \
P2: [0,0,0] ── e2 ─▶ [0,1,0]            \
                       \                  \
                        recv ─▶ [1,2,0]    
                          
Comparison of e1=[1,0,0] vs e2=[0,1,0]:
  e1[1]=1 > e2[1]=0  but  e1[2]=0 < e2[2]=1
  → concurrent (e1 ∥ e2)

Vector clocks are the canonical mechanism for detecting concurrent updates. Dynamo, Riak, and many CRDT systems use them to recognize conflicts.

Cost: the vector grows linearly with the number of processes. For 1000 nodes, every message carries 1000 counters. This is why bare vector clocks rarely scale beyond a few dozen actors.

Version Vectors and Dotted Version Vectors

For replicated data (not just events), the relevant variant is the version vector: one entry per replica, attached to a data item. Two replicas can compare an item's version vector to decide which is newer, or detect that both diverged.

Naive version vectors have a subtle problem in systems with many transient clients: every client that writes is an "actor" and gets its own entry, blowing up the vector. Dotted Version Vectors (DVVs), introduced by Preguiça et al. (2010), separate the server replica entries from individual client write events. They are the modern fix and what Riak uses today.

Hybrid Logical Clocks

Logical clocks order events but discard real time. Wall-clock timestamps approximate real time but cannot be trusted for ordering. Hybrid Logical Clocks (Kulkarni et al., 2014) combine them: a 64-bit timestamp where the high bits are physical time and the low bits are a Lamport counter that increments when physical time has not advanced.

HLC = (physical_time, logical_counter)

Tick rules:
  on local event:
    pt = max(physical_now(), current.physical_time)
    if pt == current.physical_time:
      logical = current.logical + 1
    else:
      logical = 0
  
  on receive of HLC' = (pt', l'):
    pt = max(physical_now(), current.physical_time, pt')
    logical = appropriate increment to maintain monotonicity

Properties:

  • Within bounded clock skew, an HLC is close to wall-clock time, so it is readable and useful for debugging.
  • It is monotonic and respects happens-before, like a Lamport timestamp.
  • It is a single 64-bit number, so cheap to store and compare.

HLCs are what CockroachDB and YugabyteDB use to order transactions across nodes without Spanner's TrueTime hardware. If you are building a new system that needs a usable timestamp, this is the modern default.

Where Each One Fits

NeedUse
Approximate "when did this happen" for logs and tracesWall clock (NTP)
Total order across nodes, no concurrency detectionLamport timestamp
Detect concurrent updates / conflicts on shared dataVector clock / version vector
Many transient writers against few replicasDotted Version Vector
Transaction timestamps with readability + orderingHybrid Logical Clock
Global linearizable timestamps without coordinationSpanner-style TrueTime (requires GPS / atomic clocks)

Practical Notes

  • Distributed traces (OpenTelemetry et al.) are not ordered by their timestamps. They are ordered by parent-child span relations, which is a happens-before in disguise. When traces look out of order, the timestamps are lying and the span graph is right.
  • Last-Writer-Wins (LWW) using wall-clock time is broken. A node with a clock 2 seconds fast will silently overwrite newer writes. Use an HLC, or define LWW over a logical clock, or do not use LWW.
  • Causality metadata costs real bytes. Vector clocks on every message in a chatty service can be expensive. Many systems compress, truncate, or rely on session-level metadata instead of per-item vectors.
  • "Bounded skew" assumptions matter. Spanner's TrueTime assumes a known maximum clock uncertainty (the ε); Spanner waits before committing. If your assumption is wrong, your guarantees evaporate.

Further Reading

  • Lamport, Time, Clocks, and the Ordering of Events in a Distributed System (CACM 1978) — the founding paper. Eight pages, still the best introduction.
  • Fidge, Timestamps in Message-Passing Systems That Preserve the Partial Ordering (1988) — vector clocks.
  • Preguiça et al., Dotted Version Vectors: Logical Clocks for Optimistic Replication (2010) — the modern fix for version vectors with many clients.
  • Kulkarni et al., Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases (2014) — HLCs.
  • Corbett et al., Spanner: Google's Globally-Distributed Database (OSDI 2012) — TrueTime and the bounded-uncertainty approach.

Pre-commit Checklist

  • For each "is A before B?" decision in my code, am I using wall-clock time? If so, can I justify it (audit log only, no correctness depends on it)?
  • For conflict detection, am I using a vector clock or version vector, not a single timestamp?
  • If I am using LWW, what happens when one of my nodes' clocks is 5 seconds off? Have I tested that?
  • For trace / event ordering in observability, am I sorting by span parent chain, not by wall-clock?
  • If I have a "bounded skew" assumption, is it written down and monitored?

On this page