Steven's Knowledge

KV Cache Internals

How transformers avoid redundant attention computation and why it dominates GPU memory

If you're running inference at any scale, the KV cache is the single biggest consumer of your GPU memory and the primary lever for throughput. Every optimization in modern inference serving — PagedAttention, continuous batching, prefix caching — is fundamentally about managing this one data structure more intelligently.

What the KV Cache Actually Is

During autoregressive generation, a transformer produces one token at a time. To generate token N, the model needs to attend over all previous tokens 0 through N-1. Without caching, this means recomputing the key and value projections for every previous token at every generation step — O(N^2) total computation for a sequence of length N.

The KV cache stores the key and value tensors for all previously processed tokens, so each new token only needs to compute its own K and V, then attend over the cached history. This brings per-step cost down to O(N) — a massive saving for long sequences.

Memory Math

The KV cache size scales with: 2 × num_layers × num_heads × head_dim × sequence_length × bytes_per_element.

For a 70B-parameter model with 80 layers, 64 heads, 128 head_dim, in FP16:

  • Per token: 2 × 80 × 64 × 128 × 2 bytes = ~2.6 MB
  • For a 4096-token sequence: ~10.5 GB
  • For a 128K-token sequence: ~330 GB

That last number exceeds the memory of most single GPUs. This is why KV cache management is the bottleneck — not model weights, not compute, but storing all the cached attention state for concurrent requests.

PagedAttention and vLLM

The breakthrough insight from vLLM (2023): treat KV cache memory like an operating system treats virtual memory.

Before PagedAttention, inference engines pre-allocated contiguous memory for each request's full sequence length. A request that might generate 2048 tokens had 2048 tokens of KV cache reserved upfront, even if it only used 200. This internal fragmentation wasted 60–80% of KV cache memory.

PagedAttention fixes this:

  1. KV cache is divided into fixed-size blocks (pages), typically 16 tokens each.
  2. Blocks are allocated on demand as the sequence grows.
  3. A block table maps logical positions to physical memory locations.
  4. Blocks don't need to be contiguous in physical memory.

The result: near-zero memory waste, 2–4x more concurrent requests on the same GPU.

Cache Eviction Strategies

When GPU memory fills up, something has to give. Common eviction strategies:

  • FIFO (First In, First Out) — evict the oldest request's cache. Simple, but penalizes long-running generations.
  • LRU (Least Recently Used) — evict the cache that hasn't been accessed longest. Better for interactive workloads.
  • Priority-based — assign priorities to requests; evict lowest priority first. Useful when some users or tasks matter more.
  • Preemption with swap — write evicted KV cache to CPU memory or disk, reload later. Adds latency but avoids recomputation.

vLLM uses a preemption approach: when memory pressure is high, lower-priority sequences have their KV blocks swapped to CPU memory. When space frees up, they're swapped back. This is slower than keeping everything on GPU but faster than recomputing from scratch.

Prefix Caching at the Inference Level

This is the inference-engine equivalent of API-level prompt caching, but happening at the GPU level:

  • The engine detects that multiple requests share the same prompt prefix.
  • It computes the KV cache for that prefix once.
  • Subsequent requests with the same prefix share the cached KV blocks via copy-on-write.

vLLM, TensorRT-LLM, and SGLang all support some form of this. The savings compound with:

  • System prompts reused across thousands of requests.
  • Few-shot examples that are identical for every call.
  • RAG prefixes where the same retrieved documents appear in multiple requests.

This is also called automatic prefix caching or RadixAttention (SGLang's term). It builds a radix tree of token sequences, and any request that shares a prefix gets free KV reuse.

Multi-Tenant Cache Sharing

In serving platforms that handle traffic for multiple models or users, cache sharing becomes a scheduling problem:

  • Cross-request sharing — requests to the same model with overlapping prefixes share KV blocks. This is the prefix caching case above.
  • Cross-user isolation — even with shared prefixes, you may need to ensure one user's cache doesn't leak information to another. The KV cache itself doesn't contain anything not derivable from the input, but access patterns can leak.
  • Fair scheduling — a user sending large-batch requests shouldn't starve other users' KV cache allocations.

Production inference platforms typically run a memory manager that tracks block ownership, reference counts, and access patterns — conceptually very similar to an OS memory manager.

Practical Implications for Application Developers

Even if you're not running your own inference stack, KV cache mechanics affect you:

  • Longer prompts = slower first token because prefill has to process more tokens before generation starts.
  • Long output sequences get slower per token as the attention computation grows linearly with the KV cache length.
  • Concurrent requests compete for KV memory — high concurrency with long contexts can cause queuing or OOM.
  • Prompt structure matters — identical prefixes across requests enable sharing; randomized prefixes prevent it.

Understanding the KV cache turns vague intuitions about "inference is slow" into actionable engineering decisions.

On this page