Prompt Caching
Reusing computation on shared prompt prefixes to cut cost and latency at the API level
Every time you send a long system prompt to an LLM API, the provider recomputes the same KV cache from scratch. Prompt caching lets you pay for that prefix computation once and reuse it across requests. Anthropic and OpenAI both support this now, and for applications with large, stable system prompts, the savings are dramatic — 90% cost reduction on cached tokens and noticeably faster time-to-first-token.
How Prompt Caching Works
The core idea: the provider stores the KV cache computed from a prefix of your prompt. On the next request, if the prefix matches exactly, the provider loads the cached KV state instead of recomputing it.
- You send a request with a long prompt (system instructions, few-shot examples, documents).
- The provider computes the KV cache and stores it, keyed by the exact token sequence.
- On your next request, if the prefix matches, the provider skips recomputation and starts generating from the cached state.
- You pay a reduced rate for the cached tokens.
The match is exact and prefix-based — one token difference at the start invalidates the cache. Everything after the cached prefix is computed normally.
Anthropic Prompt Caching
Anthropic's implementation is explicit — you mark which parts of the prompt to cache:
- Add
cache_controlbreakpoints in your message list to tell the API where to cache. - Cached tokens cost 90% less than regular input tokens.
- There's a small write cost on the first request (10% premium), amortized over subsequent cache hits.
- Cache has a 5-minute TTL — refreshed on each hit. Active conversations keep the cache warm naturally.
- Minimum cacheable prefix is 1024 tokens (Claude 3.5 Sonnet) or 2048 tokens (Claude 3.5 Haiku).
The explicit control is the advantage here. You decide exactly what to cache, which makes behavior predictable.
OpenAI Prefix Caching
OpenAI's approach is automatic — no code changes needed:
- The API automatically caches prompts longer than 1024 tokens.
- Cached tokens are billed at 50% off the regular input price.
- Caching happens automatically when the same prefix appears across requests.
- No TTL is publicly documented; caching behavior is best-effort.
The automatic approach is simpler to adopt but gives you less control over what gets cached and when.
KV Cache Reuse Under the Hood
What's actually being cached is the key-value pairs from the transformer's attention mechanism, computed during the prefill phase. This is the expensive part of processing a long prompt — the model has to attend over every token in the prefix. By storing these KV pairs, the provider skips all that matrix multiplication on subsequent requests.
This is why the match must be prefix-based and exact. The KV cache at position N depends on every token from position 0 to N-1. Change one early token and every subsequent KV pair changes.
Cache-Friendly Prompt Design
To maximize cache hits, structure your prompts deliberately:
- Put stable content first. System instructions, persona definitions, and tool schemas should be at the top. These rarely change between requests.
- Put variable content last. User messages, retrieved documents, and conversation history go after the stable prefix.
- Standardize formatting. Whitespace differences break the cache. Use a prompt template that produces identical prefixes.
- Batch similar requests. If you're processing many items with the same instructions, send them sequentially to keep the cache warm.
- Don't randomize few-shot examples. Pick a fixed set and always include them in the same order.
The ideal prompt structure: [system instructions] → [tool definitions] → [few-shot examples] → [retrieved context] → [conversation history] → [current user message].
Cost Savings Math
The economics are compelling. Consider an application with a 4000-token system prompt and 500-token average user messages:
- Without caching: 4500 input tokens billed at full rate every request.
- With Anthropic caching: 4000 tokens at 10% of input rate + 500 tokens at full rate. That's roughly 80% savings on input costs.
- With OpenAI caching: 4000 tokens at 50% of input rate + 500 tokens at full rate. That's roughly 40% savings on input costs.
For RAG applications stuffing 20K tokens of retrieved context into every call, the savings are even larger. The longer your stable prefix relative to the variable suffix, the more you save.
When to Use Prompt Caching
Use it when:
- Your system prompt is large (1K+ tokens) and stable across requests.
- You're running high-volume production traffic with similar prompts.
- Latency matters — cached prefill is meaningfully faster.
- You're doing RAG with a stable retrieval set for the same user session.
Skip it when:
- Every request has a completely different prompt (no shared prefix).
- Your system prompt is short (under 1024 tokens) — not enough to cache.
- Request volume is too low to hit the cache before TTL expires.
- You're already optimizing at a different layer (e.g., semantic caching at the application level).
Prompt caching is the lowest-effort, highest-reward caching strategy for most production LLM applications. If you're spending real money on a long system prompt, turn it on before you try anything fancier.