Cache Invalidation

The hardest problem in LLM caching — knowing when your cached answers are wrong

There are two hard problems in computer science: cache invalidation, naming things, and off-by-one errors. For LLM applications, cache invalidation is genuinely the hard one. You cached an answer that was correct yesterday, but today the model changed, the prompt changed, or the knowledge base changed — and now you're confidently serving stale garbage.

Why LLM Cache Invalidation Is Harder

Traditional cache invalidation is (conceptually) simple: the source of truth changes, you bust the cache. But LLM caches have multiple sources of truth that can change independently:

The model — a provider-side model update silently changes outputs for the same inputs.
The prompt — you edit system instructions, and every cached response from the old prompt is now wrong.
The knowledge base — your RAG documents get updated, but cached responses still reflect the old data.
The world — facts change. "Who is the president?" has a different answer than it did last year.

Any one of these changing should invalidate affected cache entries. Most teams handle zero of them well.

TTL Policies

The simplest invalidation strategy: every cache entry expires after a fixed time-to-live.

Aggressive TTL (minutes to hours) — safe but low hit rates. Use when freshness matters more than cost savings.
Moderate TTL (hours to days) — the default for most applications. Good balance for knowledge that changes occasionally.
Long TTL (days to weeks) — only for truly static content. Product descriptions that never change, foundational reference material.

TTL is a blunt instrument. It doesn't know whether the underlying data actually changed — it just assumes it might have. But blunt instruments are reliable, and reliability beats cleverness in production.

Content-Hash Keying

A smarter approach: include a hash of the inputs in the cache key so the cache automatically invalidates when inputs change.

Build your cache key from:

Model identifier + version — when the model changes, the key changes.
Hash of the full prompt — system prompt + user message + any injected context.
Hash of retrieved documents — if you're doing RAG, include the document content or version.
Any parameters that affect output — temperature, top_p, max_tokens.

cache_key = hash(model_id, system_prompt, user_message, doc_hashes, temperature)

This handles prompt changes and retrieval changes automatically. It doesn't handle model updates unless you track model versions explicitly (which you should).

When to Bust Cache

Specific triggers that should invalidate cache entries:

Model update — the provider ships a new model version. If you're on an auto-updating endpoint (e.g., claude-sonnet-4-20250514 vs claude-sonnet-4), your cached outputs may no longer match what the model would produce. Pin model versions and include them in cache keys.
Prompt change — you edited the system prompt. Every cached entry generated under the old prompt is suspect. If you use content-hash keying, this is handled automatically.
Knowledge base update — new documents ingested, old ones updated or deleted. Cached RAG responses may reference outdated information. Hash the retrieved document set as part of the cache key.
Tool or schema change — you added, removed, or modified tools available to the model. Cached tool-calling responses may reference tools that no longer exist.
Business logic change — pricing changed, policies changed, product information changed. Anything the model references from dynamic context.

Versioned Caching

The most robust strategy for production systems: version everything and include versions in cache keys.

Prompt version — every edit to the system prompt increments a version counter. prompt_v23.
Knowledge version — every knowledge base update gets a version. kb_v147.
Model version — pin to a dated model version. claude-sonnet-4-20250514.
Schema version — tool definitions get their own version. tools_v8.

Your cache key becomes: f"{model_version}:{prompt_version}:{kb_version}:{tools_version}:{query_hash}".

When any version bumps, all old cache entries naturally miss. You don't need to actively delete anything — they just stop matching. Old entries can be garbage-collected on their normal TTL.

This is more bookkeeping upfront but eliminates the entire class of "we changed something and forgot to invalidate the cache" bugs.

Layered Invalidation

In practice, different parts of your cache need different invalidation strategies:

Semantic cache (application level) — TTL + content-hash keying. Check false-positive rate regularly.
Prompt cache (API level) — handled by the provider. You control it by structuring prompts for maximum prefix stability.
RAG result cache — invalidate on knowledge base update. Version your document corpus.
Embedding cache — almost never needs invalidation unless you change embedding models.

Each layer has different freshness requirements. A stale embedding is harmless; a stale answer to "what's our refund policy?" could be a liability.

The Monitoring Imperative

No invalidation strategy is complete without monitoring. Track:

Staleness rate — sample cached responses and check them against fresh model outputs. How often do they disagree?
Invalidation frequency — how often are cache entries being busted? Too often means your cache is providing little value. Too rarely means you're probably serving stale content.
Cache age distribution — how old are the entries being served? A healthy cache has a mix; all-old entries suggest inadequate invalidation.

The uncomfortable truth: if you can't measure your staleness rate, you don't know whether your cache is helping or hurting. And "we haven't gotten any complaints" is not a staleness measurement.