Semantic Caching
Using embedding similarity to serve cached LLM responses for "close enough" queries
The idea is simple: if someone already asked roughly the same question, return the same answer without calling the model again. Semantic caching embeds incoming queries, compares them against cached query embeddings, and returns a cached response when the similarity is above a threshold. In theory, it's free latency and cost savings. In practice, it's a minefield of false positives.
How It Works
- Embed the incoming user query with a fast embedding model.
- Search a vector store of previously cached query embeddings.
- If the top match exceeds a similarity threshold (typically 0.95+), return the cached response.
- Otherwise, call the LLM, cache the new query-response pair, return the fresh response.
The embedding model, the distance metric, and the threshold are the three knobs that determine whether this helps or hurts.
Choosing a Similarity Threshold
This is the single most important decision, and there's no universal right answer:
- 0.98+ — very conservative. Only near-exact rephrasings match. Low cache hit rate but almost no false positives.
- 0.95–0.97 — the sweet spot for many applications. Catches paraphrases, misses genuinely different questions most of the time.
- Below 0.93 — dangerous territory. "How do I reset my password?" starts matching "How do I change my email?" You will serve wrong answers.
Start conservative and lower the threshold only after measuring false-positive rates on your actual traffic.
Cache Hit Rate Optimization
A semantic cache with a 2% hit rate isn't worth the complexity. To improve hit rates:
- Normalize queries before embedding — lowercase, strip punctuation, remove filler words.
- Use the system prompt + user message as the cache key, not just the user message. Two identical questions in different contexts need different answers.
- Cluster your traffic first. If your top 100 queries cover 40% of traffic, semantic caching will shine. If every query is unique, it won't.
- Cache at the right granularity — sometimes caching sub-steps (tool calls, retrieval results) is more effective than caching final responses.
GPTCache and Similar Tools
GPTCache (open source) was the first popular library for this. It bundles embedding, vector search, and cache management. Other options:
- Redis with vector search — if you already run Redis, adding semantic caching is straightforward.
- Custom pipeline — an embedding model + pgvector or Qdrant + a thin wrapper. More control, more maintenance.
- Managed offerings — some LLM gateways (Portkey, Helicone) offer built-in semantic caching.
The tool matters less than the tuning. An off-the-shelf library with a bad threshold will serve wrong answers; a hand-rolled solution with careful threshold management will work fine.
When Semantic Caching Works
It works best when:
- Traffic is repetitive — customer support, FAQ bots, internal knowledge bases.
- Answers are stable — the same question should return the same answer regardless of when it's asked.
- Latency matters more than freshness — users want instant responses and can tolerate slightly stale content.
- The query space is bounded — thousands of common questions, not millions of unique ones.
When Semantic Caching Fails
It fails — sometimes dangerously — when:
- Answers depend on context not captured in the query. "What's my balance?" means different things for different users.
- Small wording changes flip the intent. "Can I cancel?" vs "Can I not cancel?" look similar in embedding space.
- The underlying data changes often. Cached responses go stale and you serve outdated information.
- You're caching chain-of-thought or reasoning. The same question might need different reasoning paths.
The fundamental tension: embeddings capture semantic similarity, but similar questions don't always have similar answers. Every semantic cache deployment needs monitoring for false positives — not just at launch, but continuously.
Measuring What Matters
Track these metrics from day one:
- Cache hit rate — what fraction of requests are served from cache.
- False positive rate — sample cached responses and have humans judge correctness. This is the metric that kills you if you ignore it.
- Latency savings — the whole point. Measure P50 and P99 for cached vs uncached.
- Cost savings — straightforward: cached requests cost embedding inference only, not LLM inference.
If your false-positive rate is above 1–2%, tighten the threshold or reconsider whether semantic caching is right for your use case.