The caching opportunity
LLM inference is expensive, and a surprising amount of it is redundant. In most production applications, many requests share common prefixes (system prompts, few-shot examples), similar queries repeat frequently, and deterministic tasks produce the same output for the same input.
Caching exploits these patterns to reduce both cost and latency. The challenge is that LLM caching is more nuanced than traditional web caching because the inputs and outputs are high-dimensional, similarity is semantic rather than exact, and stale caches can serve subtly wrong results.
Strategy 1 — Exact response caching
The simplest form: hash the complete input (system prompt + user message + all parameters) and cache the output. If an identical request arrives, serve the cached response.
This works for deterministic tasks: classification, extraction, translation of fixed inputs. The cache hit rate depends on how many of your requests are exact duplicates — which is higher than most teams expect. Customer support systems, FAQ bots, and data processing pipelines often see 20–30% exact-match rates.
Implementation is straightforward: a key-value store (Redis, DynamoDB) with the input hash as the key and the response as the value. Set appropriate TTLs based on how quickly your data changes.
Strategy 2 — Semantic caching
Instead of requiring exact matches, embed the query and check if a semantically similar query has been answered recently. If the similarity exceeds a threshold, serve the cached response.
This dramatically increases cache hit rates — often 40–60% for customer-facing applications — because users frequently ask the same question in different words.
The implementation requires a vector store for cached query embeddings and a similarity search for each incoming query. The embedding cost is small compared to the LLM inference cost it saves, but you need to monitor false positive rates carefully.
Strategy 3 — Prefix caching
Most LLM providers now support prompt prefix caching: if multiple requests share a common prefix (typically a system prompt), the provider caches the KV states for that prefix and reuses them across requests.
This is the lowest-effort optimization because it requires no code changes on your end — you just benefit from it automatically if your requests share common prefixes. The savings are typically 50–80% of the prefill cost for the shared portion.
To maximize prefix caching benefits: keep your system prompt stable (don't include timestamps or request-specific data at the beginning), put variable content at the end of the prompt rather than the beginning, and use consistent formatting across requests.
Strategy 4 — Multi-tier caching
The most effective production systems combine multiple caching layers:
- Exact cache — Check for exact input match first (fastest, highest precision)
- Semantic cache — If no exact match, check for semantically similar queries
- Prefix cache — For requests that miss both caches, benefit from provider-level prefix caching
- LLM inference — Only for genuinely novel queries
This cascading approach maximizes cost savings while minimizing the risk of serving incorrect cached responses.
Cache invalidation
The hardest problem in caching is knowing when cached responses are stale. For LLM applications, staleness can come from several sources:
- Source data changes — If your RAG system's documents are updated, cached responses based on old documents are stale
- Model updates — A new model version might produce better responses, making old cached responses suboptimal
- Time-sensitive content — Responses about current events, stock prices, or schedules become stale quickly
Practical invalidation strategies:
- TTL-based — Set time-to-live values appropriate to your content freshness requirements
- Event-based — Invalidate caches when source documents are updated
- Version-based — Include the model version in the cache key so model updates automatically create a new cache space
Measuring cache effectiveness
Track these metrics to evaluate your caching strategy:
- Hit rate — Percentage of requests served from cache. Target: 20–50% depending on application.
- Cost savings — Actual reduction in inference spend attributable to caching.
- Latency improvement — Cached responses should be 10–100× faster than live inference.
- Quality impact — Compare the quality of cached responses with fresh responses. Any degradation means your similarity threshold or TTL needs adjustment.
When not to cache
Caching is not appropriate for all use cases. Skip it when the task requires creativity or variety (e.g., brainstorming, creative writing), when every response needs to incorporate the very latest information, when the input space is so diverse that hit rates would be negligible, or when the cost of a wrong cached response exceeds the savings from caching.
For everything else, caching is one of the highest-ROI optimizations in the production AI toolkit — straightforward to implement, easy to measure, and often the difference between a sustainable inference budget and an alarming one.