Picture a support bot. It has a 6,000-token system prompt: product documentation, tone guidelines, a list of tool schemas, a handful of examples. A user asks one question. The model answers. A second user asks a different question. The model answers again - and silently re-reads all 6,000 tokens of context it already processed moments ago, as if it had never seen them.
Do that ten thousand times a day and you are paying for 60 million tokens of computation that produces nothing new.
Prompt caching is the fix. But "caching" is doing a lot of work in that sentence, because it is not caching in the web-server sense - no response is stored, no lookup table is checked against query strings. What gets stored is the intermediate computation itself: the key-value attention states the model builds while processing input tokens. Understanding why that matters requires a quick look at how a transformer actually reads your prompt.
What the model is doing during prefill
When you send a request to an LLM, inference happens in two phases. The first is called prefill: the model reads every token in your input and, for each attention layer, computes a set of key and value vectors that represent that token's relationship to everything before it. These key-value entries are stored in a structure called the KV cache. Normally, the model recomputes this KV cache on every request.
For autoregressive generation, the K and V vectors for token position i depend only on tokens 0 through i - once computed, they never change. Without caching, every newly generated token requires recomputing Q, K, and V for all previous tokens. That is O(n²) complexity per token, where n is your sequence length.
With a KV cache, you store the K and V vectors and only compute for new tokens, dropping complexity to O(n) per token.
The second phase is decode: the model generates output tokens one at a time, attending back to the KV cache it built during prefill. This phase is not affected by prompt caching - the model still generates a fresh response every time; it is the redundant prefill work that gets cut.
The KV cache is not a feature you opt into. It lives inside every transformer inference engine. Prompt caching is the art of sharing it across requests.
How prefix matching works
Prompt caching works by comparing the beginning of your current prompt against what is already cached. If the cached prefix and your new prompt are exactly identical - token for token - up to a certain point, the model reuses the cached computation for that portion and only processes new tokens from where the match ends.
The word "exactly" is load-bearing. A single character difference, an extra space, a timestamp injected into the system prompt - any of these breaks the match and forces a full prefill. The key to effective prompt caching is putting static content at the beginning and dynamic content at the end. The system prompt, the tool definitions, the few-shot examples: those go first. The user's message goes last. It sounds obvious, but many applications are built the other way around - user context injected into the system block, metadata stamped at the top - and they get zero cache hits as a result.
Under the hood, serving frameworks like vLLM implement this with Automatic Prefix Caching (APC): the KV cache of each request is partitioned into blocks, where each block contains the attention keys and values for a fixed number of tokens.
Each prompt is hashed and compared against a hash table to check for an existing cache entry. If the prompt is not cached, KV pairs are computed and stored for future use. The longest matching prefix wins; only the tail of the prompt gets recomputed.
What the providers actually give you
The three major API providers have taken different approaches.
OpenAI makes it invisible. Caching requires no code changes and is enabled automatically for prompts of 1,024 tokens or longer, with cache hits occurring in increments of 128 tokens.
OpenAI recently extended cache retention to 24 hours for the GPT-4.1 and GPT-5.1 series - the default is 5-10 minutes of inactivity, but the extended policy offloads KV tensors to GPU-local storage when idle and loads them back on a cache hit.
The discount is 50% on cached input tokens. You can verify hits by checking prompt_tokens_details.cached_tokens in the response.
Anthropic makes it explicit.
With Claude, prompt caching is controlled by the developer - sections of the prompt are marked cacheable using the cache_control parameter.
That adds friction, but also precision: you decide exactly what to cache.
Cache reads are billed at $0.30 per million tokens versus $3.00 per million for fresh processing
- a 90% discount. The catch is that cache write tokens with a five-minute lifetime are 25% more expensive than base input tokens, so you need at least two hits per cached prefix to come out ahead.
Google (Gemini) calls it "context caching" and takes yet another approach: you upload stable context once and reference it by ID, rather than marking inline cache points per request. Gemini 2.5 Pro and 2.5 Flash now also support implicit caching, providing automatic caching functionality similar to OpenAI's.
Here is a simplified comparison of the key numbers, as of mid-2026:
| OpenAI | Anthropic | ||
|---|---|---|---|
| Input discount | 50% | 90% | 75% |
| Cache write cost | Free | +25% base | Storage-based |
| Default TTL | ~5-10 min | 5 min or 1 hr | Until deleted |
| Minimum tokens | 1,024 | 1,024 | 32,768 |
A concrete example
Back to the support bot with a 6,000-token system prompt, serving 10,000 conversations a day.
Without caching: 60 million prompt tokens daily, billed at full price every time.
With caching and a high hit rate, those 60 million tokens mostly become cache reads. At 10,000 conversations a day with a 5,000-token system prompt, you are paying for those tokens 10,000 times - 50 million tokens a day in system prompt costs alone before a single user message is processed. Prompt caching makes those repeated tokens near-free.
The hit rate depends on one thing: how stable your prefix is. Do not put timestamps, request IDs, or user names in the system prompt. Anything that changes per-request belongs at the end of the message array. A teammate like Beagle - which sends a context-heavy system prompt on every message - benefits enormously from this discipline; the tool definitions and workspace instructions are identical across thousands of requests, and they should always appear before the dynamic conversation history.
The memory trade-off providers don't advertise
Storing key-value tensors in GPU VRAM has a cost, which explains why Anthropic charges a premium for cache writes.
GPU memory requirements are substantial: for a Llama 3 70B model, the KV cache for a single 4K context request requires roughly 1.3 GB of GPU memory. This is why cached prefixes expire: providers cannot hold every customer's context in VRAM forever. The TTL is a business decision about GPU memory allocation as much as it is a product decision.
It is also why APC in general does not reduce the performance of vLLM - but it only reduces the time of processing queries in the prefill phase, and does not reduce the time of generating new tokens in the decoding phase. If you are running a workload that generates thousands of output tokens per request, the input savings look smaller in proportion.
None of this requires deep infrastructure knowledge to act on. The rule is simple: freeze your prefix, put dynamic content last, and read the cache metrics back. The savings come from how you structure the prompt, not from anything the model does differently.