Most developers encounter prompt caching the same way: a line in a billing dashboard that says "cached tokens: X" and a vague sense that this is good. The discount is real. The mechanism behind it is worth understanding, because the same mechanism that saves you money is exactly what causes mysterious cache misses when you move things around.
Start with what is actually being stored. When a model processes your prompt, it runs every token through a series of attention layers. At each layer, the model computes two matrices for each token - a Key matrix and a Value matrix - that encode how that token relates to everything around it. The data that gets cached is the result of embeddings multiplied by the weight matrices WK and WV, so K and V. As a result, prompt caching tends to be called "KV caching."
Providers hold on to these matrices for each prompt for roughly 5-10 minutes after the request is made, and if you send a new request that starts with the same prompt, they reuse the cached K and V rather than recalculating them.
This is why "prompt caching" is a slightly misleading name. The model is not storing your text. It is storing the result of having processed your text - the intermediate state that would otherwise take compute to regenerate. Prompt caching stores and reuses the initial, unchanging part of a prompt (the prompt prefix) so that large language models don't have to process it again on every request. More specifically, it caches the internal state of the model for that prefix.
The cache does not store your words. It stores the model's understanding of your words, up to a fixed point.
The prefix rule and why it is strict
If the cached prefix and your new prompt are exactly identical (token-for-token) up to a certain point, the model reuses the cached computation for that portion and only processes new tokens from where the match ends.
A single token change anywhere in the prefix breaks the match from that point forward.
This is not a quirk; it is a structural requirement. Each token's KV vectors depend on every token that came before it and that token's absolute position in the sequence. Change anything before position N, and every KV vector from N onward is wrong. Each token's KV vector depends on all preceding tokens and their absolute position IDs in the prompt. Consequently, even minor differences in the prefix invalidate the KV vectors of otherwise immutable chunks, requiring full recomputation.
This is why prompt order matters more than most teams realize. The stable parts of your prompt - system instructions, tool schemas, large retrieved documents - need to sit physically before the volatile parts. Stable content must physically precede volatile content. If a timestamp is interpolated into the system prompt header, everything after it is uncacheable regardless of markers.
A concrete example
Say you are building a support bot. Each request contains: a 2,000-token system prompt describing the product, a 3,000-token knowledge base excerpt, and a 50-token user question. Without caching, every request processes all 5,050 tokens. With a cache hit on the first 5,000, only 50 tokens need fresh computation.
Prompt caching tends to work well in retrieval-augmented generation (RAG) setups where multiple users query the same knowledge base. Caching the system instructions and retrieved document chunks means the model skips prefill on the shared context for each new question. The payoff is highest when users ask several questions about the same document.
When retrieved chunks change with every query, though, the prefix changes too, and cache reuse drops.
The common mistake is fetching slightly different document snippets for each user, maybe with a different sort order, or with a user-specific header prepended. Each variation resets the clock. The retrieval logic and the caching logic need to be designed together, not independently.
How providers differ
OpenAI routes API requests to servers that recently processed the same prompt, making it cheaper and faster than processing a prompt from scratch. Prompt caching can reduce latency by up to 80% and input token costs by up to 90%. Prompt caching works automatically on all API requests with no code changes required. The minimum eligible prompt length is 1,024 tokens; shorter prompts are never cached.
Anthropic's approach gives developers explicit control. You mark a content block as a cache breakpoint, and Anthropic stores the encoded state of everything up to that point. The next request that starts with the same exact bytes reads from the cache instead of recomputing.
There are at most four cache_control breakpoints per request.
Cache reads cost roughly 0.1× the base input price. Cache writes cost 1.25× for a five-minute TTL, or 2× for a one-hour TTL.
The render order matters here too: a single byte difference at position N - a timestamp, a reordered JSON key, a different tool in the list - invalidates the cache for all breakpoints at positions greater than or equal to N. Render order is: tools → system → messages. A breakpoint on the last system block caches both tools and system together.
Google's Gemini 2.5 models use implicit caching by default - Google supports context caching through both the Gemini Developer API and Vertex AI, with implicit caching enabled by default on Gemini 2.5 models. You opt in by writing stable prompts, not by marking breakpoints.
What silently breaks it
A few failure modes appear consistently across teams building on these APIs:
JSON key ordering. Some languages serialize JSON dictionaries with random key order. If your tool schemas are serialized differently on each request, the cache is invalidated every time even though the content is semantically identical.
Verify that the keys in tool_use content blocks have stable ordering, as some languages (for example, Swift, Go) randomize key order during JSON conversion, breaking caches.
Dynamic system prompt injection. Inserting a timestamp, a user ID, or a session token early in the system prompt corrupts every downstream cache boundary. Move dynamic data to the end of the prompt, after the last stable block.
Short prompts.
The minimum cacheable prefix is model-dependent. Shorter prefixes silently won't cache even with a marker - no error, just cache_creation_input_tokens: 0.
If you add a cache breakpoint to a 500-token system prompt and wonder why it never hits, this is why.
Routing.
Caching only works if two requests share the same prefix and land on the same machine. Requests are routed to inference engines based on a hash of the first ~256 tokens of the prompt.
At high request rates, overflow to additional machines reduces hit rates. Some providers expose a prompt_cache_key parameter that influences routing affinity.
The underlying constraint
The strict prefix requirement has a deeper consequence for agent workflows. Agents that reorder retrieved documents, shuffle tool lists, or vary system messages between steps cannot share a cache efficiently. A teammate using Beagle might benefit from a stable channel or workspace context that stays fixed across requests - that is the kind of shared prefix that caches well.
The token bill for an LLM application is rarely dominated by the cleverness of the prompt. It is dominated by repetition and waste: the same long system prompt re-sent on every turn, the same retrieved documents re-read on every question, a 200-message conversation carried verbatim into turn 201. Prompt caching is the structural fix for that waste. But it rewards teams who think about prompt assembly as an engineering discipline - not an afterthought.
The KV matrices sitting on a server for the next five minutes are not magic. They are just saved arithmetic. The question is whether your prompt is stable enough to use them.