Every time you send a message to an LLM, the model doesn't just read your new question. It re-reads everything: the system instructions, the full conversation history, any documents you attached. All of it, every turn. That gets expensive fast, and it gets slow.
Prompt caching is how providers cut that redundant work. Understanding it changes how you structure requests - and explains some billing surprises.
What the model is actually doing
When an LLM receives a prompt, it goes through a prefill phase: the model reads and deeply understands your entire input by calculating attention across every token. This is computationally expensive, especially with long documents.
The output of that phase is a set of tensors called the KV cache - key-value pairs computed for each token across each attention layer. For autoregressive generation, the K and V vectors for any token position depend only on the tokens before it. Once computed, they never change.
Without caching, every newly generated token requires recomputing Q, K, and V for all previous tokens - that's O(n²) complexity per token, where n is your sequence length. With a KV cache, you store the K and V vectors and only compute for new tokens. Complexity drops to O(n) per token.
Prompt caching takes that a step further. When an LLM processes your prompt, it generates KV cache entries in its attention layers. Normally, the model recomputes this KV cache on every request. Prompt caching stores it so the model can skip that computation on subsequent requests that share the same prefix. The model still generates a fresh response every time; it's the redundant prefill work that gets cut.
A concrete example
Say you're running a support bot. Your system prompt is 2,000 tokens: company policies, tone instructions, a dozen worked examples. A user sends one short question. The model still has to read all 2,000 tokens before it can do anything.
The next user sends a different question. Same 2,000-token system prompt. The model reads it again. And again for every user after that.
If you have a chatbot where users can upload documents and ask questions about them, it can be time consuming for the model to process the document every time the user provides input. With prompt caching, you can cache the document so that future queries don't need to reprocess it.
The savings are not theoretical. Every time you send a message in Claude Code, the entire conversation gets re-sent to the API. Your first message sends the system prompt, tool definitions, and your message. Your tenth message sends all of that plus the previous nine exchanges. Your fiftieth message sends everything from the beginning plus 49 rounds of conversation. Without caching, the model reprocesses every token from scratch each time.
Why it costs money to store the cache
A 100K-token prompt might produce a KV cache of 500MB-1GB per request. Anthropic is storing and retrieving this data in GPU memory for millions of concurrent users simultaneously. That's why there's a 25% surcharge on cache writes - you're paying for VRAM allocation, not just compute.
By default, cached prefixes stay in GPU VRAM for 5-10 minutes of inactivity. The extended 24-hour retention offloads KV tensors to GPU-local storage (SSDs attached to GPU nodes) when idle, loading them back into VRAM on a cache hit.
The pricing math: cache reads cost $0.30/M tokens versus $3.00/M fresh tokens on Anthropic. You pay a small premium on the first write; every subsequent hit is priced at a steep discount. Cache creation costs slightly more than standard input processing, but that cost is paid once and then amortized across every subsequent cache hit.
The one rule that breaks most implementations
Prompt caching works by comparing the beginning of your current prompt against what's already cached. If the cached prefix and your new prompt are exactly identical - token-for-token - up to a certain point, the model reuses the cached computation for that portion and only processes new tokens from where the match ends.
The critical word is identical. The cache is not semantic. It is not thinking "this is basically the same." It is much closer to: "have I already seen this same prefix in this shape?"
This means structure matters more than content. The key to effective prompt caching is putting static content at the beginning and dynamic content at the end. If your system prompt injects a timestamp, or the current user's name, or anything that changes per-request - the cache misses every time.
The thing that actually breaks caches in production is usually something small and dynamic buried at the top of a system prompt.
Adding an MCP tool, putting a timestamp in your system prompt, switching models mid-session - each of these can invalidate the entire cache and 5× your costs for that turn.
How to actually mark a cache breakpoint
On the Anthropic API, you signal where the stable prefix ends using a cache_control field.
With automatic caching, you add a single cache_control field at the top level of your request. The system automatically applies the cache breakpoint to the last cacheable block and moves it forward as conversations grow.
Prompt caching references the entire prompt - tools, system, and messages (in that order) up to and including the block designated with cache_control.
That ordering matters. If your tool definitions change, the cache for everything after them also breaks.
Anthropic explicitly documents that tools are part of the cached prefix. So if your tool list changes, your cache picture changes. If you have a stable agent harness, keep its tool surface stable when possible.
There's also a minimum token threshold before the provider will bother caching anything. The minimum is 1,024 tokens for Sonnet and Haiku, up to 2,048-4,096 for Opus. Below that threshold, the overhead of storing and looking up the cache isn't worth it.
Where it matters most in practice
The workloads that benefit most share a pattern: a large, stable context that gets reused across many short queries. This includes conversational agents with long instructions or uploaded documents, agentic search and tool use across multiple rounds of calls, and Q&A over books, papers, documentation, or other long-form content.
An AI teammate like Beagle - running inside Slack or Teams, with a system prompt that includes channel context, tool definitions, and behavioral instructions - hits this pattern on every single message. The system prompt doesn't change; the user question does. Every conversation turn is exactly the kind of workload caching was designed for.
The underlying mechanism is not exotic. It's the same KV cache that already runs inside the model during a single response, extended across requests. Once you see that, the pricing structure makes sense, the structure rules make sense, and the things that break it make sense too.