A developer building a YouTube analytics bot recently shared their terminal output: first API call, 81,262 tokens. Second call, same system prompt, same reference data - 11 tokens billed. Without caching, they were processing 81,251 tokens of context every single request, running up $720 a month. With prompt caching, that dropped to $72.
That gap - $648 a month from one afternoon of work - is the prompt caching story in miniature. But the reason it works is worth understanding, because knowing the mechanism tells you exactly when to use it, when it will fail you, and why it matters so much more for agents than for simple chatbots.
What the model is actually doing when it reads your prompt
Every time you send a request to a language model, the model reads your input token by token and builds up an internal representation before it writes a single word of response. This "reading" phase - called the prefill - is where most of the compute cost lives.
The machinery underneath is transformer self-attention. The transformer relies on self-attention, which allows each token in a sequence to attend to all other tokens. For each token, the model computes three projections: queries (Q), keys (K), and values (V). The keys and values are what gets expensive: while producing one token at a time, the model must compute attention between the current token's query and the keys of all previous tokens. Naively, this requires recomputing the key and value projections for every past token at each generation step - a very expensive operation.
The KV cache solves this by storing the key and value projections from previous tokens. At each new generation step, only the current token's K and V are computed and appended to the cache. Within a single conversation, this is already happening automatically at the GPU level. Prompt caching extends that same idea across requests.
Prompt caching refers to the productized, provider-managed features that reuse KV tensors across API requests when prompts share common prefixes. By caching the KV tensors from the prefill phase, providers can skip redundant computation when subsequent requests begin with the same content, reducing both latency and cost.
Think of it this way: your system prompt is a 4,000-word briefing document. On request one, the model reads every word. On request two, if that briefing hasn't changed, the provider hands the model a stack of pre-computed tensors instead of making it re-read. The model skips straight to your user's actual question.
It matches exact token prefixes at the GPU level - there's no accuracy loss, and outputs are identical to uncached requests.
How OpenAI and Anthropic implement it differently
Major LLM providers have implemented prompt caching with varying approaches. OpenAI offers automatic prompt caching on GPT-4o and newer models, where caching activates automatically for prompts exceeding a minimum token threshold, with cache hits occurring only for exact prefix matches. Anthropic provides developer-controlled caching through explicit cache breakpoints, allowing users to specify which portions of their prompt should be cached, with configurable TTL options. Google offers both implicit caching, which activates automatically with no guaranteed cost savings, and explicit context caching, where developers create and reference caches with guaranteed discounts.
On OpenAI's side, the setup is zero-effort: OpenAI's prompt caching is fully automatic. Since October 2024, every API call with 1,024 or more input tokens automatically benefits from caching. You don't opt in, you don't add headers, you don't change your code.
After the initial 1,024-token threshold, the cache matches in 128-token increments. The cache lives for 5-10 minutes during normal usage and can persist up to 24 hours during off-peak periods.
Anthropic's approach gives more control in exchange for a little setup.
Rather than automatically caching prompts, Anthropic defines up to four cache breakpoints, which allow users to have finer control over which sections of the prompt are cached. This can be adjusted using the cache_control parameter.
The cache has a 5-minute time-to-live, which refreshes each time the cached content is used. The pricing reflects the asymmetry: cache reads come in at $0.30 per million tokens versus $3.00 per million fresh on Anthropic's Sonnet-class models. There is one catch on the write side: Claude charges 25% more on the initial cache write , which means you need at least two cache hits before you break even - though in practice if you write a 4,000-token system prompt once and read it five times, you break even on the write cost and every subsequent read is pure savings.
The part of your prompt that gets cached is always the stable prefix - the cache applies to tokens at the beginning of your prompt, typically your system prompt, tool definitions, few-shot examples, or retrieved documents that don't change between requests. The user's actual message, which does change, still gets processed normally.
Why agents make this far more important than chatbots
A single chatbot call with a 2,000-token system prompt costs a few fractions of a cent. Cache or don't cache, the bill is roughly the same. Agents are different.
On a 40-step task, you're sending a large system prompt 40 times. As the conversation grows linearly, step N re-sends everything from steps 1 through N-1. The agentic tax: the cost of intelligence compounds quadratically with task complexity.
Caching is the only structural fix. One team found it saved 59% on LLM costs compared to the same token volume at full input rates; post-optimization that number reached 66%, and continued improving to 70% over the following ten days.
The math is equally stark on latency. In a research evaluation across models and providers, cost savings from prompt caching ranged from 41% to 80%, while time-to-first-token improvements ranged from 6% to 31%. For a multi-step agent where each tool call is waiting on the previous response, shaving 20-30% off time-to-first-token at every step adds up to a meaningfully faster experience.
A teammate like Beagle - living inside Slack and fielding repeated queries against the same workspace context - sits squarely in the "cache everything stable" category. The channel history summary, the team glossary, the project brief: those live at the top of every request and almost never change within a working session.
When caching won't help (and when it actively costs you)
Prompt caching has a failure mode worth being explicit about. If your prompt changes substantially between calls - different user, different document, different system message - caching has nothing stable to hold onto. Cache hits won't materialize.
There's also the one-shot trap: if you make a single API call and never follow up, caching costs you the cache-write premium with no chance to recoup. Don't bother.
And the TTL is genuinely short. Anthropic's ephemeral cache lasts 5 minutes from last hit. Each subsequent call with the same prefix resets the TTL. If you keep asking questions within 5 minutes, the cache stays warm. If you walk away for lunch, you pay the cache-write premium again on return. Anthropic does offer a 1-hour caching option for less frequently accessed content, though it carries a higher write rate.
There's one more structural limit worth noting. The KV cache is a foundational optimization in transformer-based LLMs, but its memory footprint scales linearly with context length, imposing real bottlenecks on GPU memory capacity and inference throughput as context windows grow from thousands to millions of tokens.
With the LLaMA-7B model, the KV cache consumes only 2% of memory during inference with a sequence length of 512 tokens - but that usage skyrockets to 84% when the sequence length increases to 128k. Providers manage this behind the scenes, but it's why cache hit rates tend to drop on very long, variable conversations even when you do everything right.
The rule of thumb: prompt caching earns its keep when you have a large, stable block of context and you're hitting the same model multiple times. That describes most production AI features - RAG pipelines, agents, assistant products with system personas - and almost none of the exploratory one-off prompts that fill a developer's afternoon.
Anthropic's API response includes cache_creation_input_tokens, cache_read_input_tokens, and input_tokens fields
so you can measure the split on every call, not just estimate it.
For teams building on top of LLM APIs, this is now table-stakes infrastructure, not a clever optimization. By organizing prompts into static cached prefixes and dynamic request components, developers can reduce token costs by roughly 70-90% in many real-world applications. At scale, those savings can amount to tens or hundreds of thousands of dollars per month. The model doesn't get smarter. Your bill just gets a lot smaller.