Prompt caching cuts your LLM bill before you write a line of new code

Picture a support bot. Every user message arrives wrapped in the same 2,000-token system prompt: persona, policies, tone rules, a knowledge-base excerpt. The user types five words. The model reads 2,000 tokens it has seen ten thousand times today. You pay for all of them, every single time.

That is not a pricing quirk. It is a structural inefficiency baked into how transformer inference works, and prompt caching is the fix.

What the model actually does with your input

LLM inference is divided into two distinct stages. The first - called the prefill stage - is where the model reads and processes your entire input, turning each token into a set of internal mathematical representations called key-value (KV) tensors. The second stage generates output, one token at a time, using those tensors to attend back over the input.

The expensive part is prefill. Intermediate key and value tensors for the input prompt and previously generated tokens are calculated once and then stored in the KV cache, instead of recomputing from scratch at each iteration. That per-request KV cache is what makes generation fast within a single conversation. But the moment that conversation ends, the cache is gone. The next user triggers a full prefill again.

This type of KV caching only works for a single prompt and for generating a single response. Prompt caching extends the principles of caching across different prompts, users, and sessions.

The prefix is the key insight

Prompt caching is a provider-native feature that stores and reuses the initial, unchanging part of a prompt - the prompt prefix - so that large language models don't have to process it again on every request. More specifically, it caches the internal state of the model for that prefix, reducing redundant computation.

The mechanism is straightforward. The KV cache is persisted on the inference servers, indexed by a cryptographic hash of the token sequence. When a new request comes in with the same prefix, the hash matches, the tensors are loaded from memory, and the prefill computation for those tokens is skipped entirely.

This drops computational complexity from O(n²) per generated token to O(n). For a 20,000-token prefix repeated across 50 turns, that's an enormous reduction.

The model isn't remembering your content. It's reusing the math it already did on identical bytes.

How Anthropic and OpenAI handle this differently

OpenAI and Anthropic do caching very differently. OpenAI does it automatically, attempting to route requests to cached entries when possible.

Automatic caching is enabled for prompts that are 1,024 tokens or longer, with cache hits occurring in increments of 128 tokens.

If a matching prefix is found, the system uses the cached result, decreasing latency by up to 80% and cutting costs by up to 50%. You don't touch a config file. The discount just appears in your usage response.

Anthropic's approach trades automation for control. Prompt caching with Anthropic models is explicitly controlled by the developer. Developers mark sections of the prompt as cacheable using the cache_control parameter.

Cache read tokens are 0.1 times the base input tokens price - a 90% discount - while cache write tokens cost 1.25 times the base.

Anthropic gives you more control, letting you decide when to cache and for how long. In practice, Anthropic routes you to cached entries 100% of the time when you ask them to cache a prompt. OpenAI's automatic routing, by contrast, can be inconsistent under load.

A concrete example: the document Q&A loop

Say you're building a bot that answers questions about a 3,000-token engineering runbook. A user session typically involves five or six follow-up questions.

A typical scenario: you're building a document analyzer that includes a 3,000-token document in every request. Five questions about that document means processing 15,000 tokens of identical content at full price. With caching, that document is processed once per session. The subsequent four or five turns read it from the KV store for a fraction of the cost.

The structure that makes this work is simple: stable content at the top, dynamic content at the bottom.

What the model actually does with your input

The prefix is the key insight

How Anthropic and OpenAI handle this differently

A concrete example: the document Q&A loop

Keep reading

LLM Prompt Caching: What the KV Cache Actually Stores

How LLM Evals Work, From Golden Dataset to Judge Score

AI Agent Sandboxing, From Container to MicroVM