Structure Your Prompts Around the KV Cache

Every time you call an LLM, the model runs a mountain of math over every token in your prompt - unless it already did that math. Here is what actually gets saved, and why the order of your tokens decides whether any of it sticks.

Picture a support agent that runs a thousand conversations a day. Each one starts with the same 4,000-token system prompt: persona, rules, tool definitions, example exchanges. Every single request, the model reads all of it again from scratch - unless something changes that equation. Something does, and it is worth understanding at the level of what the GPU is actually doing.

What the model does with every token you send

Under the hood, a context window is tied to the model's attention mechanism. Transformer-based LLMs use self-attention to let every token "attend to" every other token in the sequence. Before generating a single word of output, the model must run the full prefill pass - processing every input token through every layer.

Attention is the mechanism the model uses to figure out which tokens matter to which other tokens. Before generating each new token of its response, the model compares it against every other token currently in the context window. This gives LLMs their ability to connect ideas across long stretches of text, but it is also the source of their most important limitations.

The cost is not linear. The computational cost scales quadratically with sequence length - doubling the context window roughly quadruples the computation for the attention layers. So when you have a 4,000-token system prompt and you run a thousand conversations, you are paying to process roughly four million tokens that carry identical information every single time.

What the KV cache actually is

During that prefill pass, the model computes two matrices for every token at every layer: a Key and a Value. Together they encode what the model "knows" about each token in context. Transformer models build a "KV cache" - key-value tensors computed during the attention mechanism. Prefix caching persists these tensors across API calls.

Every time you send a request to an LLM API, the model runs attention over every token in your prompt. If your system prompt is 10,000 tokens and you are handling 1,000 requests per day, you are paying to process 10 million tokens daily just for the static part of your prompt - context that never changes. Prompt caching stores the intermediate computation (the key-value attention states) so subsequent requests can skip that work entirely.

The analogy that holds up: imagine computing a spreadsheet formula from scratch every time you open the file, versus loading the pre-computed result. The inputs did not change. The math did not change. Running it again is pure waste.

The prefix rule

The mechanism has one hard constraint: prompt caching refers to provider-managed features that reuse KV tensors across API requests when prompts share common prefixes. By caching the KV tensors from the prefill phase, providers can skip redundant computation when subsequent requests begin with the same content, reducing both latency and cost for users.

"Same prefix" means exactly that - byte-for-byte identical, from the first token forward. The moment any token changes, the hash breaks and everything downstream gets recomputed. Do not inject timestamps into system prompts, do not shuffle tool definitions, do not switch models mid-session, and do not mutate anything upstream of the cache breakpoint. These are not stylistic preferences; they are the difference between a cache hit and a full prefill.

The prefix is sacred. Anything that moves inside it poisons the cache for every token that follows.

How providers have productized this

The underlying KV cache is an inference-time optimization every transformer uses internally. What the major providers sell on top of it is a cross-request version - your tensors survive between API calls, not just within a single generation.

OpenAI offers automatic prompt caching on GPT-4o and newer models, where caching activates automatically for prompts exceeding a minimum token threshold, with cache hits occurring only for exact prefix matches.

Anthropic provides developer-controlled caching through explicit cache breakpoints, allowing users to specify which portions of their prompt should be cached, with configurable time-to-live options.

Google offers both implicit caching, which activates automatically with no guaranteed cost savings, and explicit context caching, where developers create and reference caches with guaranteed discounts.

Implementation details such as minimum token thresholds (typically 1,024-4,096 tokens depending on model), TTL durations (ranging from 5 minutes to 24 hours), and pricing structures vary across providers. The economics are steep on the upside: cache reads run at $0.30/M tokens versus $3.00/M fresh on Anthropic's Claude. A 90% discount, but only if the prefix holds.

By default, cached prefixes stay in GPU VRAM for 5-10 minutes of inactivity. Extended 24-hour retention offloads KV tensors to GPU-local storage (SSDs attached to GPU nodes) when idle, loading them back into VRAM on a cache hit.

Your API request needs to get routed to the same machine to hit the cache. OpenAI routes based on a hash of the initial prefix (~256 tokens). This is why cache hit rates on short prompts are less predictable than on long ones.

A concrete example: the coding agent

If you use Codex, Claude Code, or Cursor and check the API usage, you will notice a lot of the tokens are "cached". Code is structured and multiple queries can attend to the same context and prefixes to answer queries, so there are lots of cache hits. This is what keeps the bills in control.

Claude Code demonstrates what this looks like at scale, with a 92% cache hit rate and an 81% cost reduction. The reason it works so well is architectural: the system prompt - with all its tool definitions, project context, and behavioral instructions - sits at the top and never moves. User messages append at the bottom. The static part is always the prefix; the dynamic part always grows downward. The KV cache gets to do its job.

As agentic workloads grow in complexity, conversations can span dozens of API calls with context windows accumulating tens of thousands of tokens, leading to significant costs and latency overhead. An agent that runs twenty tool calls in a single session is paying for twenty prefill passes on an ever-growing context - unless the static portions are anchored up top and the cache can warm early.

Something like Beagle, which maintains persistent context across Slack conversations, benefits directly from this: the parts of the prompt that describe your team's structure, integrations, and behavior rules stay fixed, while the conversation content appends below them.

The real lesson

The context window is not just a size limit. It is an ordered structure where position determines whether computation gets reused or repeated. Attention is not distributed evenly across the context window. Research has consistently shown that LLMs pay the most attention to tokens at the beginning and end of the input, with a significant drop-off in the middle. That means your most important instructions belong at the top anyway - which happens to also be where the cache needs them.

Most teams treat their system prompt as a text file to edit whenever the model behaves oddly. The better mental model is to treat it as a compiled artifact: stable, versioned, and ordered with the KV cache in mind. In the age of agentic AI - where models are performing multi-step reasoning, tool use, and long-term planning - maximizing KV cache hit rate is no longer optional; it is foundational.