Someone asks your support bot: "Does the enterprise plan include SSO?" The bot answers correctly, citing the right pricing page. That answer did not come from the model's training data. It came from a pipeline that ran in about 200 milliseconds before the model typed its first word. Here is what that pipeline actually did.
Step one: the knowledge base gets sliced up in advance
Before any question is asked, someone has to prepare the knowledge base. Indexing involves preprocessing the incoming data - splitting documents into smaller, overlapping chunks. The pricing page becomes ten or twenty fragments, each covering a coherent idea: one chunk for the starter plan, one for enterprise features, one for billing terms.
Chunk size is one of those decisions that looks trivial and is not. When chunks are too large, the data points become too general and fail to correspond directly to potential user queries. But if chunks are too small, the data points can lose semantic coherency. A chunk containing a single sentence like "SSO is available" is not very useful without the surrounding sentence that specifies which plan.
Step two: every chunk gets turned into a vector
Each chunk is transformed into a dense vector - a numerical representation that captures the semantic meaning of the text. These vectors are then stored in a vector database.
The embedding model is doing something worth pausing on. Embedding models capture semantic and contextual relationships between words and concepts. For instance, in the multi-dimensional vector space, "mountain" and "hill" would have embeddings that are close to each other, despite being far apart in terms of letters. The same principle means "does enterprise include SSO" and "is single sign-on available on paid tiers" will produce vectors that land close together, even though they share almost no words.
This pre-computation step happens once, offline. The vectors sit in the database waiting. The model is not involved yet.
Step three: the user's question becomes a vector too
When the user submits a query, it is first converted to an embedding vector using the same embedding model that was used for the indexing. That last clause matters - same model, same vector space. If you index with one embedding model and query with a different one, the coordinates are in different universes and the search produces garbage.
The vector database then searches for embeddings close to the query embedding using a distance metric, and returns the relevant data chunks. These data chunks and the user query are combined into a single prompt and passed to the LLM.
Typically the system retrieves the top three to five closest chunks by cosine similarity. Not the whole document - just the fragments that scored highest against the question.
The model never searches the database. The retriever does. The model only sees the result.
Step four: the retrieved chunks ride along in the prompt
The final prompt looks roughly like this: system instructions, then the retrieved chunks pasted in as context, then the user's question. The text chunks which are near in vector space to the user query are injected into the LLM context. The LLM can use the chunk information to generate a better and more factually grounded answer.
The model reads those chunks the same way it reads anything else in the context window. There is no special "retrieval mode." It just has better raw material than if you had sent the question cold.
Where this breaks
The pipeline has four joints, and any one of them can fail quietly.
Bad chunking. If the chunk that contains "SSO" was split mid-sentence from the chunk that contains "enterprise plan," neither chunk alone is enough, and the model may hallucinate a connection it cannot actually see.
Embedding mismatch. Using a small, older embedding model to index a large technical knowledge base can produce vectors that don't separate concepts cleanly. The query "SSO" might retrieve chunks about "security certificates" instead.
Too many chunks retrieved. Stuffing five loosely relevant chunks into the context introduces noise. The model can be led astray by a chunk that is topically adjacent but factually irrelevant to the specific question.
Stale index. If you updated the pricing page two weeks ago but the index hasn't been rebuilt, the model confidently cites the old policy. The retrieval worked perfectly; the data was just wrong.
Where prompt caching layers on top
Once you understand the pipeline, the interaction with prompt caching becomes obvious. Prompt caching tends to work well in RAG setups where multiple users query the same knowledge base. Caching the system instructions and retrieved document chunks means the model skips prefill on the shared context.
The mechanism is specific: when an LLM processes a prompt, it generates key-value cache entries in its attention layers - mathematical representations of the relationships between tokens. Normally, the model recomputes this KV cache on every request. Prompt caching stores it so the model can skip that computation on subsequent requests that share the same prefix.
For a support bot serving hundreds of users against the same ten documents, the system prompt and the most-frequently-retrieved chunks can be cached at the provider level. This can reduce latency by up to 80% and reduce input token costs by up to 90% for large and repetitive prompts.
The practical constraint is ordering. If the cached prefix and your new prompt are exactly identical token-for-token up to a certain point, the model reuses the cached computation for that portion and only processes new tokens from where the match ends. This means the stable content - system instructions, fixed document chunks - must come first in the prompt, before the user's question. If you put the question first, every request starts differently and you get zero cache hits.
A teammate like Beagle, which lives inside Slack and needs to answer questions against a shared knowledge base across hundreds of different people's conversations, benefits from exactly this pattern - shared system context cached, per-user question processed fresh.
The part that often surprises people
Prompt caching is not a replacement for RAG. They serve different purposes and are best suited for different use cases. RAG handles the question of which information to surface. Prompt caching handles the question of how cheaply to process that information once retrieved. The first is a retrieval problem; the second is a compute problem. Running them together is where the real efficiency shows up in production.
The support bot that answered the SSO question correctly was running both: a vector search that found the right pricing chunk, and a KV cache that meant it did not have to re-read the system prompt and common documents from scratch for the hundredth time that afternoon.