How RAG Works: From Your Question to Retrieved Text

Your support bot confidently told a customer the return window is 14 days. The actual policy says 30. The model did not hallucinate - it read the wrong chunk. That is the most common RAG failure in production, and understanding why it happens requires knowing what the pipeline actually does at each step.

RAG - retrieval-augmented generation - is the mechanism behind almost every AI system that answers questions from a private knowledge base: internal wikis, support docs, engineering runbooks, HR handbooks. It enables large language models to retrieve and incorporate new information from external data sources , rather than relying only on what they learned during training. The term itself was introduced in a 2020 paper that described combining a parametric language model with a non-parametric external memory accessed through retrieval at inference time.

There are four distinct stages: chunking, embedding, retrieval, and generation. Each one can go wrong independently. Here is what each stage actually does.

How your documents get turned into searchable chunks

Before any retrieval can happen, your documents have to be prepared. A 40-page employee handbook cannot be handed to a retrieval system whole - it needs to be broken into smaller units that can be individually indexed and compared.

Chunking is the act of splitting larger documents into smaller units. Each chunk can be individually indexed, embedded, and retrieved. The size of those chunks matters more than most people expect. Common practices suggest chunks between 128-512 tokens. Smaller chunks (128-256 tokens) work well for fact-based queries where precise keyword matching matters, while larger chunks (256-512 tokens) are better for tasks requiring broader context, like summarizing concepts.

The failure mode cuts both ways. Oversized chunks cause omission of critical entities and relations, leading to semantic ambiguity during retrieval. Undersized chunks restrict the retrieval's informational horizon, causing contextual fragmentation. The return-policy example from the opening is almost always a chunking problem: the correct sentence existed in the document, but got buried in a 1,500-token chunk that also matched a dozen other queries more strongly.

A peer-reviewed clinical decision support study (MDPI Bioengineering, November 2025) found that adaptive chunking aligned to logical topic boundaries hit 87% accuracy versus 13% for fixed-size baselines. Whether that result holds for your domain is an empirical question - but the gap is large enough to take seriously.

What an embedding actually is

Once you have chunks, each one gets converted into an embedding - a list of numbers that encodes its meaning. A vector embedding converts data into a numerical representation (a high-dimensional vector, such as a 1,256-dimensional array) that captures its semantic meaning.

The useful property of embeddings is that meaning becomes distance. Similar concepts - like "car" and "vehicle" - end up closer to each other in the vector space than unrelated terms like "car" and "banana." This is what makes semantic search work. Consider a library of wine descriptions, one of which mentions the wine is "good with fish." A "wine for seafood" keyword search won't find that wine. But a meaning-based search understands that "fish" is similar to "seafood" - and finds it.

When a user asks a question, that question also gets embedded into the same vector space using the same model. The retrieval system then looks for chunks whose vectors are closest to the query vector. The system calculates the similarity between the query vector and all document vectors using metrics like cosine similarity or dot product. Documents with vectors closest to the query vector are ranked highest.

All of those embeddings live in a vector database - tools like Pinecone, Weaviate, Chroma, or FAISS. Vector databases are specialized repositories for semantic vectors, optimized for fast storage, indexing, and retrieval. The key goal is to quickly find the closest vectors by content - those with high semantic similarity to the query.

The embedding model is not a neutral pipe. Embedding model choice matters as much as chunking strategy. A general-purpose embedding model trained on web text may handle your internal product terminology poorly. Domain-specific models - or fine-tuned ones - exist for exactly this reason.

What retrieval actually returns (and when it goes wrong)

Retrieval-augmented generation is not the same as "vector search." Pure semantic search via embeddings is powerful but incomplete on its own. Dense retrieval is excellent at semantic similarity: ask "what are the revenue figures for Q3?" and it finds chunks about financial performance even if they don't contain that exact phrase. But it can miss chunks that use the exact terminology your users type, because exact-match signals get diluted by semantics.

The popular shortcut "RAG equals vector DB" is the single biggest source of expensive failures at scale. Production systems commonly use hybrid retrieval - semantic search running in parallel with keyword search (BM25 or similar) - then merge and rerank the results before passing them to the model.

The top-K chunks that survive retrieval get assembled into what the model actually reads. The documents are concatenated as context with the original input prompt and fed to the text generator which produces the final output. This is sometimes called "prompt stuffing" - you are literally inserting retrieved text into the prompt before the model sees it.

tested 18 models including GPT-4.1, Claude 4, and Gemini 2.5, and found that retrieval performance degrades as context length increases, even on straightforward tasks.

Where the model fits in - and the failure nobody debugs

The final stage is generation. The model receives the original question plus the retrieved chunks, and writes an answer. At its core, RAG operates by retrieving relevant information from a vast corpus of data and then generating coherent and contextually enriched responses based on this retrieved information.

The important thing to understand here is that two independent failures can produce the same wrong answer. The retriever can return the wrong chunks, or the model can mishandle the right ones. To the user, both look identical - a confidently incorrect response.

Most debugging effort goes to the model. Teams swap models, rewrite prompts, adjust temperature. Teams spend weeks tweaking system prompts when the real problem is poor chunking, outdated documents, or weak embeddings. Prompt engineering cannot fix bad context. If your retrieval pipeline returns irrelevant chunks 30% of the time, no system prompt will save you.

This is the most important operational insight about RAG: retrieval quality and generation quality are separate problems that require separate measurement. An LLM that faithfully echoes retrieved content still gives wrong answers if retrieval returned the wrong chunks. Faithfulness and retrieval quality are independent - you need to measure both.

An AI teammate like Beagle, which answers questions from Slack directly against your team's connected knowledge sources, runs this same pipeline every time someone asks it something. The answer quality depends on what was chunked, how it was embedded, and what was retrieved - not just which model runs at the end.

The pipeline as a whole

RAG's value is real: a key advantage over other approaches is that the LLM doesn't need to be retrained for task-specific applications. You can point it at new documents, update them, and the system reflects the change without touching a model weight. That is why it powers most enterprise AI deployments rather than fine-tuning.

But it is not a single thing you switch on. RAG is not a component you add - it's a pipeline you architect. The decisions you make at ingestion time (how you chunk, how you embed) determine the ceiling of what's possible at retrieval and generation time.

The return-policy bot told a customer the wrong date because a chunk boundary landed in the wrong place, or an older document had not been re-indexed after the policy changed. That is a data engineering problem wearing an AI hat. Knowing the four stages - chunk, embed, retrieve, generate - tells you exactly where to look when the answer comes back wrong.

How your documents get turned into searchable chunks

What an embedding actually is

What retrieval actually returns (and when it goes wrong)

Where the model fits in - and the failure nobody debugs

The pipeline as a whole

Keep reading

How LLM Tool Calling Works at the Token Level

How AI Agent Memory Actually Works Under the Hood

Trim Your MCP Server List Before It Trims Your Agent