The Three Passes a RAG Pipeline Makes Before the Model Answers

Most people think RAG means "dump documents into a vector database." The real pipeline has three distinct stages, and the one most teams skip is where quality goes wrong.

Most people think RAG means "dump documents into a vector database and let the model look things up." That description is accurate the way "a car turns fuel into motion" is accurate. True, and nearly useless if something is misfiring.

A retrieval-augmented generation pipeline actually makes three separate passes before a single token of an answer is generated. Understanding each one separately is what separates a system that mostly works from one that reliably does.

Pass one: cutting the document into chunks

Before any document is embedded and stored, it has to be split into chunks, and those cut points define the smallest unit your system can ever retrieve. This sounds like a preprocessing detail. It is actually the highest-leverage decision in the whole pipeline.

Here is the concrete problem: imagine your company's security policy document runs to 40 pages. One paragraph on page 17 answers the question "what's our data retention rule for EU customers?" If that paragraph ends up in the middle of a 2,000-token chunk that's mostly about physical access controls, the chunk is dominated by unrelated text. If a key fact is split across two chunks, or buried in a chunk dominated by unrelated text, no embedding model or reranker can fully recover it.

Chunking configuration influences retrieval quality as much as or more than embedding model selection, according to Vectara's peer-reviewed NAACL 2025 study across 25 chunking configurations and 48 embedding models. That finding surprises most engineers, who spend far more time debating which embedding model to use.

The main approaches are fixed-size splitting (every N tokens, with some overlap), recursive splitting (try paragraph breaks, then sentence breaks, then character breaks), and semantic chunking (split where the topic actually changes, which costs more but tracks meaning). The wrong chunking approach can open a gap of up to 9% in recall between the best and worst methods on the same corpus, with the same retriever.

A 9% recall gap means roughly 1 in 11 questions that should return the right chunk won't - before the model even sees the context.

One underused technique: attach metadata to each chunk before embedding it. Metadata-aware chunking enriches each chunk with document-level context - title, section header, author, date, source URL - prepended to the chunk content before embedding. Microsoft Azure Architecture Center's 2025 guidance found it boosts QA accuracy by 15-25 points with no changes to retrieval architecture.

Pass two: turning text into numbers

Once you have chunks, each one is fed through an embedding model that converts it into a vector - a list of floating-point numbers, typically 768 to 3,072 values long. Embedding transforms each chunk into a dense vector that captures the semantic meaning of the text; these vectors are then stored in a vector database.

When a query arrives, it gets embedded by the same model, producing its own vector. The retrieval system then finds which stored vectors are closest to the query vector. Two sentences like "The cat sat on the mat" and "A feline rested upon the rug" should map to vectors pointing in nearly the same direction; even if the model produces vectors of slightly different lengths, cosine similarity will correctly identify them as highly similar because their angular separation is small.

The metric is well-suited to the query-document asymmetry that characterizes retrieval. Query vectors often represent short, terse expressions of information need, while document vectors represent long, rich passages. Normalizing out magnitude makes this comparison fair.

One thing that trips people up: the metric is not a free parameter you can tune independently. During training, the embedding model learns to arrange vectors in space so that a specific notion of similarity corresponds to semantic relatedness. Using a different distance metric than the one the model was trained with quietly breaks retrieval in ways that are hard to diagnose.

Pass three: reranking the candidates

This is the step most teams skip, and it is where the pipeline earns its keep.

Vector similarity is fast because it compares independently produced vectors. But independence is also its flaw: the query and each document chunk are embedded in isolation, without seeing each other. Your retrieval model embeds the query and each document independently, then computes similarity between the resulting vectors. That means the similarity score reflects whether the two pieces of text tend to appear in similar contexts - not whether the document actually answers the question.

A reranker fixes this. A reranker, also known as a cross-encoder, is a model that takes a user's query and a single retrieved document and outputs a relevance score. Its only job is to determine how relevant that specific document is to that specific query.

The reason rerankers can't do the full retrieval job themselves is cost. A cross-encoder can't precompute anything. It needs to see the query and document together. So at query time, it runs a full transformer forward pass for every candidate. Run that against a million chunks per query and you'd wait minutes for an answer. So the two stages work together: retrieve cheaply and then rerank precisely.

In practice, the pipeline looks like this:

  • Vector search retrieves the top 50 or 100 candidate chunks in milliseconds.
  • Reranker scores each of those 50 against the query, reads both together, and re-orders them.
  • The top 5 or 10 reranked chunks go into the model's context.

When retrieving a smaller number of documents (Top K = 3), the reranker improves accuracy by over 10 percent. Sending fewer, higher-quality documents to the language model is more efficient and often yields better final answers.

On latency: on CPU with a MiniLM cross-encoder and 50 candidates, expect 100-250ms; with FlashRank, closer to 15-30ms; with Cohere's API, 150-400ms plus network. You need to decide whether your application's latency budget can absorb this. For most async or background retrieval tasks - the kind a Slack-based assistant runs when someone asks a question in a channel - the budget is there.

Where quality actually breaks

Most RAG debugging time goes to the model and the prompt. Most quality failures originate in pass one or two. A useful diagnostic split: if the right answer is never in the top-50 candidates, the problem is chunking or embedding. If it is in the top-50 but doesn't make the final cut, the problem is the reranker or the absence of one.

If your NDCG@5 is low, figure out whether the problem is at the retrieval stage (the gold answer not in the top 50) or the reranking stage (the gold answer in the top 50 but the reranker scores it poorly). That distinction tells you exactly where to spend the next hour.

A teammate like Beagle, operating inside Slack, runs this kind of pipeline every time someone asks a question against a connected knowledge base. The difference between a useful answer and a confidently wrong one usually traces back to a chunk boundary set three preprocessing steps ago.