RAG answers are only as good as the chunks you cut

Retrieval-augmented generation is sold as a way to give an LLM access to your documents. What actually determines quality isn't the model - it's how you slice up the source material before the query arrives.

Ask most engineers how RAG works and they'll say "you put your docs in a vector database and the model looks things up." That's not wrong, but it skips the part where most pipelines quietly fail: the chunking step that happens long before any query arrives.

Here is what actually happens.

The pipeline runs in two separate phases

The first phase - indexing - is entirely offline. You take your knowledge base and convert it into a searchable format. That means splitting documents into smaller segments called chunks, converting those chunks into vector embeddings using an embedding model, and storing them in a vector database.

The second phase, retrieval, happens at query time. When a user sends a question, the system converts that query into a vector using the same embedding model and runs a similarity search against the index to find the most relevant chunks. The top-K results - commonly three to ten chunks - are returned as the context for generation.

Generation is the final step. The retrieved context is combined with the user's original query into a prompt, and the LLM generates a response grounded in that context. The model can now reference specific facts and produce answers that reflect the current state of your knowledge base rather than whatever it memorized during training.

So there are really three distinct machines: an indexer, a retriever, and a generator. Teams that struggle with RAG quality usually have a generator problem in their heads but a chunking problem in their code.

What a chunk actually is

The input query is converted into vector embeddings using an embedding model - a model that maps text into a numerical form that can be used for similarity searches. That vector is then sent to the vector database, which contains embeddings of documents and is indexed based on vector similarity; cosine similarity is often used.

The embedding model doesn't read text the way a human does. It compresses a passage into a fixed-size list of numbers - typically 768 to 3072 floating-point values, depending on the model - where the geometry of the space encodes semantic meaning. Two passages that mean similar things land near each other. Two passages that share keywords but mean different things may still land far apart.

This is why chunking strategy matters so much. If you split a 40-page engineering spec into 500-word blocks with hard line breaks at arbitrary positions, you will regularly cut a sentence at exactly the point where the meaning becomes specific. The chunk that gets retrieved will contain the context but not the conclusion, or the conclusion but not the variable it refers to.

The embedding model compresses meaning into geometry. A badly cut chunk can preserve the words while destroying the meaning.

A concrete example. Suppose your runbook has a section like this: