RAG Pipeline Design: From Prototype to Production
Every RAG prototype works. You chunk some documents, embed them, throw them into a vector database, and the LLM produces answers that look impressive in a demo. Then you deploy it to real users and discover that retrieval quality is inconsistent, latency is unpredictable, and the system hallucinates confidently when the relevant document is sitting right there in the index. Here is what I have learned about closing the gap between prototype and production.
Chunking Is the Foundation
Most prototype pipelines use fixed-size character or token splits with some overlap. This works for homogeneous text, but real-world corpora include tables, headers, nested lists, code blocks, and metadata that carry structural meaning. Splitting a table across two chunks destroys the information in both.
Use document-aware chunking. Parse your source format properly — if it is HTML, split on semantic boundaries like headings and sections. If it is PDF, use a layout-aware parser that preserves table structure. For technical documentation, I chunk at the section level and keep the header hierarchy as metadata attached to each chunk. This metadata becomes critical later for filtering and re-ranking.
Chunk size matters more than most teams realize. Smaller chunks (200-300 tokens) improve retrieval precision but lose surrounding context. Larger chunks (800-1000 tokens) preserve context but dilute the embedding signal. In production, I typically index at a smaller granularity and then expand the context window at retrieval time by fetching neighboring chunks — a pattern sometimes called “parent-child” or “windowed retrieval.”
Retrieval Is Not Just Vector Search
Pure semantic search has blind spots. It struggles with exact keyword matches, acronyms, product codes, and queries that depend on precise terminology rather than meaning. A hybrid approach combining dense vector search with sparse keyword search (BM25) consistently outperforms either method alone.
In practice, this means running two retrieval paths and merging results using reciprocal rank fusion or a learned score combination. Most vector databases now support hybrid search natively — Weaviate, Qdrant, and Pinecone all offer it. If yours does not, run an Elasticsearch BM25 query alongside your vector query and merge the results before re-ranking.
Re-ranking is the highest-leverage improvement you can make to retrieval quality. After your initial retrieval returns the top 20-50 candidates, pass them through a cross-encoder model that scores each document against the query with full attention. This is computationally expensive compared to bi-encoder similarity, but you are only scoring a small candidate set. I use Cohere’s rerank API or a self-hosted cross-encoder/ms-marco-MiniLM model depending on latency and cost requirements.
Evaluation Before Optimization
You cannot improve what you do not measure. Before tuning anything, build an evaluation dataset: a set of questions paired with the expected source documents and ideal answers. I aim for at least 100 representative queries covering different question types, document sources, and difficulty levels.
Measure retrieval and generation separately. For retrieval, track recall@k (did the relevant document appear in the top k results) and MRR (how high did it rank). For generation, use a combination of LLM-as-judge evaluation for answer quality and factual grounding checks against the retrieved context. This separation tells you whether failures originate in retrieval or generation, which dictates completely different fixes.
Production Hardening
Three things that do not matter in a prototype but will break production:
Stale data handling. Documents change. Build an incremental ingestion pipeline that detects updates, re-chunks and re-embeds modified documents, and removes deleted ones. A nightly batch job works for most use cases. For high-frequency updates, use a change data capture pattern that triggers re-indexing on document modification events.
Latency budgets. A production RAG call involves embedding the query, searching the index, optionally re-ranking, and then running LLM generation. Each step adds latency. Set a latency budget for the full pipeline and work backward. If your total budget is 3 seconds and generation takes 2 seconds, retrieval and re-ranking need to complete in under a second combined. Cache frequent queries aggressively — I use a semantic cache that returns stored answers for queries with high cosine similarity to previously seen ones.
Guardrails. When retrieval returns low-confidence results, the system should say so rather than letting the LLM improvise. Set a minimum similarity threshold and, when no chunk meets it, return a structured “I don’t have enough information” response instead of generating from the LLM’s parametric knowledge. This single change eliminates the majority of hallucination complaints in production.
The gap between a RAG demo and a RAG product is mostly engineering discipline — measurement, iteration, and building for the failure modes that only appear at scale.
