Building Production RAG Systems: Beyond the Tutorial
What actually matters when deploying RAG systems in production — chunking, hybrid search, reranking, and evaluation pipelines.
Building Production RAG Systems: Beyond the Tutorial
Every RAG tutorial follows the same pattern: chunk documents, embed them, store in a vector database, retrieve, generate. It works in demos. It fails in production.
After deploying RAG systems for enterprise clients processing millions of documents, I've identified the gaps between tutorial RAG and production RAG. This post covers what actually matters.
The RAG Pipeline in Production
Documents → Preprocessing → Chunking → Embedding → Indexing
│
User Query → Query Processing → Retrieval → Reranking → Generation
│
Evaluation ← Feedback
Each stage has failure modes that tutorials never mention.
Chunking: The Foundation
Bad chunking is the #1 cause of poor RAG quality. The naive approach — splitting by token count — destroys context.
Semantic Chunking
Instead of fixed-size chunks, split on semantic boundaries:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Naive approach — DON'T DO THIS
bad_splitter = CharacterTextSplitter(chunk_size=500)
# Better: respect document structure
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n## ", "\n### ", "\n\n", "\n", ". ", " "],
)
# Best: use document-specific splitters
from langchain.text_splitter import MarkdownTextSplitter
md_splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=200)
The Overlap Strategy
Overlap is critical but often misunderstood. Too little overlap (< 10%) loses context at boundaries. Too much (> 30%) wastes tokens and creates redundancy. I've found 15-20% overlap is the sweet spot for most document types.
Metadata Enrichment
Every chunk should carry metadata:
chunk = {
"text": "The quarterly revenue increased by 15%...",
"metadata": {
"source": "Q3_2024_earnings.pdf",
"page": 12,
"section": "Financial Results",
"date": "2024-10-15",
"document_type": "earnings_report",
}
}
This metadata enables filtered retrieval — searching only within specific document types or date ranges.
Hybrid Search: Beyond Pure Vectors
Pure vector search has a fundamental weakness: it's great at semantic similarity but poor at exact matching. Ask "What was the revenue in Q3 2024?" and vector search might return chunks about Q2 2024 or Q3 2023 — semantically similar, factually wrong.
BM25 + Vector Fusion
The solution is hybrid search combining BM25 (keyword) and vector (semantic):
from rank_bm25 import BM25Okapi
import numpy as np
def hybrid_search(query: str, k: int = 10, alpha: float = 0.7):
# Vector search
vector_results = vector_store.similarity_search(query, k=k*2)
# BM25 keyword search
tokenized_query = query.lower().split()
bm25_scores = bm25.get_scores(tokenized_query)
bm25_top_k = np.argsort(bm25_scores)[-k*2:][::-1]
# Reciprocal Rank Fusion (RRF)
combined_scores = {}
for rank, doc in enumerate(vector_results):
combined_scores[doc.id] = alpha * (1 / (rank + 60))
for rank, idx in enumerate(bm25_top_k):
doc_id = documents[idx].id
if doc_id in combined_scores:
combined_scores[doc_id] += (1 - alpha) * (1 / (rank + 60))
else:
combined_scores[doc_id] = (1 - alpha) * (1 / (rank + 60))
# Sort by combined score
return sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)[:k]
In my tests, hybrid search improves retrieval precision by 25-40% compared to vector-only search.
Reranking: The Quality Multiplier
After retrieval, a reranker model rescores results using cross-attention — it sees query and document together, not independently:
from cohere import Client
co = Client(api_key="...")
def rerank(query: str, documents: list[str], top_k: int = 5):
response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=documents,
top_n=top_k,
)
return [(r.index, r.relevance_score) for r in response.results]
Reranking typically adds 100-300ms of latency but improves answer quality by 30-50%. It's the highest ROI improvement you can make to a RAG pipeline.
Evaluation: Measuring What Matters
You can't improve what you don't measure. I use the RAGAS framework for systematic evaluation:
from ragas import evaluate
from ragas.metrics import (
answer_relevancy,
faithfulness,
context_precision,
context_recall,
)
result = evaluate(
dataset=eval_dataset,
metrics=[
answer_relevancy, # Is the answer relevant to the question?
faithfulness, # Is the answer grounded in the context?
context_precision, # Are retrieved contexts relevant?
context_recall, # Are all relevant contexts retrieved?
],
)
Key Metrics
| Metric | Target | What It Measures |
|---|---|---|
| Faithfulness | > 0.9 | Hallucination rate |
| Answer Relevancy | > 0.85 | Answer quality |
| Context Precision | > 0.8 | Retrieval accuracy |
| Context Recall | > 0.8 | Retrieval completeness |
If faithfulness drops below 0.9, your system is hallucinating — the most dangerous failure mode in production.
Production Considerations
Caching
Cache embeddings and frequent queries. A Redis cache in front of your vector store can reduce latency by 90% for repeated queries.
Monitoring
Log every query, every retrieval, every generation. When a user reports a wrong answer, you need to trace the full pipeline to identify the failure point.
Guardrails
Add input/output validation:
- Input: Detect and reject prompt injection attempts
- Output: Verify the answer references retrieved context (not hallucinated)
- Fallback: When confidence is low, route to human agent
Conclusion
Production RAG is an engineering discipline, not a demo project. The gap between tutorial RAG and production RAG is enormous — but bridgeable with the right architecture, evaluation framework, and operational practices.