Overview
Retrieval-Augmented Generation (RAG) is the most practical way to give LLMs access to your private data. But most RAG systems fail in production because they're built for demos, not scale.
This guide covers:
- Vector database selection — 5 options compared with real benchmarks
- Chunking strategies — semantic vs fixed vs hybrid approaches
- Retrieval optimization — hybrid search, reranking, query expansion
- Evaluation framework — metrics that actually matter
- Production checklist — what to do before going live
RAG Architecture
A production RAG system has 5 core components:
# Simplified RAG pipeline
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
# 1. Initialize components
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Pinecone.from_existing_index("my-index", embeddings)
llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)
# 2. Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(
search_type="mmr", # Maximal Marginal Relevance
search_kwargs={"k": 5, "fetch_k": 20}
),
return_source_documents=True,
)
# 3. Query
result = qa_chain({"query": "What is our refund policy?"})
print(result["result"])
print(result["source_documents"])Vector Database Comparison
Choosing the right vector database is critical. Here's our comparison based on real production deployments:
Pinecone
ManagedBest for: Fast time-to-production, serverless
- Zero ops
- Excellent docs
- Serverless scaling
- Hybrid search
- Vendor lock-in
- Expensive at scale
- Limited filtering
Weaviate
Open Source / ManagedBest for: Flexibility, self-hosted control
- GraphQL API
- Multi-modal
- Great filtering
- OSS option
- More ops overhead
- Learning curve
- Resource hungry
pgvector
PostgreSQL ExtensionBest for: Existing Postgres, small-medium scale
- Use existing Postgres
- ACID transactions
- SQL familiar
- Free
- Scale limits
- Manual optimization
- No native hybrid search
Qdrant
Open Source / ManagedBest for: Performance-critical, Rust lovers
- Fast (Rust)
- Rich filtering
- Good memory efficiency
- OSS
- Smaller ecosystem
- Fewer integrations
- Newer
Milvus
Open Source / ManagedBest for: Massive scale, enterprise
- Massive scale
- GPU support
- Mature
- Enterprise features
- Complex setup
- Heavy resource usage
- Steep learning curve
Our Recommendation
Starting out? Use Pinecone — zero ops, fast setup, great docs.
Need control? Use Weaviate or Qdrant self-hosted.
Already on Postgres? Start with pgvector, migrate later if needed.
Chunking Strategies
Chunking is where most RAG systems fail. The wrong chunk size = bad retrieval = hallucinations.
Fixed-Size Chunking
Simple, predictable. Good baseline. May split sentences mid-thought.
Semantic Chunking
Preserves meaning boundaries. More compute. Better retrieval quality.
Recursive Chunking
Respects document structure. Best for complex docs with hierarchy.
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Production-tested settings
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Sweet spot for most use cases
chunk_overlap=200, # 20% overlap prevents context loss
length_function=len,
separators=[
"\n\n", # Paragraphs first
"\n", # Then lines
". ", # Then sentences
" ", # Then words
"" # Finally characters
],
is_separator_regex=False,
)
# For code: use language-aware splitter
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
code_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=1500,
chunk_overlap=200,
)Embedding Models
| Model | Provider | Dims | Price | MTEB | Best For |
|---|---|---|---|---|---|
| text-embedding-3-large | OpenAI | 3072 | $0.13/1M tokens | 64.6 | General purpose, easy integration |
| voyage-large-2 | Voyage AI | 1536 | $0.12/1M tokens | 68.3 | Best quality, code & legal docs |
| embed-english-v3.0 | Cohere | 1024 | $0.10/1M tokens | 64.5 | Cost-effective, compression |
| bge-large-en-v1.5 | BAAI (Open Source) | 1024 | Free (self-host) | 64.2 | Privacy, no API costs |
Retrieval Optimization
Hybrid Search (Vector + Keyword)
Pure vector search misses exact matches. Pure keyword search misses semantic similarity. Hybrid search combines both.
# Pinecone hybrid search
from pinecone import Pinecone
pc = Pinecone(api_key="...")
index = pc.Index("my-index")
# Query with both vector and sparse (BM25) embeddings
results = index.query(
vector=dense_embedding, # From OpenAI/Cohere
sparse_vector=sparse_embedding, # BM25/SPLADE
top_k=10,
include_metadata=True,
filter={"category": "support"}, # Metadata filtering
)
# Weighted combination (alpha = vector weight)
# alpha=1.0 = pure vector, alpha=0.0 = pure keyword
# alpha=0.7 is a good starting pointReranking
Retrieve more candidates (top-50), then use a cross-encoder to rerank to top-5. Dramatically improves precision.
import cohere
co = cohere.Client("...")
# First: retrieve top 50 with vector search
candidates = vectorstore.similarity_search(query, k=50)
# Then: rerank to top 5
reranked = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=[doc.page_content for doc in candidates],
top_n=5,
)
# Use reranked results
final_docs = [candidates[r.index] for r in reranked.results]Evaluation Metrics
Retrieval Metrics
- Recall@k: % of relevant docs in top-k
- MRR: Mean Reciprocal Rank of first relevant doc
- NDCG: Normalized Discounted Cumulative Gain
Generation Metrics
- Faithfulness: Does answer match retrieved context?
- Relevance: Does answer address the question?
- Groundedness: Are claims supported by sources?
Common Failure Modes
Chunks too small
Lost context, fragmented information
No metadata filtering
Retrieving irrelevant docs from wrong categories
Stale embeddings
Docs updated but embeddings not refreshed
Wrong embedding model
Using general model for domain-specific content
Production Checklist
Before Going Live
- Vector DB sized for 2x expected load
- Embedding API rate limits handled
- Caching layer for frequent queries
- Fallback for vector DB downtime
- Eval set with 100+ query-answer pairs
- Recall@5 > 90% on eval set
- Hallucination detection in place
- User feedback collection enabled
Related Resources
Cite This Page
Use these citation formats for academic papers, articles, and documentation. Click to copy.
@article{bhatia2026building,
author = {Randeep Bhatia},
title = {Building RAG Systems That Actually Work in Production},
journal = {Randeep Bhatia},
year = {2026},
month = {january},
url = {https://randeepbhatia.com/guides/rag-production},
note = {Accessed: 2026-01-13}
}Citation-safe content. Updated regularly.