Skip to main content
Ultimate GuideAI Architecture30 min read

Building RAG Systems ThatActually Work in Production

From prototype to 100M+ documents. Vector database comparison, chunking strategies, retrieval optimization, and evaluation metrics. The definitive guide to production RAG.

5
Vector DBs Compared
95%+
Retrieval Accuracy
<200ms
P99 Latency
100M+
Documents Scale

Overview

Retrieval-Augmented Generation (RAG) is the most practical way to give LLMs access to your private data. But most RAG systems fail in production because they're built for demos, not scale.

This guide covers:

  • Vector database selection — 5 options compared with real benchmarks
  • Chunking strategies — semantic vs fixed vs hybrid approaches
  • Retrieval optimization — hybrid search, reranking, query expansion
  • Evaluation framework — metrics that actually matter
  • Production checklist — what to do before going live

RAG Architecture

A production RAG system has 5 core components:

Document Ingestion
Chunking
Embedding
Vector Store
Retrieval
Basic RAG Pipeline
# Simplified RAG pipeline
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# 1. Initialize components
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Pinecone.from_existing_index("my-index", embeddings)
llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)

# 2. Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(
        search_type="mmr",  # Maximal Marginal Relevance
        search_kwargs={"k": 5, "fetch_k": 20}
    ),
    return_source_documents=True,
)

# 3. Query
result = qa_chain({"query": "What is our refund policy?"})
print(result["result"])
print(result["source_documents"])

Vector Database Comparison

Choosing the right vector database is critical. Here's our comparison based on real production deployments:

Pinecone

Managed
$70/month starter, usage-based<50ms P991B+ vectors

Best for: Fast time-to-production, serverless

✓ Pros
  • Zero ops
  • Excellent docs
  • Serverless scaling
  • Hybrid search
✗ Cons
  • Vendor lock-in
  • Expensive at scale
  • Limited filtering

Weaviate

Open Source / Managed
Free OSS, $25/month managed<100ms P99100M+ (self-hosted) vectors

Best for: Flexibility, self-hosted control

✓ Pros
  • GraphQL API
  • Multi-modal
  • Great filtering
  • OSS option
✗ Cons
  • More ops overhead
  • Learning curve
  • Resource hungry

pgvector

PostgreSQL Extension
Free (infra cost only)<200ms P9910M practical limit vectors

Best for: Existing Postgres, small-medium scale

✓ Pros
  • Use existing Postgres
  • ACID transactions
  • SQL familiar
  • Free
✗ Cons
  • Scale limits
  • Manual optimization
  • No native hybrid search

Qdrant

Open Source / Managed
Free OSS, $25/month cloud<30ms P99100M+ vectors

Best for: Performance-critical, Rust lovers

✓ Pros
  • Fast (Rust)
  • Rich filtering
  • Good memory efficiency
  • OSS
✗ Cons
  • Smaller ecosystem
  • Fewer integrations
  • Newer

Milvus

Open Source / Managed
Free OSS, Zilliz Cloud managed<100ms P9910B+ vectors

Best for: Massive scale, enterprise

✓ Pros
  • Massive scale
  • GPU support
  • Mature
  • Enterprise features
✗ Cons
  • Complex setup
  • Heavy resource usage
  • Steep learning curve

Our Recommendation

Starting out? Use Pinecone — zero ops, fast setup, great docs.
Need control? Use Weaviate or Qdrant self-hosted.
Already on Postgres? Start with pgvector, migrate later if needed.

Chunking Strategies

Chunking is where most RAG systems fail. The wrong chunk size = bad retrieval = hallucinations.

Fixed-Size Chunking

chunk_size=512, overlap=50

Simple, predictable. Good baseline. May split sentences mid-thought.

Best for: Structured docs, tables

Semantic Chunking

Split by embedding similarity

Preserves meaning boundaries. More compute. Better retrieval quality.

Best for: Long-form content

Recursive Chunking

Hierarchical: doc → section → para

Respects document structure. Best for complex docs with hierarchy.

Best for: Technical docs, manuals
Production Chunking Config
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Production-tested settings
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,          # Sweet spot for most use cases
    chunk_overlap=200,        # 20% overlap prevents context loss
    length_function=len,
    separators=[
        "\n\n",             # Paragraphs first
        "\n",                # Then lines
        ". ",                 # Then sentences
        " ",                  # Then words
        ""                    # Finally characters
    ],
    is_separator_regex=False,
)

# For code: use language-aware splitter
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

code_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1500,
    chunk_overlap=200,
)

Embedding Models

ModelProviderDimsPriceMTEBBest For
text-embedding-3-largeOpenAI3072$0.13/1M tokens64.6General purpose, easy integration
voyage-large-2Voyage AI1536$0.12/1M tokens68.3Best quality, code & legal docs
embed-english-v3.0Cohere1024$0.10/1M tokens64.5Cost-effective, compression
bge-large-en-v1.5BAAI (Open Source)1024Free (self-host)64.2Privacy, no API costs

Retrieval Optimization

Hybrid Search (Vector + Keyword)

Pure vector search misses exact matches. Pure keyword search misses semantic similarity. Hybrid search combines both.

Hybrid Search Implementation
# Pinecone hybrid search
from pinecone import Pinecone

pc = Pinecone(api_key="...")
index = pc.Index("my-index")

# Query with both vector and sparse (BM25) embeddings
results = index.query(
    vector=dense_embedding,      # From OpenAI/Cohere
    sparse_vector=sparse_embedding,  # BM25/SPLADE
    top_k=10,
    include_metadata=True,
    filter={"category": "support"},  # Metadata filtering
)

# Weighted combination (alpha = vector weight)
# alpha=1.0 = pure vector, alpha=0.0 = pure keyword
# alpha=0.7 is a good starting point

Reranking

Retrieve more candidates (top-50), then use a cross-encoder to rerank to top-5. Dramatically improves precision.

Cohere Reranker
import cohere

co = cohere.Client("...")

# First: retrieve top 50 with vector search
candidates = vectorstore.similarity_search(query, k=50)

# Then: rerank to top 5
reranked = co.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=[doc.page_content for doc in candidates],
    top_n=5,
)

# Use reranked results
final_docs = [candidates[r.index] for r in reranked.results]

Evaluation Metrics

Retrieval Metrics

  • Recall@k: % of relevant docs in top-k
  • MRR: Mean Reciprocal Rank of first relevant doc
  • NDCG: Normalized Discounted Cumulative Gain
Target: Recall@5 > 90%, MRR > 0.7

Generation Metrics

  • Faithfulness: Does answer match retrieved context?
  • Relevance: Does answer address the question?
  • Groundedness: Are claims supported by sources?
Use RAGAS or DeepEval for automated eval

Common Failure Modes

Chunks too small

Lost context, fragmented information

✓ Fix: Increase chunk size to 1000+ tokens with 20% overlap

No metadata filtering

Retrieving irrelevant docs from wrong categories

✓ Fix: Add category/date/source metadata, filter at query time

Stale embeddings

Docs updated but embeddings not refreshed

✓ Fix: Implement incremental re-indexing pipeline

Wrong embedding model

Using general model for domain-specific content

✓ Fix: Fine-tune embeddings or use domain-specific model

Production Checklist

Before Going Live

Infrastructure
  • Vector DB sized for 2x expected load
  • Embedding API rate limits handled
  • Caching layer for frequent queries
  • Fallback for vector DB downtime
Quality
  • Eval set with 100+ query-answer pairs
  • Recall@5 > 90% on eval set
  • Hallucination detection in place
  • User feedback collection enabled

Related Resources

Cite This Page

Use these citation formats for academic papers, articles, and documentation. Click to copy.

APA (7th Edition)
Bhatia, R. (2026). Building RAG Systems That Actually Work in Production. Randeep Bhatia. https://randeepbhatia.com/guides/rag-production
MLA (9th Edition)
Randeep Bhatia. "Building RAG Systems That Actually Work in Production." Randeep Bhatia, 8 Jan. 2026, https://randeepbhatia.com/guides/rag-production.
Chicago
Randeep Bhatia. "Building RAG Systems That Actually Work in Production." Randeep Bhatia. January 8, 2026. https://randeepbhatia.com/guides/rag-production.
BibTeX
@article{bhatia2026building,
  author = {Randeep Bhatia},
  title = {Building RAG Systems That Actually Work in Production},
  journal = {Randeep Bhatia},
  year = {2026},
  month = {january},
  url = {https://randeepbhatia.com/guides/rag-production},
  note = {Accessed: 2026-01-13}
}

Citation-safe content. Updated regularly.