Building RAG Systems That Actually Work in Production

Overview

Retrieval-Augmented Generation (RAG) is the most practical way to give LLMs access to your private data. But most RAG systems fail in production because they're built for demos, not scale.

This guide covers:

Vector database selection — 5 options compared with real benchmarks
Chunking strategies — semantic vs fixed vs hybrid approaches
Retrieval optimization — hybrid search, reranking, query expansion
Evaluation framework — metrics that actually matter
Production checklist — what to do before going live

RAG Architecture

A production RAG system has 5 core components:

Document Ingestion

Chunking

Embedding

Vector Store

Retrieval

Basic RAG Pipeline

# Simplified RAG pipeline
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# 1. Initialize components
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectorstore = Pinecone.from_existing_index("my-index", embeddings)
llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)

# 2. Create retrieval chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(
        search_type="mmr",  # Maximal Marginal Relevance
        search_kwargs={"k": 5, "fetch_k": 20}
    ),
    return_source_documents=True,
)

# 3. Query
result = qa_chain({"query": "What is our refund policy?"})
print(result["result"])
print(result["source_documents"])

Vector Database Comparison

Choosing the right vector database is critical. Here's our comparison based on real production deployments:

Pinecone

Managed

$70/month starter, usage-based<50ms P991B+ vectors

Best for: Fast time-to-production, serverless

✓ Pros

Zero ops
Excellent docs
Serverless scaling
Hybrid search

✗ Cons

Vendor lock-in
Expensive at scale
Limited filtering

Weaviate

Open Source / Managed

Free OSS, $25/month managed<100ms P99100M+ (self-hosted) vectors

Best for: Flexibility, self-hosted control

✓ Pros

GraphQL API
Multi-modal
Great filtering
OSS option

✗ Cons

More ops overhead
Learning curve
Resource hungry

pgvector

PostgreSQL Extension

Free (infra cost only)<200ms P9910M practical limit vectors

Best for: Existing Postgres, small-medium scale

✓ Pros

Use existing Postgres
ACID transactions
SQL familiar
Free

✗ Cons

Scale limits
Manual optimization
No native hybrid search

Qdrant

Open Source / Managed

Free OSS, $25/month cloud<30ms P99100M+ vectors

Best for: Performance-critical, Rust lovers

✓ Pros

Fast (Rust)
Rich filtering
Good memory efficiency
OSS

✗ Cons

Smaller ecosystem
Fewer integrations
Newer

Milvus

Open Source / Managed

Free OSS, Zilliz Cloud managed<100ms P9910B+ vectors

Best for: Massive scale, enterprise

✓ Pros

Massive scale
GPU support
Mature
Enterprise features

✗ Cons

Complex setup
Heavy resource usage
Steep learning curve

Our Recommendation

Starting out? Use Pinecone — zero ops, fast setup, great docs.
Need control? Use Weaviate or Qdrant self-hosted.
Already on Postgres? Start with pgvector, migrate later if needed.

Chunking Strategies

Chunking is where most RAG systems fail. The wrong chunk size = bad retrieval = hallucinations.

Fixed-Size Chunking

chunk_size=512, overlap=50

Simple, predictable. Good baseline. May split sentences mid-thought.

Best for: Structured docs, tables

Semantic Chunking

Split by embedding similarity

Preserves meaning boundaries. More compute. Better retrieval quality.

Best for: Long-form content

Recursive Chunking

Hierarchical: doc → section → para

Respects document structure. Best for complex docs with hierarchy.

Best for: Technical docs, manuals

Production Chunking Config

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Production-tested settings
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,          # Sweet spot for most use cases
    chunk_overlap=200,        # 20% overlap prevents context loss
    length_function=len,
    separators=[
        "\n\n",             # Paragraphs first
        "\n",                # Then lines
        ". ",                 # Then sentences
        " ",                  # Then words
        ""                    # Finally characters
    ],
    is_separator_regex=False,
)

# For code: use language-aware splitter
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

code_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1500,
    chunk_overlap=200,
)

Embedding Models

Model	Provider	Dims	Price	MTEB	Best For
text-embedding-3-large	OpenAI	3072	$0.13/1M tokens	64.6	General purpose, easy integration
voyage-large-2	Voyage AI	1536	$0.12/1M tokens	68.3	Best quality, code & legal docs
embed-english-v3.0	Cohere	1024	$0.10/1M tokens	64.5	Cost-effective, compression
bge-large-en-v1.5	BAAI (Open Source)	1024	Free (self-host)	64.2	Privacy, no API costs

Retrieval Optimization

Hybrid Search (Vector + Keyword)

Pure vector search misses exact matches. Pure keyword search misses semantic similarity. Hybrid search combines both.

Hybrid Search Implementation

# Pinecone hybrid search
from pinecone import Pinecone

pc = Pinecone(api_key="...")
index = pc.Index("my-index")

# Query with both vector and sparse (BM25) embeddings
results = index.query(
    vector=dense_embedding,      # From OpenAI/Cohere
    sparse_vector=sparse_embedding,  # BM25/SPLADE
    top_k=10,
    include_metadata=True,
    filter={"category": "support"},  # Metadata filtering
)

# Weighted combination (alpha = vector weight)
# alpha=1.0 = pure vector, alpha=0.0 = pure keyword
# alpha=0.7 is a good starting point

Reranking

Retrieve more candidates (top-50), then use a cross-encoder to rerank to top-5. Dramatically improves precision.

Cohere Reranker

import cohere

co = cohere.Client("...")

# First: retrieve top 50 with vector search
candidates = vectorstore.similarity_search(query, k=50)

# Then: rerank to top 5
reranked = co.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=[doc.page_content for doc in candidates],
    top_n=5,
)

# Use reranked results
final_docs = [candidates[r.index] for r in reranked.results]

Evaluation Metrics

Retrieval Metrics

Recall@k: % of relevant docs in top-k
MRR: Mean Reciprocal Rank of first relevant doc
NDCG: Normalized Discounted Cumulative Gain

Target: Recall@5 > 90%, MRR > 0.7

Generation Metrics

Faithfulness: Does answer match retrieved context?
Relevance: Does answer address the question?
Groundedness: Are claims supported by sources?

Use RAGAS or DeepEval for automated eval

Common Failure Modes

Chunks too small

Lost context, fragmented information

✓ Fix: Increase chunk size to 1000+ tokens with 20% overlap

No metadata filtering

Retrieving irrelevant docs from wrong categories

✓ Fix: Add category/date/source metadata, filter at query time

Stale embeddings

Docs updated but embeddings not refreshed

✓ Fix: Implement incremental re-indexing pipeline

Wrong embedding model

Using general model for domain-specific content

✓ Fix: Fine-tune embeddings or use domain-specific model

Production Checklist

Before Going Live

Infrastructure

Vector DB sized for 2x expected load
Embedding API rate limits handled
Caching layer for frequent queries
Fallback for vector DB downtime

Quality

Eval set with 100+ query-answer pairs
Recall@5 > 90% on eval set
Hallucination detection in place
User feedback collection enabled

Related Resources

Context Engineering Course

20-chapter deep dive

AI Reference Library

RAG definitions & more

Cite This Page

Use these citation formats for academic papers, articles, and documentation. Click to copy.

APA (7th Edition)

Bhatia, R. (2026). Building RAG Systems That Actually Work in Production. Randeep Bhatia. https://randeepbhatia.com/guides/rag-production

MLA (9th Edition)

Randeep Bhatia. "Building RAG Systems That Actually Work in Production." Randeep Bhatia, 8 Jan. 2026, https://randeepbhatia.com/guides/rag-production.

Chicago

Randeep Bhatia. "Building RAG Systems That Actually Work in Production." Randeep Bhatia. January 8, 2026. https://randeepbhatia.com/guides/rag-production.

BibTeX

@article{bhatia2026building,
  author = {Randeep Bhatia},
  title = {Building RAG Systems That Actually Work in Production},
  journal = {Randeep Bhatia},
  year = {2026},
  month = {january},
  url = {https://randeepbhatia.com/guides/rag-production},
  note = {Accessed: 2026-01-13}
}

Citation-safe content. Updated regularly.

Building RAG Systems ThatActually Work in Production

Overview

RAG Architecture

Vector Database Comparison

Pinecone

Weaviate

pgvector

Qdrant

Milvus

Our Recommendation

Chunking Strategies

Fixed-Size Chunking

Semantic Chunking

Recursive Chunking

Embedding Models

Retrieval Optimization

Hybrid Search (Vector + Keyword)

Reranking

Evaluation Metrics

Retrieval Metrics

Generation Metrics

Common Failure Modes

Chunks too small

No metadata filtering

Stale embeddings

Wrong embedding model

Production Checklist

Before Going Live

Related Resources

Cite This Page