RAG Architecture Patterns
Build production RAG systems. Complete implementation guide with real code, architecture patterns, and lessons from processing 100k+ documents/day.
What is RAG?
Retrieval Augmented Generation explained in plain English.
RAG solves a fundamental problem: LLMs don't know your data.
Traditional LLMs are trained on public internet data up to a cutoff date. They can't:
- ✗Access your private documents
- ✗Know about events after training
- ✗Reference your specific data
- ✗Cite sources accurately
RAG Solution:
1. Retrieve relevant documents from your data
2. Augment the prompt with that context
3. Generate response based on your data
Production RAG Architectures
Simple RAG
Basic retrieval pipeline for MVPs and prototypes
The foundational RAG pattern: embed query → search vectors → augment prompt → generate. Simple to implement, works for straightforward Q&A on small document sets. No fancy techniques, just the basics.
- • Easy to implement and understand
- • Low latency (200-500ms)
- • Works well for FAQs and simple docs
- • Minimal infrastructure needs
- • Limited accuracy (~70-75%)
- • Struggles with complex queries
- • No context awareness
- • Breaks on edge cases
Advanced RAG
Production-ready with reranking and hybrid search
Adds critical production features: query expansion, hybrid search (vector + keyword), cross-encoder reranking, and metadata filtering. This is what 90% of production RAG should use. Proven to work at scale.
- • High accuracy (85-92%)
- • Handles complex queries well
- • Hybrid search catches edge cases
- • Scales to 100k+ documents
- • More complex to implement
- • Higher latency (500-1500ms)
- • Requires more infrastructure
- • Costs 2-3x simple RAG
Agentic RAG
Multi-step reasoning with tool use and self-correction
LLM acts as an agent: plans retrieval strategy, uses multiple tools, reasons over results, self-corrects. Can handle complex multi-hop queries like "Compare X to Y based on Z criteria." Highest quality but slowest and most expensive.
- • Highest accuracy (92-98%)
- • Handles complex reasoning
- • Self-correcting on errors
- • Multi-document synthesis
- • Very high latency (2-10s)
- • Expensive ($0.50-2.00/query)
- • Complex to build and debug
- • Needs careful prompt engineering
Iterative RAG
Multiple retrieval rounds for deep research
Performs multiple retrieval-generation cycles. First pass identifies knowledge gaps, subsequent passes fill them in. Great for research tasks requiring synthesis of multiple sources. Used by Perplexity and similar tools.
- • Comprehensive answers
- • Finds connections across docs
- • Good for research tasks
- • Handles ambiguity well
- • High latency (3-8s)
- • Expensive due to multiple LLM calls
- • Can hallucinate connections
- • Hard to control depth
Vector Database Comparison
Choosing the right vector DB is critical. Here's how they compare in production.
Database | Best For | Performance | Cost | Verdict |
---|---|---|---|---|
Pinecone Managed Service | Production apps that need reliability | Excellent | $0.096/1M vectors/month | Best for startups and mid-size companies. Just works. |
Qdrant Self-hosted/Managed | High-performance filtering | Excellent | Free (self-hosted) | Best self-hosted option. Great filtering capabilities. |
Weaviate Self-hosted/Managed | Multi-modal search | Very Good | Free (self-hosted) | Good for images + text. GraphQL interface. |
Milvus Self-hosted | Massive scale (billions of vectors) | Excellent | Free (self-hosted) | Best for enterprise scale. Complex to operate. |
Chroma Self-hosted/Embedded | Prototyping and small projects | Good | Free (self-hosted) | Great for development. Not for production scale. |
pgvector Postgres Extension | Already using Postgres | Good | Free (with Postgres) | Simplest option if you have Postgres. Limited scale. |
Embedding Model Selection
Your embedding model determines retrieval quality. Choose wisely.
text-embedding-3-large
⭐ Top PickOpenAI's latest and best. Highest quality embeddings for RAG. Works across domains.
text-embedding-3-small
More affordable OpenAI model. Good quality, 6.5x cheaper. Great for high-volume applications.
Cohere Embed v3
Excellent multilingual support. Compression-aware. Good for international applications.
BGE-large-en-v1.5
Best open-source model. 90% of commercial quality. Great if you can self-host.
all-MiniLM-L6-v2
Fast and lightweight. Good for prototyping. Not production-quality.
Voyage AI
Domain-specific fine-tuning available. Best-in-class for code and legal documents.
Advanced RAG Techniques
These techniques can 2-5x your RAG accuracy. Production-tested.
Query Expansion
Generate multiple variants of the user query before searching. E.g., "How do I refund?" becomes ["refund process", "return policy", "get money back"]. Catches different phrasings of the same intent.
When to use:
User queries are short or ambiguous. E.g., customer support, product search.
Hypothetical Document Embeddings (HyDE)
Instead of embedding the query, have the LLM generate a hypothetical answer, then embed that. More similar to actual documents. Works surprisingly well.
When to use:
Documents are technical or formal. Query style differs from document style.
Cross-Encoder Reranking
After vector search, rerank results with a cross-encoder model that scores query-document pairs. Much more accurate than just vector similarity.
When to use:
Always. This is the highest ROI technique. Add it to every production RAG.
Parent Document Retrieval
Embed small chunks for precise search, but return larger parent documents for context. Best of both worlds: precision + context.
When to use:
Long documents where context matters. Legal, technical documentation.
Semantic Caching
Cache responses for semantically similar queries. If someone asks "How do I return items?" and you have "What's your return policy?", reuse that response.
When to use:
High query volume with repeated intents. Customer support, FAQs.
Metadata Filtering
Add metadata (date, category, access level) to chunks. Filter before or after vector search. Critical for multi-tenant and time-sensitive data.
When to use:
Multi-tenant apps, time-sensitive data, access control requirements.
Production Code Examples
Simple retrieval → generation pipeline
import { PineconeClient } from '@pinecone-database/pinecone';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { OpenAI } from 'langchain/llms/openai';
const pinecone = new PineconeClient();
await pinecone.init({
apiKey: process.env.PINECONE_API_KEY!,
environment: process.env.PINECONE_ENVIRONMENT!,
});
const embeddings = new OpenAIEmbeddings();
async function simpleRAG(question: string) {
// 1. Embed the query
const queryEmbedding = await embeddings.embedQuery(question);
// 2. Search vectors
const index = pinecone.Index('docs');
const searchResults = await index.query({
vector: queryEmbedding,
topK: 3,
includeMetadata: true,
});
// 3. Build context
const context = searchResults.matches
.map((m) => m.metadata.text)
.join('\n\n');
// 4. Generate answer
const llm = new OpenAI({ temperature: 0 });
const prompt = `Context: ${context}\n\nQuestion: ${question}\n\nAnswer:`;
return await llm.call(prompt);
}
Hybrid search + cross-encoder reranking for production
import { Pinecone } from '@pinecone-database/pinecone';
import Anthropic from '@anthropic-ai/sdk';
import Cohere from 'cohere-ai';
const pinecone = new Pinecone();
const anthropic = new Anthropic();
const cohere = new Cohere();
async function advancedRAG(question: string) {
// 1. Query expansion
const expandedQueries = await expandQuery(question);
// 2. Hybrid search (vector + keyword)
const vectorResults = await Promise.all(
expandedQueries.map(q => vectorSearch(q))
);
const keywordResults = await keywordSearch(question);
const combined = mergeResults(vectorResults.flat(), keywordResults);
// 3. Rerank with Cohere
const reranked = await cohere.rerank({
query: question,
documents: combined.map(r => r.text),
topN: 5,
model: 'rerank-english-v3.0',
});
// 4. Build context from top results
const context = reranked.results
.map(r => combined[r.index].text)
.join('\n\n---\n\n');
// 5. Generate with Claude
const message = await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
messages: [{
role: 'user',
content: `Context:\n${context}\n\nQuestion: ${question}\n\nAnswer based on the context above. Cite sources.`
}]
});
return message.content[0].text;
}
async function expandQuery(query: string): Promise<string[]> {
const response = await anthropic.messages.create({
model: 'claude-3-haiku-20240307',
max_tokens: 200,
messages: [{
role: 'user',
content: `Generate 3 alternative phrasings of: "${query}"`
}]
});
// Parse and return variants
return parseVariants(response.content[0].text);
}
Smart document chunking for better retrieval
interface Chunk {
text: string;
metadata: {
source: string;
page?: number;
chunkIndex: number;
totalChunks: number;
};
}
function chunkDocument(
text: string,
chunkSize: number = 1000,
overlapSize: number = 200,
source: string
): Chunk[] {
const chunks: Chunk[] = [];
let start = 0;
let chunkIndex = 0;
while (start < text.length) {
// Extract chunk
const end = Math.min(start + chunkSize, text.length);
let chunkText = text.slice(start, end);
// Try to end at sentence boundary
if (end < text.length) {
const lastPeriod = chunkText.lastIndexOf('.');
if (lastPeriod > chunkSize * 0.7) {
chunkText = chunkText.slice(0, lastPeriod + 1);
}
}
chunks.push({
text: chunkText.trim(),
metadata: {
source,
chunkIndex,
totalChunks: 0, // Will update later
},
});
// Move start position (with overlap)
start += chunkText.length - overlapSize;
chunkIndex++;
}
// Update total chunks
chunks.forEach(chunk => {
chunk.metadata.totalChunks = chunks.length;
});
return chunks;
}
// Usage
const document = await loadDocument('policy.pdf');
const chunks = chunkDocument(document.text, 1000, 200, 'policy.pdf');
await embedAndStore(chunks);
Generate hypothetical answer before retrieval
import { OpenAI } from 'openai';
import { PineconeClient } from '@pinecone-database/pinecone';
const openai = new OpenAI();
const pinecone = new PineconeClient();
async function hydeSearch(question: string) {
// 1. Generate hypothetical answer
const hypotheticalDoc = await openai.chat.completions.create({
model: 'gpt-4',
messages: [{
role: 'user',
content: `Write a detailed answer to: ${question}`
}],
temperature: 0.7,
max_tokens: 300,
});
const hydeText = hypotheticalDoc.choices[0].message.content;
// 2. Embed the hypothetical answer
const embedding = await openai.embeddings.create({
model: 'text-embedding-3-large',
input: hydeText,
});
// 3. Search with hypothetical embedding
const index = pinecone.Index('documents');
const results = await index.query({
vector: embedding.data[0].embedding,
topK: 5,
includeMetadata: true,
});
// 4. Return actual documents
return results.matches.map(m => ({
text: m.metadata.text,
score: m.score,
source: m.metadata.source,
}));
}
// HyDE works because the hypothetical answer is more similar
// to actual documents than the question itself
const results = await hydeSearch("How does photosynthesis work?");
Chunking Strategies That Work
How you chunk your data is 50% of RAG success. Here's what works in production.
Fixed-Size Chunking
Split into equal-sized chunks with overlap. Simple and reliable. This is what 80% of production RAG uses.
Best For:
General documents, articles, documentation. Works for most use cases.
Avoid When:
Code, tables, or structured data where boundaries matter.
Semantic Chunking
Use NLP to split at semantic boundaries (topic changes, sections). Preserves meaning but more complex.
Best For:
Long-form content, books, research papers where topics are distinct.
Avoid When:
Short documents, FAQs, or when you need consistent chunk sizes.
Sentence Window
Embed single sentences but return surrounding context. Great for precision while maintaining context.
Best For:
Q&A systems where exact sentences matter. Customer support.
Avoid When:
Long-form generation or when storage cost is a concern.
Recursive Chunking
Split by structure (markdown headers, code blocks, etc.) recursively until target size. Preserves document structure.
Best For:
Technical docs, code repositories, structured content with clear hierarchy.
Avoid When:
Unstructured text, natural language documents.
Performance Optimization Guide
Make your RAG system faster and cheaper without sacrificing quality.
Embedding Cache
Cache query embeddings by hash. If someone asks the same question twice, reuse the embedding. Simple but effective.
Approximate Nearest Neighbor (ANN) Tuning
Tune your vector DB's ANN parameters. Trade accuracy for speed. For Pinecone, adjust "topK" and "includeMetadata". For HNSW, tune "ef" parameter.
Batch Embedding Generation
Embed multiple documents at once instead of one at a time. Most embedding APIs support batching with huge speedups.
Streaming Responses
Stream the LLM response while it's generating. Users see output immediately. Feels 2-3x faster even though total time is the same.
Parallel Retrieval
Query multiple vector DBs or indices in parallel. Aggregate results. Critical for multi-collection RAG.
Common Pitfalls & How to Avoid Them
Learn from mistakes so you don't have to make them yourself.
Chunks too small → Loss of context
If chunks are 100-200 tokens, they lack context for the LLM to generate good answers.
Solution:
Use 500-1000 token chunks. Or use "parent document retrieval" to retrieve small but return large.
No reranking → Poor relevance
Vector similarity alone is noisy. You get semantically similar but contextually irrelevant results.
Solution:
Always add a reranking step with a cross-encoder (Cohere, Jina AI). 20-40% accuracy boost.
Ignoring metadata → Generic results
Without metadata filtering, you mix old/new docs, different sources, access levels.
Solution:
Add metadata (date, source, category, access level) and filter during search.
Single retrieval pass → Missed information
Complex queries need multiple retrieval steps. One pass misses context or related information.
Solution:
Use iterative retrieval or query decomposition. Break complex queries into sub-queries.
Not handling "no results" → Poor UX
If vector search returns low-confidence results, RAG will hallucinate or give bad answers.
Solution:
Set a confidence threshold (e.g., 0.7). If below, tell user "I don't have information on that."
Embedding stale data → Outdated answers
If you don't update embeddings when documents change, RAG works on old information.
Solution:
Implement incremental updates. Track document versions. Re-embed on changes.
Ignoring chunk overlap → Lost context at boundaries
Without overlap, important information spanning two chunks gets split.
Solution:
Use 10-20% overlap. If chunk size is 1000 tokens, overlap 100-200 tokens.
Wrong embedding model → Poor retrieval
Using a small or domain-mismatched embedding model tanks retrieval quality.
Solution:
Use text-embedding-3-large for production. Test on your specific domain.
Production Best Practices
Start simple, iterate based on data
Begin with basic RAG. Measure accuracy on real queries. Add complexity (reranking, HyDE, etc.) only where needed.
Monitor retrieval quality separately
Track retrieval accuracy independent of generation. Use metrics like MRR, NDCG. Most RAG failures are retrieval failures.
Version your embeddings
When you change chunking, embedding models, or preprocessing, create a new index version. Enables A/B testing and rollback.
Implement feedback loops
Collect user feedback (helpful/not helpful). Use it to improve retrieval, fine-tune embeddings, or adjust chunking.
Test on edge cases
Build a test set of hard queries: ambiguous, multi-hop, negations, temporal. These reveal weaknesses.
Budget for costs early
RAG is expensive: embeddings, vector DB, LLM calls. A 100k-query/day app costs $1k-5k/month. Plan accordingly.
Building a Production RAG System?
I've built RAG pipelines processing 100,000+ documents per day for Fortune 500 companies. Let me design yours.
Real example: Built a RAG system for insurance claims that processes 50k documents/day with 95%+ accuracy, saving $2M annually in manual review costs.