HomeRAG Architecture

RAG Architecture Patterns

Build production RAG systems. Complete implementation guide with real code, architecture patterns, and lessons from processing 100k+ documents/day.

Production Patterns
Real Code Examples
Scalability Lessons

What is RAG?

Retrieval Augmented Generation explained in plain English.

RAG solves a fundamental problem: LLMs don't know your data.

Traditional LLMs are trained on public internet data up to a cutoff date. They can't:

  • Access your private documents
  • Know about events after training
  • Reference your specific data
  • Cite sources accurately

RAG Solution:

1. Retrieve relevant documents from your data
2. Augment the prompt with that context
3. Generate response based on your data

Production RAG Architectures

🎯

Simple RAG

Basic retrieval pipeline for MVPs and prototypes

The foundational RAG pattern: embed query → search vectors → augment prompt → generate. Simple to implement, works for straightforward Q&A on small document sets. No fancy techniques, just the basics.

Pros
  • Easy to implement and understand
  • Low latency (200-500ms)
  • Works well for FAQs and simple docs
  • Minimal infrastructure needs
Cons
  • Limited accuracy (~70-75%)
  • Struggles with complex queries
  • No context awareness
  • Breaks on edge cases
Accuracy: 70-75%
Latency: 200-500ms
Cost: $0.01-0.03/query
Scale: <10k docs

Advanced RAG

Production-ready with reranking and hybrid search

Adds critical production features: query expansion, hybrid search (vector + keyword), cross-encoder reranking, and metadata filtering. This is what 90% of production RAG should use. Proven to work at scale.

Pros
  • High accuracy (85-92%)
  • Handles complex queries well
  • Hybrid search catches edge cases
  • Scales to 100k+ documents
Cons
  • More complex to implement
  • Higher latency (500-1500ms)
  • Requires more infrastructure
  • Costs 2-3x simple RAG
Accuracy: 85-92%
Latency: 500-1500ms
Cost: $0.05-0.15/query
Scale: 10k-500k docs
🚀

Agentic RAG

Multi-step reasoning with tool use and self-correction

LLM acts as an agent: plans retrieval strategy, uses multiple tools, reasons over results, self-corrects. Can handle complex multi-hop queries like "Compare X to Y based on Z criteria." Highest quality but slowest and most expensive.

Pros
  • Highest accuracy (92-98%)
  • Handles complex reasoning
  • Self-correcting on errors
  • Multi-document synthesis
Cons
  • Very high latency (2-10s)
  • Expensive ($0.50-2.00/query)
  • Complex to build and debug
  • Needs careful prompt engineering
Accuracy: 92-98%
Latency: 2-10s
Cost: $0.50-2.00/query
Scale: 100k-10M+ docs
🔄

Iterative RAG

Multiple retrieval rounds for deep research

Performs multiple retrieval-generation cycles. First pass identifies knowledge gaps, subsequent passes fill them in. Great for research tasks requiring synthesis of multiple sources. Used by Perplexity and similar tools.

Pros
  • Comprehensive answers
  • Finds connections across docs
  • Good for research tasks
  • Handles ambiguity well
Cons
  • High latency (3-8s)
  • Expensive due to multiple LLM calls
  • Can hallucinate connections
  • Hard to control depth
Accuracy: 88-94%
Latency: 3-8s
Cost: $0.30-1.00/query
Scale: 50k-1M docs

Vector Database Comparison

Choosing the right vector DB is critical. Here's how they compare in production.

DatabaseBest ForPerformanceCostVerdict
Pinecone
Managed Service
Production apps that need reliabilityExcellent$0.096/1M vectors/monthBest for startups and mid-size companies. Just works.
Qdrant
Self-hosted/Managed
High-performance filteringExcellentFree (self-hosted)Best self-hosted option. Great filtering capabilities.
Weaviate
Self-hosted/Managed
Multi-modal searchVery GoodFree (self-hosted)Good for images + text. GraphQL interface.
Milvus
Self-hosted
Massive scale (billions of vectors)ExcellentFree (self-hosted)Best for enterprise scale. Complex to operate.
Chroma
Self-hosted/Embedded
Prototyping and small projectsGoodFree (self-hosted)Great for development. Not for production scale.
pgvector
Postgres Extension
Already using PostgresGoodFree (with Postgres)Simplest option if you have Postgres. Limited scale.

Embedding Model Selection

Your embedding model determines retrieval quality. Choose wisely.

text-embedding-3-large

⭐ Top Pick
Dimensions:3,072
Context:8,191 tokens
Cost:$0.13/1M tokens

OpenAI's latest and best. Highest quality embeddings for RAG. Works across domains.

Production RAG for any domain

text-embedding-3-small

Dimensions:1,536
Context:8,191 tokens
Cost:$0.02/1M tokens

More affordable OpenAI model. Good quality, 6.5x cheaper. Great for high-volume applications.

High-volume, cost-sensitive apps

Cohere Embed v3

Dimensions:1,024
Context:512 tokens
Cost:$0.10/1M tokens

Excellent multilingual support. Compression-aware. Good for international applications.

Multilingual RAG systems

BGE-large-en-v1.5

Dimensions:1,024
Context:512 tokens
Cost:Free (self-hosted)

Best open-source model. 90% of commercial quality. Great if you can self-host.

Self-hosted, privacy-sensitive

all-MiniLM-L6-v2

Dimensions:384
Context:256 tokens
Cost:Free (self-hosted)

Fast and lightweight. Good for prototyping. Not production-quality.

Prototyping, development

Voyage AI

Dimensions:1,024
Context:16,000 tokens
Cost:$0.12/1M tokens

Domain-specific fine-tuning available. Best-in-class for code and legal documents.

Specialized domains (code, legal)

Advanced RAG Techniques

These techniques can 2-5x your RAG accuracy. Production-tested.

Query Expansion

Generate multiple variants of the user query before searching. E.g., "How do I refund?" becomes ["refund process", "return policy", "get money back"]. Catches different phrasings of the same intent.

Impact:+15-25% recall
Complexity:Low

When to use:

User queries are short or ambiguous. E.g., customer support, product search.

Hypothetical Document Embeddings (HyDE)

Instead of embedding the query, have the LLM generate a hypothetical answer, then embed that. More similar to actual documents. Works surprisingly well.

Impact:+10-20% accuracy
Complexity:Medium

When to use:

Documents are technical or formal. Query style differs from document style.

Cross-Encoder Reranking

After vector search, rerank results with a cross-encoder model that scores query-document pairs. Much more accurate than just vector similarity.

Impact:+20-40% accuracy
Complexity:Low

When to use:

Always. This is the highest ROI technique. Add it to every production RAG.

Parent Document Retrieval

Embed small chunks for precise search, but return larger parent documents for context. Best of both worlds: precision + context.

Impact:+15-30% quality
Complexity:Medium

When to use:

Long documents where context matters. Legal, technical documentation.

Semantic Caching

Cache responses for semantically similar queries. If someone asks "How do I return items?" and you have "What's your return policy?", reuse that response.

Impact:50-80% cost savings
Complexity:Medium

When to use:

High query volume with repeated intents. Customer support, FAQs.

Metadata Filtering

Add metadata (date, category, access level) to chunks. Filter before or after vector search. Critical for multi-tenant and time-sensitive data.

Impact:+30-50% precision
Complexity:Low

When to use:

Multi-tenant apps, time-sensitive data, access control requirements.

Production Code Examples

Basic RAG with Pinecone

Simple retrieval → generation pipeline

import { PineconeClient } from '@pinecone-database/pinecone';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';
import { OpenAI } from 'langchain/llms/openai';

const pinecone = new PineconeClient();
await pinecone.init({
  apiKey: process.env.PINECONE_API_KEY!,
  environment: process.env.PINECONE_ENVIRONMENT!,
});

const embeddings = new OpenAIEmbeddings();

async function simpleRAG(question: string) {
  // 1. Embed the query
  const queryEmbedding = await embeddings.embedQuery(question);

  // 2. Search vectors
  const index = pinecone.Index('docs');
  const searchResults = await index.query({
    vector: queryEmbedding,
    topK: 3,
    includeMetadata: true,
  });

  // 3. Build context
  const context = searchResults.matches
    .map((m) => m.metadata.text)
    .join('\n\n');

  // 4. Generate answer
  const llm = new OpenAI({ temperature: 0 });
  const prompt = `Context: ${context}\n\nQuestion: ${question}\n\nAnswer:`;
  
  return await llm.call(prompt);
}
Advanced RAG with Reranking

Hybrid search + cross-encoder reranking for production

import { Pinecone } from '@pinecone-database/pinecone';
import Anthropic from '@anthropic-ai/sdk';
import Cohere from 'cohere-ai';

const pinecone = new Pinecone();
const anthropic = new Anthropic();
const cohere = new Cohere();

async function advancedRAG(question: string) {
  // 1. Query expansion
  const expandedQueries = await expandQuery(question);
  
  // 2. Hybrid search (vector + keyword)
  const vectorResults = await Promise.all(
    expandedQueries.map(q => vectorSearch(q))
  );
  const keywordResults = await keywordSearch(question);
  const combined = mergeResults(vectorResults.flat(), keywordResults);
  
  // 3. Rerank with Cohere
  const reranked = await cohere.rerank({
    query: question,
    documents: combined.map(r => r.text),
    topN: 5,
    model: 'rerank-english-v3.0',
  });
  
  // 4. Build context from top results
  const context = reranked.results
    .map(r => combined[r.index].text)
    .join('\n\n---\n\n');
  
  // 5. Generate with Claude
  const message = await anthropic.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    messages: [{
      role: 'user',
      content: `Context:\n${context}\n\nQuestion: ${question}\n\nAnswer based on the context above. Cite sources.`
    }]
  });
  
  return message.content[0].text;
}

async function expandQuery(query: string): Promise<string[]> {
  const response = await anthropic.messages.create({
    model: 'claude-3-haiku-20240307',
    max_tokens: 200,
    messages: [{
      role: 'user',
      content: `Generate 3 alternative phrasings of: "${query}"`
    }]
  });
  // Parse and return variants
  return parseVariants(response.content[0].text);
}
Chunking with Overlap

Smart document chunking for better retrieval

interface Chunk {
  text: string;
  metadata: {
    source: string;
    page?: number;
    chunkIndex: number;
    totalChunks: number;
  };
}

function chunkDocument(
  text: string,
  chunkSize: number = 1000,
  overlapSize: number = 200,
  source: string
): Chunk[] {
  const chunks: Chunk[] = [];
  let start = 0;
  let chunkIndex = 0;

  while (start < text.length) {
    // Extract chunk
    const end = Math.min(start + chunkSize, text.length);
    let chunkText = text.slice(start, end);

    // Try to end at sentence boundary
    if (end < text.length) {
      const lastPeriod = chunkText.lastIndexOf('.');
      if (lastPeriod > chunkSize * 0.7) {
        chunkText = chunkText.slice(0, lastPeriod + 1);
      }
    }

    chunks.push({
      text: chunkText.trim(),
      metadata: {
        source,
        chunkIndex,
        totalChunks: 0, // Will update later
      },
    });

    // Move start position (with overlap)
    start += chunkText.length - overlapSize;
    chunkIndex++;
  }

  // Update total chunks
  chunks.forEach(chunk => {
    chunk.metadata.totalChunks = chunks.length;
  });

  return chunks;
}

// Usage
const document = await loadDocument('policy.pdf');
const chunks = chunkDocument(document.text, 1000, 200, 'policy.pdf');
await embedAndStore(chunks);
HyDE (Hypothetical Document Embeddings)

Generate hypothetical answer before retrieval

import { OpenAI } from 'openai';
import { PineconeClient } from '@pinecone-database/pinecone';

const openai = new OpenAI();
const pinecone = new PineconeClient();

async function hydeSearch(question: string) {
  // 1. Generate hypothetical answer
  const hypotheticalDoc = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [{
      role: 'user',
      content: `Write a detailed answer to: ${question}`
    }],
    temperature: 0.7,
    max_tokens: 300,
  });

  const hydeText = hypotheticalDoc.choices[0].message.content;

  // 2. Embed the hypothetical answer
  const embedding = await openai.embeddings.create({
    model: 'text-embedding-3-large',
    input: hydeText,
  });

  // 3. Search with hypothetical embedding
  const index = pinecone.Index('documents');
  const results = await index.query({
    vector: embedding.data[0].embedding,
    topK: 5,
    includeMetadata: true,
  });

  // 4. Return actual documents
  return results.matches.map(m => ({
    text: m.metadata.text,
    score: m.score,
    source: m.metadata.source,
  }));
}

// HyDE works because the hypothetical answer is more similar
// to actual documents than the question itself
const results = await hydeSearch("How does photosynthesis work?");

Chunking Strategies That Work

How you chunk your data is 50% of RAG success. Here's what works in production.

Fixed-Size Chunking

Split into equal-sized chunks with overlap. Simple and reliable. This is what 80% of production RAG uses.

Chunk Size:500-1000 tokens
Overlap:10-20%
Complexity:Low

Best For:

General documents, articles, documentation. Works for most use cases.

Avoid When:

Code, tables, or structured data where boundaries matter.

⭐ Recommended for most use cases

Semantic Chunking

Use NLP to split at semantic boundaries (topic changes, sections). Preserves meaning but more complex.

Chunk Size:Variable (200-1500 tokens)
Overlap:None (semantic boundaries)
Complexity:High

Best For:

Long-form content, books, research papers where topics are distinct.

Avoid When:

Short documents, FAQs, or when you need consistent chunk sizes.

Sentence Window

Embed single sentences but return surrounding context. Great for precision while maintaining context.

Chunk Size:1 sentence embedded, 3-5 returned
Overlap:100% (overlapping windows)
Complexity:Medium

Best For:

Q&A systems where exact sentences matter. Customer support.

Avoid When:

Long-form generation or when storage cost is a concern.

Recursive Chunking

Split by structure (markdown headers, code blocks, etc.) recursively until target size. Preserves document structure.

Chunk Size:Variable (respects structure)
Overlap:10-15%
Complexity:Medium

Best For:

Technical docs, code repositories, structured content with clear hierarchy.

Avoid When:

Unstructured text, natural language documents.

Performance Optimization Guide

Make your RAG system faster and cheaper without sacrificing quality.

1

Embedding Cache

Cache query embeddings by hash. If someone asks the same question twice, reuse the embedding. Simple but effective.

Speed Gain
50-200ms
Cost Savings
30-60%
Implementation
Easy (Redis/Upstash)
2

Approximate Nearest Neighbor (ANN) Tuning

Tune your vector DB's ANN parameters. Trade accuracy for speed. For Pinecone, adjust "topK" and "includeMetadata". For HNSW, tune "ef" parameter.

Speed Gain
100-500ms
Cost Savings
10-30%
Implementation
Medium (DB-specific)
3

Batch Embedding Generation

Embed multiple documents at once instead of one at a time. Most embedding APIs support batching with huge speedups.

Speed Gain
5-10x on ingestion
Cost Savings
20-40%
Implementation
Easy (API-level)
4

Streaming Responses

Stream the LLM response while it's generating. Users see output immediately. Feels 2-3x faster even though total time is the same.

Speed Gain
0ms (perceived 2-3s)
Cost Savings
0%
Implementation
Easy (SSE/WebSocket)
5

Parallel Retrieval

Query multiple vector DBs or indices in parallel. Aggregate results. Critical for multi-collection RAG.

Speed Gain
200-1000ms
Cost Savings
0%
Implementation
Medium (async/await)

Common Pitfalls & How to Avoid Them

Learn from mistakes so you don't have to make them yourself.

Chunks too small → Loss of context

If chunks are 100-200 tokens, they lack context for the LLM to generate good answers.

Solution:

Use 500-1000 token chunks. Or use "parent document retrieval" to retrieve small but return large.

No reranking → Poor relevance

Vector similarity alone is noisy. You get semantically similar but contextually irrelevant results.

Solution:

Always add a reranking step with a cross-encoder (Cohere, Jina AI). 20-40% accuracy boost.

Ignoring metadata → Generic results

Without metadata filtering, you mix old/new docs, different sources, access levels.

Solution:

Add metadata (date, source, category, access level) and filter during search.

Single retrieval pass → Missed information

Complex queries need multiple retrieval steps. One pass misses context or related information.

Solution:

Use iterative retrieval or query decomposition. Break complex queries into sub-queries.

Not handling "no results" → Poor UX

If vector search returns low-confidence results, RAG will hallucinate or give bad answers.

Solution:

Set a confidence threshold (e.g., 0.7). If below, tell user "I don't have information on that."

Embedding stale data → Outdated answers

If you don't update embeddings when documents change, RAG works on old information.

Solution:

Implement incremental updates. Track document versions. Re-embed on changes.

Ignoring chunk overlap → Lost context at boundaries

Without overlap, important information spanning two chunks gets split.

Solution:

Use 10-20% overlap. If chunk size is 1000 tokens, overlap 100-200 tokens.

Wrong embedding model → Poor retrieval

Using a small or domain-mismatched embedding model tanks retrieval quality.

Solution:

Use text-embedding-3-large for production. Test on your specific domain.

Production Best Practices

1

Start simple, iterate based on data

Begin with basic RAG. Measure accuracy on real queries. Add complexity (reranking, HyDE, etc.) only where needed.

2

Monitor retrieval quality separately

Track retrieval accuracy independent of generation. Use metrics like MRR, NDCG. Most RAG failures are retrieval failures.

3

Version your embeddings

When you change chunking, embedding models, or preprocessing, create a new index version. Enables A/B testing and rollback.

4

Implement feedback loops

Collect user feedback (helpful/not helpful). Use it to improve retrieval, fine-tune embeddings, or adjust chunking.

5

Test on edge cases

Build a test set of hard queries: ambiguous, multi-hop, negations, temporal. These reveal weaknesses.

6

Budget for costs early

RAG is expensive: embeddings, vector DB, LLM calls. A 100k-query/day app costs $1k-5k/month. Plan accordingly.

Building a Production RAG System?

I've built RAG pipelines processing 100,000+ documents per day for Fortune 500 companies. Let me design yours.

Real example: Built a RAG system for insurance claims that processes 50k documents/day with 95%+ accuracy, saving $2M annually in manual review costs.