FOUNDATION40 min67 sections

RAG Fundamentals

THIS WEEK'S JOURNEY

RAG: The Bridge Between Static Models and Dynamic Knowledge

Retrieval-Augmented Generation represents the most significant architectural pattern in production LLM systems today, enabling models to access current, domain-specific information without the astronomical costs of fine-tuning. At its core, RAG solves a fundamental limitation: LLMs are frozen in time at their training cutoff, yet most valuable applications require up-to-date, proprietary, or specialized knowledge.

Key Insight

RAG Is Not Just 'Search Plus LLM'—It's a Complete Information Architecture

The most common misconception about RAG is treating it as simple keyword search bolted onto a language model. In reality, effective RAG requires careful orchestration of embedding models, vector similarity algorithms, re-ranking strategies, and context assembly—each component significantly impacting final output quality.

The RAG Pipeline: From Query to Answer

User Query

Query Embedding

Vector Search

Retrieved Chunks

91%

of enterprise AI applications use some form of RAG

This dominance reflects RAG's unique ability to combine the reasoning capabilities of large language models with access to proprietary, current, and verifiable information.

RAG vs. Fine-Tuning: Choosing Your Knowledge Integration Strategy

Retrieval-Augmented Generation

Knowledge updates instantly by modifying the document store—...

Provides source attribution and citations, enabling fact-che...

Scales to billions of documents with proper indexing, limite...

Costs $0.01-0.10 per query at scale, primarily embedding and...

Fine-Tuning

Knowledge baked into model weights—updates require full retr...

No inherent source attribution; model 'just knows' without e...

Limited by model's parameter count; cannot easily scale to a...

High upfront cost ($1,000-$100,000+) but lower per-query inf...

Notion

Building AI Search Across 100 Million Documents

Achieved 89% relevance scores in user studies, reduced average query latency fro...

Framework

The RAG Quality Triangle

Retrieval Precision

How relevant are the retrieved documents to the query? Low precision means the LLM receives noisy, i...

Retrieval Recall

Are you finding ALL the relevant documents? Low recall means missing critical information that could...

System Latency

How fast can you retrieve and generate? Users expect sub-second responses for conversational AI. Eve...

Cost Efficiency

What's the total cost per query including embeddings, vector search, re-ranking, and LLM tokens? At ...

The 80/20 Rule of RAG Failures

Approximately 80% of RAG system failures trace back to retrieval problems, not generation problems. Before optimizing your prompts or switching LLM providers, instrument your retrieval pipeline to measure what's actually being retrieved.

Key Insight

Embedding Models Are Not Created Equal—Selection Matters Enormously

The choice of embedding model can swing your retrieval accuracy by 20-40 percentage points, yet many teams default to whatever's most convenient without evaluation. OpenAI's text-embedding-3-large currently leads most benchmarks with 1536 dimensions, but Cohere's embed-v3 often outperforms on multilingual content, and open-source options like BGE-large-en-v1.5 achieve 95% of the performance at zero API cost.

Basic RAG Pipeline Implementationpython

123456789101112
from openai import OpenAI
import numpy as np
from typing import List, Tuple

client = OpenAI()

def embed_text(text: str) -> List[float]:
    """Generate embedding for a text string."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )

Anti-Pattern: The 'Embed Everything Once' Trap

❌ Problem

Retrieval quality silently degrades over time as new documents use different emb...

✓ Solution

Build your embedding pipeline with re-indexing as a first-class operation. Store...

RAG System Foundation Checklist

Key Insight

Vector Databases Are Infrastructure, Not Magic—Choose Based on Your Scale

The vector database market has exploded with options—Pinecone, Weaviate, Qdrant, Milvus, Chroma, pgvector—each with different tradeoffs that matter enormously at scale. For prototypes under 100K vectors, even a simple NumPy array with brute-force search works fine; don't over-engineer early.

Stripe

Scaling Documentation Search to Handle Developer Queries

Documentation search accuracy improved to 94%, support ticket volume decreased b...

Building Your First Production RAG System

Assemble and Clean Your Document Corpus

Design Your Chunking Strategy

Select and Benchmark Embedding Models

Set Up Your Vector Database

Build the Retrieval Pipeline

Start With a Golden Test Set

Before writing any code, create a set of 100-200 query-answer pairs with human-verified correct retrievals. This becomes your regression test suite, preventing you from accidentally degrading quality while optimizing for other metrics.

3-5x

improvement in retrieval accuracy from adding re-ranking

Re-ranking uses a more computationally expensive cross-encoder model to re-score the top candidates from initial retrieval.

Key Insight

Hybrid Search Combines the Best of Semantic and Keyword Matching

Pure semantic search fails on exact matches—searching for 'error code E-4521' might retrieve documents about errors generally rather than the specific code. Pure keyword search fails on semantic understanding—searching for 'how to cancel my subscription' won't find 'ending your membership.' Hybrid search combines both approaches, typically using Reciprocal Rank Fusion (RRF) to merge results from dense (semantic) and sparse (BM25/keyword) retrieval.

Practice Exercise

Build a RAG System for Your Documentation

90 min

Framework

The RICE Framework for RAG Quality

Reach

What percentage of queries will this improvement affect? A chunking change impacts 100% of queries; ...

Impact

How much will this improve affected queries? Measure in terms of retrieval metric improvement (e.g.,...

Confidence

How certain are you about the expected improvement? Base this on testing, benchmarks, or similar dep...

Effort

How much engineering time and infrastructure cost is required? Switching embedding models might take...

Framework

The Embedding Quality Triangle

Semantic Density

How much meaning is captured per dimension in the embedding vector. Higher semantic density means be...

Domain Alignment

How well the embedding model understands your specific domain's vocabulary and concepts. A general-p...

Query-Document Symmetry

The degree to which the model handles asymmetric retrieval—short queries finding long documents. Som...

Temporal Consistency

Whether embeddings remain stable across model versions and API updates. Production systems need repr...

Popular Embedding Models: Capabilities and Trade-offs

OpenAI text-embedding-3-large

3072 dimensions with optional dimension reduction to 256/512...

Best-in-class performance on MTEB benchmarks (64.6% average)

Excellent multilingual support across 100+ languages

API-based with per-token pricing ($0.00013/1K tokens)

Cohere embed-v3

1024 dimensions with input type specification (search_docume...

Explicit asymmetric retrieval support improves query-documen...

Compression-aware training maintains quality at lower dimens...

Supports int8 and binary quantization natively

Notion

Building AI Search Across 15 Billion Blocks

Notion AI search achieved 89% user satisfaction scores, with average query laten...

Key Insight

The Hidden Cost of Embedding API Calls

Teams consistently underestimate embedding costs at scale. Consider a knowledge base with 1 million documents averaging 2,000 tokens each.

Implementing Semantic Caching for Query Embeddingspython

123456789101112
import hashlib
import numpy as np
from redis import Redis
from openai import OpenAI

class SemanticCache:
    def __init__(self, redis_client: Redis, similarity_threshold: float = 0.95):
        self.redis = redis_client
        self.threshold = similarity_threshold
        self.client = OpenAI()
    
    def get_embedding(self, query: str) -> np.ndarray:

Anti-Pattern: Embedding Everything at Maximum Dimensions

❌ Problem

Storage costs scale linearly with dimensions. A 10-million document corpus at 30...

✓ Solution

Start with the smallest embedding that meets your quality requirements. OpenAI's...

Systematic Embedding Model Selection Process

Define Your Retrieval Scenarios

Establish Baseline with General-Purpose Model

Test Domain-Specific Alternatives

Evaluate Dimension-Quality Tradeoffs

Stress Test Latency and Throughput

23%

Average improvement in retrieval precision when using query expansion techniques

Query expansion transforms short user queries into richer semantic representations before embedding.

Framework

The Chunking Decision Matrix

Document Structure Analysis

Examine how information is organized in your corpus. Structured documents (legal contracts, technica...

Query Length Distribution

Analyze typical query lengths in your system. Short queries (3-5 words) match better with shorter ch...

Information Density Assessment

Measure how much unique information each document section contains. Dense technical documentation ne...

Overlap Strategy Selection

Determine how much context bleeding between chunks is acceptable. Zero overlap risks splitting criti...

Stripe

Adaptive Chunking for API Documentation

Developer satisfaction with Stripe's documentation search improved from 67% to 9...

Chunk Size Affects More Than Retrieval Quality

Smaller chunks mean more vectors to store and search, increasing infrastructure costs and latency. A 1-million-page corpus chunked at 200 tokens might produce 50 million vectors; at 1000 tokens, only 10 million.

Fixed-Size vs. Semantic Chunking

Fixed-Size Chunking

Predictable chunk count and storage requirements

Simple implementation with consistent processing time

Works well for homogeneous document types

Risk of splitting semantic units mid-thought

Semantic Chunking

Respects natural language boundaries (paragraphs, sections)

Variable chunk sizes require flexible storage planning

Preserves complete thoughts and arguments

More complex implementation using NLP or LLM assistance

Implementing Semantic Chunking with Sentence Boundariespython

123456789101112
from typing import List
import spacy

nlp = spacy.load("en_core_web_sm")

def semantic_chunk(
    text: str,
    min_chunk_size: int = 200,
    max_chunk_size: int = 1000,
    overlap_sentences: int = 1
) -> List[str]:
    doc = nlp(text)

Vector Database Evaluation Checklist

Key Insight

Retrieval Quality Metrics That Actually Matter

Teams often optimize for the wrong metrics. Recall@K (what percentage of relevant documents appear in top K results) matters for comprehensive answers but ignores ranking quality.

Practice Exercise

Build a Retrieval Quality Benchmark

90 min

The RAG Retrieval Pipeline

User Query

Query Preprocessing ...

Embedding Generation...

Vector Search (ANN l...

Anti-Pattern: Ignoring the Re-ranking Opportunity

❌ Problem

Without re-ranking, the most relevant document often isn't in the top position. ...

✓ Solution

Implement a two-stage retrieval: use fast vector search to get top-50 candidates...

Perplexity

Multi-Stage Retrieval for Real-Time Web Search

Perplexity handles over 10 million queries daily with average response times und...

The Hybrid Search Sweet Spot

Combine vector similarity (semantic understanding) with BM25 keyword matching (exact term matching) using Reciprocal Rank Fusion. Weight vector results at 0.7 and keyword results at 0.3 as a starting point.

Essential RAG Implementation Resources

MTEB Leaderboard

tool

LlamaIndex Documentation

article

Pinecone Learning Center

article

Anthropic's Contextual Retrieval Guide

article

THIS WEEK'S JOURNEY

Putting RAG Into Practice: From Theory to Production

Understanding RAG concepts is only half the battle—the real challenge lies in implementing systems that work reliably in production environments. This section provides hands-on exercises, real code examples, and battle-tested checklists that will transform your theoretical knowledge into practical skills.

Practice Exercise

Build a Complete RAG Pipeline from Scratch

90 min

Production-Ready RAG Pipeline in Pythonpython

123456789101112
from openai import OpenAI
import chromadb
from chromadb.utils import embedding_functions

class ProductionRAG:
    def __init__(self, collection_name: str):
        self.client = OpenAI()
        self.chroma = chromadb.PersistentClient(path="./rag_db")
        self.embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
            model_name="text-embedding-3-small",
            api_key=os.environ["OPENAI_API_KEY"]
        )

Practice Exercise

Implement and Compare Chunking Strategies

45 min

Semantic Chunking Implementationpython

123456789101112
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import list

def semantic_chunk(text: str, 
                   similarity_threshold: float = 0.5,
                   min_chunk_size: int = 100,
                   max_chunk_size: int = 1000) -> list[str]:
    """
    Chunk text based on semantic similarity between sentences.
    Creates breaks where topic shifts occur.
    """

RAG Production Readiness Checklist

Anti-Pattern: The 'Embed Everything' Approach

❌ Problem

Users receive irrelevant or contradictory information because the retrieval syst...

✓ Solution

Implement strict content curation before indexing. Define clear criteria for wha...

Anti-Pattern: Ignoring the Retrieval-Generation Gap

❌ Problem

Resources are wasted optimizing the wrong metric. The system retrieves documents...

✓ Solution

Implement end-to-end evaluation from the start. Create a golden dataset of quest...

Anti-Pattern: One-Size-Fits-All Chunking

❌ Problem

Retrieval quality varies wildly across content types. Users searching technical ...

✓ Solution

Implement content-type-aware chunking. Analyze your corpus and identify 3-5 dist...

Practice Exercise

Build a RAG Evaluation Dashboard

120 min

RAG Evaluation Metrics Implementationpython

123456789101112
from dataclasses import dataclass
from typing import list, Optional
import numpy as np

@dataclass
class EvaluationResult:
    query: str
    retrieved_ids: list[str]
    expected_ids: list[str]
    generated_answer: str
    expected_answer: str
    retrieval_precision: float

RAG Debugging Checklist

Essential RAG Learning Resources

LangChain RAG Documentation

article

Pinecone Learning Center

article

RAGAS: RAG Assessment Framework

tool

Anthropic's Contextual Retrieval Guide

article

Framework

RAG Quality Improvement Cycle

Measure Baseline

Establish current performance using a golden dataset of 100+ queries. Measure retrieval precision, r...

Analyze Failures

Categorize failed queries into failure modes: retrieval miss (relevant document not retrieved), retr...

Hypothesize Solutions

Based on failure analysis, form specific hypotheses. 'Increasing chunk size from 500 to 800 tokens w...

Implement and Test

Implement changes in isolation and test against your golden dataset. Use A/B testing in production w...

Start Simple, Measure Everything, Iterate

The most successful RAG implementations start with the simplest possible architecture—basic chunking, standard embeddings, single-stage retrieval—and improve based on measured failures. Teams that over-engineer from the start spend months building complex systems that solve the wrong problems.

Practice Exercise

Conduct a RAG System Audit

60 min

RAG Tools and Infrastructure

LlamaIndex

tool

Weaviate

tool

Cohere Rerank

tool

Unstructured.io

tool

The 80/20 Rule of RAG Optimization

80% of RAG quality improvements come from three areas: better chunking strategies, adding a reranking step, and improving your generation prompt. Before investing in exotic techniques like hypothetical document embeddings or graph-based retrieval, ensure you've optimized these fundamentals.

67%

of RAG failures are retrieval problems, not generation problems

This finding emphasizes the importance of investing in retrieval quality.

Chapter Complete!

RAG systems combine retrieval and generation to ground LLM r...

Chunking strategy is the most underrated factor in RAG perfo...

Embedding model selection significantly impacts retrieval qu...

Evaluation infrastructure is non-negotiable for production R...

Next: Build a minimal RAG system this week using the code examples provided

PreviousNext