EXPANSION40 min65 sections

Advanced RAG Patterns

THIS WEEK'S JOURNEY

Beyond Basic RAG: Production Patterns That Actually Scale

Basic Retrieval-Augmented Generation gets you 60% of the way there—the remaining 40% separates demos from production systems. This chapter dives deep into the advanced patterns that companies like Anthropic, Notion, and Perplexity use to build RAG systems handling millions of queries daily with sub-second latency.

73%

of production RAG systems require hybrid search to meet accuracy requirements

Pure vector search, while powerful for semantic similarity, misses exact matches and keyword-specific queries that users frequently need.

Key Insight

The Retrieval Quality Ceiling Is Lower Than You Think

Most teams focus obsessively on their LLM choice while neglecting retrieval quality—yet retrieval errors account for 67% of incorrect RAG responses according to Langchain's analysis of 10,000 production queries. The fundamental truth is that your RAG system can never be smarter than your retrieval.

Basic RAG vs. Production-Ready RAG

Basic RAG

Single embedding model for all content types

Pure vector similarity search (cosine/dot product)

Fixed chunk sizes regardless of content structure

Top-k retrieval without relevance filtering

Production RAG

Domain-specific embeddings with hybrid search fallbacks

Hybrid search combining BM25, vector, and metadata filtering

Semantic chunking respecting document structure and context

Reranking with cross-encoders and relevance thresholds

Notion

How Notion AI Handles 100M+ Documents with Sub-Second Search

Relevance accuracy improved from 71% to 89%, user satisfaction scores for AI sea...

Framework

The Retrieval Quality Stack

Layer 1: Indexing Quality

The foundation layer focuses on how you chunk, embed, and store documents. Poor chunking creates irr...

Layer 2: Query Understanding

Before searching, understand what the user actually needs. This includes query classification (factu...

Layer 3: Hybrid Retrieval

Combine multiple retrieval strategies: dense vectors for semantic similarity, sparse vectors (BM25) ...

Layer 4: Reranking

After initial retrieval, use cross-encoder models to reorder results by true relevance. Cross-encode...

Implementing Hybrid Search with Reciprocal Rank Fusionpython

123456789101112
from typing import List, Dict, Tuple
import numpy as np
from collections import defaultdict

def reciprocal_rank_fusion(
    results_lists: List[List[Tuple[str, float]]],
    k: int = 60,
    weights: List[float] = None
) -> List[Tuple[str, float]]:
    """
    Combine multiple ranked lists using RRF.

The BM25 Renaissance in the Age of Embeddings

Don't dismiss BM25 as 'legacy technology.' Elasticsearch's BM25 implementation consistently outperforms pure vector search for queries containing product names, error codes, technical terms, and proper nouns. Anyscale's benchmark showed BM25 achieving 82% accuracy on technical documentation queries where vector search scored 71%.

Key Insight

Reranking: The 10x Improvement Most Teams Skip

Cross-encoder reranking is the single highest-ROI improvement you can make to an existing RAG system. Unlike bi-encoders (which embed query and document separately), cross-encoders process them together, enabling deep token-level interaction that captures nuanced relevance signals.

Perplexity

Building a Search Engine That Outperforms Google on Complex Queries

Perplexity achieved 40% higher user satisfaction scores than traditional search ...

Two-Stage Retrieval Architecture

User Query

Query Expansion (HyD...

Stage 1: Fast Retrie...

Hybrid: BM25 + Vecto...

Anti-Pattern: The 'More Chunks Is Better' Fallacy

❌ Problem

Anthropic's research shows that answer quality degrades when more than 30% of co...

✓ Solution

Focus on retrieval precision over recall. Use aggressive reranking with relevanc...

Hybrid Search Implementation Checklist

Key Insight

Query Expansion: Teaching Your System to Read Minds

Users are terrible at expressing what they actually need. They search for 'python error' when they mean 'TypeError: cannot concatenate str and int objects.' Query expansion bridges this gap by automatically enriching queries with related terms, synonyms, and reformulations.

Implementing HyDE (Hypothetical Document Embeddings)python

123456789101112
from openai import OpenAI
import numpy as np

client = OpenAI()

def hyde_retrieval(
    query: str,
    vector_store,
    num_hypothetical: int = 3,
    top_k: int = 10
) -> list:
    """

Cost-Effective Query Expansion Strategy

HyDE adds LLM calls to every query, which gets expensive at scale. Implement a tiered approach: use simple synonym expansion for common queries (cached), HyDE for complex/ambiguous queries (detected via query classifier), and skip expansion entirely for exact-match queries containing product names or error codes.

Implementing Production-Ready Reranking

Choose Your Reranker Model

Determine Retrieval-to-Rerank Ratio

Implement Relevance Thresholds

Handle Score Calibration

Optimize for Latency

Stripe

How Stripe Docs Achieves 94% Query Resolution Without Human Support

Query resolution rate improved from 76% to 94%, reducing support ticket volume b...

The Reranking Latency Trap

Cross-encoder reranking can dominate your latency budget if not carefully managed. A naive implementation reranking 100 documents sequentially on CPU can take 2-3 seconds.

Practice Exercise

Build a Hybrid Search System from Scratch

90 min

Framework

The Hybrid Search Architecture

Sparse Retrieval Layer

BM25 or TF-IDF based keyword matching that excels at exact term matching, acronyms, product codes, a...

Dense Retrieval Layer

Vector embeddings that capture semantic meaning, enabling retrieval of conceptually similar content ...

Fusion Strategy

The algorithm that combines results from both layers. Reciprocal Rank Fusion (RRF) is most common, b...

Query Router

An intelligent classifier that analyzes incoming queries to determine optimal weighting between spar...

Implementing Reciprocal Rank Fusionpython

123456789101112
from typing import List, Dict, Tuple
import numpy as np

def reciprocal_rank_fusion(
    sparse_results: List[Tuple[str, float]],
    dense_results: List[Tuple[str, float]],
    k: int = 60,
    sparse_weight: float = 0.5
) -> List[Tuple[str, float]]:
    """
    Combine sparse and dense retrieval results using RRF.
    k parameter controls how much to favor top-ranked documents.

Elastic

Building hybrid search into Elasticsearch 8.0

Hybrid search improved mean reciprocal rank by 23% over dense-only and 31% over ...

Retriever vs Reranker: Understanding the Tradeoffs

Bi-Encoder Retriever

Encodes query and documents independently into separate vect...

Extremely fast: can search millions of documents in millisec...

Vectors can be pre-computed and cached for instant retrieval

Limited understanding of query-document interaction

Cross-Encoder Reranker

Processes query and document together as a single input

Slower: must run inference for each query-document pair

Cannot pre-compute—requires real-time processing

Deep understanding of how query relates to specific document

Production Reranking Pipeline with Coherepython

123456789101112
import cohere
from typing import List, Dict
import asyncio

class RerankerPipeline:
    def __init__(self, api_key: str):
        self.co = cohere.Client(api_key)
        self.cache = {}  # Simple cache for repeated queries
    
    async def retrieve_and_rerank(
        self,
        query: str,

Key Insight

Query Expansion: The Multiplier Effect for Retrieval

Query expansion transforms a single user query into multiple semantically related queries, dramatically improving recall. A user searching for 'Python memory issues' might also benefit from documents about 'memory leaks', 'garbage collection', 'RAM optimization', and 'heap allocation'.

Implementing Effective Query Expansion

Analyze Query Intent

Generate Semantic Variants

Add Hypothetical Document Queries

Execute Parallel Retrieval

Deduplicate and Merge Results

LLM-Powered Query Expansion with HyDEpython

123456789101112
from openai import OpenAI
import asyncio
from typing import List

client = OpenAI()

QUERY_EXPANSION_PROMPT = """Generate 4 alternative search queries for the following question.
Each alternative should:
1. Use different words but seek the same information
2. Include one that uses more technical terminology
3. Include one that uses simpler, everyday language
4. Include one that approaches the topic from a different angle

Perplexity AI

Multi-query retrieval for comprehensive answers

Query expansion improved their 'answer completeness' metric by 47% as measured b...

Anti-Pattern: Over-Expanding Simple Queries

❌ Problem

Systems that over-expand see 40-60% unnecessary API costs from expansion LLM cal...

✓ Solution

Implement a query classifier that routes simple factual queries directly to retr...

Framework

Multi-Hop Retrieval Architecture

Query Decomposition

Break complex questions into atomic sub-questions that can be answered independently. Use an LLM to ...

Iterative Retrieval

Execute retrieval for each sub-question in dependency order. Use answers from earlier hops to reform...

Context Accumulation

Maintain a growing context window that accumulates relevant information across hops. Implement intel...

Hop Termination Logic

Determine when sufficient information has been gathered to answer the original question. Implement b...

Multi-Hop Retrieval Implementationpython

123456789101112
from openai import OpenAI
from typing import List, Dict, Optional
from dataclasses import dataclass

@dataclass
class HopResult:
    query: str
    documents: List[Dict]
    extracted_answer: str
    confidence: float

class MultiHopRetriever:

Anthropic

Multi-hop reasoning in Claude's research capabilities

Multi-hop retrieval improved complex question accuracy from 52% to 84%. Latency ...

Key Insight

Self-RAG: Teaching Models to Know When They Don't Know

Self-RAG represents a paradigm shift in retrieval-augmented generation. Instead of always retrieving and always using retrieved content, Self-RAG models learn to decide when retrieval is necessary, whether retrieved documents are relevant, and whether the generated response is supported by the evidence.

Self-RAG Decision Flow

Query Input

[Retrieve?] Decision

If Yes: Retrieve Doc...

[Relevant?] Per-Doc ...

Production RAG Pattern Selection Guide

67%

of production RAG failures are retrieval failures, not generation failures

This statistic underscores why advanced retrieval patterns matter so much.

The Reranker Threshold Trap

Setting reranker relevance thresholds too high causes silent failures where the system returns no results rather than low-confidence results. Start with a threshold of 0.2-0.3 and monitor the percentage of queries returning zero results after reranking.

Practice Exercise

Build a Hybrid Search Evaluation Framework

45 min

Advanced RAG Implementation Resources

Self-RAG: Learning to Retrieve, Generate, and Critique

article

Cohere Rerank API Documentation

tool

LlamaIndex Multi-Hop Query Engine

tool

Hybrid Search in Elasticsearch: A Practical Guide

article

Practice Exercise

Build a Hybrid Search System from Scratch

90 min

Complete Hybrid Search Implementationpython

123456789101112
import asyncio
from dataclasses import dataclass
from typing import List, Dict, Tuple
import numpy as np
from openai import AsyncOpenAI
import asyncpg

@dataclass
class SearchResult:
    id: str
    content: str
    score: float

Practice Exercise

Implement Query Expansion with Evaluation

60 min

Query Expansion with LLMpython

123456789101112
from anthropic import Anthropic
from pydantic import BaseModel
from typing import List
import json

class ExpandedQueries(BaseModel):
    original: str
    synonyms: List[str]  # Same meaning, different words
    specific: List[str]  # More narrow interpretations
    general: List[str]   # Broader interpretations
    related: List[str]   # Adjacent concepts

Production RAG Deployment Checklist

Anti-Pattern: The Reranker Bottleneck

❌ Problem

Users experience 2-3 second search latencies that destroy the interactive feel o...

✓ Solution

Implement tiered reranking with strict candidate limits. Use a fast first-stage ...

Anti-Pattern: The Embedding Monoculture

❌ Problem

Retrieval quality varies wildly across use cases. Code search might work well wh...

✓ Solution

Evaluate embedding models per domain using your actual content and queries. Crea...

Anti-Pattern: The Context Window Stuffing

❌ Problem

Response quality degrades as the LLM struggles with information dilution. Releva...

✓ Solution

Treat context as precious real estate regardless of window size. Implement aggre...

Practice Exercise

Build a Self-RAG Evaluation Loop

120 min

Self-RAG Implementation with Critique Looppython

123456789101112
from dataclasses import dataclass
from typing import List, Optional, Tuple
from anthropic import Anthropic
import json

@dataclass
class RetrievalCritique:
    document_id: str
    relevance_score: int  # 1-5
    reasoning: str
    useful_excerpts: List[str]

Essential RAG Research and Tools

Lost in the Middle: How Language Models Use Long Contexts

article

Cohere Rerank API

tool

RAGAS: RAG Assessment Framework

tool

LlamaIndex Advanced RAG Techniques

article

Start with Evaluation, Not Implementation

Before building any advanced RAG pattern, create your evaluation dataset first. Collect 100+ real user queries, manually label relevant documents, and establish baseline metrics.

Framework

RAG Complexity Decision Framework

Query Complexity Assessment

Analyze your query distribution: What percentage are simple factual lookups vs. complex multi-part q...

Latency Budget Allocation

Define your total latency budget and allocate across stages. Interactive applications need sub-secon...

Content Characteristics Analysis

Evaluate your content: Is it homogeneous or diverse? Does it require freshness? How large is the cor...

Quality vs. Cost Trade-off

Map the cost curve for your application: What's the cost per query at each quality level? Basic RAG ...

Practice Exercise

RAG Pipeline Optimization Challenge

180 min

The Compound Effect of RAG Improvements

RAG improvements compound multiplicatively. A 15% improvement in retrieval quality combined with 20% improvement in reranking and 10% improvement in context formatting can yield 50%+ improvement in final answer quality.

3.2x

Average improvement in answer accuracy when combining hybrid search, reranking, and query expansion vs. basic vector RAG

This benchmark tested 15 different RAG configurations across 5 datasets including enterprise documentation, customer support, and code search.

RAG Pattern Selection by Use Case

Customer Support Bot

Hybrid search essential - users mix product names with descr...

Lightweight reranking (Cohere) - latency critical for chat U...

Skip multi-hop - support queries are usually single-intent

Query expansion helps for typos and colloquial language

Research Assistant

Heavy reranking justified - accuracy more important than spe...

Multi-hop retrieval essential for complex research questions

Self-RAG valuable for ensuring comprehensive coverage

Query expansion critical for academic terminology variations

Don't Optimize Prematurely

Many teams implement complex RAG patterns before validating that basic RAG is insufficient. Start with simple vector search and a good embedding model.

Chapter Complete!

Hybrid search combining semantic and keyword retrieval impro...

Reranking is the highest-ROI RAG improvement for most applic...

Query expansion helps most for short, ambiguous queries. Use...

Multi-hop retrieval is essential for complex questions requi...

Next: Begin by establishing your RAG evaluation baseline with at least 100 labeled query-document pairs

PreviousNext