← Back to AWS Serverless ML Architecture

MASTERY40 min63 sections

Building RAG on AWS

THIS WEEK'S JOURNEY

Building Production RAG Systems on AWS: From Documents to Intelligent Responses

Retrieval-Augmented Generation (RAG) has emerged as the dominant pattern for building enterprise AI applications that combine the power of large language models with proprietary knowledge bases. Unlike fine-tuning, which requires expensive retraining cycles, RAG systems dynamically retrieve relevant context at query time, enabling real-time knowledge updates and transparent source attribution.

Key Insight

RAG Fundamentally Changes How We Build AI Applications

Traditional LLM applications suffer from knowledge cutoffs, hallucinations, and inability to access proprietary data—RAG solves all three problems simultaneously. By retrieving relevant documents before generation, you ground the model's responses in factual, up-to-date information that you control.

End-to-End RAG Architecture on AWS

S3 Document Bucket

Lambda Processor

Bedrock Titan Embedd...

OpenSearch Serverles...

Framework

The RAG Quality Triangle

Retrieval Precision

The percentage of retrieved documents that are actually relevant to the query. Low precision floods ...

Retrieval Recall

The percentage of relevant documents that are successfully retrieved. Low recall means missing criti...

Generation Fidelity

How accurately the generated response reflects the retrieved context without hallucination or distor...

Latency Budget

The total time from query to response, typically targeting under 3 seconds for interactive applicati...

Notion

Building AI Search Across Millions of Workspaces

Notion AI search now handles 10 million queries daily with P95 latency under 800...

Vector Database Options on AWS

OpenSearch Serverless

Fully managed with automatic scaling from 0 to millions of v...

Native hybrid search combining vector similarity with BM25 k...

Built-in security with IAM integration and encryption at res...

Pay-per-use pricing starting at $0.24/OCU-hour, minimum 2 OC...

Amazon Aurora PostgreSQL with pgvector

Combines vector search with relational data in single databa...

Lower cost for smaller deployments under 1M vectors

Familiar PostgreSQL interface and tooling ecosystem

Requires capacity planning and manual scaling configuration

67%

of RAG implementations fail to reach production

The primary failure modes are poor retrieval quality (cited by 45% of failed projects), unacceptable latency (28%), and cost overruns from embedding and inference (27%).

Document Ingestion Lambda with Chunkingpython

123456789101112
import boto3
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter
from hashlib import sha256

bedrock = boto3.client('bedrock-runtime')
s3 = boto3.client('s3')

def chunk_document(text: str, metadata: dict) -> list[dict]:
    """Split document into overlapping chunks with metadata."""
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,

Document Ingestion Pipeline Requirements

Anti-Pattern: The Monolithic Chunk Anti-Pattern

❌ Problem

Large chunks dilute the semantic signal, causing the embedding to represent an a...

✓ Solution

Use chunks of 200-500 tokens that represent single coherent concepts. Implement ...

Setting Up OpenSearch Serverless for Vector Search

Create the Vector Search Collection

Configure Data Access Policies

Create the Vector Index

Configure Index Settings for Performance

Implement the Indexing Client

OpenSearch Serverless Minimum Costs

OpenSearch Serverless has a minimum of 2 OCUs (OpenSearch Compute Units) for indexing and 2 OCUs for search, totaling approximately $700/month even with zero traffic. For development and small-scale deployments, consider Aurora PostgreSQL with pgvector, which can run on a db.t4g.medium instance for under $50/month.

Key Insight

Hybrid Search Dramatically Improves Retrieval Quality

Pure vector search excels at semantic similarity but struggles with exact term matching—searching for 'error code 5012' might return documents about general error handling rather than the specific code. Hybrid search combines vector similarity with traditional keyword matching (BM25), capturing both semantic meaning and lexical precision.

Hybrid Search Query with OpenSearchpython

123456789101112
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
import boto3

def hybrid_search(
    query_text: str,
    query_embedding: list[float],
    index_name: str,
    k: int = 10,
    vector_weight: float = 0.7,
    filters: dict = None
) -> list[dict]:

Stripe

RAG-Powered Developer Documentation Search

Search relevance improved from 62% to 91% measured by click-through on first res...

Query Expansion Improves Recall by 20-30%

Before embedding user queries, use a fast LLM call to expand abbreviations, add synonyms, and reformulate the question. A query like 'how to fix 504 errors' becomes 'How to troubleshoot and resolve HTTP 504 Gateway Timeout errors in web applications'.

Framework

The Retrieval Quality Stack

Query Understanding

Transform raw user queries into optimized search queries. Techniques include query expansion (adding...

Hybrid Retrieval

Combine multiple retrieval methods and merge results. Vector search captures semantic similarity, ke...

Reranking

Apply a more sophisticated model to reorder initial retrieval results. Cross-encoder models like Coh...

Context Compression

Reduce retrieved context to the most relevant portions before generation. Extract key sentences, rem...

Practice Exercise

Build a Document Ingestion Pipeline

45 min

Essential RAG Architecture Resources

Amazon Bedrock Knowledge Bases Documentation

article

LangChain RAG Tutorial

article

Anthropic's RAG Best Practices

article

OpenSearch Vector Search Documentation

article

Framework

RAG Embedding Strategy Framework

Content Classification Layer

Categorize your documents into semantic types: structured data (tables, forms), narrative content (a...

Embedding Model Selection Matrix

Map content types to embedding models based on benchmarks like MTEB. For general text, Amazon Titan ...

Chunk Optimization Engine

Implement adaptive chunking that respects document structure. Use 512-token chunks with 50-token ove...

Index Architecture Design

Design your OpenSearch Serverless index with appropriate shard counts based on expected document vol...

Embedding Model Selection for AWS RAG

Amazon Titan Embeddings V2

1024-dimensional vectors with excellent general-purpose perf...

Native integration with Bedrock means no additional infrastr...

Cost-effective at $0.00002 per 1K tokens, roughly 80% cheape...

Supports up to 8,192 tokens per request, enabling larger chu...

Cohere Embed v3 (via Bedrock)

1024-dimensional vectors with state-of-the-art performance o...

Compression-aware training allows 4x vector size reduction w...

Input type parameter distinguishes between search_document a...

Superior performance on code and technical content compared ...

Anthropic

Building Claude's Documentation RAG System

Search abandonment dropped from 40% to 8%, and the average time to find relevant...

Production Embedding Pipeline with Batching and Error Handlingpython

123456789101112
import boto3
import json
from typing import List, Dict
from dataclasses import dataclass
from concurrent.futures import ThreadPoolExecutor, as_completed
import backoff

@dataclass
class EmbeddingResult:
    chunk_id: str
    embedding: List[float]
    token_count: int

Key Insight

Hybrid Search Dramatically Outperforms Pure Vector Search

Production RAG systems achieve 20-35% better retrieval accuracy by combining dense vector search with sparse keyword matching. OpenSearch Serverless supports this through its neural search plugin, which lets you combine k-NN similarity scores with BM25 text matching in a single query.

Implementing OpenSearch Serverless Vector Store

Create Collection with Vector Search Configuration

Configure Network and Data Access Policies

Design Index Mapping for Vector and Metadata

Implement Bulk Indexing Pipeline

Configure Index Refresh and Replica Settings

Anti-Pattern: Embedding Everything with the Same Model and Parameters

❌ Problem

Retrieval quality varies wildly across content types. Code searches return irrel...

✓ Solution

Implement content-type-aware embedding pipelines. Use code-specific models like ...

Vector Dimension Trade-offs Impact Both Cost and Quality

Higher-dimensional embeddings (1024-1536) capture more semantic nuance but increase storage costs linearly and search latency logarithmically. Amazon Titan V2 supports dimension reduction from 1024 to 256/512 at embedding time with minimal quality loss for most use cases.

Framework

Multi-Stage Retrieval Architecture

Stage 1: Candidate Generation

Cast a wide net using efficient approximate nearest neighbor search. Retrieve top-100 to top-500 can...

Stage 2: Metadata Filtering

Apply business logic filters to candidates: date ranges, access permissions, content categories, and...

Stage 3: Cross-Encoder Reranking

Use a cross-encoder model to rerank remaining candidates by computing query-document relevance score...

Stage 4: Diversity Optimization

Apply maximal marginal relevance (MMR) or similar diversity algorithms to ensure retrieved documents...

Notion

Scaling RAG to Billions of User Documents

Query latency improved from 2-3 seconds to under 400ms at p99, even for enterpri...

OpenSearch Serverless Hybrid Search Querypython

123456789101112
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
import boto3

class HybridSearchClient:
    def __init__(self, collection_endpoint: str, index_name: str):
        credentials = boto3.Session().get_credentials()
        self.auth = AWS4Auth(
            credentials.access_key,
            credentials.secret_key,
            'us-east-1',
            'aoss',

47%

Improvement in retrieval accuracy with reranking

Cross-encoder reranking of initial retrieval results improves Mean Reciprocal Rank by 47% on average across benchmark datasets.

Retrieval Quality Optimization Checklist

Multi-Index Retrieval Architecture

User Query

Query Analyzer (inte...

Index Router

↓

[Documentation Index...

Practice Exercise

Build a Retrieval Evaluation Pipeline

90 min

Pre-compute Common Query Embeddings

Analyze your query logs to identify the top 1000 most frequent queries and their variations. Pre-compute and cache embeddings for these queries, eliminating embedding latency for 40-60% of production traffic.

Key Insight

Document Summaries Enable Hierarchical Retrieval

Generate and embed document-level summaries alongside chunk embeddings to enable two-stage retrieval. First, identify relevant documents using summary embeddings, then search within those documents using chunk embeddings.

Practice Exercise

Build a Complete Document Ingestion Pipeline

90 min

Complete RAG Query Handler with Bedrockpython

123456789101112
import boto3
import json
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth

class RAGQueryHandler:
    def __init__(self):
        self.bedrock = boto3.client('bedrock-runtime')
        self.opensearch = self._init_opensearch()
        self.embedding_model = 'amazon.titan-embed-text-v1'
        self.generation_model = 'anthropic.claude-3-sonnet-20240229-v1:0'

Production RAG Deployment Checklist

Anti-Pattern: The Monolithic Prompt Anti-Pattern

❌ Problem

Monolithic prompts increase latency by 40-60% due to longer input processing, re...

✓ Solution

Implement a prompt routing layer that classifies incoming queries and selects ap...

Anti-Pattern: The Embedding-Everything Trap

❌ Problem

Retrieval quality degrades significantly as irrelevant chunks compete with valua...

✓ Solution

Implement a content qualification pipeline before embedding. Filter out boilerpl...

Anti-Pattern: The Static Chunk Size Fallacy

❌ Problem

Fixed chunking breaks semantic coherence—a 512-token chunk might split a code fu...

✓ Solution

Implement adaptive chunking strategies based on content type and structure. Use ...

Practice Exercise

Implement RAG Evaluation Pipeline

120 min

Practice Exercise

Build Hybrid Search with Keyword and Semantic Retrieval

75 min

RAG Evaluation Metrics Implementationpython

123456789101112
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from rouge_score import rouge_scorer
import boto3
import json

class RAGEvaluator:
    def __init__(self):
        self.bedrock = boto3.client('bedrock-runtime')
        self.rouge = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    
    def evaluate_retrieval(self, retrieved_docs: list, relevant_docs: list, k: int = 5) -> dict:

Framework

RAG Optimization Flywheel

Measure

Establish comprehensive metrics across the RAG pipeline: retrieval relevance scores, generation qual...

Analyze

Deep dive into failure cases using systematic categorization. Identify patterns: Are failures due to...

Experiment

Run controlled experiments testing specific hypotheses. A/B test chunking strategies, prompt variati...

Implement

Deploy winning variations through gradual rollout. Use feature flags for quick rollback capability. ...

Essential RAG Development Resources

Amazon Bedrock Knowledge Bases Documentation

article

LlamaIndex RAG Evaluation Guide

article

OpenSearch k-NN Plugin Documentation

article

Anthropic Prompt Engineering Guide

article

Cost Optimization Quick Win

Implement embedding caching using ElastiCache Redis with document content hashes as keys. For a typical enterprise corpus with 30% query overlap, this reduces Bedrock embedding API calls by 40-50%, saving $500-2000/month at moderate scale.

Bedrock Throttling Considerations

Bedrock has default quotas that can throttle high-volume RAG systems: Titan Embeddings at 2000 requests/minute, Claude at varying limits by model size. Request quota increases proactively before launch.

Data Privacy in RAG Systems

RAG systems often process sensitive documents. Ensure your architecture addresses data residency requirements—Bedrock processes data in-region without persistence.

67%

of RAG failures stem from retrieval issues, not generation

This statistic underscores the importance of investing in retrieval quality.

Notion

Building AI-Powered Search with RAG

Notion AI search achieves sub-200ms p95 latency across workspaces with millions ...

Anthropic

Constitutional AI for RAG Guardrails

Claude models on Bedrock demonstrate industry-leading faithfulness to retrieved ...

Complete AWS RAG Architecture

S3 (Documents)

Lambda (Processor)

Bedrock (Embeddings)

OpenSearch Serverles...

Chapter Complete!

RAG on AWS combines S3 for document storage, Lambda for proc...

Document ingestion quality determines RAG success more than ...

Retrieval optimization offers the highest ROI for RAG improv...

Comprehensive evaluation is non-negotiable for production RA...

Next: Begin by implementing a minimal RAG pipeline with Bedrock Knowledge Bases to validate your use case quickly

PreviousNext