Building Production RAG Systems on AWS: From Documents to Intelligent Responses
Retrieval-Augmented Generation (RAG) has emerged as the dominant pattern for building enterprise AI applications that combine the power of large language models with proprietary knowledge bases. Unlike fine-tuning, which requires expensive retraining cycles, RAG systems dynamically retrieve relevant context at query time, enabling real-time knowledge updates and transparent source attribution.
Key Insight
RAG Fundamentally Changes How We Build AI Applications
Traditional LLM applications suffer from knowledge cutoffs, hallucinations, and inability to access proprietary data—RAG solves all three problems simultaneously. By retrieving relevant documents before generation, you ground the model's responses in factual, up-to-date information that you control.
End-to-End RAG Architecture on AWS
S3 Document Bucket
Lambda Processor
Bedrock Titan Embedd...
OpenSearch Serverles...
Framework
The RAG Quality Triangle
Retrieval Precision
The percentage of retrieved documents that are actually relevant to the query. Low precision floods ...
Retrieval Recall
The percentage of relevant documents that are successfully retrieved. Low recall means missing criti...
Generation Fidelity
How accurately the generated response reflects the retrieved context without hallucination or distor...
Latency Budget
The total time from query to response, typically targeting under 3 seconds for interactive applicati...
N
Notion
Building AI Search Across Millions of Workspaces
Notion AI search now handles 10 million queries daily with P95 latency under 800...
Vector Database Options on AWS
OpenSearch Serverless
Fully managed with automatic scaling from 0 to millions of v...
Native hybrid search combining vector similarity with BM25 k...
Built-in security with IAM integration and encryption at res...
Pay-per-use pricing starting at $0.24/OCU-hour, minimum 2 OC...
Amazon Aurora PostgreSQL with pgvector
Combines vector search with relational data in single databa...
Lower cost for smaller deployments under 1M vectors
Familiar PostgreSQL interface and tooling ecosystem
Requires capacity planning and manual scaling configuration
67%
of RAG implementations fail to reach production
The primary failure modes are poor retrieval quality (cited by 45% of failed projects), unacceptable latency (28%), and cost overruns from embedding and inference (27%).
Document Ingestion Lambda with Chunkingpython
123456789101112
import boto3
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter
from hashlib import sha256
bedrock = boto3.client('bedrock-runtime')
s3 = boto3.client('s3')
def chunk_document(text: str, metadata: dict) -> list[dict]:
"""Split document into overlapping chunks with metadata."""
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
Document Ingestion Pipeline Requirements
Anti-Pattern: The Monolithic Chunk Anti-Pattern
❌ Problem
Large chunks dilute the semantic signal, causing the embedding to represent an a...
✓ Solution
Use chunks of 200-500 tokens that represent single coherent concepts. Implement ...
Setting Up OpenSearch Serverless for Vector Search
1
Create the Vector Search Collection
2
Configure Data Access Policies
3
Create the Vector Index
4
Configure Index Settings for Performance
5
Implement the Indexing Client
OpenSearch Serverless Minimum Costs
OpenSearch Serverless has a minimum of 2 OCUs (OpenSearch Compute Units) for indexing and 2 OCUs for search, totaling approximately $700/month even with zero traffic. For development and small-scale deployments, consider Aurora PostgreSQL with pgvector, which can run on a db.t4g.medium instance for under $50/month.
Pure vector search excels at semantic similarity but struggles with exact term matching—searching for 'error code 5012' might return documents about general error handling rather than the specific code. Hybrid search combines vector similarity with traditional keyword matching (BM25), capturing both semantic meaning and lexical precision.
Search relevance improved from 62% to 91% measured by click-through on first res...
Query Expansion Improves Recall by 20-30%
Before embedding user queries, use a fast LLM call to expand abbreviations, add synonyms, and reformulate the question. A query like 'how to fix 504 errors' becomes 'How to troubleshoot and resolve HTTP 504 Gateway Timeout errors in web applications'.
Framework
The Retrieval Quality Stack
Query Understanding
Transform raw user queries into optimized search queries. Techniques include query expansion (adding...
Apply a more sophisticated model to reorder initial retrieval results. Cross-encoder models like Coh...
Context Compression
Reduce retrieved context to the most relevant portions before generation. Extract key sentences, rem...
Practice Exercise
Build a Document Ingestion Pipeline
45 min
Essential RAG Architecture Resources
Amazon Bedrock Knowledge Bases Documentation
article
LangChain RAG Tutorial
article
Anthropic's RAG Best Practices
article
OpenSearch Vector Search Documentation
article
Framework
RAG Embedding Strategy Framework
Content Classification Layer
Categorize your documents into semantic types: structured data (tables, forms), narrative content (a...
Embedding Model Selection Matrix
Map content types to embedding models based on benchmarks like MTEB. For general text, Amazon Titan ...
Chunk Optimization Engine
Implement adaptive chunking that respects document structure. Use 512-token chunks with 50-token ove...
Index Architecture Design
Design your OpenSearch Serverless index with appropriate shard counts based on expected document vol...
Embedding Model Selection for AWS RAG
Amazon Titan Embeddings V2
1024-dimensional vectors with excellent general-purpose perf...
Native integration with Bedrock means no additional infrastr...
Cost-effective at $0.00002 per 1K tokens, roughly 80% cheape...
Supports up to 8,192 tokens per request, enabling larger chu...
Cohere Embed v3 (via Bedrock)
1024-dimensional vectors with state-of-the-art performance o...
Compression-aware training allows 4x vector size reduction w...
Input type parameter distinguishes between search_document a...
Superior performance on code and technical content compared ...
A
Anthropic
Building Claude's Documentation RAG System
Search abandonment dropped from 40% to 8%, and the average time to find relevant...
Production Embedding Pipeline with Batching and Error Handlingpython
123456789101112
import boto3
import json
from typing import List, Dict
from dataclasses import dataclass
from concurrent.futures import ThreadPoolExecutor, as_completed
import backoff
@dataclass
class EmbeddingResult:
chunk_id: str
embedding: List[float]
token_count: int
Key Insight
Hybrid Search Dramatically Outperforms Pure Vector Search
Production RAG systems achieve 20-35% better retrieval accuracy by combining dense vector search with sparse keyword matching. OpenSearch Serverless supports this through its neural search plugin, which lets you combine k-NN similarity scores with BM25 text matching in a single query.
Implementing OpenSearch Serverless Vector Store
1
Create Collection with Vector Search Configuration
2
Configure Network and Data Access Policies
3
Design Index Mapping for Vector and Metadata
4
Implement Bulk Indexing Pipeline
5
Configure Index Refresh and Replica Settings
Anti-Pattern: Embedding Everything with the Same Model and Parameters
Implement content-type-aware embedding pipelines. Use code-specific models like ...
Vector Dimension Trade-offs Impact Both Cost and Quality
Higher-dimensional embeddings (1024-1536) capture more semantic nuance but increase storage costs linearly and search latency logarithmically. Amazon Titan V2 supports dimension reduction from 1024 to 256/512 at embedding time with minimal quality loss for most use cases.
Framework
Multi-Stage Retrieval Architecture
Stage 1: Candidate Generation
Cast a wide net using efficient approximate nearest neighbor search. Retrieve top-100 to top-500 can...
Stage 2: Metadata Filtering
Apply business logic filters to candidates: date ranges, access permissions, content categories, and...
Stage 3: Cross-Encoder Reranking
Use a cross-encoder model to rerank remaining candidates by computing query-document relevance score...
Stage 4: Diversity Optimization
Apply maximal marginal relevance (MMR) or similar diversity algorithms to ensure retrieved documents...
N
Notion
Scaling RAG to Billions of User Documents
Query latency improved from 2-3 seconds to under 400ms at p99, even for enterpri...
OpenSearch Serverless Hybrid Search Querypython
123456789101112
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
import boto3
class HybridSearchClient:
def __init__(self, collection_endpoint: str, index_name: str):
credentials = boto3.Session().get_credentials()
self.auth = AWS4Auth(
credentials.access_key,
credentials.secret_key,
'us-east-1',
'aoss',
47%
Improvement in retrieval accuracy with reranking
Cross-encoder reranking of initial retrieval results improves Mean Reciprocal Rank by 47% on average across benchmark datasets.
Retrieval Quality Optimization Checklist
Multi-Index Retrieval Architecture
User Query
Query Analyzer (inte...
Index Router
↓
[Documentation Index...
Practice Exercise
Build a Retrieval Evaluation Pipeline
90 min
Pre-compute Common Query Embeddings
Analyze your query logs to identify the top 1000 most frequent queries and their variations. Pre-compute and cache embeddings for these queries, eliminating embedding latency for 40-60% of production traffic.
Key Insight
Document Summaries Enable Hierarchical Retrieval
Generate and embed document-level summaries alongside chunk embeddings to enable two-stage retrieval. First, identify relevant documents using summary embeddings, then search within those documents using chunk embeddings.
Practice Exercise
Build a Complete Document Ingestion Pipeline
90 min
Complete RAG Query Handler with Bedrockpython
123456789101112
import boto3
import json
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
class RAGQueryHandler:
def __init__(self):
self.bedrock = boto3.client('bedrock-runtime')
self.opensearch = self._init_opensearch()
self.embedding_model = 'amazon.titan-embed-text-v1'
self.generation_model = 'anthropic.claude-3-sonnet-20240229-v1:0'
Production RAG Deployment Checklist
Anti-Pattern: The Monolithic Prompt Anti-Pattern
❌ Problem
Monolithic prompts increase latency by 40-60% due to longer input processing, re...
✓ Solution
Implement a prompt routing layer that classifies incoming queries and selects ap...
Anti-Pattern: The Embedding-Everything Trap
❌ Problem
Retrieval quality degrades significantly as irrelevant chunks compete with valua...
✓ Solution
Implement a content qualification pipeline before embedding. Filter out boilerpl...
Implement adaptive chunking strategies based on content type and structure. Use ...
Practice Exercise
Implement RAG Evaluation Pipeline
120 min
Practice Exercise
Build Hybrid Search with Keyword and Semantic Retrieval
75 min
RAG Evaluation Metrics Implementationpython
123456789101112
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from rouge_score import rouge_scorer
import boto3
import json
class RAGEvaluator:
def __init__(self):
self.bedrock = boto3.client('bedrock-runtime')
self.rouge = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
def evaluate_retrieval(self, retrieved_docs: list, relevant_docs: list, k: int = 5) -> dict:
Framework
RAG Optimization Flywheel
Measure
Establish comprehensive metrics across the RAG pipeline: retrieval relevance scores, generation qual...
Analyze
Deep dive into failure cases using systematic categorization. Identify patterns: Are failures due to...
Experiment
Run controlled experiments testing specific hypotheses. A/B test chunking strategies, prompt variati...
Implement
Deploy winning variations through gradual rollout. Use feature flags for quick rollback capability. ...
Essential RAG Development Resources
Amazon Bedrock Knowledge Bases Documentation
article
LlamaIndex RAG Evaluation Guide
article
OpenSearch k-NN Plugin Documentation
article
Anthropic Prompt Engineering Guide
article
Cost Optimization Quick Win
Implement embedding caching using ElastiCache Redis with document content hashes as keys. For a typical enterprise corpus with 30% query overlap, this reduces Bedrock embedding API calls by 40-50%, saving $500-2000/month at moderate scale.
Bedrock Throttling Considerations
Bedrock has default quotas that can throttle high-volume RAG systems: Titan Embeddings at 2000 requests/minute, Claude at varying limits by model size. Request quota increases proactively before launch.
Data Privacy in RAG Systems
RAG systems often process sensitive documents. Ensure your architecture addresses data residency requirements—Bedrock processes data in-region without persistence.
67%
of RAG failures stem from retrieval issues, not generation
This statistic underscores the importance of investing in retrieval quality.
N
Notion
Building AI-Powered Search with RAG
Notion AI search achieves sub-200ms p95 latency across workspaces with millions ...
A
Anthropic
Constitutional AI for RAG Guardrails
Claude models on Bedrock demonstrate industry-leading faithfulness to retrieved ...
Complete AWS RAG Architecture
S3 (Documents)
Lambda (Processor)
Bedrock (Embeddings)
OpenSearch Serverles...
Chapter Complete!
RAG on AWS combines S3 for document storage, Lambda for proc...
Document ingestion quality determines RAG success more than ...
Retrieval optimization offers the highest ROI for RAG improv...
Comprehensive evaluation is non-negotiable for production RA...
Next: Begin by implementing a minimal RAG pipeline with Bedrock Knowledge Bases to validate your use case quickly