← Back to AWS Serverless ML Architecture

MASTERY35 min63 sections

Vector Stores on AWS

THIS WEEK'S JOURNEY

Vector Stores: The Foundation of Modern AI-Powered Search and Retrieval

Vector stores have become the critical infrastructure layer enabling retrieval-augmented generation (RAG), semantic search, and recommendation systems across the AI landscape. Unlike traditional databases that rely on exact keyword matching, vector stores operate on mathematical representations of meaning, allowing systems to find conceptually similar content even when no words match.

847%

Growth in vector database adoption among enterprises

This explosive growth reflects the fundamental shift toward AI-native applications where semantic understanding trumps keyword matching.

Key Insight

Embeddings Transform Meaning Into Mathematics

Vector embeddings convert text, images, audio, and other content into dense numerical arrays that capture semantic meaning in high-dimensional space. When you embed the phrase 'machine learning engineer' and 'AI developer,' they'll occupy nearby positions in vector space despite sharing no common words.

Vector Search Architecture Flow

User Query

Embedding Model (20-...

Vector Index Search ...

Metadata Filtering

Vector Store Options on AWS

OpenSearch Serverless

Fully managed with automatic scaling from 0 to millions of q...

Native k-NN plugin with HNSW and IVF algorithms built-in

Pay-per-OCU pricing starting at $0.24/OCU-hour with 4 OCU mi...

Supports up to 16,000 dimensions per vector field

Aurora PostgreSQL pgvector

Familiar PostgreSQL interface with SQL-based vector queries

Combines vector search with relational data in single databa...

Instance-based pricing from $0.073/hour for db.t4g.medium

Supports up to 16,000 dimensions with latest pgvector versio...

Notion

Scaling Semantic Search to 100M+ Documents

Search latency improved from 450ms p99 to 89ms p99, while infrastructure costs d...

Framework

The DIMS Framework for Vector Store Selection

Data Volume & Velocity

Assess your current vector count and growth trajectory. Solutions that work at 1M vectors may fail c...

Integration Requirements

Evaluate how vectors interact with existing data. If you need ACID transactions combining vector upd...

Metadata Complexity

Consider your filtering requirements. Simple filters (tenant_id, timestamp) work everywhere. Complex...

SLA Requirements

Define your latency and availability targets. Real-time applications need p99 latencies under 100ms....

Embedding Model Lock-in Is Real

Your vector store contains embeddings generated by a specific model version. Switching embedding models requires re-embedding your entire corpus and rebuilding all indexes.

Creating a Vector-Enabled Table in Aurora PostgreSQL with pgvectorsql

123456789101112
-- Enable the pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create a table for document embeddings
CREATE TABLE document_embeddings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id VARCHAR(255) NOT NULL,
    chunk_index INTEGER NOT NULL,
    content TEXT NOT NULL,
    embedding vector(1536),  -- OpenAI text-embedding-3-small dimension
    metadata JSONB DEFAULT '{}',
    tenant_id VARCHAR(100) NOT NULL,

Key Insight

Chunking Strategy Determines RAG Quality More Than Vector Store Choice

Before obsessing over vector store selection, recognize that how you chunk documents has 3-5x more impact on retrieval quality than infrastructure choices. Fixed-size chunking (512 tokens) is simple but often splits semantic units awkwardly.

Deploying OpenSearch Serverless for Vector Search

Create the Collection

Configure Network Access

Set Up Data Access Policies

Create the Vector Index

Configure Index Settings

Anti-Pattern: Storing Raw Text in Vector Indexes

❌ Problem

Index sizes balloon 10-50x larger than necessary, dramatically increasing costs ...

✓ Solution

Store only embeddings, document IDs, and essential filter metadata in your vecto...

Vector Store Production Readiness Checklist

Anthropic

Building Claude's Context Retrieval Infrastructure

The system handles 50,000 queries per second with p99 latency of 34ms. Cache hit...

Start with pgvector, Graduate to OpenSearch

For teams building their first RAG system, pgvector on Aurora PostgreSQL offers the fastest path to production. You likely already have PostgreSQL expertise, and the familiar SQL interface accelerates development.

Key Insight

Hybrid Search Combines Keyword and Semantic Retrieval

Pure vector search misses exact matches that keyword search handles perfectly—searching for 'error code E-4521' should return documents containing that exact string, not semantically similar error descriptions. Hybrid search combines BM25 keyword scoring with vector similarity, typically using reciprocal rank fusion to merge results.

Hybrid Search Query in OpenSearch Serverlessjson

123456789101112
{
  "size": 10,
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "content": {
              "query": "kubernetes pod scheduling",
              "boost": 0.3
            }
          }

23ms

Average vector search latency on OpenSearch Serverless

This benchmark reflects queries against indexes with 10 million 1536-dimensional vectors using HNSW with ef_search=100.

Framework

The Vector Store Cost Model

Storage Costs

Raw vector storage plus index overhead. HNSW indexes add 40-60% overhead. Calculate: (vector_count ×...

Compute Costs

Query processing and index maintenance. OpenSearch Serverless uses OCUs ($0.24/hour minimum 4 OCUs)....

Embedding Generation

Often the largest cost component. OpenAI text-embedding-3-small costs $0.02/1M tokens. At 500 tokens...

Data Transfer

Frequently overlooked but significant at scale. Cross-AZ transfer costs $0.01/GB. Embedding API call...

Framework

Vector Store Selection Matrix

Query Pattern Analysis

Evaluate whether your workload is read-heavy (RAG applications), write-heavy (continuous embedding u...

Scale Trajectory Assessment

Project your vector count growth over 12-24 months. If you're starting under 1 million vectors but e...

Integration Complexity Score

Audit your existing data infrastructure. If you already use PostgreSQL for application data, pgvecto...

Latency Budget Allocation

Define your P99 latency requirements. Real-time applications like chatbots need sub-100ms responses,...

Notion

Building AI-Powered Search Across 200 Million Pages

Achieved 89ms average query latency across their entire corpus, reduced infrastr...

OpenSearch Serverless vs Aurora pgvector: Deep Technical Analysis

OpenSearch Serverless

Native k-NN plugin with HNSW and IVF algorithms, supporting ...

Automatic scaling from 0 to thousands of OCUs based on query...

Built-in hybrid search combining BM25 text relevance with ve...

Distributed architecture handles billions of vectors across ...

Aurora PostgreSQL pgvector

pgvector extension supports up to 2,000 dimensions with HNSW...

Vertical scaling through instance resizing, horizontal read ...

Requires application-level implementation for hybrid search ...

Single-node index limits practical vector counts to 50-100 m...

Implementing Production-Grade RAG with OpenSearch Serverless

Create Collection with Optimal Settings

Design Your Index Mapping

Implement Chunking Pipeline

Generate and Index Embeddings

Build Query Pipeline with Reranking

Anti-Pattern: The Monolithic Vector Index

❌ Problem

Query relevance plummeted because the embedding space became polluted with seman...

✓ Solution

Implement domain-specific indices with a routing layer. Create separate indices ...

Hybrid Search Query with OpenSearch Serverlesspython

123456789101112
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
import boto3

# Initialize client with IAM authentication
region = 'us-east-1'
service = 'aoss'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key,
                   region, service, session_token=credentials.token)

client = OpenSearch(

47%

Reduction in hallucinations with proper chunking strategies

Anthropic's research found that RAG systems using semantic chunking with appropriate overlap reduced hallucination rates by 47% compared to fixed-size chunking.

Key Insight

The Hidden Cost of Vector Dimension Inflation

Many teams default to OpenAI's text-embedding-ada-002 (1536 dimensions) or larger models without considering the compound costs. Each dimension adds storage, memory, and compute overhead that scales linearly with your vector count.

Vector Store Production Readiness Assessment

Stripe

Scaling Document Intelligence with Aurora pgvector

Achieved 45ms average query latency for their AI documentation assistant, reduce...

OpenSearch Serverless Cold Start Implications

OpenSearch Serverless can scale to zero OCUs during periods of inactivity, but cold starts add 30-60 seconds to the first query. For production RAG applications, configure minimum OCU capacity (at least 2 OCUs for search) to eliminate cold starts.

Framework

RAG Quality Optimization Loop

Retrieval Metrics Collection

Instrument your pipeline to log query text, retrieved chunks, relevance scores, and final LLM respon...

Failure Case Analysis

Sample 50-100 low-quality responses weekly for manual review. Categorize failures: wrong chunks retr...

Hypothesis-Driven Experiments

Based on failure analysis, form specific hypotheses (e.g., 'reducing chunk size from 512 to 256 toke...

Embedding Model Evaluation

Quarterly, benchmark your current embedding model against newer alternatives using your production q...

Aurora pgvector with Metadata Filteringpython

123456789101112
import psycopg2
from pgvector.psycopg2 import register_vector
import numpy as np

# Connection with pgvector extension
conn = psycopg2.connect(
    host="your-aurora-cluster.cluster-xxx.us-east-1.rds.amazonaws.com",
    database="vectors_db",
    user="admin",
    password="your-password"
)
register_vector(conn)

Key Insight

Amazon Kendra: When Semantic Search Isn't Enough

While vector databases excel at semantic similarity, Amazon Kendra provides enterprise search capabilities that pure vector stores lack. Kendra's ML-powered ranking considers document authority, freshness, user feedback signals, and organizational context—factors that embedding similarity alone cannot capture.

Multi-Stage RAG Pipeline Architecture

User Query

Query Cache (ElastiC...

Query Embedding (Bed...

Metadata Filter (SQL...

Practice Exercise

Build a Hybrid Search Benchmark Suite

90 min

Leverage OpenSearch Index State Management

Configure ISM policies to automatically manage vector index lifecycle. Set up policies that move indices older than 30 days to warm storage (UltraWarm), reducing costs by 90% while maintaining query capability.

Anti-Pattern: Ignoring Query-Document Asymmetry

❌ Problem

Short queries produce embeddings that don't align well with document chunk embed...

✓ Solution

Use asymmetric embedding models designed for query-document retrieval (like Cohe...

Practice Exercise

Build a Complete RAG Pipeline with OpenSearch Serverless

90 min

Complete OpenSearch Serverless Vector Search Implementationpython

123456789101112
import boto3
import json
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth

class VectorSearchService:
    def __init__(self, collection_endpoint: str, index_name: str):
        self.region = 'us-east-1'
        self.service = 'aoss'
        credentials = boto3.Session().get_credentials()
        self.awsauth = AWS4Auth(
            credentials.access_key,

Production Vector Store Deployment Checklist

Anti-Pattern: Storing Raw Embeddings Without Metadata

❌ Problem

Every search query requires follow-up database lookups, adding 50-200ms of laten...

✓ Solution

Store rich metadata alongside embeddings including source document ID, chunk pos...

Practice Exercise

Implement Hybrid Search with pgvector and Full-Text

60 min

Hybrid Search Implementation with Reciprocal Rank Fusionpython

123456789101112
from typing import List, Dict
import asyncio
import asyncpg

class HybridSearchService:
    def __init__(self, connection_string: str):
        self.conn_string = connection_string
        self.pool = None
    
    async def initialize(self):
        self.pool = await asyncpg.create_pool(self.conn_string, min_size=5, max_size=20)

Anti-Pattern: Using Single-Threaded Batch Indexing

❌ Problem

A 10 million document migration takes 2-3 days instead of 2-3 hours. During this...

✓ Solution

Implement parallel batch indexing with configurable concurrency. Use asyncio or ...

Embedding Model Lock-In Risk

Your vector store is tightly coupled to your embedding model's output dimensions and semantic space. Changing embedding models requires complete reindexing of all vectors—a process that can take days for large datasets and requires maintaining dual indexes during migration.

Practice Exercise

Build a Multi-Tenant Vector Search System

120 min

Framework

Vector Store Observability Framework

Query Performance Monitoring

Track latency distributions (p50, p95, p99) for embedding generation, vector search, and total reque...

Search Quality Metrics

Implement offline evaluation using labeled test sets to measure precision, recall, and NDCG. Track o...

Capacity Planning Dashboard

Monitor storage growth rate, query volume trends, and resource utilization patterns. Create projecti...

Query Analysis Pipeline

Log all queries with their results, latency, and user feedback. Build analytics to identify common q...

Essential Vector Database Learning Resources

Pinecone Learning Center: Vector Database Fundamentals

article

AWS re:Invent 2023: Building RAG Applications with OpenSearch Serverless

video

pgvector GitHub Repository and Documentation

tool

LangChain Vector Store Integration Guide

article

Anti-Pattern: Ignoring Query Result Diversity

❌ Problem

LLM responses become repetitive and miss important information that exists in th...

✓ Solution

Implement Maximum Marginal Relevance (MMR) or similar diversity algorithms that ...

Maximum Marginal Relevance Implementationpython

123456789101112
import numpy as np
from typing import List, Tuple

def maximal_marginal_relevance(
    query_embedding: np.ndarray,
    candidate_embeddings: np.ndarray,
    candidate_contents: List[str],
    k: int = 5,
    lambda_param: float = 0.7,
    initial_k: int = 20
) -> List[Tuple[int, str, float]]:
    """

Cost Optimization Through Tiered Storage

Implement a tiered vector storage strategy based on access patterns. Keep frequently accessed vectors (last 30 days, high-traffic documents) in your primary vector store with fast retrieval.

67%

of RAG system latency comes from embedding generation

Most teams focus on optimizing vector search when the real bottleneck is embedding API calls.

Practice Exercise

Implement Vector Search Caching Layer

45 min

Anti-Pattern: Treating Vector Search as Exact Match

❌ Problem

Applications fail silently when semantic search returns related but not exact ma...

✓ Solution

Design for probabilistic retrieval by implementing confidence thresholds and res...

Test with Production-Scale Data

Vector search performance characteristics change dramatically at scale. An index that performs perfectly with 100,000 vectors may have completely different latency and recall characteristics at 10 million vectors.

Chapter Complete!

AWS offers three primary vector store options: OpenSearch Se...

Production vector search requires comprehensive infrastructu...

Chunk size and overlap significantly impact retrieval qualit...

Hybrid search combining semantic vectors with keyword matchi...

Next: Start by deploying a proof-of-concept with OpenSearch Serverless using the code examples provided, indexing 10,000 representative documents from your corpus

PreviousNext