EXPANSION30 min63 sections

Implementing Memory on AWS

THIS WEEK'S JOURNEY

Building Intelligent Memory Systems for Production AI Agents on AWS

Memory is what transforms a stateless AI model into an intelligent agent capable of learning, adapting, and maintaining context across interactions. In production environments, implementing robust memory systems requires careful orchestration of multiple AWS services, each optimized for specific memory patterns—from millisecond-latency working memory to petabyte-scale long-term storage.

Key Insight

The Four Pillars of Agent Memory Architecture

Production AI agents require four distinct memory types, each served by different AWS services optimized for specific access patterns. Working memory handles immediate context with sub-millisecond latency using ElastiCache Redis, typically storing the current conversation turn and active tool states.

47ms

Average memory retrieval latency for production agents at scale

This benchmark represents the combined latency of fetching working memory from ElastiCache, session state from DynamoDB, and relevant context from OpenSearch.

Framework

MARS: Memory Architecture for Reliable Systems

Mutability Classification

Categorize data by how frequently it changes. High-mutability data like current conversation state b...

Access Pattern Analysis

Map read/write ratios and latency requirements for each data type. Working memory sees 100:1 read:wr...

Retention Policy Design

Define TTLs and archival rules for each memory tier. Session memory typically expires after 24 hours...

Synchronization Strategy

Design how memories flow between tiers. Working memory should flush to short-term storage every conv...

Notion

Building Semantic Memory for AI Writing Assistants

The new architecture reduced AI response latency by 62% while improving relevanc...

Agent Memory Flow Architecture

User Request

ElastiCache (Working...

DynamoDB (Session St...

OpenSearch (Semantic...

DynamoDB vs. Traditional SQL for Agent State Management

DynamoDB

Single-digit millisecond latency at any scale with no perfor...

Automatic scaling handles traffic spikes without manual inte...

Pay-per-request pricing ideal for variable agent workloads

Built-in TTL automatically expires stale session data

PostgreSQL (RDS/Aurora)

Latency increases under load, requiring careful capacity pla...

Manual scaling with potential downtime during resize operati...

Fixed costs regardless of actual usage, expensive for bursty...

Requires scheduled jobs or triggers for data cleanup

Memory Isolation is Non-Negotiable for Multi-Tenant Agents

Every memory access must be scoped to the correct tenant, user, and session. A single bug that leaks one user's memories to another can destroy trust and create legal liability.

DynamoDB Table Design for Agent Session Memorytypescript

123456789101112
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
import { DynamoDBDocumentClient, PutCommand, QueryCommand } from '@aws-sdk/lib-dynamodb';

interface SessionMemory {
  tenantId: string;
  sessionId: string;
  messageId: string;
  timestamp: number;
  role: 'user' | 'assistant' | 'system';
  content: string;
  metadata: Record<string, any>;
  ttl: number;

Key Insight

Vector Embeddings Are the Bridge Between Language and Memory

Vector embeddings transform text into numerical representations that capture semantic meaning, enabling agents to retrieve relevant memories even when the exact words differ. When a user asks about 'quarterly revenue,' a well-designed vector search can retrieve memories about 'Q3 earnings,' 'financial results,' or 'sales performance' because these concepts cluster together in embedding space.

Anti-Pattern: The Infinite Context Window Fallacy

❌ Problem

Costs explode as you pay for tokens the model largely ignores. Latency increases...

✓ Solution

Implement semantic retrieval that fetches only relevant memories for each query....

Implementing Your First Agent Memory System on AWS

Create the DynamoDB Session Table

Set Up OpenSearch Serverless Collection

Deploy ElastiCache Redis Cluster

Implement the Memory Service Layer

Build the Memory Retrieval Pipeline

Memory System Production Readiness Checklist

Key Insight

ElastiCache Redis: The Speed Layer Your Agents Need

ElastiCache Redis serves as the speed layer in your memory architecture, providing sub-millisecond access to frequently-used data. For agent systems, Redis excels at three critical functions: caching assembled context to avoid repeated DynamoDB and OpenSearch queries, storing working memory for multi-step tool execution, and maintaining rate limiting counters for API calls.

Anthropic

Scaling Claude's Memory for Enterprise Deployments

The redesigned system reduced memory-related latency by 73% and decreased storag...

Use DynamoDB Single-Table Design for Agent State

Rather than creating separate tables for sessions, messages, user preferences, and agent state, use a single-table design with carefully crafted partition and sort keys. This reduces the number of connections, simplifies transactions across entity types, and improves cost efficiency.

Practice Exercise

Build a Memory-Enabled Conversation Agent

45 min

Essential Resources for Agent Memory Implementation

AWS DynamoDB Developer Guide - Best Practices

article

OpenSearch Vector Search Tutorial

article

Redis University - RU101: Introduction to Redis Data Structures

video

LangChain Memory Documentation

article

Key Insight

S3: The Artifact Memory Layer for Generated Content

While DynamoDB and OpenSearch handle structured data and vectors, S3 serves as the artifact memory layer for large, unstructured content that agents generate or reference. This includes generated images, PDF reports, code files, audio transcriptions, and any content too large for database storage.

Framework

The Memory Tier Architecture Pattern

Hot Memory Layer (ElastiCache)

Sub-millisecond access for active conversation context, current task state, and frequently accessed ...

Warm Memory Layer (DynamoDB)

Single-digit millisecond access for session history, task queues, and structured metadata. This laye...

Semantic Memory Layer (OpenSearch)

Tens of milliseconds for vector similarity search, enabling agents to retrieve relevant past experie...

Cold Memory Layer (S3)

Hundreds of milliseconds for large artifacts, complete conversation archives, and audit trails. This...

DynamoDB On-Demand vs Provisioned Capacity for Agent Workloads

On-Demand Capacity

Automatic scaling handles unpredictable agent traffic spikes...

Pay per request pricing ($1.25 per million writes, $0.25 per...

Zero capacity planning required—ideal for new agent deployme...

Instant scaling from zero to thousands of requests per secon...

Provisioned Capacity

Predictable costs when you understand your agent's traffic p...

Reserved capacity pricing can reduce costs by 70% for steady...

Auto-scaling available but requires configuration and has re...

Risk of throttling if traffic exceeds provisioned capacity

Notion

Building Notion AI's Memory System with DynamoDB

Memory retrieval latency dropped from 150ms to 23ms average, while storage costs...

DynamoDB Single-Table Design for Agent Memorypython

123456789101112
import boto3
from datetime import datetime, timedelta
import json

class AgentMemoryStore:
    def __init__(self, table_name: str):
        self.dynamodb = boto3.resource('dynamodb')
        self.table = self.dynamodb.Table(table_name)
    
    def save_turn(self, session_id: str, turn_number: int, 
                  user_input: str, agent_response: str, 
                  tool_calls: list = None):

Key Insight

OpenSearch Serverless Eliminates Vector Database Operations Overhead

OpenSearch Serverless with vector engine support launched in 2023 and represents a paradigm shift for agent memory systems. Instead of managing cluster sizing, shard allocation, and capacity planning, you simply create a collection and start indexing vectors.

Implementing Semantic Memory with OpenSearch

Create OpenSearch Serverless Collection

Design Your Vector Index Schema

Implement Embedding Generation Pipeline

Build the Retrieval Interface

Configure Memory Lifecycle Management

Anti-Pattern: Storing Raw Conversation Text as Embeddings

❌ Problem

Retrieval quality degrades as the memory store grows. Agents retrieve irrelevant...

✓ Solution

Extract and embed semantic summaries rather than raw text. Before embedding, use...

Intercom

Fin AI's Multi-Tier Memory Architecture

Average memory retrieval time dropped from 340ms to 67ms. Customer satisfaction ...

ElastiCache Cluster Mode Disabled vs Enabled

For agent memory workloads, start with Cluster Mode Disabled unless you need more than 500GB of data or 500,000 operations per second. Cluster Mode Enabled adds complexity with hash slot management and cross-slot operation limitations that can break common agent patterns like MULTI/EXEC transactions across different keys.

ElastiCache Working Memory Implementationpython

123456789101112
import redis
import json
from datetime import datetime, timedelta
from typing import Optional, List, Dict
import hashlib

class AgentWorkingMemory:
    def __init__(self, redis_url: str, default_ttl: int = 1800):
        self.redis = redis.from_url(redis_url, decode_responses=True)
        self.default_ttl = default_ttl  # 30 minutes
    
    def get_or_create_context(self, session_id: str) -> dict:

847x

Latency difference between ElastiCache and DynamoDB for hot data

ElastiCache delivers sub-millisecond response times (0.2-0.5ms) compared to DynamoDB's single-digit milliseconds (170-400ms for complex queries).

Framework

The Memory Consistency Model for Multi-Agent Systems

Strong Consistency (DynamoDB)

Use for state that must never conflict: task ownership, workflow status, financial transactions. Dyn...

Eventual Consistency (Default)

Acceptable for most agent memory: conversation history, cached tool results, user preferences. Reads...

Session Consistency (ElastiCache)

A single agent instance always sees its own writes immediately through cache locality. Different ins...

Causal Consistency (Event Sourcing)

Operations that depend on each other are seen in order, but independent operations may be reordered....

S3 Artifact Storage Best Practices

Complete Agent Memory Architecture on AWS

Agent Request

ElastiCache (Working...

DynamoDB (Session St...

↓

OpenSearch (Semantic...

Stripe

Stripe's Fraud Detection Agent Memory System

Decision latency dropped from 230ms to 47ms while fraud detection accuracy impro...

Practice Exercise

Build a Multi-Tier Memory System for a Customer Support Agent

90 min

Memory Encryption Requirements for Production

Enable encryption at rest for all memory tiers: DynamoDB (AWS-managed or CMK), ElastiCache (at-rest and in-transit encryption), OpenSearch (node-to-node encryption and encryption at rest), and S3 (SSE-S3 or SSE-KMS). For agents handling PII or sensitive data, use customer-managed KMS keys to maintain key rotation control and audit trails.

Key Insight

Memory Partitioning Strategies Determine Scale Limits

How you partition agent memory across storage systems determines your maximum scale. The naive approach—one DynamoDB partition per session—works until you have sessions with thousands of turns, hitting the 10GB partition limit.

Practice Exercise

Build a Complete Agent Memory System

90 min

Complete Memory Manager Implementationpython

123456789101112
import boto3
import json
import hashlib
from datetime import datetime, timedelta
from typing import Dict, List, Optional
from opensearchpy import OpenSearch, RequestsHttpConnection
import redis

class AgentMemoryManager:
    def __init__(self, config: Dict):
        self.dynamodb = boto3.resource('dynamodb')
        self.s3 = boto3.client('s3')

Production Memory System Deployment Checklist

Anti-Pattern: The Monolithic Memory Store

❌ Problem

As the agent handles more conversations, scan operations become increasingly exp...

✓ Solution

Design a purpose-built architecture where each service handles what it does best...

Anti-Pattern: Ignoring Cache Invalidation Complexity

❌ Problem

Agents operate on stale context, leading to repetitive questions or contradictor...

✓ Solution

Implement a tiered TTL strategy based on data volatility—active conversation con...

Anti-Pattern: Over-Indexing Everything in OpenSearch

❌ Problem

OpenSearch costs grow 5-10x higher than necessary because storage and compute sc...

✓ Solution

Be selective about what gets indexed for semantic search. Only index memories wh...

Practice Exercise

Implement Memory Consolidation Pipeline

60 min

Practice Exercise

Build Multi-Region Memory Replication

75 min

Memory Importance Scoring and Consolidationpython

123456789101112
import boto3
from datetime import datetime, timedelta
from typing import List, Dict
import anthropic

class MemoryConsolidator:
    def __init__(self, memory_manager, claude_client):
        self.memory = memory_manager
        self.claude = claude_client
        
    def calculate_importance_score(self, memory: Dict) -> float:
        """Calculate memory importance based on multiple factors"""

Essential Resources for AWS Memory Implementation

AWS DynamoDB Best Practices Guide

article

OpenSearch k-NN Plugin Documentation

article

ElastiCache for Redis Best Practices

article

Building Serverless Applications with DynamoDB Streams

video

Monitor DynamoDB Consumed Capacity Closely

DynamoDB throttling can cascade into cache misses and increased load on other services. Set up CloudWatch alarms for consumed read/write capacity exceeding 80% of provisioned capacity, and enable auto-scaling with appropriate minimum and maximum bounds.

Use DynamoDB Streams for Real-Time Cache Invalidation

Instead of implementing complex cache invalidation logic in your application, use DynamoDB Streams to trigger Lambda functions that update or invalidate cache entries. This decouples your write path from cache management and ensures consistency even when writes come from multiple sources.

Framework

Memory System Health Metrics Framework

Cache Effectiveness Score

Composite metric combining cache hit rate (target >85%), cache latency (target <5ms p99), and evicti...

Memory Retrieval Quality

Measures semantic search relevance through user feedback signals and retrieval-augmented generation ...

Storage Efficiency Ratio

Ratio of active memories to total stored memories, accounting for consolidation and archival. Target...

Cross-Service Latency Budget

End-to-end latency breakdown across cache lookup, DynamoDB query, OpenSearch search, and S3 retrieva...

99.99%

DynamoDB availability SLA for global tables

DynamoDB global tables provide the highest availability SLA of any AWS database service, making them ideal for critical agent state.

Plan for Memory Migration from Day One

Your memory schema will evolve as agent capabilities expand. Design your DynamoDB schema with version fields and implement backward-compatible readers that handle multiple schema versions.

Notion

Building AI Memory for Millions of Workspaces

The architecture supports over 30 million workspaces with AI memory capabilities...

Practice Exercise

Implement Memory Access Audit Trail

45 min

Complete Memory System Data Flow

Agent Request

API Gateway

Lambda Handler

ElastiCache Check

Chapter Complete!

DynamoDB serves as the foundation for agent memory, providin...

OpenSearch enables semantic memory retrieval through vector ...

ElastiCache dramatically reduces latency and cost by caching...

S3 provides cost-effective storage for large artifacts like ...

Next: Begin by implementing a basic memory system using DynamoDB for state and ElastiCache for caching

PreviousNext