EXPANSION40 min61 sections

Agent Infrastructure on AWS

THIS WEEK'S JOURNEY

Building Production-Grade Agent Infrastructure on AWS

Deploying AI agents to production requires far more than wrapping an LLM in an API endpoint—it demands robust infrastructure that handles unpredictable execution times, maintains state across complex multi-step workflows, and scales seamlessly from prototype to millions of requests. AWS provides a uniquely powerful combination of serverless compute, managed orchestration, and persistent storage services that together form the backbone of production agent systems at companies like Anthropic, Notion, and Stripe.

94%

of production agent failures trace back to infrastructure issues, not model problems

This striking statistic reveals the critical importance of infrastructure in agent deployments.

Key Insight

The Agent Infrastructure Triangle: Compute, State, and Orchestration

Every production agent system rests on three fundamental pillars that must work in harmony. Compute handles the execution of individual tools and LLM calls—Lambda excels here with its millisecond cold starts and automatic scaling.

Notion

Rebuilding their AI assistant infrastructure for 100x scale

Post-migration, Notion's AI assistant handles 2.3 million daily requests with p9...

Framework

The SCALE Framework for Agent Infrastructure

Statelessness

Design every compute component to be completely stateless, pushing all state to dedicated storage. T...

Compensation

Build compensation logic for every action an agent can take. When a multi-step workflow fails midway...

Asynchrony

Embrace asynchronous patterns throughout your architecture. Agents naturally involve unpredictable e...

Logging

Implement comprehensive structured logging from day one, capturing every LLM call, tool execution, a...

Production Agent Infrastructure Architecture on AWS

API Gateway (Rate Li...

Lambda Router (Reque...

Step Functions (Orch...

Lambda Tools (Execut...

Key Insight

Why Serverless Wins for Agent Workloads

Agent workloads have characteristics that make serverless architecture not just convenient but strategically superior. First, agent execution is inherently unpredictable—a simple query might resolve in one LLM call while a complex research task requires 50 tool invocations over several minutes.

Container-Based vs. Serverless Agent Infrastructure

ECS/EKS Containers

Predictable per-hour pricing but requires capacity planning ...

Unlimited execution time supports long-running agents but re...

Full control over runtime environment but increases operatio...

Warm instances eliminate cold starts but waste 70%+ of compu...

Lambda Serverless

Pay-per-millisecond billing perfectly matches unpredictable ...

15-minute timeout requires architectural patterns for long-r...

Managed runtime reduces operational overhead to near-zero, l...

Cold starts of 100-500ms are negligible compared to LLM late...

The Cold Start Myth in Agent Systems

Teams often reject Lambda for agents citing cold start concerns, but this fear is misplaced. A Lambda cold start adds 100-500ms to the first invocation, while a single GPT-4 or Claude API call takes 1-30 seconds.

Key Insight

The True Cost of Agent Infrastructure

Understanding the real cost structure of agent infrastructure prevents budget surprises and enables optimization. LLM API costs typically dominate at 60-80% of total spend—a single GPT-4 agent conversation might cost $0.10-0.50 in API fees.

Anti-Pattern: The Monolithic Agent Lambda

❌ Problem

Teams with monolithic agent Lambdas report 3x longer deployment cycles, 5x more ...

✓ Solution

Decompose your agent into discrete Lambda functions: one for each tool, one for ...

Agent Infrastructure Readiness Assessment

Stripe

Building their fraud detection agent infrastructure for real-time decisions

The new architecture handles 8,000 transactions per second with p99 latency of 8...

Key Insight

Multi-Region Considerations for Global Agent Deployments

Deploying agents globally introduces complexities that catch teams off guard. LLM API latency varies dramatically by region—calling OpenAI from ap-southeast-1 adds 200-400ms compared to us-east-1 due to network round trips.

Setting Up Your First Production Agent Infrastructure

Create the foundational DynamoDB tables

Deploy the core Lambda functions

Configure Step Functions workflow

Set up API Gateway integration

Implement authentication and rate limiting

Start with Infrastructure as Code from Day One

Use AWS CDK, Terraform, or SAM to define your agent infrastructure from the very first deployment. Teams that start with console-based setup invariably accumulate configuration drift and struggle to replicate environments.

Basic Agent Infrastructure with AWS CDKtypescript

123456789101112
import * as cdk from 'aws-cdk-lib';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as sfn from 'aws-cdk-lib/aws-stepfunctions';
import * as tasks from 'aws-cdk-lib/aws-stepfunctions-tasks';

export class AgentInfrastructureStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string) {
    super(scope, id);

    // State management table
    const stateTable = new dynamodb.Table(this, 'AgentState', {

Essential Resources for Agent Infrastructure on AWS

AWS Well-Architected Framework - Machine Learning Lens

article

Serverless Land - Step Functions Patterns

article

AWS Lambda Power Tuning

tool

DynamoDB Book by Alex DeBrie

book

Key Insight

The Hidden Complexity of Agent State Management

Agent state is deceptively complex because it spans multiple dimensions that traditional applications don't face. Conversation state tracks the ongoing dialogue and must be instantly accessible for response generation.

Framework

AWS Agent Infrastructure Stack

Execution Layer (Lambda)

Individual tool functions that perform discrete actions like API calls, data transformations, or ext...

Orchestration Layer (Step Functions)

Coordinates multi-step agent workflows, handles retries, manages parallel executions, and implements...

State Layer (DynamoDB)

Persists conversation history, agent memory, session context, and execution checkpoints. Use single-...

Interface Layer (API Gateway)

Exposes agent capabilities via REST or WebSocket APIs, handles authentication, rate limiting, and re...

Notion

Building AI Assistant Infrastructure on AWS Lambda

Tool invocation latency dropped from 3.2 seconds average to 340ms. Cold start fr...

Step Functions Standard vs Express Workflows for Agent Orchestration

Standard Workflows

Maximum duration of 1 year - ideal for long-running agent ta...

Exactly-once execution guarantee - critical for agents perfo...

Priced per state transition ($0.025 per 1000) - cost-effecti...

Built-in execution history for 90 days - essential for debug...

Express Workflows

Maximum duration of 5 minutes - suitable for quick agent int...

At-least-once execution - acceptable for idempotent operatio...

Priced per execution and duration ($1 per million + duration...

Execution history via CloudWatch Logs - requires additional ...

Step Functions State Machine for Multi-Tool Agentjson

123456789101112
{
  "Comment": "Agent orchestration with tool selection and execution",
  "StartAt": "AnalyzeIntent",
  "States": {
    "AnalyzeIntent": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:analyze-intent",
      "ResultPath": "$.intent",
      "Next": "SelectTools",
      "Retry": [{
        "ErrorEquals": ["Lambda.ServiceException"],
        "IntervalSeconds": 2,

Key Insight

DynamoDB Single-Table Design Enables Sub-10ms Agent State Access

Agent state management requires multiple access patterns: retrieve conversation by session ID, fetch user's recent sessions, query tool execution history, and access agent memory by topic. Traditional multi-table designs require multiple queries and joins at the application layer, adding 50-100ms latency per state access.

DynamoDB Single-Table Schema for Agent Statetypescript

123456789101112
// Table: AgentState
// PK: Partition Key, SK: Sort Key
// GSI1PK, GSI1SK: Global Secondary Index for alternate access patterns

interface AgentStateItem {
  PK: string;           // SESSION#<sessionId> or USER#<userId>
  SK: string;           // MSG#<timestamp> or TOOL#<executionId> or META
  GSI1PK?: string;      // USER#<userId> or TOOL#<toolName>
  GSI1SK?: string;      // SESSION#<sessionId>#<timestamp>
  
  // Entity-specific attributes
  type: 'message' | 'toolExecution' | 'sessionMeta' | 'memory';

Anti-Pattern: Storing Full Conversation History in Lambda Memory

❌ Problem

Lambda containers are recycled unpredictably, causing sudden context loss mid-co...

✓ Solution

Use DynamoDB with DAX (DynamoDB Accelerator) for microsecond-latency reads of re...

Stripe

API Gateway Configuration for High-Volume Agent APIs

API latency p50 improved from 890ms to 340ms through caching and request validat...

Implementing WebSocket Connections for Real-Time Agent Interactions

Create WebSocket API in API Gateway

Implement Connection Management Lambda

Build Message Handler with Streaming Support

Configure Connection Keep-Alive

Implement Reconnection Logic

WebSocket Connection Limits Can Silently Drop Messages

API Gateway WebSocket APIs have a 128KB message size limit and 32KB frame size limit. Large agent responses or tool results that exceed these limits are silently dropped without error.

Framework

Long-Running Agent Architecture Patterns

Step Functions with Wait States

For workflows that need to pause for external events or human approval, use Step Functions wait stat...

ECS Fargate Spot for Batch Processing

Run long-running agent tasks on Fargate Spot instances for up to 70% cost savings. Implement checkpo...

Lambda Continuation Pattern

Chain Lambda invocations through SQS or Step Functions, with each invocation processing a portion of...

ECS with Application Load Balancer

For agents requiring persistent connections or continuous operation, run on ECS with ALB for load ba...

Runway

Hybrid Architecture for Video Generation Agents

Job completion rate improved from 94% to 99.7% after implementing checkpointing ...

Lambda Continuation Pattern with SQSpython

123456789101112
import boto3
import json
from typing import Optional
import hashlib

sqs = boto3.client('sqs')
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('AgentExecutionState')

QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/123456789/agent-continuation'
MAX_ITERATIONS_PER_INVOCATION = 50

Production Readiness Checklist for Agent Infrastructure

340ms

Average cold start time for optimized Lambda functions

Achieving sub-500ms cold starts requires careful optimization: use ARM64 architecture (15% faster startup), minimize deployment package size under 50MB, initialize SDK clients outside handlers, and use provisioned concurrency for latency-critical functions.

Complete Agent Infrastructure Data Flow

User Request

API Gateway (Auth + ...

Lambda Router

Step Functions Orche...

Use Step Functions Express Workflows for Sub-5-Minute Agent Interactions

For most conversational agent interactions that complete in under 5 minutes, Express Workflows cost 90% less than Standard Workflows while providing synchronous execution. A typical agent handling 100,000 daily interactions saves approximately $2,400/month by using Express for short interactions and reserving Standard for long-running workflows requiring exactly-once semantics..

Key Insight

API Gateway WebSocket APIs Enable True Streaming Agent Responses

Traditional REST APIs require the complete response before sending to the client, creating perceived latency even when the agent starts generating immediately. WebSocket APIs allow pushing response chunks as they're generated, reducing time-to-first-token from seconds to milliseconds.

Practice Exercise

Build a Fault-Tolerant Agent Tool Chain

90 min

Practice Exercise

Build a Complete Agent Tool System

90 min

Complete Tool Orchestrator Implementationtypescript

123456789101112
import { DynamoDBClient, GetItemCommand, PutItemCommand } from '@aws-sdk/client-dynamodb';
import { LambdaClient, InvokeCommand } from '@aws-sdk/client-lambda';
import { z } from 'zod';

const ToolRequestSchema = z.object({
  toolName: z.enum(['search', 'calculator', 'database']),
  parameters: z.record(z.unknown()),
  sessionId: z.string(),
  requestId: z.string()
});

const TOOL_CONFIG = {

Practice Exercise

Implement Step Functions Agent Workflow

120 min

Step Functions State Machine Definitionjson

123456789101112
{
  "Comment": "Production Agent Orchestration Workflow",
  "StartAt": "ValidateAndParse",
  "States": {
    "ValidateAndParse": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:agent-validate",
      "ResultPath": "$.parsed",
      "Retry": [{
        "ErrorEquals": ["Lambda.ServiceException"],
        "IntervalSeconds": 2,
        "MaxAttempts": 3,

Production Agent Infrastructure Checklist

Anti-Pattern: The Monolithic Agent Lambda

❌ Problem

Monolithic agent functions lead to cold start times exceeding 5 seconds, making ...

✓ Solution

Decompose your agent into focused Lambda functions: one for planning (calls LLM)...

Anti-Pattern: Synchronous Everything Architecture

❌ Problem

Synchronous architectures create terrible user experiences with perceived latenc...

✓ Solution

Implement asynchronous patterns with immediate acknowledgment. Return a request ...

Anti-Pattern: Ignoring Cold Start Impact

❌ Problem

Cold starts devastate user experience for agent systems where responsiveness is ...

✓ Solution

Minimize deployment package size by using Lambda layers for dependencies and tre...

Practice Exercise

Build Agent State Recovery System

60 min

Agent State Checkpointing Implementationtypescript

123456789101112
interface AgentCheckpoint {
  sessionId: string;
  checkpointId: string;
  timestamp: number;
  conversationHistory: Message[];
  pendingActions: Action[];
  completedActions: Action[];
  currentPlan: Plan | null;
  toolResults: Record<string, unknown>;
  metadata: Record<string, unknown>;
}

Essential AWS Agent Infrastructure Resources

AWS Lambda Powertools for TypeScript

tool

Step Functions Local for Testing

tool

AWS Solutions Constructs

tool

DynamoDB Single-Table Design by Alex DeBrie

book

Practice Exercise

Implement Agent Observability Dashboard

45 min

Cost Optimization is Continuous

Agent infrastructure costs can spiral quickly—one team's prototype that cost $50/month in development reached $15,000/month in production due to inefficient patterns. Implement cost monitoring from day one: tag all resources, set up AWS Budgets with alerts at 50% and 80% of targets, and review Cost Explorer weekly.

Framework

Agent Infrastructure Maturity Model

Level 1: Functional

Basic Lambda functions with API Gateway, simple DynamoDB table for state, manual deployments. Suitab...

Level 2: Observable

CloudWatch dashboards and alarms, X-Ray tracing enabled, structured logging with correlation IDs, ba...

Level 3: Resilient

Step Functions orchestration with retry logic, DynamoDB with auto-scaling and backups, multi-AZ depl...

Level 4: Optimized

Provisioned concurrency for latency-sensitive paths, response caching, cost allocation and optimizat...

Start with Step Functions Express Workflows

For most agent use cases, Express Workflows provide the best balance of features and cost. They support up to 100,000 state transitions per second, cost $1 per million transitions (vs $25 for Standard), and handle 90% of agent orchestration needs.

67%

of agent system failures traced to state management issues

State management is the most common failure point in agent systems.

Chapter Complete!

Lambda functions provide the ideal execution environment for...

Step Functions orchestration enables complex agent workflows...

DynamoDB single-table design with composite keys (SESSION#id...

API Gateway with proper throttling (1000 RPS default, burst ...

Next: Start by implementing a single tool as a Lambda function with proper error handling and observability

PreviousNext