← Back to AWS Serverless ML Architecture

MASTERY35 min65 sections

Reference Architectures

THIS WEEK'S JOURNEY

Reference Architectures: Production-Tested Blueprints for Serverless ML Systems

Building serverless ML systems from scratch requires navigating countless architectural decisions, from choosing the right event-driven patterns to optimizing inference pipelines for cost and latency. This chapter provides five complete reference architectures that have been battle-tested in production environments handling millions of requests daily.

Key Insight

Reference Architectures Reduce Time-to-Production by 60-70%

Organizations that adopt proven reference architectures consistently reach production faster than those designing systems from scratch. Anthropic's engineering team documented that their standardized ML architecture patterns reduced new feature deployment time from 8 weeks to under 3 weeks.

73%

of ML projects fail to reach production

The primary causes are architectural complexity and operational challenges, not model quality.

Framework

SCALE Reference Architecture Framework

Security Boundaries

Define clear security perimeters using VPCs, IAM roles, and encryption at rest and in transit. Each ...

Cost Optimization Layers

Implement cost controls at every tier including request batching, caching strategies, and right-size...

Availability Patterns

Define SLA requirements and implement corresponding redundancy. This includes multi-AZ deployments, ...

Latency Requirements

Map end-to-end latency budgets across all components. Allocate specific latency targets to each serv...

Serverless ML Architecture Component Taxonomy

Ingestion Layer (API...

Processing Layer (La...

ML Layer (SageMaker,...

Storage Layer (S3, D...

Synchronous vs Asynchronous ML Architecture Patterns

Synchronous Architectures

Sub-second latency requirements with P99 targets under 500ms

Real-time user interactions requiring immediate responses

Higher cost due to provisioned concurrency and always-on end...

Simpler error handling with immediate feedback to clients

Asynchronous Architectures

Flexible latency from seconds to hours based on workload

Batch processing and background tasks without user waiting

Lower cost through spot instances and request batching

Complex error handling requiring dead letter queues and retr...

Notion

Building a Unified ML Architecture Supporting Multiple Use Cases

Reduced infrastructure costs by 40% compared to separate systems, achieved 99.95...

Architecture Selection Criteria

Before implementing any reference architecture, document your specific requirements for latency (P50, P95, P99), throughput (requests per second, concurrent users), cost constraints (monthly budget, cost per request), and compliance requirements (data residency, encryption standards). These constraints will determine which patterns to adopt and which to modify..

Key Insight

Event-Driven Patterns Enable Architecture Evolution Without Rewrites

The most successful serverless ML architectures use event-driven patterns that allow components to evolve independently. When Vercel built their AI SDK infrastructure, they designed every ML interaction as an event that flows through EventBridge, enabling them to add new model providers without modifying existing code.

Reference Architecture Evaluation Criteria

Anti-Pattern: The Monolithic Serverless Function

❌ Problem

A single slow inference call causes the entire function to timeout, wasting all ...

✓ Solution

Decompose into separate functions connected by Step Functions or EventBridge. Cr...

Implementing a Reference Architecture from This Chapter

Select the Closest Matching Reference Architecture

Deploy the Minimal Viable Architecture

Implement Observability Before Scaling

Load Test to Identify Bottlenecks

Add Resilience Patterns

Key Insight

Multi-Region Architectures Require Explicit Data Synchronization Strategies

Many teams assume that deploying serverless architectures to multiple regions automatically provides disaster recovery, but ML systems have unique challenges around model versioning and feature store consistency. When Linear implemented their multi-region AI features, they discovered that DynamoDB Global Tables introduced 200-500ms replication lag, causing inconsistent predictions between regions during model updates.

Stripe

Designing for 99.999% Availability in Payment Fraud Detection

Reduced fraud losses by 40% compared to rule-based systems alone, maintained sub...

Start with the Simplest Architecture That Could Work

Resist the temptation to implement every optimization from day one. Begin with synchronous Lambda invocations and add Step Functions orchestration only when you need complex workflows.

Base Architecture Configuration with AWS CDKtypescript

123456789101112
import * as cdk from 'aws-cdk-lib';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as apigateway from 'aws-cdk-lib/aws-apigateway';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';

export class ServerlessMLStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // Feature store with on-demand scaling
    const featureStore = new dynamodb.Table(this, 'FeatureStore', {
      partitionKey: { name: 'entityId', type: dynamodb.AttributeType.STRING },

Framework

Architecture Decision Record (ADR) Template for ML Systems

Context

Describe the business and technical requirements driving the decision. Include latency requirements,...

Decision

State the architectural choice clearly and specifically. Reference the pattern from this chapter if ...

Consequences

Document both positive outcomes and accepted tradeoffs. Include cost implications, operational compl...

Alternatives Considered

List other approaches that were evaluated and why they were rejected. This prevents future engineers...

3.7x

faster incident resolution with documented architectures

Teams with comprehensive architecture documentation and runbooks resolve production incidents 3.7 times faster than those without.

Key Insight

Serverless ML Architectures Should Plan for Model Versioning from Day One

Model updates are inevitable, yet many architectures treat them as afterthoughts, leading to complex deployment procedures and potential downtime. OpenAI's API architecture demonstrates best practices by maintaining multiple model versions simultaneously, allowing gradual traffic migration and instant rollback.

Account for Cold Start Impact in Architecture Design

Lambda cold starts (100-500ms), SageMaker Serverless cold starts (2-30 seconds), and DynamoDB on-demand scaling delays can cascade into timeout failures. Design architectures with warming strategies, provisioned concurrency for critical paths, and timeout budgets that account for worst-case initialization times.

Framework

Serverless Chatbot Architecture Framework

Conversation Ingestion Layer

API Gateway WebSocket APIs handle bidirectional communication with sub-100ms connection establishmen...

Context Management Engine

ElastiCache Redis clusters store conversation context with sliding window expiration for active sess...

Intent Processing Pipeline

Amazon Comprehend performs initial intent classification with custom entity recognition models. Step...

Response Generation Service

Amazon Bedrock provides foundation model access for generative responses with guardrails for safety....

Intercom

Building Fin AI Agent on Serverless Infrastructure

Fin resolves 67% of customer queries without human intervention, reducing averag...

Chatbot Backend Architecture Patterns

Request-Response Pattern

REST API with synchronous Lambda processing handles complete...

Simpler implementation with standard HTTP semantics and cach...

Higher latency (2-5 seconds) as users wait for complete resp...

Better for simple Q&A bots with short, predictable responses

Streaming WebSocket Pattern

Bidirectional WebSocket connections enable real-time token s...

Complex connection management requires dedicated state handl...

Sub-second perceived latency with progressive response rende...

Essential for conversational AI with long-form generative re...

Building a Document Processing Pipeline

Configure Intelligent Document Ingestion

Implement Document Classification

Extract Text with Amazon Textract

Apply Entity Extraction and Enrichment

Orchestrate with Step Functions

Lemonade

AI-Powered Insurance Claims Processing

Lemonade's AI Jim handles 30% of claims end-to-end without human intervention, w...

Intelligent Document Processing Pipeline

S3 Upload

Lambda Classifier

Textract Extraction

Comprehend Entities

Key Insight

Async Processing is Non-Negotiable for Document Pipelines

Document processing workloads are inherently unpredictable—a single PDF might contain 2 pages or 200 pages, and processing time can vary from 2 seconds to 2 minutes. Synchronous architectures fail catastrophically under these conditions, causing timeout errors and poor user experience.

Framework

Serverless Recommendation Engine Architecture

Event Collection Layer

Kinesis Data Streams capture user interactions (views, clicks, purchases, ratings) with guaranteed o...

Feature Engineering Pipeline

Step Functions orchestrate daily feature computation jobs processing billions of events. EMR Serverl...

Model Training Infrastructure

SageMaker training jobs run on spot instances for 70% cost reduction on recommendation model trainin...

Real-Time Serving Layer

SageMaker Serverless Inference endpoints serve recommendation models with automatic scaling from zer...

Spotify

Discover Weekly Recommendation Architecture

Discover Weekly drives 30% of all Spotify streams, with users who engage with pe...

Real-Time Recommendation API with Feature Store Integrationpython

123456789101112
import boto3
import json
from typing import List, Dict

sagemaker_runtime = boto3.client('sagemaker-runtime')
feature_store = boto3.client('sagemaker-featurestore-runtime')
dynamodb = boto3.resource('dynamodb')
cache_table = dynamodb.Table('recommendation-cache')

def lambda_handler(event, context):
    user_id = event['pathParameters']['userId']
    num_recommendations = int(event.get('queryStringParameters', {}).get('limit', 10))

Anti-Pattern: Synchronous Model Training in Request Path

❌ Problem

Systems implementing this pattern experience cascading failures during traffic s...

✓ Solution

Implement a hybrid architecture separating batch training from real-time serving...

Framework

Computer Vision Pipeline Architecture

Media Ingestion Gateway

S3 multipart upload with Transfer Acceleration handles files up to 5TB with global edge optimization...

Preprocessing Pipeline

Lambda functions with container images (up to 10GB) include heavy dependencies like OpenCV and FFmpe...

Multi-Model Inference Layer

SageMaker multi-model endpoints host dozens of models on single infrastructure, reducing costs by 80...

Rekognition Integration

Amazon Rekognition provides managed APIs for common CV tasks: object detection, face analysis, text ...

Visual Search and Content Understanding Pipeline

Visual search drives 600 million monthly searches on Pinterest, with users engag...

Computer Vision Inference Deployment Options

SageMaker Serverless Inference

Zero cost when idle, scales automatically from 0 to configur...

Cold start latency of 1-2 seconds for first request after id...

Maximum 6GB memory limits model size and batch processing ca...

Ideal for sporadic workloads with unpredictable traffic patt...

SageMaker Real-Time Endpoints

Always-on instances with consistent sub-100ms latency

No cold starts, immediate response for all requests

Support for GPU instances and large models up to hundreds of...

Better for steady-state workloads with predictable traffic

Multi-Tenant ML Security Checklist

78%

of enterprise ML platforms now support multi-tenancy

The shift toward multi-tenant ML architectures reflects the growing demand for ML-as-a-Service offerings.

Cold Start Impact on Real-Time CV Applications

Computer vision models on SageMaker Serverless Inference experience 2-5 second cold starts due to model loading time. For user-facing applications requiring consistent sub-second latency, implement a warming strategy using CloudWatch Events to invoke endpoints every 5 minutes.

Canva

Multi-Tenant Design AI Platform

Canva's multi-tenant ML architecture supports 10x user growth while reducing per...

Key Insight

Feature Store is the Foundation of Multi-Tenant ML

SageMaker Feature Store provides the critical capability of point-in-time correct feature retrieval with tenant isolation built-in. When serving predictions, you need features computed at the exact time of the original event to avoid data leakage—Feature Store handles this automatically with its offline store queries.

Practice Exercise

Build a Multi-Tenant Image Classification Service

90 min

Practice Exercise

Build a Complete Serverless Chatbot Architecture

180 min

Multi-Tenant ML Inference Lambda with Tenant Isolationpython

123456789101112
import boto3
import json
from functools import lru_cache
from aws_lambda_powertools import Logger, Tracer, Metrics
from aws_lambda_powertools.utilities.typing import LambdaContext
from aws_lambda_powertools.metrics import MetricUnit

logger = Logger()
tracer = Tracer()
metrics = Metrics()

# Tenant-specific model endpoints

Production Readiness Checklist for Serverless ML Architectures

Anti-Pattern: The Monolithic Lambda Anti-Pattern

❌ Problem

Monolithic Lambda functions result in 3-5x higher costs due to over-provisioned ...

✓ Solution

Decompose ML pipelines into single-responsibility Lambda functions connected via...

Anti-Pattern: The Synchronous Everything Anti-Pattern

❌ Problem

Synchronous ML architectures hit API Gateway's 29-second timeout limit, causing ...

✓ Solution

Implement async-first architecture using SQS for request queuing and WebSockets ...

Anti-Pattern: The Shared State Anti-Pattern in Multi-Tenant Systems

❌ Problem

Shared state creates noisy neighbor problems where one tenant's traffic spike af...

✓ Solution

Implement physical isolation for sensitive operations using separate DynamoDB ta...

Practice Exercise

Implement Document Processing Pipeline with Error Recovery

150 min

Practice Exercise

Build Multi-Model Recommendation Engine with A/B Testing

240 min

Computer Vision Pipeline with Parallel Processingpython

123456789101112
# Step Functions State Machine Definition (CDK)
from aws_cdk import (
    aws_stepfunctions as sfn,
    aws_stepfunctions_tasks as tasks,
    aws_lambda as lambda_,
    Duration
)

class ComputerVisionPipeline(Construct):
    def __init__(self, scope, id, **kwargs):
        super().__init__(scope, id, **kwargs)

Framework

Serverless ML Architecture Decision Framework

Latency Classification

Categorize your ML workload into real-time (<100ms), near-real-time (<1s), batch (<1 hour), or offli...

State Complexity Assessment

Evaluate whether your workload is stateless (single inference), session-stateful (conversation conte...

Tenant Isolation Requirements

Determine isolation needs: logical (same infrastructure, tenant ID filtering), resource (dedicated c...

Cost Optimization Strategy

Select the appropriate cost model based on traffic patterns. Sporadic traffic (<1000 requests/day) b...

Essential Resources for Serverless ML Architecture

AWS Well-Architected Machine Learning Lens

article

Serverless Land Patterns Collection

tool

AWS Solutions Library - ML Solutions

article

Amazon Bedrock Workshop

tool

Start with the Simplest Architecture That Could Work

Reference architectures show complete, production-grade patterns, but you shouldn't implement everything on day one. Start with a minimal viable architecture—perhaps just API Gateway, Lambda, and DynamoDB—and add complexity as requirements demand.

Use AWS Application Composer for Architecture Visualization

AWS Application Composer provides a visual canvas for designing serverless architectures with automatic CloudFormation/SAM generation. Use it to prototype reference architectures before committing to implementation.

67%

of ML projects fail to reach production

The majority of ML project failures stem from infrastructure and operational challenges rather than model performance.

Build vs. Buy Decision for ML Infrastructure Components

Build Custom Components

Full control over implementation details and optimization

No vendor lock-in for critical ML pipeline components

Can implement proprietary algorithms and unique requirements

Potentially lower per-unit costs at massive scale

Use AWS Managed Services

Production-ready from day one with built-in best practices

Automatic scaling, security patches, and feature updates

Integrated observability with CloudWatch and X-Ray

Predictable costs with pay-per-use pricing model

Migration Path from Traditional ML Infrastructure to Serverless

Audit Current Architecture and Identify Migration Candidates

Implement Strangler Fig Pattern for Gradual Migration

Migrate Batch Processing Workloads First

Refactor Inference Endpoints for Serverless

Migrate Data Processing and Feature Pipelines

Validate Cold Start Impact Before Production Deployment

Every reference architecture should be load tested with realistic cold start scenarios. Use Lambda Provisioned Concurrency for latency-sensitive paths, but understand that provisioned concurrency has a cost floor regardless of actual usage.

Duolingo

Serverless Architecture Powers 500M+ Language Learners

Duolingo reduced infrastructure costs by 50% while improving recommendation late...

Chapter Complete!

Reference architectures provide proven patterns for common M...

Multi-tenant ML architectures require careful consideration ...

Cost optimization in serverless ML requires understanding th...

Production readiness requires comprehensive observability, p...

Next: Begin by selecting one reference architecture that matches your highest-priority use case

PreviousFinish