← Back to AWS Serverless ML Architecture

EXPANSION35 min64 sections

Streaming ML with Kinesis

THIS WEEK'S JOURNEY

Real-Time ML on Streaming Data: The New Competitive Frontier

In an era where batch processing delays can cost millions in missed opportunities, streaming ML has become the cornerstone of modern intelligent systems. Amazon Kinesis provides the foundation for processing millions of records per second while applying machine learning models in real-time, enabling use cases from fraud detection to personalized recommendations that respond in milliseconds.

500B+

Daily events processed by Netflix through Kinesis-based streaming pipelines

Netflix's recommendation engine processes over 500 billion events daily through their streaming infrastructure, making real-time decisions about what content to surface to 230+ million subscribers.

Key Insight

Streaming ML Fundamentally Changes the Value Equation

Traditional batch ML systems operate on a simple premise: collect data, train models, deploy, and wait for the next batch cycle. Streaming ML inverts this model by making inference a continuous process that responds to events as they occur.

Batch ML vs Streaming ML: Architectural Trade-offs

Batch ML Processing

Data collected over hours/days before processing begins

Models trained and deployed on fixed schedules (daily/weekly...

Predictions available minutes to hours after events occur

Simpler infrastructure with clear processing boundaries

Streaming ML Processing

Data processed immediately as events arrive (milliseconds)

Models continuously updated or applied in real-time

Predictions available within 10-500ms of event occurrence

Complex infrastructure requiring careful state management

Framework

The STREAM Framework for Real-Time ML Architecture

Scalability Design

Plan for 10x current load from day one. Kinesis shards should be provisioned based on peak throughpu...

Temporal Handling

Implement explicit handling for event time vs processing time. Late arrivals are not edge cases—they...

Resilience Patterns

Build for failure at every layer. Implement dead letter queues for failed records, circuit breakers ...

Exactly-Once Semantics

Understand where you need exactly-once processing vs at-least-once. True exactly-once is expensive a...

Stripe

Building a Sub-100ms Fraud Detection Pipeline

The system blocks $500 million in fraudulent transactions annually while maintai...

Streaming ML Architecture on AWS

Data Sources (APIs, ...

Kinesis Data Streams...

Lambda Consumers (Pa...

SageMaker Endpoints ...

Shard Count Determines Your Parallelism Ceiling

Your Kinesis shard count directly limits how many Lambda functions can process records in parallel. With 10 shards, you can have at most 10 concurrent Lambda invocations processing your stream, regardless of your Lambda concurrency limits.

Configuring Kinesis Data Stream with Terraformhcl

123456789101112
resource "aws_kinesis_stream" "ml_inference_stream" {
  name             = "ml-inference-events"
  shard_count      = 16
  retention_period = 168  # 7 days in hours

  shard_level_metrics = [
    "IncomingBytes",
    "IncomingRecords",
    "OutgoingBytes",
    "OutgoingRecords",
    "WriteProvisionedThroughputExceeded",
    "ReadProvisionedThroughputExceeded",

Key Insight

Partition Key Strategy Makes or Breaks Your Pipeline

The partition key determines which shard receives each record, and poor partition key choices are responsible for more streaming ML failures than any other design decision. A common anti-pattern is using user_id as the partition key when a small percentage of users generate most of the traffic—this creates hot shards that throttle while others sit idle.

Anti-Pattern: The Synchronous Inference Trap

❌ Problem

During a Black Friday traffic spike, one team saw their iterator age grow from s...

✓ Solution

Implement an asynchronous inference pattern using SageMaker Async Inference or a...

Setting Up Your First Kinesis ML Pipeline

Estimate Throughput Requirements

Design Your Partition Key Strategy

Create the Kinesis Stream with Monitoring

Implement the Producer with KPL

Build the Lambda Consumer Function

Kinesis Stream Production Readiness Checklist

Netflix

Real-Time Personalization at 500 Billion Events Per Day

The streaming ML system drives a 20% improvement in content discovery metrics co...

Use On-Demand Mode for Unpredictable Workloads

Kinesis On-Demand mode automatically scales from zero to 200 MB/s write and 400 MB/s read throughput without capacity planning. While it costs approximately 20% more than well-utilized provisioned capacity, it eliminates the risk of throttling during traffic spikes and removes operational burden of capacity management.

Key Insight

Lambda Batch Size Is Your Primary Latency Lever

The batch size configuration in your Kinesis-Lambda event source mapping directly controls the trade-off between latency and throughput efficiency. A batch size of 1 gives minimum latency but maximum Lambda invocation overhead—you'll pay for 1 million invocations to process 1 million records.

Lambda Consumer with Batch Processing and Partial Failure Handlingpython

123456789101112
import json
import boto3
import base64
from typing import Dict, List, Any
import time

sagemaker_runtime = boto3.client('sagemaker-runtime')

def lambda_handler(event: Dict[str, Any], context) -> Dict[str, List]:
    """Process Kinesis records with ML inference and partial failure reporting."""
    
    batch_item_failures = []

Key Insight

Iterator Age Is Your Most Important Metric

The IteratorAgeMilliseconds metric tells you how far behind your consumers are from the latest data in the stream—it's the single most important indicator of streaming ML system health. An iterator age of 0 means you're processing records as fast as they arrive; an age of 60,000 (1 minute) means your predictions are based on minute-old data.

Practice Exercise

Build a Real-Time Anomaly Detection Pipeline

90 min

Essential Resources for Kinesis ML Architectures

AWS Big Data Blog: Real-time ML Inference with Kinesis

article

Designing Data-Intensive Applications by Martin Kleppmann

book

AWS re:Invent 2023: ANT340 - Real-Time ML at Scale

video

Kinesis Data Generator

tool

Framework

Stream Processing Decision Framework

Latency Classification

Categorize your use case: Ultra-low (<10ms) for fraud detection, Low (<100ms) for recommendations, M...

Throughput Profiling

Analyze your data velocity patterns: steady-state volume, peak multipliers, burst duration, and reco...

State Requirements

Determine if inference requires historical context. Stateless inference (single record) scales linea...

Failure Tolerance

Define acceptable failure modes: Can you drop records during overload? Is exactly-once processing re...

Robinhood

Building Real-Time Market Sentiment Analysis with Kinesis

Reduced sentiment analysis latency from 45 seconds to under 200ms, captured 40% ...

Kinesis Data Streams vs. Kinesis Data Analytics for ML

Kinesis Data Streams + Lambda

Maximum flexibility: any ML framework, any inference pattern...

Pay-per-invocation pricing ideal for variable workloads with...

Cold start latency (100-500ms) acceptable for most ML use ca...

Manual implementation of windowing, state management, and ex...

Kinesis Data Analytics (Apache Flink)

Built-in windowing primitives: tumbling, sliding, session wi...

Provisioned capacity pricing better for steady high-throughp...

No cold starts: always-on processing with sub-10ms latency a...

Native exactly-once semantics and state management with auto...

Lambda Consumer with Windowed Feature Aggregationpython

123456789101112
import json
import boto3
from datetime import datetime, timedelta
from collections import defaultdict
import base64

dynamodb = boto3.resource('dynamodb')
window_table = dynamodb.Table('ml-feature-windows')
sagemaker_runtime = boto3.client('sagemaker-runtime')

WINDOW_SIZE_MINUTES = 5

Anti-Pattern: The Synchronous Model Invocation Bottleneck

❌ Problem

Iterator age grows exponentially during traffic spikes, triggering shard splitti...

✓ Solution

Implement asynchronous fan-out: Lambda writes inference requests to an SQS queue...

Key Insight

Partition Key Design Determines ML Pipeline Scalability

Your Kinesis partition key strategy directly impacts ML inference quality and system scalability. Using random partition keys maximizes throughput but destroys temporal ordering needed for time-series models.

Production Streaming ML Readiness Checklist

DoorDash

Real-Time Delivery Time Prediction with Kinesis

Improved delivery time prediction accuracy by 23%, reduced customer complaints a...

Implementing Late Arrival Handling for Streaming ML

Define Event Time vs. Processing Time

Establish Watermark Strategy

Implement Speculative Processing

Design Late Data Side Outputs

Build Reprocessing Capability

Kinesis Data Streams Retention and Cost Implications

Default retention is 24 hours, extendable to 365 days, but extended retention costs $0.023 per shard-hour per day of retention. A 10-shard stream with 7-day retention costs an additional $38.64/month just for retention.

Multi-Stage Streaming ML Inference Architecture

Data Sources (IoT, A...

Kinesis Data Streams...

Lambda (Feature Engi...

Feature Store (Dynam...

Framework

SCALE Framework for Streaming ML Capacity Planning

Steady-State Analysis

Calculate baseline throughput: average records per second, average record size, and required process...

Ceiling Identification

Identify your maximum expected load: peak hour multiplier, special event spikes, and failure recover...

Amplification Factors

Account for processing amplification: feature lookups (3-5x latency), model inference (10-100ms adde...

Latency Budgets

Allocate your end-to-end latency budget across stages: ingestion (10-50ms), feature engineering (20-...

47%

of streaming ML pipelines fail during their first major traffic spike

The primary causes are insufficient shard capacity (31%), Lambda concurrency limits (28%), and downstream service throttling (24%).

Enhanced Fan-Out Consumer with Backpressure Handlingpython

123456789101112
import boto3
import asyncio
from datetime import datetime
import json

kinesis = boto3.client('kinesis')
sagemaker = boto3.client('sagemaker-runtime')

class StreamingMLConsumer:
    def __init__(self, stream_name, consumer_name, max_concurrent=100):
        self.stream_name = stream_name
        self.consumer_name = consumer_name

Key Insight

Feature Store Integration is Critical for Streaming ML Accuracy

Real-time inference often requires features that can't be computed from the current event alone—user history, rolling aggregates, or entity embeddings. Integrating a feature store (DynamoDB, ElastiCache, or SageMaker Feature Store) into your streaming pipeline is essential but adds complexity.

Practice Exercise

Build a Real-Time Anomaly Detection Pipeline

90 min

Use Kinesis Data Firehose for ML Training Data Collection

While your Lambda processes records for real-time inference, attach a Kinesis Data Firehose delivery stream to the same Kinesis Data Stream to automatically archive all raw events to S3. This creates a training data lake with zero additional application code.

Twitch

Real-Time Content Moderation with Streaming ML

Reduced time-to-moderation from 45 seconds (human-only) to under 2 seconds for c...

Essential Resources for Streaming ML Mastery

Streaming Systems by Tyler Akidau

book

AWS Kinesis Data Streams Developer Guide

article

Flink ML Documentation

article

Real-Time Machine Learning at Uber

video

Practice Exercise

Build a Real-Time Fraud Detection Pipeline

90 min

Complete Streaming ML Lambda with Error Handlingpython

123456789101112
import json
import boto3
import time
from decimal import Decimal
from botocore.config import Config

# Configure clients with retry logic
config = Config(
    retries={'max_attempts': 3, 'mode': 'adaptive'},
    connect_timeout=5,
    read_timeout=10
)

Production Streaming ML Deployment Checklist

Anti-Pattern: The Synchronous External Call Anti-Pattern

❌ Problem

This pattern causes cascading failures during traffic spikes. External service l...

✓ Solution

Batch external calls within each Lambda invocation. Collect records into groups ...

Practice Exercise

Implement Windowed Aggregations with Late Arrival Handling

60 min

Windowed Aggregation with Watermark Managementpython

123456789101112
import json
import time
from datetime import datetime, timedelta
import boto3
from boto3.dynamodb.conditions import Key

dynamodb = boto3.resource('dynamodb')
kinesis = boto3.client('kinesis')

WINDOW_TABLE = dynamodb.Table('windowed-aggregations')
LATE_ARRIVALS_STREAM = 'late-arrivals-stream'

Anti-Pattern: The Monolithic Stream Consumer Anti-Pattern

❌ Problem

Monolithic consumers suffer from poor maintainability and high blast radius for ...

✓ Solution

Decompose into a pipeline of focused functions connected via Kinesis streams or ...

Essential Resources for Streaming ML Mastery

Designing Data-Intensive Applications by Martin Kleppmann

book

AWS Streaming Data Solution for Amazon Kinesis

tool

Real-Time Machine Learning with Amazon SageMaker

article

Apache Kafka: The Definitive Guide

book

Practice Exercise

Build a Multi-Model Streaming Inference Pipeline

120 min

Multi-Model Endpoint Router with Fallback Logicpython

123456789101112
import json
import boto3
import hashlib
from functools import lru_cache

sagemaker_runtime = boto3.client('sagemaker-runtime')
cloudwatch = boto3.client('cloudwatch')

MME_ENDPOINT = 'multi-model-nlp-endpoint'

MODEL_REGISTRY = {
    'text_classification': {

Anti-Pattern: The Unbounded State Anti-Pattern

❌ Problem

Unbounded state leads to exponential cost growth and eventual system failures. O...

✓ Solution

Implement TTL on all state from day one. DynamoDB supports automatic TTL-based d...

Streaming ML Monitoring and Alerting Checklist

47%

of streaming ML pipelines fail in first production month

Nearly half of streaming ML deployments experience significant failures within the first month, primarily due to inadequate testing of failure scenarios, missing monitoring for data drift, and underestimated operational complexity.

Framework

Streaming ML Maturity Model

Level 1: Reactive Streaming

Basic Lambda consumers process Kinesis events and invoke ML endpoints. No windowing or state managem...

Level 2: Stateful Processing

Windowed aggregations enable time-based features. DynamoDB maintains processing state with proper TT...

Level 3: Feature Store Integration

Centralized feature store ensures training-serving consistency. Real-time features computed on strea...

Level 4: Continuous Learning

Models retrain automatically based on streaming data. Online learning updates model weights incremen...

Practice Exercise

Implement Comprehensive Streaming ML Monitoring

75 min

Cost Optimization: Right-Size Before You Scale

Before adding Kinesis shards or Lambda concurrency, profile your existing resources. Most streaming ML pipelines waste 40-60% of compute on inefficient code.

Custom Metrics with CloudWatch Embedded Metric Formatpython

123456789101112
import json
import time
from aws_embedded_metrics import metric_scope
from aws_embedded_metrics.config import get_config

# Configure EMF
config = get_config()
config.service_name = 'streaming-ml-pipeline'
config.namespace = 'StreamingML/Production'

class MetricsCollector:
    def __init__(self):

Testing Streaming ML: The Chaos Engineering Approach

Don't just test happy paths. Inject failures systematically: kill SageMaker endpoints during processing, introduce network latency, send malformed records, simulate DynamoDB throttling.

Chapter Complete!

Kinesis Data Streams provides the foundation for real-time M...

Windowing patterns are essential for streaming ML features. ...

Lambda consumers must be designed for resilience. Implement ...

Multi-model serving with SageMaker Multi-Model Endpoints ena...

Next: Begin by implementing a basic streaming ML pipeline with a single model, focusing on proper error handling and monitoring

PreviousNext