Real-Time ML on Streaming Data: The New Competitive Frontier
In an era where batch processing delays can cost millions in missed opportunities, streaming ML has become the cornerstone of modern intelligent systems. Amazon Kinesis provides the foundation for processing millions of records per second while applying machine learning models in real-time, enabling use cases from fraud detection to personalized recommendations that respond in milliseconds.
500B+
Daily events processed by Netflix through Kinesis-based streaming pipelines
Netflix's recommendation engine processes over 500 billion events daily through their streaming infrastructure, making real-time decisions about what content to surface to 230+ million subscribers.
Key Insight
Streaming ML Fundamentally Changes the Value Equation
Traditional batch ML systems operate on a simple premise: collect data, train models, deploy, and wait for the next batch cycle. Streaming ML inverts this model by making inference a continuous process that responds to events as they occur.
Batch ML vs Streaming ML: Architectural Trade-offs
Batch ML Processing
Data collected over hours/days before processing begins
Models trained and deployed on fixed schedules (daily/weekly...
Predictions available minutes to hours after events occur
Simpler infrastructure with clear processing boundaries
Streaming ML Processing
Data processed immediately as events arrive (milliseconds)
Models continuously updated or applied in real-time
Predictions available within 10-500ms of event occurrence
Complex infrastructure requiring careful state management
Framework
The STREAM Framework for Real-Time ML Architecture
Scalability Design
Plan for 10x current load from day one. Kinesis shards should be provisioned based on peak throughpu...
Temporal Handling
Implement explicit handling for event time vs processing time. Late arrivals are not edge casesβthey...
Resilience Patterns
Build for failure at every layer. Implement dead letter queues for failed records, circuit breakers ...
Exactly-Once Semantics
Understand where you need exactly-once processing vs at-least-once. True exactly-once is expensive a...
S
Stripe
Building a Sub-100ms Fraud Detection Pipeline
The system blocks $500 million in fraudulent transactions annually while maintai...
Streaming ML Architecture on AWS
Data Sources (APIs, ...
Kinesis Data Streams...
Lambda Consumers (Pa...
SageMaker Endpoints ...
Shard Count Determines Your Parallelism Ceiling
Your Kinesis shard count directly limits how many Lambda functions can process records in parallel. With 10 shards, you can have at most 10 concurrent Lambda invocations processing your stream, regardless of your Lambda concurrency limits.
Configuring Kinesis Data Stream with Terraformhcl
123456789101112
resource "aws_kinesis_stream" "ml_inference_stream" {
name = "ml-inference-events"
shard_count = 16
retention_period = 168 # 7 days in hours
shard_level_metrics = [
"IncomingBytes",
"IncomingRecords",
"OutgoingBytes",
"OutgoingRecords",
"WriteProvisionedThroughputExceeded",
"ReadProvisionedThroughputExceeded",
Key Insight
Partition Key Strategy Makes or Breaks Your Pipeline
The partition key determines which shard receives each record, and poor partition key choices are responsible for more streaming ML failures than any other design decision. A common anti-pattern is using user_id as the partition key when a small percentage of users generate most of the trafficβthis creates hot shards that throttle while others sit idle.
Anti-Pattern: The Synchronous Inference Trap
β Problem
During a Black Friday traffic spike, one team saw their iterator age grow from s...
β Solution
Implement an asynchronous inference pattern using SageMaker Async Inference or a...
Setting Up Your First Kinesis ML Pipeline
1
Estimate Throughput Requirements
2
Design Your Partition Key Strategy
3
Create the Kinesis Stream with Monitoring
4
Implement the Producer with KPL
5
Build the Lambda Consumer Function
Kinesis Stream Production Readiness Checklist
N
Netflix
Real-Time Personalization at 500 Billion Events Per Day
The streaming ML system drives a 20% improvement in content discovery metrics co...
Use On-Demand Mode for Unpredictable Workloads
Kinesis On-Demand mode automatically scales from zero to 200 MB/s write and 400 MB/s read throughput without capacity planning. While it costs approximately 20% more than well-utilized provisioned capacity, it eliminates the risk of throttling during traffic spikes and removes operational burden of capacity management.
Key Insight
Lambda Batch Size Is Your Primary Latency Lever
The batch size configuration in your Kinesis-Lambda event source mapping directly controls the trade-off between latency and throughput efficiency. A batch size of 1 gives minimum latency but maximum Lambda invocation overheadβyou'll pay for 1 million invocations to process 1 million records.
Lambda Consumer with Batch Processing and Partial Failure Handlingpython
123456789101112
import json
import boto3
import base64
from typing import Dict, List, Any
import time
sagemaker_runtime = boto3.client('sagemaker-runtime')
def lambda_handler(event: Dict[str, Any], context) -> Dict[str, List]:
"""Process Kinesis records with ML inference and partial failure reporting."""
batch_item_failures = []
Key Insight
Iterator Age Is Your Most Important Metric
The IteratorAgeMilliseconds metric tells you how far behind your consumers are from the latest data in the streamβit's the single most important indicator of streaming ML system health. An iterator age of 0 means you're processing records as fast as they arrive; an age of 60,000 (1 minute) means your predictions are based on minute-old data.
Practice Exercise
Build a Real-Time Anomaly Detection Pipeline
90 min
Essential Resources for Kinesis ML Architectures
AWS Big Data Blog: Real-time ML Inference with Kinesis
article
Designing Data-Intensive Applications by Martin Kleppmann
book
AWS re:Invent 2023: ANT340 - Real-Time ML at Scale
video
Kinesis Data Generator
tool
Framework
Stream Processing Decision Framework
Latency Classification
Categorize your use case: Ultra-low (<10ms) for fraud detection, Low (<100ms) for recommendations, M...
Throughput Profiling
Analyze your data velocity patterns: steady-state volume, peak multipliers, burst duration, and reco...
Anti-Pattern: The Synchronous Model Invocation Bottleneck
β Problem
Iterator age grows exponentially during traffic spikes, triggering shard splitti...
β Solution
Implement asynchronous fan-out: Lambda writes inference requests to an SQS queue...
Key Insight
Partition Key Design Determines ML Pipeline Scalability
Your Kinesis partition key strategy directly impacts ML inference quality and system scalability. Using random partition keys maximizes throughput but destroys temporal ordering needed for time-series models.
Production Streaming ML Readiness Checklist
D
DoorDash
Real-Time Delivery Time Prediction with Kinesis
Improved delivery time prediction accuracy by 23%, reduced customer complaints a...
Implementing Late Arrival Handling for Streaming ML
1
Define Event Time vs. Processing Time
2
Establish Watermark Strategy
3
Implement Speculative Processing
4
Design Late Data Side Outputs
5
Build Reprocessing Capability
Kinesis Data Streams Retention and Cost Implications
Default retention is 24 hours, extendable to 365 days, but extended retention costs $0.023 per shard-hour per day of retention. A 10-shard stream with 7-day retention costs an additional $38.64/month just for retention.
Multi-Stage Streaming ML Inference Architecture
Data Sources (IoT, A...
Kinesis Data Streams...
Lambda (Feature Engi...
Feature Store (Dynam...
Framework
SCALE Framework for Streaming ML Capacity Planning
Steady-State Analysis
Calculate baseline throughput: average records per second, average record size, and required process...
Ceiling Identification
Identify your maximum expected load: peak hour multiplier, special event spikes, and failure recover...
Amplification Factors
Account for processing amplification: feature lookups (3-5x latency), model inference (10-100ms adde...
Latency Budgets
Allocate your end-to-end latency budget across stages: ingestion (10-50ms), feature engineering (20-...
47%
of streaming ML pipelines fail during their first major traffic spike
The primary causes are insufficient shard capacity (31%), Lambda concurrency limits (28%), and downstream service throttling (24%).
Enhanced Fan-Out Consumer with Backpressure Handlingpython
Feature Store Integration is Critical for Streaming ML Accuracy
Real-time inference often requires features that can't be computed from the current event aloneβuser history, rolling aggregates, or entity embeddings. Integrating a feature store (DynamoDB, ElastiCache, or SageMaker Feature Store) into your streaming pipeline is essential but adds complexity.
Practice Exercise
Build a Real-Time Anomaly Detection Pipeline
90 min
Use Kinesis Data Firehose for ML Training Data Collection
While your Lambda processes records for real-time inference, attach a Kinesis Data Firehose delivery stream to the same Kinesis Data Stream to automatically archive all raw events to S3. This creates a training data lake with zero additional application code.
T
Twitch
Real-Time Content Moderation with Streaming ML
Reduced time-to-moderation from 45 seconds (human-only) to under 2 seconds for c...
Essential Resources for Streaming ML Mastery
Streaming Systems by Tyler Akidau
book
AWS Kinesis Data Streams Developer Guide
article
Flink ML Documentation
article
Real-Time Machine Learning at Uber
video
Practice Exercise
Build a Real-Time Fraud Detection Pipeline
90 min
Complete Streaming ML Lambda with Error Handlingpython
123456789101112
import json
import boto3
import time
from decimal import Decimal
from botocore.config import Config
# Configure clients with retry logic
config = Config(
retries={'max_attempts': 3, 'mode': 'adaptive'},
connect_timeout=5,
read_timeout=10
)
Production Streaming ML Deployment Checklist
Anti-Pattern: The Synchronous External Call Anti-Pattern
β Problem
This pattern causes cascading failures during traffic spikes. External service l...
β Solution
Batch external calls within each Lambda invocation. Collect records into groups ...
Practice Exercise
Implement Windowed Aggregations with Late Arrival Handling
60 min
Windowed Aggregation with Watermark Managementpython
123456789101112
import json
import time
from datetime import datetime, timedelta
import boto3
from boto3.dynamodb.conditions import Key
dynamodb = boto3.resource('dynamodb')
kinesis = boto3.client('kinesis')
WINDOW_TABLE = dynamodb.Table('windowed-aggregations')
LATE_ARRIVALS_STREAM = 'late-arrivals-stream'
Anti-Pattern: The Monolithic Stream Consumer Anti-Pattern
β Problem
Monolithic consumers suffer from poor maintainability and high blast radius for ...
β Solution
Decompose into a pipeline of focused functions connected via Kinesis streams or ...
Essential Resources for Streaming ML Mastery
Designing Data-Intensive Applications by Martin Kleppmann
book
AWS Streaming Data Solution for Amazon Kinesis
tool
Real-Time Machine Learning with Amazon SageMaker
article
Apache Kafka: The Definitive Guide
book
Practice Exercise
Build a Multi-Model Streaming Inference Pipeline
120 min
Multi-Model Endpoint Router with Fallback Logicpython
Unbounded state leads to exponential cost growth and eventual system failures. O...
β Solution
Implement TTL on all state from day one. DynamoDB supports automatic TTL-based d...
Streaming ML Monitoring and Alerting Checklist
47%
of streaming ML pipelines fail in first production month
Nearly half of streaming ML deployments experience significant failures within the first month, primarily due to inadequate testing of failure scenarios, missing monitoring for data drift, and underestimated operational complexity.
Framework
Streaming ML Maturity Model
Level 1: Reactive Streaming
Basic Lambda consumers process Kinesis events and invoke ML endpoints. No windowing or state managem...
Level 2: Stateful Processing
Windowed aggregations enable time-based features. DynamoDB maintains processing state with proper TT...
Level 3: Feature Store Integration
Centralized feature store ensures training-serving consistency. Real-time features computed on strea...
Level 4: Continuous Learning
Models retrain automatically based on streaming data. Online learning updates model weights incremen...
Practice Exercise
Implement Comprehensive Streaming ML Monitoring
75 min
Cost Optimization: Right-Size Before You Scale
Before adding Kinesis shards or Lambda concurrency, profile your existing resources. Most streaming ML pipelines waste 40-60% of compute on inefficient code.
Custom Metrics with CloudWatch Embedded Metric Formatpython
123456789101112
import json
import time
from aws_embedded_metrics import metric_scope
from aws_embedded_metrics.config import get_config
# Configure EMF
config = get_config()
config.service_name = 'streaming-ml-pipeline'
config.namespace = 'StreamingML/Production'
class MetricsCollector:
def __init__(self):
Testing Streaming ML: The Chaos Engineering Approach
Don't just test happy paths. Inject failures systematically: kill SageMaker endpoints during processing, introduce network latency, send malformed records, simulate DynamoDB throttling.
Chapter Complete!
Kinesis Data Streams provides the foundation for real-time M...
Windowing patterns are essential for streaming ML features. ...
Lambda consumers must be designed for resilience. Implement ...
Multi-model serving with SageMaker Multi-Model Endpoints ena...
Next: Begin by implementing a basic streaming ML pipeline with a single model, focusing on proper error handling and monitoring