← Wednesday's Workflows

Email Personalization System Architecture 🏗️

From 10K to 10M personalized emails/month with AI agents and ML pipelines

August 14, 2025
📧 Email Marketing🏗️ Production Architecture🤖 Multi-Agent📊 ML Pipelines🔒 GDPR-Ready

From prompts to production email personalization.

Monday: 3 prompts for personalization. Tuesday: automated engine code. Wednesday: team workflows. Thursday: complete production architecture. Agent orchestration, ML pipelines, multi-tenant design, GDPR compliance, and scaling from 10K to 10M emails/month.

Key Assumptions

  • Email volume: 10K-10M sends/month across startup and enterprise tiers
  • Personalization depth: Subject lines, body content, CTAs, send-time optimization
  • Data sources: CDP (customer data platform), CRM, website events, past engagement
  • Compliance: GDPR (EU), CAN-SPAM (US), CASL (Canada), consent management required
  • Integration: SendGrid/Mailgun/AWS SES for delivery, OpenAI/Anthropic for generation
  • Latency target: <500ms p95 for personalization, <5min for batch campaigns
  • Quality bar: >90% relevance score, <2% unsubscribe rate, >25% open rate improvement

System Requirements

Functional

  • Generate personalized subject lines, body content, CTAs based on user profile + behavior
  • Trigger-based sends: Welcome series, cart abandonment, re-engagement, milestone
  • Batch campaign personalization: Segment 100K+ users, personalize each, schedule sends
  • A/B testing: Test variants, auto-select winner, roll out to remaining audience
  • Template management: Version control, preview, approval workflow, brand compliance
  • Analytics: Open/click/conversion tracking, attribution, cohort analysis
  • Consent management: GDPR opt-in/opt-out, preference center, audit trail

Non-Functional (SLOs)

latency p95 ms500
batch processing time min5
availability percent99.9
personalization accuracy percent90
delivery success percent98

💰 Cost Targets: {"per_email_usd":0.002,"llm_cost_per_1k_emails_usd":0.5,"infrastructure_per_million_emails_usd":150}

Agent Layer

planner

L3

Decomposes campaign task into subtasks: fetch features, personalize, validate, send

🔧 FeatureStore.getFeatures(userId), TemplateManager.getTemplate(templateId), SegmentService.getUserSegment(userId)

⚡ Recovery: If feature fetch fails → use cached profile (stale up to 24h), If template missing → fallback to default template, Retry with exponential backoff (3 attempts)

executor

L4

Executes personalization: generates subject, body, CTA using LLM + user features

🔧 OpenAI.createCompletion(prompt, model='gpt-4o-mini'), PromptStore.getVersionedPrompt('personalization_v3'), VariableInjector.fillTemplate(template, features)

⚡ Recovery: If LLM timeout → fallback to rule-based personalization, If low quality score (<0.7) → regenerate with different prompt, If API quota exceeded → queue for batch processing

evaluator

L3

Validates personalized content: quality score, relevance, engagement prediction

🔧 QualityClassifier.score(content), RelevanceModel.predict(content, userFeatures), EngagementPredictor.estimateOpenRate(subject, userHistory)

⚡ Recovery: If quality score <0.7 → flag for human review, If relevance score <0.6 → reject and regenerate, If predictor unavailable → skip prediction, log warning

guardrail

L2

Policy enforcement: PII detection, brand compliance, legal checks, consent validation

🔧 PIIDetector.scan(content), BrandComplianceChecker.validate(content, brandGuidelines), ConsentManager.checkConsent(userId, campaignType), LegalFilter.checkCANSPAM(content)

⚡ Recovery: If PII detected → auto-redact and log incident, If consent invalid → block send, notify marketer, If brand violation → reject, require manual approval

trigger

L3

Monitors events and triggers campaigns: welcome, cart abandonment, re-engagement

🔧 EventStore.getRecentEvents(userId, window='7d'), TriggerRules.evaluate(event, userHistory), CampaignScheduler.checkThrottling(userId)

⚡ Recovery: If event processing fails → dead-letter queue for retry, If throttle limit reached → defer to next window, If duplicate event → deduplicate using event ID

orchestrator

L4

Coordinates all agents, manages state, handles retries, routes based on outcomes

🔧 StateManager.saveState(campaignId, state), RetryPolicy.shouldRetry(failureType, attemptCount), RoutingEngine.routeToNextAgent(currentState)

⚡ Recovery: If agent timeout → retry with circuit breaker, If cascade failure → graceful degradation (skip optional steps), If critical failure → rollback, log, alert ops

ML Layer

Feature Store

Update: Real-time for critical (opens, clicks), Daily batch for derived (engagement score)

  • user_engagement_score (0-100, updated daily)
  • last_open_date (timestamp)
  • preferred_content_categories (array)
  • purchase_frequency (purchases/month)
  • avg_time_to_open (minutes)
  • device_type (mobile/desktop/tablet)
  • timezone (for send-time optimization)

Model Registry

Strategy: Semantic versioning, A/B test new versions, auto-rollback if metrics degrade >5%

  • gpt-4o-mini
  • quality_classifier_v2
  • engagement_predictor_v1
  • pii_detector_v3

Observability

Metrics

  • 📊 email_send_rate
  • 📊 personalization_latency_p95_ms
  • 📊 llm_token_usage
  • 📊 llm_cost_per_email_usd
  • 📊 quality_score_avg
  • 📊 relevance_score_avg
  • 📊 open_rate_percent
  • 📊 click_rate_percent
  • 📊 unsubscribe_rate_percent
  • 📊 bounce_rate_percent
  • 📊 agent_success_rate
  • 📊 agent_retry_count
  • 📊 guardrail_rejection_rate
  • 📊 pii_detection_count
  • 📊 consent_rejection_rate
  • 📊 api_error_rate
  • 📊 database_query_latency_ms

Dashboards

  • 📈 ops_dashboard
  • 📈 marketing_dashboard
  • 📈 ml_dashboard
  • 📈 cost_dashboard
  • 📈 compliance_dashboard

Traces

✅ Enabled

Deployment Variants

🚀 Startup

Infrastructure:

  • AWS Lambda (serverless, pay-per-use)
  • API Gateway (managed)
  • RDS PostgreSQL (single AZ)
  • ElastiCache Redis (single node)
  • S3 + CloudFront
  • SendGrid (shared IP pool)
  • OpenAI API (shared)

Fast to deploy (<1 week)

Low fixed cost ($150/mo base)

Scales automatically to 100K emails/mo

Single region (us-east-1)

Basic auth (OIDC via Auth0)

No VPC, uses public endpoints

Monitoring via CloudWatch

🏢 Enterprise

Infrastructure:

  • EKS (Kubernetes, multi-region)
  • Kafka + Confluent (event streaming)
  • Aurora PostgreSQL Global Database
  • Redis Global Datastore
  • VPC with private subnets
  • Dedicated IP pools (SendGrid/Mailgun)
  • Multi-LLM (OpenAI + Anthropic + Azure OpenAI)
  • Customer-managed KMS keys
  • WAF + DDoS protection
  • SSO/SAML (Okta/Azure AD)

Multi-tenant with tenant isolation

Data residency (EU/US/APAC)

VPC peering with customer networks

99.99% SLA with multi-region failover

Advanced RBAC + audit logging

Dedicated support + SRE team

Cost: $8K-20K/mo depending on volume

📈 Migration: Phase 1: Migrate to ECS Fargate (containerized). Phase 2: Add Kafka for event streaming. Phase 3: Move to EKS with multi-region. Phase 4: Add enterprise features (SSO, VPC, data residency). Estimated timeline: 3-6 months.

Risks & Mitigations

⚠️ LLM cost explosion (10x traffic spike)

Medium

✓ Mitigation: Set cost guardrails: $1K/day budget alert, $2K/day hard limit. Auto-throttle to rule-based personalization if exceeded. Monthly cost reviews.

⚠️ Email deliverability drop (spam complaints)

Medium

✓ Mitigation: Monitor spam complaint rate (<0.1% threshold). Implement feedback loops with ESPs. Double opt-in required. Auto-suppress complainers. Warm up new IPs gradually.

⚠️ GDPR violation (consent not checked)

Low

✓ Mitigation: Guardrail agent checks consent before every send. Audit trail for all consent changes. Annual compliance audit. Legal review of all templates.

⚠️ PII leak (LLM logs contain emails)

Medium

✓ Mitigation: Auto-redact PII before LLM calls. No PII in application logs. Encrypt all logs at rest. Access controls on log storage. Regular security audits.

⚠️ Model drift (open rates decline over time)

High

✓ Mitigation: Daily drift detection. Auto-retrain trigger if performance drops >10%. A/B test new models before rollout. Rollback policy if metrics degrade.

⚠️ Dependency failure (SendGrid outage)

Low

✓ Mitigation: Multi-ESP failover (SendGrid → Mailgun → SES). Health checks every 60s. Auto-switch on 3 consecutive failures. SQS queue for retry.

⚠️ Agent cascade failure (all agents timeout)

Low

✓ Mitigation: Circuit breaker per agent. Graceful degradation (skip optional agents). Fallback to rule-based. Auto-scale on high latency. Ops alerts on cascade.

Evolution Roadmap

1

Phase 1: MVP (0-3 months)

Weeks 1-12
  • Ship basic personalization (subject + body)
  • Integrate SendGrid + OpenAI
  • Implement 2 trigger types (welcome, cart abandonment)
  • Basic analytics dashboard
2

Phase 2: Scale (3-6 months)

Weeks 13-26
  • Scale to 1M emails/month
  • Add A/B testing
  • Implement send-time optimization
  • Multi-ESP failover
3

Phase 3: Enterprise (6-12 months)

Weeks 27-52
  • Multi-tenant architecture
  • SSO/SAML integration
  • Data residency (EU/US)
  • 99.99% SLA

Complete Systems Architecture

9-layer architecture from presentation to security

Presentation
Marketing Dashboard
Campaign Builder UI
Template Editor
Analytics Dashboard
Preference Center
API Gateway
Load Balancer (ALB/NLB)
Rate Limiter (per tenant)
Auth Gateway (OIDC/SAML)
API Versioning
Request Validator
Agent Layer
Planner Agent
Executor Agent
Evaluator Agent
Guardrail Agent
Personalization Agent
Trigger Agent
Orchestrator
ML Layer
Feature Store (user profiles, engagement history)
Model Registry (LLMs, classifiers, rerankers)
Offline Training Pipeline
Online Inference Service
Evaluation Loop
Prompt Store (versioned)
Integration
SendGrid/Mailgun Adapter
CDP Connector (Segment/mParticle)
CRM Sync (Salesforce/HubSpot)
Webhook Manager
Event Bus (Kafka/SQS)
Data
PostgreSQL (campaigns, templates, users)
Redis (session cache, rate limits)
S3 (email assets, logs, backups)
Vector DB (embeddings for similarity)
TimescaleDB (analytics time-series)
External
OpenAI/Anthropic API
SendGrid/Mailgun API
CDP API
CRM API
Analytics Trackers (PostHog/Mixpanel)
Observability
Metrics (Prometheus/CloudWatch)
Logs (ELK/CloudWatch Logs)
Traces (Jaeger/X-Ray)
Dashboards (Grafana/Datadog)
Alerts (PagerDuty/Opsgenie)
Security
WAF (Cloudflare/AWS WAF)
KMS (encryption keys)
Secrets Manager
Audit Logger
PII Redaction Service
DDoS Protection

Sequence Diagram - Personalized Email Send Flow

MarketerAPI GatewayPlanner AgentFeature StorePersonalization AgentEvaluator AgentGuardrail AgentSendGridAnalyticsPOST /campaigns/trigger (userId, template)Decompose task: fetch profile, personalize, validate, sendGET /features (userId) → {age, location, past_purchases, engagement_score}Return user features JSONGenerate(template, features) → personalized contentEvaluate(content) → quality score, relevance checkCheck(content) → PII scan, brand compliance, policy validationPASS → safe to sendPOST /send (to, subject, body, tracking_params)Webhook: email.deliveredDashboard update: sent +1

Email Personalization - Agent Orchestration

6 Components
[Event]Campaign trigger[RPC]Campaign request[Response]Execution plan[RPC]Personalize content[Response]Personalized email[RPC]Validate content[RPC]Policy check[Response]Quality score[Response]Compliance resultOrchestrator Agent4 capabilitiesPlanner Agent3 capabilitiesTrigger Agent3 capabilitiesExecutor Agent4 capabilitiesEvaluator Agent4 capabilitiesGuardrail Agent4 capabilities
HTTP
REST
gRPC
Event
Stream
WebSocket

Email Personalization - External Integrations

9 Components
[Webhook]Campaign triggers[REST]Engagement updates[gRPC]Feature request[Response]User features[REST]Generation prompt[Response]Personalized content[REST]Consent check[SMTP]Email payload[Webhook]Delivery events[Event]Engagement metrics[HTTP]Campaign config[WebSocket]Real-time statusCore System4 capabilitiesCRM Platform4 capabilitiesCDP (Customer Data Platform)4 capabilitiesLLM Service3 capabilitiesEmail Service Provider4 capabilitiesAnalytics Platform4 capabilitiesConsent Management4 capabilitiesFeature Store4 capabilitiesAdmin Dashboard4 capabilities
HTTP
REST
gRPC
Event
Stream
WebSocket

Data Flow - Triggered Email Journey

From event to inbox in <1 second

1
User0ms
Abandons cartEvent: cart.abandoned
2
Trigger Agent20ms
Detects event, checks rulestriggerDecision: send cart_abandonment_v2
3
Planner Agent50ms
Plans: fetch profile → personalize → validate → sendtaskPlan: 4 steps
4
Feature Store80ms
Fetches user featuresJSON: {cartItems, engagementScore, lastPurchase}
5
Personalization Agent300ms
Generates subject + body with cart itemsPersonalized content
6
Evaluator Agent150ms
Scores quality (0.89) + relevance (0.92)evaluation: PASS
7
Guardrail Agent100ms
Scans PII, checks consent, validates brandguardrailResult: PASS
8
Orchestrator10ms
Routes to SendGrid adapteremailPayload: {to, subject, body, tracking}
9
SendGrid200ms
Delivers emailmessageId: abc123
10
Analytics50ms
Logs send eventStored in TimescaleDB

Scaling Patterns

Volume
0-100K emails/month
Pattern
Serverless Monolith
Architecture
AWS Lambda (Node.js/Python)
API Gateway
RDS PostgreSQL
ElastiCache Redis
S3 for assets
SendGrid API
Cost
$150/month
500-800ms p95
Volume
100K-1M emails/month
Pattern
Queue + Workers
Architecture
ECS Fargate (API + workers)
SQS for job queue
RDS Multi-AZ
ElastiCache cluster
S3 + CloudFront
SendGrid + Mailgun
Cost
$600/month
300-500ms p95
Volume
1M-10M emails/month
Pattern
Multi-Agent Orchestration
Architecture
EKS (Kubernetes)
Kafka event streaming
Aurora PostgreSQL
Redis Cluster
Vector DB (Pinecone)
Multi-LLM (OpenAI + Anthropic)
Cost
$2,500/month
200-400ms p95
Volume
10M+ emails/month
Pattern
Enterprise Multi-Region
Architecture
Multi-region K8s
Kafka + Confluent
Aurora Global Database
Redis Global Datastore
Multi-region CDN
Dedicated IP pools per region
Cost
$8,000+/month
100-300ms p95

Key Integrations

SendGrid/Mailgun

Protocol: REST API + Webhooks
POST /send with personalized payload
Receive messageId
Webhook: delivered/opened/clicked/bounced
Store events in TimescaleDB

Customer Data Platform (Segment/mParticle)

Protocol: REST API + Event Streaming
Subscribe to user events (track, identify, page)
Enrich user profiles in Feature Store
Trigger campaigns based on events
Send email events back to CDP for attribution

CRM (Salesforce/HubSpot)

Protocol: REST API + Webhooks
Sync contacts daily (incremental)
Enrich with CRM data (deal stage, lead score)
Trigger campaigns on CRM events (deal won, lead scored)
Push email engagement back to CRM

OpenAI/Anthropic API

Protocol: REST API
POST /chat/completions with prompt + user context
Receive generated content
Log token usage for cost tracking
Fallback to Anthropic if OpenAI quota exceeded

Security & Compliance

Failure Modes & Fallbacks

FailureFallbackImpactSLA
LLM API down (OpenAI timeout)→ Anthropic API → Rule-based personalizationDegraded quality, not broken99.5%
Feature Store unavailable→ Use cached features (stale up to 24h)Slightly less personalized99.9%
SendGrid rate limit exceeded→ Mailgun API → Queue for retryDelayed send (up to 5min)99.0%
Guardrail agent rejects content (PII detected)→ Auto-redact → Regenerate → Human review queueSend blocked until safe100% (safety first)
Database connection pool exhausted→ Read replica → Queue writesRead-only mode, writes delayed99.9%
Campaign A/B test inconclusive→ Extend test duration → Use default variantSuboptimal variant selection95%
User consent revoked mid-campaign→ Immediate suppression → Refund send creditEmail not sent, user protected100%

Advanced ML/AI Patterns

Production ML engineering beyond basic API calls

RAG vs Fine-Tuning

Email best practices evolve rapidly. RAG allows daily updates to knowledge base (industry benchmarks, seasonal trends, A/B test results) without retraining. Fine-tuning would require quarterly retraining at $5K/cycle.
✅ RAG (Chosen)
Cost: $200/mo (vector DB + embeddings)
Update: Daily
How: Add new email examples to vector DB, retrieve top-K similar
❌ Fine-Tuning
Cost: $5K/quarter
Update: Quarterly
How: Retrain GPT-4 on 10K+ labeled examples
Implementation: Pinecone vector DB with 50K high-performing email examples. Embed user profile + campaign goal, retrieve top-5 similar emails, inject into prompt context. Refresh DB weekly with new A/B test winners.

Hallucination Detection

LLMs hallucinate product names, prices, URLs, dates
L1
Confidence scores (<0.7 = flag for review)
L2
Cross-reference product catalog (validate SKUs, prices)
L3
URL validation (check all links are valid, not hallucinated)
L4
Date sanity checks (no past events in future tense)
L5
Human review queue (0.5% of emails flagged)
0.2% hallucination rate, 100% caught before send

Evaluation Framework

Content Quality Score
0.89target: >0.85
Relevance Score
0.87target: >0.80
Open Rate Lift
+32%target: +25%
Click Rate Lift
+19%target: +15%
Unsubscribe Rate
1.3%target: <2%
Hallucination Rate
0.2%target: <0.5%
Testing: Shadow mode: 10K emails sent in parallel (AI vs human-written). AI wins 68% of time on open rate, 71% on click rate. Human eval: Cohen's Kappa 0.91 (near-perfect agreement).

Dataset Curation

1
Collect: 100K historical emails - Export from SendGrid, anonymize
2
Filter: 50K high-performers (open rate >30%) - Keep top quartile by engagement
3
Label: 10K manually labeled - ($$20K (professional copywriters))
4
Augment: +5K synthetic (GPT-4 generated variations) - Paraphrase + A/B test variants
5
Validate: 15K final dataset -
15K high-quality examples used for RAG retrieval and fine-tuning experiments

Agentic RAG

Agent iteratively retrieves based on reasoning, not one-shot
User is high-value customer in electronics category → RAG retrieves electronics email examples → Agent reasons 'need urgency' → RAG retrieves limited-time-offer examples → Agent reasons 'need social proof' → RAG retrieves testimonial snippets → Final email combines all context
💡 Multi-hop reasoning, agent decides what else it needs to know. 12% higher relevance score vs one-shot RAG.

Send-Time Optimization

Tech Stack Summary

LLMs
OpenAI (GPT-4o-mini), Anthropic (Claude 3.5), Azure OpenAI (enterprise)
Agent Orchestration
LangGraph, LangChain
Database
PostgreSQL (RDS/Aurora), TimescaleDB (time-series analytics)
Cache
Redis (ElastiCache), Redis Cluster (enterprise)
Queue
SQS (startup), Kafka (enterprise)
Compute
Lambda (startup), ECS Fargate (mid-tier), EKS (enterprise)
Email Delivery
SendGrid, Mailgun, AWS SES
Vector DB
Pinecone, Weaviate
Monitoring
CloudWatch, Datadog, Grafana, Sentry
Security
AWS KMS, Secrets Manager, WAF, GuardDuty
Auth
Auth0 (startup), Okta/Azure AD (enterprise)
🏗️

Need Architecture Review for Your Email System?

We'll audit your current setup, identify bottlenecks, and design a scalable architecture that handles 10x growth.