From prompts to production email personalization.
Monday: 3 prompts for personalization. Tuesday: automated engine code. Wednesday: team workflows. Thursday: complete production architecture. Agent orchestration, ML pipelines, multi-tenant design, GDPR compliance, and scaling from 10K to 10M emails/month.
Key Assumptions
- •Email volume: 10K-10M sends/month across startup and enterprise tiers
- •Personalization depth: Subject lines, body content, CTAs, send-time optimization
- •Data sources: CDP (customer data platform), CRM, website events, past engagement
- •Compliance: GDPR (EU), CAN-SPAM (US), CASL (Canada), consent management required
- •Integration: SendGrid/Mailgun/AWS SES for delivery, OpenAI/Anthropic for generation
- •Latency target: <500ms p95 for personalization, <5min for batch campaigns
- •Quality bar: >90% relevance score, <2% unsubscribe rate, >25% open rate improvement
System Requirements
Functional
- Generate personalized subject lines, body content, CTAs based on user profile + behavior
- Trigger-based sends: Welcome series, cart abandonment, re-engagement, milestone
- Batch campaign personalization: Segment 100K+ users, personalize each, schedule sends
- A/B testing: Test variants, auto-select winner, roll out to remaining audience
- Template management: Version control, preview, approval workflow, brand compliance
- Analytics: Open/click/conversion tracking, attribution, cohort analysis
- Consent management: GDPR opt-in/opt-out, preference center, audit trail
Non-Functional (SLOs)
💰 Cost Targets: {"per_email_usd":0.002,"llm_cost_per_1k_emails_usd":0.5,"infrastructure_per_million_emails_usd":150}
Agent Layer
planner
L3Decomposes campaign task into subtasks: fetch features, personalize, validate, send
🔧 FeatureStore.getFeatures(userId), TemplateManager.getTemplate(templateId), SegmentService.getUserSegment(userId)
⚡ Recovery: If feature fetch fails → use cached profile (stale up to 24h), If template missing → fallback to default template, Retry with exponential backoff (3 attempts)
executor
L4Executes personalization: generates subject, body, CTA using LLM + user features
🔧 OpenAI.createCompletion(prompt, model='gpt-4o-mini'), PromptStore.getVersionedPrompt('personalization_v3'), VariableInjector.fillTemplate(template, features)
⚡ Recovery: If LLM timeout → fallback to rule-based personalization, If low quality score (<0.7) → regenerate with different prompt, If API quota exceeded → queue for batch processing
evaluator
L3Validates personalized content: quality score, relevance, engagement prediction
🔧 QualityClassifier.score(content), RelevanceModel.predict(content, userFeatures), EngagementPredictor.estimateOpenRate(subject, userHistory)
⚡ Recovery: If quality score <0.7 → flag for human review, If relevance score <0.6 → reject and regenerate, If predictor unavailable → skip prediction, log warning
guardrail
L2Policy enforcement: PII detection, brand compliance, legal checks, consent validation
🔧 PIIDetector.scan(content), BrandComplianceChecker.validate(content, brandGuidelines), ConsentManager.checkConsent(userId, campaignType), LegalFilter.checkCANSPAM(content)
⚡ Recovery: If PII detected → auto-redact and log incident, If consent invalid → block send, notify marketer, If brand violation → reject, require manual approval
trigger
L3Monitors events and triggers campaigns: welcome, cart abandonment, re-engagement
🔧 EventStore.getRecentEvents(userId, window='7d'), TriggerRules.evaluate(event, userHistory), CampaignScheduler.checkThrottling(userId)
⚡ Recovery: If event processing fails → dead-letter queue for retry, If throttle limit reached → defer to next window, If duplicate event → deduplicate using event ID
orchestrator
L4Coordinates all agents, manages state, handles retries, routes based on outcomes
🔧 StateManager.saveState(campaignId, state), RetryPolicy.shouldRetry(failureType, attemptCount), RoutingEngine.routeToNextAgent(currentState)
⚡ Recovery: If agent timeout → retry with circuit breaker, If cascade failure → graceful degradation (skip optional steps), If critical failure → rollback, log, alert ops
ML Layer
Feature Store
Update: Real-time for critical (opens, clicks), Daily batch for derived (engagement score)
- • user_engagement_score (0-100, updated daily)
- • last_open_date (timestamp)
- • preferred_content_categories (array)
- • purchase_frequency (purchases/month)
- • avg_time_to_open (minutes)
- • device_type (mobile/desktop/tablet)
- • timezone (for send-time optimization)
Model Registry
Strategy: Semantic versioning, A/B test new versions, auto-rollback if metrics degrade >5%
- • gpt-4o-mini
- • quality_classifier_v2
- • engagement_predictor_v1
- • pii_detector_v3
Observability
Metrics
- 📊 email_send_rate
- 📊 personalization_latency_p95_ms
- 📊 llm_token_usage
- 📊 llm_cost_per_email_usd
- 📊 quality_score_avg
- 📊 relevance_score_avg
- 📊 open_rate_percent
- 📊 click_rate_percent
- 📊 unsubscribe_rate_percent
- 📊 bounce_rate_percent
- 📊 agent_success_rate
- 📊 agent_retry_count
- 📊 guardrail_rejection_rate
- 📊 pii_detection_count
- 📊 consent_rejection_rate
- 📊 api_error_rate
- 📊 database_query_latency_ms
Dashboards
- 📈 ops_dashboard
- 📈 marketing_dashboard
- 📈 ml_dashboard
- 📈 cost_dashboard
- 📈 compliance_dashboard
Traces
✅ Enabled
Deployment Variants
🚀 Startup
Infrastructure:
- • AWS Lambda (serverless, pay-per-use)
- • API Gateway (managed)
- • RDS PostgreSQL (single AZ)
- • ElastiCache Redis (single node)
- • S3 + CloudFront
- • SendGrid (shared IP pool)
- • OpenAI API (shared)
→ Fast to deploy (<1 week)
→ Low fixed cost ($150/mo base)
→ Scales automatically to 100K emails/mo
→ Single region (us-east-1)
→ Basic auth (OIDC via Auth0)
→ No VPC, uses public endpoints
→ Monitoring via CloudWatch
🏢 Enterprise
Infrastructure:
- • EKS (Kubernetes, multi-region)
- • Kafka + Confluent (event streaming)
- • Aurora PostgreSQL Global Database
- • Redis Global Datastore
- • VPC with private subnets
- • Dedicated IP pools (SendGrid/Mailgun)
- • Multi-LLM (OpenAI + Anthropic + Azure OpenAI)
- • Customer-managed KMS keys
- • WAF + DDoS protection
- • SSO/SAML (Okta/Azure AD)
→ Multi-tenant with tenant isolation
→ Data residency (EU/US/APAC)
→ VPC peering with customer networks
→ 99.99% SLA with multi-region failover
→ Advanced RBAC + audit logging
→ Dedicated support + SRE team
→ Cost: $8K-20K/mo depending on volume
📈 Migration: Phase 1: Migrate to ECS Fargate (containerized). Phase 2: Add Kafka for event streaming. Phase 3: Move to EKS with multi-region. Phase 4: Add enterprise features (SSO, VPC, data residency). Estimated timeline: 3-6 months.
Risks & Mitigations
⚠️ LLM cost explosion (10x traffic spike)
Medium✓ Mitigation: Set cost guardrails: $1K/day budget alert, $2K/day hard limit. Auto-throttle to rule-based personalization if exceeded. Monthly cost reviews.
⚠️ Email deliverability drop (spam complaints)
Medium✓ Mitigation: Monitor spam complaint rate (<0.1% threshold). Implement feedback loops with ESPs. Double opt-in required. Auto-suppress complainers. Warm up new IPs gradually.
⚠️ GDPR violation (consent not checked)
Low✓ Mitigation: Guardrail agent checks consent before every send. Audit trail for all consent changes. Annual compliance audit. Legal review of all templates.
⚠️ PII leak (LLM logs contain emails)
Medium✓ Mitigation: Auto-redact PII before LLM calls. No PII in application logs. Encrypt all logs at rest. Access controls on log storage. Regular security audits.
⚠️ Model drift (open rates decline over time)
High✓ Mitigation: Daily drift detection. Auto-retrain trigger if performance drops >10%. A/B test new models before rollout. Rollback policy if metrics degrade.
⚠️ Dependency failure (SendGrid outage)
Low✓ Mitigation: Multi-ESP failover (SendGrid → Mailgun → SES). Health checks every 60s. Auto-switch on 3 consecutive failures. SQS queue for retry.
⚠️ Agent cascade failure (all agents timeout)
Low✓ Mitigation: Circuit breaker per agent. Graceful degradation (skip optional agents). Fallback to rule-based. Auto-scale on high latency. Ops alerts on cascade.
Evolution Roadmap
Phase 1: MVP (0-3 months)
Weeks 1-12- → Ship basic personalization (subject + body)
- → Integrate SendGrid + OpenAI
- → Implement 2 trigger types (welcome, cart abandonment)
- → Basic analytics dashboard
Phase 2: Scale (3-6 months)
Weeks 13-26- → Scale to 1M emails/month
- → Add A/B testing
- → Implement send-time optimization
- → Multi-ESP failover
Phase 3: Enterprise (6-12 months)
Weeks 27-52- → Multi-tenant architecture
- → SSO/SAML integration
- → Data residency (EU/US)
- → 99.99% SLA
Complete Systems Architecture
9-layer architecture from presentation to security
Sequence Diagram - Personalized Email Send Flow
Email Personalization - Agent Orchestration
6 ComponentsEmail Personalization - External Integrations
9 ComponentsData Flow - Triggered Email Journey
From event to inbox in <1 second
Scaling Patterns
Key Integrations
SendGrid/Mailgun
Customer Data Platform (Segment/mParticle)
CRM (Salesforce/HubSpot)
OpenAI/Anthropic API
Security & Compliance
Failure Modes & Fallbacks
Failure | Fallback | Impact | SLA |
---|---|---|---|
LLM API down (OpenAI timeout) | → Anthropic API → Rule-based personalization | Degraded quality, not broken | 99.5% |
Feature Store unavailable | → Use cached features (stale up to 24h) | Slightly less personalized | 99.9% |
SendGrid rate limit exceeded | → Mailgun API → Queue for retry | Delayed send (up to 5min) | 99.0% |
Guardrail agent rejects content (PII detected) | → Auto-redact → Regenerate → Human review queue | Send blocked until safe | 100% (safety first) |
Database connection pool exhausted | → Read replica → Queue writes | Read-only mode, writes delayed | 99.9% |
Campaign A/B test inconclusive | → Extend test duration → Use default variant | Suboptimal variant selection | 95% |
User consent revoked mid-campaign | → Immediate suppression → Refund send credit | Email not sent, user protected | 100% |
Advanced ML/AI Patterns
Production ML engineering beyond basic API calls