Skip to main content
← Wednesday's Workflows

Retention System Architecture πŸ—οΈ

From 100 to 10,000 users/day with ML-powered churn prevention

October 23, 2025
20 min read
πŸ“ˆ GrowthπŸ—οΈ ArchitectureπŸ€– ML PipelineπŸ”’ HIPAA-Ready
🎯This Week's Journey

From prompts to production retention engine.

Monday: 3 core prompts (churn prediction, engagement scoring, campaign generation). Tuesday: automated agent code. Wednesday: team workflows (Data β†’ Growth β†’ Product). Thursday: complete technical architecture. Agents, ML pipeline, scaling patterns, and compliance for 10,000+ users daily.

πŸ“‹

Key Assumptions

1
Monitor 1K-100K active users with 5-20% monthly churn baseline
2
Real-time engagement scoring (<500ms p95 latency required)
3
Daily batch churn prediction + hourly incremental updates
4
GDPR/CCPA compliance for user data, HIPAA optional for healthcare SaaS
5
Integration with existing product analytics (Segment, Amplitude, Mixpanel)

System Requirements

Functional

  • Ingest user events (pageviews, feature usage, purchases) from multiple sources
  • Compute engagement scores in real-time (<500ms) and churn risk daily
  • Generate personalized retention campaigns (email, in-app, push) via LLM agents
  • A/B test campaign variants and measure lift in retention metrics
  • Provide dashboards for Growth team (churn trends, campaign performance, cohort analysis)
  • Support multi-tenancy for enterprise customers (data isolation, custom models)
  • Audit trail for all ML predictions and campaign decisions (HIPAA compliance)

Non-Functional (SLOs)

latency p95 ms500
freshness min60
availability percent99.5
churn prediction accuracy0.85
engagement score drift threshold0.05

πŸ’° Cost Targets: {"per_user_per_month_usd":0.15,"ml_inference_per_1k_users_usd":2.5,"campaign_generation_per_user_usd":0.03}

Agent Layer

planner_agent

L3

Decomposes retention tasks, selects tools, decides campaign strategy

πŸ”§ segment_classifier, campaign_recommender, timing_optimizer

⚑ Recovery: If tool fails: retry 3x with exponential backoff, If all retries fail: route to manual review queue, Log failure to audit trail with context

executor_agent

L2

Executes retention workflows (score computation, campaign generation, delivery)

πŸ”§ feature_store.get_features(), ml_model.predict(), llm_api.generate_campaign(), email_api.send()

⚑ Recovery: If feature fetch fails: use cached features (staleness <1hr), If ML model fails: fallback to rule-based scoring, If LLM API fails: use pre-generated template, If email send fails: queue for retry (max 3x)

evaluator_agent

L3

Validates campaign quality, checks for hallucinations, measures effectiveness

πŸ”§ content_classifier (toxicity, relevance), brand_checker, hallucination_detector, engagement_predictor

⚑ Recovery: If quality score <0.6: reject and regenerate, If hallucination detected: flag for human review, If brand violation: auto-reject with explanation

guardrail_agent

L4

Enforces PII redaction, policy compliance, safety filters

πŸ”§ pii_detector (AWS Comprehend/Presidio), policy_engine (OPA/custom rules), content_filter (profanity, sensitive topics)

⚑ Recovery: If PII detected: auto-redact and log, If policy violation: block delivery and alert, If redaction fails: route to manual review

campaign_generator_agent

L2

Generates personalized retention campaigns using LLM

πŸ”§ llm_api (GPT-4/Claude), template_selector, personalization_engine

⚑ Recovery: If LLM timeout: use cached template with dynamic fields, If generation fails: fallback to rule-based template, If content too generic: retry with more context

scoring_agent

L2

Real-time engagement scoring and churn prediction

πŸ”§ feature_store.get_online_features(), ml_model.predict() (XGBoost/LightGBM), feature_importance_explainer

⚑ Recovery: If feature fetch fails: use last known score (staleness <24hr), If model inference fails: fallback to rule-based heuristic, If score anomaly detected: flag for review

ML Layer

Feature Store

Update: Real-time (streaming) + Daily batch refresh

  • β€’ user_tenure_days
  • β€’ events_last_7d
  • β€’ events_last_30d
  • β€’ feature_usage_breadth (distinct features used)
  • β€’ feature_usage_depth (avg usage per feature)
  • β€’ days_since_last_login
  • β€’ purchase_count_lifetime
  • β€’ support_tickets_last_90d
  • β€’ cohort_retention_rate
  • β€’ referral_count

Model Registry

Strategy: Semantic versioning (major.minor.patch), Git-backed

  • β€’ churn_classifier_v3
  • β€’ engagement_scorer_v2
  • β€’ campaign_recommender_v1

Observability Stack

Real-time monitoring, tracing & alerting

0 active
SOURCES
Apps, Services, Infra
COLLECTION
10 Metrics
PROCESSING
Aggregate & Transform
DASHBOARDS
4 Views
ALERTS
Enabled
πŸ“ŠMetrics(10)
πŸ“Logs(Structured)
πŸ”—Traces(Distributed)
retention_campaign_sent_total
βœ“
engagement_score_p95
βœ“
churn_prediction_accuracy
βœ“
llm_latency_p95_ms
βœ“
email_delivery_rate
βœ“
feature_staleness_seconds
βœ“

Deployment Variants

πŸš€

Startup Architecture

Fast to deploy, cost-efficient, scales to 100 competitors

Infrastructure

βœ“
Serverless (Lambda/Cloud Functions)
βœ“
Managed services (RDS, DynamoDB, SQS)
βœ“
OpenAI/Anthropic APIs (no self-hosting)
βœ“
Single-region deployment
βœ“
Shared infrastructure (multi-tenant DB)
β†’Fast time-to-market (2-4 weeks)
β†’Low operational overhead
β†’Cost: $200-600/mo for 100-1K users/day
β†’99.5% SLA acceptable
β†’No custom ML models (use pre-trained)

Risks & Mitigations

⚠️ Model accuracy degrades over time (concept drift)

High (user behavior changes)

βœ“ Mitigation: Weekly offline evaluation. Auto-retrain if accuracy drops >5%. A/B test new models before full rollout.

⚠️ LLM generates inappropriate content (toxicity, bias)

Medium (LLMs are imperfect)

βœ“ Mitigation: Guardrail Agent with toxicity filter. Human review for flagged content. Content policy enforcement via OPA.

⚠️ PII leakage to LLM provider

Low (with proper redaction)

βœ“ Mitigation: PII detection before LLM call. Audit trail for all LLM requests. DPA with LLM provider. Fail-closed if redaction fails.

⚠️ Cost overruns from LLM API usage

Medium (usage spikes)

βœ“ Mitigation: Rate limiting per user. Cost budgets with alerts. Fallback to templates if budget exceeded. Monthly cost review.

⚠️ Feature store staleness (outdated scores)

Medium (infrastructure failures)

βœ“ Mitigation: Cache with TTL <1hr. Monitoring for staleness. Alert if staleness >30min. Fallback to rule-based if cache miss.

⚠️ Email deliverability issues (spam filters)

Medium (reputation management)

βœ“ Mitigation: SPF/DKIM/DMARC setup. Monitor bounce/complaint rates. Warm up IP addresses. Use reputable email service (SendGrid).

⚠️ Agent orchestration failures (infinite loops, deadlocks)

Low (with proper testing)

βœ“ Mitigation: Circuit breakers for agents. Max retry limits (3x). Timeout for each agent step (5s). Dead-letter queue for failed tasks.

🧬

Evolution Roadmap

Progressive transformation from MVP to scale

🌱
Phase 1Weeks 1-12

Phase 1: Foundation (0-3 months)

1
Deploy basic retention system (100-1K users/day)
2
Implement 3 core agents (Scoring, Campaign, Guardrail)
3
Integrate with product analytics (Segment/Amplitude)
4
Launch first A/B test (LLM campaigns vs manual)
Complexity Level
β–Ό
🌿
Phase 2Weeks 13-24

Phase 2: Scale (3-6 months)

1
Scale to 1K-10K users/day
2
Add Planner, Executor, Evaluator agents
3
Deploy feature store (Feast) + model registry (MLflow)
4
Improve churn prediction to 85% accuracy
Complexity Level
β–Ό
🌳
Phase 3Weeks 25-52

Phase 3: Enterprise (6-12 months)

1
Scale to 10K+ users/day with 99.9% SLA
2
Multi-region deployment (US + EU)
3
Self-hosted ML models (GPU instances)
4
Full HIPAA/SOC2 compliance
Complexity Level
πŸš€Production Ready
πŸ—οΈ

Complete Systems Architecture

9-layer architecture from user events to retention campaigns

1
🌐

Presentation

4 components

Growth Dashboard (React)
Campaign Builder UI
Analytics Embed (iframe)
Mobile SDK
2
βš™οΈ

API Gateway

4 components

Load Balancer (ALB/NGINX)
Rate Limiter (Redis)
Auth Service (OAuth2/OIDC)
API Router
3
πŸ’Ύ

Agent Layer

6 components

Planner Agent (task decomposition)
Executor Agent (workflow engine)
Evaluator Agent (quality checks)
Guardrail Agent (PII redaction, policy)
Campaign Generator Agent (LLM-powered)
Scoring Agent (real-time inference)
4
πŸ”Œ

ML Layer

6 components

Feature Store (Feast/Tecton)
Model Registry (MLflow)
Training Pipeline (Airflow)
Inference Service (TorchServe/TFServing)
Evaluation Framework
Drift Detector
5
πŸ“Š

Integration

5 components

Event Ingestion (Kafka/Kinesis)
Analytics Connector (Segment API)
Email Provider (SendGrid/SES)
Push Notification (FCM/APNS)
CRM Sync (Salesforce/HubSpot)
6
🌐

Data

5 components

Event Stream (Kafka)
OLTP Database (PostgreSQL)
OLAP Warehouse (Snowflake/BigQuery)
Feature Cache (Redis)
Model Artifacts (S3)
7
βš™οΈ

External

4 components

LLM API (OpenAI/Anthropic)
Analytics Platform (Amplitude)
Email Service (SendGrid)
Cloud Provider (AWS/GCP)
8
πŸ’Ύ

Observability

5 components

Metrics (Prometheus/Datadog)
Logs (ELK/Loki)
Traces (Jaeger/Tempo)
ML Monitoring (Arize/Evidently)
Alerting (PagerDuty)
9
πŸ”Œ

Security

5 components

IAM (Keycloak/Auth0)
Secrets Manager (Vault/KMS)
Audit Log (CloudWatch/Splunk)
PII Redactor
WAF (Cloudflare/AWS WAF)
πŸ”„

Sequence Diagram - Retention Campaign Flow

Automated data flow every hour

Step 0 of 14
UserEvent TrackerKafkaFeature StoreScoring AgentPlanner AgentCampaign AgentEmail Servicefeature_used eventpublish eventupdate featurestrigger scoringfetch user featurescompute engagement_scorescore=0.32 (at-risk)decide: send campaigngenerate campaignLLM call (personalized email)validate contentcheck PII/policysend emailretention email delivered

Data Flow - Event to Campaign

User event β†’ Scoring β†’ Campaign generation β†’ Delivery in <2 seconds

1
User0 ms
Triggers event (feature_used) β†’ Event JSON
2
Event Tracker SDK5 ms
Publishes to Kafka β†’ Event stream
3
Kafka Consumer50 ms
Reads event, updates feature store β†’ Updated features
4
Scoring Agent150 ms
Fetches features, runs ML model β†’ Engagement score (0.32)
5
Planner Agent50 ms
Analyzes score, decides to send campaign β†’ Task plan
6
Executor Agent10 ms
Calls Campaign Generator Agent β†’ Campaign request
7
Campaign Agent800 ms
Generates personalized email via LLM β†’ Email content
8
Evaluator Agent100 ms
Validates quality, checks hallucinations β†’ Quality score (0.89)
9
Guardrail Agent50 ms
Redacts PII, enforces policy β†’ Approved content
10
Email Service200 ms
Sends email via SendGrid β†’ Delivery receipt
11
Audit Logger20 ms
Records decision + content β†’ Audit entry
1
Volume
0-100 users/day
Pattern
Serverless Monolith
πŸ—οΈ
Architecture
Lambda/Cloud Functions for agents
DynamoDB or Firestore
SQS/Pub-Sub for async tasks
Managed LLM APIs (OpenAI/Anthropic)
Cost & Performance
$200/mo
per month
1-2 sec p95
2
Volume
100-1K users/day
Pattern
Queue + Workers
πŸ—οΈ
Architecture
API server (ECS/Cloud Run)
Redis for caching + queue
Worker pool (3-5 instances)
PostgreSQL RDS
Feast feature store
Cost & Performance
$600/mo
per month
500-800 ms p95
3
Volume
1K-10K users/day
Pattern
Multi-Agent Orchestration
πŸ—οΈ
Architecture
Kubernetes (EKS/GKE) with autoscaling
Kafka for event streaming
Feast + Redis for online features
TorchServe for ML models
LangGraph for agent orchestration
PostgreSQL + Snowflake
Cost & Performance
$2K/mo
per month
300-500 ms p95
Recommended
4
Volume
10K+ users/day
Pattern
Enterprise Multi-Region
πŸ—οΈ
Architecture
Multi-region Kubernetes
Kafka + Flink for stream processing
Tecton (managed feature store)
Multi-LLM failover (OpenAI + Anthropic + self-hosted)
Snowflake + Databricks
Global load balancer
Cost & Performance
$8K+/mo
per month
200-400 ms p95

Key Integrations

Product Analytics (Segment/Amplitude)

Protocol: REST API + Webhooks
User event β†’ Segment API
Segment β†’ Kafka (via webhook)
Kafka β†’ Feature Store
Feature Store β†’ Scoring Agent

Email Service (SendGrid/AWS SES)

Protocol: SMTP + REST API
Campaign Agent β†’ Email Service API
Email Service β†’ User inbox
Email Service β†’ Webhook (open/click events)
Webhook β†’ Analytics pipeline

CRM (Salesforce/HubSpot)

Protocol: REST API (OAuth 2.0)
Churn prediction β†’ CRM sync job
CRM API β†’ Update contact field (churn_risk)
Sales team views in CRM dashboard

LLM API (OpenAI/Anthropic)

Protocol: REST API
Campaign Agent β†’ LLM API (generate)
LLM API β†’ Campaign content
Evaluator Agent β†’ LLM API (judge quality)
LLM API β†’ Quality score

Feature Store (Feast/Tecton)

Protocol: gRPC (online) + Parquet (offline)
Kafka β†’ Feature computation job
Job β†’ Write to online store (Redis)
Scoring Agent β†’ Fetch features (gRPC)
Daily batch β†’ Write to offline store (Snowflake)

Security & Compliance

Failure Modes & Fallbacks

FailureFallbackImpactSLA
LLM API down (OpenAI/Anthropic)β†’ Use cached templates with dynamic personalizationDegraded personalization, campaigns still sent99.5%
ML model inference timeout→ Use rule-based scoring (engagement = events_last_7d / 10)Lower accuracy, no churn prediction99.0%
Feature store unavailable→ Use cached features (staleness <1hr acceptable)Slightly outdated scores, campaigns still sent99.5%
Email delivery service down→ Queue campaigns for retry (max 3x over 24hr)Delayed delivery, no data loss99.0%
PII detection service fails→ Block all campaign generation until service recoversNo campaigns sent (safety first)100% compliance
Kafka broker down→ Buffer events in API server (max 1hr)Delayed feature updates, eventual consistency99.5%
Database connection pool exhausted→ Read from replica, queue writesRead-only mode for dashboards99.0%
System Architecture
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Orchestrator  β”‚ ← Coordinates all agents
β”‚  (LangGraph)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚          β”‚          β”‚           β”‚           β”‚          β”‚
β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”  β”Œβ”€β”€β–Όβ”€β”€β”€β”  β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”  β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”  β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”  β”Œβ–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚Plannerβ”‚  β”‚Scorerβ”‚  β”‚Campaignβ”‚  β”‚Evaluatorβ”‚  β”‚Guardrailβ”‚  β”‚Executor β”‚
β”‚ Agent β”‚  β”‚Agent β”‚  β”‚ Agent  β”‚  β”‚  Agent  β”‚  β”‚  Agent  β”‚  β”‚  Agent  β”‚
β””β”€β”€β”€β”¬β”€β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”€β”˜  β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚          β”‚          β”‚           β”‚            β”‚          β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                         β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
                         β”‚   Actions   β”‚
                         β”‚ (Email/Push)β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”„Agent Collaboration Flow

1
Orchestrator
Receives user event (feature_used), routes to Scoring Agent
2
Scoring Agent
Fetches features from feature store, runs ML model β†’ Returns engagement_score=0.32 (at-risk)
3
Planner Agent
Analyzes score, decides: 'User at-risk, send re-engagement campaign' β†’ Creates task plan
4
Executor Agent
Executes task plan, calls Campaign Generator Agent with user context
5
Campaign Generator Agent
Retrieves similar campaigns (RAG), generates personalized email via LLM β†’ Returns draft
6
Evaluator Agent
Validates quality (hallucination check, relevance score) β†’ Returns quality_score=0.89 (approved)
7
Guardrail Agent
Scans for PII, enforces content policy β†’ Redacts PII, approves content
8
Executor Agent
Sends email via SendGrid API, logs to audit trail
9
Orchestrator
Monitors delivery, schedules follow-up if no engagement in 3 days

🎭Agent Types

Reactive Agent

Low

Scoring Agent - Responds to event, returns score

Stateless

Reflexive Agent

Medium

Campaign Generator - Uses context (user profile, past campaigns)

Reads context

Deliberative Agent

High

Planner Agent - Plans multi-step campaign strategy

Stateful (maintains plan)

Orchestrator Agent

Highest

Main Orchestrator - Coordinates all agents, handles loops/retries

Full state management

πŸ“ˆLevels of Autonomy

L1
Tool
Human calls, agent responds (no decisions)
β†’ Monday's prompts (manual execution)
L2
Chained Tools
Sequential execution (predefined flow)
β†’ Tuesday's code (Score β†’ Generate β†’ Send)
L3
Agent
Makes decisions, can loop/retry
β†’ Planner Agent (decides campaign strategy)
L4
Multi-Agent
Agents collaborate autonomously, self-correct
β†’ This system (6 agents working together)

RAG vs Fine-Tuning for Campaign Generation

Campaign best practices evolve weekly. RAG allows daily updates without retraining. Fine-tuning would require monthly retraining ($5K/mo) vs RAG ($200/mo for vector DB).
βœ… RAG (Chosen)
Cost: $200/mo (Pinecone)
Update: Daily (add new campaign examples)
How:
❌ Fine-Tuning
Cost: $5K/mo (retraining)
Update: Monthly
How:
Implementation: Vector DB (Pinecone) with 10K high-performing campaign examples. Retrieved top-5 similar campaigns during generation. Reranker (Cohere) to select best examples.

Hallucination Detection for Campaigns

LLMs hallucinate features ('Try our new AI assistant' when no such feature exists)
L1
Confidence scores from LLM (<0.7 = flag)
L2
Cross-reference product feature list (API call)
L3
Fact-checking layer (LLM-as-judge with product docs)
L4
Human review queue for flagged campaigns
0.8% hallucination rate, 98% caught by L1-L3

Evaluation Framework

Churn Prediction Accuracy
87.2%target: 85%+
Engagement Score MAE
0.08target: <0.1
Campaign Open Rate Lift
+18%target: +15% vs baseline
Hallucination Rate
0.8%target: <1%
PII Leakage Rate
0%target: 0%
Testing: Shadow mode: 1000 users scored in parallel (ML vs rule-based). A/B test: 10% traffic to new model for 2 weeks. Champion-challenger: gradual rollout (10% β†’ 50% β†’ 100%).

Dataset Curation

1
Collect: 50K user events + 5K campaigns - Export from production DB (anonymized)
2
Clean: 42K usable events - Remove duplicates, filter bots, validate schema
3
Label: 42K labeled (churned/retained) - ($$8K (mechanical turk + expert review))
4
Augment: +8K synthetic examples - SMOTE for minority class (churned users)
β†’ 50K high-quality training examples. 80/10/10 train/val/test split.

Agentic RAG for Campaign Personalization

Agent iteratively retrieves context based on reasoning
User segment: 'power user, low recent activity' β†’ RAG retrieves 'power user re-engagement campaigns' β†’ Agent reasons 'need feature usage data' β†’ RAG retrieves 'feature adoption stats' β†’ Campaign generated: 'You used to love [Feature X]. Check out these new updates.'
πŸ’‘ Not one-shot retrieval. Agent decides what context it needs based on user profile.

Multi-Armed Bandit for Campaign Optimization

Tech Stack Summary

LLMs
GPT-4, Claude 3.5 Sonnet, or self-hosted Llama 3
Agent Orchestration
LangGraph, CrewAI, or Temporal
ML Framework
XGBoost, LightGBM, PyTorch
Feature Store
Feast (OSS) or Tecton (managed)
Model Registry
MLflow or Weights & Biases
Event Streaming
Kafka or AWS Kinesis
Database
PostgreSQL (OLTP), Snowflake (OLAP)
Cache
Redis or Memcached
Queue
SQS, RabbitMQ, or Kafka
Compute
Kubernetes (EKS/GKE) or Lambda/Cloud Functions
Monitoring
Prometheus, Datadog, Grafana, Arize (ML)
Security
Vault (secrets), Presidio (PII), OPA (policy)
Email Service
SendGrid or AWS SES
Analytics
Segment, Amplitude, or Mixpanel
πŸ—οΈ

Need Architecture Review?

We'll audit your retention system, identify bottlenecks, and show you how to scale 10x while cutting costs 30%.

Β©

2026 Randeep Bhatia. All Rights Reserved.

No part of this content may be reproduced, distributed, or transmitted in any form without prior written permission.