โ† Wednesday's Workflows

User Onboarding System Architecture ๐Ÿ—๏ธ

Adaptive, AI-powered onboarding from 1K to 1M users/month

October 2, 2025
๐Ÿš€ Product๐Ÿ—๏ธ Architecture๐Ÿค– AI-Driven๐Ÿ“Š Analytics

From prompts to production onboarding system.

Monday: 3 core prompts for path generation, progress tracking, and nudge creation. Tuesday: automated onboarding platform code. Wednesday: team workflows across product, growth, and engineering. Thursday: complete technical architecture with adaptive path engine, ML-driven nudges, and real-time analytics.

Key Assumptions

  • โ€ขHandle 1K-100K new users per month (scale to 1M)
  • โ€ขSupport 5-20 onboarding paths (persona-based, use-case-based)
  • โ€ขReal-time progress tracking (<500ms latency)
  • โ€ขMulti-channel nudges (email, SMS, in-app, push)
  • โ€ขGDPR/CCPA compliant (PII redaction, data residency)
  • โ€ขSOC2 Type II compliance for enterprise customers

System Requirements

Functional

  • Adaptive path engine: select optimal onboarding flow per user
  • Progress tracking: real-time state management across sessions
  • Nudge system: multi-channel triggers based on behavior
  • Analytics: funnel analysis, drop-off detection, cohort tracking
  • A/B testing: path variants with statistical significance
  • Integration: Segment, Mixpanel, Amplitude, Intercom, SendGrid
  • Admin UI: path editor, analytics dashboard, experiment manager

Non-Functional (SLOs)

latency p95 ms500
freshness min1
availability percent99.9
accuracy percent95

๐Ÿ’ฐ Cost Targets: {"per_user_usd":0.05,"per_1k_events_usd":0.1,"ml_inference_per_1k_usd":0.02}

Agent Layer

path-planner

L3

Select optimal onboarding path based on user context

๐Ÿ”ง Feature Store: getUserFeatures(), Model Registry: getPathClassifier(v1.2), Path Database: getAvailablePaths()

โšก Recovery: If ML inference fails โ†’ fallback to rule-based path selection, If feature store unavailable โ†’ use cached features (stale <5min), If no path matches โ†’ default to generic path

progress-evaluator

L3

Validate step completion, detect stuck users, trigger interventions

๐Ÿ”ง Progress Database: getState(userId), Validation Rules: checkCompletionCriteria(stepId), Analytics: getStepStats(stepId)

โšก Recovery: If validation ambiguous โ†’ flag for human review, If database write fails โ†’ retry 3x with exponential backoff, If stuck detection fails โ†’ default to time-based nudge

nudge-executor

L2

Generate and send personalized nudges across channels

๐Ÿ”ง Prompt Store: getNudgeTemplate(trigger, persona), LLM: generatePersonalizedMessage(template, context), Channel Gateway: send(channel, userId, message)

โšก Recovery: If LLM fails โ†’ use static template, If channel unavailable โ†’ queue for retry, If rate limit hit โ†’ defer to next window

guardrail-agent

L4

PII redaction, policy enforcement, abuse detection

๐Ÿ”ง PII Detector: detectSensitiveData(text), Policy Engine: checkPolicy(action, userId), Abuse Detector: checkAnomalousActivity(userId)

โšก Recovery: If PII detection fails โ†’ block action (fail-safe), If policy check times out โ†’ deny by default, If abuse detection unavailable โ†’ log and allow (monitored)

analytics-agent

L2

Real-time funnel analysis, drop-off detection, cohort insights

๐Ÿ”ง ClickHouse: runQuery(funnel_sql), Feature Store: getSegmentFeatures(), Alerting: sendAlert(anomaly)

โšก Recovery: If query times out โ†’ return cached results, If anomaly detection fails โ†’ skip alert, If segmentation unavailable โ†’ use default segments

experiment-orchestrator

L3

A/B test allocation, statistical analysis, winner declaration

๐Ÿ”ง Experiment Store: getExperiment(id), Allocation Engine: assignVariant(userId, experiment), Stats Engine: bayesianAnalysis(results)

โšก Recovery: If allocation fails โ†’ default to control, If stats engine unavailable โ†’ defer analysis, If confidence low โ†’ extend experiment duration

ML Layer

Feature Store

Update: Real-time for behavioral features, daily for derived features

  • โ€ข user_persona (categorical)
  • โ€ข signup_source (categorical)
  • โ€ข engagement_score (numeric, 0-100)
  • โ€ข company_size (categorical)
  • โ€ข use_case (categorical)
  • โ€ข time_since_signup (numeric, hours)
  • โ€ข steps_completed (numeric)
  • โ€ข session_count (numeric)
  • โ€ข avg_session_duration (numeric, seconds)

Model Registry

Strategy: Semantic versioning with A/B testing before promotion

  • โ€ข path_classifier
  • โ€ข stuck_detector
  • โ€ข nudge_response_predictor

Observability

Metrics

  • ๐Ÿ“Š onboarding_start_rate
  • ๐Ÿ“Š activation_rate
  • ๐Ÿ“Š time_to_activation_p50_hours
  • ๐Ÿ“Š step_completion_rate
  • ๐Ÿ“Š drop_off_rate_by_step
  • ๐Ÿ“Š nudge_send_rate
  • ๐Ÿ“Š nudge_open_rate
  • ๐Ÿ“Š nudge_conversion_rate
  • ๐Ÿ“Š path_selection_latency_p95_ms
  • ๐Ÿ“Š ml_inference_latency_p95_ms
  • ๐Ÿ“Š api_latency_p95_ms
  • ๐Ÿ“Š error_rate_percent
  • ๐Ÿ“Š experiment_allocation_consistency

Dashboards

  • ๐Ÿ“ˆ ops_dashboard
  • ๐Ÿ“ˆ product_dashboard
  • ๐Ÿ“ˆ ml_dashboard
  • ๐Ÿ“ˆ experiment_dashboard

Traces

โœ… Enabled

Deployment Variants

๐Ÿš€ Startup

Infrastructure:

  • โ€ข AWS Lambda + API Gateway
  • โ€ข RDS PostgreSQL (single AZ)
  • โ€ข ElastiCache Redis
  • โ€ข S3 for events
  • โ€ข CloudWatch for logs/metrics
  • โ€ข Managed services (SendGrid, Twilio, Intercom)

โ†’ Serverless-first for low ops overhead

โ†’ Single region (us-east-1 or eu-west-1)

โ†’ Managed services to reduce complexity

โ†’ Cost: $200-500/month for 1K-10K users

โ†’ Deploy via Serverless Framework or SAM

๐Ÿข Enterprise

Infrastructure:

  • โ€ข Kubernetes (EKS/GKE) multi-cluster
  • โ€ข Aurora PostgreSQL Global Database
  • โ€ข Redis Enterprise (multi-region)
  • โ€ข Kafka (Confluent Cloud or self-hosted)
  • โ€ข ClickHouse Cloud
  • โ€ข Private networking (VPC peering, PrivateLink)
  • โ€ข BYO KMS/HSM for encryption
  • โ€ข SSO/SAML integration
  • โ€ข Dedicated ML inference service (GPU)
  • โ€ข Multi-region active-active

โ†’ Multi-tenant with tenant isolation

โ†’ VPC isolation per customer (optional)

โ†’ Data residency per region

โ†’ SSO/SAML for enterprise auth

โ†’ Custom SLAs (99.9% or 99.99%)

โ†’ Dedicated support and on-call

โ†’ Cost: $20K+/month for 1M+ users

๐Ÿ“ˆ Migration: Start with startup stack. Migrate to Kubernetes when >100K users/month. Add multi-region when >500K users/month. Implement tenant isolation when first enterprise customer signs. Use feature flags to gradually shift traffic.

Risks & Mitigations

โš ๏ธ Low activation rates (<40%)

Medium

โœ“ Mitigation: A/B test paths continuously. Shadow mode with manual path selection. Fallback to proven paths.

โš ๏ธ Nudge fatigue (too many notifications)

Medium

โœ“ Mitigation: Frequency caps (max 1 nudge/day). User preferences (opt-out). Track unsubscribe rate.

โš ๏ธ PII leakage to LLM

Low

โœ“ Mitigation: PII detection before LLM calls. Fail-safe mode if detection fails. Audit all LLM inputs.

โš ๏ธ ML model drift (accuracy degrades over time)

High

โœ“ Mitigation: Weekly drift detection. Automatic retraining if accuracy drops >5%. Human review for edge cases.

โš ๏ธ Scalability bottleneck (Redis/PostgreSQL)

Medium

โœ“ Mitigation: Read replicas. Sharding for Redis. Caching layer. Load testing before launch.

โš ๏ธ Integration failures (Segment, SendGrid, etc.)

Medium

โœ“ Mitigation: Retry logic with exponential backoff. Backup providers. Queue for failed deliveries.

โš ๏ธ Cost overruns (LLM API costs)

Medium

โœ“ Mitigation: Rate limiting. Cost alerts. Use cheaper models for non-critical tasks. Cache LLM responses.

Evolution Roadmap

1

Phase 1: MVP (0-3 months)

Q1 2025
  • โ†’ Launch 3 core paths (by persona)
  • โ†’ Basic progress tracking
  • โ†’ Email nudges only
  • โ†’ Manual analytics (SQL queries)
2

Phase 2: Scale (3-6 months)

Q2 2025
  • โ†’ 10+ paths (persona + use-case)
  • โ†’ Multi-channel nudges (SMS, in-app, push)
  • โ†’ Real-time analytics dashboard
  • โ†’ A/B testing framework
3

Phase 3: Optimize (6-12 months)

Q3-Q4 2025
  • โ†’ Personalized paths (1:1)
  • โ†’ Predictive nudges (ML-driven timing)
  • โ†’ Advanced analytics (cohort analysis, drop-off prediction)
  • โ†’ Enterprise features (SSO, audit logs, multi-tenancy)

Complete Systems Architecture

9-layer architecture from user interaction to ML inference

Presentation
Web App
Mobile App (iOS/Android)
Admin Dashboard
Embed SDK
API Gateway
Load Balancer (ALB/NLB)
Rate Limiter (Redis)
Auth (OAuth2/OIDC)
API Router
Agent Layer
Path Planner Agent
Nudge Executor Agent
Progress Evaluator Agent
Guardrail Agent
ML Layer
Feature Store (Feast/Tecton)
Model Registry (MLflow)
Inference Service (TorchServe/TFServing)
Prompt Store (versioned)
Integration
Event Bus (Kafka/SQS)
Segment Adapter
Email/SMS Gateway
In-app Messaging (Intercom)
Data
PostgreSQL (user state)
Redis (session cache)
ClickHouse (analytics)
S3 (event archive)
External
Segment API
Mixpanel/Amplitude
SendGrid/Twilio
OpenAI/Anthropic
Observability
Metrics (Prometheus/Datadog)
Logs (CloudWatch/Loki)
Traces (Jaeger/Tempo)
Dashboards (Grafana)
Security
WAF (Cloudflare/AWS WAF)
Secrets (Vault/KMS)
Audit Log (immutable)
PII Redaction Service

Sequence Diagram - New User Onboarding Request

UserAPI GatewayPath Planner AgentFeature StoreProgress TrackerNudge ExecutorAnalyticsPOST /onboarding/startselectPath(userId, context)getFeatures(userId)features (persona, use_case, engagement)ML inference: classify optimal pathinitializeState(userId, pathId)track(onboarding_started)200 OK + first stepOnboarding path + first taskPOST /onboarding/step/completeupdateProgress(userId, stepId)checkTriggers(userId, progress)evaluate: send nudge?track(step_completed)

User Onboarding System - Hub Architecture

7 Components
[gRPC]User context & profile[gRPC]Optimal path & steps[Event]Step completion events[Event]Validation results & alerts[Kafka]Nudge requests[Kafka]Delivery status[REST]Content validation requests[REST]Sanitized content & flags[Event]User interaction events[WebSocket]Real-time insights[gRPC]Variant assignment request[gRPC]Assigned variant & configOnboarding Orchestrator4 capabilitiesPath Planner Agent4 capabilitiesProgress Evaluator Agent4 capabilitiesNudge Executor Agent4 capabilitiesGuardrail Agent4 capabilitiesAnalytics Agent4 capabilitiesExperiment Orchestrator Agent4 capabilities
HTTP
REST
gRPC
Event
Stream
WebSocket

User Onboarding System - Feedback Loops & Adaptive Learning

7 Components
[Feedback]Path effectiveness metrics[Event]Path adjustments[Event]Intervention triggers[Feedback]Engagement response data[REST]Content for validation[REST]Approved/rejected content[Stream]Path performance insights[Stream]Variant performance data[Event]Winning variant config[Feedback]Path variant requests[Kafka]Training data & metrics[gRPC]Updated ML models[gRPC]Optimized messaging models[Event]User progress events[Event]Nudge effectiveness data[Event]Policy violation eventsPath Planner Agent4 capabilitiesProgress Evaluator Agent4 capabilitiesNudge Executor Agent4 capabilitiesAnalytics Agent4 capabilitiesExperiment Orchestrator Agent4 capabilitiesGuardrail Agent4 capabilitiesLearning Engine4 capabilities
HTTP
REST
gRPC
Event
Stream
WebSocket

Data Flow

User signup โ†’ activation in 8 steps

1
User0s
Signs up โ†’ Email, company, role
2
API Gateway50ms
Authenticates, routes โ†’ JWT token
3
Path Planner Agent300ms
Selects optimal path โ†’ pathId, confidence
4
Progress Tracker400ms
Initializes state โ†’ UserProgress record
5
User120s
Completes step 1 โ†’ Step completion event
6
Progress Evaluator150ms
Validates completion โ†’ isValid, qualityScore
7
Nudge Executor200ms
Checks triggers โ†’ No nudge (on-track)
8
Analytics50ms
Updates funnel โ†’ Event ingested
9
User86400s
Gets stuck on step 3 โ†’ No activity 24h
10
Nudge Executor2s
Sends email nudge โ†’ Personalized message
11
User3600s
Completes all steps โ†’ Activation event
12
Analytics100ms
Records activation โ†’ Cohort update

Scaling Patterns

Volume
1K-10K users/month
Pattern
Serverless Monolith
Architecture
โ€ข API Gateway (AWS API Gateway)
โ€ข Lambda functions (Python)
โ€ข RDS PostgreSQL (single instance)
โ€ข ElastiCache Redis
โ€ข S3 for event storage
Cost
$200-500/month
300-500ms p95
Volume
10K-100K users/month
Pattern
Microservices + Queue
Architecture
โ€ข Load balancer (ALB)
โ€ข ECS/Fargate containers
โ€ข RDS Multi-AZ
โ€ข Redis Cluster
โ€ข SQS for async tasks
โ€ข ClickHouse for analytics
Cost
$1K-2K/month
200-400ms p95
Volume
100K-1M users/month
Pattern
Event-Driven Multi-Agent
Architecture
โ€ข Global load balancer
โ€ข Kubernetes (EKS/GKE)
โ€ข Aurora PostgreSQL (multi-region)
โ€ข Redis Cluster (sharded)
โ€ข Kafka for event streaming
โ€ข ClickHouse Cluster
โ€ข Dedicated ML inference service
Cost
$5K-10K/month
100-300ms p95
Volume
1M+ users/month
Pattern
Enterprise Multi-Region
Architecture
โ€ข Multi-region active-active
โ€ข Kubernetes (multi-cluster)
โ€ข Aurora Global Database
โ€ข Redis Enterprise
โ€ข Confluent Kafka (managed)
โ€ข ClickHouse Cloud
โ€ข Dedicated GPU instances for ML
โ€ข CDN (CloudFront/Cloudflare)
Cost
$20K+/month
50-200ms p95

Key Integrations

Segment

Protocol: HTTP API + Webhooks
User event โ†’ Segment track()
Segment โ†’ onboarding system webhook
Onboarding system โ†’ progress update
Onboarding system โ†’ Segment identify() with traits

Product Analytics (Mixpanel/Amplitude)

Protocol: HTTP API
Onboarding event โ†’ analytics API
Analytics โ†’ funnel calculation
Dashboard queries โ†’ analytics API
Real-time cohort sync

Email/SMS (SendGrid/Twilio)

Protocol: HTTP API
Nudge trigger โ†’ template selection
LLM โ†’ personalized message generation
SendGrid/Twilio API โ†’ delivery
Webhook โ†’ delivery status update

In-app Messaging (Intercom)

Protocol: REST API + JavaScript SDK
User context โ†’ Intercom identify()
Nudge trigger โ†’ Intercom message API
User interaction โ†’ webhook
Progress update

Security & Compliance

Failure Modes & Fallbacks

FailureFallbackImpactSLA
Path Planner Agent unavailableUse rule-based path selection (persona โ†’ default path mapping)Degraded personalization, 10-15% lower activation99.5%
Feature Store timeoutUse cached features (stale <5min acceptable)Slightly stale data, minimal impact99.9%
ML inference high latency (>1s)Switch to lightweight model or rule-based logicFaster response, 5-10% lower accuracy99.0%
Nudge delivery failure (SendGrid/Twilio down)Queue for retry, switch to backup providerDelayed nudge (up to 1 hour), minimal impact99.5%
Analytics database (ClickHouse) unavailableReturn cached dashboard data, disable real-time queriesStale analytics, no new insights99.0%
Progress Tracker (Redis) unavailableFallback to PostgreSQL for state reads (slower)Higher latency (500ms vs 50ms), no data loss99.9%
PII detection service failsBlock all LLM calls (fail-safe mode)No personalized nudges, manual fallback100%

Advanced ML/AI Patterns

Production ML engineering beyond basic API calls

RAG vs Fine-Tuning

Nudge templates change frequently (weekly A/B tests). RAG allows rapid updates without retraining. Path classification is stable (quarterly updates), fine-tuning gives better accuracy.
โœ… RAG (Chosen)
Cost: $100/month
Update: Daily
How: Add new templates to vector DB
โŒ Fine-Tuning
Cost: $2K/month
Update: Quarterly
How: Retrain classifier on new data
Implementation: Vector DB (Pinecone) with nudge templates. Retrieved based on user context. Fine-tuned Random Forest for path classification (scikit-learn).

Hallucination Detection

LLMs hallucinate user data (fake names, companies)
L1
Confidence scores (<0.7 = flag)
L2
Cross-reference with user profile database
L3
Logical consistency checks (e.g., company size vs. industry)
L4
Human review queue for low-confidence cases
0.2% hallucination rate, 99.8% caught before delivery

Evaluation Framework

Path Selection Accuracy
89%target: 85%+
Stuck Detection Precision
92%target: 90%+
Nudge Response Prediction AUC
0.76target: 0.75+
Hallucination Rate
0.2%target: <0.5%
Testing: Shadow mode: 1000 users parallel with manual path selection. Compare activation rates.

Dataset Curation

1
Collect: 50K onboarding sessions - Historical data + manual labels
2
Clean: 42K usable - Remove incomplete sessions, duplicates
3
Label: 42K labeled - ($$8.4K)
4
Augment: +8K synthetic - Edge case generation (stuck users, fast completers)
โ†’ 50K high-quality examples (Cohen's Kappa: 0.88)

Agentic RAG

Agent iteratively retrieves based on reasoning
User stuck on step 3 โ†’ RAG retrieves similar cases โ†’ Agent reasons 'need clarification on feature X' โ†’ RAG retrieves feature docs โ†’ Nudge generated with full context
๐Ÿ’ก Not one-shot retrieval. Agent decides what else it needs to know.

Multi-Armed Bandit for Nudge Optimization

Tech Stack Summary

LLMs
GPT-4, Claude 3.5, or Gemini
ML
scikit-learn, XGBoost, PyMC
Orchestration
LangGraph, Celery, or Temporal
Database
PostgreSQL, Redis, ClickHouse
Messaging
Kafka, SQS, or RabbitMQ
Compute
Lambda, ECS/Fargate, or Kubernetes
Monitoring
Prometheus, Grafana, Datadog, Sentry
Feature Store
Feast, Tecton, or custom
Experiment Platform
Custom or Optimizely
๐Ÿ—๏ธ

Need Architecture Review?

We'll audit your onboarding system, identify bottlenecks, and show you how to scale to 10x users with higher activation rates.