From prompts to production onboarding system.
Monday: 3 core prompts for path generation, progress tracking, and nudge creation. Tuesday: automated onboarding platform code. Wednesday: team workflows across product, growth, and engineering. Thursday: complete technical architecture with adaptive path engine, ML-driven nudges, and real-time analytics.
Key Assumptions
- โขHandle 1K-100K new users per month (scale to 1M)
- โขSupport 5-20 onboarding paths (persona-based, use-case-based)
- โขReal-time progress tracking (<500ms latency)
- โขMulti-channel nudges (email, SMS, in-app, push)
- โขGDPR/CCPA compliant (PII redaction, data residency)
- โขSOC2 Type II compliance for enterprise customers
System Requirements
Functional
- Adaptive path engine: select optimal onboarding flow per user
- Progress tracking: real-time state management across sessions
- Nudge system: multi-channel triggers based on behavior
- Analytics: funnel analysis, drop-off detection, cohort tracking
- A/B testing: path variants with statistical significance
- Integration: Segment, Mixpanel, Amplitude, Intercom, SendGrid
- Admin UI: path editor, analytics dashboard, experiment manager
Non-Functional (SLOs)
๐ฐ Cost Targets: {"per_user_usd":0.05,"per_1k_events_usd":0.1,"ml_inference_per_1k_usd":0.02}
Agent Layer
path-planner
L3Select optimal onboarding path based on user context
๐ง Feature Store: getUserFeatures(), Model Registry: getPathClassifier(v1.2), Path Database: getAvailablePaths()
โก Recovery: If ML inference fails โ fallback to rule-based path selection, If feature store unavailable โ use cached features (stale <5min), If no path matches โ default to generic path
progress-evaluator
L3Validate step completion, detect stuck users, trigger interventions
๐ง Progress Database: getState(userId), Validation Rules: checkCompletionCriteria(stepId), Analytics: getStepStats(stepId)
โก Recovery: If validation ambiguous โ flag for human review, If database write fails โ retry 3x with exponential backoff, If stuck detection fails โ default to time-based nudge
nudge-executor
L2Generate and send personalized nudges across channels
๐ง Prompt Store: getNudgeTemplate(trigger, persona), LLM: generatePersonalizedMessage(template, context), Channel Gateway: send(channel, userId, message)
โก Recovery: If LLM fails โ use static template, If channel unavailable โ queue for retry, If rate limit hit โ defer to next window
guardrail-agent
L4PII redaction, policy enforcement, abuse detection
๐ง PII Detector: detectSensitiveData(text), Policy Engine: checkPolicy(action, userId), Abuse Detector: checkAnomalousActivity(userId)
โก Recovery: If PII detection fails โ block action (fail-safe), If policy check times out โ deny by default, If abuse detection unavailable โ log and allow (monitored)
analytics-agent
L2Real-time funnel analysis, drop-off detection, cohort insights
๐ง ClickHouse: runQuery(funnel_sql), Feature Store: getSegmentFeatures(), Alerting: sendAlert(anomaly)
โก Recovery: If query times out โ return cached results, If anomaly detection fails โ skip alert, If segmentation unavailable โ use default segments
experiment-orchestrator
L3A/B test allocation, statistical analysis, winner declaration
๐ง Experiment Store: getExperiment(id), Allocation Engine: assignVariant(userId, experiment), Stats Engine: bayesianAnalysis(results)
โก Recovery: If allocation fails โ default to control, If stats engine unavailable โ defer analysis, If confidence low โ extend experiment duration
ML Layer
Feature Store
Update: Real-time for behavioral features, daily for derived features
- โข user_persona (categorical)
- โข signup_source (categorical)
- โข engagement_score (numeric, 0-100)
- โข company_size (categorical)
- โข use_case (categorical)
- โข time_since_signup (numeric, hours)
- โข steps_completed (numeric)
- โข session_count (numeric)
- โข avg_session_duration (numeric, seconds)
Model Registry
Strategy: Semantic versioning with A/B testing before promotion
- โข path_classifier
- โข stuck_detector
- โข nudge_response_predictor
Observability
Metrics
- ๐ onboarding_start_rate
- ๐ activation_rate
- ๐ time_to_activation_p50_hours
- ๐ step_completion_rate
- ๐ drop_off_rate_by_step
- ๐ nudge_send_rate
- ๐ nudge_open_rate
- ๐ nudge_conversion_rate
- ๐ path_selection_latency_p95_ms
- ๐ ml_inference_latency_p95_ms
- ๐ api_latency_p95_ms
- ๐ error_rate_percent
- ๐ experiment_allocation_consistency
Dashboards
- ๐ ops_dashboard
- ๐ product_dashboard
- ๐ ml_dashboard
- ๐ experiment_dashboard
Traces
โ Enabled
Deployment Variants
๐ Startup
Infrastructure:
- โข AWS Lambda + API Gateway
- โข RDS PostgreSQL (single AZ)
- โข ElastiCache Redis
- โข S3 for events
- โข CloudWatch for logs/metrics
- โข Managed services (SendGrid, Twilio, Intercom)
โ Serverless-first for low ops overhead
โ Single region (us-east-1 or eu-west-1)
โ Managed services to reduce complexity
โ Cost: $200-500/month for 1K-10K users
โ Deploy via Serverless Framework or SAM
๐ข Enterprise
Infrastructure:
- โข Kubernetes (EKS/GKE) multi-cluster
- โข Aurora PostgreSQL Global Database
- โข Redis Enterprise (multi-region)
- โข Kafka (Confluent Cloud or self-hosted)
- โข ClickHouse Cloud
- โข Private networking (VPC peering, PrivateLink)
- โข BYO KMS/HSM for encryption
- โข SSO/SAML integration
- โข Dedicated ML inference service (GPU)
- โข Multi-region active-active
โ Multi-tenant with tenant isolation
โ VPC isolation per customer (optional)
โ Data residency per region
โ SSO/SAML for enterprise auth
โ Custom SLAs (99.9% or 99.99%)
โ Dedicated support and on-call
โ Cost: $20K+/month for 1M+ users
๐ Migration: Start with startup stack. Migrate to Kubernetes when >100K users/month. Add multi-region when >500K users/month. Implement tenant isolation when first enterprise customer signs. Use feature flags to gradually shift traffic.
Risks & Mitigations
โ ๏ธ Low activation rates (<40%)
Mediumโ Mitigation: A/B test paths continuously. Shadow mode with manual path selection. Fallback to proven paths.
โ ๏ธ Nudge fatigue (too many notifications)
Mediumโ Mitigation: Frequency caps (max 1 nudge/day). User preferences (opt-out). Track unsubscribe rate.
โ ๏ธ PII leakage to LLM
Lowโ Mitigation: PII detection before LLM calls. Fail-safe mode if detection fails. Audit all LLM inputs.
โ ๏ธ ML model drift (accuracy degrades over time)
Highโ Mitigation: Weekly drift detection. Automatic retraining if accuracy drops >5%. Human review for edge cases.
โ ๏ธ Scalability bottleneck (Redis/PostgreSQL)
Mediumโ Mitigation: Read replicas. Sharding for Redis. Caching layer. Load testing before launch.
โ ๏ธ Integration failures (Segment, SendGrid, etc.)
Mediumโ Mitigation: Retry logic with exponential backoff. Backup providers. Queue for failed deliveries.
โ ๏ธ Cost overruns (LLM API costs)
Mediumโ Mitigation: Rate limiting. Cost alerts. Use cheaper models for non-critical tasks. Cache LLM responses.
Evolution Roadmap
Phase 1: MVP (0-3 months)
Q1 2025- โ Launch 3 core paths (by persona)
- โ Basic progress tracking
- โ Email nudges only
- โ Manual analytics (SQL queries)
Phase 2: Scale (3-6 months)
Q2 2025- โ 10+ paths (persona + use-case)
- โ Multi-channel nudges (SMS, in-app, push)
- โ Real-time analytics dashboard
- โ A/B testing framework
Phase 3: Optimize (6-12 months)
Q3-Q4 2025- โ Personalized paths (1:1)
- โ Predictive nudges (ML-driven timing)
- โ Advanced analytics (cohort analysis, drop-off prediction)
- โ Enterprise features (SSO, audit logs, multi-tenancy)
Complete Systems Architecture
9-layer architecture from user interaction to ML inference
Sequence Diagram - New User Onboarding Request
User Onboarding System - Hub Architecture
7 ComponentsUser Onboarding System - Feedback Loops & Adaptive Learning
7 ComponentsData Flow
User signup โ activation in 8 steps
Scaling Patterns
Key Integrations
Segment
Product Analytics (Mixpanel/Amplitude)
Email/SMS (SendGrid/Twilio)
In-app Messaging (Intercom)
Security & Compliance
Failure Modes & Fallbacks
Failure | Fallback | Impact | SLA |
---|---|---|---|
Path Planner Agent unavailable | Use rule-based path selection (persona โ default path mapping) | Degraded personalization, 10-15% lower activation | 99.5% |
Feature Store timeout | Use cached features (stale <5min acceptable) | Slightly stale data, minimal impact | 99.9% |
ML inference high latency (>1s) | Switch to lightweight model or rule-based logic | Faster response, 5-10% lower accuracy | 99.0% |
Nudge delivery failure (SendGrid/Twilio down) | Queue for retry, switch to backup provider | Delayed nudge (up to 1 hour), minimal impact | 99.5% |
Analytics database (ClickHouse) unavailable | Return cached dashboard data, disable real-time queries | Stale analytics, no new insights | 99.0% |
Progress Tracker (Redis) unavailable | Fallback to PostgreSQL for state reads (slower) | Higher latency (500ms vs 50ms), no data loss | 99.9% |
PII detection service fails | Block all LLM calls (fail-safe mode) | No personalized nudges, manual fallback | 100% |
Advanced ML/AI Patterns
Production ML engineering beyond basic API calls