User Onboarding System Architecture: AI-Powered Product

From prompts to production onboarding system.

Monday: 3 core prompts for path generation, progress tracking, and nudge creation. Tuesday: automated onboarding platform code. Wednesday: team workflows across product, growth, and engineering. Thursday: complete technical architecture with adaptive path engine, ML-driven nudges, and real-time analytics.

Key Assumptions

•Handle 1K-100K new users per month (scale to 1M)
•Support 5-20 onboarding paths (persona-based, use-case-based)
•Real-time progress tracking (<500ms latency)
•Multi-channel nudges (email, SMS, in-app, push)
•GDPR/CCPA compliant (PII redaction, data residency)
•SOC2 Type II compliance for enterprise customers

System Requirements

Functional

Adaptive path engine: select optimal onboarding flow per user
Progress tracking: real-time state management across sessions
Nudge system: multi-channel triggers based on behavior
Analytics: funnel analysis, drop-off detection, cohort tracking
A/B testing: path variants with statistical significance
Integration: Segment, Mixpanel, Amplitude, Intercom, SendGrid
Admin UI: path editor, analytics dashboard, experiment manager

Non-Functional (SLOs)

latency p95 ms500

freshness min1

availability percent99.9

accuracy percent95

💰 Cost Targets: {"per_user_usd":0.05,"per_1k_events_usd":0.1,"ml_inference_per_1k_usd":0.02}

Agent Layer

path-planner

Select optimal onboarding path based on user context

🔧 Feature Store: getUserFeatures(), Model Registry: getPathClassifier(v1.2), Path Database: getAvailablePaths()

⚡ Recovery: If ML inference fails → fallback to rule-based path selection, If feature store unavailable → use cached features (stale <5min), If no path matches → default to generic path

progress-evaluator

Validate step completion, detect stuck users, trigger interventions

🔧 Progress Database: getState(userId), Validation Rules: checkCompletionCriteria(stepId), Analytics: getStepStats(stepId)

⚡ Recovery: If validation ambiguous → flag for human review, If database write fails → retry 3x with exponential backoff, If stuck detection fails → default to time-based nudge

nudge-executor

Generate and send personalized nudges across channels

🔧 Prompt Store: getNudgeTemplate(trigger, persona), LLM: generatePersonalizedMessage(template, context), Channel Gateway: send(channel, userId, message)

⚡ Recovery: If LLM fails → use static template, If channel unavailable → queue for retry, If rate limit hit → defer to next window

guardrail-agent

PII redaction, policy enforcement, abuse detection

🔧 PII Detector: detectSensitiveData(text), Policy Engine: checkPolicy(action, userId), Abuse Detector: checkAnomalousActivity(userId)

⚡ Recovery: If PII detection fails → block action (fail-safe), If policy check times out → deny by default, If abuse detection unavailable → log and allow (monitored)

analytics-agent

Real-time funnel analysis, drop-off detection, cohort insights

🔧 ClickHouse: runQuery(funnel_sql), Feature Store: getSegmentFeatures(), Alerting: sendAlert(anomaly)

⚡ Recovery: If query times out → return cached results, If anomaly detection fails → skip alert, If segmentation unavailable → use default segments

experiment-orchestrator

A/B test allocation, statistical analysis, winner declaration

🔧 Experiment Store: getExperiment(id), Allocation Engine: assignVariant(userId, experiment), Stats Engine: bayesianAnalysis(results)

⚡ Recovery: If allocation fails → default to control, If stats engine unavailable → defer analysis, If confidence low → extend experiment duration

ML Layer

Feature Store

Update: Real-time for behavioral features, daily for derived features

• user_persona (categorical)
• signup_source (categorical)
• engagement_score (numeric, 0-100)
• company_size (categorical)
• use_case (categorical)
• time_since_signup (numeric, hours)
• steps_completed (numeric)
• session_count (numeric)
• avg_session_duration (numeric, seconds)

Model Registry

Strategy: Semantic versioning with A/B testing before promotion

• path_classifier
• stuck_detector
• nudge_response_predictor

Observability

Metrics

📊 onboarding_start_rate
📊 activation_rate
📊 time_to_activation_p50_hours
📊 step_completion_rate
📊 drop_off_rate_by_step
📊 nudge_send_rate
📊 nudge_open_rate
📊 nudge_conversion_rate
📊 path_selection_latency_p95_ms
📊 ml_inference_latency_p95_ms
📊 api_latency_p95_ms
📊 error_rate_percent
📊 experiment_allocation_consistency

Dashboards

📈 ops_dashboard
📈 product_dashboard
📈 ml_dashboard
📈 experiment_dashboard

Traces

✅ Enabled

Deployment Variants

🚀 Startup

Infrastructure:

• AWS Lambda + API Gateway
• RDS PostgreSQL (single AZ)
• ElastiCache Redis
• S3 for events
• CloudWatch for logs/metrics
• Managed services (SendGrid, Twilio, Intercom)

→ Serverless-first for low ops overhead

→ Single region (us-east-1 or eu-west-1)

→ Managed services to reduce complexity

→ Cost: $200-500/month for 1K-10K users

→ Deploy via Serverless Framework or SAM

🏢 Enterprise

Infrastructure:

• Kubernetes (EKS/GKE) multi-cluster
• Aurora PostgreSQL Global Database
• Redis Enterprise (multi-region)
• Kafka (Confluent Cloud or self-hosted)
• ClickHouse Cloud
• Private networking (VPC peering, PrivateLink)
• BYO KMS/HSM for encryption
• SSO/SAML integration
• Dedicated ML inference service (GPU)
• Multi-region active-active

→ Multi-tenant with tenant isolation

→ VPC isolation per customer (optional)

→ Data residency per region

→ SSO/SAML for enterprise auth

→ Custom SLAs (99.9% or 99.99%)

→ Dedicated support and on-call

→ Cost: $20K+/month for 1M+ users

📈 Migration: Start with startup stack. Migrate to Kubernetes when >100K users/month. Add multi-region when >500K users/month. Implement tenant isolation when first enterprise customer signs. Use feature flags to gradually shift traffic.

Risks & Mitigations

⚠️ Low activation rates (<40%)

Medium

✓ Mitigation: A/B test paths continuously. Shadow mode with manual path selection. Fallback to proven paths.

⚠️ Nudge fatigue (too many notifications)

Medium

✓ Mitigation: Frequency caps (max 1 nudge/day). User preferences (opt-out). Track unsubscribe rate.

⚠️ PII leakage to LLM

Low

✓ Mitigation: PII detection before LLM calls. Fail-safe mode if detection fails. Audit all LLM inputs.

⚠️ ML model drift (accuracy degrades over time)

High

✓ Mitigation: Weekly drift detection. Automatic retraining if accuracy drops >5%. Human review for edge cases.

⚠️ Scalability bottleneck (Redis/PostgreSQL)

Medium

✓ Mitigation: Read replicas. Sharding for Redis. Caching layer. Load testing before launch.

⚠️ Integration failures (Segment, SendGrid, etc.)

Medium

✓ Mitigation: Retry logic with exponential backoff. Backup providers. Queue for failed deliveries.

⚠️ Cost overruns (LLM API costs)

Medium

✓ Mitigation: Rate limiting. Cost alerts. Use cheaper models for non-critical tasks. Cache LLM responses.

Evolution Roadmap

Phase 1: MVP (0-3 months)

Q1 2025

→ Launch 3 core paths (by persona)
→ Basic progress tracking
→ Email nudges only
→ Manual analytics (SQL queries)

Phase 2: Scale (3-6 months)

Q2 2025

→ 10+ paths (persona + use-case)
→ Multi-channel nudges (SMS, in-app, push)
→ Real-time analytics dashboard
→ A/B testing framework

Phase 3: Optimize (6-12 months)

Q3-Q4 2025

→ Personalized paths (1:1)
→ Predictive nudges (ML-driven timing)
→ Advanced analytics (cohort analysis, drop-off prediction)
→ Enterprise features (SSO, audit logs, multi-tenancy)

Complete Systems Architecture

9-layer architecture from user interaction to ML inference

Presentation

Web App

Mobile App (iOS/Android)

Admin Dashboard

Embed SDK

API Gateway

Load Balancer (ALB/NLB)

Rate Limiter (Redis)

Auth (OAuth2/OIDC)

API Router

Agent Layer

Path Planner Agent

Nudge Executor Agent

Progress Evaluator Agent

Guardrail Agent

ML Layer

Feature Store (Feast/Tecton)

Model Registry (MLflow)

Inference Service (TorchServe/TFServing)

Prompt Store (versioned)

Integration

Event Bus (Kafka/SQS)

Segment Adapter

Email/SMS Gateway

In-app Messaging (Intercom)

Data

PostgreSQL (user state)

Redis (session cache)

ClickHouse (analytics)

S3 (event archive)

External

Segment API

Mixpanel/Amplitude

SendGrid/Twilio

OpenAI/Anthropic

Observability

Metrics (Prometheus/Datadog)

Logs (CloudWatch/Loki)

Traces (Jaeger/Tempo)

Dashboards (Grafana)

Security

WAF (Cloudflare/AWS WAF)

Secrets (Vault/KMS)

Audit Log (immutable)

PII Redaction Service

Sequence Diagram - New User Onboarding Request

User Onboarding System - Hub Architecture

7 Components

HTTP

REST

gRPC

Event

Stream

WebSocket

User Onboarding System - Feedback Loops & Adaptive Learning

7 Components

HTTP

REST

gRPC

Event

Stream

WebSocket

Data Flow

User signup → activation in 8 steps

User0s

Signs up → Email, company, role

API Gateway50ms

Authenticates, routes → JWT token

Path Planner Agent300ms

Selects optimal path → pathId, confidence

Progress Tracker400ms

Initializes state → UserProgress record

User120s

Completes step 1 → Step completion event

Progress Evaluator150ms

Validates completion → isValid, qualityScore

Nudge Executor200ms

Checks triggers → No nudge (on-track)

Analytics50ms

Updates funnel → Event ingested

User86400s

Gets stuck on step 3 → No activity 24h

Nudge Executor2s

Sends email nudge → Personalized message

User3600s

Completes all steps → Activation event

Analytics100ms

Records activation → Cohort update

Scaling Patterns

Volume

1K-10K users/month

Pattern

Serverless Monolith

Architecture

• API Gateway (AWS API Gateway)

• Lambda functions (Python)

• RDS PostgreSQL (single instance)

• ElastiCache Redis

• S3 for event storage

Cost

$200-500/month

300-500ms p95

Volume

10K-100K users/month

Pattern

Microservices + Queue

Architecture

• Load balancer (ALB)

• ECS/Fargate containers

• RDS Multi-AZ

• Redis Cluster

• SQS for async tasks

• ClickHouse for analytics

Cost

$1K-2K/month

200-400ms p95

Volume

100K-1M users/month

Pattern

Event-Driven Multi-Agent

Architecture

• Global load balancer

• Kubernetes (EKS/GKE)

• Aurora PostgreSQL (multi-region)

• Redis Cluster (sharded)

• Kafka for event streaming

• ClickHouse Cluster

• Dedicated ML inference service

Cost

$5K-10K/month

100-300ms p95

Volume

1M+ users/month

Pattern

Enterprise Multi-Region

Architecture

• Multi-region active-active

• Kubernetes (multi-cluster)

• Aurora Global Database

• Redis Enterprise

• Confluent Kafka (managed)

• ClickHouse Cloud

• Dedicated GPU instances for ML

• CDN (CloudFront/Cloudflare)

Cost

$20K+/month

50-200ms p95

Key Integrations

Segment

Protocol: HTTP API + Webhooks

User event → Segment track()

Segment → onboarding system webhook

Onboarding system → progress update

Onboarding system → Segment identify() with traits

Product Analytics (Mixpanel/Amplitude)

Protocol: HTTP API

Onboarding event → analytics API

Analytics → funnel calculation

Dashboard queries → analytics API

Real-time cohort sync

Email/SMS (SendGrid/Twilio)

Protocol: HTTP API

Nudge trigger → template selection

LLM → personalized message generation

SendGrid/Twilio API → delivery

Webhook → delivery status update

In-app Messaging (Intercom)

Protocol: REST API + JavaScript SDK

User context → Intercom identify()

Nudge trigger → Intercom message API

User interaction → webhook

Progress update

Security & Compliance

Failure Modes & Fallbacks

Failure	Fallback	Impact	SLA
Path Planner Agent unavailable	Use rule-based path selection (persona → default path mapping)	Degraded personalization, 10-15% lower activation	99.5%
Feature Store timeout	Use cached features (stale <5min acceptable)	Slightly stale data, minimal impact	99.9%
ML inference high latency (>1s)	Switch to lightweight model or rule-based logic	Faster response, 5-10% lower accuracy	99.0%
Nudge delivery failure (SendGrid/Twilio down)	Queue for retry, switch to backup provider	Delayed nudge (up to 1 hour), minimal impact	99.5%
Analytics database (ClickHouse) unavailable	Return cached dashboard data, disable real-time queries	Stale analytics, no new insights	99.0%
Progress Tracker (Redis) unavailable	Fallback to PostgreSQL for state reads (slower)	Higher latency (500ms vs 50ms), no data loss	99.9%
PII detection service fails	Block all LLM calls (fail-safe mode)	No personalized nudges, manual fallback	100%

Advanced ML/AI Patterns

Production ML engineering beyond basic API calls

RAG vs Fine-Tuning

Nudge templates change frequently (weekly A/B tests). RAG allows rapid updates without retraining. Path classification is stable (quarterly updates), fine-tuning gives better accuracy.

✅ RAG (Chosen)

Cost: $100/month

Update: Daily

How: Add new templates to vector DB

❌ Fine-Tuning

Cost: $2K/month

Update: Quarterly

How: Retrain classifier on new data

Implementation: Vector DB (Pinecone) with nudge templates. Retrieved based on user context. Fine-tuned Random Forest for path classification (scikit-learn).

Hallucination Detection

LLMs hallucinate user data (fake names, companies)

Confidence scores (<0.7 = flag)

Cross-reference with user profile database

Logical consistency checks (e.g., company size vs. industry)

Human review queue for low-confidence cases

0.2% hallucination rate, 99.8% caught before delivery

Evaluation Framework

Path Selection Accuracy

89%target: 85%+

Stuck Detection Precision

92%target: 90%+

Nudge Response Prediction AUC

0.76target: 0.75+

Hallucination Rate

0.2%target: <0.5%

Testing: Shadow mode: 1000 users parallel with manual path selection. Compare activation rates.

Dataset Curation

Collect: 50K onboarding sessions - Historical data + manual labels

Clean: 42K usable - Remove incomplete sessions, duplicates

Label: 42K labeled - ($$8.4K)

Augment: +8K synthetic - Edge case generation (stuck users, fast completers)

→ 50K high-quality examples (Cohen's Kappa: 0.88)

Agentic RAG

Agent iteratively retrieves based on reasoning

User stuck on step 3 → RAG retrieves similar cases → Agent reasons 'need clarification on feature X' → RAG retrieves feature docs → Nudge generated with full context

💡 Not one-shot retrieval. Agent decides what else it needs to know.

Multi-Armed Bandit for Nudge Optimization

Tech Stack Summary

LLMs

GPT-4, Claude 3.5, or Gemini

scikit-learn, XGBoost, PyMC

Orchestration

LangGraph, Celery, or Temporal

Database

PostgreSQL, Redis, ClickHouse

Messaging

Kafka, SQS, or RabbitMQ

Compute

Lambda, ECS/Fargate, or Kubernetes

Monitoring

Prometheus, Grafana, Datadog, Sentry

Feature Store

Feast, Tecton, or custom

Experiment Platform

Custom or Optimizely

🏗️

Need Architecture Review?

We'll audit your onboarding system, identify bottlenecks, and show you how to scale to 10x users with higher activation rates.

User Onboarding System Architecture 🏗️

From prompts to production onboarding system.

Key Assumptions

System Requirements

Functional

Non-Functional (SLOs)

Agent Layer

path-planner

progress-evaluator

nudge-executor

guardrail-agent

analytics-agent

experiment-orchestrator

ML Layer

Feature Store

Model Registry

Observability

Metrics

Dashboards

Traces

Deployment Variants

🚀 Startup

🏢 Enterprise

Risks & Mitigations

⚠️ Low activation rates (<40%)

⚠️ Nudge fatigue (too many notifications)

⚠️ PII leakage to LLM

⚠️ ML model drift (accuracy degrades over time)

⚠️ Scalability bottleneck (Redis/PostgreSQL)

⚠️ Integration failures (Segment, SendGrid, etc.)

⚠️ Cost overruns (LLM API costs)

Evolution Roadmap

Phase 1: MVP (0-3 months)

Phase 2: Scale (3-6 months)

Phase 3: Optimize (6-12 months)

Complete Systems Architecture

Sequence Diagram - New User Onboarding Request

User Onboarding System - Hub Architecture

User Onboarding System - Feedback Loops & Adaptive Learning

Data Flow

Scaling Patterns

Key Integrations

Segment

Product Analytics (Mixpanel/Amplitude)

Email/SMS (SendGrid/Twilio)

In-app Messaging (Intercom)

Security & Compliance

Failure Modes & Fallbacks

Advanced ML/AI Patterns

RAG vs Fine-Tuning

Hallucination Detection

Evaluation Framework

Dataset Curation

Agentic RAG

Multi-Armed Bandit for Nudge Optimization

Tech Stack Summary

Need Architecture Review?