Skip to main content
← Wednesday's Workflows

Feature Feedback Analysis System Architecture 🏗️

Production-grade design from 100 to 100K users/day with multi-agent orchestration

May 22, 2025
18 min read
🚀 SaaS/Product🏗️ Architecture📊 Scalable🤖 Multi-Agent
🎯This Week's Journey

From prompts to production feedback intelligence.

Monday: 3 core prompts for feedback analysis. Tuesday: automated agent code. Wednesday: team workflows. Thursday: complete technical architecture. Today we show the full system: agent orchestration, ML pipelines, multi-tenant scaling, security layers, and evolution from startup to enterprise deployment.

📋

Key Assumptions

1
Process 100-100K feedback items per day across 10-1000 tenants
2
Support 15+ feedback sources (in-app, email, Slack, support tickets, NPS surveys)
3
Real-time processing (<5s) for critical feedback, batch for analytics
4
GDPR/SOC2 compliance with PII redaction and data residency
5
Multi-tenant isolation with tenant-specific models and data

System Requirements

Functional

  • Ingest feedback from 15+ sources with unified schema
  • Extract features, sentiment, themes, and user intent via ML
  • Deduplicate similar feedback across sources and time
  • Prioritize features based on impact, frequency, and strategic fit
  • Generate weekly insights reports with trend analysis
  • Support custom taxonomies per tenant (enterprise)
  • Real-time alerts for critical feedback (P0/P1 bugs, churn signals)

Non-Functional (SLOs)

latency p95 ms3000
freshness min5
availability percent99.9
accuracy percent95
deduplication recall92

💰 Cost Targets: {"per_feedback_item_usd":0.02,"per_tenant_monthly_usd":150,"ml_cost_percent_of_revenue":8}

Agent Layer

planner

L4

Decomposes feedback processing into subtasks and selects appropriate tools

🔧 task_decomposer, tool_selector, dependency_resolver

⚡ Recovery: Retry with simplified plan (skip optional steps), Route to manual review queue if decomposition fails, Log planning failures for model retraining

executor

L3

Executes the planned workflow: feature extraction, deduplication, prioritization

🔧 llm_extractor, vector_search, priority_model, feature_mapper

⚡ Recovery: Retry failed steps up to 3x with exponential backoff, Use cached embeddings if vector search fails, Fallback to rule-based prioritization if ML model unavailable, Queue for manual review if extraction confidence < 0.7

evaluator

L3

Validates output quality, checks for hallucinations, ensures schema compliance

🔧 schema_validator, hallucination_detector, quality_scorer, consistency_checker

⚡ Recovery: Flag low-quality outputs for human review, Request re-execution with stricter parameters, Log evaluation failures for threshold tuning

guardrail

L4

Enforces safety policies: PII redaction, profanity filtering, content moderation

🔧 pii_detector, profanity_filter, content_moderator, compliance_checker

⚡ Recovery: Block processing if PII detection fails (fail-safe), Use conservative redaction if uncertain, Alert security team for high-risk content, Quarantine feedback if moderation service unavailable

deduplication

L3

Identifies and clusters duplicate or similar feedback across sources and time

🔧 vector_search, clustering_algorithm, semantic_similarity_model, merge_logic

⚡ Recovery: Use exact text matching if embedding search fails, Create new cluster if no matches above threshold, Log clustering failures for model retraining

prioritization

L3

Scores and ranks features based on impact, frequency, strategic fit, and urgency

🔧 priority_model, impact_estimator, roadmap_aligner, explanation_generator

⚡ Recovery: Fallback to rule-based scoring if ML model unavailable, Use historical priority data for cold-start features, Flag for PM review if conflicting signals

ML Layer

Feature Store

Update: Real-time for new feedback, batch daily for aggregates

  • feedback_sentiment (pos/neg/neutral)
  • theme_category (bug/feature/ux/performance)
  • user_intent (request/complaint/praise/question)
  • urgency_score (0-100)
  • user_segment (enterprise/smb/trial)
  • source_type (in-app/email/slack/ticket)
  • text_embedding (1536d)
  • historical_frequency (30d/90d)
  • similar_feedback_count
  • user_churn_risk

Model Registry

Strategy: Semantic versioning with A/B testing for major updates

  • sentiment_classifier
  • theme_extractor
  • priority_model
  • deduplication_model

Observability Stack

Real-time monitoring, tracing & alerting

0 active
SOURCES
Apps, Services, Infra
COLLECTION
12 Metrics
PROCESSING
Aggregate & Transform
DASHBOARDS
4 Views
ALERTS
Enabled
📊Metrics(12)
📝Logs(Structured)
🔗Traces(Distributed)
feedback_ingestion_rate
agent_execution_time_ms
llm_latency_p95_ms
llm_token_usage
llm_cost_per_request
deduplication_recall

Deployment Variants

🚀

Startup Architecture

Fast to deploy, cost-efficient, scales to 100 competitors

Infrastructure

Serverless (Lambda/Cloud Functions)
Managed PostgreSQL (RDS/Cloud SQL)
Managed Redis (ElastiCache/Memorystore)
Vector DB SaaS (Pinecone/Weaviate Cloud)
LLM APIs (OpenAI/Anthropic)
CloudWatch/Cloud Logging
Single-tenant per customer
Synchronous processing for <1K/day
No custom models (use GPT-4/Claude)
Basic auth (OIDC)
Cost: $200-800/mo for 1-10K feedback/day

Risks & Mitigations

⚠️ LLM cost explosion (unpredictable token usage)

High

✓ Mitigation: Token budgets per request (max 4K tokens). Cache embeddings. Use cheaper models for non-critical tasks (GPT-4o-mini for theme extraction). Monitor cost per feedback item, alert if > $0.03. Quarterly cost optimization reviews.

⚠️ Model drift (accuracy degrades as product evolves)

Medium

✓ Mitigation: Weekly model performance monitoring. Automated retraining triggered when accuracy drops > 5%. Monthly human review of 100 errors. RAG knowledge base updated daily with new product docs.

⚠️ PII leakage (customer data sent to LLM)

Low

✓ Mitigation: Guardrail Agent blocks processing if PII detection fails (fail-safe). All feedback redacted before LLM call. Audit logs for all PII access. Quarterly security audits. Encrypted storage (KMS).

⚠️ Vendor lock-in (OpenAI/Anthropic dependency)

Medium

✓ Mitigation: Multi-LLM strategy (GPT + Claude + fine-tuned models). Abstract LLM calls behind interface. Test failover quarterly. Maintain fine-tuned models as backup (can run on-prem).

⚠️ Scaling bottlenecks (vector DB, database)

Medium

✓ Mitigation: Load testing at 2x expected volume. Auto-scaling for compute. Read replicas for database. Vector DB sharding. Caching layer (Redis). Quarterly capacity planning reviews.

⚠️ Data quality issues (garbage in, garbage out)

High

✓ Mitigation: Input validation (schema checks, length limits). Confidence thresholds (< 0.7 = flag). Human review queue for low-confidence outputs. Feedback on feedback (users rate AI quality). Monthly data quality audits.

⚠️ Hallucinations (LLM invents features, quotes)

Medium

✓ Mitigation: 5-layer hallucination detection (confidence, cross-reference, fact-checking, statistical validation, human review). Evaluator Agent validates all outputs. Monthly hallucination rate audits (target < 1%).

🧬

Evolution Roadmap

Progressive transformation from MVP to scale

🌱
Phase 1Months 0-3

Phase 1: MVP (0-3 months)

1
Launch with 5 pilot customers (100-1K feedback/day)
2
Prove core value prop (automated theme extraction, deduplication)
3
Establish baseline metrics (accuracy, latency, cost)
Complexity Level
🌿
Phase 2Months 3-6

Phase 2: Scale (3-6 months)

1
Onboard 20 customers (1K-10K feedback/day)
2
Add Planner, Deduplication, Prioritization agents
3
Achieve 90%+ theme extraction accuracy
4
Reduce cost per feedback item to $0.02
Complexity Level
🌳
Phase 3Months 6-12

Phase 3: Enterprise (6-12 months)

1
Onboard 5 enterprise customers (10K-100K feedback/day)
2
Multi-region deployment (US + EU for data residency)
3
Achieve 99.9% uptime SLA
4
Support custom taxonomies per tenant
Complexity Level
🚀Production Ready
🏗️

Complete Systems Architecture

9-layer architecture from presentation to security

1
🌐

Presentation

4 components

Product Dashboard
Mobile App
Slack Bot
Email Parser
2
⚙️

API Gateway

4 components

Load Balancer
Rate Limiter
Auth (OIDC/SAML)
Tenant Router
3
💾

Agent Layer

6 components

Planner Agent
Executor Agent
Evaluator Agent
Guardrail Agent
Deduplication Agent
Prioritization Agent
4
🔌

ML Layer

6 components

Feature Store
Model Registry
Offline Training
Online Inference
Evaluation Loop
Prompt Store
5
📊

Integration

5 components

Intercom Adapter
Zendesk Adapter
Slack Adapter
Jira Adapter
Webhook Handler
6
🌐

Data

4 components

PostgreSQL (metadata)
Vector DB (embeddings)
S3 (raw feedback)
Redis (cache)
7
⚙️

External

5 components

OpenAI API
Anthropic API
Cohere Rerank
SendGrid
Twilio
8
💾

Observability

5 components

Metrics (Prometheus)
Logs (CloudWatch)
Traces (Jaeger)
Dashboards (Grafana)
Eval Metrics
9
🔌

Security

5 components

KMS (secrets)
WAF
PII Redactor
Audit Logger
RBAC Engine
🔄

Sequence Diagram - Feedback Processing Flow

Automated data flow every hour

Step 0 of 11
UserAPI GatewayPlanner AgentExecutor AgentGuardrail AgentEvaluator AgentFeature StoreVector DBPOST /feedback (text, source, metadata)validate(PII check, profanity filter)sanitized feedback + plan requestplan: [extract_features, dedupe, prioritize]extract_features(text) → sentiment, theme, intentsearch_similar(embedding, top_k=10)similar feedback IDs + scoresvalidate output qualityquality_score=0.94, approvedresult: {feature_id, priority, duplicates[]}200 OK + processed feedback

Data Flow

Feedback submission → insights in 2.6 seconds

1
User0ms
Submits feedbackText + metadata
2
API Gateway50ms
Authenticates, routes to tenantValidated request
3
Guardrail Agent200ms
Redacts PII, filters profanitySanitized text
4
Planner Agent500ms
Creates execution planTask sequence
5
Executor Agent1700ms
Extracts sentiment, theme, intentStructured features
6
Feature Store1900ms
Computes embedding, stores featuresVector + metadata
7
Deduplication Agent2100ms
Searches for similar feedbackDuplicate cluster ID
8
Prioritization Agent2250ms
Scores feature priorityPriority score + explanation
9
Evaluator Agent2450ms
Validates output qualityQuality score
10
Database2550ms
Persists feedback + featuresSaved record
11
API Gateway2600ms
Returns result to userProcessing confirmation
1
Volume
100-1K feedback/day
Pattern
Serverless Monolith
🏗️
Architecture
API Gateway (AWS/GCP)
Lambda/Cloud Functions
PostgreSQL (managed)
Redis cache
S3 for raw feedback
Cost & Performance
$200/mo
per month
2-4s
2
Volume
1K-10K feedback/day
Pattern
Queue + Workers
🏗️
Architecture
Load balancer
API servers (ECS/Cloud Run)
SQS/Pub-Sub queue
Worker pool (auto-scaling)
PostgreSQL + read replicas
Redis cluster
Vector DB (Pinecone/Weaviate)
Cost & Performance
$800/mo
per month
1-3s
3
Volume
10K-100K feedback/day
Pattern
Multi-Agent Orchestration
🏗️
Architecture
Global load balancer
Agent framework (LangGraph/CrewAI)
Event bus (Kafka/EventBridge)
Serverless agents (Lambda)
Feature store (Feast)
Model registry (MLflow)
Multi-region PostgreSQL
Vector DB cluster
Cost & Performance
$3K/mo
per month
<1s (p95)
Recommended
4
Volume
100K+ feedback/day
Pattern
Enterprise Multi-Region
🏗️
Architecture
Multi-region Kubernetes
Service mesh (Istio)
Kafka (multi-DC)
Distributed feature store
Model serving (KServe/Seldon)
Sharded PostgreSQL
Multi-region vector DB
CDN for static assets
Cost & Performance
$10K+/mo
per month
<500ms (p95)

Key Integrations

Intercom / Zendesk

Protocol: REST API + Webhooks
Webhook receives new conversation
Extract feedback from messages
Enrich with user metadata
Process via agent pipeline

Slack

Protocol: Slack Events API
Listen to #feedback channel
Parse message + thread context
Attribute to user via Slack ID
Process via agent pipeline

Jira

Protocol: REST API v3
Webhook on issue creation/update
Map Jira fields to feedback schema
Link to feature request
Sync priority scores back to Jira

In-App Widget

Protocol: Custom REST API
User submits feedback in product
Capture context (page, feature, user segment)
Submit to API
Real-time processing

Email Parser

Protocol: IMAP + NLP
Poll feedback@company.com
Extract body text
Parse sender metadata
Process via agent pipeline

Security & Compliance

Failure Modes & Fallbacks

FailureFallbackImpactSLA
LLM API down (OpenAI/Anthropic outage)Switch to backup LLM (Claude ↔ GPT), queue if all downDegraded (slower processing), not broken99.5%
Feature extraction low confidence (<0.7)Flag for human review, use rule-based fallback for critical fieldsQuality maintained, slight delay99.9%
Vector DB unavailable (Pinecone/Weaviate outage)Skip deduplication, use exact text matching, create new clustersMore duplicates created, eventual consistency99.0%
PII detection service failsBlock processing (fail-safe), queue for manual redactionProcessing halted for safety100% (no PII leaks)
Database write timeoutRetry 3x with exponential backoff, write to S3 as backupEventual consistency, slight delay99.9%
Tenant quota exceeded (rate limit)Queue excess requests, notify tenant adminDelayed processing for tenantPer-tenant SLA
Model serving latency spike (>5s p95)Route to faster model (GPT-4 → GPT-4o-mini), scale workersSlightly lower accuracy, maintained latency99.5%

RAG vs Fine-Tuning

Product taxonomy changes frequently (RAG allows daily updates). Sentiment is stable (fine-tuning gives 2x speedup and 40% cost reduction).
✅ RAG (Chosen)
Cost: $0.015/request
Update: Daily (add new themes to vector DB)
How:
❌ Fine-Tuning
Cost: $0.009/request
Update: Monthly retraining
How:
Implementation: Fine-tuned DistilBERT for sentiment (hosted on SageMaker). RAG with GPT-4o-mini for theme extraction (retrieve top 10 similar feedback + product docs from vector DB, inject into prompt).

Hallucination Detection

LLMs hallucinate feature names, invent user quotes, fabricate statistics
L1
Confidence thresholding (< 0.7 = flag for review)
L2
Cross-reference against product catalog (valid feature names)
L3
Fact-checking user quotes (match against original feedback text)
L4
Statistical validation (check if numbers are plausible)
L5
Human review queue for flagged outputs
0.8% hallucination rate, 98% caught before production

Evaluation Framework

Sentiment Accuracy
94.2%target: 94%+
Theme Extraction Precision
91.1%target: 90%+
Theme Extraction Recall
89.3%target: 88%+
Deduplication Recall
92.5%target: 92%+
Priority Ranking NDCG@10
87.2%target: 85%+
Hallucination Rate
0.8%target: <1%
Testing: Shadow mode: 1000 real feedback items processed in parallel with manual labeling. Weekly evaluation on held-out test set (200 samples).

Dataset Curation

1
Collect: 50K feedback items - Historical data from 20 pilot customers
2
Clean: 42K usable - Remove spam, duplicates, non-English
3
Label: 5K labeled (sentiment + theme + priority) - ($$15K (3 PMs × 2 weeks))
4
Augment: +3K synthetic - GPT-4 generates edge cases (sarcasm, mixed sentiment, ambiguous themes)
5
Split: Train: 6.4K, Val: 0.8K, Test: 0.8K - Stratified split by theme + sentiment
8K high-quality labeled examples. Models trained on this dataset achieve 94% sentiment accuracy, 91% theme precision.

Agentic RAG

Agent iteratively retrieves based on reasoning, not one-shot retrieval
User feedback: 'The export feature is slow.' → Executor Agent reasons: 'Need context on export feature' → Retrieves product docs on export → Reasons: 'Need similar past feedback' → Retrieves 10 similar feedback items → Generates theme: 'Performance issue with CSV export' with high confidence.
💡 Agent decides what context it needs (product docs, past feedback, user history) instead of retrieving everything upfront. Reduces token usage by 40%, improves accuracy by 6%.

Model Drift Detection

Tech Stack Summary

LLMs
GPT-4o-mini (theme extraction), Claude Sonnet (prioritization), fine-tuned DistilBERT (sentiment)
Agent Framework
LangGraph (orchestration), CrewAI (agent collaboration)
Vector DB
Pinecone (startup), Milvus (enterprise)
Feature Store
Feast (open-source), Tecton (enterprise)
Model Registry
MLflow (open-source), SageMaker Model Registry (AWS)
Database
PostgreSQL (RDS/Cloud SQL), Redis (ElastiCache/Memorystore)
Queue
SQS (startup), Kafka (enterprise)
Compute
Lambda (startup), ECS/Cloud Run (scale), Kubernetes (enterprise)
Observability
CloudWatch/Cloud Logging (startup), Prometheus + Grafana (enterprise), Datadog (unified)
Security
AWS KMS/Cloud KMS, Secrets Manager, WAF, AWS Comprehend (PII detection)
🏗️

Need Architecture Review?

We'll audit your feedback system design, identify bottlenecks, and show you how to scale 10x with multi-agent orchestration.

©

2026 Randeep Bhatia. All Rights Reserved.

No part of this content may be reproduced, distributed, or transmitted in any form without prior written permission.