Skip to main content
← Wednesday's Workflows

Roadmap Prioritization System Architecture πŸ—οΈ

From 10 to 10,000 features/month with 6 AI agents and ML pipeline

November 6, 2025
17 min read
🎯 ProductπŸ—οΈ ArchitectureπŸ€– 6 AgentsπŸ“Š ML PipelineπŸ”’ SOC2
🎯This Week's Journey

From prompts to production roadmap system.

Monday: 3 core prompts for scoring features. Tuesday: automated code to run them. Wednesday: team workflows (PM, Eng, Leadership). Thursday: complete technical architecture with 6 specialized agents, ML pipeline, and production deployment from 10 to 10,000 features per month.

πŸ“‹

Key Assumptions

1
Product teams manage 10-10,000 feature requests per month
2
Data sources: Jira, Linear, customer feedback, sales CRM, usage analytics
3
Scoring criteria: business value, effort, strategic alignment, customer impact, technical debt
4
SOC2 compliance required for enterprise customers
5
Multi-tenant architecture with data isolation per organization

System Requirements

Functional

  • Ingest feature requests from multiple sources (Jira, Linear, Slack, email)
  • Score features across 5 dimensions using AI agents
  • Aggregate scores with weighted formulas and ML models
  • Generate prioritized roadmap with confidence intervals
  • Provide explanations for each score and ranking
  • Support custom scoring criteria per organization
  • Track score changes over time with audit trail

Non-Functional (SLOs)

latency p95 ms500
freshness min15
availability percent99.5
accuracy target0.92

πŸ’° Cost Targets: {"per_feature_scored_usd":0.05,"per_org_per_month_usd":200}

Agent Layer

planner

L3

Decomposes feature scoring into subtasks and selects appropriate tools

πŸ”§ feature_parser, criteria_loader, context_retriever

⚑ Recovery: If parsing fails: use default criteria, If context unavailable: proceed without historical data, Retry up to 3x with exponential backoff

executor

L4

Orchestrates the scoring workflow across all dimensions

πŸ”§ scorer_agent, parallel_executor, result_aggregator

⚑ Recovery: If one dimension fails: continue with others, flag incomplete, If all fail: route to manual queue, Timeout per dimension: 5 seconds

scorer

L2

Scores individual dimensions using LLM reasoning

πŸ”§ claude_api, gpt4_api, prompt_formatter, score_normalizer

⚑ Recovery: If LLM timeout: retry with different model, If invalid score: use median of historical scores, If confidence < 0.5: flag for human review

evaluator

L3

Validates output quality and detects anomalies

πŸ”§ statistical_validator, anomaly_detector, consistency_checker

⚑ Recovery: If validation fails: route to human review queue, If anomaly detected: flag and continue with warning, If quality < threshold: trigger re-scoring

guardrail

L4

Enforces policy checks, PII redaction, and safety filters

πŸ”§ pii_detector, policy_engine, content_filter

⚑ Recovery: If PII detected: redact and log, If policy violation: block and alert, If uncertain: flag for manual review

aggregator

L3

Combines dimension scores into final ranking with ML model

πŸ”§ weighted_average, ml_ranking_model, confidence_estimator

⚑ Recovery: If ML model fails: fall back to weighted average, If weights missing: use equal weights, If rank calculation fails: use score-only ordering

ML Layer

Feature Store

Update: Real-time for new features, batch daily for historical aggregates

  • β€’ feature_text_embeddings (768-dim)
  • β€’ historical_score_avg
  • β€’ historical_score_std
  • β€’ source_reliability_score
  • β€’ user_engagement_metrics
  • β€’ similar_feature_scores
  • β€’ org_scoring_patterns

Model Registry

Strategy: Semantic versioning with A/B testing for new versions

  • β€’ score_aggregator_v2
  • β€’ confidence_estimator_v1
  • β€’ anomaly_detector_v1

Observability Stack

Real-time monitoring, tracing & alerting

0 active
SOURCES
Apps, Services, Infra
COLLECTION
8 Metrics
PROCESSING
Aggregate & Transform
DASHBOARDS
4 Views
ALERTS
Enabled
πŸ“ŠMetrics(8)
πŸ“Logs(Structured)
πŸ”—Traces(Distributed)
feature_scoring_latency_p95_ms
βœ“
agent_success_rate_percent
βœ“
llm_api_latency_ms
βœ“
score_accuracy_percent
βœ“
cost_per_feature_usd
βœ“
queue_depth
βœ“

Deployment Variants

πŸš€

Startup Architecture

Fast to deploy, cost-efficient, scales to 100 competitors

Infrastructure

βœ“
Render or Railway for API server
βœ“
Neon or Supabase for PostgreSQL
βœ“
Upstash Redis for cache
βœ“
Anthropic/OpenAI API (pay-as-you-go)
βœ“
Vercel for frontend
β†’Single-tenant architecture
β†’Synchronous processing (no queue)
β†’Managed services only
β†’Ship in 2 weeks
β†’Cost: $100-200/mo for 100 features/mo

Risks & Mitigations

⚠️ LLM cost explosion (>$1K/day)

Medium

βœ“ Mitigation: Set daily budget caps. Cache similar features. Use cheaper models for low-value features. Monitor cost per feature. Alert at $500/day threshold.

⚠️ Score accuracy degrades over time (concept drift)

High

βœ“ Mitigation: Weekly evaluation against PM judgments. Retrain aggregation model monthly. A/B test new prompts. Human-in-the-loop for low-confidence scores.

⚠️ PII leakage to LLM providers

Low

βœ“ Mitigation: PII detection before LLM calls. Redact all detected entities. Audit logs for all LLM requests. Use enterprise LLM plans with no training on customer data.

⚠️ Latency exceeds 5 seconds (user frustration)

Medium

βœ“ Mitigation: Parallel scoring across dimensions. Cache common patterns. Use faster models for low-value features. Async processing with webhook callback.

⚠️ Agent orchestration bugs (infinite loops, deadlocks)

Medium

βœ“ Mitigation: Timeout per agent (5 sec). Max retry count (3x). Circuit breaker pattern. Extensive testing with chaos engineering.

⚠️ Integration failures (Jira/Linear API down)

Medium

βœ“ Mitigation: Queue-based architecture. Retry with exponential backoff. Manual upload fallback. Monitor integration health.

⚠️ Bias in scoring (favors certain types of features)

High

βœ“ Mitigation: Diverse training data. Bias detection in evaluation. Human review of edge cases. Regular fairness audits.

🧬

Evolution Roadmap

Progressive transformation from MVP to scale

🌱
Phase 1Months 0-3

Phase 1: MVP (0-3 months)

1
Core scoring for 5 dimensions
2
Single-tenant deployment
3
Jira + Linear integration
4
Basic dashboard
Complexity Level
β–Ό
🌿
Phase 2Months 3-6

Phase 2: Scale (3-6 months)

1
Multi-tenant architecture
2
Async processing with queue
3
Custom scoring criteria per org
4
ML aggregation model
Complexity Level
β–Ό
🌳
Phase 3Months 6-12

Phase 3: Enterprise (6-12 months)

1
SOC2 compliance
2
Multi-region deployment
3
SSO/SAML integration
4
Advanced analytics
Complexity Level
πŸš€Production Ready
πŸ—οΈ

Complete Systems Architecture

9-layer architecture from presentation to security

1
🌐

Presentation

4 components

Web Dashboard
Mobile App
Slack Bot
Email Digest
2
βš™οΈ

API Gateway

4 components

Load Balancer (ALB)
Rate Limiter (Redis)
Auth Proxy (OIDC)
API Versioning
3
πŸ’Ύ

Agent Layer

6 components

Planner Agent
Executor Agent
Evaluator Agent
Guardrail Agent
Scorer Agent
Aggregator Agent
4
πŸ”Œ

ML Layer

5 components

Feature Store
Model Registry
Offline Training
Online Inference
Evaluation Loop
5
πŸ“Š

Integration

5 components

Jira Connector
Linear Connector
Slack Connector
CRM Connector
Analytics Connector
6
🌐

Data

4 components

PostgreSQL (primary)
Redis (cache)
S3 (blob storage)
Vector DB (embeddings)
7
βš™οΈ

External

5 components

Claude API
GPT-4 API
Jira API
Linear API
Mixpanel API
8
πŸ’Ύ

Observability

5 components

CloudWatch Metrics
CloudWatch Logs
X-Ray Tracing
Grafana Dashboards
PagerDuty Alerts
9
πŸ”Œ

Security

5 components

IAM Roles
KMS Encryption
WAF Rules
Audit Logger
Secrets Manager
πŸ”„

Sequence Diagram - Feature Scoring Flow

Automated data flow every hour

Step 0 of 9
PMAPI GatewayPlanner AgentExecutor AgentScorer AgentEvaluator AgentGuardrail AgentAggregator AgentDatabasePOST /features (title, description, source)plan(feature)execute_plan(steps: [extract, score, aggregate])score(feature, dimensions: [value, effort, alignment, impact, debt])validate(scores)aggregate(validated_scores)evaluate(final_score, confidence)save(feature_id, scores, final_score, confidence, explanation)200 OK (feature_id, scores, rank, explanation)

Data Flow

Feature request β†’ Prioritized roadmap in 4 seconds

1
PM0s
Submits feature via Jira/Linear/Slack β†’ Title, description, source
2
API Gateway50ms
Authenticates & rate limits β†’ Validated request
3
Planner Agent100ms
Creates execution plan β†’ Plan with 5 scoring dimensions
4
Executor Agent150ms
Spawns 5 parallel Scorer Agents β†’ 5 concurrent scoring tasks
5
Scorer Agents3200ms
Score each dimension (parallel) β†’ 5 scores with explanations
6
Guardrail Agent3400ms
Validates & sanitizes β†’ Validated scores, PII redacted
7
Aggregator Agent3600ms
Combines with ML model β†’ Final score + confidence
8
Evaluator Agent3800ms
Quality check β†’ Validation result
9
Database3900ms
Saves to PostgreSQL β†’ Feature record with scores
10
PM4000ms
Receives prioritized ranking β†’ Score, rank, explanation
1
Volume
0-100 features/month
Pattern
Monolith
πŸ—οΈ
Architecture
Single API server
Synchronous LLM calls
PostgreSQL
Redis cache
Cost & Performance
$100/mo
per month
4-6 sec
2
Volume
100-1K features/month
Pattern
Queue + Workers
πŸ—οΈ
Architecture
API server
SQS queue
Lambda workers
RDS PostgreSQL
ElastiCache Redis
Cost & Performance
$400/mo
per month
3-5 sec
3
Volume
1K-10K features/month
Pattern
Multi-Agent Orchestration
πŸ—οΈ
Architecture
ALB load balancer
ECS Fargate (agent containers)
EventBridge message bus
Aurora PostgreSQL
Vector DB (Pinecone)
Cost & Performance
$1500/mo
per month
2-4 sec
Recommended
4
Volume
10K+ features/month
Pattern
Enterprise Multi-Region
πŸ—οΈ
Architecture
EKS Kubernetes
Kafka event streaming
Multi-region Aurora
Multi-LLM providers
Dedicated vector DB cluster
Cost & Performance
$5K+/mo
per month
1-3 sec

Key Integrations

Jira Integration

Protocol: REST API v3
Webhook receives new issue
Extract title, description, labels, priority
Map Jira fields to internal schema
Trigger scoring pipeline

Linear Integration

Protocol: GraphQL API
Webhook receives issue.created event
GraphQL query for full issue details
Transform to internal format
Enqueue for scoring

Slack Integration

Protocol: Slack Events API + Slash Commands
User submits feature via /roadmap command
Parse command text
Create feature record
Return immediate ack + async scoring

CRM Integration (Salesforce/HubSpot)

Protocol: REST API
Pull customer feedback from CRM
Enrich feature with customer tier, ARR, churn risk
Use in scoring (customer impact dimension)
Sync scores back to CRM

Analytics Integration (Mixpanel/Amplitude)

Protocol: REST API
Query usage data for related features
Calculate engagement metrics
Use in scoring (customer impact dimension)
Track score accuracy vs actual adoption

Security & Compliance

Failure Modes

FailureFallbackImpactSLA
LLM API down (Anthropic/OpenAI)β†’ Switch to backup LLM provider β†’ If all fail, queue for retryDegraded (slower), not broken99.5%
Scoring confidence < 0.5β†’ Flag for human review β†’ Use historical median score as placeholderQuality maintained, some manual work99.9%
Database unavailable→ Read from replica → Write to queue for replayRead-only mode, eventual consistency99.9%
PII detection service down→ Block all processing (safety first) → Alert on-callComplete halt to protect privacy100%
Aggregation model fails→ Use weighted average of dimension scoresSlightly less accurate, fully functional99.0%
Queue depth > 1000β†’ Scale up workers β†’ Rate limit new submissionsSlower processing, no data loss99.5%
Cost budget exceeded ($500/day)β†’ Throttle LLM calls β†’ Use cached scores for similar featuresSlower, less fresh scoresN/A (cost control)
System Architecture
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Planner    β”‚ ← Decomposes task
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Executor   β”‚ ← Orchestrates workflow
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
   β”Œβ”€β”€β”€β”΄β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚        β”‚          β”‚          β”‚
β”Œβ”€β”€β–Όβ”€β”€β”  β”Œβ”€β–Όβ”€β”€β”  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”  β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”
β”‚Scorerβ”‚  β”‚Scorerβ”‚  β”‚ Scorer  β”‚  β”‚Scorer β”‚ (5 parallel)
β”‚(Value)β”‚ β”‚(Effort)β”‚ β”‚(Alignment)β”‚ β”‚(Impact)β”‚
β””β”€β”€β”¬β”€β”€β”˜  β””β”€β”¬β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”€β”€β”˜
   β”‚        β”‚          β”‚          β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
        β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”
        β”‚Guardrailβ”‚ ← Validates & sanitizes
        β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
            β”‚
        β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”
        β”‚Aggregatorβ”‚ ← Combines scores
        β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
            β”‚
        β”Œβ”€β”€β”€β–Όβ”€β”€β”€β”€β”
        β”‚Evaluatorβ”‚ ← Quality check
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”„Agent Collaboration Flow

1
Planner
Receives feature request, creates execution plan with 5 scoring dimensions
2
Executor
Spawns 5 parallel Scorer Agents (one per dimension)
3
Scorer Agents
Each scores its dimension using LLM reasoning β†’ Returns score + explanation
4
Guardrail
Validates scores, detects PII, checks policy β†’ Returns sanitized scores
5
Aggregator
Combines dimension scores with ML model β†’ Returns final score + rank
6
Evaluator
Quality check: confidence, anomalies, consistency β†’ Returns validation result
7
Executor
If validation fails: loop back to Planner. If pass: save to database.

🎭Agent Types

Reactive Agent

Low

Scorer Agent - Responds to input, returns score

Stateless

Reflexive Agent

Medium

Guardrail Agent - Uses rules + context for validation

Reads context

Deliberative Agent

High

Planner Agent - Plans multi-step execution

Stateful

Orchestrator Agent

Highest

Executor Agent - Coordinates all agents, handles loops

Full state management

πŸ“ˆLevels of Autonomy

L1
Tool
Human calls, agent responds
β†’ Monday's prompts
L2
Chained Tools
Sequential execution
β†’ Tuesday's code
L3
Agent
Makes decisions, can loop
β†’ Planner, Evaluator
L4
Multi-Agent
Agents collaborate autonomously
β†’ This system (Executor coordinates 6 agents)

RAG vs Fine-Tuning

Product terminology and priorities change constantly. RAG allows daily updates without retraining. Fine-tuning would require quarterly retraining at $5K+ per iteration.
βœ… RAG (Chosen)
Cost: $100/mo
Update: Daily
How: Add new docs to vector DB
❌ Fine-Tuning
Cost: $5K/mo
Update: Quarterly
How: Retrain entire model
Implementation: Vector DB (Pinecone) with product docs, historical roadmaps, customer feedback. Retrieved during scoring for context. Embeddings updated daily.

Hallucination Detection

LLMs hallucinate scores (e.g., inventing customer quotes, fabricating usage data)
L1
Confidence scores (< 0.7 = flag for review)
L2
Cross-reference with source data (Jira, CRM, analytics)
L3
Logical consistency checks (e.g., high effort + low value = unlikely high score)
L4
Human review queue for flagged items
0.5% hallucination rate, 100% caught by layers 1-3

Evaluation Framework

Score Accuracy (vs PM judgment)
92%target: 90%+
Ranking Correlation (vs actual roadmap)
0.88target: 0.85+
Explanation Quality (human rating)
4.2/5target: 4/5
Confidence Calibration Error
0.08target: <0.1
Testing: Shadow mode: 500 features scored by both AI and PMs in parallel. Compare results weekly. A/B test new models on 10% traffic.

Dataset Curation

1
Collect: 5K historical features - Export from Jira/Linear
2
Clean: 4.2K usable - Remove duplicates, incomplete data
3
Label: 4.2K labeled - ($$21K)
4
Augment: +800 synthetic - Edge case generation (very high value + very high effort, etc.)
β†’ 5K high-quality examples. Inter-rater agreement (Fleiss' Kappa): 0.85. Used for model training and evaluation.

Agentic RAG

Agent iteratively retrieves based on reasoning. Not one-shot retrieval.
Feature mentions 'mobile app'. Agent retrieves mobile roadmap context. Sees mention of 'React Native migration'. Agent retrieves migration plan. Reasons: 'This feature depends on migration completion.' Generates score with full context.
πŸ’‘ Agent decides what context it needs, retrieves iteratively. More accurate than one-shot retrieval.

Multi-Model Ensemble

Tech Stack Summary

LLMs
Claude 3.5 Sonnet (primary), GPT-4 (fallback), Gemini Pro (ensemble)
Agent Framework
LangGraph or CrewAI
Database
PostgreSQL (Aurora for enterprise)
Vector DB
Pinecone or Weaviate
Cache
Redis (ElastiCache for enterprise)
Queue
SQS or Kafka (enterprise)
Compute
Lambda (startup), ECS Fargate (enterprise)
API Gateway
AWS ALB + API Gateway
Monitoring
CloudWatch + Grafana + PagerDuty
Security
KMS, Secrets Manager, Comprehend, WAF
Frontend
Next.js + Vercel (startup), CloudFront + S3 (enterprise)
πŸ—οΈ

Need Architecture Review?

We'll audit your product roadmap system, identify bottlenecks, and show you how to scale to 10K features/month with AI.

Β©

2026 Randeep Bhatia. All Rights Reserved.

No part of this content may be reproduced, distributed, or transmitted in any form without prior written permission.