From prompts to production routing system.
Monday: 3 core prompts for route planning. Tuesday: automation code with LangGraph. Wednesday: team workflows across dispatch, drivers, and ops. Thursday: complete technical architecture. Multi-agent coordination, real-time optimization, ML-driven predictions, and scaling patterns for 10,000+ routes daily.
Key Assumptions
System Requirements
Functional
- Ingest orders (address, time window, priority, size/weight)
- Optimize routes considering distance, time, capacity, constraints
- Real-time re-routing on traffic, delays, cancellations
- Driver assignment based on skills, location, availability
- ETA prediction with 90%+ accuracy (Β±15 min)
- Cost tracking: fuel, labor, vehicle wear per route
- Analytics: route efficiency, on-time %, cost per delivery
Non-Functional (SLOs)
π° Cost Targets: {"per_route_usd":0.05,"per_vehicle_per_day_usd":2,"ml_inference_per_1k_routes_usd":5}
Agent Layer
planner
L3Decompose routing problem into solvable sub-problems (clustering, sequencing, assignment)
π§ GeospatialClusteringTool (DBSCAN, k-means), ConstraintAnalysisTool (feasibility check), StrategySelector (problem size β algorithm)
β‘ Recovery: If clustering fails β fallback to simple geographic bounds, If infeasible constraints β relax time windows incrementally, If no strategy matches β default to greedy nearest-neighbor
executor
L4Run VRP solver with ML-enhanced cost functions (traffic, weather, historical patterns)
π§ VRPSolver (OR-Tools, Gurobi, or custom), MLCostPredictor (traffic impact, ETA), DistanceMatrixAPI (Google Maps)
β‘ Recovery: If solver timeout β return best-so-far solution, If ML predictor fails β fallback to historical averages, If API rate limit β use cached distance matrix
evaluator
L2Validate route quality (capacity, time windows, cost, ETA accuracy)
π§ CapacityValidator (sum weights per vehicle), TimeWindowValidator (arrival vs deadline), CostAnalyzer (compare to historical baseline)
β‘ Recovery: If violations found β flag for human review, If quality_score < 70 β trigger re-optimization, If critical violation (safety) β block deployment
guardrail
L1Enforce safety policies (max drive time, rest periods, hazmat restrictions)
π§ SafetyPolicyEngine (rule-based checks), DriverComplianceChecker (hours of service), HazmatValidator (route restrictions)
β‘ Recovery: If policy violation β block route, alert dispatcher, If driver unavailable β reassign to backup driver, If hazmat route invalid β reroute avoiding restricted zones
rebalancer
L3Dynamically adjust routes based on real-time events (traffic, cancellations, new orders)
π§ TrafficMonitor (live traffic API), EventProcessor (order updates, driver status), IncrementalOptimizer (local search, 2-opt)
β‘ Recovery: If traffic API down β use cached data (5min stale), If rebalancing infeasible β maintain current routes, If cost increase > 20% β require dispatcher approval
eta_predictor
L2ML-driven ETA predictions considering traffic, weather, driver behavior
π§ TrafficModel (XGBoost on historical patterns), WeatherImpactModel (rain β 15% slower), DriverBehaviorModel (speed profile)
β‘ Recovery: If ML model fails β fallback to distance/speed formula, If confidence < 60% β widen interval to Β±30min, If historical data missing β use city-wide averages
ML Layer
Feature Store
Update: Real-time (traffic), Hourly (weather), Daily (historical aggregates)
- β’ traffic_speed_by_road_segment (5min intervals)
- β’ weather_conditions (rain, snow, temp)
- β’ historical_delivery_times (by time_of_day, day_of_week)
- β’ driver_speed_profile (avg speed, variability)
- β’ order_density_heatmap (orders per kmΒ²)
- β’ vehicle_utilization (capacity used %)
Model Registry
Strategy: Semantic versioning (major.minor.patch), A/B test new models at 10% traffic
- β’ ETA Predictor
- β’ Demand Forecaster
- β’ Route Clustering
Observability Stack
Real-time monitoring, tracing & alerting
0 activeDeployment Variants
Startup Architecture
Fast to deploy, cost-efficient, scales to 100 competitors
Infrastructure
Risks & Mitigations
β οΈ VRP solver timeout on large problems (>500 stops)
Mediumβ Mitigation: Implement hierarchical optimization (cluster β optimize clusters β merge). Set hard timeout (60sec) with best-so-far fallback. Use heuristics (genetic algorithm, simulated annealing) for large problems.
β οΈ Google Maps API cost explosion ($10K+/mo)
High (if not monitored)β Mitigation: Cache distance matrix (1hr TTL), implement rate limiting (100 req/sec), use OSM as backup, negotiate volume pricing with Google, monitor daily spend with alerts.
β οΈ ML model drift (ETA accuracy drops from 90% to 70%)
Mediumβ Mitigation: Monitor ETA accuracy daily, alert if <85%, retrain model weekly, A/B test new models, keep 3 model versions in registry for rollback.
β οΈ Real-time traffic data unavailable (API outage)
Lowβ Mitigation: Use historical traffic patterns as fallback, cache traffic data (5min TTL), switch to backup provider (Waze, TomTom), accept degraded accuracy (Β±15min) temporarily.
β οΈ Driver non-compliance (ignoring optimized routes)
Mediumβ Mitigation: Gamification (leaderboard, bonuses for following routes), real-time coaching (in-app nudges), analytics dashboard for fleet managers, escalation for repeated violations.
β οΈ Data privacy breach (customer addresses leaked)
Lowβ Mitigation: Encrypt addresses at rest (AES-256), redact in logs, store as lat/lng + geohash, implement RBAC (least privilege), audit all access, pen-test quarterly.
β οΈ Multi-tenancy data leakage (Tenant A sees Tenant B's routes)
Lowβ Mitigation: Row-level security (RLS) in PostgreSQL, tenant_id in all queries, separate schemas per tenant (enterprise), automated testing for cross-tenant access, regular security audits.
Evolution Roadmap
Progressive transformation from MVP to scale
Phase 1: MVP (0-3 months)
Phase 2: Scale & ML (3-6 months)
Phase 3: Enterprise & Multi-Region (6-12 months)
Complete Systems Architecture
9-layer architecture from driver app to ML models
Presentation
4 components
API Gateway
4 components
Agent Layer
4 components
ML Layer
4 components
Integration
4 components
Data
4 components
External
4 components
Observability
4 components
Security
4 components
Sequence Diagram - Route Optimization Request
Automated data flow every hour
Data Flow - Order to Delivery
Key Integrations
Google Maps Platform
Weather API (OpenWeather)
TMS (Transportation Management System)
Telematics (GPS Tracking)
Payment Gateway (Stripe)
Security & Compliance
Failure Modes & Fallbacks
| Failure | Fallback | Impact | SLA |
|---|---|---|---|
| Google Maps API down or rate limited | Use cached distance matrix (1hr stale) β OpenStreetMap (OSM) as backup β Manual distance estimation (haversine formula) | Degraded accuracy (Β±10% distance error), slower optimization | 99.5% (Maps API SLA: 99.9%) |
| VRP solver timeout (>60sec) | Return best-so-far solution (may be suboptimal) β Greedy nearest-neighbor as last resort | Routes 5-15% less efficient, but still valid | 99.0% |
| ML model prediction fails (ETA, demand) | Use historical averages (last 30 days) β Simple distance/speed formula | ETA accuracy drops from 90% to 75% | 99.5% |
| Database unavailable (PostgreSQL) | Read from replica (eventual consistency) β Cache (Redis) for recent data β Fail gracefully (return 503) | Read-only mode, no new routes generated | 99.9% (RDS Multi-AZ) |
| Guardrail agent detects policy violation (e.g., max drive time exceeded) | Block route deployment β Alert dispatcher β Suggest manual adjustment | Safety maintained, but manual intervention required | 100% (safety-critical) |
| Real-time traffic data unavailable | Use historical traffic patterns (by time_of_day, day_of_week) β Assume free-flow speed | ETA accuracy drops by 10-15% | 99.0% |
| Kafka message broker down | Buffer events in Redis (limited capacity) β Fall back to synchronous processing β Alert on-call | Increased latency, risk of data loss if buffer full | 99.5% |
ββββββββββββββββ
β Orchestrator β β Coordinates all agents, manages state
ββββββββ¬ββββββββ
β
βββββ΄βββββ¬ββββββββββ¬βββββββββββ¬βββββββββββ¬ββββββββββ
β β β β β β
ββββΌβββ βββΌββββ ββββΌβββββ ββββΌβββββ ββββΌββββ βββΌβββ
βPlan β βExec β βEval β βGuard β βRebal β βETA β
βAgentβ βAgentβ βAgent β βAgent β βAgent β βPredβ
ββββ¬βββ ββββ¬βββ βββββ¬ββββ βββββ¬ββββ ββββ¬ββββ βββ¬βββ
β β β β β β
β βββββββββββ΄βββββββββββ΄ββββββββββ β
β β β
ββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ
β
ββββββββΌβββββββ
β Database β
β (Routes) β
βββββββββββββββπAgent Collaboration Flow
πAgent Types
Reactive Agent
LowETA Predictor - Receives route segment, returns ETA
Reflexive Agent
MediumEvaluator Agent - Uses rules + context (capacity, time windows)
Deliberative Agent
HighPlanner Agent - Plans clustering strategy based on problem size
Orchestrator Agent
HighestCoordinator - Makes routing decisions, handles loops, manages state
πLevels of Autonomy
RAG vs Fine-Tuning for Route Planning
Hallucination Detection in Route Planning
Evaluation Framework
Dataset Curation for Route Optimization
Agentic RAG for Dynamic Routing
Online Learning for ETA Prediction
Tech Stack Summary
Need Route Optimization Architecture?
We'll design your production system: multi-agent coordination, ML-driven optimization, and scaling to 10,000+ routes/day.
2026 Randeep Bhatia. All Rights Reserved.
No part of this content may be reproduced, distributed, or transmitted in any form without prior written permission.