Skip to main content
← Wednesday's Workflows

Smart Factory IoT System Architecture 🏗️

Edge-to-cloud architecture for 10 to 10,000+ connected machines

August 21, 2025
18 min read
🏭 Manufacturing🔌 IoT🤖 Multi-Agent⚡ Real-Time
🎯This Week's Journey

From sensor data to predictive insights in milliseconds.

Monday showed 3 core prompts for anomaly detection, predictive maintenance, and quality control. Tuesday automated data ingestion and agent orchestration. Wednesday mapped team workflows across plant managers, data engineers, and ML ops. Today: complete production architecture. Edge computing, multi-agent coordination, real-time analytics, and ML pipelines that scale from pilot line to enterprise manufacturing.

📋

Key Assumptions

1
Monitor 10-10,000 industrial machines (PLCs, CNCs, robots, sensors)
2
Real-time requirements: <100ms for critical alerts, <5s for analytics
3
Edge deployment in harsh environments (temperature, vibration, dust)
4
Integration with legacy systems (OPC-UA, Modbus, proprietary protocols)
5
Compliance: ISO 27001, SOC2, data residency requirements for multinational plants

System Requirements

Functional

  • Ingest sensor data from 1,000+ data points per machine at 1-100Hz
  • Detect anomalies in real-time with <100ms latency for critical events
  • Predict machine failures 24-72 hours in advance with 85%+ accuracy
  • Generate quality control insights from vision + sensor fusion
  • Orchestrate maintenance workflows across plant teams
  • Provide digital twin visualization with real-time state sync
  • Support batch analytics for production optimization

Non-Functional (SLOs)

latency p95 ms100
freshness min1
availability percent99.9
prediction accuracy percent85
false positive rate percent5

💰 Cost Targets: {"per_machine_per_month_usd":15,"per_prediction_usd":0.02,"storage_per_gb_per_month_usd":0.1}

Agent Layer

planner

L4

Decomposes incoming sensor data streams, routes to specialized agents, coordinates multi-step workflows

🔧 anomaly_detector, predictive_maintenance, quality_control, work_order_creator

⚡ Recovery: Retry with exponential backoff (3 attempts), Fallback to rule-based routing if ML planner fails, Queue for human review if all agents fail

executor

L3

Runs primary data processing workflows: feature extraction, anomaly detection, prediction generation

🔧 feature_store.read, ml_model.infer, time_series_analyzer, feature_store.write

⚡ Recovery: Switch to cached features if feature store unavailable, Use last known good model version if inference fails, Degrade to statistical methods if ML unavailable

evaluator

L3

Validates output quality, checks prediction confidence, triggers retraining when drift detected

🔧 statistical_validator, confidence_scorer, drift_detector, human_review_trigger

⚡ Recovery: Flag low-confidence predictions for human review, Trigger model retraining if accuracy drops below 80%, Escalate to domain expert if drift is severe

guardrail

L2

Enforces safety policies, prevents false positives, validates before triggering critical actions

🔧 policy_checker, risk_assessor, false_positive_filter, audit_logger

⚡ Recovery: Default to conservative (reject) if policy engine fails, Require human approval for high-risk actions, Log all decisions for post-incident review

anomaly_detector

L3

Real-time anomaly detection using statistical methods + ML models

🔧 isolation_forest, autoencoder_model, statistical_outlier_detector, feature_store.read

⚡ Recovery: Fallback to z-score method if ML fails, Use ensemble of 3 methods, majority vote, Alert if detection latency exceeds 100ms

predictive_maintenance

L4

Predict machine failures 24-72 hours in advance using survival analysis + deep learning

🔧 survival_model, lstm_predictor, feature_store.read, maintenance_recommender

⚡ Recovery: Use rule-based heuristics if ML unavailable, Require 2/3 model agreement for high-confidence predictions, Flag low-confidence predictions for expert review

ML Layer

Feature Store

Update: Real-time (streaming) + hourly batch aggregations

  • vibration_rms_1h
  • temperature_mean_1h
  • pressure_std_1h
  • cycle_count_1d
  • power_consumption_1h
  • quality_defect_rate_1d
  • maintenance_days_since_last
  • operating_hours_total

Model Registry

Strategy: Semantic versioning with A/B testing for major versions

  • anomaly_detector_v3
  • failure_predictor_v2
  • quality_classifier_v1

Observability Stack

Real-time monitoring, tracing & alerting

0 active
SOURCES
Apps, Services, Infra
COLLECTION
10 Metrics
PROCESSING
Aggregate & Transform
DASHBOARDS
5 Views
ALERTS
Enabled
📊Metrics(10)
📝Logs(Structured)
🔗Traces(Distributed)
sensor_data_ingestion_rate
edge_gateway_uptime
anomaly_detection_latency_p95_ms
prediction_accuracy
false_positive_rate
work_order_creation_rate

Deployment Variants

🚀

Startup Architecture

Fast to deploy, cost-efficient, scales to 100 competitors

Infrastructure

Edge: Raspberry Pi 4 (1 per line)
Cloud: AWS Lambda + API Gateway
DB: InfluxDB Cloud (managed)
ML: SageMaker serverless inference
Auth: Auth0 free tier
Monitoring: CloudWatch + Grafana Cloud
Single-tenant, single-region (us-east-1)
Serverless-first to minimize ops overhead
Managed services for fast time-to-market
Cost target: <$500/month for 10-50 machines
Deploy in 2-4 weeks

Risks & Mitigations

⚠️ Edge gateway hardware failure in harsh factory environment

Medium

✓ Mitigation: Redundant gateways per line, industrial-grade hardware (IP65 rated), local buffering (24h), automated failover, quarterly hardware audits

⚠️ Model drift due to changing production processes

High

✓ Mitigation: Continuous monitoring (PSI, KL divergence), automated retraining pipeline (weekly), A/B testing before deployment, human-in-the-loop validation

⚠️ False positives causing alert fatigue

Medium

✓ Mitigation: Guardrail agent with adaptive thresholds, human feedback loop, precision-focused tuning (target <5% FP rate), escalation for repeated FPs

⚠️ Integration failure with legacy SCADA systems

Medium

✓ Mitigation: Protocol abstraction layer, extensive testing in staging, fallback to manual data entry, vendor support contracts, phased rollout

⚠️ Data privacy violation (sensor data contains trade secrets)

Low

✓ Mitigation: Data classification policy, encryption at rest/in transit, access controls (RBAC), audit logging, data residency compliance, regular security audits

⚠️ Cloud connectivity loss during critical production

Low

✓ Mitigation: Edge autonomy mode (local ML inference), critical alerts still work, 24h local buffering, automatic sync when reconnected, multi-path networking (LTE backup)

⚠️ Insufficient training data for rare failure modes

High

✓ Mitigation: Synthetic data generation (physics-based simulation), transfer learning from similar machines, active learning (prioritize labeling of uncertain predictions), partnerships with OEMs for historical data

🧬

Evolution Roadmap

Progressive transformation from MVP to scale

🌱
Phase 1Weeks 1-12

Phase 1: MVP (0-3 months)

1
Deploy edge gateways on 1 production line (10 machines)
2
Implement anomaly detection (single model)
3
Basic alerting (email/SMS)
4
Manual work order creation
Complexity Level
🌿
Phase 2Weeks 13-24

Phase 2: Scale (3-6 months)

1
Expand to 3 production lines (50 machines)
2
Add predictive maintenance (failure prediction)
3
Integrate with CMMS (automated work orders)
4
Multi-agent orchestration (planner + executor + evaluator)
Complexity Level
🌳
Phase 3Weeks 25-52

Phase 3: Enterprise (6-12 months)

1
Multi-factory deployment (500+ machines)
2
Federated learning (cross-factory model training)
3
Advanced ML (vision + sensor fusion for quality control)
4
Enterprise security (SSO, RBAC, audit)
5
99.9% SLA
Complexity Level
🚀Production Ready
🏗️

Complete Systems Architecture

9-layer edge-to-cloud architecture

1
🌐

Edge Layer

5 components

IoT Gateway (Edge Compute)
Protocol Adapters (OPC-UA, Modbus, MQTT)
Local Agent Runtime
Edge ML Inference
Time-Series Buffer
2
⚙️

Presentation Layer

4 components

Factory Dashboard (Real-Time)
Digital Twin Viewer
Maintenance Mobile App
Analytics Portal
3
💾

API Gateway

4 components

Load Balancer
Rate Limiter
Auth (OIDC/SAML)
API Versioning
4
🔌

Agent Layer

6 components

Planner Agent
Executor Agent
Evaluator Agent
Guardrail Agent
Anomaly Detector Agent
Predictive Maintenance Agent
5
📊

ML Layer

5 components

Feature Store
Model Registry
Training Pipeline
Inference Service
Evaluation Loop
6
🌐

Integration Layer

4 components

ERP Connector (SAP, Oracle)
CMMS Integration (Maximo, eMaint)
MES Adapter
SCADA Bridge
7
⚙️

Data Layer

4 components

Time-Series DB (InfluxDB, TimescaleDB)
Event Store (Kafka)
Object Storage (S3)
Graph DB (Neo4j - Asset Relationships)
8
💾

External Services

4 components

Cloud ML APIs (Vertex AI, SageMaker)
Notification Services (Twilio, PagerDuty)
Weather APIs (for environmental correlation)
Supplier APIs (spare parts inventory)
9
🔌

Observability

5 components

Metrics (Prometheus)
Logs (Loki)
Traces (Tempo)
Dashboards (Grafana)
Alerting (AlertManager)
10
📊

Security

5 components

Identity Provider (Okta, Auth0)
Secrets Manager (Vault)
Certificate Authority
Audit Log Store
Network Firewall
🔄

Request Flow - Anomaly Detection

Automated data flow every hour

Step 0 of 8
Machine SensorEdge GatewayPlanner AgentAnomaly AgentEvaluator AgentGuardrail AgentMaintenance SystemSensor data (100Hz)Aggregated metricsAnalyze for anomaliesAnomaly score: 0.92Validate severityCritical: trueCreate work orderAlert sent (130ms total)

Data Flow - Sensor to Action

From sensor reading to maintenance work order in <200ms

1
Machine Sensor10ms
Emits data at 100HzRaw telemetry (temp, vibration, pressure)
2
Edge Gateway10ms
Aggregates to 1Hz, buffers locallyAggregated metrics
3
Edge ML Inference25ms
Runs lightweight anomaly detectionPreliminary anomaly score
4
Cloud Planner Agent20ms
Receives batch, routes to specialized agentsTask routing plan
5
Executor Agent30ms
Extracts features, runs full ML modelsFeature vector + predictions
6
Anomaly Agent25ms
Deep analysis with historical contextAnomaly score + type classification
7
Evaluator Agent10ms
Validates confidence, checks driftQuality score + confidence interval
8
Guardrail Agent5ms
Applies safety policies, filters false positivesAction approval
9
Work Order System50ms
Creates maintenance ticket, notifies technicianWork order + mobile alert
10
Digital Twin15ms
Updates machine status visualizationReal-time dashboard update
1
Volume
10-100 machines
Pattern
Edge + Serverless
🏗️
Architecture
Edge gateway per production line
Serverless functions (Lambda/Cloud Run)
Managed time-series DB (InfluxDB Cloud)
Single-region deployment
Cost & Performance
$500/month
per month
100-200ms
2
Volume
100-1,000 machines
Pattern
Edge + Containerized Services
🏗️
Architecture
Edge gateways with local ML inference
Kubernetes cluster (EKS/GKE)
Kafka for event streaming
TimescaleDB for time-series
Multi-zone deployment
Cost & Performance
$3,000/month
per month
50-100ms
3
Volume
1,000-10,000 machines
Pattern
Multi-Region Edge + Cloud
🏗️
Architecture
Distributed edge compute (fog layer)
Multi-region Kubernetes
Kafka clusters per region
Federated feature store
Global load balancing
Cost & Performance
$15,000/month
per month
20-50ms
Recommended
4
Volume
10,000+ machines
Pattern
Hybrid Edge-Cloud with Data Sovereignty
🏗️
Architecture
Edge autonomy (local decision-making)
Private cloud per region (data residency)
Hierarchical data aggregation
Federated learning for model training
Multi-cloud (AWS + Azure for redundancy)
Cost & Performance
$50,000+/month
per month
<20ms

Key Integrations

OPC-UA (Industrial Protocol)

Protocol: OPC Unified Architecture
Edge gateway subscribes to OPC-UA server
Receives data change notifications
Translates to internal JSON format
Publishes to MQTT broker

ERP Integration (SAP)

Protocol: REST API + RFC (Remote Function Call)
Query spare parts inventory
Create purchase requisitions
Update production orders
Sync maintenance costs

CMMS (Maximo)

Protocol: REST API
Create work orders from anomaly alerts
Retrieve maintenance history
Update work order status
Log completed maintenance

MES (Manufacturing Execution System)

Protocol: ISA-95 standard + REST
Receive production schedule
Report machine downtime
Sync quality control data
Update production metrics

Security & Compliance

Failure Modes & Recovery

FailureFallbackImpactSLA
Edge gateway offlineLocal buffering (24h), auto-reconnect, alert plant managerDegraded real-time visibility, no data loss99.5% gateway uptime
ML model inference timeoutUse cached predictions, fallback to rule-based heuristicsReduced accuracy, no critical alerts missed99.9% inference availability
Cloud connectivity lossEdge autonomy mode (local ML inference, local alerts)No cloud analytics, critical alerts still work99.0% cloud connectivity
Feature store unavailableUse cached features (1h TTL), degrade to raw sensor dataReduced prediction accuracy, increased latency99.9% feature store uptime
False positive spikeGuardrail agent increases threshold, human review queueReduced alert volume, potential missed anomalies<5% false positive rate
Database write failureBuffer writes in Kafka (7 days retention), retry with backoffDelayed analytics, no real-time impact99.9% write availability
Model drift detectedTrigger retraining pipeline, flag predictions as low-confidenceDegraded accuracy, increased human reviewDetect drift within 24h, retrain within 48h

RAG vs Fine-Tuning

Hallucination Detection

LLMs hallucinate failure modes, recommend non-existent parts, or suggest unsafe maintenance procedures
L1
Confidence scoring (reject if <0.8)
L2
Cross-reference parts database (validate part numbers exist)
L3
Safety policy checker (flag unsafe procedures)
L4
Human expert review for high-risk recommendations
0.2% hallucination rate, 100% caught before reaching technicians

Evaluation Framework

Anomaly Detection Precision
96.2%target: 95%+
Anomaly Detection Recall
92.1%target: 90%+
Failure Prediction Accuracy
87.3%target: 85%+
Lead Time (hours before failure)
48h mediantarget: 24-72h
False Positive Rate
3.8%target: <5%
Testing: Shadow mode: 3 months parallel with manual inspections. Backtesting: 6 months historical data with known failures.

Dataset Curation

1
Collect: 50K failure events + 2M hours sensor data - Export from CMMS + time-series DB
2
Clean: 45K usable events - Remove duplicates, filter incomplete records
3
Label: 45K labeled with failure mode, root cause, severity - ($$90K)
4
Augment: +15K synthetic (rare failure modes) - SMOTE for imbalanced classes, physics-based simulation
60K high-quality examples, balanced across 12 failure modes (Cohen's Kappa: 0.89)

Agentic RAG

Planner agent iteratively retrieves context based on reasoning chain
Anomaly detected in bearing vibration → RAG retrieves bearing specs → Agent reasons 'need lubrication history' → RAG retrieves maintenance logs → Agent reasons 'check for misalignment' → RAG retrieves alignment procedure → Final recommendation with full context.
💡 Not one-shot retrieval. Agent builds understanding through multi-step reasoning, retrieves only what's needed, reduces hallucination by grounding in facts.

Federated Learning

Tech Stack Summary

Edge Compute
Raspberry Pi 4 (pilot), Dell Edge Gateway 5200 (production), NVIDIA Jetson Nano (vision workloads)
Protocol Adapters
node-opcua (OPC-UA), pymodbus (Modbus), Eclipse Paho (MQTT)
LLMs
GPT-4 (planning), Claude 3.5 Sonnet (explanations), Fine-tuned BERT (classification)
Orchestration
LangGraph (agent coordination), Apache Airflow (batch pipelines), Temporal (workflow engine)
Time-Series DB
TimescaleDB (PostgreSQL extension), InfluxDB (high-cardinality metrics)
Event Streaming
Apache Kafka (enterprise), AWS Kinesis (startup), MQTT (edge-to-cloud)
ML Framework
PyTorch (deep learning), scikit-learn (classical ML), TensorFlow Lite (edge inference)
Feature Store
Feast (open-source), Tecton (enterprise)
Model Registry
MLflow (tracking + registry), SageMaker Model Registry
Compute
AWS Lambda (serverless), EKS (Kubernetes), EC2 (batch training)
Monitoring
Prometheus (metrics), Loki (logs), Tempo (traces), Grafana (dashboards)
Security
HashiCorp Vault (secrets), Okta (SSO), AWS KMS (encryption)
🏭

Ready to Build Your Smart Factory System?

We'll architect your edge-to-cloud IoT platform, from 10 machines to enterprise scale. Multi-agent orchestration, real-time analytics, and ML-driven predictive maintenance.

©

2026 Randeep Bhatia. All Rights Reserved.

No part of this content may be reproduced, distributed, or transmitted in any form without prior written permission.