Skip to main content
← Wednesday's Workflows

Data Governance System Architecture πŸ—οΈ

From 10K to 10M+ records/day with automated compliance

August 14, 2025
19 min read
🏒 EnterpriseπŸ—οΈ ArchitectureπŸ“Š ScalableπŸ”’ Compliant
🎯This Week's Journey

From prompts to enterprise data governance platform.

Monday: 3 core prompts for metadata extraction, quality checks, and policy enforcement. Tuesday: automated agents for data cataloging. Wednesday: team workflows across data engineers, compliance officers, and business users. Thursday: complete technical architecture with 9 layers, multi-agent orchestration, ML-driven quality checks, and scaling from startup to 10M+ records/day.

πŸ“‹

Key Assumptions

1
Monitor 100-10,000 data sources (databases, APIs, files, streams)
2
Process 10K-10M records/day with sub-second policy enforcement
3
Multi-region deployment with data residency requirements (US, EU, APAC)
4
Compliance requirements: GDPR, CCPA, SOC2, HIPAA (healthcare), PCI-DSS (finance)
5
Integration with existing data stack (Snowflake, Databricks, BigQuery, S3, etc.)

System Requirements

Functional

  • Automated metadata extraction from structured and unstructured sources
  • Real-time data quality scoring with ML-driven anomaly detection
  • Policy engine for access control, PII redaction, and retention rules
  • End-to-end data lineage tracking with column-level granularity
  • Self-service data catalog with search, tagging, and business glossary
  • Automated compliance reporting (GDPR Article 30, CCPA disclosures)
  • Multi-agent orchestration for complex governance workflows

Non-Functional (SLOs)

latency p95 ms500
freshness min5
availability percent99.9

πŸ’° Cost Targets: {"per_record_usd":0.0001,"per_user_month_usd":50,"per_source_month_usd":20}

Agent Layer

planner

L4

Decomposes governance workflows into tasks and routes to specialized agents

πŸ”§ task_decomposer, resource_estimator, dependency_resolver

⚑ Recovery: If agent unavailable: queue task with backoff, If plan invalid: fallback to manual workflow, If resource exhausted: throttle new requests

executor

L3

Executes primary governance workflows (metadata extraction, policy application)

πŸ”§ metadata_extractor, data_profiler, policy_applier, lineage_builder

⚑ Recovery: If connection fails: retry 3x with exponential backoff, If extraction incomplete: mark for manual review, If policy conflict: escalate to admin

evaluator

L3

Validates data quality, policy compliance, and lineage accuracy

πŸ”§ quality_scorer, anomaly_detector, compliance_checker, lineage_validator

⚑ Recovery: If quality check fails: flag for re-profiling, If compliance violation: trigger alert and block access, If lineage broken: rebuild from source

guardrail

L4

Enforces safety constraints (PII redaction, access control, rate limits)

πŸ”§ pii_detector, access_control_engine, rate_limiter, audit_logger

⚑ Recovery: If PII detection fails: deny access (fail-safe), If access control unavailable: deny and alert admin, If rate limit exceeded: queue request

metadata

L2

Extracts and enriches metadata from diverse data sources

πŸ”§ schema_introspector, data_sampler, stats_calculator, llm_enricher (for descriptions)

⚑ Recovery: If schema locked: retry after delay, If sampling fails: use metadata only, If LLM unavailable: skip enrichment

lineage

L3

Builds and maintains end-to-end data lineage graphs

πŸ”§ sql_parser, log_analyzer, graph_builder, impact_analyzer

⚑ Recovery: If parsing fails: mark lineage as partial, If cycle detected: alert and break cycle, If graph too large: paginate or sample

ML Layer

Feature Store

Update: Real-time for access patterns, hourly for quality scores, daily for lineage metrics

  • β€’ data_quality_score (rolling 7-day avg)
  • β€’ access_frequency (requests/day per table)
  • β€’ pii_density (% of PII columns per table)
  • β€’ lineage_depth (max hops from source)
  • β€’ user_role_embedding (vector representation)
  • β€’ policy_complexity_score (rule count + conditions)

Model Registry

Strategy: Semantic versioning with A/B testing for 7 days before promotion

  • β€’ pii_detector
  • β€’ quality_scorer
  • β€’ anomaly_detector
  • β€’ metadata_enricher

Observability Stack

Real-time monitoring, tracing & alerting

0 active
SOURCES
Apps, Services, Infra
COLLECTION
10 Metrics
PROCESSING
Aggregate & Transform
DASHBOARDS
4 Views
ALERTS
Enabled
πŸ“ŠMetrics(10)
πŸ“Logs(Structured)
πŸ”—Traces(Distributed)
sources_registered
βœ“
metadata_extraction_time_p95_ms
βœ“
quality_score_avg
βœ“
pii_detections
βœ“
policy_violations
βœ“
lineage_edges_created
βœ“

Deployment Variants

πŸš€

Startup Architecture

Fast to deploy, cost-efficient, scales to 100 competitors

Infrastructure

βœ“
Serverless (Lambda/Cloud Run)
βœ“
Managed PostgreSQL (RDS/Cloud SQL)
βœ“
Managed Redis (ElastiCache/Memorystore)
βœ“
S3 for logs and backups
βœ“
Single region (us-east-1 or eu-west-1)
β†’Quick to deploy (<1 week)
β†’Low operational overhead
β†’Pay-per-use pricing
β†’Good for <100K records/day
β†’Single-tenant by default

Risks & Mitigations

⚠️ Source credentials leaked or compromised

Medium

βœ“ Mitigation: Store credentials in Secrets Manager with 90-day rotation. Use least-privilege IAM roles. Audit all credential access. Encrypt at rest with customer-managed keys.

⚠️ PII detector fails, exposing sensitive data

Low

βœ“ Mitigation: Fail-safe: Deny all access if PII detector unavailable. Multi-layer detection (NER + regex + manual review). Regular audits of PII detection accuracy. Immediate rollback if accuracy drops.

⚠️ Lineage graph becomes too large to query

High (at scale)

βœ“ Mitigation: Paginate graph queries. Limit depth to 10 hops. Add indexes on frequently queried paths. Consider graph sampling or pre-computed summaries. Offer filtering by date/source.

⚠️ Agent orchestration deadlock or infinite loop

Medium

βœ“ Mitigation: Timeout all agent calls (30s max). Circuit breaker pattern. Monitor agent utilization and kill hung processes. Retry with different plan if deadlock detected. Telemetry on agent execution time.

⚠️ Multi-tenant data leakage (tenant A sees tenant B's data)

Low (with proper design)

βœ“ Mitigation: Tenant ID in all queries (enforce at DB level). Row-level security in PostgreSQL. Separate schemas per tenant. Regular penetration testing. Audit all cross-tenant queries (should be zero).

⚠️ Cost explosion from LLM API usage

Medium (without guardrails)

βœ“ Mitigation: Rate limits per tenant. Cost budgets with alerts. Cache LLM responses. Use cheaper models for non-critical tasks. Batch requests where possible. Monitor cost per record.

⚠️ Compliance drift (policies not enforced consistently)

Medium (as system grows)

βœ“ Mitigation: Automated policy testing (unit tests for policy logic). Regular compliance audits (quarterly). Policy change log with approval workflow. Guardrail agent validates all policy changes before deployment.

🧬

Evolution Roadmap

Progressive transformation from MVP to scale

🌱
Phase 1Q1 2025

0-3 months (MVP)

1
Launch with 5-10 data sources
2
Core metadata extraction and cataloging
3
Basic policy enforcement (access control, PII redaction)
4
Single-tenant deployment
Complexity Level
β–Ό
🌿
Phase 2Q2 2025

3-6 months (Scale)

1
Support 50+ data sources
2
Add data lineage tracking
3
ML-driven quality scoring
4
Multi-tenant architecture
Complexity Level
β–Ό
🌳
Phase 3Q3-Q4 2025

6-12 months (Enterprise)

1
Support 500+ data sources
2
Multi-region deployment
3
Advanced ML (agentic RAG, anomaly detection)
4
SOC2 Type II certification
Complexity Level
πŸš€Production Ready
πŸ—οΈ

Complete Systems Architecture

9-layer architecture for enterprise data governance

1
🌐

Presentation

4 components

Data Catalog UI
Compliance Dashboard
Admin Portal
API Explorer
2
βš™οΈ

API Gateway

4 components

GraphQL API
REST API
Rate Limiter
Auth Middleware
3
πŸ’Ύ

Agent Layer

6 components

Planner Agent
Executor Agent
Evaluator Agent
Guardrail Agent
Metadata Agent
Lineage Agent
4
πŸ”Œ

ML Layer

4 components

Feature Store
Model Registry
Inference Engine
Evaluation Loop
5
πŸ“Š

Integration

4 components

Source Connectors
Data Profiler
Policy Engine
Notification Service
6
🌐

Data

4 components

Metadata Store (PostgreSQL)
Graph DB (Neo4j)
Vector DB (Pinecone)
Cache (Redis)
7
βš™οΈ

External

4 components

Data Sources
Identity Provider
Notification APIs
Audit Services
8
πŸ’Ύ

Observability

4 components

Metrics (Prometheus)
Logs (ELK)
Traces (Jaeger)
Dashboards (Grafana)
9
πŸ”Œ

Security

4 components

KMS
Secret Manager
WAF
Audit Logger
πŸ”„

Sequence Diagram - New Data Source Registration

Automated data flow every hour

Step 0 of 11
UserAPI GatewayPlanner AgentMetadata AgentLineage AgentGuardrail AgentPolicy EngineMetadata StorePOST /sources (connection details)registerSource(config)validateConfig()config_valid=trueextractMetadata(source)INSERT tables, columns, statsbuildLineage(source_id)INSERT lineage graphapplyPolicies(source_id)UPDATE access_rules, retention201 Created (source_id, metadata)

Data Flow - Source Registration to Policy Enforcement

1
User0s
Registers new data source β†’ Connection config (host, port, credentials)
2
API Gateway50ms
Validates auth and rate limits β†’ JWT token + request payload
3
Planner Agent200ms
Creates task plan β†’ Task graph (extract β†’ profile β†’ lineage β†’ policy)
4
Guardrail Agent100ms
Validates connection config β†’ Config + security rules
5
Metadata Agent3s
Connects to source and extracts schema β†’ Tables, columns, data types, row counts
6
Metadata Agent2s
Samples data and calculates stats β†’ Distinct values, nulls, min/max, histograms
7
Metadata Agent4s
Enriches with LLM-generated descriptions β†’ Business-friendly column descriptions
8
Evaluator Agent500ms
Runs quality checks β†’ Quality scores (0-100) per column
9
Evaluator Agent300ms
Detects PII β†’ PII labels (email, SSN, credit card, etc.)
10
Lineage Agent2s
Parses SQL queries from logs β†’ Source β†’ target relationships
11
Lineage Agent1s
Builds lineage graph in Neo4j β†’ Graph edges with transformation metadata
12
Executor Agent500ms
Applies default policies β†’ Access rules, retention periods, redaction rules
13
Guardrail Agent100ms
Audits policy application β†’ Audit log entry
14
API Gateway50ms
Returns success response β†’ Source ID + metadata summary
1
Volume
10K-100K records/day
Pattern
Serverless Monolith
πŸ—οΈ
Architecture
AWS Lambda or Cloud Run
PostgreSQL (managed)
Redis (managed)
S3 for logs
Cost & Performance
$200/mo
per month
3-5s per source
2
Volume
100K-1M records/day
Pattern
Queue + Workers
πŸ—οΈ
Architecture
API server (ECS/GKE)
Message queue (SQS/Pub/Sub)
Worker pool (auto-scaling)
PostgreSQL + Redis
Neo4j for lineage
Cost & Performance
$800/mo
per month
1-3s per source
3
Volume
1M-10M records/day
Pattern
Multi-Agent Orchestration
πŸ—οΈ
Architecture
Load balancer
Agent framework (LangGraph)
Event bus (Kafka)
Distributed cache (Redis Cluster)
Sharded PostgreSQL
Neo4j cluster
Cost & Performance
$3K/mo
per month
500ms-2s per source
Recommended
4
Volume
10M+ records/day
Pattern
Enterprise Multi-Region
πŸ—οΈ
Architecture
Kubernetes (EKS/GKE)
Multi-region event streaming (Kafka)
Distributed tracing (Jaeger)
Multi-region DB replication
CDN for UI
Dedicated ML inference cluster
Cost & Performance
$10K+/mo
per month
200-500ms per source

Key Integrations

Data Sources (Snowflake, Databricks, BigQuery, S3, etc.)

Protocol: JDBC/ODBC, REST APIs, S3 SDK
Connect via secure credentials
Query information_schema for metadata
Sample data for profiling
Extract query logs for lineage

Identity Provider (Okta, Auth0, Azure AD)

Protocol: SAML 2.0, OIDC
User authenticates via SSO
IdP returns JWT with roles/groups
API Gateway validates JWT
RBAC enforced per request

Notification Services (Slack, Email, PagerDuty)

Protocol: Webhooks, SMTP, REST APIs
Policy violation detected
Notification service triggered
Alert sent to relevant stakeholders
Audit log updated

Audit & Compliance (Splunk, Datadog, CloudWatch)

Protocol: Syslog, REST APIs, agent-based
All governance actions logged
Logs forwarded to SIEM
Real-time dashboards and alerts
Long-term retention for compliance

Security & Compliance

Failure Modes & Recovery

FailureFallbackImpactSLA
Source connection timeoutRetry 3x with exponential backoff (1s, 2s, 4s) β†’ Queue for manual reviewDelayed metadata extraction, no data loss99.5%
LLM API down (metadata enrichment)Skip enrichment, use basic metadata only β†’ Queue for later enrichmentDegraded UX (no descriptions), core functionality intact99.9%
PII detector failsDeny all access (fail-safe) β†’ Alert admin β†’ Manual reviewService degraded, data protected100% (safety first)
Policy engine unavailableDeny all access β†’ Serve cached policy decisions (if available) β†’ Alert adminService degraded, compliance maintained99.9%
Database unavailableRead from replica (if available) β†’ Queue writes β†’ Alert adminRead-only mode, no data loss99.9%
Lineage graph too large (>1M nodes)Paginate graph queries β†’ Limit depth to 10 hops β†’ Suggest filteringSlower lineage queries, functionality intact99.0%
Agent orchestration deadlockTimeout after 30s β†’ Rollback transaction β†’ Retry with different planDelayed workflow completion, no data corruption99.5%
System Architecture
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Planner   β”‚ ← Coordinates all agents
β”‚    Agent    β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
       β”‚
   β”Œβ”€β”€β”€β”΄β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚        β”‚        β”‚          β”‚          β”‚
β”Œβ”€β”€β–Όβ”€β”€β”  β”Œβ”€β–Όβ”€β”€β”  β”Œβ”€β”€β–Όβ”€β”€β”€β”  β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”  β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”
β”‚Exec β”‚  β”‚Evalβ”‚  β”‚Guard β”‚  β”‚Meta  β”‚  β”‚Lineageβ”‚
β”‚Agentβ”‚  β”‚Agentβ”‚  β”‚ Agentβ”‚  β”‚Agent β”‚  β”‚ Agent β”‚
β””β”€β”€β”¬β”€β”€β”˜  β””β”€β”¬β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”€β”€β”˜
   β”‚        β”‚        β”‚         β”‚         β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
          β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”
          β”‚  Data  β”‚
          β”‚ Store  β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”„Agent Collaboration Flow

1
Planner Agent
Receives source registration request, creates task plan (extract β†’ profile β†’ lineage β†’ policy)
2
Guardrail Agent
Validates connection config, checks security rules β†’ Returns validation result
3
Metadata Agent
Connects to source, extracts schema, samples data β†’ Returns metadata JSON
4
Evaluator Agent
Runs quality checks, detects PII β†’ Returns quality scores and PII labels
5
Lineage Agent
Parses SQL logs, builds lineage graph β†’ Stores in Neo4j
6
Executor Agent
Applies default policies (access control, retention) β†’ Updates policy engine
7
Planner Agent
Aggregates results, returns success response to user

🎭Agent Types

Reactive Agent

Low

Metadata Agent - Responds to extraction request, returns metadata

Stateless

Reflexive Agent

Medium

Guardrail Agent - Uses rules + context to make security decisions

Reads context (user role, resource sensitivity)

Deliberative Agent

High

Evaluator Agent - Plans quality checks based on data type and history

Stateful (tracks quality trends)

Orchestrator Agent

Highest

Planner Agent - Makes routing decisions, handles loops and retries

Full state management (task graph, agent status)

πŸ“ˆLevels of Autonomy

L1
Tool
Human calls, agent responds
β†’ Monday's prompts (manual metadata extraction)
L2
Chained Tools
Sequential execution
β†’ Tuesday's code (extract β†’ validate β†’ store)
L3
Agent
Makes decisions, can loop
β†’ Evaluator Agent (decides which quality checks to run)
L4
Multi-Agent
Agents collaborate autonomously
β†’ This system (Planner coordinates 5 specialized agents)

RAG vs Fine-Tuning for Metadata Enrichment

Business glossaries and data dictionaries change frequently. RAG allows daily updates without retraining. Fine-tuning requires weeks of work and $5K+ per iteration.
βœ… RAG (Chosen)
Cost: $100/mo (vector DB + embeddings)
Update: Daily (add new docs to vector DB)
How:
❌ Fine-Tuning
Cost: $5K/iteration + $500/mo inference
Update: Monthly (retrain on new examples)
How:
Implementation: Vector DB (Pinecone) with company glossary, industry terms, and example descriptions. Retrieved during metadata extraction. GPT-4 generates descriptions with retrieved context.

Hallucination Detection in Metadata Enrichment

LLMs hallucinate column descriptions, inventing business context that doesn't exist.
L1
Confidence scores from LLM (logprobs). Flag if <0.7.
L2
Cross-reference with glossary. Flag if terms not found.
L3
Semantic similarity check. Flag if description doesn't match column name/type.
L4
Human review queue for flagged descriptions.
0.8% hallucination rate, 95% caught by L1-L3, 100% caught by L4

Evaluation Framework for Governance Models

PII Detection Precision
97.2%target: 98%+
PII Detection Recall
96.8%target: 95%+
Quality Score MAE
3.2 pointstarget: <5 points
Lineage Accuracy
94.1%target: 95%+
Metadata Enrichment Satisfaction
4.2/5.0target: 4.0/5.0
Testing: Shadow mode: Run new models in parallel with production for 7 days. Compare outputs. Promote if metrics improve by >5% and no regressions.

Dataset Curation for PII Detection

1
Collect: 50K columns from real sources - De-identified and anonymized
2
Clean: 42K usable (removed duplicates, invalid data) - Automated + manual review
3
Label: 42K labeled (PII vs non-PII) - ($$21K (data labeling service))
4
Augment: +8K synthetic examples - Edge case generation (rare PII types, obfuscated formats)
β†’ 50K high-quality labeled examples. Cohen's Kappa: 0.94 (excellent inter-annotator agreement).

Agentic RAG for Policy Enforcement

Agent iteratively retrieves policy rules based on reasoning about data context.
User requests access to 'customer_email' column. Agent reasons: 'This is PII' β†’ RAG retrieves GDPR rules β†’ Agent reasons: 'Need consent check' β†’ RAG retrieves consent records β†’ Agent generates decision: 'Allow with redaction if no consent'.
πŸ’‘ Not one-shot retrieval. Agent decides what rules to fetch based on data context. Handles complex, multi-step policy logic.

Continuous Learning Loop

Tech Stack Summary

LLMs
GPT-4 (OpenAI), Claude 3.5 (Anthropic), or Gemini (Google)
Orchestration
LangGraph, CrewAI, or custom agent framework
Database
PostgreSQL (metadata), Neo4j (lineage graphs), Redis (cache)
Queue
AWS SQS, Google Pub/Sub, or Apache Kafka
Compute
Kubernetes (EKS/GKE), AWS Lambda, or Google Cloud Run
ML Inference
TensorFlow Serving, TorchServe, or AWS SageMaker
Vector DB
Pinecone, Weaviate, or Qdrant
Monitoring
Prometheus + Grafana, Datadog, or New Relic
Logging
ELK Stack (Elasticsearch, Logstash, Kibana) or Splunk
Tracing
Jaeger, Zipkin, or AWS X-Ray
Security
AWS KMS, HashiCorp Vault, Cloudflare WAF
πŸ—οΈ

Need Data Governance Architecture?

We'll audit your data landscape, design a governance system, and help you scale from 10K to 10M+ records/day.

Β©

2026 Randeep Bhatia. All Rights Reserved.

No part of this content may be reproduced, distributed, or transmitted in any form without prior written permission.