From prompts to enterprise data governance platform.
Monday: 3 core prompts for metadata extraction, quality checks, and policy enforcement. Tuesday: automated agents for data cataloging. Wednesday: team workflows across data engineers, compliance officers, and business users. Thursday: complete technical architecture with 9 layers, multi-agent orchestration, ML-driven quality checks, and scaling from startup to 10M+ records/day.
Key Assumptions
System Requirements
Functional
- Automated metadata extraction from structured and unstructured sources
- Real-time data quality scoring with ML-driven anomaly detection
- Policy engine for access control, PII redaction, and retention rules
- End-to-end data lineage tracking with column-level granularity
- Self-service data catalog with search, tagging, and business glossary
- Automated compliance reporting (GDPR Article 30, CCPA disclosures)
- Multi-agent orchestration for complex governance workflows
Non-Functional (SLOs)
π° Cost Targets: {"per_record_usd":0.0001,"per_user_month_usd":50,"per_source_month_usd":20}
Agent Layer
planner
L4Decomposes governance workflows into tasks and routes to specialized agents
π§ task_decomposer, resource_estimator, dependency_resolver
β‘ Recovery: If agent unavailable: queue task with backoff, If plan invalid: fallback to manual workflow, If resource exhausted: throttle new requests
executor
L3Executes primary governance workflows (metadata extraction, policy application)
π§ metadata_extractor, data_profiler, policy_applier, lineage_builder
β‘ Recovery: If connection fails: retry 3x with exponential backoff, If extraction incomplete: mark for manual review, If policy conflict: escalate to admin
evaluator
L3Validates data quality, policy compliance, and lineage accuracy
π§ quality_scorer, anomaly_detector, compliance_checker, lineage_validator
β‘ Recovery: If quality check fails: flag for re-profiling, If compliance violation: trigger alert and block access, If lineage broken: rebuild from source
guardrail
L4Enforces safety constraints (PII redaction, access control, rate limits)
π§ pii_detector, access_control_engine, rate_limiter, audit_logger
β‘ Recovery: If PII detection fails: deny access (fail-safe), If access control unavailable: deny and alert admin, If rate limit exceeded: queue request
metadata
L2Extracts and enriches metadata from diverse data sources
π§ schema_introspector, data_sampler, stats_calculator, llm_enricher (for descriptions)
β‘ Recovery: If schema locked: retry after delay, If sampling fails: use metadata only, If LLM unavailable: skip enrichment
lineage
L3Builds and maintains end-to-end data lineage graphs
π§ sql_parser, log_analyzer, graph_builder, impact_analyzer
β‘ Recovery: If parsing fails: mark lineage as partial, If cycle detected: alert and break cycle, If graph too large: paginate or sample
ML Layer
Feature Store
Update: Real-time for access patterns, hourly for quality scores, daily for lineage metrics
- β’ data_quality_score (rolling 7-day avg)
- β’ access_frequency (requests/day per table)
- β’ pii_density (% of PII columns per table)
- β’ lineage_depth (max hops from source)
- β’ user_role_embedding (vector representation)
- β’ policy_complexity_score (rule count + conditions)
Model Registry
Strategy: Semantic versioning with A/B testing for 7 days before promotion
- β’ pii_detector
- β’ quality_scorer
- β’ anomaly_detector
- β’ metadata_enricher
Observability Stack
Real-time monitoring, tracing & alerting
0 activeDeployment Variants
Startup Architecture
Fast to deploy, cost-efficient, scales to 100 competitors
Infrastructure
Risks & Mitigations
β οΈ Source credentials leaked or compromised
Mediumβ Mitigation: Store credentials in Secrets Manager with 90-day rotation. Use least-privilege IAM roles. Audit all credential access. Encrypt at rest with customer-managed keys.
β οΈ PII detector fails, exposing sensitive data
Lowβ Mitigation: Fail-safe: Deny all access if PII detector unavailable. Multi-layer detection (NER + regex + manual review). Regular audits of PII detection accuracy. Immediate rollback if accuracy drops.
β οΈ Lineage graph becomes too large to query
High (at scale)β Mitigation: Paginate graph queries. Limit depth to 10 hops. Add indexes on frequently queried paths. Consider graph sampling or pre-computed summaries. Offer filtering by date/source.
β οΈ Agent orchestration deadlock or infinite loop
Mediumβ Mitigation: Timeout all agent calls (30s max). Circuit breaker pattern. Monitor agent utilization and kill hung processes. Retry with different plan if deadlock detected. Telemetry on agent execution time.
β οΈ Multi-tenant data leakage (tenant A sees tenant B's data)
Low (with proper design)β Mitigation: Tenant ID in all queries (enforce at DB level). Row-level security in PostgreSQL. Separate schemas per tenant. Regular penetration testing. Audit all cross-tenant queries (should be zero).
β οΈ Cost explosion from LLM API usage
Medium (without guardrails)β Mitigation: Rate limits per tenant. Cost budgets with alerts. Cache LLM responses. Use cheaper models for non-critical tasks. Batch requests where possible. Monitor cost per record.
β οΈ Compliance drift (policies not enforced consistently)
Medium (as system grows)β Mitigation: Automated policy testing (unit tests for policy logic). Regular compliance audits (quarterly). Policy change log with approval workflow. Guardrail agent validates all policy changes before deployment.
Evolution Roadmap
Progressive transformation from MVP to scale
0-3 months (MVP)
3-6 months (Scale)
6-12 months (Enterprise)
Complete Systems Architecture
9-layer architecture for enterprise data governance
Presentation
4 components
API Gateway
4 components
Agent Layer
6 components
ML Layer
4 components
Integration
4 components
Data
4 components
External
4 components
Observability
4 components
Security
4 components
Sequence Diagram - New Data Source Registration
Automated data flow every hour
Data Flow - Source Registration to Policy Enforcement
Key Integrations
Data Sources (Snowflake, Databricks, BigQuery, S3, etc.)
Identity Provider (Okta, Auth0, Azure AD)
Notification Services (Slack, Email, PagerDuty)
Audit & Compliance (Splunk, Datadog, CloudWatch)
Security & Compliance
Failure Modes & Recovery
| Failure | Fallback | Impact | SLA |
|---|---|---|---|
| Source connection timeout | Retry 3x with exponential backoff (1s, 2s, 4s) β Queue for manual review | Delayed metadata extraction, no data loss | 99.5% |
| LLM API down (metadata enrichment) | Skip enrichment, use basic metadata only β Queue for later enrichment | Degraded UX (no descriptions), core functionality intact | 99.9% |
| PII detector fails | Deny all access (fail-safe) β Alert admin β Manual review | Service degraded, data protected | 100% (safety first) |
| Policy engine unavailable | Deny all access β Serve cached policy decisions (if available) β Alert admin | Service degraded, compliance maintained | 99.9% |
| Database unavailable | Read from replica (if available) β Queue writes β Alert admin | Read-only mode, no data loss | 99.9% |
| Lineage graph too large (>1M nodes) | Paginate graph queries β Limit depth to 10 hops β Suggest filtering | Slower lineage queries, functionality intact | 99.0% |
| Agent orchestration deadlock | Timeout after 30s β Rollback transaction β Retry with different plan | Delayed workflow completion, no data corruption | 99.5% |
βββββββββββββββ
β Planner β β Coordinates all agents
β Agent β
ββββββββ¬βββββββ
β
βββββ΄βββββ¬βββββββββ¬βββββββββββ¬βββββββββββ
β β β β β
ββββΌβββ βββΌβββ ββββΌββββ ββββΌβββββ ββββΌβββββ
βExec β βEvalβ βGuard β βMeta β βLineageβ
βAgentβ βAgentβ β Agentβ βAgent β β Agent β
ββββ¬βββ βββ¬βββ ββββ¬ββββ ββββ¬ββββ ββββ¬βββββ
β β β β β
ββββββββββ΄βββββββββ΄ββββββββββ΄ββββββββββ
β
ββββΌββββββ
β Data β
β Store β
ββββββββββπAgent Collaboration Flow
πAgent Types
Reactive Agent
LowMetadata Agent - Responds to extraction request, returns metadata
Reflexive Agent
MediumGuardrail Agent - Uses rules + context to make security decisions
Deliberative Agent
HighEvaluator Agent - Plans quality checks based on data type and history
Orchestrator Agent
HighestPlanner Agent - Makes routing decisions, handles loops and retries
πLevels of Autonomy
RAG vs Fine-Tuning for Metadata Enrichment
Hallucination Detection in Metadata Enrichment
Evaluation Framework for Governance Models
Dataset Curation for PII Detection
Agentic RAG for Policy Enforcement
Continuous Learning Loop
Tech Stack Summary
2026 Randeep Bhatia. All Rights Reserved.
No part of this content may be reproduced, distributed, or transmitted in any form without prior written permission.