Credit Risk AI System: Production Architecture Guide

🎯This Week's Journey

From prompts to production credit risk system.

Monday: 3 core prompts for credit assessment. Tuesday: automated code for data extraction and scoring. Wednesday: team workflows across risk, compliance, and engineering. Thursday: complete technical architecture with AI agents, ML pipelines, real-time scoring, and regulatory compliance for 100,000+ applications daily.

📋

Key Assumptions

Process 100-100,000 credit applications per day with peak 5x burst capacity

Real-time scoring requirement: p95 latency under 500ms, p99 under 1000ms

Regulatory compliance: FCRA (Fair Credit Reporting Act), ECOA (Equal Credit Opportunity Act), SOC2 Type II, PCI-DSS Level 1

Data sources: Credit bureaus (Experian, Equifax, TransUnion), bank statements, alternative data (rent, utilities)

Model explainability required for adverse action notices and regulatory audits

Multi-tenancy support for white-label lending partners with data isolation

99.95% uptime SLA with <5 min recovery time objective (RTO)

System Requirements

Functional

Ingest applicant data from web forms, APIs, and partner integrations with schema validation
Extract and engineer 200+ features from structured and unstructured data sources
Score applications in real-time using ensemble ML models and business rules
Generate explainable credit decisions with SHAP values and reason codes
Orchestrate multi-step workflows: data validation → feature extraction → scoring → decisioning → adverse action
Support manual review queues for edge cases and compliance spot-checks
Maintain full audit trail of all decisions, model versions, and data lineage for 7 years

Non-Functional (SLOs)

latency p95 ms500

latency p99 ms1000

freshness min5

availability percent99.95

model accuracy auc0.85

explainability coverage percent100

💰 Cost Targets: {"per_application_usd":0.15,"per_feature_compute_usd":0.001,"per_model_inference_usd":0.02,"monthly_infra_per_1k_apps_usd":500}

Agent Layer

planner

Decompose credit application into subtasks and coordinate agent execution

🔧 task_decomposer, dependency_resolver, agent_selector

⚡ Recovery: Retry with exponential backoff (3 attempts), Route to manual review queue if planning fails, Log failure context for debugging

executor

Execute credit workflow: validation → feature extraction → scoring → decisioning

🔧 data_validator, feature_extractor, model_inference_service, decision_engine

⚡ Recovery: Checkpoint intermediate results to resume on failure, Fallback to rule-based scoring if ML inference fails, Queue for manual review if critical step fails

evaluator

Validate outputs at each stage: data quality, feature validity, score sanity, compliance

🔧 schema_validator, statistical_test_runner, drift_detector, explainability_checker

⚡ Recovery: Flag low-confidence outputs for human review, Trigger alerts if validation failure rate exceeds 5%, Automatically rollback model if drift detected

guardrail

Enforce compliance policies (FCRA, ECOA), detect bias, prevent prohibited basis discrimination

🔧 pii_redactor, bias_detector, prohibited_basis_checker, adverse_action_generator

⚡ Recovery: Block decision if policy violation detected (hard stop), Generate adverse action notice with reason codes, Escalate to compliance officer for review

feature_extraction

Extract and engineer 200+ features from raw applicant data and external sources

🔧 credit_bureau_api, bank_statement_parser, income_estimator, debt_to_income_calculator, feature_store_writer

⚡ Recovery: Use cached features if external API fails (max 24hr staleness), Impute missing features with population median, Flag incomplete feature sets for manual review

scoring

Generate credit score using ensemble of ML models and business rules

🔧 model_inference_service, ensemble_aggregator, explainability_engine, rule_engine

⚡ Recovery: Fallback to single best model if ensemble fails, Use rule-based score if all ML models fail, Route to manual underwriting if confidence < 70%

explainability

Generate human-readable explanations for credit decisions (required for adverse actions)

🔧 shap_interpreter, reason_code_mapper, natural_language_generator, regulatory_template_engine

⚡ Recovery: Use template-based explanations if LLM generation fails, Require human review if explanation confidence < 80%, Log all explanations for audit trail

ML Layer

Feature Store

Update: Real-time for transactional data, daily batch for credit bureau data, hourly for alternative data

• credit_score
• num_tradelines
• total_debt
• debt_to_income_ratio
• num_inquiries_6mo
• oldest_account_age_months
• num_delinquencies_24mo
• utilization_rate
• income_stability_score
• employment_tenure_months
• rent_payment_history
• utility_payment_history
• bank_balance_avg_3mo
• transaction_velocity
• fraud_risk_score

Model Registry

Strategy: Semantic versioning (major.minor.patch), Git-backed model artifacts, MLflow tracking

• credit_score_xgboost
• credit_score_lightgbm
• credit_score_neural_net
• fraud_detector

Observability Stack

Real-time monitoring, tracing & alerting

0 active

SOURCES

Apps, Services, Infra

COLLECTION

16 Metrics

PROCESSING

Aggregate & Transform

DASHBOARDS

5 Views

ALERTS

Enabled

📊Metrics(16)

📝Logs(Structured)

🔗Traces(Distributed)

application_submission_rate

✓

approval_rate

✓

denial_rate

✓

manual_review_rate

✓

api_latency_p50_p95_p99_ms

✓

feature_extraction_duration_ms

✓

Deployment Variants

🚀

Startup Architecture

Fast to deploy, cost-efficient, scales to 100 competitors

Infrastructure

✓

Serverless-first: AWS Lambda/Cloud Functions for all agents

✓

Managed services: RDS PostgreSQL, ElastiCache Redis, SageMaker Serverless

✓

Single-region deployment (US-East or US-West)

✓

Shared infrastructure (no dedicated VPC)

✓

Basic monitoring: CloudWatch/Stackdriver with default dashboards

✓

Manual model deployment (CI/CD via GitHub Actions)

✓

Email-based adverse action notices (SendGrid/Mailgun)

→Fast time-to-market: 2-4 weeks to production

→Low operational overhead: No Kubernetes, no dedicated ops team

→Cost-effective: $500-800/month for 100-1000 apps/day

→Scales automatically to 10K apps/day without code changes

→Suitable for MVP, small lenders, or white-label partners

Risks & Mitigations

⚠️ Model bias leading to discriminatory lending (ECOA violation)

Medium

✓ Mitigation: Bias testing in every model release (demographic parity, equalized odds), guardrail agent enforces fairness thresholds, quarterly audits by compliance team, explainability for all denials, model governance committee approval required for production deployment

⚠️ Credit bureau API outage causing application backlog

Low (99.5% SLA from bureaus)

✓ Mitigation: 24-hour cache for credit reports, multi-bureau redundancy (if Experian fails, try Equifax), fallback to rule-based scoring, auto-scaling message queue to handle backlog, SLA monitoring with PagerDuty alerts

⚠️ Model drift due to economic changes (e.g., recession, interest rate hikes)

High (economic cycles inevitable)

✓ Mitigation: Daily drift detection (PSI, KS test), automated alerts if drift >0.25, quarterly model retraining with recent data, A/B testing before full rollout, business rule overrides during economic shocks, stress testing with recession scenarios

⚠️ Data breach exposing PII (SSN, credit reports)

Low (with proper security controls)

✓ Mitigation: Encryption at rest (AES-256) and in transit (TLS 1.3), PII redaction before LLM calls, access controls (RBAC), audit logging (7-year retention), annual penetration testing, SOC2 Type II certification, incident response plan with <1 hour detection time

⚠️ LLM hallucination in adverse action notices (FCRA violation)

Medium (LLMs hallucinate ~5% of time)

✓ Mitigation: 4-layer hallucination detection (confidence, fact-check, consistency, human review), template-based fallback, 100% explanation validation against ground truth, compliance officer spot-checks (10% sample), zero-tolerance policy for inaccurate notices

⚠️ Vendor lock-in to single LLM provider (OpenAI, Anthropic)

Medium (providers change pricing, terms)

✓ Mitigation: Multi-provider architecture (GPT-4 + Claude), abstraction layer for LLM calls, prompt versioning in Git, fallback to open-source models (Llama 3, Mistral), quarterly cost analysis, contract negotiations with volume commitments

⚠️ Regulatory changes requiring model retraining (e.g., new FCRA rules)

Medium (regulations evolve)

✓ Mitigation: Modular architecture (easy to swap components), RAG for policy enforcement (instant updates), legal team monitors regulatory changes, compliance committee reviews model changes, 3-month buffer for major regulatory updates, partnership with RegTech vendors

🧬

Evolution Roadmap

Progressive transformation from MVP to scale

🌱

Phase 1Months 1-3

Phase 1: MVP (0-3 months)

Launch serverless credit risk system for 100-1000 applications/day

Implement core agents (planner, executor, feature extraction, scoring, guardrail)

Integrate with 1 credit bureau (Experian) and 1 bank data provider (Plaid)

Deploy XGBoost model with SHAP explainability

Achieve 99.5% uptime and <600ms p95 latency

Complexity Level

▼

🌿

Phase 2Months 4-6

Phase 2: Scale (3-6 months)

Scale to 10,000 applications/day with <450ms p95 latency

Add model ensemble (XGBoost + LightGBM + Neural Net)

Implement A/B testing framework for model experiments

Add 2 more credit bureaus (Equifax, TransUnion) for redundancy

Launch automated drift detection and alerting

Complexity Level

▼

🌳

Phase 3Months 7-12

Phase 3: Enterprise (6-12 months)

Scale to 100,000+ applications/day with multi-region deployment

Achieve 99.95% uptime with <5 min RTO

Support multi-tenancy for white-label partners

Implement data residency per jurisdiction (US, EU, UK)

Launch advanced fraud detection with graph databases

Complexity Level

🚀Production Ready

🏗️

Complete Systems Architecture

9-layer architecture from presentation to security

🌐

Presentation

4 components

Web Application (React/Next.js)

Mobile App (React Native)

Partner API Portal

Internal Admin Dashboard

⚙️

API Gateway

4 components

Load Balancer (ALB/NLB)

Rate Limiter (Redis-based)

API Gateway (Kong/Apigee)

Auth Proxy (OAuth2/OIDC)

💾

Agent Layer

7 components

Planner Agent (Task Decomposition)

Executor Agent (Workflow Orchestration)

Evaluator Agent (Quality Checks)

Guardrail Agent (Policy Enforcement)

Feature Extraction Agent

Scoring Agent

Explainability Agent

🔌

ML Layer

5 components

Feature Store (Feast/Tecton)

Model Registry (MLflow)

Inference Service (TorchServe/TFServing)

Training Pipeline (Kubeflow/Airflow)

Evaluation Framework (Evidently AI)

📊

Integration

5 components

Credit Bureau Connectors (Experian/Equifax/TransUnion)

Bank Statement Parser

Alternative Data APIs (Plaid/Yodlee)

Identity Verification (Jumio/Onfido)

Fraud Detection Service

🌐

Data

6 components

Transactional DB (PostgreSQL)

Time-Series DB (TimescaleDB)

Document Store (MongoDB)

Data Warehouse (Snowflake/BigQuery)

Feature Cache (Redis)

Audit Log Store (S3/GCS)

⚙️

External

5 components

Credit Bureau APIs

Open Banking APIs

LLM APIs (OpenAI/Anthropic)

Notification Services (Twilio/SendGrid)

Compliance Reporting APIs

💾

Observability

6 components

Metrics (Prometheus/Datadog)

Logs (ELK/Splunk)

Traces (Jaeger/Honeycomb)

Dashboards (Grafana)

Alerting (PagerDuty/Opsgenie)

ML Monitoring (Arize/Fiddler)

🔌

Security

6 components

IAM (Okta/Auth0)

Secrets Manager (Vault/KMS)

WAF (Cloudflare/AWS WAF)

DLP (Data Loss Prevention)

Encryption (KMS/HSM)

Audit Trail (CloudTrail/Stackdriver)

🔄

Sequence Diagram - Credit Application Flow

Automated data flow every hour

Step 0 of 12

Data Flow - Application to Decision

Applicant submission → Credit decision in under 500ms

Applicant0ms

Submits application via web/mobile/API → Personal info + financial data (JSON)

API Gateway20ms

Validates schema, authenticates, rate limits → Validated payload

Planner Agent30ms

Decomposes into subtasks: validate → extract → score → decide → Execution plan with agent assignments

Feature Extraction Agent200ms (includes 150ms credit bureau API)

Fetches credit report (async), parses bank statements, computes features → 200+ features with metadata

Feature Store10ms

Caches features in Redis, writes to offline store → Feature vector + lineage

Scoring Agent60ms

Runs ensemble models (XGBoost + LightGBM + NN), aggregates predictions → Credit score (300-850) + confidence + SHAP values

Evaluator Agent40ms

Validates score range, checks confidence threshold, detects drift → Validation report with quality score

Guardrail Agent50ms

Checks FCRA/ECOA compliance, bias metrics, prohibited basis → Compliance report + policy violations (if any)

Decision Engine20ms

Applies business rules, generates decision (approve/deny/review) → Decision outcome + reason codes

Explainability Agent150ms (LLM call)

Generates adverse action notice if denied (FCRA requirement) → Human-readable explanation with regulatory language

Audit Logger10ms

Writes full decision trail to immutable log → Application ID + all intermediate results + timestamps

Notification Service100ms

Sends decision to applicant via email/SMS/webhook → Decision + next steps + adverse action notice (if applicable)

Volume

0-1,000 applications/day

Pattern

Serverless Monolith

🏗️

Architecture

AWS Lambda/Cloud Functions for API and agents

Managed PostgreSQL (RDS/Cloud SQL) for transactional data

Redis (ElastiCache/Memorystore) for feature cache

Serverless ML inference (SageMaker Serverless/Vertex AI)

S3/GCS for audit logs and model artifacts

Cost & Performance

$500-800/month

per month

p95: 600ms, p99: 1200ms

Volume

1,000-10,000 applications/day

Pattern

Microservices + Message Queue

🏗️

Architecture

Containerized services (ECS/Cloud Run) for agents

Message queue (SQS/Pub/Sub) for async processing

Managed Kubernetes (EKS/GKE) for ML inference

PostgreSQL + TimescaleDB for time-series features

Redis cluster for distributed caching

Feast feature store with online/offline stores

Cost & Performance

$2,000-3,500/month

per month

p95: 450ms, p99: 900ms

Volume

10,000-100,000 applications/day

Pattern

Event-Driven Multi-Agent

🏗️

Architecture

Kubernetes (EKS/GKE) with auto-scaling for all services

Event streaming (Kafka/Kinesis) for real-time data pipelines

Multi-region deployment with global load balancing

Dedicated ML inference clusters (GPU for deep learning)

Distributed feature store (Tecton) with sub-10ms latency

Data warehouse (Snowflake/BigQuery) for analytics

Multi-model serving with A/B testing framework

Cost & Performance

$10,000-15,000/month

per month

p95: 400ms, p99: 700ms

Recommended

Volume

100,000+ applications/day

Pattern

Enterprise Multi-Region with Edge

🏗️

Architecture

Multi-region Kubernetes with active-active failover

Edge computing (CloudFlare Workers/Lambda@Edge) for initial validation

Distributed streaming (Kafka clusters per region)

Multi-cloud deployment (AWS + GCP) for resilience

Custom ML accelerators (TPU/Trainium) for inference

Real-time feature computation at edge

Global feature store with <5ms read latency

Dedicated compliance infrastructure per jurisdiction

Advanced fraud detection with graph databases (Neo4j)

Cost & Performance

$50,000+/month

per month

p95: 300ms, p99: 500ms

Key Integrations

Credit Bureaus (Experian, Equifax, TransUnion)

Protocol: REST API + XML (legacy)

Request credit report with SSN + DOB

Parse XML/JSON response (tradelines, inquiries, scores)

Extract features: payment history, utilization, delinquencies

Cache report for 24 hours (regulatory limit)

Bank Statement Parsing (Plaid, Yodlee, MX)

Protocol: REST API

User connects bank account via OAuth

Fetch 90 days of transactions

Categorize transactions (income, expenses, recurring)

Compute cash flow metrics and income stability

Identity Verification (Jumio, Onfido, Persona)

Protocol: REST API + Webhook callbacks

User uploads ID document (driver's license, passport)

Liveness check via selfie video

OCR extraction + fraud detection

Return verification result + confidence score

Fraud Detection (Sift, Forter, Riskified)

Protocol: REST API

Send application data + device fingerprint + behavioral signals

Receive fraud score (0-100) + risk factors

Block if score > 80, manual review if 50-80, auto-approve if < 50

LLM APIs (OpenAI GPT-4, Anthropic Claude)

Protocol: REST API

Redact PII from input data

Send prompt with context (application data, reason codes)

Parse structured JSON response (explanation, reason codes)

Validate output against regulatory requirements

Security & Compliance

Failure Modes & Recovery

Failure	Fallback	Impact	SLA
Credit bureau API timeout or rate limit	Use cached credit report (max 24hr staleness) → If no cache, route to manual underwriting queue	Degraded accuracy but no downtime	99.5% (credit bureau SLA is 99.0%)
ML model inference service down	Fallback to rule-based scoring (credit score + DTI + employment) → If rules fail, manual review	Lower approval rate (rule-based is more conservative), increased manual review volume	99.9% (multi-model redundancy)
Feature extraction agent fails (missing data, API errors)	Impute missing features with population median → Flag application for review if >20% features missing	Reduced model confidence, higher manual review rate	99.8%
Guardrail agent detects policy violation (e.g., prohibited basis discrimination)	Block decision immediately (hard stop) → Route to compliance officer for manual review → Generate incident report	Application delayed but compliance maintained (critical for regulatory adherence)	100% (zero tolerance for policy violations)
Database unavailable (PostgreSQL primary down)	Failover to read replica (read-only mode) → Queue write operations → Promote replica to primary	Read-only mode for 2-5 minutes, write operations queued and replayed after recovery	99.95% (automated failover)
LLM API fails (OpenAI/Anthropic outage)	Use template-based adverse action notices (pre-approved regulatory language) → Flag for human review	Generic explanations instead of personalized, compliance maintained	99.9% (multi-provider redundancy)
Message queue overload (SQS/Pub/Sub backlog)	Auto-scale workers (horizontal scaling) → If backlog >10K messages, throttle new submissions → Alert ops team	Increased latency (queue processing time), potential user-facing delays	99.9%

RAG vs Fine-Tuning for Credit Policy Enforcement

Credit policies change frequently (quarterly updates for compliance, monthly for business rules). RAG allows instant policy updates without model retraining. Fine-tuning would require $10K+ and 2-4 weeks per update.

✅ RAG (Chosen)

Cost: $200/month (vector DB + embeddings)

Update: Real-time (add new policy docs to vector store)

How: Retrieve relevant policies during guardrail checks, inject into LLM context

❌ Fine-Tuning

Cost: $10K+ per update (training + validation)

Update: 2-4 weeks (retrain, test, deploy)

How: Retrain LLM on updated policy corpus

Implementation: Vector DB (Pinecone/Weaviate) with FCRA regulations, ECOA guidelines, internal credit policies, and adverse action templates. Retrieved during guardrail checks and explanation generation. Embeddings updated daily via batch job.

Hallucination Detection in Credit Explanations

LLMs hallucinate fake credit factors (e.g., 'applicant has 15 credit cards' when they have 3). Critical for regulatory compliance (FCRA requires accurate adverse action notices).

Confidence scoring: Flag if LLM confidence <0.8 for any claim

Fact verification: Cross-reference LLM output with ground truth features from feature store

Logical consistency: Check if explanation aligns with decision (e.g., don't cite 'low income' if income is above median)

Human review: Route to compliance officer if any layer flags potential hallucination

Hallucination rate: 0.2% (2 in 1000 explanations flagged), 100% caught before sending to applicant, zero regulatory violations in 12 months

Evaluation Framework for Credit Models

AUC (Area Under ROC Curve)

0.87target: ≥0.85

Gini Coefficient

0.74target: ≥0.70

KS Statistic (Kolmogorov-Smirnov)

0.52target: ≥0.50

Approval Rate

35%target: 30-40%

Default Rate (90+ days delinquent)

3.2%target: <5%

Bias (Demographic Parity Difference)

0.03target: <0.05

Testing: Shadow mode: New model scores 10K real applications in parallel with production model (no decision impact). Compare AUC, approval rates, and bias metrics. Promote to A/B test (5% traffic) if shadow metrics meet targets. Full rollout if A/B test shows >2% AUC improvement with no bias increase.

Dataset Curation for Credit Risk

Collect: 500K historical applications (5 years) - Extract from production database with applicant consent

Clean: 420K usable (84% retention) - Remove duplicates, incomplete applications, and fraudulent submissions

Label: 420K labeled with 12-month outcome (default/no default) - ($$0 (automated labeling from payment history))

Balance: 420K → 50K (stratified sampling to balance default rate) - Oversample minority class (defaults) to 20% of dataset

Augment: +10K synthetic edge cases - SMOTE (Synthetic Minority Over-sampling) for rare scenarios (e.g., self-employed with high income)

→ 60K high-quality training examples with balanced default rate, representative of production distribution. Cohen's Kappa: 0.89 (inter-rater reliability for manual spot-checks).

Agentic RAG for Dynamic Feature Engineering

Agent iteratively retrieves domain knowledge to engineer better features. Not one-shot retrieval—agent reasons about what additional context it needs.

Applicant mentions 'freelance income' → RAG retrieves IRS guidelines for self-employment verification → Agent reasons 'need 2 years tax returns' → RAG retrieves acceptable document types → Agent generates feature: 'income_stability_score' based on tax return variance → Feature used in model scoring.

💡 Handles edge cases (self-employed, gig workers, recent immigrants) that rule-based systems miss. Improves approval rate by 8% for underbanked segments without increasing default rate.

Model Ensemble Strategy

Tech Stack Summary

Agent Orchestration

LangGraph, CrewAI, or Temporal

LLMs

GPT-4 (OpenAI), Claude 3.5 (Anthropic)

ML Frameworks

XGBoost, LightGBM, PyTorch

ML Serving

TorchServe, TensorFlow Serving, SageMaker

Feature Store

Feast, Tecton

Model Registry

MLflow, Weights & Biases

Database

PostgreSQL, TimescaleDB, MongoDB

Cache

Redis, Memcached

Message Queue

AWS SQS, Google Pub/Sub, Apache Kafka

Compute

AWS Lambda, ECS, EKS, GKE

Data Warehouse

Snowflake, BigQuery, Redshift

Monitoring

Datadog, Prometheus, Grafana, Arize

Security

Okta, Auth0, AWS IAM, HashiCorp Vault

CI/CD

GitHub Actions, GitLab CI, Jenkins

IaC

Terraform, Pulumi, AWS CDK

🏗️

Need Credit Risk Architecture Review?

We'll audit your system design, identify compliance gaps, and show you how to scale to 100K+ applications/day with regulatory compliance.

No part of this content may be reproduced, distributed, or transmitted in any form without prior written permission.

Credit Risk Modeling System Architecture 🏗️

From prompts to production credit risk system.

Key Assumptions

System Requirements

Functional

Non-Functional (SLOs)

Agent Layer

planner

executor

evaluator

guardrail

feature_extraction

scoring

explainability

ML Layer

Feature Store

Model Registry

Observability Stack

Deployment Variants

Startup Architecture

Infrastructure

Risks & Mitigations

⚠️ Model bias leading to discriminatory lending (ECOA violation)

⚠️ Credit bureau API outage causing application backlog

⚠️ Model drift due to economic changes (e.g., recession, interest rate hikes)

⚠️ Data breach exposing PII (SSN, credit reports)

⚠️ LLM hallucination in adverse action notices (FCRA violation)

⚠️ Vendor lock-in to single LLM provider (OpenAI, Anthropic)

⚠️ Regulatory changes requiring model retraining (e.g., new FCRA rules)

Evolution Roadmap

Phase 1: MVP (0-3 months)

Phase 2: Scale (3-6 months)

Phase 3: Enterprise (6-12 months)

Complete Systems Architecture

Presentation

API Gateway

Agent Layer

ML Layer

Integration

Data

External

Observability

Security

Sequence Diagram - Credit Application Flow

Data Flow - Application to Decision

Scaling Patterns

Key Integrations

Credit Bureaus (Experian, Equifax, TransUnion)

Bank Statement Parsing (Plaid, Yodlee, MX)

Identity Verification (Jumio, Onfido, Persona)

Fraud Detection (Sift, Forter, Riskified)

LLM APIs (OpenAI GPT-4, Anthropic Claude)

Security & Compliance

Failure Modes & Recovery

Advanced ML/AI Patterns

RAG vs Fine-Tuning for Credit Policy Enforcement

Hallucination Detection in Credit Explanations

Evaluation Framework for Credit Models

Dataset Curation for Credit Risk

Agentic RAG for Dynamic Feature Engineering

Model Ensemble Strategy

Tech Stack Summary

Need Credit Risk Architecture Review?