From prompts to production contract review system.
Monday: 3 core prompts for clause extraction, risk scoring, and recommendations. Tuesday: automated multi-step pipeline. Wednesday: team workflows across legal, compliance, and business teams. Thursday: complete technical architecture with 4 specialized agents, ML evaluation, and scaling patterns for enterprise legal operations.
Key Assumptions
System Requirements
Functional
- Extract 25+ clause types (termination, liability, indemnity, IP, confidentiality, payment, renewal)
- Risk score each clause (0-100) with confidence intervals and reasoning
- Generate redline recommendations with legal justification
- Support batch processing (upload 100 contracts) and real-time API
- Multi-language support (English, Spanish, French, German, Mandarin)
- Version control: track contract revisions and clause changes over time
- Human-in-the-loop review queue for low-confidence extractions (<80%)
Non-Functional (SLOs)
π° Cost Targets: {"per_contract_usd":0.85,"per_page_usd":0.06,"monthly_infra_usd":1200}
Agent Layer
planner
L4Decomposes contract review into tasks, selects tools, routes to specialized agents
π§ contract_classifier (determines contract type), page_counter (estimates processing time), language_detector
β‘ Recovery: If OCR fails β retry with enhanced preprocessing, If classifier uncertain β route to human review queue, If timeout β split into chunks and process parallel
extractor
L3Extract 25+ clause types from contract PDF using OCR + LLM
π§ aws_textract (OCR), gpt4_extraction (structured output), clause_deduplicator
β‘ Recovery: If OCR quality <70% β flag for manual review, If LLM extraction confidence <80% β route to Validator for double-check, If parsing error β retry with fallback prompt
validator
L2Validate extracted clauses against schema, check completeness, flag anomalies
π§ schema_validator (JSON schema check), completeness_checker (25 required clause types), anomaly_detector (unusual clause language)
β‘ Recovery: If missing critical clauses β trigger re-extraction with focused prompt, If anomaly detected β flag for legal review, If validation fails β route to human queue with context
risk_scorer
L3Score each clause 0-100 for legal risk using RAG + historical precedent
π§ rag_retriever (legal precedent database), risk_model (fine-tuned classifier), confidence_estimator
β‘ Recovery: If RAG retrieval empty β fall back to rule-based scoring, If model confidence <70% β flag for attorney review, If contradictory scores β ensemble vote across 3 models
recommender
L3Generate redline suggestions with legal justification for high-risk clauses
π§ redline_generator (GPT-4 with legal prompt), policy_matcher (company-specific rules), precedent_finder (similar contracts)
β‘ Recovery: If recommendation unclear β provide multiple options, If policy conflict β escalate to legal team, If generation fails β fall back to template-based suggestions
guardrail
L4Final safety check: PII redaction, policy compliance, output validation
π§ pii_detector (NER model), policy_checker (SOC2 compliance rules), output_validator (schema + safety)
β‘ Recovery: If PII detected β redact and log, If policy violation β block output and alert, If validation fails β route to manual review queue
ML Layer
Feature Store
Update: Daily batch + real-time for new contracts
- β’ clause_length_chars
- β’ clause_complexity_score (Flesch-Kincaid)
- β’ entity_count (parties, dates, amounts)
- β’ historical_risk_score_avg (per clause type)
- β’ jurisdiction_risk_modifier
- β’ industry_standard_deviation
Model Registry
Strategy: Semantic versioning (major.minor.patch), blue-green deployment
- β’ clause_extractor
- β’ risk_classifier
- β’ embeddings
- β’ reranker
Observability Stack
Real-time monitoring, tracing & alerting
0 activeDeployment Variants
Startup Architecture
Fast to deploy, cost-efficient, scales to 100 competitors
Infrastructure
Risks & Mitigations
β οΈ LLM hallucinations (fake clauses, incorrect risk scores)
Mediumβ Mitigation: 4-layer hallucination detection: confidence scores, cross-reference database, logical consistency, human review queue. 98% catch rate.
β οΈ OCR quality issues (poor scans, handwritten notes)
Mediumβ Mitigation: Pre-process PDFs (deskew, denoise). Flag low-quality scans (<70%) for manual review. Provide scan quality feedback to users.
β οΈ Model drift (accuracy degrades over time)
Highβ Mitigation: Monthly drift detection. Automatic retraining if accuracy drops >3%. Rollback to previous model version. Weekly evaluation on 1K-contract test set.
β οΈ PII leakage (sensitive data sent to LLM)
Lowβ Mitigation: Pre-redact PII via AWS Comprehend before LLM processing. Audit logs for all PII access. Encrypt PII at rest (KMS). Block processing if PII detection fails.
β οΈ Cost overruns (LLM API costs spike)
Mediumβ Mitigation: Cost tracking per contract. Budget alerts (>$1,000/day). Rate limiting (10 req/sec per tenant). Optimize prompts to reduce tokens. Cache common extractions.
β οΈ Scalability bottleneck (queue overflow during peak)
Mediumβ Mitigation: Auto-scaling workers (2-20 based on queue depth). Kafka for high-throughput event streaming. Throttle API (429 responses) if queue >10K. Priority queue for urgent contracts.
β οΈ Vendor lock-in (OpenAI API dependency)
Highβ Mitigation: Multi-LLM architecture (GPT-4, Claude, Gemini). Abstract LLM calls via interface. Easy to swap providers. Test failover monthly.
Evolution Roadmap
Progressive transformation from MVP to scale
Phase 1: MVP (0-3 months)
Phase 2: Scale (3-6 months)
Phase 3: Enterprise (6-12 months)
Complete Systems Architecture
9-layer architecture from ingestion to compliance
Presentation
4 components
API Gateway
4 components
Agent Layer
7 components
ML Layer
5 components
Integration
4 components
Data
4 components
External
4 components
Observability
4 components
Security
4 components
Sequence Diagram - Contract Review Flow
Automated data flow every hour
Data Flow
Contract upload β reviewed output in 45 seconds
Key Integrations
DocuSign
Salesforce
Internal Contract Management System
AWS Textract
Security & Compliance
Failure Modes & Recovery
| Failure | Fallback | Impact | SLA |
|---|---|---|---|
| LLM API down (OpenAI outage) | Failover to Claude 3.5 β Then Gemini β Then queue for retry | Degraded latency (+5-10 sec), not broken | 99.5% |
| OCR quality low (<70%) | Flag for manual review β Human uploads higher-quality scan | Requires human intervention | 99.0% |
| Extraction confidence low (<80%) | Route to Validator for double-check β If still low, human review queue | Quality maintained, slight latency increase | 99.9% |
| Risk scoring model drift (accuracy drops >3%) | Trigger retraining pipeline β Roll back to previous model version | Temporary accuracy degradation | 99.0% |
| Database unavailable | Read from replica β Queue writes β Retry on primary recovery | Read-only mode, writes delayed | 99.9% |
| PII detection fails | Block processing β Alert security team β Manual review | Safety first, processing halted | 100% |
| Queue overflow (>10K contracts pending) | Scale workers 2x β If still overwhelmed, throttle API (429 responses) | Increased latency, some requests rejected | 99.0% |
βββββββββββββββββββ
β Orchestrator β β LangGraph coordinator
β (LangGraph) β
ββββββββββ¬βββββββββ
β
ββββββ΄βββββ¬βββββββββ¬βββββββββββ¬ββββββββββββ¬βββββββββββ
β β β β β β
βββββΌββββ ββββΌββββ ββββΌβββββ βββββΌββββββ ββββΌβββββββ ββΌβββββββββ
βPlannerβ βExtractβ βValidateβ βRiskScoreβ βRecommendβ βGuardrailβ
β Agent β β Agent β β Agent β β Agent β β Agent β β Agent β
βββββ¬ββββ ββββ¬ββββ ββββ¬βββββ βββββ¬ββββββ ββββ¬βββββββ ββ¬βββββββββ
β β β β β β
ββββββββββ΄βββββββββ΄βββββββββββ΄ββββββββββββ΄βββββββββββ
β
βββββΌβββββββββ
β RAG DB β β Legal precedent
β (Vector) β
ββββββββββββββ
β
βββββΌβββββββββ
β PostgreSQL β β Contracts, clauses
ββββββββββββββπAgent Collaboration Flow
πAgent Types
Reactive Agent
LowExtractor Agent - Responds to PDF input, returns clauses
Reflexive Agent
MediumValidator Agent - Uses rules + context to validate
Deliberative Agent
HighRisk Scorer Agent - Reasons over precedent, plans scoring strategy
Orchestrator Agent
HighestPlanner + Guardrail - Makes routing decisions, handles failures, ensures safety
πLevels of Autonomy
RAG vs Fine-Tuning
Hallucination Detection
Evaluation Framework
Dataset Curation
Agentic RAG
Multi-LLM Ensemble
Tech Stack Summary
2026 Randeep Bhatia. All Rights Reserved.
No part of this content may be reproduced, distributed, or transmitted in any form without prior written permission.