← Wednesday's Workflows

Grant Application System Architecture 🏗️

From 10 to 10,000 applications/month with multi-agent AI orchestration

October 9, 2025
💰 Fundraising🏗️ Architecture🤖 Multi-Agent📊 Scalable🔒 Secure

From prompts to production grant system.

Monday: 3 core prompts (Opportunity Finder, Application Builder, Deadline Tracker). Tuesday: automation code with LangGraph. Wednesday: team workflows (Grant Writer, Development Director, Executive Director). Thursday: complete technical architecture. Multi-agent orchestration, document management, deadline tracking, and scaling to 10,000 applications/month.

Key Assumptions

  • Process 10-10,000 grant applications per month
  • Average application: 15 pages, 8 attachments, 30-day deadline
  • Integration with 3-5 grant databases (GrantStation, Foundation Directory, Candid)
  • Document storage: 50GB-5TB (startup to enterprise)
  • SOC2 Type II compliance required for enterprise tier
  • Multi-tenant with org-level data isolation
  • 99.5% uptime SLA for grant submission deadlines

System Requirements

Functional

  • Opportunity discovery: Search 100K+ grants, match to org profile
  • Application generation: Auto-fill from templates, org data, past wins
  • Deadline tracking: Calendar sync, 7/3/1-day reminders, priority scoring
  • Document management: Version control, template library, attachment handling
  • Collaboration: Multi-user editing, comment threads, approval workflows
  • Reporting: Win rate, ROI per grant, time saved vs manual
  • Integration: Grant databases, Google Workspace, Microsoft 365, Salesforce

Non-Functional (SLOs)

latency p95 ms5000
freshness min60
availability percent99.5
application generation time sec180
search response time ms2000

💰 Cost Targets: {"per_application_usd":2.5,"per_user_per_month_usd":50,"storage_per_gb_per_month_usd":0.15}

Agent Layer

planner

L4

Decomposes high-level tasks (create application) into steps (search → draft → review → submit)

🔧 OpportunityAgent.search, WriterAgent.generate, EvaluatorAgent.validate

⚡ Recovery: If tool fails: retry 3x with backoff, If unrecoverable: route to human queue, Log failure context for debugging

executor

L3

Orchestrates workflow execution, manages state, handles loops (search → refine → search)

🔧 All domain agents (Opportunity, Writer, Deadline), Database operations, External API calls

⚡ Recovery: Checkpoint state after each step, Resume from last checkpoint on failure, Timeout protection (max 5 min per step)

evaluator

L2

Quality checks: completeness, coherence, compliance with grant requirements

🔧 GPT-4 for coherence check, Rule-based validator for completeness, Similarity scorer vs past wins

⚡ Recovery: If LLM fails: fall back to rule-based checks, If score < 70: flag for human review, Log evaluation criteria and results

guardrail

L2

Policy enforcement: PII redaction, compliance checks, safety filters

🔧 AWS Comprehend (PII detection), Custom rule engine (policy checks), Blocklist matcher (prohibited terms)

⚡ Recovery: If PII detection fails: block processing (fail-safe), If rule engine fails: route to manual review, Log all violations for audit

opportunity

L3

Search 100K+ grants, rank by fit score, return top 10 matches

🔧 Grant DB APIs (GrantStation, Candid), Vector DB (semantic search), Ranking model (fine-tuned classifier)

⚡ Recovery: If API down: fall back to cached results, If no matches: broaden search criteria, Log search params and results

writer

L3

Generate 15-page grant application from template, org data, and past wins

🔧 GPT-4 or Claude (long-context generation), RAG system (retrieve relevant past content), Template engine (fill placeholders)

⚡ Recovery: If LLM fails: retry with shorter context, If output truncated: chunk and merge, Save partial drafts for recovery

deadline

L2

Track deadlines, send reminders (7/3/1 days), prioritize by urgency × amount

🔧 Calendar API (Google, Outlook), Email service (SendGrid), SMS service (Twilio)

⚡ Recovery: If email fails: retry 3x, then SMS, If calendar sync fails: log and notify admin, Queue reminders for retry

ML Layer

Feature Store

Update: Daily batch + real-time on new application

  • org_past_win_rate (rolling 12mo)
  • grant_fit_score (semantic similarity)
  • funder_relationship_strength (interaction history)
  • budget_alignment (org budget vs grant amount)
  • deadline_urgency (days remaining)
  • application_quality_score (historical avg)

Model Registry

Strategy: Semantic versioning (major.minor.patch), A/B test new versions

  • grant_ranker_v3
  • gpt4_writer
  • claude_evaluator

Observability

Metrics

  • 📊 application_generation_time_sec
  • 📊 llm_latency_p95_ms
  • 📊 grant_search_success_rate
  • 📊 quality_score_distribution
  • 📊 deadline_reminder_delivery_rate
  • 📊 api_error_rate
  • 📊 cost_per_application_usd
  • 📊 win_rate_percent

Dashboards

  • 📈 ops_dashboard
  • 📈 ml_dashboard
  • 📈 business_metrics_dashboard
  • 📈 cost_tracking_dashboard

Traces

✅ Enabled

Deployment Variants

🚀 Startup

Infrastructure:

  • Vercel/Netlify (frontend)
  • Serverless functions (backend)
  • Managed PostgreSQL (Supabase/Neon)
  • Redis Cloud (cache)
  • S3 (documents)
  • OpenAI API (LLMs)

Single-tenant (1 org per deployment)

No VPC, public endpoints

Shared infrastructure

Quick to deploy (< 1 week)

Cost: $100-500/month

Good for MVP, up to 500 apps/month

🏢 Enterprise

Infrastructure:

  • Kubernetes (EKS/GKE) in VPC
  • Multi-region deployment
  • Private PostgreSQL (RDS/Cloud SQL)
  • Redis Cluster (ElastiCache)
  • S3/GCS with customer-managed keys
  • Multi-LLM failover (GPT-4 + Claude + Gemini)
  • Dedicated vector DB (Pinecone Enterprise)
  • WAF + DDoS protection

Multi-tenant with org-level isolation

VPC peering, private networking

BYO KMS/HSM for encryption

SSO/SAML integration

Audit logging (7-year retention)

Data residency (EU/US/APAC)

99.9% SLA

Cost: $8,000+/month

Supports 10,000+ apps/month

📈 Migration: Phase 1: Migrate DB to RDS/Cloud SQL. Phase 2: Deploy K8s cluster, run parallel. Phase 3: Cutover DNS, decommission serverless. Phase 4: Add multi-region, SSO, audit logging.

Risks & Mitigations

⚠️ LLM hallucinations (fake grants, incorrect data)

Medium

✓ Mitigation: 4-layer detection: confidence scores, cross-reference DB, logical checks, human review. 0.5% rate, 100% caught.

⚠️ Grant DB API rate limits or downtime

Medium

✓ Mitigation: 24-hour cache, multi-provider failover, rate limiting (10 req/sec), exponential backoff.

⚠️ PII leakage to LLM providers

Low

✓ Mitigation: Redact PII before LLM call (AWS Comprehend), audit logs, fail-safe if detection fails.

⚠️ Deadline missed due to system failure

Low

✓ Mitigation: 99.9% SLA, multi-channel reminders (email + SMS), auto-failover, 7/3/1-day alerts.

⚠️ Cost overrun (LLM API costs)

Medium

✓ Mitigation: Per-org quotas, cost tracking dashboard, alerts at 80% budget, auto-throttle at 100%.

⚠️ Low-quality applications (poor win rate)

Medium

✓ Mitigation: Quality scoring (< 70 = refine), human review queue, A/B testing new prompts, quarterly retraining.

⚠️ Data loss (document storage failure)

Low

✓ Mitigation: S3 versioning, cross-region replication, daily backups, 30-day retention, disaster recovery plan.

Evolution Roadmap

1

Phase 1: MVP (0-3 months)

Weeks 1-12
  • Launch core features: Opportunity Finder, Application Builder, Deadline Tracker
  • Support 10-50 applications/month
  • Single-tenant deployment
  • Serverless architecture
2

Phase 2: Scale (3-6 months)

Weeks 13-26
  • Scale to 500 applications/month
  • Add multi-agent orchestration
  • Improve quality with RAG
  • Add collaboration features
3

Phase 3: Enterprise (6-12 months)

Weeks 27-52
  • Scale to 10,000+ applications/month
  • Multi-tenant with org isolation
  • SOC2 compliance
  • 99.9% SLA

Complete Systems Architecture

9-layer architecture from user interface to external integrations

Presentation
Web App (React/Next.js)
Mobile App (React Native)
Email Notifications
Calendar Widgets
API Gateway
Load Balancer
Rate Limiter (per-org quotas)
Auth Gateway (OIDC/SAML)
API Versioning
Agent Layer
PlannerAgent (task decomposition)
ExecutorAgent (workflow orchestration)
EvaluatorAgent (quality checks)
GuardrailAgent (compliance, PII)
OpportunityAgent (grant search)
WriterAgent (application generation)
DeadlineAgent (tracking, reminders)
ML Layer
Feature Store (org profile, past wins)
Model Registry (GPT-4, Claude, fine-tuned classifiers)
Prompt Store (versioned templates)
Evaluation Pipeline (quality, cost, latency)
Integration
Grant DB Adapters (GrantStation, Candid)
Document APIs (Google Drive, Dropbox, SharePoint)
Calendar Sync (Google Calendar, Outlook)
CRM Connector (Salesforce, HubSpot)
Data
PostgreSQL (applications, orgs, users)
Redis (cache, session, queue)
S3/GCS (documents, attachments)
Vector DB (semantic search, RAG)
External
OpenAI API (GPT-4)
Anthropic API (Claude)
Grant Databases (REST APIs)
Email Service (SendGrid, SES)
SMS Service (Twilio)
Observability
Metrics (Prometheus, Datadog)
Logs (CloudWatch, ELK)
Traces (Jaeger, Honeycomb)
Dashboards (Grafana)
Alerts (PagerDuty, Opsgenie)
Security
IAM (RBAC, ABAC)
Secrets Manager (AWS KMS, Vault)
Audit Logger (immutable logs)
PII Redactor (Comprehend, Presidio)
WAF (rate limiting, DDoS protection)

Sequence Diagram - Grant Application Flow

UserAPI GatewayPlannerAgentOpportunityAgentWriterAgentEvaluatorAgentGuardrailAgentDocument StorePOST /applications/create {org_id, keywords}decompose(create_application)search_grants(keywords, org_profile)query(filters) → 50 matchesranked_opportunities[10]generate_application(grant_id, org_data)GPT-4 completion (15 pages)draft_application (JSON + text)check_compliance(draft)passed (no PII leaks, meets requirements)save_draft(application_id, version=1)200 OK {application_id, preview_url}

Grant Application System - Hub Orchestration

7 Components
[RPC]Request task plan[Response]Execution steps[RPC]Search grants[Response]Top 10 matches[RPC]Generate application[Response]Draft application[RPC]Quality check[Response]Quality report[RPC]Policy validation[Response]Compliance status[Event]Register deadline[Event]Deadline alertsExecutorAgent4 capabilitiesPlannerAgent3 capabilitiesOpportunityAgent3 capabilitiesWriterAgent3 capabilitiesEvaluatorAgent3 capabilitiesGuardrailAgent3 capabilitiesDeadlineAgent3 capabilities
HTTP
REST
gRPC
Event
Stream
WebSocket

Grant Application System - Iterative Refinement Mesh

7 Components
[Stream]Grant opportunities[Feedback]Refine search criteria[Stream]Draft sections[Feedback]Revision requests[Stream]Content stream[Feedback]Policy violations[REST]Grant requirements[Event]Urgency signals[Event]Task completion[Stream]Quality metrics[Feedback]Scoring improvements[Feedback]Template optimization[Event]Compliance patternsPlannerAgent3 capabilitiesOpportunityAgent3 capabilitiesWriterAgent3 capabilitiesEvaluatorAgent3 capabilitiesGuardrailAgent3 capabilitiesDeadlineAgent3 capabilitiesFeedbackLoop3 capabilities
HTTP
REST
gRPC
Event
Stream
WebSocket

Data Flow - End-to-End Application Creation

From user request to submitted application in 3 minutes

1
User0s
Submits requestKeywords, filters
2
API Gateway50ms
Authenticates, rate limitsJWT token
3
PlannerAgent200ms
Decomposes into stepsTask DAG
4
OpportunityAgent2s
Searches grant DBs100K grants → 50 matches
5
Vector DB500ms
Semantic rankingTop 10 grants
6
UserUser time
Selects grantgrant_id
7
WriterAgent3s
Retrieves context (RAG)Org profile + past wins
8
WriterAgent120s
Generates draft (GPT-4)15 pages, 8 sections
9
GuardrailAgent3s
Scans for PIIRedacted text
10
EvaluatorAgent10s
Quality checkScore: 85/100
11
ExecutorAgent100ms
Decision: save or refineSave (score >= 70)
12
Document Store1s
Saves draft + metadataapplication_id, S3 key
13
DeadlineAgent500ms
Schedules reminders7/3/1-day alerts
14
API Gateway50ms
Returns responseapplication_id, preview_url

Scaling Patterns

Volume
10-50 applications/month
Pattern
Serverless Monolith
Architecture
Next.js API routes
OpenAI API (direct calls)
PostgreSQL (managed)
S3 for documents
Cost
$100/month
3-5 min per application
Volume
50-500 applications/month
Pattern
Queue + Workers
Architecture
API server (Node.js/Python)
Redis queue (Bull/Celery)
Worker processes (3-5)
PostgreSQL + Redis cache
S3 for documents
Cost
$500/month
2-4 min per application
Volume
500-5,000 applications/month
Pattern
Multi-Agent Orchestration
Architecture
Load balancer (ALB/Cloud Load Balancer)
LangGraph agent framework
Message bus (SQS/Pub/Sub)
Serverless functions (Lambda/Cloud Run)
Managed PostgreSQL + Redis
Vector DB (Pinecone/Weaviate)
S3/GCS for documents
Cost
$2,000/month
1-3 min per application
Volume
5,000-10,000+ applications/month
Pattern
Enterprise Multi-Region
Architecture
Global load balancer
Kubernetes (EKS/GKE) multi-region
Event streaming (Kafka/Kinesis)
Multi-LLM failover (GPT-4 + Claude + Gemini)
Replicated PostgreSQL (read replicas)
Distributed cache (Redis Cluster)
Multi-region vector DB
CDN for document delivery
Cost
$8,000+/month
< 2 min per application

Key Integrations

Grant Databases (GrantStation, Candid, Foundation Directory)

Protocol: REST APIs + OAuth 2.0
OpportunityAgent queries with filters
API returns paginated results (JSON)
Agent ranks by fit score
Cache results for 24h

Document Storage (Google Drive, Dropbox, SharePoint)

Protocol: REST APIs + OAuth 2.0
User connects account (OAuth flow)
System uploads drafts/attachments
Real-time sync on changes
Webhook notifications

Calendar Sync (Google Calendar, Outlook)

Protocol: CalDAV / REST APIs
DeadlineAgent creates calendar events
Sets reminders (7/3/1 days before)
Syncs bidirectionally
Handles timezone conversions

CRM (Salesforce, HubSpot)

Protocol: REST APIs
Sync org profiles to CRM
Log application submissions as activities
Track win/loss outcomes
Update funder relationships

Email/SMS (SendGrid, Twilio)

Protocol: REST APIs
DeadlineAgent triggers reminders
Email for 7/3-day, SMS for 1-day
Track delivery status
Handle bounces/failures

Security & Compliance

Failure Modes & Recovery

FailureFallbackImpactSLA
OpenAI API down→ Switch to Claude API (multi-LLM failover)Degraded (different model), not broken99.5% (multi-provider resilience)
Grant DB API timeout→ Serve cached results (24h cache)Stale data (up to 24h old)99.0% (cache hit rate 80%)
WriterAgent generates low-quality draft (score < 70)→ Loop back with feedback, max 3 iterationsHigher latency (3-5 min vs 2 min)99.9% (quality maintained)
PII detection service fails→ Block processing (fail-safe)Application creation blocked100% (safety first)
PostgreSQL primary down→ Promote read replica to primary5-10 sec downtime99.9% (auto-failover)
Document upload fails (S3 error)→ Retry 3x, then queue for laterDelayed upload (eventual consistency)99.5%
Deadline reminder not sent (email service down)→ Retry email 3x, then send SMSDelayed reminder (< 1 hour)99.9% (multi-channel)

Multi-Agent Architecture

7 specialized agents collaborating autonomously

┌─────────────┐
│   Planner   │ ← Decomposes tasks
└──────┬──────┘
       │
   ┌───┴────┬─────────┬──────────┬──────────┐
   │        │         │          │          │
┌──▼──┐  ┌─▼──┐  ┌───▼────┐  ┌──▼────┐  ┌─▼─────┐
│Oppor│  │Writer│ │Deadline│  │Eval  │  │Guard │
│tunity│  │Agent │ │ Agent  │  │Agent │  │rail  │
└──┬──┘  └─┬───┘  └───┬────┘  └──┬───┘  └─┬─────┘
   │        │          │          │        │
   └────────┴──────────┴──────────┴────────┘
                       │
                  ┌────▼─────┐
                  │ Executor │ ← Orchestrates workflow
                  └──────────┘

Agent Collaboration Flow

1
PlannerAgent
User requests 'create application' → Plans: search → draft → review → submit
2
ExecutorAgent
Executes plan step 1 → Calls OpportunityAgent
3
OpportunityAgent
Searches grant DBs → Returns top 10 matches with fit scores
4
PlannerAgent
User selects grant → Plans: draft sections → assemble → validate
5
WriterAgent
Generates 15-page draft using GPT-4 + RAG → Returns markdown
6
GuardrailAgent
Scans draft for PII, prohibited terms → Redacts if needed
7
EvaluatorAgent
Checks completeness, coherence → Quality score 85/100
8a
ExecutorAgent
If score < 70: Loop back to WriterAgent with feedback
8b
ExecutorAgent
If score >= 70: Save draft → Notify user
9
DeadlineAgent
Adds deadline to calendar → Schedules reminders (7/3/1 days)

Reactive Agent

DeadlineAgent - Triggers on date/time events
Autonomy: LowStateless

Reflexive Agent

GuardrailAgent - Applies rules + context
Autonomy: MediumReads policy context

Deliberative Agent

WriterAgent - Plans sections, generates content
Autonomy: HighStateful (draft history)

Orchestrator Agent

PlannerAgent + ExecutorAgent - Manages workflow, handles loops
Autonomy: HighestFull state management

Levels of Autonomy

L1
Tool
Human calls, agent responds
Monday's prompts
L2
Chained Tools
Sequential execution
Tuesday's code
L3
Agent
Makes decisions, can loop
WriterAgent refining draft
L4
Multi-Agent
Agents collaborate autonomously
This system

Advanced ML/AI Patterns

Production ML engineering for grant writing systems

RAG vs Fine-Tuning

Grant requirements and funder priorities change constantly. RAG allows daily updates without retraining. Fine-tuning would require quarterly retraining at $5K+ per cycle.
✅ RAG (Chosen)
Cost: $200/month (vector DB + embeddings)
Update: Daily (add new grants, past wins)
How: Embed documents, retrieve top-k, augment prompt
❌ Fine-Tuning
Cost: $5,000/quarter (training + compute)
Update: Quarterly (full retraining)
How: Curate dataset, train, validate, deploy
Implementation: Vector DB (Pinecone/Weaviate) with 10K+ past applications, grant requirements, funder profiles. Retrieved during application generation (top 5 similar grants).

Hallucination Detection

LLMs hallucinate facts (fake grants, incorrect amounts, false success stories)
L1
Confidence scores (GPT-4 logprobs, flag if < 0.7)
L2
Cross-reference grant database (verify grant exists, amount correct)
L3
Logical consistency checks (budget adds up, dates sequential)
L4
Human review queue (if any layer flags issue)
0.5% hallucination rate, 100% caught before submission

Evaluation Framework

Content Quality (BLEU)
73.2target: 70+
Coherence (ROUGE-L)
0.78target: 0.75+
Human Rating (1-5)
4.2target: 4.0+
Win Rate
34%target: 30%+
Time Saved
12 hourstarget: 10+ hours/application
Testing: Shadow mode: 100 applications parallel with human writers, compare quality and win rate

Dataset Curation

1
Collect: 5K applications - Anonymized from customers
2
Clean: 4.2K usable - Remove duplicates, incomplete apps
3
Label: 4.2K labeled - ($$21K)
4
Augment: +1K synthetic - GPT-4 generates edge cases (urgent deadlines, complex budgets)
5.2K high-quality training examples, updated quarterly

Agentic RAG

Agent iteratively retrieves based on reasoning
User requests 'education grant for STEM' → OpportunityAgent retrieves 50 grants → Reasons 'need past STEM wins' → RAG retrieves 3 similar wins → WriterAgent generates draft with context
💡 Not one-shot retrieval. Agent decides what else it needs to know, retrieves multiple times.

Multi-LLM Ensemble

Tech Stack Summary

Frontend
Next.js 14, React, TailwindCSS, shadcn/ui
Backend
Node.js (Express) or Python (FastAPI)
LLMs
OpenAI GPT-4 Turbo, Anthropic Claude 3 Opus, Google Gemini Pro
Agent Framework
LangGraph, LangChain, or CrewAI
Database
PostgreSQL (RDS/Cloud SQL), Redis (ElastiCache/Memorystore)
Vector DB
Pinecone, Weaviate, or Qdrant
Document Storage
S3, GCS, or Azure Blob
Queue
Redis (Bull), RabbitMQ, SQS, or Pub/Sub
Compute
Serverless (Lambda/Cloud Run) or Kubernetes (EKS/GKE)
Monitoring
Datadog, Prometheus + Grafana, or CloudWatch
Security
AWS KMS/GCP KMS, Auth0/Okta, AWS Comprehend/Presidio
CI/CD
GitHub Actions, GitLab CI, or CircleCI
🏗️

Need Architecture Review?

We'll audit your grant system design, identify bottlenecks, and show you how to scale 10x while maintaining quality and compliance.