Skip to main content
← Wednesday's Workflows

Property Listings System Architecture 🏗️

From 1,000 to 1,000,000 listings/day with AI enrichment and multi-tenant support

May 15, 2025
17 min read
🏘️ Real Estate🏗️ Architecture📊 Scalable🤖 AI-Powered
🎯This Week's Journey

From MLS feeds to intelligent property platform.

Monday: 3 core prompts for property enrichment. Tuesday: automation code for ingestion pipelines. Wednesday: team workflows across data, AI, and operations. Thursday: complete production architecture with multi-agent orchestration, ML pipelines, and enterprise-grade scaling patterns.

📋

Key Assumptions

1
Handle 1K-1M listings/day across multiple MLS feeds
2
Support 100-1000 concurrent brokerage tenants
3
Real-time sync (<5 min) for active listings
4
AI enrichment for descriptions, valuations, recommendations
5
Compliance with NAR, DMCA, fair housing regulations

System Requirements

Functional

  • Ingest listings from 800+ MLS APIs (RETS, RESO Web API)
  • Normalize heterogeneous schemas to unified property model
  • AI enrichment: generate descriptions, estimate valuations, extract features
  • Real-time search with filters (price, beds, location, school district)
  • Multi-tenant isolation with per-client branding and data access
  • CRM integration for lead routing and agent assignment
  • Analytics dashboards for market trends and listing performance

Non-Functional (SLOs)

latency p95 ms200
freshness min5
availability percent99.9
enrichment latency sec30
search latency p95 ms100

💰 Cost Targets: {"per_listing_ingestion_usd":0.002,"per_enrichment_usd":0.05,"per_search_query_usd":0.0001}

Agent Layer

planner

L3

Decompose listing ingestion into subtasks, route to specialized agents

🔧 schema_normalizer, policy_checker, cost_estimator

⚡ Recovery: If schema invalid → queue for manual review, If cost exceeds budget → skip optional enrichments

ingestion_executor

L2

Fetch listings from MLS APIs, normalize schemas, store raw data

🔧 rets_client, reso_client, schema_mapper, deduplicator

⚡ Recovery: If MLS API down → retry with exponential backoff (max 3x), If rate limit hit → queue and throttle, If schema mismatch → log and alert

enrichment_executor

L3

Generate AI-powered descriptions, valuations, feature extraction

🔧 llm_api (Claude/GPT), valuation_model, feature_extractor, image_analyzer

⚡ Recovery: If LLM timeout → retry with shorter context, If low confidence (<0.7) → flag for human review, If cost spike → switch to cheaper model

evaluator

L4

Validate enriched data quality, detect hallucinations, ensure compliance

🔧 fact_checker, hallucination_detector, compliance_validator, sentiment_analyzer

⚡ Recovery: If quality < 0.7 → re-enrich with different prompt, If compliance violation → block and alert, If hallucination detected → discard and retry

guardrail

L4

Policy enforcement, PII redaction, fair housing compliance

🔧 pii_detector (AWS Comprehend), policy_engine, fair_housing_checker, dmca_validator

⚡ Recovery: If PII detected → redact and log, If policy violation → block publish and notify admin, If uncertain → escalate to human review

orchestrator

L4

Coordinate multi-agent workflows, handle retries, manage state

🔧 state_machine (LangGraph/Temporal), retry_handler, event_bus

⚡ Recovery: If agent fails → retry with different agent, If workflow timeout → checkpoint and resume, If deadlock → abort and alert

ML Layer

Feature Store

Update: Batch: daily for neighborhood stats, Real-time: per listing update

  • price_per_sqft
  • days_on_market
  • neighborhood_avg_price
  • school_rating_avg
  • crime_index
  • walkability_score
  • price_change_velocity
  • listing_completeness_score

Model Registry

Strategy: Semantic versioning with A/B testing

  • description_generator
  • valuation_estimator
  • feature_extractor
  • recommendation_ranker

Observability Stack

Real-time monitoring, tracing & alerting

0 active
SOURCES
Apps, Services, Infra
COLLECTION
10 Metrics
PROCESSING
Aggregate & Transform
DASHBOARDS
5 Views
ALERTS
Enabled
📊Metrics(10)
📝Logs(Structured)
🔗Traces(Distributed)
listings_ingested_per_min
enrichment_latency_p95_sec
llm_tokens_used_per_hour
cost_per_listing_usd
search_latency_p95_ms
error_rate_percent

Deployment Variants

🚀

Startup Architecture

Fast to deploy, cost-efficient, scales to 100 competitors

Infrastructure

Single AWS region (us-east-1)
Serverless-first (Lambda, DynamoDB, API Gateway)
Managed services (RDS, ElastiCache, Elasticsearch Service)
CloudFront CDN
No Kubernetes (ECS Fargate for containers)
Single-tenant architecture (one DB per customer)
Fast to deploy (<1 week)
Low operational overhead
Pay-per-use pricing
Good for 1-50 tenants
Manual scaling for spikes
Cost: $300-1500/mo

Risks & Mitigations

⚠️ MLS API rate limits (varies by feed, 100-10K req/hour)

High

✓ Mitigation: Implement adaptive rate limiting per MLS. Queue requests. Negotiate higher limits with top MLSs. Cache aggressively (30 min TTL).

⚠️ LLM cost explosion ($0.05/listing × 1M listings = $50K/day)

Medium

✓ Mitigation: Set cost guardrails ($10K/day max). Use cheaper models for non-premium listings. Batch processing during off-peak hours. Cache descriptions for similar properties.

⚠️ Data quality issues (incomplete/incorrect MLS data)

High

✓ Mitigation: Validation layer checks 50+ fields. Flag incomplete listings for manual review. Auto-enrich missing data from public records. Display confidence scores.

⚠️ Fair housing compliance violations (discriminatory language)

Low

✓ Mitigation: Guardrail agent blocks protected class mentions (race, religion, family status). Human review for flagged content. Regular compliance audits. Legal review of prompts.

⚠️ Multi-tenant data leakage (tenant A sees tenant B's data)

Low

✓ Mitigation: Row-level security in PostgreSQL. Tenant ID in every query. API gateway enforces tenant isolation. Regular security audits. Penetration testing.

⚠️ MLS contract violations (scraping, data redistribution)

Medium

✓ Mitigation: Comply with MLS terms (attribution, data use restrictions). Legal review of contracts. Only display data for authorized users. Audit trail for compliance.

⚠️ Search relevance degradation (poor ranking, stale index)

Medium

✓ Mitigation: Monitor search quality metrics (NDCG, CTR). A/B test ranking algorithms. Real-time index updates (<5 min). User feedback loop (thumbs up/down).

🧬

Evolution Roadmap

Progressive transformation from MVP to scale

🌱
Phase 1Q2 2025

0-3 months (MVP)

1
Launch with 5 MLS feeds
2
Basic AI enrichment (descriptions only)
3
10 beta customers
4
Prove product-market fit
Complexity Level
🌿
Phase 2Q3 2025

3-6 months (Scale)

1
Expand to 50 MLS feeds
2
Add valuations and feature extraction
3
100 paying customers
4
Multi-tenant SaaS
Complexity Level
🌳
Phase 3Q4 2025 - Q1 2026

6-12 months (Enterprise)

1
Support 800+ MLS feeds (nationwide)
2
Enterprise features (SSO, data residency)
3
1,000 customers
4
99.99% SLA
Complexity Level
🚀Production Ready
🏗️

Complete Systems Architecture

9-layer architecture from client to data

1
🌐

Presentation

4 components

Web App (React/Next.js)
Mobile App (React Native)
Agent Portal
Admin Dashboard
2
⚙️

API Gateway

4 components

Load Balancer (ALB/CloudFront)
Rate Limiter (per tenant)
Auth Gateway (OIDC/SAML)
API Versioning
3
💾

Agent Layer

6 components

Planner Agent (task decomposition)
Ingestion Executor (MLS sync)
Enrichment Executor (AI processing)
Evaluator Agent (quality checks)
Guardrail Agent (policy/PII)
Orchestrator (workflow coordination)
4
🔌

ML Layer

5 components

Feature Store (property metrics)
Model Registry (LLMs, valuation models)
Inference Service (real-time)
Training Pipeline (batch)
Evaluation Loop (drift detection)
5
📊

Integration

4 components

MLS Adapters (RETS/RESO)
CRM Connectors (Salesforce/HubSpot)
Payment Gateway
Notification Service (email/SMS)
6
🌐

Data

5 components

PostgreSQL (listings, tenants)
Elasticsearch (search index)
Redis (cache, queues)
S3 (images, documents)
Vector DB (embeddings)
7
⚙️

External

5 components

MLS APIs (800+ feeds)
LLM APIs (Claude, GPT)
Geocoding (Google Maps)
School Data (GreatSchools)
Crime Data (SpotCrime)
8
💾

Observability

5 components

Metrics (Prometheus/Datadog)
Logs (ELK/CloudWatch)
Traces (Jaeger/X-Ray)
Dashboards (Grafana)
Alerting (PagerDuty)
9
🔌

Security

5 components

IAM (role-based access)
KMS (encryption keys)
WAF (DDoS protection)
Audit Log (compliance)
PII Redaction (Comprehend)
🔄

Sequence Diagram - Listing Ingestion Flow

Automated data flow every hour

Step 0 of 13
MLS APIIngestion WorkerPlanner AgentEnrichment AgentEvaluator AgentGuardrail AgentSearch IndexNotification ServicePOST new listing (RESO)normalize(raw_data)check_policy(listing)approved (no PII/violations)enrich(listing)generate_description()description + featuresvalidate(enriched_data)quality_score=0.92 (pass)INSERT listingindex(listing)notify_agents(new_listing)Email/SMS alert

Data Flow - MLS to Search Index

Complete pipeline from ingestion to user search

1
MLS API0s
Webhook notification (new listing)RESO JSON
2
Ingestion Worker2s
Fetch full listing detailsRaw MLS data
3
Planner Agent2.05s
Normalize schema, create task graphNormalized listing + tasks
4
Guardrail Agent2.2s
Check for PII and policy violationsSanitized listing
5
Enrichment Executor5s
Generate description (LLM)AI-generated description
6
Enrichment Executor5.5s
Estimate valuation (ML model)Price estimate + confidence
7
Enrichment Executor6s
Extract features (NER model)Structured features (pool, garage, etc.)
8
Evaluator Agent6.2s
Quality check (hallucination, completeness)Quality score + issues
9
Orchestrator6.3s
Approve and route to storageFinal listing
10
PostgreSQL6.4s
INSERT listingPersisted
11
Elasticsearch6.6s
Index for searchSearchable
12
Notification Service7s
Alert matching agentsEmail/SMS sent
1
Volume
1K-10K listings/day
Pattern
Monolith with Background Jobs
🏗️
Architecture
Single API server (Node.js/Python)
PostgreSQL (single instance)
Redis (cache + queue)
Background workers (Celery/Bull)
Elasticsearch (single node)
Cost & Performance
$300/mo
per month
5-10 sec ingestion
2
Volume
10K-100K listings/day
Pattern
Microservices with Message Queue
🏗️
Architecture
API Gateway + Load Balancer
Microservices (ingestion, enrichment, search)
PostgreSQL (read replicas)
Redis Cluster
RabbitMQ or SQS
Elasticsearch (3-node cluster)
Horizontal pod autoscaling
Cost & Performance
$1,500/mo
per month
3-7 sec ingestion
3
Volume
100K-1M listings/day
Pattern
Event-Driven Multi-Agent
🏗️
Architecture
Kubernetes (EKS/GKE)
Event streaming (Kafka/Kinesis)
Agent orchestration (LangGraph/Temporal)
PostgreSQL (sharded by tenant)
Elasticsearch (multi-region)
Redis (distributed cache)
Serverless functions (Lambda) for spikes
CDN for images (CloudFront)
Cost & Performance
$8,000/mo
per month
2-5 sec ingestion
Recommended
4
Volume
1M+ listings/day
Pattern
Global Multi-Region Platform
🏗️
Architecture
Multi-region Kubernetes
Global event mesh (Kafka + replication)
Multi-model databases (Postgres + Cassandra)
Elasticsearch (geo-replicated)
Edge computing (CloudFlare Workers)
Multi-LLM failover (Claude + GPT + Gemini)
Active-active disaster recovery
Dedicated VPCs per enterprise client
Cost & Performance
$30,000+/mo
per month
1-3 sec ingestion

Key Integrations

MLS APIs (RETS/RESO)

Protocol: RETS 1.8 / RESO Web API 2.0
Register webhook for listing updates
Poll every 5 min for changes (fallback)
Fetch full listing via GetObject
Normalize heterogeneous schemas
Store raw + normalized data

CRM Integration (Salesforce/HubSpot)

Protocol: REST API
New lead from listing → POST to CRM
Enrich lead with listing data
Auto-assign to agent based on territory
Sync lead status back to platform

Geocoding & POI (Google Maps)

Protocol: REST API
Extract address from listing
Geocode to lat/lng
Fetch nearby POIs (schools, transit, parks)
Calculate walkability score
Store in property features

School Data (GreatSchools API)

Protocol: REST API
Query schools within 1 mile of listing
Fetch ratings and demographics
Display in listing details
Use in search filters

Image Storage (S3 + CloudFront)

Protocol: S3 API
Upload listing images to S3
Generate thumbnails (Lambda)
Serve via CloudFront CDN
Lazy load in UI

Security & Compliance

🔒

Authentication

Controls
OIDC/SAML for enterprise SSO
OAuth 2.0 for API access
MFA for admin accounts
Session timeout (30 min idle)
Implementation: Auth0 or AWS Cognito
🔒

Authorization (RBAC)

Controls
Tenant-level isolation (row-level security)
Role-based access (admin, agent, viewer)
Resource-level permissions (listing ownership)
API rate limiting per tenant
Implementation: PostgreSQL RLS + API gateway policies
🔒

Secrets Management

Controls
MLS credentials in KMS
API keys rotated every 90 days
No secrets in code or logs
Encrypted environment variables
Implementation: AWS KMS or HashiCorp Vault
🔒

Audit Trail

Controls
Log all data access (who, what, when)
Immutable audit logs (7 year retention)
Real-time anomaly detection
Compliance reports (SOC 2, NAR)
Implementation: CloudWatch Logs + S3 archival
🔒

Data Privacy

Controls
PII redaction before LLM processing
Data residency per tenant (US/EU/CA)
Right to deletion (GDPR/CCPA)
Consent management for marketing
Implementation: AWS Comprehend + geo-replicated DBs

Failure Modes & Fallbacks

FailureFallbackImpactSLA
MLS API down (5% of feeds)Retry with exponential backoff (max 3x) → Queue for manual sync → Alert ops teamDelayed updates for affected feeds (5-30 min)99.5% API availability
LLM API timeout (Claude/GPT)Retry with shorter context → Switch to backup LLM (GPT ↔ Claude) → Fall back to template-based descriptionLower quality descriptions (template) for <1% of listings99.9% enrichment success
Elasticsearch cluster unhealthyRead from PostgreSQL (slower) → Rebuild index from DB → Scale clusterSearch latency 2-5x slower (200ms → 500ms)99.9% search availability
Database connection pool exhaustedQueue writes → Scale read replicas → Shed non-critical traffic (analytics)Delayed writes (1-5 min) but reads unaffected99.95% write availability
Enrichment quality below threshold (<0.7)Retry with different prompt → Route to human review queue → Block publish until approvedDelayed listing publish (30 min - 2 hours)100% quality compliance
PII detected in listing descriptionBlock publish → Redact PII → Alert compliance team → Re-submit for approvalListing not published until sanitized100% PII compliance
Multi-region replication lag (>5 min)Route reads to local region (eventual consistency) → Alert ops → Investigate networkStale data in secondary regions99% cross-region consistency within 5 min

RAG vs Fine-Tuning

Real estate data changes daily (new listings, price updates, market trends). RAG allows real-time updates without retraining. Fine-tuning requires monthly retraining ($10K+) and lags market changes.
✅ RAG (Chosen)
Cost: $200/mo (vector DB + embeddings)
Update: Real-time
How: Add new listings to vector DB, retrieve top-K for context
❌ Fine-Tuning
Cost: $10K/mo (training + compute)
Update: Monthly
How: Retrain LLM on 100K+ examples
Implementation: Vector DB (Pinecone/Weaviate) with 1M+ listing embeddings. Retrieve top-10 similar listings for description generation. Hybrid search (vector + keyword) for best results.

Hallucination Detection

LLMs hallucinate property features (fake amenities, incorrect square footage, false claims)
L1
Confidence scores (<0.7 = flag for review)
L2
Cross-reference with structured data (beds, baths, sqft must match)
L3
Fact-check against MLS database (no pool if MLS says no pool)
L4
Human review queue for flagged listings
0.2% hallucination rate, 100% caught before publish

Evaluation Framework

Description Quality (BLEU)
0.68target: >0.6
Valuation Accuracy (MAE)
4.2%target: <5%
Feature Extraction (F1)
0.89target: >0.85
Hallucination Rate
0.2%target: <1%
User Engagement (CTR)
9.3%target: >8%
Testing: Shadow mode: Run AI enrichment in parallel with manual for 1000 listings. Human raters score quality (1-5). Track CTR and lead conversion.

Dataset Curation

1
Collect: 100K listings - Historical MLS data
2
Clean: 85K usable - Remove duplicates, incomplete data
3
Label: 85K labeled - ($$170K)
4
Augment: +15K synthetic - Generate edge cases (luxury, rural, fixer-upper)
100K high-quality training examples (inter-rater reliability: 0.88)

Agentic RAG

Agent iteratively retrieves based on reasoning. Not one-shot retrieval.
Listing mentions 'near downtown' → RAG retrieves downtown listings → Agent reasons 'need walkability data' → RAG retrieves transit/walkability scores → Agent reasons 'need comparable sales' → RAG retrieves recent sales → Agent generates description with full context.
💡 Agent decides what else it needs to know. Multi-hop reasoning. Better context selection.

Multi-Model Ensemble

Tech Stack Summary

LLMs
Claude 3.5 Sonnet (primary), GPT-4 (backup), Gemini 1.5 Pro (cost-effective)
Orchestration
LangGraph (agent workflows), Temporal (long-running workflows)
Database
PostgreSQL (listings, tenants), Redis (cache, queues), Elasticsearch (search)
Vector DB
Pinecone or Weaviate
Message Queue
Kafka (high volume), SQS (serverless), RabbitMQ (startup)
Compute
Kubernetes (EKS), Lambda (serverless), ECS Fargate (containers)
Storage
S3 (images, backups), CloudFront (CDN)
Monitoring
Datadog (metrics, APM), CloudWatch (logs), Sentry (errors)
Security
AWS KMS (encryption), WAF (DDoS), Comprehend (PII detection)
ML Platform
SageMaker (training), MLflow (model registry), Feast (feature store)
🏗️

Need Architecture Review?

We'll audit your real estate platform, identify bottlenecks, and show you how to scale from 10K to 1M listings/day with AI-powered enrichment.

©

2026 Randeep Bhatia. All Rights Reserved.

No part of this content may be reproduced, distributed, or transmitted in any form without prior written permission.