EXPANSION35 min61 sections

Infrastructure for Speed

THIS WEEK'S JOURNEY

Infrastructure for Speed: Building the Foundation That Lets You Ship Daily

The difference between teams that ship AI products weekly versus monthly often comes down to infrastructure choices made in the first few weeks. Great infrastructure disappears—you don't think about it, you just ship.

73%

of AI startups cite infrastructure complexity as their biggest shipping bottleneck

This isn't about compute costs or model quality—it's about the friction between having working code and getting it to users.

Key Insight

Infrastructure Is a Shipping Multiplier, Not a Cost Center

Most teams view infrastructure as overhead—something you minimize spending time on. This is backwards.

Framework

The Speed Stack Hierarchy

Layer 1: Compute Foundation

Serverless functions for API logic, managed GPU infrastructure for inference, and container orchestr...

Layer 2: Edge Distribution

Edge functions for routing and preprocessing, CDN caching for static model artifacts, and geographic...

Layer 3: Streaming Infrastructure

Server-Sent Events or WebSocket connections for real-time responses, chunked transfer encoding for p...

Layer 4: Intelligent Caching

Semantic caching for similar queries, embedding caches to avoid recomputation, response caches with ...

Traditional Infrastructure vs. Speed-Optimized Infrastructure

Traditional Approach

Kubernetes clusters requiring dedicated DevOps

Manual scaling based on predicted load

Monolithic deployments taking 20+ minutes

Synchronous request/response patterns

Speed-Optimized Approach

Serverless functions with zero ops overhead

Automatic scaling from zero to millions

Atomic deployments completing in seconds

Streaming responses from first token

Vercel

How v0 Achieves Sub-Second AI Responses Globally

v0 serves 50,000+ daily active users with a p95 latency of 1.2 seconds for initi...

Key Insight

Serverless Is the Default Choice for AI Products

Unless you're running continuous batch inference or have extremely predictable traffic patterns, serverless should be your default infrastructure choice for AI products. The math is simple: serverless scales to zero when unused (saving money during development), scales infinitely during traffic spikes (no capacity planning), and eliminates operational overhead (no servers to patch or monitor).

The Cold Start Reality Check

Cold starts are the most common objection to serverless for AI, but the reality is nuanced. Edge functions (Vercel, Cloudflare) have cold starts under 50ms—imperceptible to users.

Basic Serverless AI Architecture with Vercel Edge Functionstypescript

123456789101112
// api/chat/route.ts - Edge function for AI chat
import { OpenAI } from 'openai';
import { Redis } from '@upstash/redis';

export const runtime = 'edge'; // Deploy to 200+ edge locations

const openai = new OpenAI();
const redis = Redis.fromEnv();

export async function POST(req: Request) {
  const { messages, sessionId } = await req.json();

Setting Up Your Serverless AI Stack in One Day

Choose Your Serverless Platform

Configure Your Inference Layer

Add Serverless Caching

Implement Streaming

Set Up Preview Deployments

Anti-Pattern: The Kubernetes Trap for Early-Stage AI Products

❌ Problem

A typical Kubernetes setup for AI workloads requires 2-4 weeks of initial config...

✓ Solution

Start with serverless and only migrate specific workloads to Kubernetes when you...

Key Insight

Edge Functions Are Your AI Product's Secret Weapon

Edge functions execute in data centers closest to your users—often within 50ms round-trip time anywhere in the world. For AI products, this transforms the user experience in three critical ways.

Notion

Building Notion AI with Global Edge Distribution

Notion AI serves 30+ million users with p95 latency under 2 seconds globally, ha...

Serverless AI Infrastructure Readiness Checklist

Start with the Simplest Architecture That Could Work

The fastest path to shipping is often the most boring architecture: a single edge function that calls an inference API and returns the response. No caching, no streaming, no fancy routing.

The Serverless AI Request Flow

User Request

Edge Function (auth,...

Cache Check (semanti...

Inference API (OpenA...

Practice Exercise

Deploy Your First Serverless AI Endpoint

30 min

Key Insight

Your Infrastructure Choices Compound Over Time

The infrastructure decisions you make in week one determine your shipping velocity in month six and beyond. Teams that start with serverless can deploy 50+ times per day from day one.

Framework

The Speed Stack Architecture

Compute Layer

Choose between serverless functions for unpredictable workloads, edge functions for latency-sensitiv...

State Layer

Implement a tiered caching strategy with in-memory caches (Redis) for hot data, CDN caching for stat...

Streaming Layer

Build real-time response delivery using Server-Sent Events or WebSockets, with edge-based stream pro...

Orchestration Layer

Coordinate between AI providers, fallback systems, and caching layers using lightweight orchestratio...

Serverless vs. Container Infrastructure for AI Products

Serverless Functions

Zero infrastructure management—deploy code, not servers

Automatic scaling from 0 to millions of requests

Pay only for execution time (idle time is free)

Cold start latency of 100-500ms on first request

Container Infrastructure

Full control over environment, dependencies, and runtime

Manual scaling configuration required (HPA, KEDA)

Pay for reserved capacity whether used or not

No cold starts—containers always running

Replicate

Building a serverless platform for AI model inference

Replicate now runs over 50,000 different AI models with 99.9% availability. Thei...

Serverless AI Function with Streaming Responsetypescript

123456789101112
// Vercel Edge Function with OpenAI streaming
import { OpenAIStream, StreamingTextResponse } from 'ai';
import OpenAI from 'openai';

export const config = {
  runtime: 'edge', // Run at the edge for lowest latency
};

const openai = new OpenAI();

export default async function handler(req: Request) {
  const { messages, userId } = await req.json();

Key Insight

Edge Functions Are the Secret Weapon for AI Latency

Edge functions run your code in 200+ locations worldwide, typically within 50ms of any user. For AI products, this matters enormously—not for the AI inference itself (which happens at a central location), but for everything around it.

Implementing Production-Ready Streaming Infrastructure

Choose Your Streaming Protocol

Set Up Edge-Based Stream Processing

Implement Chunk Processing

Handle Connection Interruptions

Add Client-Side Buffer Management

Perplexity AI

Building sub-second search with streaming architecture

Perplexity achieves median time-to-first-result of 800ms, with complete answers ...

Anti-Pattern: The Monolithic AI Endpoint

❌ Problem

Monolithic endpoints typically have 3-5x higher latency than decomposed architec...

✓ Solution

Decompose your AI pipeline into discrete, independently deployable components. A...

Framework

The Caching Hierarchy for AI Products

Semantic Cache (L1)

Cache based on query meaning, not exact text. Use embedding similarity to identify semantically equi...

Exact Match Cache (L2)

Simple key-value cache for identical queries. Use a hash of the normalized query (lowercase, trimmed...

Partial Response Cache (L3)

Cache components of responses that can be reused. For example, if your AI generates product descript...

Prompt Template Cache (L4)

Cache the static portions of your prompts and system messages. These rarely change but are sent with...

Implementing Semantic Caching with Embeddingstypescript

123456789101112
import { OpenAI } from 'openai';
import { Index } from '@upstash/vector';

const openai = new OpenAI();
const vectorIndex = new Index();

const SIMILARITY_THRESHOLD = 0.95;
const CACHE_TTL = 3600; // 1 hour

async function semanticCache(
  query: string,
  generateFn: () => Promise<string>

73%

Average cache hit rate for production AI products with semantic caching

Products implementing semantic caching see dramatically higher hit rates than exact-match caching alone.

Key Insight

Cache Invalidation Is Simpler Than You Think for AI

The classic computer science joke is that cache invalidation is one of the hardest problems. But for AI products, it's surprisingly manageable because AI responses are often time-insensitive.

Production Caching Implementation Checklist

Notion AI

Building a multi-tier caching system for document AI

Notion AI achieved a 65% effective cache hit rate despite the dynamic document c...

Request Flow Through the Speed Stack

User Request

Edge Network (10ms)

Edge Cache Check

→ HIT → Return (50ms...

↓ MISS

Start Simple, Optimize Incrementally

Don't try to implement the complete speed stack on day one. Start with serverless functions and basic exact-match caching.

Practice Exercise

Build a Caching Layer for Your AI Product

45 min

Essential Tools for the AI Speed Stack

Vercel AI SDK

tool

Upstash Redis and Vector

tool

Cloudflare Workers AI

tool

OpenTelemetry for AI

article

Practice Exercise

Build a Streaming AI Chat Interface

45 min

Complete Streaming Chat API Routetypescript

123456789101112
// app/api/chat/route.ts
import { OpenAIStream, StreamingTextResponse } from 'ai'
import OpenAI from 'openai'
import { Redis } from '@upstash/redis'

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const redis = Redis.fromEnv()

export async function POST(req: Request) {
  const { messages, userId } = await req.json()
  
  // Rate limiting check

Practice Exercise

Implement Multi-Layer Caching Strategy

60 min

Production Readiness Checklist for AI Infrastructure

Anti-Pattern: The Synchronous Everything Trap

❌ Problem

Users abandon tasks when interfaces feel frozen. A fintech startup saw 45% of us...

✓ Solution

Implement streaming for all user-facing AI operations so users see immediate fee...

Anti-Pattern: The Cache-Nothing Approach

❌ Problem

A customer service AI platform processed 100,000 queries daily without caching. ...

✓ Solution

Implement semantic caching that identifies similar queries using embedding simil...

Anti-Pattern: The Single Provider Dependency

❌ Problem

When OpenAI experienced a 4-hour outage in late 2023, applications with single-p...

✓ Solution

Build an abstraction layer that normalizes requests and responses across provide...

Multi-Provider Failover Implementationtypescript

123456789101112
// lib/ai-providers.ts
interface AIProvider {
  name: string
  isHealthy: boolean
  latencyP99: number
  complete(messages: Message[]): Promise<AsyncIterable<string>>
}

class AIProviderManager {
  private providers: AIProvider[] = []
  private healthCheckInterval: NodeJS.Timeout

Practice Exercise

Build an Edge-Deployed Embedding Service

90 min

Essential Tools for AI Infrastructure

Vercel AI SDK

tool

Upstash Redis

tool

LangSmith by LangChain

tool

Helicone

tool

Complete Speed Stack Configurationtypescript

123456789101112
// lib/speed-stack.ts
import { Redis } from '@upstash/redis'
import { Ratelimit } from '@upstash/ratelimit'
import { OpenAI } from 'openai'
import { Anthropic } from '@anthropic-ai/sdk'

// 1. Multi-layer caching
const redis = Redis.fromEnv()
const browserCache = new Map<string, { data: string; timestamp: number }>()

// 2. Rate limiting with sliding window
const ratelimit = new Ratelimit({

73%

Reduction in P99 latency with edge deployment

Applications that moved AI preprocessing to edge functions saw dramatic latency improvements.

The 100ms Budget Rule

Users perceive responses under 100ms as instantaneous. Structure your AI infrastructure to deliver initial feedback within this budget.

Framework

The FAST Infrastructure Framework

Failover First

Design every component with automatic failover from day one. Never depend on a single provider, regi...

Async Always

Default to asynchronous operations for everything that doesn't require immediate user feedback. Queu...

Stream Strategically

Implement streaming for all user-facing AI operations to provide immediate feedback. But don't strea...

Tiered Caching

Implement caching at every layer: browser, edge, application, and database. Each tier serves a diffe...

Practice Exercise

Implement Cost-Aware Request Routing

60 min

Pre-compute During Off-Peak Hours

Identify queries and embeddings that can be pre-computed and schedule batch jobs during low-traffic periods (typically 2-6 AM local time). Pre-generate embeddings for new content, warm caches with likely queries based on historical patterns, and run expensive computations when resources are cheap.

Deep Learning Resources for AI Infrastructure

High Performance Browser Networking by Ilya Grigorik

book

Designing Data-Intensive Applications by Martin Kleppmann

book

AI Engineering Newsletter by Chip Huyen

article

Cloudflare Workers AI Documentation

article

Chapter Complete!

Serverless architectures with edge functions eliminate cold ...

Streaming is non-negotiable for user-facing AI features. Use...

Multi-layer caching (browser, edge, application, semantic) c...

Provider failover with health checks ensures reliability wit...

Next: Begin by implementing streaming for your most important AI feature—the improvement in perceived performance will be immediately visible to users

PreviousNext