EXPANSION30 min63 sections

Testing AI Products Fast

THIS WEEK'S JOURNEY

Quality Assurance Without Slowing Down

Traditional software testing assumes deterministic outputs—run the same input, get the same output, every time. AI products shatter this assumption completely, creating a testing paradox that paralyzes many teams.

Key Insight

AI Testing Is About Behavior, Not Exact Outputs

The fundamental shift in AI testing is moving from 'did I get the exact right answer' to 'did the system behave acceptably.' When you test a traditional function that adds two numbers, you assert that 2+2=4. When you test an AI summarizer, you can't assert the exact words—you assert properties like 'summary is shorter than input,' 'summary contains key entities,' and 'summary doesn't introduce false claims.' This property-based testing approach, borrowed from formal verification, is how companies like Anthropic test Claude's outputs at scale.

67%

of AI product teams report testing as their biggest bottleneck

Most teams either over-test (spending weeks on comprehensive evaluation suites) or under-test (shipping with fingers crossed).

Traditional Testing vs. AI Testing

Traditional Software Testing

Deterministic: same input always produces same output

Binary pass/fail based on exact assertions

Test coverage measured by code paths executed

Bugs are reproducible with specific inputs

AI Product Testing

Stochastic: same input may produce different valid outputs

Graded evaluation on quality dimensions (relevance, accuracy...

Test coverage measured by scenario and edge case coverage

Failures may be probabilistic—works 95% of the time

Framework

The AI Testing Pyramid

Smoke Tests (Base Layer)

Fast, cheap tests that verify the system is working at all. Does the API respond? Does the model ret...

Golden Dataset Tests (Middle Layer)

A curated set of 50-200 examples covering critical scenarios. Each example has expected behavior (no...

Automated Evaluation (Upper Layer)

LLM-as-judge or custom scoring functions that evaluate quality dimensions at scale. Run nightly or b...

Human Evaluation (Top Layer)

Expert review of edge cases and subjective quality. Run weekly or before major releases. Expensive b...

Linear

Building a 30-Second AI Test Suite

Reduced AI feature testing time from 15 minutes to 30 seconds for most commits, ...

Set Temperature to 0 for Testing

While production might use temperature=0.7 for varied responses, always set temperature=0 in your test environment. This maximizes reproducibility and makes debugging failures much easier.

Key Insight

The 80/20 Rule of AI Testing: Test Your Prompts, Not the Model

Here's a secret that saves teams hundreds of hours: you don't need to test whether GPT-4 can summarize text—OpenAI already tested that extensively. What you need to test is whether YOUR prompt, with YOUR specific instructions and YOUR edge cases, produces acceptable results for YOUR users.

Building Your First Smoke Test Suite

Identify Critical Paths

Define 'Working' vs 'Working Well'

Create Minimal Test Inputs

Add Schema Validation

Set Aggressive Timeouts

Basic AI Smoke Test Suite in Pythonpython

123456789101112
import pytest
import asyncio
from your_ai_client import AIClient

client = AIClient()

class TestAISmokeTests:
    """Fast smoke tests - run on every commit"""
    
    @pytest.mark.timeout(5)
    async def test_basic_completion_works(self):
        """AI responds to basic prompt"""

Anti-Pattern: Testing Against Exact String Matches

❌ Problem

Tests become flaky, failing randomly even when the AI is working correctly. Team...

✓ Solution

Test properties and behaviors instead of exact strings. Check that the response ...

Key Insight

Golden Datasets Are Your Most Valuable Testing Asset

A golden dataset is a curated collection of input-output examples that represent your most important scenarios. Unlike synthetic test cases, golden datasets come from real usage—actual user inputs that you've manually labeled with expected behaviors.

Golden Dataset Creation Checklist

Anthropic

Constitutional AI Testing at Scale

Anthropic catches 94% of potential safety issues before deployment through autom...

Start Your Golden Dataset Today

Don't wait for the 'perfect' dataset structure. Create a simple JSON file with 10 examples today: 5 common cases and 5 edge cases.

The Golden Dataset Testing Flow

Code Change

Smoke Tests (30s)

Golden Dataset (2-5m...

PR Review

Framework

CRISP: Criteria for Golden Dataset Examples

Consequential

Does this example represent a scenario with real business impact? Prioritize examples where failure ...

Representative

Does this example reflect actual user behavior? Synthetic examples often miss the messiness of real ...

Interpretable

Can you clearly articulate what 'good' looks like for this example? If you can't define expected beh...

Stable

Will the expected behavior for this example remain consistent over time? Avoid examples tied to curr...

Practice Exercise

Build Your First Golden Dataset

45 min

Key Insight

Test Data Is a Product, Not a Byproduct

The teams that test AI products most effectively treat their test datasets as first-class products. Stripe's ML team has dedicated processes for test data curation, versioning, and quality control—the same rigor they apply to production code.

Framework

The AI Testing Pyramid

Foundation: Smoke Tests

Quick sanity checks that run in under 30 seconds. These verify the model loads, accepts input, and r...

Core: Golden Dataset Tests

A curated set of 50-200 examples that represent critical user scenarios. These are your regression s...

Middle: Automated Evaluation

LLM-as-judge and programmatic quality checks that assess output quality at scale. These catch subtle...

Top: Human Evaluation

Expert review of random samples and edge cases. The most expensive but most reliable quality signal....

Comprehensive AI Smoke Test Suitepython

123456789101112
import pytest
import time
from typing import Any
import json

class AISmokeTests:
    """Run in <30 seconds on every commit"""
    
    def __init__(self, model_client):
        self.client = model_client
        self.timeout = 10  # seconds per test

Anthropic

How Claude's team builds golden datasets

This structured approach reduced production incidents by 67% while actually decr...

Building Your Golden Dataset from Scratch

Mine Production Logs for Real Examples

Categorize by User Intent

Identify Critical Edge Cases

Define Expected Behavior (Not Exact Output)

Add Difficulty Ratings

Golden Dataset Testing vs. Traditional Unit Testing

Traditional Unit Tests

Exact output matching: assert result == expected

Binary pass/fail with no gray area

Tests are independent and isolated

100% pass rate expected and required

Golden Dataset Tests

Behavioral matching: response should contain X

Scored evaluation with acceptable ranges

Tests interact through shared model state

95%+ pass rate acceptable for hard examples

Key Insight

The 'Canary in the Coal Mine' Testing Pattern

The most effective AI testing teams maintain a special subset of 10-20 'canary' examples that have historically predicted production problems. These aren't your hardest examples—they're your most diagnostic ones.

Anti-Pattern: The 'Test Everything' Paralysis

❌ Problem

Development velocity drops to a crawl. Engineers start skipping tests because th...

✓ Solution

Start with 50-100 carefully curated examples that cover your critical paths. Run...

Framework

The Regression Detection Matrix

Critical Regressions (Block Deployment)

Safety violations, complete feature failures, or >20% quality drop in any capability slice. These tr...

Major Regressions (Investigate Before Ship)

10-20% quality drop in a capability slice, or failure on multiple canary examples. Requires human re...

Minor Regressions (Monitor Closely)

5-10% quality drop or single canary failure. Ship with enhanced monitoring. Set up alerts for user c...

Noise (Ignore)

Less than 5% variation in scores, especially on hard examples. This is normal probabilistic behavior...

Automated Regression Detection Systempython

123456789101112
from dataclasses import dataclass
from enum import Enum
from typing import List, Dict
import statistics

class RegressionSeverity(Enum):
    CRITICAL = "critical"  # Block deployment
    MAJOR = "major"        # Investigate first
    MINOR = "minor"        # Monitor closely
    NOISE = "noise"        # Ignore

@dataclass

Linear

Building an LLM-as-Judge evaluation system

Linear now evaluates 10,000+ AI outputs daily with their LLM judge system, at a ...

Pre-Deployment Testing Checklist

73%

of AI production incidents could have been caught by smoke tests

Most AI production failures aren't subtle quality regressions—they're obvious infrastructure failures like model loading errors, timeout issues, or malformed outputs.

Key Insight

The 'Production is the Real Test' Mindset

The most sophisticated AI testing strategy is recognizing that production traffic is your most valuable test suite. No synthetic evaluation can match the diversity and unpredictability of real users.

The False Confidence Trap

A passing test suite can be more dangerous than no tests at all. If your tests pass but don't actually catch real problems, you'll ship with unjustified confidence.

Practice Exercise

Build Your First Golden Dataset

60 min

The AI Testing Feedback Loop

Code Change

Smoke Tests (30s)

Golden Dataset (5min...

Gradual Rollout

Anti-Pattern: Testing Against Your Training Data

❌ Problem

Your evaluation scores look great, but production performance is significantly w...

✓ Solution

Maintain strict separation between training data and evaluation data. Use time-b...

Essential AI Testing Tools and Frameworks

Promptfoo

tool

LangSmith

tool

Weights & Biases Prompts

tool

Braintrust

tool

Practice Exercise

Build Your First AI Smoke Test Suite

45 min

Smoke Test Implementation Patternpython

123456789101112
import pytest
import asyncio
from typing import Literal
import time

class AISmokeTester:
    def __init__(self, client, timeout: float = 10.0):
        self.client = client
        self.timeout = timeout
        self.results = []
    
    async def test_endpoint(

Practice Exercise

Create Your Golden Dataset

90 min

Golden Dataset Evaluation Frameworkpython

123456789101112
import json
from dataclasses import dataclass
from enum import Enum
import hashlib

class TestCategory(Enum):
    REGRESSION = "regression"
    EDGE_CASE = "edge_case"
    ADVERSARIAL = "adversarial"

@dataclass
class GoldenExample:

AI Testing Maturity Checklist

Anti-Pattern: The Perfectionist Test Suite

❌ Problem

Deployment frequency drops from multiple times per day to once per week. Develop...

✓ Solution

Implement a tiered testing strategy: fast smoke tests (< 2 min) run on every PR,...

Anti-Pattern: Testing Determinism in Non-Deterministic Systems

❌ Problem

Tests fail randomly even when the AI is working correctly, creating alert fatigu...

✓ Solution

Test for characteristics, not exact outputs. Instead of assertEqual(output, 'The...

Anti-Pattern: The Untested Emergency Deploy

❌ Problem

Technical debt accumulates as untested code enters production. Regressions compo...

✓ Solution

Build an emergency testing path that runs only critical smoke tests (< 30 second...

Practice Exercise

Implement LLM-as-Judge Evaluation

60 min

LLM-as-Judge Implementationpython

123456789101112
from pydantic import BaseModel
from typing import Literal

class JudgmentCriteria(BaseModel):
    name: str
    description: str
    score: Literal[1, 2, 3, 4, 5]
    reasoning: str

class JudgmentResult(BaseModel):
    criteria_scores: list[JudgmentCriteria]
    overall_score: float

Essential AI Testing Tools and Resources

Promptfoo

tool

Langsmith by LangChain

tool

Braintrust

tool

LMSYS Chatbot Arena

article

Practice Exercise

Build a Regression Detection Dashboard

120 min

Framework

The Test Pyramid for AI Products

Foundation: Unit Tests (Run in milliseconds)

Test non-AI code thoroughly: input validation, output parsing, error handling, API contracts. These ...

Level 2: Mock AI Tests (Run in seconds)

Test AI integration code with mocked LLM responses. Verify your code handles various response format...

Level 3: Smoke Tests (Run in 1-2 minutes)

Minimal live AI tests that verify the system works end-to-end. Test 5-10 critical paths with real LL...

Level 4: Golden Dataset Tests (Run in 5-15 minutes)

Comprehensive quality evaluation against curated examples. Run on merge to main or before deployment...

The 80/20 Rule for AI Testing

80% of your bugs will be caught by 20% of your tests. Identify which tests have historically caught real issues and prioritize them.

When to Skip Tests (And How to Do It Safely)

Anti-Pattern: Testing Theater

❌ Problem

False confidence leads to more production incidents, not fewer. Resources are wa...

✓ Solution

Measure test effectiveness by tracking bugs caught per test and false positive r...

47%

of AI product bugs are caught by the simplest tests

Nearly half of production AI bugs would be caught by basic smoke tests checking for response existence, format validity, and absence of error messages.

Start Testing Tomorrow, Not Next Sprint

You don't need a perfect testing strategy to start. Write one smoke test for your most critical feature today.

Chapter Complete!

AI testing is fundamentally different from traditional testi...

Build a tiered testing strategy: smoke tests (< 2 min) on ev...

Golden datasets are your most valuable testing asset. Start ...

LLM-as-judge enables scalable quality evaluation but require...

Next: Start by implementing a single smoke test for your most critical AI feature—something that verifies the happy path works in under 30 seconds

PreviousNext