MASTERY30 min63 sections

Testing Agents

THIS WEEK'S JOURNEY

Testing Agents: Building Confidence in Autonomous Systems

Testing AI agents presents unique challenges that traditional software testing methodologies simply weren't designed to handle. Unlike deterministic applications where the same input always produces the same output, agents exhibit emergent behaviors, make autonomous decisions, and interact with unpredictable environments in ways that can surprise even their creators.

67%

of agent failures in production traced to inadequate testing

A comprehensive analysis of over 200 production agent deployments on AWS revealed that two-thirds of critical failures could have been prevented with proper testing strategies.

Key Insight

Agent Testing Requires a Fundamentally Different Mindset

Traditional software testing assumes deterministic behavior—you can write assertions that check for exact outputs. Agent testing must embrace non-determinism as a feature, not a bug.

Framework

The Agent Testing Pyramid

Unit Tests (Foundation - 60%)

Test individual components in isolation: tool implementations, prompt templates, parsing logic, and ...

Integration Tests (Middle - 25%)

Test interactions between components: tool chains, memory systems, and orchestration logic. Use reco...

End-to-End Tests (Top - 10%)

Test complete agent workflows against real or simulated environments. These are slower and more expe...

Chaos and Adversarial Tests (Peak - 5%)

Stress test agents with malicious inputs, service failures, and edge cases. Include prompt injection...

Traditional Software Testing vs. Agent Testing

Traditional Testing

Deterministic outputs enable exact assertions

Test cases cover specific input-output pairs

Mocking is straightforward with fixed responses

Code coverage is a meaningful metric

Agent Testing

Non-deterministic outputs require semantic validation

Test cases must cover behavioral outcomes and goals

Mocking requires realistic response distributions

Behavioral coverage and scenario coverage matter more

Setting Up an Agent Testing Framework on AWSpython

123456789101112
# agent_testing/framework.py
import pytest
import boto3
from dataclasses import dataclass
from typing import Callable, List, Optional
from enum import Enum
import json
import hashlib

class TestMode(Enum):
    UNIT = "unit"           # Mocked LLM, fast
    INTEGRATION = "integration"  # Recorded responses

Always Test with Production-Like Data

Agent behavior can vary dramatically based on input characteristics. A customer service agent tested only with polite, well-formed queries will fail when faced with typos, multiple languages, or frustrated customers using profanity.

Stripe

Building a Comprehensive Agent Testing Pipeline

The testing pipeline caught 23 critical bugs before production deployment, reduc...

Key Insight

Mock LLM Responses Must Be Realistic, Not Perfect

A common mistake in agent testing is creating overly clean mock responses that don't reflect real LLM behavior. Production LLMs occasionally produce malformed JSON, include unexpected preambles, or generate responses that technically follow instructions but miss the intent.

Implementing Unit Tests for Agent Tools

Define Tool Contracts

Generate Test Cases from Contracts

Test Input Validation

Test Output Formatting

Test Error Handling

Unit Testing Agent Tools with Pytestpython

123456789101112
# tests/test_tools/test_database_tool.py
import pytest
from hypothesis import given, strategies as st
from unittest.mock import Mock, patch
from agent.tools.database import DatabaseQueryTool
from agent.tools.schemas import QueryInput, QueryOutput
import asyncio

class TestDatabaseQueryTool:
    @pytest.fixture
    def tool(self):
        return DatabaseQueryTool(

Unit Testing Checklist for Agent Components

Anti-Pattern: Testing Only the Happy Path

❌ Problem

Teams deploy agents that work flawlessly in demos but fail catastrophically in p...

✓ Solution

Build test suites that explicitly cover failure modes. For every happy-path test...

Key Insight

Snapshot Testing Catches Prompt Regression

When you modify prompts or agent logic, subtle changes in behavior can slip through traditional tests. Snapshot testing captures the agent's complete response (including reasoning traces) for a set of canonical inputs and alerts you when outputs change.

Implementing Snapshot Testing for Agent Responsespython

123456789101112
# tests/snapshot_testing.py
import json
import hashlib
import boto3
from datetime import datetime
from typing import Optional
from deepdiff import DeepDiff

class AgentSnapshotTester:
    def __init__(self, agent_name: str, version: str):
        self.agent_name = agent_name
        self.version = version

Agent Testing Pipeline Architecture on AWS

Git Push

CodeBuild: Unit Test...

CodeBuild: Integrati...

Step Functions: E2E ...

Use Deterministic Seeds for Reproducible Tests

When testing with real LLMs, set temperature to 0 and use deterministic sampling where available. For Claude, use the same system fingerprint across test runs.

Notion

Implementing Behavioral Test Suites for AI Features

The behavioral testing approach caught 31 quality regressions that traditional t...

Essential Tools for Agent Testing

pytest-asyncio

tool

Hypothesis

tool

DeepDiff

tool

LocalStack

tool

Practice Exercise

Build a Unit Test Suite for an Agent Tool

45 min

Framework

Agent Testing Pyramid

Unit Tests (Base Layer)

Fast, isolated tests for individual functions like prompt templates, tool parsers, and state transfo...

Component Tests

Test individual tools, memory systems, and decision modules with controlled inputs. Use recorded LLM...

Integration Tests

Verify multi-component workflows like tool chains and memory retrieval pipelines. Use lightweight LL...

Behavioral Tests

Test agent decision-making patterns against scenario libraries. These verify the agent chooses appro...

Testing Strategies: Traditional Software vs. AI Agents

Traditional Software Testing

Deterministic outputs enable exact assertions

Test coverage measured by code path execution

Mocking external services is straightforward

Performance testing focuses on latency and throughput

AI Agent Testing

Non-deterministic outputs require semantic comparison

Coverage includes scenario space and edge cases

LLM mocking requires response recording and replay

Performance includes token usage and cost per task

Notion

Building a Comprehensive Test Suite for AI Writing Assistants

Test suite runtime dropped from 4 hours to 45 minutes. They caught 23 regression...

Implementing Semantic Assertion Helpers for Agent Testspython

123456789101112
import boto3
import numpy as np
from typing import List, Tuple
import json

class SemanticAssertions:
    def __init__(self, similarity_threshold: float = 0.85):
        self.bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')
        self.threshold = similarity_threshold
        self._embedding_cache = {}
    
    def _get_embedding(self, text: str) -> List[float]:

Key Insight

The 80/20 Rule of Agent Test Coverage

Analysis of production agent failures reveals that 80% of issues stem from 20% of scenarios: malformed tool responses, context window overflow, ambiguous user intents, and state corruption during multi-turn conversations. Prioritize test coverage for these high-risk areas before expanding to edge cases.

Implementing Integration Tests for Multi-Tool Agent Workflows

Define Tool Interaction Contracts

Build Tool Response Recorders

Create Workflow Scenario Files

Implement State Verification Checkpoints

Build Timeout and Retry Testing

Anti-Pattern: Testing Only Happy Paths

❌ Problem

Production agents encounter untested failure modes and behave unpredictably. Use...

✓ Solution

Implement failure injection testing as a first-class practice. For every tool in...

Linear

Simulation Environment for Issue Triage Agent Testing

The simulation environment discovered 34 edge cases in the first month that manu...

Framework

TRACE: Test Reliability for Agent Correctness Evaluation

Tolerance Bands

Define acceptable variation ranges for each metric. Rather than pass/fail, tests report whether resu...

Reproducibility Scores

Track how consistently tests produce the same result across runs. Tests with reproducibility below 9...

Assertion Confidence Levels

Categorize assertions by confidence: 'structural' (100% deterministic), 'semantic' (embedding-based ...

Cost Attribution

Track LLM token usage and API costs per test. Identify expensive tests that could use cached respons...

Version Lock Your Test Models

Always pin specific model versions in test configurations. Using 'latest' or unversioned model IDs means your tests can suddenly fail when providers update models.

Building a Simulation Environment for Agent Testingpython

123456789101112
import boto3
import random
from dataclasses import dataclass
from typing import List, Dict, Generator
import uuid
from datetime import datetime, timedelta

@dataclass
class SimulatedUser:
    id: str
    name: str
    role: str  # 'engineer', 'pm', 'customer'

3.2x

Increase in bugs caught pre-production with simulation testing

Organizations implementing simulation-based testing for AI agents catch 3.2 times more bugs before production deployment compared to manual test case creation alone.

End-to-End Test Coverage Checklist

Key Insight

Regression Testing Requires Behavioral Fingerprinting

Traditional regression testing compares outputs, but agent regression testing must compare behaviors. Develop 'behavioral fingerprints' that characterize how your agent responds to scenario categories: response length distributions, tool selection patterns, confidence score ranges, and error handling approaches.

Vercel

Continuous Regression Testing for v0 AI Code Generator

The regression system caught 12 significant quality regressions in the first qua...

Agent Test Execution Pipeline

Code Change

Static Analysis (sec...

Unit Tests (minutes,...

Component Tests w/ M...

Use Test Scenario Generators for Coverage

Build parameterized test generators that create thousands of variations from templates. A single 'user asks about pricing' template can generate tests for different products, user tiers, currencies, and edge cases.

Practice Exercise

Build a Multi-Layer Agent Test Suite

90 min

Practice Exercise

Build a Comprehensive Agent Test Suite

90 min

Complete Agent Testing Frameworkpython

123456789101112
import pytest
import asyncio
from unittest.mock import AsyncMock, patch
from dataclasses import dataclass
from typing import List, Dict, Any
import json
import hashlib

# Agent under test
@dataclass
class AgentResponse:
    message: str

Production Agent Testing Checklist

Anti-Pattern: Testing Only Happy Paths

❌ Problem

When Robinhood experienced their 2021 outage during high market volatility, thei...

✓ Solution

Adopt adversarial testing as a first-class practice. For every happy path test, ...

Practice Exercise

Implement Chaos Testing for Agent Resilience

60 min

Chaos Testing Middleware Implementationpython

123456789101112
import random
import asyncio
from functools import wraps
from typing import Callable, Any
import os

class ChaosConfig:
    """Configuration for chaos injection."""
    def __init__(self):
        self.enabled = os.getenv("CHAOS_ENABLED", "false").lower() == "true"
        self.timeout_rate = float(os.getenv("CHAOS_TIMEOUT_RATE", "0.10"))
        self.error_rate = float(os.getenv("CHAOS_ERROR_RATE", "0.05"))

Anti-Pattern: Ignoring Test Environment Parity

❌ Problem

A fintech startup tested their financial advisor agent using GPT-3.5 mocks that ...

✓ Solution

Implement environment parity as a non-negotiable requirement. Use the same model...

Framework

VALIDATE Testing Framework

V - Verification Tests

Verify the agent produces correct outputs for known inputs. This includes unit tests for individual ...

A - Adversarial Tests

Test the agent's behavior under attack or unusual conditions. Include prompt injection attempts, mal...

L - Latency Tests

Measure and validate response time characteristics. Set SLOs for p50, p95, and p99 latency. Test und...

I - Integration Tests

Verify the agent works correctly with all external dependencies. Test tool integrations, database co...

Essential Agent Testing Tools and Resources

pytest-asyncio

tool

Locust

tool

LangSmith

tool

Promptfoo

tool

Practice Exercise

Create a Regression Test Suite from Production Logs

120 min

Production Log to Regression Test Pipelinepython

123456789101112
import json
import hashlib
from dataclasses import dataclass
from typing import List, Dict, Optional
from sentence_transformers import SentenceTransformer
import numpy as np
import boto3
from datetime import datetime, timedelta

@dataclass
class ConversationTurn:
    role: str  # 'user' or 'agent'

Anti-Pattern: Skipping End-to-End Tests Due to Cost

❌ Problem

A healthcare AI company skipped E2E testing to save on API costs, relying on moc...

✓ Solution

Budget for E2E testing as a non-negotiable production cost. Calculate the cost o...

Pre-Deployment Testing Gate Checklist

94%

Reduction in production incidents after implementing comprehensive agent testing

Anthropic reported that teams implementing their recommended testing framework—including unit tests, integration tests, simulation environments, and regression testing—saw a 94% reduction in production incidents.

Practice Exercise

Implement Automated Test Generation from User Feedback

75 min

Test Data Privacy Requirements

When using production conversations for regression testing, ensure compliance with data privacy regulations. Implement PII detection and redaction before storing test cases.

Advanced Agent Testing Resources

DeepEval

tool

Giskard

tool

Weights & Biases Prompts

tool

Arize Phoenix

tool

Chapter Complete!

Comprehensive agent testing requires a multi-layered approac...

Chaos testing is essential for production resilience. Inject...

Production log mining is the most valuable source of regress...

Test environment parity with production is non-negotiable. U...

Next: Begin by auditing your current test coverage using the VALIDATE framework

PreviousNext