EXPANSION25 min62 sections

Incident Response for AI

THIS WEEK'S JOURNEY

When AI Fails, Your Response Defines Your Product

AI systems fail differently than traditional software—they don't crash cleanly with stack traces, they degrade subtly, hallucinate confidently, or drift silently over time. A single AI incident can erode months of user trust in minutes, making incident response for AI products fundamentally different from conventional software operations.

67%

of AI incidents are detected by users before internal monitoring

This statistic reveals a critical gap in most AI teams' observability.

Key Insight

AI Incidents Are Fundamentally Different From Software Bugs

Traditional software fails in binary ways—a function throws an error, a server goes down, a database query times out. AI systems fail in probabilistic, context-dependent ways that are far harder to detect and diagnose.

Traditional Software Incidents vs. AI System Incidents

Traditional Software

Clear error messages and stack traces point to root cause

Binary failure states: working or not working

Reproducible with same inputs yielding same outputs

Rollback restores exact previous behavior instantly

AI Systems

Failures may produce confident but wrong outputs with no err...

Degradation spectrum: slightly wrong to completely hallucina...

Non-deterministic outputs make reproduction challenging

Rollback may require model retraining or data pipeline fixes

Framework

The AI Incident Severity Matrix

SEV-1: Safety Critical

AI outputs that could cause physical, financial, or legal harm. Examples include medical AI giving d...

SEV-2: Trust Destroying

Visible failures that erode user confidence even without direct harm. Obvious hallucinations, confid...

SEV-3: Quality Degradation

Subtle decline in output quality that users notice over time. Increased latency, less helpful respon...

SEV-4: Internal Anomaly

Monitoring detects issues before user impact. Drift alerts, cost anomalies, or test suite failures. ...

Anthropic

The Claude Character Consistency Incident

Zero user-reported incidents despite the vulnerability existing for 3 days. Inte...

Key Insight

The Golden Hour: Why Speed Matters More for AI Incidents

In traditional software, you might have hours or days before an incident becomes a PR crisis. With AI systems, the window is measured in minutes.

Pre-Authorize Your Rollback Authority

The single biggest cause of extended AI incidents is waiting for approval to roll back. Establish clear policies NOW that authorize on-call engineers to immediately rollback any AI system showing SEV-1 or SEV-2 behavior without management approval.

The First 15 Minutes: AI Incident Initial Response Protocol

Acknowledge and Classify

Assess Blast Radius

Execute Containment

Open Communication Channels

Begin Evidence Collection

AI Incident Detection Infrastructure Checklist

Anti-Pattern: The 'Let's Wait and See' Response

❌ Problem

What starts as a 5% degradation in quality can become a 50% degradation within h...

✓ Solution

Treat any anomaly in AI system behavior as guilty until proven innocent. When yo...

Real-Time AI Output Quality Monitoringpython

123456789101112
from dataclasses import dataclass
from datetime import datetime, timedelta
import numpy as np
from collections import deque

@dataclass
class AIOutputMetrics:
    timestamp: datetime
    latency_ms: float
    confidence_score: float
    output_length: int
    user_feedback: int  # -1, 0, 1

Key Insight

Your Incident Runbooks Need AI-Specific Sections

Standard incident runbooks assume you can look at logs, find an error, and fix the code. AI incidents require different diagnostic approaches because the 'error' might be a perfectly valid model output that's just wrong for this context.

Notion

Rapid Detection Through Feedback Aggregation

Only 2,300 users experienced the degraded feature before rollback. Post-incident...

Framework

The AI Incident Communication Framework

Acknowledge Quickly, Even Without Details

Within 30 minutes of a user-visible incident, post a status update acknowledging the issue. You don'...

Describe Impact, Not Implementation

Users don't care that your 'embedding model experienced distribution shift.' They care that 'search ...

Provide Timeline Expectations

Even if uncertain, give users a sense of timeline: 'We expect to have more information within 2 hour...

Update Regularly, Even If No Change

Post updates every 30-60 minutes during active incidents, even if just to say 'still investigating.'...

Never Blame the AI Model in Public Communications

Saying 'our AI model made an error' or 'the model hallucinated' shifts responsibility away from your team and onto an entity users already distrust. Instead, own the failure: 'Our system provided incorrect information' or 'We identified an issue with our response quality.' You built the system, you deployed it, you're responsible for its outputs.

Practice Exercise

Build Your AI Incident Detection Dashboard

45 min

AI Incident Escalation Flow

Alert Triggered

On-Call Acknowledges...

AI-Specific Triage

Severity Classificat...

Framework

The AI Incident Severity Matrix

SEV-0: Critical Business Impact

Complete AI system failure affecting all users, data corruption or loss, security breaches involving...

SEV-1: Major Feature Degradation

Core AI functionality severely impaired for significant user segment (>25%), significant accuracy de...

SEV-2: Partial Functionality Loss

AI features working but with noticeable quality degradation, affecting 5-25% of users or specific us...

SEV-3: Minor Issues

Edge case failures, cosmetic issues in AI outputs, or problems affecting <5% of users with available...

Anthropic

Building Claude's Multi-Layer Detection System

The layered approach reduced their mean time to detection from 45 minutes to und...

Reactive vs. Proactive Incident Detection

Reactive Detection (Anti-Pattern)

Wait for user complaints to surface issues—typically 30-60 m...

Rely on customer support tickets as primary signal, missing ...

Monitor only infrastructure metrics, blind to model behavior...

Single threshold alerts that either fire constantly (alert f...

Proactive Detection (Best Practice)

Automated anomaly detection catches issues within 5-10 minut...

Multi-signal correlation: infrastructure + model behavior + ...

Statistical process control with dynamic thresholds adapting...

Canary deployments catch problems in 5% of traffic before fu...

AI Model Rollback Procedure

Confirm Incident and Assess Severity

Notify Stakeholders and Start Incident Channel

Verify Rollback Target Health

Execute Gradual Rollback

Validate Rollback Success

The Hidden Danger of Stateful AI Rollbacks

Unlike stateless web services, AI systems often maintain state through conversation history, user preference learning, or fine-tuning on recent data. Rolling back the model doesn't roll back this state, potentially causing version mismatches.

Notion

Notion AI's Graceful Degradation Strategy

User satisfaction scores during degraded mode were only 12% lower than normal op...

Framework

The User Communication Hierarchy

Level 0: Silent Mitigation

For incidents detected and resolved within 5 minutes with <1% user impact, no external communication...

Level 1: In-Product Contextual Notice

For ongoing issues affecting specific features with workarounds available. Show subtle, non-alarming...

Level 2: Status Page Update

For incidents lasting >15 minutes affecting >10% of users. Post to status page with factual descript...

Level 3: Proactive Direct Communication

For major incidents affecting business-critical workflows. Email affected enterprise customers direc...

AI Incident Communication Checklist

Anti-Pattern: The Blame-Focused Post-Mortem

❌ Problem

Engineers stop reporting near-misses and small issues to avoid blame, creating a...

✓ Solution

Adopt strict blameless post-mortem practices. Focus questions on 'what' and 'how...

Google

Google's AI Post-Mortem Culture

Google's AI teams report that 60% of their monitoring improvements came from pos...

77%

of AI incidents have root causes in data or evaluation, not model architecture

This statistic fundamentally shifts where teams should focus prevention efforts.

Framework

The Five Whys for AI Incidents

Why 1: What was the immediate technical cause?

Start with the direct technical failure. 'The model generated inappropriate content' or 'Latency exc...

Why 2: What model/data condition enabled this?

Dig into the AI-specific factors. 'The model learned this pattern from training data containing simi...

Why 3: Why wasn't this caught in evaluation?

Examine your testing and validation gaps. 'Our evaluation set didn't include this demographic' or 'W...

Why 4: Why wasn't this caught in monitoring?

Analyze detection failures. 'We monitor error rates but not output quality signals' or 'Alerts were ...

Create an AI Incident Knowledge Base

Build a searchable repository of past AI incidents with their root causes, detection signals, and resolutions. When new incidents occur, search for similar patterns—often, someone on another team solved a similar problem.

Stripe

Stripe's ML Incident Prevention Through Systematic Evaluation

Stripe reduced their ML incident rate by 73% in the year following implementatio...

AI Incident Prevention Strategy Checklist

AI Incident Lifecycle

Issue Introduced (Co...

Latent Period (Issue...

Detection (Monitorin...

Triage (Severity Ass...

Practice Exercise

Simulate an AI Incident Response

90 min

Incident Response Resources for AI Teams

Google SRE Book - Chapter on Incident Management

book

PagerDuty Incident Response Documentation

article

Anthropic's AI Safety Incident Reports

article

Incident.io

tool

Practice Exercise

Build Your Incident Detection System

45 min

AI Incident Detection and Alerting Systemtypescript

123456789101112
import { MetricsClient } from './metrics';
import { AlertManager } from './alerts';

interface AIHealthMetrics {
  errorRate: number;
  latencyP99: number;
  confidenceP10: number;
  fallbackRate: number;
}

class AIIncidentDetector {
  private baseline: AIHealthMetrics;

Incident Response Readiness Checklist

Graceful Degradation with User Communicationtypescript

123456789101112
interface DegradedResponse {
  result: any;
  degraded: boolean;
  degradationReason?: string;
  userMessage?: string;
}

class GracefulAIService {
  async processRequest(input: string): Promise<DegradedResponse> {
    // Try primary AI provider
    try {
      const result = await this.primaryProvider.process(input);

Anti-Pattern: The Silent Failure

❌ Problem

Users blame themselves or the product when AI fails silently, leading to decreas...

✓ Solution

Implement transparent degradation messaging that acknowledges issues while maint...

Anti-Pattern: The Blame Game Post-Mortem

❌ Problem

Engineers become defensive and withhold information during incident response, sl...

✓ Solution

Adopt a blameless post-mortem culture that assumes everyone acted with the best ...

Anti-Pattern: The Over-Engineered Rollback

❌ Problem

Incidents last longer because teams hesitate to initiate complex rollback proced...

✓ Solution

Design one-click rollback mechanisms that any on-call engineer can execute withi...

Practice Exercise

Conduct a Tabletop Incident Exercise

60 min

Framework

The Five Whys for AI Incidents

Why did the AI fail?

Start with the immediate technical cause. Was it a provider outage, model degradation, prompt inject...

Why wasn't it detected earlier?

Examine monitoring gaps. Were there alerts that should have fired? Did anyone notice warning signs b...

Why did it reach users?

Investigate guardrails and fallbacks. Why didn't output validation catch the bad responses? Why didn...

Why was recovery slow?

Analyze the response process. Were runbooks unclear or missing? Did the right people get notified? W...

Incident Response Resources for AI Teams

Google SRE Book - Managing Incidents Chapter

book

PagerDuty Incident Response Documentation

article

Jeli.io Post-Incident Review Platform

tool

Learning from Incidents in Software

article

Automated Post-Mortem Data Collectiontypescript

123456789101112
interface IncidentTimeline {
  detected: Date;
  acknowledged: Date;
  mitigated: Date;
  resolved: Date;
  events: TimelineEvent[];
}

class PostMortemCollector {
  async generatePostMortemData(incidentId: string): Promise<PostMortemReport> {
    const incident = await this.incidentStore.get(incidentId);

Practice Exercise

Create Your AI Incident Runbook

90 min

47%

of AI incidents caused by upstream provider issues

Nearly half of AI system incidents originate from third-party provider problems rather than internal code issues.

The 24-Hour Post-Mortem Rule

Schedule your post-mortem within 24 hours of incident resolution while details are fresh. Waiting longer causes memory decay and makes it harder to identify root causes.

Post-Incident Prevention Checklist

Notion

AI Feature Incident Response Transformation

Notion reduced their AI incident detection time from 45 minutes to under 3 minut...

The Incident Response Kit

Prepare an 'incident response kit' that includes everything needed during an emergency: provider status page URLs, escalation contact numbers, rollback commands, communication templates, and dashboard links. Store this in a location accessible even if your primary infrastructure is down - a printed document, mobile-accessible wiki, or dedicated Slack channel description..

Complete AI Incident Lifecycle

Prevention

Detection

Response

Resolution

Chapter Complete!

AI incidents require specialized detection because tradition...

Rollback mechanisms must be instant and tested regularly. De...

User communication during AI incidents should be transparent...

Blameless post-mortems are essential for building resilient ...

Next: Start by auditing your current incident response capabilities against the readiness checklist in this chapter

PreviousNext