Accuracy
Measure how often the AI gives correct answers
Why This Matters
A 5% increase in accuracy can reduce customer complaints by 40% and save $100k+ annually in incorrect decisions. Medical AI with 95% vs 90% accuracy = difference between FDA approval and rejection.
Target: 95% accuracy
is the minimum for production AI in healthcare and finance. Below this = legal liability.
Real Cost of Ignoring This:
One incorrect AI diagnosis = $2M+ in lawsuits. Testing prevents this.
Exact Match
Response exactly matches expected output
function calculateExactMatch(predictions: string[], expected: string[]): number {
if (predictions.length !== expected.length) return 0;
const matches = predictions.filter((pred, i) =>
pred.trim().toLowerCase() === expected[i].trim().toLowerCase()
).length;
return matches / predictions.length;
}
// Usage
const accuracy = calculateExactMatch(
['Paris', 'London', 'Berlin'],
['Paris', 'London', 'Berlin']
);
console.log(`Accuracy: ${(accuracy * 100).toFixed(1)}%`); // 100%
💡 When to use: Classification tasks, structured outputs, deterministic answers
Semantic Similarity
Responses are semantically similar to expected
import { encode } from 'gpt-tokenizer';
import { cosineSimilarity } from '@/lib/utils';
async function calculateSemanticSimilarity(
prediction: string,
expected: string,
embedModel: EmbeddingModel
): Promise<number> {
const [predEmbed, expEmbed] = await Promise.all([
embedModel.embed(prediction),
embedModel.embed(expected)
]);
return cosineSimilarity(predEmbed, expEmbed);
}
// Usage
const similarity = await calculateSemanticSimilarity(
"The capital of France is Paris",
"Paris is the capital city of France",
openai.embeddings
);
console.log(`Similarity: ${(similarity * 100).toFixed(1)}%`); // ~95%
💡 When to use: Open-ended responses, paraphrasing OK, general Q&A
Relevance
Measure how relevant responses are to the query
Why This Matters
Irrelevant responses waste user time and erode trust. Studies show users abandon apps after 2-3 irrelevant answers. Customer support AI with 80% relevance = 60% user satisfaction. With 90% relevance = 85% satisfaction.
Target: 85%+ relevance score
separates good AI from great AI. Below 75% = users lose trust and stop using your app.
Real Cost of Ignoring This:
Irrelevant responses cost companies $50-200 per customer in lost sales and support time.
Keyword Presence
Check if important keywords appear in response
function calculateKeywordRelevance(
response: string,
requiredKeywords: string[]
): number {
const lowerResponse = response.toLowerCase();
const presentKeywords = requiredKeywords.filter(keyword =>
lowerResponse.includes(keyword.toLowerCase())
);
return presentKeywords.length / requiredKeywords.length;
}
// Usage
const relevance = calculateKeywordRelevance(
"RAG uses vector databases like Pinecone to retrieve relevant documents",
["RAG", "vector", "retrieve"]
);
console.log(`Relevance: ${(relevance * 100).toFixed(1)}%`); // 100%
💡 When to use: Domain-specific tasks, technical content, required terms
Answer Relevance Score
Use LLM to judge relevance of answer to question
async function judgeRelevance(
question: string,
answer: string,
llm: LLM
): Promise<{ score: number; reasoning: string }> {
const prompt = `Rate the relevance of this answer to the question on a scale of 1-10.
Question: ${question}
Answer: ${answer}
Respond with JSON: { "score": number, "reasoning": string }`;
const response = await llm.call(prompt);
return JSON.parse(response);
}
// Usage
const result = await judgeRelevance(
"What is RAG?",
"RAG stands for Retrieval Augmented Generation...",
gpt4
);
console.log(`Relevance: ${result.score}/10 - ${result.reasoning}`);
💡 When to use: Complex queries, nuanced content, subjective relevance
Consistency
Measure consistency across multiple runs
Why This Matters
Inconsistent AI destroys user trust. Same question = same answer should be true 90%+ of the time. Without testing, temperature=0.7 can give wildly different answers. Financial AI giving different risk scores for same data = compliance nightmare.
Target: < 10% variance
is required for production systems. Above 20% variance = unreliable, users complain, regulators investigate.
Real Cost of Ignoring This:
Inconsistent AI caused Wells Fargo a $185M fine when their loan AI gave inconsistent decisions. Testing prevents this.
Response Variance
Check how much responses vary for same input
async function measureConsistency(
prompt: string,
llm: LLM,
runs: number = 5
): Promise<{ variance: number; responses: string[] }> {
const responses = await Promise.all(
Array(runs).fill(null).map(() => llm.call(prompt))
);
// Calculate pairwise similarities
const similarities: number[] = [];
for (let i = 0; i < responses.length; i++) {
for (let j = i + 1; j < responses.length; j++) {
const sim = await calculateSemanticSimilarity(
responses[i],
responses[j],
embedModel
);
similarities.push(sim);
}
}
const avgSimilarity = similarities.reduce((a, b) => a + b, 0) / similarities.length;
const variance = 1 - avgSimilarity;
return { variance, responses };
}
// Usage
const consistency = await measureConsistency(
"What is the capital of France?",
gpt35,
5
);
console.log(`Variance: ${(consistency.variance * 100).toFixed(1)}%`);
// Low variance = high consistency
💡 When to use: Critical applications, factual queries, deterministic needs
Safety
Measure safety and appropriateness of responses
Why This Matters
One harmful AI response = brand damage, lawsuits, regulatory fines. Microsoft Tay chatbot lasted 16 hours before harmful outputs forced shutdown. Without safety testing, your AI WILL say something harmful eventually.
Target: 99.9%+ safety rate
is minimum for customer-facing AI. Even 0.1% failure = 1,000 harmful responses per 1M queries.
Real Cost of Ignoring This:
One viral harmful AI response = $10M+ in brand damage (see: Google Gemini, Microsoft Tay, Meta Galactica). Safety testing is mandatory.
Content Filtering
Detect harmful, biased, or inappropriate content
interface SafetyCheck {
isSafe: boolean;
categories: string[];
confidence: number;
}
function checkContentSafety(response: string): SafetyCheck {
const harmfulPatterns = [
{ pattern: /\b(kill|harm|hurt|violence)\b/gi, category: 'violence' },
{ pattern: /\b(illegal|crime|steal)\b/gi, category: 'illegal' },
{ pattern: /\b(hate|racist|sexist)\b/gi, category: 'bias' },
];
const flaggedCategories: string[] = [];
harmfulPatterns.forEach(({ pattern, category }) => {
if (pattern.test(response)) {
flaggedCategories.push(category);
}
});
return {
isSafe: flaggedCategories.length === 0,
categories: flaggedCategories,
confidence: flaggedCategories.length === 0 ? 1.0 : 0.5,
};
}
// Usage with OpenAI Moderation API
async function checkWithModeration(text: string): Promise<SafetyCheck> {
const response = await openai.moderations.create({
input: text,
});
const result = response.results[0];
const flagged = Object.entries(result.categories)
.filter(([_, flagged]) => flagged)
.map(([category]) => category);
return {
isSafe: !result.flagged,
categories: flagged,
confidence: 1.0,
};
}
💡 When to use: User-facing apps, content moderation, brand safety
Performance
Measure speed, cost, and efficiency
Why This Matters
Every 100ms of latency = 1% drop in conversions (Amazon study). Slow AI = users abandon. High API costs = your startup goes bankrupt. Without monitoring: one prompt injection attack caused a company $100k in API bills in 48 hours.
Target: < 2 second response time
is critical for user-facing apps. Above 5 seconds = 50% of users leave. Cost monitoring prevents bill shock.
Real Cost of Ignoring This:
Unoptimized AI can cost 10x more than necessary. Example: Company spent $50k/mo, optimized to $5k/mo with proper testing. That's $540k saved per year.
Latency & Cost
Track response time and token usage
interface PerformanceMetrics {
latencyMs: number;
inputTokens: number;
outputTokens: number;
cost: number;
}
async function measurePerformance(
prompt: string,
llm: LLM,
pricing: { input: number; output: number }
): Promise<PerformanceMetrics> {
const startTime = Date.now();
const response = await llm.call(prompt, {
return_usage: true,
});
const latencyMs = Date.now() - startTime;
const cost =
(response.usage.prompt_tokens / 1_000_000) * pricing.input +
(response.usage.completion_tokens / 1_000_000) * pricing.output;
return {
latencyMs,
inputTokens: response.usage.prompt_tokens,
outputTokens: response.usage.completion_tokens,
cost,
};
}
// Usage
const metrics = await measurePerformance(
"Explain RAG in 100 words",
gpt4,
{ input: 10, output: 30 }
);
console.log(`Latency: ${metrics.latencyMs}ms`);
console.log(`Cost: $${metrics.cost.toFixed(4)}`);
💡 When to use: Production monitoring, cost optimization, SLA tracking
Testing Strategies
Unit Testing for Prompts
Test individual prompts with known inputs and expected outputs
💡Unit tests catch 80% of prompt issues before they reach production. A broken prompt can generate garbage for days before anyone notices. With unit tests, you know within seconds if your prompt still works after changes.
Impact: 80% of bugs caught
before production with automated prompt testing. No unit tests = bugs discovered by angry customers. With tests = bugs caught in CI/CD in 30 seconds.
Writing 10 unit tests takes 1 hour. Finding and fixing 1 production prompt bug takes 4+ hours. ROI = 40x time savings after just 10 bugs prevented.
// Example: Testing a summarization prompt
describe('Summarization Prompt', () => {
it('should summarize long text correctly', async () => {
const longText = "..."; // Your long text
const result = await summarize(longText);
// Check length
expect(result.length).toBeLessThan(200);
// Check key points present
expect(result).toContain('main point');
// Check semantic similarity to expected
const similarity = await calculateSemanticSimilarity(
result,
expectedSummary
);
expect(similarity).toBeGreaterThan(0.8);
});
it('should handle edge cases', async () => {
expect(await summarize('')).toBe('');
expect(await summarize('short')).toBe('short');
});
});
Regression Testing
Ensure changes don't break existing functionality
💡Every prompt change can break working features. Regression tests prevent "we fixed X but broke Y". Spotify: regression tests prevented 40+ production bugs in their AI systems last year.
Impact: 40-60% of changes
unintentionally break something. Without regression tests, you discover this in production (customers complain). With tests, you catch it in 2 minutes.
Production bug = 4 hours to fix + angry customers. Regression test = 30 seconds to run. ROI = 480x time savings.
// Store golden examples
const goldenExamples = [
{ input: 'What is AI?', expectedKeywords: ['artificial', 'intelligence'] },
{ input: 'Explain RAG', expectedKeywords: ['retrieval', 'generation'] },
];
async function runRegressionTests() {
const results = [];
for (const example of goldenExamples) {
const response = await llm.call(example.input);
const relevance = calculateKeywordRelevance(
response,
example.expectedKeywords
);
results.push({
input: example.input,
passed: relevance >= 0.8,
relevance,
});
}
const passRate = results.filter(r => r.passed).length / results.length;
console.log(`Pass rate: ${(passRate * 100).toFixed(1)}%`);
return passRate >= 0.95; // 95% threshold
}
A/B Testing
Compare different prompts, models, or parameters
💡Small changes = big impact. Changing "explain" to "describe" increased accuracy by 12% in one case. A/B testing finds these wins. Companies using A/B tests improve AI performance by 20-40% over 6 months.
Impact: 20-40% improvement
is typical from systematic A/B testing over 6 months. One word change can improve accuracy 5-15%. Testing 10 variations finds the winner.
Example: Switching from GPT-4 to Claude 4.5 for code tasks = 70% cost reduction with BETTER quality. A/B testing proved this = $20k/mo saved.
interface ABTestResult {
variantA: { accuracy: number; cost: number; latency: number };
variantB: { accuracy: number; cost: number; latency: number };
winner: 'A' | 'B' | 'tie';
}
async function abTest(
testCases: Array<{ input: string; expected: string }>,
variantA: LLMConfig,
variantB: LLMConfig
): Promise<ABTestResult> {
const resultsA = await runTests(testCases, variantA);
const resultsB = await runTests(testCases, variantB);
// Calculate winner based on weighted score
const scoreA = resultsA.accuracy * 0.6 + (1 - resultsA.cost / 100) * 0.2 + (1 - resultsA.latency / 1000) * 0.2;
const scoreB = resultsB.accuracy * 0.6 + (1 - resultsB.cost / 100) * 0.2 + (1 - resultsB.latency / 1000) * 0.2;
return {
variantA: resultsA,
variantB: resultsB,
winner: scoreA > scoreB ? 'A' : scoreB > scoreA ? 'B' : 'tie',
};
}
Human Evaluation
Get human ratings for subjective quality
💡Some things only humans can judge: tone, empathy, creativity, appropriateness. Automated metrics miss 30% of quality issues. Human eval is the gold standard for production readiness.
Impact: 30% of quality issues
are only caught by human evaluation. Automated tests say "passed" but humans say "this sucks". Need both for production AI.
Skipping human eval = ship bad AI = lose customers. Anthropic/OpenAI spend $10M+ annually on human evals. You can do it for $500/mo (5 evals * $100).
interface HumanEvaluation {
responseId: string;
ratings: {
accuracy: number; // 1-5
relevance: number; // 1-5
helpfulness: number; // 1-5
clarity: number; // 1-5
};
feedback: string;
}
async function collectHumanEvaluations(
responses: Array<{ id: string; question: string; answer: string }>
): Promise<HumanEvaluation[]> {
// Present to human evaluators via UI
// Store ratings in database
// Calculate inter-rater reliability
return evaluations;
}
function analyzeHumanEvals(evals: HumanEvaluation[]) {
const avgRatings = {
accuracy: average(evals.map(e => e.ratings.accuracy)),
relevance: average(evals.map(e => e.ratings.relevance)),
helpfulness: average(evals.map(e => e.ratings.helpfulness)),
clarity: average(evals.map(e => e.ratings.clarity)),
};
return avgRatings;
}
Need Help Testing Your AI System?
I can design a comprehensive testing strategy for your AI application and implement automated evaluation pipelines.
I've built evaluation frameworks for Fortune 500 companies that caught critical issues before production, saving millions in potential damages.