HomeAI ToolsEvaluation Framework

AI Testing & Evaluation Framework

Complete guide to testing and evaluating LLM applications. Production-tested metrics, strategies, and code examples.

15+ Metrics
Code Examples
Testing Strategies
🎯

Accuracy

Measure how often the AI gives correct answers

Why This Matters

A 5% increase in accuracy can reduce customer complaints by 40% and save $100k+ annually in incorrect decisions. Medical AI with 95% vs 90% accuracy = difference between FDA approval and rejection.

📊

Target: 95% accuracy

is the minimum for production AI in healthcare and finance. Below this = legal liability.

💰

Real Cost of Ignoring This:

One incorrect AI diagnosis = $2M+ in lawsuits. Testing prevents this.

Exact Match

Response exactly matches expected output

Implementation
function calculateExactMatch(predictions: string[], expected: string[]): number {
  if (predictions.length !== expected.length) return 0;
  
  const matches = predictions.filter((pred, i) => 
    pred.trim().toLowerCase() === expected[i].trim().toLowerCase()
  ).length;
  
  return matches / predictions.length;
}

// Usage
const accuracy = calculateExactMatch(
  ['Paris', 'London', 'Berlin'],
  ['Paris', 'London', 'Berlin']
);
console.log(`Accuracy: ${(accuracy * 100).toFixed(1)}%`); // 100%

💡 When to use: Classification tasks, structured outputs, deterministic answers

Semantic Similarity

Responses are semantically similar to expected

Implementation
import { encode } from 'gpt-tokenizer';
import { cosineSimilarity } from '@/lib/utils';

async function calculateSemanticSimilarity(
  prediction: string,
  expected: string,
  embedModel: EmbeddingModel
): Promise<number> {
  const [predEmbed, expEmbed] = await Promise.all([
    embedModel.embed(prediction),
    embedModel.embed(expected)
  ]);
  
  return cosineSimilarity(predEmbed, expEmbed);
}

// Usage
const similarity = await calculateSemanticSimilarity(
  "The capital of France is Paris",
  "Paris is the capital city of France",
  openai.embeddings
);
console.log(`Similarity: ${(similarity * 100).toFixed(1)}%`); // ~95%

💡 When to use: Open-ended responses, paraphrasing OK, general Q&A

🔍

Relevance

Measure how relevant responses are to the query

Why This Matters

Irrelevant responses waste user time and erode trust. Studies show users abandon apps after 2-3 irrelevant answers. Customer support AI with 80% relevance = 60% user satisfaction. With 90% relevance = 85% satisfaction.

📊

Target: 85%+ relevance score

separates good AI from great AI. Below 75% = users lose trust and stop using your app.

💰

Real Cost of Ignoring This:

Irrelevant responses cost companies $50-200 per customer in lost sales and support time.

Keyword Presence

Check if important keywords appear in response

Implementation
function calculateKeywordRelevance(
  response: string,
  requiredKeywords: string[]
): number {
  const lowerResponse = response.toLowerCase();
  const presentKeywords = requiredKeywords.filter(keyword =>
    lowerResponse.includes(keyword.toLowerCase())
  );
  
  return presentKeywords.length / requiredKeywords.length;
}

// Usage
const relevance = calculateKeywordRelevance(
  "RAG uses vector databases like Pinecone to retrieve relevant documents",
  ["RAG", "vector", "retrieve"]
);
console.log(`Relevance: ${(relevance * 100).toFixed(1)}%`); // 100%

💡 When to use: Domain-specific tasks, technical content, required terms

Answer Relevance Score

Use LLM to judge relevance of answer to question

Implementation
async function judgeRelevance(
  question: string,
  answer: string,
  llm: LLM
): Promise<{ score: number; reasoning: string }> {
  const prompt = `Rate the relevance of this answer to the question on a scale of 1-10.

Question: ${question}
Answer: ${answer}

Respond with JSON: { "score": number, "reasoning": string }`;

  const response = await llm.call(prompt);
  return JSON.parse(response);
}

// Usage
const result = await judgeRelevance(
  "What is RAG?",
  "RAG stands for Retrieval Augmented Generation...",
  gpt4
);
console.log(`Relevance: ${result.score}/10 - ${result.reasoning}`);

💡 When to use: Complex queries, nuanced content, subjective relevance

🔄

Consistency

Measure consistency across multiple runs

Why This Matters

Inconsistent AI destroys user trust. Same question = same answer should be true 90%+ of the time. Without testing, temperature=0.7 can give wildly different answers. Financial AI giving different risk scores for same data = compliance nightmare.

📊

Target: < 10% variance

is required for production systems. Above 20% variance = unreliable, users complain, regulators investigate.

💰

Real Cost of Ignoring This:

Inconsistent AI caused Wells Fargo a $185M fine when their loan AI gave inconsistent decisions. Testing prevents this.

Response Variance

Check how much responses vary for same input

Implementation
async function measureConsistency(
  prompt: string,
  llm: LLM,
  runs: number = 5
): Promise<{ variance: number; responses: string[] }> {
  const responses = await Promise.all(
    Array(runs).fill(null).map(() => llm.call(prompt))
  );
  
  // Calculate pairwise similarities
  const similarities: number[] = [];
  for (let i = 0; i < responses.length; i++) {
    for (let j = i + 1; j < responses.length; j++) {
      const sim = await calculateSemanticSimilarity(
        responses[i],
        responses[j],
        embedModel
      );
      similarities.push(sim);
    }
  }
  
  const avgSimilarity = similarities.reduce((a, b) => a + b, 0) / similarities.length;
  const variance = 1 - avgSimilarity;
  
  return { variance, responses };
}

// Usage
const consistency = await measureConsistency(
  "What is the capital of France?",
  gpt35,
  5
);
console.log(`Variance: ${(consistency.variance * 100).toFixed(1)}%`);
// Low variance = high consistency

💡 When to use: Critical applications, factual queries, deterministic needs

🛡️

Safety

Measure safety and appropriateness of responses

Why This Matters

One harmful AI response = brand damage, lawsuits, regulatory fines. Microsoft Tay chatbot lasted 16 hours before harmful outputs forced shutdown. Without safety testing, your AI WILL say something harmful eventually.

📊

Target: 99.9%+ safety rate

is minimum for customer-facing AI. Even 0.1% failure = 1,000 harmful responses per 1M queries.

💰

Real Cost of Ignoring This:

One viral harmful AI response = $10M+ in brand damage (see: Google Gemini, Microsoft Tay, Meta Galactica). Safety testing is mandatory.

Content Filtering

Detect harmful, biased, or inappropriate content

Implementation
interface SafetyCheck {
  isSafe: boolean;
  categories: string[];
  confidence: number;
}

function checkContentSafety(response: string): SafetyCheck {
  const harmfulPatterns = [
    { pattern: /\b(kill|harm|hurt|violence)\b/gi, category: 'violence' },
    { pattern: /\b(illegal|crime|steal)\b/gi, category: 'illegal' },
    { pattern: /\b(hate|racist|sexist)\b/gi, category: 'bias' },
  ];
  
  const flaggedCategories: string[] = [];
  
  harmfulPatterns.forEach(({ pattern, category }) => {
    if (pattern.test(response)) {
      flaggedCategories.push(category);
    }
  });
  
  return {
    isSafe: flaggedCategories.length === 0,
    categories: flaggedCategories,
    confidence: flaggedCategories.length === 0 ? 1.0 : 0.5,
  };
}

// Usage with OpenAI Moderation API
async function checkWithModeration(text: string): Promise<SafetyCheck> {
  const response = await openai.moderations.create({
    input: text,
  });
  
  const result = response.results[0];
  const flagged = Object.entries(result.categories)
    .filter(([_, flagged]) => flagged)
    .map(([category]) => category);
  
  return {
    isSafe: !result.flagged,
    categories: flagged,
    confidence: 1.0,
  };
}

💡 When to use: User-facing apps, content moderation, brand safety

Performance

Measure speed, cost, and efficiency

Why This Matters

Every 100ms of latency = 1% drop in conversions (Amazon study). Slow AI = users abandon. High API costs = your startup goes bankrupt. Without monitoring: one prompt injection attack caused a company $100k in API bills in 48 hours.

📊

Target: < 2 second response time

is critical for user-facing apps. Above 5 seconds = 50% of users leave. Cost monitoring prevents bill shock.

💰

Real Cost of Ignoring This:

Unoptimized AI can cost 10x more than necessary. Example: Company spent $50k/mo, optimized to $5k/mo with proper testing. That's $540k saved per year.

Latency & Cost

Track response time and token usage

Implementation
interface PerformanceMetrics {
  latencyMs: number;
  inputTokens: number;
  outputTokens: number;
  cost: number;
}

async function measurePerformance(
  prompt: string,
  llm: LLM,
  pricing: { input: number; output: number }
): Promise<PerformanceMetrics> {
  const startTime = Date.now();
  
  const response = await llm.call(prompt, {
    return_usage: true,
  });
  
  const latencyMs = Date.now() - startTime;
  
  const cost =
    (response.usage.prompt_tokens / 1_000_000) * pricing.input +
    (response.usage.completion_tokens / 1_000_000) * pricing.output;
  
  return {
    latencyMs,
    inputTokens: response.usage.prompt_tokens,
    outputTokens: response.usage.completion_tokens,
    cost,
  };
}

// Usage
const metrics = await measurePerformance(
  "Explain RAG in 100 words",
  gpt4,
  { input: 10, output: 30 }
);

console.log(`Latency: ${metrics.latencyMs}ms`);
console.log(`Cost: $${metrics.cost.toFixed(4)}`);

💡 When to use: Production monitoring, cost optimization, SLA tracking

Testing Strategies

Unit Testing for Prompts

Test individual prompts with known inputs and expected outputs

💡Unit tests catch 80% of prompt issues before they reach production. A broken prompt can generate garbage for days before anyone notices. With unit tests, you know within seconds if your prompt still works after changes.

📈

Impact: 80% of bugs caught

before production with automated prompt testing. No unit tests = bugs discovered by angry customers. With tests = bugs caught in CI/CD in 30 seconds.

💵

Writing 10 unit tests takes 1 hour. Finding and fixing 1 production prompt bug takes 4+ hours. ROI = 40x time savings after just 10 bugs prevented.

Example
// Example: Testing a summarization prompt
describe('Summarization Prompt', () => {
  it('should summarize long text correctly', async () => {
    const longText = "..."; // Your long text
    const result = await summarize(longText);
    
    // Check length
    expect(result.length).toBeLessThan(200);
    
    // Check key points present
    expect(result).toContain('main point');
    
    // Check semantic similarity to expected
    const similarity = await calculateSemanticSimilarity(
      result,
      expectedSummary
    );
    expect(similarity).toBeGreaterThan(0.8);
  });
  
  it('should handle edge cases', async () => {
    expect(await summarize('')).toBe('');
    expect(await summarize('short')).toBe('short');
  });
});

Regression Testing

Ensure changes don't break existing functionality

💡Every prompt change can break working features. Regression tests prevent "we fixed X but broke Y". Spotify: regression tests prevented 40+ production bugs in their AI systems last year.

📈

Impact: 40-60% of changes

unintentionally break something. Without regression tests, you discover this in production (customers complain). With tests, you catch it in 2 minutes.

💵

Production bug = 4 hours to fix + angry customers. Regression test = 30 seconds to run. ROI = 480x time savings.

Example
// Store golden examples
const goldenExamples = [
  { input: 'What is AI?', expectedKeywords: ['artificial', 'intelligence'] },
  { input: 'Explain RAG', expectedKeywords: ['retrieval', 'generation'] },
];

async function runRegressionTests() {
  const results = [];
  
  for (const example of goldenExamples) {
    const response = await llm.call(example.input);
    const relevance = calculateKeywordRelevance(
      response,
      example.expectedKeywords
    );
    
    results.push({
      input: example.input,
      passed: relevance >= 0.8,
      relevance,
    });
  }
  
  const passRate = results.filter(r => r.passed).length / results.length;
  console.log(`Pass rate: ${(passRate * 100).toFixed(1)}%`);
  
  return passRate >= 0.95; // 95% threshold
}

A/B Testing

Compare different prompts, models, or parameters

💡Small changes = big impact. Changing "explain" to "describe" increased accuracy by 12% in one case. A/B testing finds these wins. Companies using A/B tests improve AI performance by 20-40% over 6 months.

📈

Impact: 20-40% improvement

is typical from systematic A/B testing over 6 months. One word change can improve accuracy 5-15%. Testing 10 variations finds the winner.

💵

Example: Switching from GPT-4 to Claude 4.5 for code tasks = 70% cost reduction with BETTER quality. A/B testing proved this = $20k/mo saved.

Example
interface ABTestResult {
  variantA: { accuracy: number; cost: number; latency: number };
  variantB: { accuracy: number; cost: number; latency: number };
  winner: 'A' | 'B' | 'tie';
}

async function abTest(
  testCases: Array<{ input: string; expected: string }>,
  variantA: LLMConfig,
  variantB: LLMConfig
): Promise<ABTestResult> {
  const resultsA = await runTests(testCases, variantA);
  const resultsB = await runTests(testCases, variantB);
  
  // Calculate winner based on weighted score
  const scoreA = resultsA.accuracy * 0.6 + (1 - resultsA.cost / 100) * 0.2 + (1 - resultsA.latency / 1000) * 0.2;
  const scoreB = resultsB.accuracy * 0.6 + (1 - resultsB.cost / 100) * 0.2 + (1 - resultsB.latency / 1000) * 0.2;
  
  return {
    variantA: resultsA,
    variantB: resultsB,
    winner: scoreA > scoreB ? 'A' : scoreB > scoreA ? 'B' : 'tie',
  };
}

Human Evaluation

Get human ratings for subjective quality

💡Some things only humans can judge: tone, empathy, creativity, appropriateness. Automated metrics miss 30% of quality issues. Human eval is the gold standard for production readiness.

📈

Impact: 30% of quality issues

are only caught by human evaluation. Automated tests say "passed" but humans say "this sucks". Need both for production AI.

💵

Skipping human eval = ship bad AI = lose customers. Anthropic/OpenAI spend $10M+ annually on human evals. You can do it for $500/mo (5 evals * $100).

Example
interface HumanEvaluation {
  responseId: string;
  ratings: {
    accuracy: number;    // 1-5
    relevance: number;   // 1-5
    helpfulness: number; // 1-5
    clarity: number;     // 1-5
  };
  feedback: string;
}

async function collectHumanEvaluations(
  responses: Array<{ id: string; question: string; answer: string }>
): Promise<HumanEvaluation[]> {
  // Present to human evaluators via UI
  // Store ratings in database
  // Calculate inter-rater reliability
  
  return evaluations;
}

function analyzeHumanEvals(evals: HumanEvaluation[]) {
  const avgRatings = {
    accuracy: average(evals.map(e => e.ratings.accuracy)),
    relevance: average(evals.map(e => e.ratings.relevance)),
    helpfulness: average(evals.map(e => e.ratings.helpfulness)),
    clarity: average(evals.map(e => e.ratings.clarity)),
  };
  
  return avgRatings;
}

Need Help Testing Your AI System?

I can design a comprehensive testing strategy for your AI application and implement automated evaluation pipelines.

I've built evaluation frameworks for Fortune 500 companies that caught critical issues before production, saving millions in potential damages.