EXPANSION30 min61 sections

Context Optimization

THIS WEEK'S JOURNEY

Context Optimization: The Art of Doing More with Less

In the world of LLM applications, context is both your most powerful asset and your most expensive resource. Every token you send to a model costs money, adds latency, and competes for attention within the model's finite context window.

67%

of tokens in typical LLM contexts are redundant or low-value

This staggering figure reveals the massive opportunity in context optimization.

Key Insight

The Paradox of Context: More Information Often Means Worse Results

Contrary to intuition, providing more context frequently degrades LLM performance rather than improving it. Research from Google DeepMind shows that models exhibit 'lost in the middle' behavior, where information in the center of long contexts receives less attention than content at the beginning or end.

Notion

Reduced AI costs by 73% through intelligent context selection

API costs dropped from $2.3M to $620K monthly while user satisfaction scores inc...

Framework

The Context Optimization Triangle

Quality

The accuracy and helpfulness of model responses. Higher quality requires more relevant context, bett...

Cost

The financial expense of API calls, measured in tokens consumed. Reducing cost means smaller context...

Latency

The time from request to response. Larger contexts increase processing time, sometimes dramatically....

The Sweet Spot

The optimal context size that maximizes quality while meeting cost and latency requirements. This va...

Naive Context Loading vs. Optimized Context Engineering

Naive Approach

Include all potentially relevant documents in full

Use the same context template for every query type

Send complete conversation history every request

Include verbose system prompts with every example

Optimized Approach

Score and select only highest-relevance content chunks

Dynamic templates that adapt to query classification

Summarize older conversation turns, detail recent ones

Compressed instructions with examples loaded conditionally

Context Optimization is Not Prompt Shortening

A common misconception is that context optimization means making prompts shorter. The real goal is making prompts more efficient—delivering maximum information density per token.

Key Insight

Token Budgeting: Treat Context Like a Financial Portfolio

The most sophisticated teams approach context construction like portfolio allocation. You have a fixed budget (the context window) and multiple asset classes competing for allocation: system instructions, examples, retrieved documents, conversation history, and user-specific context.

The Context Optimization Pipeline

Raw Sources

Relevance Scoring

Chunking & Selection

Compression

Anti-Pattern: The Kitchen Sink Context

❌ Problem

Beyond the obvious cost increase, kitchen sink contexts actively harm response q...

✓ Solution

Implement aggressive relevance filtering before context assembly. Use embedding ...

Establishing Your Context Optimization Baseline

Audit Current Context Composition

Measure Quality Baseline

Calculate Current Unit Economics

Identify Optimization Opportunities

Set Optimization Targets

Linear

Achieved 4x throughput increase through context compression

Average context size dropped to 3,100 tokens. P95 latency improved from 4.2 seco...

Key Insight

The Compression-Quality Curve is Not Linear

When you begin compressing context, the first 30-40% of reduction typically has zero impact on quality—you're just removing genuine waste. The next 20-30% has minimal impact if done intelligently—you're trading verbosity for concision.

Start with the Biggest Context Sections

When beginning optimization, focus on your largest context components first. A 20% reduction in a 5,000 token section saves more than a 50% reduction in a 500 token section.

Context Optimization Quick Wins

Basic Token Budget Allocationtypescript

123456789101112
interface TokenBudget {
  total: number;
  systemPrompt: number;
  examples: number;
  retrievedDocs: number;
  conversationHistory: number;
  userContext: number;
  buffer: number;
}

function allocateTokenBudget(maxTokens: number, queryType: string): TokenBudget {
  // Different query types get different allocations

3.2x

average cost reduction achieved by teams implementing systematic context optimization

This multiplier comes from analyzing 47 production LLM applications before and after implementing context optimization strategies.

Key Insight

Context Windows Are Growing, But Optimization Still Matters

With models like Claude offering 200K token windows and Gemini pushing to 1M+, some argue context optimization is becoming irrelevant. This reasoning is flawed for three reasons.

Practice Exercise

Audit Your Current Context Efficiency

45 min

Framework

The SCORE Framework for Context Compression

Summarize

Convert verbose content into dense summaries using extractive or abstractive techniques. For documen...

Chunk

Break content into semantically meaningful units rather than arbitrary character limits. Use sentenc...

Order

Arrange context by relevance using embedding similarity, recency weighting, or task-specific scoring...

Remove

Eliminate redundant, outdated, or low-value content through deduplication and staleness detection. I...

Anthropic

Implementing Hierarchical Context Summarization for Claude

This approach reduced average context usage by 73% for document Q&A tasks while ...

Compression Strategies: Lossy vs. Lossless

Lossless Compression

Removes redundancy without losing any information (whitespac...

Achieves 20-40% size reduction typically, with guaranteed in...

Safe for all use cases including legal, medical, and complia...

Techniques: deduplication, minification, reference compressi...

Lossy Compression

Trades some information for dramatically smaller context (su...

Achieves 60-90% size reduction but requires careful validati...

Requires task-specific tuning—what's 'unimportant' varies by...

Techniques: abstractive summarization, key phrase extraction...

Key Insight

Relevance Scoring Is the Foundation of Intelligent Context Selection

Every piece of potential context should have a relevance score before entering your prompt. This isn't just about vector similarity—effective relevance scoring combines multiple signals: semantic similarity to the query (0.4 weight typically), recency of information (0.2 weight), source authority (0.2 weight), and user-specific personalization factors (0.2 weight).

Multi-Signal Relevance Scoring Implementationpython

123456789101112
from dataclasses import dataclass
from datetime import datetime, timedelta
import numpy as np

@dataclass
class ContextCandidate:
    content: str
    embedding: np.ndarray
    created_at: datetime
    source_authority: float  # 0-1 scale
    user_interaction_count: int

Implementing Dynamic Context Selection

Define Your Context Budget

Categorize Context Sources

Implement Relevance Scoring

Build the Selection Algorithm

Add Adaptive Compression

Anti-Pattern: The 'Kitchen Sink' Context Pattern

❌ Problem

Excessive context degrades model performance through attention dilution—the mode...

✓ Solution

Implement strict relevance thresholds and token budgets. Require every piece of ...

Linear

Token Budgeting for Issue Context in AI Features

Context costs dropped by 87% while user satisfaction with AI features increased ...

Framework

The Token Budget Allocation Matrix

Fixed Allocations (30-40% of budget)

Content that's always included regardless of query: system prompt, safety guidelines, output format ...

Query-Dependent Allocations (40-50% of budget)

Retrieved documents, relevant examples, and task-specific context that varies per request. This is y...

Conversation History (10-20% of budget)

Previous turns in multi-turn conversations. Implement sliding window with summarization—keep recent ...

Response Buffer (10-15% of budget)

Reserved space for the model's response. This is often forgotten, leading to truncated outputs. Esti...

The Hidden Cost of Context: Attention Computation

Token costs are just the visible expense—attention computation scales quadratically with context length. A 32K context request doesn't just cost 8x more than 4K in tokens; it requires roughly 64x more computation for self-attention.

Key Insight

Caching Is Your Most Powerful Optimization Lever

Intelligent caching can reduce effective context costs by 60-80% for most applications. The key insight is that much of your context is repetitive across requests: system prompts are identical, popular documents are retrieved frequently, and users often ask similar questions.

Implementing a Semantic Cache for Contexttypescript

123456789101112
interface CacheEntry {
  queryEmbedding: number[];
  context: string;
  response: string;
  timestamp: Date;
  hitCount: number;
}

class SemanticContextCache {
  private cache: Map<string, CacheEntry> = new Map();
  private similarityThreshold = 0.95;
  private maxAge = 3600000; // 1 hour

67%

of retrieved context is never referenced in LLM responses

This striking finding reveals massive inefficiency in typical RAG implementations.

Context Optimization Audit Checklist

Vercel

v0's Intelligent Code Context Management

Average context size dropped from 32K to 8K tokens—a 75% reduction. Generation q...

The Context Optimization Pipeline

Raw Sources

Retrieval & Scoring

Deduplication

Compression

Start with Measurement, Not Optimization

Before implementing any optimization, instrument your current system to understand baseline metrics: average context size, token costs per request, cache hit rates (if any), and response quality scores. Many teams optimize blindly, sometimes making things worse.

Practice Exercise

Build a Context Budget Enforcer

45 min

Practice Exercise

Build a Context Compression Pipeline

45 min

Complete Token Budget Manager Implementationtypescript

123456789101112
interface TokenBudget {
  total: number;
  allocated: Map<string, number>;
  used: Map<string, number>;
}

interface ContextSource {
  id: string;
  priority: number;
  minTokens: number;
  maxTokens: number;
  content: string;

Context Optimization Production Readiness Checklist

Anti-Pattern: The Kitchen Sink Context

❌ Problem

Token costs explode as contexts routinely hit maximum limits. Response quality a...

✓ Solution

Implement strict relevance filtering before any content enters context. Score ev...

Practice Exercise

Implement Semantic Caching for Similar Queries

60 min

Relevance Scoring with Hybrid Rankingpython

123456789101112
from dataclasses import dataclass
from typing import List
import numpy as np
from sentence_transformers import SentenceTransformer

@dataclass
class ContextChunk:
    id: str
    content: str
    source: str
    timestamp: float
    metadata: dict

Anti-Pattern: Static Budget Allocation

❌ Problem

Simple queries waste tokens on unnecessary context while complex queries are sta...

✓ Solution

Implement query classification to categorize requests by complexity and type. De...

Practice Exercise

Build a Context Optimization Dashboard

90 min

Context Optimization Tools and Libraries

tiktoken

tool

LangChain Text Splitters

tool

Chroma DB

tool

LLMLingua

tool

Anti-Pattern: Compression Without Quality Monitoring

❌ Problem

Response quality silently degrades as important information is compressed away. ...

✓ Solution

Establish quality baselines before implementing compression using a golden test ...

Context Optimization Quality Assurance Checklist

Practice Exercise

Optimize a Real Context Pipeline End-to-End

120 min

Production Context Optimization Pipelinetypescript

123456789101112
class ContextOptimizationPipeline {
  private cache: SemanticCache;
  private scorer: HybridRelevanceScorer;
  private compressor: TieredCompressor;
  private budgetManager: TokenBudgetManager;
  private metrics: MetricsCollector;
  
  async assembleContext(query: string, userId: string): Promise<OptimizedContext> {
    const startTime = Date.now();
    
    // Check semantic cache first
    const cached = await this.cache.findSimilar(query, 0.92);

Start with Measurement, Not Optimization

Before implementing any optimization technique, establish comprehensive measurement of your current state. Know your baseline token usage, cost per conversation, response quality scores, and latency percentiles.

Framework

The SCALE Framework for Context Optimization

Score

Implement relevance scoring for all context sources. Every piece of potential context should have a ...

Compress

Apply tiered compression based on relevance and content type. High-relevance content gets light comp...

Allocate

Dynamically allocate token budgets based on query complexity and source importance. Move away from s...

Learn

Continuously learn from production data to improve optimization. Track which context actually improv...

Further Learning: Advanced Context Optimization

Attention Is All You Need (Original Transformer Paper)

article

Lost in the Middle: How Language Models Use Long Contexts

article

Prompt Compression Research (Microsoft)

article

Building LLM Applications for Production (Chip Huyen)

book

Quick Wins for Immediate Impact

If you need to show optimization results quickly, start with these high-impact, low-effort changes: enable provider prompt caching (often just a configuration change), implement basic deduplication to remove repeated content, add simple recency filters to conversation history, and cache embedding computations. These typically yield 20-30% cost reduction within a week of implementation..

Chapter Complete!

Context optimization is essential for production LLM applica...

Effective optimization requires a multi-layered approach com...

Quality preservation must be the constraint on all optimizat...

Measurement is the foundation of optimization—instrument eve...

Next: Begin by auditing your current context usage—measure token counts by source, identify redundancy, and establish quality baselines

PreviousNext