EXPANSION30 min63 sections

Context Compression Techniques

THIS WEEK'S JOURNEY

Context Compression: The Art of Saying More with Less

As context windows expand to 128K, 200K, and even 1M tokens, a counterintuitive truth emerges: the most effective LLM applications often use far less context than available. Context compression is the discipline of maximizing information density while minimizing token usage, directly impacting latency, cost, and model performance.

4-8x

Token reduction achievable with modern compression techniques

Studies demonstrate that intelligent compression can reduce token counts by 4-8x while maintaining 95%+ of task performance.

Key Insight

Compression Is Not Just About Cost—It's About Quality

The most overlooked benefit of context compression is improved model reasoning. When you feed an LLM 50,000 tokens of loosely relevant information, the attention mechanism must distribute focus across all of it, diluting attention on truly important content.

Lossy vs Lossless Context Compression

Lossy Compression

Summarization that discards details deemed less important

Higher compression ratios (10-20x possible)

Risk of losing critical information for edge cases

Best for general understanding and broad queries

Lossless Compression

Removes redundancy without losing any semantic content

Lower compression ratios (2-4x typical)

Preserves all information for precise retrieval

Best for legal, medical, or compliance-critical applications

Framework

The Compression Decision Matrix

Criticality Assessment

Evaluate how critical exact information preservation is. Legal contracts require lossless approaches...

Latency Requirements

Real-time applications with <500ms requirements need pre-computed compression or lightweight algorit...

Cost Sensitivity

Calculate your cost-per-query and annual token spend. Applications spending >$10K/month on tokens sh...

Query Diversity

If users ask predictable questions, aggressive compression works well. High query diversity requires...

Notion

Hierarchical Compression for AI-Powered Search

67% reduction in API costs, 340ms average latency improvement, and 23% improveme...

Implementing LLMLingua Compression in Pythonpython

123456789101112
from llmlingua import PromptCompressor

# Initialize compressor with a small model for speed
compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    use_llmlingua2=True,
    device_map="cuda"  # GPU acceleration recommended
)

def compress_context(context: str, target_ratio: float = 0.5) -> dict:
    """
    Compress context while preserving semantic meaning.

Compression Can Introduce Subtle Errors

Always validate compressed outputs against your specific task. A 95% accuracy preservation rate means 5% of queries will have degraded responses.

Key Insight

Selective Context: The 80/20 Rule of Information Retrieval

Selective context is the practice of dynamically choosing which information to include based on the specific query, rather than compressing everything uniformly. Studies from Google's research team show that for most queries, 80% of the useful information comes from 20% of available documents.

The Selective Context Funnel

All Documents (10,00...

Keyword Filter (2,00...

Embedding Similarity...

LLM Relevance Score ...

Implementing a Selective Context Pipeline

Build Your Document Index

Implement Query Analysis

Execute Parallel Retrieval

Apply Cross-Encoder Reranking

Compute Information Density Scores

Anti-Pattern: The 'Stuff Everything' Approach

❌ Problem

Systems using this approach typically see 30-40% higher latency, 3-5x higher cos...

✓ Solution

Implement relevance-based filtering before context assembly. Start with aggressi...

Anthropic

Research on Context Utilization Patterns

After restructuring context assembly based on these research findings, Anthropic...

Key Insight

Summary Chains: Recursive Compression for Massive Document Sets

When you need to compress hundreds or thousands of documents into a coherent context, summary chains provide a structured approach. The technique works by recursively summarizing groups of documents until you reach your target token budget.

Use Different Models for Different Compression Stages

You don't need GPT-4 for every compression step. Use fast, cheap models (GPT-3.5-turbo, Claude Instant) for initial summarization passes where volume is high.

Context Compression Readiness Assessment

23%

Average accuracy improvement when using compressed vs. verbose context

Counter to intuition, compressed contexts often yield better results than verbose ones.

Practice Exercise

Build a Basic Compression Pipeline

45 min

Framework

The Compression Decision Matrix

Information Criticality Assessment

Evaluate each piece of context on a 1-5 scale for how essential it is to task completion. Critical i...

Retrieval Frequency Analysis

Track how often specific context pieces are actually used in model outputs. High-frequency context (...

Temporal Relevance Scoring

Assign decay weights to context based on age and task requirements. Recent context typically needs h...

Compression Technique Mapping

Map each context category to appropriate techniques: extractive for factual data, abstractive for na...

Notion

Hierarchical Summarization for Workspace Context

Notion AI achieved 89% answer accuracy on workspace-wide questions (up from 67%)...

Lossy vs Lossless Compression Strategies

Lossy Compression

Achieves 5-20x compression ratios by removing 'unnecessary' ...

Ideal for: conversation history, background context, supplem...

Risk: May remove information that seems unimportant but is a...

Implementation: Summarization chains, LLMLingua, selective f...

Lossless Compression

Achieves 1.5-3x compression through structural optimization ...

Ideal for: legal documents, medical records, financial data,...

Risk: Limited compression potential may not solve context wi...

Implementation: Whitespace removal, abbreviation expansion, ...

Implementing a Production Summary Chain

Define Summary Granularity Levels

Create Level-Specific Prompts

Build the Hierarchical Index

Implement Progressive Retrieval

Add Summary Freshness Management

Selective Context Filtering with Importance Scoringpython

123456789101112
from dataclasses import dataclass
from typing import List, Tuple
import numpy as np
from sentence_transformers import SentenceTransformer

@dataclass
class ContextChunk:
    content: str
    source: str
    timestamp: float
    token_count: int

Anti-Pattern: The Uniform Compression Fallacy

❌ Problem

Uniform compression either under-compresses redundant content (wasting tokens) o...

✓ Solution

Implement content-type-aware compression with different strategies per source. S...

Key Insight

The Compression Cliff: Why 10x Compression Often Outperforms 5x

Counter-intuitively, aggressive compression sometimes produces better results than moderate compression. At 5x compression, you're often left with awkward partial sentences and fragmented context that confuses the model.

Anthropic

Constitutional AI Context Compression Research

The hierarchical compression approach reduced constitutional context from 8,000 ...

Context Compression Production Readiness Checklist

73%

of context tokens in typical RAG applications are redundant or low-value

This finding from Microsoft's analysis of production RAG systems reveals massive optimization potential.

Compression Can Amplify Bias

Summarization models often preserve majority viewpoints while dropping minority perspectives or edge cases. If your context includes diverse user feedback or multiple stakeholder views, aggressive compression may create a biased representation.

Framework

The CRISP Compression Framework

Classify (Content Typing)

Automatically categorize each context chunk by type: factual data, procedural instructions, conversa...

Rank (Importance Scoring)

Score each chunk's importance for the current query using semantic similarity, recency, source autho...

Identify (Redundancy Detection)

Find overlapping information across chunks using embedding clustering and entity extraction. When mu...

Summarize (Adaptive Compression)

Apply type-appropriate compression to each chunk. Factual data gets extractive compression (keep key...

Practice Exercise

Build a Multi-Strategy Compression Pipeline

90 min

Summary Chain Architecture

Original Content (10...

Paragraph Chunking

Paragraph Summaries ...

Section Grouping

Use Smaller Models for Compression

Summarization is a well-understood task where smaller models perform nearly as well as large ones. GPT-3.5-turbo or Claude Instant can summarize at 10-20x lower cost than GPT-4 or Claude Opus with minimal quality difference.

Stripe

API Documentation Compression for Developer Copilot

The specialized compression pipeline reduced documentation context by 65% while ...

Key Insight

Compression Quality Degrades Non-Linearly with Ratio

Quality degradation from compression follows an S-curve, not a linear relationship. From 0-40% compression, quality loss is minimal (1-3%) because you're removing true redundancy.

Hierarchical Summarization with Quality Verificationpython

123456789101112
from typing import Dict, List, Optional
import hashlib
from dataclasses import dataclass

@dataclass
class SummaryNode:
    content: str
    level: int  # 0 = root, higher = more detailed
    token_count: int
    children: List['SummaryNode']
    source_hash: str
    embedding: Optional[List[float]] = None

Essential Context Compression Resources

LLMLingua GitHub Repository

tool

Anthropic's Context Distillation Paper

article

LangChain Summarization Chains Documentation

article

ROUGE Score Implementation (Hugging Face)

tool

Practice Exercise

Build a Compression Benchmark Suite

45 min

Implementing Selective Context with Relevance Scoringpython

123456789101112
import numpy as np
from typing import List, Tuple
from sentence_transformers import SentenceTransformer

class SelectiveContextCompressor:
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        self.encoder = SentenceTransformer(model_name)
        self.relevance_cache = {}
    
    def compress(self, 
                 query: str, 
                 context_chunks: List[str],

Context Compression Production Readiness Checklist

Anti-Pattern: The 'Compress Everything Equally' Mistake

❌ Problem

System instructions get compressed and lose critical behavioral constraints, cau...

✓ Solution

Implement tiered compression policies based on content type and importance. Syst...

Practice Exercise

Implement Hierarchical Summarization Pipeline

60 min

LLMLingua-Style Token Importance Scoringpython

123456789101112
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from typing import List, Tuple

class TokenImportanceCompressor:
    def __init__(self, model_name: str = 'gpt2'):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.model.eval()
    
    def calculate_token_importance(self, text: str) -> List[Tuple[str, float]]:
        """

Anti-Pattern: Ignoring Compression Latency in Real-Time Systems

❌ Problem

User-perceived latency increases by 15-30%, causing frustration and abandonment ...

✓ Solution

Implement compression latency budgets as a first-class constraint alongside comp...

Essential Context Compression Resources

LLMLingua: Compressing Prompts for Accelerated Inference

article

LangChain ContextualCompressionRetriever

tool

Anthropic's Context Window Research Blog

article

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

article

Practice Exercise

Build a Lossy vs Lossless Compression Comparator

40 min

Compression Quality Degrades Non-Linearly

Compression quality doesn't degrade linearly with compression ratio. Most content can be compressed 30-40% with negligible quality loss, but pushing beyond 60% compression often causes sudden quality cliffs where critical information is lost.

Implementing Summary Chains with Quality Checkpointspython

123456789101112
from typing import List, Dict, Optional
import asyncio
from dataclasses import dataclass

@dataclass
class SummaryResult:
    text: str
    level: int
    source_tokens: int
    output_tokens: int
    quality_score: float
    metadata: Dict

Anti-Pattern: Compressing Without Preserving Structure

❌ Problem

Compressed code examples become useless because they no longer run or even parse...

✓ Solution

Implement structure-aware compression that handles different content types appro...

Compression Debugging and Troubleshooting Checklist

Practice Exercise

Create a Context Budget Allocator

50 min

Pre-compute Compression for Static Content

System prompts, documentation, and other static content should be compressed offline and cached, not compressed at request time. Pre-computing compression for your 100 most common retrieved documents can eliminate 60-80% of runtime compression overhead.

47%

Average tokens that can be removed from typical RAG contexts without quality impact

The LLMLingua paper demonstrated that nearly half of tokens in typical retrieval-augmented contexts are low-information tokens that can be removed while maintaining task performance.

Framework

The CRISP Compression Decision Framework

Content Type Analysis

Categorize your context content into types (prose, code, structured data, conversation) and identify...

Requirements Mapping

Define hard constraints: maximum acceptable latency, minimum quality threshold, token budget limits,...

Information Density Assessment

Evaluate how information-dense your content is. Dense content (technical specs, legal text) tolerate...

Sensitivity Classification

Identify content that must never be compressed or modified: safety guidelines, legal disclaimers, cr...

Compression Can Amplify Retrieval Errors

If your retrieval system returns marginally relevant documents, compression will often make them appear more relevant by removing the irrelevant parts. This can mask retrieval quality issues and lead to overconfidence in poor retrieval.

Chapter Complete!

Context compression is essential for production LLM systems,...

Selective context based on relevance scoring provides the be...

Hierarchical summarization enables handling of arbitrarily l...

The distinction between lossy and lossless compression is cr...

Next: Start by auditing your current context composition to identify the highest-impact compression opportunities

PreviousNext