FOUNDATION30 min63 sections

Context Fundamentals

THIS WEEK'S JOURNEY

Understanding How LLMs Actually Process Your Input

Every interaction with a large language model begins with a fundamental transformation: your text becomes numbers, those numbers flow through attention mechanisms, and meaning emerges from mathematical operations performed billions of times per second. Understanding these mechanics isn't just academic curiosity—it's the foundation of effective context engineering.

Key Insight

Tokenization Is the Hidden Tax on Every LLM Interaction

Before a single computation happens, your carefully crafted prompt is chopped into tokens—subword units that the model actually processes. GPT-4 uses roughly 1 token per 4 characters in English, but this ratio varies dramatically: code often tokenizes at 1:3, while languages like Japanese or Arabic can hit 1:1.5 or worse.

4.7x

Token cost difference between optimized and unoptimized prompts

In production systems processing millions of requests, tokenization efficiency directly impacts infrastructure costs.

Tokenization Across Different Content Types

Efficient Tokenization

Common English words: 'the' = 1 token, 'information' = 1 tok...

Standard punctuation: periods, commas tokenize predictably

Frequent programming keywords: 'function', 'return', 'class'...

Well-known proper nouns: 'Microsoft', 'Google' = 1 token eac...

Inefficient Tokenization

Rare words: 'defenestration' = 4 tokens, 'pneumonoultramicro...

UUID strings: a single UUID can consume 8-12 tokens

Base64 encoded data: 100 characters of base64 = ~50 tokens

Unusual variable names: 'calculateUserPreferenceScore' = 6 t...

Measuring Token Counts Before API Callspython

123456789101112
import tiktoken

def count_tokens(text: str, model: str = "gpt-4") -> dict:
    """Count tokens and estimate costs before making API calls."""
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)
    
    # GPT-4 pricing as of 2024
    input_cost_per_1k = 0.03
    
    return {
        "token_count": len(tokens),

Notion

Reducing AI Feature Costs Through Tokenization Optimization

Notion reduced their per-request token consumption by 52%, saving an estimated $...

The Special Token Trap

Special tokens like <|endoftext|>, [INST], and <<SYS>> are reserved by models for internal use. If your user input contains these sequences—which can happen with code snippets or creative writing—they may be interpreted as control signals rather than content.

Key Insight

Attention Mechanisms Determine What Context Actually Gets Used

The transformer architecture's attention mechanism is both the source of LLM power and a key constraint you must design around. Each token can 'attend to' every other token in the context window, creating a matrix of attention weights that determines information flow.

Attention Flow in a Transformer Layer

Input Tokens

Q, K, V Projections

Attention Scores (Q·...

Softmax Normalizatio...

Framework

The Attention Budget Framework

Signal Density

Measure the ratio of essential information to total tokens. High-signal content should be concentrat...

Positional Priority

Place your most important information in attention hot zones: the very beginning of context, immedia...

Semantic Anchoring

Use consistent terminology and explicit references to create attention bridges. If you define 'the t...

Redundancy Budget

Critical information should appear 2-3 times in different forms across your context. This redundancy...

Stripe

Optimizing Attention for Financial Document Analysis

Detection of high-risk application patterns improved from 67% to 94%, while fals...

Key Insight

Position Encodings Create Invisible Boundaries in Your Context

LLMs have no inherent sense of sequence—position encodings are added to give tokens spatial awareness. Original transformers used fixed sinusoidal encodings, but modern models like GPT-4 and Claude use learned or rotary position embeddings (RoPE) that can theoretically extend to longer contexts.

Anti-Pattern: The Context Window Maximization Fallacy

❌ Problem

Performance actually degrades as context length increases beyond what's necessar...

✓ Solution

Adopt a 'minimum effective context' philosophy. Start with the smallest context ...

Pre-Flight Context Optimization Checklist

The 10% Rule for Context Headroom

Always reserve at least 10% of your context window for model output. If you're using a 32K context model, limit your input to 28K tokens maximum.

Implementing Token-Aware Prompt Construction

Define your context budget

Build a token counting utility

Implement priority-based truncation

Add compression fallbacks

Implement dynamic example selection

Practice Exercise

Token Efficiency Audit

30 min

Essential Tokenization and Attention Resources

OpenAI Tokenizer Tool

tool

tiktoken Library

tool

Attention Is All You Need (Vaswani et al.)

article

The Illustrated Transformer (Jay Alammar)

article

Framework

The Attention Budget Framework

Primary Attention Zone

The first 500-1000 tokens and the last 200-500 tokens receive disproportionate attention due to posi...

Recency Gradient

Attention naturally flows toward recent tokens due to causal masking in decoder-only models. Informa...

Semantic Anchors

Distinctive tokens and phrases act as attention anchors that the model references throughout generat...

Attention Interference

Similar or contradictory information competes for attention, potentially causing confusion or incons...

Notion

Restructuring AI Context for 40% Accuracy Improvement

The restructured context improved task completion accuracy from 67% to 94% on th...

Positional Encoding Strategies: Absolute vs. Relative

Absolute Positional Encoding (Original Transformers)

Each position gets a unique embedding added to token embeddi...

Model learns position-specific patterns during training, lim...

Context length is hard-capped at training length—cannot extr...

Simpler to implement but creates strong position-specific bi...

Relative Positional Encoding (Modern LLMs)

Encodes distance between tokens rather than absolute positio...

RoPE (Rotary Position Embedding) rotates query and key vecto...

ALiBi adds learned biases to attention scores based on dista...

Enables context extension techniques like YaRN and NTK-aware...

Key Insight

The 'Lost in the Middle' Problem Is Worse Than You Think

Research from Stanford and UC Berkeley in 2023 demonstrated that LLMs show a dramatic U-shaped performance curve when retrieving information from different positions in long contexts. For a 20-document retrieval task, models achieved 75% accuracy when the answer was in the first 3 documents, dropped to 45% for documents 8-12, and recovered to 70% for the last 3 documents.

Anti-Pattern: The 'More Context Is Better' Fallacy

❌ Problem

Teams following this pattern typically see initial improvements as they add cont...

✓ Solution

Apply the 'minimum viable context' principle: include only information that dire...

Implementing Position-Aware Context Structuringpython

123456789101112
from dataclasses import dataclass
from typing import List, Optional
from enum import Enum

class AttentionZone(Enum):
    PRIMARY_START = "primary_start"  # First 500 tokens
    PRIMARY_END = "primary_end"      # Last 300 tokens
    SECONDARY = "secondary"          # Positions 500-1000 and -600 to -300
    COMPRESSION = "compression"      # Middle zone

@dataclass
class ContextBlock:

23%

Average attention weight on middle-positioned information vs. optimal positions

This statistic reveals the severity of positional attention bias in modern LLMs.

Diagnosing and Fixing Context Attention Issues

Establish Baseline Performance

Map Information Criticality

Analyze Current Position Distribution

Restructure Using Attention Zones

Add Reinforcement Patterns

Context Window ≠ Effective Context

A model advertising 128K context doesn't mean you can effectively use 128K tokens. Empirical testing shows that most models maintain high performance only up to 25-40% of their advertised context window.

Anthropic

Internal Research on Attention Pattern Optimization

The research led to Claude's improved instruction following in version 2.1 and b...

Framework

The CRISP Context Framework

Constraints (Position: First)

State all hard constraints, limitations, and non-negotiable requirements in the first 300 tokens. Th...

Role (Position: First, after Constraints)

Define the persona, expertise level, and behavioral expectations immediately after constraints. Role...

Information (Position: Middle)

Place reference information, documents, and data in the middle section. Use clear delimiters to sepa...

Structure (Position: Throughout)

Use consistent structural markers throughout the context: XML tags for sections, markdown for format...

Context Attention Optimization Checklist

The Attention U-Curve in Long Contexts

Position 0-10%: HIGH...

↘

Position 10-30%: DEC...

↘

Position 30-70%: LOW...

↗

Position 70-90%: REC...

The 'Instruction Echo' Technique

For long contexts where critical instructions might be forgotten, use the 'instruction echo' technique: state the instruction fully at the start, reference it briefly in the middle ('Following the analysis requirements above...'), and restate the key points at the end. This creates three attention anchors for the same information, dramatically improving compliance..

Practice Exercise

Attention Zone Mapping Exercise

30 min

Key Insight

Tokenization Determines Your True Context Budget

Your context budget is measured in tokens, not characters or words, and tokenization varies dramatically across content types. Code tokenizes inefficiently—a 100-line Python file might consume 800-1200 tokens due to whitespace and syntax handling.

Stripe

Optimizing Context for Financial Document Processing

Error rates dropped to 1.2% for high-confidence extractions and 3.8% overall. Pr...

Context Optimization: Retrieval-Augmented vs. Full Context

Full Context Approach

All potentially relevant information included in every reque...

Simple implementation—no retrieval infrastructure needed

Consistent context means consistent behavior (good for testi...

High token costs as context grows; doesn't scale beyond ~50K...

Retrieval-Augmented Generation (RAG)

Only retrieved relevant chunks included per request

Complex infrastructure: embeddings, vector DB, retrieval pip...

Context varies by query, requiring careful testing across re...

Efficient token usage; scales to millions of documents

The Hidden Cost of Context Switching

When your prompt contains multiple distinct tasks or topics, the model must 'context switch' attention between them. Each switch has a cost—attention heads must reconfigure, and information from the previous context bleeds into the new one.

Essential Resources for Understanding LLM Attention

Attention Is All You Need (Original Paper)

article

Lost in the Middle: How Language Models Use Long Contexts

article

BertViz: Attention Visualization Tool

tool

Anthropic's Prompt Engineering Guide

article

Practice Exercise

Token Budget Calculator Workshop

30 min

Token-Aware Context Builderpython

123456789101112
import tiktoken
from typing import List, Dict, Optional
from dataclasses import dataclass

@dataclass
class ContextSegment:
    content: str
    priority: int  # 1 = critical, 2 = important, 3 = helpful
    position: str  # 'start', 'middle', 'end'
    
class TokenAwareContextBuilder:
    def __init__(self, model: str = 'gpt-4', max_tokens: int = 8000):

Practice Exercise

Lost-in-the-Middle Stress Test

45 min

Production Context Engineering Checklist

Anti-Pattern: The Kitchen Sink Context

❌ Problem

Beyond the obvious cost increase (often 3-5x higher than necessary), kitchen sin...

✓ Solution

Adopt a 'minimum viable context' mindset. Start with only what's absolutely nece...

Anti-Pattern: Ignoring Tokenization Differences

❌ Problem

Tokenization variance causes unpredictable behavior: the same content might work...

✓ Solution

Always use the exact tokenizer for your target model - never estimate or use a d...

Anti-Pattern: Static Context in Dynamic Applications

❌ Problem

Simple queries that need only 500 tokens of context receive 3,000 tokens of boil...

✓ Solution

Build context dynamically based on query analysis. Implement a context orchestra...

Position-Optimized RAG Context Assemblytypescript

123456789101112
interface RetrievedChunk {
  content: string;
  relevanceScore: number;
  tokenCount: number;
  source: string;
}

interface ContextConfig {
  maxTokens: number;
  reserveForQuery: number;
  reserveForResponse: number;
  boundaryBoost: number; // Extra weight for boundary positions

Practice Exercise

Attention Pattern Visualization Lab

60 min

Essential Context Engineering Resources

Anthropic's Context Window Research

article

tiktoken Library Documentation

tool

Lost in the Middle: How Language Models Use Long Contexts

article

BertViz Attention Visualization Tool

tool

Context Engineering is a Team Sport

Effective context engineering requires collaboration between ML engineers (who understand model behavior), product managers (who understand user needs), and frontend developers (who control what information is available). Establish shared vocabulary, create context design documents, and review context changes with the same rigor as code changes.

Framework

CRISP Context Design Framework

Constraints

Start by documenting hard constraints: model context limits, latency requirements, cost budgets, and...

Relevance

For each piece of potential context, ask: 'Does this directly help answer the user's query?' Impleme...

Information Density

Maximize information per token. Prefer structured formats over prose, use abbreviations consistently...

Structure

Design clear, consistent structure with explicit sections, delimiters, and formatting. Place critica...

Context Compression with Semantic Chunkingpython

123456789101112
from typing import List, Tuple
import numpy as np
from sentence_transformers import SentenceTransformer

class SemanticContextCompressor:
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        self.encoder = SentenceTransformer(model_name)
        
    def compress_context(
        self,
        documents: List[str],
        query: str,

Practice Exercise

Build Your Context Testing Suite

90 min

Start Measuring Before Optimizing

Before implementing any optimization, establish baseline metrics: average context utilization, token cost per request, accuracy by context length, and latency distribution. Without baselines, you can't prove your optimizations work.

Context Fundamentals Mastery Verification

3-5x

Typical cost reduction from context optimization

Teams that invest in systematic context engineering typically reduce their LLM API costs by 3-5x while maintaining or improving output quality.

Chapter Complete!

Tokenization is the foundation of context engineering - alwa...

The attention mechanism's O(n²) complexity and learned posit...

Position encodings give models their sense of sequence but i...

Context limits are hard constraints that require proactive m...

Next: Apply these fundamentals immediately by auditing your current application's context usage

PreviousNext