What is LLM Throughput
Executive Summary
Executive Summary
LLM throughput measures the total number of tokens a language model can generate or process per unit time, typically expressed as tokens per second across all concurrent requests.
Throughput is a system-level metric that captures aggregate processing capacity, distinct from latency which measures individual request response time, and optimizing one often involves tradeoffs with the other.
Key factors affecting throughput include batch size, model architecture, hardware specifications, memory bandwidth, and serving infrastructure design, with continuous batching and speculative decoding being primary optimization techniques.
Production systems must balance throughput against latency SLAs, cost constraints, and quality requirements, making throughput optimization a multi-dimensional engineering challenge rather than a single-variable problem.
The Bottom Line
LLM throughput determines the economic viability and scalability of AI applications by directly impacting how many users or requests a system can serve with given hardware resources. Understanding and optimizing throughput is essential for any production LLM deployment, as it fundamentally affects infrastructure costs, user experience, and system capacity.
Definition
Definition
LLM throughput is a performance metric that quantifies the aggregate rate at which a large language model system generates or processes tokens, typically measured in tokens per second (TPS) across all concurrent requests being served.
This metric reflects the total productive capacity of an inference system and serves as the primary indicator of how efficiently hardware resources are being utilized to serve model predictions at scale.
Extended Definition
LLM throughput encompasses both the prefill phase (processing input tokens) and the decode phase (generating output tokens), though the decode phase typically dominates in autoregressive generation scenarios. The metric can be measured at various granularities: per-GPU, per-node, per-cluster, or per-service, depending on the architectural scope being evaluated. Throughput is fundamentally bounded by memory bandwidth during the decode phase and by compute capacity during the prefill phase, creating distinct optimization strategies for each bottleneck. In production environments, throughput must be considered alongside quality-of-service constraints, as maximum theoretical throughput often cannot be achieved while maintaining acceptable latency percentiles for individual requests.
Etymology & Origins
The term 'throughput' originates from industrial engineering and manufacturing, where it described the rate at which products move through a production system. In computing, throughput became a standard performance metric in the 1960s with batch processing systems, measuring jobs completed per unit time. The application to LLMs emerged with the scaling of transformer models in 2020-2023, as practitioners needed to distinguish between single-request latency and system-wide processing capacity. The tokens-per-second formulation became standard as the atomic unit of LLM computation, analogous to FLOPS for general compute or IOPS for storage systems.
Also Known As
Not To Be Confused With
Latency
Latency measures the time for a single request to complete (time-to-first-token or total generation time), while throughput measures aggregate tokens generated across all requests per unit time. A system can have high throughput but poor latency if it batches many requests together, or low throughput but good latency if it processes requests sequentially with minimal queuing.
Requests per second (RPS)
RPS counts completed requests regardless of their token count, while throughput measures total tokens processed. A system handling many short requests may have high RPS but lower throughput than one handling fewer long-form generations. Throughput provides a more accurate measure of computational work performed.
FLOPS (Floating Point Operations Per Second)
FLOPS measures raw computational capacity of hardware, while throughput measures actual productive output in application-specific units (tokens). A system may have high theoretical FLOPS but low throughput due to memory bandwidth limitations, inefficient batching, or suboptimal kernel implementations.
Bandwidth
Memory bandwidth refers to the rate of data transfer between memory and compute units (GB/s), while throughput refers to the rate of meaningful output generation. Bandwidth is a hardware specification that constrains throughput, but they measure different things and have different units.
Capacity
Capacity often refers to the maximum number of concurrent requests or users a system can handle, while throughput specifically measures the rate of token generation. A system might have high capacity (many concurrent connections) but limited throughput if each request generates tokens slowly.
Batch size
Batch size is the number of requests processed simultaneously in a single forward pass, while throughput is the resulting rate of token generation. Increasing batch size generally increases throughput up to memory limits, but batch size is an input parameter while throughput is an output metric.
Conceptual Foundation
Conceptual Foundation
Core Principles
(8 principles)Mental Models
(6 models)Factory Assembly Line
Think of LLM inference as a factory where tokens are products being manufactured. Throughput is the total number of products leaving the factory per hour, regardless of how long any individual product spent in production. Batching is like processing multiple products simultaneously on the assembly line, and memory bandwidth is the speed of the conveyor belt bringing raw materials (weights) to the workers (compute units).
Highway Traffic Flow
Consider throughput as the total number of vehicles passing a checkpoint per hour on a highway. Latency is how long each individual car takes to travel the route. Adding more lanes (GPUs) increases capacity, but each lane has a speed limit (memory bandwidth). Batching is like carpooling - more passengers (tokens) per vehicle (forward pass) means more people transported with the same road usage.
Restaurant Kitchen
Imagine the LLM as a restaurant kitchen where throughput is meals served per hour. The chef (GPU) can only cook as fast as ingredients (model weights) arrive from the pantry (memory). Cooking multiple orders at once (batching) is efficient, but the kitchen has limited counter space (GPU memory) for active orders. The KV cache is like keeping partially prepared ingredients for each table's ongoing meal.
Roofline Model
The roofline model visualizes throughput as bounded by either compute capacity (horizontal ceiling) or memory bandwidth (diagonal slope), with actual performance falling below this roof based on operational intensity. LLM decode operations sit on the memory-bound slope, while prefill operations approach the compute-bound ceiling.
Queueing System
Model the inference system as a queueing network where requests arrive, wait in queue, receive service, and depart. Throughput is the departure rate, which in steady state equals the arrival rate up to system capacity. Little's Law (L = λW) relates concurrent requests, throughput, and latency, providing mathematical foundations for capacity planning.
Bucket Brigade
Envision token generation as a bucket brigade where each person (pipeline stage) passes buckets (tensors) to the next. Throughput is limited by the slowest person in the chain. Pipeline parallelism adds more brigades working on different buckets simultaneously, while tensor parallelism adds more people at each station to handle larger buckets.
Key Insights
(10 insights)Throughput and latency are not simply inversely related - they can be independently optimized to some degree through techniques like continuous batching, which maintains high throughput while bounding latency through iteration-level scheduling and preemption.
The memory-bound nature of LLM decoding means that faster GPUs (higher TFLOPS) often provide minimal throughput improvement compared to GPUs with higher memory bandwidth (HBM3 vs HBM2), making bandwidth-per-dollar a more relevant metric than compute-per-dollar for inference workloads.
KV cache memory consumption grows linearly with both batch size and sequence length, creating a two-dimensional constraint surface that limits achievable throughput - doubling context length effectively halves maximum batch size and thus throughput for decode-bound workloads.
Continuous batching can improve throughput by 2-10x over static batching by eliminating padding waste and enabling immediate scheduling of new requests into completed sequence slots, making it essential for production deployments with variable-length requests.
Speculative decoding can improve throughput by 2-3x for appropriate workloads without quality degradation, but requires careful draft model selection and provides diminishing returns as generation becomes less predictable or draft model accuracy decreases.
Quantization provides roughly linear throughput improvements (2x for INT8, 4x for INT4) by reducing memory bandwidth requirements, but the actual gains depend on whether the workload is truly memory-bound and whether quantization kernels are well-optimized.
Multi-GPU inference introduces communication overhead that can negate throughput gains if not carefully managed - tensor parallelism across 8 GPUs rarely achieves 8x throughput due to all-reduce synchronization costs, especially for smaller batch sizes.
The optimal batch size for throughput is not always the maximum that fits in memory - there's often a sweet spot where further batching increases latency more than it improves throughput due to compute saturation or scheduling overhead.
Prefill operations can severely impact decode throughput in shared systems because prefill is compute-intensive and can starve decode iterations of GPU cycles, making prefill-decode disaggregation valuable for mixed workloads.
Real-world throughput is typically 30-50% of theoretical maximum due to software overhead, memory fragmentation, variable sequence lengths, and the inability to maintain perfect batch utilization - benchmarks claiming higher numbers often use unrealistic synthetic workloads.
When to Use
When to Use
Ideal Scenarios
(12)Batch processing pipelines where large volumes of text need to be processed offline without real-time latency requirements, such as document summarization, content moderation, or data extraction jobs that can tolerate queuing delays in exchange for cost efficiency.
High-volume API services serving many concurrent users where aggregate capacity matters more than individual request speed, such as chatbot platforms, content generation services, or AI-assisted writing tools with thousands of simultaneous users.
Cost optimization initiatives where the goal is to maximize useful work per dollar of infrastructure spend, requiring throughput metrics to evaluate hardware efficiency and identify optimization opportunities.
Capacity planning exercises that need to determine how many GPUs or instances are required to handle projected request volumes while meeting SLO targets for both throughput and latency.
A/B testing of inference optimizations where throughput serves as the primary metric for evaluating the effectiveness of changes like quantization, batching strategies, or serving framework upgrades.
Hardware selection decisions comparing different GPU models, cloud instance types, or inference accelerators where throughput-per-dollar determines the most cost-effective option for specific workload patterns.
Autoscaling configuration where throughput metrics drive scaling decisions, triggering scale-out when throughput approaches capacity limits and scale-in when throughput indicates underutilization.
SLO definition and monitoring where throughput targets are established as part of service level objectives, with alerts triggered when throughput drops below acceptable thresholds indicating system degradation.
Benchmark comparisons evaluating different models, serving frameworks, or optimization techniques where throughput provides a standardized metric for comparing alternatives under controlled conditions.
Resource allocation decisions in multi-tenant environments where throughput metrics help fairly distribute GPU resources among competing workloads based on their processing requirements.
Performance regression detection in CI/CD pipelines where throughput benchmarks catch degradations introduced by code changes before they reach production.
Inference cost modeling where throughput directly determines the cost per token or cost per request, enabling accurate pricing and margin calculations for AI-powered products.
Prerequisites
(8)Stable model deployment with consistent inference behavior, as throughput measurements are meaningless if the model or serving infrastructure is still being actively modified or debugged.
Representative workload characterization including typical input lengths, output lengths, request arrival patterns, and concurrency levels, since throughput varies significantly across different workload profiles.
Adequate monitoring infrastructure capable of measuring tokens generated, request counts, and timing at appropriate granularity without introducing significant measurement overhead.
Defined latency constraints that bound acceptable throughput optimization strategies, since maximum throughput often requires latency tradeoffs that may violate user experience requirements.
Sufficient hardware resources to achieve meaningful throughput levels, as throughput optimization on severely under-provisioned systems may yield misleading results that don't transfer to production scale.
Understanding of the memory-compute tradeoff landscape for the specific model and hardware combination, enabling informed decisions about batch sizes, parallelism strategies, and optimization techniques.
Baseline measurements establishing current throughput levels before optimization efforts begin, providing a reference point for measuring improvement and detecting regressions.
Clear business requirements translating throughput into business value, such as cost per request targets, user capacity requirements, or infrastructure budget constraints.
Signals You Need This
(10)Infrastructure costs are disproportionately high relative to the value delivered by AI features, suggesting that throughput optimization could significantly reduce cost per inference.
Request queues are growing during peak periods, indicating that current throughput is insufficient to handle demand and users are experiencing increased latency due to queuing.
GPU utilization metrics show low compute utilization (under 30%) during inference, suggesting that batching or other throughput optimizations could extract more value from existing hardware.
Autoscaling is triggering frequently and aggressively, potentially indicating that throughput per instance is suboptimal and more efficient serving could reduce instance counts.
Batch processing jobs are taking longer than acceptable time windows, requiring either more hardware or throughput optimization to complete within required timeframes.
Cost projections for scaling to target user counts exceed budget constraints, necessitating throughput improvements to serve more users with existing or reduced infrastructure.
Competitive analysis reveals that similar services achieve better price-performance ratios, suggesting throughput optimization opportunities that competitors have already captured.
Memory utilization is high while compute utilization is low, indicating memory-bound operation where throughput optimization techniques like quantization or KV cache optimization could help.
Request rejection rates or timeout rates are increasing, signaling that throughput capacity is being exceeded and optimization or scaling is required.
Energy costs or carbon footprint concerns are driving efficiency requirements, where throughput optimization directly reduces environmental impact per useful inference.
Organizational Readiness
(7)ML engineering team with experience in inference optimization, including familiarity with profiling tools, serving frameworks, and common optimization techniques like quantization and batching.
Infrastructure team capable of deploying and managing GPU clusters, including expertise in container orchestration, networking, and resource scheduling for ML workloads.
Monitoring and observability systems that can capture inference metrics at appropriate granularity, including custom metrics for tokens per second, batch utilization, and queue depths.
Performance testing capabilities including load generation tools, benchmark harnesses, and analysis frameworks for conducting rigorous throughput evaluations.
Cross-functional alignment between ML, infrastructure, and product teams on throughput targets, latency constraints, and acceptable quality tradeoffs.
Budget allocation for potential hardware upgrades, serving framework licenses, or cloud instance costs that may be required to achieve throughput targets.
Change management processes that allow for iterative optimization with appropriate testing, rollback capabilities, and gradual rollout of throughput improvements.
When NOT to Use
When NOT to Use
Anti-Patterns
(12)Optimizing throughput for latency-critical applications like real-time voice assistants or interactive coding tools where time-to-first-token and inter-token latency directly impact user experience and throughput gains that increase latency are counterproductive.
Pursuing maximum throughput on single-user or low-concurrency systems where batching opportunities are limited and throughput optimization provides minimal benefit while adding complexity.
Sacrificing model quality for throughput gains in applications where output quality directly impacts business value, such as medical diagnosis assistance or legal document analysis where errors have significant consequences.
Over-engineering throughput optimization for prototype or experimental systems that may be significantly redesigned, wasting effort on optimizations that won't transfer to the final architecture.
Focusing on throughput metrics while ignoring tail latency, potentially creating systems that perform well on average but provide unacceptable experience for a significant fraction of users.
Applying generic throughput optimization techniques without understanding workload-specific characteristics, potentially optimizing for the wrong bottleneck or introducing regressions for actual usage patterns.
Measuring throughput on synthetic benchmarks that don't represent production workloads, leading to optimization decisions that don't translate to real-world improvements.
Pursuing throughput optimization before establishing baseline stability, potentially masking underlying reliability issues or introducing optimizations that work only under ideal conditions.
Optimizing throughput for batch sizes that exceed actual concurrent request volumes, achieving theoretical improvements that can never be realized in production due to insufficient load.
Ignoring the cost of throughput optimization itself, such as engineering time, increased system complexity, or ongoing maintenance burden that may exceed the infrastructure cost savings achieved.
Applying throughput optimization uniformly across all request types when different requests have different latency requirements, potentially degrading experience for latency-sensitive requests to improve aggregate throughput.
Chasing throughput improvements through techniques that reduce system reliability or increase failure rates, trading availability for efficiency in ways that harm overall service quality.
Red Flags
(10)Latency SLOs are being violated while throughput targets are being met, indicating that throughput optimization has gone too far at the expense of user experience.
Model quality metrics are degrading as throughput increases, suggesting that aggressive optimization techniques like quantization or speculative decoding are introducing unacceptable quality tradeoffs.
System complexity is increasing faster than throughput improvements, creating maintenance burden and reliability risks that outweigh efficiency gains.
Throughput benchmarks show impressive numbers but production metrics don't improve, indicating that benchmarks don't represent actual workload characteristics.
Team is spending more time on throughput optimization than on core product features, suggesting misallocated priorities or diminishing returns from further optimization.
Throughput optimization requires specialized hardware or software that creates vendor lock-in or limits future flexibility.
Optimization techniques are brittle and break with model updates, requiring constant re-optimization effort that negates efficiency gains.
Memory errors or OOM conditions are increasing as batch sizes are pushed higher, indicating that throughput optimization is approaching unsafe limits.
Request timeout rates are increasing even as throughput metrics improve, suggesting that tail latency is being sacrificed for aggregate throughput.
Cost per token is decreasing but total infrastructure cost is increasing due to over-provisioning for throughput headroom that isn't needed.
Better Alternatives
(8)Real-time conversational AI with strict latency requirements
Latency optimization with streaming and time-to-first-token focus
User experience in conversational applications depends primarily on perceived responsiveness, which is determined by latency rather than throughput. Optimizing for throughput may increase batch sizes in ways that harm the interactive feel of the conversation.
Low-volume, high-value inference tasks like medical diagnosis
Quality optimization with ensemble methods or larger models
When request volume is low and accuracy is paramount, the marginal cost savings from throughput optimization are negligible compared to the value of improved output quality. Resources are better spent on model quality than inference efficiency.
Highly variable workloads with unpredictable spikes
Elastic scaling with serverless inference
Throughput optimization assumes relatively stable load patterns. For highly variable workloads, the ability to rapidly scale capacity up and down provides better cost efficiency than optimizing throughput on fixed infrastructure.
Multi-model pipelines with complex orchestration
End-to-end pipeline latency optimization
In multi-step pipelines, optimizing throughput of individual models may not improve overall pipeline performance if other stages are bottlenecks. End-to-end optimization considering all stages typically yields better results.
Development and experimentation environments
Flexibility and iteration speed optimization
In development contexts, the ability to quickly test different models, prompts, and configurations provides more value than inference efficiency. Throughput optimization adds constraints that slow experimentation.
Edge deployment with limited hardware
Model compression and architecture optimization
On resource-constrained edge devices, fundamental model size and architecture changes provide larger efficiency gains than serving-level throughput optimization, which has limited headroom on small devices.
Applications with highly variable output lengths
Adaptive batching with output length prediction
Standard throughput optimization assumes relatively uniform request characteristics. For highly variable outputs, adaptive strategies that predict output length and batch accordingly outperform static optimization.
Cost-sensitive applications with flexible latency requirements
Spot instance utilization with queue-based processing
When latency is flexible, using cheaper spot instances with queue-based request handling may provide better cost efficiency than throughput optimization on more expensive on-demand instances.
Common Mistakes
(10)Measuring throughput only under optimal conditions with uniform request lengths and perfect batching, leading to unrealistic expectations when deployed with real-world variable workloads.
Ignoring the impact of KV cache memory consumption on achievable batch sizes, leading to throughput projections that assume larger batches than memory constraints allow.
Confusing tokens per second with requests per second, leading to incorrect capacity planning when request lengths vary significantly.
Optimizing for throughput on a single GPU without considering the communication overhead that reduces efficiency in multi-GPU deployments.
Assuming linear throughput scaling with hardware additions, ignoring the sublinear scaling caused by communication overhead, memory bandwidth sharing, and scheduling inefficiencies.
Neglecting the prefill phase impact on decode throughput in systems handling mixed workloads with varying input lengths.
Over-relying on quantization for throughput gains without validating quality impact on the specific use case and model combination.
Implementing aggressive batching without latency bounds, creating systems that achieve high throughput but violate user experience requirements.
Measuring throughput with synthetic prompts that don't trigger the same attention patterns, cache behaviors, or generation characteristics as production requests.
Failing to account for warmup effects, leading to throughput measurements that don't reflect steady-state performance.
Core Taxonomy
Core Taxonomy
Primary Types
(8 types)The rate of token generation during the autoregressive decoding phase, measured in output tokens per second. This is typically the primary throughput metric for generative applications where output length dominates processing time.
Characteristics
- Memory-bandwidth bound due to low arithmetic intensity
- Scales roughly linearly with batch size until memory saturation
- Highly sensitive to KV cache memory consumption
- Benefits significantly from quantization and memory optimization
Use Cases
Tradeoffs
Optimizing decode throughput often requires larger batch sizes that increase per-request latency. Memory optimization techniques like quantization improve decode throughput but may impact output quality.
Classification Dimensions
Measurement Scope
Throughput can be measured at different levels of the infrastructure stack, with each level providing different insights into system performance and bottlenecks.
Workload Type
Different request handling patterns have different throughput characteristics and optimization strategies, requiring workload-specific measurement approaches.
Optimization Target
Throughput optimization can target different objectives, with each target implying different tradeoffs and measurement criteria.
Time Horizon
Throughput varies over time, and different time horizons provide different views of system capacity and performance stability.
Request Priority
In multi-priority systems, throughput may be measured separately for different priority classes to ensure SLO compliance across service tiers.
Model Configuration
Different model optimization techniques produce different throughput characteristics, requiring separate measurement for each configuration.
Evolutionary Stages
Naive Serving
Initial deployment, 0-3 monthsSingle-request processing without batching, minimal optimization, throughput limited to sequential request handling. Typical of initial deployments and development environments.
Basic Batching
Early production, 3-6 monthsStatic batching with fixed batch sizes, basic padding strategies, throughput improved but limited by batch formation delays and padding waste.
Continuous Batching
Mature production, 6-12 monthsDynamic batch formation with iteration-level scheduling, minimal padding waste, significantly improved throughput with bounded latency impact.
Advanced Optimization
Optimized production, 12-24 monthsQuantization, speculative decoding, KV cache optimization, prefill-decode disaggregation, approaching hardware efficiency limits.
Hardware-Software Co-optimization
Highly optimized systems, 24+ monthsCustom kernels, hardware-specific optimizations, inference accelerator utilization, maximum achievable throughput for given hardware.
Architecture Patterns
Architecture Patterns
Architecture Patterns
(8 patterns)Single-GPU Serving
The simplest deployment pattern where a single GPU handles all inference requests, suitable for smaller models or low-traffic applications. Throughput is limited by single GPU memory bandwidth and compute capacity.
Components
- Inference server (vLLM, TGI, or similar)
- Request queue
- Model weights in GPU memory
- KV cache manager
Data Flow
Requests arrive at the inference server, are queued for batching, processed through the model on a single GPU, and responses are returned to clients. The KV cache is managed within GPU memory for active sequences.
Best For
- Development and testing environments
- Low-traffic production services
- Models that fit in single GPU memory
- Cost-sensitive deployments
Limitations
- Single point of failure
- Limited throughput ceiling
- Cannot serve models larger than GPU memory
- No horizontal scaling capability
Scaling Characteristics
Vertical scaling only through GPU upgrade. Throughput ceiling determined by GPU memory bandwidth (typically 1-3 TB/s for modern GPUs). Maximum batch size limited by GPU memory capacity.
Integration Points
Load Balancer
Distributes incoming requests across inference server replicas, performs health checking, and implements routing policies for optimal throughput.
Load balancer must understand request characteristics (input length, expected output length) for intelligent routing. Session affinity may be needed for stateful interactions. Must handle streaming responses appropriately.
Request Queue
Buffers incoming requests for batch formation, implements priority scheduling, and manages backpressure when throughput is exceeded.
Queue depth directly impacts latency. Must implement fair scheduling across priority levels. Backpressure mechanisms prevent queue overflow. Queue persistence may be needed for reliability.
KV Cache Manager
Allocates, manages, and optimizes key-value cache memory for active sequences, implementing strategies like PagedAttention for efficient memory utilization.
KV cache management directly impacts achievable batch size and thus throughput. Must handle variable sequence lengths efficiently. Memory fragmentation can reduce effective capacity over time.
Metrics Collector
Gathers throughput metrics, latency distributions, and resource utilization data for monitoring, alerting, and optimization feedback.
Metrics collection overhead should be minimal to avoid impacting throughput. Must capture both aggregate throughput and per-request latency. Real-time metrics enable dynamic optimization.
Autoscaler
Monitors throughput and latency metrics to automatically adjust replica count, maintaining target throughput while minimizing cost.
Scaling decisions must account for GPU startup time (minutes). Throughput-based scaling must consider latency impact. Predictive scaling can improve responsiveness to load changes.
Model Registry
Stores and versions model artifacts, provides model metadata, and supports model loading for multi-model serving scenarios.
Model loading time impacts throughput during scaling or model switches. Registry must support efficient partial loading for large models. Version management enables safe rollbacks.
Batch Scheduler
Forms optimal batches from queued requests, balancing batch size for throughput against latency constraints and fairness requirements.
Scheduler directly determines throughput-latency tradeoff. Must handle variable sequence lengths efficiently. Continuous batching requires iteration-level scheduling decisions.
Hardware Abstraction Layer
Provides unified interface to underlying GPU/accelerator hardware, enabling portability across different hardware platforms while exposing optimization opportunities.
Abstraction overhead should be minimal. Must expose hardware-specific optimizations where available. Enables multi-hardware deployments and hardware migration.
Decision Framework
Decision Framework
If latency-constrained, optimize for minimal batch sizes and fast single-request processing. Throughput optimization may be counterproductive.
If throughput-constrained, proceed with throughput optimization strategies including larger batch sizes and continuous batching.
Most production systems have both constraints. Define clear SLOs for both metrics before optimizing. Latency percentiles (p95, p99) often matter more than averages.
Technical Deep Dive
Technical Deep Dive
Overview
LLM throughput emerges from the interaction between model architecture, hardware capabilities, and serving system design. At its core, throughput is determined by how efficiently the system can move model weights from memory to compute units and perform the necessary matrix multiplications for token generation. The autoregressive nature of LLM generation creates a fundamental bottleneck: each new token depends on all previous tokens, requiring sequential generation that limits parallelism within a single sequence. The key insight is that while individual sequences must be generated sequentially, multiple sequences can be processed in parallel through batching. This amortizes the fixed cost of loading model weights across multiple useful computations, dramatically improving throughput. Modern serving systems achieve high throughput through sophisticated batching strategies, memory management techniques, and hardware-aware optimizations that maximize the utilization of available compute and memory bandwidth. Throughput optimization requires understanding the distinct characteristics of the prefill and decode phases. Prefill processes all input tokens in parallel and is compute-bound, while decode generates tokens one at a time and is memory-bound. These phases have different bottlenecks and optimization strategies, and production systems must balance both for optimal overall throughput. The practical throughput ceiling is determined by the minimum of compute capacity and memory bandwidth, adjusted for software overhead and utilization efficiency. Modern GPUs have far more compute capacity than memory bandwidth relative to the arithmetic intensity of LLM inference, making memory bandwidth the dominant constraint for most workloads.
Step-by-Step Process
Incoming inference requests arrive at the serving system and are placed in a request queue. Each request contains input tokens (prompt) and generation parameters (max tokens, temperature, etc.). The queue maintains request metadata including arrival time, priority, and estimated resource requirements.
Unbounded queues can lead to memory exhaustion and unpredictable latency. Queue depth must be monitored and backpressure applied when capacity is exceeded. Priority inversion can occur if queue management doesn't account for request characteristics.
Under The Hood
At the hardware level, LLM inference throughput is fundamentally constrained by the memory bandwidth of the GPU. Modern GPUs like the NVIDIA H100 provide approximately 3.35 TB/s of HBM3 bandwidth, while offering 1979 TFLOPS of FP16 compute. The arithmetic intensity (FLOPS per byte) required to fully utilize this compute is approximately 590 FLOPS/byte. However, during autoregressive decoding, the arithmetic intensity is only about 1-2 FLOPS/byte because each token generation requires loading the entire model weights while performing relatively few operations per weight. This memory-bound nature means that throughput scales with batch size because batching amortizes weight loading across multiple sequences. If a 70B parameter model (140GB in FP16) requires loading all weights for each token, a single sequence achieves only about 24 tokens/second (3350 GB/s Ć· 140 GB). With a batch size of 32, the same weight loading produces 32 tokens, achieving approximately 768 tokens/second aggregate throughput. This linear scaling continues until memory capacity limits batch size or compute becomes saturated. The KV cache adds another dimension to memory constraints. For a typical 70B model, each token in the KV cache consumes approximately 1.6MB (assuming 80 layers, 64 heads, 128 dimensions per head, FP16). A sequence of 4096 tokens requires about 6.5GB of KV cache. With 80GB of GPU memory and 140GB of model weights (requiring tensor parallelism), the remaining memory for KV cache limits the product of batch size and sequence length. This creates a tradeoff between supporting more concurrent sequences (higher batch size) and supporting longer sequences. Modern serving systems implement sophisticated memory management to maximize throughput within these constraints. PagedAttention allocates KV cache in fixed-size blocks rather than contiguous per-sequence allocations, eliminating fragmentation and enabling near-100% memory utilization. Continuous batching maintains high batch utilization by immediately filling slots vacated by completed sequences. Speculative decoding converts memory-bound sequential decoding into compute-bound parallel verification, effectively increasing arithmetic intensity and throughput. Quantization provides another path to higher throughput by reducing memory bandwidth requirements. INT8 quantization halves the memory footprint, enabling 2x larger batches or 2x faster weight loading. INT4 quantization provides 4x improvement. The throughput gain is roughly proportional to the compression ratio, minus any overhead from dequantization operations. Modern quantization techniques (GPTQ, AWQ, SqueezeLLM) minimize quality degradation while maximizing throughput benefit. Multi-GPU inference introduces communication overhead that affects throughput. Tensor parallelism requires all-reduce operations to synchronize activations between layers, consuming interconnect bandwidth. With NVLink (900 GB/s bidirectional), this overhead is manageable, but with PCIe (64 GB/s), it can significantly reduce efficiency. Pipeline parallelism avoids all-reduce but introduces pipeline bubbles that reduce utilization. The optimal parallelism strategy depends on model size, batch size, interconnect bandwidth, and latency requirements.
Failure Modes
Failure Modes
KV cache growth exceeds available GPU memory due to long sequences, large batch sizes, or memory fragmentation. Can also occur from memory leaks in serving framework.
- CUDA OOM errors in logs
- Sudden request failures after period of normal operation
- Increasing memory usage over time without corresponding load increase
- Failures correlated with long sequence requests
Complete inference failure for affected requests. May crash entire serving process, affecting all concurrent requests. Recovery requires process restart.
Implement memory budgeting with hard limits on batch size and sequence length. Use PagedAttention for efficient memory utilization. Monitor memory usage and implement proactive eviction. Set conservative memory limits with safety margin.
Implement graceful degradation by rejecting new requests when memory pressure is high. Use request queuing with admission control. Implement automatic process restart with state recovery. Consider KV cache offloading for memory pressure relief.
Operational Considerations
Operational Considerations
Key Metrics (15)
Aggregate token generation rate across all requests, the primary throughput metric. Includes both input (prefill) and output (decode) tokens.
Dashboard Panels
Alerting Strategy
Implement tiered alerting with different severity levels and response procedures. Use anomaly detection for metrics without fixed thresholds. Correlate alerts across metrics to identify root causes. Implement alert suppression during known maintenance windows. Use runbooks linked to alerts for consistent response. Page on-call for critical alerts affecting user experience. Use Slack/email for warning-level alerts requiring investigation. Aggregate similar alerts to prevent alert fatigue. Track alert-to-resolution time for continuous improvement.
Cost Analysis
Cost Analysis
Cost Drivers
(10)GPU Instance Hours
Primary cost driver, typically 60-80% of total inference cost. Directly proportional to number of GPUs and runtime hours.
Maximize throughput per GPU to reduce required instance count. Use spot instances for fault-tolerant workloads. Right-size instances based on actual utilization. Implement autoscaling to match capacity to demand.
Memory Capacity Requirements
Determines minimum GPU tier required. Higher memory GPUs cost significantly more (A100 80GB vs 40GB, H100 80GB vs others).
Use quantization to reduce memory requirements. Implement efficient KV cache management. Consider model sharding across cheaper GPUs. Optimize batch sizes to maximize memory utilization.
Network Egress
Can be significant for high-volume streaming responses. Cloud providers charge for outbound data transfer.
Compress responses where possible. Use regional deployments to reduce cross-region transfer. Implement response caching for repeated queries. Consider CDN for static or cacheable content.
Storage for Model Artifacts
Model weights require significant storage. Multiple model versions multiply storage costs.
Use efficient model formats (safetensors). Implement model deduplication for shared layers. Use tiered storage (hot/cold) for model versions. Clean up unused model versions.
Monitoring and Logging
High-volume metrics and logs can incur significant observability costs. Often overlooked in initial estimates.
Sample metrics appropriately. Aggregate logs before shipping. Use retention policies to limit storage. Consider self-hosted observability for large scale.
Load Balancing and Networking
Load balancer costs scale with request volume. Private networking may incur additional charges.
Use efficient load balancing tiers. Batch requests where possible to reduce per-request overhead. Optimize health check frequency.
Engineering Time for Optimization
Throughput optimization requires significant engineering investment. Ongoing maintenance adds to total cost.
Focus on high-impact optimizations first. Use managed services where build cost exceeds buy cost. Automate routine optimization tasks.
Redundancy and Fault Tolerance
High availability requires redundant capacity, increasing costs. N+1 or N+2 redundancy multiplies base cost.
Right-size redundancy based on actual availability requirements. Use spot instances for redundant capacity. Implement graceful degradation to reduce redundancy needs.
Development and Staging Environments
Non-production environments consume resources for testing and development. Often over-provisioned.
Use smaller instances for dev/staging. Implement environment scheduling (shut down overnight). Share resources across teams where possible.
Data Transfer for Distributed Inference
Multi-GPU and multi-node inference incurs internal data transfer costs. Can be significant for large-scale deployments.
Minimize cross-node communication. Use high-bandwidth interconnects to reduce transfer time. Optimize parallelism strategy to reduce communication.
Cost Models
Cost per Million Tokens
(GPU_hourly_cost Ć hours) / (throughput_tokens_per_second Ć 3600 Ć hours / 1_000_000)H100 at $3/hour achieving 5000 tokens/second: $3 / (5000 Ć 3600 / 1M) = $0.17 per million tokens
Cost per Request
cost_per_million_tokens Ć (avg_input_tokens + avg_output_tokens) / 1_000_000At $0.17/M tokens with 500 input + 200 output tokens: $0.17 Ć 700 / 1M = $0.000119 per request
Total Cost of Ownership (Monthly)
GPU_cost + storage_cost + network_cost + monitoring_cost + engineering_cost + overhead10 H100s at $3/hr for 720 hours = $21,600 + $500 storage + $1,000 network + $500 monitoring + $5,000 engineering = $28,600/month
Break-even Analysis for Optimization
optimization_cost / (baseline_cost - optimized_cost) = months_to_break_even$50,000 optimization effort saving $10,000/month: 50,000 / 10,000 = 5 months to break even
Optimization Strategies
- 1Implement quantization (INT8/INT4) to increase throughput per GPU, reducing required instance count
- 2Use spot/preemptible instances for fault-tolerant batch workloads, achieving 60-90% cost reduction
- 3Right-size GPU instances based on actual memory and compute requirements rather than over-provisioning
- 4Implement autoscaling to match capacity to demand, avoiding paying for idle resources
- 5Use continuous batching to maximize throughput and GPU utilization
- 6Consider smaller, more efficient models for latency-tolerant use cases
- 7Implement request caching for repeated or similar queries
- 8Use reserved instances or committed use discounts for baseline capacity
- 9Consolidate low-utilization workloads onto shared infrastructure
- 10Optimize model loading and warmup to reduce startup costs during scaling
- 11Implement tiered serving with different cost-performance tradeoffs for different request types
- 12Use speculative decoding to improve throughput without additional hardware
Hidden Costs
- š°GPU idle time during low-traffic periods if autoscaling is not implemented or is too slow
- š°Over-provisioning for peak capacity that is rarely utilized
- š°Engineering time spent on debugging and optimization that could be spent on features
- š°Opportunity cost of using expensive GPUs for workloads that could run on cheaper hardware
- š°Cost of failed experiments and optimization attempts that don't pan out
- š°Compliance and security overhead for handling sensitive data in inference
- š°Cost of incidents and outages including engineering time and potential SLA penalties
- š°Training and onboarding costs for specialized ML infrastructure skills
ROI Considerations
Return on investment for throughput optimization depends heavily on scale. At small scale (few GPUs, low traffic), the engineering investment in optimization may exceed the infrastructure savings. The break-even point typically occurs when monthly infrastructure costs exceed $10,000-$50,000, at which point a 20-30% efficiency improvement justifies dedicated optimization effort. Throughput optimization ROI should be calculated considering both direct cost savings and indirect benefits. Direct savings come from reduced instance counts and better utilization. Indirect benefits include improved user experience (if latency also improves), ability to serve more users without scaling, and reduced environmental impact. The ROI calculation should also account for ongoing maintenance costs of optimizations. Complex optimizations like speculative decoding or custom kernels require ongoing maintenance as models and frameworks evolve. Simpler optimizations like quantization or continuous batching have lower maintenance burden and thus better long-term ROI. Finally, consider the opportunity cost of engineering time. If throughput optimization delays product features that would generate revenue, the net ROI may be negative even if infrastructure costs decrease. Prioritize optimization efforts based on overall business impact, not just infrastructure efficiency.
Security Considerations
Security Considerations
Threat Model
(10 threats)Denial of Service through Resource Exhaustion
Attacker sends requests designed to consume maximum resources (long inputs, long outputs, many concurrent requests) to exhaust GPU memory or compute capacity.
Service unavailability for legitimate users. Potential cost explosion if autoscaling responds to attack traffic.
Implement rate limiting per user/API key. Set maximum input/output length limits. Use admission control based on resource availability. Implement cost-based quotas.
Prompt Injection Affecting Throughput
Malicious prompts designed to cause extremely long outputs, infinite loops, or resource-intensive generation patterns.
Reduced throughput for other users. Potential resource exhaustion. Unexpected cost increases.
Enforce strict output length limits. Implement generation timeout. Monitor for anomalous generation patterns. Use output validation.
Side-Channel Information Leakage
Attacker infers information about other users' requests through timing analysis, resource utilization patterns, or batching behavior.
Privacy violation. Potential exposure of sensitive information from other users' prompts or outputs.
Implement request isolation where required. Add timing noise to responses. Use separate infrastructure for sensitive workloads. Avoid batching requests from different security contexts.
Model Extraction through High-Volume Queries
Attacker sends large volumes of carefully crafted queries to extract model behavior, effectively stealing the model.
Intellectual property theft. Competitive disadvantage. Potential for creating adversarial attacks.
Implement query rate limits. Monitor for extraction patterns. Use watermarking techniques. Limit API access to authorized users.
Data Exfiltration through Model Outputs
If model has been trained on or has access to sensitive data, attacker crafts prompts to extract this information.
Data breach. Compliance violations. Reputational damage.
Implement output filtering for sensitive patterns. Use data loss prevention (DLP) tools. Audit model training data. Implement access controls on sensitive capabilities.
Adversarial Inputs Causing Model Misbehavior
Specially crafted inputs cause model to produce harmful, biased, or incorrect outputs at scale.
Reputational damage. Potential legal liability. User harm.
Implement input validation and sanitization. Use output filtering and moderation. Monitor for adversarial patterns. Implement human review for high-risk outputs.
Infrastructure Compromise
Attacker gains access to inference infrastructure through vulnerabilities in serving software, containers, or cloud configuration.
Data breach. Service disruption. Potential for persistent access and ongoing attacks.
Keep software updated. Use minimal container images. Implement network segmentation. Use secrets management. Regular security audits.
Supply Chain Attacks on Model Artifacts
Compromised model weights or serving software introduced through supply chain, containing backdoors or malicious behavior.
Compromised outputs. Potential data exfiltration. Difficult to detect.
Verify model artifact integrity. Use trusted sources. Implement artifact signing. Scan for known vulnerabilities. Audit model behavior.
Insider Threat
Malicious insider with access to inference infrastructure exfiltrates data, modifies models, or disrupts service.
Data breach. Service disruption. Difficult to detect and attribute.
Implement least-privilege access. Audit all access and changes. Use multi-person approval for sensitive operations. Monitor for anomalous behavior.
API Key Compromise
Stolen or leaked API keys used to access inference service, potentially at scale.
Unauthorized usage. Cost explosion. Potential data access.
Implement key rotation. Use short-lived tokens. Monitor for anomalous usage patterns. Implement IP allowlisting where possible.
Security Best Practices
- āImplement authentication and authorization for all API endpoints
- āUse TLS for all communications, including internal service-to-service
- āImplement rate limiting at multiple levels (global, per-user, per-endpoint)
- āSet and enforce maximum input and output length limits
- āImplement request timeout to prevent resource exhaustion
- āUse network segmentation to isolate inference infrastructure
- āImplement audit logging for all requests and administrative actions
- āUse secrets management for API keys, credentials, and certificates
- āRegularly update and patch all software components
- āImplement input validation and sanitization
- āUse output filtering for sensitive content
- āImplement monitoring and alerting for security events
- āConduct regular security assessments and penetration testing
- āImplement incident response procedures for security events
- āUse minimal container images and runtime privileges
Data Protection
- šEncrypt all data at rest including model weights, KV cache (if persisted), and logs
- šUse TLS 1.3 for all data in transit
- šImplement data classification to identify sensitive content in prompts and outputs
- šUse data loss prevention (DLP) tools to detect sensitive data in outputs
- šImplement retention policies for inference logs with appropriate deletion
- šUse anonymization or pseudonymization for logs where possible
- šImplement access controls limiting who can view inference logs
- šConsider on-premises or private cloud deployment for highly sensitive workloads
- šImplement data residency controls for geographic compliance requirements
- šUse secure deletion procedures for decommissioned storage
Compliance Implications
GDPR (General Data Protection Regulation)
Personal data processing must be lawful, fair, and transparent. Data subjects have rights to access, rectification, and erasure.
Implement data minimization in prompts. Provide mechanisms for data subject requests. Document lawful basis for processing. Implement data retention policies.
CCPA (California Consumer Privacy Act)
Consumers have right to know what personal information is collected and to request deletion.
Implement data inventory for inference logs. Provide opt-out mechanisms. Honor deletion requests including inference logs.
HIPAA (Health Insurance Portability and Accountability Act)
Protected health information (PHI) must be safeguarded with appropriate administrative, physical, and technical safeguards.
Use dedicated infrastructure for PHI workloads. Implement encryption at rest and in transit. Audit all access. Sign BAAs with cloud providers.
SOC 2 (Service Organization Control 2)
Demonstrate controls for security, availability, processing integrity, confidentiality, and privacy.
Document and implement controls. Maintain audit trails. Conduct regular assessments. Implement change management procedures.
PCI DSS (Payment Card Industry Data Security Standard)
Protect cardholder data with specific technical and operational requirements.
Isolate systems processing payment data. Implement strong access controls. Encrypt cardholder data. Regular vulnerability scanning.
AI Act (EU Artificial Intelligence Act)
High-risk AI systems must meet requirements for transparency, human oversight, and risk management.
Document AI system capabilities and limitations. Implement human oversight mechanisms. Conduct risk assessments. Maintain technical documentation.
FedRAMP (Federal Risk and Authorization Management Program)
Cloud services for federal agencies must meet standardized security requirements.
Use FedRAMP-authorized cloud infrastructure. Implement required security controls. Maintain continuous monitoring. Document system security plan.
ISO 27001
Implement information security management system (ISMS) with appropriate controls.
Establish ISMS policies and procedures. Implement risk assessment process. Conduct internal audits. Maintain continuous improvement.
Scaling Guide
Scaling Guide
Scaling Dimensions
Horizontal Scaling (Replicas)
Add more independent model replicas behind a load balancer. Each replica handles a portion of traffic. Throughput scales linearly with replica count.
Limited by load balancer capacity, network bandwidth, and cost. Requires model to fit in single replica's memory (or use model parallelism within replica).
Simplest scaling approach. No inter-replica communication overhead. Requires effective load balancing. Cold start time affects scaling responsiveness.
Vertical Scaling (GPU Upgrade)
Use more powerful GPUs with higher memory bandwidth and capacity. Enables larger batch sizes and faster per-token generation.
Limited by available GPU tiers. Diminishing returns as memory bandwidth scales slower than cost. Single point of failure.
May require code changes for new GPU architectures. Cost per GPU increases faster than performance. Consider total cost of ownership.
Tensor Parallelism Scaling
Shard model across more GPUs within a node to increase aggregate memory bandwidth and enable larger models.
Limited by interconnect bandwidth (NVLink). Communication overhead increases with parallelism degree. Typically limited to 8 GPUs per node.
Requires high-bandwidth interconnects for efficiency. All GPUs must participate in each request. Best for latency-sensitive workloads.
Pipeline Parallelism Scaling
Distribute model layers across multiple nodes, enabling very large models and high throughput for large batches.
Pipeline bubbles reduce efficiency. Latency increases with pipeline depth. Requires careful micro-batch sizing.
Best for throughput-focused batch processing. Can cross node boundaries. Requires sophisticated scheduling.
Batch Size Scaling
Increase batch size to improve throughput by amortizing weight loading across more sequences.
Limited by GPU memory capacity. Increases per-request latency. Diminishing returns as compute becomes saturated.
Most cost-effective scaling for memory-bound workloads. Must balance against latency requirements. Requires sufficient request volume.
Geographic Scaling
Deploy replicas in multiple regions to reduce latency for global users and provide geographic redundancy.
Increases operational complexity. May require data residency considerations. Cost multiplier for multi-region deployment.
Essential for global services. Provides disaster recovery. Consider traffic routing and failover strategies.
Model Efficiency Scaling
Use quantization, distillation, or smaller models to increase throughput per GPU without adding hardware.
Quality degradation at aggressive optimization levels. Diminishing returns from further compression. May not apply to all use cases.
Most cost-effective scaling approach when applicable. Validate quality impact. Consider multiple model tiers for different use cases.
Speculative Decoding Scaling
Implement speculative decoding to increase effective throughput without additional hardware for suitable workloads.
Effectiveness depends on workload predictability. Requires additional memory for draft model. Not all workloads benefit.
Can provide 2-3x improvement for suitable workloads. Requires draft model selection and tuning. Monitor acceptance rates.
Capacity Planning
Required replicas = (peak_requests_per_second Ć (avg_input_tokens + avg_output_tokens)) / (throughput_per_replica Ć target_utilization) Ć redundancy_factorPlan for 30-50% headroom above expected peak load. Account for replica failures (N+1 or N+2 redundancy). Consider scaling lag time (GPU startup is minutes, not seconds). Include buffer for unexpected traffic spikes.
Scaling Milestones
- Establishing baseline performance
- Setting up monitoring
- Validating model quality
Single GPU deployment sufficient. Focus on correctness over optimization. Use managed services if available.
- Implementing continuous batching
- Setting up autoscaling
- Establishing SLOs
Multiple replicas with load balancing. Implement basic monitoring and alerting. Consider quantization for cost efficiency.
- Optimizing throughput per GPU
- Managing costs at scale
- Handling traffic variability
Sophisticated autoscaling. Advanced batching optimization. Consider multi-region deployment. Implement cost monitoring.
- Load balancer capacity
- Operational complexity
- Cost optimization at scale
Multiple load balancer tiers. Dedicated infrastructure teams. Advanced optimization (speculative decoding, custom kernels). Multi-region with traffic management.
- Custom infrastructure requirements
- Extreme cost sensitivity
- Global availability
Custom hardware and software optimization. Dedicated capacity planning team. Sophisticated traffic management. Consider custom silicon (TPUs, custom ASICs).
Benchmarks
Benchmarks
Industry Benchmarks
| Metric | P50 | P95 | P99 | World Class |
|---|---|---|---|---|
| Decode throughput (7B model, single A100 80GB) | 2500 tokens/second | 3500 tokens/second | 4000 tokens/second | 5000+ tokens/second |
| Decode throughput (70B model, 8x A100 80GB tensor parallel) | 800 tokens/second | 1200 tokens/second | 1500 tokens/second | 2000+ tokens/second |
| Decode throughput (7B model, single H100 80GB) | 4000 tokens/second | 5500 tokens/second | 6500 tokens/second | 8000+ tokens/second |
| Prefill throughput (7B model, single A100 80GB) | 15000 tokens/second | 25000 tokens/second | 35000 tokens/second | 50000+ tokens/second |
| Batch utilization (continuous batching) | 70% | 85% | 92% | 95%+ |
| GPU memory utilization | 75% | 88% | 93% | 95%+ |
| Cost per million tokens (7B model, cloud GPU) | $0.30 | $0.15 | $0.10 | $0.05 |
| Throughput improvement from INT8 quantization | 1.5x | 1.8x | 1.95x | 2x |
| Throughput improvement from INT4 quantization | 2.5x | 3.2x | 3.7x | 4x |
| Speculative decoding speedup (suitable workloads) | 1.8x | 2.5x | 2.8x | 3x+ |
| Continuous batching improvement over static batching | 2x | 4x | 6x | 10x |
| Multi-GPU scaling efficiency (8 GPUs, tensor parallel) | 65% | 78% | 85% | 90%+ |
Comparison Matrix
| Serving Framework | Continuous Batching | PagedAttention | Speculative Decoding | Quantization Support | Multi-GPU | Relative Throughput |
|---|---|---|---|---|---|---|
| vLLM | Yes | Yes | Yes | INT8, INT4, AWQ, GPTQ | Tensor, Pipeline | 1.0 (baseline) |
| TGI (Text Generation Inference) | Yes | Yes | Yes | INT8, INT4, GPTQ, AWQ | Tensor | 0.9-1.0 |
| TensorRT-LLM | Yes | Yes | Yes | INT8, INT4, FP8 | Tensor, Pipeline | 1.1-1.3 |
| llama.cpp | Limited | No | Yes | GGUF (various) | Limited | 0.5-0.8 |
| Triton Inference Server | Yes (with backend) | Backend dependent | Backend dependent | Backend dependent | Yes | 0.9-1.1 |
| DeepSpeed-MII | Yes | Yes | No | INT8 | Tensor, Pipeline | 0.8-1.0 |
Performance Tiers
Single GPU, static batching, no optimization. Suitable for development and low-traffic production.
500-1500 tokens/second (7B model), 50-150 tokens/second (70B model)
Continuous batching, basic memory optimization. Standard production deployment.
2000-3500 tokens/second (7B model), 300-600 tokens/second (70B model)
Quantization, PagedAttention, tuned batch sizes. Cost-optimized production.
4000-6000 tokens/second (7B model), 600-1000 tokens/second (70B model)
Speculative decoding, custom kernels, hardware-specific optimization. Maximum efficiency.
6000-10000 tokens/second (7B model), 1000-2000 tokens/second (70B model)
Custom hardware, proprietary optimizations, dedicated infrastructure. Hyperscale deployments.
10000+ tokens/second (7B model), 2000+ tokens/second (70B model)
Real World Examples
Real World Examples
Real-World Scenarios
(8 examples)High-Volume Customer Service Chatbot
E-commerce company deploying LLM-powered customer service handling 10,000+ concurrent conversations during peak shopping periods. Latency requirements moderate (2-3 second response acceptable), cost sensitivity high.
Deployed 7B parameter model with INT8 quantization on replicated A100 instances. Implemented continuous batching with vLLM. Used autoscaling based on queue depth. Implemented request prioritization for VIP customers.
Achieved 4,500 tokens/second per GPU, supporting 15,000 concurrent conversations on 8 GPUs. Cost per conversation reduced 60% compared to API-based solution. P95 latency maintained under 2 seconds.
- š”Quantization provided 1.8x throughput improvement with negligible quality impact for this use case
- š”Continuous batching was essential for handling variable conversation lengths
- š”Autoscaling lag (3-5 minutes for GPU startup) required maintaining 30% headroom
- š”Request prioritization prevented VIP customer impact during traffic spikes
Code Completion IDE Integration
Developer tools company integrating LLM code completion into IDE. Strict latency requirements (< 200ms for suggestions), high concurrency (100,000+ developers), predictable output patterns.
Deployed 13B code-specialized model with speculative decoding using 1B draft model. Used tensor parallelism across 4 GPUs per replica. Implemented aggressive caching for common completions. Geographic distribution across 5 regions.
Achieved 3x throughput improvement from speculative decoding due to predictable code patterns. P95 latency under 150ms. Supported 150,000 concurrent developers across global deployment.
- š”Speculative decoding exceptionally effective for code completion (85% acceptance rate)
- š”Geographic distribution essential for latency requirements
- š”Caching provided significant cost reduction for common patterns
- š”Draft model quality critical - invested in code-specific draft model training
Document Processing Pipeline
Legal technology company processing millions of documents for contract analysis. No real-time latency requirements, extreme cost sensitivity, variable document lengths (1-100 pages).
Deployed 70B model for quality on spot instances with pipeline parallelism. Implemented offline batch processing with large batch sizes. Used INT4 quantization validated for legal domain. Implemented checkpointing for spot instance interruptions.
Achieved cost of $0.08 per million tokens using spot instances (70% discount). Processed 10 million documents monthly. Quality maintained at 95% accuracy for contract clause extraction.
- š”Spot instances viable for batch processing with proper checkpointing
- š”INT4 quantization required careful validation for legal accuracy requirements
- š”Pipeline parallelism enabled 70B model deployment cost-effectively
- š”Variable document lengths required sophisticated batching to avoid padding waste
Real-Time Content Moderation
Social media platform moderating user-generated content in real-time. Strict latency requirements (< 500ms), extremely high volume (1M+ posts per hour), high availability requirements (99.99%).
Deployed ensemble of 7B models specialized for different content types. Used replicated serving with geographic distribution. Implemented tiered moderation (fast model for initial screen, larger model for edge cases). Aggressive autoscaling with predictive scaling based on traffic patterns.
Achieved 99.99% availability with sub-300ms P95 latency. Processed 1.5M posts per hour across 50 GPU instances. False positive rate maintained under 0.1%.
- š”Tiered approach essential for balancing throughput and accuracy
- š”Predictive autoscaling critical for handling traffic spikes
- š”Geographic distribution required for latency at global scale
- š”Specialized models outperformed single general model for this use case
RAG-Powered Enterprise Search
Enterprise deploying RAG system for internal knowledge base search. Long contexts (retrieved documents), moderate latency requirements (3-5 seconds), variable query complexity.
Deployed 13B model with prefill-decode disaggregation to handle long contexts efficiently. Implemented chunked prefill to prevent decode blocking. Used KV cache offloading for session persistence. Implemented semantic caching for repeated queries.
Achieved 2x throughput improvement from disaggregation for long-context workload. Supported 10,000 concurrent users. Average response time 2.5 seconds including retrieval.
- š”Prefill-decode disaggregation essential for long-context RAG workloads
- š”Chunked prefill prevented latency spikes for existing conversations
- š”Semantic caching provided 40% hit rate, significantly reducing compute
- š”KV cache offloading enabled cost-effective session persistence
Multi-Model AI Platform
AI platform provider serving multiple customer models on shared infrastructure. Diverse model sizes (7B-70B), variable traffic patterns per model, cost optimization critical.
Implemented multi-model serving with dynamic model loading. Used model popularity tracking to keep hot models loaded. Implemented fair scheduling across tenants. Used spot instances for cold model serving.
Achieved 3x better GPU utilization compared to dedicated per-model serving. Supported 50+ models on 20 GPU instances. Cold start latency under 30 seconds for model loading.
- š”Model popularity follows power law - 20% of models handle 80% of traffic
- š”Dynamic loading essential for cost efficiency with many models
- š”Fair scheduling critical for multi-tenant SLO compliance
- š”Model loading optimization (parallel loading, caching) critical for cold start
Streaming Voice Assistant
Consumer electronics company deploying voice assistant with LLM backend. Strict time-to-first-token requirements (< 300ms), streaming output for natural speech synthesis, global deployment.
Deployed 7B model optimized for TTFT with minimal batching. Implemented streaming output with word-level chunking. Used edge caching for common responses. Geographic distribution with latency-based routing.
Achieved P95 TTFT under 250ms. Supported 500,000 daily active users. Natural conversation flow maintained with streaming.
- š”TTFT optimization required sacrificing some throughput efficiency
- š”Streaming chunking strategy affected perceived naturalness
- š”Edge caching effective for common queries (weather, time, etc.)
- š”Latency-based routing essential for global user experience
Financial Document Analysis
Investment firm analyzing earnings reports and financial filings. High accuracy requirements, moderate volume (thousands of documents daily), strict data security requirements.
Deployed 70B model on-premises for data security. Used full precision (no quantization) for accuracy. Implemented batch processing during market close. Extensive output validation and human review integration.
Achieved 98% accuracy on financial metric extraction. Processed 5,000 documents daily on 16 GPU cluster. Maintained complete data isolation for compliance.
- š”On-premises deployment required for financial data compliance
- š”Full precision necessary for numerical accuracy in financial context
- š”Batch processing during off-hours maximized resource utilization
- š”Human-in-the-loop essential for high-stakes financial decisions
Industry Applications
Healthcare
Clinical documentation and medical coding assistance
HIPAA compliance requires dedicated infrastructure and audit logging. Medical terminology requires specialized models or fine-tuning. High accuracy requirements limit aggressive optimization. Integration with EHR systems adds latency constraints.
Financial Services
Trading signal analysis and risk assessment
Extreme latency sensitivity for trading applications. Regulatory requirements (SOX, FINRA) affect deployment options. Numerical accuracy critical - quantization must be validated. Audit trail requirements for all inferences.
E-commerce
Product description generation and customer service
High seasonality requires elastic scaling. Cost sensitivity drives optimization focus. Multi-language support increases model complexity. Integration with product catalogs and inventory systems.
Media and Entertainment
Content generation and personalization
Creative quality often more important than throughput. Long-form content generation requires extended context. Copyright and originality concerns affect output validation. Real-time personalization requires low latency.
Legal
Contract analysis and legal research
Accuracy requirements limit optimization aggressiveness. Long document contexts require memory optimization. Confidentiality requirements may require on-premises deployment. Citation accuracy critical for research applications.
Manufacturing
Technical documentation and maintenance assistance
Domain-specific terminology requires specialized models. Integration with IoT and sensor data. Offline capability may be required for factory floor. Safety-critical applications require extensive validation.
Education
Tutoring systems and content generation
Accessibility requirements affect response format. Age-appropriate content filtering required. Scalability for peak usage (exam periods). Integration with learning management systems.
Government
Citizen services and document processing
FedRAMP and other compliance requirements. Accessibility (Section 508) requirements. Multi-language support for diverse populations. Transparency and explainability requirements.
Telecommunications
Customer support and network operations
24/7 availability requirements. Integration with network management systems. High volume during outages requires burst capacity. Multi-channel support (voice, chat, email).
Gaming
NPC dialogue and dynamic content generation
Real-time latency requirements for interactive experiences. Creative and entertaining outputs valued over accuracy. Integration with game engines. Scaling for game launch events.
Frequently Asked Questions
Frequently Asked Questions
Frequently Asked Questions
(20 questions)Fundamentals
Throughput measures the total number of tokens generated per unit time across all requests (tokens/second), while latency measures the time for individual requests (time-to-first-token, inter-token latency, or total response time). A system can have high throughput but poor latency if it batches many requests together, causing individual requests to wait. Conversely, a system optimized for latency may have lower throughput due to smaller batch sizes. Production systems must balance both metrics based on application requirements.
Technical
Optimization
Architecture
Benchmarks
Operations
Cost
Scaling
Quality
Glossary
Glossary
Glossary
(30 terms)Arithmetic Intensity
The ratio of compute operations to memory operations (FLOPS per byte), determining whether a workload is compute-bound or memory-bound.
Context: LLM decoding has low arithmetic intensity (~1-2 FLOPS/byte), making it memory-bound on modern GPUs.
Autoscaling
Automatically adjusting the number of inference replicas based on load metrics.
Context: Essential for handling variable traffic while optimizing cost. Limited by GPU startup time (minutes).
Batch Size
The number of sequences processed simultaneously in a single forward pass.
Context: Larger batch sizes improve throughput but increase latency. Limited by GPU memory capacity.
Batch Utilization
The ratio of actual batch size to maximum possible batch size, indicating how efficiently batching is being used.
Context: Low batch utilization indicates potential for throughput improvement through better batching or increased load.
Chunked Prefill
Breaking long prefill operations into smaller chunks interleaved with decode iterations.
Context: Prevents prefill from blocking decode progress, improving latency consistency for mixed workloads.
Continuous Batching
A serving technique that allows requests to enter and exit batches at each decode iteration, rather than waiting for all sequences in a batch to complete.
Context: Essential for production throughput, providing 2-10x improvement over static batching for variable-length workloads.
Decode Phase
The autoregressive generation phase where tokens are generated one at a time, each depending on all previous tokens.
Context: Memory-bound phase that typically dominates inference time and determines throughput for generative workloads.
Flash Attention
An optimized attention implementation that reduces memory footprint and improves memory access patterns.
Context: Enables longer sequences and larger batches by reducing attention memory from O(N²) to O(N).
HBM (High Bandwidth Memory)
Stacked memory technology used in GPUs providing high bandwidth for AI workloads.
Context: HBM bandwidth is the primary determinant of LLM decode throughput. HBM3 provides ~3 TB/s.
Inference Server
Software system that loads models, manages requests, and executes inference operations.
Context: Examples include vLLM, TGI, TensorRT-LLM, and Triton. Choice significantly affects achievable throughput.
Inter-Token Latency (ITL)
The time between consecutive token generations during the decode phase.
Context: Affects streaming experience and perceived generation speed. Determined by decode throughput and batch size.
KV Cache
Storage for key and value projection tensors from previous tokens, enabling efficient attention computation without recomputing past representations.
Context: Major consumer of GPU memory during inference, growing linearly with sequence length and limiting batch sizes.
Load Balancing
Distributing requests across multiple inference replicas to optimize utilization and performance.
Context: Must account for inference-specific characteristics like request length and GPU memory state.
Memory Bandwidth
The rate at which data can be transferred between GPU memory (HBM) and compute units, measured in GB/s or TB/s.
Context: Primary bottleneck for LLM decode throughput. H100 provides 3.35 TB/s HBM3 bandwidth.
Model Parallelism
Distributing a single model across multiple devices, including tensor and pipeline parallelism.
Context: Required for models that don't fit in single GPU memory. Introduces communication overhead.
NCCL (NVIDIA Collective Communications Library)
Library for multi-GPU and multi-node communication operations like all-reduce.
Context: Used for synchronization in tensor and pipeline parallel inference. Performance critical for multi-GPU throughput.
NVLink
High-bandwidth interconnect between NVIDIA GPUs enabling efficient multi-GPU communication.
Context: Provides 600+ GB/s bandwidth, essential for efficient tensor parallelism. Much faster than PCIe.
PagedAttention
A memory management technique that allocates KV cache in fixed-size pages rather than contiguous per-sequence blocks, eliminating fragmentation.
Context: Implemented in vLLM and other modern serving frameworks to maximize memory utilization and enable larger batch sizes.
Pipeline Parallelism
A model parallelism technique that distributes model layers sequentially across GPUs, with different GPUs processing different layers.
Context: Enables very large models across multiple nodes, with higher throughput for large batches but increased latency.
Prefill Phase
The initial phase of inference where all input tokens are processed in parallel to generate the initial KV cache.
Context: Compute-bound phase with high arithmetic intensity, distinct from the memory-bound decode phase.
Prefill-Decode Disaggregation
Separating prefill and decode phases onto different GPU pools for independent optimization.
Context: Valuable for long-context workloads where prefill and decode have very different resource requirements.
Quantization
Reducing model precision from FP16/BF16 to INT8 or INT4 to decrease memory footprint and bandwidth requirements.
Context: Provides roughly linear throughput improvement (2x for INT8, 4x for INT4) with potential quality tradeoffs.
Request Queue
Buffer holding pending inference requests awaiting batch inclusion and processing.
Context: Queue depth indicates backlog and affects latency. Must be bounded to prevent memory exhaustion.
Roofline Model
A visual performance model showing throughput bounded by compute capacity or memory bandwidth.
Context: Useful for identifying whether workload is compute-bound or memory-bound and guiding optimization.
SLO (Service Level Objective)
Target performance metrics that a service commits to achieving, such as latency percentiles or availability.
Context: Throughput optimization must respect SLOs. Common SLOs include P95 latency and availability percentage.
Speculative Decoding
A technique using a smaller draft model to propose multiple tokens that are verified in parallel by the target model.
Context: Can provide 2-3x throughput improvement for predictable generation patterns without quality degradation.
Tensor Parallelism
A model parallelism technique that shards model layers across multiple GPUs, with each GPU computing a portion of each layer.
Context: Enables serving models larger than single GPU memory and increases aggregate memory bandwidth.
Throughput-Latency Tradeoff
The fundamental tension between maximizing aggregate throughput and minimizing individual request latency.
Context: Production systems must balance both based on application requirements and SLOs.
Time-to-First-Token (TTFT)
The latency from request arrival to generation of the first output token, critical for perceived responsiveness.
Context: Primarily determined by prefill time and queue wait time. Key metric for interactive applications.
Tokens per Second (TPS)
The primary unit of LLM throughput measurement, representing the number of tokens generated or processed per second across all concurrent requests.
Context: Used as the standard metric for comparing inference performance across different systems, models, and configurations.
References & Resources
Academic Papers
- ⢠Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023) - Introduces PagedAttention for efficient KV cache management
- ⢠Fast Inference from Transformers via Speculative Decoding (Leviathan et al., 2023) - Foundational paper on speculative decoding
- ⢠FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022) - Memory-efficient attention algorithm
- ⢠Orca: A Distributed Serving System for Transformer-Based Generative Models (Yu et al., 2022) - Continuous batching for LLM serving
- ⢠GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Frantar et al., 2022) - Quantization technique for LLMs
- ⢠AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (Lin et al., 2023) - Advanced quantization method
- ⢠Efficiently Scaling Transformer Inference (Pope et al., 2022) - Google's analysis of transformer inference scaling
- ⢠Splitwise: Efficient Generative LLM Inference Using Phase Splitting (Patel et al., 2024) - Prefill-decode disaggregation
Industry Standards
- ⢠MLPerf Inference Benchmark - Standard benchmark suite for ML inference performance
- ⢠NVIDIA TensorRT-LLM Documentation - Reference for optimized LLM inference
- ⢠vLLM Documentation and Benchmarks - Open-source high-throughput serving reference
- ⢠Hugging Face Text Generation Inference (TGI) - Production-ready inference server documentation
- ⢠OpenAI API Rate Limits and Best Practices - Industry reference for API design
- ⢠NVIDIA GPU Architecture Whitepapers - Hardware specifications and optimization guides
Resources
- ⢠NVIDIA Developer Blog - LLM Inference Optimization Series
- ⢠Hugging Face Blog - Optimizing Transformers for Production
- ⢠Anyscale Blog - Scaling LLM Inference with Ray
- ⢠vLLM GitHub Repository and Documentation
- ⢠PyTorch Documentation - Inference Optimization
- ⢠CUDA Programming Guide - Memory Optimization
- ⢠Google Cloud Architecture Center - ML Inference Best Practices
- ⢠AWS Machine Learning Blog - SageMaker Inference Optimization
Continue Learning
Related concepts to deepen your understanding
Last updated: 2026-01-05 ⢠Version: v1.0 ⢠Status: citation-safe-reference
Keywords: LLM throughput, tokens per second, inference throughput, batch inference