Skip to main content
0req/s

What is LLM Throughput

Canonical Definitionsāœ“ citation-safe-referencešŸ“– 45 minutesUpdated: 2026-01-05

Executive Summary

LLM throughput measures the total number of tokens a language model can generate or process per unit time, typically expressed as tokens per second across all concurrent requests.

1

Throughput is a system-level metric that captures aggregate processing capacity, distinct from latency which measures individual request response time, and optimizing one often involves tradeoffs with the other.

2

Key factors affecting throughput include batch size, model architecture, hardware specifications, memory bandwidth, and serving infrastructure design, with continuous batching and speculative decoding being primary optimization techniques.

3

Production systems must balance throughput against latency SLAs, cost constraints, and quality requirements, making throughput optimization a multi-dimensional engineering challenge rather than a single-variable problem.

The Bottom Line

LLM throughput determines the economic viability and scalability of AI applications by directly impacting how many users or requests a system can serve with given hardware resources. Understanding and optimizing throughput is essential for any production LLM deployment, as it fundamentally affects infrastructure costs, user experience, and system capacity.

Definition

LLM throughput is a performance metric that quantifies the aggregate rate at which a large language model system generates or processes tokens, typically measured in tokens per second (TPS) across all concurrent requests being served.

This metric reflects the total productive capacity of an inference system and serves as the primary indicator of how efficiently hardware resources are being utilized to serve model predictions at scale.

Extended Definition

LLM throughput encompasses both the prefill phase (processing input tokens) and the decode phase (generating output tokens), though the decode phase typically dominates in autoregressive generation scenarios. The metric can be measured at various granularities: per-GPU, per-node, per-cluster, or per-service, depending on the architectural scope being evaluated. Throughput is fundamentally bounded by memory bandwidth during the decode phase and by compute capacity during the prefill phase, creating distinct optimization strategies for each bottleneck. In production environments, throughput must be considered alongside quality-of-service constraints, as maximum theoretical throughput often cannot be achieved while maintaining acceptable latency percentiles for individual requests.

Etymology & Origins

The term 'throughput' originates from industrial engineering and manufacturing, where it described the rate at which products move through a production system. In computing, throughput became a standard performance metric in the 1960s with batch processing systems, measuring jobs completed per unit time. The application to LLMs emerged with the scaling of transformer models in 2020-2023, as practitioners needed to distinguish between single-request latency and system-wide processing capacity. The tokens-per-second formulation became standard as the atomic unit of LLM computation, analogous to FLOPS for general compute or IOPS for storage systems.

Also Known As

tokens per second (TPS)inference throughputgeneration throughputtoken generation ratemodel serving capacityaggregate inference ratesystem throughputbatch throughput

Not To Be Confused With

Latency

Latency measures the time for a single request to complete (time-to-first-token or total generation time), while throughput measures aggregate tokens generated across all requests per unit time. A system can have high throughput but poor latency if it batches many requests together, or low throughput but good latency if it processes requests sequentially with minimal queuing.

Requests per second (RPS)

RPS counts completed requests regardless of their token count, while throughput measures total tokens processed. A system handling many short requests may have high RPS but lower throughput than one handling fewer long-form generations. Throughput provides a more accurate measure of computational work performed.

FLOPS (Floating Point Operations Per Second)

FLOPS measures raw computational capacity of hardware, while throughput measures actual productive output in application-specific units (tokens). A system may have high theoretical FLOPS but low throughput due to memory bandwidth limitations, inefficient batching, or suboptimal kernel implementations.

Bandwidth

Memory bandwidth refers to the rate of data transfer between memory and compute units (GB/s), while throughput refers to the rate of meaningful output generation. Bandwidth is a hardware specification that constrains throughput, but they measure different things and have different units.

Capacity

Capacity often refers to the maximum number of concurrent requests or users a system can handle, while throughput specifically measures the rate of token generation. A system might have high capacity (many concurrent connections) but limited throughput if each request generates tokens slowly.

Batch size

Batch size is the number of requests processed simultaneously in a single forward pass, while throughput is the resulting rate of token generation. Increasing batch size generally increases throughput up to memory limits, but batch size is an input parameter while throughput is an output metric.

Conceptual Foundation

Core Principles

(8 principles)

Mental Models

(6 models)

Factory Assembly Line

Think of LLM inference as a factory where tokens are products being manufactured. Throughput is the total number of products leaving the factory per hour, regardless of how long any individual product spent in production. Batching is like processing multiple products simultaneously on the assembly line, and memory bandwidth is the speed of the conveyor belt bringing raw materials (weights) to the workers (compute units).

Highway Traffic Flow

Consider throughput as the total number of vehicles passing a checkpoint per hour on a highway. Latency is how long each individual car takes to travel the route. Adding more lanes (GPUs) increases capacity, but each lane has a speed limit (memory bandwidth). Batching is like carpooling - more passengers (tokens) per vehicle (forward pass) means more people transported with the same road usage.

Restaurant Kitchen

Imagine the LLM as a restaurant kitchen where throughput is meals served per hour. The chef (GPU) can only cook as fast as ingredients (model weights) arrive from the pantry (memory). Cooking multiple orders at once (batching) is efficient, but the kitchen has limited counter space (GPU memory) for active orders. The KV cache is like keeping partially prepared ingredients for each table's ongoing meal.

Roofline Model

The roofline model visualizes throughput as bounded by either compute capacity (horizontal ceiling) or memory bandwidth (diagonal slope), with actual performance falling below this roof based on operational intensity. LLM decode operations sit on the memory-bound slope, while prefill operations approach the compute-bound ceiling.

Queueing System

Model the inference system as a queueing network where requests arrive, wait in queue, receive service, and depart. Throughput is the departure rate, which in steady state equals the arrival rate up to system capacity. Little's Law (L = λW) relates concurrent requests, throughput, and latency, providing mathematical foundations for capacity planning.

Bucket Brigade

Envision token generation as a bucket brigade where each person (pipeline stage) passes buckets (tensors) to the next. Throughput is limited by the slowest person in the chain. Pipeline parallelism adds more brigades working on different buckets simultaneously, while tensor parallelism adds more people at each station to handle larger buckets.

Key Insights

(10 insights)

Throughput and latency are not simply inversely related - they can be independently optimized to some degree through techniques like continuous batching, which maintains high throughput while bounding latency through iteration-level scheduling and preemption.

The memory-bound nature of LLM decoding means that faster GPUs (higher TFLOPS) often provide minimal throughput improvement compared to GPUs with higher memory bandwidth (HBM3 vs HBM2), making bandwidth-per-dollar a more relevant metric than compute-per-dollar for inference workloads.

KV cache memory consumption grows linearly with both batch size and sequence length, creating a two-dimensional constraint surface that limits achievable throughput - doubling context length effectively halves maximum batch size and thus throughput for decode-bound workloads.

Continuous batching can improve throughput by 2-10x over static batching by eliminating padding waste and enabling immediate scheduling of new requests into completed sequence slots, making it essential for production deployments with variable-length requests.

Speculative decoding can improve throughput by 2-3x for appropriate workloads without quality degradation, but requires careful draft model selection and provides diminishing returns as generation becomes less predictable or draft model accuracy decreases.

Quantization provides roughly linear throughput improvements (2x for INT8, 4x for INT4) by reducing memory bandwidth requirements, but the actual gains depend on whether the workload is truly memory-bound and whether quantization kernels are well-optimized.

Multi-GPU inference introduces communication overhead that can negate throughput gains if not carefully managed - tensor parallelism across 8 GPUs rarely achieves 8x throughput due to all-reduce synchronization costs, especially for smaller batch sizes.

The optimal batch size for throughput is not always the maximum that fits in memory - there's often a sweet spot where further batching increases latency more than it improves throughput due to compute saturation or scheduling overhead.

Prefill operations can severely impact decode throughput in shared systems because prefill is compute-intensive and can starve decode iterations of GPU cycles, making prefill-decode disaggregation valuable for mixed workloads.

Real-world throughput is typically 30-50% of theoretical maximum due to software overhead, memory fragmentation, variable sequence lengths, and the inability to maintain perfect batch utilization - benchmarks claiming higher numbers often use unrealistic synthetic workloads.

When to Use

Ideal Scenarios

(12)

Batch processing pipelines where large volumes of text need to be processed offline without real-time latency requirements, such as document summarization, content moderation, or data extraction jobs that can tolerate queuing delays in exchange for cost efficiency.

High-volume API services serving many concurrent users where aggregate capacity matters more than individual request speed, such as chatbot platforms, content generation services, or AI-assisted writing tools with thousands of simultaneous users.

Cost optimization initiatives where the goal is to maximize useful work per dollar of infrastructure spend, requiring throughput metrics to evaluate hardware efficiency and identify optimization opportunities.

Capacity planning exercises that need to determine how many GPUs or instances are required to handle projected request volumes while meeting SLO targets for both throughput and latency.

A/B testing of inference optimizations where throughput serves as the primary metric for evaluating the effectiveness of changes like quantization, batching strategies, or serving framework upgrades.

Hardware selection decisions comparing different GPU models, cloud instance types, or inference accelerators where throughput-per-dollar determines the most cost-effective option for specific workload patterns.

Autoscaling configuration where throughput metrics drive scaling decisions, triggering scale-out when throughput approaches capacity limits and scale-in when throughput indicates underutilization.

SLO definition and monitoring where throughput targets are established as part of service level objectives, with alerts triggered when throughput drops below acceptable thresholds indicating system degradation.

Benchmark comparisons evaluating different models, serving frameworks, or optimization techniques where throughput provides a standardized metric for comparing alternatives under controlled conditions.

Resource allocation decisions in multi-tenant environments where throughput metrics help fairly distribute GPU resources among competing workloads based on their processing requirements.

Performance regression detection in CI/CD pipelines where throughput benchmarks catch degradations introduced by code changes before they reach production.

Inference cost modeling where throughput directly determines the cost per token or cost per request, enabling accurate pricing and margin calculations for AI-powered products.

Prerequisites

(8)
1

Stable model deployment with consistent inference behavior, as throughput measurements are meaningless if the model or serving infrastructure is still being actively modified or debugged.

2

Representative workload characterization including typical input lengths, output lengths, request arrival patterns, and concurrency levels, since throughput varies significantly across different workload profiles.

3

Adequate monitoring infrastructure capable of measuring tokens generated, request counts, and timing at appropriate granularity without introducing significant measurement overhead.

4

Defined latency constraints that bound acceptable throughput optimization strategies, since maximum throughput often requires latency tradeoffs that may violate user experience requirements.

5

Sufficient hardware resources to achieve meaningful throughput levels, as throughput optimization on severely under-provisioned systems may yield misleading results that don't transfer to production scale.

6

Understanding of the memory-compute tradeoff landscape for the specific model and hardware combination, enabling informed decisions about batch sizes, parallelism strategies, and optimization techniques.

7

Baseline measurements establishing current throughput levels before optimization efforts begin, providing a reference point for measuring improvement and detecting regressions.

8

Clear business requirements translating throughput into business value, such as cost per request targets, user capacity requirements, or infrastructure budget constraints.

Signals You Need This

(10)

Infrastructure costs are disproportionately high relative to the value delivered by AI features, suggesting that throughput optimization could significantly reduce cost per inference.

Request queues are growing during peak periods, indicating that current throughput is insufficient to handle demand and users are experiencing increased latency due to queuing.

GPU utilization metrics show low compute utilization (under 30%) during inference, suggesting that batching or other throughput optimizations could extract more value from existing hardware.

Autoscaling is triggering frequently and aggressively, potentially indicating that throughput per instance is suboptimal and more efficient serving could reduce instance counts.

Batch processing jobs are taking longer than acceptable time windows, requiring either more hardware or throughput optimization to complete within required timeframes.

Cost projections for scaling to target user counts exceed budget constraints, necessitating throughput improvements to serve more users with existing or reduced infrastructure.

Competitive analysis reveals that similar services achieve better price-performance ratios, suggesting throughput optimization opportunities that competitors have already captured.

Memory utilization is high while compute utilization is low, indicating memory-bound operation where throughput optimization techniques like quantization or KV cache optimization could help.

Request rejection rates or timeout rates are increasing, signaling that throughput capacity is being exceeded and optimization or scaling is required.

Energy costs or carbon footprint concerns are driving efficiency requirements, where throughput optimization directly reduces environmental impact per useful inference.

Organizational Readiness

(7)

ML engineering team with experience in inference optimization, including familiarity with profiling tools, serving frameworks, and common optimization techniques like quantization and batching.

Infrastructure team capable of deploying and managing GPU clusters, including expertise in container orchestration, networking, and resource scheduling for ML workloads.

Monitoring and observability systems that can capture inference metrics at appropriate granularity, including custom metrics for tokens per second, batch utilization, and queue depths.

Performance testing capabilities including load generation tools, benchmark harnesses, and analysis frameworks for conducting rigorous throughput evaluations.

Cross-functional alignment between ML, infrastructure, and product teams on throughput targets, latency constraints, and acceptable quality tradeoffs.

Budget allocation for potential hardware upgrades, serving framework licenses, or cloud instance costs that may be required to achieve throughput targets.

Change management processes that allow for iterative optimization with appropriate testing, rollback capabilities, and gradual rollout of throughput improvements.

When NOT to Use

Anti-Patterns

(12)

Optimizing throughput for latency-critical applications like real-time voice assistants or interactive coding tools where time-to-first-token and inter-token latency directly impact user experience and throughput gains that increase latency are counterproductive.

Pursuing maximum throughput on single-user or low-concurrency systems where batching opportunities are limited and throughput optimization provides minimal benefit while adding complexity.

Sacrificing model quality for throughput gains in applications where output quality directly impacts business value, such as medical diagnosis assistance or legal document analysis where errors have significant consequences.

Over-engineering throughput optimization for prototype or experimental systems that may be significantly redesigned, wasting effort on optimizations that won't transfer to the final architecture.

Focusing on throughput metrics while ignoring tail latency, potentially creating systems that perform well on average but provide unacceptable experience for a significant fraction of users.

Applying generic throughput optimization techniques without understanding workload-specific characteristics, potentially optimizing for the wrong bottleneck or introducing regressions for actual usage patterns.

Measuring throughput on synthetic benchmarks that don't represent production workloads, leading to optimization decisions that don't translate to real-world improvements.

Pursuing throughput optimization before establishing baseline stability, potentially masking underlying reliability issues or introducing optimizations that work only under ideal conditions.

Optimizing throughput for batch sizes that exceed actual concurrent request volumes, achieving theoretical improvements that can never be realized in production due to insufficient load.

Ignoring the cost of throughput optimization itself, such as engineering time, increased system complexity, or ongoing maintenance burden that may exceed the infrastructure cost savings achieved.

Applying throughput optimization uniformly across all request types when different requests have different latency requirements, potentially degrading experience for latency-sensitive requests to improve aggregate throughput.

Chasing throughput improvements through techniques that reduce system reliability or increase failure rates, trading availability for efficiency in ways that harm overall service quality.

Red Flags

(10)

Latency SLOs are being violated while throughput targets are being met, indicating that throughput optimization has gone too far at the expense of user experience.

Model quality metrics are degrading as throughput increases, suggesting that aggressive optimization techniques like quantization or speculative decoding are introducing unacceptable quality tradeoffs.

System complexity is increasing faster than throughput improvements, creating maintenance burden and reliability risks that outweigh efficiency gains.

Throughput benchmarks show impressive numbers but production metrics don't improve, indicating that benchmarks don't represent actual workload characteristics.

Team is spending more time on throughput optimization than on core product features, suggesting misallocated priorities or diminishing returns from further optimization.

Throughput optimization requires specialized hardware or software that creates vendor lock-in or limits future flexibility.

Optimization techniques are brittle and break with model updates, requiring constant re-optimization effort that negates efficiency gains.

Memory errors or OOM conditions are increasing as batch sizes are pushed higher, indicating that throughput optimization is approaching unsafe limits.

Request timeout rates are increasing even as throughput metrics improve, suggesting that tail latency is being sacrificed for aggregate throughput.

Cost per token is decreasing but total infrastructure cost is increasing due to over-provisioning for throughput headroom that isn't needed.

Better Alternatives

(8)
1
When:

Real-time conversational AI with strict latency requirements

Use Instead:

Latency optimization with streaming and time-to-first-token focus

Why:

User experience in conversational applications depends primarily on perceived responsiveness, which is determined by latency rather than throughput. Optimizing for throughput may increase batch sizes in ways that harm the interactive feel of the conversation.

2
When:

Low-volume, high-value inference tasks like medical diagnosis

Use Instead:

Quality optimization with ensemble methods or larger models

Why:

When request volume is low and accuracy is paramount, the marginal cost savings from throughput optimization are negligible compared to the value of improved output quality. Resources are better spent on model quality than inference efficiency.

3
When:

Highly variable workloads with unpredictable spikes

Use Instead:

Elastic scaling with serverless inference

Why:

Throughput optimization assumes relatively stable load patterns. For highly variable workloads, the ability to rapidly scale capacity up and down provides better cost efficiency than optimizing throughput on fixed infrastructure.

4
When:

Multi-model pipelines with complex orchestration

Use Instead:

End-to-end pipeline latency optimization

Why:

In multi-step pipelines, optimizing throughput of individual models may not improve overall pipeline performance if other stages are bottlenecks. End-to-end optimization considering all stages typically yields better results.

5
When:

Development and experimentation environments

Use Instead:

Flexibility and iteration speed optimization

Why:

In development contexts, the ability to quickly test different models, prompts, and configurations provides more value than inference efficiency. Throughput optimization adds constraints that slow experimentation.

6
When:

Edge deployment with limited hardware

Use Instead:

Model compression and architecture optimization

Why:

On resource-constrained edge devices, fundamental model size and architecture changes provide larger efficiency gains than serving-level throughput optimization, which has limited headroom on small devices.

7
When:

Applications with highly variable output lengths

Use Instead:

Adaptive batching with output length prediction

Why:

Standard throughput optimization assumes relatively uniform request characteristics. For highly variable outputs, adaptive strategies that predict output length and batch accordingly outperform static optimization.

8
When:

Cost-sensitive applications with flexible latency requirements

Use Instead:

Spot instance utilization with queue-based processing

Why:

When latency is flexible, using cheaper spot instances with queue-based request handling may provide better cost efficiency than throughput optimization on more expensive on-demand instances.

Common Mistakes

(10)

Measuring throughput only under optimal conditions with uniform request lengths and perfect batching, leading to unrealistic expectations when deployed with real-world variable workloads.

Ignoring the impact of KV cache memory consumption on achievable batch sizes, leading to throughput projections that assume larger batches than memory constraints allow.

Confusing tokens per second with requests per second, leading to incorrect capacity planning when request lengths vary significantly.

Optimizing for throughput on a single GPU without considering the communication overhead that reduces efficiency in multi-GPU deployments.

Assuming linear throughput scaling with hardware additions, ignoring the sublinear scaling caused by communication overhead, memory bandwidth sharing, and scheduling inefficiencies.

Neglecting the prefill phase impact on decode throughput in systems handling mixed workloads with varying input lengths.

Over-relying on quantization for throughput gains without validating quality impact on the specific use case and model combination.

Implementing aggressive batching without latency bounds, creating systems that achieve high throughput but violate user experience requirements.

Measuring throughput with synthetic prompts that don't trigger the same attention patterns, cache behaviors, or generation characteristics as production requests.

Failing to account for warmup effects, leading to throughput measurements that don't reflect steady-state performance.

Core Taxonomy

Primary Types

(8 types)

The rate of token generation during the autoregressive decoding phase, measured in output tokens per second. This is typically the primary throughput metric for generative applications where output length dominates processing time.

Characteristics
  • Memory-bandwidth bound due to low arithmetic intensity
  • Scales roughly linearly with batch size until memory saturation
  • Highly sensitive to KV cache memory consumption
  • Benefits significantly from quantization and memory optimization
Use Cases
Chatbots and conversational AIContent generation applicationsCode completion and generationLong-form text synthesis
Tradeoffs

Optimizing decode throughput often requires larger batch sizes that increase per-request latency. Memory optimization techniques like quantization improve decode throughput but may impact output quality.

Classification Dimensions

Measurement Scope

Throughput can be measured at different levels of the infrastructure stack, with each level providing different insights into system performance and bottlenecks.

Per-GPU throughputPer-node throughputPer-cluster throughputPer-service throughput

Workload Type

Different request handling patterns have different throughput characteristics and optimization strategies, requiring workload-specific measurement approaches.

Synchronous throughputAsynchronous throughputStreaming throughputBatch throughput

Optimization Target

Throughput optimization can target different objectives, with each target implying different tradeoffs and measurement criteria.

Latency-constrained throughputCost-optimized throughputQuality-constrained throughputMaximum theoretical throughput

Time Horizon

Throughput varies over time, and different time horizons provide different views of system capacity and performance stability.

Instantaneous throughputAverage throughputPeak throughputSustained throughput

Request Priority

In multi-priority systems, throughput may be measured separately for different priority classes to ensure SLO compliance across service tiers.

High-priority throughputBest-effort throughputBackground throughputPreemptible throughput

Model Configuration

Different model optimization techniques produce different throughput characteristics, requiring separate measurement for each configuration.

Full-precision throughputQuantized throughputPruned model throughputDistilled model throughput

Evolutionary Stages

1

Naive Serving

Initial deployment, 0-3 months

Single-request processing without batching, minimal optimization, throughput limited to sequential request handling. Typical of initial deployments and development environments.

2

Basic Batching

Early production, 3-6 months

Static batching with fixed batch sizes, basic padding strategies, throughput improved but limited by batch formation delays and padding waste.

3

Continuous Batching

Mature production, 6-12 months

Dynamic batch formation with iteration-level scheduling, minimal padding waste, significantly improved throughput with bounded latency impact.

4

Advanced Optimization

Optimized production, 12-24 months

Quantization, speculative decoding, KV cache optimization, prefill-decode disaggregation, approaching hardware efficiency limits.

5

Hardware-Software Co-optimization

Highly optimized systems, 24+ months

Custom kernels, hardware-specific optimizations, inference accelerator utilization, maximum achievable throughput for given hardware.

Architecture Patterns

Architecture Patterns

(8 patterns)

Single-GPU Serving

The simplest deployment pattern where a single GPU handles all inference requests, suitable for smaller models or low-traffic applications. Throughput is limited by single GPU memory bandwidth and compute capacity.

Components
  • Inference server (vLLM, TGI, or similar)
  • Request queue
  • Model weights in GPU memory
  • KV cache manager
Data Flow

Requests arrive at the inference server, are queued for batching, processed through the model on a single GPU, and responses are returned to clients. The KV cache is managed within GPU memory for active sequences.

Best For
  • Development and testing environments
  • Low-traffic production services
  • Models that fit in single GPU memory
  • Cost-sensitive deployments
Limitations
  • Single point of failure
  • Limited throughput ceiling
  • Cannot serve models larger than GPU memory
  • No horizontal scaling capability
Scaling Characteristics

Vertical scaling only through GPU upgrade. Throughput ceiling determined by GPU memory bandwidth (typically 1-3 TB/s for modern GPUs). Maximum batch size limited by GPU memory capacity.

Integration Points

Load Balancer

Distributes incoming requests across inference server replicas, performs health checking, and implements routing policies for optimal throughput.

Interfaces:
HTTP/gRPC ingressHealth check endpointsMetrics exportConfiguration API

Load balancer must understand request characteristics (input length, expected output length) for intelligent routing. Session affinity may be needed for stateful interactions. Must handle streaming responses appropriately.

Request Queue

Buffers incoming requests for batch formation, implements priority scheduling, and manages backpressure when throughput is exceeded.

Interfaces:
Enqueue APIDequeue/batch formation APIQueue depth metricsPriority configuration

Queue depth directly impacts latency. Must implement fair scheduling across priority levels. Backpressure mechanisms prevent queue overflow. Queue persistence may be needed for reliability.

KV Cache Manager

Allocates, manages, and optimizes key-value cache memory for active sequences, implementing strategies like PagedAttention for efficient memory utilization.

Interfaces:
Allocation APIDeallocation APIMemory pressure signalsDefragmentation triggers

KV cache management directly impacts achievable batch size and thus throughput. Must handle variable sequence lengths efficiently. Memory fragmentation can reduce effective capacity over time.

Metrics Collector

Gathers throughput metrics, latency distributions, and resource utilization data for monitoring, alerting, and optimization feedback.

Interfaces:
Prometheus metrics endpointCustom metrics APITrace exportLog aggregation

Metrics collection overhead should be minimal to avoid impacting throughput. Must capture both aggregate throughput and per-request latency. Real-time metrics enable dynamic optimization.

Autoscaler

Monitors throughput and latency metrics to automatically adjust replica count, maintaining target throughput while minimizing cost.

Interfaces:
Metrics ingestionScaling APIConfiguration managementCost constraints

Scaling decisions must account for GPU startup time (minutes). Throughput-based scaling must consider latency impact. Predictive scaling can improve responsiveness to load changes.

Model Registry

Stores and versions model artifacts, provides model metadata, and supports model loading for multi-model serving scenarios.

Interfaces:
Model upload/download APIVersion managementMetadata queriesAccess control

Model loading time impacts throughput during scaling or model switches. Registry must support efficient partial loading for large models. Version management enables safe rollbacks.

Batch Scheduler

Forms optimal batches from queued requests, balancing batch size for throughput against latency constraints and fairness requirements.

Interfaces:
Request intakeBatch emissionScheduling policy configurationPriority handling

Scheduler directly determines throughput-latency tradeoff. Must handle variable sequence lengths efficiently. Continuous batching requires iteration-level scheduling decisions.

Hardware Abstraction Layer

Provides unified interface to underlying GPU/accelerator hardware, enabling portability across different hardware platforms while exposing optimization opportunities.

Interfaces:
Compute dispatchMemory managementHardware capability queriesPerformance counters

Abstraction overhead should be minimal. Must expose hardware-specific optimizations where available. Enables multi-hardware deployments and hardware migration.

Decision Framework

āœ“ If Yes

If latency-constrained, optimize for minimal batch sizes and fast single-request processing. Throughput optimization may be counterproductive.

āœ— If No

If throughput-constrained, proceed with throughput optimization strategies including larger batch sizes and continuous batching.

Considerations

Most production systems have both constraints. Define clear SLOs for both metrics before optimizing. Latency percentiles (p95, p99) often matter more than averages.

Technical Deep Dive

Overview

LLM throughput emerges from the interaction between model architecture, hardware capabilities, and serving system design. At its core, throughput is determined by how efficiently the system can move model weights from memory to compute units and perform the necessary matrix multiplications for token generation. The autoregressive nature of LLM generation creates a fundamental bottleneck: each new token depends on all previous tokens, requiring sequential generation that limits parallelism within a single sequence. The key insight is that while individual sequences must be generated sequentially, multiple sequences can be processed in parallel through batching. This amortizes the fixed cost of loading model weights across multiple useful computations, dramatically improving throughput. Modern serving systems achieve high throughput through sophisticated batching strategies, memory management techniques, and hardware-aware optimizations that maximize the utilization of available compute and memory bandwidth. Throughput optimization requires understanding the distinct characteristics of the prefill and decode phases. Prefill processes all input tokens in parallel and is compute-bound, while decode generates tokens one at a time and is memory-bound. These phases have different bottlenecks and optimization strategies, and production systems must balance both for optimal overall throughput. The practical throughput ceiling is determined by the minimum of compute capacity and memory bandwidth, adjusted for software overhead and utilization efficiency. Modern GPUs have far more compute capacity than memory bandwidth relative to the arithmetic intensity of LLM inference, making memory bandwidth the dominant constraint for most workloads.

Step-by-Step Process

Incoming inference requests arrive at the serving system and are placed in a request queue. Each request contains input tokens (prompt) and generation parameters (max tokens, temperature, etc.). The queue maintains request metadata including arrival time, priority, and estimated resource requirements.

āš ļø Pitfalls to Avoid

Unbounded queues can lead to memory exhaustion and unpredictable latency. Queue depth must be monitored and backpressure applied when capacity is exceeded. Priority inversion can occur if queue management doesn't account for request characteristics.

Under The Hood

At the hardware level, LLM inference throughput is fundamentally constrained by the memory bandwidth of the GPU. Modern GPUs like the NVIDIA H100 provide approximately 3.35 TB/s of HBM3 bandwidth, while offering 1979 TFLOPS of FP16 compute. The arithmetic intensity (FLOPS per byte) required to fully utilize this compute is approximately 590 FLOPS/byte. However, during autoregressive decoding, the arithmetic intensity is only about 1-2 FLOPS/byte because each token generation requires loading the entire model weights while performing relatively few operations per weight. This memory-bound nature means that throughput scales with batch size because batching amortizes weight loading across multiple sequences. If a 70B parameter model (140GB in FP16) requires loading all weights for each token, a single sequence achieves only about 24 tokens/second (3350 GB/s Ć· 140 GB). With a batch size of 32, the same weight loading produces 32 tokens, achieving approximately 768 tokens/second aggregate throughput. This linear scaling continues until memory capacity limits batch size or compute becomes saturated. The KV cache adds another dimension to memory constraints. For a typical 70B model, each token in the KV cache consumes approximately 1.6MB (assuming 80 layers, 64 heads, 128 dimensions per head, FP16). A sequence of 4096 tokens requires about 6.5GB of KV cache. With 80GB of GPU memory and 140GB of model weights (requiring tensor parallelism), the remaining memory for KV cache limits the product of batch size and sequence length. This creates a tradeoff between supporting more concurrent sequences (higher batch size) and supporting longer sequences. Modern serving systems implement sophisticated memory management to maximize throughput within these constraints. PagedAttention allocates KV cache in fixed-size blocks rather than contiguous per-sequence allocations, eliminating fragmentation and enabling near-100% memory utilization. Continuous batching maintains high batch utilization by immediately filling slots vacated by completed sequences. Speculative decoding converts memory-bound sequential decoding into compute-bound parallel verification, effectively increasing arithmetic intensity and throughput. Quantization provides another path to higher throughput by reducing memory bandwidth requirements. INT8 quantization halves the memory footprint, enabling 2x larger batches or 2x faster weight loading. INT4 quantization provides 4x improvement. The throughput gain is roughly proportional to the compression ratio, minus any overhead from dequantization operations. Modern quantization techniques (GPTQ, AWQ, SqueezeLLM) minimize quality degradation while maximizing throughput benefit. Multi-GPU inference introduces communication overhead that affects throughput. Tensor parallelism requires all-reduce operations to synchronize activations between layers, consuming interconnect bandwidth. With NVLink (900 GB/s bidirectional), this overhead is manageable, but with PCIe (64 GB/s), it can significantly reduce efficiency. Pipeline parallelism avoids all-reduce but introduces pipeline bubbles that reduce utilization. The optimal parallelism strategy depends on model size, batch size, interconnect bandwidth, and latency requirements.

Failure Modes

Root Cause

KV cache growth exceeds available GPU memory due to long sequences, large batch sizes, or memory fragmentation. Can also occur from memory leaks in serving framework.

Symptoms
  • CUDA OOM errors in logs
  • Sudden request failures after period of normal operation
  • Increasing memory usage over time without corresponding load increase
  • Failures correlated with long sequence requests
Impact

Complete inference failure for affected requests. May crash entire serving process, affecting all concurrent requests. Recovery requires process restart.

Prevention

Implement memory budgeting with hard limits on batch size and sequence length. Use PagedAttention for efficient memory utilization. Monitor memory usage and implement proactive eviction. Set conservative memory limits with safety margin.

Mitigation

Implement graceful degradation by rejecting new requests when memory pressure is high. Use request queuing with admission control. Implement automatic process restart with state recovery. Consider KV cache offloading for memory pressure relief.

Operational Considerations

Key Metrics (15)

Aggregate token generation rate across all requests, the primary throughput metric. Includes both input (prefill) and output (decode) tokens.

NormalDepends on hardware and model. Typical: 1000-10000 TPS per GPU for 7B model, 100-1000 TPS per GPU for 70B model.
AlertAlert when below 70% of baseline for sustained period (5+ minutes). Critical alert below 50%.
ResponseInvestigate batch utilization, memory pressure, and hardware health. Check for degraded modes. Scale out if load-related.

Dashboard Panels

Real-time throughput (tokens/second) with historical trend and baseline comparisonThroughput breakdown by phase (prefill vs decode) and request typeBatch size distribution histogram and time seriesGPU memory utilization with KV cache breakdownRequest latency percentiles (TTFT, ITL, total) with SLO linesQueue depth and wait time distributionGPU utilization (compute, memory bandwidth) per deviceError rate by type (OOM, timeout, model error) with recent errors listAutoscaling status (current replicas, scaling events, target)Cost efficiency metrics (cost per token, utilization efficiency)

Alerting Strategy

Implement tiered alerting with different severity levels and response procedures. Use anomaly detection for metrics without fixed thresholds. Correlate alerts across metrics to identify root causes. Implement alert suppression during known maintenance windows. Use runbooks linked to alerts for consistent response. Page on-call for critical alerts affecting user experience. Use Slack/email for warning-level alerts requiring investigation. Aggregate similar alerts to prevent alert fatigue. Track alert-to-resolution time for continuous improvement.

Cost Analysis

Cost Drivers

(10)

GPU Instance Hours

Impact:

Primary cost driver, typically 60-80% of total inference cost. Directly proportional to number of GPUs and runtime hours.

Optimization:

Maximize throughput per GPU to reduce required instance count. Use spot instances for fault-tolerant workloads. Right-size instances based on actual utilization. Implement autoscaling to match capacity to demand.

Memory Capacity Requirements

Impact:

Determines minimum GPU tier required. Higher memory GPUs cost significantly more (A100 80GB vs 40GB, H100 80GB vs others).

Optimization:

Use quantization to reduce memory requirements. Implement efficient KV cache management. Consider model sharding across cheaper GPUs. Optimize batch sizes to maximize memory utilization.

Network Egress

Impact:

Can be significant for high-volume streaming responses. Cloud providers charge for outbound data transfer.

Optimization:

Compress responses where possible. Use regional deployments to reduce cross-region transfer. Implement response caching for repeated queries. Consider CDN for static or cacheable content.

Storage for Model Artifacts

Impact:

Model weights require significant storage. Multiple model versions multiply storage costs.

Optimization:

Use efficient model formats (safetensors). Implement model deduplication for shared layers. Use tiered storage (hot/cold) for model versions. Clean up unused model versions.

Monitoring and Logging

Impact:

High-volume metrics and logs can incur significant observability costs. Often overlooked in initial estimates.

Optimization:

Sample metrics appropriately. Aggregate logs before shipping. Use retention policies to limit storage. Consider self-hosted observability for large scale.

Load Balancing and Networking

Impact:

Load balancer costs scale with request volume. Private networking may incur additional charges.

Optimization:

Use efficient load balancing tiers. Batch requests where possible to reduce per-request overhead. Optimize health check frequency.

Engineering Time for Optimization

Impact:

Throughput optimization requires significant engineering investment. Ongoing maintenance adds to total cost.

Optimization:

Focus on high-impact optimizations first. Use managed services where build cost exceeds buy cost. Automate routine optimization tasks.

Redundancy and Fault Tolerance

Impact:

High availability requires redundant capacity, increasing costs. N+1 or N+2 redundancy multiplies base cost.

Optimization:

Right-size redundancy based on actual availability requirements. Use spot instances for redundant capacity. Implement graceful degradation to reduce redundancy needs.

Development and Staging Environments

Impact:

Non-production environments consume resources for testing and development. Often over-provisioned.

Optimization:

Use smaller instances for dev/staging. Implement environment scheduling (shut down overnight). Share resources across teams where possible.

Data Transfer for Distributed Inference

Impact:

Multi-GPU and multi-node inference incurs internal data transfer costs. Can be significant for large-scale deployments.

Optimization:

Minimize cross-node communication. Use high-bandwidth interconnects to reduce transfer time. Optimize parallelism strategy to reduce communication.

Cost Models

Cost per Million Tokens

(GPU_hourly_cost Ɨ hours) / (throughput_tokens_per_second Ɨ 3600 Ɨ hours / 1_000_000)
Variables:
GPU_hourly_cost: Hourly cost of GPU instancethroughput_tokens_per_second: Achieved throughputhours: Operating hours
Example:

H100 at $3/hour achieving 5000 tokens/second: $3 / (5000 Ɨ 3600 / 1M) = $0.17 per million tokens

Cost per Request

cost_per_million_tokens Ɨ (avg_input_tokens + avg_output_tokens) / 1_000_000
Variables:
cost_per_million_tokens: From previous modelavg_input_tokens: Average input lengthavg_output_tokens: Average output length
Example:

At $0.17/M tokens with 500 input + 200 output tokens: $0.17 Ɨ 700 / 1M = $0.000119 per request

Total Cost of Ownership (Monthly)

GPU_cost + storage_cost + network_cost + monitoring_cost + engineering_cost + overhead
Variables:
GPU_cost: Instance hours Ɨ hourly ratestorage_cost: Model storage + logsnetwork_cost: Egress + load balancingmonitoring_cost: Metrics + logging servicesengineering_cost: Maintenance time Ɨ rateoverhead: Management, security, compliance
Example:

10 H100s at $3/hr for 720 hours = $21,600 + $500 storage + $1,000 network + $500 monitoring + $5,000 engineering = $28,600/month

Break-even Analysis for Optimization

optimization_cost / (baseline_cost - optimized_cost) = months_to_break_even
Variables:
optimization_cost: Engineering time + any tooling costsbaseline_cost: Monthly cost before optimizationoptimized_cost: Monthly cost after optimization
Example:

$50,000 optimization effort saving $10,000/month: 50,000 / 10,000 = 5 months to break even

Optimization Strategies

  • 1Implement quantization (INT8/INT4) to increase throughput per GPU, reducing required instance count
  • 2Use spot/preemptible instances for fault-tolerant batch workloads, achieving 60-90% cost reduction
  • 3Right-size GPU instances based on actual memory and compute requirements rather than over-provisioning
  • 4Implement autoscaling to match capacity to demand, avoiding paying for idle resources
  • 5Use continuous batching to maximize throughput and GPU utilization
  • 6Consider smaller, more efficient models for latency-tolerant use cases
  • 7Implement request caching for repeated or similar queries
  • 8Use reserved instances or committed use discounts for baseline capacity
  • 9Consolidate low-utilization workloads onto shared infrastructure
  • 10Optimize model loading and warmup to reduce startup costs during scaling
  • 11Implement tiered serving with different cost-performance tradeoffs for different request types
  • 12Use speculative decoding to improve throughput without additional hardware

Hidden Costs

  • šŸ’°GPU idle time during low-traffic periods if autoscaling is not implemented or is too slow
  • šŸ’°Over-provisioning for peak capacity that is rarely utilized
  • šŸ’°Engineering time spent on debugging and optimization that could be spent on features
  • šŸ’°Opportunity cost of using expensive GPUs for workloads that could run on cheaper hardware
  • šŸ’°Cost of failed experiments and optimization attempts that don't pan out
  • šŸ’°Compliance and security overhead for handling sensitive data in inference
  • šŸ’°Cost of incidents and outages including engineering time and potential SLA penalties
  • šŸ’°Training and onboarding costs for specialized ML infrastructure skills

ROI Considerations

Return on investment for throughput optimization depends heavily on scale. At small scale (few GPUs, low traffic), the engineering investment in optimization may exceed the infrastructure savings. The break-even point typically occurs when monthly infrastructure costs exceed $10,000-$50,000, at which point a 20-30% efficiency improvement justifies dedicated optimization effort. Throughput optimization ROI should be calculated considering both direct cost savings and indirect benefits. Direct savings come from reduced instance counts and better utilization. Indirect benefits include improved user experience (if latency also improves), ability to serve more users without scaling, and reduced environmental impact. The ROI calculation should also account for ongoing maintenance costs of optimizations. Complex optimizations like speculative decoding or custom kernels require ongoing maintenance as models and frameworks evolve. Simpler optimizations like quantization or continuous batching have lower maintenance burden and thus better long-term ROI. Finally, consider the opportunity cost of engineering time. If throughput optimization delays product features that would generate revenue, the net ROI may be negative even if infrastructure costs decrease. Prioritize optimization efforts based on overall business impact, not just infrastructure efficiency.

Security Considerations

Threat Model

(10 threats)
1

Denial of Service through Resource Exhaustion

Attack Vector

Attacker sends requests designed to consume maximum resources (long inputs, long outputs, many concurrent requests) to exhaust GPU memory or compute capacity.

Impact

Service unavailability for legitimate users. Potential cost explosion if autoscaling responds to attack traffic.

Mitigation

Implement rate limiting per user/API key. Set maximum input/output length limits. Use admission control based on resource availability. Implement cost-based quotas.

2

Prompt Injection Affecting Throughput

Attack Vector

Malicious prompts designed to cause extremely long outputs, infinite loops, or resource-intensive generation patterns.

Impact

Reduced throughput for other users. Potential resource exhaustion. Unexpected cost increases.

Mitigation

Enforce strict output length limits. Implement generation timeout. Monitor for anomalous generation patterns. Use output validation.

3

Side-Channel Information Leakage

Attack Vector

Attacker infers information about other users' requests through timing analysis, resource utilization patterns, or batching behavior.

Impact

Privacy violation. Potential exposure of sensitive information from other users' prompts or outputs.

Mitigation

Implement request isolation where required. Add timing noise to responses. Use separate infrastructure for sensitive workloads. Avoid batching requests from different security contexts.

4

Model Extraction through High-Volume Queries

Attack Vector

Attacker sends large volumes of carefully crafted queries to extract model behavior, effectively stealing the model.

Impact

Intellectual property theft. Competitive disadvantage. Potential for creating adversarial attacks.

Mitigation

Implement query rate limits. Monitor for extraction patterns. Use watermarking techniques. Limit API access to authorized users.

5

Data Exfiltration through Model Outputs

Attack Vector

If model has been trained on or has access to sensitive data, attacker crafts prompts to extract this information.

Impact

Data breach. Compliance violations. Reputational damage.

Mitigation

Implement output filtering for sensitive patterns. Use data loss prevention (DLP) tools. Audit model training data. Implement access controls on sensitive capabilities.

6

Adversarial Inputs Causing Model Misbehavior

Attack Vector

Specially crafted inputs cause model to produce harmful, biased, or incorrect outputs at scale.

Impact

Reputational damage. Potential legal liability. User harm.

Mitigation

Implement input validation and sanitization. Use output filtering and moderation. Monitor for adversarial patterns. Implement human review for high-risk outputs.

7

Infrastructure Compromise

Attack Vector

Attacker gains access to inference infrastructure through vulnerabilities in serving software, containers, or cloud configuration.

Impact

Data breach. Service disruption. Potential for persistent access and ongoing attacks.

Mitigation

Keep software updated. Use minimal container images. Implement network segmentation. Use secrets management. Regular security audits.

8

Supply Chain Attacks on Model Artifacts

Attack Vector

Compromised model weights or serving software introduced through supply chain, containing backdoors or malicious behavior.

Impact

Compromised outputs. Potential data exfiltration. Difficult to detect.

Mitigation

Verify model artifact integrity. Use trusted sources. Implement artifact signing. Scan for known vulnerabilities. Audit model behavior.

9

Insider Threat

Attack Vector

Malicious insider with access to inference infrastructure exfiltrates data, modifies models, or disrupts service.

Impact

Data breach. Service disruption. Difficult to detect and attribute.

Mitigation

Implement least-privilege access. Audit all access and changes. Use multi-person approval for sensitive operations. Monitor for anomalous behavior.

10

API Key Compromise

Attack Vector

Stolen or leaked API keys used to access inference service, potentially at scale.

Impact

Unauthorized usage. Cost explosion. Potential data access.

Mitigation

Implement key rotation. Use short-lived tokens. Monitor for anomalous usage patterns. Implement IP allowlisting where possible.

Security Best Practices

  • āœ“Implement authentication and authorization for all API endpoints
  • āœ“Use TLS for all communications, including internal service-to-service
  • āœ“Implement rate limiting at multiple levels (global, per-user, per-endpoint)
  • āœ“Set and enforce maximum input and output length limits
  • āœ“Implement request timeout to prevent resource exhaustion
  • āœ“Use network segmentation to isolate inference infrastructure
  • āœ“Implement audit logging for all requests and administrative actions
  • āœ“Use secrets management for API keys, credentials, and certificates
  • āœ“Regularly update and patch all software components
  • āœ“Implement input validation and sanitization
  • āœ“Use output filtering for sensitive content
  • āœ“Implement monitoring and alerting for security events
  • āœ“Conduct regular security assessments and penetration testing
  • āœ“Implement incident response procedures for security events
  • āœ“Use minimal container images and runtime privileges

Data Protection

  • šŸ”’Encrypt all data at rest including model weights, KV cache (if persisted), and logs
  • šŸ”’Use TLS 1.3 for all data in transit
  • šŸ”’Implement data classification to identify sensitive content in prompts and outputs
  • šŸ”’Use data loss prevention (DLP) tools to detect sensitive data in outputs
  • šŸ”’Implement retention policies for inference logs with appropriate deletion
  • šŸ”’Use anonymization or pseudonymization for logs where possible
  • šŸ”’Implement access controls limiting who can view inference logs
  • šŸ”’Consider on-premises or private cloud deployment for highly sensitive workloads
  • šŸ”’Implement data residency controls for geographic compliance requirements
  • šŸ”’Use secure deletion procedures for decommissioned storage

Compliance Implications

GDPR (General Data Protection Regulation)

Requirement:

Personal data processing must be lawful, fair, and transparent. Data subjects have rights to access, rectification, and erasure.

Implementation:

Implement data minimization in prompts. Provide mechanisms for data subject requests. Document lawful basis for processing. Implement data retention policies.

CCPA (California Consumer Privacy Act)

Requirement:

Consumers have right to know what personal information is collected and to request deletion.

Implementation:

Implement data inventory for inference logs. Provide opt-out mechanisms. Honor deletion requests including inference logs.

HIPAA (Health Insurance Portability and Accountability Act)

Requirement:

Protected health information (PHI) must be safeguarded with appropriate administrative, physical, and technical safeguards.

Implementation:

Use dedicated infrastructure for PHI workloads. Implement encryption at rest and in transit. Audit all access. Sign BAAs with cloud providers.

SOC 2 (Service Organization Control 2)

Requirement:

Demonstrate controls for security, availability, processing integrity, confidentiality, and privacy.

Implementation:

Document and implement controls. Maintain audit trails. Conduct regular assessments. Implement change management procedures.

PCI DSS (Payment Card Industry Data Security Standard)

Requirement:

Protect cardholder data with specific technical and operational requirements.

Implementation:

Isolate systems processing payment data. Implement strong access controls. Encrypt cardholder data. Regular vulnerability scanning.

AI Act (EU Artificial Intelligence Act)

Requirement:

High-risk AI systems must meet requirements for transparency, human oversight, and risk management.

Implementation:

Document AI system capabilities and limitations. Implement human oversight mechanisms. Conduct risk assessments. Maintain technical documentation.

FedRAMP (Federal Risk and Authorization Management Program)

Requirement:

Cloud services for federal agencies must meet standardized security requirements.

Implementation:

Use FedRAMP-authorized cloud infrastructure. Implement required security controls. Maintain continuous monitoring. Document system security plan.

ISO 27001

Requirement:

Implement information security management system (ISMS) with appropriate controls.

Implementation:

Establish ISMS policies and procedures. Implement risk assessment process. Conduct internal audits. Maintain continuous improvement.

Scaling Guide

Scaling Dimensions

Horizontal Scaling (Replicas)

Strategy:

Add more independent model replicas behind a load balancer. Each replica handles a portion of traffic. Throughput scales linearly with replica count.

Limits:

Limited by load balancer capacity, network bandwidth, and cost. Requires model to fit in single replica's memory (or use model parallelism within replica).

Considerations:

Simplest scaling approach. No inter-replica communication overhead. Requires effective load balancing. Cold start time affects scaling responsiveness.

Vertical Scaling (GPU Upgrade)

Strategy:

Use more powerful GPUs with higher memory bandwidth and capacity. Enables larger batch sizes and faster per-token generation.

Limits:

Limited by available GPU tiers. Diminishing returns as memory bandwidth scales slower than cost. Single point of failure.

Considerations:

May require code changes for new GPU architectures. Cost per GPU increases faster than performance. Consider total cost of ownership.

Tensor Parallelism Scaling

Strategy:

Shard model across more GPUs within a node to increase aggregate memory bandwidth and enable larger models.

Limits:

Limited by interconnect bandwidth (NVLink). Communication overhead increases with parallelism degree. Typically limited to 8 GPUs per node.

Considerations:

Requires high-bandwidth interconnects for efficiency. All GPUs must participate in each request. Best for latency-sensitive workloads.

Pipeline Parallelism Scaling

Strategy:

Distribute model layers across multiple nodes, enabling very large models and high throughput for large batches.

Limits:

Pipeline bubbles reduce efficiency. Latency increases with pipeline depth. Requires careful micro-batch sizing.

Considerations:

Best for throughput-focused batch processing. Can cross node boundaries. Requires sophisticated scheduling.

Batch Size Scaling

Strategy:

Increase batch size to improve throughput by amortizing weight loading across more sequences.

Limits:

Limited by GPU memory capacity. Increases per-request latency. Diminishing returns as compute becomes saturated.

Considerations:

Most cost-effective scaling for memory-bound workloads. Must balance against latency requirements. Requires sufficient request volume.

Geographic Scaling

Strategy:

Deploy replicas in multiple regions to reduce latency for global users and provide geographic redundancy.

Limits:

Increases operational complexity. May require data residency considerations. Cost multiplier for multi-region deployment.

Considerations:

Essential for global services. Provides disaster recovery. Consider traffic routing and failover strategies.

Model Efficiency Scaling

Strategy:

Use quantization, distillation, or smaller models to increase throughput per GPU without adding hardware.

Limits:

Quality degradation at aggressive optimization levels. Diminishing returns from further compression. May not apply to all use cases.

Considerations:

Most cost-effective scaling approach when applicable. Validate quality impact. Consider multiple model tiers for different use cases.

Speculative Decoding Scaling

Strategy:

Implement speculative decoding to increase effective throughput without additional hardware for suitable workloads.

Limits:

Effectiveness depends on workload predictability. Requires additional memory for draft model. Not all workloads benefit.

Considerations:

Can provide 2-3x improvement for suitable workloads. Requires draft model selection and tuning. Monitor acceptance rates.

Capacity Planning

Key Factors:
Peak request rate (requests per second)Average and P99 input token countAverage and P99 output token countLatency SLOs (TTFT, ITL, total)Availability requirements (uptime SLA)Growth projections (monthly/yearly)Seasonal patterns and special eventsBudget constraints
Formula:Required replicas = (peak_requests_per_second Ɨ (avg_input_tokens + avg_output_tokens)) / (throughput_per_replica Ɨ target_utilization) Ɨ redundancy_factor
Safety Margin:

Plan for 30-50% headroom above expected peak load. Account for replica failures (N+1 or N+2 redundancy). Consider scaling lag time (GPU startup is minutes, not seconds). Include buffer for unexpected traffic spikes.

Scaling Milestones

Development (< 10 requests/second)
Challenges:
  • Establishing baseline performance
  • Setting up monitoring
  • Validating model quality
Architecture Changes:

Single GPU deployment sufficient. Focus on correctness over optimization. Use managed services if available.

Early Production (10-100 requests/second)
Challenges:
  • Implementing continuous batching
  • Setting up autoscaling
  • Establishing SLOs
Architecture Changes:

Multiple replicas with load balancing. Implement basic monitoring and alerting. Consider quantization for cost efficiency.

Growth (100-1000 requests/second)
Challenges:
  • Optimizing throughput per GPU
  • Managing costs at scale
  • Handling traffic variability
Architecture Changes:

Sophisticated autoscaling. Advanced batching optimization. Consider multi-region deployment. Implement cost monitoring.

Scale (1000-10000 requests/second)
Challenges:
  • Load balancer capacity
  • Operational complexity
  • Cost optimization at scale
Architecture Changes:

Multiple load balancer tiers. Dedicated infrastructure teams. Advanced optimization (speculative decoding, custom kernels). Multi-region with traffic management.

Hyperscale (> 10000 requests/second)
Challenges:
  • Custom infrastructure requirements
  • Extreme cost sensitivity
  • Global availability
Architecture Changes:

Custom hardware and software optimization. Dedicated capacity planning team. Sophisticated traffic management. Consider custom silicon (TPUs, custom ASICs).

Benchmarks

Industry Benchmarks

MetricP50P95P99 World Class
Decode throughput (7B model, single A100 80GB)2500 tokens/second3500 tokens/second4000 tokens/second5000+ tokens/second
Decode throughput (70B model, 8x A100 80GB tensor parallel)800 tokens/second1200 tokens/second1500 tokens/second2000+ tokens/second
Decode throughput (7B model, single H100 80GB)4000 tokens/second5500 tokens/second6500 tokens/second8000+ tokens/second
Prefill throughput (7B model, single A100 80GB)15000 tokens/second25000 tokens/second35000 tokens/second50000+ tokens/second
Batch utilization (continuous batching)70%85%92%95%+
GPU memory utilization75%88%93%95%+
Cost per million tokens (7B model, cloud GPU)$0.30$0.15$0.10$0.05
Throughput improvement from INT8 quantization1.5x1.8x1.95x2x
Throughput improvement from INT4 quantization2.5x3.2x3.7x4x
Speculative decoding speedup (suitable workloads)1.8x2.5x2.8x3x+
Continuous batching improvement over static batching2x4x6x10x
Multi-GPU scaling efficiency (8 GPUs, tensor parallel)65%78%85%90%+

Comparison Matrix

Serving FrameworkContinuous BatchingPagedAttentionSpeculative DecodingQuantization SupportMulti-GPURelative Throughput
vLLMYesYesYesINT8, INT4, AWQ, GPTQTensor, Pipeline1.0 (baseline)
TGI (Text Generation Inference)YesYesYesINT8, INT4, GPTQ, AWQTensor0.9-1.0
TensorRT-LLMYesYesYesINT8, INT4, FP8Tensor, Pipeline1.1-1.3
llama.cppLimitedNoYesGGUF (various)Limited0.5-0.8
Triton Inference ServerYes (with backend)Backend dependentBackend dependentBackend dependentYes0.9-1.1
DeepSpeed-MIIYesYesNoINT8Tensor, Pipeline0.8-1.0

Performance Tiers

Basic

Single GPU, static batching, no optimization. Suitable for development and low-traffic production.

Target:

500-1500 tokens/second (7B model), 50-150 tokens/second (70B model)

Optimized

Continuous batching, basic memory optimization. Standard production deployment.

Target:

2000-3500 tokens/second (7B model), 300-600 tokens/second (70B model)

Advanced

Quantization, PagedAttention, tuned batch sizes. Cost-optimized production.

Target:

4000-6000 tokens/second (7B model), 600-1000 tokens/second (70B model)

Expert

Speculative decoding, custom kernels, hardware-specific optimization. Maximum efficiency.

Target:

6000-10000 tokens/second (7B model), 1000-2000 tokens/second (70B model)

World-Class

Custom hardware, proprietary optimizations, dedicated infrastructure. Hyperscale deployments.

Target:

10000+ tokens/second (7B model), 2000+ tokens/second (70B model)

Real World Examples

Real-World Scenarios

(8 examples)
1

High-Volume Customer Service Chatbot

Context

E-commerce company deploying LLM-powered customer service handling 10,000+ concurrent conversations during peak shopping periods. Latency requirements moderate (2-3 second response acceptable), cost sensitivity high.

Approach

Deployed 7B parameter model with INT8 quantization on replicated A100 instances. Implemented continuous batching with vLLM. Used autoscaling based on queue depth. Implemented request prioritization for VIP customers.

Outcome

Achieved 4,500 tokens/second per GPU, supporting 15,000 concurrent conversations on 8 GPUs. Cost per conversation reduced 60% compared to API-based solution. P95 latency maintained under 2 seconds.

Lessons Learned
  • šŸ’”Quantization provided 1.8x throughput improvement with negligible quality impact for this use case
  • šŸ’”Continuous batching was essential for handling variable conversation lengths
  • šŸ’”Autoscaling lag (3-5 minutes for GPU startup) required maintaining 30% headroom
  • šŸ’”Request prioritization prevented VIP customer impact during traffic spikes
2

Code Completion IDE Integration

Context

Developer tools company integrating LLM code completion into IDE. Strict latency requirements (< 200ms for suggestions), high concurrency (100,000+ developers), predictable output patterns.

Approach

Deployed 13B code-specialized model with speculative decoding using 1B draft model. Used tensor parallelism across 4 GPUs per replica. Implemented aggressive caching for common completions. Geographic distribution across 5 regions.

Outcome

Achieved 3x throughput improvement from speculative decoding due to predictable code patterns. P95 latency under 150ms. Supported 150,000 concurrent developers across global deployment.

Lessons Learned
  • šŸ’”Speculative decoding exceptionally effective for code completion (85% acceptance rate)
  • šŸ’”Geographic distribution essential for latency requirements
  • šŸ’”Caching provided significant cost reduction for common patterns
  • šŸ’”Draft model quality critical - invested in code-specific draft model training
3

Document Processing Pipeline

Context

Legal technology company processing millions of documents for contract analysis. No real-time latency requirements, extreme cost sensitivity, variable document lengths (1-100 pages).

Approach

Deployed 70B model for quality on spot instances with pipeline parallelism. Implemented offline batch processing with large batch sizes. Used INT4 quantization validated for legal domain. Implemented checkpointing for spot instance interruptions.

Outcome

Achieved cost of $0.08 per million tokens using spot instances (70% discount). Processed 10 million documents monthly. Quality maintained at 95% accuracy for contract clause extraction.

Lessons Learned
  • šŸ’”Spot instances viable for batch processing with proper checkpointing
  • šŸ’”INT4 quantization required careful validation for legal accuracy requirements
  • šŸ’”Pipeline parallelism enabled 70B model deployment cost-effectively
  • šŸ’”Variable document lengths required sophisticated batching to avoid padding waste
4

Real-Time Content Moderation

Context

Social media platform moderating user-generated content in real-time. Strict latency requirements (< 500ms), extremely high volume (1M+ posts per hour), high availability requirements (99.99%).

Approach

Deployed ensemble of 7B models specialized for different content types. Used replicated serving with geographic distribution. Implemented tiered moderation (fast model for initial screen, larger model for edge cases). Aggressive autoscaling with predictive scaling based on traffic patterns.

Outcome

Achieved 99.99% availability with sub-300ms P95 latency. Processed 1.5M posts per hour across 50 GPU instances. False positive rate maintained under 0.1%.

Lessons Learned
  • šŸ’”Tiered approach essential for balancing throughput and accuracy
  • šŸ’”Predictive autoscaling critical for handling traffic spikes
  • šŸ’”Geographic distribution required for latency at global scale
  • šŸ’”Specialized models outperformed single general model for this use case
5

RAG-Powered Enterprise Search

Context

Enterprise deploying RAG system for internal knowledge base search. Long contexts (retrieved documents), moderate latency requirements (3-5 seconds), variable query complexity.

Approach

Deployed 13B model with prefill-decode disaggregation to handle long contexts efficiently. Implemented chunked prefill to prevent decode blocking. Used KV cache offloading for session persistence. Implemented semantic caching for repeated queries.

Outcome

Achieved 2x throughput improvement from disaggregation for long-context workload. Supported 10,000 concurrent users. Average response time 2.5 seconds including retrieval.

Lessons Learned
  • šŸ’”Prefill-decode disaggregation essential for long-context RAG workloads
  • šŸ’”Chunked prefill prevented latency spikes for existing conversations
  • šŸ’”Semantic caching provided 40% hit rate, significantly reducing compute
  • šŸ’”KV cache offloading enabled cost-effective session persistence
6

Multi-Model AI Platform

Context

AI platform provider serving multiple customer models on shared infrastructure. Diverse model sizes (7B-70B), variable traffic patterns per model, cost optimization critical.

Approach

Implemented multi-model serving with dynamic model loading. Used model popularity tracking to keep hot models loaded. Implemented fair scheduling across tenants. Used spot instances for cold model serving.

Outcome

Achieved 3x better GPU utilization compared to dedicated per-model serving. Supported 50+ models on 20 GPU instances. Cold start latency under 30 seconds for model loading.

Lessons Learned
  • šŸ’”Model popularity follows power law - 20% of models handle 80% of traffic
  • šŸ’”Dynamic loading essential for cost efficiency with many models
  • šŸ’”Fair scheduling critical for multi-tenant SLO compliance
  • šŸ’”Model loading optimization (parallel loading, caching) critical for cold start
7

Streaming Voice Assistant

Context

Consumer electronics company deploying voice assistant with LLM backend. Strict time-to-first-token requirements (< 300ms), streaming output for natural speech synthesis, global deployment.

Approach

Deployed 7B model optimized for TTFT with minimal batching. Implemented streaming output with word-level chunking. Used edge caching for common responses. Geographic distribution with latency-based routing.

Outcome

Achieved P95 TTFT under 250ms. Supported 500,000 daily active users. Natural conversation flow maintained with streaming.

Lessons Learned
  • šŸ’”TTFT optimization required sacrificing some throughput efficiency
  • šŸ’”Streaming chunking strategy affected perceived naturalness
  • šŸ’”Edge caching effective for common queries (weather, time, etc.)
  • šŸ’”Latency-based routing essential for global user experience
8

Financial Document Analysis

Context

Investment firm analyzing earnings reports and financial filings. High accuracy requirements, moderate volume (thousands of documents daily), strict data security requirements.

Approach

Deployed 70B model on-premises for data security. Used full precision (no quantization) for accuracy. Implemented batch processing during market close. Extensive output validation and human review integration.

Outcome

Achieved 98% accuracy on financial metric extraction. Processed 5,000 documents daily on 16 GPU cluster. Maintained complete data isolation for compliance.

Lessons Learned
  • šŸ’”On-premises deployment required for financial data compliance
  • šŸ’”Full precision necessary for numerical accuracy in financial context
  • šŸ’”Batch processing during off-hours maximized resource utilization
  • šŸ’”Human-in-the-loop essential for high-stakes financial decisions

Industry Applications

Healthcare

Clinical documentation and medical coding assistance

Key Considerations:

HIPAA compliance requires dedicated infrastructure and audit logging. Medical terminology requires specialized models or fine-tuning. High accuracy requirements limit aggressive optimization. Integration with EHR systems adds latency constraints.

Financial Services

Trading signal analysis and risk assessment

Key Considerations:

Extreme latency sensitivity for trading applications. Regulatory requirements (SOX, FINRA) affect deployment options. Numerical accuracy critical - quantization must be validated. Audit trail requirements for all inferences.

E-commerce

Product description generation and customer service

Key Considerations:

High seasonality requires elastic scaling. Cost sensitivity drives optimization focus. Multi-language support increases model complexity. Integration with product catalogs and inventory systems.

Media and Entertainment

Content generation and personalization

Key Considerations:

Creative quality often more important than throughput. Long-form content generation requires extended context. Copyright and originality concerns affect output validation. Real-time personalization requires low latency.

Legal

Contract analysis and legal research

Key Considerations:

Accuracy requirements limit optimization aggressiveness. Long document contexts require memory optimization. Confidentiality requirements may require on-premises deployment. Citation accuracy critical for research applications.

Manufacturing

Technical documentation and maintenance assistance

Key Considerations:

Domain-specific terminology requires specialized models. Integration with IoT and sensor data. Offline capability may be required for factory floor. Safety-critical applications require extensive validation.

Education

Tutoring systems and content generation

Key Considerations:

Accessibility requirements affect response format. Age-appropriate content filtering required. Scalability for peak usage (exam periods). Integration with learning management systems.

Government

Citizen services and document processing

Key Considerations:

FedRAMP and other compliance requirements. Accessibility (Section 508) requirements. Multi-language support for diverse populations. Transparency and explainability requirements.

Telecommunications

Customer support and network operations

Key Considerations:

24/7 availability requirements. Integration with network management systems. High volume during outages requires burst capacity. Multi-channel support (voice, chat, email).

Gaming

NPC dialogue and dynamic content generation

Key Considerations:

Real-time latency requirements for interactive experiences. Creative and entertaining outputs valued over accuracy. Integration with game engines. Scaling for game launch events.

Frequently Asked Questions

Frequently Asked Questions

(20 questions)

Fundamentals

Throughput measures the total number of tokens generated per unit time across all requests (tokens/second), while latency measures the time for individual requests (time-to-first-token, inter-token latency, or total response time). A system can have high throughput but poor latency if it batches many requests together, causing individual requests to wait. Conversely, a system optimized for latency may have lower throughput due to smaller batch sizes. Production systems must balance both metrics based on application requirements.

Technical

Optimization

Architecture

Benchmarks

Operations

Cost

Scaling

Quality

Glossary

Glossary

(30 terms)
A

Arithmetic Intensity

The ratio of compute operations to memory operations (FLOPS per byte), determining whether a workload is compute-bound or memory-bound.

Context: LLM decoding has low arithmetic intensity (~1-2 FLOPS/byte), making it memory-bound on modern GPUs.

Autoscaling

Automatically adjusting the number of inference replicas based on load metrics.

Context: Essential for handling variable traffic while optimizing cost. Limited by GPU startup time (minutes).

B

Batch Size

The number of sequences processed simultaneously in a single forward pass.

Context: Larger batch sizes improve throughput but increase latency. Limited by GPU memory capacity.

Batch Utilization

The ratio of actual batch size to maximum possible batch size, indicating how efficiently batching is being used.

Context: Low batch utilization indicates potential for throughput improvement through better batching or increased load.

C

Chunked Prefill

Breaking long prefill operations into smaller chunks interleaved with decode iterations.

Context: Prevents prefill from blocking decode progress, improving latency consistency for mixed workloads.

Continuous Batching

A serving technique that allows requests to enter and exit batches at each decode iteration, rather than waiting for all sequences in a batch to complete.

Context: Essential for production throughput, providing 2-10x improvement over static batching for variable-length workloads.

D

Decode Phase

The autoregressive generation phase where tokens are generated one at a time, each depending on all previous tokens.

Context: Memory-bound phase that typically dominates inference time and determines throughput for generative workloads.

F

Flash Attention

An optimized attention implementation that reduces memory footprint and improves memory access patterns.

Context: Enables longer sequences and larger batches by reducing attention memory from O(N²) to O(N).

H

HBM (High Bandwidth Memory)

Stacked memory technology used in GPUs providing high bandwidth for AI workloads.

Context: HBM bandwidth is the primary determinant of LLM decode throughput. HBM3 provides ~3 TB/s.

I

Inference Server

Software system that loads models, manages requests, and executes inference operations.

Context: Examples include vLLM, TGI, TensorRT-LLM, and Triton. Choice significantly affects achievable throughput.

Inter-Token Latency (ITL)

The time between consecutive token generations during the decode phase.

Context: Affects streaming experience and perceived generation speed. Determined by decode throughput and batch size.

K

KV Cache

Storage for key and value projection tensors from previous tokens, enabling efficient attention computation without recomputing past representations.

Context: Major consumer of GPU memory during inference, growing linearly with sequence length and limiting batch sizes.

L

Load Balancing

Distributing requests across multiple inference replicas to optimize utilization and performance.

Context: Must account for inference-specific characteristics like request length and GPU memory state.

M

Memory Bandwidth

The rate at which data can be transferred between GPU memory (HBM) and compute units, measured in GB/s or TB/s.

Context: Primary bottleneck for LLM decode throughput. H100 provides 3.35 TB/s HBM3 bandwidth.

Model Parallelism

Distributing a single model across multiple devices, including tensor and pipeline parallelism.

Context: Required for models that don't fit in single GPU memory. Introduces communication overhead.

N

NCCL (NVIDIA Collective Communications Library)

Library for multi-GPU and multi-node communication operations like all-reduce.

Context: Used for synchronization in tensor and pipeline parallel inference. Performance critical for multi-GPU throughput.

NVLink

High-bandwidth interconnect between NVIDIA GPUs enabling efficient multi-GPU communication.

Context: Provides 600+ GB/s bandwidth, essential for efficient tensor parallelism. Much faster than PCIe.

P

PagedAttention

A memory management technique that allocates KV cache in fixed-size pages rather than contiguous per-sequence blocks, eliminating fragmentation.

Context: Implemented in vLLM and other modern serving frameworks to maximize memory utilization and enable larger batch sizes.

Pipeline Parallelism

A model parallelism technique that distributes model layers sequentially across GPUs, with different GPUs processing different layers.

Context: Enables very large models across multiple nodes, with higher throughput for large batches but increased latency.

Prefill Phase

The initial phase of inference where all input tokens are processed in parallel to generate the initial KV cache.

Context: Compute-bound phase with high arithmetic intensity, distinct from the memory-bound decode phase.

Prefill-Decode Disaggregation

Separating prefill and decode phases onto different GPU pools for independent optimization.

Context: Valuable for long-context workloads where prefill and decode have very different resource requirements.

Q

Quantization

Reducing model precision from FP16/BF16 to INT8 or INT4 to decrease memory footprint and bandwidth requirements.

Context: Provides roughly linear throughput improvement (2x for INT8, 4x for INT4) with potential quality tradeoffs.

R

Request Queue

Buffer holding pending inference requests awaiting batch inclusion and processing.

Context: Queue depth indicates backlog and affects latency. Must be bounded to prevent memory exhaustion.

Roofline Model

A visual performance model showing throughput bounded by compute capacity or memory bandwidth.

Context: Useful for identifying whether workload is compute-bound or memory-bound and guiding optimization.

S

SLO (Service Level Objective)

Target performance metrics that a service commits to achieving, such as latency percentiles or availability.

Context: Throughput optimization must respect SLOs. Common SLOs include P95 latency and availability percentage.

Speculative Decoding

A technique using a smaller draft model to propose multiple tokens that are verified in parallel by the target model.

Context: Can provide 2-3x throughput improvement for predictable generation patterns without quality degradation.

T

Tensor Parallelism

A model parallelism technique that shards model layers across multiple GPUs, with each GPU computing a portion of each layer.

Context: Enables serving models larger than single GPU memory and increases aggregate memory bandwidth.

Throughput-Latency Tradeoff

The fundamental tension between maximizing aggregate throughput and minimizing individual request latency.

Context: Production systems must balance both based on application requirements and SLOs.

Time-to-First-Token (TTFT)

The latency from request arrival to generation of the first output token, critical for perceived responsiveness.

Context: Primarily determined by prefill time and queue wait time. Key metric for interactive applications.

Tokens per Second (TPS)

The primary unit of LLM throughput measurement, representing the number of tokens generated or processed per second across all concurrent requests.

Context: Used as the standard metric for comparing inference performance across different systems, models, and configurations.

References & Resources

Academic Papers

  • • Efficient Memory Management for Large Language Model Serving with PagedAttention (Kwon et al., 2023) - Introduces PagedAttention for efficient KV cache management
  • • Fast Inference from Transformers via Speculative Decoding (Leviathan et al., 2023) - Foundational paper on speculative decoding
  • • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022) - Memory-efficient attention algorithm
  • • Orca: A Distributed Serving System for Transformer-Based Generative Models (Yu et al., 2022) - Continuous batching for LLM serving
  • • GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (Frantar et al., 2022) - Quantization technique for LLMs
  • • AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration (Lin et al., 2023) - Advanced quantization method
  • • Efficiently Scaling Transformer Inference (Pope et al., 2022) - Google's analysis of transformer inference scaling
  • • Splitwise: Efficient Generative LLM Inference Using Phase Splitting (Patel et al., 2024) - Prefill-decode disaggregation

Industry Standards

  • • MLPerf Inference Benchmark - Standard benchmark suite for ML inference performance
  • • NVIDIA TensorRT-LLM Documentation - Reference for optimized LLM inference
  • • vLLM Documentation and Benchmarks - Open-source high-throughput serving reference
  • • Hugging Face Text Generation Inference (TGI) - Production-ready inference server documentation
  • • OpenAI API Rate Limits and Best Practices - Industry reference for API design
  • • NVIDIA GPU Architecture Whitepapers - Hardware specifications and optimization guides

Resources

  • • NVIDIA Developer Blog - LLM Inference Optimization Series
  • • Hugging Face Blog - Optimizing Transformers for Production
  • • Anyscale Blog - Scaling LLM Inference with Ray
  • • vLLM GitHub Repository and Documentation
  • • PyTorch Documentation - Inference Optimization
  • • CUDA Programming Guide - Memory Optimization
  • • Google Cloud Architecture Center - ML Inference Best Practices
  • • AWS Machine Learning Blog - SageMaker Inference Optimization

Last updated: 2026-01-05 • Version: v1.0 • Status: citation-safe-reference

Keywords: LLM throughput, tokens per second, inference throughput, batch inference