LLM Inference Fundamentals
Batching Strategies for Maximum Throughput
Batching is the key to maximizing GPU utilization and throughput. Let's explore the evolution from static to continuous batching.
The GPU Utilization Problem
GPUs are massively parallel processors designed for matrix operations. Single-request inference wastes this power:
Single Request Processing:
┌─────────────────────────────────────────────┐
│ GPU UTILIZATION: 5% │
├─────────────────────────────────────────────┤
│ ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ │
│ ███ │
│ └── Only small portion of GPU active │
│ │
│ Problem: Memory bandwidth is bottleneck │
│ GPU compute units sit idle waiting for │
│ memory reads during decode phase │
└─────────────────────────────────────────────┘
Static Batching
The traditional approach: wait for a batch, process together:
Static Batching:
─────────────────
Step 1: Collect requests until batch is full
Step 2: Pad all sequences to maximum length
Step 3: Process entire batch together
Step 4: Wait for slowest request to complete
Step 5: Return all responses together
┌─────────────────────────────────────────────┐
│ STATIC BATCH (size=4) │
├─────────────────────────────────────────────┤
│ Request A: "Hello" → [128 tokens response] │
│ Request B: "Hi" → [512 tokens response] │
│ Request C: "Hey" → [64 tokens response] │
│ Request D: "Yo" → [256 tokens response] │
├─────────────────────────────────────────────┤
│ All padded to 512 tokens │
│ A, C, D finish early → wait for B │
│ Wasted compute: ~50% │
└─────────────────────────────────────────────┘
Problems:
- Early-finishing requests block new ones
- Padding wastes compute on empty tokens
- Unpredictable latency
Continuous (In-Flight) Batching
Modern approach: dynamically manage batch as requests complete:
Continuous Batching Timeline:
───────────────────────────────
Time 0: [A][B][C][D] start together
Time 5: [A][B][ ][D] → C done, slot freed
Time 5: [A][B][E][D] → E starts immediately
Time 10: [ ][B][E][D] → A done
Time 10: [F][B][E][D] → F starts immediately
...
┌─────────────────────────────────────────────┐
│ CONTINUOUS BATCHING │
├─────────────────────────────────────────────┤
│ │
│ Time → │
│ ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │
│ │A │A │A │A │A │ │ │ │ │ │ │ │ A │
│ │B │B │B │B │B │B │B │B │B │B │ │ │ B │
│ │C │C │ │ │ │ │ │ │ │ │ │ │ C │
│ │D │D │D │D │D │D │ │ │ │ │ │ │ D │
│ │ │ │E │E │E │E │E │E │E │ │ │ │ E │
│ │ │ │ │ │ │F │F │F │F │F │F │ │ F │
│ │ │ │ │ │ │ │ │G │G │G │G │G │ G │
│ └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ │
│ │
│ Benefits: │
│ • Near-100% GPU utilization │
│ • No waiting for slow requests │
│ • Immediate response when complete │
│ • 2-10x throughput improvement │
│ │
└─────────────────────────────────────────────┘
Implementing Continuous Batching
Key components:
# Scheduler manages request lifecycle
class ContinuousBatchScheduler:
def __init__(self, max_batch_size: int):
self.waiting_queue = [] # Requests waiting to start
self.running_batch = [] # Active requests
self.max_batch_size = max_batch_size
def schedule(self):
# Remove completed requests
self.running_batch = [
r for r in self.running_batch
if not r.is_complete
]
# Fill empty slots with waiting requests
available_slots = self.max_batch_size - len(self.running_batch)
for _ in range(available_slots):
if self.waiting_queue:
request = self.waiting_queue.pop(0)
self.running_batch.append(request)
return self.running_batch
Chunked Prefill
Optimization for long prompts: break prefill into chunks to interleave with decode:
Without Chunked Prefill:
────────────────────────
Long prompt prefill blocks all decodes
[PPPPPPPPPPPP][D][D][D][D]
└── 12-step prefill blocks responses for other requests
With Chunked Prefill:
────────────────────────
[PPP][D][D][PPP][D][D][PPP][D][D][PPPP][D][D]
└── Prefill split into 4 chunks
Other requests make progress between chunks
vLLM V1 (January 2025) optimizes chunked prefill with:
- Automatic chunk size tuning
- Zero-overhead prefix caching
- Improved scheduling for mixed workloads
Batching Configuration
Key parameters to tune:
# vLLM configuration example
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
# Batching settings
max_num_seqs=256, # Max concurrent sequences
max_num_batched_tokens=8192, # Max tokens per batch iteration
# Scheduling policy
scheduling_policy="fcfs", # First-come-first-served
# Alternative: "priority" for priority-based scheduling
# Chunked prefill
enable_chunked_prefill=True,
max_num_partial_prefills=4, # Max concurrent prefills
)
Measuring Batching Efficiency
# Key metrics
batch_utilization = active_sequences / max_batch_size
# Target: >80%
# Tokens per iteration
tokens_per_iter = sum(active_tokens) / iterations
# Higher is better (more work per GPU round-trip)
# Queue wait time
avg_wait = sum(queue_times) / num_requests
# Target: <100ms for interactive workloads
Continuous batching is now standard in production systems—it's the foundation for all modern LLM serving.
Next, we'll explore speculative decoding for even faster generation. :::