LLM Inference Fundamentals

Batching Strategies for Maximum Throughput

3 min read

Batching is the key to maximizing GPU utilization and throughput. Let's explore the evolution from static to continuous batching.

The GPU Utilization Problem

GPUs are massively parallel processors designed for matrix operations. Single-request inference wastes this power:

Single Request Processing:
┌─────────────────────────────────────────────┐
│            GPU UTILIZATION: 5%              │
├─────────────────────────────────────────────┤
│  ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  │
│  ███                                        │
│  └── Only small portion of GPU active       │
│                                             │
│  Problem: Memory bandwidth is bottleneck    │
│  GPU compute units sit idle waiting for     │
│  memory reads during decode phase           │
└─────────────────────────────────────────────┘

Static Batching

The traditional approach: wait for a batch, process together:

Static Batching:
─────────────────
Step 1: Collect requests until batch is full
Step 2: Pad all sequences to maximum length
Step 3: Process entire batch together
Step 4: Wait for slowest request to complete
Step 5: Return all responses together

┌─────────────────────────────────────────────┐
│              STATIC BATCH (size=4)          │
├─────────────────────────────────────────────┤
│  Request A: "Hello" → [128 tokens response] │
│  Request B: "Hi"    → [512 tokens response] │
│  Request C: "Hey"   → [64 tokens response]  │
│  Request D: "Yo"    → [256 tokens response] │
├─────────────────────────────────────────────┤
│  All padded to 512 tokens                   │
│  A, C, D finish early → wait for B          │
│  Wasted compute: ~50%                       │
└─────────────────────────────────────────────┘

Problems:

  • Early-finishing requests block new ones
  • Padding wastes compute on empty tokens
  • Unpredictable latency

Continuous (In-Flight) Batching

Modern approach: dynamically manage batch as requests complete:

Continuous Batching Timeline:
───────────────────────────────
Time 0:  [A][B][C][D] start together
Time 5:  [A][B][ ][D] → C done, slot freed
Time 5:  [A][B][E][D] → E starts immediately
Time 10: [ ][B][E][D] → A done
Time 10: [F][B][E][D] → F starts immediately
...

┌─────────────────────────────────────────────┐
│         CONTINUOUS BATCHING                 │
├─────────────────────────────────────────────┤
│                                             │
│  Time →                                     │
│  ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐    │
│  │A │A │A │A │A │  │  │  │  │  │  │  │ A  │
│  │B │B │B │B │B │B │B │B │B │B │  │  │ B  │
│  │C │C │  │  │  │  │  │  │  │  │  │  │ C  │
│  │D │D │D │D │D │D │  │  │  │  │  │  │ D  │
│  │  │  │E │E │E │E │E │E │E │  │  │  │ E  │
│  │  │  │  │  │  │F │F │F │F │F │F │  │ F  │
│  │  │  │  │  │  │  │  │G │G │G │G │G │ G  │
│  └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘    │
│                                             │
│  Benefits:                                  │
│  • Near-100% GPU utilization               │
│  • No waiting for slow requests            │
│  • Immediate response when complete        │
│  • 2-10x throughput improvement            │
│                                             │
└─────────────────────────────────────────────┘

Implementing Continuous Batching

Key components:

# Scheduler manages request lifecycle
class ContinuousBatchScheduler:
    def __init__(self, max_batch_size: int):
        self.waiting_queue = []      # Requests waiting to start
        self.running_batch = []      # Active requests
        self.max_batch_size = max_batch_size

    def schedule(self):
        # Remove completed requests
        self.running_batch = [
            r for r in self.running_batch
            if not r.is_complete
        ]

        # Fill empty slots with waiting requests
        available_slots = self.max_batch_size - len(self.running_batch)

        for _ in range(available_slots):
            if self.waiting_queue:
                request = self.waiting_queue.pop(0)
                self.running_batch.append(request)

        return self.running_batch

Chunked Prefill

Optimization for long prompts: break prefill into chunks to interleave with decode:

Without Chunked Prefill:
────────────────────────
Long prompt prefill blocks all decodes
[PPPPPPPPPPPP][D][D][D][D]
└── 12-step prefill blocks responses for other requests

With Chunked Prefill:
────────────────────────
[PPP][D][D][PPP][D][D][PPP][D][D][PPPP][D][D]
└── Prefill split into 4 chunks
    Other requests make progress between chunks

vLLM V1 (January 2025) optimizes chunked prefill with:

  • Automatic chunk size tuning
  • Zero-overhead prefix caching
  • Improved scheduling for mixed workloads

Batching Configuration

Key parameters to tune:

# vLLM configuration example
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",

    # Batching settings
    max_num_seqs=256,           # Max concurrent sequences
    max_num_batched_tokens=8192, # Max tokens per batch iteration

    # Scheduling policy
    scheduling_policy="fcfs",   # First-come-first-served
    # Alternative: "priority" for priority-based scheduling

    # Chunked prefill
    enable_chunked_prefill=True,
    max_num_partial_prefills=4, # Max concurrent prefills
)

Measuring Batching Efficiency

# Key metrics
batch_utilization = active_sequences / max_batch_size
# Target: >80%

# Tokens per iteration
tokens_per_iter = sum(active_tokens) / iterations
# Higher is better (more work per GPU round-trip)

# Queue wait time
avg_wait = sum(queue_times) / num_requests
# Target: <100ms for interactive workloads

Continuous batching is now standard in production systems—it's the foundation for all modern LLM serving.

Next, we'll explore speculative decoding for even faster generation. :::

Quiz

Module 1: LLM Inference Fundamentals

Take Quiz