TensorRT-LLM & GPU Optimization

TensorRT-LLM Performance Tuning

3 min read

This lesson covers advanced performance tuning techniques for TensorRT-LLM to maximize throughput and minimize latency in production deployments.

Key Performance Metrics

┌─────────────────────────────────────────────────────────────┐
│              Performance Metrics Dashboard                   │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Throughput Metrics          Latency Metrics                │
│  ─────────────────          ────────────────                │
│  • Tokens/second            • TTFT (Time to First Token)    │
│  • Requests/second          • TPOT (Time per Output Token)  │
│  • GPU utilization          • P50/P95/P99 latency           │
│                                                             │
│  Resource Metrics            Quality Metrics                │
│  ────────────────           ───────────────                 │
│  • GPU memory usage         • KV cache hit rate             │
│  • Batch utilization        • Preemption rate               │
│  • Queue depth              • Timeout rate                  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Batch Size Optimization

Finding the optimal batch size balances throughput and latency:

from tensorrt_llm import LLM
from tensorrt_llm.builder import BuildConfig

# Profile different batch sizes
batch_configs = [
    {"max_batch_size": 32, "use_case": "low_latency"},
    {"max_batch_size": 128, "use_case": "balanced"},
    {"max_batch_size": 512, "use_case": "high_throughput"},
]

# Optimal configuration for balanced workloads
build_config = BuildConfig(
    max_batch_size=128,
    max_num_tokens=8192,  # Total tokens across batch

    # Enable dynamic batching
    enable_chunked_context=True,
    max_chunk_size=2048,

    # Batch scheduling
    scheduler_policy="guaranteed_no_evict",
)

Batch Size Guidelines

Use Case Batch Size TTFT Impact Throughput
Real-time chat 16-32 Minimal Medium
API service 64-128 Moderate High
Batch processing 256-512 Higher Maximum

Memory Optimization

KV Cache Configuration

build_config = BuildConfig(
    # Paged KV cache for memory efficiency
    use_paged_context_fmha=True,

    # KV cache quantization
    kv_cache_quant_algo="FP8",  # 50% memory reduction

    # Block size for paged attention
    tokens_per_block=64,  # Default, good for most cases

    # Maximum KV cache memory (fraction of GPU memory)
    kv_cache_fraction=0.9,  # Reserve 10% for activations
)

Speculative Decoding Integration

TensorRT-LLM supports speculative decoding for faster generation:

from tensorrt_llm.builder import BuildConfig

# Configure speculative decoding
build_config = BuildConfig(
    # Main model settings
    max_batch_size=64,
    max_seq_len=8192,

    # Speculative decoding
    speculative_decoding_mode="draft_tokens_external",
    max_draft_tokens=5,  # Number of draft tokens per step
)

# Build both draft and target models
draft_llm = LLM(
    model="/models/llama-3.1-8b",  # Smaller draft model
    build_config=draft_config,
)

target_llm = LLM(
    model="/models/llama-3.1-70b",
    build_config=build_config,
)

Speculative decoding can provide 2-3x speedup for generation-heavy workloads.

Kernel Auto-Tuning

TensorRT-LLM can auto-tune kernels for your specific GPU:

from tensorrt_llm.builder import BuildConfig

build_config = BuildConfig(
    # Enable kernel auto-tuning
    builder_opt_level=5,  # Maximum optimization (0-5)

    # Tactic selection
    max_workspace_size=8 * 1024**3,  # 8GB for tuning

    # Profile for specific GPU
    # Tuning takes longer but produces faster engines
)

# Build with tuning (takes longer, runs faster)
llm = LLM(
    model="/models/llama-3.1-70b",
    build_config=build_config,
    # Save tuned engine
    engine_dir="/engines/llama-3.1-70b-tuned",
)

Runtime Optimization

Request Scheduling

from tensorrt_llm.executor import ExecutorConfig

executor_config = ExecutorConfig(
    # Batching strategy
    batching_type="INFLIGHT",  # Continuous batching

    # Scheduling
    scheduler_policy="MAX_UTILIZATION",

    # Request handling
    max_queue_delay_microseconds=1000,  # 1ms max queue time

    # Parallel execution
    parallel_config={
        "communication_mode": "LEADER",
        "participant_ids": [0, 1],  # GPU IDs
    },
)

Streaming Configuration

from tensorrt_llm import LLM, SamplingParams

llm = LLM(model="/engines/llama-3.1-70b")

sampling_params = SamplingParams(
    max_tokens=512,
    temperature=0.7,
    # Streaming settings
    stream=True,
    stream_interval=1,  # Yield every N tokens
)

# Streaming generation
async for chunk in llm.generate_async(prompt, sampling_params):
    print(chunk.text, end="", flush=True)

Profiling and Benchmarking

Built-in Benchmarking

# TensorRT-LLM benchmark tool
python -m tensorrt_llm.bench.benchmark \
  --engine_dir /engines/llama-3.1-70b \
  --dataset /data/benchmark.json \
  --num_requests 1000 \
  --concurrency 64 \
  --output_file results.json

NVIDIA Nsight Integration

# Profile with Nsight Systems
nsys profile -o llm_profile \
  python inference_script.py

# Analyze with Nsight Compute for kernel details
ncu --target-processes all \
  python inference_script.py

Performance Tuning Checklist

┌─────────────────────────────────────────────────────────────┐
│              Performance Optimization Steps                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  □ Choose appropriate quantization (FP8/FP4 for Hopper+)    │
│  □ Set max_batch_size based on latency requirements         │
│  □ Enable paged attention for memory efficiency             │
│  □ Configure KV cache quantization                          │
│  □ Enable chunked prefill for long contexts                 │
│  □ Run kernel auto-tuning for production GPUs               │
│  □ Profile with representative workload                     │
│  □ Monitor GPU memory and utilization in production         │
│  □ Set appropriate queue delays for SLOs                    │
│  □ Consider speculative decoding for generation workloads   │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Common Performance Issues

Issue Symptom Solution
Low GPU util <80% utilization Increase batch size
High TTFT Slow first token Enable chunked prefill
OOM errors Memory allocation fails Reduce batch size, use FP8/FP4
Throughput drops Varies with load Check preemption rate
Latency spikes P99 >> P50 Reduce queue delay, add replicas
:::

Quiz

Module 3: TensorRT-LLM & GPU Optimization

Take Quiz