TensorRT-LLM & GPU Optimization
TensorRT-LLM Performance Tuning
3 min read
This lesson covers advanced performance tuning techniques for TensorRT-LLM to maximize throughput and minimize latency in production deployments.
Key Performance Metrics
┌─────────────────────────────────────────────────────────────┐
│ Performance Metrics Dashboard │
├─────────────────────────────────────────────────────────────┤
│ │
│ Throughput Metrics Latency Metrics │
│ ───────────────── ──────────────── │
│ • Tokens/second • TTFT (Time to First Token) │
│ • Requests/second • TPOT (Time per Output Token) │
│ • GPU utilization • P50/P95/P99 latency │
│ │
│ Resource Metrics Quality Metrics │
│ ──────────────── ─────────────── │
│ • GPU memory usage • KV cache hit rate │
│ • Batch utilization • Preemption rate │
│ • Queue depth • Timeout rate │
│ │
└─────────────────────────────────────────────────────────────┘
Batch Size Optimization
Finding the optimal batch size balances throughput and latency:
from tensorrt_llm import LLM
from tensorrt_llm.builder import BuildConfig
# Profile different batch sizes
batch_configs = [
{"max_batch_size": 32, "use_case": "low_latency"},
{"max_batch_size": 128, "use_case": "balanced"},
{"max_batch_size": 512, "use_case": "high_throughput"},
]
# Optimal configuration for balanced workloads
build_config = BuildConfig(
max_batch_size=128,
max_num_tokens=8192, # Total tokens across batch
# Enable dynamic batching
enable_chunked_context=True,
max_chunk_size=2048,
# Batch scheduling
scheduler_policy="guaranteed_no_evict",
)
Batch Size Guidelines
| Use Case | Batch Size | TTFT Impact | Throughput |
|---|---|---|---|
| Real-time chat | 16-32 | Minimal | Medium |
| API service | 64-128 | Moderate | High |
| Batch processing | 256-512 | Higher | Maximum |
Memory Optimization
KV Cache Configuration
build_config = BuildConfig(
# Paged KV cache for memory efficiency
use_paged_context_fmha=True,
# KV cache quantization
kv_cache_quant_algo="FP8", # 50% memory reduction
# Block size for paged attention
tokens_per_block=64, # Default, good for most cases
# Maximum KV cache memory (fraction of GPU memory)
kv_cache_fraction=0.9, # Reserve 10% for activations
)
Speculative Decoding Integration
TensorRT-LLM supports speculative decoding for faster generation:
from tensorrt_llm.builder import BuildConfig
# Configure speculative decoding
build_config = BuildConfig(
# Main model settings
max_batch_size=64,
max_seq_len=8192,
# Speculative decoding
speculative_decoding_mode="draft_tokens_external",
max_draft_tokens=5, # Number of draft tokens per step
)
# Build both draft and target models
draft_llm = LLM(
model="/models/llama-3.1-8b", # Smaller draft model
build_config=draft_config,
)
target_llm = LLM(
model="/models/llama-3.1-70b",
build_config=build_config,
)
Speculative decoding can provide 2-3x speedup for generation-heavy workloads.
Kernel Auto-Tuning
TensorRT-LLM can auto-tune kernels for your specific GPU:
from tensorrt_llm.builder import BuildConfig
build_config = BuildConfig(
# Enable kernel auto-tuning
builder_opt_level=5, # Maximum optimization (0-5)
# Tactic selection
max_workspace_size=8 * 1024**3, # 8GB for tuning
# Profile for specific GPU
# Tuning takes longer but produces faster engines
)
# Build with tuning (takes longer, runs faster)
llm = LLM(
model="/models/llama-3.1-70b",
build_config=build_config,
# Save tuned engine
engine_dir="/engines/llama-3.1-70b-tuned",
)
Runtime Optimization
Request Scheduling
from tensorrt_llm.executor import ExecutorConfig
executor_config = ExecutorConfig(
# Batching strategy
batching_type="INFLIGHT", # Continuous batching
# Scheduling
scheduler_policy="MAX_UTILIZATION",
# Request handling
max_queue_delay_microseconds=1000, # 1ms max queue time
# Parallel execution
parallel_config={
"communication_mode": "LEADER",
"participant_ids": [0, 1], # GPU IDs
},
)
Streaming Configuration
from tensorrt_llm import LLM, SamplingParams
llm = LLM(model="/engines/llama-3.1-70b")
sampling_params = SamplingParams(
max_tokens=512,
temperature=0.7,
# Streaming settings
stream=True,
stream_interval=1, # Yield every N tokens
)
# Streaming generation
async for chunk in llm.generate_async(prompt, sampling_params):
print(chunk.text, end="", flush=True)
Profiling and Benchmarking
Built-in Benchmarking
# TensorRT-LLM benchmark tool
python -m tensorrt_llm.bench.benchmark \
--engine_dir /engines/llama-3.1-70b \
--dataset /data/benchmark.json \
--num_requests 1000 \
--concurrency 64 \
--output_file results.json
NVIDIA Nsight Integration
# Profile with Nsight Systems
nsys profile -o llm_profile \
python inference_script.py
# Analyze with Nsight Compute for kernel details
ncu --target-processes all \
python inference_script.py
Performance Tuning Checklist
┌─────────────────────────────────────────────────────────────┐
│ Performance Optimization Steps │
├─────────────────────────────────────────────────────────────┤
│ │
│ □ Choose appropriate quantization (FP8/FP4 for Hopper+) │
│ □ Set max_batch_size based on latency requirements │
│ □ Enable paged attention for memory efficiency │
│ □ Configure KV cache quantization │
│ □ Enable chunked prefill for long contexts │
│ □ Run kernel auto-tuning for production GPUs │
│ □ Profile with representative workload │
│ □ Monitor GPU memory and utilization in production │
│ □ Set appropriate queue delays for SLOs │
│ □ Consider speculative decoding for generation workloads │
│ │
└─────────────────────────────────────────────────────────────┘
Common Performance Issues
| Issue | Symptom | Solution |
|---|---|---|
| Low GPU util | <80% utilization | Increase batch size |
| High TTFT | Slow first token | Enable chunked prefill |
| OOM errors | Memory allocation fails | Reduce batch size, use FP8/FP4 |
| Throughput drops | Varies with load | Check preemption rate |
| Latency spikes | P99 >> P50 | Reduce queue delay, add replicas |
| ::: |