TensorRT-LLM & GPU Optimization

Blackwell GPU Optimization

3 min read

NVIDIA's Blackwell architecture (2024-2025) introduces significant improvements for LLM inference. This lesson covers optimization strategies specific to B200, GB200, B300, and GB300 GPUs.

Blackwell Architecture Overview

┌─────────────────────────────────────────────────────────────┐
│                  Blackwell GPU Features                     │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌───────────────────┐  ┌───────────────────────────────┐  │
│  │  5th Gen Tensor   │  │  HBM3e Memory                 │  │
│  │  Cores            │  │  ─────────────────────────    │  │
│  │  ─────────────    │  │  B200: 192GB @ 8TB/s          │  │
│  │  • FP4 support    │  │  B300: 288GB @ 10TB/s         │  │
│  │  • 2x FP8 perf    │  │  GB200: 384GB (2x192GB NVLink)│  │
│  │  • Sparsity 2:4   │  │                               │  │
│  └───────────────────┘  └───────────────────────────────┘  │
│                                                             │
│  ┌───────────────────┐  ┌───────────────────────────────┐  │
│  │  NVLink 5.0       │  │  Transformer Engine 2.0       │  │
│  │  ─────────────    │  │  ────────────────────────     │  │
│  │  1.8TB/s GPU-GPU  │  │  • Automatic FP4/FP8 select   │  │
│  │  Grace-Blackwell  │  │  • Dynamic precision scaling  │  │
│  │  unified memory   │  │  • Optimized attention        │  │
│  └───────────────────┘  └───────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Blackwell GPU Comparison

Specification B200 GB200 B300 GB300
HBM3e Memory 192GB 384GB 288GB 576GB
Memory Bandwidth 8TB/s 16TB/s 10TB/s 20TB/s
FP8 TFLOPS 4,500 9,000 5,600 11,200
FP4 TFLOPS 9,000 18,000 11,200 22,400
TDP 1000W 2700W 1200W 3000W
Form Factor SXM Grace+Blackwell SXM Grace+Blackwell

FP4 Optimization for Blackwell

FP4 is Blackwell's standout feature for LLM inference:

from tensorrt_llm import LLM
from tensorrt_llm.builder import BuildConfig
from tensorrt_llm.quantization import QuantConfig

# Optimal FP4 configuration for Blackwell
quant_config = QuantConfig(
    quant_algo="FP4",
    kv_cache_quant_algo="FP8",
    exclude_modules=["lm_head"],  # Keep output layer at higher precision
)

build_config = BuildConfig(
    max_input_len=8192,
    max_seq_len=32768,  # Longer context with FP4 memory savings
    max_batch_size=256,  # Much larger batches possible
    max_num_tokens=32768,

    # Blackwell-specific optimizations
    use_fp8_context_fmha=True,
    use_paged_context_fmha=True,

    quant_config=quant_config,
)

# Build for Blackwell
llm = LLM(
    model="/models/llama-3.1-405b",
    build_config=build_config,
    # Enable multi-GPU if needed
    tensor_parallel_size=2,  # B200 NVLink pair
)

FP4 Memory Benefits

Llama 3.1 405B Memory Requirements:

FP16:  810GB → Requires 6x H100-80GB or 5x B200
FP8:   405GB → Requires 3x H100-80GB or 3x B200
FP4:   203GB → Fits on 2x B200 or 1x GB200

Cost Reduction: ~60% fewer GPUs for same model

The GB200 Superchip connects 2 Blackwell GPUs with 1.8TB/s NVLink:

# Optimal configuration for GB200
build_config = BuildConfig(
    max_input_len=8192,
    max_seq_len=65536,  # Extended context with 384GB
    max_batch_size=512,

    # Use NVLink tensor parallelism
    tp_size=2,  # Both GPUs in GB200

    # Maximize memory efficiency
    quant_config=QuantConfig(quant_algo="FP4"),

    # Enable pipelined prefill/decode
    enable_chunked_context=True,
    max_chunk_size=8192,
)

Transformer Engine 2.0 Integration

Blackwell's Transformer Engine automatically optimizes precision:

# Enable Transformer Engine optimizations
build_config = BuildConfig(
    # ... other settings ...

    # Transformer Engine 2.0 features
    use_transformer_engine=True,

    # Automatic precision selection per layer
    # - Attention: FP8 for stability
    # - FFN: FP4 for throughput
    # - LayerNorm: FP16 for accuracy
    mixed_precision_strategy="auto",
)

Multi-Node B200 Configuration

For extremely large models or massive throughput:

# 8x B200 configuration (4-node, 2 GPU per node)
build_config = BuildConfig(
    max_batch_size=1024,
    max_num_tokens=65536,

    # Tensor parallelism within node
    tp_size=2,

    # Pipeline parallelism across nodes
    pp_size=4,

    quant_config=QuantConfig(quant_algo="FP4"),
)

# Launch with TensorRT-LLM distributed runner
# Uses NCCL over InfiniBand between nodes

Blackwell Optimization Checklist

  1. Use FP4 quantization - Native support, 4x memory reduction
  2. Maximize batch size - Blackwell excels at high throughput
  3. Enable paged attention - Essential for memory efficiency
  4. Use chunked prefill - Reduces TTFT for long contexts
  5. Configure NVLink correctly - Ensure optimal GPU-GPU communication
  6. Monitor power limits - B200 draws up to 1000W under load
  7. Test FP4 quality - Validate model accuracy after quantization

Performance Expectations

Model GPU Config Quantization Throughput TTFT
Llama 3.1 8B 1x B200 FP4 15,000 tok/s 12ms
Llama 3.1 70B 1x B200 FP4 3,500 tok/s 45ms
Llama 3.1 405B 2x B200 FP4 800 tok/s 120ms
Llama 3.1 405B 1x GB200 FP4 900 tok/s 100ms
:::

Quiz

Module 3: TensorRT-LLM & GPU Optimization

Take Quiz