TensorRT-LLM & GPU Optimization
LLM Quantization Techniques
4 min read
Quantization reduces model precision from FP32/FP16 to lower bit-widths, dramatically reducing memory requirements and increasing throughput while maintaining acceptable quality.
Quantization Formats
Precision Hierarchy (Memory per Parameter):
FP32 (32-bit) ████████████████████████████████ 4 bytes
FP16 (16-bit) ████████████████ 2 bytes
BF16 (16-bit) ████████████████ 2 bytes
FP8 (8-bit) ████████ 1 byte
INT8 (8-bit) ████████ 1 byte
INT4 (4-bit) ████ 0.5 bytes
FP4 (4-bit) ████ 0.5 bytes
Memory Savings (70B model):
FP16: 140GB → FP8: 70GB → INT4: 35GB
FP8 Quantization (Hopper/Ada/Blackwell)
FP8 is the sweet spot for modern inference, offering 2x memory reduction with minimal quality loss:
from tensorrt_llm.builder import BuildConfig
from tensorrt_llm.quantization import QuantConfig
# FP8 quantization configuration
quant_config = QuantConfig(
quant_algo="FP8",
kv_cache_quant_algo="FP8", # Also quantize KV cache
)
build_config = BuildConfig(
max_input_len=4096,
max_seq_len=8192,
max_batch_size=64,
precision="float16", # Base precision
quant_config=quant_config,
)
# Build with FP8
llm = LLM(
model="/models/llama-3.1-70b",
build_config=build_config,
)
FP8 Formats
| Format | Exponent | Mantissa | Range | Best For |
|---|---|---|---|---|
| E4M3 | 4 bits | 3 bits | ±240 | Weights |
| E5M2 | 5 bits | 2 bits | ±57344 | Activations |
TensorRT-LLM automatically selects optimal formats per layer.
FP4 Quantization (Blackwell Only)
FP4 is exclusive to Blackwell architecture, providing 4x memory reduction:
# FP4 configuration for Blackwell GPUs (B200, B300)
quant_config = QuantConfig(
quant_algo="FP4",
kv_cache_quant_algo="FP8", # KV cache stays at FP8
)
build_config = BuildConfig(
max_input_len=4096,
max_seq_len=8192,
max_batch_size=128, # Larger batches possible
quant_config=quant_config,
)
FP4 enables:
- Running 405B models on single B200 (140GB HBM3e)
- 4x higher batch sizes at same memory
- Cost-effective serving of massive models
INT8/INT4 Weight-Only Quantization
For Ampere GPUs without FP8 support:
# INT8 weight-only (works on all GPUs)
quant_config = QuantConfig(
quant_algo="W8A16", # 8-bit weights, 16-bit activations
)
# INT4 weight-only (AWQ-style)
quant_config = QuantConfig(
quant_algo="W4A16",
group_size=128, # Per-group quantization
)
Quantization Comparison
| Method | Memory | Speed | Quality | GPU Support |
|---|---|---|---|---|
| FP16 | 1x | 1x | Baseline | All |
| FP8 | 0.5x | 1.5-2x | 99%+ | Hopper+ |
| FP4 | 0.25x | 2-3x | 97%+ | Blackwell |
| INT8 (W8A16) | 0.5x | 1.2x | 98%+ | All |
| INT4 (W4A16) | 0.25x | 1.5x | 95%+ | All |
Calibration for Quantization
Proper calibration is essential for maintaining model quality:
from tensorrt_llm.quantization import quantize
# Prepare calibration dataset
calibration_data = [
"The quick brown fox jumps over the lazy dog.",
"Machine learning models require careful optimization.",
# 100-500 representative samples
]
# Quantize with calibration
quantized_model = quantize(
model_dir="/models/llama-3.1-70b",
output_dir="/models/llama-3.1-70b-fp8",
quant_config=quant_config,
calib_dataset=calibration_data,
calib_batch_size=8,
calib_max_seq_length=512,
)
Calibration Best Practices
- Representative Data: Use samples that match production workloads
- Dataset Size: 100-500 samples for stable calibration
- Sequence Length: Match expected inference lengths
- Validation: Compare perplexity before/after quantization
SmoothQuant for Challenging Models
For models with activation outliers:
# SmoothQuant moves quantization difficulty from activations to weights
quant_config = QuantConfig(
quant_algo="W8A8_SQ_PER_CHANNEL",
smoothquant_val=0.5, # Balance factor (0.0-1.0)
)
Production Quantization Workflow
┌──────────────────────────────────────────────────────────┐
│ Quantization Pipeline │
├──────────────────────────────────────────────────────────┤
│ │
│ 1. Baseline Evaluation │
│ ├── Measure FP16 perplexity/accuracy │
│ └── Establish quality threshold │
│ │
│ 2. Calibration │
│ ├── Collect representative samples │
│ ├── Run calibration pass │
│ └── Generate quantization scales │
│ │
│ 3. Quantized Model Build │
│ ├── Apply quantization │
│ ├── Build TensorRT engine │
│ └── Optimize kernel selection │
│ │
│ 4. Quality Validation │
│ ├── Compare perplexity (<0.5% increase) │
│ ├── Run task-specific benchmarks │
│ └── A/B test in staging │
│ │
│ 5. Production Deployment │
│ └── Monitor quality metrics continuously │
└──────────────────────────────────────────────────────────┘
:::