LLM Quantization Techniques

Quantization reduces model precision from FP32/FP16 to lower bit-widths, dramatically reducing memory requirements and increasing throughput while maintaining acceptable quality.

Quantization Formats

Precision Hierarchy (Memory per Parameter):

FP32 (32-bit)  ████████████████████████████████  4 bytes
FP16 (16-bit)  ████████████████                  2 bytes
BF16 (16-bit)  ████████████████                  2 bytes
FP8  (8-bit)   ████████                          1 byte
INT8 (8-bit)   ████████                          1 byte
INT4 (4-bit)   ████                              0.5 bytes
FP4  (4-bit)   ████                              0.5 bytes

Memory Savings (70B model):
FP16: 140GB → FP8: 70GB → INT4: 35GB

FP8 Quantization (Hopper/Ada/Blackwell)

FP8 is the sweet spot for modern inference, offering 2x memory reduction with minimal quality loss:

from tensorrt_llm.builder import BuildConfig
from tensorrt_llm.quantization import QuantConfig

# FP8 quantization configuration
quant_config = QuantConfig(
    quant_algo="FP8",
    kv_cache_quant_algo="FP8",  # Also quantize KV cache
)

build_config = BuildConfig(
    max_input_len=4096,
    max_seq_len=8192,
    max_batch_size=64,
    precision="float16",  # Base precision
    quant_config=quant_config,
)

# Build with FP8
llm = LLM(
    model="/models/llama-3.1-70b",
    build_config=build_config,
)

FP8 Formats

Format	Exponent	Mantissa	Range	Best For
E4M3	4 bits	3 bits	±240	Weights
E5M2	5 bits	2 bits	±57344	Activations

TensorRT-LLM automatically selects optimal formats per layer.

FP4 Quantization (Blackwell Only)

FP4 is exclusive to Blackwell architecture, providing 4x memory reduction:

# FP4 configuration for Blackwell GPUs (B200, B300)
quant_config = QuantConfig(
    quant_algo="FP4",
    kv_cache_quant_algo="FP8",  # KV cache stays at FP8
)

build_config = BuildConfig(
    max_input_len=4096,
    max_seq_len=8192,
    max_batch_size=128,  # Larger batches possible
    quant_config=quant_config,
)

FP4 enables:

Running 405B models on single B200 (140GB HBM3e)
4x higher batch sizes at same memory
Cost-effective serving of massive models

INT8/INT4 Weight-Only Quantization

For Ampere GPUs without FP8 support:

# INT8 weight-only (works on all GPUs)
quant_config = QuantConfig(
    quant_algo="W8A16",  # 8-bit weights, 16-bit activations
)

# INT4 weight-only (AWQ-style)
quant_config = QuantConfig(
    quant_algo="W4A16",
    group_size=128,  # Per-group quantization
)

Quantization Comparison

Method	Memory	Speed	Quality	GPU Support
FP16	1x	1x	Baseline	All
FP8	0.5x	1.5-2x	99%+	Hopper+
FP4	0.25x	2-3x	97%+	Blackwell
INT8 (W8A16)	0.5x	1.2x	98%+	All
INT4 (W4A16)	0.25x	1.5x	95%+	All

Calibration for Quantization

Proper calibration is essential for maintaining model quality:

from tensorrt_llm.quantization import quantize

# Prepare calibration dataset
calibration_data = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning models require careful optimization.",
    # 100-500 representative samples
]

# Quantize with calibration
quantized_model = quantize(
    model_dir="/models/llama-3.1-70b",
    output_dir="/models/llama-3.1-70b-fp8",
    quant_config=quant_config,
    calib_dataset=calibration_data,
    calib_batch_size=8,
    calib_max_seq_length=512,
)

Calibration Best Practices

Representative Data: Use samples that match production workloads
Dataset Size: 100-500 samples for stable calibration
Sequence Length: Match expected inference lengths
Validation: Compare perplexity before/after quantization

SmoothQuant for Challenging Models

For models with activation outliers:

# SmoothQuant moves quantization difficulty from activations to weights
quant_config = QuantConfig(
    quant_algo="W8A8_SQ_PER_CHANNEL",
    smoothquant_val=0.5,  # Balance factor (0.0-1.0)
)

Production Quantization Workflow

┌──────────────────────────────────────────────────────────┐
│                 Quantization Pipeline                     │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  1. Baseline Evaluation                                  │
│     ├── Measure FP16 perplexity/accuracy                │
│     └── Establish quality threshold                      │
│                                                          │
│  2. Calibration                                          │
│     ├── Collect representative samples                   │
│     ├── Run calibration pass                            │
│     └── Generate quantization scales                     │
│                                                          │
│  3. Quantized Model Build                               │
│     ├── Apply quantization                              │
│     ├── Build TensorRT engine                           │
│     └── Optimize kernel selection                        │
│                                                          │
│  4. Quality Validation                                   │
│     ├── Compare perplexity (<0.5% increase)             │
│     ├── Run task-specific benchmarks                    │
│     └── A/B test in staging                             │
│                                                          │
│  5. Production Deployment                               │
│     └── Monitor quality metrics continuously             │
└──────────────────────────────────────────────────────────┘

:::