TensorRT-LLM & GPU Optimization
TensorRT-LLM Architecture & Setup
4 min read
TensorRT-LLM is NVIDIA's high-performance inference library built on TensorRT, designed specifically for large language model serving on NVIDIA GPUs.
TensorRT-LLM 1.0 Architecture
The 1.0 release (2025) introduced a PyTorch-based architecture that fundamentally changed how models are defined and optimized:
┌─────────────────────────────────────────────────────────────┐
│ TensorRT-LLM 1.0 │
├─────────────────────────────────────────────────────────────┤
│ PyTorch Model Definition │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Native PyTorch layers with TRT extensions │ │
│ │ - Attention, MLP, Embedding layers │ │
│ │ - Automatic graph capture and optimization │ │
│ └─────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ TensorRT Compiler │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ - Layer fusion and kernel optimization │ │
│ │ - Precision calibration (FP16, FP8, INT8, FP4) │ │
│ │ - Memory planning and allocation │ │
│ └─────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Runtime Engine │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ - Paged KV cache (like vLLM) │ │
│ │ - Continuous batching (in-flight batching) │ │
│ │ - Multi-GPU tensor parallelism │ │
│ └─────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Hardware Support Matrix
TensorRT-LLM 1.0 supports the full range of NVIDIA datacenter GPUs:
| GPU Generation | Models | FP8 Support | FP4 Support | Best For |
|---|---|---|---|---|
| Hopper | H100, H200 | Yes | No | Current production |
| Ada Lovelace | L40S, L4 | Yes | No | Cost-effective inference |
| Blackwell | B200, GB200, B300 | Yes | Yes | Next-gen maximum performance |
| Ampere | A100, A10G | No | No | Legacy/budget deployments |
Installation and Setup
Docker-based Installation (Recommended)
# Pull the official TensorRT-LLM container
docker pull nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
# Or the standalone TensorRT-LLM image
docker pull nvcr.io/nvidia/tensorrt-llm:25.01-py3
# Run with GPU access
docker run --gpus all -it --rm \
-v /path/to/models:/models \
nvcr.io/nvidia/tensorrt-llm:25.01-py3
Building an Engine
TensorRT-LLM requires converting models to optimized TensorRT engines:
from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.builder import BuildConfig
# Define build configuration
build_config = BuildConfig(
max_input_len=4096,
max_seq_len=8192,
max_batch_size=64,
max_num_tokens=8192, # For paged attention
# Quantization settings
strongly_typed=True,
precision="float16", # or "fp8", "int8"
# Optimization settings
use_paged_context_fmha=True,
use_fused_mlp=True,
)
# Build the engine
llm = LLM(
model="/models/llama-3.1-8b",
build_config=build_config,
)
# Save for production
llm.save("/engines/llama-3.1-8b-trt")
Running Inference
from tensorrt_llm import LLM, SamplingParams
# Load pre-built engine
llm = LLM(model="/engines/llama-3.1-8b-trt")
# Configure sampling
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
)
# Generate
prompts = ["Explain quantum computing:", "Write a haiku about AI:"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}")
print(f"Generated: {output.outputs[0].text}")
TensorRT-LLM vs vLLM: When to Choose
| Aspect | TensorRT-LLM | vLLM |
|---|---|---|
| Hardware | NVIDIA only | Multi-vendor |
| Setup Complexity | Higher (engine build) | Lower (direct HuggingFace) |
| Performance | 10-30% faster on NVIDIA | Excellent, slightly slower |
| Quantization | FP4, FP8, INT8, INT4 | FP8, AWQ, GPTQ |
| Community | NVIDIA-driven | Large open-source |
| Use Case | Maximum NVIDIA performance | Flexibility, multi-hardware |
Choose TensorRT-LLM when:
- You're committed to NVIDIA hardware
- You need maximum single-GPU performance
- You're using Blackwell GPUs with FP4
- Latency is critical (trading complexity for speed)
Choose vLLM when:
- You need hardware flexibility
- Faster iteration and simpler deployment matters
- Community support and ecosystem integration are priorities :::