TensorRT-LLM & GPU Optimization

TensorRT-LLM Architecture & Setup

4 min read

TensorRT-LLM is NVIDIA's high-performance inference library built on TensorRT, designed specifically for large language model serving on NVIDIA GPUs.

TensorRT-LLM 1.0 Architecture

The 1.0 release (2025) introduced a PyTorch-based architecture that fundamentally changed how models are defined and optimized:

┌─────────────────────────────────────────────────────────────┐
│                    TensorRT-LLM 1.0                         │
├─────────────────────────────────────────────────────────────┤
│  PyTorch Model Definition                                   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Native PyTorch layers with TRT extensions          │   │
│  │  - Attention, MLP, Embedding layers                 │   │
│  │  - Automatic graph capture and optimization         │   │
│  └─────────────────────────────────────────────────────┘   │
├─────────────────────────────────────────────────────────────┤
│  TensorRT Compiler                                          │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  - Layer fusion and kernel optimization             │   │
│  │  - Precision calibration (FP16, FP8, INT8, FP4)     │   │
│  │  - Memory planning and allocation                   │   │
│  └─────────────────────────────────────────────────────┘   │
├─────────────────────────────────────────────────────────────┤
│  Runtime Engine                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  - Paged KV cache (like vLLM)                       │   │
│  │  - Continuous batching (in-flight batching)         │   │
│  │  - Multi-GPU tensor parallelism                     │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Hardware Support Matrix

TensorRT-LLM 1.0 supports the full range of NVIDIA datacenter GPUs:

GPU Generation Models FP8 Support FP4 Support Best For
Hopper H100, H200 Yes No Current production
Ada Lovelace L40S, L4 Yes No Cost-effective inference
Blackwell B200, GB200, B300 Yes Yes Next-gen maximum performance
Ampere A100, A10G No No Legacy/budget deployments

Installation and Setup

# Pull the official TensorRT-LLM container
docker pull nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3

# Or the standalone TensorRT-LLM image
docker pull nvcr.io/nvidia/tensorrt-llm:25.01-py3

# Run with GPU access
docker run --gpus all -it --rm \
  -v /path/to/models:/models \
  nvcr.io/nvidia/tensorrt-llm:25.01-py3

Building an Engine

TensorRT-LLM requires converting models to optimized TensorRT engines:

from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.builder import BuildConfig

# Define build configuration
build_config = BuildConfig(
    max_input_len=4096,
    max_seq_len=8192,
    max_batch_size=64,
    max_num_tokens=8192,  # For paged attention

    # Quantization settings
    strongly_typed=True,
    precision="float16",  # or "fp8", "int8"

    # Optimization settings
    use_paged_context_fmha=True,
    use_fused_mlp=True,
)

# Build the engine
llm = LLM(
    model="/models/llama-3.1-8b",
    build_config=build_config,
)

# Save for production
llm.save("/engines/llama-3.1-8b-trt")

Running Inference

from tensorrt_llm import LLM, SamplingParams

# Load pre-built engine
llm = LLM(model="/engines/llama-3.1-8b-trt")

# Configure sampling
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

# Generate
prompts = ["Explain quantum computing:", "Write a haiku about AI:"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}")

TensorRT-LLM vs vLLM: When to Choose

Aspect TensorRT-LLM vLLM
Hardware NVIDIA only Multi-vendor
Setup Complexity Higher (engine build) Lower (direct HuggingFace)
Performance 10-30% faster on NVIDIA Excellent, slightly slower
Quantization FP4, FP8, INT8, INT4 FP8, AWQ, GPTQ
Community NVIDIA-driven Large open-source
Use Case Maximum NVIDIA performance Flexibility, multi-hardware

Choose TensorRT-LLM when:

  • You're committed to NVIDIA hardware
  • You need maximum single-GPU performance
  • You're using Blackwell GPUs with FP4
  • Latency is critical (trading complexity for speed)

Choose vLLM when:

  • You need hardware flexibility
  • Faster iteration and simpler deployment matters
  • Community support and ecosystem integration are priorities :::

Quiz

Module 3: TensorRT-LLM & GPU Optimization

Take Quiz