TensorRT-LLM Architecture & Setup

TensorRT-LLM is NVIDIA's high-performance inference library built on TensorRT, designed specifically for large language model serving on NVIDIA GPUs.

TensorRT-LLM 1.0 Architecture

The 1.0 release (2025) introduced a PyTorch-based architecture that fundamentally changed how models are defined and optimized:

┌─────────────────────────────────────────────────────────────┐
│                    TensorRT-LLM 1.0                         │
├─────────────────────────────────────────────────────────────┤
│  PyTorch Model Definition                                   │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Native PyTorch layers with TRT extensions          │   │
│  │  - Attention, MLP, Embedding layers                 │   │
│  │  - Automatic graph capture and optimization         │   │
│  └─────────────────────────────────────────────────────┘   │
├─────────────────────────────────────────────────────────────┤
│  TensorRT Compiler                                          │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  - Layer fusion and kernel optimization             │   │
│  │  - Precision calibration (FP16, FP8, INT8, FP4)     │   │
│  │  - Memory planning and allocation                   │   │
│  └─────────────────────────────────────────────────────┘   │
├─────────────────────────────────────────────────────────────┤
│  Runtime Engine                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  - Paged KV cache (like vLLM)                       │   │
│  │  - Continuous batching (in-flight batching)         │   │
│  │  - Multi-GPU tensor parallelism                     │   │
│  └─────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘

Hardware Support Matrix

TensorRT-LLM 1.0 supports the full range of NVIDIA datacenter GPUs:

GPU Generation	Models	FP8 Support	FP4 Support	Best For
Hopper	H100, H200	Yes	No	Current production
Ada Lovelace	L40S, L4	Yes	No	Cost-effective inference
Blackwell	B200, GB200, B300	Yes	Yes	Next-gen maximum performance
Ampere	A100, A10G	No	No	Legacy/budget deployments

Installation and Setup

Docker-based Installation (Recommended)

# Pull the official TensorRT-LLM container
docker pull nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3

# Or the standalone TensorRT-LLM image
docker pull nvcr.io/nvidia/tensorrt-llm:25.01-py3

# Run with GPU access
docker run --gpus all -it --rm \
  -v /path/to/models:/models \
  nvcr.io/nvidia/tensorrt-llm:25.01-py3

Building an Engine

TensorRT-LLM requires converting models to optimized TensorRT engines:

from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.builder import BuildConfig

# Define build configuration
build_config = BuildConfig(
    max_input_len=4096,
    max_seq_len=8192,
    max_batch_size=64,
    max_num_tokens=8192,  # For paged attention

    # Quantization settings
    strongly_typed=True,
    precision="float16",  # or "fp8", "int8"

    # Optimization settings
    use_paged_context_fmha=True,
    use_fused_mlp=True,
)

# Build the engine
llm = LLM(
    model="/models/llama-3.1-8b",
    build_config=build_config,
)

# Save for production
llm.save("/engines/llama-3.1-8b-trt")

Running Inference

from tensorrt_llm import LLM, SamplingParams

# Load pre-built engine
llm = LLM(model="/engines/llama-3.1-8b-trt")

# Configure sampling
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

# Generate
prompts = ["Explain quantum computing:", "Write a haiku about AI:"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Generated: {output.outputs[0].text}")

TensorRT-LLM vs vLLM: When to Choose

Aspect	TensorRT-LLM	vLLM
Hardware	NVIDIA only	Multi-vendor
Setup Complexity	Higher (engine build)	Lower (direct HuggingFace)
Performance	10-30% faster on NVIDIA	Excellent, slightly slower
Quantization	FP4, FP8, INT8, INT4	FP8, AWQ, GPTQ
Community	NVIDIA-driven	Large open-source
Use Case	Maximum NVIDIA performance	Flexibility, multi-hardware

Choose TensorRT-LLM when:

You're committed to NVIDIA hardware
You need maximum single-GPU performance
You're using Blackwell GPUs with FP4
Latency is critical (trading complexity for speed)

Choose vLLM when:

You need hardware flexibility
Faster iteration and simpler deployment matters
Community support and ecosystem integration are priorities :::