Model Serving & Inference
NVIDIA Triton Inference Server
4 min read
NVIDIA Triton Inference Server is an open-source inference serving platform that maximizes GPU utilization through features like dynamic batching, concurrent model execution, and support for all major ML frameworks.
Triton Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ Triton Inference Server │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ HTTP/gRPC Endpoints │ │
│ │ /v2/models/{model}/infer │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Request Scheduler │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Dynamic │ │ Sequence │ │ Priority │ │ │
│ │ │ Batching │ │ Batching │ │ Queuing │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Model Repository │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │TensorRT │ │PyTorch │ │TensorFlow│ │ ONNX │ │ │
│ │ │ Backend │ │ Backend │ │ Backend │ │ Backend │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ GPU Execution │ │
│ │ - Multi-GPU support - CUDA streams - Memory pooling │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Model Repository Structure
model_repository/
├── text_classifier/
│ ├── config.pbtxt
│ ├── 1/
│ │ └── model.onnx
│ └── 2/
│ └── model.onnx
├── image_encoder/
│ ├── config.pbtxt
│ └── 1/
│ └── model.plan # TensorRT engine
└── llm_model/
├── config.pbtxt
└── 1/
└── model.pt
Model Configuration
# config.pbtxt for text_classifier
name: "text_classifier"
platform: "onnxruntime_onnx"
max_batch_size: 64
input [
{
name: "input_ids"
data_type: TYPE_INT64
dims: [ -1 ] # Dynamic sequence length
},
{
name: "attention_mask"
data_type: TYPE_INT64
dims: [ -1 ]
}
]
output [
{
name: "logits"
data_type: TYPE_FP32
dims: [ 2 ]
}
]
# Dynamic batching configuration
dynamic_batching {
preferred_batch_size: [ 8, 16, 32 ]
max_queue_delay_microseconds: 100000
}
# Instance group for GPU allocation
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0 ]
}
]
# Version policy
version_policy: { latest: { num_versions: 2 }}
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-inference
namespace: ml-serving
spec:
replicas: 3
selector:
matchLabels:
app: triton
template:
metadata:
labels:
app: triton
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.01-py3
command: ["tritonserver"]
args:
- "--model-repository=s3://models/triton-repo"
- "--model-control-mode=poll"
- "--repository-poll-secs=30"
- "--strict-model-config=false"
- "--log-verbose=1"
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: grpc
- containerPort: 8002
name: metrics
resources:
requests:
nvidia.com/gpu: 2
memory: "32Gi"
cpu: "8"
limits:
nvidia.com/gpu: 2
memory: "64Gi"
cpu: "16"
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 60
periodSeconds: 5
volumeMounts:
- name: shm
mountPath: /dev/shm
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: "16Gi"
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: triton-service
namespace: ml-serving
spec:
selector:
app: triton
ports:
- name: http
port: 8000
- name: grpc
port: 8001
- name: metrics
port: 8002
Advanced Features
Ensemble Models
# Ensemble pipeline: preprocess -> model -> postprocess
name: "ensemble_pipeline"
platform: "ensemble"
max_batch_size: 64
input [
{ name: "raw_text" data_type: TYPE_STRING dims: [ 1 ] }
]
output [
{ name: "predictions" data_type: TYPE_FP32 dims: [ 10 ] }
]
ensemble_scheduling {
step [
{
model_name: "tokenizer"
model_version: -1
input_map { key: "text" value: "raw_text" }
output_map { key: "tokens" value: "preprocessed" }
},
{
model_name: "bert_model"
model_version: -1
input_map { key: "input_ids" value: "preprocessed" }
output_map { key: "logits" value: "model_output" }
},
{
model_name: "postprocessor"
model_version: -1
input_map { key: "raw_logits" value: "model_output" }
output_map { key: "final" value: "predictions" }
}
]
}
Model Warmup
# Warmup configuration for consistent latency
model_warmup [
{
name: "warmup_requests"
batch_size: 32
inputs: {
key: "input_ids"
value: {
data_type: TYPE_INT64
dims: [ 128 ]
zero_data: true
}
}
count: 10
}
]
Performance Metrics
# Query Triton metrics
curl http://triton-service:8002/metrics
# Key metrics
# - nv_inference_request_success: Successful requests
# - nv_inference_request_failure: Failed requests
# - nv_inference_queue_duration_us: Queue wait time
# - nv_inference_compute_infer_duration_us: Inference time
# - nv_gpu_utilization: GPU utilization percentage
Client Integration
import tritonclient.grpc as grpcclient
import numpy as np
client = grpcclient.InferenceServerClient(url="triton-service:8001")
# Prepare input
input_ids = np.array([[101, 2054, 2003, 1996, 2034, 102]], dtype=np.int64)
inputs = [grpcclient.InferInput("input_ids", input_ids.shape, "INT64")]
inputs[0].set_data_from_numpy(input_ids)
# Inference request
outputs = [grpcclient.InferRequestedOutput("logits")]
result = client.infer(model_name="text_classifier", inputs=inputs, outputs=outputs)
predictions = result.as_numpy("logits")
print(f"Predictions: {predictions}")
Next lesson: Autoscaling and traffic management for inference workloads. :::