Model Serving & Inference

NVIDIA Triton Inference Server

4 min read

NVIDIA Triton Inference Server is an open-source inference serving platform that maximizes GPU utilization through features like dynamic batching, concurrent model execution, and support for all major ML frameworks.

Triton Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                  Triton Inference Server                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                    HTTP/gRPC Endpoints                       │    │
│  │              /v2/models/{model}/infer                        │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                   Request Scheduler                          │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │    │
│  │  │   Dynamic    │  │  Sequence    │  │   Priority   │       │    │
│  │  │   Batching   │  │   Batching   │  │   Queuing    │       │    │
│  │  └──────────────┘  └──────────────┘  └──────────────┘       │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                   Model Repository                           │    │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐       │    │
│  │  │TensorRT  │ │PyTorch   │ │TensorFlow│ │  ONNX    │       │    │
│  │  │ Backend  │ │ Backend  │ │ Backend  │ │ Backend  │       │    │
│  │  └──────────┘ └──────────┘ └──────────┘ └──────────┘       │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                    GPU Execution                             │    │
│  │  - Multi-GPU support  - CUDA streams  - Memory pooling      │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Model Repository Structure

model_repository/
├── text_classifier/
│   ├── config.pbtxt
│   ├── 1/
│   │   └── model.onnx
│   └── 2/
│       └── model.onnx
├── image_encoder/
│   ├── config.pbtxt
│   └── 1/
│       └── model.plan  # TensorRT engine
└── llm_model/
    ├── config.pbtxt
    └── 1/
        └── model.pt

Model Configuration

# config.pbtxt for text_classifier
name: "text_classifier"
platform: "onnxruntime_onnx"
max_batch_size: 64

input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [ -1 ]  # Dynamic sequence length
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [ -1 ]
  }
]

output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ 2 ]
  }
]

# Dynamic batching configuration
dynamic_batching {
  preferred_batch_size: [ 8, 16, 32 ]
  max_queue_delay_microseconds: 100000
}

# Instance group for GPU allocation
instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

# Version policy
version_policy: { latest: { num_versions: 2 }}

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference
  namespace: ml-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.01-py3
        command: ["tritonserver"]
        args:
        - "--model-repository=s3://models/triton-repo"
        - "--model-control-mode=poll"
        - "--repository-poll-secs=30"
        - "--strict-model-config=false"
        - "--log-verbose=1"
        ports:
        - containerPort: 8000
          name: http
        - containerPort: 8001
          name: grpc
        - containerPort: 8002
          name: metrics
        resources:
          requests:
            nvidia.com/gpu: 2
            memory: "32Gi"
            cpu: "8"
          limits:
            nvidia.com/gpu: 2
            memory: "64Gi"
            cpu: "16"
        livenessProbe:
          httpGet:
            path: /v2/health/live
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 5
        volumeMounts:
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "16Gi"
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: triton-service
  namespace: ml-serving
spec:
  selector:
    app: triton
  ports:
  - name: http
    port: 8000
  - name: grpc
    port: 8001
  - name: metrics
    port: 8002

Advanced Features

Ensemble Models

# Ensemble pipeline: preprocess -> model -> postprocess
name: "ensemble_pipeline"
platform: "ensemble"
max_batch_size: 64

input [
  { name: "raw_text" data_type: TYPE_STRING dims: [ 1 ] }
]

output [
  { name: "predictions" data_type: TYPE_FP32 dims: [ 10 ] }
]

ensemble_scheduling {
  step [
    {
      model_name: "tokenizer"
      model_version: -1
      input_map { key: "text" value: "raw_text" }
      output_map { key: "tokens" value: "preprocessed" }
    },
    {
      model_name: "bert_model"
      model_version: -1
      input_map { key: "input_ids" value: "preprocessed" }
      output_map { key: "logits" value: "model_output" }
    },
    {
      model_name: "postprocessor"
      model_version: -1
      input_map { key: "raw_logits" value: "model_output" }
      output_map { key: "final" value: "predictions" }
    }
  ]
}

Model Warmup

# Warmup configuration for consistent latency
model_warmup [
  {
    name: "warmup_requests"
    batch_size: 32
    inputs: {
      key: "input_ids"
      value: {
        data_type: TYPE_INT64
        dims: [ 128 ]
        zero_data: true
      }
    }
    count: 10
  }
]

Performance Metrics

# Query Triton metrics
curl http://triton-service:8002/metrics

# Key metrics
# - nv_inference_request_success: Successful requests
# - nv_inference_request_failure: Failed requests
# - nv_inference_queue_duration_us: Queue wait time
# - nv_inference_compute_infer_duration_us: Inference time
# - nv_gpu_utilization: GPU utilization percentage

Client Integration

import tritonclient.grpc as grpcclient
import numpy as np

client = grpcclient.InferenceServerClient(url="triton-service:8001")

# Prepare input
input_ids = np.array([[101, 2054, 2003, 1996, 2034, 102]], dtype=np.int64)
inputs = [grpcclient.InferInput("input_ids", input_ids.shape, "INT64")]
inputs[0].set_data_from_numpy(input_ids)

# Inference request
outputs = [grpcclient.InferRequestedOutput("logits")]
result = client.infer(model_name="text_classifier", inputs=inputs, outputs=outputs)

predictions = result.as_numpy("logits")
print(f"Predictions: {predictions}")

Next lesson: Autoscaling and traffic management for inference workloads. :::

Quiz

Module 4: Model Serving & Inference

Take Quiz