Model Serving & Inference

LLM Serving with vLLM and TGI

4 min read

Large Language Model serving requires specialized inference engines optimized for transformer architectures. vLLM and Text Generation Inference (TGI) are the leading solutions for production LLM deployment on Kubernetes.

LLM Serving Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                    LLM Serving Stack                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                   API Gateway / Load Balancer                │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │              LLM Inference Engine                            │    │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐          │    │
│  │  │    vLLM     │  │    TGI      │  │  TensorRT   │          │    │
│  │  │  (PagedAttn)│  │ (HuggingFace)│ │     LLM     │          │    │
│  │  └─────────────┘  └─────────────┘  └─────────────┘          │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │              Optimizations                                   │    │
│  │  - Continuous batching    - KV cache management              │    │
│  │  - Tensor parallelism     - Speculative decoding            │    │
│  │  - Flash Attention        - Quantization (AWQ/GPTQ)         │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │              Multi-GPU / Multi-Node                          │    │
│  │  [GPU 0] [GPU 1] [GPU 2] [GPU 3] [GPU 4] [GPU 5] [GPU 6]    │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

vLLM Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama
  namespace: ml-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm-llama
  template:
    metadata:
      labels:
        app: vllm-llama
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.6.0
        command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - "--model=/models/llama-3-70b"
        - "--tensor-parallel-size=4"
        - "--max-model-len=8192"
        - "--gpu-memory-utilization=0.9"
        - "--enable-prefix-caching"
        - "--enable-chunked-prefill"
        - "--max-num-seqs=256"
        - "--quantization=awq"
        ports:
        - containerPort: 8000
          name: http
        resources:
          requests:
            nvidia.com/gpu: 4
            memory: "128Gi"
            cpu: "32"
          limits:
            nvidia.com/gpu: 4
            memory: "256Gi"
            cpu: "64"
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0,1,2,3"
        - name: NCCL_DEBUG
          value: "INFO"
        volumeMounts:
        - name: models
          mountPath: /models
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: models
        persistentVolumeClaim:
          claimName: model-storage
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "32Gi"
      nodeSelector:
        nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm-llama
  ports:
  - port: 8000
    targetPort: 8000

TGI (Text Generation Inference) Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-mistral
  namespace: ml-serving
spec:
  replicas: 3
  selector:
    matchLabels:
      app: tgi-mistral
  template:
    metadata:
      labels:
        app: tgi-mistral
    spec:
      containers:
      - name: tgi
        image: ghcr.io/huggingface/text-generation-inference:2.0
        args:
        - "--model-id=mistralai/Mistral-7B-Instruct-v0.3"
        - "--num-shard=2"
        - "--max-batch-size=64"
        - "--max-input-length=4096"
        - "--max-total-tokens=8192"
        - "--quantize=awq"
        - "--trust-remote-code"
        ports:
        - containerPort: 80
          name: http
        resources:
          requests:
            nvidia.com/gpu: 2
            memory: "48Gi"
          limits:
            nvidia.com/gpu: 2
            memory: "64Gi"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        - name: MAX_CONCURRENT_REQUESTS
          value: "128"
        volumeMounts:
        - name: cache
          mountPath: /data
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: cache
        persistentVolumeClaim:
          claimName: tgi-cache
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "16Gi"

KServe LLM Runtime

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: vllm-runtime
spec:
  annotations:
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
  supportedModelFormats:
  - name: vllm
    version: "1"
    autoSelect: true
  containers:
  - name: vllm
    image: vllm/vllm-openai:v0.6.0
    command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
    args:
    - "--model=/mnt/models"
    - "--port=8080"
    ports:
    - containerPort: 8080
      protocol: TCP
    env:
    - name: STORAGE_URI
      value: "{{.StorageUri}}"
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-vllm
spec:
  predictor:
    model:
      modelFormat:
        name: vllm
      runtime: vllm-runtime
      storageUri: "s3://models/llama-3-8b"
      resources:
        requests:
          nvidia.com/gpu: 1
          memory: "32Gi"
        limits:
          nvidia.com/gpu: 1
          memory: "48Gi"

Multi-Node LLM Serving (70B+ Models)

# Ray Cluster for distributed inference
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: llm-cluster
spec:
  headGroupSpec:
    rayStartParams:
      dashboard-host: '0.0.0.0'
    template:
      spec:
        containers:
        - name: ray-head
          image: rayproject/ray-ml:2.9.0-gpu
          resources:
            requests:
              nvidia.com/gpu: 8
              memory: "512Gi"
            limits:
              nvidia.com/gpu: 8
              memory: "768Gi"
  workerGroupSpecs:
  - replicas: 3
    rayStartParams: {}
    template:
      spec:
        containers:
        - name: ray-worker
          image: rayproject/ray-ml:2.9.0-gpu
          resources:
            requests:
              nvidia.com/gpu: 8
              memory: "512Gi"
            limits:
              nvidia.com/gpu: 8
              memory: "768Gi"
        nodeSelector:
          nvidia.com/gpu.product: NVIDIA-H100-80GB
---
# vLLM on Ray for 70B model across 4 nodes (32 GPUs)
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: llama-70b-service
spec:
  serveConfigV2: |
    applications:
    - name: llm
      route_prefix: /
      import_path: vllm_ray:deployment
      deployments:
      - name: VLLMDeployment
        num_replicas: 1
        ray_actor_options:
          num_gpus: 32
        user_config:
          model: meta-llama/Llama-3-70B-Instruct
          tensor_parallel_size: 8
          pipeline_parallel_size: 4

OpenAI-Compatible API Usage

# vLLM exposes OpenAI-compatible endpoints
curl http://vllm-service:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-70b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain Kubernetes in simple terms."}
    ],
    "max_tokens": 500,
    "temperature": 0.7,
    "stream": true
  }'

# Streaming response
curl http://vllm-service:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-70b",
    "prompt": "The benefits of containerization are:",
    "max_tokens": 200,
    "stream": true
  }'

Performance Optimization

Technique Throughput Gain Memory Reduction
PagedAttention 2-4x 50-70%
AWQ Quantization 1.2x 75%
Continuous Batching 10-20x -
Flash Attention 2 1.5-2x 40%
Speculative Decoding 2-3x -

Next module: Service mesh and networking for ML workloads with Istio. :::

Quiz

Module 4: Model Serving & Inference

Take Quiz