Model Serving & Inference
LLM Serving with vLLM and TGI
4 min read
Large Language Model serving requires specialized inference engines optimized for transformer architectures. vLLM and Text Generation Inference (TGI) are the leading solutions for production LLM deployment on Kubernetes.
LLM Serving Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ LLM Serving Stack │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ API Gateway / Load Balancer │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ LLM Inference Engine │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ vLLM │ │ TGI │ │ TensorRT │ │ │
│ │ │ (PagedAttn)│ │ (HuggingFace)│ │ LLM │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Optimizations │ │
│ │ - Continuous batching - KV cache management │ │
│ │ - Tensor parallelism - Speculative decoding │ │
│ │ - Flash Attention - Quantization (AWQ/GPTQ) │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Multi-GPU / Multi-Node │ │
│ │ [GPU 0] [GPU 1] [GPU 2] [GPU 3] [GPU 4] [GPU 5] [GPU 6] │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
vLLM Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama
namespace: ml-serving
spec:
replicas: 2
selector:
matchLabels:
app: vllm-llama
template:
metadata:
labels:
app: vllm-llama
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.0
command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model=/models/llama-3-70b"
- "--tensor-parallel-size=4"
- "--max-model-len=8192"
- "--gpu-memory-utilization=0.9"
- "--enable-prefix-caching"
- "--enable-chunked-prefill"
- "--max-num-seqs=256"
- "--quantization=awq"
ports:
- containerPort: 8000
name: http
resources:
requests:
nvidia.com/gpu: 4
memory: "128Gi"
cpu: "32"
limits:
nvidia.com/gpu: 4
memory: "256Gi"
cpu: "64"
env:
- name: CUDA_VISIBLE_DEVICES
value: "0,1,2,3"
- name: NCCL_DEBUG
value: "INFO"
volumeMounts:
- name: models
mountPath: /models
- name: shm
mountPath: /dev/shm
volumes:
- name: models
persistentVolumeClaim:
claimName: model-storage
- name: shm
emptyDir:
medium: Memory
sizeLimit: "32Gi"
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm-llama
ports:
- port: 8000
targetPort: 8000
TGI (Text Generation Inference) Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: tgi-mistral
namespace: ml-serving
spec:
replicas: 3
selector:
matchLabels:
app: tgi-mistral
template:
metadata:
labels:
app: tgi-mistral
spec:
containers:
- name: tgi
image: ghcr.io/huggingface/text-generation-inference:2.0
args:
- "--model-id=mistralai/Mistral-7B-Instruct-v0.3"
- "--num-shard=2"
- "--max-batch-size=64"
- "--max-input-length=4096"
- "--max-total-tokens=8192"
- "--quantize=awq"
- "--trust-remote-code"
ports:
- containerPort: 80
name: http
resources:
requests:
nvidia.com/gpu: 2
memory: "48Gi"
limits:
nvidia.com/gpu: 2
memory: "64Gi"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
- name: MAX_CONCURRENT_REQUESTS
value: "128"
volumeMounts:
- name: cache
mountPath: /data
- name: shm
mountPath: /dev/shm
volumes:
- name: cache
persistentVolumeClaim:
claimName: tgi-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: "16Gi"
KServe LLM Runtime
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: vllm-runtime
spec:
annotations:
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
supportedModelFormats:
- name: vllm
version: "1"
autoSelect: true
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.0
command: ["python", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model=/mnt/models"
- "--port=8080"
ports:
- containerPort: 8080
protocol: TCP
env:
- name: STORAGE_URI
value: "{{.StorageUri}}"
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-vllm
spec:
predictor:
model:
modelFormat:
name: vllm
runtime: vllm-runtime
storageUri: "s3://models/llama-3-8b"
resources:
requests:
nvidia.com/gpu: 1
memory: "32Gi"
limits:
nvidia.com/gpu: 1
memory: "48Gi"
Multi-Node LLM Serving (70B+ Models)
# Ray Cluster for distributed inference
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: llm-cluster
spec:
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
template:
spec:
containers:
- name: ray-head
image: rayproject/ray-ml:2.9.0-gpu
resources:
requests:
nvidia.com/gpu: 8
memory: "512Gi"
limits:
nvidia.com/gpu: 8
memory: "768Gi"
workerGroupSpecs:
- replicas: 3
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image: rayproject/ray-ml:2.9.0-gpu
resources:
requests:
nvidia.com/gpu: 8
memory: "512Gi"
limits:
nvidia.com/gpu: 8
memory: "768Gi"
nodeSelector:
nvidia.com/gpu.product: NVIDIA-H100-80GB
---
# vLLM on Ray for 70B model across 4 nodes (32 GPUs)
apiVersion: ray.io/v1
kind: RayService
metadata:
name: llama-70b-service
spec:
serveConfigV2: |
applications:
- name: llm
route_prefix: /
import_path: vllm_ray:deployment
deployments:
- name: VLLMDeployment
num_replicas: 1
ray_actor_options:
num_gpus: 32
user_config:
model: meta-llama/Llama-3-70B-Instruct
tensor_parallel_size: 8
pipeline_parallel_size: 4
OpenAI-Compatible API Usage
# vLLM exposes OpenAI-compatible endpoints
curl http://vllm-service:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-70b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain Kubernetes in simple terms."}
],
"max_tokens": 500,
"temperature": 0.7,
"stream": true
}'
# Streaming response
curl http://vllm-service:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3-70b",
"prompt": "The benefits of containerization are:",
"max_tokens": 200,
"stream": true
}'
Performance Optimization
| Technique | Throughput Gain | Memory Reduction |
|---|---|---|
| PagedAttention | 2-4x | 50-70% |
| AWQ Quantization | 1.2x | 75% |
| Continuous Batching | 10-20x | - |
| Flash Attention 2 | 1.5-2x | 40% |
| Speculative Decoding | 2-3x | - |
Next module: Service mesh and networking for ML workloads with Istio. :::