Model Serving & Inference
KServe: Kubernetes-Native Model Serving
4 min read
KServe is a CNCF incubating project that provides a standardized, scalable model serving solution for Kubernetes. It abstracts away infrastructure complexity, enabling data scientists to deploy models without deep Kubernetes expertise.
KServe Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ KServe Architecture │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ InferenceService CRD │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Predictor │ │Transformer │ │ Explainer │ │ │
│ │ │ (Model) │ │ (Pre/Post) │ │ (SHAP) │ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Knative Serving / Raw Deployment │ │
│ │ - Serverless autoscaling (scale-to-zero) │ │
│ │ - Traffic splitting (canary/blue-green) │ │
│ │ - Revision management │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────┐ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ Istio/ │ │ Model Storage │ │ Monitoring │ │
│ │ Gateway │ │ (S3/GCS/PVC) │ │ (Prometheus/Grafana)│ │
│ └─────────────┘ └─────────────────┘ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
KServe Installation (2025-2026)
# Install KServe (standalone mode - no Istio/Knative required)
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14/kserve.yaml
# Or with Knative for serverless features
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14/kserve-knative.yaml
# Verify installation
kubectl get pods -n kserve
kubectl get crd | grep serving
Basic InferenceService
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
namespace: ml-serving
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
GPU-Enabled Model Serving
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-inference
namespace: ml-serving
annotations:
serving.kserve.io/autoscalerClass: hpa
serving.kserve.io/metric: "gpu-utilization"
serving.kserve.io/targetUtilizationPercentage: "70"
spec:
predictor:
minReplicas: 1
maxReplicas: 10
model:
modelFormat:
name: pytorch
runtime: kserve-torchserve
storageUri: "s3://models/llama-7b"
resources:
requests:
nvidia.com/gpu: 1
memory: "32Gi"
cpu: "8"
limits:
nvidia.com/gpu: 1
memory: "48Gi"
cpu: "16"
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Supported Model Formats
| Framework | Format | Runtime |
|---|---|---|
| TensorFlow | SavedModel | TF Serving |
| PyTorch | TorchScript/MAR | TorchServe |
| ONNX | .onnx | Triton |
| XGBoost | .bst | XGBoost Server |
| LightGBM | .txt | LightGBM Server |
| Scikit-learn | .pkl | SKLearn Server |
| HuggingFace | Transformers | vLLM/TGI |
| Custom | Any | Custom Runtime |
Multi-Model Serving (ModelMesh)
# Efficient serving of many models with shared resources
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: triton-runtime
spec:
supportedModelFormats:
- name: tensorflow
version: "2"
- name: pytorch
version: "2"
- name: onnx
version: "1"
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.01-py3
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: multi-model-service
spec:
predictor:
model:
modelFormat:
name: pytorch
runtime: triton-runtime
storageUri: "s3://models/bert-base"
Inference Request
# Get service URL
SERVICE_URL=$(kubectl get inferenceservice sklearn-iris \
-n ml-serving -o jsonpath='{.status.url}')
# Send prediction request
curl -v "${SERVICE_URL}/v1/models/sklearn-iris:predict" \
-H "Content-Type: application/json" \
-d '{
"instances": [
[6.8, 2.8, 4.8, 1.4],
[6.0, 3.4, 4.5, 1.6]
]
}'
# Response format
{
"predictions": [1, 1]
}
Production Considerations
Autoscaling metrics:
- CPU/Memory utilization
- GPU utilization
- Requests per second
- Custom Prometheus metrics
High availability:
- Pod disruption budgets
- Multi-zone deployment
- Health probes configuration
Next lesson: NVIDIA Triton Inference Server for high-performance multi-framework serving. :::