Model Serving & Inference

KServe: Kubernetes-Native Model Serving

4 min read

KServe is a CNCF incubating project that provides a standardized, scalable model serving solution for Kubernetes. It abstracts away infrastructure complexity, enabling data scientists to deploy models without deep Kubernetes expertise.

KServe Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                    KServe Architecture                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │                   InferenceService CRD                        │   │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐              │   │
│  │  │ Predictor  │  │Transformer │  │ Explainer  │              │   │
│  │  │  (Model)   │  │ (Pre/Post) │  │  (SHAP)    │              │   │
│  │  └────────────┘  └────────────┘  └────────────┘              │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│                              ↓                                       │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │              Knative Serving / Raw Deployment                 │   │
│  │  - Serverless autoscaling (scale-to-zero)                    │   │
│  │  - Traffic splitting (canary/blue-green)                     │   │
│  │  - Revision management                                        │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                              │                                       │
│  ┌─────────────┐   ┌─────────────────┐   ┌─────────────────────┐   │
│  │   Istio/    │   │  Model Storage  │   │    Monitoring       │   │
│  │   Gateway   │   │  (S3/GCS/PVC)   │   │ (Prometheus/Grafana)│   │
│  └─────────────┘   └─────────────────┘   └─────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

KServe Installation (2025-2026)

# Install KServe (standalone mode - no Istio/Knative required)
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14/kserve.yaml

# Or with Knative for serverless features
kubectl apply -f https://github.com/kserve/kserve/releases/download/v0.14/kserve-knative.yaml

# Verify installation
kubectl get pods -n kserve
kubectl get crd | grep serving

Basic InferenceService

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  namespace: ml-serving
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
      resources:
        requests:
          cpu: "1"
          memory: "2Gi"
        limits:
          cpu: "2"
          memory: "4Gi"

GPU-Enabled Model Serving

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-inference
  namespace: ml-serving
  annotations:
    serving.kserve.io/autoscalerClass: hpa
    serving.kserve.io/metric: "gpu-utilization"
    serving.kserve.io/targetUtilizationPercentage: "70"
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 10
    model:
      modelFormat:
        name: pytorch
      runtime: kserve-torchserve
      storageUri: "s3://models/llama-7b"
      resources:
        requests:
          nvidia.com/gpu: 1
          memory: "32Gi"
          cpu: "8"
        limits:
          nvidia.com/gpu: 1
          memory: "48Gi"
          cpu: "16"
    nodeSelector:
      nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
    tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule

Supported Model Formats

Framework Format Runtime
TensorFlow SavedModel TF Serving
PyTorch TorchScript/MAR TorchServe
ONNX .onnx Triton
XGBoost .bst XGBoost Server
LightGBM .txt LightGBM Server
Scikit-learn .pkl SKLearn Server
HuggingFace Transformers vLLM/TGI
Custom Any Custom Runtime

Multi-Model Serving (ModelMesh)

# Efficient serving of many models with shared resources
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: triton-runtime
spec:
  supportedModelFormats:
  - name: tensorflow
    version: "2"
  - name: pytorch
    version: "2"
  - name: onnx
    version: "1"
  containers:
  - name: triton
    image: nvcr.io/nvidia/tritonserver:24.01-py3
    resources:
      requests:
        nvidia.com/gpu: 1
      limits:
        nvidia.com/gpu: 1
---
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: multi-model-service
spec:
  predictor:
    model:
      modelFormat:
        name: pytorch
      runtime: triton-runtime
      storageUri: "s3://models/bert-base"

Inference Request

# Get service URL
SERVICE_URL=$(kubectl get inferenceservice sklearn-iris \
  -n ml-serving -o jsonpath='{.status.url}')

# Send prediction request
curl -v "${SERVICE_URL}/v1/models/sklearn-iris:predict" \
  -H "Content-Type: application/json" \
  -d '{
    "instances": [
      [6.8, 2.8, 4.8, 1.4],
      [6.0, 3.4, 4.5, 1.6]
    ]
  }'

# Response format
{
  "predictions": [1, 1]
}

Production Considerations

Autoscaling metrics:

  • CPU/Memory utilization
  • GPU utilization
  • Requests per second
  • Custom Prometheus metrics

High availability:

  • Pod disruption budgets
  • Multi-zone deployment
  • Health probes configuration

Next lesson: NVIDIA Triton Inference Server for high-performance multi-framework serving. :::

Quiz

Module 4: Model Serving & Inference

Take Quiz