Infrastructure & Deployment

Model Serving Infrastructure

5 min read

Model serving is the core of MLOps production systems. Interviewers test your knowledge of serving frameworks, patterns, and trade-offs.

Serving Framework Comparison

Framework Best For Latency Scalability Interview Points
TensorFlow Serving TF models, gRPC Low High Native TF, batching
Triton Inference Multi-framework Very Low Very High Dynamic batching, GPU sharing
BentoML Fast iteration Medium High Python-native, easy packaging
Ray Serve Complex pipelines Low Very High Autoscaling, composition
TorchServe PyTorch models Low High Native PyTorch, MAR format

Interview Question: Serving Architecture

Question: "Design a serving system for a recommendation model with 50ms latency SLA."

Answer Structure:

# High-level architecture discussion
architecture = {
    "load_balancer": "AWS ALB or Envoy",
    "serving_layer": "Triton Inference Server",
    "caching": "Redis for user features",
    "feature_store": "Feast for real-time features",
    "monitoring": "Prometheus + Grafana"
}

# Key optimizations to discuss
optimizations = [
    "Model quantization (FP32 -> INT8)",
    "Dynamic batching (wait 5ms, batch up to 32)",
    "GPU memory management (multi-model on single GPU)",
    "Warm model pool (pre-loaded models)",
    "Feature caching (cache user embeddings)"
]

Dynamic Batching Deep Dive

# Triton configuration for dynamic batching
# config.pbtxt
"""
dynamic_batching {
    preferred_batch_size: [8, 16, 32]
    max_queue_delay_microseconds: 5000  # 5ms wait
}

instance_group [
    {
        count: 2
        kind: KIND_GPU
    }
]
"""

# Why this matters
interview_talking_points = """
- Without batching: 1 request = 1 GPU inference
- With batching: 32 requests = 1 GPU inference
- GPU utilization: 10% -> 80%+
- Latency trade-off: +5ms wait, -20ms inference
"""

Multi-Model Serving

Interview Question: "How do you serve 100 models on limited GPU resources?"

# Strategy: GPU memory sharing with Triton
model_config:
  model_name: "fraud_detector_v1"
  instance_group:
    - count: 1
      kind: KIND_GPU
      gpus: [0]  # Share GPU 0 with other models

# Model loading strategy
loading_strategy:
  type: "on-demand"  # Load when first request arrives
  unload_after_seconds: 300  # Unload after 5 min idle

# Memory management
memory_config:
  gpu_memory_limit_mb: 2048  # Limit per model
  cpu_memory_limit_mb: 4096

Key Talking Points:

  1. Model prioritization: Hot models always loaded, cold models on-demand
  2. GPU sharing: Multiple models on same GPU with memory limits
  3. Load shedding: Reject requests when capacity exceeded vs queuing

A/B Testing and Canary Deployments

# Istio VirtualService for traffic splitting
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: model-serving
spec:
  hosts:
  - model-serving
  http:
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: model-serving-canary
        port:
          number: 8080
  - route:
    - destination:
        host: model-serving-stable
        port:
          number: 8080
      weight: 90
    - destination:
        host: model-serving-canary
        port:
          number: 8080
      weight: 10

Interview Follow-up: "How do you measure if the canary is successful?"

Answer Framework:

  • Business metrics: Click-through rate, conversion, revenue
  • Technical metrics: Latency p99, error rate, GPU utilization
  • Statistical significance: Bayesian analysis or frequentist tests
  • Automated rollback: If error rate > 1% or latency p99 > SLA

Expert Insight: Mention that model A/B tests need larger sample sizes than UI A/B tests due to higher variance in ML predictions.

Next, we'll cover cloud-specific ML infrastructure questions. :::

Quiz

Module 2: Infrastructure & Deployment

Take Quiz