Infrastructure & Deployment
Model Serving Infrastructure
5 min read
Model serving is the core of MLOps production systems. Interviewers test your knowledge of serving frameworks, patterns, and trade-offs.
Serving Framework Comparison
| Framework | Best For | Latency | Scalability | Interview Points |
|---|---|---|---|---|
| TensorFlow Serving | TF models, gRPC | Low | High | Native TF, batching |
| Triton Inference | Multi-framework | Very Low | Very High | Dynamic batching, GPU sharing |
| BentoML | Fast iteration | Medium | High | Python-native, easy packaging |
| Ray Serve | Complex pipelines | Low | Very High | Autoscaling, composition |
| TorchServe | PyTorch models | Low | High | Native PyTorch, MAR format |
Interview Question: Serving Architecture
Question: "Design a serving system for a recommendation model with 50ms latency SLA."
Answer Structure:
# High-level architecture discussion
architecture = {
"load_balancer": "AWS ALB or Envoy",
"serving_layer": "Triton Inference Server",
"caching": "Redis for user features",
"feature_store": "Feast for real-time features",
"monitoring": "Prometheus + Grafana"
}
# Key optimizations to discuss
optimizations = [
"Model quantization (FP32 -> INT8)",
"Dynamic batching (wait 5ms, batch up to 32)",
"GPU memory management (multi-model on single GPU)",
"Warm model pool (pre-loaded models)",
"Feature caching (cache user embeddings)"
]
Dynamic Batching Deep Dive
# Triton configuration for dynamic batching
# config.pbtxt
"""
dynamic_batching {
preferred_batch_size: [8, 16, 32]
max_queue_delay_microseconds: 5000 # 5ms wait
}
instance_group [
{
count: 2
kind: KIND_GPU
}
]
"""
# Why this matters
interview_talking_points = """
- Without batching: 1 request = 1 GPU inference
- With batching: 32 requests = 1 GPU inference
- GPU utilization: 10% -> 80%+
- Latency trade-off: +5ms wait, -20ms inference
"""
Multi-Model Serving
Interview Question: "How do you serve 100 models on limited GPU resources?"
# Strategy: GPU memory sharing with Triton
model_config:
model_name: "fraud_detector_v1"
instance_group:
- count: 1
kind: KIND_GPU
gpus: [0] # Share GPU 0 with other models
# Model loading strategy
loading_strategy:
type: "on-demand" # Load when first request arrives
unload_after_seconds: 300 # Unload after 5 min idle
# Memory management
memory_config:
gpu_memory_limit_mb: 2048 # Limit per model
cpu_memory_limit_mb: 4096
Key Talking Points:
- Model prioritization: Hot models always loaded, cold models on-demand
- GPU sharing: Multiple models on same GPU with memory limits
- Load shedding: Reject requests when capacity exceeded vs queuing
A/B Testing and Canary Deployments
# Istio VirtualService for traffic splitting
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: model-serving
spec:
hosts:
- model-serving
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: model-serving-canary
port:
number: 8080
- route:
- destination:
host: model-serving-stable
port:
number: 8080
weight: 90
- destination:
host: model-serving-canary
port:
number: 8080
weight: 10
Interview Follow-up: "How do you measure if the canary is successful?"
Answer Framework:
- Business metrics: Click-through rate, conversion, revenue
- Technical metrics: Latency p99, error rate, GPU utilization
- Statistical significance: Bayesian analysis or frequentist tests
- Automated rollback: If error rate > 1% or latency p99 > SLA
Expert Insight: Mention that model A/B tests need larger sample sizes than UI A/B tests due to higher variance in ML predictions.
Next, we'll cover cloud-specific ML infrastructure questions. :::