Model Registry & Serving
Canary Deployments & A/B Testing
3 min read
Deploying ML models safely requires gradual rollouts and data-driven validation. Canary deployments and A/B testing minimize risk while maximizing learning.
Deployment Strategies
| Strategy | Risk | Rollback Speed | Use Case |
|---|---|---|---|
| Big Bang | High | Slow | Development only |
| Blue-Green | Medium | Fast | Quick switchover |
| Canary | Low | Fast | Production ML |
| A/B Test | Low | Fast | Model comparison |
Canary Deployments
Concept
Deploy new model to small percentage of traffic, monitor metrics, then gradually increase.
Traffic Flow:
┌─────────────────────────────────────────────┐
│ │
│ Users (100%) │
│ │ │
│ ▼ │
│ ┌─────────┐ │
│ │ Load │ │
│ │Balancer │ │
│ └────┬────┘ │
│ │ │
│ ┌───┴───┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌───┐ ┌───┐ │
│ │95%│ │ 5%│ ◄── Canary │
│ │v1 │ │v2 │ │
│ └───┘ └───┘ │
└─────────────────────────────────────────────┘
Kubernetes Implementation
# model-v1-deployment.yaml (stable)
apiVersion: apps/v1
kind: Deployment
metadata:
name: fraud-detector-stable
labels:
app: fraud-detector
version: v1
spec:
replicas: 9 # 90% of traffic
selector:
matchLabels:
app: fraud-detector
version: v1
template:
metadata:
labels:
app: fraud-detector
version: v1
spec:
containers:
- name: model
image: myregistry/fraud-detector:v1
ports:
- containerPort: 3000
---
# model-v2-deployment.yaml (canary)
apiVersion: apps/v1
kind: Deployment
metadata:
name: fraud-detector-canary
labels:
app: fraud-detector
version: v2
spec:
replicas: 1 # 10% of traffic
selector:
matchLabels:
app: fraud-detector
version: v2
template:
metadata:
labels:
app: fraud-detector
version: v2
spec:
containers:
- name: model
image: myregistry/fraud-detector:v2
ports:
- containerPort: 3000
---
# service.yaml (routes to both)
apiVersion: v1
kind: Service
metadata:
name: fraud-detector
spec:
selector:
app: fraud-detector # Matches both v1 and v2
ports:
- port: 80
targetPort: 3000
Istio Traffic Splitting
# virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: fraud-detector
spec:
hosts:
- fraud-detector
http:
- route:
- destination:
host: fraud-detector
subset: stable
weight: 95
- destination:
host: fraud-detector
subset: canary
weight: 5
---
# destination-rule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: fraud-detector
spec:
host: fraud-detector
subsets:
- name: stable
labels:
version: v1
- name: canary
labels:
version: v2
Gradual Rollout Script
# canary_rollout.py
import subprocess
import time
from typing import Callable
def update_traffic_weight(canary_weight: int):
"""Update Istio VirtualService weight."""
stable_weight = 100 - canary_weight
patch = f'''
spec:
http:
- route:
- destination:
host: fraud-detector
subset: stable
weight: {stable_weight}
- destination:
host: fraud-detector
subset: canary
weight: {canary_weight}
'''
subprocess.run([
"kubectl", "patch", "virtualservice", "fraud-detector",
"--type=merge", "-p", patch
])
def canary_rollout(
check_metrics: Callable[[], bool],
weights: list[int] = [5, 10, 25, 50, 100],
wait_minutes: int = 10
):
"""Gradually increase canary traffic if metrics pass."""
for weight in weights:
print(f"Setting canary weight to {weight}%")
update_traffic_weight(weight)
print(f"Waiting {wait_minutes} minutes...")
time.sleep(wait_minutes * 60)
if not check_metrics():
print("Metrics check failed! Rolling back...")
update_traffic_weight(0)
return False
print(f"Metrics passed at {weight}%")
print("Canary rollout complete!")
return True
# Usage
def check_model_metrics() -> bool:
"""Check if canary model meets SLOs."""
# Query Prometheus/Grafana
latency_ok = get_p99_latency("canary") < 100 # ms
error_rate_ok = get_error_rate("canary") < 0.01 # 1%
accuracy_ok = get_accuracy("canary") > 0.90
return latency_ok and error_rate_ok and accuracy_ok
canary_rollout(check_model_metrics)
A/B Testing for Models
Concept
Split traffic between models to measure business impact, not just technical metrics.
| Canary | A/B Test |
|---|---|
| Validate stability | Measure business impact |
| Quick rollout | Statistical significance |
| Technical metrics | Business metrics |
| Temporary | Can run for weeks |
Implementation with Feature Flags
# ab_test_router.py
import hashlib
from enum import Enum
class ModelVariant(Enum):
CONTROL = "v1"
TREATMENT = "v2"
def get_variant(user_id: str, experiment: str, treatment_pct: int = 50) -> ModelVariant:
"""Deterministic assignment based on user_id."""
hash_input = f"{user_id}:{experiment}"
hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
bucket = hash_value % 100
if bucket < treatment_pct:
return ModelVariant.TREATMENT
return ModelVariant.CONTROL
# Usage in service
class FraudDetectorABTest:
def __init__(self):
self.model_v1 = load_model("fraud_detector:v1")
self.model_v2 = load_model("fraud_detector:v2")
def predict(self, user_id: str, features: dict) -> dict:
variant = get_variant(user_id, "fraud_model_v2_test")
if variant == ModelVariant.TREATMENT:
model = self.model_v2
else:
model = self.model_v1
prediction = model.predict(features)
# Log for analysis
log_prediction(
user_id=user_id,
variant=variant.value,
prediction=prediction,
timestamp=time.time()
)
return {
"prediction": prediction,
"variant": variant.value
}
Statistical Analysis
# ab_analysis.py
import scipy.stats as stats
import pandas as pd
def analyze_ab_test(
control_conversions: int,
control_total: int,
treatment_conversions: int,
treatment_total: int,
confidence: float = 0.95
) -> dict:
"""Analyze A/B test results."""
control_rate = control_conversions / control_total
treatment_rate = treatment_conversions / treatment_total
# Two-proportion z-test
pooled = (control_conversions + treatment_conversions) / (control_total + treatment_total)
se = (pooled * (1 - pooled) * (1/control_total + 1/treatment_total)) ** 0.5
z_stat = (treatment_rate - control_rate) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
# Confidence interval
z_critical = stats.norm.ppf((1 + confidence) / 2)
diff = treatment_rate - control_rate
margin = z_critical * se
return {
"control_rate": control_rate,
"treatment_rate": treatment_rate,
"lift": (treatment_rate - control_rate) / control_rate * 100,
"p_value": p_value,
"significant": p_value < (1 - confidence),
"confidence_interval": (diff - margin, diff + margin)
}
# Example usage
results = analyze_ab_test(
control_conversions=450,
control_total=10000,
treatment_conversions=520,
treatment_total=10000
)
print(f"Control: {results['control_rate']:.2%}")
print(f"Treatment: {results['treatment_rate']:.2%}")
print(f"Lift: {results['lift']:.1f}%")
print(f"P-value: {results['p_value']:.4f}")
print(f"Significant: {results['significant']}")
Monitoring Canary Deployments
Key Metrics to Track
# prometheus-rules.yaml
groups:
- name: canary-alerts
rules:
- alert: CanaryHighLatency
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{version="canary"}[5m]))
by (le)
) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Canary p99 latency above 500ms"
- alert: CanaryHighErrorRate
expr: |
sum(rate(http_requests_total{version="canary",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{version="canary"}[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Canary error rate above 5%"
Grafana Dashboard Query
# Compare latency between versions
histogram_quantile(0.99,
sum by (version, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# Compare error rates
sum by (version) (
rate(http_requests_total{status=~"5.."}[5m])
) / sum by (version) (
rate(http_requests_total[5m])
)
Automated Rollback
Argo Rollouts
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: fraud-detector
spec:
replicas: 10
selector:
matchLabels:
app: fraud-detector
template:
metadata:
labels:
app: fraud-detector
spec:
containers:
- name: model
image: myregistry/fraud-detector:v2
strategy:
canary:
steps:
- setWeight: 5
- pause: {duration: 10m}
- setWeight: 20
- pause: {duration: 10m}
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100
analysis:
templates:
- templateName: success-rate
startingStep: 1
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.95
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Best Practices
| Practice | Why |
|---|---|
| Start small | 1-5% initial canary |
| Monitor business metrics | Not just latency/errors |
| Automate rollback | React faster than humans |
| Use deterministic routing | Consistent user experience |
| Run long enough | Statistical significance |
Key insight: Safe model deployment is iterative—deploy small, measure impact, automate decisions, and always have a rollback plan.
Next module: ML Monitoring & Production Operations. :::