Model Registry & Serving

Canary Deployments & A/B Testing

3 min read

Deploying ML models safely requires gradual rollouts and data-driven validation. Canary deployments and A/B testing minimize risk while maximizing learning.

Deployment Strategies

Strategy Risk Rollback Speed Use Case
Big Bang High Slow Development only
Blue-Green Medium Fast Quick switchover
Canary Low Fast Production ML
A/B Test Low Fast Model comparison

Canary Deployments

Concept

Deploy new model to small percentage of traffic, monitor metrics, then gradually increase.

Traffic Flow:
┌─────────────────────────────────────────────┐
│                                             │
│  Users (100%)                               │
│      │                                      │
│      ▼                                      │
│  ┌─────────┐                                │
│  │  Load   │                                │
│  │Balancer │                                │
│  └────┬────┘                                │
│       │                                     │
│   ┌───┴───┐                                 │
│   │       │                                 │
│   ▼       ▼                                 │
│ ┌───┐   ┌───┐                               │
│ │95%│   │ 5%│  ◄── Canary                   │
│ │v1 │   │v2 │                               │
│ └───┘   └───┘                               │
└─────────────────────────────────────────────┘

Kubernetes Implementation

# model-v1-deployment.yaml (stable)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-detector-stable
  labels:
    app: fraud-detector
    version: v1
spec:
  replicas: 9  # 90% of traffic
  selector:
    matchLabels:
      app: fraud-detector
      version: v1
  template:
    metadata:
      labels:
        app: fraud-detector
        version: v1
    spec:
      containers:
      - name: model
        image: myregistry/fraud-detector:v1
        ports:
        - containerPort: 3000
---
# model-v2-deployment.yaml (canary)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-detector-canary
  labels:
    app: fraud-detector
    version: v2
spec:
  replicas: 1  # 10% of traffic
  selector:
    matchLabels:
      app: fraud-detector
      version: v2
  template:
    metadata:
      labels:
        app: fraud-detector
        version: v2
    spec:
      containers:
      - name: model
        image: myregistry/fraud-detector:v2
        ports:
        - containerPort: 3000
---
# service.yaml (routes to both)
apiVersion: v1
kind: Service
metadata:
  name: fraud-detector
spec:
  selector:
    app: fraud-detector  # Matches both v1 and v2
  ports:
  - port: 80
    targetPort: 3000

Istio Traffic Splitting

# virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: fraud-detector
spec:
  hosts:
  - fraud-detector
  http:
  - route:
    - destination:
        host: fraud-detector
        subset: stable
      weight: 95
    - destination:
        host: fraud-detector
        subset: canary
      weight: 5
---
# destination-rule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: fraud-detector
spec:
  host: fraud-detector
  subsets:
  - name: stable
    labels:
      version: v1
  - name: canary
    labels:
      version: v2

Gradual Rollout Script

# canary_rollout.py
import subprocess
import time
from typing import Callable

def update_traffic_weight(canary_weight: int):
    """Update Istio VirtualService weight."""
    stable_weight = 100 - canary_weight

    patch = f'''
    spec:
      http:
      - route:
        - destination:
            host: fraud-detector
            subset: stable
          weight: {stable_weight}
        - destination:
            host: fraud-detector
            subset: canary
          weight: {canary_weight}
    '''

    subprocess.run([
        "kubectl", "patch", "virtualservice", "fraud-detector",
        "--type=merge", "-p", patch
    ])

def canary_rollout(
    check_metrics: Callable[[], bool],
    weights: list[int] = [5, 10, 25, 50, 100],
    wait_minutes: int = 10
):
    """Gradually increase canary traffic if metrics pass."""
    for weight in weights:
        print(f"Setting canary weight to {weight}%")
        update_traffic_weight(weight)

        print(f"Waiting {wait_minutes} minutes...")
        time.sleep(wait_minutes * 60)

        if not check_metrics():
            print("Metrics check failed! Rolling back...")
            update_traffic_weight(0)
            return False

        print(f"Metrics passed at {weight}%")

    print("Canary rollout complete!")
    return True

# Usage
def check_model_metrics() -> bool:
    """Check if canary model meets SLOs."""
    # Query Prometheus/Grafana
    latency_ok = get_p99_latency("canary") < 100  # ms
    error_rate_ok = get_error_rate("canary") < 0.01  # 1%
    accuracy_ok = get_accuracy("canary") > 0.90

    return latency_ok and error_rate_ok and accuracy_ok

canary_rollout(check_model_metrics)

A/B Testing for Models

Concept

Split traffic between models to measure business impact, not just technical metrics.

Canary A/B Test
Validate stability Measure business impact
Quick rollout Statistical significance
Technical metrics Business metrics
Temporary Can run for weeks

Implementation with Feature Flags

# ab_test_router.py
import hashlib
from enum import Enum

class ModelVariant(Enum):
    CONTROL = "v1"
    TREATMENT = "v2"

def get_variant(user_id: str, experiment: str, treatment_pct: int = 50) -> ModelVariant:
    """Deterministic assignment based on user_id."""
    hash_input = f"{user_id}:{experiment}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    bucket = hash_value % 100

    if bucket < treatment_pct:
        return ModelVariant.TREATMENT
    return ModelVariant.CONTROL

# Usage in service
class FraudDetectorABTest:
    def __init__(self):
        self.model_v1 = load_model("fraud_detector:v1")
        self.model_v2 = load_model("fraud_detector:v2")

    def predict(self, user_id: str, features: dict) -> dict:
        variant = get_variant(user_id, "fraud_model_v2_test")

        if variant == ModelVariant.TREATMENT:
            model = self.model_v2
        else:
            model = self.model_v1

        prediction = model.predict(features)

        # Log for analysis
        log_prediction(
            user_id=user_id,
            variant=variant.value,
            prediction=prediction,
            timestamp=time.time()
        )

        return {
            "prediction": prediction,
            "variant": variant.value
        }

Statistical Analysis

# ab_analysis.py
import scipy.stats as stats
import pandas as pd

def analyze_ab_test(
    control_conversions: int,
    control_total: int,
    treatment_conversions: int,
    treatment_total: int,
    confidence: float = 0.95
) -> dict:
    """Analyze A/B test results."""
    control_rate = control_conversions / control_total
    treatment_rate = treatment_conversions / treatment_total

    # Two-proportion z-test
    pooled = (control_conversions + treatment_conversions) / (control_total + treatment_total)
    se = (pooled * (1 - pooled) * (1/control_total + 1/treatment_total)) ** 0.5
    z_stat = (treatment_rate - control_rate) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

    # Confidence interval
    z_critical = stats.norm.ppf((1 + confidence) / 2)
    diff = treatment_rate - control_rate
    margin = z_critical * se

    return {
        "control_rate": control_rate,
        "treatment_rate": treatment_rate,
        "lift": (treatment_rate - control_rate) / control_rate * 100,
        "p_value": p_value,
        "significant": p_value < (1 - confidence),
        "confidence_interval": (diff - margin, diff + margin)
    }

# Example usage
results = analyze_ab_test(
    control_conversions=450,
    control_total=10000,
    treatment_conversions=520,
    treatment_total=10000
)

print(f"Control: {results['control_rate']:.2%}")
print(f"Treatment: {results['treatment_rate']:.2%}")
print(f"Lift: {results['lift']:.1f}%")
print(f"P-value: {results['p_value']:.4f}")
print(f"Significant: {results['significant']}")

Monitoring Canary Deployments

Key Metrics to Track

# prometheus-rules.yaml
groups:
- name: canary-alerts
  rules:
  - alert: CanaryHighLatency
    expr: |
      histogram_quantile(0.99,
        sum(rate(http_request_duration_seconds_bucket{version="canary"}[5m]))
        by (le)
      ) > 0.5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Canary p99 latency above 500ms"

  - alert: CanaryHighErrorRate
    expr: |
      sum(rate(http_requests_total{version="canary",status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{version="canary"}[5m]))
      > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Canary error rate above 5%"

Grafana Dashboard Query

# Compare latency between versions
histogram_quantile(0.99,
  sum by (version, le) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

# Compare error rates
sum by (version) (
  rate(http_requests_total{status=~"5.."}[5m])
) / sum by (version) (
  rate(http_requests_total[5m])
)

Automated Rollback

Argo Rollouts

# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: fraud-detector
spec:
  replicas: 10
  selector:
    matchLabels:
      app: fraud-detector
  template:
    metadata:
      labels:
        app: fraud-detector
    spec:
      containers:
      - name: model
        image: myregistry/fraud-detector:v2
  strategy:
    canary:
      steps:
      - setWeight: 5
      - pause: {duration: 10m}
      - setWeight: 20
      - pause: {duration: 10m}
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100
      analysis:
        templates:
        - templateName: success-rate
        startingStep: 1
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
  - name: success-rate
    interval: 1m
    successCondition: result[0] >= 0.95
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

Best Practices

Practice Why
Start small 1-5% initial canary
Monitor business metrics Not just latency/errors
Automate rollback React faster than humans
Use deterministic routing Consistent user experience
Run long enough Statistical significance

Key insight: Safe model deployment is iterative—deploy small, measure impact, automate decisions, and always have a rollback plan.

Next module: ML Monitoring & Production Operations. :::

Quiz

Module 5: Model Registry & Serving

Take Quiz