Canary Deployments & A/B Testing

Deploying ML models safely requires gradual rollouts and data-driven validation. Canary deployments and A/B testing minimize risk while maximizing learning.

Deployment Strategies

Strategy	Risk	Rollback Speed	Use Case
Big Bang	High	Slow	Development only
Blue-Green	Medium	Fast	Quick switchover
Canary	Low	Fast	Production ML
A/B Test	Low	Fast	Model comparison

Canary Deployments

Concept

Deploy new model to small percentage of traffic, monitor metrics, then gradually increase.

Traffic Flow:
┌─────────────────────────────────────────────┐
│                                             │
│  Users (100%)                               │
│      │                                      │
│      ▼                                      │
│  ┌─────────┐                                │
│  │  Load   │                                │
│  │Balancer │                                │
│  └────┬────┘                                │
│       │                                     │
│   ┌───┴───┐                                 │
│   │       │                                 │
│   ▼       ▼                                 │
│ ┌───┐   ┌───┐                               │
│ │95%│   │ 5%│  ◄── Canary                   │
│ │v1 │   │v2 │                               │
│ └───┘   └───┘                               │
└─────────────────────────────────────────────┘

Kubernetes Implementation

# model-v1-deployment.yaml (stable)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-detector-stable
  labels:
    app: fraud-detector
    version: v1
spec:
  replicas: 9  # 90% of traffic
  selector:
    matchLabels:
      app: fraud-detector
      version: v1
  template:
    metadata:
      labels:
        app: fraud-detector
        version: v1
    spec:
      containers:
      - name: model
        image: myregistry/fraud-detector:v1
        ports:
        - containerPort: 3000
---
# model-v2-deployment.yaml (canary)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: fraud-detector-canary
  labels:
    app: fraud-detector
    version: v2
spec:
  replicas: 1  # 10% of traffic
  selector:
    matchLabels:
      app: fraud-detector
      version: v2
  template:
    metadata:
      labels:
        app: fraud-detector
        version: v2
    spec:
      containers:
      - name: model
        image: myregistry/fraud-detector:v2
        ports:
        - containerPort: 3000
---
# service.yaml (routes to both)
apiVersion: v1
kind: Service
metadata:
  name: fraud-detector
spec:
  selector:
    app: fraud-detector  # Matches both v1 and v2
  ports:
  - port: 80
    targetPort: 3000

Istio Traffic Splitting

# virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: fraud-detector
spec:
  hosts:
  - fraud-detector
  http:
  - route:
    - destination:
        host: fraud-detector
        subset: stable
      weight: 95
    - destination:
        host: fraud-detector
        subset: canary
      weight: 5
---
# destination-rule.yaml
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: fraud-detector
spec:
  host: fraud-detector
  subsets:
  - name: stable
    labels:
      version: v1
  - name: canary
    labels:
      version: v2

Gradual Rollout Script

# canary_rollout.py
import subprocess
import time
from typing import Callable

def update_traffic_weight(canary_weight: int):
    """Update Istio VirtualService weight."""
    stable_weight = 100 - canary_weight

    patch = f'''
    spec:
      http:
      - route:
        - destination:
            host: fraud-detector
            subset: stable
          weight: {stable_weight}
        - destination:
            host: fraud-detector
            subset: canary
          weight: {canary_weight}
    '''

    subprocess.run([
        "kubectl", "patch", "virtualservice", "fraud-detector",
        "--type=merge", "-p", patch
    ])

def canary_rollout(
    check_metrics: Callable[[], bool],
    weights: list[int] = [5, 10, 25, 50, 100],
    wait_minutes: int = 10
):
    """Gradually increase canary traffic if metrics pass."""
    for weight in weights:
        print(f"Setting canary weight to {weight}%")
        update_traffic_weight(weight)

        print(f"Waiting {wait_minutes} minutes...")
        time.sleep(wait_minutes * 60)

        if not check_metrics():
            print("Metrics check failed! Rolling back...")
            update_traffic_weight(0)
            return False

        print(f"Metrics passed at {weight}%")

    print("Canary rollout complete!")
    return True

# Usage
def check_model_metrics() -> bool:
    """Check if canary model meets SLOs."""
    # Query Prometheus/Grafana
    latency_ok = get_p99_latency("canary") < 100  # ms
    error_rate_ok = get_error_rate("canary") < 0.01  # 1%
    accuracy_ok = get_accuracy("canary") > 0.90

    return latency_ok and error_rate_ok and accuracy_ok

canary_rollout(check_model_metrics)

A/B Testing for Models

Concept

Split traffic between models to measure business impact, not just technical metrics.

Canary	A/B Test
Validate stability	Measure business impact
Quick rollout	Statistical significance
Technical metrics	Business metrics
Temporary	Can run for weeks

Implementation with Feature Flags

# ab_test_router.py
import hashlib
from enum import Enum

class ModelVariant(Enum):
    CONTROL = "v1"
    TREATMENT = "v2"

def get_variant(user_id: str, experiment: str, treatment_pct: int = 50) -> ModelVariant:
    """Deterministic assignment based on user_id."""
    hash_input = f"{user_id}:{experiment}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    bucket = hash_value % 100

    if bucket < treatment_pct:
        return ModelVariant.TREATMENT
    return ModelVariant.CONTROL

# Usage in service
class FraudDetectorABTest:
    def __init__(self):
        self.model_v1 = load_model("fraud_detector:v1")
        self.model_v2 = load_model("fraud_detector:v2")

    def predict(self, user_id: str, features: dict) -> dict:
        variant = get_variant(user_id, "fraud_model_v2_test")

        if variant == ModelVariant.TREATMENT:
            model = self.model_v2
        else:
            model = self.model_v1

        prediction = model.predict(features)

        # Log for analysis
        log_prediction(
            user_id=user_id,
            variant=variant.value,
            prediction=prediction,
            timestamp=time.time()
        )

        return {
            "prediction": prediction,
            "variant": variant.value
        }

Statistical Analysis

# ab_analysis.py
import scipy.stats as stats
import pandas as pd

def analyze_ab_test(
    control_conversions: int,
    control_total: int,
    treatment_conversions: int,
    treatment_total: int,
    confidence: float = 0.95
) -> dict:
    """Analyze A/B test results."""
    control_rate = control_conversions / control_total
    treatment_rate = treatment_conversions / treatment_total

    # Two-proportion z-test
    pooled = (control_conversions + treatment_conversions) / (control_total + treatment_total)
    se = (pooled * (1 - pooled) * (1/control_total + 1/treatment_total)) ** 0.5
    z_stat = (treatment_rate - control_rate) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))

    # Confidence interval
    z_critical = stats.norm.ppf((1 + confidence) / 2)
    diff = treatment_rate - control_rate
    margin = z_critical * se

    return {
        "control_rate": control_rate,
        "treatment_rate": treatment_rate,
        "lift": (treatment_rate - control_rate) / control_rate * 100,
        "p_value": p_value,
        "significant": p_value < (1 - confidence),
        "confidence_interval": (diff - margin, diff + margin)
    }

# Example usage
results = analyze_ab_test(
    control_conversions=450,
    control_total=10000,
    treatment_conversions=520,
    treatment_total=10000
)

print(f"Control: {results['control_rate']:.2%}")
print(f"Treatment: {results['treatment_rate']:.2%}")
print(f"Lift: {results['lift']:.1f}%")
print(f"P-value: {results['p_value']:.4f}")
print(f"Significant: {results['significant']}")

Monitoring Canary Deployments

Key Metrics to Track

# prometheus-rules.yaml
groups:
- name: canary-alerts
  rules:
  - alert: CanaryHighLatency
    expr: |
      histogram_quantile(0.99,
        sum(rate(http_request_duration_seconds_bucket{version="canary"}[5m]))
        by (le)
      ) > 0.5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Canary p99 latency above 500ms"

  - alert: CanaryHighErrorRate
    expr: |
      sum(rate(http_requests_total{version="canary",status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{version="canary"}[5m]))
      > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Canary error rate above 5%"

Grafana Dashboard Query

# Compare latency between versions
histogram_quantile(0.99,
  sum by (version, le) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

# Compare error rates
sum by (version) (
  rate(http_requests_total{status=~"5.."}[5m])
) / sum by (version) (
  rate(http_requests_total[5m])
)

Automated Rollback

Argo Rollouts

# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: fraud-detector
spec:
  replicas: 10
  selector:
    matchLabels:
      app: fraud-detector
  template:
    metadata:
      labels:
        app: fraud-detector
    spec:
      containers:
      - name: model
        image: myregistry/fraud-detector:v2
  strategy:
    canary:
      steps:
      - setWeight: 5
      - pause: {duration: 10m}
      - setWeight: 20
      - pause: {duration: 10m}
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100
      analysis:
        templates:
        - templateName: success-rate
        startingStep: 1
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
  - name: success-rate
    interval: 1m
    successCondition: result[0] >= 0.95
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

Best Practices

Practice	Why
Start small	1-5% initial canary
Monitor business metrics	Not just latency/errors
Automate rollback	React faster than humans
Use deterministic routing	Consistent user experience
Run long enough	Statistical significance

Key insight: Safe model deployment is iterative—deploy small, measure impact, automate decisions, and always have a rollback plan.

Next module: ML Monitoring & Production Operations. :::

Deployment Strategies

Canary Deployments

Concept

Kubernetes Implementation

Istio Traffic Splitting

Gradual Rollout Script

A/B Testing for Models

Concept

Implementation with Feature Flags

Statistical Analysis

Monitoring Canary Deployments

Key Metrics to Track

Grafana Dashboard Query

Automated Rollback

Argo Rollouts

Best Practices

Quiz

Stay on the Nerd Track