GitOps & Deployment Strategies

Canary & Blue-Green for ML

5 min read

English Content

Why Progressive Delivery for ML?

ML model deployments carry unique risks:

  • Silent performance degradation
  • Data distribution sensitivity
  • Inference latency variations
  • Business metric impacts not visible in technical metrics

Progressive delivery minimizes risk by gradually exposing new models to production traffic.

Deployment Strategies Comparison

Strategy Risk Level Rollback Speed Resource Cost Best For
Big Bang High Slow Low Simple updates
Blue-Green Medium Fast 2x resources Major changes
Canary Low Fast 1.1x resources ML models
A/B Testing Low Fast 2x resources Business metrics

Blue-Green Deployment

Run two identical environments, switch traffic instantly:

# blue-green/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: model-server
spec:
  selector:
    app: model-server
    version: blue  # Switch to 'green' for cutover
  ports:
    - port: 80
      targetPort: 8080
# blue-green/deployment-blue.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-server-blue
spec:
  replicas: 5
  selector:
    matchLabels:
      app: model-server
      version: blue
  template:
    metadata:
      labels:
        app: model-server
        version: blue
    spec:
      containers:
        - name: model-server
          image: registry/model:v1.2.0
# blue-green/deployment-green.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-server-green
spec:
  replicas: 5
  selector:
    matchLabels:
      app: model-server
      version: green
  template:
    metadata:
      labels:
        app: model-server
        version: green
    spec:
      containers:
        - name: model-server
          image: registry/model:v1.3.0  # New version

Canary Deployment with Argo Rollouts

# Install Argo Rollouts
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
# rollout/model-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: model-server
spec:
  replicas: 10
  selector:
    matchLabels:
      app: model-server
  template:
    metadata:
      labels:
        app: model-server
    spec:
      containers:
        - name: model-server
          image: registry/model:v1.3.0
          ports:
            - containerPort: 8080
  strategy:
    canary:
      steps:
        # Phase 1: 10% traffic
        - setWeight: 10
        - pause: {duration: 5m}

        # Phase 2: 25% traffic
        - setWeight: 25
        - pause: {duration: 10m}

        # Phase 3: 50% traffic
        - setWeight: 50
        - pause: {duration: 15m}

        # Phase 4: 75% traffic
        - setWeight: 75
        - pause: {duration: 10m}

        # Full rollout
        - setWeight: 100

Automated Analysis for ML Models

# rollout/model-rollout-analysis.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: model-server
spec:
  replicas: 10
  selector:
    matchLabels:
      app: model-server
  template:
    metadata:
      labels:
        app: model-server
    spec:
      containers:
        - name: model-server
          image: registry/model:v1.3.0
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: {duration: 2m}
        - analysis:
            templates:
              - templateName: ml-model-analysis
        - setWeight: 50
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: ml-model-analysis
        - setWeight: 100
# rollout/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: ml-model-analysis
spec:
  metrics:
    # Check model accuracy
    - name: accuracy
      interval: 1m
      successCondition: result[0] >= 0.90
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(model_predictions_correct_total{app="model-server"}[5m])) /
            sum(rate(model_predictions_total{app="model-server"}[5m]))

    # Check inference latency
    - name: latency-p99
      interval: 1m
      successCondition: result[0] <= 200
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(model_inference_duration_seconds_bucket{app="model-server"}[5m])) by (le)
            ) * 1000

    # Check error rate
    - name: error-rate
      interval: 1m
      successCondition: result[0] <= 0.01
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(model_predictions_errors_total{app="model-server"}[5m])) /
            sum(rate(model_predictions_total{app="model-server"}[5m]))

Shadow Testing (Dark Launch)

Test new model with production data without affecting users:

# shadow/istio-virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: model-server
spec:
  hosts:
    - model-server
  http:
    - route:
        - destination:
            host: model-server
            subset: stable
          weight: 100
      mirror:
        host: model-server
        subset: canary
      mirrorPercentage:
        value: 100.0  # Mirror 100% of traffic

A/B Testing for Business Metrics

# ab-test/experiment.yaml
apiVersion: argoproj.io/v1alpha1
kind: Experiment
metadata:
  name: model-ab-test
spec:
  duration: 24h
  templates:
    - name: model-a
      replicas: 5
      spec:
        containers:
          - name: model-server
            image: registry/model:v1.2.0
    - name: model-b
      replicas: 5
      spec:
        containers:
          - name: model-server
            image: registry/model:v1.3.0
  analyses:
    - name: conversion-rate
      templateName: business-metrics

Automated Rollback

# rollout/rollback-config.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: model-server
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - analysis:
            templates:
              - templateName: ml-model-analysis
      # Automatic rollback on failure
      autoPromotionEnabled: false
      abortScaleDownDelaySeconds: 30
# Manual rollback
kubectl argo rollouts undo model-server

# Abort current rollout
kubectl argo rollouts abort model-server

Key Takeaways

Strategy When to Use
Blue-Green Major model changes, need instant rollback
Canary Gradual rollout with automated analysis
Shadow Test with production data safely
A/B Testing Compare business metric impact

المحتوى العربي

لماذا التسليم التدريجي لـ ML؟

نشر نماذج ML يحمل مخاطر فريدة:

  • تدهور الأداء الصامت
  • حساسية توزيع البيانات
  • تباينات زمن استجابة الاستدلال
  • تأثيرات مقاييس الأعمال غير مرئية في المقاييس التقنية

التسليم التدريجي يقلل المخاطر بتعريض النماذج الجديدة تدريجياً لحركة مرور الإنتاج.

مقارنة استراتيجيات النشر

الاستراتيجية مستوى المخاطر سرعة التراجع تكلفة الموارد الأفضل لـ
Big Bang عالي بطيء منخفض تحديثات بسيطة
Blue-Green متوسط سريع 2x موارد تغييرات كبيرة
Canary منخفض سريع 1.1x موارد نماذج ML
A/B Testing منخفض سريع 2x موارد مقاييس الأعمال

نشر Blue-Green

تشغيل بيئتين متطابقتين، تبديل الحركة فوراً:

# blue-green/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: model-server
spec:
  selector:
    app: model-server
    version: blue  # التبديل إلى 'green' للتحويل
  ports:
    - port: 80
      targetPort: 8080
# blue-green/deployment-blue.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-server-blue
spec:
  replicas: 5
  selector:
    matchLabels:
      app: model-server
      version: blue
  template:
    metadata:
      labels:
        app: model-server
        version: blue
    spec:
      containers:
        - name: model-server
          image: registry/model:v1.2.0
# blue-green/deployment-green.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-server-green
spec:
  replicas: 5
  selector:
    matchLabels:
      app: model-server
      version: green
  template:
    metadata:
      labels:
        app: model-server
        version: green
    spec:
      containers:
        - name: model-server
          image: registry/model:v1.3.0  # الإصدار الجديد

نشر Canary مع Argo Rollouts

# تثبيت Argo Rollouts
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
# rollout/model-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: model-server
spec:
  replicas: 10
  selector:
    matchLabels:
      app: model-server
  template:
    metadata:
      labels:
        app: model-server
    spec:
      containers:
        - name: model-server
          image: registry/model:v1.3.0
          ports:
            - containerPort: 8080
  strategy:
    canary:
      steps:
        # المرحلة 1: 10% من الحركة
        - setWeight: 10
        - pause: {duration: 5m}

        # المرحلة 2: 25% من الحركة
        - setWeight: 25
        - pause: {duration: 10m}

        # المرحلة 3: 50% من الحركة
        - setWeight: 50
        - pause: {duration: 15m}

        # المرحلة 4: 75% من الحركة
        - setWeight: 75
        - pause: {duration: 10m}

        # الطرح الكامل
        - setWeight: 100

التحليل الآلي لنماذج ML

# rollout/model-rollout-analysis.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: model-server
spec:
  replicas: 10
  selector:
    matchLabels:
      app: model-server
  template:
    metadata:
      labels:
        app: model-server
    spec:
      containers:
        - name: model-server
          image: registry/model:v1.3.0
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: {duration: 2m}
        - analysis:
            templates:
              - templateName: ml-model-analysis
        - setWeight: 50
        - pause: {duration: 5m}
        - analysis:
            templates:
              - templateName: ml-model-analysis
        - setWeight: 100
# rollout/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: ml-model-analysis
spec:
  metrics:
    # فحص دقة النموذج
    - name: accuracy
      interval: 1m
      successCondition: result[0] >= 0.90
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(model_predictions_correct_total{app="model-server"}[5m])) /
            sum(rate(model_predictions_total{app="model-server"}[5m]))

    # فحص زمن استجابة الاستدلال
    - name: latency-p99
      interval: 1m
      successCondition: result[0] <= 200
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(model_inference_duration_seconds_bucket{app="model-server"}[5m])) by (le)
            ) * 1000

    # فحص معدل الأخطاء
    - name: error-rate
      interval: 1m
      successCondition: result[0] <= 0.01
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(model_predictions_errors_total{app="model-server"}[5m])) /
            sum(rate(model_predictions_total{app="model-server"}[5m]))

اختبار الظل (الإطلاق المظلم)

اختبار النموذج الجديد مع بيانات الإنتاج دون التأثير على المستخدمين:

# shadow/istio-virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: model-server
spec:
  hosts:
    - model-server
  http:
    - route:
        - destination:
            host: model-server
            subset: stable
          weight: 100
      mirror:
        host: model-server
        subset: canary
      mirrorPercentage:
        value: 100.0  # انعكاس 100% من الحركة

اختبار A/B لمقاييس الأعمال

# ab-test/experiment.yaml
apiVersion: argoproj.io/v1alpha1
kind: Experiment
metadata:
  name: model-ab-test
spec:
  duration: 24h
  templates:
    - name: model-a
      replicas: 5
      spec:
        containers:
          - name: model-server
            image: registry/model:v1.2.0
    - name: model-b
      replicas: 5
      spec:
        containers:
          - name: model-server
            image: registry/model:v1.3.0
  analyses:
    - name: conversion-rate
      templateName: business-metrics

التراجع الآلي

# rollout/rollback-config.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: model-server
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - analysis:
            templates:
              - templateName: ml-model-analysis
      # التراجع التلقائي عند الفشل
      autoPromotionEnabled: false
      abortScaleDownDelaySeconds: 30
# التراجع اليدوي
kubectl argo rollouts undo model-server

# إلغاء الطرح الحالي
kubectl argo rollouts abort model-server

النقاط الرئيسية

الاستراتيجية متى تُستخدم
Blue-Green تغييرات كبيرة للنموذج، تحتاج تراجع فوري
Canary طرح تدريجي مع تحليل آلي
Shadow اختبار مع بيانات الإنتاج بأمان
A/B Testing مقارنة تأثير مقاييس الأعمال

Quiz

Module 6: GitOps & Deployment Strategies

Take Quiz