GitOps & Deployment Strategies
Canary & Blue-Green for ML
5 min read
English Content
Why Progressive Delivery for ML?
ML model deployments carry unique risks:
- Silent performance degradation
- Data distribution sensitivity
- Inference latency variations
- Business metric impacts not visible in technical metrics
Progressive delivery minimizes risk by gradually exposing new models to production traffic.
Deployment Strategies Comparison
| Strategy | Risk Level | Rollback Speed | Resource Cost | Best For |
|---|---|---|---|---|
| Big Bang | High | Slow | Low | Simple updates |
| Blue-Green | Medium | Fast | 2x resources | Major changes |
| Canary | Low | Fast | 1.1x resources | ML models |
| A/B Testing | Low | Fast | 2x resources | Business metrics |
Blue-Green Deployment
Run two identical environments, switch traffic instantly:
# blue-green/service.yaml
apiVersion: v1
kind: Service
metadata:
name: model-server
spec:
selector:
app: model-server
version: blue # Switch to 'green' for cutover
ports:
- port: 80
targetPort: 8080
# blue-green/deployment-blue.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-server-blue
spec:
replicas: 5
selector:
matchLabels:
app: model-server
version: blue
template:
metadata:
labels:
app: model-server
version: blue
spec:
containers:
- name: model-server
image: registry/model:v1.2.0
# blue-green/deployment-green.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-server-green
spec:
replicas: 5
selector:
matchLabels:
app: model-server
version: green
template:
metadata:
labels:
app: model-server
version: green
spec:
containers:
- name: model-server
image: registry/model:v1.3.0 # New version
Canary Deployment with Argo Rollouts
# Install Argo Rollouts
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
# rollout/model-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: model-server
spec:
replicas: 10
selector:
matchLabels:
app: model-server
template:
metadata:
labels:
app: model-server
spec:
containers:
- name: model-server
image: registry/model:v1.3.0
ports:
- containerPort: 8080
strategy:
canary:
steps:
# Phase 1: 10% traffic
- setWeight: 10
- pause: {duration: 5m}
# Phase 2: 25% traffic
- setWeight: 25
- pause: {duration: 10m}
# Phase 3: 50% traffic
- setWeight: 50
- pause: {duration: 15m}
# Phase 4: 75% traffic
- setWeight: 75
- pause: {duration: 10m}
# Full rollout
- setWeight: 100
Automated Analysis for ML Models
# rollout/model-rollout-analysis.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: model-server
spec:
replicas: 10
selector:
matchLabels:
app: model-server
template:
metadata:
labels:
app: model-server
spec:
containers:
- name: model-server
image: registry/model:v1.3.0
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 2m}
- analysis:
templates:
- templateName: ml-model-analysis
- setWeight: 50
- pause: {duration: 5m}
- analysis:
templates:
- templateName: ml-model-analysis
- setWeight: 100
# rollout/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: ml-model-analysis
spec:
metrics:
# Check model accuracy
- name: accuracy
interval: 1m
successCondition: result[0] >= 0.90
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(model_predictions_correct_total{app="model-server"}[5m])) /
sum(rate(model_predictions_total{app="model-server"}[5m]))
# Check inference latency
- name: latency-p99
interval: 1m
successCondition: result[0] <= 200
provider:
prometheus:
address: http://prometheus:9090
query: |
histogram_quantile(0.99,
sum(rate(model_inference_duration_seconds_bucket{app="model-server"}[5m])) by (le)
) * 1000
# Check error rate
- name: error-rate
interval: 1m
successCondition: result[0] <= 0.01
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(model_predictions_errors_total{app="model-server"}[5m])) /
sum(rate(model_predictions_total{app="model-server"}[5m]))
Shadow Testing (Dark Launch)
Test new model with production data without affecting users:
# shadow/istio-virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: model-server
spec:
hosts:
- model-server
http:
- route:
- destination:
host: model-server
subset: stable
weight: 100
mirror:
host: model-server
subset: canary
mirrorPercentage:
value: 100.0 # Mirror 100% of traffic
A/B Testing for Business Metrics
# ab-test/experiment.yaml
apiVersion: argoproj.io/v1alpha1
kind: Experiment
metadata:
name: model-ab-test
spec:
duration: 24h
templates:
- name: model-a
replicas: 5
spec:
containers:
- name: model-server
image: registry/model:v1.2.0
- name: model-b
replicas: 5
spec:
containers:
- name: model-server
image: registry/model:v1.3.0
analyses:
- name: conversion-rate
templateName: business-metrics
Automated Rollback
# rollout/rollback-config.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: model-server
spec:
strategy:
canary:
steps:
- setWeight: 10
- analysis:
templates:
- templateName: ml-model-analysis
# Automatic rollback on failure
autoPromotionEnabled: false
abortScaleDownDelaySeconds: 30
# Manual rollback
kubectl argo rollouts undo model-server
# Abort current rollout
kubectl argo rollouts abort model-server
Key Takeaways
| Strategy | When to Use |
|---|---|
| Blue-Green | Major model changes, need instant rollback |
| Canary | Gradual rollout with automated analysis |
| Shadow | Test with production data safely |
| A/B Testing | Compare business metric impact |
المحتوى العربي
لماذا التسليم التدريجي لـ ML؟
نشر نماذج ML يحمل مخاطر فريدة:
- تدهور الأداء الصامت
- حساسية توزيع البيانات
- تباينات زمن استجابة الاستدلال
- تأثيرات مقاييس الأعمال غير مرئية في المقاييس التقنية
التسليم التدريجي يقلل المخاطر بتعريض النماذج الجديدة تدريجياً لحركة مرور الإنتاج.
مقارنة استراتيجيات النشر
| الاستراتيجية | مستوى المخاطر | سرعة التراجع | تكلفة الموارد | الأفضل لـ |
|---|---|---|---|---|
| Big Bang | عالي | بطيء | منخفض | تحديثات بسيطة |
| Blue-Green | متوسط | سريع | 2x موارد | تغييرات كبيرة |
| Canary | منخفض | سريع | 1.1x موارد | نماذج ML |
| A/B Testing | منخفض | سريع | 2x موارد | مقاييس الأعمال |
نشر Blue-Green
تشغيل بيئتين متطابقتين، تبديل الحركة فوراً:
# blue-green/service.yaml
apiVersion: v1
kind: Service
metadata:
name: model-server
spec:
selector:
app: model-server
version: blue # التبديل إلى 'green' للتحويل
ports:
- port: 80
targetPort: 8080
# blue-green/deployment-blue.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-server-blue
spec:
replicas: 5
selector:
matchLabels:
app: model-server
version: blue
template:
metadata:
labels:
app: model-server
version: blue
spec:
containers:
- name: model-server
image: registry/model:v1.2.0
# blue-green/deployment-green.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-server-green
spec:
replicas: 5
selector:
matchLabels:
app: model-server
version: green
template:
metadata:
labels:
app: model-server
version: green
spec:
containers:
- name: model-server
image: registry/model:v1.3.0 # الإصدار الجديد
نشر Canary مع Argo Rollouts
# تثبيت Argo Rollouts
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
# rollout/model-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: model-server
spec:
replicas: 10
selector:
matchLabels:
app: model-server
template:
metadata:
labels:
app: model-server
spec:
containers:
- name: model-server
image: registry/model:v1.3.0
ports:
- containerPort: 8080
strategy:
canary:
steps:
# المرحلة 1: 10% من الحركة
- setWeight: 10
- pause: {duration: 5m}
# المرحلة 2: 25% من الحركة
- setWeight: 25
- pause: {duration: 10m}
# المرحلة 3: 50% من الحركة
- setWeight: 50
- pause: {duration: 15m}
# المرحلة 4: 75% من الحركة
- setWeight: 75
- pause: {duration: 10m}
# الطرح الكامل
- setWeight: 100
التحليل الآلي لنماذج ML
# rollout/model-rollout-analysis.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: model-server
spec:
replicas: 10
selector:
matchLabels:
app: model-server
template:
metadata:
labels:
app: model-server
spec:
containers:
- name: model-server
image: registry/model:v1.3.0
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 2m}
- analysis:
templates:
- templateName: ml-model-analysis
- setWeight: 50
- pause: {duration: 5m}
- analysis:
templates:
- templateName: ml-model-analysis
- setWeight: 100
# rollout/analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: ml-model-analysis
spec:
metrics:
# فحص دقة النموذج
- name: accuracy
interval: 1m
successCondition: result[0] >= 0.90
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(model_predictions_correct_total{app="model-server"}[5m])) /
sum(rate(model_predictions_total{app="model-server"}[5m]))
# فحص زمن استجابة الاستدلال
- name: latency-p99
interval: 1m
successCondition: result[0] <= 200
provider:
prometheus:
address: http://prometheus:9090
query: |
histogram_quantile(0.99,
sum(rate(model_inference_duration_seconds_bucket{app="model-server"}[5m])) by (le)
) * 1000
# فحص معدل الأخطاء
- name: error-rate
interval: 1m
successCondition: result[0] <= 0.01
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(model_predictions_errors_total{app="model-server"}[5m])) /
sum(rate(model_predictions_total{app="model-server"}[5m]))
اختبار الظل (الإطلاق المظلم)
اختبار النموذج الجديد مع بيانات الإنتاج دون التأثير على المستخدمين:
# shadow/istio-virtual-service.yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: model-server
spec:
hosts:
- model-server
http:
- route:
- destination:
host: model-server
subset: stable
weight: 100
mirror:
host: model-server
subset: canary
mirrorPercentage:
value: 100.0 # انعكاس 100% من الحركة
اختبار A/B لمقاييس الأعمال
# ab-test/experiment.yaml
apiVersion: argoproj.io/v1alpha1
kind: Experiment
metadata:
name: model-ab-test
spec:
duration: 24h
templates:
- name: model-a
replicas: 5
spec:
containers:
- name: model-server
image: registry/model:v1.2.0
- name: model-b
replicas: 5
spec:
containers:
- name: model-server
image: registry/model:v1.3.0
analyses:
- name: conversion-rate
templateName: business-metrics
التراجع الآلي
# rollout/rollback-config.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: model-server
spec:
strategy:
canary:
steps:
- setWeight: 10
- analysis:
templates:
- templateName: ml-model-analysis
# التراجع التلقائي عند الفشل
autoPromotionEnabled: false
abortScaleDownDelaySeconds: 30
# التراجع اليدوي
kubectl argo rollouts undo model-server
# إلغاء الطرح الحالي
kubectl argo rollouts abort model-server
النقاط الرئيسية
| الاستراتيجية | متى تُستخدم |
|---|---|
| Blue-Green | تغييرات كبيرة للنموذج، تحتاج تراجع فوري |
| Canary | طرح تدريجي مع تحليل آلي |
| Shadow | اختبار مع بيانات الإنتاج بأمان |
| A/B Testing | مقارنة تأثير مقاييس الأعمال |