Production Operations & GitOps
ArgoCD for ML Deployments
4 min read
ArgoCD enables GitOps-based continuous delivery for ML platforms, ensuring declarative, version-controlled deployments. Organizations report 70% faster releases and improved audit compliance with GitOps adoption.
GitOps Architecture for ML
┌─────────────────────────────────────────────────────────────────────┐
│ GitOps for ML Platform │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Git Repository (Source of Truth) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ Model Configs│ │ Infra Configs│ │ App Configs │ │ │
│ │ │ (versions) │ │ (k8s yamls) │ │ (services) │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ PR Review + Approval │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ ArgoCD Controller │ │
│ │ - Sync status monitoring │ │
│ │ - Drift detection │ │
│ │ - Automated rollback │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │ │
│ ↓ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Kubernetes Cluster │ │
│ │ [ML Serving] [Training] [Monitoring] [Feature Store] │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
ArgoCD Installation
# Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# Get initial admin password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
# Install ArgoCD CLI
brew install argocd
# Login
argocd login argocd-server.argocd.svc.cluster.local
ML Platform Application
# ArgoCD Application for ML platform
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: ml-platform
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: ml-production
source:
repoURL: https://github.com/org/ml-platform-config
targetRevision: main
path: environments/production
helm:
valueFiles:
- values.yaml
- values-production.yaml
destination:
server: https://kubernetes.default.svc
namespace: ml-serving
syncPolicy:
automated:
prune: true
selfHeal: true
allowEmpty: false
syncOptions:
- CreateNamespace=true
- PruneLast=true
- ApplyOutOfSyncOnly=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
---
# ArgoCD Project with RBAC
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: ml-production
namespace: argocd
spec:
description: ML Platform Production
sourceRepos:
- https://github.com/org/ml-platform-config
- https://github.com/org/ml-models
destinations:
- namespace: ml-serving
server: https://kubernetes.default.svc
- namespace: ml-training
server: https://kubernetes.default.svc
clusterResourceWhitelist:
- group: ''
kind: Namespace
- group: 'apiextensions.k8s.io'
kind: CustomResourceDefinition
namespaceResourceWhitelist:
- group: ''
kind: '*'
- group: 'apps'
kind: '*'
- group: 'serving.kserve.io'
kind: '*'
Model Deployment with GitOps
# models/llm-service/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: ml-serving
resources:
- inference-service.yaml
- hpa.yaml
- pdb.yaml
configMapGenerator:
- name: model-config
literals:
- MODEL_VERSION=v2.1.0
- MAX_BATCH_SIZE=32
- ENABLE_CACHING=true
images:
- name: inference-server
newTag: v2.1.0
commonLabels:
app.kubernetes.io/managed-by: argocd
model.ml/version: v2.1.0
---
# models/llm-service/inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llm-service
annotations:
argocd.argoproj.io/sync-wave: "2"
spec:
predictor:
minReplicas: 2
maxReplicas: 10
model:
modelFormat:
name: pytorch
storageUri: "s3://models/llm/v2.1.0"
resources:
requests:
nvidia.com/gpu: 1
Progressive Rollouts with Argo Rollouts
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: inference-rollout
namespace: ml-serving
annotations:
argocd.argoproj.io/sync-wave: "3"
spec:
replicas: 10
selector:
matchLabels:
app: inference
template:
metadata:
labels:
app: inference
spec:
containers:
- name: inference
image: inference:v2
resources:
limits:
nvidia.com/gpu: 1
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 5m}
- analysis:
templates:
- templateName: inference-success-rate
args:
- name: service-name
value: inference-canary
- setWeight: 30
- pause: {duration: 10m}
- analysis:
templates:
- templateName: inference-latency
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100
canaryService: inference-canary
stableService: inference-stable
trafficRouting:
istio:
virtualService:
name: inference-vsvc
routes:
- primary
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: inference-success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 1m
count: 5
successCondition: result[0] >= 0.99
failureLimit: 2
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(inference_requests_total{service="{{args.service-name}}",status="success"}[5m])) /
sum(rate(inference_requests_total{service="{{args.service-name}}"}[5m]))
Sync Waves for ML Components
# Order of deployment using sync waves
# Wave 0: Namespaces and RBAC
apiVersion: v1
kind: Namespace
metadata:
name: ml-serving
annotations:
argocd.argoproj.io/sync-wave: "0"
---
# Wave 1: ConfigMaps and Secrets
apiVersion: v1
kind: Secret
metadata:
name: model-credentials
annotations:
argocd.argoproj.io/sync-wave: "1"
---
# Wave 2: Storage (PVCs)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
annotations:
argocd.argoproj.io/sync-wave: "2"
---
# Wave 3: Services
apiVersion: v1
kind: Service
metadata:
name: inference-service
annotations:
argocd.argoproj.io/sync-wave: "3"
---
# Wave 4: Deployments
apiVersion: apps/v1
kind: Deployment
metadata:
name: inference
annotations:
argocd.argoproj.io/sync-wave: "4"
---
# Wave 5: Autoscaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: inference-hpa
annotations:
argocd.argoproj.io/sync-wave: "5"
ArgoCD Notifications for ML
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-notifications-cm
namespace: argocd
data:
trigger.on-deployed: |
- description: Model deployment completed
send: [slack-ml-team]
when: app.status.operationState.phase in ['Succeeded']
trigger.on-sync-failed: |
- description: Model sync failed
send: [slack-ml-team, pagerduty]
when: app.status.operationState.phase in ['Error', 'Failed']
template.slack-ml-team: |
message: |
{{if eq .app.status.operationState.phase "Succeeded"}}:white_check_mark:{{end}}
{{if eq .app.status.operationState.phase "Failed"}}:x:{{end}}
Application {{.app.metadata.name}} sync {{.app.status.operationState.phase}}
Revision: {{.app.status.sync.revision}}
{{range .app.status.operationState.syncResult.resources}}
- {{.kind}}/{{.name}}: {{.status}}
{{end}}
Next lesson: Monitoring and alerting for ML production systems. :::