Production Operations & GitOps

ArgoCD for ML Deployments

4 min read

ArgoCD enables GitOps-based continuous delivery for ML platforms, ensuring declarative, version-controlled deployments. Organizations report 70% faster releases and improved audit compliance with GitOps adoption.

GitOps Architecture for ML

┌─────────────────────────────────────────────────────────────────────┐
│                    GitOps for ML Platform                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                    Git Repository (Source of Truth)          │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │    │
│  │  │ Model Configs│  │ Infra Configs│  │ App Configs  │       │    │
│  │  │ (versions)   │  │ (k8s yamls)  │  │ (services)   │       │    │
│  │  └──────────────┘  └──────────────┘  └──────────────┘       │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│                        PR Review + Approval                          │
│                              ↓                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                    ArgoCD Controller                         │    │
│  │  - Sync status monitoring                                    │    │
│  │  - Drift detection                                           │    │
│  │  - Automated rollback                                        │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                              │                                       │
│                              ↓                                       │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                  Kubernetes Cluster                          │    │
│  │  [ML Serving] [Training] [Monitoring] [Feature Store]       │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

ArgoCD Installation

# Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Get initial admin password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

# Install ArgoCD CLI
brew install argocd

# Login
argocd login argocd-server.argocd.svc.cluster.local

ML Platform Application

# ArgoCD Application for ML platform
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ml-platform
  namespace: argocd
  finalizers:
  - resources-finalizer.argocd.argoproj.io
spec:
  project: ml-production
  source:
    repoURL: https://github.com/org/ml-platform-config
    targetRevision: main
    path: environments/production
    helm:
      valueFiles:
      - values.yaml
      - values-production.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: ml-serving
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
    - CreateNamespace=true
    - PruneLast=true
    - ApplyOutOfSyncOnly=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
---
# ArgoCD Project with RBAC
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: ml-production
  namespace: argocd
spec:
  description: ML Platform Production
  sourceRepos:
  - https://github.com/org/ml-platform-config
  - https://github.com/org/ml-models
  destinations:
  - namespace: ml-serving
    server: https://kubernetes.default.svc
  - namespace: ml-training
    server: https://kubernetes.default.svc
  clusterResourceWhitelist:
  - group: ''
    kind: Namespace
  - group: 'apiextensions.k8s.io'
    kind: CustomResourceDefinition
  namespaceResourceWhitelist:
  - group: ''
    kind: '*'
  - group: 'apps'
    kind: '*'
  - group: 'serving.kserve.io'
    kind: '*'

Model Deployment with GitOps

# models/llm-service/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: ml-serving

resources:
- inference-service.yaml
- hpa.yaml
- pdb.yaml

configMapGenerator:
- name: model-config
  literals:
  - MODEL_VERSION=v2.1.0
  - MAX_BATCH_SIZE=32
  - ENABLE_CACHING=true

images:
- name: inference-server
  newTag: v2.1.0

commonLabels:
  app.kubernetes.io/managed-by: argocd
  model.ml/version: v2.1.0
---
# models/llm-service/inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llm-service
  annotations:
    argocd.argoproj.io/sync-wave: "2"
spec:
  predictor:
    minReplicas: 2
    maxReplicas: 10
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://models/llm/v2.1.0"
      resources:
        requests:
          nvidia.com/gpu: 1

Progressive Rollouts with Argo Rollouts

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: inference-rollout
  namespace: ml-serving
  annotations:
    argocd.argoproj.io/sync-wave: "3"
spec:
  replicas: 10
  selector:
    matchLabels:
      app: inference
  template:
    metadata:
      labels:
        app: inference
    spec:
      containers:
      - name: inference
        image: inference:v2
        resources:
          limits:
            nvidia.com/gpu: 1
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: inference-success-rate
          args:
          - name: service-name
            value: inference-canary
      - setWeight: 30
      - pause: {duration: 10m}
      - analysis:
          templates:
          - templateName: inference-latency
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100
      canaryService: inference-canary
      stableService: inference-stable
      trafficRouting:
        istio:
          virtualService:
            name: inference-vsvc
            routes:
            - primary
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: inference-success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 1m
    count: 5
    successCondition: result[0] >= 0.99
    failureLimit: 2
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(inference_requests_total{service="{{args.service-name}}",status="success"}[5m])) /
          sum(rate(inference_requests_total{service="{{args.service-name}}"}[5m]))

Sync Waves for ML Components

# Order of deployment using sync waves
# Wave 0: Namespaces and RBAC
apiVersion: v1
kind: Namespace
metadata:
  name: ml-serving
  annotations:
    argocd.argoproj.io/sync-wave: "0"
---
# Wave 1: ConfigMaps and Secrets
apiVersion: v1
kind: Secret
metadata:
  name: model-credentials
  annotations:
    argocd.argoproj.io/sync-wave: "1"
---
# Wave 2: Storage (PVCs)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  annotations:
    argocd.argoproj.io/sync-wave: "2"
---
# Wave 3: Services
apiVersion: v1
kind: Service
metadata:
  name: inference-service
  annotations:
    argocd.argoproj.io/sync-wave: "3"
---
# Wave 4: Deployments
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference
  annotations:
    argocd.argoproj.io/sync-wave: "4"
---
# Wave 5: Autoscaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
  annotations:
    argocd.argoproj.io/sync-wave: "5"

ArgoCD Notifications for ML

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-notifications-cm
  namespace: argocd
data:
  trigger.on-deployed: |
    - description: Model deployment completed
      send: [slack-ml-team]
      when: app.status.operationState.phase in ['Succeeded']

  trigger.on-sync-failed: |
    - description: Model sync failed
      send: [slack-ml-team, pagerduty]
      when: app.status.operationState.phase in ['Error', 'Failed']

  template.slack-ml-team: |
    message: |
      {{if eq .app.status.operationState.phase "Succeeded"}}:white_check_mark:{{end}}
      {{if eq .app.status.operationState.phase "Failed"}}:x:{{end}}
      Application {{.app.metadata.name}} sync {{.app.status.operationState.phase}}
      Revision: {{.app.status.sync.revision}}
      {{range .app.status.operationState.syncResult.resources}}
      - {{.kind}}/{{.name}}: {{.status}}
      {{end}}

Next lesson: Monitoring and alerting for ML production systems. :::

Quiz

Module 6: Production Operations & GitOps

Take Quiz