CI/CD for ML Systems

GitOps and Infrastructure as Code

4 min read

GitOps brings Git workflows to infrastructure and ML deployments. Interviewers test your knowledge of ArgoCD, Terraform, and declarative ML infrastructure.

GitOps Principles for ML

Principle Application to ML
Declarative Model deployments defined in YAML
Versioned All config in Git, auditable
Automated ArgoCD syncs desired state
Observable Deployment status visible in Git

ArgoCD for ML Deployments

# argocd/applications/model-serving.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: fraud-detection-model
  namespace: argocd
spec:
  project: ml-production
  source:
    repoURL: https://github.com/company/ml-deployments.git
    targetRevision: main
    path: models/fraud-detection
    helm:
      valueFiles:
        - values-production.yaml
      parameters:
        - name: model.version
          value: "v2.3.1"
        - name: model.replicas
          value: "5"
  destination:
    server: https://kubernetes.default.svc
    namespace: ml-production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Interview Question: GitOps for ML

Question: "How would you implement GitOps for a model serving platform?"

Answer Structure:

# Repository structure for GitOps ML
ml-deployments/
├── base/                          # Shared resources
│   ├── kustomization.yaml
│   ├── model-serving-deployment.yaml
│   ├── model-serving-service.yaml
│   └── model-serving-hpa.yaml
├── models/
│   ├── fraud-detection/
│   │   ├── kustomization.yaml
│   │   ├── values-staging.yaml
│   │   └── values-production.yaml
│   │
│   └── recommendation/
│       ├── kustomization.yaml
│       ├── values-staging.yaml
│       └── values-production.yaml
├── infrastructure/
│   ├── terraform/
│   │   ├── gke-cluster.tf
│   │   ├── mlflow-server.tf
│   │   └── feature-store.tf
│   │
│   └── crossplane/
│       ├── database.yaml
│       └── redis.yaml
└── argocd/
    ├── app-of-apps.yaml
    └── applications/
        ├── fraud-detection.yaml
        └── recommendation.yaml

Terraform for ML Infrastructure

# terraform/ml-platform.tf

# GKE cluster with GPU node pool
resource "google_container_cluster" "ml_cluster" {
  name     = "ml-production"
  location = var.region

  # Enable features needed for ML
  addons_config {
    gce_persistent_disk_csi_driver_config {
      enabled = true  # For model storage
    }
  }

  # Autoscaling cluster
  cluster_autoscaling {
    enabled = true
    resource_limits {
      resource_type = "cpu"
      maximum       = 1000
    }
    resource_limits {
      resource_type = "memory"
      maximum       = 4000  # GB
    }
    resource_limits {
      resource_type = "nvidia-tesla-t4"
      maximum       = 50
    }
  }
}

# GPU node pool for inference
resource "google_container_node_pool" "gpu_pool" {
  name       = "gpu-inference"
  cluster    = google_container_cluster.ml_cluster.name
  location   = var.region

  autoscaling {
    min_node_count = 1
    max_node_count = 20
  }

  node_config {
    machine_type = "n1-standard-8"
    guest_accelerator {
      type  = "nvidia-tesla-t4"
      count = 1
    }

    # Required for GPU nodes
    metadata = {
      disable-legacy-endpoints = "true"
    }

    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }
}

# MLflow tracking server
resource "google_sql_database_instance" "mlflow_db" {
  name             = "mlflow-tracking"
  database_version = "POSTGRES_15"
  region           = var.region

  settings {
    tier = "db-custom-2-7680"

    backup_configuration {
      enabled = true
    }
  }
}

# Artifact storage for models
resource "google_storage_bucket" "model_artifacts" {
  name     = "ml-model-artifacts-${var.project_id}"
  location = var.region

  versioning {
    enabled = true  # Model version history
  }

  lifecycle_rule {
    action {
      type = "Delete"
    }
    condition {
      num_newer_versions = 10  # Keep last 10 versions
    }
  }
}

GitOps Workflow for Model Updates

# How a model update flows through GitOps
model_update_workflow:
  1_data_scientist:
    action: "Trains new model, registers in MLflow"
    output: "Model version v2.3.2 in registry"

  2_ml_engineer:
    action: "Updates values.yaml with new model version"
    file: "models/fraud-detection/values-production.yaml"
    change: "model.version: v2.3.1 → v2.3.2"

  3_pull_request:
    action: "Opens PR with model version bump"
    ci_checks:
      - "Model exists in registry"
      - "Model passed offline tests"
      - "Staging deployment successful"

  4_approval:
    action: "Tech lead approves PR"
    requirements:
      - "CI checks passed"
      - "Model metrics validated"

  5_argocd_sync:
    action: "ArgoCD detects change, syncs cluster"
    rollout: "Canary → full rollout"

  6_observability:
    action: "Monitor dashboards for issues"
    rollback: "Revert Git commit if needed"

Benefits to Mention in Interviews

gitops_benefits = {
    "auditability": "Every change tracked in Git history",
    "reproducibility": "Can recreate any environment from Git",
    "rollback": "git revert for instant rollback",
    "collaboration": "Code review process for infrastructure",
    "consistency": "Same process for all environments",
    "security": "Cluster credentials stay in ArgoCD, not CI"
}

Pro Tip: "We use ArgoCD's ApplicationSet to manage multiple models. One template, automatic deployment for each model version bump."

Next module covers Behavioral & Negotiation interview questions. :::

Quiz

Module 5: CI/CD for ML Systems

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.