CI/CD for ML Systems
GitOps and Infrastructure as Code
4 min read
GitOps brings Git workflows to infrastructure and ML deployments. Interviewers test your knowledge of ArgoCD, Terraform, and declarative ML infrastructure.
GitOps Principles for ML
| Principle | Application to ML |
|---|---|
| Declarative | Model deployments defined in YAML |
| Versioned | All config in Git, auditable |
| Automated | ArgoCD syncs desired state |
| Observable | Deployment status visible in Git |
ArgoCD for ML Deployments
# argocd/applications/model-serving.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: fraud-detection-model
namespace: argocd
spec:
project: ml-production
source:
repoURL: https://github.com/company/ml-deployments.git
targetRevision: main
path: models/fraud-detection
helm:
valueFiles:
- values-production.yaml
parameters:
- name: model.version
value: "v2.3.1"
- name: model.replicas
value: "5"
destination:
server: https://kubernetes.default.svc
namespace: ml-production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
Interview Question: GitOps for ML
Question: "How would you implement GitOps for a model serving platform?"
Answer Structure:
# Repository structure for GitOps ML
ml-deployments/
├── base/ # Shared resources
│ ├── kustomization.yaml
│ ├── model-serving-deployment.yaml
│ ├── model-serving-service.yaml
│ └── model-serving-hpa.yaml
│
├── models/
│ ├── fraud-detection/
│ │ ├── kustomization.yaml
│ │ ├── values-staging.yaml
│ │ └── values-production.yaml
│ │
│ └── recommendation/
│ ├── kustomization.yaml
│ ├── values-staging.yaml
│ └── values-production.yaml
│
├── infrastructure/
│ ├── terraform/
│ │ ├── gke-cluster.tf
│ │ ├── mlflow-server.tf
│ │ └── feature-store.tf
│ │
│ └── crossplane/
│ ├── database.yaml
│ └── redis.yaml
│
└── argocd/
├── app-of-apps.yaml
└── applications/
├── fraud-detection.yaml
└── recommendation.yaml
Terraform for ML Infrastructure
# terraform/ml-platform.tf
# GKE cluster with GPU node pool
resource "google_container_cluster" "ml_cluster" {
name = "ml-production"
location = var.region
# Enable features needed for ML
addons_config {
gce_persistent_disk_csi_driver_config {
enabled = true # For model storage
}
}
# Autoscaling cluster
cluster_autoscaling {
enabled = true
resource_limits {
resource_type = "cpu"
maximum = 1000
}
resource_limits {
resource_type = "memory"
maximum = 4000 # GB
}
resource_limits {
resource_type = "nvidia-tesla-t4"
maximum = 50
}
}
}
# GPU node pool for inference
resource "google_container_node_pool" "gpu_pool" {
name = "gpu-inference"
cluster = google_container_cluster.ml_cluster.name
location = var.region
autoscaling {
min_node_count = 1
max_node_count = 20
}
node_config {
machine_type = "n1-standard-8"
guest_accelerator {
type = "nvidia-tesla-t4"
count = 1
}
# Required for GPU nodes
metadata = {
disable-legacy-endpoints = "true"
}
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
}
}
# MLflow tracking server
resource "google_sql_database_instance" "mlflow_db" {
name = "mlflow-tracking"
database_version = "POSTGRES_15"
region = var.region
settings {
tier = "db-custom-2-7680"
backup_configuration {
enabled = true
}
}
}
# Artifact storage for models
resource "google_storage_bucket" "model_artifacts" {
name = "ml-model-artifacts-${var.project_id}"
location = var.region
versioning {
enabled = true # Model version history
}
lifecycle_rule {
action {
type = "Delete"
}
condition {
num_newer_versions = 10 # Keep last 10 versions
}
}
}
GitOps Workflow for Model Updates
# How a model update flows through GitOps
model_update_workflow:
1_data_scientist:
action: "Trains new model, registers in MLflow"
output: "Model version v2.3.2 in registry"
2_ml_engineer:
action: "Updates values.yaml with new model version"
file: "models/fraud-detection/values-production.yaml"
change: "model.version: v2.3.1 → v2.3.2"
3_pull_request:
action: "Opens PR with model version bump"
ci_checks:
- "Model exists in registry"
- "Model passed offline tests"
- "Staging deployment successful"
4_approval:
action: "Tech lead approves PR"
requirements:
- "CI checks passed"
- "Model metrics validated"
5_argocd_sync:
action: "ArgoCD detects change, syncs cluster"
rollout: "Canary → full rollout"
6_observability:
action: "Monitor dashboards for issues"
rollback: "Revert Git commit if needed"
Benefits to Mention in Interviews
gitops_benefits = {
"auditability": "Every change tracked in Git history",
"reproducibility": "Can recreate any environment from Git",
"rollback": "git revert for instant rollback",
"collaboration": "Code review process for infrastructure",
"consistency": "Same process for all environments",
"security": "Cluster credentials stay in ArgoCD, not CI"
}
Pro Tip: "We use ArgoCD's ApplicationSet to manage multiple models. One template, automatic deployment for each model version bump."
Next module covers Behavioral & Negotiation interview questions. :::