Production Operations & GitOps

CI/CD for ML Model Deployment

3 min read

ML-specific CI/CD pipelines extend traditional software delivery with model validation, performance testing, and automated canary deployments. This lesson covers GitHub Actions and Tekton pipelines for ML workflows.

ML CI/CD Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                    ML CI/CD Pipeline                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐         │
│  │   Code   │──→│  Build   │──→│   Test   │──→│   Scan   │         │
│  │   Push   │   │  Image   │   │  Model   │   │ Security │         │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘         │
│                                      │                               │
│                                      ↓                               │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐         │
│  │  Prod    │←──│  Canary  │←──│  Stage   │←──│ Registry │         │
│  │  Deploy  │   │  Deploy  │   │  Test    │   │  Push    │         │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘         │
│       │              │                                               │
│       └──────────────┼───────────────────────────────────→          │
│                      │          Monitor & Rollback                   │
│                      ↓                                               │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │              Observability (Metrics, Logs, Traces)           │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

GitHub Actions for ML

# .github/workflows/ml-deploy.yaml
name: ML Model Deployment

on:
  push:
    branches: [main]
    paths:
      - 'models/**'
      - 'inference/**'
  pull_request:
    branches: [main]

env:
  REGISTRY: gcr.io
  PROJECT_ID: ml-production
  CLUSTER_NAME: ml-cluster
  CLUSTER_ZONE: us-central1-a

jobs:
  test-model:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4

    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: '3.11'

    - name: Install dependencies
      run: |
        pip install -r requirements.txt
        pip install pytest pytest-benchmark

    - name: Run model unit tests
      run: pytest tests/unit/ -v

    - name: Run model performance tests
      run: |
        pytest tests/performance/ --benchmark-json=benchmark.json

    - name: Check performance regression
      run: |
        python scripts/check_performance.py benchmark.json \
          --baseline benchmarks/baseline.json \
          --threshold 0.1  # Max 10% regression

  build-and-push:
    needs: test-model
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
    steps:
    - uses: actions/checkout@v4

    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v3

    - name: Login to GCR
      uses: docker/login-action@v3
      with:
        registry: gcr.io
        username: _json_key
        password: ${{ secrets.GCP_SA_KEY }}

    - name: Extract metadata
      id: meta
      uses: docker/metadata-action@v5
      with:
        images: ${{ env.REGISTRY }}/${{ env.PROJECT_ID }}/inference
        tags: |
          type=sha,prefix=
          type=ref,event=branch

    - name: Build and push
      uses: docker/build-push-action@v5
      with:
        context: .
        push: true
        tags: ${{ steps.meta.outputs.tags }}
        cache-from: type=gha
        cache-to: type=gha,mode=max

  security-scan:
    needs: build-and-push
    runs-on: ubuntu-latest
    steps:
    - name: Scan image for vulnerabilities
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: ${{ needs.build-and-push.outputs.image-tag }}
        format: 'sarif'
        output: 'trivy-results.sarif'
        severity: 'HIGH,CRITICAL'

    - name: Upload scan results
      uses: github/codeql-action/upload-sarif@v3
      with:
        sarif_file: 'trivy-results.sarif'

  deploy-staging:
    needs: [build-and-push, security-scan]
    runs-on: ubuntu-latest
    environment: staging
    steps:
    - uses: actions/checkout@v4

    - name: Set up gcloud
      uses: google-github-actions/setup-gcloud@v2
      with:
        service_account_key: ${{ secrets.GCP_SA_KEY }}

    - name: Get GKE credentials
      run: |
        gcloud container clusters get-credentials ${{ env.CLUSTER_NAME }} \
          --zone ${{ env.CLUSTER_ZONE }}

    - name: Deploy to staging
      run: |
        kubectl set image deployment/inference-staging \
          inference=${{ needs.build-and-push.outputs.image-tag }} \
          -n ml-staging

    - name: Wait for rollout
      run: |
        kubectl rollout status deployment/inference-staging \
          -n ml-staging --timeout=300s

    - name: Run integration tests
      run: |
        python scripts/integration_tests.py \
          --endpoint https://staging.inference.example.com \
          --test-data tests/fixtures/integration.json

  deploy-canary:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
    - uses: actions/checkout@v4

    - name: Deploy canary (10% traffic)
      run: |
        # Update canary deployment
        kubectl set image deployment/inference-canary \
          inference=${{ needs.build-and-push.outputs.image-tag }} \
          -n ml-serving

        # Update Istio VirtualService for 10% canary traffic
        kubectl patch virtualservice inference-vs -n ml-serving \
          --type=json \
          -p='[{"op": "replace", "path": "/spec/http/0/route/1/weight", "value": 10}]'

    - name: Monitor canary metrics
      run: |
        python scripts/canary_monitor.py \
          --duration 600 \
          --error-threshold 0.01 \
          --latency-threshold-p99 2.0

    - name: Promote or rollback
      run: |
        if [ "$CANARY_SUCCESS" == "true" ]; then
          # Promote to 100%
          kubectl patch virtualservice inference-vs -n ml-serving \
            --type=json \
            -p='[{"op": "replace", "path": "/spec/http/0/route/1/weight", "value": 100}]'
        else
          # Rollback
          kubectl rollout undo deployment/inference-canary -n ml-serving
        fi

Tekton Pipeline for ML

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: ml-deployment-pipeline
spec:
  params:
  - name: git-url
    type: string
  - name: git-revision
    type: string
    default: main
  - name: image-name
    type: string

  workspaces:
  - name: shared-workspace
  - name: docker-credentials

  tasks:
  - name: fetch-source
    taskRef:
      name: git-clone
    params:
    - name: url
      value: $(params.git-url)
    - name: revision
      value: $(params.git-revision)
    workspaces:
    - name: output
      workspace: shared-workspace

  - name: run-tests
    runAfter: [fetch-source]
    taskSpec:
      workspaces:
      - name: source
      steps:
      - name: test
        image: python:3.11
        script: |
          cd $(workspaces.source.path)
          pip install -r requirements.txt
          pytest tests/ -v --junitxml=test-results.xml
    workspaces:
    - name: source
      workspace: shared-workspace

  - name: validate-model
    runAfter: [run-tests]
    taskSpec:
      workspaces:
      - name: source
      steps:
      - name: validate
        image: python:3.11
        script: |
          cd $(workspaces.source.path)
          python scripts/validate_model.py \
            --model-path models/latest \
            --validation-data data/validation.csv \
            --min-accuracy 0.95
    workspaces:
    - name: source
      workspace: shared-workspace

  - name: build-image
    runAfter: [validate-model]
    taskRef:
      name: kaniko
    params:
    - name: IMAGE
      value: $(params.image-name)
    workspaces:
    - name: source
      workspace: shared-workspace
    - name: dockerconfig
      workspace: docker-credentials

  - name: deploy-canary
    runAfter: [build-image]
    taskRef:
      name: kubernetes-actions
    params:
    - name: script
      value: |
        kubectl set image deployment/inference-canary \
          inference=$(params.image-name) -n ml-serving
        kubectl rollout status deployment/inference-canary \
          -n ml-serving --timeout=300s

  - name: run-canary-analysis
    runAfter: [deploy-canary]
    taskSpec:
      steps:
      - name: analyze
        image: curlimages/curl
        script: |
          # Query Prometheus for canary metrics
          CANARY_ERROR_RATE=$(curl -s "prometheus:9090/api/v1/query?query=sum(rate(inference_errors_total{deployment='canary'}[10m]))")
          STABLE_ERROR_RATE=$(curl -s "prometheus:9090/api/v1/query?query=sum(rate(inference_errors_total{deployment='stable'}[10m]))")

          # Compare and decide
          if [ "$CANARY_ERROR_RATE" -gt "$STABLE_ERROR_RATE" ]; then
            echo "Canary has higher error rate, rolling back"
            exit 1
          fi

Model Validation Gate

apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: model-validation-gate
spec:
  params:
  - name: model-uri
    type: string
  - name: min-accuracy
    type: string
    default: "0.95"
  - name: max-latency-ms
    type: string
    default: "100"
  steps:
  - name: download-model
    image: amazon/aws-cli
    script: |
      aws s3 cp $(params.model-uri) /workspace/model

  - name: validate-accuracy
    image: python:3.11
    script: |
      pip install scikit-learn numpy
      python << 'EOF'
      import pickle
      from sklearn.metrics import accuracy_score

      with open('/workspace/model', 'rb') as f:
          model = pickle.load(f)

      # Load validation data
      X_val, y_val = load_validation_data()
      predictions = model.predict(X_val)
      accuracy = accuracy_score(y_val, predictions)

      if accuracy < float("$(params.min-accuracy)"):
          print(f"Model accuracy {accuracy} below threshold")
          exit(1)
      EOF

  - name: validate-latency
    image: python:3.11
    script: |
      python << 'EOF'
      import time
      import pickle

      with open('/workspace/model', 'rb') as f:
          model = pickle.load(f)

      # Measure inference latency
      latencies = []
      for _ in range(100):
          start = time.time()
          model.predict([[1, 2, 3, 4]])
          latencies.append((time.time() - start) * 1000)

      p99_latency = sorted(latencies)[95]
      if p99_latency > float("$(params.max-latency-ms)"):
          print(f"P99 latency {p99_latency}ms exceeds threshold")
          exit(1)
      EOF

Automated Rollback

# Argo Rollouts with automatic rollback
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: inference-rollout
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: success-rate
          - templateName: latency-check
      - setWeight: 50
      - pause: {duration: 10m}
      - analysis:
          templates:
          - templateName: success-rate
      - setWeight: 100
      canaryService: inference-canary
      stableService: inference-stable
      # Automatic rollback on failure
      abortScaleDownDelaySeconds: 30
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
  - name: success-rate
    successCondition: result[0] >= 0.99
    failureCondition: result[0] < 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(inference_success_total{rollouts_pod_template_hash="{{args.canary-hash}}"}[5m])) /
          sum(rate(inference_requests_total{rollouts_pod_template_hash="{{args.canary-hash}}"}[5m]))

Congratulations! You've completed the Kubernetes for AI/ML course. You now have the knowledge to deploy, scale, and operate production ML workloads on Kubernetes. :::

Quiz

Module 6: Production Operations & GitOps

Take Quiz