Kubeflow & ML Pipelines

Kubeflow: The ML Platform for Kubernetes

4 min read

Kubeflow provides a composable ecosystem for end-to-end ML on Kubernetes. With the community preparing for CNCF Graduation, it's becoming the standard platform for production ML pipelines.

Kubeflow Ecosystem

Component Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Kubeflow Platform                             │
├─────────────────────────────────────────────────────────────────┤
│  User Interface                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │ Central Dashboard │ Notebooks UI │ Pipelines UI │ Katib UI ││
│  └─────────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────────┤
│  ML Components                                                   │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐          │
│  │ Notebooks│ │ Pipelines│ │  Katib   │ │ Training │          │
│  │(Jupyter) │ │  (KFP)   │ │(AutoML)  │ │ Operators│          │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘          │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐                        │
│  │  KServe  │ │  Feast   │ │  Model   │                        │
│  │(Serving) │ │(Features)│ │ Registry │                        │
│  └──────────┘ └──────────┘ └──────────┘                        │
├─────────────────────────────────────────────────────────────────┤
│  Platform Services                                               │
│  ┌──────────────────────────────────────────────────────────── ┐│
│  │ Istio (Service Mesh) │ Dex (Auth) │ MinIO (Storage)        ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Key Components

Component Purpose Use Case
Notebooks Interactive development Data exploration, prototyping
Pipelines Workflow orchestration End-to-end ML pipelines
Katib Hyperparameter tuning AutoML, NAS
Training Operators Distributed training PyTorch, TensorFlow, MPI
KServe Model serving Inference endpoints
Feast Feature store Feature management

Installing Kubeflow

Standalone Installation

# Install kustomize
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash

# Clone Kubeflow manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests

# Install Kubeflow (full installation)
while ! kustomize build example | kubectl apply -f -; do
  echo "Retrying..."
  sleep 10
done

# Verify installation
kubectl get pods -n kubeflow
# deployKF provides GitOps-based Kubeflow deployment
# Uses ArgoCD for lifecycle management

# Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Apply deployKF generator
kubectl apply -f https://raw.githubusercontent.com/deployKF/deployKF/main/argocd-application.yaml

Cloud Provider Options

Provider Service Features
AWS Amazon SageMaker on EKS Managed notebooks, pipelines
GCP Vertex AI Pipelines Native GKE integration
Azure Azure ML on AKS Managed Kubeflow

Kubeflow Central Dashboard

Accessing the Dashboard

# Port-forward to access dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

# Access at http://localhost:8080
# Default credentials: user@example.com / 12341234

Multi-Tenancy with Profiles

# Create user profile (namespace + RBAC)
apiVersion: kubeflow.org/v1
kind: Profile
metadata:
  name: ml-team-alpha
spec:
  owner:
    kind: User
    name: alice@example.com
  resourceQuotaSpec:
    hard:
      cpu: "32"
      memory: "128Gi"
      nvidia.com/gpu: "8"
      persistentvolumeclaims: "10"
# Apply profile
kubectl apply -f ml-team-alpha-profile.yaml

# View profiles
kubectl get profiles

Kubeflow Notebooks

Creating a Notebook Server

apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
  name: pytorch-notebook
  namespace: ml-team-alpha
spec:
  template:
    spec:
      containers:
      - name: pytorch
        image: kubeflownotebookswg/jupyter-pytorch-cuda-full:v1.8.0
        resources:
          requests:
            cpu: "2"
            memory: "8Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "4"
            memory: "16Gi"
            nvidia.com/gpu: "1"
        volumeMounts:
        - name: workspace
          mountPath: /home/jovyan
      volumes:
      - name: workspace
        persistentVolumeClaim:
          claimName: pytorch-notebook-pvc

Pre-built Notebook Images

Image Framework GPU Support
jupyter-pytorch-full PyTorch Yes
jupyter-tensorflow-full TensorFlow Yes
jupyter-scipy SciPy/Pandas No
codeserver-python VS Code Optional

Training Operators

PyTorch Training Operator

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-distributed
  namespace: ml-team-alpha
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.1-cuda12.1
            command: ["python", "train.py"]
            resources:
              limits:
                nvidia.com/gpu: 1
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.1-cuda12.1
            command: ["python", "train.py"]
            resources:
              limits:
                nvidia.com/gpu: 1

Other Training Operators

# TensorFlow Training
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: tf-distributed
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
    Worker:
      replicas: 2
    PS:
      replicas: 1
---
# MPI Job (for Horovod)
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: horovod-training
spec:
  slotsPerWorker: 1
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
    Worker:
      replicas: 4

Next, we'll dive deep into Kubeflow Pipelines for building reproducible ML workflows. :::

Quiz

Module 3: Kubeflow & ML Pipelines

Take Quiz