Kubeflow: The ML Platform for Kubernetes

Kubeflow provides a composable ecosystem for end-to-end ML on Kubernetes. With the community preparing for CNCF Graduation, it's becoming the standard platform for production ML pipelines.

Kubeflow Ecosystem

Component Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Kubeflow Platform                             │
├─────────────────────────────────────────────────────────────────┤
│  User Interface                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │ Central Dashboard │ Notebooks UI │ Pipelines UI │ Katib UI ││
│  └─────────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────────┤
│  ML Components                                                   │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐          │
│  │ Notebooks│ │ Pipelines│ │  Katib   │ │ Training │          │
│  │(Jupyter) │ │  (KFP)   │ │(AutoML)  │ │ Operators│          │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘          │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐                        │
│  │  KServe  │ │  Feast   │ │  Model   │                        │
│  │(Serving) │ │(Features)│ │ Registry │                        │
│  └──────────┘ └──────────┘ └──────────┘                        │
├─────────────────────────────────────────────────────────────────┤
│  Platform Services                                               │
│  ┌──────────────────────────────────────────────────────────── ┐│
│  │ Istio (Service Mesh) │ Dex (Auth) │ MinIO (Storage)        ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Key Components

Component	Purpose	Use Case
Notebooks	Interactive development	Data exploration, prototyping
Pipelines	Workflow orchestration	End-to-end ML pipelines
Katib	Hyperparameter tuning	AutoML, NAS
Training Operators	Distributed training	PyTorch, TensorFlow, MPI
KServe	Model serving	Inference endpoints
Feast	Feature store	Feature management

Installing Kubeflow

Standalone Installation

# Install kustomize
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash

# Clone Kubeflow manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests

# Install Kubeflow (full installation)
while ! kustomize build example | kubectl apply -f -; do
  echo "Retrying..."
  sleep 10
done

# Verify installation
kubectl get pods -n kubeflow

deployKF (Recommended for Production)

# deployKF provides GitOps-based Kubeflow deployment
# Uses ArgoCD for lifecycle management

# Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Apply deployKF generator
kubectl apply -f https://raw.githubusercontent.com/deployKF/deployKF/main/argocd-application.yaml

Cloud Provider Options

Provider	Service	Features
AWS	Amazon SageMaker on EKS	Managed notebooks, pipelines
GCP	Vertex AI Pipelines	Native GKE integration
Azure	Azure ML on AKS	Managed Kubeflow

Kubeflow Central Dashboard

Accessing the Dashboard

# Port-forward to access dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

# Access at http://localhost:8080
# Default credentials: user@example.com / 12341234

Multi-Tenancy with Profiles

# Create user profile (namespace + RBAC)
apiVersion: kubeflow.org/v1
kind: Profile
metadata:
  name: ml-team-alpha
spec:
  owner:
    kind: User
    name: alice@example.com
  resourceQuotaSpec:
    hard:
      cpu: "32"
      memory: "128Gi"
      nvidia.com/gpu: "8"
      persistentvolumeclaims: "10"

# Apply profile
kubectl apply -f ml-team-alpha-profile.yaml

# View profiles
kubectl get profiles

Kubeflow Notebooks

Creating a Notebook Server

apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
  name: pytorch-notebook
  namespace: ml-team-alpha
spec:
  template:
    spec:
      containers:
      - name: pytorch
        image: kubeflownotebookswg/jupyter-pytorch-cuda-full:v1.8.0
        resources:
          requests:
            cpu: "2"
            memory: "8Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "4"
            memory: "16Gi"
            nvidia.com/gpu: "1"
        volumeMounts:
        - name: workspace
          mountPath: /home/jovyan
      volumes:
      - name: workspace
        persistentVolumeClaim:
          claimName: pytorch-notebook-pvc

Pre-built Notebook Images

Image	Framework	GPU Support
jupyter-pytorch-full	PyTorch	Yes
jupyter-tensorflow-full	TensorFlow	Yes
jupyter-scipy	SciPy/Pandas	No
codeserver-python	VS Code	Optional

Training Operators

PyTorch Training Operator

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-distributed
  namespace: ml-team-alpha
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.1-cuda12.1
            command: ["python", "train.py"]
            resources:
              limits:
                nvidia.com/gpu: 1
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.1-cuda12.1
            command: ["python", "train.py"]
            resources:
              limits:
                nvidia.com/gpu: 1

Other Training Operators

# TensorFlow Training
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: tf-distributed
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
    Worker:
      replicas: 2
    PS:
      replicas: 1
---
# MPI Job (for Horovod)
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: horovod-training
spec:
  slotsPerWorker: 1
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
    Worker:
      replicas: 4

Next, we'll dive deep into Kubeflow Pipelines for building reproducible ML workflows. :::