Kubeflow & ML Pipelines

Kubeflow: The ML Platform for Kubernetes

4 min read

Kubeflow provides a composable ecosystem for end-to-end ML on Kubernetes. With the community preparing for CNCF Graduation, it's becoming the standard platform for production ML pipelines.

Kubeflow Ecosystem

Component Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Kubeflow Platform                             │
├─────────────────────────────────────────────────────────────────┤
│  User Interface                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │ Central Dashboard │ Notebooks UI │ Pipelines UI │ Katib UI ││
│  └─────────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────────┤
│  ML Components                                                   │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐          │
│  │ Notebooks│ │ Pipelines│ │  Katib   │ │ Training │          │
│  │(Jupyter) │ │  (KFP)   │ │(AutoML)  │ │ Operators│          │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘          │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐                        │
│  │  KServe  │ │  Feast   │ │  Model   │                        │
│  │(Serving) │ │(Features)│ │ Registry │                        │
│  └──────────┘ └──────────┘ └──────────┘                        │
├─────────────────────────────────────────────────────────────────┤
│  Platform Services                                               │
│  ┌──────────────────────────────────────────────────────────── ┐│
│  │ Istio (Service Mesh) │ Dex (Auth) │ MinIO (Storage)        ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Key Components

ComponentPurposeUse Case
NotebooksInteractive developmentData exploration, prototyping
PipelinesWorkflow orchestrationEnd-to-end ML pipelines
KatibHyperparameter tuningAutoML, NAS
Training OperatorsDistributed trainingPyTorch, TensorFlow, MPI
KServeModel servingInference endpoints
FeastFeature storeFeature management

Installing Kubeflow

Standalone Installation

# Install kustomize
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash

# Clone Kubeflow manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests

# Install Kubeflow (full installation)
while ! kustomize build example | kubectl apply -f -; do
  echo "Retrying..."
  sleep 10
done

# Verify installation
kubectl get pods -n kubeflow
# deployKF provides GitOps-based Kubeflow deployment
# Uses ArgoCD for lifecycle management

# Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Apply deployKF generator
kubectl apply -f https://raw.githubusercontent.com/deployKF/deployKF/main/argocd-application.yaml

Cloud Provider Options

ProviderServiceFeatures
AWSAmazon SageMaker on EKSManaged notebooks, pipelines
GCPVertex AI PipelinesNative GKE integration
AzureAzure ML on AKSManaged Kubeflow

Kubeflow Central Dashboard

Accessing the Dashboard

# Port-forward to access dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

# Access at http://localhost:8080
# Default credentials: user@example.com / 12341234

Multi-Tenancy with Profiles

# Create user profile (namespace + RBAC)
apiVersion: kubeflow.org/v1
kind: Profile
metadata:
  name: ml-team-alpha
spec:
  owner:
    kind: User
    name: alice@example.com
  resourceQuotaSpec:
    hard:
      cpu: "32"
      memory: "128Gi"
      nvidia.com/gpu: "8"
      persistentvolumeclaims: "10"
# Apply profile
kubectl apply -f ml-team-alpha-profile.yaml

# View profiles
kubectl get profiles

Kubeflow Notebooks

Creating a Notebook Server

apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
  name: pytorch-notebook
  namespace: ml-team-alpha
spec:
  template:
    spec:
      containers:
      - name: pytorch
        image: kubeflownotebookswg/jupyter-pytorch-cuda-full:v1.8.0
        resources:
          requests:
            cpu: "2"
            memory: "8Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "4"
            memory: "16Gi"
            nvidia.com/gpu: "1"
        volumeMounts:
        - name: workspace
          mountPath: /home/jovyan
      volumes:
      - name: workspace
        persistentVolumeClaim:
          claimName: pytorch-notebook-pvc

Pre-built Notebook Images

ImageFrameworkGPU Support
jupyter-pytorch-fullPyTorchYes
jupyter-tensorflow-fullTensorFlowYes
jupyter-scipySciPy/PandasNo
codeserver-pythonVS CodeOptional

Training Operators

PyTorch Training Operator

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-distributed
  namespace: ml-team-alpha
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.1-cuda12.1
            command: ["python", "train.py"]
            resources:
              limits:
                nvidia.com/gpu: 1
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - name: pytorch
            image: pytorch/pytorch:2.1-cuda12.1
            command: ["python", "train.py"]
            resources:
              limits:
                nvidia.com/gpu: 1

Other Training Operators

# TensorFlow Training
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: tf-distributed
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
    Worker:
      replicas: 2
    PS:
      replicas: 1
---
# MPI Job (for Horovod)
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: horovod-training
spec:
  slotsPerWorker: 1
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
    Worker:
      replicas: 4

Next, we'll dive deep into Kubeflow Pipelines for building reproducible ML workflows. :::

Quick check: how does this lesson land for you?

Quiz

Module 3: Kubeflow & ML Pipelines

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.