Kubeflow & ML Pipelines
Kubeflow: The ML Platform for Kubernetes
4 min read
Kubeflow provides a composable ecosystem for end-to-end ML on Kubernetes. With the community preparing for CNCF Graduation, it's becoming the standard platform for production ML pipelines.
Kubeflow Ecosystem
Component Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Kubeflow Platform │
├─────────────────────────────────────────────────────────────────┤
│ User Interface │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ Central Dashboard │ Notebooks UI │ Pipelines UI │ Katib UI ││
│ └─────────────────────────────────────────────────────────────┘│
├─────────────────────────────────────────────────────────────────┤
│ ML Components │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Notebooks│ │ Pipelines│ │ Katib │ │ Training │ │
│ │(Jupyter) │ │ (KFP) │ │(AutoML) │ │ Operators│ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ KServe │ │ Feast │ │ Model │ │
│ │(Serving) │ │(Features)│ │ Registry │ │
│ └──────────┘ └──────────┘ └──────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Platform Services │
│ ┌──────────────────────────────────────────────────────────── ┐│
│ │ Istio (Service Mesh) │ Dex (Auth) │ MinIO (Storage) ││
│ └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
Key Components
| Component | Purpose | Use Case |
|---|---|---|
| Notebooks | Interactive development | Data exploration, prototyping |
| Pipelines | Workflow orchestration | End-to-end ML pipelines |
| Katib | Hyperparameter tuning | AutoML, NAS |
| Training Operators | Distributed training | PyTorch, TensorFlow, MPI |
| KServe | Model serving | Inference endpoints |
| Feast | Feature store | Feature management |
Installing Kubeflow
Standalone Installation
# Install kustomize
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
# Clone Kubeflow manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests
# Install Kubeflow (full installation)
while ! kustomize build example | kubectl apply -f -; do
echo "Retrying..."
sleep 10
done
# Verify installation
kubectl get pods -n kubeflow
deployKF (Recommended for Production)
# deployKF provides GitOps-based Kubeflow deployment
# Uses ArgoCD for lifecycle management
# Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# Apply deployKF generator
kubectl apply -f https://raw.githubusercontent.com/deployKF/deployKF/main/argocd-application.yaml
Cloud Provider Options
| Provider | Service | Features |
|---|---|---|
| AWS | Amazon SageMaker on EKS | Managed notebooks, pipelines |
| GCP | Vertex AI Pipelines | Native GKE integration |
| Azure | Azure ML on AKS | Managed Kubeflow |
Kubeflow Central Dashboard
Accessing the Dashboard
# Port-forward to access dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
# Access at http://localhost:8080
# Default credentials: user@example.com / 12341234
Multi-Tenancy with Profiles
# Create user profile (namespace + RBAC)
apiVersion: kubeflow.org/v1
kind: Profile
metadata:
name: ml-team-alpha
spec:
owner:
kind: User
name: alice@example.com
resourceQuotaSpec:
hard:
cpu: "32"
memory: "128Gi"
nvidia.com/gpu: "8"
persistentvolumeclaims: "10"
# Apply profile
kubectl apply -f ml-team-alpha-profile.yaml
# View profiles
kubectl get profiles
Kubeflow Notebooks
Creating a Notebook Server
apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
name: pytorch-notebook
namespace: ml-team-alpha
spec:
template:
spec:
containers:
- name: pytorch
image: kubeflownotebookswg/jupyter-pytorch-cuda-full:v1.8.0
resources:
requests:
cpu: "2"
memory: "8Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
volumeMounts:
- name: workspace
mountPath: /home/jovyan
volumes:
- name: workspace
persistentVolumeClaim:
claimName: pytorch-notebook-pvc
Pre-built Notebook Images
| Image | Framework | GPU Support |
|---|---|---|
| jupyter-pytorch-full | PyTorch | Yes |
| jupyter-tensorflow-full | TensorFlow | Yes |
| jupyter-scipy | SciPy/Pandas | No |
| codeserver-python | VS Code | Optional |
Training Operators
PyTorch Training Operator
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-distributed
namespace: ml-team-alpha
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.1-cuda12.1
command: ["python", "train.py"]
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 3
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: pytorch/pytorch:2.1-cuda12.1
command: ["python", "train.py"]
resources:
limits:
nvidia.com/gpu: 1
Other Training Operators
# TensorFlow Training
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: tf-distributed
spec:
tfReplicaSpecs:
Chief:
replicas: 1
Worker:
replicas: 2
PS:
replicas: 1
---
# MPI Job (for Horovod)
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
name: horovod-training
spec:
slotsPerWorker: 1
mpiReplicaSpecs:
Launcher:
replicas: 1
Worker:
replicas: 4
Next, we'll dive deep into Kubeflow Pipelines for building reproducible ML workflows. :::