Kubeflow & ML Pipelines
Katib: AutoML & Hyperparameter Tuning
3 min read
Katib automates hyperparameter tuning and neural architecture search (NAS) on Kubernetes. It supports multiple ML frameworks and optimization algorithms for finding optimal model configurations.
Katib Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Katib System │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────── ┐│
│ │ Experiment ││
│ │ - Objective (minimize/maximize) ││
│ │ - Search space (hyperparameters) ││
│ │ - Algorithm (Bayesian, Grid, Random, etc.) ││
│ │ - Trial template (training job) ││
│ └─────────────────────────────────────────────────────────────┘│
│ ↓ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Suggestion Service │ │
│ │ Generates hyperparameter combinations │ │
│ └───────────────────────────────────────────────────────────┘ │
│ ↓ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Trial 1 │ │ Trial 2 │ │ Trial 3 │ │ Trial N │ │
│ │ lr=0.01 │ │ lr=0.001│ │ lr=0.1 │ │ ... │ │
│ │ bs=32 │ │ bs=64 │ │ bs=128 │ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ ↓ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Best Trial: lr=0.001, bs=64, accuracy=0.95 │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Supported Algorithms
| Algorithm | Type | Best For |
|---|---|---|
| Random | Search | Initial exploration |
| Grid | Search | Small search spaces |
| Bayesian (TPE) | Optimization | Expensive evaluations |
| CMA-ES | Evolutionary | Continuous parameters |
| Hyperband | Early stopping | Large search spaces |
| ENAS | NAS | Architecture search |
| DARTS | NAS | Differentiable NAS |
Creating an Experiment
Basic Hyperparameter Tuning
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: pytorch-hpo
namespace: ml-research
spec:
# Optimization objective
objective:
type: maximize
goal: 0.95
objectiveMetricName: accuracy
additionalMetricNames:
- loss
- f1_score
# Optimization algorithm
algorithm:
algorithmName: bayesianoptimization
algorithmSettings:
- name: "random_state"
value: "42"
# Number of trials
parallelTrialCount: 4
maxTrialCount: 20
maxFailedTrialCount: 3
# Search space
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.0001"
max: "0.1"
step: "0.0001"
- name: batch_size
parameterType: int
feasibleSpace:
min: "16"
max: "128"
step: "16"
- name: optimizer
parameterType: categorical
feasibleSpace:
list:
- adam
- sgd
- adamw
- name: dropout
parameterType: double
feasibleSpace:
min: "0.1"
max: "0.5"
# Trial template
trialTemplate:
primaryContainerName: training-container
trialParameters:
- name: learningRate
description: Learning rate
reference: learning_rate
- name: batchSize
description: Batch size
reference: batch_size
- name: optimizer
description: Optimizer type
reference: optimizer
- name: dropout
description: Dropout rate
reference: dropout
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training-container
image: my-training-image:latest
command:
- "python"
- "train.py"
- "--lr=${trialParameters.learningRate}"
- "--batch-size=${trialParameters.batchSize}"
- "--optimizer=${trialParameters.optimizer}"
- "--dropout=${trialParameters.dropout}"
resources:
limits:
nvidia.com/gpu: 1
memory: "16Gi"
restartPolicy: Never
Metrics Collection
# In your training script (train.py)
import logging
# Configure Katib metrics logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def train():
for epoch in range(epochs):
train_loss = train_epoch()
val_accuracy = validate()
# Log metrics for Katib (parsed from stdout)
logger.info(f"epoch={epoch}")
logger.info(f"loss={train_loss:.4f}")
logger.info(f"accuracy={val_accuracy:.4f}") # Objective metric
logger.info(f"f1_score={f1:.4f}")
Metrics Collector Configuration
metricsCollectorSpec:
collector:
kind: StdOut # Parse metrics from stdout
source:
filter:
metricsFormat:
- "{metricName}={metricValue}" # Format: accuracy=0.95
Neural Architecture Search (NAS)
ENAS Experiment
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: nas-enas
namespace: ml-research
spec:
objective:
type: maximize
objectiveMetricName: accuracy
algorithm:
algorithmName: enas
algorithmSettings:
- name: "controller_hidden_size"
value: "64"
- name: "controller_temperature"
value: "5.0"
nasConfig:
graphConfig:
numLayers: 8
inputSizes:
- 32
- 32
- 3
outputSizes:
- 10
operations:
- operationType: convolution
parameters:
- name: filter_size
parameterType: categorical
feasibleSpace:
list: ["3", "5", "7"]
- name: num_filter
parameterType: categorical
feasibleSpace:
list: ["32", "64", "128"]
- operationType: pooling
parameters:
- name: pooling_type
parameterType: categorical
feasibleSpace:
list: ["max", "avg"]
Monitoring Experiments
# View experiment status
kubectl get experiment pytorch-hpo -n ml-research
# View all trials
kubectl get trials -n ml-research
# Get best trial
kubectl get experiment pytorch-hpo -n ml-research \
-o jsonpath='{.status.currentOptimalTrial}'
# View experiment in Katib UI
kubectl port-forward svc/katib-ui -n kubeflow 8080:80
Early Stopping with Hyperband
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: hyperband-experiment
spec:
algorithm:
algorithmName: hyperband
algorithmSettings:
- name: "resource_name"
value: "epoch"
- name: "eta"
value: "3"
- name: "r_l"
value: "9" # Max epochs
parameters:
- name: learning_rate
parameterType: double
feasibleSpace:
min: "0.0001"
max: "0.1"
# Early stopping saves GPU hours
earlyStoppingAlgorithmName: medianstop
Next, we'll explore Argo Workflows for advanced pipeline orchestration beyond Kubeflow Pipelines. :::