Katib: AutoML & Hyperparameter Tuning

Katib automates hyperparameter tuning and neural architecture search (NAS) on Kubernetes. It supports multiple ML frameworks and optimization algorithms for finding optimal model configurations.

Katib Architecture

┌─────────────────────────────────────────────────────────────────┐
│                       Katib System                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────────────────────────────────────────────────── ┐│
│  │                    Experiment                               ││
│  │  - Objective (minimize/maximize)                            ││
│  │  - Search space (hyperparameters)                           ││
│  │  - Algorithm (Bayesian, Grid, Random, etc.)                 ││
│  │  - Trial template (training job)                            ││
│  └─────────────────────────────────────────────────────────────┘│
│                            ↓                                     │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ Suggestion Service                                         │  │
│  │ Generates hyperparameter combinations                      │  │
│  └───────────────────────────────────────────────────────────┘  │
│                            ↓                                     │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐              │
│  │ Trial 1 │ │ Trial 2 │ │ Trial 3 │ │ Trial N │              │
│  │ lr=0.01 │ │ lr=0.001│ │ lr=0.1  │ │   ...   │              │
│  │ bs=32   │ │ bs=64   │ │ bs=128  │ │         │              │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘              │
│                            ↓                                     │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ Best Trial: lr=0.001, bs=64, accuracy=0.95                │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Supported Algorithms

Algorithm	Type	Best For
Random	Search	Initial exploration
Grid	Search	Small search spaces
Bayesian (TPE)	Optimization	Expensive evaluations
CMA-ES	Evolutionary	Continuous parameters
Hyperband	Early stopping	Large search spaces
ENAS	NAS	Architecture search
DARTS	NAS	Differentiable NAS

Creating an Experiment

Basic Hyperparameter Tuning

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: pytorch-hpo
  namespace: ml-research
spec:
  # Optimization objective
  objective:
    type: maximize
    goal: 0.95
    objectiveMetricName: accuracy
    additionalMetricNames:
      - loss
      - f1_score

  # Optimization algorithm
  algorithm:
    algorithmName: bayesianoptimization
    algorithmSettings:
      - name: "random_state"
        value: "42"

  # Number of trials
  parallelTrialCount: 4
  maxTrialCount: 20
  maxFailedTrialCount: 3

  # Search space
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.0001"
        max: "0.1"
        step: "0.0001"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "16"
        max: "128"
        step: "16"
    - name: optimizer
      parameterType: categorical
      feasibleSpace:
        list:
          - adam
          - sgd
          - adamw
    - name: dropout
      parameterType: double
      feasibleSpace:
        min: "0.1"
        max: "0.5"

  # Trial template
  trialTemplate:
    primaryContainerName: training-container
    trialParameters:
      - name: learningRate
        description: Learning rate
        reference: learning_rate
      - name: batchSize
        description: Batch size
        reference: batch_size
      - name: optimizer
        description: Optimizer type
        reference: optimizer
      - name: dropout
        description: Dropout rate
        reference: dropout
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: training-container
                image: my-training-image:latest
                command:
                  - "python"
                  - "train.py"
                  - "--lr=${trialParameters.learningRate}"
                  - "--batch-size=${trialParameters.batchSize}"
                  - "--optimizer=${trialParameters.optimizer}"
                  - "--dropout=${trialParameters.dropout}"
                resources:
                  limits:
                    nvidia.com/gpu: 1
                    memory: "16Gi"
            restartPolicy: Never

Metrics Collection

# In your training script (train.py)
import logging

# Configure Katib metrics logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def train():
    for epoch in range(epochs):
        train_loss = train_epoch()
        val_accuracy = validate()

        # Log metrics for Katib (parsed from stdout)
        logger.info(f"epoch={epoch}")
        logger.info(f"loss={train_loss:.4f}")
        logger.info(f"accuracy={val_accuracy:.4f}")  # Objective metric
        logger.info(f"f1_score={f1:.4f}")

Metrics Collector Configuration

metricsCollectorSpec:
  collector:
    kind: StdOut  # Parse metrics from stdout
  source:
    filter:
      metricsFormat:
        - "{metricName}={metricValue}"  # Format: accuracy=0.95

Neural Architecture Search (NAS)

ENAS Experiment

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: nas-enas
  namespace: ml-research
spec:
  objective:
    type: maximize
    objectiveMetricName: accuracy

  algorithm:
    algorithmName: enas
    algorithmSettings:
      - name: "controller_hidden_size"
        value: "64"
      - name: "controller_temperature"
        value: "5.0"

  nasConfig:
    graphConfig:
      numLayers: 8
      inputSizes:
        - 32
        - 32
        - 3
      outputSizes:
        - 10
    operations:
      - operationType: convolution
        parameters:
          - name: filter_size
            parameterType: categorical
            feasibleSpace:
              list: ["3", "5", "7"]
          - name: num_filter
            parameterType: categorical
            feasibleSpace:
              list: ["32", "64", "128"]
      - operationType: pooling
        parameters:
          - name: pooling_type
            parameterType: categorical
            feasibleSpace:
              list: ["max", "avg"]

Monitoring Experiments

# View experiment status
kubectl get experiment pytorch-hpo -n ml-research

# View all trials
kubectl get trials -n ml-research

# Get best trial
kubectl get experiment pytorch-hpo -n ml-research \
  -o jsonpath='{.status.currentOptimalTrial}'

# View experiment in Katib UI
kubectl port-forward svc/katib-ui -n kubeflow 8080:80

Early Stopping with Hyperband

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: hyperband-experiment
spec:
  algorithm:
    algorithmName: hyperband
    algorithmSettings:
      - name: "resource_name"
        value: "epoch"
      - name: "eta"
        value: "3"
      - name: "r_l"
        value: "9"  # Max epochs

  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.0001"
        max: "0.1"

  # Early stopping saves GPU hours
  earlyStoppingAlgorithmName: medianstop

Next, we'll explore Argo Workflows for advanced pipeline orchestration beyond Kubeflow Pipelines. :::