ML Pipelines & Orchestration

Airflow vs Kubeflow vs Prefect

5 min read

Choosing the right orchestrator is a common interview topic. Know the trade-offs to show architectural thinking.

Feature Comparison

Feature Airflow Kubeflow Pipelines Prefect
Primary use General workflow ML-specific Modern workflows
Infrastructure Any (VM, K8s) Kubernetes-only Any (hybrid cloud)
Learning curve Medium High Low
Dynamic DAGs Limited Native Native
ML integrations Via operators Native (TFX, KServe) Via tasks
Scaling Celery/K8s executor Native K8s Hybrid agents
Community Largest Growing Growing

Interview Question: Which Orchestrator?

Question: "You're building ML pipelines for a team of 10 ML engineers. Which orchestrator would you choose and why?"

Framework Answer:

def choose_orchestrator(context):
    # Airflow: Best for mixed workloads
    if context["team_has_existing_airflow"]:
        return "Airflow - leverage existing expertise"

    if context["workflows"] == "general_etl_plus_ml":
        return "Airflow - most versatile, largest community"

    # Kubeflow: Best for ML-native teams
    if context["infrastructure"] == "kubernetes_only":
        if context["ml_heavy"] and context["team_familiar_with_k8s"]:
            return "Kubeflow - native ML features, K8s-native"

    # Prefect: Best for modern teams
    if context["wants_modern_developer_experience"]:
        if context["needs_dynamic_workflows"]:
            return "Prefect - dynamic DAGs, better testing"

    return "Airflow - safest default choice"

Airflow Deep Dive

When to Use: General-purpose orchestration, mixed data + ML pipelines

# Airflow 2.x TaskFlow API
from airflow.decorators import dag, task
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import (
    KubernetesPodOperator
)

@dag(schedule="@daily", catchup=False)
def ml_training_dag():
    @task
    def prepare_data():
        # Runs on Airflow worker
        return {"dataset_path": "s3://bucket/training_data"}

    # GPU training on Kubernetes
    train_model = KubernetesPodOperator(
        task_id="train_model",
        name="model-training",
        namespace="ml-pipelines",
        image="training:v1.2",
        arguments=["--data", "{{ ti.xcom_pull('prepare_data') }}"],
        resources={
            "limits": {"nvidia.com/gpu": "1"},
            "requests": {"memory": "8Gi"}
        },
        node_selector={"gpu-type": "a100"}
    )

    data = prepare_data()
    data >> train_model

Kubeflow Pipelines Deep Dive

When to Use: Kubernetes-native teams, heavy ML focus

from kfp import dsl
from kfp.dsl import component, pipeline, Input, Output, Dataset, Model

@component(base_image="python:3.11")
def preprocess_data(
    raw_data: Input[Dataset],
    processed_data: Output[Dataset]
):
    import pandas as pd
    df = pd.read_csv(raw_data.path)
    df_clean = df.dropna()
    df_clean.to_csv(processed_data.path)

@component(base_image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime")
def train_model(
    training_data: Input[Dataset],
    model_artifact: Output[Model]
):
    # Training logic
    pass

@pipeline(name="ml-training-pipeline")
def training_pipeline():
    preprocess = preprocess_data(raw_data=...)
    train = train_model(training_data=preprocess.outputs["processed_data"])

Interview Point: Kubeflow Pipelines v2 uses IR (Intermediate Representation) for platform portability.

Prefect Deep Dive

When to Use: Modern teams, dynamic workflows, hybrid cloud

from prefect import flow, task
from prefect.tasks import task_input_hash
from datetime import timedelta

@task(
    retries=3,
    retry_delay_seconds=[60, 120, 300],  # Exponential backoff
    cache_key_fn=task_input_hash,
    cache_expiration=timedelta(hours=1)
)
def fetch_training_data(dataset_id: str):
    # Cached for 1 hour
    return load_dataset(dataset_id)

@task
def train_model(data, hyperparams: dict):
    # Training logic
    pass

@flow(name="dynamic-training-flow")
def training_flow(model_configs: list[dict]):
    # Dynamic: Different number of tasks per run
    for config in model_configs:
        data = fetch_training_data(config["dataset_id"])
        train_model(data, config["hyperparams"])

Prefect Advantages:

  • Native Python - no decorators for task dependencies
  • First-class caching and retry support
  • Easier testing (just call functions)

Pro Tip: In interviews, acknowledge that orchestrator choice often depends on existing infrastructure. There's no universally "best" tool.

Next, we'll cover feature stores and data versioning. :::

Quiz

Module 3: ML Pipelines & Orchestration

Take Quiz