ML Pipelines & Orchestration
Airflow vs Kubeflow vs Prefect
Choosing the right orchestrator is a common interview topic. Know the trade-offs to show architectural thinking.
Feature Comparison
| Feature | Airflow | Kubeflow Pipelines | Prefect |
|---|---|---|---|
| Primary use | General workflow | ML-specific | Modern workflows |
| Infrastructure | Any (VM, K8s) | Kubernetes-only | Any (hybrid cloud) |
| Learning curve | Medium | High | Low |
| Dynamic DAGs | Limited | Native | Native |
| ML integrations | Via operators | Native (TFX, KServe) | Via tasks |
| Scaling | Celery/K8s executor | Native K8s | Hybrid agents |
| Community | Largest | Growing | Growing |
Interview Question: Which Orchestrator?
Question: "You're building ML pipelines for a team of 10 ML engineers. Which orchestrator would you choose and why?"
Framework Answer:
def choose_orchestrator(context):
# Airflow: Best for mixed workloads
if context["team_has_existing_airflow"]:
return "Airflow - leverage existing expertise"
if context["workflows"] == "general_etl_plus_ml":
return "Airflow - most versatile, largest community"
# Kubeflow: Best for ML-native teams
if context["infrastructure"] == "kubernetes_only":
if context["ml_heavy"] and context["team_familiar_with_k8s"]:
return "Kubeflow - native ML features, K8s-native"
# Prefect: Best for modern teams
if context["wants_modern_developer_experience"]:
if context["needs_dynamic_workflows"]:
return "Prefect - dynamic DAGs, better testing"
return "Airflow - safest default choice"
Airflow Deep Dive
When to Use: General-purpose orchestration, mixed data + ML pipelines
# Airflow 2.x TaskFlow API
from airflow.decorators import dag, task
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import (
KubernetesPodOperator
)
@dag(schedule="@daily", catchup=False)
def ml_training_dag():
@task
def prepare_data():
# Runs on Airflow worker
return {"dataset_path": "s3://bucket/training_data"}
# GPU training on Kubernetes
train_model = KubernetesPodOperator(
task_id="train_model",
name="model-training",
namespace="ml-pipelines",
image="training:v1.2",
arguments=["--data", "{{ ti.xcom_pull('prepare_data') }}"],
resources={
"limits": {"nvidia.com/gpu": "1"},
"requests": {"memory": "8Gi"}
},
node_selector={"gpu-type": "a100"}
)
data = prepare_data()
data >> train_model
Kubeflow Pipelines Deep Dive
When to Use: Kubernetes-native teams, heavy ML focus
from kfp import dsl
from kfp.dsl import component, pipeline, Input, Output, Dataset, Model
@component(base_image="python:3.11")
def preprocess_data(
raw_data: Input[Dataset],
processed_data: Output[Dataset]
):
import pandas as pd
df = pd.read_csv(raw_data.path)
df_clean = df.dropna()
df_clean.to_csv(processed_data.path)
@component(base_image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime")
def train_model(
training_data: Input[Dataset],
model_artifact: Output[Model]
):
# Training logic
pass
@pipeline(name="ml-training-pipeline")
def training_pipeline():
preprocess = preprocess_data(raw_data=...)
train = train_model(training_data=preprocess.outputs["processed_data"])
Interview Point: Kubeflow Pipelines v2 uses IR (Intermediate Representation) for platform portability.
Prefect Deep Dive
When to Use: Modern teams, dynamic workflows, hybrid cloud
from prefect import flow, task
from prefect.tasks import task_input_hash
from datetime import timedelta
@task(
retries=3,
retry_delay_seconds=[60, 120, 300], # Exponential backoff
cache_key_fn=task_input_hash,
cache_expiration=timedelta(hours=1)
)
def fetch_training_data(dataset_id: str):
# Cached for 1 hour
return load_dataset(dataset_id)
@task
def train_model(data, hyperparams: dict):
# Training logic
pass
@flow(name="dynamic-training-flow")
def training_flow(model_configs: list[dict]):
# Dynamic: Different number of tasks per run
for config in model_configs:
data = fetch_training_data(config["dataset_id"])
train_model(data, config["hyperparams"])
Prefect Advantages:
- Native Python - no decorators for task dependencies
- First-class caching and retry support
- Easier testing (just call functions)
Pro Tip: In interviews, acknowledge that orchestrator choice often depends on existing infrastructure. There's no universally "best" tool.
Next, we'll cover feature stores and data versioning. :::