ML Workflow Orchestration

Pipeline Concepts

3 min read

ML pipelines automate the flow from raw data to deployed model. Understanding pipeline fundamentals helps you choose the right orchestration tool.

What is an ML Pipeline?

An ML pipeline is a sequence of automated steps:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Ingest     │───▶│  Transform  │───▶│  Validate   │
│  Data       │    │  Data       │    │  Data       │
└─────────────┘    └─────────────┘    └─────────────┘
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Deploy     │◀───│  Evaluate   │◀───│  Train      │
│  Model      │    │  Model      │    │  Model      │
└─────────────┘    └─────────────┘    └─────────────┘

DAG: The Core Abstraction

Pipelines are modeled as Directed Acyclic Graphs (DAGs):

  • Directed: Tasks flow in one direction
  • Acyclic: No circular dependencies
  • Graph: Tasks (nodes) connected by dependencies (edges)
# Conceptual DAG structure
dag = {
    "ingest": [],                    # No dependencies
    "transform": ["ingest"],         # Depends on ingest
    "validate": ["transform"],       # Depends on transform
    "train": ["validate"],           # Depends on validate
    "evaluate": ["train"],           # Depends on train
    "deploy": ["evaluate"]           # Depends on evaluate
}

Key Pipeline Components

Component Purpose Example
Task Single unit of work Train model, validate data
Dependency Order of execution Train after validate
Trigger What starts the pipeline Schedule, data arrival
Artifact Output passed between tasks Processed dataset, model
Parameter Configuration values Learning rate, epochs

Pipeline Patterns

Sequential Pipeline

A ──▶ B ──▶ C ──▶ D

Each step waits for the previous one. Simple but slow.

Parallel Pipeline

    ┌──▶ B ──┐
A ──┤        ├──▶ D
    └──▶ C ──┘

Independent tasks run simultaneously. Faster execution.

Conditional Pipeline

         ┌──▶ B (if condition)
A ──▶ ◆──┤
         └──▶ C (else)

Different paths based on results. Flexible workflows.

Fan-Out/Fan-In

    ┌──▶ B1 ──┐
    │         │
A ──┼──▶ B2 ──┼──▶ C
    │         │
    └──▶ B3 ──┘

Process data in parallel, then aggregate. Common for distributed training.

ML-Specific Pipeline Stages

Data Pipeline

┌─────────────────────────────────────────────────────────┐
│                    Data Pipeline                        │
├─────────────┬─────────────┬─────────────┬──────────────┤
│  Extract    │  Transform  │  Validate   │  Load        │
│  (sources)  │  (clean)    │  (quality)  │  (store)     │
└─────────────┴─────────────┴─────────────┴──────────────┘

Training Pipeline

┌─────────────────────────────────────────────────────────┐
│                   Training Pipeline                      │
├─────────────┬─────────────┬─────────────┬──────────────┤
│  Feature    │  Train      │  Validate   │  Register    │
│  Engineer   │  Model      │  Model      │  Model       │
└─────────────┴─────────────┴─────────────┴──────────────┘

Inference Pipeline

┌─────────────────────────────────────────────────────────┐
│                   Inference Pipeline                     │
├─────────────┬─────────────┬─────────────┬──────────────┤
│  Preprocess │  Load       │  Predict    │  Postprocess │
│  Input      │  Model      │             │  Output      │
└─────────────┴─────────────┴─────────────┴──────────────┘

Scheduling Strategies

Strategy When to Use Example
Time-based Regular intervals Daily retraining at 2 AM
Event-based On data arrival New batch file triggers pipeline
Manual Ad-hoc experiments Triggered by data scientist
Continuous Stream processing Real-time feature updates

Orchestration vs Workflow

Orchestration Workflow
Manages execution Defines steps
Handles failures Specifies logic
Schedules runs Declares dependencies
Monitors progress Configuration

Choosing an Orchestrator

Tool Best For Complexity
DVC Simple ML pipelines Low
Airflow Data engineering + ML Medium
Prefect Modern Python workflows Medium
Kubeflow Kubernetes-native ML High
Argo Workflows Generic Kubernetes jobs High

Error Handling Patterns

Retry

# Retry failed tasks automatically
task_config = {
    "retries": 3,
    "retry_delay": "5m"
}

Fallback

# Use alternative path on failure
if model_training_failed:
    use_previous_model()
else:
    use_new_model()

Alerting

# Notify on failure
on_failure_callback = send_slack_alert

Best Practices

Practice Why
Idempotent tasks Safe to retry without side effects
Small, focused tasks Easier to debug and rerun
Version everything Reproduce any pipeline run
Log extensively Debug failures quickly
Test locally first Catch errors before deployment

Key insight: A well-designed pipeline turns a manual, error-prone process into a reliable, automated system. Start simple, then add complexity as needed.

Next, we'll dive into Kubeflow Pipelines for Kubernetes-native ML orchestration. :::

Quiz

Module 3: ML Workflow Orchestration

Take Quiz