Pipeline Concepts

ML pipelines automate the flow from raw data to deployed model. Understanding pipeline fundamentals helps you choose the right orchestration tool.

What is an ML Pipeline?

An ML pipeline is a sequence of automated steps:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Ingest     │───▶│  Transform  │───▶│  Validate   │
│  Data       │    │  Data       │    │  Data       │
└─────────────┘    └─────────────┘    └─────────────┘
                                            │
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│  Deploy     │◀───│  Evaluate   │◀───│  Train      │
│  Model      │    │  Model      │    │  Model      │
└─────────────┘    └─────────────┘    └─────────────┘

DAG: The Core Abstraction

Pipelines are modeled as Directed Acyclic Graphs (DAGs):

Directed: Tasks flow in one direction
Acyclic: No circular dependencies
Graph: Tasks (nodes) connected by dependencies (edges)

# Conceptual DAG structure
dag = {
    "ingest": [],                    # No dependencies
    "transform": ["ingest"],         # Depends on ingest
    "validate": ["transform"],       # Depends on transform
    "train": ["validate"],           # Depends on validate
    "evaluate": ["train"],           # Depends on train
    "deploy": ["evaluate"]           # Depends on evaluate
}

Key Pipeline Components

Component	Purpose	Example
Task	Single unit of work	Train model, validate data
Dependency	Order of execution	Train after validate
Trigger	What starts the pipeline	Schedule, data arrival
Artifact	Output passed between tasks	Processed dataset, model
Parameter	Configuration values	Learning rate, epochs

Pipeline Patterns

Sequential Pipeline

A ──▶ B ──▶ C ──▶ D

Each step waits for the previous one. Simple but slow.

Parallel Pipeline

    ┌──▶ B ──┐
A ──┤        ├──▶ D
    └──▶ C ──┘

Independent tasks run simultaneously. Faster execution.

Conditional Pipeline

         ┌──▶ B (if condition)
A ──▶ ◆──┤
         └──▶ C (else)

Different paths based on results. Flexible workflows.

Fan-Out/Fan-In

    ┌──▶ B1 ──┐
    │         │
A ──┼──▶ B2 ──┼──▶ C
    │         │
    └──▶ B3 ──┘

Process data in parallel, then aggregate. Common for distributed training.

ML-Specific Pipeline Stages

Data Pipeline

┌─────────────────────────────────────────────────────────┐
│                    Data Pipeline                        │
├─────────────┬─────────────┬─────────────┬──────────────┤
│  Extract    │  Transform  │  Validate   │  Load        │
│  (sources)  │  (clean)    │  (quality)  │  (store)     │
└─────────────┴─────────────┴─────────────┴──────────────┘

Training Pipeline

┌─────────────────────────────────────────────────────────┐
│                   Training Pipeline                      │
├─────────────┬─────────────┬─────────────┬──────────────┤
│  Feature    │  Train      │  Validate   │  Register    │
│  Engineer   │  Model      │  Model      │  Model       │
└─────────────┴─────────────┴─────────────┴──────────────┘

Inference Pipeline

┌─────────────────────────────────────────────────────────┐
│                   Inference Pipeline                     │
├─────────────┬─────────────┬─────────────┬──────────────┤
│  Preprocess │  Load       │  Predict    │  Postprocess │
│  Input      │  Model      │             │  Output      │
└─────────────┴─────────────┴─────────────┴──────────────┘

Scheduling Strategies

Strategy	When to Use	Example
Time-based	Regular intervals	Daily retraining at 2 AM
Event-based	On data arrival	New batch file triggers pipeline
Manual	Ad-hoc experiments	Triggered by data scientist
Continuous	Stream processing	Real-time feature updates

Orchestration vs Workflow

Orchestration	Workflow
Manages execution	Defines steps
Handles failures	Specifies logic
Schedules runs	Declares dependencies
Monitors progress	Configuration

Choosing an Orchestrator

Tool	Best For	Complexity
DVC	Simple ML pipelines	Low
Airflow	Data engineering + ML	Medium
Prefect	Modern Python workflows	Medium
Kubeflow	Kubernetes-native ML	High
Argo Workflows	Generic Kubernetes jobs	High

Error Handling Patterns

Retry

# Retry failed tasks automatically
task_config = {
    "retries": 3,
    "retry_delay": "5m"
}

Fallback

# Use alternative path on failure
if model_training_failed:
    use_previous_model()
else:
    use_new_model()

Alerting

# Notify on failure
on_failure_callback = send_slack_alert

Best Practices

Practice	Why
Idempotent tasks	Safe to retry without side effects
Small, focused tasks	Easier to debug and rerun
Version everything	Reproduce any pipeline run
Log extensively	Debug failures quickly
Test locally first	Catch errors before deployment

Key insight: A well-designed pipeline turns a manual, error-prone process into a reliable, automated system. Start simple, then add complexity as needed.

Next, we'll dive into Kubeflow Pipelines for Kubernetes-native ML orchestration. :::