ETL Pipelines & Orchestration

ETL vs ELT Patterns

4 min read

Understanding data integration patterns is fundamental to data engineering interviews. Be prepared to discuss when to use each approach and the trade-offs involved.

ETL: Extract, Transform, Load

Traditional approach where transformation happens before loading.

┌─────────┐    ┌──────────────┐    ┌─────────────┐
│ Sources │───▶│ Staging Area │───▶│  Warehouse  │
└─────────┘    │ (Transform)  │    └─────────────┘
               └──────────────┘

Characteristics

Aspect ETL
Transform Location Dedicated ETL server/cluster
Compute Separate from warehouse
Data in Warehouse Clean, transformed only
Best For On-premise, legacy systems
Tools Informatica, Talend, SSIS

When to Use ETL

  • On-premise warehouses: Limited warehouse compute
  • Sensitive data: Need to filter/mask before loading
  • Data quality gates: Must validate before warehouse
  • Compliance requirements: Transform in controlled environment

Example ETL Pipeline

# Traditional ETL with pandas
import pandas as pd

# Extract
source_df = pd.read_sql("SELECT * FROM source_orders", source_conn)

# Transform (on ETL server)
transformed_df = (
    source_df
    .drop_duplicates(subset=['order_id'])
    .assign(
        order_date=lambda x: pd.to_datetime(x['order_date']),
        total_amount=lambda x: x['quantity'] * x['unit_price']
    )
    .query('order_status != "cancelled"')
)

# Load
transformed_df.to_sql('dim_orders', warehouse_conn, if_exists='append')

ELT: Extract, Load, Transform

Modern approach leveraging cloud warehouse compute.

┌─────────┐    ┌─────────────────────────────────┐
│ Sources │───▶│           Warehouse              │
└─────────┘    │ ┌─────────┐    ┌─────────────┐  │
               │ │  Raw    │───▶│ Transformed │  │
               │ └─────────┘    └─────────────┘  │
               └─────────────────────────────────┘

Characteristics

Aspect ELT
Transform Location Inside the warehouse
Compute Warehouse MPP engine
Data in Warehouse Raw + transformed
Best For Cloud warehouses
Tools dbt, Dataform, warehouse SQL

When to Use ELT

  • Cloud warehouses: Abundant, scalable compute
  • Flexibility needed: Transform requirements change often
  • Data exploration: Analysts need raw data access
  • Schema-on-read: Define structure at query time

Example ELT Pipeline

-- Extract & Load: Raw data lands in staging
-- (Done by ingestion tool: Fivetran, Airbyte, custom)

-- Transform in warehouse with dbt
-- models/staging/stg_orders.sql
WITH source AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),
cleaned AS (
    SELECT
        order_id,
        customer_id,
        CAST(order_date AS DATE) as order_date,
        quantity,
        unit_price,
        quantity * unit_price as total_amount,
        order_status
    FROM source
    WHERE order_status != 'cancelled'
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY order_id ORDER BY updated_at DESC
    ) = 1
)
SELECT * FROM cleaned

Comparison Matrix

Factor ETL ELT
Transformation Compute ETL server Warehouse
Raw Data Storage Not in warehouse In warehouse
Latency Higher (extra hop) Lower (direct load)
Flexibility Lower (fixed transforms) Higher (re-transform)
Cost Model ETL tool licensing Warehouse compute
Data Volume Better for smaller Better for larger
Skill Set ETL tool expertise SQL expertise

Hybrid Patterns

EtLT: Extract, small Transform, Load, Transform

Common in practice—light preprocessing before load.

# Extract with light transform
raw_df = extract_from_source()

# Light transform (t) - just enough to load
light_transformed = (
    raw_df
    .rename(columns=str.lower)  # Standardize column names
    .assign(
        _loaded_at=datetime.utcnow(),
        _source='salesforce'
    )
)

# Load to raw layer
load_to_warehouse(light_transformed, 'raw.salesforce_accounts')

# Heavy Transform (T) in warehouse with dbt
# dbt run --models staging.stg_accounts

Reverse ETL

Push transformed data back to operational systems.

┌───────────┐    ┌─────────────┐    ┌───────────┐
│ Warehouse │───▶│ Census/     │───▶│ Salesforce│
│  (Gold)   │    │ Hightouch   │    │ HubSpot   │
└───────────┘    └─────────────┘    └───────────┘

Use Cases:

  • Sync customer segments to marketing tools
  • Push ML scores to CRM
  • Update operational dashboards

Interview Questions

Question: "When would you choose ETL over ELT?"

Answer Framework:

Choose ETL When Choose ELT When
On-premise warehouse with limited compute Cloud warehouse with scalable compute
Must filter sensitive data before loading Need raw data for ad-hoc analysis
Strict data quality gates required Transform requirements change frequently
Legacy systems integration Modern data stack (dbt, Snowflake)

Question: "Design a data pipeline for real-time e-commerce analytics"

Answer Structure:

1. Source: Transaction database (CDC with Debezium)
2. Stream: Kafka for real-time events
3. Load: Kafka Connect → Snowflake (raw tables)
4. Transform: dbt models (staged → fact/dim)
5. Serve: BI dashboard with real-time layer

Key decisions:
- ELT pattern for flexibility
- CDC for low-latency capture
- Streaming for time-sensitive metrics
- Batch transforms for historical aggregates

Data Pipeline Design Principles

1. Idempotency

Pipelines should produce the same result when run multiple times.

-- Bad: Append without dedup
INSERT INTO target SELECT * FROM source;

-- Good: Merge/upsert
MERGE INTO target t
USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...;

2. Incremental Processing

Process only new/changed data when possible.

-- Process only new records
SELECT * FROM source
WHERE updated_at > (SELECT MAX(updated_at) FROM target);

3. Schema Evolution

Handle source schema changes gracefully.

# Detect new columns
source_cols = set(source_df.columns)
target_cols = set(get_target_columns())
new_cols = source_cols - target_cols

if new_cols:
    for col in new_cols:
        add_column_to_target(col)

Interview Insight: Modern data engineering favors ELT for cloud warehouses, but understand both patterns. Many organizations have hybrid approaches, and legacy systems still use ETL.

Next, we'll explore pipeline orchestration with Airflow, Prefect, and Dagster. :::

Quiz

Module 4: ETL Pipelines & Orchestration

Take Quiz