ETL vs ELT Patterns

Understanding data integration patterns is fundamental to data engineering interviews. Be prepared to discuss when to use each approach and the trade-offs involved.

ETL: Extract, Transform, Load

Traditional approach where transformation happens before loading.

┌─────────┐    ┌──────────────┐    ┌─────────────┐
│ Sources │───▶│ Staging Area │───▶│  Warehouse  │
└─────────┘    │ (Transform)  │    └─────────────┘
               └──────────────┘

Characteristics

Aspect	ETL
Transform Location	Dedicated ETL server/cluster
Compute	Separate from warehouse
Data in Warehouse	Clean, transformed only
Best For	On-premise, legacy systems
Tools	Informatica, Talend, SSIS

When to Use ETL

On-premise warehouses: Limited warehouse compute
Sensitive data: Need to filter/mask before loading
Data quality gates: Must validate before warehouse
Compliance requirements: Transform in controlled environment

Example ETL Pipeline

# Traditional ETL with pandas
import pandas as pd

# Extract
source_df = pd.read_sql("SELECT * FROM source_orders", source_conn)

# Transform (on ETL server)
transformed_df = (
    source_df
    .drop_duplicates(subset=['order_id'])
    .assign(
        order_date=lambda x: pd.to_datetime(x['order_date']),
        total_amount=lambda x: x['quantity'] * x['unit_price']
    )
    .query('order_status != "cancelled"')
)

# Load
transformed_df.to_sql('dim_orders', warehouse_conn, if_exists='append')

ELT: Extract, Load, Transform

Modern approach leveraging cloud warehouse compute.

┌─────────┐    ┌─────────────────────────────────┐
│ Sources │───▶│           Warehouse              │
└─────────┘    │ ┌─────────┐    ┌─────────────┐  │
               │ │  Raw    │───▶│ Transformed │  │
               │ └─────────┘    └─────────────┘  │
               └─────────────────────────────────┘

Characteristics

Aspect	ELT
Transform Location	Inside the warehouse
Compute	Warehouse MPP engine
Data in Warehouse	Raw + transformed
Best For	Cloud warehouses
Tools	dbt, Dataform, warehouse SQL

When to Use ELT

Cloud warehouses: Abundant, scalable compute
Flexibility needed: Transform requirements change often
Data exploration: Analysts need raw data access
Schema-on-read: Define structure at query time

Example ELT Pipeline

-- Extract & Load: Raw data lands in staging
-- (Done by ingestion tool: Fivetran, Airbyte, custom)

-- Transform in warehouse with dbt
-- models/staging/stg_orders.sql
WITH source AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),
cleaned AS (
    SELECT
        order_id,
        customer_id,
        CAST(order_date AS DATE) as order_date,
        quantity,
        unit_price,
        quantity * unit_price as total_amount,
        order_status
    FROM source
    WHERE order_status != 'cancelled'
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY order_id ORDER BY updated_at DESC
    ) = 1
)
SELECT * FROM cleaned

Comparison Matrix

Factor	ETL	ELT
Transformation Compute	ETL server	Warehouse
Raw Data Storage	Not in warehouse	In warehouse
Latency	Higher (extra hop)	Lower (direct load)
Flexibility	Lower (fixed transforms)	Higher (re-transform)
Cost Model	ETL tool licensing	Warehouse compute
Data Volume	Better for smaller	Better for larger
Skill Set	ETL tool expertise	SQL expertise

Hybrid Patterns

EtLT: Extract, small Transform, Load, Transform

Common in practice—light preprocessing before load.

# Extract with light transform
raw_df = extract_from_source()

# Light transform (t) - just enough to load
light_transformed = (
    raw_df
    .rename(columns=str.lower)  # Standardize column names
    .assign(
        _loaded_at=datetime.utcnow(),
        _source='salesforce'
    )
)

# Load to raw layer
load_to_warehouse(light_transformed, 'raw.salesforce_accounts')

# Heavy Transform (T) in warehouse with dbt
# dbt run --models staging.stg_accounts

Reverse ETL

Push transformed data back to operational systems.

┌───────────┐    ┌─────────────┐    ┌───────────┐
│ Warehouse │───▶│ Census/     │───▶│ Salesforce│
│  (Gold)   │    │ Hightouch   │    │ HubSpot   │
└───────────┘    └─────────────┘    └───────────┘

Use Cases:

Sync customer segments to marketing tools
Push ML scores to CRM
Update operational dashboards

Interview Questions

Question: "When would you choose ETL over ELT?"

Answer Framework:

Choose ETL When	Choose ELT When
On-premise warehouse with limited compute	Cloud warehouse with scalable compute
Must filter sensitive data before loading	Need raw data for ad-hoc analysis
Strict data quality gates required	Transform requirements change frequently
Legacy systems integration	Modern data stack (dbt, Snowflake)

Question: "Design a data pipeline for real-time e-commerce analytics"

Answer Structure:

1. Source: Transaction database (CDC with Debezium)
2. Stream: Kafka for real-time events
3. Load: Kafka Connect → Snowflake (raw tables)
4. Transform: dbt models (staged → fact/dim)
5. Serve: BI dashboard with real-time layer

Key decisions:
- ELT pattern for flexibility
- CDC for low-latency capture
- Streaming for time-sensitive metrics
- Batch transforms for historical aggregates

Data Pipeline Design Principles

1. Idempotency

Pipelines should produce the same result when run multiple times.

-- Bad: Append without dedup
INSERT INTO target SELECT * FROM source;

-- Good: Merge/upsert
MERGE INTO target t
USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...;

2. Incremental Processing

Process only new/changed data when possible.

-- Process only new records
SELECT * FROM source
WHERE updated_at > (SELECT MAX(updated_at) FROM target);

3. Schema Evolution

Handle source schema changes gracefully.

# Detect new columns
source_cols = set(source_df.columns)
target_cols = set(get_target_columns())
new_cols = source_cols - target_cols

if new_cols:
    for col in new_cols:
        add_column_to_target(col)

Interview Insight: Modern data engineering favors ELT for cloud warehouses, but understand both patterns. Many organizations have hybrid approaches, and legacy systems still use ETL.

Next, we'll explore pipeline orchestration with Airflow, Prefect, and Dagster. :::

ETL: Extract, Transform, Load

Characteristics

When to Use ETL

Example ETL Pipeline

ELT: Extract, Load, Transform

Characteristics

When to Use ELT

Example ELT Pipeline

Comparison Matrix

Hybrid Patterns

EtLT: Extract, small Transform, Load, Transform

Reverse ETL

Interview Questions

Question: "When would you choose ETL over ELT?"

Question: "Design a data pipeline for real-time e-commerce analytics"

Data Pipeline Design Principles

1. Idempotency

2. Incremental Processing

3. Schema Evolution

Quiz

Stay on the Nerd Track