ETL Pipelines & Orchestration
ETL vs ELT Patterns
Understanding data integration patterns is fundamental to data engineering interviews. Be prepared to discuss when to use each approach and the trade-offs involved.
ETL: Extract, Transform, Load
Traditional approach where transformation happens before loading.
┌─────────┐ ┌──────────────┐ ┌─────────────┐
│ Sources │───▶│ Staging Area │───▶│ Warehouse │
└─────────┘ │ (Transform) │ └─────────────┘
└──────────────┘
Characteristics
| Aspect | ETL |
|---|---|
| Transform Location | Dedicated ETL server/cluster |
| Compute | Separate from warehouse |
| Data in Warehouse | Clean, transformed only |
| Best For | On-premise, legacy systems |
| Tools | Informatica, Talend, SSIS |
When to Use ETL
- On-premise warehouses: Limited warehouse compute
- Sensitive data: Need to filter/mask before loading
- Data quality gates: Must validate before warehouse
- Compliance requirements: Transform in controlled environment
Example ETL Pipeline
# Traditional ETL with pandas
import pandas as pd
# Extract
source_df = pd.read_sql("SELECT * FROM source_orders", source_conn)
# Transform (on ETL server)
transformed_df = (
source_df
.drop_duplicates(subset=['order_id'])
.assign(
order_date=lambda x: pd.to_datetime(x['order_date']),
total_amount=lambda x: x['quantity'] * x['unit_price']
)
.query('order_status != "cancelled"')
)
# Load
transformed_df.to_sql('dim_orders', warehouse_conn, if_exists='append')
ELT: Extract, Load, Transform
Modern approach leveraging cloud warehouse compute.
┌─────────┐ ┌─────────────────────────────────┐
│ Sources │───▶│ Warehouse │
└─────────┘ │ ┌─────────┐ ┌─────────────┐ │
│ │ Raw │───▶│ Transformed │ │
│ └─────────┘ └─────────────┘ │
└─────────────────────────────────┘
Characteristics
| Aspect | ELT |
|---|---|
| Transform Location | Inside the warehouse |
| Compute | Warehouse MPP engine |
| Data in Warehouse | Raw + transformed |
| Best For | Cloud warehouses |
| Tools | dbt, Dataform, warehouse SQL |
When to Use ELT
- Cloud warehouses: Abundant, scalable compute
- Flexibility needed: Transform requirements change often
- Data exploration: Analysts need raw data access
- Schema-on-read: Define structure at query time
Example ELT Pipeline
-- Extract & Load: Raw data lands in staging
-- (Done by ingestion tool: Fivetran, Airbyte, custom)
-- Transform in warehouse with dbt
-- models/staging/stg_orders.sql
WITH source AS (
SELECT * FROM {{ source('raw', 'orders') }}
),
cleaned AS (
SELECT
order_id,
customer_id,
CAST(order_date AS DATE) as order_date,
quantity,
unit_price,
quantity * unit_price as total_amount,
order_status
FROM source
WHERE order_status != 'cancelled'
QUALIFY ROW_NUMBER() OVER (
PARTITION BY order_id ORDER BY updated_at DESC
) = 1
)
SELECT * FROM cleaned
Comparison Matrix
| Factor | ETL | ELT |
|---|---|---|
| Transformation Compute | ETL server | Warehouse |
| Raw Data Storage | Not in warehouse | In warehouse |
| Latency | Higher (extra hop) | Lower (direct load) |
| Flexibility | Lower (fixed transforms) | Higher (re-transform) |
| Cost Model | ETL tool licensing | Warehouse compute |
| Data Volume | Better for smaller | Better for larger |
| Skill Set | ETL tool expertise | SQL expertise |
Hybrid Patterns
EtLT: Extract, small Transform, Load, Transform
Common in practice—light preprocessing before load.
# Extract with light transform
raw_df = extract_from_source()
# Light transform (t) - just enough to load
light_transformed = (
raw_df
.rename(columns=str.lower) # Standardize column names
.assign(
_loaded_at=datetime.utcnow(),
_source='salesforce'
)
)
# Load to raw layer
load_to_warehouse(light_transformed, 'raw.salesforce_accounts')
# Heavy Transform (T) in warehouse with dbt
# dbt run --models staging.stg_accounts
Reverse ETL
Push transformed data back to operational systems.
┌───────────┐ ┌─────────────┐ ┌───────────┐
│ Warehouse │───▶│ Census/ │───▶│ Salesforce│
│ (Gold) │ │ Hightouch │ │ HubSpot │
└───────────┘ └─────────────┘ └───────────┘
Use Cases:
- Sync customer segments to marketing tools
- Push ML scores to CRM
- Update operational dashboards
Interview Questions
Question: "When would you choose ETL over ELT?"
Answer Framework:
| Choose ETL When | Choose ELT When |
|---|---|
| On-premise warehouse with limited compute | Cloud warehouse with scalable compute |
| Must filter sensitive data before loading | Need raw data for ad-hoc analysis |
| Strict data quality gates required | Transform requirements change frequently |
| Legacy systems integration | Modern data stack (dbt, Snowflake) |
Question: "Design a data pipeline for real-time e-commerce analytics"
Answer Structure:
1. Source: Transaction database (CDC with Debezium)
2. Stream: Kafka for real-time events
3. Load: Kafka Connect → Snowflake (raw tables)
4. Transform: dbt models (staged → fact/dim)
5. Serve: BI dashboard with real-time layer
Key decisions:
- ELT pattern for flexibility
- CDC for low-latency capture
- Streaming for time-sensitive metrics
- Batch transforms for historical aggregates
Data Pipeline Design Principles
1. Idempotency
Pipelines should produce the same result when run multiple times.
-- Bad: Append without dedup
INSERT INTO target SELECT * FROM source;
-- Good: Merge/upsert
MERGE INTO target t
USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE ...
WHEN NOT MATCHED THEN INSERT ...;
2. Incremental Processing
Process only new/changed data when possible.
-- Process only new records
SELECT * FROM source
WHERE updated_at > (SELECT MAX(updated_at) FROM target);
3. Schema Evolution
Handle source schema changes gracefully.
# Detect new columns
source_cols = set(source_df.columns)
target_cols = set(get_target_columns())
new_cols = source_cols - target_cols
if new_cols:
for col in new_cols:
add_column_to_target(col)
Interview Insight: Modern data engineering favors ELT for cloud warehouses, but understand both patterns. Many organizations have hybrid approaches, and legacy systems still use ETL.
Next, we'll explore pipeline orchestration with Airflow, Prefect, and Dagster. :::