ETL Pipelines & Orchestration
dbt for Data Transformations
dbt (data build tool) has become essential for modern data engineering. Interview questions range from basic concepts to advanced patterns.
dbt Fundamentals
What dbt Does
| Capability | Description |
|---|---|
| SQL-based transforms | Write SELECT statements, dbt handles DDL |
| Dependency management | Automatic DAG from ref() |
| Testing | Built-in and custom data tests |
| Documentation | Auto-generated docs from YAML |
| Incremental models | Process only new data |
| Snapshots | SCD Type 2 tracking |
Project Structure
dbt_project/
├── dbt_project.yml # Project configuration
├── profiles.yml # Connection settings
├── models/
│ ├── staging/ # Raw → cleaned
│ │ ├── stg_orders.sql
│ │ └── _stg_models.yml
│ ├── intermediate/ # Business logic
│ │ └── int_order_items.sql
│ └── marts/ # Final tables
│ ├── core/
│ │ └── fct_orders.sql
│ └── marketing/
│ └── dim_customers.sql
├── tests/
│ └── custom_tests.sql
├── macros/
│ └── custom_macros.sql
└── seeds/
└── country_codes.csv
Model Types
Staging Models
Clean and standardize raw data.
-- models/staging/stg_orders.sql
WITH source AS (
SELECT * FROM {{ source('raw', 'orders') }}
),
renamed AS (
SELECT
id AS order_id,
customer_id,
CAST(order_date AS DATE) AS order_date,
status AS order_status,
total_amount,
_loaded_at
FROM source
)
SELECT * FROM renamed
Intermediate Models
Apply business logic, join data.
-- models/intermediate/int_order_items_joined.sql
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
order_items AS (
SELECT * FROM {{ ref('stg_order_items') }}
),
products AS (
SELECT * FROM {{ ref('stg_products') }}
)
SELECT
o.order_id,
o.customer_id,
o.order_date,
oi.product_id,
p.product_name,
p.category,
oi.quantity,
oi.unit_price,
oi.quantity * oi.unit_price AS line_total
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
Mart Models
Business-ready tables for analytics.
-- models/marts/core/fct_orders.sql
{{
config(
materialized='incremental',
unique_key='order_id',
partition_by={'field': 'order_date', 'data_type': 'date'}
)
}}
WITH order_items AS (
SELECT * FROM {{ ref('int_order_items_joined') }}
),
aggregated AS (
SELECT
order_id,
customer_id,
order_date,
COUNT(DISTINCT product_id) AS unique_products,
SUM(quantity) AS total_items,
SUM(line_total) AS order_total
FROM order_items
GROUP BY order_id, customer_id, order_date
)
SELECT * FROM aggregated
{% if is_incremental() %}
WHERE order_date > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
Incremental Models
Process only new/changed data for efficiency.
Strategies
| Strategy | Use Case |
|---|---|
| append | Insert new rows, no updates |
| merge | Upsert based on unique key |
| delete+insert | Replace partition |
| insert_overwrite | Replace matching partitions |
Example: Merge Strategy
{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
merge_exclude_columns=['created_at']
)
}}
SELECT
order_id,
customer_id,
order_total,
updated_at,
CURRENT_TIMESTAMP() AS created_at
FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
Testing in dbt
Built-in Tests
# models/_schema.yml
version: 2
models:
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
- name: order_status
tests:
- accepted_values:
values: ['pending', 'shipped', 'delivered', 'cancelled']
Custom Tests
-- tests/assert_positive_amounts.sql
SELECT
order_id,
order_total
FROM {{ ref('fct_orders') }}
WHERE order_total < 0
Generic Tests (Macros)
-- macros/test_positive_value.sql
{% test positive_value(model, column_name) %}
SELECT
{{ column_name }}
FROM {{ model }}
WHERE {{ column_name }} < 0
{% endtest %}
# Usage in schema
- name: order_total
tests:
- positive_value
Advanced Patterns
Jinja Macros
-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
ROUND({{ column_name }} / 100.0, 2)
{% endmacro %}
-- Usage in model
SELECT
order_id,
{{ cents_to_dollars('amount_cents') }} AS amount_dollars
FROM {{ ref('stg_orders') }}
Dynamic Schemas
-- Generate union of monthly tables
{% set months = ['01', '02', '03', '04', '05', '06'] %}
{% for month in months %}
SELECT * FROM raw.events_2024_{{ month }}
{% if not loop.last %}UNION ALL{% endif %}
{% endfor %}
Snapshots (SCD Type 2)
-- snapshots/customer_snapshot.sql
{% snapshot customer_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at',
)
}}
SELECT
customer_id,
customer_name,
segment,
region,
updated_at
FROM {{ source('raw', 'customers') }}
{% endsnapshot %}
Interview Questions
Question: "Explain how dbt handles dependencies"
Answer: "dbt automatically builds a DAG from ref() and source() functions. When you write {{ ref('stg_orders') }}, dbt knows that model must run first. The DAG enables:
- Correct execution order
- Parallel execution where possible
- Selective runs (
dbt run --select stg_orders+) - Automatic documentation of lineage"
Question: "How do you optimize dbt for large datasets?"
Answer:
# 1. Use incremental models
materialized='incremental'
# 2. Partition large tables
partition_by={'field': 'event_date', 'data_type': 'date'}
# 3. Cluster for query patterns
cluster_by=['user_id']
# 4. Defer to production for development
# dbt run --defer --state prod-artifacts/
# 5. Use model selection
# dbt run --select staging.stg_orders+
Question: "How would you implement data quality checks in dbt?"
Answer: "I'd implement a multi-layer approach:
- Schema tests: unique, not_null, relationships, accepted_values
- Custom SQL tests: Business-specific validations
- dbt-expectations: Statistical tests (outliers, distribution)
- Pre-hooks: Validate source freshness
- Post-hooks: Log test results to monitoring
For critical pipelines, I'd use dbt build which runs tests after each model, failing fast on issues."
Interview Insight: dbt is now table stakes for data engineering. Know the fundamentals deeply, and be prepared to discuss how you'd structure a project for a team of 10+ engineers.
Next, we'll cover data quality and testing strategies. :::