ETL Pipelines & Orchestration

dbt for Data Transformations

4 min read

dbt (data build tool) has become essential for modern data engineering. Interview questions range from basic concepts to advanced patterns.

dbt Fundamentals

What dbt Does

Capability Description
SQL-based transforms Write SELECT statements, dbt handles DDL
Dependency management Automatic DAG from ref()
Testing Built-in and custom data tests
Documentation Auto-generated docs from YAML
Incremental models Process only new data
Snapshots SCD Type 2 tracking

Project Structure

dbt_project/
├── dbt_project.yml          # Project configuration
├── profiles.yml             # Connection settings
├── models/
│   ├── staging/             # Raw → cleaned
│   │   ├── stg_orders.sql
│   │   └── _stg_models.yml
│   ├── intermediate/        # Business logic
│   │   └── int_order_items.sql
│   └── marts/               # Final tables
│       ├── core/
│       │   └── fct_orders.sql
│       └── marketing/
│           └── dim_customers.sql
├── tests/
│   └── custom_tests.sql
├── macros/
│   └── custom_macros.sql
└── seeds/
    └── country_codes.csv

Model Types

Staging Models

Clean and standardize raw data.

-- models/staging/stg_orders.sql
WITH source AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),

renamed AS (
    SELECT
        id AS order_id,
        customer_id,
        CAST(order_date AS DATE) AS order_date,
        status AS order_status,
        total_amount,
        _loaded_at
    FROM source
)

SELECT * FROM renamed

Intermediate Models

Apply business logic, join data.

-- models/intermediate/int_order_items_joined.sql
WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

order_items AS (
    SELECT * FROM {{ ref('stg_order_items') }}
),

products AS (
    SELECT * FROM {{ ref('stg_products') }}
)

SELECT
    o.order_id,
    o.customer_id,
    o.order_date,
    oi.product_id,
    p.product_name,
    p.category,
    oi.quantity,
    oi.unit_price,
    oi.quantity * oi.unit_price AS line_total
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id

Mart Models

Business-ready tables for analytics.

-- models/marts/core/fct_orders.sql
{{
    config(
        materialized='incremental',
        unique_key='order_id',
        partition_by={'field': 'order_date', 'data_type': 'date'}
    )
}}

WITH order_items AS (
    SELECT * FROM {{ ref('int_order_items_joined') }}
),

aggregated AS (
    SELECT
        order_id,
        customer_id,
        order_date,
        COUNT(DISTINCT product_id) AS unique_products,
        SUM(quantity) AS total_items,
        SUM(line_total) AS order_total
    FROM order_items
    GROUP BY order_id, customer_id, order_date
)

SELECT * FROM aggregated

{% if is_incremental() %}
WHERE order_date > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}

Incremental Models

Process only new/changed data for efficiency.

Strategies

Strategy Use Case
append Insert new rows, no updates
merge Upsert based on unique key
delete+insert Replace partition
insert_overwrite Replace matching partitions

Example: Merge Strategy

{{
    config(
        materialized='incremental',
        unique_key='order_id',
        incremental_strategy='merge',
        merge_exclude_columns=['created_at']
    )
}}

SELECT
    order_id,
    customer_id,
    order_total,
    updated_at,
    CURRENT_TIMESTAMP() AS created_at
FROM {{ ref('stg_orders') }}

{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

Testing in dbt

Built-in Tests

# models/_schema.yml
version: 2

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id
      - name: order_status
        tests:
          - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'cancelled']

Custom Tests

-- tests/assert_positive_amounts.sql
SELECT
    order_id,
    order_total
FROM {{ ref('fct_orders') }}
WHERE order_total < 0

Generic Tests (Macros)

-- macros/test_positive_value.sql
{% test positive_value(model, column_name) %}

SELECT
    {{ column_name }}
FROM {{ model }}
WHERE {{ column_name }} < 0

{% endtest %}
# Usage in schema
- name: order_total
  tests:
    - positive_value

Advanced Patterns

Jinja Macros

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
    ROUND({{ column_name }} / 100.0, 2)
{% endmacro %}

-- Usage in model
SELECT
    order_id,
    {{ cents_to_dollars('amount_cents') }} AS amount_dollars
FROM {{ ref('stg_orders') }}

Dynamic Schemas

-- Generate union of monthly tables
{% set months = ['01', '02', '03', '04', '05', '06'] %}

{% for month in months %}
SELECT * FROM raw.events_2024_{{ month }}
{% if not loop.last %}UNION ALL{% endif %}
{% endfor %}

Snapshots (SCD Type 2)

-- snapshots/customer_snapshot.sql
{% snapshot customer_snapshot %}

{{
    config(
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='timestamp',
        updated_at='updated_at',
    )
}}

SELECT
    customer_id,
    customer_name,
    segment,
    region,
    updated_at
FROM {{ source('raw', 'customers') }}

{% endsnapshot %}

Interview Questions

Question: "Explain how dbt handles dependencies"

Answer: "dbt automatically builds a DAG from ref() and source() functions. When you write {{ ref('stg_orders') }}, dbt knows that model must run first. The DAG enables:

  1. Correct execution order
  2. Parallel execution where possible
  3. Selective runs (dbt run --select stg_orders+)
  4. Automatic documentation of lineage"

Question: "How do you optimize dbt for large datasets?"

Answer:

# 1. Use incremental models
materialized='incremental'

# 2. Partition large tables
partition_by={'field': 'event_date', 'data_type': 'date'}

# 3. Cluster for query patterns
cluster_by=['user_id']

# 4. Defer to production for development
# dbt run --defer --state prod-artifacts/

# 5. Use model selection
# dbt run --select staging.stg_orders+

Question: "How would you implement data quality checks in dbt?"

Answer: "I'd implement a multi-layer approach:

  1. Schema tests: unique, not_null, relationships, accepted_values
  2. Custom SQL tests: Business-specific validations
  3. dbt-expectations: Statistical tests (outliers, distribution)
  4. Pre-hooks: Validate source freshness
  5. Post-hooks: Log test results to monitoring

For critical pipelines, I'd use dbt build which runs tests after each model, failing fast on issues."

Interview Insight: dbt is now table stakes for data engineering. Know the fundamentals deeply, and be prepared to discuss how you'd structure a project for a team of 10+ engineers.

Next, we'll cover data quality and testing strategies. :::

Quiz

Module 4: ETL Pipelines & Orchestration

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.