dbt for Data Transformations

dbt (data build tool) has become essential for modern data engineering. Interview questions range from basic concepts to advanced patterns.

dbt Fundamentals

What dbt Does

Capability	Description
SQL-based transforms	Write SELECT statements, dbt handles DDL
Dependency management	Automatic DAG from ref()
Testing	Built-in and custom data tests
Documentation	Auto-generated docs from YAML
Incremental models	Process only new data
Snapshots	SCD Type 2 tracking

Project Structure

dbt_project/
├── dbt_project.yml          # Project configuration
├── profiles.yml             # Connection settings
├── models/
│   ├── staging/             # Raw → cleaned
│   │   ├── stg_orders.sql
│   │   └── _stg_models.yml
│   ├── intermediate/        # Business logic
│   │   └── int_order_items.sql
│   └── marts/               # Final tables
│       ├── core/
│       │   └── fct_orders.sql
│       └── marketing/
│           └── dim_customers.sql
├── tests/
│   └── custom_tests.sql
├── macros/
│   └── custom_macros.sql
└── seeds/
    └── country_codes.csv

Model Types

Staging Models

Clean and standardize raw data.

-- models/staging/stg_orders.sql
WITH source AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),

renamed AS (
    SELECT
        id AS order_id,
        customer_id,
        CAST(order_date AS DATE) AS order_date,
        status AS order_status,
        total_amount,
        _loaded_at
    FROM source
)

SELECT * FROM renamed

Intermediate Models

Apply business logic, join data.

-- models/intermediate/int_order_items_joined.sql
WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

order_items AS (
    SELECT * FROM {{ ref('stg_order_items') }}
),

products AS (
    SELECT * FROM {{ ref('stg_products') }}
)

SELECT
    o.order_id,
    o.customer_id,
    o.order_date,
    oi.product_id,
    p.product_name,
    p.category,
    oi.quantity,
    oi.unit_price,
    oi.quantity * oi.unit_price AS line_total
FROM orders o
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id

Mart Models

Business-ready tables for analytics.

-- models/marts/core/fct_orders.sql
{{
    config(
        materialized='incremental',
        unique_key='order_id',
        partition_by={'field': 'order_date', 'data_type': 'date'}
    )
}}

WITH order_items AS (
    SELECT * FROM {{ ref('int_order_items_joined') }}
),

aggregated AS (
    SELECT
        order_id,
        customer_id,
        order_date,
        COUNT(DISTINCT product_id) AS unique_products,
        SUM(quantity) AS total_items,
        SUM(line_total) AS order_total
    FROM order_items
    GROUP BY order_id, customer_id, order_date
)

SELECT * FROM aggregated

{% if is_incremental() %}
WHERE order_date > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}

Incremental Models

Process only new/changed data for efficiency.

Strategies

Strategy	Use Case
append	Insert new rows, no updates
merge	Upsert based on unique key
delete+insert	Replace partition
insert_overwrite	Replace matching partitions

Example: Merge Strategy

{{
    config(
        materialized='incremental',
        unique_key='order_id',
        incremental_strategy='merge',
        merge_exclude_columns=['created_at']
    )
}}

SELECT
    order_id,
    customer_id,
    order_total,
    updated_at,
    CURRENT_TIMESTAMP() AS created_at
FROM {{ ref('stg_orders') }}

{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

Testing in dbt

Built-in Tests

# models/_schema.yml
version: 2

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id
      - name: order_status
        tests:
          - accepted_values:
              values: ['pending', 'shipped', 'delivered', 'cancelled']

Custom Tests

-- tests/assert_positive_amounts.sql
SELECT
    order_id,
    order_total
FROM {{ ref('fct_orders') }}
WHERE order_total < 0

Generic Tests (Macros)

-- macros/test_positive_value.sql
{% test positive_value(model, column_name) %}

SELECT
    {{ column_name }}
FROM {{ model }}
WHERE {{ column_name }} < 0

{% endtest %}

# Usage in schema
- name: order_total
  tests:
    - positive_value

Advanced Patterns

Jinja Macros

-- macros/cents_to_dollars.sql
{% macro cents_to_dollars(column_name) %}
    ROUND({{ column_name }} / 100.0, 2)
{% endmacro %}

-- Usage in model
SELECT
    order_id,
    {{ cents_to_dollars('amount_cents') }} AS amount_dollars
FROM {{ ref('stg_orders') }}

Dynamic Schemas

-- Generate union of monthly tables
{% set months = ['01', '02', '03', '04', '05', '06'] %}

{% for month in months %}
SELECT * FROM raw.events_2024_{{ month }}
{% if not loop.last %}UNION ALL{% endif %}
{% endfor %}

Snapshots (SCD Type 2)

-- snapshots/customer_snapshot.sql
{% snapshot customer_snapshot %}

{{
    config(
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='timestamp',
        updated_at='updated_at',
    )
}}

SELECT
    customer_id,
    customer_name,
    segment,
    region,
    updated_at
FROM {{ source('raw', 'customers') }}

{% endsnapshot %}

Interview Questions

Question: "Explain how dbt handles dependencies"

Answer: "dbt automatically builds a DAG from ref() and source() functions. When you write {{ ref('stg_orders') }}, dbt knows that model must run first. The DAG enables:

Correct execution order
Parallel execution where possible
Selective runs (dbt run --select stg_orders+)
Automatic documentation of lineage"

Question: "How do you optimize dbt for large datasets?"

Answer:

# 1. Use incremental models
materialized='incremental'

# 2. Partition large tables
partition_by={'field': 'event_date', 'data_type': 'date'}

# 3. Cluster for query patterns
cluster_by=['user_id']

# 4. Defer to production for development
# dbt run --defer --state prod-artifacts/

# 5. Use model selection
# dbt run --select staging.stg_orders+

Question: "How would you implement data quality checks in dbt?"

Answer: "I'd implement a multi-layer approach:

Schema tests: unique, not_null, relationships, accepted_values
Custom SQL tests: Business-specific validations
dbt-expectations: Statistical tests (outliers, distribution)
Pre-hooks: Validate source freshness
Post-hooks: Log test results to monitoring

For critical pipelines, I'd use dbt build which runs tests after each model, failing fast on issues."

Interview Insight: dbt is now table stakes for data engineering. Know the fundamentals deeply, and be prepared to discuss how you'd structure a project for a team of 10+ engineers.

Next, we'll cover data quality and testing strategies. :::