Data Modeling & Warehousing
Slowly Changing Dimensions
Slowly Changing Dimensions (SCDs) handle historical tracking in dimensional models. This is a classic interview topic—expect to explain types, trade-offs, and implementation.
SCD Types Overview
| Type | Description | History | Storage |
|---|---|---|---|
| Type 0 | Retain original value | None | Minimal |
| Type 1 | Overwrite old value | None | Minimal |
| Type 2 | Add new row with versioning | Full | High |
| Type 3 | Add column for previous value | Limited | Medium |
| Type 4 | Separate history table | Full | Medium |
| Type 6 | Hybrid (1+2+3) | Full | Highest |
Type 1: Overwrite
No history maintained—simply replace old values.
Implementation
-- Before update
SELECT * FROM dim_customer WHERE customer_id = 100;
-- customer_key=1, customer_id=100, email='old@email.com'
-- Type 1 Update
UPDATE dim_customer
SET email = 'new@email.com',
updated_at = CURRENT_TIMESTAMP
WHERE customer_id = 100;
-- After update
-- customer_key=1, customer_id=100, email='new@email.com'
Use Cases
- Correction of data entry errors
- Non-analytically significant attributes
- When history is not required
Type 2: Add New Row
Full history preserved with versioning columns.
Schema Design
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY, -- Surrogate key
customer_id VARCHAR(50), -- Natural key
customer_name VARCHAR(100),
segment VARCHAR(50),
region VARCHAR(50),
effective_date DATE, -- Row valid from
expiration_date DATE, -- Row valid to
is_current BOOLEAN, -- Current record flag
version INT -- Version number
);
Implementation
-- Step 1: Expire current record
UPDATE dim_customer
SET expiration_date = CURRENT_DATE - 1,
is_current = FALSE
WHERE customer_id = 100
AND is_current = TRUE;
-- Step 2: Insert new version
INSERT INTO dim_customer (
customer_key, customer_id, customer_name, segment, region,
effective_date, expiration_date, is_current, version
)
SELECT
nextval('customer_key_seq'),
100,
'John Smith',
'Premium', -- Changed from 'Standard'
'West',
CURRENT_DATE,
'9999-12-31', -- High date for current
TRUE,
(SELECT MAX(version) + 1 FROM dim_customer WHERE customer_id = 100);
Query Patterns
-- Current state only
SELECT * FROM dim_customer WHERE is_current = TRUE;
-- Point-in-time lookup
SELECT * FROM dim_customer
WHERE customer_id = 100
AND '2024-06-15' BETWEEN effective_date AND expiration_date;
-- Historical analysis with facts
SELECT
f.sale_date,
f.amount,
d.segment -- Segment at time of sale
FROM fact_sales f
JOIN dim_customer d ON f.customer_key = d.customer_key;
Interview Question: "Why use surrogate keys with SCD Type 2?"
Answer: "Surrogate keys (customer_key) allow each version to have a unique identifier. Fact tables reference the surrogate key, automatically capturing the dimension state at the time of the transaction. Without surrogate keys, you'd need complex date-based joins for every query."
Type 3: Previous Value Column
Track limited history with additional columns.
Schema Design
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id VARCHAR(50),
current_segment VARCHAR(50),
previous_segment VARCHAR(50),
segment_change_date DATE
);
Implementation
UPDATE dim_customer
SET previous_segment = current_segment,
current_segment = 'Premium',
segment_change_date = CURRENT_DATE
WHERE customer_id = 100;
Use Cases
- When only one previous value matters
- Analyzing "before and after" scenarios
- Storage-constrained environments
Type 4: History Table
Separate table for historical records.
Schema Design
-- Current dimension (Type 1 updates)
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id VARCHAR(50),
customer_name VARCHAR(100),
segment VARCHAR(50),
updated_at TIMESTAMP
);
-- Historical tracking
CREATE TABLE dim_customer_history (
history_id SERIAL PRIMARY KEY,
customer_key INT,
customer_name VARCHAR(100),
segment VARCHAR(50),
effective_date DATE,
expiration_date DATE,
change_type VARCHAR(20) -- INSERT, UPDATE, DELETE
);
Benefits
- Current queries remain simple and fast
- History available when needed
- Cleaner separation of concerns
Type 6: Hybrid (1+2+3)
Combines multiple approaches for maximum flexibility.
Schema Design
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id VARCHAR(50),
customer_name VARCHAR(100),
-- Type 3 columns
current_segment VARCHAR(50),
previous_segment VARCHAR(50),
original_segment VARCHAR(50), -- Type 0
-- Type 2 columns
effective_date DATE,
expiration_date DATE,
is_current BOOLEAN
);
Use Cases
- Need both current comparison and full history
- Complex analytical requirements
- When different attributes need different SCD treatment
Implementation with dbt
Type 2 Snapshot
# models/snapshots/dim_customer_snapshot.sql
{% snapshot dim_customer_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at',
)
}}
SELECT
customer_id,
customer_name,
email,
segment,
region,
updated_at
FROM {{ source('raw', 'customers') }}
{% endsnapshot %}
Result Schema (auto-generated by dbt)
customer_id | customer_name | segment | dbt_scd_id | dbt_valid_from | dbt_valid_to | dbt_updated_at
Performance Considerations
| Aspect | Type 1 | Type 2 | Type 3 | Type 4 |
|---|---|---|---|---|
| Storage | Low | High | Medium | Medium |
| Current Query | Fast | Medium | Fast | Fast |
| Historical Query | N/A | Complex | Limited | Medium |
| ETL Complexity | Low | High | Medium | Medium |
| Fact Table Joins | Simple | Simple | Simple | Complex |
Interview Design Question
Question: "Design an SCD strategy for an e-commerce customer dimension"
Answer:
| Attribute | SCD Type | Rationale |
|---|---|---|
customer_name |
Type 1 | Name corrections, not analytically significant |
email |
Type 1 | Current contact info needed |
segment |
Type 2 | Critical for historical analysis |
lifetime_value_tier |
Type 2 | Drives marketing analysis |
address |
Type 2 or 4 | Depends on shipping analysis needs |
phone |
Type 1 | Current contact only |
registration_date |
Type 0 | Never changes |
Implementation Choice: "I'd use Type 2 for segment and tier, Type 1 for contact info, and consider Type 4 if we need detailed address change history for fraud detection."
Interview Tip: When discussing SCDs, always connect your choice to business requirements. There's no universally "correct" type—the best choice depends on analytical needs and storage constraints.
Next module covers ETL pipelines and orchestration patterns. :::