Data Modeling & Warehousing

Modern Warehouse Architectures

4 min read

Understanding cloud data warehouse architectures is essential for data engineering interviews. Each platform has distinct characteristics that affect design decisions.

Snowflake Architecture

Key Concepts

┌─────────────────────────────────────────────────┐
│              Cloud Services Layer               │
│  (Query optimization, metadata, security)       │
├─────────────────────────────────────────────────┤
│           Multi-Cluster Compute Layer           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐     │
│  │ XS       │  │ Medium   │  │ 2XL      │     │
│  │Warehouse │  │Warehouse │  │Warehouse │     │
│  └──────────┘  └──────────┘  └──────────┘     │
├─────────────────────────────────────────────────┤
│            Centralized Storage Layer            │
│      (Columnar, compressed, S3/Azure/GCS)       │
└─────────────────────────────────────────────────┘

Interview-Relevant Features

FeatureWhat to Know
Separation of storage/computeScale independently, pay separately
Virtual WarehousesIndependent compute clusters
Zero-copy cloningInstant dev/test environments
Time TravelQuery historical data (up to 90 days)
Micro-partitionsAuto-partitioning, 50-500MB compressed

Interview Question: "How does Snowflake achieve performance isolation?"

Answer: "Snowflake uses separate virtual warehouses for different workloads. Each warehouse has dedicated compute resources, so a heavy ETL job won't affect dashboard queries if they use different warehouses. This is the key benefit of storage-compute separation."

Cost Optimization Strategies

-- Auto-suspend and auto-resume
ALTER WAREHOUSE analytics_wh
SET AUTO_SUSPEND = 60  -- seconds
    AUTO_RESUME = TRUE;

-- Query result caching
-- Automatically caches results for 24 hours
-- Same query, same role = instant results

-- Clustering keys for large tables
ALTER TABLE events CLUSTER BY (event_date, user_id);

BigQuery Architecture

Key Concepts

┌─────────────────────────────────────────────────┐
│                 Dremel Engine                    │
│      (Serverless, massively parallel)           │
├─────────────────────────────────────────────────┤
│              Colossus Storage                    │
│     (Capacitor columnar format, petabyte-scale) │
├─────────────────────────────────────────────────┤
│              Jupiter Network                     │
│        (Petabit bandwidth between layers)       │
└─────────────────────────────────────────────────┘

Interview-Relevant Features

FeatureWhat to Know
ServerlessNo cluster management
SlotsUnits of compute (2000 slots = ~$10K/month)
PartitioningRequired for cost control on large tables
ClusteringColocate related data within partitions
BI EngineIn-memory acceleration for dashboards

Interview Question: "When would you use BigQuery vs Snowflake?"

Answer: "BigQuery excels for GCP-native workloads, truly serverless operation, and when you want to avoid cluster management. Snowflake offers more flexibility with multi-cloud, better data sharing, and more predictable pricing. Choose based on existing cloud footprint and operational preferences."

Cost Control Patterns

-- Always partition large tables
CREATE TABLE events
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id
AS SELECT * FROM raw_events;

-- Use query dry-run to estimate cost
-- bq query --dry_run --use_legacy_sql=false 'SELECT ...'

-- Set custom quotas
-- Project-level: Maximum bytes billed per day/query

Redshift Architecture

Key Concepts

┌─────────────────────────────────────────────────┐
│              Leader Node                         │
│    (Query planning, coordination)               │
├─────────────────────────────────────────────────┤
│            Compute Nodes                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐     │
│  │  Node 1  │  │  Node 2  │  │  Node N  │     │
│  │ (Slices) │  │ (Slices) │  │ (Slices) │     │
│  └──────────┘  └──────────┘  └──────────┘     │
└─────────────────────────────────────────────────┘

Interview-Relevant Features

FeatureWhat to Know
Distribution stylesKEY, ALL, EVEN, AUTO
Sort keysCompound vs interleaved
Redshift SpectrumQuery S3 data directly
Concurrency ScalingAuto-add clusters for spikes
AQUAHardware-accelerated cache

Interview Question: "How do you choose distribution keys in Redshift?"

Answer: "Choose distribution key based on the most frequent join column. If tables are joined on customer_id, both should use DISTKEY(customer_id) to colocate data. Small dimension tables can use DISTSTYLE ALL to replicate to all nodes. Avoid data skew—check distribution with SVV_TABLE_INFO."

Performance Tuning

-- Check table distribution
SELECT "table", diststyle, sortkey1
FROM SVV_TABLE_INFO
WHERE "table" = 'fact_sales';

-- Analyze column encoding
ANALYZE COMPRESSION fact_sales;

-- Vacuum and analyze regularly
VACUUM FULL fact_sales;
ANALYZE fact_sales;

Architecture Comparison

AspectSnowflakeBigQueryRedshift
Compute ModelVirtual warehousesServerless slotsFixed clusters
ScalingWarehouse resizeAuto (or reserved)Resize/elastic
Pricing ModelCompute + storageBytes scanned + storageNode hours + storage
ConcurrencyMulti-cluster warehousesHigh (slots-based)WLM queues
Data SharingNative, cross-accountAuthorized views / Analytics HubData sharing (GA)
Semi-structuredVARIANT typeSTRUCT, ARRAYSUPER type

Design Patterns Across Platforms

Pattern 1: Time-Based Partitioning

-- Snowflake: Automatic micro-partitioning + clustering
CREATE TABLE events (
    event_id BIGINT,
    event_timestamp TIMESTAMP,
    user_id INT,
    event_type STRING
) CLUSTER BY (event_timestamp::DATE, user_id);

-- BigQuery: Explicit partitioning required
CREATE TABLE events
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id
AS SELECT ...;

-- Redshift: Sort keys
CREATE TABLE events (
    event_id BIGINT,
    event_timestamp TIMESTAMP,
    user_id INT
)
SORTKEY (event_timestamp);

Pattern 2: Incremental Loading

-- Snowflake: MERGE statement
MERGE INTO target t USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

-- BigQuery: MERGE with partitioning
MERGE target t USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

-- Redshift: Native MERGE statement (GA since 2023)
MERGE INTO target USING source ON target.id = source.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT VALUES (...);

Pattern 3: Materialized Views

-- All platforms support materialized views
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT
    DATE(sale_timestamp) as sale_date,
    product_id,
    SUM(amount) as total_amount,
    COUNT(*) as transaction_count
FROM fact_sales
GROUP BY DATE(sale_timestamp), product_id;

Interview Insight: Be prepared to discuss trade-offs. Each warehouse excels in different scenarios—there's no universally "best" platform. Your ability to reason about trade-offs demonstrates senior-level thinking.

Next, we'll explore the data lake vs lakehouse evolution. :::

Quick check: how does this lesson land for you?

Quiz

Module 3: Data Modeling & Warehousing

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.