Data Modeling & Warehousing

Modern Warehouse Architectures

4 min read

Understanding cloud data warehouse architectures is essential for data engineering interviews. Each platform has distinct characteristics that affect design decisions.

Snowflake Architecture

Key Concepts

┌─────────────────────────────────────────────────┐
│              Cloud Services Layer               │
│  (Query optimization, metadata, security)       │
├─────────────────────────────────────────────────┤
│           Multi-Cluster Compute Layer           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐     │
│  │ XS       │  │ Medium   │  │ 2XL      │     │
│  │Warehouse │  │Warehouse │  │Warehouse │     │
│  └──────────┘  └──────────┘  └──────────┘     │
├─────────────────────────────────────────────────┤
│            Centralized Storage Layer            │
│      (Columnar, compressed, S3/Azure/GCS)       │
└─────────────────────────────────────────────────┘

Interview-Relevant Features

Feature What to Know
Separation of storage/compute Scale independently, pay separately
Virtual Warehouses Independent compute clusters
Zero-copy cloning Instant dev/test environments
Time Travel Query historical data (up to 90 days)
Micro-partitions Auto-partitioning, 50-500MB compressed

Interview Question: "How does Snowflake achieve performance isolation?"

Answer: "Snowflake uses separate virtual warehouses for different workloads. Each warehouse has dedicated compute resources, so a heavy ETL job won't affect dashboard queries if they use different warehouses. This is the key benefit of storage-compute separation."

Cost Optimization Strategies

-- Auto-suspend and auto-resume
ALTER WAREHOUSE analytics_wh
SET AUTO_SUSPEND = 60  -- seconds
    AUTO_RESUME = TRUE;

-- Query result caching
-- Automatically caches results for 24 hours
-- Same query, same role = instant results

-- Clustering keys for large tables
ALTER TABLE events CLUSTER BY (event_date, user_id);

BigQuery Architecture

Key Concepts

┌─────────────────────────────────────────────────┐
│                 Dremel Engine                    │
│      (Serverless, massively parallel)           │
├─────────────────────────────────────────────────┤
│              Colossus Storage                    │
│     (Capacitor columnar format, petabyte-scale) │
├─────────────────────────────────────────────────┤
│              Jupiter Network                     │
│        (Petabit bandwidth between layers)       │
└─────────────────────────────────────────────────┘

Interview-Relevant Features

Feature What to Know
Serverless No cluster management
Slots Units of compute (2000 slots = ~$10K/month)
Partitioning Required for cost control on large tables
Clustering Colocate related data within partitions
BI Engine In-memory acceleration for dashboards

Interview Question: "When would you use BigQuery vs Snowflake?"

Answer: "BigQuery excels for GCP-native workloads, truly serverless operation, and when you want to avoid cluster management. Snowflake offers more flexibility with multi-cloud, better data sharing, and more predictable pricing. Choose based on existing cloud footprint and operational preferences."

Cost Control Patterns

-- Always partition large tables
CREATE TABLE events
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id
AS SELECT * FROM raw_events;

-- Use query dry-run to estimate cost
-- bq query --dry_run --use_legacy_sql=false 'SELECT ...'

-- Set custom quotas
-- Project-level: Maximum bytes billed per day/query

Redshift Architecture

Key Concepts

┌─────────────────────────────────────────────────┐
│              Leader Node                         │
│    (Query planning, coordination)               │
├─────────────────────────────────────────────────┤
│            Compute Nodes                         │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐     │
│  │  Node 1  │  │  Node 2  │  │  Node N  │     │
│  │ (Slices) │  │ (Slices) │  │ (Slices) │     │
│  └──────────┘  └──────────┘  └──────────┘     │
└─────────────────────────────────────────────────┘

Interview-Relevant Features

Feature What to Know
Distribution styles KEY, ALL, EVEN, AUTO
Sort keys Compound vs interleaved
Redshift Spectrum Query S3 data directly
Concurrency Scaling Auto-add clusters for spikes
AQUA Hardware-accelerated cache

Interview Question: "How do you choose distribution keys in Redshift?"

Answer: "Choose distribution key based on the most frequent join column. If tables are joined on customer_id, both should use DISTKEY(customer_id) to colocate data. Small dimension tables can use DISTSTYLE ALL to replicate to all nodes. Avoid data skew—check distribution with SVV_TABLE_INFO."

Performance Tuning

-- Check table distribution
SELECT "table", diststyle, sortkey1
FROM SVV_TABLE_INFO
WHERE "table" = 'fact_sales';

-- Analyze column encoding
ANALYZE COMPRESSION fact_sales;

-- Vacuum and analyze regularly
VACUUM FULL fact_sales;
ANALYZE fact_sales;

Architecture Comparison

Aspect Snowflake BigQuery Redshift
Compute Model Virtual warehouses Serverless slots Fixed clusters
Scaling Warehouse resize Auto (or reserved) Resize/elastic
Pricing Model Compute + storage Bytes scanned + storage Node hours + storage
Concurrency Multi-cluster warehouses High (slots-based) WLM queues
Data Sharing Native, cross-account Authorized views Data sharing (newer)
Semi-structured VARIANT type STRUCT, ARRAY SUPER type

Design Patterns Across Platforms

Pattern 1: Time-Based Partitioning

-- Snowflake: Automatic micro-partitioning + clustering
CREATE TABLE events (
    event_id BIGINT,
    event_timestamp TIMESTAMP,
    user_id INT,
    event_type STRING
) CLUSTER BY (event_timestamp::DATE, user_id);

-- BigQuery: Explicit partitioning required
CREATE TABLE events
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id
AS SELECT ...;

-- Redshift: Sort keys
CREATE TABLE events (
    event_id BIGINT,
    event_timestamp TIMESTAMP,
    user_id INT
)
SORTKEY (event_timestamp);

Pattern 2: Incremental Loading

-- Snowflake: MERGE statement
MERGE INTO target t USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

-- BigQuery: MERGE with partitioning
MERGE target t USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

-- Redshift: Delete + Insert pattern (MERGE is newer)
BEGIN;
DELETE FROM target WHERE id IN (SELECT id FROM source);
INSERT INTO target SELECT * FROM source;
COMMIT;

Pattern 3: Materialized Views

-- All platforms support materialized views
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT
    DATE(sale_timestamp) as sale_date,
    product_id,
    SUM(amount) as total_amount,
    COUNT(*) as transaction_count
FROM fact_sales
GROUP BY DATE(sale_timestamp), product_id;

Interview Insight: Be prepared to discuss trade-offs. Each warehouse excels in different scenarios—there's no universally "best" platform. Your ability to reason about trade-offs demonstrates senior-level thinking.

Next, we'll explore the data lake vs lakehouse evolution. :::

Quiz

Module 3: Data Modeling & Warehousing

Take Quiz