Data Modeling & Warehousing
Modern Warehouse Architectures
Understanding cloud data warehouse architectures is essential for data engineering interviews. Each platform has distinct characteristics that affect design decisions.
Snowflake Architecture
Key Concepts
┌─────────────────────────────────────────────────┐
│ Cloud Services Layer │
│ (Query optimization, metadata, security) │
├─────────────────────────────────────────────────┤
│ Multi-Cluster Compute Layer │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ XS │ │ Medium │ │ 2XL │ │
│ │Warehouse │ │Warehouse │ │Warehouse │ │
│ └──────────┘ └──────────┘ └──────────┘ │
├─────────────────────────────────────────────────┤
│ Centralized Storage Layer │
│ (Columnar, compressed, S3/Azure/GCS) │
└─────────────────────────────────────────────────┘
Interview-Relevant Features
| Feature | What to Know |
|---|---|
| Separation of storage/compute | Scale independently, pay separately |
| Virtual Warehouses | Independent compute clusters |
| Zero-copy cloning | Instant dev/test environments |
| Time Travel | Query historical data (up to 90 days) |
| Micro-partitions | Auto-partitioning, 50-500MB compressed |
Interview Question: "How does Snowflake achieve performance isolation?"
Answer: "Snowflake uses separate virtual warehouses for different workloads. Each warehouse has dedicated compute resources, so a heavy ETL job won't affect dashboard queries if they use different warehouses. This is the key benefit of storage-compute separation."
Cost Optimization Strategies
-- Auto-suspend and auto-resume
ALTER WAREHOUSE analytics_wh
SET AUTO_SUSPEND = 60 -- seconds
AUTO_RESUME = TRUE;
-- Query result caching
-- Automatically caches results for 24 hours
-- Same query, same role = instant results
-- Clustering keys for large tables
ALTER TABLE events CLUSTER BY (event_date, user_id);
BigQuery Architecture
Key Concepts
┌─────────────────────────────────────────────────┐
│ Dremel Engine │
│ (Serverless, massively parallel) │
├─────────────────────────────────────────────────┤
│ Colossus Storage │
│ (Capacitor columnar format, petabyte-scale) │
├─────────────────────────────────────────────────┤
│ Jupiter Network │
│ (Petabit bandwidth between layers) │
└─────────────────────────────────────────────────┘
Interview-Relevant Features
| Feature | What to Know |
|---|---|
| Serverless | No cluster management |
| Slots | Units of compute (2000 slots = ~$10K/month) |
| Partitioning | Required for cost control on large tables |
| Clustering | Colocate related data within partitions |
| BI Engine | In-memory acceleration for dashboards |
Interview Question: "When would you use BigQuery vs Snowflake?"
Answer: "BigQuery excels for GCP-native workloads, truly serverless operation, and when you want to avoid cluster management. Snowflake offers more flexibility with multi-cloud, better data sharing, and more predictable pricing. Choose based on existing cloud footprint and operational preferences."
Cost Control Patterns
-- Always partition large tables
CREATE TABLE events
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id
AS SELECT * FROM raw_events;
-- Use query dry-run to estimate cost
-- bq query --dry_run --use_legacy_sql=false 'SELECT ...'
-- Set custom quotas
-- Project-level: Maximum bytes billed per day/query
Redshift Architecture
Key Concepts
┌─────────────────────────────────────────────────┐
│ Leader Node │
│ (Query planning, coordination) │
├─────────────────────────────────────────────────┤
│ Compute Nodes │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node N │ │
│ │ (Slices) │ │ (Slices) │ │ (Slices) │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────┘
Interview-Relevant Features
| Feature | What to Know |
|---|---|
| Distribution styles | KEY, ALL, EVEN, AUTO |
| Sort keys | Compound vs interleaved |
| Redshift Spectrum | Query S3 data directly |
| Concurrency Scaling | Auto-add clusters for spikes |
| AQUA | Hardware-accelerated cache |
Interview Question: "How do you choose distribution keys in Redshift?"
Answer: "Choose distribution key based on the most frequent join column. If tables are joined on customer_id, both should use DISTKEY(customer_id) to colocate data. Small dimension tables can use DISTSTYLE ALL to replicate to all nodes. Avoid data skew—check distribution with SVV_TABLE_INFO."
Performance Tuning
-- Check table distribution
SELECT "table", diststyle, sortkey1
FROM SVV_TABLE_INFO
WHERE "table" = 'fact_sales';
-- Analyze column encoding
ANALYZE COMPRESSION fact_sales;
-- Vacuum and analyze regularly
VACUUM FULL fact_sales;
ANALYZE fact_sales;
Architecture Comparison
| Aspect | Snowflake | BigQuery | Redshift |
|---|---|---|---|
| Compute Model | Virtual warehouses | Serverless slots | Fixed clusters |
| Scaling | Warehouse resize | Auto (or reserved) | Resize/elastic |
| Pricing Model | Compute + storage | Bytes scanned + storage | Node hours + storage |
| Concurrency | Multi-cluster warehouses | High (slots-based) | WLM queues |
| Data Sharing | Native, cross-account | Authorized views | Data sharing (newer) |
| Semi-structured | VARIANT type | STRUCT, ARRAY | SUPER type |
Design Patterns Across Platforms
Pattern 1: Time-Based Partitioning
-- Snowflake: Automatic micro-partitioning + clustering
CREATE TABLE events (
event_id BIGINT,
event_timestamp TIMESTAMP,
user_id INT,
event_type STRING
) CLUSTER BY (event_timestamp::DATE, user_id);
-- BigQuery: Explicit partitioning required
CREATE TABLE events
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id
AS SELECT ...;
-- Redshift: Sort keys
CREATE TABLE events (
event_id BIGINT,
event_timestamp TIMESTAMP,
user_id INT
)
SORTKEY (event_timestamp);
Pattern 2: Incremental Loading
-- Snowflake: MERGE statement
MERGE INTO target t USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;
-- BigQuery: MERGE with partitioning
MERGE target t USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;
-- Redshift: Delete + Insert pattern (MERGE is newer)
BEGIN;
DELETE FROM target WHERE id IN (SELECT id FROM source);
INSERT INTO target SELECT * FROM source;
COMMIT;
Pattern 3: Materialized Views
-- All platforms support materialized views
CREATE MATERIALIZED VIEW daily_sales_summary AS
SELECT
DATE(sale_timestamp) as sale_date,
product_id,
SUM(amount) as total_amount,
COUNT(*) as transaction_count
FROM fact_sales
GROUP BY DATE(sale_timestamp), product_id;
Interview Insight: Be prepared to discuss trade-offs. Each warehouse excels in different scenarios—there's no universally "best" platform. Your ability to reason about trade-offs demonstrates senior-level thinking.
Next, we'll explore the data lake vs lakehouse evolution. :::