Data Modeling & Warehousing

Data Lakes & Lakehouse Architecture

4 min read

The lakehouse paradigm combines the flexibility of data lakes with the reliability of data warehouses. Understanding this evolution is crucial for modern data engineering interviews.

Data Lake Evolution

Traditional Data Lake Challenges

Challenge Impact
No ACID transactions Data corruption, inconsistent reads
Schema enforcement Garbage in, garbage out
No indexing Full scans for every query
Stale metadata Slow query planning
No time travel Can't recover from bad writes

The Lakehouse Solution

┌─────────────────────────────────────────────────────┐
│                  Query Engines                       │
│     (Spark, Trino, Dremio, Athena, Databricks SQL)  │
├─────────────────────────────────────────────────────┤
│              Table Format Layer                      │
│         (Delta Lake, Apache Iceberg, Hudi)          │
├─────────────────────────────────────────────────────┤
│              Object Storage                          │
│           (S3, GCS, ADLS, MinIO)                    │
└─────────────────────────────────────────────────────┘

Delta Lake

Databricks' open-source lakehouse format.

Key Features

Feature Benefit
ACID transactions Reliable concurrent writes
Schema evolution Add/rename columns without rewriting
Time travel Query any previous version
Z-ordering Multi-dimensional clustering
Change data feed Stream changes for downstream

Common Operations

# Create Delta table
df.write.format("delta").save("/data/events")

# Time travel - read previous version
spark.read.format("delta") \
    .option("versionAsOf", 5) \
    .load("/data/events")

# Time travel - read by timestamp
spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-01") \
    .load("/data/events")

# Optimize and Z-order
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "/data/events")
delta_table.optimize().executeZOrderBy("user_id", "event_date")

# Vacuum old versions (keep 7 days)
delta_table.vacuum(168)  # hours

Interview Question: "How does Delta Lake achieve ACID transactions?"

Answer: "Delta Lake uses a transaction log (_delta_log/) that records every change as JSON commits. Writes are atomic through optimistic concurrency—each write creates a new commit file. Reads use the log to determine which files to read. Conflicts are detected by checking if committed transactions overlap with the current write."

Apache Iceberg

Netflix-developed format, now an Apache project.

Key Features

Feature Benefit
Hidden partitioning Partition evolution without rewriting
Schema evolution Full support (add, drop, rename, reorder)
Time travel Snapshot-based versioning
Partition evolution Change partitioning without migration
Multiple engines Spark, Flink, Trino, Hive, Presto

Architecture

┌─────────────────────────────────────┐
│          Catalog (Metadata)          │
│  (Hive Metastore, Glue, Nessie)     │
├─────────────────────────────────────┤
│        Metadata Layer                │
│   ┌──────────────────────────────┐  │
│   │ Manifest List (snapshot)     │  │
│   │   └── Manifest Files         │  │
│   │       └── Data File Refs     │  │
│   └──────────────────────────────┘  │
├─────────────────────────────────────┤
│          Data Files                  │
│    (Parquet, ORC, Avro)             │
└─────────────────────────────────────┘

Common Operations

-- Create Iceberg table
CREATE TABLE events (
    event_id BIGINT,
    event_time TIMESTAMP,
    user_id INT,
    event_type STRING
)
USING iceberg
PARTITIONED BY (days(event_time), bucket(16, user_id));

-- Time travel
SELECT * FROM events VERSION AS OF 123456;
SELECT * FROM events TIMESTAMP AS OF '2024-01-01 00:00:00';

-- Partition evolution (no data rewrite!)
ALTER TABLE events ADD PARTITION FIELD bucket(8, region);

-- Expire snapshots
CALL system.expire_snapshots('events', TIMESTAMP '2024-01-01 00:00:00');

Interview Question: "What's the advantage of Iceberg's hidden partitioning?"

Answer: "With hidden partitioning, users write queries using the actual column values (WHERE event_time > '2024-01-01'), and Iceberg automatically translates this to the correct partition predicate. This means:

  1. Partition evolution doesn't require query changes
  2. No need to know the partitioning scheme
  3. Partition transforms (day, month, bucket) are abstracted away"

Apache Hudi

Uber-developed format for incremental processing.

Key Features

Feature Benefit
Upserts Native support for record-level updates
Incremental queries Read only changed records
Record-level indexing Fast lookups and updates
Compaction Merge small files automatically
Table types Copy-on-Write vs Merge-on-Read

Table Types Comparison

Aspect Copy-on-Write (CoW) Merge-on-Read (MoR)
Write latency Higher (rewrites files) Lower (appends logs)
Read latency Lower (pre-merged) Higher (merges at read)
Best for Read-heavy workloads Write-heavy workloads
Storage More (full rewrites) Less (only changes)

Format Comparison

Feature Delta Lake Iceberg Hudi
Origin Databricks Netflix Uber
Primary Use General lakehouse Analytics CDC/Streaming
ACID
Time Travel
Schema Evolution Good Excellent Good
Partition Evolution Limited Excellent Limited
Upserts Good Good Excellent
Incremental Reads Change Data Feed Incremental Native
Engine Support Best Spark Multi-engine Good Spark

Interview Question: "How do you choose between Delta, Iceberg, and Hudi?"

Answer Framework:

If You Need Choose
Databricks ecosystem Delta Lake
Engine portability Iceberg
Heavy CDC/upserts Hudi
Simple operations Delta Lake
Partition flexibility Iceberg
Real-time ingestion Hudi

Medallion Architecture

Common lakehouse design pattern.

┌───────────┐    ┌───────────┐    ┌───────────┐
│  Bronze   │───▶│  Silver   │───▶│   Gold    │
│  (Raw)    │    │ (Cleaned) │    │(Aggregated│
└───────────┘    └───────────┘    └───────────┘
Layer Purpose Schema Quality
Bronze Raw ingestion Schema-on-read As-is from source
Silver Cleaned, conformed Enforced Validated, deduped
Gold Business aggregates Modeled Ready for BI

Implementation Example

# Bronze: Raw ingestion
raw_df = spark.read.json("/landing/events/")
raw_df.write.format("delta").mode("append") \
    .save("/bronze/events")

# Silver: Clean and deduplicate
bronze_df = spark.read.format("delta").load("/bronze/events")
silver_df = bronze_df \
    .dropDuplicates(["event_id"]) \
    .filter(col("user_id").isNotNull()) \
    .withColumn("event_date", to_date("event_time"))
silver_df.write.format("delta").mode("overwrite") \
    .partitionBy("event_date") \
    .save("/silver/events")

# Gold: Business aggregates
silver_df = spark.read.format("delta").load("/silver/events")
gold_df = silver_df.groupBy("event_date", "event_type") \
    .agg(count("*").alias("event_count"))
gold_df.write.format("delta").mode("overwrite") \
    .save("/gold/daily_event_summary")

Interview Insight: The lakehouse represents the convergence of data lakes and warehouses. Understanding when to use lakehouse vs traditional warehouse demonstrates architectural maturity.

Next, we'll dive into slowly changing dimensions and their implementations. :::

Quiz

Module 3: Data Modeling & Warehousing

Take Quiz