Data Modeling & Warehousing

Data Lakes & Lakehouse Architecture

4 min read

The lakehouse paradigm combines the flexibility of data lakes with the reliability of data warehouses. Understanding this evolution is crucial for modern data engineering interviews.

Data Lake Evolution

Traditional Data Lake Challenges

ChallengeImpact
No ACID transactionsData corruption, inconsistent reads
Schema enforcementGarbage in, garbage out
No indexingFull scans for every query
Stale metadataSlow query planning
No time travelCan't recover from bad writes

The Lakehouse Solution

┌─────────────────────────────────────────────────────┐
│                  Query Engines                       │
│     (Spark, Trino, Dremio, Athena, Databricks SQL)  │
├─────────────────────────────────────────────────────┤
│              Table Format Layer                      │
│         (Delta Lake, Apache Iceberg, Hudi)          │
├─────────────────────────────────────────────────────┤
│              Object Storage                          │
│           (S3, GCS, ADLS, MinIO)                    │
└─────────────────────────────────────────────────────┘

Delta Lake

Databricks' open-source lakehouse format.

Key Features

FeatureBenefit
ACID transactionsReliable concurrent writes
Schema evolutionAdd/rename columns without rewriting
Time travelQuery any previous version
Z-orderingMulti-dimensional clustering
Change data feedStream changes for downstream

Common Operations

# Create Delta table
df.write.format("delta").save("/data/events")

# Time travel - read previous version
spark.read.format("delta") \
    .option("versionAsOf", 5) \
    .load("/data/events")

# Time travel - read by timestamp
spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-01") \
    .load("/data/events")

# Optimize and Z-order
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "/data/events")
delta_table.optimize().executeZOrderBy("user_id", "event_date")

# Vacuum old versions (keep 7 days)
delta_table.vacuum(168)  # hours

Interview Question: "How does Delta Lake achieve ACID transactions?"

Answer: "Delta Lake uses a transaction log (_delta_log/) that records every change as JSON commits. Writes are atomic through optimistic concurrency—each write creates a new commit file. Reads use the log to determine which files to read. Conflicts are detected by checking if committed transactions overlap with the current write."

Apache Iceberg

Netflix-developed format, now an Apache project.

Key Features

FeatureBenefit
Hidden partitioningPartition evolution without rewriting
Schema evolutionFull support (add, drop, rename, reorder)
Time travelSnapshot-based versioning
Partition evolutionChange partitioning without migration
Multiple enginesSpark, Flink, Trino, Hive, Presto

Architecture

┌─────────────────────────────────────┐
│          Catalog (Metadata)          │
│  (Hive Metastore, Glue, Nessie)     │
├─────────────────────────────────────┤
│        Metadata Layer                │
│   ┌──────────────────────────────┐  │
│   │ Manifest List (snapshot)     │  │
│   │   └── Manifest Files         │  │
│   │       └── Data File Refs     │  │
│   └──────────────────────────────┘  │
├─────────────────────────────────────┤
│          Data Files                  │
│    (Parquet, ORC, Avro)             │
└─────────────────────────────────────┘

Common Operations

-- Create Iceberg table
CREATE TABLE events (
    event_id BIGINT,
    event_time TIMESTAMP,
    user_id INT,
    event_type STRING
)
USING iceberg
PARTITIONED BY (days(event_time), bucket(16, user_id));

-- Time travel
SELECT * FROM events VERSION AS OF 123456;
SELECT * FROM events TIMESTAMP AS OF '2024-01-01 00:00:00';

-- Partition evolution (no data rewrite!)
ALTER TABLE events ADD PARTITION FIELD bucket(8, region);

-- Expire snapshots
CALL system.expire_snapshots('events', TIMESTAMP '2024-01-01 00:00:00');

Interview Question: "What's the advantage of Iceberg's hidden partitioning?"

Answer: "With hidden partitioning, users write queries using the actual column values (WHERE event_time > '2024-01-01'), and Iceberg automatically translates this to the correct partition predicate. This means:

  1. Partition evolution doesn't require query changes
  2. No need to know the partitioning scheme
  3. Partition transforms (day, month, bucket) are abstracted away"

Apache Hudi

Uber-developed format for incremental processing.

Key Features

FeatureBenefit
UpsertsNative support for record-level updates
Incremental queriesRead only changed records
Record-level indexingFast lookups and updates
CompactionMerge small files automatically
Table typesCopy-on-Write vs Merge-on-Read

Table Types Comparison

AspectCopy-on-Write (CoW)Merge-on-Read (MoR)
Write latencyHigher (rewrites files)Lower (appends logs)
Read latencyLower (pre-merged)Higher (merges at read)
Best forRead-heavy workloadsWrite-heavy workloads
StorageMore (full rewrites)Less (only changes)

Format Comparison

FeatureDelta LakeIcebergHudi
OriginDatabricksNetflixUber
Primary UseGeneral lakehouseAnalyticsCDC/Streaming
ACID
Time Travel
Schema EvolutionGoodExcellentGood
Partition EvolutionLimitedExcellentLimited
UpsertsGoodGoodExcellent
Incremental ReadsChange Data FeedIncrementalNative
Engine SupportBest SparkMulti-engineGood Spark

Interview Question: "How do you choose between Delta, Iceberg, and Hudi?"

Answer Framework:

If You NeedChoose
Databricks ecosystemDelta Lake
Engine portabilityIceberg
Heavy CDC/upsertsHudi
Simple operationsDelta Lake
Partition flexibilityIceberg
Real-time ingestionHudi

Medallion Architecture

Common lakehouse design pattern.

┌───────────┐    ┌───────────┐    ┌───────────┐
│  Bronze   │───▶│  Silver   │───▶│   Gold    │
│  (Raw)    │    │ (Cleaned) │    │(Aggregated│
└───────────┘    └───────────┘    └───────────┘
LayerPurposeSchemaQuality
BronzeRaw ingestionSchema-on-readAs-is from source
SilverCleaned, conformedEnforcedValidated, deduped
GoldBusiness aggregatesModeledReady for BI

Implementation Example

# Bronze: Raw ingestion
raw_df = spark.read.json("/landing/events/")
raw_df.write.format("delta").mode("append") \
    .save("/bronze/events")

# Silver: Clean and deduplicate
bronze_df = spark.read.format("delta").load("/bronze/events")
silver_df = bronze_df \
    .dropDuplicates(["event_id"]) \
    .filter(col("user_id").isNotNull()) \
    .withColumn("event_date", to_date("event_time"))
silver_df.write.format("delta").mode("overwrite") \
    .partitionBy("event_date") \
    .save("/silver/events")

# Gold: Business aggregates
silver_df = spark.read.format("delta").load("/silver/events")
gold_df = silver_df.groupBy("event_date", "event_type") \
    .agg(count("*").alias("event_count"))
gold_df.write.format("delta").mode("overwrite") \
    .save("/gold/daily_event_summary")

Interview Insight: The lakehouse represents the convergence of data lakes and warehouses. Understanding when to use lakehouse vs traditional warehouse demonstrates architectural maturity.

Next, we'll dive into slowly changing dimensions and their implementations. :::

Quick check: how does this lesson land for you?

Quiz

Module 3: Data Modeling & Warehousing

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.