Data Lakes & Lakehouse Architecture

The lakehouse paradigm combines the flexibility of data lakes with the reliability of data warehouses. Understanding this evolution is crucial for modern data engineering interviews.

Data Lake Evolution

Traditional Data Lake Challenges

Challenge	Impact
No ACID transactions	Data corruption, inconsistent reads
Schema enforcement	Garbage in, garbage out
No indexing	Full scans for every query
Stale metadata	Slow query planning
No time travel	Can't recover from bad writes

The Lakehouse Solution

┌─────────────────────────────────────────────────────┐
│                  Query Engines                       │
│     (Spark, Trino, Dremio, Athena, Databricks SQL)  │
├─────────────────────────────────────────────────────┤
│              Table Format Layer                      │
│         (Delta Lake, Apache Iceberg, Hudi)          │
├─────────────────────────────────────────────────────┤
│              Object Storage                          │
│           (S3, GCS, ADLS, MinIO)                    │
└─────────────────────────────────────────────────────┘

Delta Lake

Databricks' open-source lakehouse format.

Key Features

Feature	Benefit
ACID transactions	Reliable concurrent writes
Schema evolution	Add/rename columns without rewriting
Time travel	Query any previous version
Z-ordering	Multi-dimensional clustering
Change data feed	Stream changes for downstream

Common Operations

# Create Delta table
df.write.format("delta").save("/data/events")

# Time travel - read previous version
spark.read.format("delta") \
    .option("versionAsOf", 5) \
    .load("/data/events")

# Time travel - read by timestamp
spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-01") \
    .load("/data/events")

# Optimize and Z-order
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "/data/events")
delta_table.optimize().executeZOrderBy("user_id", "event_date")

# Vacuum old versions (keep 7 days)
delta_table.vacuum(168)  # hours

Interview Question: "How does Delta Lake achieve ACID transactions?"

Answer: "Delta Lake uses a transaction log (_delta_log/) that records every change as JSON commits. Writes are atomic through optimistic concurrency—each write creates a new commit file. Reads use the log to determine which files to read. Conflicts are detected by checking if committed transactions overlap with the current write."

Apache Iceberg

Netflix-developed format, now an Apache project.

Key Features

Feature	Benefit
Hidden partitioning	Partition evolution without rewriting
Schema evolution	Full support (add, drop, rename, reorder)
Time travel	Snapshot-based versioning
Partition evolution	Change partitioning without migration
Multiple engines	Spark, Flink, Trino, Hive, Presto

Architecture

┌─────────────────────────────────────┐
│          Catalog (Metadata)          │
│  (Hive Metastore, Glue, Nessie)     │
├─────────────────────────────────────┤
│        Metadata Layer                │
│   ┌──────────────────────────────┐  │
│   │ Manifest List (snapshot)     │  │
│   │   └── Manifest Files         │  │
│   │       └── Data File Refs     │  │
│   └──────────────────────────────┘  │
├─────────────────────────────────────┤
│          Data Files                  │
│    (Parquet, ORC, Avro)             │
└─────────────────────────────────────┘

Common Operations

-- Create Iceberg table
CREATE TABLE events (
    event_id BIGINT,
    event_time TIMESTAMP,
    user_id INT,
    event_type STRING
)
USING iceberg
PARTITIONED BY (days(event_time), bucket(16, user_id));

-- Time travel
SELECT * FROM events VERSION AS OF 123456;
SELECT * FROM events TIMESTAMP AS OF '2024-01-01 00:00:00';

-- Partition evolution (no data rewrite!)
ALTER TABLE events ADD PARTITION FIELD bucket(8, region);

-- Expire snapshots
CALL system.expire_snapshots('events', TIMESTAMP '2024-01-01 00:00:00');

Interview Question: "What's the advantage of Iceberg's hidden partitioning?"

Answer: "With hidden partitioning, users write queries using the actual column values (WHERE event_time > '2024-01-01'), and Iceberg automatically translates this to the correct partition predicate. This means:

Partition evolution doesn't require query changes
No need to know the partitioning scheme
Partition transforms (day, month, bucket) are abstracted away"

Apache Hudi

Uber-developed format for incremental processing.

Key Features

Feature	Benefit
Upserts	Native support for record-level updates
Incremental queries	Read only changed records
Record-level indexing	Fast lookups and updates
Compaction	Merge small files automatically
Table types	Copy-on-Write vs Merge-on-Read

Table Types Comparison

Aspect	Copy-on-Write (CoW)	Merge-on-Read (MoR)
Write latency	Higher (rewrites files)	Lower (appends logs)
Read latency	Lower (pre-merged)	Higher (merges at read)
Best for	Read-heavy workloads	Write-heavy workloads
Storage	More (full rewrites)	Less (only changes)

Format Comparison

Feature	Delta Lake	Iceberg	Hudi
Origin	Databricks	Netflix	Uber
Primary Use	General lakehouse	Analytics	CDC/Streaming
ACID	✅	✅	✅
Time Travel	✅	✅	✅
Schema Evolution	Good	Excellent	Good
Partition Evolution	Limited	Excellent	Limited
Upserts	Good	Good	Excellent
Incremental Reads	Change Data Feed	Incremental	Native
Engine Support	Best Spark	Multi-engine	Good Spark

Interview Question: "How do you choose between Delta, Iceberg, and Hudi?"

Answer Framework:

If You Need	Choose
Databricks ecosystem	Delta Lake
Engine portability	Iceberg
Heavy CDC/upserts	Hudi
Simple operations	Delta Lake
Partition flexibility	Iceberg
Real-time ingestion	Hudi

Medallion Architecture

Common lakehouse design pattern.

┌───────────┐    ┌───────────┐    ┌───────────┐
│  Bronze   │───▶│  Silver   │───▶│   Gold    │
│  (Raw)    │    │ (Cleaned) │    │(Aggregated│
└───────────┘    └───────────┘    └───────────┘

Layer	Purpose	Schema	Quality
Bronze	Raw ingestion	Schema-on-read	As-is from source
Silver	Cleaned, conformed	Enforced	Validated, deduped
Gold	Business aggregates	Modeled	Ready for BI

Implementation Example

# Bronze: Raw ingestion
raw_df = spark.read.json("/landing/events/")
raw_df.write.format("delta").mode("append") \
    .save("/bronze/events")

# Silver: Clean and deduplicate
bronze_df = spark.read.format("delta").load("/bronze/events")
silver_df = bronze_df \
    .dropDuplicates(["event_id"]) \
    .filter(col("user_id").isNotNull()) \
    .withColumn("event_date", to_date("event_time"))
silver_df.write.format("delta").mode("overwrite") \
    .partitionBy("event_date") \
    .save("/silver/events")

# Gold: Business aggregates
silver_df = spark.read.format("delta").load("/silver/events")
gold_df = silver_df.groupBy("event_date", "event_type") \
    .agg(count("*").alias("event_count"))
gold_df.write.format("delta").mode("overwrite") \
    .save("/gold/daily_event_summary")

Interview Insight: The lakehouse represents the convergence of data lakes and warehouses. Understanding when to use lakehouse vs traditional warehouse demonstrates architectural maturity.

Next, we'll dive into slowly changing dimensions and their implementations. :::

Data Lake Evolution

Traditional Data Lake Challenges

The Lakehouse Solution

Delta Lake

Key Features

Common Operations

Interview Question: "How does Delta Lake achieve ACID transactions?"

Apache Iceberg

Key Features

Architecture

Common Operations

Interview Question: "What's the advantage of Iceberg's hidden partitioning?"

Apache Hudi

Key Features

Table Types Comparison

Format Comparison

Interview Question: "How do you choose between Delta, Iceberg, and Hudi?"

Medallion Architecture

Implementation Example

Quiz

Stay on the Nerd Track