Data Modeling & Warehousing
Data Lakes & Lakehouse Architecture
The lakehouse paradigm combines the flexibility of data lakes with the reliability of data warehouses. Understanding this evolution is crucial for modern data engineering interviews.
Data Lake Evolution
Traditional Data Lake Challenges
| Challenge | Impact |
|---|---|
| No ACID transactions | Data corruption, inconsistent reads |
| Schema enforcement | Garbage in, garbage out |
| No indexing | Full scans for every query |
| Stale metadata | Slow query planning |
| No time travel | Can't recover from bad writes |
The Lakehouse Solution
┌─────────────────────────────────────────────────────┐
│ Query Engines │
│ (Spark, Trino, Dremio, Athena, Databricks SQL) │
├─────────────────────────────────────────────────────┤
│ Table Format Layer │
│ (Delta Lake, Apache Iceberg, Hudi) │
├─────────────────────────────────────────────────────┤
│ Object Storage │
│ (S3, GCS, ADLS, MinIO) │
└─────────────────────────────────────────────────────┘
Delta Lake
Databricks' open-source lakehouse format.
Key Features
| Feature | Benefit |
|---|---|
| ACID transactions | Reliable concurrent writes |
| Schema evolution | Add/rename columns without rewriting |
| Time travel | Query any previous version |
| Z-ordering | Multi-dimensional clustering |
| Change data feed | Stream changes for downstream |
Common Operations
# Create Delta table
df.write.format("delta").save("/data/events")
# Time travel - read previous version
spark.read.format("delta") \
.option("versionAsOf", 5) \
.load("/data/events")
# Time travel - read by timestamp
spark.read.format("delta") \
.option("timestampAsOf", "2024-01-01") \
.load("/data/events")
# Optimize and Z-order
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, "/data/events")
delta_table.optimize().executeZOrderBy("user_id", "event_date")
# Vacuum old versions (keep 7 days)
delta_table.vacuum(168) # hours
Interview Question: "How does Delta Lake achieve ACID transactions?"
Answer: "Delta Lake uses a transaction log (_delta_log/) that records every change as JSON commits. Writes are atomic through optimistic concurrency—each write creates a new commit file. Reads use the log to determine which files to read. Conflicts are detected by checking if committed transactions overlap with the current write."
Apache Iceberg
Netflix-developed format, now an Apache project.
Key Features
| Feature | Benefit |
|---|---|
| Hidden partitioning | Partition evolution without rewriting |
| Schema evolution | Full support (add, drop, rename, reorder) |
| Time travel | Snapshot-based versioning |
| Partition evolution | Change partitioning without migration |
| Multiple engines | Spark, Flink, Trino, Hive, Presto |
Architecture
┌─────────────────────────────────────┐
│ Catalog (Metadata) │
│ (Hive Metastore, Glue, Nessie) │
├─────────────────────────────────────┤
│ Metadata Layer │
│ ┌──────────────────────────────┐ │
│ │ Manifest List (snapshot) │ │
│ │ └── Manifest Files │ │
│ │ └── Data File Refs │ │
│ └──────────────────────────────┘ │
├─────────────────────────────────────┤
│ Data Files │
│ (Parquet, ORC, Avro) │
└─────────────────────────────────────┘
Common Operations
-- Create Iceberg table
CREATE TABLE events (
event_id BIGINT,
event_time TIMESTAMP,
user_id INT,
event_type STRING
)
USING iceberg
PARTITIONED BY (days(event_time), bucket(16, user_id));
-- Time travel
SELECT * FROM events VERSION AS OF 123456;
SELECT * FROM events TIMESTAMP AS OF '2024-01-01 00:00:00';
-- Partition evolution (no data rewrite!)
ALTER TABLE events ADD PARTITION FIELD bucket(8, region);
-- Expire snapshots
CALL system.expire_snapshots('events', TIMESTAMP '2024-01-01 00:00:00');
Interview Question: "What's the advantage of Iceberg's hidden partitioning?"
Answer: "With hidden partitioning, users write queries using the actual column values (WHERE event_time > '2024-01-01'), and Iceberg automatically translates this to the correct partition predicate. This means:
- Partition evolution doesn't require query changes
- No need to know the partitioning scheme
- Partition transforms (day, month, bucket) are abstracted away"
Apache Hudi
Uber-developed format for incremental processing.
Key Features
| Feature | Benefit |
|---|---|
| Upserts | Native support for record-level updates |
| Incremental queries | Read only changed records |
| Record-level indexing | Fast lookups and updates |
| Compaction | Merge small files automatically |
| Table types | Copy-on-Write vs Merge-on-Read |
Table Types Comparison
| Aspect | Copy-on-Write (CoW) | Merge-on-Read (MoR) |
|---|---|---|
| Write latency | Higher (rewrites files) | Lower (appends logs) |
| Read latency | Lower (pre-merged) | Higher (merges at read) |
| Best for | Read-heavy workloads | Write-heavy workloads |
| Storage | More (full rewrites) | Less (only changes) |
Format Comparison
| Feature | Delta Lake | Iceberg | Hudi |
|---|---|---|---|
| Origin | Databricks | Netflix | Uber |
| Primary Use | General lakehouse | Analytics | CDC/Streaming |
| ACID | ✅ | ✅ | ✅ |
| Time Travel | ✅ | ✅ | ✅ |
| Schema Evolution | Good | Excellent | Good |
| Partition Evolution | Limited | Excellent | Limited |
| Upserts | Good | Good | Excellent |
| Incremental Reads | Change Data Feed | Incremental | Native |
| Engine Support | Best Spark | Multi-engine | Good Spark |
Interview Question: "How do you choose between Delta, Iceberg, and Hudi?"
Answer Framework:
| If You Need | Choose |
|---|---|
| Databricks ecosystem | Delta Lake |
| Engine portability | Iceberg |
| Heavy CDC/upserts | Hudi |
| Simple operations | Delta Lake |
| Partition flexibility | Iceberg |
| Real-time ingestion | Hudi |
Medallion Architecture
Common lakehouse design pattern.
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Bronze │───▶│ Silver │───▶│ Gold │
│ (Raw) │ │ (Cleaned) │ │(Aggregated│
└───────────┘ └───────────┘ └───────────┘
| Layer | Purpose | Schema | Quality |
|---|---|---|---|
| Bronze | Raw ingestion | Schema-on-read | As-is from source |
| Silver | Cleaned, conformed | Enforced | Validated, deduped |
| Gold | Business aggregates | Modeled | Ready for BI |
Implementation Example
# Bronze: Raw ingestion
raw_df = spark.read.json("/landing/events/")
raw_df.write.format("delta").mode("append") \
.save("/bronze/events")
# Silver: Clean and deduplicate
bronze_df = spark.read.format("delta").load("/bronze/events")
silver_df = bronze_df \
.dropDuplicates(["event_id"]) \
.filter(col("user_id").isNotNull()) \
.withColumn("event_date", to_date("event_time"))
silver_df.write.format("delta").mode("overwrite") \
.partitionBy("event_date") \
.save("/silver/events")
# Gold: Business aggregates
silver_df = spark.read.format("delta").load("/silver/events")
gold_df = silver_df.groupBy("event_date", "event_type") \
.agg(count("*").alias("event_count"))
gold_df.write.format("delta").mode("overwrite") \
.save("/gold/daily_event_summary")
Interview Insight: The lakehouse represents the convergence of data lakes and warehouses. Understanding when to use lakehouse vs traditional warehouse demonstrates architectural maturity.
Next, we'll dive into slowly changing dimensions and their implementations. :::