Query Optimization & Execution Plans

Understanding query optimization demonstrates the difference between junior and senior data engineers. Interviewers test this to assess production readiness.

Reading Execution Plans

The EXPLAIN Command

-- PostgreSQL
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 123;

-- MySQL
EXPLAIN FORMAT=JSON SELECT * FROM orders WHERE customer_id = 123;

-- Snowflake
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;

Key Metrics to Understand

Metric	Meaning	Good vs Bad
Rows Examined	Records scanned	Lower is better
Actual Time	Execution duration	Context dependent
Cost	Estimated resource usage	Relative comparison
Buffer Hits	Cache utilization	Higher is better

Scan Types (Best to Worst)

Scan Type	Speed	When Used
Index Unique Scan	Fastest	Primary key lookup
Index Range Scan	Fast	Range conditions with index
Index Full Scan	Medium	All rows via index
Full Table Scan	Slowest	No suitable index

Interview Question: "This query is slow. How would you diagnose it?"

-- Step 1: Get execution plan
EXPLAIN ANALYZE
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA'
AND o.order_date > '2024-01-01';

-- Look for:
-- 1. Full table scans on large tables
-- 2. High row estimates vs actual
-- 3. Sort operations without indexes
-- 4. Nested loop joins on large datasets

Common Performance Anti-Patterns

1. SELECT * Instead of Specific Columns

-- Bad: Fetches all columns
SELECT * FROM orders WHERE status = 'pending';

-- Good: Only needed columns
SELECT order_id, customer_id, total
FROM orders WHERE status = 'pending';

Why it matters: Wide tables with many columns waste I/O and memory.

2. Functions on Indexed Columns

-- Bad: Function prevents index usage
SELECT * FROM orders
WHERE YEAR(order_date) = 2024;

-- Good: Preserves index usage
SELECT * FROM orders
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01';

3. Implicit Type Conversions

-- Bad: customer_id is INT, comparing to string
SELECT * FROM customers WHERE customer_id = '123';

-- Good: Matching types
SELECT * FROM customers WHERE customer_id = 123;

4. OR Conditions on Different Columns

-- Bad: Often results in full scan
SELECT * FROM orders
WHERE customer_id = 123 OR product_id = 456;

-- Good: Union of indexed queries
SELECT * FROM orders WHERE customer_id = 123
UNION
SELECT * FROM orders WHERE product_id = 456;

5. Correlated Subqueries

-- Bad: Executes subquery for each row
SELECT o.order_id,
    (SELECT COUNT(*) FROM order_items oi
     WHERE oi.order_id = o.order_id) as item_count
FROM orders o;

-- Good: Join with aggregation
SELECT o.order_id, COUNT(oi.item_id) as item_count
FROM orders o
LEFT JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id;

Join Optimization

Understanding Join Algorithms

Algorithm	Best For	Memory
Nested Loop	Small tables, indexed lookups	Low
Hash Join	Large tables, no index	High
Merge Join	Sorted data, equi-joins	Medium

Interview Pattern: "Why might a hash join be chosen over nested loop?"

Answer: Hash joins are efficient for large tables without indexes. The optimizer builds a hash table from the smaller table and probes it with the larger table, avoiding repeated scans.

Join Order Matters

-- Query with multiple joins
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
WHERE c.country = 'USA';

-- Optimization: Filter early to reduce join size
SELECT *
FROM (SELECT * FROM customers WHERE country = 'USA') c
JOIN orders o ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id;

Statistics and Cardinality

Why Statistics Matter

The query optimizer relies on table statistics to choose execution plans.

-- PostgreSQL: Update statistics
ANALYZE table_name;

-- MySQL: Update statistics
ANALYZE TABLE table_name;

-- Check cardinality estimates
EXPLAIN (ANALYZE, BUFFERS) SELECT ...

Cardinality Estimation Issues

Problem	Symptom	Solution
Stale statistics	Wrong row estimates	Run ANALYZE
Skewed data	Bad join order	Histogram statistics
Correlated columns	Underestimates	Extended statistics

Interview Question: "The execution plan shows 100 estimated rows but 1M actual rows. What's wrong?"

Answer: Statistics are likely stale or there's data skew. Solutions:

Run ANALYZE to update statistics
Create histogram statistics for skewed columns
Use query hints if necessary

Query Rewriting Techniques

Exists vs IN vs JOIN

-- IN: Good for small subquery results
SELECT * FROM orders
WHERE customer_id IN (SELECT id FROM premium_customers);

-- EXISTS: Better for correlated conditions
SELECT * FROM orders o
WHERE EXISTS (
    SELECT 1 FROM premium_customers p
    WHERE p.id = o.customer_id
);

-- JOIN: Often best for large datasets
SELECT DISTINCT o.* FROM orders o
JOIN premium_customers p ON o.customer_id = p.id;

Pagination Optimization

-- Bad: OFFSET scans all previous rows
SELECT * FROM orders ORDER BY id LIMIT 10 OFFSET 100000;

-- Good: Keyset pagination (seek method)
SELECT * FROM orders
WHERE id > 100000
ORDER BY id LIMIT 10;

Interview Insight: When asked about query optimization, always structure your response: (1) Identify the problem using EXPLAIN, (2) Analyze the bottleneck, (3) Propose specific solutions, (4) Measure the improvement.

Next, we'll explore indexing strategies for optimal query performance. :::