SQL Mastery & Query Optimization
Query Optimization & Execution Plans
Understanding query optimization demonstrates the difference between junior and senior data engineers. Interviewers test this to assess production readiness.
Reading Execution Plans
The EXPLAIN Command
-- PostgreSQL
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 123;
-- MySQL
EXPLAIN FORMAT=JSON SELECT * FROM orders WHERE customer_id = 123;
-- Snowflake
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;
Key Metrics to Understand
| Metric | Meaning | Good vs Bad |
|---|---|---|
| Rows Examined | Records scanned | Lower is better |
| Actual Time | Execution duration | Context dependent |
| Cost | Estimated resource usage | Relative comparison |
| Buffer Hits | Cache utilization | Higher is better |
Scan Types (Best to Worst)
| Scan Type | Speed | When Used |
|---|---|---|
| Index Unique Scan | Fastest | Primary key lookup |
| Index Range Scan | Fast | Range conditions with index |
| Index Full Scan | Medium | All rows via index |
| Full Table Scan | Slowest | No suitable index |
Interview Question: "This query is slow. How would you diagnose it?"
-- Step 1: Get execution plan
EXPLAIN ANALYZE
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA'
AND o.order_date > '2024-01-01';
-- Look for:
-- 1. Full table scans on large tables
-- 2. High row estimates vs actual
-- 3. Sort operations without indexes
-- 4. Nested loop joins on large datasets
Common Performance Anti-Patterns
1. SELECT * Instead of Specific Columns
-- Bad: Fetches all columns
SELECT * FROM orders WHERE status = 'pending';
-- Good: Only needed columns
SELECT order_id, customer_id, total
FROM orders WHERE status = 'pending';
Why it matters: Wide tables with many columns waste I/O and memory.
2. Functions on Indexed Columns
-- Bad: Function prevents index usage
SELECT * FROM orders
WHERE YEAR(order_date) = 2024;
-- Good: Preserves index usage
SELECT * FROM orders
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01';
3. Implicit Type Conversions
-- Bad: customer_id is INT, comparing to string
SELECT * FROM customers WHERE customer_id = '123';
-- Good: Matching types
SELECT * FROM customers WHERE customer_id = 123;
4. OR Conditions on Different Columns
-- Bad: Often results in full scan
SELECT * FROM orders
WHERE customer_id = 123 OR product_id = 456;
-- Good: Union of indexed queries
SELECT * FROM orders WHERE customer_id = 123
UNION
SELECT * FROM orders WHERE product_id = 456;
5. Correlated Subqueries
-- Bad: Executes subquery for each row
SELECT o.order_id,
(SELECT COUNT(*) FROM order_items oi
WHERE oi.order_id = o.order_id) as item_count
FROM orders o;
-- Good: Join with aggregation
SELECT o.order_id, COUNT(oi.item_id) as item_count
FROM orders o
LEFT JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id;
Join Optimization
Understanding Join Algorithms
| Algorithm | Best For | Memory |
|---|---|---|
| Nested Loop | Small tables, indexed lookups | Low |
| Hash Join | Large tables, no index | High |
| Merge Join | Sorted data, equi-joins | Medium |
Interview Pattern: "Why might a hash join be chosen over nested loop?"
Answer: Hash joins are efficient for large tables without indexes. The optimizer builds a hash table from the smaller table and probes it with the larger table, avoiding repeated scans.
Join Order Matters
-- Query with multiple joins
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
WHERE c.country = 'USA';
-- Optimization: Filter early to reduce join size
SELECT *
FROM (SELECT * FROM customers WHERE country = 'USA') c
JOIN orders o ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id;
Statistics and Cardinality
Why Statistics Matter
The query optimizer relies on table statistics to choose execution plans.
-- PostgreSQL: Update statistics
ANALYZE table_name;
-- MySQL: Update statistics
ANALYZE TABLE table_name;
-- Check cardinality estimates
EXPLAIN (ANALYZE, BUFFERS) SELECT ...
Cardinality Estimation Issues
| Problem | Symptom | Solution |
|---|---|---|
| Stale statistics | Wrong row estimates | Run ANALYZE |
| Skewed data | Bad join order | Histogram statistics |
| Correlated columns | Underestimates | Extended statistics |
Interview Question: "The execution plan shows 100 estimated rows but 1M actual rows. What's wrong?"
Answer: Statistics are likely stale or there's data skew. Solutions:
- Run ANALYZE to update statistics
- Create histogram statistics for skewed columns
- Use query hints if necessary
Query Rewriting Techniques
Exists vs IN vs JOIN
-- IN: Good for small subquery results
SELECT * FROM orders
WHERE customer_id IN (SELECT id FROM premium_customers);
-- EXISTS: Better for correlated conditions
SELECT * FROM orders o
WHERE EXISTS (
SELECT 1 FROM premium_customers p
WHERE p.id = o.customer_id
);
-- JOIN: Often best for large datasets
SELECT DISTINCT o.* FROM orders o
JOIN premium_customers p ON o.customer_id = p.id;
Pagination Optimization
-- Bad: OFFSET scans all previous rows
SELECT * FROM orders ORDER BY id LIMIT 10 OFFSET 100000;
-- Good: Keyset pagination (seek method)
SELECT * FROM orders
WHERE id > 100000
ORDER BY id LIMIT 10;
Interview Insight: When asked about query optimization, always structure your response: (1) Identify the problem using EXPLAIN, (2) Analyze the bottleneck, (3) Propose specific solutions, (4) Measure the improvement.
Next, we'll explore indexing strategies for optimal query performance. :::