SQL Mastery & Query Optimization

Query Optimization & Execution Plans

4 min read

Understanding query optimization demonstrates the difference between junior and senior data engineers. Interviewers test this to assess production readiness.

Reading Execution Plans

The EXPLAIN Command

-- PostgreSQL
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 123;

-- MySQL
EXPLAIN FORMAT=JSON SELECT * FROM orders WHERE customer_id = 123;

-- Snowflake
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;

Key Metrics to Understand

Metric Meaning Good vs Bad
Rows Examined Records scanned Lower is better
Actual Time Execution duration Context dependent
Cost Estimated resource usage Relative comparison
Buffer Hits Cache utilization Higher is better

Scan Types (Best to Worst)

Scan Type Speed When Used
Index Unique Scan Fastest Primary key lookup
Index Range Scan Fast Range conditions with index
Index Full Scan Medium All rows via index
Full Table Scan Slowest No suitable index

Interview Question: "This query is slow. How would you diagnose it?"

-- Step 1: Get execution plan
EXPLAIN ANALYZE
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA'
AND o.order_date > '2024-01-01';

-- Look for:
-- 1. Full table scans on large tables
-- 2. High row estimates vs actual
-- 3. Sort operations without indexes
-- 4. Nested loop joins on large datasets

Common Performance Anti-Patterns

1. SELECT * Instead of Specific Columns

-- Bad: Fetches all columns
SELECT * FROM orders WHERE status = 'pending';

-- Good: Only needed columns
SELECT order_id, customer_id, total
FROM orders WHERE status = 'pending';

Why it matters: Wide tables with many columns waste I/O and memory.

2. Functions on Indexed Columns

-- Bad: Function prevents index usage
SELECT * FROM orders
WHERE YEAR(order_date) = 2024;

-- Good: Preserves index usage
SELECT * FROM orders
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01';

3. Implicit Type Conversions

-- Bad: customer_id is INT, comparing to string
SELECT * FROM customers WHERE customer_id = '123';

-- Good: Matching types
SELECT * FROM customers WHERE customer_id = 123;

4. OR Conditions on Different Columns

-- Bad: Often results in full scan
SELECT * FROM orders
WHERE customer_id = 123 OR product_id = 456;

-- Good: Union of indexed queries
SELECT * FROM orders WHERE customer_id = 123
UNION
SELECT * FROM orders WHERE product_id = 456;

5. Correlated Subqueries

-- Bad: Executes subquery for each row
SELECT o.order_id,
    (SELECT COUNT(*) FROM order_items oi
     WHERE oi.order_id = o.order_id) as item_count
FROM orders o;

-- Good: Join with aggregation
SELECT o.order_id, COUNT(oi.item_id) as item_count
FROM orders o
LEFT JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id;

Join Optimization

Understanding Join Algorithms

Algorithm Best For Memory
Nested Loop Small tables, indexed lookups Low
Hash Join Large tables, no index High
Merge Join Sorted data, equi-joins Medium

Interview Pattern: "Why might a hash join be chosen over nested loop?"

Answer: Hash joins are efficient for large tables without indexes. The optimizer builds a hash table from the smaller table and probes it with the larger table, avoiding repeated scans.

Join Order Matters

-- Query with multiple joins
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
WHERE c.country = 'USA';

-- Optimization: Filter early to reduce join size
SELECT *
FROM (SELECT * FROM customers WHERE country = 'USA') c
JOIN orders o ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id;

Statistics and Cardinality

Why Statistics Matter

The query optimizer relies on table statistics to choose execution plans.

-- PostgreSQL: Update statistics
ANALYZE table_name;

-- MySQL: Update statistics
ANALYZE TABLE table_name;

-- Check cardinality estimates
EXPLAIN (ANALYZE, BUFFERS) SELECT ...

Cardinality Estimation Issues

Problem Symptom Solution
Stale statistics Wrong row estimates Run ANALYZE
Skewed data Bad join order Histogram statistics
Correlated columns Underestimates Extended statistics

Interview Question: "The execution plan shows 100 estimated rows but 1M actual rows. What's wrong?"

Answer: Statistics are likely stale or there's data skew. Solutions:

  1. Run ANALYZE to update statistics
  2. Create histogram statistics for skewed columns
  3. Use query hints if necessary

Query Rewriting Techniques

Exists vs IN vs JOIN

-- IN: Good for small subquery results
SELECT * FROM orders
WHERE customer_id IN (SELECT id FROM premium_customers);

-- EXISTS: Better for correlated conditions
SELECT * FROM orders o
WHERE EXISTS (
    SELECT 1 FROM premium_customers p
    WHERE p.id = o.customer_id
);

-- JOIN: Often best for large datasets
SELECT DISTINCT o.* FROM orders o
JOIN premium_customers p ON o.customer_id = p.id;

Pagination Optimization

-- Bad: OFFSET scans all previous rows
SELECT * FROM orders ORDER BY id LIMIT 10 OFFSET 100000;

-- Good: Keyset pagination (seek method)
SELECT * FROM orders
WHERE id > 100000
ORDER BY id LIMIT 10;

Interview Insight: When asked about query optimization, always structure your response: (1) Identify the problem using EXPLAIN, (2) Analyze the bottleneck, (3) Propose specific solutions, (4) Measure the improvement.

Next, we'll explore indexing strategies for optimal query performance. :::

Quiz

Module 2: SQL Mastery & Query Optimization

Take Quiz