SQL Mastery & Query Optimization

Query Optimization & Execution Plans

4 min read

Understanding query optimization demonstrates the difference between junior and senior data engineers. Interviewers test this to assess production readiness.

Reading Execution Plans

The EXPLAIN Command

-- PostgreSQL
EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 123;

-- MySQL
EXPLAIN FORMAT=JSON SELECT * FROM orders WHERE customer_id = 123;

-- Snowflake
EXPLAIN SELECT * FROM orders WHERE customer_id = 123;

Key Metrics to Understand

MetricMeaningGood vs Bad
Rows ExaminedRecords scannedLower is better
Actual TimeExecution durationContext dependent
CostEstimated resource usageRelative comparison
Buffer HitsCache utilizationHigher is better

Scan Types (Best to Worst)

Scan TypeSpeedWhen Used
Index Unique ScanFastestPrimary key lookup
Index Range ScanFastRange conditions with index
Index Full ScanMediumAll rows via index
Full Table ScanSlowestNo suitable index

Interview Question: "This query is slow. How would you diagnose it?"

-- Step 1: Get execution plan
EXPLAIN ANALYZE
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA'
AND o.order_date > '2024-01-01';

-- Look for:
-- 1. Full table scans on large tables
-- 2. High row estimates vs actual
-- 3. Sort operations without indexes
-- 4. Nested loop joins on large datasets

Common Performance Anti-Patterns

1. SELECT * Instead of Specific Columns

-- Bad: Fetches all columns
SELECT * FROM orders WHERE status = 'pending';

-- Good: Only needed columns
SELECT order_id, customer_id, total
FROM orders WHERE status = 'pending';

Why it matters: Wide tables with many columns waste I/O and memory.

2. Functions on Indexed Columns

-- Bad: Function prevents index usage
SELECT * FROM orders
WHERE YEAR(order_date) = 2024;

-- Good: Preserves index usage
SELECT * FROM orders
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01';

3. Implicit Type Conversions

-- Bad: customer_id is INT, comparing to string
SELECT * FROM customers WHERE customer_id = '123';

-- Good: Matching types
SELECT * FROM customers WHERE customer_id = 123;

4. OR Conditions on Different Columns

-- Bad: Often results in full scan
SELECT * FROM orders
WHERE customer_id = 123 OR product_id = 456;

-- Good: Union of indexed queries
SELECT * FROM orders WHERE customer_id = 123
UNION
SELECT * FROM orders WHERE product_id = 456;

5. Correlated Subqueries

-- Bad: Executes subquery for each row
SELECT o.order_id,
    (SELECT COUNT(*) FROM order_items oi
     WHERE oi.order_id = o.order_id) as item_count
FROM orders o;

-- Good: Join with aggregation
SELECT o.order_id, COUNT(oi.item_id) as item_count
FROM orders o
LEFT JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id;

Join Optimization

Understanding Join Algorithms

AlgorithmBest ForMemory
Nested LoopSmall tables, indexed lookupsLow
Hash JoinLarge tables, no indexHigh
Merge JoinSorted data, equi-joinsMedium

Interview Pattern: "Why might a hash join be chosen over nested loop?"

Answer: Hash joins are efficient for large tables without indexes. The optimizer builds a hash table from the smaller table and probes it with the larger table, avoiding repeated scans.

Join Order Matters

-- Query with multiple joins
SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id
WHERE c.country = 'USA';

-- Optimization: Filter early to reduce join size
SELECT *
FROM (SELECT * FROM customers WHERE country = 'USA') c
JOIN orders o ON o.customer_id = c.id
JOIN products p ON o.product_id = p.id;

Statistics and Cardinality

Why Statistics Matter

The query optimizer relies on table statistics to choose execution plans.

-- PostgreSQL: Update statistics
ANALYZE table_name;

-- MySQL: Update statistics
ANALYZE TABLE table_name;

-- Check cardinality estimates
EXPLAIN (ANALYZE, BUFFERS) SELECT ...

Cardinality Estimation Issues

ProblemSymptomSolution
Stale statisticsWrong row estimatesRun ANALYZE
Skewed dataBad join orderHistogram statistics
Correlated columnsUnderestimatesExtended statistics

Interview Question: "The execution plan shows 100 estimated rows but 1M actual rows. What's wrong?"

Answer: Statistics are likely stale or there's data skew. Solutions:

  1. Run ANALYZE to update statistics
  2. Create histogram statistics for skewed columns
  3. Use query hints if necessary

Query Rewriting Techniques

Exists vs IN vs JOIN

-- IN: Good for small subquery results
SELECT * FROM orders
WHERE customer_id IN (SELECT id FROM premium_customers);

-- EXISTS: Better for correlated conditions
SELECT * FROM orders o
WHERE EXISTS (
    SELECT 1 FROM premium_customers p
    WHERE p.id = o.customer_id
);

-- JOIN: Often best for large datasets
SELECT DISTINCT o.* FROM orders o
JOIN premium_customers p ON o.customer_id = p.id;

Pagination Optimization

-- Bad: OFFSET scans all previous rows
SELECT * FROM orders ORDER BY id LIMIT 10 OFFSET 100000;

-- Good: Keyset pagination (seek method)
SELECT * FROM orders
WHERE id > 100000
ORDER BY id LIMIT 10;

Interview Insight: When asked about query optimization, always structure your response: (1) Identify the problem using EXPLAIN, (2) Analyze the bottleneck, (3) Propose specific solutions, (4) Measure the improvement.

Next, we'll explore indexing strategies for optimal query performance. :::

Quick check: how does this lesson land for you?

Quiz

Module 2: SQL Mastery & Query Optimization

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.