SQL Mastery & Query Optimization
Indexing Strategies
Index design is a critical data engineering skill. Interview questions often ask you to design indexes for specific query patterns or troubleshoot slow queries.
Index Types and Use Cases
B-Tree Indexes (Default)
The most common index type, optimal for equality and range queries.
-- Create B-Tree index
CREATE INDEX idx_orders_customer ON orders(customer_id);
-- Good for:
SELECT * FROM orders WHERE customer_id = 123;
SELECT * FROM orders WHERE customer_id > 100 AND customer_id < 200;
SELECT * FROM orders WHERE customer_id IN (1, 2, 3);
| Operation | B-Tree Performance |
|---|---|
= equality |
Excellent |
<, >, BETWEEN |
Excellent |
LIKE 'prefix%' |
Good |
LIKE '%suffix' |
Poor (full scan) |
IS NULL |
Good |
Hash Indexes
Optimized for exact equality lookups only.
-- PostgreSQL hash index
CREATE INDEX idx_orders_status_hash ON orders USING HASH (status);
-- Only efficient for:
SELECT * FROM orders WHERE status = 'pending';
-- NOT useful for:
SELECT * FROM orders WHERE status != 'pending'; -- Full scan
SELECT * FROM orders WHERE status IN ('a', 'b'); -- Multiple lookups
Composite (Multi-Column) Indexes
Critical for queries filtering on multiple columns.
-- Composite index
CREATE INDEX idx_orders_customer_date
ON orders(customer_id, order_date);
Left-Most Prefix Rule: The index is used when queries filter on leftmost columns.
| Query | Uses Index? |
|---|---|
WHERE customer_id = 1 |
✅ Yes |
WHERE customer_id = 1 AND order_date = '2024-01-01' |
✅ Yes |
WHERE order_date = '2024-01-01' |
❌ No |
WHERE order_date = '2024-01-01' AND customer_id = 1 |
✅ Yes (optimizer reorders) |
Interview Question: "Design indexes for these query patterns"
-- Query 1: Filter by customer, sort by date
SELECT * FROM orders WHERE customer_id = ? ORDER BY order_date DESC;
-- Query 2: Filter by date range and status
SELECT * FROM orders WHERE order_date BETWEEN ? AND ? AND status = ?;
-- Query 3: Get latest order per customer
SELECT * FROM orders WHERE customer_id = ? ORDER BY order_date DESC LIMIT 1;
Answer:
-- For Query 1 & 3: customer_id first, then order_date for sorting
CREATE INDEX idx_customer_date ON orders(customer_id, order_date DESC);
-- For Query 2: Consider selectivity - if status is more selective
CREATE INDEX idx_status_date ON orders(status, order_date);
-- Or if date range is more selective
CREATE INDEX idx_date_status ON orders(order_date, status);
Covering Indexes
Include all columns needed by the query to avoid table lookups.
-- Query
SELECT customer_id, order_date, total
FROM orders
WHERE customer_id = 123;
-- Covering index
CREATE INDEX idx_covering ON orders(customer_id)
INCLUDE (order_date, total);
-- Now query reads only from index, no table access
Partial (Filtered) Indexes
Index only rows matching a condition—smaller and faster.
-- Only index pending orders (small subset)
CREATE INDEX idx_pending_orders ON orders(customer_id)
WHERE status = 'pending';
-- Efficient for:
SELECT * FROM orders WHERE status = 'pending' AND customer_id = 123;
Index Design Principles
1. Query Analysis First
Before creating indexes, analyze actual query patterns:
-- PostgreSQL: Find slow queries
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 10;
-- Identify columns in:
-- 1. WHERE clauses
-- 2. JOIN conditions
-- 3. ORDER BY clauses
-- 4. GROUP BY clauses
2. Selectivity Matters
High selectivity (unique values) makes indexes more effective.
-- Check selectivity
SELECT
COUNT(DISTINCT status) as status_cardinality, -- Low (e.g., 5)
COUNT(DISTINCT customer_id) as customer_cardinality, -- High (e.g., 100K)
COUNT(*) as total_rows
FROM orders;
-- Index customer_id first (high selectivity)
-- Status might not need index alone
3. Column Order in Composite Indexes
Order by:
- Equality conditions first (exact match)
- Range conditions last (>, <, BETWEEN)
- Consider sort order for ORDER BY
-- Query: WHERE status = 'active' AND created_at > '2024-01-01' ORDER BY created_at
-- Best index:
CREATE INDEX idx_optimized ON table(status, created_at);
-- status (equality) → created_at (range + sort)
4. Index Overhead Awareness
| Consideration | Impact |
|---|---|
| Write performance | Each index slows INSERT/UPDATE/DELETE |
| Storage | Indexes consume disk space |
| Maintenance | Index fragmentation over time |
| Vacuum/Analyze | More indexes = longer maintenance |
Rule of Thumb: Aim for 3-5 indexes per table maximum. Each additional index needs strong justification.
Common Interview Scenarios
Scenario 1: Slow Dashboard Query
Problem: "This dashboard query takes 30 seconds"
SELECT
DATE(created_at) as date,
COUNT(*) as orders,
SUM(total) as revenue
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
AND status = 'completed'
GROUP BY DATE(created_at);
Solution:
-- 1. Create composite index
CREATE INDEX idx_orders_status_created ON orders(status, created_at)
WHERE status = 'completed'; -- Partial index if supported
-- 2. Consider materialized view for dashboard
CREATE MATERIALIZED VIEW daily_completed_orders AS
SELECT
DATE(created_at) as date,
COUNT(*) as orders,
SUM(total) as revenue
FROM orders
WHERE status = 'completed'
GROUP BY DATE(created_at);
Scenario 2: Join Performance
Problem: "Join between orders and customers is slow"
SELECT c.name, o.total
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'USA';
Solution:
-- Ensure foreign key has index
CREATE INDEX idx_orders_customer ON orders(customer_id);
-- Index for filter condition
CREATE INDEX idx_customers_country ON customers(country);
-- Or covering index
CREATE INDEX idx_customers_country_covering ON customers(country)
INCLUDE (id, name);
Scenario 3: Unused Index Detection
Interview Question: "How do you identify unused indexes?"
-- PostgreSQL: Find unused indexes
SELECT
schemaname || '.' || relname AS table,
indexrelname AS index,
pg_size_pretty(pg_relation_size(i.indexrelid)) AS size,
idx_scan AS scans
FROM pg_stat_user_indexes ui
JOIN pg_index i ON ui.indexrelid = i.indexrelid
WHERE idx_scan = 0
AND NOT indisunique
ORDER BY pg_relation_size(i.indexrelid) DESC;
Interview Insight: When discussing indexes, always consider the trade-off between read performance (faster queries) and write performance (slower inserts/updates). Production systems need this balance.
Next, we'll practice common SQL interview patterns and solutions. :::