Database Design & SQL Mastery
Schema Design & Normalization
Database schema design is arguably the most critical skill tested in backend interviews. A poorly designed schema leads to data anomalies, performance problems, and maintenance nightmares. This lesson covers normalization forms, denormalization trade-offs, and a full interview-style schema walkthrough.
Normal Forms: From 1NF to BCNF
Normalization eliminates data redundancy by organizing columns and tables. Each normal form builds on the previous one.
First Normal Form (1NF)
A table is in 1NF when:
- Every column contains atomic (indivisible) values
- There are no repeating groups or arrays in a single column
Violation example -- storing multiple phone numbers in one column:
-- BAD: violates 1NF
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name VARCHAR(100),
phones VARCHAR(255) -- "555-0101, 555-0102, 555-0103"
);
-- GOOD: 1NF compliant
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name VARCHAR(100)
);
CREATE TABLE customer_phones (
id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL REFERENCES customers(id),
phone VARCHAR(20) NOT NULL
);
Second Normal Form (2NF)
A table is in 2NF when it is in 1NF and every non-key column depends on the entire primary key, not just part of it. This only applies to tables with composite primary keys.
Violation example -- product_name depends only on product_id, not on the full composite key:
-- BAD: violates 2NF (product_name depends only on product_id)
CREATE TABLE order_items (
order_id INTEGER,
product_id INTEGER,
product_name VARCHAR(100), -- depends only on product_id
quantity INTEGER,
PRIMARY KEY (order_id, product_id)
);
-- GOOD: 2NF compliant — separate the partial dependency
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL
);
CREATE TABLE order_items (
order_id INTEGER REFERENCES orders(id),
product_id INTEGER REFERENCES products(id),
quantity INTEGER NOT NULL CHECK (quantity > 0),
PRIMARY KEY (order_id, product_id)
);
Third Normal Form (3NF)
A table is in 3NF when it is in 2NF and no non-key column depends on another non-key column (no transitive dependencies).
Violation example -- city depends on zip_code, which depends on the primary key:
-- BAD: violates 3NF (city depends on zip_code, not directly on id)
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name VARCHAR(100),
zip_code VARCHAR(10),
city VARCHAR(100) -- transitively depends on id via zip_code
);
-- GOOD: 3NF compliant
CREATE TABLE zip_codes (
zip_code VARCHAR(10) PRIMARY KEY,
city VARCHAR(100) NOT NULL
);
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name VARCHAR(100),
zip_code VARCHAR(10) REFERENCES zip_codes(zip_code)
);
Boyce-Codd Normal Form (BCNF)
BCNF strengthens 3NF: every determinant must be a candidate key. In practice, most 3NF schemas are already BCNF. The exception arises when a table has multiple overlapping candidate keys.
Rule of thumb for interviews: If you can justify 3NF, that is usually sufficient. Mention BCNF if the interviewer pushes deeper.
Denormalization: When and Why
Fully normalized schemas minimize redundancy but require joins for most queries. Denormalization intentionally introduces redundancy to improve read performance.
| Scenario | Normalize | Denormalize |
|---|---|---|
| Write-heavy OLTP system | Yes | No |
| Read-heavy dashboard/analytics | Partially | Yes |
| Real-time leaderboard | No | Yes |
| Financial transaction records | Yes | No |
| Product catalog with search | Partially | Yes (materialized views) |
Common denormalization techniques:
- Duplicating columns -- Store
customer_namedirectly onordersto avoid a JOIN on every order lookup - Pre-computed aggregates -- Store
total_reviewsandavg_ratingon theproductstable - Materialized views -- Periodically refreshed read-only tables combining multiple source tables
Interview tip: Always start with a normalized design, then explain which specific read patterns justify denormalization. Interviewers want to see that you understand the trade-off: faster reads at the cost of more complex writes and potential data inconsistency.
Constraints: Your Schema's Safety Net
Constraints enforce data integrity at the database level, which is more reliable than application-level validation alone.
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL,
sku VARCHAR(50) NOT NULL UNIQUE,
price NUMERIC(10,2) NOT NULL CHECK (price >= 0),
stock_quantity INTEGER NOT NULL DEFAULT 0 CHECK (stock_quantity >= 0),
category_id INTEGER NOT NULL REFERENCES categories(id),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
| Constraint | Purpose | Example |
|---|---|---|
PRIMARY KEY |
Unique row identifier | id SERIAL PRIMARY KEY |
NOT NULL |
Column must have a value | name VARCHAR(200) NOT NULL |
UNIQUE |
No duplicate values | sku VARCHAR(50) UNIQUE |
CHECK |
Custom validation rule | CHECK (price >= 0) |
DEFAULT |
Fallback value if none provided | DEFAULT 0 |
REFERENCES (FK) |
Enforces referential integrity | REFERENCES categories(id) |
Interview Walkthrough: Social Media Schema
A common interview question: "Design the database schema for a social media application."
Here is a structured approach:
Step 1: Identify Core Entities
Users, Posts, Comments, Likes, Follows, Media
Step 2: Define Relationships
- A user has many posts (one-to-many)
- A post has many comments (one-to-many)
- A user follows many users (many-to-many, self-referential)
- A user likes many posts (many-to-many)
Step 3: Write the Schema
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
username VARCHAR(30) NOT NULL UNIQUE,
email VARCHAR(255) NOT NULL UNIQUE,
display_name VARCHAR(100) NOT NULL,
bio TEXT,
avatar_url VARCHAR(500),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE posts (
id BIGSERIAL PRIMARY KEY,
author_id BIGINT NOT NULL REFERENCES users(id),
content TEXT NOT NULL CHECK (char_length(content) <= 5000),
media_url VARCHAR(500),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE comments (
id BIGSERIAL PRIMARY KEY,
post_id BIGINT NOT NULL REFERENCES posts(id) ON DELETE CASCADE,
author_id BIGINT NOT NULL REFERENCES users(id),
content TEXT NOT NULL CHECK (char_length(content) <= 2000),
parent_comment_id BIGINT REFERENCES comments(id), -- threaded replies
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE likes (
user_id BIGINT NOT NULL REFERENCES users(id),
post_id BIGINT NOT NULL REFERENCES posts(id) ON DELETE CASCADE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (user_id, post_id) -- prevents duplicate likes
);
CREATE TABLE follows (
follower_id BIGINT NOT NULL REFERENCES users(id),
following_id BIGINT NOT NULL REFERENCES users(id),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (follower_id, following_id),
CHECK (follower_id != following_id) -- cannot follow yourself
);
-- Denormalized counters for fast reads
ALTER TABLE users ADD COLUMN follower_count INTEGER NOT NULL DEFAULT 0;
ALTER TABLE users ADD COLUMN following_count INTEGER NOT NULL DEFAULT 0;
ALTER TABLE posts ADD COLUMN like_count INTEGER NOT NULL DEFAULT 0;
ALTER TABLE posts ADD COLUMN comment_count INTEGER NOT NULL DEFAULT 0;
Step 4: Explain Your Trade-offs
ON DELETE CASCADEon comments and likes: when a post is deleted, its associated data is automatically cleaned up- Composite primary key on
likesandfollows: prevents duplicates without a separate unique constraint parent_comment_id: enables threaded comment replies using a self-referential foreign key- Denormalized counters:
like_countandfollower_countavoid expensiveCOUNT(*)queries on every profile or post view, but require careful updating (use triggers or application logic)
Key takeaway: In interviews, always narrate your reasoning out loud. Walk through entities, relationships, constraints, and trade-offs in a structured way. A mediocre schema with excellent reasoning often scores higher than a perfect schema with no explanation.
Next: Indexing strategies, query optimization, and transaction isolation levels. :::