Schema Design & Normalization

Database schema design is arguably the most critical skill tested in backend interviews. A poorly designed schema leads to data anomalies, performance problems, and maintenance nightmares. This lesson covers normalization forms, denormalization trade-offs, and a full interview-style schema walkthrough.

Normal Forms: From 1NF to BCNF

Normalization eliminates data redundancy by organizing columns and tables. Each normal form builds on the previous one.

First Normal Form (1NF)

A table is in 1NF when:

Every column contains atomic (indivisible) values
There are no repeating groups or arrays in a single column

Violation example -- storing multiple phone numbers in one column:

-- BAD: violates 1NF
CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    phones VARCHAR(255)  -- "555-0101, 555-0102, 555-0103"
);

-- GOOD: 1NF compliant
CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100)
);

CREATE TABLE customer_phones (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL REFERENCES customers(id),
    phone VARCHAR(20) NOT NULL
);

Second Normal Form (2NF)

A table is in 2NF when it is in 1NF and every non-key column depends on the entire primary key, not just part of it. This only applies to tables with composite primary keys.

Violation example -- product_name depends only on product_id, not on the full composite key:

-- BAD: violates 2NF (product_name depends only on product_id)
CREATE TABLE order_items (
    order_id INTEGER,
    product_id INTEGER,
    product_name VARCHAR(100),  -- depends only on product_id
    quantity INTEGER,
    PRIMARY KEY (order_id, product_id)
);

-- GOOD: 2NF compliant — separate the partial dependency
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL
);

CREATE TABLE order_items (
    order_id INTEGER REFERENCES orders(id),
    product_id INTEGER REFERENCES products(id),
    quantity INTEGER NOT NULL CHECK (quantity > 0),
    PRIMARY KEY (order_id, product_id)
);

Third Normal Form (3NF)

A table is in 3NF when it is in 2NF and no non-key column depends on another non-key column (no transitive dependencies).

Violation example -- city depends on zip_code, which depends on the primary key:

-- BAD: violates 3NF (city depends on zip_code, not directly on id)
CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    zip_code VARCHAR(10),
    city VARCHAR(100)  -- transitively depends on id via zip_code
);

-- GOOD: 3NF compliant
CREATE TABLE zip_codes (
    zip_code VARCHAR(10) PRIMARY KEY,
    city VARCHAR(100) NOT NULL
);

CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100),
    zip_code VARCHAR(10) REFERENCES zip_codes(zip_code)
);

Boyce-Codd Normal Form (BCNF)

BCNF strengthens 3NF: every determinant must be a candidate key. In practice, most 3NF schemas are already BCNF. The exception arises when a table has multiple overlapping candidate keys.

Rule of thumb for interviews: If you can justify 3NF, that is usually sufficient. Mention BCNF if the interviewer pushes deeper.

Denormalization: When and Why

Fully normalized schemas minimize redundancy but require joins for most queries. Denormalization intentionally introduces redundancy to improve read performance.

Scenario	Normalize	Denormalize
Write-heavy OLTP system	Yes	No
Read-heavy dashboard/analytics	Partially	Yes
Real-time leaderboard	No	Yes
Financial transaction records	Yes	No
Product catalog with search	Partially	Yes (materialized views)

Common denormalization techniques:

Duplicating columns -- Store customer_name directly on orders to avoid a JOIN on every order lookup
Pre-computed aggregates -- Store total_reviews and avg_rating on the products table
Materialized views -- Periodically refreshed read-only tables combining multiple source tables

Interview tip: Always start with a normalized design, then explain which specific read patterns justify denormalization. Interviewers want to see that you understand the trade-off: faster reads at the cost of more complex writes and potential data inconsistency.

Constraints: Your Schema's Safety Net

Constraints enforce data integrity at the database level, which is more reliable than application-level validation alone.

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(200) NOT NULL,
    sku VARCHAR(50) NOT NULL UNIQUE,
    price NUMERIC(10,2) NOT NULL CHECK (price >= 0),
    stock_quantity INTEGER NOT NULL DEFAULT 0 CHECK (stock_quantity >= 0),
    category_id INTEGER NOT NULL REFERENCES categories(id),
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Constraint	Purpose	Example
`PRIMARY KEY`	Unique row identifier	`id SERIAL PRIMARY KEY`
`NOT NULL`	Column must have a value	`name VARCHAR(200) NOT NULL`
`UNIQUE`	No duplicate values	`sku VARCHAR(50) UNIQUE`
`CHECK`	Custom validation rule	`CHECK (price >= 0)`
`DEFAULT`	Fallback value if none provided	`DEFAULT 0`
`REFERENCES` (FK)	Enforces referential integrity	`REFERENCES categories(id)`

A common interview question: "Design the database schema for a social media application."

Here is a structured approach:

Step 1: Identify Core Entities

Users, Posts, Comments, Likes, Follows, Media

Step 2: Define Relationships

A user has many posts (one-to-many)
A post has many comments (one-to-many)
A user follows many users (many-to-many, self-referential)
A user likes many posts (many-to-many)

Step 3: Write the Schema

CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    username VARCHAR(30) NOT NULL UNIQUE,
    email VARCHAR(255) NOT NULL UNIQUE,
    display_name VARCHAR(100) NOT NULL,
    bio TEXT,
    avatar_url VARCHAR(500),
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE posts (
    id BIGSERIAL PRIMARY KEY,
    author_id BIGINT NOT NULL REFERENCES users(id),
    content TEXT NOT NULL CHECK (char_length(content) <= 5000),
    media_url VARCHAR(500),
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE comments (
    id BIGSERIAL PRIMARY KEY,
    post_id BIGINT NOT NULL REFERENCES posts(id) ON DELETE CASCADE,
    author_id BIGINT NOT NULL REFERENCES users(id),
    content TEXT NOT NULL CHECK (char_length(content) <= 2000),
    parent_comment_id BIGINT REFERENCES comments(id),  -- threaded replies
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE likes (
    user_id BIGINT NOT NULL REFERENCES users(id),
    post_id BIGINT NOT NULL REFERENCES posts(id) ON DELETE CASCADE,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (user_id, post_id)  -- prevents duplicate likes
);

CREATE TABLE follows (
    follower_id BIGINT NOT NULL REFERENCES users(id),
    following_id BIGINT NOT NULL REFERENCES users(id),
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (follower_id, following_id),
    CHECK (follower_id != following_id)  -- cannot follow yourself
);

-- Denormalized counters for fast reads
ALTER TABLE users ADD COLUMN follower_count INTEGER NOT NULL DEFAULT 0;
ALTER TABLE users ADD COLUMN following_count INTEGER NOT NULL DEFAULT 0;
ALTER TABLE posts ADD COLUMN like_count INTEGER NOT NULL DEFAULT 0;
ALTER TABLE posts ADD COLUMN comment_count INTEGER NOT NULL DEFAULT 0;

Step 4: Explain Your Trade-offs

ON DELETE CASCADE on comments and likes: when a post is deleted, its associated data is automatically cleaned up
Composite primary key on likes and follows: prevents duplicates without a separate unique constraint
parent_comment_id: enables threaded comment replies using a self-referential foreign key
Denormalized counters: like_count and follower_count avoid expensive COUNT(*) queries on every profile or post view, but require careful updating (use triggers or application logic)

Key takeaway: In interviews, always narrate your reasoning out loud. Walk through entities, relationships, constraints, and trade-offs in a structured way. A mediocre schema with excellent reasoning often scores higher than a perfect schema with no explanation.

Next: Indexing strategies, query optimization, and transaction isolation levels. :::