Transformer Architecture Deep Dive

Why This Matters for Interviews

At OpenAI, Anthropic, Meta, and Google, transformer architecture questions are mandatory in L5-L6 LLM engineer interviews. Interviewers will ask you to:

Explain attention mechanism from scratch (whiteboard/code)
Derive multi-head attention math (matrix dimensions, complexity)
Debug positional encoding issues in real code
Compare architectures (encoder-only, decoder-only, encoder-decoder)

Real Interview Question (OpenAI L5):

"Walk me through how self-attention works in GPT. What's the computational complexity? How would you optimize it for 100K+ context windows?"

The Transformer in 60 Seconds

Before Transformers (2017):

RNNs/LSTMs: Sequential processing → slow, gradient issues
CNNs: No long-range dependencies

After "Attention Is All You Need" (Vaswani et al.):

Parallel processing of entire sequence
Self-attention captures relationships between all tokens
Scalability to billions of parameters

Key Innovation: Replace recurrence with attention mechanism.

Self-Attention Mechanism (The Core)

The Math Interviewers Expect You to Know

Given input sequence: X = [x₁, x₂, ..., xₙ] where each xᵢ ∈ ℝᵈᵐᵒᵈᵉˡ

Step 1: Create Q, K, V Matrices

Q = X · Wq    (Query: "what am I looking for?")
K = X · Wk    (Key: "what do I contain?")
V = X · Wv    (Value: "what do I actually represent?")

Where Wq, Wk, Wv ∈ ℝᵈᵐᵒᵈᵉˡ ˣ ᵈᵏ are learned weight matrices.

Step 2: Compute Attention Scores

Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V

Why divide by √dₖ?

Prevents dot products from getting too large (gradient issues)
Keeps softmax gradients stable
Interview Answer: "Variance of dot product is dₖ, so we scale by √dₖ to normalize"

Step 3: Softmax Normalizes to Probabilities

import numpy as np

def self_attention(Q, K, V):
    """
    Q, K, V: (batch_size, seq_len, d_k)
    Returns: (batch_size, seq_len, d_k)
    """
    d_k = Q.shape[-1]

    # Attention scores: (batch, seq_len, seq_len)
    scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)

    # Softmax over keys dimension
    attention_weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)

    # Weighted sum of values
    output = np.matmul(attention_weights, V)

    return output, attention_weights

Multi-Head Attention (What Makes Transformers Work)

Why Multi-Head?

Different heads learn different patterns:
- Head 1: Syntax relationships (subject-verb)
- Head 2: Semantic similarity (synonyms)
- Head 3: Long-range dependencies (coreference)

The Formula:

MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) · Wₒ

where headᵢ = Attention(Q·Wᵢq, K·Wᵢk, V·Wᵢv)

Interview Question: "Why not just use one big attention head instead of 8 smaller ones?"

Good Answer:

"Multiple heads allow the model to attend to information from different representation subspaces simultaneously. A single large head would have the same parameter count but less expressiveness - it's like having 8 different 'lenses' to view the data vs. one averaged lens. Empirically, 8 heads of 64 dimensions outperform 1 head of 512 dimensions."

Computational Complexity Analysis

Attention Complexity: O(n² · d)

Where:

n = sequence length
d = model dimension

Breakdown:

Q·Kᵀ: O(n² · d) - This is the bottleneck!
Softmax: O(n²)
Attention·V: O(n² · d)

For GPT-5.2 with 128K context:

n = 128,000 tokens
n² = 16.4 billion operations per layer
With 96 layers → 1.5 trillion operations

Interview Follow-up: "How do we scale to 1M+ context windows?"

Optimization Strategies:

Sparse Attention (Longformer, BigBird)
- Only attend to local + global tokens
- Complexity: O(n · k) where k << n
Linear Attention (Performers, RWKV)
- Approximate attention with kernel methods
- Complexity: O(n · d²)
Flash Attention (Dao et al. 2022)
- Reorder operations for GPU memory efficiency
- Same complexity, but 2-4x faster in practice

Code Example - Sparse Attention Pattern:

def sparse_attention_mask(seq_len, window_size=256):
    """
    Creates local + global sparse attention mask.
    Used in Llama 3.3's long-context variant.
    """
    mask = np.zeros((seq_len, seq_len))

    # Local attention: window around each token
    for i in range(seq_len):
        start = max(0, i - window_size)
        end = min(seq_len, i + window_size + 1)
        mask[i, start:end] = 1

    # Global attention: always attend to first/last 64 tokens
    mask[:, :64] = 1
    mask[:, -64:] = 1

    return mask

# Complexity: O(n · window_size) instead of O(n²)

Positional Encodings (The "Where" Information)

Problem: Self-attention is permutation-invariant

"I love AI" and "AI love I" would produce same output
Need to inject position information

Absolute Positional Encoding (Original Transformer)

Sinusoidal Encoding (Vaswani et al.):

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Why sine/cosine?

Continuous: Generalizes to unseen sequence lengths
Periodic: Different frequencies capture different patterns
Deterministic: No learned parameters needed

def sinusoidal_encoding(max_len, d_model):
    """
    Generate positional encodings.
    Used in: Original Transformer, T5, some BERT variants
    """
    import numpy as np

    position = np.arange(max_len)[:, np.newaxis]
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))

    pe = np.zeros((max_len, d_model))
    pe[:, 0::2] = np.sin(position * div_term)
    pe[:, 1::2] = np.cos(position * div_term)

    return pe

# Properties:
# - PE(pos + k) can be represented as linear function of PE(pos)
# - Allows model to learn relative positions

Learned Positional Embeddings (GPT, BERT)

class LearnedPositionalEmbedding(nn.Module):
    """
    Used in: GPT-2, GPT-3, BERT
    """
    def __init__(self, max_len, d_model):
        super().__init__()
        self.embedding = nn.Embedding(max_len, d_model)

    def forward(self, positions):
        return self.embedding(positions)

# Pros: Flexible, can learn task-specific patterns
# Cons: Fixed max length, doesn't generalize beyond training

Relative Positional Encoding (GPT-5, Llama)

RoPE (Rotary Position Embedding) - Used in GPT-5.2, Llama 3.3:

Instead of: X + PE
Do: Rotate Q and K by position-dependent angle

Q_rotated = RoPE(Q, position)
K_rotated = RoPE(K, position)
Attention(Q_rotated, K_rotated, V)

Why RoPE is Superior:

Relative positions: Natural for language (distance matters, not absolute position)
Extrapolation: Can handle longer sequences than training
Efficiency: No extra parameters

Interview Question (Anthropic): "Why does Llama use RoPE instead of sinusoidal encoding?"

Strong Answer:

"RoPE encodes relative positions directly into Q/K interactions rather than adding positional info to embeddings. This gives better length extrapolation - Llama 3.3 trained on 8K can handle 128K at inference. The rotation preserves inner products while encoding distance, which aligns with how attention should work: tokens care about their relative distance, not absolute position."

Decoder-Only vs. Encoder-Only vs. Encoder-Decoder

Architecture Comparison Table

Architecture	Examples	Use Case	Cross-Attention?
Encoder-Only	BERT, RoBERTa	Understanding tasks (classification, NER)	No
Decoder-Only	GPT-5, Llama 3, Claude 4.5	Generation tasks (chat, code, reasoning)	No
Encoder-Decoder	T5, BART, Flan-UL2	Translation, summarization	Yes

Interview Question: "Why are all frontier LLMs (GPT-5, Claude 4.5, Gemini 3) decoder-only?"

Answer:

"Decoder-only scales better. Encoder-decoder requires cross-attention between encoder/decoder, which doesn't parallelize as well. For large-scale pretraining (trillions of tokens), autoregressive next-token prediction is simpler and more efficient. We can do 'understanding' tasks with decoder-only by framing them as generation (e.g., 'Question: X Answer:')."

Causal Masking (Why GPT Can't See the Future)

Autoregressive generation requires causal masking:

def create_causal_mask(seq_len):
    """
    Lower-triangular mask for decoder-only models.
    Position i can only attend to positions ≤ i
    """
    mask = np.tril(np.ones((seq_len, seq_len)))
    # Convert to -inf for positions to ignore
    mask = np.where(mask == 0, -1e9, 0)
    return mask

# Example for seq_len=4:
# [[0,    -inf, -inf, -inf],   # Token 0 sees only itself
#  [0,    0,    -inf, -inf],   # Token 1 sees 0, 1
#  [0,    0,    0,    -inf],   # Token 2 sees 0, 1, 2
#  [0,    0,    0,    0   ]]   # Token 3 sees all

Why -inf instead of 0?

After softmax, exp(-inf) = 0
Clean way to mask without special-casing

LayerNorm vs. RMSNorm (What Modern LLMs Use)

LayerNorm (Original Transformer):

def layer_norm(x, gamma, beta, eps=1e-5):
    mean = x.mean(dim=-1, keepdim=True)
    var = x.var(dim=-1, keepdim=True)
    x_norm = (x - mean) / np.sqrt(var + eps)
    return gamma * x_norm + beta

RMSNorm (Llama 3, GPT-5) - 30% Faster:

def rms_norm(x, gamma, eps=1e-5):
    """
    Root Mean Square Normalization.
    Used in: Llama 3.3, GPT-5, DeepSeek R1
    """
    rms = np.sqrt((x ** 2).mean(dim=-1, keepdim=True) + eps)
    x_norm = x / rms
    return gamma * x_norm
    # No beta! No mean subtraction!

Why RMSNorm?

Simpler: No mean calculation
Faster: 30% less compute
Same performance: Empirically validated

Interview Insight: Mentioning RMSNorm shows you're up-to-date with 2025-2026 LLM implementations.

Feed-Forward Network (The 2/3 of Model Parameters)

Structure:

class FeedForward(nn.Module):
    """
    2-layer MLP with GeLU activation.
    d_model → 4*d_model → d_model
    """
    def __init__(self, d_model):
        super().__init__()
        self.fc1 = nn.Linear(d_model, 4 * d_model)
        self.fc2 = nn.Linear(4 * d_model, d_model)
        self.activation = nn.GELU()

    def forward(self, x):
        return self.fc2(self.activation(self.fc1(x)))

# For GPT-5.2 (d_model=12288):
# - fc1: 12288 → 49152 (603M parameters)
# - fc2: 49152 → 12288 (603M parameters)
# - Total: 1.2B parameters per layer!

Interview Question: "Why is the FFN hidden dimension 4x the model dimension?"

Answer:

"Empirically optimal trade-off. Larger ratios (8x) give better quality but worse efficiency. Smaller ratios (2x) save compute but hurt performance. The 4x ratio comes from scaling laws research (Kaplan et al. 2020, Hoffman et al. 2022) showing it balances performance per FLOP."

Full Transformer Block Code

class TransformerBlock(nn.Module):
    """
    One decoder layer in GPT-5 / Llama 3.3.
    """
    def __init__(self, d_model, n_heads, dropout=0.1):
        super().__init__()

        # Multi-head attention
        self.attention = MultiHeadAttention(d_model, n_heads)
        self.attn_norm = RMSNorm(d_model)

        # Feed-forward network
        self.ffn = FeedForward(d_model)
        self.ffn_norm = RMSNorm(d_model)

        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Pre-norm architecture (used in GPT-5, Llama)
        # Attention block
        normed = self.attn_norm(x)
        attn_out, _ = self.attention(normed, normed, normed, mask)
        x = x + self.dropout(attn_out)  # Residual connection

        # FFN block
        normed = self.ffn_norm(x)
        ffn_out = self.ffn(normed)
        x = x + self.dropout(ffn_out)  # Residual connection

        return x

Pre-Norm vs. Post-Norm:

Post-Norm (Original Transformer): Norm(X + Sublayer(X))
Pre-Norm (GPT-5, Llama): X + Sublayer(Norm(X))

Why Pre-Norm?

Gradient flow: Cleaner gradients for deep models (96+ layers)
Stability: Less likely to diverge during training
All modern LLMs use pre-norm

Common Interview Questions & Answers

Q1: "What's the memory complexity of self-attention?"

Answer:

"O(n²) for storing attention weights. For a 100K token sequence in float16, that's 100K × 100K × 2 bytes = 20GB per layer. This is why Flash Attention recomputes attention on-the-fly instead of storing it - trading compute for memory."

Q2: "How would you reduce transformer inference latency?"

Strong Answer:

KV-cache: Store key/value tensors from previous tokens (used in GPT-5.2 API)
Quantization: INT8/INT4 weights (DeepSeek R1 uses 4-bit)
Speculative decoding: Draft with small model, verify with large model
Prompt caching: Cache computation for common prompt prefixes (Claude 4.5 feature)

Q3: "Why do transformers need so much data to train?"

Answer:

"Unlike CNNs with inductive biases (locality, translation invariance), transformers are 'tabula rasa' - they learn everything from data. Attention can connect any two tokens, so the model must learn from examples which connections matter. With trillions of parameters, you need trillions of tokens to avoid overfitting. This is why GPT-5 trained on ~15 trillion tokens."

Q4: "Explain the difference between encoder and decoder attention masks."

Answer:

"Encoder (BERT-style): Bidirectional mask - token i sees all tokens. Used for understanding tasks. Decoder (GPT-style): Causal mask - token i only sees tokens ≤ i. Prevents looking ahead during autoregressive generation. Encoder-decoder (T5): Encoder has bidirectional, decoder has causal + cross-attention to encoder (bidirectional over source sequence)."

Practical Debugging Scenario (Real Interview)

Interviewer: "You trained a GPT-style model, but it's generating garbage. The loss plateaued at 8.2 instead of going below 3.0. What could be wrong?"

Debugging Checklist:

# 1. Check positional encoding is being added
assert outputs.shape == embeddings.shape + pos_encodings.shape

# 2. Verify causal mask is applied
attention_weights = model.get_attention_weights(input_ids)
# Upper triangular should be near-zero
assert attention_weights.triu(diagonal=1).abs().max() < 0.01

# 3. Check gradient flow through residuals
for name, param in model.named_parameters():
    if param.grad is None:
        print(f"No gradient: {name}")  # Bad!
    elif param.grad.abs().mean() < 1e-7:
        print(f"Vanishing gradient: {name}")

# 4. Verify layer norm is normalizing
activations = model.get_layer_activations(input_ids, layer=12)
mean = activations.mean(dim=-1)
var = activations.var(dim=-1)
assert mean.abs().max() < 0.1  # Should be near zero
assert (var - 1.0).abs().max() < 0.1  # Should be near one

# 5. Check attention isn't collapsing to uniform
attention_entropy = -(attn_weights * attn_weights.log()).sum(dim=-1).mean()
# Should be > 2.0 for seq_len=100 (log(100) ≈ 4.6)
assert attention_entropy > 2.0, f"Attention collapsed: entropy={attention_entropy}"

✅ Know the math: Attention formula, complexity analysis, matrix dimensions ✅ Understand trade-offs: Multi-head vs single-head, pre-norm vs post-norm ✅ Modern variants: RoPE, RMSNorm, Flash Attention (shows you're current) ✅ Scaling challenges: Why O(n²) matters, how to optimize for long contexts ✅ Debugging skills: Gradient flow, attention collapse, positional encoding bugs

Next Step: Understand how all these parameters translate to token costs in Module 1, Lesson 2: Token Economics.

:::

Why This Matters for Interviews

The Transformer in 60 Seconds

Self-Attention Mechanism (The Core)

The Math Interviewers Expect You to Know

Multi-Head Attention (What Makes Transformers Work)

Computational Complexity Analysis

Attention Complexity: O(n² · d)

Positional Encodings (The "Where" Information)

Absolute Positional Encoding (Original Transformer)

Learned Positional Embeddings (GPT, BERT)

Relative Positional Encoding (GPT-5, Llama)

Decoder-Only vs. Encoder-Only vs. Encoder-Decoder

Architecture Comparison Table

Causal Masking (Why GPT Can't See the Future)

LayerNorm vs. RMSNorm (What Modern LLMs Use)

Feed-Forward Network (The 2/3 of Model Parameters)

Full Transformer Block Code

Common Interview Questions & Answers

Q1: "What's the memory complexity of self-attention?"

Q2: "How would you reduce transformer inference latency?"

Q3: "Why do transformers need so much data to train?"

Q4: "Explain the difference between encoder and decoder attention masks."

Practical Debugging Scenario (Real Interview)

Key Takeaways for Interviews

Quiz

Stay on the Nerd Track