LLM Fundamentals for Interviews
Transformer Architecture Deep Dive
Why This Matters for Interviews
At OpenAI, Anthropic, Meta, and Google, transformer architecture questions are mandatory in L5-L6 LLM engineer interviews. Interviewers will ask you to:
- Explain attention mechanism from scratch (whiteboard/code)
- Derive multi-head attention math (matrix dimensions, complexity)
- Debug positional encoding issues in real code
- Compare architectures (encoder-only, decoder-only, encoder-decoder)
Real Interview Question (OpenAI L5):
"Walk me through how self-attention works in GPT. What's the computational complexity? How would you optimize it for 100K+ context windows?"
The Transformer in 60 Seconds
Before Transformers (2017):
- RNNs/LSTMs: Sequential processing → slow, gradient issues
- CNNs: No long-range dependencies
After "Attention Is All You Need" (Vaswani et al.):
- Parallel processing of entire sequence
- Self-attention captures relationships between all tokens
- Scalability to billions of parameters
Key Innovation: Replace recurrence with attention mechanism.
Self-Attention Mechanism (The Core)
The Math Interviewers Expect You to Know
Given input sequence: X = [x₁, x₂, ..., xₙ] where each xᵢ ∈ ℝᵈᵐᵒᵈᵉˡ
Step 1: Create Q, K, V Matrices
Q = X · Wq (Query: "what am I looking for?")
K = X · Wk (Key: "what do I contain?")
V = X · Wv (Value: "what do I actually represent?")
Where Wq, Wk, Wv ∈ ℝᵈᵐᵒᵈᵉˡ ˣ ᵈᵏ are learned weight matrices.
Step 2: Compute Attention Scores
Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V
Why divide by √dₖ?
- Prevents dot products from getting too large (gradient issues)
- Keeps softmax gradients stable
- Interview Answer: "Variance of dot product is dₖ, so we scale by √dₖ to normalize"
Step 3: Softmax Normalizes to Probabilities
import numpy as np
def self_attention(Q, K, V):
"""
Q, K, V: (batch_size, seq_len, d_k)
Returns: (batch_size, seq_len, d_k)
"""
d_k = Q.shape[-1]
# Attention scores: (batch, seq_len, seq_len)
scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)
# Softmax over keys dimension
attention_weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
# Weighted sum of values
output = np.matmul(attention_weights, V)
return output, attention_weights
Multi-Head Attention (What Makes Transformers Work)
Why Multi-Head?
- Different heads learn different patterns:
- Head 1: Syntax relationships (subject-verb)
- Head 2: Semantic similarity (synonyms)
- Head 3: Long-range dependencies (coreference)
The Formula:
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ) · Wₒ
where headᵢ = Attention(Q·Wᵢq, K·Wᵢk, V·Wᵢv)
Interview Question: "Why not just use one big attention head instead of 8 smaller ones?"
Good Answer:
"Multiple heads allow the model to attend to information from different representation subspaces simultaneously. A single large head would have the same parameter count but less expressiveness - it's like having 8 different 'lenses' to view the data vs. one averaged lens. Empirically, 8 heads of 64 dimensions outperform 1 head of 512 dimensions."
Computational Complexity Analysis
Attention Complexity: O(n² · d)
Where:
n= sequence lengthd= model dimension
Breakdown:
- Q·Kᵀ:
O(n² · d)- This is the bottleneck! - Softmax:
O(n²) - Attention·V:
O(n² · d)
For GPT-5.2 with 128K context:
- n = 128,000 tokens
- n² = 16.4 billion operations per layer
- With 96 layers → 1.5 trillion operations
Interview Follow-up: "How do we scale to 1M+ context windows?"
Optimization Strategies:
-
Sparse Attention (Longformer, BigBird)
- Only attend to local + global tokens
- Complexity:
O(n · k)where k << n
-
Linear Attention (Performers, RWKV)
- Approximate attention with kernel methods
- Complexity:
O(n · d²)
-
Flash Attention (Dao et al. 2022)
- Reorder operations for GPU memory efficiency
- Same complexity, but 2-4x faster in practice
Code Example - Sparse Attention Pattern:
def sparse_attention_mask(seq_len, window_size=256):
"""
Creates local + global sparse attention mask.
Used in Llama 3.3's long-context variant.
"""
mask = np.zeros((seq_len, seq_len))
# Local attention: window around each token
for i in range(seq_len):
start = max(0, i - window_size)
end = min(seq_len, i + window_size + 1)
mask[i, start:end] = 1
# Global attention: always attend to first/last 64 tokens
mask[:, :64] = 1
mask[:, -64:] = 1
return mask
# Complexity: O(n · window_size) instead of O(n²)
Positional Encodings (The "Where" Information)
Problem: Self-attention is permutation-invariant
- "I love AI" and "AI love I" would produce same output
- Need to inject position information
Absolute Positional Encoding (Original Transformer)
Sinusoidal Encoding (Vaswani et al.):
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Why sine/cosine?
- Continuous: Generalizes to unseen sequence lengths
- Periodic: Different frequencies capture different patterns
- Deterministic: No learned parameters needed
def sinusoidal_encoding(max_len, d_model):
"""
Generate positional encodings.
Used in: Original Transformer, T5, some BERT variants
"""
import numpy as np
position = np.arange(max_len)[:, np.newaxis]
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
pe = np.zeros((max_len, d_model))
pe[:, 0::2] = np.sin(position * div_term)
pe[:, 1::2] = np.cos(position * div_term)
return pe
# Properties:
# - PE(pos + k) can be represented as linear function of PE(pos)
# - Allows model to learn relative positions
Learned Positional Embeddings (GPT, BERT)
class LearnedPositionalEmbedding(nn.Module):
"""
Used in: GPT-2, GPT-3, BERT
"""
def __init__(self, max_len, d_model):
super().__init__()
self.embedding = nn.Embedding(max_len, d_model)
def forward(self, positions):
return self.embedding(positions)
# Pros: Flexible, can learn task-specific patterns
# Cons: Fixed max length, doesn't generalize beyond training
Relative Positional Encoding (GPT-5, Llama)
RoPE (Rotary Position Embedding) - Used in GPT-5.2, Llama 3.3:
Instead of: X + PE
Do: Rotate Q and K by position-dependent angle
Q_rotated = RoPE(Q, position)
K_rotated = RoPE(K, position)
Attention(Q_rotated, K_rotated, V)
Why RoPE is Superior:
- Relative positions: Natural for language (distance matters, not absolute position)
- Extrapolation: Can handle longer sequences than training
- Efficiency: No extra parameters
Interview Question (Anthropic): "Why does Llama use RoPE instead of sinusoidal encoding?"
Strong Answer:
"RoPE encodes relative positions directly into Q/K interactions rather than adding positional info to embeddings. This gives better length extrapolation - Llama 3.3 trained on 8K can handle 128K at inference. The rotation preserves inner products while encoding distance, which aligns with how attention should work: tokens care about their relative distance, not absolute position."
Decoder-Only vs. Encoder-Only vs. Encoder-Decoder
Architecture Comparison Table
| Architecture | Examples | Use Case | Cross-Attention? |
|---|---|---|---|
| Encoder-Only | BERT, RoBERTa | Understanding tasks (classification, NER) | No |
| Decoder-Only | GPT-5, Llama 3, Claude 4.5 | Generation tasks (chat, code, reasoning) | No |
| Encoder-Decoder | T5, BART, Flan-UL2 | Translation, summarization | Yes |
Interview Question: "Why are all frontier LLMs (GPT-5, Claude 4.5, Gemini 3) decoder-only?"
Answer:
"Decoder-only scales better. Encoder-decoder requires cross-attention between encoder/decoder, which doesn't parallelize as well. For large-scale pretraining (trillions of tokens), autoregressive next-token prediction is simpler and more efficient. We can do 'understanding' tasks with decoder-only by framing them as generation (e.g., 'Question: X Answer:')."
Causal Masking (Why GPT Can't See the Future)
Autoregressive generation requires causal masking:
def create_causal_mask(seq_len):
"""
Lower-triangular mask for decoder-only models.
Position i can only attend to positions ≤ i
"""
mask = np.tril(np.ones((seq_len, seq_len)))
# Convert to -inf for positions to ignore
mask = np.where(mask == 0, -1e9, 0)
return mask
# Example for seq_len=4:
# [[0, -inf, -inf, -inf], # Token 0 sees only itself
# [0, 0, -inf, -inf], # Token 1 sees 0, 1
# [0, 0, 0, -inf], # Token 2 sees 0, 1, 2
# [0, 0, 0, 0 ]] # Token 3 sees all
Why -inf instead of 0?
- After softmax, exp(-inf) = 0
- Clean way to mask without special-casing
LayerNorm vs. RMSNorm (What Modern LLMs Use)
LayerNorm (Original Transformer):
def layer_norm(x, gamma, beta, eps=1e-5):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True)
x_norm = (x - mean) / np.sqrt(var + eps)
return gamma * x_norm + beta
RMSNorm (Llama 3, GPT-5) - 30% Faster:
def rms_norm(x, gamma, eps=1e-5):
"""
Root Mean Square Normalization.
Used in: Llama 3.3, GPT-5, DeepSeek R1
"""
rms = np.sqrt((x ** 2).mean(dim=-1, keepdim=True) + eps)
x_norm = x / rms
return gamma * x_norm
# No beta! No mean subtraction!
Why RMSNorm?
- Simpler: No mean calculation
- Faster: 30% less compute
- Same performance: Empirically validated
Interview Insight: Mentioning RMSNorm shows you're up-to-date with 2025-2026 LLM implementations.
Feed-Forward Network (The 2/3 of Model Parameters)
Structure:
class FeedForward(nn.Module):
"""
2-layer MLP with GeLU activation.
d_model → 4*d_model → d_model
"""
def __init__(self, d_model):
super().__init__()
self.fc1 = nn.Linear(d_model, 4 * d_model)
self.fc2 = nn.Linear(4 * d_model, d_model)
self.activation = nn.GELU()
def forward(self, x):
return self.fc2(self.activation(self.fc1(x)))
# For GPT-5.2 (d_model=12288):
# - fc1: 12288 → 49152 (603M parameters)
# - fc2: 49152 → 12288 (603M parameters)
# - Total: 1.2B parameters per layer!
Interview Question: "Why is the FFN hidden dimension 4x the model dimension?"
Answer:
"Empirically optimal trade-off. Larger ratios (8x) give better quality but worse efficiency. Smaller ratios (2x) save compute but hurt performance. The 4x ratio comes from scaling laws research (Kaplan et al. 2020, Hoffman et al. 2022) showing it balances performance per FLOP."
Full Transformer Block Code
class TransformerBlock(nn.Module):
"""
One decoder layer in GPT-5 / Llama 3.3.
"""
def __init__(self, d_model, n_heads, dropout=0.1):
super().__init__()
# Multi-head attention
self.attention = MultiHeadAttention(d_model, n_heads)
self.attn_norm = RMSNorm(d_model)
# Feed-forward network
self.ffn = FeedForward(d_model)
self.ffn_norm = RMSNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Pre-norm architecture (used in GPT-5, Llama)
# Attention block
normed = self.attn_norm(x)
attn_out, _ = self.attention(normed, normed, normed, mask)
x = x + self.dropout(attn_out) # Residual connection
# FFN block
normed = self.ffn_norm(x)
ffn_out = self.ffn(normed)
x = x + self.dropout(ffn_out) # Residual connection
return x
Pre-Norm vs. Post-Norm:
- Post-Norm (Original Transformer):
Norm(X + Sublayer(X)) - Pre-Norm (GPT-5, Llama):
X + Sublayer(Norm(X))
Why Pre-Norm?
- Gradient flow: Cleaner gradients for deep models (96+ layers)
- Stability: Less likely to diverge during training
- All modern LLMs use pre-norm
Common Interview Questions & Answers
Q1: "What's the memory complexity of self-attention?"
Answer:
"O(n²) for storing attention weights. For a 100K token sequence in float16, that's 100K × 100K × 2 bytes = 20GB per layer. This is why Flash Attention recomputes attention on-the-fly instead of storing it - trading compute for memory."
Q2: "How would you reduce transformer inference latency?"
Strong Answer:
- KV-cache: Store key/value tensors from previous tokens (used in GPT-5.2 API)
- Quantization: INT8/INT4 weights (DeepSeek R1 uses 4-bit)
- Speculative decoding: Draft with small model, verify with large model
- Prompt caching: Cache computation for common prompt prefixes (Claude 4.5 feature)
Q3: "Why do transformers need so much data to train?"
Answer:
"Unlike CNNs with inductive biases (locality, translation invariance), transformers are 'tabula rasa' - they learn everything from data. Attention can connect any two tokens, so the model must learn from examples which connections matter. With trillions of parameters, you need trillions of tokens to avoid overfitting. This is why GPT-5 trained on ~15 trillion tokens."
Q4: "Explain the difference between encoder and decoder attention masks."
Answer:
"Encoder (BERT-style): Bidirectional mask - token i sees all tokens. Used for understanding tasks. Decoder (GPT-style): Causal mask - token i only sees tokens ≤ i. Prevents looking ahead during autoregressive generation. Encoder-decoder (T5): Encoder has bidirectional, decoder has causal + cross-attention to encoder (bidirectional over source sequence)."
Practical Debugging Scenario (Real Interview)
Interviewer: "You trained a GPT-style model, but it's generating garbage. The loss plateaued at 8.2 instead of going below 3.0. What could be wrong?"
Debugging Checklist:
# 1. Check positional encoding is being added
assert outputs.shape == embeddings.shape + pos_encodings.shape
# 2. Verify causal mask is applied
attention_weights = model.get_attention_weights(input_ids)
# Upper triangular should be near-zero
assert attention_weights.triu(diagonal=1).abs().max() < 0.01
# 3. Check gradient flow through residuals
for name, param in model.named_parameters():
if param.grad is None:
print(f"No gradient: {name}") # Bad!
elif param.grad.abs().mean() < 1e-7:
print(f"Vanishing gradient: {name}")
# 4. Verify layer norm is normalizing
activations = model.get_layer_activations(input_ids, layer=12)
mean = activations.mean(dim=-1)
var = activations.var(dim=-1)
assert mean.abs().max() < 0.1 # Should be near zero
assert (var - 1.0).abs().max() < 0.1 # Should be near one
# 5. Check attention isn't collapsing to uniform
attention_entropy = -(attn_weights * attn_weights.log()).sum(dim=-1).mean()
# Should be > 2.0 for seq_len=100 (log(100) ≈ 4.6)
assert attention_entropy > 2.0, f"Attention collapsed: entropy={attention_entropy}"
Key Takeaways for Interviews
✅ Know the math: Attention formula, complexity analysis, matrix dimensions ✅ Understand trade-offs: Multi-head vs single-head, pre-norm vs post-norm ✅ Modern variants: RoPE, RMSNorm, Flash Attention (shows you're current) ✅ Scaling challenges: Why O(n²) matters, how to optimize for long contexts ✅ Debugging skills: Gradient flow, attention collapse, positional encoding bugs
Next Step: Understand how all these parameters translate to token costs in Module 1, Lesson 2: Token Economics.
:::