LLM Fundamentals for Interviews
Sampling Strategies & Generation Control
Why This Matters for Interviews
Real Interview Question (OpenAI L6):
"Your chatbot is sometimes creative but sometimes gives factually incorrect answers. How would you control this trade-off? What parameters would you adjust and why?"
Real Interview Question (Meta):
"Explain the difference between temperature and top-p sampling. When would you use one over the other?"
Understanding sampling is critical for production LLM systems where you need to balance creativity, consistency, and factual accuracy.
How LLMs Generate Text (The Autoregressive Process)
Step-by-step:
- Forward pass: Model computes logits for next token
- Convert to probabilities: Apply softmax
- Sample: Choose next token based on strategy
- Repeat: Add token to sequence, generate next
def autoregressive_generation(model, prompt, max_tokens=100, strategy="greedy"):
"""
Core LLM generation loop.
"""
tokens = tokenize(prompt)
for _ in range(max_tokens):
# 1. Forward pass
logits = model(tokens) # Shape: (vocab_size,)
# 2. Convert to probabilities
probs = softmax(logits)
# 3. Sample next token
next_token = sample(probs, strategy=strategy)
# 4. Add to sequence
tokens.append(next_token)
# Stop if <EOS> token
if next_token == EOS_TOKEN:
break
return tokens
Sampling Strategy 1: Greedy Decoding (Deterministic)
Always pick the most probable token:
def greedy_sample(probs):
"""
Greedy sampling: argmax(probs)
"""
return np.argmax(probs)
# Example:
logits = np.array([2.0, 5.0, 1.0, 0.5]) # Raw model outputs
probs = softmax(logits) # [0.05, 0.84, 0.02, 0.01]
next_token = greedy_sample(probs) # Always returns index 1
Pros:
- ✅ Deterministic (same prompt → same output)
- ✅ Fast (no sampling overhead)
- ✅ Good for factual tasks (Q&A, translation)
Cons:
- ❌ Repetitive (gets stuck in loops)
- ❌ No creativity
- ❌ Misses better alternatives
When to Use:
- Factual Q&A
- Translation
- Code completion (when you want most likely)
Example Output:
Prompt: "The capital of France is"
Greedy: "Paris. The capital of France is Paris. The capital of..."
→ Stuck in repetition!
Sampling Strategy 2: Temperature Scaling
Adjust randomness by dividing logits before softmax:
probs = softmax(logits / temperature)
Temperature Values & Effects
Temperature = 0 (near-greedy):
temperature = 0.01
probs = softmax([2.0, 5.0, 1.0, 0.5] / temperature)
# probs ≈ [0.0, 1.0, 0.0, 0.0] # Almost deterministic
Temperature = 1 (unchanged):
temperature = 1.0
probs = softmax([2.0, 5.0, 1.0, 0.5] / temperature)
# probs = [0.05, 0.84, 0.02, 0.01] # Original distribution
Temperature = 2 (more random):
temperature = 2.0
probs = softmax([2.0, 5.0, 1.0, 0.5] / temperature)
# probs ≈ [0.12, 0.61, 0.09, 0.06] # Flatter distribution
Visual Comparison:
| Temperature | Distribution | Use Case |
|---|---|---|
| 0.0-0.3 | Sharp peak | Factual tasks, code |
| 0.5-0.7 | Balanced | General chat |
| 0.8-1.0 | Creative | Brainstorming, stories |
| 1.5-2.0 | Very random | Experimental, art |
Code Implementation:
def temperature_sample(logits, temperature=1.0):
"""
Sample with temperature scaling.
"""
if temperature == 0:
return np.argmax(logits) # Greedy
# Scale logits by temperature
scaled_logits = logits / temperature
# Convert to probabilities
probs = softmax(scaled_logits)
# Sample from distribution
next_token = np.random.choice(len(probs), p=probs)
return next_token
# Example:
logits = np.array([2.0, 5.0, 1.0, 0.5])
# Low temperature (more focused)
token1 = temperature_sample(logits, temperature=0.3)
# → Almost always returns index 1 (highest logit)
# High temperature (more random)
token2 = temperature_sample(logits, temperature=2.0)
# → Could return any index, more likely to pick lower-prob tokens
Interview Question: "What happens if temperature is too high?"
Answer:
"The distribution becomes too flat - low-probability tokens get boosted. This leads to:
- Incoherent outputs (random words)
- Grammar errors
- Factual hallucinations
At temperature=10, even a token with logit -5 gets significant probability. The model loses its learned preferences from training."
Sampling Strategy 3: Top-p (Nucleus Sampling)
Sample from smallest set of tokens whose cumulative probability ≥ p:
def top_p_sample(probs, p=0.9):
"""
Nucleus sampling: sample from top tokens totaling p probability.
"""
# Sort probabilities in descending order
sorted_indices = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_indices]
# Compute cumulative probabilities
cumulative_probs = np.cumsum(sorted_probs)
# Find cutoff index where cumsum >= p
cutoff_index = np.searchsorted(cumulative_probs, p)
# Keep only top-p tokens
top_p_indices = sorted_indices[:cutoff_index + 1]
top_p_probs = sorted_probs[:cutoff_index + 1]
# Renormalize
top_p_probs = top_p_probs / top_p_probs.sum()
# Sample from nucleus
relative_index = np.random.choice(len(top_p_probs), p=top_p_probs)
next_token = top_p_indices[relative_index]
return next_token
# Example:
probs = np.array([0.5, 0.3, 0.1, 0.05, 0.03, 0.02])
# Sorted: [0.5, 0.3, 0.1, 0.05, 0.03, 0.02]
# Cumsum: [0.5, 0.8, 0.9, 0.95, 0.98, 1.0]
# p=0.9 → Keep first 3 tokens [0.5, 0.3, 0.1]
# Renorm: [0.556, 0.333, 0.111]
# Sample from these 3 tokens only
Why Top-p > Temperature:
- Adaptive: Nucleus size changes based on confidence
- High-confidence step (one token 0.9 prob): nucleus size = 1
- Low-confidence step (many tokens ~0.1 prob): nucleus size = 10+
- Prevents tail sampling: Never samples from very low-prob tokens
Comparison:
| Scenario | Probs | Temperature=1.0 | Top-p=0.9 |
|---|---|---|---|
| High confidence | [0.9, 0.05, ...] | Samples all (including 0.001 tokens) | Samples only [0.9, 0.05] |
| Low confidence | [0.2, 0.18, 0.15, ...] | Samples all | Samples top ~7 tokens |
When to Use:
- Top-p=0.9: General purpose (GPT-5.2 default)
- Top-p=0.95: More creative
- Top-p=0.5-0.7: Very focused, factual
Sampling Strategy 4: Top-k (Fixed Nucleus)
Keep only top k most probable tokens:
def top_k_sample(probs, k=50):
"""
Top-k sampling: sample from top k tokens.
"""
# Get top-k indices
top_k_indices = np.argsort(probs)[-k:]
# Get top-k probs
top_k_probs = probs[top_k_indices]
# Renormalize
top_k_probs = top_k_probs / top_k_probs.sum()
# Sample
relative_index = np.random.choice(len(top_k_probs), p=top_k_probs)
next_token = top_k_indices[relative_index]
return next_token
Top-k vs. Top-p:
| Factor | Top-k | Top-p |
|---|---|---|
| Nucleus size | Fixed (always k tokens) | Adaptive |
| High confidence | Wastes k on low-prob tokens | Only keeps high-prob |
| Low confidence | May cut off valid options | Keeps all plausible |
| Modern LLMs | Less common | Standard (GPT-5, Claude) |
Interview Insight: Top-p has largely replaced top-k in production (more adaptive).
Combined Strategies (What Production Uses)
GPT-5.2 API Default:
temperature = 1.0
top_p = 0.9
Process:
- Apply temperature scaling
- Then apply top-p filtering
- Sample from result
def combined_sample(logits, temperature=1.0, top_p=0.9, top_k=None):
"""
Production-ready sampling combining temperature + top-p + top-k.
"""
# Step 1: Temperature scaling
if temperature == 0:
return np.argmax(logits)
scaled_logits = logits / temperature
probs = softmax(scaled_logits)
# Step 2: Top-k filtering (if specified)
if top_k is not None:
top_k_indices = np.argsort(probs)[-top_k:]
filtered_probs = np.zeros_like(probs)
filtered_probs[top_k_indices] = probs[top_k_indices]
probs = filtered_probs / filtered_probs.sum()
# Step 3: Top-p (nucleus) filtering
sorted_indices = np.argsort(probs)[::-1]
sorted_probs = probs[sorted_indices]
cumsum_probs = np.cumsum(sorted_probs)
cutoff = np.searchsorted(cumsum_probs, top_p)
nucleus_indices = sorted_indices[:cutoff + 1]
nucleus_probs = sorted_probs[:cutoff + 1]
nucleus_probs = nucleus_probs / nucleus_probs.sum()
# Step 4: Sample
relative_idx = np.random.choice(len(nucleus_probs), p=nucleus_probs)
next_token = nucleus_indices[relative_idx]
return next_token
Advanced: Frequency & Presence Penalties
Discourage repetition without changing core sampling:
Frequency Penalty:
def apply_frequency_penalty(logits, generated_tokens, penalty=0.5):
"""
Reduce logits for tokens that appeared often.
"""
adjusted_logits = logits.copy()
# Count token frequencies
token_counts = {}
for token in generated_tokens:
token_counts[token] = token_counts.get(token, 0) + 1
# Apply penalty
for token, count in token_counts.items():
adjusted_logits[token] -= penalty * count
return adjusted_logits
# Example:
logits = [2.0, 5.0, 1.0, 0.5]
generated = [1, 1, 1] # Token 1 appeared 3 times
penalized = apply_frequency_penalty(logits, generated, penalty=0.5)
# [2.0, 5.0 - 0.5*3, 1.0, 0.5] = [2.0, 3.5, 1.0, 0.5]
# Token 1 less likely to be sampled again
Presence Penalty:
def apply_presence_penalty(logits, generated_tokens, penalty=0.5):
"""
Reduce logits for tokens that appeared at all.
"""
adjusted_logits = logits.copy()
unique_tokens = set(generated_tokens)
for token in unique_tokens:
adjusted_logits[token] -= penalty # Fixed penalty, not count-based
return adjusted_logits
When to Use:
- Frequency penalty: Prevent repetitive phrases
- Presence penalty: Encourage diverse vocabulary
- Both: Creative writing, brainstorming
GPT-5.2 API:
response = openai.ChatCompletion.create(
model="gpt-5",
messages=[...],
temperature=0.8,
top_p=0.9,
frequency_penalty=0.5, # 0.0-2.0
presence_penalty=0.3, # 0.0-2.0
)
Real-World Configuration Examples
Factual Q&A Bot
config = {
"temperature": 0.2, # Low randomness
"top_p": 0.9, # Standard nucleus
"frequency_penalty": 0.0, # Don't penalize facts
"presence_penalty": 0.0,
}
# Output: Consistent, factual answers
# "The capital of France is Paris."
Creative Writing Assistant
config = {
"temperature": 0.9, # High creativity
"top_p": 0.95, # Wider nucleus
"frequency_penalty": 0.6, # Avoid repetition
"presence_penalty": 0.4, # Diverse vocabulary
}
# Output: Creative, varied prose
# "The crimson sunset painted the horizon with ethereal hues..."
Code Completion
config = {
"temperature": 0.0, # Deterministic
"top_p": 1.0, # Not used (temp=0)
"frequency_penalty": 0.0,
"presence_penalty": 0.0,
}
# Output: Most likely code (greedy)
# def fibonacci(n):
# if n <= 1:
# return n
Customer Support (Balanced)
config = {
"temperature": 0.6, # Balanced
"top_p": 0.9,
"frequency_penalty": 0.3, # Slight variation
"presence_penalty": 0.2,
}
# Output: Consistent but not robotic
# "I understand your concern. Let me help you resolve this issue..."
Common Interview Questions & Answers
Q1: "Temperature vs. Top-p - which is better?"
Answer:
"They serve different purposes:
- Temperature adjusts overall randomness globally for all tokens
- Top-p adapts nucleus size based on model confidence
Best practice: Use both. Temperature=0.7 + Top-p=0.9 is common. Temperature sets baseline creativity, top-p prevents sampling from very unlikely tokens."
Q2: "Your chatbot sometimes repeats itself. How do you fix it?"
Debugging Steps:
# 1. Check for repetition in output
output = "I can help you. I can help you. I can help you."
# → Frequency penalty needed
# 2. Add frequency penalty
config = {
"temperature": 0.7,
"frequency_penalty": 0.8, # Start here
}
# 3. If still repeating, check for:
# - Very low temperature (< 0.3) → increase to 0.5+
# - Prompt engineering issue (ask model to "be concise")
# - Context window overflow (model sees its own output, loops)
Q3: "How do you balance creativity vs. factual accuracy?"
Answer Framework:
| Task Type | Temperature | Top-p | Reasoning |
|---|---|---|---|
| Factual Q&A | 0.0-0.3 | 0.9 | Greed minimizes hallucination |
| General Chat | 0.5-0.7 | 0.9 | Balanced |
| Creative Writing | 0.8-1.0 | 0.95 | Exploration needed |
| Code | 0.0-0.2 | 0.9 | Syntax errors costly |
Production Strategy:
def adaptive_temperature(task_type, confidence_score):
"""
Adjust temperature based on task and model confidence.
"""
base_temps = {
"factual": 0.2,
"chat": 0.7,
"creative": 0.9,
"code": 0.1,
}
base_temp = base_temps[task_type]
# Lower temperature if model is uncertain
if confidence_score < 0.5:
return base_temp * 0.7 # More conservative
else:
return base_temp
Q4: "Why does temperature=0 not always give the same output?"
Answer:
"Small implementation details can cause variation even at temperature=0:
- Floating-point precision: Different hardware (GPU vs CPU) may compute slightly different logits
- Batching: Batch processing can introduce non-determinism
- Top-p filtering: Even at temp=0, if top-p < 1.0, there's randomness
- Model versioning: API model updates
For true determinism: Set temperature=0, top-p=1.0, and use the same seed (if API supports)."
Advanced Technique: Beam Search
Alternative to sampling - explore multiple paths:
def beam_search(model, prompt, beam_size=5, max_length=50):
"""
Beam search for finding high-probability sequences.
"""
# Initialize with prompt
beams = [(prompt, 0.0)] # (sequence, log_prob)
for _ in range(max_length):
candidates = []
for sequence, score in beams:
# Get next token probabilities
logits = model(sequence)
probs = softmax(logits)
# Take top-k most likely next tokens
top_k = 5 # Or beam_size
top_indices = np.argsort(probs)[-top_k:]
for idx in top_indices:
new_sequence = sequence + [idx]
new_score = score + np.log(probs[idx])
candidates.append((new_sequence, new_score))
# Keep top beam_size candidates
candidates.sort(key=lambda x: x[1], reverse=True)
beams = candidates[:beam_size]
# Return best sequence
return beams[0][0]
When to Use:
- Translation (find best overall translation)
- Summarization (maximize coherence)
- Not for chat (too conservative, boring)
Key Takeaways for Interviews
✅ Know the basics: Temperature, top-p, top-k, penalties ✅ Explain trade-offs: Creativity vs. factuality, determinism vs. diversity ✅ Production configs: Memorize common settings (Q&A vs. creative) ✅ Debugging: Repetition → frequency penalty, hallucination → lower temp ✅ Advanced: Mention beam search, adaptive sampling
Next Module: Apply these fundamentals to Prompt Engineering in Module 2.
:::