Sampling Strategies & Generation Control

Why This Matters for Interviews

Real Interview Question (OpenAI L6):

"Your chatbot is sometimes creative but sometimes gives factually incorrect answers. How would you control this trade-off? What parameters would you adjust and why?"

Real Interview Question (Meta):

"Explain the difference between temperature and top-p sampling. When would you use one over the other?"

Understanding sampling is critical for production LLM systems where you need to balance creativity, consistency, and factual accuracy.

How LLMs Generate Text (The Autoregressive Process)

Step-by-step:

Forward pass: Model computes logits for next token
Convert to probabilities: Apply softmax
Sample: Choose next token based on strategy
Repeat: Add token to sequence, generate next

def autoregressive_generation(model, prompt, max_tokens=100, strategy="greedy"):
    """
    Core LLM generation loop.
    """
    tokens = tokenize(prompt)

    for _ in range(max_tokens):
        # 1. Forward pass
        logits = model(tokens)  # Shape: (vocab_size,)

        # 2. Convert to probabilities
        probs = softmax(logits)

        # 3. Sample next token
        next_token = sample(probs, strategy=strategy)

        # 4. Add to sequence
        tokens.append(next_token)

        # Stop if <EOS> token
        if next_token == EOS_TOKEN:
            break

    return tokens

Sampling Strategy 1: Greedy Decoding (Deterministic)

Always pick the most probable token:

def greedy_sample(probs):
    """
    Greedy sampling: argmax(probs)
    """
    return np.argmax(probs)

# Example:
logits = np.array([2.0, 5.0, 1.0, 0.5])  # Raw model outputs
probs = softmax(logits)                  # [0.05, 0.84, 0.02, 0.01]
next_token = greedy_sample(probs)        # Always returns index 1

Pros:

✅ Deterministic (same prompt → same output)
✅ Fast (no sampling overhead)
✅ Good for factual tasks (Q&A, translation)

Cons:

❌ Repetitive (gets stuck in loops)
❌ No creativity
❌ Misses better alternatives

When to Use:

Factual Q&A
Translation
Code completion (when you want most likely)

Example Output:

Prompt: "The capital of France is"
Greedy: "Paris. The capital of France is Paris. The capital of..."
→ Stuck in repetition!

Sampling Strategy 2: Temperature Scaling

Adjust randomness by dividing logits before softmax:

probs = softmax(logits / temperature)

Temperature Values & Effects

Temperature = 0 (near-greedy):

temperature = 0.01
probs = softmax([2.0, 5.0, 1.0, 0.5] / temperature)
# probs ≈ [0.0, 1.0, 0.0, 0.0]  # Almost deterministic

Temperature = 1 (unchanged):

temperature = 1.0
probs = softmax([2.0, 5.0, 1.0, 0.5] / temperature)
# probs = [0.05, 0.84, 0.02, 0.01]  # Original distribution

Temperature = 2 (more random):

temperature = 2.0
probs = softmax([2.0, 5.0, 1.0, 0.5] / temperature)
# probs ≈ [0.12, 0.61, 0.09, 0.06]  # Flatter distribution

Visual Comparison:

Temperature	Distribution	Use Case
0.0-0.3	Sharp peak	Factual tasks, code
0.5-0.7	Balanced	General chat
0.8-1.0	Creative	Brainstorming, stories
1.5-2.0	Very random	Experimental, art

Code Implementation:

def temperature_sample(logits, temperature=1.0):
    """
    Sample with temperature scaling.
    """
    if temperature == 0:
        return np.argmax(logits)  # Greedy

    # Scale logits by temperature
    scaled_logits = logits / temperature

    # Convert to probabilities
    probs = softmax(scaled_logits)

    # Sample from distribution
    next_token = np.random.choice(len(probs), p=probs)

    return next_token

# Example:
logits = np.array([2.0, 5.0, 1.0, 0.5])

# Low temperature (more focused)
token1 = temperature_sample(logits, temperature=0.3)
# → Almost always returns index 1 (highest logit)

# High temperature (more random)
token2 = temperature_sample(logits, temperature=2.0)
# → Could return any index, more likely to pick lower-prob tokens

Interview Question: "What happens if temperature is too high?"

Answer:

"The distribution becomes too flat - low-probability tokens get boosted. This leads to:

Incoherent outputs (random words)

Grammar errors

Factual hallucinations

At temperature=10, even a token with logit -5 gets significant probability. The model loses its learned preferences from training."

Sampling Strategy 3: Top-p (Nucleus Sampling)

Sample from smallest set of tokens whose cumulative probability ≥ p:

def top_p_sample(probs, p=0.9):
    """
    Nucleus sampling: sample from top tokens totaling p probability.
    """
    # Sort probabilities in descending order
    sorted_indices = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_indices]

    # Compute cumulative probabilities
    cumulative_probs = np.cumsum(sorted_probs)

    # Find cutoff index where cumsum >= p
    cutoff_index = np.searchsorted(cumulative_probs, p)

    # Keep only top-p tokens
    top_p_indices = sorted_indices[:cutoff_index + 1]
    top_p_probs = sorted_probs[:cutoff_index + 1]

    # Renormalize
    top_p_probs = top_p_probs / top_p_probs.sum()

    # Sample from nucleus
    relative_index = np.random.choice(len(top_p_probs), p=top_p_probs)
    next_token = top_p_indices[relative_index]

    return next_token

# Example:
probs = np.array([0.5, 0.3, 0.1, 0.05, 0.03, 0.02])
# Sorted: [0.5, 0.3, 0.1, 0.05, 0.03, 0.02]
# Cumsum: [0.5, 0.8, 0.9, 0.95, 0.98, 1.0]

# p=0.9 → Keep first 3 tokens [0.5, 0.3, 0.1]
# Renorm: [0.556, 0.333, 0.111]
# Sample from these 3 tokens only

Why Top-p > Temperature:

Adaptive: Nucleus size changes based on confidence
- High-confidence step (one token 0.9 prob): nucleus size = 1
- Low-confidence step (many tokens ~0.1 prob): nucleus size = 10+
Prevents tail sampling: Never samples from very low-prob tokens

Comparison:

Scenario	Probs	Temperature=1.0	Top-p=0.9
High confidence	[0.9, 0.05, ...]	Samples all (including 0.001 tokens)	Samples only [0.9, 0.05]
Low confidence	[0.2, 0.18, 0.15, ...]	Samples all	Samples top ~7 tokens

When to Use:

Top-p=0.9: General purpose (GPT-5.2 default)
Top-p=0.95: More creative
Top-p=0.5-0.7: Very focused, factual

Sampling Strategy 4: Top-k (Fixed Nucleus)

Keep only top k most probable tokens:

def top_k_sample(probs, k=50):
    """
    Top-k sampling: sample from top k tokens.
    """
    # Get top-k indices
    top_k_indices = np.argsort(probs)[-k:]

    # Get top-k probs
    top_k_probs = probs[top_k_indices]

    # Renormalize
    top_k_probs = top_k_probs / top_k_probs.sum()

    # Sample
    relative_index = np.random.choice(len(top_k_probs), p=top_k_probs)
    next_token = top_k_indices[relative_index]

    return next_token

Top-k vs. Top-p:

Factor	Top-k	Top-p
Nucleus size	Fixed (always k tokens)	Adaptive
High confidence	Wastes k on low-prob tokens	Only keeps high-prob
Low confidence	May cut off valid options	Keeps all plausible
Modern LLMs	Less common	Standard (GPT-5, Claude)

Interview Insight: Top-p has largely replaced top-k in production (more adaptive).

Combined Strategies (What Production Uses)

GPT-5.2 API Default:

temperature = 1.0
top_p = 0.9

Process:

Apply temperature scaling
Then apply top-p filtering
Sample from result

def combined_sample(logits, temperature=1.0, top_p=0.9, top_k=None):
    """
    Production-ready sampling combining temperature + top-p + top-k.
    """
    # Step 1: Temperature scaling
    if temperature == 0:
        return np.argmax(logits)

    scaled_logits = logits / temperature
    probs = softmax(scaled_logits)

    # Step 2: Top-k filtering (if specified)
    if top_k is not None:
        top_k_indices = np.argsort(probs)[-top_k:]
        filtered_probs = np.zeros_like(probs)
        filtered_probs[top_k_indices] = probs[top_k_indices]
        probs = filtered_probs / filtered_probs.sum()

    # Step 3: Top-p (nucleus) filtering
    sorted_indices = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_indices]
    cumsum_probs = np.cumsum(sorted_probs)

    cutoff = np.searchsorted(cumsum_probs, top_p)
    nucleus_indices = sorted_indices[:cutoff + 1]
    nucleus_probs = sorted_probs[:cutoff + 1]
    nucleus_probs = nucleus_probs / nucleus_probs.sum()

    # Step 4: Sample
    relative_idx = np.random.choice(len(nucleus_probs), p=nucleus_probs)
    next_token = nucleus_indices[relative_idx]

    return next_token

Advanced: Frequency & Presence Penalties

Discourage repetition without changing core sampling:

Frequency Penalty:

def apply_frequency_penalty(logits, generated_tokens, penalty=0.5):
    """
    Reduce logits for tokens that appeared often.
    """
    adjusted_logits = logits.copy()

    # Count token frequencies
    token_counts = {}
    for token in generated_tokens:
        token_counts[token] = token_counts.get(token, 0) + 1

    # Apply penalty
    for token, count in token_counts.items():
        adjusted_logits[token] -= penalty * count

    return adjusted_logits

# Example:
logits = [2.0, 5.0, 1.0, 0.5]
generated = [1, 1, 1]  # Token 1 appeared 3 times

penalized = apply_frequency_penalty(logits, generated, penalty=0.5)
# [2.0, 5.0 - 0.5*3, 1.0, 0.5] = [2.0, 3.5, 1.0, 0.5]
# Token 1 less likely to be sampled again

Presence Penalty:

def apply_presence_penalty(logits, generated_tokens, penalty=0.5):
    """
    Reduce logits for tokens that appeared at all.
    """
    adjusted_logits = logits.copy()

    unique_tokens = set(generated_tokens)
    for token in unique_tokens:
        adjusted_logits[token] -= penalty  # Fixed penalty, not count-based

    return adjusted_logits

When to Use:

Frequency penalty: Prevent repetitive phrases
Presence penalty: Encourage diverse vocabulary
Both: Creative writing, brainstorming

GPT-5.2 API:

response = openai.ChatCompletion.create(
    model="gpt-5",
    messages=[...],
    temperature=0.8,
    top_p=0.9,
    frequency_penalty=0.5,  # 0.0-2.0
    presence_penalty=0.3,   # 0.0-2.0
)

Real-World Configuration Examples

Factual Q&A Bot

config = {
    "temperature": 0.2,    # Low randomness
    "top_p": 0.9,          # Standard nucleus
    "frequency_penalty": 0.0,  # Don't penalize facts
    "presence_penalty": 0.0,
}

# Output: Consistent, factual answers
# "The capital of France is Paris."

Creative Writing Assistant

config = {
    "temperature": 0.9,    # High creativity
    "top_p": 0.95,         # Wider nucleus
    "frequency_penalty": 0.6,  # Avoid repetition
    "presence_penalty": 0.4,   # Diverse vocabulary
}

# Output: Creative, varied prose
# "The crimson sunset painted the horizon with ethereal hues..."

Code Completion

config = {
    "temperature": 0.0,    # Deterministic
    "top_p": 1.0,          # Not used (temp=0)
    "frequency_penalty": 0.0,
    "presence_penalty": 0.0,
}

# Output: Most likely code (greedy)
# def fibonacci(n):
#     if n <= 1:
#         return n

Customer Support (Balanced)

config = {
    "temperature": 0.6,    # Balanced
    "top_p": 0.9,
    "frequency_penalty": 0.3,  # Slight variation
    "presence_penalty": 0.2,
}

# Output: Consistent but not robotic
# "I understand your concern. Let me help you resolve this issue..."

Common Interview Questions & Answers

Q1: "Temperature vs. Top-p - which is better?"

Answer:

"They serve different purposes:

Temperature adjusts overall randomness globally for all tokens

Top-p adapts nucleus size based on model confidence

Best practice: Use both. Temperature=0.7 + Top-p=0.9 is common. Temperature sets baseline creativity, top-p prevents sampling from very unlikely tokens."

Q2: "Your chatbot sometimes repeats itself. How do you fix it?"

Debugging Steps:

# 1. Check for repetition in output
output = "I can help you. I can help you. I can help you."
# → Frequency penalty needed

# 2. Add frequency penalty
config = {
    "temperature": 0.7,
    "frequency_penalty": 0.8,  # Start here
}

# 3. If still repeating, check for:
# - Very low temperature (< 0.3) → increase to 0.5+
# - Prompt engineering issue (ask model to "be concise")
# - Context window overflow (model sees its own output, loops)

Q3: "How do you balance creativity vs. factual accuracy?"

Answer Framework:

Task Type	Temperature	Top-p	Reasoning
Factual Q&A	0.0-0.3	0.9	Greed minimizes hallucination
General Chat	0.5-0.7	0.9	Balanced
Creative Writing	0.8-1.0	0.95	Exploration needed
Code	0.0-0.2	0.9	Syntax errors costly

Production Strategy:

def adaptive_temperature(task_type, confidence_score):
    """
    Adjust temperature based on task and model confidence.
    """
    base_temps = {
        "factual": 0.2,
        "chat": 0.7,
        "creative": 0.9,
        "code": 0.1,
    }

    base_temp = base_temps[task_type]

    # Lower temperature if model is uncertain
    if confidence_score < 0.5:
        return base_temp * 0.7  # More conservative
    else:
        return base_temp

Q4: "Why does temperature=0 not always give the same output?"

Answer:

"Small implementation details can cause variation even at temperature=0:

Floating-point precision: Different hardware (GPU vs CPU) may compute slightly different logits

Batching: Batch processing can introduce non-determinism

Top-p filtering: Even at temp=0, if top-p < 1.0, there's randomness

Model versioning: API model updates

For true determinism: Set temperature=0, top-p=1.0, and use the same seed (if API supports)."

Advanced Technique: Beam Search

Alternative to sampling - explore multiple paths:

def beam_search(model, prompt, beam_size=5, max_length=50):
    """
    Beam search for finding high-probability sequences.
    """
    # Initialize with prompt
    beams = [(prompt, 0.0)]  # (sequence, log_prob)

    for _ in range(max_length):
        candidates = []

        for sequence, score in beams:
            # Get next token probabilities
            logits = model(sequence)
            probs = softmax(logits)

            # Take top-k most likely next tokens
            top_k = 5  # Or beam_size
            top_indices = np.argsort(probs)[-top_k:]

            for idx in top_indices:
                new_sequence = sequence + [idx]
                new_score = score + np.log(probs[idx])
                candidates.append((new_sequence, new_score))

        # Keep top beam_size candidates
        candidates.sort(key=lambda x: x[1], reverse=True)
        beams = candidates[:beam_size]

    # Return best sequence
    return beams[0][0]

When to Use:

Translation (find best overall translation)
Summarization (maximize coherence)
Not for chat (too conservative, boring)

✅ Know the basics: Temperature, top-p, top-k, penalties ✅ Explain trade-offs: Creativity vs. factuality, determinism vs. diversity ✅ Production configs: Memorize common settings (Q&A vs. creative) ✅ Debugging: Repetition → frequency penalty, hallucination → lower temp ✅ Advanced: Mention beam search, adaptive sampling

Next Module: Apply these fundamentals to Prompt Engineering in Module 2.

:::

Why This Matters for Interviews

How LLMs Generate Text (The Autoregressive Process)

Sampling Strategy 1: Greedy Decoding (Deterministic)

Sampling Strategy 2: Temperature Scaling

Temperature Values & Effects

Sampling Strategy 3: Top-p (Nucleus Sampling)

Sampling Strategy 4: Top-k (Fixed Nucleus)

Combined Strategies (What Production Uses)

Advanced: Frequency & Presence Penalties

Real-World Configuration Examples

Factual Q&A Bot

Creative Writing Assistant

Code Completion

Customer Support (Balanced)

Common Interview Questions & Answers

Q1: "Temperature vs. Top-p - which is better?"

Q2: "Your chatbot sometimes repeats itself. How do you fix it?"

Q3: "How do you balance creativity vs. factual accuracy?"

Q4: "Why does temperature=0 not always give the same output?"

Advanced Technique: Beam Search

Key Takeaways for Interviews

Quiz

Stay on the Nerd Track