Prompt Engineering for Interviews

In-Context Learning Mechanics

5 min read

Why This Matters for Interviews

Every LLM engineer interview at OpenAI, Anthropic, Meta, and Google will ask:

  • "Explain how in-context learning works" (mechanism, not just definition)
  • "When would you use ICL vs RAG vs fine-tuning?" (decision framework)
  • "How do you optimize few-shot example selection?" (retrieval strategies)

Real Interview Question (Meta L5):

"You're building a customer support chatbot. It needs to learn from 10,000 past conversations. Would you use in-context learning, RAG, or fine-tuning? Walk me through your decision process and the trade-offs."


What is In-Context Learning?

Definition: LLMs learn from examples in the prompt without updating model weights.

The Surprising Discovery (GPT-3, 2020):

  • Give GPT-3 a few examples of translation → it translates
  • Give it examples of code → it writes code
  • Give it examples of sentiment analysis → it classifies sentiment

All without a single gradient update.

Why This Matters:

  • No training required: Deploy new tasks instantly
  • No data needed: Just a few examples (2-8)
  • Flexible: Change behavior by changing examples

How Does ICL Actually Work?

The Mechanism (Research Perspective)

Hypothesis 1: Pattern Matching (Shallow)

  • Model recognizes surface patterns in examples
  • Problem: Doesn't explain why GPT-5 can do unseen tasks

Hypothesis 2: Induction Heads (Mechanistic Interpretability)

  • Specific attention heads learn to copy patterns from context
  • "Induction heads" in layers 10-20 detect and repeat patterns
  • Evidence: Ablating these heads destroys ICL ability

Hypothesis 3: Meta-Learning During Pretraining

  • Model sees millions of "task examples" in pretraining
  • Learns a general algorithm for "task inference from examples"
  • ICL is implicit meta-learning

Interview Answer:

"ICL works through induction heads - attention circuits that detect patterns in the context and apply them to new inputs. During pretraining on trillions of tokens, the model encounters countless implicit few-shot scenarios (e.g., a Wikipedia article defining a term, then using it). This trains the model to infer tasks from examples. At inference, giving 3-5 examples activates these induction heads to apply the pattern to your query."

Code Example - Visualizing Attention on Examples:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

def visualize_icl_attention(model_name="gpt2"):
    """
    Show which parts of the prompt GPT attends to during ICL.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name, output_attentions=True)

    # Few-shot prompt
    prompt = """Translate English to French:

    English: Hello
    French: Bonjour

    English: Goodbye
    French: Au revoir

    English: Thank you
    French:"""

    inputs = tokenizer(prompt, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)
        attentions = outputs.attentions  # Tuple of (num_layers, batch, num_heads, seq_len, seq_len)

    # Examine last layer, last head (often induction head)
    last_layer_attn = attentions[-1][0, -1, :, :]  # (seq_len, seq_len)

    # Where does the model attend when predicting the French translation?
    last_token_attn = last_layer_attn[-1, :]  # Attention from last token

    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

    print("Top attended tokens (model is copying patterns from):")
    top_indices = last_token_attn.argsort(descending=True)[:5]
    for idx in top_indices:
        print(f"  {tokens[idx]}: {last_token_attn[idx]:.3f}")

# Output shows model attends most to "Bonjour" and "Au revoir" (the French examples)

ICL vs RAG vs Fine-Tuning: The Decision Matrix

This is the most important framework for interviews.

Comparison Table

DimensionIn-Context Learning (ICL)RAG (Retrieval-Augmented)Fine-Tuning
Setup TimeInstant (just prompt)Hours (build index)Days-weeks (training)
Data Required2-8 examples100s-1000s documents1000s-100Ks examples
Cost per Request$$$ (larger prompts)$$ (embedding + LLM)$ (inference only)
Customization DepthSurface (format, style)Medium (knowledge)Deep (behavior, domain)
Knowledge UpdatesInstant (change examples)Fast (update index)Slow (retrain)
Context Window LimitYes (fits in prompt?)Partial (retrieval filters)No
Best ForNew tasks, format learningKnowledge-intensive Q&ADomain adaptation, behavior

Decision Framework (Interview Gold)

Use ICL when:

  • ✅ Task is new/experimental (testing ideas)
  • ✅ Need instant deployment (no training time)
  • ✅ Few examples available (2-10)
  • ✅ Task changes frequently (A/B testing prompts)

Use RAG when:

  • ✅ Need external knowledge (docs, databases)
  • ✅ Knowledge updates frequently (news, inventory)
  • ✅ Questions require multi-hop reasoning over documents
  • ✅ Want to cite sources (transparency)

Use Fine-Tuning when:

  • ✅ Need consistent behavior on a specific task
  • ✅ Task is well-defined and stable
  • ✅ Have 1000+ high-quality examples
  • ✅ Want lower latency and cost per request (smaller model)

Example Interview Response:

Interviewer: "10,000 customer support conversations. ICL, RAG, or fine-tune?"

Strong Answer:

Analysis:
- Data size: 10,000 examples → enough for fine-tuning
- Task stability: Customer support is ongoing → stable task
- Knowledge updates: Policies change → need fresh info

Decision: Hybrid approach (production-ready)

1. RAG for knowledge:
   - Index all 10K conversations + company docs
   - Retrieve top-3 similar past conversations
   - Retrieve relevant policy docs
   - This handles: "What's our return policy?" (cite latest policy)

2. Few-shot ICL for format:
   - 2-3 examples of ideal response style
   - "Be concise, empathetic, always offer next steps"
   - This ensures: Consistent tone

3. Fine-tuning (optional, if budget allows):
   - Fine-tune GPT-5.4-mini on 10K conversations
   - Lower cost per request ($0.75 vs $2.50 per 1M input tokens)
   - Faster (less prompt overhead)
   - Use for high-volume (>100K requests/month)

Why NOT pure ICL?
- 10K conversations won't fit in context window (128K tokens ≈ 2000 conversations max)
- Expensive: 2000 example tokens per request × 1M requests/month = $3,500/month just for examples

Why NOT pure fine-tuning?
- Can't update knowledge without retraining
- Might hallucinate outdated policies

Optimizing Few-Shot Example Selection

The Problem: With 10,000 examples, which 3-5 do you include in the prompt?

Strategy 1: Random Sampling (Baseline)

import random

def random_few_shot(examples, k=3):
    """Simplest approach: random k examples."""
    return random.sample(examples, k)

# Pros: Fast, no bias
# Cons: Might pick irrelevant examples

Strategy 2: Diversity Sampling

Goal: Cover different input patterns.

from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
import numpy as np

def diversity_few_shot(examples, k=3):
    """
    Select k diverse examples using clustering.
    """
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # Embed all examples
    texts = [ex['input'] for ex in examples]
    embeddings = model.encode(texts)

    # Cluster into k groups
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(embeddings)

    # Select example closest to each cluster center
    selected = []
    for i in range(k):
        cluster_points = np.where(kmeans.labels_ == i)[0]
        center = kmeans.cluster_centers_[i]

        # Find closest point to center
        distances = np.linalg.norm(embeddings[cluster_points] - center, axis=1)
        closest_idx = cluster_points[distances.argmin()]

        selected.append(examples[closest_idx])

    return selected

# Example
examples = [
    {"input": "Great product!", "output": "POSITIVE"},
    {"input": "Terrible quality", "output": "NEGATIVE"},
    {"input": "Okay I guess", "output": "NEUTRAL"},
    {"input": "Love it so much!", "output": "POSITIVE"},
    {"input": "Not what I expected", "output": "NEGATIVE"},
    {"input": "It's fine", "output": "NEUTRAL"},
]

diverse_examples = diversity_few_shot(examples, k=3)
# Returns: One POSITIVE, one NEGATIVE, one NEUTRAL example

Strategy 3: Similarity-Based Retrieval (Best for Production)

Goal: Select examples most similar to the current query.

from sentence_transformers import SentenceTransformer, util
import torch

class FewShotRetriever:
    """
    Retrieve most relevant few-shot examples for each query.
    Used in production at OpenAI, Anthropic.
    """
    def __init__(self, example_pool):
        self.examples = example_pool
        self.model = SentenceTransformer('all-MiniLM-L6-v2')

        # Pre-compute embeddings (do once, cache)
        self.embeddings = self.model.encode(
            [ex['input'] for ex in example_pool],
            convert_to_tensor=True
        )

    def retrieve(self, query, k=3):
        """Retrieve k most similar examples to query."""
        query_embedding = self.model.encode(query, convert_to_tensor=True)

        # Compute cosine similarity
        similarities = util.cos_sim(query_embedding, self.embeddings)[0]

        # Get top-k
        top_k = torch.topk(similarities, k=k)

        selected = []
        for idx in top_k.indices:
            selected.append(self.examples[idx.item()])

        return selected

# Usage
retriever = FewShotRetriever(examples)

query = "I'm disappointed with this purchase"
relevant_examples = retriever.retrieve(query, k=3)

# Likely returns: "Terrible quality", "Not what I expected", "Okay I guess"
# (all negative/neutral, similar sentiment)

Cost Analysis:

For 10,000-example pool:

  • Embedding cost: 10,000 × 384 dimensions × 4 bytes = 15 MB (negligible)
  • Retrieval latency: <10ms with FAISS/Annoy
  • API cost: $0 (local similarity search)

Interview Insight: Mentioning "we'd use sentence-transformers for retrieval, cache embeddings, and use FAISS for scale" shows production thinking.

Strategy 4: Hard Example Mining

Goal: Include challenging examples that teach edge cases.

def hard_example_mining(examples, model, k=3):
    """
    Select examples the model currently gets wrong.
    Improves performance on hard cases.
    """
    scores = []

    for ex in examples:
        # Test model with zero-shot
        prediction = model.predict_zero_shot(ex['input'])

        # Score by difficulty (1 if correct, 0 if wrong)
        correct = (prediction == ex['output'])
        scores.append(1 - correct)  # Higher score = harder

    # Select top-k hardest examples
    hard_indices = np.argsort(scores)[-k:]
    return [examples[i] for i in hard_indices]

# This is adaptive few-shot learning
# Model improves on its weaknesses

Dynamic vs Static Few-Shot Prompts

Static Few-Shot (Simple)

STATIC_EXAMPLES = [
    {"input": "Great!", "output": "POSITIVE"},
    {"input": "Bad.", "output": "NEGATIVE"},
    {"input": "Okay.", "output": "NEUTRAL"}
]

def static_prompt(query):
    prompt = "Classify sentiment:\n\n"
    for ex in STATIC_EXAMPLES:
        prompt += f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"
    prompt += f"Input: {query}\nOutput:"
    return prompt

Pros:

  • ✅ Simple, deterministic
  • ✅ Can cache prompt prefix (90% cost reduction with GPT-5.4 prompt caching)

Cons:

  • ❌ Same examples for all queries (not optimal)

Dynamic Few-Shot (Production)

def dynamic_prompt(query, retriever, k=3):
    """Retrieve relevant examples for each query."""
    examples = retriever.retrieve(query, k=k)

    prompt = "Classify sentiment:\n\n"
    for ex in examples:
        prompt += f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"
    prompt += f"Input: {query}\nOutput:"

    return prompt

Pros:

  • ✅ Better accuracy (relevant examples)
  • ✅ Handles edge cases

Cons:

  • ❌ More complex
  • ❌ Can't cache (different examples per query)

Interview Trade-off:

"I'd start with static few-shot + prompt caching for cost optimization. If accuracy isn't good enough, upgrade to dynamic retrieval for the bottom 10% of queries (based on confidence scores)."


Context Length Considerations

The Math:

For GPT-5.4 (256K context window):

  • 1 token ≈ 4 characters
  • 128K tokens ≈ 512,000 characters ≈ 100-150 pages of text

Few-Shot Budget:

def calculate_few_shot_budget(max_context=128000, query_tokens=100,
                               output_tokens=500, safety_margin=0.1):
    """
    Calculate how many few-shot examples fit in context.
    """
    # Reserve space for query and output
    reserved = query_tokens + output_tokens

    # Safety margin for system prompt, formatting
    available = max_context * (1 - safety_margin) - reserved

    return available

# For GPT-5.4
available = calculate_few_shot_budget()
print(f"Available tokens for examples: {available:,.0f}")

# If each example is ~200 tokens (input + output + formatting):
max_examples = available // 200
print(f"Max few-shot examples: {max_examples}")

# Output: ~570 examples max (but diminishing returns after 5-10)

Interview Question: "You have 1000 examples but only 5 fit in the prompt. How do you choose?"

Answer:

  1. Use retrieval to select most relevant 5 (similarity-based)
  2. Or: Stratified sampling (1-2 from each category/cluster)
  3. Or: Meta-learning approach (learn which examples generalize best)

When ICL Fails: Common Pitfalls

Failure Mode 1: Task Requires Memorization

# Bad use of ICL
examples = [
    {"employee_id": "E12345", "manager": "Alice"},
    {"employee_id": "E67890", "manager": "Bob"},
    # ... 1000 more
]

query = "Who is the manager of E12345?"
# Model will guess, not memorize 1000 mappings

Solution: Use RAG (index employee data, retrieve exact match).

Failure Mode 2: Examples Are Too Diverse

# Bad: Examples have no common pattern
examples = [
    {"input": "Translate: Hello", "output": "Bonjour"},
    {"input": "2 + 2 = ?", "output": "4"},
    {"input": "Sentiment: I love it", "output": "Positive"}
]

query = "Translate: Goodbye"
# Model is confused about the task

Solution: Keep examples focused on one task.

Failure Mode 3: Order Sensitivity

# ICL can be sensitive to example order
examples_v1 = [easy, medium, hard]  # Accuracy: 85%
examples_v2 = [hard, medium, easy]  # Accuracy: 78%

# Recency bias: last example has more influence

Solution:

  • Put most complex/representative example last
  • Or: Test multiple orderings, pick best

Prompt Caching for ICL (Cost Optimization)

The Problem: Few-shot prompts are long → expensive.

The Solution (GPT-5.4, Claude 4.6):

  • Cache static prompt prefix
  • Pay 90% less for cached tokens

Example:

from openai import OpenAI

client = OpenAI()

def cached_few_shot_classifier(query):
    # Static examples (will be cached)
    system_prompt = """You are a sentiment classifier.

Examples:
Input: Great product, very happy!
Output: POSITIVE

Input: Terrible quality, broke immediately.
Output: NEGATIVE

Input: It's okay, nothing special.
Output: NEUTRAL"""

    # This prompt will be cached after first call
    response = client.chat.completions.create(
        model="gpt-5.2-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Input: {query}\nOutput:"}
        ]
    )

    return response.choices[0].message.content

# Cost savings:
# First call: 500 tokens × $1.75/1M = $0.000875
# Subsequent calls (cached): 500 tokens × $0.175/1M = $0.0000875
# 90% savings!

Interview Insight: "For production, I'd use static few-shot examples in the system prompt with prompt caching enabled. This reduces cost by 90% while maintaining quality."


Advanced: Instruction-Tuned Models vs ICL

Key Difference:

Model TypeICL PerformanceWhy
Base Model (GPT-2, Llama 3 base)Requires many examples (5-10)Not trained to follow instructions
Instruction-Tuned (GPT-5.4, Claude 4.6)Works with 1-3 examplesTrained on instruction-following

Code Example:

# Base model: needs more examples
base_prompt = """Question: What is 2+2?
Answer: 4

Question: What is 3+3?
Answer: 6

Question: What is 5+5?
Answer: 10

Question: What is 7+7?
Answer:"""

# Instruction-tuned: needs fewer
instruct_prompt = """Answer the math question.

Question: 2+2 = ?
Answer: 4

Question: 7+7 = ?
Answer:"""

# Instruction-tuned models extract the pattern faster

Interview Takeaway: "Modern instruction-tuned models (GPT-5, Claude 4.6) need fewer examples than base models. This is why few-shot learning became practical for production."


Key Takeaways for Interviews

ICL mechanism: Induction heads activate during pretraining, apply patterns at inference ✅ Decision matrix: ICL for new tasks, RAG for knowledge, fine-tuning for stable tasks ✅ Example selection: Use retrieval (similarity-based) for best accuracy ✅ Cost optimization: Static examples + prompt caching = 90% savings ✅ Context limits: 128K tokens ≈ 5-10 good examples (diminishing returns after that) ✅ Instruction-tuned: GPT-5/Claude need 1-3 examples vs 5-10 for base models

Next: Learn how to use system prompts to constrain behavior and prevent jailbreaks in Lesson 3.

:::

Quick check: how does this lesson land for you?

Quiz

Module 2: Prompt Engineering for Interviews

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.