Prompt Engineering for Interviews
In-Context Learning Mechanics
Why This Matters for Interviews
Every LLM engineer interview at OpenAI, Anthropic, Meta, and Google will ask:
- "Explain how in-context learning works" (mechanism, not just definition)
- "When would you use ICL vs RAG vs fine-tuning?" (decision framework)
- "How do you optimize few-shot example selection?" (retrieval strategies)
Real Interview Question (Meta L5):
"You're building a customer support chatbot. It needs to learn from 10,000 past conversations. Would you use in-context learning, RAG, or fine-tuning? Walk me through your decision process and the trade-offs."
What is In-Context Learning?
Definition: LLMs learn from examples in the prompt without updating model weights.
The Surprising Discovery (GPT-3, 2020):
- Give GPT-3 a few examples of translation → it translates
- Give it examples of code → it writes code
- Give it examples of sentiment analysis → it classifies sentiment
All without a single gradient update.
Why This Matters:
- ✅ No training required: Deploy new tasks instantly
- ✅ No data needed: Just a few examples (2-8)
- ✅ Flexible: Change behavior by changing examples
How Does ICL Actually Work?
The Mechanism (Research Perspective)
Hypothesis 1: Pattern Matching (Shallow)
- Model recognizes surface patterns in examples
- Problem: Doesn't explain why GPT-5 can do unseen tasks
Hypothesis 2: Induction Heads (Mechanistic Interpretability)
- Specific attention heads learn to copy patterns from context
- "Induction heads" in layers 10-20 detect and repeat patterns
- Evidence: Ablating these heads destroys ICL ability
Hypothesis 3: Meta-Learning During Pretraining
- Model sees millions of "task examples" in pretraining
- Learns a general algorithm for "task inference from examples"
- ICL is implicit meta-learning
Interview Answer:
"ICL works through induction heads - attention circuits that detect patterns in the context and apply them to new inputs. During pretraining on trillions of tokens, the model encounters countless implicit few-shot scenarios (e.g., a Wikipedia article defining a term, then using it). This trains the model to infer tasks from examples. At inference, giving 3-5 examples activates these induction heads to apply the pattern to your query."
Code Example - Visualizing Attention on Examples:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
def visualize_icl_attention(model_name="gpt2"):
"""
Show which parts of the prompt GPT attends to during ICL.
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, output_attentions=True)
# Few-shot prompt
prompt = """Translate English to French:
English: Hello
French: Bonjour
English: Goodbye
French: Au revoir
English: Thank you
French:"""
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
attentions = outputs.attentions # Tuple of (num_layers, batch, num_heads, seq_len, seq_len)
# Examine last layer, last head (often induction head)
last_layer_attn = attentions[-1][0, -1, :, :] # (seq_len, seq_len)
# Where does the model attend when predicting the French translation?
last_token_attn = last_layer_attn[-1, :] # Attention from last token
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
print("Top attended tokens (model is copying patterns from):")
top_indices = last_token_attn.argsort(descending=True)[:5]
for idx in top_indices:
print(f" {tokens[idx]}: {last_token_attn[idx]:.3f}")
# Output shows model attends most to "Bonjour" and "Au revoir" (the French examples)
ICL vs RAG vs Fine-Tuning: The Decision Matrix
This is the most important framework for interviews.
Comparison Table
| Dimension | In-Context Learning (ICL) | RAG (Retrieval-Augmented) | Fine-Tuning |
|---|---|---|---|
| Setup Time | Instant (just prompt) | Hours (build index) | Days-weeks (training) |
| Data Required | 2-8 examples | 100s-1000s documents | 1000s-100Ks examples |
| Cost per Request | $$$ (larger prompts) | $$ (embedding + LLM) | $ (inference only) |
| Customization Depth | Surface (format, style) | Medium (knowledge) | Deep (behavior, domain) |
| Knowledge Updates | Instant (change examples) | Fast (update index) | Slow (retrain) |
| Context Window Limit | Yes (fits in prompt?) | Partial (retrieval filters) | No |
| Best For | New tasks, format learning | Knowledge-intensive Q&A | Domain adaptation, behavior |
Decision Framework (Interview Gold)
Use ICL when:
- ✅ Task is new/experimental (testing ideas)
- ✅ Need instant deployment (no training time)
- ✅ Few examples available (2-10)
- ✅ Task changes frequently (A/B testing prompts)
Use RAG when:
- ✅ Need external knowledge (docs, databases)
- ✅ Knowledge updates frequently (news, inventory)
- ✅ Questions require multi-hop reasoning over documents
- ✅ Want to cite sources (transparency)
Use Fine-Tuning when:
- ✅ Need consistent behavior on a specific task
- ✅ Task is well-defined and stable
- ✅ Have 1000+ high-quality examples
- ✅ Want lower latency and cost per request (smaller model)
Example Interview Response:
Interviewer: "10,000 customer support conversations. ICL, RAG, or fine-tune?"
Strong Answer:
Analysis:
- Data size: 10,000 examples → enough for fine-tuning
- Task stability: Customer support is ongoing → stable task
- Knowledge updates: Policies change → need fresh info
Decision: Hybrid approach (production-ready)
1. RAG for knowledge:
- Index all 10K conversations + company docs
- Retrieve top-3 similar past conversations
- Retrieve relevant policy docs
- This handles: "What's our return policy?" (cite latest policy)
2. Few-shot ICL for format:
- 2-3 examples of ideal response style
- "Be concise, empathetic, always offer next steps"
- This ensures: Consistent tone
3. Fine-tuning (optional, if budget allows):
- Fine-tune GPT-5.2-mini on 10K conversations
- Lower cost per request ($0.15 vs $1.75 per 1M input tokens)
- Faster (less prompt overhead)
- Use for high-volume (>100K requests/month)
Why NOT pure ICL?
- 10K conversations won't fit in context window (128K tokens ≈ 2000 conversations max)
- Expensive: 2000 example tokens per request × 1M requests/month = $3,500/month just for examples
Why NOT pure fine-tuning?
- Can't update knowledge without retraining
- Might hallucinate outdated policies
Optimizing Few-Shot Example Selection
The Problem: With 10,000 examples, which 3-5 do you include in the prompt?
Strategy 1: Random Sampling (Baseline)
import random
def random_few_shot(examples, k=3):
"""Simplest approach: random k examples."""
return random.sample(examples, k)
# Pros: Fast, no bias
# Cons: Might pick irrelevant examples
Strategy 2: Diversity Sampling
Goal: Cover different input patterns.
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
import numpy as np
def diversity_few_shot(examples, k=3):
"""
Select k diverse examples using clustering.
"""
model = SentenceTransformer('all-MiniLM-L6-v2')
# Embed all examples
texts = [ex['input'] for ex in examples]
embeddings = model.encode(texts)
# Cluster into k groups
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(embeddings)
# Select example closest to each cluster center
selected = []
for i in range(k):
cluster_points = np.where(kmeans.labels_ == i)[0]
center = kmeans.cluster_centers_[i]
# Find closest point to center
distances = np.linalg.norm(embeddings[cluster_points] - center, axis=1)
closest_idx = cluster_points[distances.argmin()]
selected.append(examples[closest_idx])
return selected
# Example
examples = [
{"input": "Great product!", "output": "POSITIVE"},
{"input": "Terrible quality", "output": "NEGATIVE"},
{"input": "Okay I guess", "output": "NEUTRAL"},
{"input": "Love it so much!", "output": "POSITIVE"},
{"input": "Not what I expected", "output": "NEGATIVE"},
{"input": "It's fine", "output": "NEUTRAL"},
]
diverse_examples = diversity_few_shot(examples, k=3)
# Returns: One POSITIVE, one NEGATIVE, one NEUTRAL example
Strategy 3: Similarity-Based Retrieval (Best for Production)
Goal: Select examples most similar to the current query.
from sentence_transformers import SentenceTransformer, util
import torch
class FewShotRetriever:
"""
Retrieve most relevant few-shot examples for each query.
Used in production at OpenAI, Anthropic.
"""
def __init__(self, example_pool):
self.examples = example_pool
self.model = SentenceTransformer('all-MiniLM-L6-v2')
# Pre-compute embeddings (do once, cache)
self.embeddings = self.model.encode(
[ex['input'] for ex in example_pool],
convert_to_tensor=True
)
def retrieve(self, query, k=3):
"""Retrieve k most similar examples to query."""
query_embedding = self.model.encode(query, convert_to_tensor=True)
# Compute cosine similarity
similarities = util.cos_sim(query_embedding, self.embeddings)[0]
# Get top-k
top_k = torch.topk(similarities, k=k)
selected = []
for idx in top_k.indices:
selected.append(self.examples[idx.item()])
return selected
# Usage
retriever = FewShotRetriever(examples)
query = "I'm disappointed with this purchase"
relevant_examples = retriever.retrieve(query, k=3)
# Likely returns: "Terrible quality", "Not what I expected", "Okay I guess"
# (all negative/neutral, similar sentiment)
Cost Analysis:
For 10,000-example pool:
- Embedding cost: 10,000 × 384 dimensions × 4 bytes = 15 MB (negligible)
- Retrieval latency: <10ms with FAISS/Annoy
- API cost: $0 (local similarity search)
Interview Insight: Mentioning "we'd use sentence-transformers for retrieval, cache embeddings, and use FAISS for scale" shows production thinking.
Strategy 4: Hard Example Mining
Goal: Include challenging examples that teach edge cases.
def hard_example_mining(examples, model, k=3):
"""
Select examples the model currently gets wrong.
Improves performance on hard cases.
"""
scores = []
for ex in examples:
# Test model with zero-shot
prediction = model.predict_zero_shot(ex['input'])
# Score by difficulty (1 if correct, 0 if wrong)
correct = (prediction == ex['output'])
scores.append(1 - correct) # Higher score = harder
# Select top-k hardest examples
hard_indices = np.argsort(scores)[-k:]
return [examples[i] for i in hard_indices]
# This is adaptive few-shot learning
# Model improves on its weaknesses
Dynamic vs Static Few-Shot Prompts
Static Few-Shot (Simple)
STATIC_EXAMPLES = [
{"input": "Great!", "output": "POSITIVE"},
{"input": "Bad.", "output": "NEGATIVE"},
{"input": "Okay.", "output": "NEUTRAL"}
]
def static_prompt(query):
prompt = "Classify sentiment:\n\n"
for ex in STATIC_EXAMPLES:
prompt += f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"
prompt += f"Input: {query}\nOutput:"
return prompt
Pros:
- ✅ Simple, deterministic
- ✅ Can cache prompt prefix (90% cost reduction with GPT-5.2 prompt caching)
Cons:
- ❌ Same examples for all queries (not optimal)
Dynamic Few-Shot (Production)
def dynamic_prompt(query, retriever, k=3):
"""Retrieve relevant examples for each query."""
examples = retriever.retrieve(query, k=k)
prompt = "Classify sentiment:\n\n"
for ex in examples:
prompt += f"Input: {ex['input']}\nOutput: {ex['output']}\n\n"
prompt += f"Input: {query}\nOutput:"
return prompt
Pros:
- ✅ Better accuracy (relevant examples)
- ✅ Handles edge cases
Cons:
- ❌ More complex
- ❌ Can't cache (different examples per query)
Interview Trade-off:
"I'd start with static few-shot + prompt caching for cost optimization. If accuracy isn't good enough, upgrade to dynamic retrieval for the bottom 10% of queries (based on confidence scores)."
Context Length Considerations
The Math:
For GPT-5.2 (128K context window):
- 1 token ≈ 4 characters
- 128K tokens ≈ 512,000 characters ≈ 100-150 pages of text
Few-Shot Budget:
def calculate_few_shot_budget(max_context=128000, query_tokens=100,
output_tokens=500, safety_margin=0.1):
"""
Calculate how many few-shot examples fit in context.
"""
# Reserve space for query and output
reserved = query_tokens + output_tokens
# Safety margin for system prompt, formatting
available = max_context * (1 - safety_margin) - reserved
return available
# For GPT-5.2
available = calculate_few_shot_budget()
print(f"Available tokens for examples: {available:,.0f}")
# If each example is ~200 tokens (input + output + formatting):
max_examples = available // 200
print(f"Max few-shot examples: {max_examples}")
# Output: ~570 examples max (but diminishing returns after 5-10)
Interview Question: "You have 1000 examples but only 5 fit in the prompt. How do you choose?"
Answer:
- Use retrieval to select most relevant 5 (similarity-based)
- Or: Stratified sampling (1-2 from each category/cluster)
- Or: Meta-learning approach (learn which examples generalize best)
When ICL Fails: Common Pitfalls
Failure Mode 1: Task Requires Memorization
# Bad use of ICL
examples = [
{"employee_id": "E12345", "manager": "Alice"},
{"employee_id": "E67890", "manager": "Bob"},
# ... 1000 more
]
query = "Who is the manager of E12345?"
# Model will guess, not memorize 1000 mappings
Solution: Use RAG (index employee data, retrieve exact match).
Failure Mode 2: Examples Are Too Diverse
# Bad: Examples have no common pattern
examples = [
{"input": "Translate: Hello", "output": "Bonjour"},
{"input": "2 + 2 = ?", "output": "4"},
{"input": "Sentiment: I love it", "output": "Positive"}
]
query = "Translate: Goodbye"
# Model is confused about the task
Solution: Keep examples focused on one task.
Failure Mode 3: Order Sensitivity
# ICL can be sensitive to example order
examples_v1 = [easy, medium, hard] # Accuracy: 85%
examples_v2 = [hard, medium, easy] # Accuracy: 78%
# Recency bias: last example has more influence
Solution:
- Put most complex/representative example last
- Or: Test multiple orderings, pick best
Prompt Caching for ICL (Cost Optimization)
The Problem: Few-shot prompts are long → expensive.
The Solution (GPT-5.2, Claude 4.5):
- Cache static prompt prefix
- Pay 90% less for cached tokens
Example:
from openai import OpenAI
client = OpenAI()
def cached_few_shot_classifier(query):
# Static examples (will be cached)
system_prompt = """You are a sentiment classifier.
Examples:
Input: Great product, very happy!
Output: POSITIVE
Input: Terrible quality, broke immediately.
Output: NEGATIVE
Input: It's okay, nothing special.
Output: NEUTRAL"""
# This prompt will be cached after first call
response = client.chat.completions.create(
model="gpt-5.2-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Input: {query}\nOutput:"}
]
)
return response.choices[0].message.content
# Cost savings:
# First call: 500 tokens × $1.75/1M = $0.000875
# Subsequent calls (cached): 500 tokens × $0.175/1M = $0.0000875
# 90% savings!
Interview Insight: "For production, I'd use static few-shot examples in the system prompt with prompt caching enabled. This reduces cost by 90% while maintaining quality."
Advanced: Instruction-Tuned Models vs ICL
Key Difference:
| Model Type | ICL Performance | Why |
|---|---|---|
| Base Model (GPT-2, Llama 3 base) | Requires many examples (5-10) | Not trained to follow instructions |
| Instruction-Tuned (GPT-5.2, Claude 4.5) | Works with 1-3 examples | Trained on instruction-following |
Code Example:
# Base model: needs more examples
base_prompt = """Question: What is 2+2?
Answer: 4
Question: What is 3+3?
Answer: 6
Question: What is 5+5?
Answer: 10
Question: What is 7+7?
Answer:"""
# Instruction-tuned: needs fewer
instruct_prompt = """Answer the math question.
Question: 2+2 = ?
Answer: 4
Question: 7+7 = ?
Answer:"""
# Instruction-tuned models extract the pattern faster
Interview Takeaway: "Modern instruction-tuned models (GPT-5, Claude 4.5) need fewer examples than base models. This is why few-shot learning became practical for production."
Key Takeaways for Interviews
✅ ICL mechanism: Induction heads activate during pretraining, apply patterns at inference ✅ Decision matrix: ICL for new tasks, RAG for knowledge, fine-tuning for stable tasks ✅ Example selection: Use retrieval (similarity-based) for best accuracy ✅ Cost optimization: Static examples + prompt caching = 90% savings ✅ Context limits: 128K tokens ≈ 5-10 good examples (diminishing returns after that) ✅ Instruction-tuned: GPT-5/Claude need 1-3 examples vs 5-10 for base models
Next: Learn how to use system prompts to constrain behavior and prevent jailbreaks in Lesson 3.
:::