LLM Fundamentals for Interviews
Token Economics & Cost Optimization
Why This Matters for Interviews
OpenAI, Anthropic, Meta explicitly ask about token economics in L5-L6 interviews:
Real Interview Question (Anthropic):
"You're building a customer support chatbot. Each conversation averages 20 messages. How would you optimize token costs while maintaining quality? Walk me through your calculation."
Real Interview Question (Meta):
"Your LLM API bill is $50K/month. You have 1M users. How would you reduce costs by 50% without degrading user experience?"
Understanding tokens = money is critical for production LLM engineering.
What is a Token? (The Unit of Computation)
Not a word! A token is a subword unit.
Examples (GPT-5.2 BPE tokenizer):
| Text | Tokens | Count |
|---|---|---|
| "Hello, world!" | ["Hello", ",", " world", "!"] |
4 |
| "Tokenization" | ["Token", "ization"] |
2 |
| "GPT-5.2" | ["G", "PT", "-", "5", ".", "2"] |
6 |
| "مرحبا" (Arabic) | ["Ù…", "ر", "Ø", "ب", "ا"] |
5 |
| "你好" (Chinese) | ["ä½", "好"] |
2 |
Key Rule: 1 token ≈ 4 characters in English, but varies by language.
BPE Tokenization (Byte-Pair Encoding)
How GPT-5.2, Claude 4.5, Llama 3.3 tokenize:
Algorithm:
- Start with character-level vocabulary
- Find most frequent pair of tokens
- Merge pair into new token
- Repeat until vocabulary size reached (typically 50K-200K)
Example Training:
Input text: "low low low low lower lower newest newest newest newest newest newest"
Iteration 1:
Most frequent pair: ('l', 'o') → merge to 'lo'
Vocab: ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'lo']
Iteration 2:
Most frequent pair: ('lo', 'w') → merge to 'low'
Vocab: ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'lo', 'low']
Iteration 3:
Most frequent pair: ('e', 'w') → merge to 'ew'
...continues...
Final: "low" becomes 1 token, "lower" becomes 2 tokens, "newest" becomes 2 tokens
Code Implementation:
def bpe_tokenize(text, vocab, merges):
"""
BPE tokenization (simplified).
Used in: GPT-5.2, Llama 3.3
"""
# Start with characters
tokens = list(text)
# Apply merge rules in order
for merge in merges:
pair = merge[0]
new_token = merge[1]
i = 0
while i < len(tokens) - 1:
if (tokens[i], tokens[i+1]) == pair:
tokens = tokens[:i] + [new_token] + tokens[i+2:]
i += 1
return tokens
# Example usage:
merges = [
(('l', 'o'), 'lo'),
(('lo', 'w'), 'low'),
(('e', 'w'), 'ew'),
(('n', 'ew'), 'new'),
(('new', 'est'), 'newest'),
]
text = "lowest newest"
tokens = bpe_tokenize(text, vocab, merges)
# Result: ['low', 'est', 'newest']
Tokenizer Comparison (January 2026)
| Model | Tokenizer | Vocab Size | Efficiency | Notes |
|---|---|---|---|---|
| GPT-5.2 | tiktoken (cl100k_base) | 100K | ⭐⭐⭐⭐⭐ | Optimized for code + multilingual |
| Claude 4.5 | claude-v1 | 100K | ⭐⭐⭐⭐⭐ | Similar to GPT-5.2 |
| Gemini 3 Pro | SentencePiece | 256K | ⭐⭐⭐⭐ | Larger vocab, fewer tokens per text |
| Llama 3.3 | tiktoken (Llama) | 128K | ⭐⭐⭐⭐⭐ | Open-source, multilingual |
| DeepSeek R1 | DeepSeek-tokenizer | 100K | ⭐⭐⭐⭐ | Chinese-optimized |
Interview Insight: Mention that modern tokenizers use 100K+ vocab (vs. 50K in GPT-2) for better efficiency across languages.
Context Window Evolution
Historical Progression
| Year | Model | Context Window | What Changed |
|---|---|---|---|
| 2020 | GPT-3 | 2K tokens | Baseline |
| 2023 | GPT-4 Turbo | 128K tokens | 64x increase! |
| 2024 | Claude 3 Opus | 200K tokens | Production-ready |
| 2025 | GPT-5 | 128K-1M tokens | Mixture of window sizes |
| 2026 | GPT-5.2 | 128K tokens (standard) | Optimized for speed |
| 2026 | Claude Sonnet 4.5 | 200K (1M beta) | Extended context |
| 2026 | Gemini 3 Pro | 1M tokens | Largest production window |
Interview Question: "Why do we need context windows beyond 100K tokens?"
Strong Answer:
"Real-world applications need long context:
- Code: Entire repository context (50K-200K tokens)
- Legal: Full contracts + amendments (100K+ tokens)
- Research: Multiple papers for literature review (200K+ tokens)
- Customer support: Full conversation history across sessions (50K+ tokens)
Larger windows reduce the need for RAG/retrieval, improving latency and accuracy."
Token Pricing (January 2026)
Pricing per 1M Tokens
| Model | Input | Output | Cached Input | Use Case |
|---|---|---|---|---|
| GPT-5.2 | $1.75 | $14.00 | $0.175 (90% off) | General-purpose |
| GPT-4 Turbo | $5.00 | $15.00 | — | Legacy |
| Claude Opus 4.5 | $15.00 | $75.00 | — | Premium reasoning |
| Claude Sonnet 4.5 | $3.00 | $15.00 | $0.60 (80% off >200K) | Production |
| Claude Haiku 3.5 | $1.00 | $5.00 | — | High-volume |
| Gemini 3 Pro | $2.50 | $10.00 | $0.50 (80% off) | Cost-effective |
| DeepSeek R1 | $0.55 | $2.19 | — | Budget option |
Key Observations:
- Output tokens cost 5-10x input tokens (generation is expensive)
- Cached inputs save 80-90% (critical for chatbots with system prompts)
- DeepSeek R1 is 3x cheaper than GPT-5.2 (MIT license, self-hostable)
Cost Calculation Examples
Example 1: Customer Support Chatbot
Scenario:
- System prompt: 500 tokens (cached)
- Average conversation: 10 messages
- Average message: 50 tokens user + 200 tokens assistant
- 1,000 conversations/day
Cost Calculation (GPT-5.2):
def calculate_chatbot_cost(
system_prompt_tokens=500,
messages_per_conversation=10,
user_tokens_per_message=50,
assistant_tokens_per_message=200,
conversations_per_day=1000,
model="gpt-5.2"
):
"""
Calculate monthly chatbot costs.
"""
# Pricing (per 1M tokens)
pricing = {
"gpt-5.2": {
"input": 1.75,
"output": 14.00,
"cached_input": 0.175,
},
"claude-sonnet-4.5": {
"input": 3.00,
"output": 15.00,
"cached_input": 0.60,
},
}
p = pricing[model]
# Tokens per conversation
cached_system_prompt = system_prompt_tokens # Once per conversation
input_tokens = messages_per_conversation * user_tokens_per_message
output_tokens = messages_per_conversation * assistant_tokens_per_message
# Cost per conversation
cost_per_conversation = (
(cached_system_prompt / 1_000_000) * p["cached_input"] + # System prompt
(input_tokens / 1_000_000) * p["input"] + # User messages
(output_tokens / 1_000_000) * p["output"] # Assistant messages
)
# Monthly cost (30 days)
monthly_cost = cost_per_conversation * conversations_per_day * 30
return {
"cost_per_conversation": cost_per_conversation,
"daily_cost": cost_per_conversation * conversations_per_day,
"monthly_cost": monthly_cost,
"tokens_per_conversation": cached_system_prompt + input_tokens + output_tokens,
}
# Calculate
result = calculate_chatbot_cost()
print(f"Cost per conversation: ${result['cost_per_conversation']:.4f}")
print(f"Monthly cost: ${result['monthly_cost']:.2f}")
Output:
Cost per conversation: $0.0285
Monthly cost: $855.00
Breakdown:
- Cached system prompt (500 tokens): $0.0001 per conversation
- Input tokens (500 tokens): $0.0009 per conversation
- Output tokens (2000 tokens): $0.0280 per conversation
Optimization Strategies:
- Cache system prompt (90% discount): ✅ Already doing
- Reduce output tokens by 20%:
assistant_tokens_per_message=160 # 200 → 160 # New monthly cost: $684 (20% savings) - Switch to Claude Haiku 3.5 for simple queries:
# 70% of queries are simple (route to Haiku) # 30% complex (use GPT-5.2) # New monthly cost: $427 (50% savings)
Example 2: Code Generation Tool
Scenario:
- Repository context: 50K tokens (cached)
- User query: 100 tokens
- Generated code: 500 tokens
- 10,000 requests/month
Cost Calculation:
def calculate_code_gen_cost(
repo_context_tokens=50_000,
user_query_tokens=100,
generated_code_tokens=500,
requests_per_month=10_000,
model="gpt-5.2"
):
pricing = {
"gpt-5.2": {"input": 1.75, "output": 14.00, "cached_input": 0.175},
}
p = pricing[model]
cost_per_request = (
(repo_context_tokens / 1_000_000) * p["cached_input"] + # Cached context
(user_query_tokens / 1_000_000) * p["input"] + # Query
(generated_code_tokens / 1_000_000) * p["output"] # Generated code
)
monthly_cost = cost_per_request * requests_per_month
return {
"cost_per_request": cost_per_request,
"monthly_cost": monthly_cost,
}
result = calculate_code_gen_cost()
print(f"Cost per request: ${result['cost_per_request']:.4f}")
print(f"Monthly cost: ${result['monthly_cost']:.2f}")
Output:
Cost per request: $0.0089
Monthly cost: $89.00
Breakdown:
- Cached repo context (50K tokens): $0.0088 (98% of cost!)
- User query (100 tokens): $0.0002
- Generated code (500 tokens): $0.0070
Critical Insight: Even with 90% caching discount, the repo context dominates cost. Optimization: Use sparse context (only relevant files).
Token Budgeting Strategies
1. Dynamic Context Truncation
def truncate_context(messages, max_tokens=120_000, reserve_for_output=4_000):
"""
Intelligently truncate conversation history to fit context window.
Used in production chatbots.
"""
import tiktoken
enc = tiktoken.encoding_for_model("gpt-5")
available_tokens = max_tokens - reserve_for_output
# Always keep system prompt + last 3 messages
system_prompt = messages[0]
recent_messages = messages[-3:]
# Fill middle with older messages (FIFO)
middle_messages = messages[1:-3]
# Count tokens
system_tokens = len(enc.encode(system_prompt["content"]))
recent_tokens = sum(len(enc.encode(m["content"])) for m in recent_messages)
used_tokens = system_tokens + recent_tokens
remaining_tokens = available_tokens - used_tokens
# Add older messages until budget exhausted
included_middle = []
for message in reversed(middle_messages): # Start from most recent
msg_tokens = len(enc.encode(message["content"]))
if used_tokens + msg_tokens <= available_tokens:
included_middle.insert(0, message)
used_tokens += msg_tokens
else:
break
final_messages = [system_prompt] + included_middle + recent_messages
return final_messages, used_tokens
# Example usage:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
# ... 100 messages of history ...
]
truncated, token_count = truncate_context(messages, max_tokens=128_000)
print(f"Truncated to {len(truncated)} messages ({token_count} tokens)")
2. Semantic Compression (RAG Hybrid)
Instead of sending entire document (50K tokens), send relevant excerpts (5K tokens):
def semantic_context_selection(query, documents, max_tokens=5000):
"""
Use embedding similarity to select most relevant context.
Reduces token usage by 90% while maintaining quality.
"""
from sentence_transformers import SentenceTransformer
import numpy as np
import tiktoken
# Embed query and documents
model = SentenceTransformer('all-MiniLM-L6-v2')
query_embedding = model.encode([query])[0]
doc_embeddings = model.encode(documents)
# Compute similarity scores
similarities = np.dot(doc_embeddings, query_embedding)
# Sort by relevance
ranked_indices = np.argsort(similarities)[::-1]
# Select top docs until token budget
enc = tiktoken.encoding_for_model("gpt-5")
selected_docs = []
total_tokens = 0
for idx in ranked_indices:
doc = documents[idx]
doc_tokens = len(enc.encode(doc))
if total_tokens + doc_tokens <= max_tokens:
selected_docs.append(doc)
total_tokens += doc_tokens
else:
break
return "\n\n".join(selected_docs), total_tokens
# Example:
query = "How do I implement authentication?"
docs = [doc1, doc2, doc3, ...] # 50K tokens total
context, tokens = semantic_context_selection(query, docs, max_tokens=5000)
# Result: 5K tokens of most relevant content (90% cost reduction)
3. Model Cascading (Quality vs. Cost)
Route requests to different models based on complexity:
def route_to_model(query):
"""
Route simple queries to cheap model, complex to expensive model.
Reduces average cost per query by 60%.
"""
# Use small model to classify complexity
complexity_score = estimate_query_complexity(query) # 0-1 scale
if complexity_score < 0.3:
return "claude-haiku-3.5" # $1 input / $5 output
elif complexity_score < 0.7:
return "gpt-5.2" # $1.75 input / $14 output
else:
return "claude-opus-4.5" # $15 input / $75 output
def estimate_query_complexity(query):
"""
Heuristic complexity scoring.
In production, use a small classifier model.
"""
complexity_signals = {
"code": 0.7,
"debug": 0.8,
"explain": 0.4,
"reasoning": 0.9,
"math": 0.8,
"summarize": 0.3,
}
query_lower = query.lower()
max_complexity = 0.2 # Baseline
for signal, score in complexity_signals.items():
if signal in query_lower:
max_complexity = max(max_complexity, score)
return max_complexity
# Example:
query1 = "What's the capital of France?"
# → claude-haiku-3.5 (simple factual)
query2 = "Debug this Python code and explain the issue"
# → gpt-5.2 (moderate complexity)
query3 = "Prove that P != NP using complexity theory"
# → claude-opus-4.5 (advanced reasoning)
Context Window Limits & Handling
What Happens When You Exceed Context Window?
GPT-5.2 (128K context):
# Request with 130K tokens
response = openai.ChatCompletion.create(
model="gpt-5",
messages=messages, # 130K tokens
)
# ❌ Error: "This model's maximum context length is 131072 tokens"
Solutions:
-
Truncate (naive):
messages = messages[-50:] # Keep last 50 messages # ⚠️ Loses important context -
Summarize (better):
# Summarize old messages old_summary = summarize_with_llm(messages[:-10]) new_messages = [ {"role": "system", "content": f"Previous conversation summary: {old_summary}"}, *messages[-10:] ] # ✅ Preserves key information, fits in context -
Sparse Attention (advanced):
# Use model with sparse attention (e.g., Longformer pattern) # Attends to: # - System prompt # - Last N messages # - Important messages (detected via keyword/embedding) # ✅ Efficient for very long conversations
Common Interview Questions & Answers
Q1: "Why do output tokens cost more than input tokens?"
Answer:
"Output tokens require autoregressive generation - each token depends on all previous tokens. For a 100-token output, the model does 100 forward passes (1 + 2 + ... + 100 = 5,050 token computations with KV-cache, or 100 × 100 = 10,000 without). Input tokens are processed in parallel in a single forward pass. Additionally, output generation uses sampling (temperature, top-p), which adds compute."
Q2: "How would you reduce costs for a chatbot with a 500-token system prompt?"
Strong Answer:
"Use prompt caching. With GPT-5.2's 90% cache discount:
- Uncached: 500 tokens × $1.75/1M = $0.000875 per request
- Cached: 500 tokens × $0.175/1M = $0.0000875 per request (10x cheaper)
For 1M requests/month, savings = $787.50/month. Also ensure the system prompt is static across requests - any change invalidates the cache."
Q3: "Your API bill doubled month-over-month. How do you debug?"
Debugging Checklist:
# 1. Check token usage distribution
token_stats = {
"total_requests": 1_000_000,
"avg_input_tokens": 5_000, # ⚠️ Was 2,000 last month
"avg_output_tokens": 800, # ⚠️ Was 400 last month
}
# 2. Identify which feature/endpoint spiked
# Use OpenAI usage API or logs
top_endpoints = {
"/chat": 900_000 requests,
"/summarize": 100_000 requests, # ⚠️ New feature!
}
# 3. Check for inefficiencies
issues_found = [
"Summarize endpoint sends full 50K-token docs (should use excerpts)",
"Chat endpoint not using cached system prompts",
"No truncation - some conversations hit 128K limit",
]
# 4. Estimate savings
optimizations = [
("Enable prompt caching", "60% reduction on /chat"),
("Use semantic search for /summarize", "90% reduction"),
("Implement conversation truncation", "30% reduction overall"),
]
Q4: "Should we use GPT-5.2 or Claude Sonnet 4.5 for production?"
Answer Framework:
| Factor | GPT-5.2 | Claude Sonnet 4.5 | Winner |
|---|---|---|---|
| Cost | $1.75/$14 | $3/$15 | GPT-5.2 |
| Context | 128K | 200K (1M beta) | Claude |
| Speed | Faster | Slower | GPT-5.2 |
| Safety | Good | Excellent | Claude |
| Code | Excellent | Excellent | Tie |
| Reasoning | Excellent | Excellent | Tie |
Decision:
- High-volume, cost-sensitive: GPT-5.2
- Long-context (>128K): Claude Sonnet 4.5
- Safety-critical: Claude Sonnet 4.5
- Fastest response: GPT-5.2
Key Takeaways for Interviews
✅ Know the pricing: Memorize top 3-4 models' costs (GPT-5.2, Claude, Gemini) ✅ Calculate costs: Be ready to estimate $/conversation or $/month on a whiteboard ✅ Optimize ruthlessly: Caching, truncation, model cascading, semantic compression ✅ Understand trade-offs: Quality vs. cost, latency vs. context window ✅ Debug systematically: Token usage distribution → identify spike → fix inefficiency
Next Step: Learn how to control LLM outputs with sampling strategies in Module 1, Lesson 3.
:::