LLM Fundamentals for Interviews
Token Economics & Cost Optimization
Why This Matters for Interviews
OpenAI, Anthropic, Meta explicitly ask about token economics in L5-L6 interviews:
Real Interview Question (Anthropic):
"You're building a customer support chatbot. Each conversation averages 20 messages. How would you optimize token costs while maintaining quality? Walk me through your calculation."
Real Interview Question (Meta):
"Your LLM API bill is $50K/month. You have 1M users. How would you reduce costs by 50% without degrading user experience?"
Understanding tokens = money is critical for production LLM engineering.
What is a Token? (The Unit of Computation)
Not a word! A token is a subword unit.
Examples (GPT-5.4 BPE tokenizer):
| Text | Tokens | Count |
|---|---|---|
| "Hello, world!" | ["Hello", ",", " world", "!"] | 4 |
| "Tokenization" | ["Token", "ization"] | 2 |
| "GPT-5.4" | ["G", "PT", "-", "5", ".", "4"] | 6 |
| "مرحبا" (Arabic) | ["Ù…", "ر", "Ø", "ب", "ا"] | 5 |
| "你好" (Chinese) | ["ä½", "好"] | 2 |
Key Rule: 1 token ≈ 4 characters in English, but varies by language.
BPE Tokenization (Byte-Pair Encoding)
How GPT-5.4, Claude 4.6, Llama 3.3 tokenize:
Algorithm:
- Start with character-level vocabulary
- Find most frequent pair of tokens
- Merge pair into new token
- Repeat until vocabulary size reached (typically 50K-200K)
Example Training:
Input text: "low low low low lower lower newest newest newest newest newest newest"
Iteration 1:
Most frequent pair: ('l', 'o') → merge to 'lo'
Vocab: ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'lo']
Iteration 2:
Most frequent pair: ('lo', 'w') → merge to 'low'
Vocab: ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'lo', 'low']
Iteration 3:
Most frequent pair: ('e', 'w') → merge to 'ew'
...continues...
Final: "low" becomes 1 token, "lower" becomes 2 tokens, "newest" becomes 2 tokens
Code Implementation:
def bpe_tokenize(text, vocab, merges):
"""
BPE tokenization (simplified).
Used in: GPT-5.4, Llama 3.3
"""
# Start with characters
tokens = list(text)
# Apply merge rules in order
for merge in merges:
pair = merge[0]
new_token = merge[1]
i = 0
while i < len(tokens) - 1:
if (tokens[i], tokens[i+1]) == pair:
tokens = tokens[:i] + [new_token] + tokens[i+2:]
i += 1
return tokens
# Example usage:
merges = [
(('l', 'o'), 'lo'),
(('lo', 'w'), 'low'),
(('e', 'w'), 'ew'),
(('n', 'ew'), 'new'),
(('new', 'est'), 'newest'),
]
text = "lowest newest"
tokens = bpe_tokenize(text, vocab, merges)
# Result: ['low', 'est', 'newest']
Tokenizer Comparison (April 2026)
| Model | Tokenizer | Vocab Size | Efficiency | Notes |
|---|---|---|---|---|
| GPT-5.4 | tiktoken (o200k_base) | 200K | ⭐⭐⭐⭐⭐ | Optimized for code + multilingual |
| Claude 4.6 | claude-v1 | 100K | ⭐⭐⭐⭐⭐ | Similar efficiency to GPT-5.4 |
| Gemini 3.1 Pro | SentencePiece | 256K | ⭐⭐⭐⭐ | Larger vocab, fewer tokens per text |
| Llama 3.3 | tiktoken (Llama) | 128K | ⭐⭐⭐⭐⭐ | Open-source, multilingual |
| DeepSeek R1 | DeepSeek-tokenizer | 100K | ⭐⭐⭐⭐ | Chinese-optimized |
Interview Insight: Mention that modern tokenizers use 100K–200K vocab (vs. 50K in GPT-2) for better efficiency across languages.
Context Window Evolution
Historical Progression
| Year | Model | Context Window | What Changed |
|---|---|---|---|
| 2020 | GPT-3 | 2K tokens | Baseline |
| 2023 | GPT-4 Turbo | 128K tokens | 64x increase |
| 2024 | Claude Opus 4 | 200K tokens | Production-ready |
| 2025 | GPT-5 | 128K-400K tokens | Mixture of window sizes |
| 2026 | GPT-5.4 | 400K (1.1M beta) | Long-context standard |
| 2026 | Claude Sonnet 4.6 | 1M tokens (standard) | Full 1M at base price |
| 2026 | Gemini 3.1 Pro | 1M tokens | Multimodal 1M window |
Interview Question: "Why do we need context windows beyond 100K tokens?"
Strong Answer:
"Real-world applications need long context:
- Code: Entire repository context (50K-200K tokens)
- Legal: Full contracts + amendments (100K+ tokens)
- Research: Multiple papers for literature review (200K+ tokens)
- Customer support: Full conversation history across sessions (50K+ tokens)
Larger windows reduce the need for RAG/retrieval, improving latency and accuracy."
Token Pricing (April 2026)
Pricing per 1M Tokens
| Model | Input | Output | Cached Input | Use Case |
|---|---|---|---|---|
| GPT-5.4 | $2.50 | $15.00 | $0.25 (90% off) | General-purpose |
| GPT-5.4 Mini | $0.75 | $4.50 | $0.075 | High-volume |
| GPT-5.4 Pro | $30.00 | $180.00 | — | Deep reasoning |
| Claude Opus 4.6 | $5.00 | $25.00 | $0.50 (90% off) | Frontier reasoning |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 (90% off) | Production workhorse |
| Claude Haiku 4.5 | $1.00 | $5.00 | $0.10 | High-volume |
| Gemini 3.1 Pro | $2.00 | $12.00 | — | Multimodal |
| DeepSeek R1 | $0.55 | $2.19 | — | Budget / self-hostable |
Key Observations:
- Output tokens cost 5-10x input tokens (generation is expensive)
- Prompt caching saves up to 90% (critical for chatbots with system prompts)
- Batch API delivers an additional 50% on both input and output
- Opus 4.6 is a major price drop vs. the Opus 4.x line ($5/$25 vs. historical $15/$75)
Cost Calculation Examples
Example 1: Customer Support Chatbot
Scenario:
- System prompt: 500 tokens (cached)
- Average conversation: 10 messages
- Average message: 50 tokens user + 200 tokens assistant
- 1,000 conversations/day
Cost Calculation (GPT-5.4):
def calculate_chatbot_cost(
system_prompt_tokens=500,
messages_per_conversation=10,
user_tokens_per_message=50,
assistant_tokens_per_message=200,
conversations_per_day=1000,
model="gpt-5.4"
):
"""
Calculate monthly chatbot costs.
"""
# Pricing (per 1M tokens, April 2026)
pricing = {
"gpt-5.4": {
"input": 2.50,
"output": 15.00,
"cached_input": 0.25,
},
"claude-sonnet-4-6": {
"input": 3.00,
"output": 15.00,
"cached_input": 0.30,
},
}
p = pricing[model]
# Tokens per conversation
cached_system_prompt = system_prompt_tokens # Once per conversation
input_tokens = messages_per_conversation * user_tokens_per_message
output_tokens = messages_per_conversation * assistant_tokens_per_message
# Cost per conversation
cost_per_conversation = (
(cached_system_prompt / 1_000_000) * p["cached_input"] + # System prompt
(input_tokens / 1_000_000) * p["input"] + # User messages
(output_tokens / 1_000_000) * p["output"] # Assistant messages
)
# Monthly cost (30 days)
monthly_cost = cost_per_conversation * conversations_per_day * 30
return {
"cost_per_conversation": cost_per_conversation,
"daily_cost": cost_per_conversation * conversations_per_day,
"monthly_cost": monthly_cost,
"tokens_per_conversation": cached_system_prompt + input_tokens + output_tokens,
}
# Calculate
result = calculate_chatbot_cost()
print(f"Cost per conversation: ${result['cost_per_conversation']:.4f}")
print(f"Monthly cost: ${result['monthly_cost']:.2f}")
Output:
Cost per conversation: $0.0285
Monthly cost: $855.00
Breakdown:
- Cached system prompt (500 tokens): $0.0001 per conversation
- Input tokens (500 tokens): $0.0009 per conversation
- Output tokens (2000 tokens): $0.0280 per conversation
Optimization Strategies:
- Cache system prompt (90% discount): ✅ Already doing
- Reduce output tokens by 20%:
assistant_tokens_per_message=160 # 200 → 160 # New monthly cost: $684 (20% savings) - Switch to Claude Haiku 4.5 for simple queries:
# 70% of queries are simple (route to Haiku) # 30% complex (use GPT-5.4) # New monthly cost: $427 (50% savings)
Example 2: Code Generation Tool
Scenario:
- Repository context: 50K tokens (cached)
- User query: 100 tokens
- Generated code: 500 tokens
- 10,000 requests/month
Cost Calculation:
def calculate_code_gen_cost(
repo_context_tokens=50_000,
user_query_tokens=100,
generated_code_tokens=500,
requests_per_month=10_000,
model="gpt-5.4"
):
pricing = {
"gpt-5.4": {"input": 2.50, "output": 15.00, "cached_input": 0.25},
}
p = pricing[model]
cost_per_request = (
(repo_context_tokens / 1_000_000) * p["cached_input"] + # Cached context
(user_query_tokens / 1_000_000) * p["input"] + # Query
(generated_code_tokens / 1_000_000) * p["output"] # Generated code
)
monthly_cost = cost_per_request * requests_per_month
return {
"cost_per_request": cost_per_request,
"monthly_cost": monthly_cost,
}
result = calculate_code_gen_cost()
print(f"Cost per request: ${result['cost_per_request']:.4f}")
print(f"Monthly cost: ${result['monthly_cost']:.2f}")
Output:
Cost per request: $0.0089
Monthly cost: $89.00
Breakdown:
- Cached repo context (50K tokens): $0.0088 (98% of cost!)
- User query (100 tokens): $0.0002
- Generated code (500 tokens): $0.0070
Critical Insight: Even with 90% caching discount, the repo context dominates cost. Optimization: Use sparse context (only relevant files).
Token Budgeting Strategies
1. Dynamic Context Truncation
def truncate_context(messages, max_tokens=120_000, reserve_for_output=4_000):
"""
Intelligently truncate conversation history to fit context window.
Used in production chatbots.
"""
import tiktoken
enc = tiktoken.encoding_for_model("gpt-5")
available_tokens = max_tokens - reserve_for_output
# Always keep system prompt + last 3 messages
system_prompt = messages[0]
recent_messages = messages[-3:]
# Fill middle with older messages (FIFO)
middle_messages = messages[1:-3]
# Count tokens
system_tokens = len(enc.encode(system_prompt["content"]))
recent_tokens = sum(len(enc.encode(m["content"])) for m in recent_messages)
used_tokens = system_tokens + recent_tokens
remaining_tokens = available_tokens - used_tokens
# Add older messages until budget exhausted
included_middle = []
for message in reversed(middle_messages): # Start from most recent
msg_tokens = len(enc.encode(message["content"]))
if used_tokens + msg_tokens <= available_tokens:
included_middle.insert(0, message)
used_tokens += msg_tokens
else:
break
final_messages = [system_prompt] + included_middle + recent_messages
return final_messages, used_tokens
# Example usage:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
# ... 100 messages of history ...
]
truncated, token_count = truncate_context(messages, max_tokens=128_000)
print(f"Truncated to {len(truncated)} messages ({token_count} tokens)")
2. Semantic Compression (RAG Hybrid)
Instead of sending entire document (50K tokens), send relevant excerpts (5K tokens):
def semantic_context_selection(query, documents, max_tokens=5000):
"""
Use embedding similarity to select most relevant context.
Reduces token usage by 90% while maintaining quality.
"""
from sentence_transformers import SentenceTransformer
import numpy as np
import tiktoken
# Embed query and documents
model = SentenceTransformer('all-MiniLM-L6-v2')
query_embedding = model.encode([query])[0]
doc_embeddings = model.encode(documents)
# Compute similarity scores
similarities = np.dot(doc_embeddings, query_embedding)
# Sort by relevance
ranked_indices = np.argsort(similarities)[::-1]
# Select top docs until token budget
enc = tiktoken.encoding_for_model("gpt-5")
selected_docs = []
total_tokens = 0
for idx in ranked_indices:
doc = documents[idx]
doc_tokens = len(enc.encode(doc))
if total_tokens + doc_tokens <= max_tokens:
selected_docs.append(doc)
total_tokens += doc_tokens
else:
break
return "\n\n".join(selected_docs), total_tokens
# Example:
query = "How do I implement authentication?"
docs = [doc1, doc2, doc3, ...] # 50K tokens total
context, tokens = semantic_context_selection(query, docs, max_tokens=5000)
# Result: 5K tokens of most relevant content (90% cost reduction)
3. Model Cascading (Quality vs. Cost)
Route requests to different models based on complexity:
def route_to_model(query):
"""
Route simple queries to cheap model, complex to expensive model.
Reduces average cost per query by 60%.
"""
# Use small model to classify complexity
complexity_score = estimate_query_complexity(query) # 0-1 scale
if complexity_score < 0.3:
return "claude-haiku-4-5" # $1 input / $5 output
elif complexity_score < 0.7:
return "gpt-5.4" # $2.50 input / $15 output
else:
return "claude-opus-4-7" # $5 input / $25 output
def estimate_query_complexity(query):
"""
Heuristic complexity scoring.
In production, use a small classifier model.
"""
complexity_signals = {
"code": 0.7,
"debug": 0.8,
"explain": 0.4,
"reasoning": 0.9,
"math": 0.8,
"summarize": 0.3,
}
query_lower = query.lower()
max_complexity = 0.2 # Baseline
for signal, score in complexity_signals.items():
if signal in query_lower:
max_complexity = max(max_complexity, score)
return max_complexity
# Example:
query1 = "What's the capital of France?"
# → claude-haiku-4-5 (simple factual)
query2 = "Debug this Python code and explain the issue"
# → gpt-5.4 (moderate complexity)
query3 = "Prove that P != NP using complexity theory"
# → claude-opus-4-7 (advanced reasoning)
Context Window Limits & Handling
What Happens When You Exceed Context Window?
GPT-5.4 (256K standard context):
from openai import OpenAI
client = OpenAI()
# Request with 260K tokens
response = client.chat.completions.create(
model="gpt-5.4",
messages=messages, # 260K tokens
)
# ❌ Error: "This model's maximum context length is 262144 tokens"
Solutions:
-
Truncate (naive):
messages = messages[-50:] # Keep last 50 messages # ⚠️ Loses important context -
Summarize (better):
# Summarize old messages old_summary = summarize_with_llm(messages[:-10]) new_messages = [ {"role": "system", "content": f"Previous conversation summary: {old_summary}"}, *messages[-10:] ] # ✅ Preserves key information, fits in context -
Sparse Attention (advanced):
# Use model with sparse attention (e.g., Longformer pattern) # Attends to: # - System prompt # - Last N messages # - Important messages (detected via keyword/embedding) # ✅ Efficient for very long conversations
Common Interview Questions & Answers
Q1: "Why do output tokens cost more than input tokens?"
Answer:
"Output tokens require autoregressive generation - each token depends on all previous tokens. For a 100-token output, the model does 100 forward passes (1 + 2 + ... + 100 = 5,050 token computations with KV-cache, or 100 × 100 = 10,000 without). Input tokens are processed in parallel in a single forward pass. Additionally, output generation uses sampling (temperature, top-p), which adds compute."
Q2: "How would you reduce costs for a chatbot with a 500-token system prompt?"
Strong Answer:
"Use prompt caching. With GPT-5.4's 90% cache discount:
- Uncached: 500 tokens × $2.50/1M = $0.00125 per request
- Cached: 500 tokens × $0.25/1M = $0.000125 per request (10x cheaper)
For 1M requests/month, savings ≈ $1,125/month. Also ensure the system prompt is static across requests — any change invalidates the cache."
Q3: "Your API bill doubled month-over-month. How do you debug?"
Debugging Checklist:
# 1. Check token usage distribution
token_stats = {
"total_requests": 1_000_000,
"avg_input_tokens": 5_000, # ⚠️ Was 2,000 last month
"avg_output_tokens": 800, # ⚠️ Was 400 last month
}
# 2. Identify which feature/endpoint spiked
# Use OpenAI usage API or logs
top_endpoints = {
"/chat": 900_000 requests,
"/summarize": 100_000 requests, # ⚠️ New feature!
}
# 3. Check for inefficiencies
issues_found = [
"Summarize endpoint sends full 50K-token docs (should use excerpts)",
"Chat endpoint not using cached system prompts",
"No truncation - some conversations hit 128K limit",
]
# 4. Estimate savings
optimizations = [
("Enable prompt caching", "60% reduction on /chat"),
("Use semantic search for /summarize", "90% reduction"),
("Implement conversation truncation", "30% reduction overall"),
]
Q4: "Should we use GPT-5.4 or Claude Sonnet 4.6 for production?"
Answer Framework:
| Factor | GPT-5.4 | Claude Sonnet 4.6 | Winner |
|---|---|---|---|
| Cost | $2.50/$15 | $3/$15 | GPT-5.4 (slightly) |
| Context | 400K (1.1M beta) | 1M (standard) | Claude |
| Speed | Faster | Comparable | GPT-5.4 |
| Safety | Good | Excellent | Claude |
| Code / tool use | Excellent | Excellent (SWE-bench leader) | Claude |
| Reasoning | Excellent | Excellent | Tie |
Decision:
- High-volume, cost-sensitive: GPT-5.4 Mini or Claude Haiku 4.5
- Long-context (>400K): Claude Sonnet 4.6 (1M at base price)
- Safety-critical: Claude Sonnet 4.6
- Agentic coding tasks: Claude Sonnet 4.6
Key Takeaways for Interviews
✅ Know the pricing: Memorize top 3-4 models' costs (GPT-5.4, Claude 4.6, Gemini 3.1) ✅ Calculate costs: Be ready to estimate $/conversation or $/month on a whiteboard ✅ Optimize ruthlessly: Caching, truncation, model cascading, semantic compression ✅ Understand trade-offs: Quality vs. cost, latency vs. context window ✅ Debug systematically: Token usage distribution → identify spike → fix inefficiency
Next Step: Learn how to control LLM outputs with sampling strategies in Module 1, Lesson 3.
:::