LLM Fundamentals for Interviews

Token Economics & Cost Optimization

5 min read

Why This Matters for Interviews

OpenAI, Anthropic, Meta explicitly ask about token economics in L5-L6 interviews:

Real Interview Question (Anthropic):

"You're building a customer support chatbot. Each conversation averages 20 messages. How would you optimize token costs while maintaining quality? Walk me through your calculation."

Real Interview Question (Meta):

"Your LLM API bill is $50K/month. You have 1M users. How would you reduce costs by 50% without degrading user experience?"

Understanding tokens = money is critical for production LLM engineering.


What is a Token? (The Unit of Computation)

Not a word! A token is a subword unit.

Examples (GPT-5.4 BPE tokenizer):

TextTokensCount
"Hello, world!"["Hello", ",", " world", "!"]4
"Tokenization"["Token", "ization"]2
"GPT-5.4"["G", "PT", "-", "5", ".", "4"]6
"مرحبا" (Arabic)["Ù…", "ر", "Ø­", "ب", "ا"]5
"你好" (Chinese)["ä½", "好"]2

Key Rule: 1 token ≈ 4 characters in English, but varies by language.


BPE Tokenization (Byte-Pair Encoding)

How GPT-5.4, Claude 4.6, Llama 3.3 tokenize:

Algorithm:

  1. Start with character-level vocabulary
  2. Find most frequent pair of tokens
  3. Merge pair into new token
  4. Repeat until vocabulary size reached (typically 50K-200K)

Example Training:

Input text: "low low low low lower lower newest newest newest newest newest newest"

Iteration 1:
Most frequent pair: ('l', 'o') → merge to 'lo'
Vocab: ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'lo']

Iteration 2:
Most frequent pair: ('lo', 'w') → merge to 'low'
Vocab: ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'lo', 'low']

Iteration 3:
Most frequent pair: ('e', 'w') → merge to 'ew'
...continues...

Final: "low" becomes 1 token, "lower" becomes 2 tokens, "newest" becomes 2 tokens

Code Implementation:

def bpe_tokenize(text, vocab, merges):
    """
    BPE tokenization (simplified).
    Used in: GPT-5.4, Llama 3.3
    """
    # Start with characters
    tokens = list(text)

    # Apply merge rules in order
    for merge in merges:
        pair = merge[0]
        new_token = merge[1]

        i = 0
        while i < len(tokens) - 1:
            if (tokens[i], tokens[i+1]) == pair:
                tokens = tokens[:i] + [new_token] + tokens[i+2:]
            i += 1

    return tokens

# Example usage:
merges = [
    (('l', 'o'), 'lo'),
    (('lo', 'w'), 'low'),
    (('e', 'w'), 'ew'),
    (('n', 'ew'), 'new'),
    (('new', 'est'), 'newest'),
]

text = "lowest newest"
tokens = bpe_tokenize(text, vocab, merges)
# Result: ['low', 'est', 'newest']

Tokenizer Comparison (April 2026)

ModelTokenizerVocab SizeEfficiencyNotes
GPT-5.4tiktoken (o200k_base)200K⭐⭐⭐⭐⭐Optimized for code + multilingual
Claude 4.6claude-v1100K⭐⭐⭐⭐⭐Similar efficiency to GPT-5.4
Gemini 3.1 ProSentencePiece256K⭐⭐⭐⭐Larger vocab, fewer tokens per text
Llama 3.3tiktoken (Llama)128K⭐⭐⭐⭐⭐Open-source, multilingual
DeepSeek R1DeepSeek-tokenizer100K⭐⭐⭐⭐Chinese-optimized

Interview Insight: Mention that modern tokenizers use 100K–200K vocab (vs. 50K in GPT-2) for better efficiency across languages.


Context Window Evolution

Historical Progression

YearModelContext WindowWhat Changed
2020GPT-32K tokensBaseline
2023GPT-4 Turbo128K tokens64x increase
2024Claude Opus 4200K tokensProduction-ready
2025GPT-5128K-400K tokensMixture of window sizes
2026GPT-5.4400K (1.1M beta)Long-context standard
2026Claude Sonnet 4.61M tokens (standard)Full 1M at base price
2026Gemini 3.1 Pro1M tokensMultimodal 1M window

Interview Question: "Why do we need context windows beyond 100K tokens?"

Strong Answer:

"Real-world applications need long context:

  • Code: Entire repository context (50K-200K tokens)
  • Legal: Full contracts + amendments (100K+ tokens)
  • Research: Multiple papers for literature review (200K+ tokens)
  • Customer support: Full conversation history across sessions (50K+ tokens)

Larger windows reduce the need for RAG/retrieval, improving latency and accuracy."


Token Pricing (April 2026)

Pricing per 1M Tokens

ModelInputOutputCached InputUse Case
GPT-5.4$2.50$15.00$0.25 (90% off)General-purpose
GPT-5.4 Mini$0.75$4.50$0.075High-volume
GPT-5.4 Pro$30.00$180.00Deep reasoning
Claude Opus 4.6$5.00$25.00$0.50 (90% off)Frontier reasoning
Claude Sonnet 4.6$3.00$15.00$0.30 (90% off)Production workhorse
Claude Haiku 4.5$1.00$5.00$0.10High-volume
Gemini 3.1 Pro$2.00$12.00Multimodal
DeepSeek R1$0.55$2.19Budget / self-hostable

Key Observations:

  • Output tokens cost 5-10x input tokens (generation is expensive)
  • Prompt caching saves up to 90% (critical for chatbots with system prompts)
  • Batch API delivers an additional 50% on both input and output
  • Opus 4.6 is a major price drop vs. the Opus 4.x line ($5/$25 vs. historical $15/$75)

Cost Calculation Examples

Example 1: Customer Support Chatbot

Scenario:

  • System prompt: 500 tokens (cached)
  • Average conversation: 10 messages
  • Average message: 50 tokens user + 200 tokens assistant
  • 1,000 conversations/day

Cost Calculation (GPT-5.4):

def calculate_chatbot_cost(
    system_prompt_tokens=500,
    messages_per_conversation=10,
    user_tokens_per_message=50,
    assistant_tokens_per_message=200,
    conversations_per_day=1000,
    model="gpt-5.4"
):
    """
    Calculate monthly chatbot costs.
    """
    # Pricing (per 1M tokens, April 2026)
    pricing = {
        "gpt-5.4": {
            "input": 2.50,
            "output": 15.00,
            "cached_input": 0.25,
        },
        "claude-sonnet-4-6": {
            "input": 3.00,
            "output": 15.00,
            "cached_input": 0.30,
        },
    }

    p = pricing[model]

    # Tokens per conversation
    cached_system_prompt = system_prompt_tokens  # Once per conversation
    input_tokens = messages_per_conversation * user_tokens_per_message
    output_tokens = messages_per_conversation * assistant_tokens_per_message

    # Cost per conversation
    cost_per_conversation = (
        (cached_system_prompt / 1_000_000) * p["cached_input"] +  # System prompt
        (input_tokens / 1_000_000) * p["input"] +                 # User messages
        (output_tokens / 1_000_000) * p["output"]                 # Assistant messages
    )

    # Monthly cost (30 days)
    monthly_cost = cost_per_conversation * conversations_per_day * 30

    return {
        "cost_per_conversation": cost_per_conversation,
        "daily_cost": cost_per_conversation * conversations_per_day,
        "monthly_cost": monthly_cost,
        "tokens_per_conversation": cached_system_prompt + input_tokens + output_tokens,
    }

# Calculate
result = calculate_chatbot_cost()
print(f"Cost per conversation: ${result['cost_per_conversation']:.4f}")
print(f"Monthly cost: ${result['monthly_cost']:.2f}")

Output:

Cost per conversation: $0.0285
Monthly cost: $855.00

Breakdown:
- Cached system prompt (500 tokens): $0.0001 per conversation
- Input tokens (500 tokens): $0.0009 per conversation
- Output tokens (2000 tokens): $0.0280 per conversation

Optimization Strategies:

  1. Cache system prompt (90% discount): ✅ Already doing
  2. Reduce output tokens by 20%:
    assistant_tokens_per_message=160  # 200 → 160
    # New monthly cost: $684 (20% savings)
    
  3. Switch to Claude Haiku 4.5 for simple queries:
    # 70% of queries are simple (route to Haiku)
    # 30% complex (use GPT-5.4)
    # New monthly cost: $427 (50% savings)
    

Example 2: Code Generation Tool

Scenario:

  • Repository context: 50K tokens (cached)
  • User query: 100 tokens
  • Generated code: 500 tokens
  • 10,000 requests/month

Cost Calculation:

def calculate_code_gen_cost(
    repo_context_tokens=50_000,
    user_query_tokens=100,
    generated_code_tokens=500,
    requests_per_month=10_000,
    model="gpt-5.4"
):
    pricing = {
        "gpt-5.4": {"input": 2.50, "output": 15.00, "cached_input": 0.25},
    }

    p = pricing[model]

    cost_per_request = (
        (repo_context_tokens / 1_000_000) * p["cached_input"] +  # Cached context
        (user_query_tokens / 1_000_000) * p["input"] +           # Query
        (generated_code_tokens / 1_000_000) * p["output"]        # Generated code
    )

    monthly_cost = cost_per_request * requests_per_month

    return {
        "cost_per_request": cost_per_request,
        "monthly_cost": monthly_cost,
    }

result = calculate_code_gen_cost()
print(f"Cost per request: ${result['cost_per_request']:.4f}")
print(f"Monthly cost: ${result['monthly_cost']:.2f}")

Output:

Cost per request: $0.0089
Monthly cost: $89.00

Breakdown:
- Cached repo context (50K tokens): $0.0088 (98% of cost!)
- User query (100 tokens): $0.0002
- Generated code (500 tokens): $0.0070

Critical Insight: Even with 90% caching discount, the repo context dominates cost. Optimization: Use sparse context (only relevant files).


Token Budgeting Strategies

1. Dynamic Context Truncation

def truncate_context(messages, max_tokens=120_000, reserve_for_output=4_000):
    """
    Intelligently truncate conversation history to fit context window.
    Used in production chatbots.
    """
    import tiktoken

    enc = tiktoken.encoding_for_model("gpt-5")
    available_tokens = max_tokens - reserve_for_output

    # Always keep system prompt + last 3 messages
    system_prompt = messages[0]
    recent_messages = messages[-3:]

    # Fill middle with older messages (FIFO)
    middle_messages = messages[1:-3]

    # Count tokens
    system_tokens = len(enc.encode(system_prompt["content"]))
    recent_tokens = sum(len(enc.encode(m["content"])) for m in recent_messages)

    used_tokens = system_tokens + recent_tokens
    remaining_tokens = available_tokens - used_tokens

    # Add older messages until budget exhausted
    included_middle = []
    for message in reversed(middle_messages):  # Start from most recent
        msg_tokens = len(enc.encode(message["content"]))
        if used_tokens + msg_tokens <= available_tokens:
            included_middle.insert(0, message)
            used_tokens += msg_tokens
        else:
            break

    final_messages = [system_prompt] + included_middle + recent_messages

    return final_messages, used_tokens

# Example usage:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    # ... 100 messages of history ...
]

truncated, token_count = truncate_context(messages, max_tokens=128_000)
print(f"Truncated to {len(truncated)} messages ({token_count} tokens)")

2. Semantic Compression (RAG Hybrid)

Instead of sending entire document (50K tokens), send relevant excerpts (5K tokens):

def semantic_context_selection(query, documents, max_tokens=5000):
    """
    Use embedding similarity to select most relevant context.
    Reduces token usage by 90% while maintaining quality.
    """
    from sentence_transformers import SentenceTransformer
    import numpy as np
    import tiktoken

    # Embed query and documents
    model = SentenceTransformer('all-MiniLM-L6-v2')
    query_embedding = model.encode([query])[0]
    doc_embeddings = model.encode(documents)

    # Compute similarity scores
    similarities = np.dot(doc_embeddings, query_embedding)

    # Sort by relevance
    ranked_indices = np.argsort(similarities)[::-1]

    # Select top docs until token budget
    enc = tiktoken.encoding_for_model("gpt-5")
    selected_docs = []
    total_tokens = 0

    for idx in ranked_indices:
        doc = documents[idx]
        doc_tokens = len(enc.encode(doc))

        if total_tokens + doc_tokens <= max_tokens:
            selected_docs.append(doc)
            total_tokens += doc_tokens
        else:
            break

    return "\n\n".join(selected_docs), total_tokens

# Example:
query = "How do I implement authentication?"
docs = [doc1, doc2, doc3, ...]  # 50K tokens total
context, tokens = semantic_context_selection(query, docs, max_tokens=5000)
# Result: 5K tokens of most relevant content (90% cost reduction)

3. Model Cascading (Quality vs. Cost)

Route requests to different models based on complexity:

def route_to_model(query):
    """
    Route simple queries to cheap model, complex to expensive model.
    Reduces average cost per query by 60%.
    """
    # Use small model to classify complexity
    complexity_score = estimate_query_complexity(query)  # 0-1 scale

    if complexity_score < 0.3:
        return "claude-haiku-4-5"   # $1 input / $5 output
    elif complexity_score < 0.7:
        return "gpt-5.4"             # $2.50 input / $15 output
    else:
        return "claude-opus-4-7"     # $5 input / $25 output

def estimate_query_complexity(query):
    """
    Heuristic complexity scoring.
    In production, use a small classifier model.
    """
    complexity_signals = {
        "code": 0.7,
        "debug": 0.8,
        "explain": 0.4,
        "reasoning": 0.9,
        "math": 0.8,
        "summarize": 0.3,
    }

    query_lower = query.lower()
    max_complexity = 0.2  # Baseline

    for signal, score in complexity_signals.items():
        if signal in query_lower:
            max_complexity = max(max_complexity, score)

    return max_complexity

# Example:
query1 = "What's the capital of France?"
# → claude-haiku-4-5 (simple factual)

query2 = "Debug this Python code and explain the issue"
# → gpt-5.4 (moderate complexity)

query3 = "Prove that P != NP using complexity theory"
# → claude-opus-4-7 (advanced reasoning)

Context Window Limits & Handling

What Happens When You Exceed Context Window?

GPT-5.4 (256K standard context):

from openai import OpenAI
client = OpenAI()

# Request with 260K tokens
response = client.chat.completions.create(
    model="gpt-5.4",
    messages=messages,  # 260K tokens
)
# ❌ Error: "This model's maximum context length is 262144 tokens"

Solutions:

  1. Truncate (naive):

    messages = messages[-50:]  # Keep last 50 messages
    # ⚠️ Loses important context
    
  2. Summarize (better):

    # Summarize old messages
    old_summary = summarize_with_llm(messages[:-10])
    new_messages = [
        {"role": "system", "content": f"Previous conversation summary: {old_summary}"},
        *messages[-10:]
    ]
    # ✅ Preserves key information, fits in context
    
  3. Sparse Attention (advanced):

    # Use model with sparse attention (e.g., Longformer pattern)
    # Attends to:
    # - System prompt
    # - Last N messages
    # - Important messages (detected via keyword/embedding)
    # ✅ Efficient for very long conversations
    

Common Interview Questions & Answers

Q1: "Why do output tokens cost more than input tokens?"

Answer:

"Output tokens require autoregressive generation - each token depends on all previous tokens. For a 100-token output, the model does 100 forward passes (1 + 2 + ... + 100 = 5,050 token computations with KV-cache, or 100 × 100 = 10,000 without). Input tokens are processed in parallel in a single forward pass. Additionally, output generation uses sampling (temperature, top-p), which adds compute."

Q2: "How would you reduce costs for a chatbot with a 500-token system prompt?"

Strong Answer:

"Use prompt caching. With GPT-5.4's 90% cache discount:

  • Uncached: 500 tokens × $2.50/1M = $0.00125 per request
  • Cached: 500 tokens × $0.25/1M = $0.000125 per request (10x cheaper)

For 1M requests/month, savings ≈ $1,125/month. Also ensure the system prompt is static across requests — any change invalidates the cache."

Q3: "Your API bill doubled month-over-month. How do you debug?"

Debugging Checklist:

# 1. Check token usage distribution
token_stats = {
    "total_requests": 1_000_000,
    "avg_input_tokens": 5_000,   # ⚠️ Was 2,000 last month
    "avg_output_tokens": 800,    # ⚠️ Was 400 last month
}

# 2. Identify which feature/endpoint spiked
# Use OpenAI usage API or logs
top_endpoints = {
    "/chat": 900_000 requests,
    "/summarize": 100_000 requests,  # ⚠️ New feature!
}

# 3. Check for inefficiencies
issues_found = [
    "Summarize endpoint sends full 50K-token docs (should use excerpts)",
    "Chat endpoint not using cached system prompts",
    "No truncation - some conversations hit 128K limit",
]

# 4. Estimate savings
optimizations = [
    ("Enable prompt caching", "60% reduction on /chat"),
    ("Use semantic search for /summarize", "90% reduction"),
    ("Implement conversation truncation", "30% reduction overall"),
]

Q4: "Should we use GPT-5.4 or Claude Sonnet 4.6 for production?"

Answer Framework:

FactorGPT-5.4Claude Sonnet 4.6Winner
Cost$2.50/$15$3/$15GPT-5.4 (slightly)
Context400K (1.1M beta)1M (standard)Claude
SpeedFasterComparableGPT-5.4
SafetyGoodExcellentClaude
Code / tool useExcellentExcellent (SWE-bench leader)Claude
ReasoningExcellentExcellentTie

Decision:

  • High-volume, cost-sensitive: GPT-5.4 Mini or Claude Haiku 4.5
  • Long-context (>400K): Claude Sonnet 4.6 (1M at base price)
  • Safety-critical: Claude Sonnet 4.6
  • Agentic coding tasks: Claude Sonnet 4.6

Key Takeaways for Interviews

Know the pricing: Memorize top 3-4 models' costs (GPT-5.4, Claude 4.6, Gemini 3.1) ✅ Calculate costs: Be ready to estimate $/conversation or $/month on a whiteboard ✅ Optimize ruthlessly: Caching, truncation, model cascading, semantic compression ✅ Understand trade-offs: Quality vs. cost, latency vs. context window ✅ Debug systematically: Token usage distribution → identify spike → fix inefficiency

Next Step: Learn how to control LLM outputs with sampling strategies in Module 1, Lesson 3.

:::

Quick check: how does this lesson land for you?

Quiz

Module 1: LLM Fundamentals for Interviews

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.