Token Economics & Cost Optimization

Why This Matters for Interviews

OpenAI, Anthropic, Meta explicitly ask about token economics in L5-L6 interviews:

Real Interview Question (Anthropic):

"You're building a customer support chatbot. Each conversation averages 20 messages. How would you optimize token costs while maintaining quality? Walk me through your calculation."

Real Interview Question (Meta):

"Your LLM API bill is $50K/month. You have 1M users. How would you reduce costs by 50% without degrading user experience?"

Understanding tokens = money is critical for production LLM engineering.

What is a Token? (The Unit of Computation)

Not a word! A token is a subword unit.

Examples (GPT-5.2 BPE tokenizer):

Text	Tokens	Count
"Hello, world!"	`["Hello", ",", " world", "!"]`	4
"Tokenization"	`["Token", "ization"]`	2
"GPT-5.2"	`["G", "PT", "-", "5", ".", "2"]`	6
"مرحبا" (Arabic)	`["Ù…", "ر", "Ø", "ب", "ا"]`	5
"你好" (Chinese)	`["ä½", "好"]`	2

Key Rule: 1 token ≈ 4 characters in English, but varies by language.

BPE Tokenization (Byte-Pair Encoding)

How GPT-5.2, Claude 4.5, Llama 3.3 tokenize:

Algorithm:

Start with character-level vocabulary
Find most frequent pair of tokens
Merge pair into new token
Repeat until vocabulary size reached (typically 50K-200K)

Example Training:

Input text: "low low low low lower lower newest newest newest newest newest newest"

Iteration 1:
Most frequent pair: ('l', 'o') → merge to 'lo'
Vocab: ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'lo']

Iteration 2:
Most frequent pair: ('lo', 'w') → merge to 'low'
Vocab: ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'lo', 'low']

Iteration 3:
Most frequent pair: ('e', 'w') → merge to 'ew'
...continues...

Final: "low" becomes 1 token, "lower" becomes 2 tokens, "newest" becomes 2 tokens

Code Implementation:

def bpe_tokenize(text, vocab, merges):
    """
    BPE tokenization (simplified).
    Used in: GPT-5.2, Llama 3.3
    """
    # Start with characters
    tokens = list(text)

    # Apply merge rules in order
    for merge in merges:
        pair = merge[0]
        new_token = merge[1]

        i = 0
        while i < len(tokens) - 1:
            if (tokens[i], tokens[i+1]) == pair:
                tokens = tokens[:i] + [new_token] + tokens[i+2:]
            i += 1

    return tokens

# Example usage:
merges = [
    (('l', 'o'), 'lo'),
    (('lo', 'w'), 'low'),
    (('e', 'w'), 'ew'),
    (('n', 'ew'), 'new'),
    (('new', 'est'), 'newest'),
]

text = "lowest newest"
tokens = bpe_tokenize(text, vocab, merges)
# Result: ['low', 'est', 'newest']

Tokenizer Comparison (January 2026)

Model	Tokenizer	Vocab Size	Efficiency	Notes
GPT-5.2	tiktoken (cl100k_base)	100K	⭐⭐⭐⭐⭐	Optimized for code + multilingual
Claude 4.5	claude-v1	100K	⭐⭐⭐⭐⭐	Similar to GPT-5.2
Gemini 3 Pro	SentencePiece	256K	⭐⭐⭐⭐	Larger vocab, fewer tokens per text
Llama 3.3	tiktoken (Llama)	128K	⭐⭐⭐⭐⭐	Open-source, multilingual
DeepSeek R1	DeepSeek-tokenizer	100K	⭐⭐⭐⭐	Chinese-optimized

Interview Insight: Mention that modern tokenizers use 100K+ vocab (vs. 50K in GPT-2) for better efficiency across languages.

Context Window Evolution

Historical Progression

Year	Model	Context Window	What Changed
2020	GPT-3	2K tokens	Baseline
2023	GPT-4 Turbo	128K tokens	64x increase!
2024	Claude Opus 4	200K tokens	Production-ready
2025	GPT-5	128K-1M tokens	Mixture of window sizes
2026	GPT-5.2	128K tokens (standard)	Optimized for speed
2026	Claude Sonnet 4.5	200K (1M beta)	Extended context
2026	Gemini 3 Pro	1M tokens	Largest production window

Interview Question: "Why do we need context windows beyond 100K tokens?"

Strong Answer:

"Real-world applications need long context:

Code: Entire repository context (50K-200K tokens)

Legal: Full contracts + amendments (100K+ tokens)

Research: Multiple papers for literature review (200K+ tokens)

Customer support: Full conversation history across sessions (50K+ tokens)

Larger windows reduce the need for RAG/retrieval, improving latency and accuracy."

Token Pricing (January 2026)

Pricing per 1M Tokens

Model	Input	Output	Cached Input	Use Case
GPT-5.2	$1.75	$14.00	$0.175 (90% off)	General-purpose
GPT-4 Turbo	$5.00	$15.00	—	Legacy
Claude Opus 4.5	$15.00	$75.00	—	Premium reasoning
Claude Sonnet 4.5	$3.00	$15.00	$0.60 (80% off >200K)	Production
Claude Haiku 4.5	$0.80	$4.00	—	High-volume
Gemini 3 Pro	$2.50	$10.00	$0.50 (80% off)	Cost-effective
DeepSeek R1	$0.55	$2.19	—	Budget option

Key Observations:

Output tokens cost 5-10x input tokens (generation is expensive)
Cached inputs save 80-90% (critical for chatbots with system prompts)
DeepSeek R1 is 3x cheaper than GPT-5.2 (MIT license, self-hostable)

Cost Calculation Examples

Example 1: Customer Support Chatbot

Scenario:

System prompt: 500 tokens (cached)
Average conversation: 10 messages
Average message: 50 tokens user + 200 tokens assistant
1,000 conversations/day

Cost Calculation (GPT-5.2):

def calculate_chatbot_cost(
    system_prompt_tokens=500,
    messages_per_conversation=10,
    user_tokens_per_message=50,
    assistant_tokens_per_message=200,
    conversations_per_day=1000,
    model="gpt-5.2"
):
    """
    Calculate monthly chatbot costs.
    """
    # Pricing (per 1M tokens)
    pricing = {
        "gpt-5.2": {
            "input": 1.75,
            "output": 14.00,
            "cached_input": 0.175,
        },
        "claude-sonnet-4.5": {
            "input": 3.00,
            "output": 15.00,
            "cached_input": 0.60,
        },
    }

    p = pricing[model]

    # Tokens per conversation
    cached_system_prompt = system_prompt_tokens  # Once per conversation
    input_tokens = messages_per_conversation * user_tokens_per_message
    output_tokens = messages_per_conversation * assistant_tokens_per_message

    # Cost per conversation
    cost_per_conversation = (
        (cached_system_prompt / 1_000_000) * p["cached_input"] +  # System prompt
        (input_tokens / 1_000_000) * p["input"] +                 # User messages
        (output_tokens / 1_000_000) * p["output"]                 # Assistant messages
    )

    # Monthly cost (30 days)
    monthly_cost = cost_per_conversation * conversations_per_day * 30

    return {
        "cost_per_conversation": cost_per_conversation,
        "daily_cost": cost_per_conversation * conversations_per_day,
        "monthly_cost": monthly_cost,
        "tokens_per_conversation": cached_system_prompt + input_tokens + output_tokens,
    }

# Calculate
result = calculate_chatbot_cost()
print(f"Cost per conversation: ${result['cost_per_conversation']:.4f}")
print(f"Monthly cost: ${result['monthly_cost']:.2f}")

Output:

Cost per conversation: $0.0285
Monthly cost: $855.00

Breakdown:
- Cached system prompt (500 tokens): $0.0001 per conversation
- Input tokens (500 tokens): $0.0009 per conversation
- Output tokens (2000 tokens): $0.0280 per conversation

Optimization Strategies:

Cache system prompt (90% discount): ✅ Already doing

Reduce output tokens by 20%:

assistant_tokens_per_message=160  # 200 → 160
# New monthly cost: $684 (20% savings)

Switch to Claude Haiku 4.5 for simple queries:

# 70% of queries are simple (route to Haiku)
# 30% complex (use GPT-5.2)
# New monthly cost: $427 (50% savings)

Example 2: Code Generation Tool

Scenario:

Repository context: 50K tokens (cached)
User query: 100 tokens
Generated code: 500 tokens
10,000 requests/month

Cost Calculation:

def calculate_code_gen_cost(
    repo_context_tokens=50_000,
    user_query_tokens=100,
    generated_code_tokens=500,
    requests_per_month=10_000,
    model="gpt-5.2"
):
    pricing = {
        "gpt-5.2": {"input": 1.75, "output": 14.00, "cached_input": 0.175},
    }

    p = pricing[model]

    cost_per_request = (
        (repo_context_tokens / 1_000_000) * p["cached_input"] +  # Cached context
        (user_query_tokens / 1_000_000) * p["input"] +           # Query
        (generated_code_tokens / 1_000_000) * p["output"]        # Generated code
    )

    monthly_cost = cost_per_request * requests_per_month

    return {
        "cost_per_request": cost_per_request,
        "monthly_cost": monthly_cost,
    }

result = calculate_code_gen_cost()
print(f"Cost per request: ${result['cost_per_request']:.4f}")
print(f"Monthly cost: ${result['monthly_cost']:.2f}")

Output:

Cost per request: $0.0089
Monthly cost: $89.00

Breakdown:
- Cached repo context (50K tokens): $0.0088 (98% of cost!)
- User query (100 tokens): $0.0002
- Generated code (500 tokens): $0.0070

Critical Insight: Even with 90% caching discount, the repo context dominates cost. Optimization: Use sparse context (only relevant files).

Token Budgeting Strategies

1. Dynamic Context Truncation

def truncate_context(messages, max_tokens=120_000, reserve_for_output=4_000):
    """
    Intelligently truncate conversation history to fit context window.
    Used in production chatbots.
    """
    import tiktoken

    enc = tiktoken.encoding_for_model("gpt-5")
    available_tokens = max_tokens - reserve_for_output

    # Always keep system prompt + last 3 messages
    system_prompt = messages[0]
    recent_messages = messages[-3:]

    # Fill middle with older messages (FIFO)
    middle_messages = messages[1:-3]

    # Count tokens
    system_tokens = len(enc.encode(system_prompt["content"]))
    recent_tokens = sum(len(enc.encode(m["content"])) for m in recent_messages)

    used_tokens = system_tokens + recent_tokens
    remaining_tokens = available_tokens - used_tokens

    # Add older messages until budget exhausted
    included_middle = []
    for message in reversed(middle_messages):  # Start from most recent
        msg_tokens = len(enc.encode(message["content"]))
        if used_tokens + msg_tokens <= available_tokens:
            included_middle.insert(0, message)
            used_tokens += msg_tokens
        else:
            break

    final_messages = [system_prompt] + included_middle + recent_messages

    return final_messages, used_tokens

# Example usage:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    # ... 100 messages of history ...
]

truncated, token_count = truncate_context(messages, max_tokens=128_000)
print(f"Truncated to {len(truncated)} messages ({token_count} tokens)")

2. Semantic Compression (RAG Hybrid)

Instead of sending entire document (50K tokens), send relevant excerpts (5K tokens):

def semantic_context_selection(query, documents, max_tokens=5000):
    """
    Use embedding similarity to select most relevant context.
    Reduces token usage by 90% while maintaining quality.
    """
    from sentence_transformers import SentenceTransformer
    import numpy as np
    import tiktoken

    # Embed query and documents
    model = SentenceTransformer('all-MiniLM-L6-v2')
    query_embedding = model.encode([query])[0]
    doc_embeddings = model.encode(documents)

    # Compute similarity scores
    similarities = np.dot(doc_embeddings, query_embedding)

    # Sort by relevance
    ranked_indices = np.argsort(similarities)[::-1]

    # Select top docs until token budget
    enc = tiktoken.encoding_for_model("gpt-5")
    selected_docs = []
    total_tokens = 0

    for idx in ranked_indices:
        doc = documents[idx]
        doc_tokens = len(enc.encode(doc))

        if total_tokens + doc_tokens <= max_tokens:
            selected_docs.append(doc)
            total_tokens += doc_tokens
        else:
            break

    return "\n\n".join(selected_docs), total_tokens

# Example:
query = "How do I implement authentication?"
docs = [doc1, doc2, doc3, ...]  # 50K tokens total
context, tokens = semantic_context_selection(query, docs, max_tokens=5000)
# Result: 5K tokens of most relevant content (90% cost reduction)

3. Model Cascading (Quality vs. Cost)

Route requests to different models based on complexity:

def route_to_model(query):
    """
    Route simple queries to cheap model, complex to expensive model.
    Reduces average cost per query by 60%.
    """
    # Use small model to classify complexity
    complexity_score = estimate_query_complexity(query)  # 0-1 scale

    if complexity_score < 0.3:
        return "claude-haiku-3.5"  # $1 input / $5 output
    elif complexity_score < 0.7:
        return "gpt-5.2"            # $1.75 input / $14 output
    else:
        return "claude-opus-4.5"    # $15 input / $75 output

def estimate_query_complexity(query):
    """
    Heuristic complexity scoring.
    In production, use a small classifier model.
    """
    complexity_signals = {
        "code": 0.7,
        "debug": 0.8,
        "explain": 0.4,
        "reasoning": 0.9,
        "math": 0.8,
        "summarize": 0.3,
    }

    query_lower = query.lower()
    max_complexity = 0.2  # Baseline

    for signal, score in complexity_signals.items():
        if signal in query_lower:
            max_complexity = max(max_complexity, score)

    return max_complexity

# Example:
query1 = "What's the capital of France?"
# → claude-haiku-3.5 (simple factual)

query2 = "Debug this Python code and explain the issue"
# → gpt-5.2 (moderate complexity)

query3 = "Prove that P != NP using complexity theory"
# → claude-opus-4.5 (advanced reasoning)

Context Window Limits & Handling

What Happens When You Exceed Context Window?

GPT-5.2 (128K context):

# Request with 130K tokens
response = openai.ChatCompletion.create(
    model="gpt-5",
    messages=messages,  # 130K tokens
)
# ❌ Error: "This model's maximum context length is 131072 tokens"

Solutions:

Truncate (naive):

messages = messages[-50:]  # Keep last 50 messages
# ⚠️ Loses important context

Summarize (better):

# Summarize old messages
old_summary = summarize_with_llm(messages[:-10])
new_messages = [
    {"role": "system", "content": f"Previous conversation summary: {old_summary}"},
    *messages[-10:]
]
# ✅ Preserves key information, fits in context

Sparse Attention (advanced):

# Use model with sparse attention (e.g., Longformer pattern)
# Attends to:
# - System prompt
# - Last N messages
# - Important messages (detected via keyword/embedding)
# ✅ Efficient for very long conversations

Common Interview Questions & Answers

Q1: "Why do output tokens cost more than input tokens?"

Answer:

"Output tokens require autoregressive generation - each token depends on all previous tokens. For a 100-token output, the model does 100 forward passes (1 + 2 + ... + 100 = 5,050 token computations with KV-cache, or 100 × 100 = 10,000 without). Input tokens are processed in parallel in a single forward pass. Additionally, output generation uses sampling (temperature, top-p), which adds compute."

Q2: "How would you reduce costs for a chatbot with a 500-token system prompt?"

Strong Answer:

"Use prompt caching. With GPT-5.2's 90% cache discount:

Uncached: 500 tokens × $1.75/1M = $0.000875 per request

Cached: 500 tokens × $0.175/1M = $0.0000875 per request (10x cheaper)

For 1M requests/month, savings = $787.50/month. Also ensure the system prompt is static across requests - any change invalidates the cache."

Q3: "Your API bill doubled month-over-month. How do you debug?"

Debugging Checklist:

# 1. Check token usage distribution
token_stats = {
    "total_requests": 1_000_000,
    "avg_input_tokens": 5_000,   # ⚠️ Was 2,000 last month
    "avg_output_tokens": 800,    # ⚠️ Was 400 last month
}

# 2. Identify which feature/endpoint spiked
# Use OpenAI usage API or logs
top_endpoints = {
    "/chat": 900_000 requests,
    "/summarize": 100_000 requests,  # ⚠️ New feature!
}

# 3. Check for inefficiencies
issues_found = [
    "Summarize endpoint sends full 50K-token docs (should use excerpts)",
    "Chat endpoint not using cached system prompts",
    "No truncation - some conversations hit 128K limit",
]

# 4. Estimate savings
optimizations = [
    ("Enable prompt caching", "60% reduction on /chat"),
    ("Use semantic search for /summarize", "90% reduction"),
    ("Implement conversation truncation", "30% reduction overall"),
]

Q4: "Should we use GPT-5.2 or Claude Sonnet 4.5 for production?"

Answer Framework:

Factor	GPT-5.2	Claude Sonnet 4.5	Winner
Cost	$1.75/$14	$3/$15	GPT-5.2
Context	128K	200K (1M beta)	Claude
Speed	Faster	Slower	GPT-5.2
Safety	Good	Excellent	Claude
Code	Excellent	Excellent	Tie
Reasoning	Excellent	Excellent	Tie

Decision:

High-volume, cost-sensitive: GPT-5.2
Long-context (>128K): Claude Sonnet 4.5
Safety-critical: Claude Sonnet 4.5
Fastest response: GPT-5.2

✅ Know the pricing: Memorize top 3-4 models' costs (GPT-5.2, Claude, Gemini) ✅ Calculate costs: Be ready to estimate $/conversation or $/month on a whiteboard ✅ Optimize ruthlessly: Caching, truncation, model cascading, semantic compression ✅ Understand trade-offs: Quality vs. cost, latency vs. context window ✅ Debug systematically: Token usage distribution → identify spike → fix inefficiency

Next Step: Learn how to control LLM outputs with sampling strategies in Module 1, Lesson 3.

:::