How LLMs Work

Understanding the mechanics behind Large Language Models helps you use them more effectively. Let's break down the key concepts without getting too technical.

Tokenization: Breaking Down Text

Before an LLM can process text, it needs to break it into smaller pieces called tokens. A token might be:

A whole word: "hello" → 1 token
Part of a word: "understanding" → "under" + "standing" = 2 tokens
A single character for rare words

For example, the sentence "I love AI" might become: ["I", " love", " AI"] = 3 tokens.

Why this matters: LLMs have token limits. When you hear "8K context" or "100K context," that's how many tokens the model can process at once.

The Transformer Architecture

LLMs are built on a revolutionary design called the Transformer (introduced by Google in 2017). The key innovation is attention—the ability to look at all parts of the input simultaneously and determine what's most relevant.

Think of reading a sentence: "The cat sat on the mat because it was tired." When you read "it," your brain automatically connects it to "cat." Transformers do something similar, calculating relationships between all words at once.

Training: Learning from Text

LLMs learn through a simple but powerful process:

Predict the next word: Given "The sky is," predict "blue"
Compare with actual text: Check if the prediction matches
Adjust parameters: Fine-tune millions of values to improve predictions
Repeat billions of times: Process massive amounts of text

This is called self-supervised learning—the training data labels itself.

Inference: Generating Responses

When you send a prompt to an LLM:

Your text is tokenized
The model calculates attention across all tokens
It predicts the most likely next token
That token is added, and the process repeats
This continues until a stopping point

Each token generation involves billions of calculations, but modern hardware makes this fast.

Temperature and Creativity

LLMs have a temperature setting that controls randomness:

Low temperature (0.0-0.3): More predictable, factual responses
High temperature (0.7-1.0): More creative, varied outputs

This is why the same prompt can give different answers—there's intentional randomness in token selection.

Key Takeaway

LLMs are essentially very sophisticated prediction machines. They don't store facts in a database—they've learned patterns that let them generate statistically likely continuations of text. This is both their power and their limitation.

:::