How LLMs Work (High-Level)

In the last post, we covered what LLMs are. Massive models, trained on text, that predict the next token.

But that description doesn't really tell you what's happening inside. What does the model actually do when you send it a message? How does it go from your words to a response?

That's what this post is about.

We're not going to do math. No formulas. No matrix multiplication. Just what's actually happening, explained clearly.

Step 1 - Your Text Becomes Tokens

When you type a message to an LLM, the first thing that happens is your text gets broken into tokens.

We touched on tokens in the last post. But let's go a bit deeper here because it matters.

A token is not a word. It's a chunk of text that the model has learned to treat as a unit. Sometimes it's a full word. Sometimes it's part of a word. Sometimes it's punctuation or a space.

Here's a rough example. The sentence "I love programming" might become:

["I", " love", " program", "ming"]

Notice how "programming" got split into two tokens. That's because the model's tokenizer - the thing that splits text - decided those were the most useful chunks based on how it was trained.

The tokenizer is built before training. It looks at the entire training dataset and figures out which chunks of text appear most often. Common words like "the", "is", "and" get their own token. Rare or long words get split. Numbers, punctuation, code syntax - all of it gets tokenized.

This matters because the model never actually sees your raw text. Everything is tokens. And tokens are immediately converted to numbers, because models only understand numbers.

Each token has an ID. "I" might be token 40. " love" might be token 1842. " program" might be token 5649. These are just integers from a massive vocabulary list. Typically around 50,000 to 100,000 tokens total.

So your beautiful English sentence becomes a list of integers. That's what gets fed into the model.

Step 2 - Tokens Become Embeddings

A list of integers isn't very useful on its own. Token 40 and token 41 are consecutive numbers but they might mean completely unrelated things.

So the model converts each token ID into something called an embedding.

An embedding is a list of numbers - a vector. And not just any numbers. These numbers are learned during training and they encode meaning.

Words with similar meanings end up with similar embeddings. "King" and "Queen" have embeddings that are close to each other in this mathematical space. "Dog" and "Cat" are close. "Python" (the snake) and "Python" (the language) are actually different depending on context - but we'll get to that.

The point is this. After this step, your sequence of tokens is now a sequence of vectors. Each vector is a rich numerical representation of that token's meaning.

Typical embedding dimension is 768 to 12,288 numbers per token depending on model size. A large model with a long input is processing a genuinely enormous amount of numbers.

Step 3 - The Transformer Processes Everything

Now comes the actual brain of the LLM - the Transformer.

The Transformer is a series of layers. Each layer takes the sequence of embeddings, does some processing, and outputs a new (slightly updated) sequence of embeddings. This happens layer after layer. Dozens of layers in small models. Hundreds in large ones.

What kind of processing? Two main things happen in each layer.

Attention. The layer looks at every token and asks: which other tokens in this sequence are relevant to understanding this one right now? It updates each token's embedding based on the answer.

Feed-forward network. After attention, each token's embedding goes through a small neural network that further transforms it. Think of this as the model "thinking harder" about each token individually.

These two steps happen in every layer. Over and over. With each layer, the embeddings become richer representations that capture more context and meaning.

By the time you're through all the layers, the embeddings are no longer just "what this token means in isolation." They now encode "what this token means given everything around it, and what's likely to come next."

Self-Attention - The Important Part

Attention deserves its own section because it's the key idea in all of modern AI.

Let me explain the problem it solves.

Take this sentence: "The trophy didn't fit in the suitcase because it was too big."

What does "it" refer to? The trophy. Not the suitcase.

You figured that out without thinking. But how? You processed "it" while keeping in mind "trophy", "fit", "suitcase", and "too big". You connected "it" to "trophy" because that's what makes logical sense given everything else.

That's exactly what self-attention does.

For every token in the sequence, the attention mechanism calculates a score between that token and every other token. How relevant is this other token to understanding the current one? These scores are learned during training.

Then it uses those scores to create a weighted average. The token's embedding gets updated based on the embeddings of all the other tokens it's paying attention to. Tokens with high scores influence it more. Tokens with low scores barely affect it.

So when processing "it", the model might have learned to strongly attend to "trophy" and "big". It pulls information from those tokens into the embedding for "it". After this, the embedding for "it" now carries context about what it's referring to.

This happens for every single token simultaneously, every single layer.

And the model doesn't just have one "version" of attention per layer. It has multiple - called attention heads. Each head learns to attend to different kinds of relationships. One head might focus on grammatical relationships. Another on factual associations. Another on logical dependencies. All running in parallel.

That's self-attention. Every token watching every other token, learning what to pay attention to.

Step 4 - The Model Outputs Probabilities

After all those layers, the model takes the final embedding of the last token and does one thing.

It produces a probability distribution over its entire vocabulary.

Every possible next token gets a score. "Paris" might get a score of 0.72. "London" might get 0.08. "the" might get 0.03. And so on for all 50,000+ tokens.

These scores are turned into probabilities that add up to 1. The model is essentially saying: "given everything I've processed, here's how likely each possible next token is."

Then one token gets picked.

How the Model Actually Picks the Next Token

This is where it gets interesting.

The naive approach would be to always pick the token with the highest probability. Always pick the most likely next token.

But that produces very boring, repetitive text. "The cat sat on the... mat. The cat sat on the... mat." It loops.

So models use sampling. Instead of always picking the top token, the model picks randomly from the distribution - but weighted by probability. High probability tokens are more likely to get picked. Low probability tokens can still get picked, just rarely.

This introduces a parameter called temperature.

High temperature (like 1.5 or 2.0) flattens the distribution. Every token becomes more equally probable. The model gets creative and unpredictable. Sometimes brilliant. Often nonsensical.

Low temperature (like 0.1 or 0.2) sharpens the distribution. The top tokens become overwhelmingly dominant. The model becomes very focused, very predictable, very "safe."

Temperature of 0 is basically greedy - always pick the top token.

When you're asking an LLM to write code, you usually want low temperature. Precise. Predictable. When you're asking it to brainstorm ideas, higher temperature gives you more interesting variety.

Most AI products expose a creativity slider or similar setting. That's temperature under the hood.

This Is Autoregressive Generation

Here's the part that clicks everything together.

The model generates one token at a time. Pick a token. Append it to the sequence. Feed the whole sequence back in. Generate the next token. Repeat.

This is called autoregressive generation.

That's why LLMs stream text to you word by word. Each token is genuinely being computed one step at a time. The model isn't generating the whole response and then showing it to you. It's generating token by token and streaming each one out.

It also means the model sees its own previous outputs as input. Each new token is generated with awareness of everything that came before - including everything the model itself just said.

This is also why LLMs can "go off track." If an early token in the response is slightly wrong, all subsequent tokens are generated based on that slightly wrong context. The error can compound. It's not correcting itself in hindsight. It's always only generating the next token.

Training vs Inference

One last distinction worth being clear on.

Everything we've described above - the forward pass through the Transformer, generating token probabilities, sampling the next token - that's inference. That's what happens when you send the model a message and it responds.

Training is different. During training, the model isn't generating text for a user. It's processing massive amounts of existing text and adjusting its internal parameters (all those billions of knobs) to get better at predicting the next token.

Training happens once (or periodically). It's slow and extremely expensive. GPT-4 reportedly cost tens of millions of dollars to train.

Inference happens every time someone uses the model. It's much cheaper per use but adds up at scale. A company serving millions of users runs inference millions of times a day.

When you use Claude or ChatGPT, you're always doing inference. The model's parameters are frozen. It's not learning from your conversation in real time. It's just predicting, one token at a time, based on what it learned during training.

Putting It All Together

Here's the full picture from your message to the model's response.

You type a message. It gets tokenized into a list of token IDs. Each ID gets converted to an embedding. That sequence of embeddings goes through dozens of Transformer layers, where self-attention lets every token gather context from every other token. The final layer produces a probability distribution over all possible next tokens. One token gets sampled. It gets appended to the sequence. The whole thing runs again. And again. Token by token, until the model decides to stop.

That's it.

How LLMs Work (High-Level)

Step 1 - Your Text Becomes Tokens

Step 2 - Tokens Become Embeddings

Step 3 - The Transformer Processes Everything

Self-Attention - The Important Part

Step 4 - The Model Outputs Probabilities

How the Model Actually Picks the Next Token

This Is Autoregressive Generation

Training vs Inference

Putting It All Together

Comments

Applied AI - Prompts, Tools & Skills, RAG, Tokenizations, Multi-Agents, Observability

Applied AI - The new role

More from this blog

AI, ML, Generative AI, LLMs: What Do These Actually Mean?

Applied AI - The new role

Array vs Slice in Golang

Golang tricky output based interview questions

Command Palette

Step 1 - Your Text Becomes Tokens

Step 2 - Tokens Become Embeddings

Step 3 - The Transformer Processes Everything

Self-Attention - The Important Part

Step 4 - The Model Outputs Probabilities

How the Model Actually Picks the Next Token

This Is Autoregressive Generation

Training vs Inference

Putting It All Together

Comments

Applied AI - Prompts, Tools & Skills, RAG, Tokenizations, Multi-Agents, Observability

Applied AI - The new role

More from this blog