Attention, Explained from Scratch

Attention is the engine inside every transformer, and therefore inside every modern LLM. The math looks scary; the idea is not. It answers one question:

When processing this word, which other words should I pay attention to?

Queries, keys, values

For each token, the model produces three vectors:

Query ( $Q$ ) — what this token is looking for.
Key ( $K$ ) — what each token offers.
Value ( $V$ ) — the actual information each token carries.

You match every query against every key to get attention scores, turn those into weights, and use them to take a weighted average of the values. That's it.

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

The $QK^\top$ is just a similarity between every pair of tokens (a dot product — linear algebra again). The $\sqrt{d_k}$ keeps the numbers stable. The softmax turns scores into weights that sum to 1.

In code

import torch
import torch.nn.functional as F
 
def attention(Q, K, V):
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / d_k**0.5  # (seq, seq)
    weights = F.softmax(scores, dim=-1)          # who attends to whom
    return weights @ V                           # weighted sum of values
 
# toy example: 5 tokens, 16-dim
Q = K = V = torch.randn(5, 16)
print(attention(Q, K, V).shape)  # torch.Size([5, 16])

Why it changed everything

Before attention, models processed text strictly left-to-right and struggled to connect distant words. Attention lets every token see every other token in one step — so "it" can instantly link back to the noun it refers to, ten words earlier. Scale that up, stack it deep, and you get the models powering today's AI.

In a future post we'll add the multi-head twist and positional encodings.

Attention, Explained from Scratch

Queries, keys, values

In code

Why it changed everything

Related articles

How Gradient Descent Actually Works

Why Identity-Aware Negative Sampling Matters

About This Blog