Attention, Explained from Scratch
Jun 23, 2026 · 2 min read
Attention is the engine inside every transformer, and therefore inside every modern LLM. The math looks scary; the idea is not. It answers one question:
When processing this word, which other words should I pay attention to?
Queries, keys, values
For each token, the model produces three vectors:
- Query () — what this token is looking for.
- Key () — what each token offers.
- Value () — the actual information each token carries.
You match every query against every key to get attention scores, turn those into weights, and use them to take a weighted average of the values. That's it.
The is just a similarity between every pair of tokens (a dot product — linear algebra again). The keeps the numbers stable. The softmax turns scores into weights that sum to 1.
In code
import torch
import torch.nn.functional as F
def attention(Q, K, V):
d_k = Q.size(-1)
scores = Q @ K.transpose(-2, -1) / d_k**0.5 # (seq, seq)
weights = F.softmax(scores, dim=-1) # who attends to whom
return weights @ V # weighted sum of values
# toy example: 5 tokens, 16-dim
Q = K = V = torch.randn(5, 16)
print(attention(Q, K, V).shape) # torch.Size([5, 16])Why it changed everything
Before attention, models processed text strictly left-to-right and struggled to connect distant words. Attention lets every token see every other token in one step — so "it" can instantly link back to the noun it refers to, ten words earlier. Scale that up, stack it deep, and you get the models powering today's AI.
In a future post we'll add the multi-head twist and positional encodings.