Inside AI Models
← All articles
TransformersDeep Learning

Attention, Explained from Scratch

Jun 23, 2026 · 2 min read

Attention is the engine inside every transformer, and therefore inside every modern LLM. The math looks scary; the idea is not. It answers one question:

When processing this word, which other words should I pay attention to?

Queries, keys, values

For each token, the model produces three vectors:

  • Query (QQ) — what this token is looking for.
  • Key (KK) — what each token offers.
  • Value (VV) — the actual information each token carries.

You match every query against every key to get attention scores, turn those into weights, and use them to take a weighted average of the values. That's it.

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

The QKQK^\top is just a similarity between every pair of tokens (a dot product — linear algebra again). The dk\sqrt{d_k} keeps the numbers stable. The softmax turns scores into weights that sum to 1.

In code

import torch
import torch.nn.functional as F
 
def attention(Q, K, V):
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / d_k**0.5  # (seq, seq)
    weights = F.softmax(scores, dim=-1)          # who attends to whom
    return weights @ V                           # weighted sum of values
 
# toy example: 5 tokens, 16-dim
Q = K = V = torch.randn(5, 16)
print(attention(Q, K, V).shape)  # torch.Size([5, 16])

Why it changed everything

Before attention, models processed text strictly left-to-right and struggled to connect distant words. Attention lets every token see every other token in one step — so "it" can instantly link back to the noun it refers to, ten words earlier. Scale that up, stack it deep, and you get the models powering today's AI.

In a future post we'll add the multi-head twist and positional encodings.

Related articles