Building GPT From Scratch: A Step-by-Step nano-gpt Walkthrough
Jun 27, 2026 · 12 min read
Large language models look impossibly complex from the outside, but their core is surprisingly small. To really understand it, I rebuilt one from scratch: nano-gpt, a character-level, decoder-only Transformer written in plain PyTorch, following Andrej Karpathy's "Let's build GPT" lecture. Apart from scale, the mechanism is identical to the models powering today's AI:
token + position embedding → causal self-attention → multi-head → residual blocks → autoregressive generation
This post is the full walkthrough — every step, with the actual code from the notebook. On top of the lecture baseline I added a cosine-annealing learning-rate schedule, early stopping, and live Weights & Biases tracking — the "HERO RUN v4" config you'll see below.
The big picture
The whole pipeline goes from raw text to freshly generated text. Each stage maps to a function we'll build:
| Stage | Function | What it does |
|---|---|---|
| 1. Download | download_dataset() | Pulls ~1.1 MB of Tiny Shakespeare. |
| 2. Tokenize | build_tokenizer() | Maps each character to an integer. |
| 3. Split | make_splits() | Text → one long tensor, 90% train / 10% val. |
| 4. Batch | get_batch() | Random (context, target) pairs. |
| 5. Model | BigramLanguageModel | Embedding → N× Block → LayerNorm → lm_head. |
| 6. Generate | model.generate() | Autoregressive sampling, character by character. |
Step 1 — Data and a tiny tokenizer
We train on Tiny Shakespeare, and we tokenize at the character level. No BPE, no tiktoken — just a bidirectional map between every unique character and an integer. The vocabulary ends up around 65 symbols, which keeps the embedding table small and makes everything easy to debug. The trade-off is longer sequences, since every character is its own token.
def build_tokenizer(text: str):
chars = sorted(set(text))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)} # string -> int
itos = {i: ch for i, ch in enumerate(chars)} # int -> string
def encode(s: str) -> list[int]:
return [stoi[c] for c in s]
def decode(ids: list[int]) -> str:
return "".join(itos[i] for i in ids)
return chars, vocab_size, encode, decodeThe whole corpus then becomes a single integer tensor, split 90/10:
def make_splits(text: str, encode):
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
return data[:n], data[n:]The core idea to hold onto: a token is just an integer, and decode turns
integers back into text.
Step 2 — Batches, and the shift-by-one trick
Training data is pulled as (B, T) batches: B independent sequences, each T
characters long. The target y is simply the input x shifted one position to the
right — because the model's job is to predict the next character for each context.
def get_batch(split, train_data, val_data, block_size, batch_size, device="cpu"):
data = train_data if split == "train" else val_data
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i:i + block_size] for i in ix])
y = torch.stack([data[i + 1:i + 1 + block_size] for i in ix])
return x.to(device), y.to(device)x : F i r s t C
↓ ↓ ↓ ↓ ↓ ↓ ↓
y : i r s t C i
A single block_size-long slice secretly contains T separate (context, target)
examples: predicting i from F, r from Fi, s from Fir, and so on. This is
exactly the source of the Transformer's parallel-learning efficiency — one slice,
many lessons.
Step 3 — A single self-attention head
This is the heart of it. Each token produces three projections of itself — Query (what am I looking for?), Key (what do I offer?), and Value (what do I actually carry?). We score every query against every key, scale, hide the future with a causal mask, softmax into weights, and take a weighted sum of the values:
The causal mask is what makes this a decoder: a token may attend only to itself and the past, never the future.
attended →
t0 t1 t2 t3 t4
t0 [ 1 0 0 0 0 ]
t1 [ 1 1 0 0 0 ]
t2 [ 1 1 1 0 0 ]
t3 [ 1 1 1 1 0 ]
t4 [ 1 1 1 1 1 ]
In code, the mask is a lower-triangular buffer; we fill the upper triangle with
-inf so softmax sends those weights to zero:
class Head(nn.Module):
"""A single self-attention head."""
def __init__(self, n_embed, head_size, dropout_rate, block_size):
super().__init__()
self.key = nn.Linear(n_embed, head_size, bias=False)
self.query = nn.Linear(n_embed, head_size, bias=False)
self.value = nn.Linear(n_embed, head_size, bias=False)
self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))
self.head_size = head_size
self.dropout = nn.Dropout(dropout_rate)
def forward(self, x):
B, T, C = x.shape
k = self.key(x)
q = self.query(x)
wei = q @ k.transpose(-2, -1) * (self.head_size ** -0.5) # scaled scores
wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf")) # causal mask
wei = F.softmax(wei, dim=-1)
wei = self.dropout(wei)
v = self.value(x)
return wei @ v # weighted sumThe head_size ** -0.5 scaling keeps the dot products from growing with dimension,
which would otherwise push softmax into saturation.
Step 4 — Multi-head attention
One head learns one kind of relationship. Running several in parallel lets the model attend to different patterns at once; we concatenate their outputs and project back to the model dimension:
class MultiHeadAttention(nn.Module):
"""Several self-attention heads in parallel."""
def __init__(self, n_embed, num_heads, head_size, dropout_rate, block_size):
super().__init__()
self.heads = nn.ModuleList(
[Head(n_embed, head_size, dropout_rate, block_size) for _ in range(num_heads)]
)
self.proj = nn.Linear(n_embed, n_embed)
self.dropout = nn.Dropout(dropout_rate)
def forward(self, x):
out = torch.cat([h(x) for h in self.heads], dim=-1)
return self.dropout(self.proj(out))Step 5 — The Transformer block
A block has two sub-layers: multi-head attention (the tokens communicate) and a feed-forward network (each token computes). Both use pre-LayerNorm and a residual connection — the pattern that makes deep Transformers trainable:
x = x + Dropout( MultiHeadAttention( LayerNorm(x) ) )
x = x + Dropout( FeedForward( LayerNorm(x) ) )
class Block(nn.Module):
"""Transformer block: communication (attention) + computation (FFN)."""
def __init__(self, n_embed, n_head, dropout_rate, block_size):
super().__init__()
head_size = n_embed // n_head
self.sa = MultiHeadAttention(n_embed, n_head, head_size, dropout_rate, block_size)
self.ffwd = nn.Sequential(
nn.Linear(n_embed, 4 * n_embed),
nn.ReLU(),
nn.Linear(4 * n_embed, n_embed),
nn.Dropout(dropout_rate),
)
self.ln1 = nn.LayerNorm(n_embed)
self.ln2 = nn.LayerNorm(n_embed)
self.dropout1 = nn.Dropout(dropout_rate)
self.dropout2 = nn.Dropout(dropout_rate)
def forward(self, x):
x = x + self.dropout1(self.sa(self.ln1(x)))
x = x + self.dropout2(self.ffwd(self.ln2(x)))
return xThe feed-forward network expands to 4 × n_embed and back — this is where most of
the model's non-linear "thinking" capacity lives.
Step 6 — Assembling the model
Now we stack the pieces. Token embeddings say what a token is; position embeddings
say where it sits. Their sum flows through n_layer blocks, a final LayerNorm, and
a linear head that projects to vocabulary-sized logits:
class BigramLanguageModel(nn.Module):
def __init__(self, vocab_size, n_embed, n_head, n_layer, dropout_rate, block_size):
super().__init__()
self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
self.position_embedding_table = nn.Embedding(block_size, n_embed)
self.tok_pos_dropout = nn.Dropout(dropout_rate)
self.blocks = nn.Sequential(
*[Block(n_embed, n_head, dropout_rate, block_size) for _ in range(n_layer)]
)
self.ln_f = nn.LayerNorm(n_embed)
self.lm_head = nn.Linear(n_embed, vocab_size)
self.block_size = block_size
def forward(self, idx, targets=None):
B, T = idx.shape
tok_emb = self.token_embedding_table(idx) # (B, T, C)
pos_emb = self.position_embedding_table(torch.arange(T, device=DEVICE)) # (T, C)
x = self.tok_pos_dropout(tok_emb + pos_emb)
x = self.blocks(x)
x = self.ln_f(x)
logits = self.lm_head(x) # (B, T, vocab)
if targets is None:
return logits, None
# CrossEntropy expects (B*T, vocab) and (B*T)
B, T, C_vocab = logits.shape
logits = logits.view(B * T, C_vocab)
targets = targets.view(B * T)
loss = F.cross_entropy(logits, targets)
return logits, lossThe class is still called
BigramLanguageModel— a leftover from the starting point of Karpathy's lecture. The code here is a full decoder-only Transformer; the bigram is just the bare baseline it grew out of (more on why that baseline is limited at the end).
Step 7 — Generating text
Generation is autoregressive: at each step we crop the context to the last
block_size tokens (because that's all the position embeddings cover), take the
logits at the final position, softmax into a distribution, sample one token, and
append it:
def generate(self, idx, max_new_tokens):
for _ in range(max_new_tokens):
idx_cond = idx[:, -self.block_size:] # crop to context window
logits, _ = self(idx_cond)
logits = logits[:, -1, :] # focus on the last step
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, idx_next), dim=1)
return idxcontext = torch.zeros((1, 1), dtype=torch.long, device=DEVICE)
print(decode(model.generate(context, max_new_tokens=500)[0].tolist()))Step 8 — Training
The training loop is the classic four lines — forward, backward, step, schedule —
wrapped with evaluation, early stopping, and logging. I used AdamW, a
cosine-annealing schedule that decays the learning rate smoothly from 5e-4 to
1e-5, and Weights & Biases for live curves. estimate_loss averages
train/val loss in eval mode every 500 steps, and early stopping halts the run if
validation loss stalls for five evaluations.
for iter in range(cfg.epochs):
if iter % EVAL_INTERVAL == 0 or iter == cfg.epochs - 1:
losses = estimate_loss(model, train_data, val_data, encode,
cfg.block_size, cfg.batch_size)
val = losses["val"].item()
wandb.log({"train_loss": losses["train"], "val_loss": val,
"learning_rate": scheduler.get_last_lr()[0]}, step=iter)
# early stopping
if val < best_val_loss:
best_val_loss, patience_counter = val, 0
else:
patience_counter += 1
if patience_counter >= patience:
break
xb, yb = get_batch("train", train_data, val_data,
cfg.block_size, cfg.batch_size, device=DEVICE)
logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
scheduler.step()The HERO RUN v4 configuration:
| Parameter | Value |
|---|---|
block_size | 256 |
batch_size | 64 |
n_embed | 256 |
n_head | 8 |
n_layer | 6 |
dropout_rate | 0.2 |
lr | 5e-4 |
weight_decay | 1e-3 |
Results
Cross-Entropy Loss
The loss function we train against is cross-entropy. Here's the formula:
For a single example, where is the probability the model assigns to the correct class (token):
More generally, with the true distribution (one-hot) and the model's prediction :
where is the vocabulary size and is the probability assigned to token after the softmax.
How does it work? At every step the model produces a probability distribution over the entire vocabulary for the next token (the softmax output). Cross-entropy looks at how much probability the model placed on the token that actually came next and takes its negative logarithm. If the model gave the correct token a high probability (), then and the loss shrinks; if it gave it a low probability (), then and the loss becomes a large penalty. In other words, the function punishes the model heavily for predictions it was "confident about but wrong on."
Why do we use it? Because predicting the next token is really a classification problem: at each position we're trying to pick the right token out of candidates. Cross-entropy is the natural loss for this kind of probabilistic classification — it pulls the model's predicted distribution toward the true distribution (equivalent to maximum likelihood estimation). It also has clean, stable gradients: combined with softmax, its derivative simplifies to just , which lets gradient descent work efficiently. In PyTorch it's computed in a single line, F.cross_entropy(logits, targets), which fuses the softmax + log + negative-log-likelihood steps into one numerically stable operation.
Validation loss is cross-entropy (negative log-likelihood) — lower is better. The reference figures for this character-level Tiny Shakespeare setup:
Validation loss (lower is better)
Random init ████████████████████████████████████ 4.17
Pure bigram ██████████████████████ 2.50
This Transformer █████████████ 1.50
Qualitatively, a pure bigram produces letter-soup that respects single-character
frequencies but never forms real words. This Transformer produces
Shakespeare-looking text: plausible words, NAME: dialogue structure, line breaks
and punctuation. It's stylistically convincing but semantically nonsensical —
exactly what you'd expect from a ~10M-parameter character model.
Why the bigram baseline hits a wall — and how attention breaks through
Understanding the baseline's ceiling is the entire motivation for attention. A true
bigram models P(next char | previous char): its context window is exactly one
token, it has no notion of position, and it's structurally just a vocab × vocab
lookup table with no composition. So its loss plateaus around 2.5 no matter how long
you train.
Every addition removes one of those limits:
| Addition | Limitation it removes |
|---|---|
| Self-attention | context up to block_size (256) instead of 1 |
| Position embeddings | order and position become meaningful |
| Multi-head attention | several relationship types learned in parallel |
| Residual blocks + FFN | depth and non-linear, compositional computation |
Together they drop validation loss from ~2.5 to ~1.5 and turn gibberish into coherent, Shakespeare-style prose — which is the whole leap this project is meant to demonstrate.
Try it yourself
The full implementation lives in a single, heavily-commented notebook:
- Repo: github.com/gocenalper/nano-gpt
- Run it in the browser: Open in Colab
If you've ever wanted to demystify "GPT," there's no better way than building the smallest possible one yourself — and watching the loss curve drop as each piece clicks into place.
References
The implementation and explanations in this post lean heavily on the following resources:
- Andrej Karpathy — "Let's build GPT: from scratch, in code, spelled out" (YouTube): youtube.com/watch?v=kCc8FmEb1nY
- Andrej Karpathy — nanoGPT repository: github.com/karpathy/nanoGPT
- Andrej Karpathy — minGPT repository: github.com/karpathy/minGPT
- Vaswani et al. — "Attention Is All You Need" (2017): arxiv.org/abs/1706.03762
- Radford et al. — "Language Models are Unsupervised Multitask Learners" (GPT-2): cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- Jay Alammar — "The Illustrated Transformer": jalammar.github.io/illustrated-transformer
- Ba, Kiros & Hinton — "Layer Normalization" (2016): arxiv.org/abs/1607.06450
- The Tiny Shakespeare dataset: github.com/karpathy/char-rnn