Inside AI Models
← All articles
TransformersDeep Learning

Building GPT From Scratch: A Step-by-Step nano-gpt Walkthrough

Jun 27, 2026 · 12 min read

Large language models look impossibly complex from the outside, but their core is surprisingly small. To really understand it, I rebuilt one from scratch: nano-gpt, a character-level, decoder-only Transformer written in plain PyTorch, following Andrej Karpathy's "Let's build GPT" lecture. Apart from scale, the mechanism is identical to the models powering today's AI:

token + position embedding → causal self-attention → multi-head → residual blocks → autoregressive generation

This post is the full walkthrough — every step, with the actual code from the notebook. On top of the lecture baseline I added a cosine-annealing learning-rate schedule, early stopping, and live Weights & Biases tracking — the "HERO RUN v4" config you'll see below.

The big picture

The whole pipeline goes from raw text to freshly generated text. Each stage maps to a function we'll build:

StageFunctionWhat it does
1. Downloaddownload_dataset()Pulls ~1.1 MB of Tiny Shakespeare.
2. Tokenizebuild_tokenizer()Maps each character to an integer.
3. Splitmake_splits()Text → one long tensor, 90% train / 10% val.
4. Batchget_batch()Random (context, target) pairs.
5. ModelBigramLanguageModelEmbedding → N× Block → LayerNorm → lm_head.
6. Generatemodel.generate()Autoregressive sampling, character by character.

Step 1 — Data and a tiny tokenizer

We train on Tiny Shakespeare, and we tokenize at the character level. No BPE, no tiktoken — just a bidirectional map between every unique character and an integer. The vocabulary ends up around 65 symbols, which keeps the embedding table small and makes everything easy to debug. The trade-off is longer sequences, since every character is its own token.

def build_tokenizer(text: str):
    chars = sorted(set(text))
    vocab_size = len(chars)
    stoi = {ch: i for i, ch in enumerate(chars)}   # string -> int
    itos = {i: ch for i, ch in enumerate(chars)}   # int -> string
 
    def encode(s: str) -> list[int]:
        return [stoi[c] for c in s]
 
    def decode(ids: list[int]) -> str:
        return "".join(itos[i] for i in ids)
 
    return chars, vocab_size, encode, decode

The whole corpus then becomes a single integer tensor, split 90/10:

def make_splits(text: str, encode):
    data = torch.tensor(encode(text), dtype=torch.long)
    n = int(0.9 * len(data))
    return data[:n], data[n:]

The core idea to hold onto: a token is just an integer, and decode turns integers back into text.

Step 2 — Batches, and the shift-by-one trick

Training data is pulled as (B, T) batches: B independent sequences, each T characters long. The target y is simply the input x shifted one position to the right — because the model's job is to predict the next character for each context.

def get_batch(split, train_data, val_data, block_size, batch_size, device="cpu"):
    data = train_data if split == "train" else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i + block_size] for i in ix])
    y = torch.stack([data[i + 1:i + 1 + block_size] for i in ix])
    return x.to(device), y.to(device)
x :  F  i  r  s  t     C
     ↓  ↓  ↓  ↓  ↓  ↓  ↓
y :  i  r  s  t     C  i

A single block_size-long slice secretly contains T separate (context, target) examples: predicting i from F, r from Fi, s from Fir, and so on. This is exactly the source of the Transformer's parallel-learning efficiency — one slice, many lessons.

Step 3 — A single self-attention head

This is the heart of it. Each token produces three projections of itself — Query (what am I looking for?), Key (what do I offer?), and Value (what do I actually carry?). We score every query against every key, scale, hide the future with a causal mask, softmax into weights, and take a weighted sum of the values:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

The causal mask is what makes this a decoder: a token may attend only to itself and the past, never the future.

        attended →
        t0  t1  t2  t3  t4
   t0 [  1   0   0   0   0 ]
   t1 [  1   1   0   0   0 ]
   t2 [  1   1   1   0   0 ]
   t3 [  1   1   1   1   0 ]
   t4 [  1   1   1   1   1 ]

In code, the mask is a lower-triangular buffer; we fill the upper triangle with -inf so softmax sends those weights to zero:

class Head(nn.Module):
    """A single self-attention head."""
 
    def __init__(self, n_embed, head_size, dropout_rate, block_size):
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)
        self.query = nn.Linear(n_embed, head_size, bias=False)
        self.value = nn.Linear(n_embed, head_size, bias=False)
        self.register_buffer("tril", torch.tril(torch.ones(block_size, block_size)))
        self.head_size = head_size
        self.dropout = nn.Dropout(dropout_rate)
 
    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)
        q = self.query(x)
        wei = q @ k.transpose(-2, -1) * (self.head_size ** -0.5)      # scaled scores
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float("-inf"))  # causal mask
        wei = F.softmax(wei, dim=-1)
        wei = self.dropout(wei)
        v = self.value(x)
        return wei @ v                                               # weighted sum

The head_size ** -0.5 scaling keeps the dot products from growing with dimension, which would otherwise push softmax into saturation.

Step 4 — Multi-head attention

One head learns one kind of relationship. Running several in parallel lets the model attend to different patterns at once; we concatenate their outputs and project back to the model dimension:

class MultiHeadAttention(nn.Module):
    """Several self-attention heads in parallel."""
 
    def __init__(self, n_embed, num_heads, head_size, dropout_rate, block_size):
        super().__init__()
        self.heads = nn.ModuleList(
            [Head(n_embed, head_size, dropout_rate, block_size) for _ in range(num_heads)]
        )
        self.proj = nn.Linear(n_embed, n_embed)
        self.dropout = nn.Dropout(dropout_rate)
 
    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        return self.dropout(self.proj(out))

Step 5 — The Transformer block

A block has two sub-layers: multi-head attention (the tokens communicate) and a feed-forward network (each token computes). Both use pre-LayerNorm and a residual connection — the pattern that makes deep Transformers trainable:

x = x + Dropout( MultiHeadAttention( LayerNorm(x) ) )
x = x + Dropout( FeedForward( LayerNorm(x) ) )
class Block(nn.Module):
    """Transformer block: communication (attention) + computation (FFN)."""
 
    def __init__(self, n_embed, n_head, dropout_rate, block_size):
        super().__init__()
        head_size = n_embed // n_head
        self.sa = MultiHeadAttention(n_embed, n_head, head_size, dropout_rate, block_size)
        self.ffwd = nn.Sequential(
            nn.Linear(n_embed, 4 * n_embed),
            nn.ReLU(),
            nn.Linear(4 * n_embed, n_embed),
            nn.Dropout(dropout_rate),
        )
        self.ln1 = nn.LayerNorm(n_embed)
        self.ln2 = nn.LayerNorm(n_embed)
        self.dropout1 = nn.Dropout(dropout_rate)
        self.dropout2 = nn.Dropout(dropout_rate)
 
    def forward(self, x):
        x = x + self.dropout1(self.sa(self.ln1(x)))
        x = x + self.dropout2(self.ffwd(self.ln2(x)))
        return x

The feed-forward network expands to 4 × n_embed and back — this is where most of the model's non-linear "thinking" capacity lives.

Step 6 — Assembling the model

Now we stack the pieces. Token embeddings say what a token is; position embeddings say where it sits. Their sum flows through n_layer blocks, a final LayerNorm, and a linear head that projects to vocabulary-sized logits:

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size, n_embed, n_head, n_layer, dropout_rate, block_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embed)
        self.position_embedding_table = nn.Embedding(block_size, n_embed)
        self.tok_pos_dropout = nn.Dropout(dropout_rate)
        self.blocks = nn.Sequential(
            *[Block(n_embed, n_head, dropout_rate, block_size) for _ in range(n_layer)]
        )
        self.ln_f = nn.LayerNorm(n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)
        self.block_size = block_size
 
    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)                          # (B, T, C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=DEVICE))  # (T, C)
        x = self.tok_pos_dropout(tok_emb + pos_emb)
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)                                           # (B, T, vocab)
 
        if targets is None:
            return logits, None
 
        # CrossEntropy expects (B*T, vocab) and (B*T)
        B, T, C_vocab = logits.shape
        logits = logits.view(B * T, C_vocab)
        targets = targets.view(B * T)
        loss = F.cross_entropy(logits, targets)
        return logits, loss

The class is still called BigramLanguageModel — a leftover from the starting point of Karpathy's lecture. The code here is a full decoder-only Transformer; the bigram is just the bare baseline it grew out of (more on why that baseline is limited at the end).

Step 7 — Generating text

Generation is autoregressive: at each step we crop the context to the last block_size tokens (because that's all the position embeddings cover), take the logits at the final position, softmax into a distribution, sample one token, and append it:

def generate(self, idx, max_new_tokens):
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -self.block_size:]   # crop to context window
        logits, _ = self(idx_cond)
        logits = logits[:, -1, :]              # focus on the last step
        probs = F.softmax(logits, dim=-1)
        idx_next = torch.multinomial(probs, num_samples=1)
        idx = torch.cat((idx, idx_next), dim=1)
    return idx
context = torch.zeros((1, 1), dtype=torch.long, device=DEVICE)
print(decode(model.generate(context, max_new_tokens=500)[0].tolist()))

Step 8 — Training

The training loop is the classic four lines — forward, backward, step, schedule — wrapped with evaluation, early stopping, and logging. I used AdamW, a cosine-annealing schedule that decays the learning rate smoothly from 5e-4 to 1e-5, and Weights & Biases for live curves. estimate_loss averages train/val loss in eval mode every 500 steps, and early stopping halts the run if validation loss stalls for five evaluations.

for iter in range(cfg.epochs):
    if iter % EVAL_INTERVAL == 0 or iter == cfg.epochs - 1:
        losses = estimate_loss(model, train_data, val_data, encode,
                               cfg.block_size, cfg.batch_size)
        val = losses["val"].item()
        wandb.log({"train_loss": losses["train"], "val_loss": val,
                   "learning_rate": scheduler.get_last_lr()[0]}, step=iter)
        # early stopping
        if val < best_val_loss:
            best_val_loss, patience_counter = val, 0
        else:
            patience_counter += 1
            if patience_counter >= patience:
                break
 
    xb, yb = get_batch("train", train_data, val_data,
                       cfg.block_size, cfg.batch_size, device=DEVICE)
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    scheduler.step()

The HERO RUN v4 configuration:

ParameterValue
block_size256
batch_size64
n_embed256
n_head8
n_layer6
dropout_rate0.2
lr5e-4
weight_decay1e-3

Results

Cross-Entropy Loss

The loss function we train against is cross-entropy. Here's the formula:

L=1Ni=1Nlogpθ(yixi)\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \log p_{\theta}\big(y_i \mid x_i\big)

For a single example, where pp is the probability the model assigns to the correct class (token):

Li=logpθ(yixi)\mathcal{L}_i = -\log p_{\theta}(y_i \mid x_i)

More generally, with the true distribution yy (one-hot) and the model's prediction y^\hat{y}:

L=c=1Vyclogy^c\mathcal{L} = -\sum_{c=1}^{V} y_c \, \log \hat{y}_c

where VV is the vocabulary size and y^c\hat{y}_c is the probability assigned to token cc after the softmax.

How does it work? At every step the model produces a probability distribution over the entire vocabulary for the next token (the softmax output). Cross-entropy looks at how much probability the model placed on the token that actually came next and takes its negative logarithm. If the model gave the correct token a high probability (p1p \to 1), then logp0-\log p \to 0 and the loss shrinks; if it gave it a low probability (p0p \to 0), then logp-\log p \to \infty and the loss becomes a large penalty. In other words, the function punishes the model heavily for predictions it was "confident about but wrong on."

Why do we use it? Because predicting the next token is really a classification problem: at each position we're trying to pick the right token out of VV candidates. Cross-entropy is the natural loss for this kind of probabilistic classification — it pulls the model's predicted distribution toward the true distribution (equivalent to maximum likelihood estimation). It also has clean, stable gradients: combined with softmax, its derivative simplifies to just y^y\hat{y} - y, which lets gradient descent work efficiently. In PyTorch it's computed in a single line, F.cross_entropy(logits, targets), which fuses the softmax + log + negative-log-likelihood steps into one numerically stable operation.

Validation loss is cross-entropy (negative log-likelihood) — lower is better. The reference figures for this character-level Tiny Shakespeare setup:

Validation loss (lower is better)
Random init      ████████████████████████████████████  4.17
Pure bigram      ██████████████████████                2.50
This Transformer █████████████                         1.50

Qualitatively, a pure bigram produces letter-soup that respects single-character frequencies but never forms real words. This Transformer produces Shakespeare-looking text: plausible words, NAME: dialogue structure, line breaks and punctuation. It's stylistically convincing but semantically nonsensical — exactly what you'd expect from a ~10M-parameter character model.

Why the bigram baseline hits a wall — and how attention breaks through

Understanding the baseline's ceiling is the entire motivation for attention. A true bigram models P(next char | previous char): its context window is exactly one token, it has no notion of position, and it's structurally just a vocab × vocab lookup table with no composition. So its loss plateaus around 2.5 no matter how long you train.

Every addition removes one of those limits:

AdditionLimitation it removes
Self-attentioncontext up to block_size (256) instead of 1
Position embeddingsorder and position become meaningful
Multi-head attentionseveral relationship types learned in parallel
Residual blocks + FFNdepth and non-linear, compositional computation

Together they drop validation loss from ~2.5 to ~1.5 and turn gibberish into coherent, Shakespeare-style prose — which is the whole leap this project is meant to demonstrate.

Try it yourself

The full implementation lives in a single, heavily-commented notebook:

If you've ever wanted to demystify "GPT," there's no better way than building the smallest possible one yourself — and watching the loss curve drop as each piece clicks into place.

References

The implementation and explanations in this post lean heavily on the following resources:

views

Related articles