How Gradient Descent Actually Works
Jun 22, 2026 · 2 min read
Every trained neural network — from a tiny classifier to a frontier LLM — was fit by the same simple idea: gradient descent. Here's the whole thing, without the mystery.
The picture
Imagine a hilly landscape. The height at each point is your loss — how wrong the model is. You want the lowest valley. You can't see the whole map, but at any point you can feel which way is downhill. So you take a small step that way, and repeat.
The "which way is downhill" is the gradient: the derivative of the loss with respect to each parameter.
- — the parameters (weights).
- — the gradient (the uphill direction).
- — the learning rate (how big a step you take).
We subtract the gradient because the gradient points uphill and we want to go down.
Twelve lines that fit a line
Let's minimize a simple loss by hand — fitting y = w·x to some data:
import numpy as np
x = np.array([1, 2, 3, 4], dtype=float)
y = np.array([2, 4, 6, 8], dtype=float) # true relationship: y = 2x
w = 0.0
lr = 0.01
for step in range(200):
pred = w * x
loss = ((pred - y) ** 2).mean() # mean squared error
grad = (2 * x * (pred - y)).mean() # d loss / d w
w -= lr * grad # the update
print(round(w, 3)) # ≈ 2.0It starts at w = 0, feels the slope, and walks straight to w = 2. That's
gradient descent. A real network just does this for millions of parameters at
once, with the gradients computed automatically by backpropagation.
The learning rate is everything
- Too small → training crawls; you never reach the valley.
- Too large → you overshoot and bounce around (or diverge).
- Just right → steady, fast descent.
Most "my model won't train" problems trace back to this one number. When in doubt, lower it and watch the loss curve.