How Gradient Descent Actually Works

Every trained neural network — from a tiny classifier to a frontier LLM — was fit by the same simple idea: gradient descent. Here's the whole thing, without the mystery.

The picture

Imagine a hilly landscape. The height at each point is your loss — how wrong the model is. You want the lowest valley. You can't see the whole map, but at any point you can feel which way is downhill. So you take a small step that way, and repeat.

The "which way is downhill" is the gradient: the derivative of the loss with respect to each parameter.

\theta \leftarrow \theta - \eta \, \nabla_\theta \mathcal{L}

$\theta$ — the parameters (weights).
$\nabla_\theta \mathcal{L}$ — the gradient (the uphill direction).
$\eta$ — the learning rate (how big a step you take).

We subtract the gradient because the gradient points uphill and we want to go down.

Twelve lines that fit a line

Let's minimize a simple loss by hand — fitting y = w·x to some data:

import numpy as np
 
x = np.array([1, 2, 3, 4], dtype=float)
y = np.array([2, 4, 6, 8], dtype=float)   # true relationship: y = 2x
w = 0.0
lr = 0.01
 
for step in range(200):
    pred = w * x
    loss = ((pred - y) ** 2).mean()       # mean squared error
    grad = (2 * x * (pred - y)).mean()    # d loss / d w
    w -= lr * grad                        # the update
print(round(w, 3))                        # ≈ 2.0

It starts at w = 0, feels the slope, and walks straight to w = 2. That's gradient descent. A real network just does this for millions of parameters at once, with the gradients computed automatically by backpropagation.

The learning rate is everything

Too small → training crawls; you never reach the valley.
Too large → you overshoot and bounce around (or diverge).
Just right → steady, fast descent.

Most "my model won't train" problems trace back to this one number. When in doubt, lower it and watch the loss curve.

How Gradient Descent Actually Works

The picture

Twelve lines that fit a line

The learning rate is everything

Related articles

What Is a Tensor? Linear Algebra for Deep Learning

Attention, Explained from Scratch

Why Identity-Aware Negative Sampling Matters