What Is a Tensor? Linear Algebra for Deep Learning

For almost everyone new to deep learning, one of the first intimidating words is tensor. It shows up on every line of the PyTorch or JAX docs and sounds like something borrowed from physics. The reality is far more modest: a tensor is nothing more than an array of numbers with a particular shape. In this post we'll climb that ladder one rung at a time, and along the way notice something more important — that essentially all of deep learning is moving numbers through the right shapes.

From scalar to tensor: a ladder

Rather than memorizing a single definition, it helps to picture a ladder that rises by number of dimensions.

At the bottom sits the scalar: a single number, such as the 0.001 you might use as a learning rate. One rung up is the vector, an ordered list of numbers; a word embedding is the canonical example. The third rung is the matrix — a table of rows and columns, which is the usual form of a neural network layer's weights. A tensor is simply the general name for all of these: numbers arranged on a regular grid, with however many dimensions you need.

Object	Dims	Example shape	Typical use
Scalar	0	`()`	learning rate, loss
Vector	1	`(n,)`	embedding, bias
Matrix	2	`(m, n)`	weight layer
Tensor	N	`(a, b, c, …)`	image batch, sequence

A concrete example grounds the abstraction: a batch of 32 RGB images at 224×224 is a four-dimensional tensor of shape (32, 3, 224, 224), where the numbers denote batch size, channels, height, and width. A tensor, in other words, is just a slightly more formal way of saying "multi-dimensional container of numbers."

The real protagonist: shape

When you work with tensors, the thing to keep alive in your mind is not the values inside them but their shape. Nearly every operation in deep learning amounts to reshaping, multiplying, and adding tensors according to fixed rules, so in practice a model "working" largely means its shapes line up.

The most fundamental operation that joins two tensors is matrix multiplication, and it is only defined when the inner dimensions agree: multiply an (m, k) matrix by a (k, n) matrix and you get (m, n); the middle ks must match. On top of this sit the broadcasting rules, which let tensors of different shapes be automatically expanded and combined — a mechanism that quietly kicks in across your code and deserves a post of its own.

Why linear algebra is the language

Written in its plainest form, a neural network layer is a matrix multiplication plus a bias term:

\mathbf{y} = W\mathbf{x} + \mathbf{b}

Here $W$ holds the learned weights, $\mathbf{x}$ is the input, $\mathbf{b}$ is the bias, and $\mathbf{y}$ is the layer's output. On its own this is just a linear transformation; it acquires its real power once we stack such layers and slot a non-linearity between them. The word "deep" refers precisely to this stack.

The example below shows a single layer mapping a four-element input to a three-element output:

import torch
 
x = torch.randn(4)        # input vector, shape (4,)
W = torch.randn(3, 4)     # weights, shape (3, 4)
b = torch.randn(3)        # bias, shape (3,)
 
y = W @ x + b             # output vector, shape (3,)
print(y.shape)            # torch.Size([3])

The W @ x term multiplies (3, 4) by (4,) to produce (3,); the only requirement is that the inner dimensions — the two 4s — agree. Adding the bias leaves the shape unchanged and merely shifts each component. All of deep learning, stripped down, is this step applied again and again with the right shapes and suitable non-linearities; what remains is to find good values for $W$ and $\mathbf{b}$ , which is the job of gradient descent.

Beyond the numbers: geometric intuition

Treating linear algebra as mere arithmetic misses half the picture. You can think of a vector as a point or a direction in space, and a matrix as a transformation that stretches, rotates, or scales that space. Multiplying by $W$ moves the input into a new space, and each layer of a network nudges the data toward a representation where the classes are easier to separate. This geometric view will make notions like "similarity" and "distance" feel far more natural when we later reach embeddings and attention.

Common pitfalls

A surprising share of the errors you'll hit are not conceptual but shape mismatches in disguise. Confusing a row vector with a column vector, forgetting a transpose, or placing the batch dimension in the wrong slot all reduce to the same class of bug. The single most useful habit, then, is to sprinkle print(x.shape) at the critical points of your code and confirm at each step that the dimensions are what you expect. That small reflex resolves at a glance what could otherwise be hours of debugging.

Where this goes next

Now that we've seen what a tensor is and why a layer is just a matrix multiplication, the natural question is how the model actually finds "good" values for those weights. The answer lives in the optimization loop at the heart of deep learning — gradient descent — which the next post takes apart, again from scratch.

What Is a Tensor? Linear Algebra for Deep Learning

From scalar to tensor: a ladder

The real protagonist: shape

Why linear algebra is the language

Beyond the numbers: geometric intuition

Common pitfalls

Where this goes next

Related articles

How Gradient Descent Actually Works

About This Blog

Attention, Explained from Scratch