The Math Engine: What Powers Encoders?

We've seen the clever designs of LSTMs and Transformers, but what makes them actually work? Underneath the gates and attention mechanisms lies the engine room: fundamental mathematics, primarily linear algebra and calculus. Here is what they actually do inside a network.

Linear algebra is how information moves through the network. Calculus is how the network figures out the right way to move it.

Linear Algebra: The Language of Transformations

Neural networks process information by passing numerical representations through layers. Linear algebra governs these representations and transformations:

Information as Vectors: Words, pixels, or other features are represented as vectors (lists of numbers). Our word embeddings are exactly this.
Operations as Matrices: The connections and transformations within network layers are defined by matrices — the various $W$ terms ( $W_{f}, W_{i}, W_{o}$ in LSTMs; $W_{Q}, W_{K}, W_{V}$ in Transformers).
Processing via Matrix Multiplication: The fundamental operation of applying a layer's transformation to an input is matrix multiplication. This is how information is combined and propagated through the network.

Matrix Multiplication in Action

Remember the LSTM gates or the Transformer's Q, K, V calculations? They all rely on matrix multiplication. For matrices $A$ of shape $(m \times k)$ and $B$ of shape $(k \times n)$ , each element of the result is:

Matrix Multiplication

C [i, j] = k \sum A [i, k] \times B [k, j]

python

import numpy as np
import time

def naive_matmul(A, B):
    m, k  = A.shape
    k2, n = B.shape
    assert k == k2, "incompatible shapes"
    C = np.zeros((m, n))
    for i in range(m):
        for j in range(n):
            for l in range(k):      # inner dot product
                C[i, j] += A[i, l] * B[l, j]
    return C

A = np.random.randn(64, 64)
B = np.random.randn(64, 64)

t0      = time.perf_counter()
C_naive = naive_matmul(A, B)
t_naive = time.perf_counter() - t0

t0     = time.perf_counter()
C_fast = A @ B                  # calls optimised BLAS under the hood
t_fast = time.perf_counter() - t0

print(f"Naive  : {t_naive * 1000:.1f} ms")
print(f"np @   : {t_fast  * 1000:.3f} ms")
print(f"Speedup: {t_naive / t_fast:.0f}×")
print(f"Results match: {np.allclose(C_naive, C_fast)}")
# Typical output on a laptop: Naive ≈ 800ms, np @ ≈ 0.3ms → ~2500× faster.
# BLAS routines exploit CPU SIMD, cache tiling, and multi-threading.

⊞matrix multiplication

visualise — edit values, then animate

edit cells, then press play

slowfast

row of Acol of Bactive result

calculator — edit any cell, result updates live

C[0,0] = 1×5 + 2×7 = 19

C[0,1] = 1×6 + 2×8 = 22

C[1,0] = 3×5 + 4×7 = 43

C[1,1] = 3×6 + 4×8 = 50

◇In Practice

You'll rarely implement matrix multiplication from scratch. Libraries like NumPy and PyTorch rely on highly optimised, low-level BLAS (Basic Linear Algebra Subprograms) routines, often written in Fortran or C, to perform these operations with incredible speed on modern hardware.

Pause & Reflect

When your code throws a dimension mismatch error during a matrix multiplication, what subtle tensor misalignment might be the cause?

Calculus: The Engine of Learning

Linear algebra defines how information is processed, but how do the weight matrices get the right values? That's where calculus — specifically gradient descent and backpropagation — comes in.

Forward Pass: Input data flows through the network to produce an output.
Loss Calculation: We compare the network's output to the desired target using a loss function, which quantifies the error.
Backward Pass (Backpropagation): The network calculates the gradient — the derivative of the loss with respect to every weight. The chain rule is the workhorse here, allowing the error signal to be efficiently propagated backward through all the layers.
Weight Update: Each weight $W$ is adjusted slightly in the direction that reduces the loss, controlled by the learning rate.

Weight Update Rule

W_{new} = W_{old} - (learning rate) \cdot \frac{\partial L}{\partial W _{old}}

Run this enough times and the weights inch toward something that actually works.

python

import numpy as np

# Tiny 2-layer network: x(2) → hidden(2, ReLU) → output(1, linear)
x      = np.array([0.5, -0.3])
W1     = np.array([[0.4, -0.2], [0.1,  0.8]])  # (2×2) hidden weights
W2     = np.array([0.6, -0.5])                  # (2,)  output weights
y_true = 1.0
lr     = 0.1

# Forward
h     = np.maximum(0, W1 @ x)      # ReLU hidden layer
y_hat = W2 @ h                     # predicted output
loss  = 0.5 * (y_hat - y_true)**2

# Backward — chain rule by hand
dL_dyhat  = y_hat - y_true                         # ∂L/∂ŷ
dL_dW2    = dL_dyhat * h                           # ∂L/∂W2
dL_dh     = dL_dyhat * W2                          # ∂L/∂h
dL_dh_pre = dL_dh * (W1 @ x > 0).astype(float)   # ReLU gate
dL_dW1    = np.outer(dL_dh_pre, x)                # ∂L/∂W1

W1 -= lr * dL_dW1
W2 -= lr * dL_dW2

print(f"loss   : {loss:.4f}")
print(f"∂L/∂W2 : {dL_dW2.round(4)}")
print(f"∂L/∂W1 :\n{dL_dW1.round(4)}")
# PyTorch .backward() computes the same gradients automatically.
# Understanding the manual steps clarifies what 'gradient' means in practice.

▶Manual Gradients vs PyTorch Autograd·Open in Colab →

checkpoint

Which mathematical operation allows the error signal to be propagated backward through multiple layers during training?

Activation Functions & Their Derivatives

Non-linear activation functions (tanh, sigmoid, ReLU, etc.) are essential — without them, stacking layers would be pointless. Their derivatives are critical for backpropagation:

$\frac{d}{d x} tanh (x) = 1 - tanh^{2} (x)$
$\frac{d}{d x} σ (x) = σ (x) (1 - σ (x))$ — derivative is always < 0.25, contributing to vanishing gradients.
ReLU (derivative = 1 for positive inputs — helps mitigate vanishing gradients)

ReLU Derivative

\frac{d}{d x} ReLU (x) = {10 if x > 0 otherwise

Silent Broadcasting Bugs

Neglecting to verify tensor shapes is a classic pitfall. While a dimension mismatch often throws an error, sometimes libraries like NumPy or PyTorch will use broadcasting to "stretch" a smaller tensor to make the operation valid. This can lead to silent bugs where your code runs without crashing but produces incorrect results because the operation wasn't what you intended. Always print your tensor shapes before multiplying.

checkpoint

Which activation functions have a derivative of exactly 1 for positive inputs? (Select all that apply)

select all that apply

›Summary

Encoders run on two mathematical pillars. Linear algebra provides the framework for representing data as vectors and transformations as matrices, with matrix multiplication being the core processing step (used in LSTM gates, Transformer Q/K/V projections, etc.). Calculus, through backpropagation and gradient descent, uses derivatives (calculated via the chain rule) to determine how to adjust the weight matrices to minimise error. It is also the source of the vanishing/exploding gradient issues covered in the first post.

≡TL;DR

Linear algebra is the skeleton: vectors carry the data, matrices apply the transformations. Calculus is the feedback loop: derivatives tell each weight how wrong it was, and by how much to change.