gradientflow

Back to Field Notes
theory

The Math Engine: What Powers Encoders?

A look under the hood at the linear algebra and calculus that power modern encoders like LSTMs and Transformers, from matrix multiplication to backpropagation.

July 7, 2025
6 min read
Module 5
Linear AlgebraCalculusBackpropagationDeep LearningMath

We've seen the clever designs of LSTMs and Transformers, but what makes them actually work? Underneath the gates and attention mechanisms lies the engine room: fundamental mathematics, primarily linear algebra and calculus. Here is what they actually do inside a network.

Linear algebra is how information moves through the network. Calculus is how the network figures out the right way to move it.

Linear Algebra: The Language of Transformations

Neural networks process information by passing numerical representations through layers. Linear algebra governs these representations and transformations:

  • Information as Vectors: Words, pixels, or other features are represented as vectors (lists of numbers). Our word embeddings are exactly this.
  • Operations as Matrices: The connections and transformations within network layers are defined by matrices — the various terms ( in LSTMs; in Transformers).
  • Processing via Matrix Multiplication: The fundamental operation of applying a layer's transformation to an input is matrix multiplication. This is how information is combined and propagated through the network.

Matrix Multiplication in Action

Remember the LSTM gates or the Transformer's Q, K, V calculations? They all rely on matrix multiplication. For matrices of shape and of shape , each element of the result is:

Matrix Multiplication
python
import numpy as np
import time

def naive_matmul(A, B):
    m, k  = A.shape
    k2, n = B.shape
    assert k == k2, "incompatible shapes"
    C = np.zeros((m, n))
    for i in range(m):
        for j in range(n):
            for l in range(k):      # inner dot product
                C[i, j] += A[i, l] * B[l, j]
    return C

A = np.random.randn(64, 64)
B = np.random.randn(64, 64)

t0      = time.perf_counter()
C_naive = naive_matmul(A, B)
t_naive = time.perf_counter() - t0

t0     = time.perf_counter()
C_fast = A @ B                  # calls optimised BLAS under the hood
t_fast = time.perf_counter() - t0

print(f"Naive  : {t_naive * 1000:.1f} ms")
print(f"np @   : {t_fast  * 1000:.3f} ms")
print(f"Speedup: {t_naive / t_fast:.0f}×")
print(f"Results match: {np.allclose(C_naive, C_fast)}")
# Typical output on a laptop: Naive ≈ 800ms, np @ ≈ 0.3ms → ~2500× faster.
# BLAS routines exploit CPU SIMD, cache tiling, and multi-threading.
In Practice
You'll rarely implement matrix multiplication from scratch. Libraries like NumPy and PyTorch rely on highly optimised, low-level BLAS (Basic Linear Algebra Subprograms) routines, often written in Fortran or C, to perform these operations with incredible speed on modern hardware.
Pause & Reflect

When your code throws a dimension mismatch error during a matrix multiplication, what subtle tensor misalignment might be the cause?

Calculus: The Engine of Learning

Linear algebra defines how information is processed, but how do the weight matrices get the right values? That's where calculus — specifically gradient descent and backpropagation — comes in.

  1. Forward Pass: Input data flows through the network to produce an output.
  2. Loss Calculation: We compare the network's output to the desired target using a loss function, which quantifies the error.
  3. Backward Pass (Backpropagation): The network calculates the gradient — the derivative of the loss with respect to every weight. The chain rule is the workhorse here, allowing the error signal to be efficiently propagated backward through all the layers.
  4. Weight Update: Each weight is adjusted slightly in the direction that reduces the loss, controlled by the learning rate.
Weight Update Rule

Run this enough times and the weights inch toward something that actually works.

python
import numpy as np

# Tiny 2-layer network: x(2) → hidden(2, ReLU) → output(1, linear)
x      = np.array([0.5, -0.3])
W1     = np.array([[0.4, -0.2], [0.1,  0.8]])  # (2×2) hidden weights
W2     = np.array([0.6, -0.5])                  # (2,)  output weights
y_true = 1.0
lr     = 0.1

# Forward
h     = np.maximum(0, W1 @ x)      # ReLU hidden layer
y_hat = W2 @ h                     # predicted output
loss  = 0.5 * (y_hat - y_true)**2

# Backward — chain rule by hand
dL_dyhat  = y_hat - y_true                         # ∂L/∂ŷ
dL_dW2    = dL_dyhat * h                           # ∂L/∂W2
dL_dh     = dL_dyhat * W2                          # ∂L/∂h
dL_dh_pre = dL_dh * (W1 @ x > 0).astype(float)   # ReLU gate
dL_dW1    = np.outer(dL_dh_pre, x)                # ∂L/∂W1

W1 -= lr * dL_dW1
W2 -= lr * dL_dW2

print(f"loss   : {loss:.4f}")
print(f"∂L/∂W2 : {dL_dW2.round(4)}")
print(f"∂L/∂W1 :\n{dL_dW1.round(4)}")
# PyTorch .backward() computes the same gradients automatically.
# Understanding the manual steps clarifies what 'gradient' means in practice.
checkpoint

Which mathematical operation allows the error signal to be propagated backward through multiple layers during training?

Activation Functions & Their Derivatives

Non-linear activation functions (tanh, sigmoid, ReLU, etc.) are essential — without them, stacking layers would be pointless. Their derivatives are critical for backpropagation:

  • — derivative is always < 0.25, contributing to vanishing gradients.
  • ReLU (derivative = 1 for positive inputs — helps mitigate vanishing gradients)
ReLU Derivative
Silent Broadcasting Bugs
Neglecting to verify tensor shapes is a classic pitfall. While a dimension mismatch often throws an error, sometimes libraries like NumPy or PyTorch will use broadcasting to "stretch" a smaller tensor to make the operation valid. This can lead to silent bugs where your code runs without crashing but produces incorrect results because the operation wasn't what you intended. Always print your tensor shapes before multiplying.
checkpoint

Which activation functions have a derivative of exactly 1 for positive inputs? (Select all that apply)

select all that apply

Summary

Encoders run on two mathematical pillars. Linear algebra provides the framework for representing data as vectors and transformations as matrices, with matrix multiplication being the core processing step (used in LSTM gates, Transformer Q/K/V projections, etc.). Calculus, through backpropagation and gradient descent, uses derivatives (calculated via the chain rule) to determine how to adjust the weight matrices to minimise error. It is also the source of the vanishing/exploding gradient issues covered in the first post.

TL;DR
Linear algebra is the skeleton: vectors carry the data, matrices apply the transformations. Calculus is the feedback loop: derivatives tell each weight how wrong it was, and by how much to change.