The Math Engine: What Powers Encoders?
A look under the hood at the linear algebra and calculus that power modern encoders like LSTMs and Transformers, from matrix multiplication to backpropagation.
We've seen the clever designs of LSTMs and Transformers, but what makes them actually work? Underneath the gates and attention mechanisms lies the engine room: fundamental mathematics, primarily linear algebra and calculus. Here is what they actually do inside a network.
Linear algebra is how information moves through the network. Calculus is how the network figures out the right way to move it.
Linear Algebra: The Language of Transformations
Neural networks process information by passing numerical representations through layers. Linear algebra governs these representations and transformations:
- Information as Vectors: Words, pixels, or other features are represented as vectors (lists of numbers). Our word embeddings are exactly this.
- Operations as Matrices: The connections and transformations within network layers are defined by matrices — the various terms ( in LSTMs; in Transformers).
- Processing via Matrix Multiplication: The fundamental operation of applying a layer's transformation to an input is matrix multiplication. This is how information is combined and propagated through the network.
Matrix Multiplication in Action
Remember the LSTM gates or the Transformer's Q, K, V calculations? They all rely on matrix multiplication. For matrices of shape and of shape , each element of the result is:
import numpy as np
import time
def naive_matmul(A, B):
m, k = A.shape
k2, n = B.shape
assert k == k2, "incompatible shapes"
C = np.zeros((m, n))
for i in range(m):
for j in range(n):
for l in range(k): # inner dot product
C[i, j] += A[i, l] * B[l, j]
return C
A = np.random.randn(64, 64)
B = np.random.randn(64, 64)
t0 = time.perf_counter()
C_naive = naive_matmul(A, B)
t_naive = time.perf_counter() - t0
t0 = time.perf_counter()
C_fast = A @ B # calls optimised BLAS under the hood
t_fast = time.perf_counter() - t0
print(f"Naive : {t_naive * 1000:.1f} ms")
print(f"np @ : {t_fast * 1000:.3f} ms")
print(f"Speedup: {t_naive / t_fast:.0f}×")
print(f"Results match: {np.allclose(C_naive, C_fast)}")
# Typical output on a laptop: Naive ≈ 800ms, np @ ≈ 0.3ms → ~2500× faster.
# BLAS routines exploit CPU SIMD, cache tiling, and multi-threading.When your code throws a dimension mismatch error during a matrix multiplication, what subtle tensor misalignment might be the cause?
Calculus: The Engine of Learning
Linear algebra defines how information is processed, but how do the weight matrices get the right values? That's where calculus — specifically gradient descent and backpropagation — comes in.
- Forward Pass: Input data flows through the network to produce an output.
- Loss Calculation: We compare the network's output to the desired target using a loss function, which quantifies the error.
- Backward Pass (Backpropagation): The network calculates the gradient — the derivative of the loss with respect to every weight. The chain rule is the workhorse here, allowing the error signal to be efficiently propagated backward through all the layers.
- Weight Update: Each weight is adjusted slightly in the direction that reduces the loss, controlled by the learning rate.
Run this enough times and the weights inch toward something that actually works.
import numpy as np
# Tiny 2-layer network: x(2) → hidden(2, ReLU) → output(1, linear)
x = np.array([0.5, -0.3])
W1 = np.array([[0.4, -0.2], [0.1, 0.8]]) # (2×2) hidden weights
W2 = np.array([0.6, -0.5]) # (2,) output weights
y_true = 1.0
lr = 0.1
# Forward
h = np.maximum(0, W1 @ x) # ReLU hidden layer
y_hat = W2 @ h # predicted output
loss = 0.5 * (y_hat - y_true)**2
# Backward — chain rule by hand
dL_dyhat = y_hat - y_true # ∂L/∂ŷ
dL_dW2 = dL_dyhat * h # ∂L/∂W2
dL_dh = dL_dyhat * W2 # ∂L/∂h
dL_dh_pre = dL_dh * (W1 @ x > 0).astype(float) # ReLU gate
dL_dW1 = np.outer(dL_dh_pre, x) # ∂L/∂W1
W1 -= lr * dL_dW1
W2 -= lr * dL_dW2
print(f"loss : {loss:.4f}")
print(f"∂L/∂W2 : {dL_dW2.round(4)}")
print(f"∂L/∂W1 :\n{dL_dW1.round(4)}")
# PyTorch .backward() computes the same gradients automatically.
# Understanding the manual steps clarifies what 'gradient' means in practice.Which mathematical operation allows the error signal to be propagated backward through multiple layers during training?
Activation Functions & Their Derivatives
Non-linear activation functions (tanh, sigmoid, ReLU, etc.) are essential — without them, stacking layers would be pointless. Their derivatives are critical for backpropagation:
- — derivative is always < 0.25, contributing to vanishing gradients.
- ReLU (derivative = 1 for positive inputs — helps mitigate vanishing gradients)
Which activation functions have a derivative of exactly 1 for positive inputs? (Select all that apply)
select all that apply
Encoders run on two mathematical pillars. Linear algebra provides the framework for representing data as vectors and transformations as matrices, with matrix multiplication being the core processing step (used in LSTM gates, Transformer Q/K/V projections, etc.). Calculus, through backpropagation and gradient descent, uses derivatives (calculated via the chain rule) to determine how to adjust the weight matrices to minimise error. It is also the source of the vanishing/exploding gradient issues covered in the first post.