gradientflow

Back to Field Notes
theory

Why Can Deep Networks Go Off the Rails?

Understanding the vanishing and exploding gradient problems that plagued early deep learning, and the modern solutions that tamed these unruly gradients.

June 29, 2025
9 min read
Module 1
Deep LearningGradientsBackpropagationNeural Networks

Deep networks have a dirty secret: the deeper you go, the harder they become to train. Not because the architecture breaks down. Because the learning signal does.

That signal is called a gradient. It travels backward through every layer, telling each one how wrong it was and by how much to adjust. In a shallow network, fine. In a twenty-layer network, that signal passes through twenty sets of hands. And like anything passed through too many hands, it either fades to nothing or blows up completely.

That is the vanishing gradient problem and the exploding gradient problem. You cannot understand why modern deep learning is designed the way it is without understanding both.

Picture this: you're in a classroom and you've got a juicy secret to share with your friend group. You lean over and whisper it to the friend next to you. They pass it along, each person whispering to the next. But sometimes the secret doesn't make it — it fades away with each quiet pass until it's just a barely audible breath, lost before it reaches the last friend. That's exactly what happens with the vanishing gradient problem: the signal gets weaker as it moves through the layers, becoming too tiny to nudge the model's weights.

Then there's the other extreme. Imagine one of your overenthusiastic friends starts discussing the secret out loud — their voice bouncing off the classroom walls until every single classmate knows it. That's the exploding gradient problem, where the signal balloons too big, throwing training into chaos.

The Culprit: Backpropagation and the Chain Rule

The core learning mechanism for most neural networks is backpropagation — an algorithm that calculates gradients for each layer by propagating error signals backward through the network. Here's the gist:

  1. The network makes a prediction.
  2. It compares the prediction to the actual target and calculates the error.
  3. It sends this error signal backward through its layers.
  4. Each weight gets feedback based on its contribution to the error, telling it how to adjust.

This backward pass relies heavily on the chain rule from calculus. As the error signal travels back, it gets multiplied by the local gradient at each layer. In a deep network, that's a lot of multiplications. Here's the critical part: what happens if the numbers you're repeatedly multiplying are consistently small, or consistently large?

Why Gradients Vanish (And Why It Matters)

If those local gradients are often less than 1 — for example when using sigmoid activations (whose derivative is always less than 0.25) — the error signal shrinks exponentially with each step back:

Vanishing Signal

where Small Factor < 1. After enough layers, the signal reaching the earliest layers is practically zero.

The Consequence: The initial layers get almost no feedback and stop learning. It's like the network has fallen asleep at the beginning of the sequence.

In Practice: Two words. That was the effective context window after hours of training an RNN on next-word prediction. Two words, then total amnesia. Goldfish with a keyboard.

Pause & Reflect

Think about how you read this sentence. Do you process each word sequentially, or do you sometimes look ahead or back to understand the meaning?

Why Gradients Explode (And How to Fix It)

Conversely, if local gradients are consistently large (greater than 1), the error signal balloons exponentially:

Exploding Signal

It can quickly hit Infinity or NaN, crashing the training or making learning wildly unstable — like trying to fine-tune a watch with a sledgehammer.

The Consequence: Often results in NaN loss values and weights jumping wildly, making learning impossible.

Taming the Explosion: Gradient clipping is a direct fix — if a gradient exceeds a threshold, it's simply scaled back down. Smart weight initialisation and normalisation layers also play crucial supporting roles, preventing chaos from the start.

Real-World Pain Points

This isn't just theoretical. Ever seen training loss suddenly become NaN? Exploding gradients. Ever trained a model that just won't learn long-term dependencies in text? Vanishing gradients. These were major headaches making truly deep networks incredibly hard to train initially.

A Quick Numerical Peek

Let's make this concrete. Simple 20-layer network, gradient starts at .

Vanishing: If each backward step multiplies by :

  • Layer 19 gets:
  • Layer 1 gets: (Practically zero!)

Exploding: If each backward step multiplies by :

  • Layer 19 gets:
  • Layer 1 gets: (Huge!)
python
import numpy as np

gradient = 1.0
layers   = 20

# Vanishing: each layer multiplies by 0.2 (sigmoid-like derivative)
vanishing = [gradient * (0.2 ** i) for i in range(layers)]
print(f"Layer  1 gradient : {vanishing[1]:.4f}")
print(f"Layer 19 gradient : {vanishing[19]:.2e}")  # ≈ 5.24e-14

# Exploding: each layer multiplies by 2.5
exploding = [gradient * (2.5 ** i) for i in range(layers)]
print(f"Layer  1 gradient : {exploding[1]:.4f}")
print(f"Layer 19 gradient : {exploding[19]:.2e}")  # ≈ 1.42e+08
checkpoint

Why might using ReLU (derivative of 1 for positive inputs) help mitigate the vanishing gradient problem compared to sigmoid (derivative always < 0.25)?

The Fixes (That Are Now Just... Standard)

Thankfully, researchers developed clever tricks now standard in deep learning:

  • Smarter Weight Initialisation: Techniques like Xavier/Glorot or He initialisation set starting weights carefully to promote stable signal propagation. In my experience, using He initialisation with ReLU networks often saves more time than fiddling endlessly with learning rates.
  • Better Activation Functions: ReLU and its variants (Leaky ReLU, GELU) have derivatives that are mostly 1 for positive inputs — avoiding the systematic gradient shrinkage of sigmoid.
  • Normalisation Layers: Batch Normalisation and Layer Normalisation help stabilise activations and gradients within layers.
  • Architectural Innovations: LSTMs, GRUs, and Residual Networks (ResNets) incorporate mechanisms specifically designed to help gradients flow.

Stack all four together and you get something that can reliably train a hundred layers.

python
import numpy as np

def sigmoid(x):       return 1 / (1 + np.exp(-x))
def sigmoid_d(x):     s = sigmoid(x); return s * (1 - s)
def relu_d(x):        return (x > 0).astype(float)

x = np.array([-2., -1., 0., 1., 2.])
print(f"{'x':>6}  {'sigmoid\'(x)':>12}  {'relu\'(x)':>10}")
print("-" * 34)
for xi in x:
    print(f"{xi:>6.1f}  {sigmoid_d(xi):>12.4f}  {relu_d(xi):>10.0f}")

# Cumulative gradient magnitude through 20 layers
layers = 20
print(f"
After {layers} layers (input x = 1.0):")
print(f"  sigmoid: {sigmoid_d(1.0) ** layers:.2e}")   # ≈ 1.07e-14
print(f"  relu   : {relu_d(1.0)    ** layers:.2e}")   # = 1.00e+00
checkpoint

Besides ReLU, which of the following techniques also help gradients flow better through deep networks? (Select all that apply)

select all that apply

TL;DR
Deep networks learn via backpropagation (chain rule multiplications). Small multipliers cause signals to vanish (early layers don't learn); large ones cause signals to explode (training unstable). Modern fixes: better initialisation, activations (ReLU), normalisation, gradient clipping, specialised architectures.
Summary

When you're training a deep network, the learning signal has to travel backward through every layer. Like that telephone game, the message can get garbled (exploding) or fade out (vanishing). The fixes are not optional extras. Better initialisation, ReLU, normalisation, and gated architectures are now the baseline. Every model in this series exists because these problems were solved first.