gradientflow

Back to Field Notes
theory

How Do Encoders Distill Information?

Encoders are AI's expert summarizers, distilling raw data like text and images into numerical representations (embeddings). This module explores different encoder types like RNNs, CNNs, and Transformers.

June 30, 2025
6 min read
Module 2
EncodersDeep LearningRNNCNNTransformersGNNAutoencoders

So we know deep networks have a gradient problem. Signals vanish, weights explode, early layers stop learning. The obvious next question is: how do you actually build something useful on top of that? The answer starts with the encoder.

An encoder has one job: take raw, messy input and compress it into something a neural network can actually work with.

Think back to reading a dense novel. You don't memorize every word; you extract the essence: the plot, themes, characters. An encoder does something similar for AI. It's a specialized component that reads input data (text, images, audio, graphs, you name it!) and compresses it into a concise, numerical summary. This summary is often called a latent representation, context vector, or embedding. It is not usually human-readable, but it captures the crucial features of the input in a form that the rest of the network can operate on.

The architecture changes depending on the data. That is not a design flaw, that is the point. A sprinter and a marathon runner are both athletes with the same goal: cross the finish line. But you would not hand one the other's training plan. Encoders work the same way.

A Tour of Encoders: The Right Tool for the Job

Different data needs different encoders. Here's the landscape.

Here's a quick reference of the main encoder families:

  • RNN / LSTM / GRU — Sequential text or time-series. Processes input one step at a time, maintaining a hidden state. Good for tasks where order matters deeply.
  • CNN (1D / 2D) — Images (2D) or local pattern detection in text (1D). Uses convolutional filters to detect features in local regions.
  • Transformer Encoder — Text and most modern NLP tasks. Uses self-attention to weigh the importance of all words relative to each other simultaneously. The current gold standard.
  • GNN (Graph Neural Network) — Graph or network-structured data (social networks, molecules). Uses message passing between nodes.
  • Autoencoder — Unsupervised compression, anomaly detection, dimensionality reduction. Learns a bottleneck representation by trying to reconstruct its own input.
checkpoint
01 / 02

For achieving state-of-the-art results on a text sentiment analysis task today, which architecture would be the most common and powerful starting point?

Which Encoder Should You Choose?

As a rough heuristic: sequential text or time-series → Transformer encoder (or LSTM for smaller datasets). Image data → CNN. Graphs → GNN. Unsupervised compression → Autoencoder. When in doubt, benchmark.

Beyond Sequences: Autoencoders, GNNs, and More

The series focuses on sequence encoders, but two other families handle things differently:

Autoencoder: The encoder part typically uses layers (like CNNs or dense layers) to progressively shrink the input representation down to a low-dimensional bottleneck vector. The decoder then tries to reconstruct the original input from this bottleneck. By forcing the network to squeeze information through this bottleneck and then reconstruct it, the encoder learns to capture the most salient features in a compressed form, discarding noise and retaining essential structure.

Conceptual dry run (simple autoencoder): Imagine encoding a 4-pixel grayscale image [0.1, 0.8, 0.2, 0.7] into a 2D latent vector.

  1. Encoding: The encoder transforms the 4D input into a 2D vector, say [0.55, -0.23]. This is the compressed essence.
  2. Decoding: The decoder takes this 2D vector and tries to expand it back to 4D.
  3. Learning: The system adjusts weights to make the decoded output close to the original [0.1, 0.8, 0.2, 0.7], forcing the encoder to learn a useful compression.
python
import torch
import torch.nn as nn

# 4 → 2 → 4  autoencoder
class Autoencoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(4, 2), nn.ReLU())
        self.decoder = nn.Sequential(nn.Linear(2, 4), nn.Sigmoid())

    def forward(self, x):
        z = self.encoder(x)          # compress to latent vector
        return self.decoder(z), z    # reconstruct + return latent

x         = torch.tensor([[0.1, 0.8, 0.2, 0.7]])
model     = Autoencoder()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
loss_fn   = nn.MSELoss()

for step in range(600):
    recon, z = model(x)
    loss = loss_fn(recon, x)
    optimizer.zero_grad(); loss.backward(); optimizer.step()

recon, z = model(x)
fmt = lambda t: [round(v, 3) for v in t.tolist()[0]]
print(f"original     : {fmt(x)}")
print(f"latent z     : {fmt(z)}")        # 2 numbers — the compressed essence
print(f"reconstructed: {fmt(recon)}")    # should be close to original
print(f"loss         : {loss.item():.6f}")

Graph Neural Network (GNN): Designed for graph data (nodes, edges). GNN encoders typically use message passing — nodes aggregate information from neighbours, update their own representation (embedding), and repeat. This allows node embeddings to capture local graph structure.

Pause & Reflect

Think of an edge case where a CNN might outperform an RNN on short text inputs.

Chasing the Hype
A common mistake is defaulting to the most complex or "hot" architecture — like a Transformer — for every problem. For smaller datasets or simpler sequence tasks, a well-tuned LSTM or even a 1D CNN can be faster to train, require less data, and perform just as well. Always match the tool to your specific task, data, and compute constraints.
TL;DR
Encoders are the translation layer between the world and the model. Raw pixels, words, molecular graphs: none of it means anything until it has been compressed into a vector a network can operate on. Text, images, graphs: each has an architecture built around its structure. Pick the wrong one and you are not just leaving performance on the table, you are making the problem harder than it needs to be.
Summary

Every major data structure has an encoder built around it. Sequences get recurrence, grids get convolution, graphs get message passing. The field essentially reverse-engineered the architecture from the shape of the data. Transformers are the odd one out: they work on almost everything, which is also why they cost the most to run.

Next: how LSTMs and Transformers actually encode sequences, and why they handle the gradient problem differently.