How Do Encoders Distill Information?

So we know deep networks have a gradient problem. Signals vanish, weights explode, early layers stop learning. The obvious next question is: how do you actually build something useful on top of that? The answer starts with the encoder.

An encoder has one job: take raw, messy input and compress it into something a neural network can actually work with.

Think back to reading a dense novel. You don't memorize every word; you extract the essence: the plot, themes, characters. An encoder does something similar for AI. It's a specialized component that reads input data (text, images, audio, graphs, you name it!) and compresses it into a concise, numerical summary. This summary is often called a latent representation, context vector, or embedding. It is not usually human-readable, but it captures the crucial features of the input in a form that the rest of the network can operate on.

The architecture changes depending on the data. That is not a design flaw, that is the point. A sprinter and a marathon runner are both athletes with the same goal: cross the finish line. But you would not hand one the other's training plan. Encoders work the same way.

A Tour of Encoders: The Right Tool for the Job

Different data needs different encoders. Here's the landscape.

≡encoder reference13 / 13

encoder typeinputuse cases

Reads data step-by-step, updating a hidden state at each position. LSTMs add a separate cell state and three gates — forget, input, output — letting the network selectively remember or discard context over long sequences. This gating directly tackles the vanishing gradient problem that cripples vanilla RNNs.

best for

›Tasks where strict token order matters — music generation, handwriting synthesis
›Smaller datasets where Transformers would overfit
›Streaming or online inference: processes one step at a time, no need to see the full sequence

watch out for

›Sequential processing can't be parallelised — slow to train on modern hardware
›Still struggles on very long sequences even with gating
›Largely superseded by Transformers for most NLP — only prefer it on tight compute budgets

e.g.ELMo, ULMFiT, original seq2seq (Sutskever et al.)

›covered in this series →

Convolutional filters slide across the input detecting local features — edges, textures, shapes. Stacking layers builds a feature hierarchy from simple to complex. Weight sharing makes CNNs parameter-efficient and translation-invariant. 1D convolutions apply the same idea to sequences, excelling at local pattern detection without the sequential bottleneck of RNNs.

best for

›Detecting local spatial patterns efficiently via parameter sharing
›Translation-invariant tasks — the model detects a cat whether it's top-left or bottom-right
›Fast inference — highly optimised CUDA kernels, lower memory than Transformers

watch out for

›Receptive field grows slowly — poor at capturing long-range dependencies without deep stacking
›Doesn't naturally handle variable-length sequences
›Sensitive to scale and rotation without aggressive data augmentation

e.g.ResNet, EfficientNet, VGG, ConvNeXt; TextCNN for NLP

Processes the entire sequence in parallel via self-attention: every token attends to every other token simultaneously. This captures both local and global dependencies regardless of distance. Positional encodings add order information. Bidirectional context makes it ideal for understanding tasks — unlike decoder-only models, it sees the full sequence at once.

best for

›Any NLP task requiring global context — sentiment, NER, reading comprehension
›Transfer learning: fine-tune a pre-trained encoder with minimal task-specific data
›Relationships between distant tokens — 'The cat that the dog chased ran away'

watch out for

›Attention is O(n²) in sequence length — expensive for very long inputs
›Data hungry: needs a large pre-training corpus to learn strong representations
›Positional encoding requires careful handling for order-sensitive tasks

e.g.BERT, RoBERTa, DistilBERT, ALBERT, DeBERTa, ModernBERT

›covered in this series →

Uses causal (masked) self-attention — each token attends only to past tokens, never future. Designed for autoregressive generation: predict the next token, append it, repeat. All modern large language models are decoder-only. They implicitly encode context through the KV cache, but lack the bidirectional understanding of encoder-only models like BERT. Key innovations vary per model: LLaMA uses RoPE and GQA; DeepSeek-V3 uses Multi-head Latent Attention (MLA) to compress the KV cache and Mixture-of-Experts (MoE) routing; Qwen extends context with dynamic NTK-scaling. Architecturally, Claude, GPT-4, LLaMA, Qwen, DeepSeek, and Mistral all belong here.

best for

›Open-ended generation, dialogue, and reasoning via chain-of-thought
›Few-shot and zero-shot tasks through prompting alone
›Tasks where encoding and generation happen in the same forward pass

watch out for

›Each token is blind to future context — limits pure understanding tasks vs. BERT
›KV cache grows linearly with context length — expensive for very long prompts
›Highly sensitive to prompt phrasing: small wording changes can shift output significantly

e.g.GPT-4, Claude 3/4 (Anthropic), LLaMA 3 (Meta), Qwen 2.5 (Alibaba), DeepSeek-V3/R1, Mistral, Gemma 2

SSMs model sequences as continuous dynamical systems, selectively propagating state based on input content. Unlike attention, computation scales linearly with sequence length — solving the Transformer's O(n²) bottleneck. During training they operate as parallel convolutions; during inference as efficient RNNs. Mamba introduces input-dependent state selection, dramatically improving recall on long sequences. Jamba (AI21 Labs) is the first production-scale hybrid: it alternates Mamba SSM layers with Transformer attention layers, combining linear-complexity sequence processing with global attention reasoning.

best for

›Very long sequences where O(n²) attention is prohibitive — genomics, hour-long audio
›Memory-efficient inference: recurrent mode processes tokens one at a time with fixed state size
›Hybrid architectures: Jamba-style interleaving captures best of SSM speed and Transformer quality

watch out for

›Recall degrades on precise lookback — 'needle in a haystack' retrieval still favours Transformers
›Smaller ecosystem: fewer fine-tuning recipes, less tooling than Transformers
›Pure SSMs often underperform hybrid variants; Jamba / Zamba2 are strong baselines

e.g.S4, Mamba, Mamba-2, Jamba (Mamba + Transformer, AI21 Labs), Zamba2

Compresses input through a bottleneck and reconstructs it on the other side. The bottleneck forces retention of only the most salient features. Because it trains without labels it learns the underlying data distribution — inputs that don't fit the distribution show high reconstruction error, making autoencoders natural anomaly detectors. Variational autoencoders (VAEs) add a probabilistic latent space enabling controlled generation. VQ-VAE provides discrete latents used in DALL-E's image encoding stage.

best for

›Anomaly detection: normal inputs reconstruct well, outliers don't
›Non-linear dimensionality reduction — richer than PCA for complex distributions
›Denoising: train with noisy input, clean target, model learns to filter noise

watch out for

›Latent space has no continuity guarantee unless you use a VAE — interpolation may produce garbage
›Reconstruction quality trades off against compression ratio
›Not naturally suited for supervised downstream tasks without fine-tuning

e.g.DAE (Denoising AE), VAE, VQ-VAE (used in DALL-E), β-VAE

An over-complete autoencoder trained on the internal activations of a larger model rather than raw data. Enforces sparsity — at any time, only a few features are active. This decomposes dense, polysemantic neuron activations into a large dictionary of human-interpretable monosemantic features. Anthropic has used SAEs to find features in Claude's residual stream corresponding to concepts like 'the Golden Gate Bridge' or 'code syntax error'. Active area of AI safety and interpretability research.

best for

›Decomposing LLM activations into human-readable, monosemantic features
›Understanding what concepts a trained model has encoded internally
›Safety research: identifying and potentially steering internal representations

watch out for

›Extremely niche — not used for general encoding tasks, purely for interpretability
›Over-complete dictionaries (e.g. 131K features) are expensive to train and analyse
›Active research area: best practices, evaluation metrics, and scaling laws still evolving

e.g.Anthropic SAE research (Claude residual streams), EleutherAI / DeepMind interpretability work

Learns by pulling embeddings of similar pairs together and pushing dissimilar ones apart in the latent space — no class labels required. Positive pairs are typically augmentations of the same example (SimCLR, MoCo) or paired cross-modal data (CLIP: image + caption). The resulting embedding space supports retrieval, clustering, and zero-shot transfer. CLIP powers most image-text matching systems today; OpenAI's text-embedding and Cohere's embedding models are contrastively trained.

best for

›Learning strong representations without labels from paired or augmented data
›Zero-shot classification: embed unseen class names and find nearest image
›Retrieval and similarity search across large corpora

watch out for

›Needs carefully crafted negative pairs — trivial negatives lead to representation collapse
›Sensitive to batch size: large batches expose more negatives and are critical for quality
›Representations can be anisotropic — embeddings cluster in narrow cones, not spread evenly

e.g.SimCLR, MoCo v3, CLIP (OpenAI), ALIGN (Google), Cohere Embed, OpenAI text-embedding-3

Encodes multiple modalities into a shared or aligned embedding space so that semantically similar content across modalities is nearby. Most are trained contrastively on paired data. CLIP aligns images and text; ImageBind aligns seven modalities (image, text, audio, depth, thermal, IMU, video) using images as a binding anchor. BLIP-2 and Flamingo bridge frozen image encoders with frozen LLMs using a lightweight cross-attention adapter. These encoders power the 'vision' in modern multimodal LLMs like GPT-4o and Claude.

best for

›Image-text retrieval and zero-shot image classification without per-class training
›Grounding language in visual context — visual question answering, image captioning
›Cross-modal generation: retrieve relevant images given text, or vice versa

watch out for

›Alignment can be superficial — models match pairs without deep semantic understanding
›Fine-tuning on one modality can degrade others (catastrophic forgetting)
›Requires massive paired datasets — noisy web-scraped pairs degrade representation quality

e.g.CLIP (OpenAI), ALIGN (Google), ImageBind (Meta, 7 modalities), BLIP-2, Flamingo

Applies a standard Transformer encoder to images by splitting them into fixed-size patches (e.g. 16×16 px), flattening each patch, and treating the sequence of patch embeddings like tokens. A learnable [CLS] token aggregates global image information. Rivals CNNs at scale and transfers exceptionally well. DINOv2 (self-supervised ViT) produces features that work for depth estimation, segmentation, and classification without fine-tuning. SAM (Segment Anything) uses a ViT-H image encoder to produce dense patch embeddings.

best for

›Large-scale image classification after pre-training on massive datasets (ImageNet-21k, JFT)
›Transfer across diverse vision tasks — depth, segmentation, classification from one backbone
›Tasks requiring global spatial relationships — whole-image context, not just local patches

watch out for

›Data hungry: underperforms CNNs on small/medium datasets without pre-training
›No built-in translation invariance — must learn spatial biases from data
›Patch size is a hard hyperparameter — wrong choice hurts both resolution and compute

e.g.ViT (Dosovitskiy et al.), DeiT, CLIP image encoder, DINOv2, SAM image encoder

Operates on graph-structured data via message passing: each node aggregates information from its neighbours, updates its own representation, and repeats for several rounds. This captures both node features and local (and eventually global) graph structure. Attention-based variants (GAT) weight neighbour contributions. Graph Transformers apply full self-attention over all nodes, bypassing the neighbourhood constraint at the cost of O(n²) complexity.

best for

›Data with explicit relational structure — molecules, protein interactions, knowledge graphs
›Node classification and link prediction where connectivity patterns matter as much as features
›Molecular property prediction: atoms as nodes, bonds as edges

watch out for

›Over-smoothing: stacking too many layers collapses all node embeddings toward the same value
›Doesn't naturally handle directed, weighted, or dynamically changing graphs
›Requires graph construction — text and images don't have graphs out of the box

e.g.GCN (Kipf & Welling), GraphSAGE, GAT, Graph Transformer, AlphaFold uses GNN-like message passing

Point clouds are sets of 3D coordinates — unordered and irregularly sampled, making CNNs inapplicable. PointNet processes each point independently then uses a global max-pool to aggregate, achieving permutation invariance. PointNet++ adds hierarchical local grouping. Point Transformer applies self-attention directly to point sets. These encoders are the backbone of LiDAR perception in autonomous vehicles and robotic manipulation systems.

best for

›3D scene understanding from LiDAR sensors — autonomous vehicles, drones, robots
›Shape classification and part segmentation of 3D objects
›Tasks requiring explicit 3D geometry rather than projecting to 2D and using a CNN

watch out for

›Point clouds are sparse and unordered — vanilla CNNs fail, need specialised architectures
›Density varies across the cloud, making uniform processing hard
›3D labelled data is expensive to collect and annotate

e.g.PointNet, PointNet++, Point Transformer, 3DETR, Waymo/Tesla perception stacks

Often a hybrid: CNN layers extract spectral patterns from spectrograms, followed by Transformer or LSTM layers modelling temporal dynamics. Modern self-supervised speech encoders (Wav2Vec 2.0, HuBERT) learn from unlabeled audio by predicting masked frames, capturing phonetic and prosodic features transferable to downstream tasks with minimal labelled data. Whisper's encoder processes 30-second mel-spectrogram chunks through a CNN + Transformer stack.

best for

›Learning from large unlabeled audio corpora via self-supervised pre-training
›Transfer to downstream tasks with very little labelled speech data
›Capturing both spectral patterns (CNN) and temporal dynamics (Transformer) in one pass

watch out for

›Raw audio sequences are extremely long — computationally expensive
›Domain-sensitive: models trained on English speech transfer poorly to other languages and accents
›Evaluation metrics like WER don't always reflect real-world deployment quality

e.g.Wav2Vec 2.0 (Meta), HuBERT (Meta), Whisper encoder (OpenAI), SpeechBERT

click any row to expand · filter by input type above

Here's a quick reference of the main encoder families:

RNN / LSTM / GRU — Sequential text or time-series. Processes input one step at a time, maintaining a hidden state. Good for tasks where order matters deeply.
CNN (1D / 2D) — Images (2D) or local pattern detection in text (1D). Uses convolutional filters to detect features in local regions.
Transformer Encoder — Text and most modern NLP tasks. Uses self-attention to weigh the importance of all words relative to each other simultaneously. The current gold standard.
GNN (Graph Neural Network) — Graph or network-structured data (social networks, molecules). Uses message passing between nodes.
Autoencoder — Unsupervised compression, anomaly detection, dimensionality reduction. Learns a bottleneck representation by trying to reconstruct its own input.

checkpoint

01 / 02

For achieving state-of-the-art results on a text sentiment analysis task today, which architecture would be the most common and powerful starting point?

Which Encoder Should You Choose?

◈encoder selection flowchart

What kind of data are you encoding?

Pick the modality that best describes your input.

heuristics only — always benchmark on your specific data and task

As a rough heuristic: sequential text or time-series → Transformer encoder (or LSTM for smaller datasets). Image data → CNN. Graphs → GNN. Unsupervised compression → Autoencoder. When in doubt, benchmark.

Beyond Sequences: Autoencoders, GNNs, and More

The series focuses on sequence encoders, but two other families handle things differently:

Autoencoder: The encoder part typically uses layers (like CNNs or dense layers) to progressively shrink the input representation down to a low-dimensional bottleneck vector. The decoder then tries to reconstruct the original input from this bottleneck. By forcing the network to squeeze information through this bottleneck and then reconstruct it, the encoder learns to capture the most salient features in a compressed form, discarding noise and retaining essential structure.

Conceptual dry run (simple autoencoder): Imagine encoding a 4-pixel grayscale image [0.1, 0.8, 0.2, 0.7] into a 2D latent vector.

Encoding: The encoder transforms the 4D input into a 2D vector, say [0.55, -0.23]. This is the compressed essence.
Decoding: The decoder takes this 2D vector and tries to expand it back to 4D.
Learning: The system adjusts weights to make the decoded output close to the original [0.1, 0.8, 0.2, 0.7], forcing the encoder to learn a useful compression.

python

import torch
import torch.nn as nn

# 4 → 2 → 4  autoencoder
class Autoencoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(4, 2), nn.ReLU())
        self.decoder = nn.Sequential(nn.Linear(2, 4), nn.Sigmoid())

    def forward(self, x):
        z = self.encoder(x)          # compress to latent vector
        return self.decoder(z), z    # reconstruct + return latent

x         = torch.tensor([[0.1, 0.8, 0.2, 0.7]])
model     = Autoencoder()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
loss_fn   = nn.MSELoss()

for step in range(600):
    recon, z = model(x)
    loss = loss_fn(recon, x)
    optimizer.zero_grad(); loss.backward(); optimizer.step()

recon, z = model(x)
fmt = lambda t: [round(v, 3) for v in t.tolist()[0]]
print(f"original     : {fmt(x)}")
print(f"latent z     : {fmt(z)}")        # 2 numbers — the compressed essence
print(f"reconstructed: {fmt(recon)}")    # should be close to original
print(f"loss         : {loss.item():.6f}")

Graph Neural Network (GNN): Designed for graph data (nodes, edges). GNN encoders typically use message passing — nodes aggregate information from neighbours, update their own representation (embedding), and repeat. This allows node embeddings to capture local graph structure.

⇄match the pairs

select an item

6 targets unfilled

Pause & Reflect

Think of an edge case where a CNN might outperform an RNN on short text inputs.

Chasing the Hype

A common mistake is defaulting to the most complex or "hot" architecture — like a Transformer — for every problem. For smaller datasets or simpler sequence tasks, a well-tuned LSTM or even a 1D CNN can be faster to train, require less data, and perform just as well. Always match the tool to your specific task, data, and compute constraints.

≡TL;DR

Encoders are the translation layer between the world and the model. Raw pixels, words, molecular graphs: none of it means anything until it has been compressed into a vector a network can operate on. Text, images, graphs: each has an architecture built around its structure. Pick the wrong one and you are not just leaving performance on the table, you are making the problem harder than it needs to be.

›Summary

Every major data structure has an encoder built around it. Sequences get recurrence, grids get convolution, graphs get message passing. The field essentially reverse-engineered the architecture from the shape of the data. Transformers are the odd one out: they work on almost everything, which is also why they cost the most to run.

Next: how LSTMs and Transformers actually encode sequences, and why they handle the gradient problem differently.