What Makes Transformers So Powerful?
A deep dive into the Transformer architecture, explaining how self-attention and positional encodings allow models to process entire sequences at once, revolutionizing NLP.
LSTMs and GRUs were a huge leap, letting networks remember things for longer. But they still read sequentially — like painstakingly reading a book one word after another. What if you could gulp down the whole sentence at once, letting every word simultaneously understand its relationship with all others? That's where Transformers strut in, introduced in the groundbreaking paper "Attention Is All You Need" (Vaswani et al., 2017), and changing how the field thought about sequences entirely.
These models, particularly their encoder part, achieve this by ditching recurrence altogether. Instead, they use a powerful mechanism called self-attention. This allows each word, when being processed, to look at and weigh the importance of all other words in the sequence simultaneously. Long-range dependencies become trivial. And because there is no recurrence, the whole sequence trains in parallel.
The Transformer Encoder Layer: A High-Level View
A Transformer encoder is typically built by stacking multiple identical layers. Each layer has two main sub-components:
- 1. Multi-Head Self-Attention (MHSA): The star player. It allows each word to "attend" to all other words (including itself) in the sequence, figuring out which ones are most relevant for understanding its own meaning in context.
- 2. Position-wise Feed-Forward Network (FFN): A standard feed-forward network applied independently to each position's representation after attention. It helps process the attention output further.
Crucially, each sub-layer also has a residual connection followed by layer normalisation. These are vital for stabilising training in deep architectures.
Input Preparation: Embeddings + Positional Encoding
Before the sequence enters the first encoder layer, two things are needed: the base meaning of each word (embedding) and its position in the sequence (positional encoding).
1. Word Embeddings
Each word / token is converted into a numerical vector. These embeddings are either pre-trained or learned from scratch during training.
2. Positional Encoding (PE)
Since Transformers look at all words at once (unlike RNNs), they don't inherently know word order. "The cat chased the mouse" looks the same as "the mouse chased the cat" without it. To solve this, we add positional encodings: unique "address labels" for each position. The original paper proposed using sine and cosine waves, where is the position index, is the dimension index, and is the embedding dimension:
import numpy as np
def positional_encoding(max_seq_len, d_model):
PE = np.zeros((max_seq_len, d_model))
positions = np.arange(max_seq_len)[:, np.newaxis] # (seq, 1)
dims = np.arange(0, d_model, 2)[np.newaxis, :] # (1, d_model/2)
angles = positions / (10000 ** (dims / d_model))
PE[:, 0::2] = np.sin(angles) # even dims
PE[:, 1::2] = np.cos(angles) # odd dims
return PE
PE = positional_encoding(max_seq_len=6, d_model=8)
print("PE shape:", PE.shape) # (6, 8)
print("Position 0:", PE[0].round(3)) # [0. 1. 0. 1. 0. 1. 0. 1.]
print("Position 1:", PE[1].round(3)) # unique address for token 1
# Each position gets a unique fingerprint — this is how the Transformer
# tells "cat chased dog" from "dog chased cat" despite looking at all
# tokens simultaneously.Multi-Head Self-Attention (MHSA): The Core Mechanism
MHSA is the core of everything. It lets the model ask, for each word: which other words in this sentence actually matter for understanding what I mean?
Generating Query, Key, and Value Vectors
For each word's input vector (embedding + PE), self-attention first creates three distinct vectors by multiplying the input with learned weight matrices , , :
- Query (Q): Represents the current word's perspective — asking "What am I looking for?"
- Key (K): Represents the word's identity — "What information do I hold?"
- Value (V): Represents the actual information — "If I'm relevant, what should I provide?"
Calculating Attention Scores
To figure out how much attention word should pay to word , the model calculates a score based on their Query and Key. The dot product measures similarity; scaling by stabilises gradients:
These scores are then passed through softmax to get attention weights — positive numbers summing to 1. The final output for word is a weighted sum of all Value vectors:
The 'Multi-Head' Aspect
Instead of doing this once, Transformers run the Q, K, V process multiple times in parallel — each "head" has its own . Each head can learn to focus on different types of relationships (e.g., syntax vs. semantics). Outputs from all heads are concatenated and passed through a final linear layer .
Test Your Understanding
In self-attention, which two vectors are primarily used to calculate the similarity between two tokens?
▶For the Curious: A Quick Dry Run & Code
Scaled Dot-Product Attention — From Scratch
Three tokens, four dimensions. No pretrained weights — just the raw mechanism.
import numpy as np
def softmax(x):
e = np.exp(x - x.max(axis=-1, keepdims=True))
return e / e.sum(axis=-1, keepdims=True)
# Toy: 3 tokens ("The", "cat", "sat"), d_model = 4
np.random.seed(42)
X = np.random.randn(3, 4) # token embeddings (3 × 4)
# Learned weight matrices (random here; normally trained)
Wq = np.random.randn(4, 4)
Wk = np.random.randn(4, 4)
Wv = np.random.randn(4, 4)
Q, K, V = X @ Wq, X @ Wk, X @ Wv # (3 × 4) each
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k) # raw scores (3 × 3)
weights = softmax(scores) # each row sums to 1.0
output = weights @ V # context-rich reps (3 × 4)
print("Attention weights (row i = how much token i attends to each token):")
print(weights.round(3))
print("\nOutput shape:", output.shape)
# weights[0][2] close to 1.0 → "The" is attending mostly to "sat"
# Real BERT-base Layer 9 Head 11 weights are in the dry-run below.Final Touches: FFN, Add & Norm, and Impact
Position-wise Feed-Forward Network (FFN)
Following MHSA (and Add & Norm), each position's vector is passed independently through an identical FFN — typically two linear layers with a GELU activation (the original paper used ReLU):
Residual Connections and Layer Normalisation (Add & Norm)
Both the MHSA and FFN sub-layers are wrapped by a residual connection () and layer normalisation, which stabilises the network by normalising activations:
Real-World Impact
At this point, Transformers are everywhere:
- ChatGPT & GPT series — Use Transformer decoders for text generation.
- BERT — Uses Transformer encoders for language understanding.
- Google Translate — Leverages Transformers for translation.
- AlphaFold — Uses attention inspired by Transformers for protein folding.
- Transformers ditch sequential processing for self-attention, looking at all words at once.
- Positional Encodings are added to embeddings to give the model a sense of word order.
- Multi-Head Self-Attention uses Q, K, V vectors to build context-aware representations.
- Each attention head can learn different types of linguistic relationships in parallel.
- Highly parallelisable for training, but complexity grows as with sequence length.
Transformer encoders ditch recurrence for self-attention, letting each word weigh the importance of all others simultaneously. Key components: positional encodings (for order), multi-head self-attention (Q, K, V → scores → weights → weighted V sum), feed-forward networks, and Add & Norm (residual connections + layer norm). This excels at long-range dependencies and parallelisation but has complexity with sequence length.