What Makes Transformers So Powerful?

LSTMs and GRUs were a huge leap, letting networks remember things for longer. But they still read sequentially — like painstakingly reading a book one word after another. What if you could gulp down the whole sentence at once, letting every word simultaneously understand its relationship with all others? That's where Transformers strut in, introduced in the groundbreaking paper "Attention Is All You Need" (Vaswani et al., 2017), and changing how the field thought about sequences entirely.

These models, particularly their encoder part, achieve this by ditching recurrence altogether. Instead, they use a powerful mechanism called self-attention. This allows each word, when being processed, to look at and weigh the importance of all other words in the sequence simultaneously. Long-range dependencies become trivial. And because there is no recurrence, the whole sequence trains in parallel.

every word, suspiciously watching every other word

The Transformer Encoder Layer: A High-Level View

A Transformer encoder is typically built by stacking multiple identical layers. Each layer has two main sub-components:

1. Multi-Head Self-Attention (MHSA): The star player. It allows each word to "attend" to all other words (including itself) in the sequence, figuring out which ones are most relevant for understanding its own meaning in context.
2. Position-wise Feed-Forward Network (FFN): A standard feed-forward network applied independently to each position's representation after attention. It helps process the attention output further.

Crucially, each sub-layer also has a residual connection followed by layer normalisation. These are vital for stabilising training in deep architectures.

Input Preparation: Embeddings + Positional Encoding

Before the sequence enters the first encoder layer, two things are needed: the base meaning of each word (embedding) and its position in the sequence (positional encoding).

1. Word Embeddings

Each word / token is converted into a numerical vector. These embeddings are either pre-trained or learned from scratch during training.

2. Positional Encoding (PE)

Since Transformers look at all words at once (unlike RNNs), they don't inherently know word order. "The cat chased the mouse" looks the same as "the mouse chased the cat" without it. To solve this, we add positional encodings: unique "address labels" for each position. The original paper proposed using sine and cosine waves, where $pos$ is the position index, $i$ is the dimension index, and $d_{model}$ is the embedding dimension:

Positional Encoding

PE (pos, 2 i) = sin (\frac{pos}{1000 0 ^{2 i / d_{model}}})

PE (pos, 2 i + 1) = cos (\frac{pos}{1000 0 ^{2 i / d_{model}}})

Forgetting Positional Encodings

If you omit positional encodings, a Transformer treats your sentence as a bag of words, losing ALL sense of order. Kind of like reading a mystery novel with every page shuffled. Spent an afternoon convinced my model was broken before I noticed the encoding was missing.

◈positional encoding

d_model

try a sentence

← slow oscillationfast →

0The

1quick

2brown

3fox

4jumps

5over

6the

7lazy

8dog

dim 0

dim 15

← scroll →

−1

≈ 0

python

import numpy as np

def positional_encoding(max_seq_len, d_model):
    PE        = np.zeros((max_seq_len, d_model))
    positions = np.arange(max_seq_len)[:, np.newaxis]   # (seq, 1)
    dims      = np.arange(0, d_model, 2)[np.newaxis, :] # (1, d_model/2)
    angles    = positions / (10000 ** (dims / d_model))
    PE[:, 0::2] = np.sin(angles)   # even dims
    PE[:, 1::2] = np.cos(angles)   # odd dims
    return PE

PE = positional_encoding(max_seq_len=6, d_model=8)
print("PE shape:", PE.shape)         # (6, 8)
print("Position 0:", PE[0].round(3)) # [0. 1. 0. 1. 0. 1. 0. 1.]
print("Position 1:", PE[1].round(3)) # unique address for token 1

# Each position gets a unique fingerprint — this is how the Transformer
# tells "cat chased dog" from "dog chased cat" despite looking at all
# tokens simultaneously.

Multi-Head Self-Attention (MHSA): The Core Mechanism

MHSA is the core of everything. It lets the model ask, for each word: which other words in this sentence actually matter for understanding what I mean?

Generating Query, Key, and Value Vectors

For each word's input vector (embedding + PE), self-attention first creates three distinct vectors by multiplying the input with learned weight matrices $W_{Q}$ , $W_{K}$ , $W_{V}$ :

Query (Q): Represents the current word's perspective — asking "What am I looking for?"
Key (K): Represents the word's identity — "What information do I hold?"
Value (V): Represents the actual information — "If I'm relevant, what should I provide?"

Calculating Attention Scores

To figure out how much attention word $i$ should pay to word $j$ , the model calculates a score based on their Query and Key. The dot product measures similarity; scaling by $d_{k}$ stabilises gradients:

Attention Score

score (i, j) = \frac{Q _{i} \cdot K _{j}}{d _{k}}

These scores are then passed through softmax to get attention weights — positive numbers summing to 1. The final output for word $i$ is a weighted sum of all Value vectors:

Scaled Dot-Product Attention

Attention (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V

The 'Multi-Head' Aspect

Instead of doing this once, Transformers run the Q, K, V process multiple times in parallel — each "head" has its own $W_{Q}, W_{K}, W_{V}$ . Each head can learn to focus on different types of relationships (e.g., syntax vs. semantics). Outputs from all heads are concatenated and passed through a final linear layer $W_{O}$ .

Test Your Understanding

checkpoint

In self-attention, which two vectors are primarily used to calculate the similarity between two tokens?

▶For the Curious: A Quick Dry Run & Code

◈attention · dry run

sentence · click a token to use as query

attention from “it” → each token

The

8.9%

dog

88.9%← coreference

hid

0.9%

because

0.1%

0.6%

was

0.2%

scared

0.5%

Coreference resolved — the main pattern

"it" attends to "dog" at 89%. This head has learned that the pronoun refers to the noun introduced earlier. At 89%, this is one of the strongest single-token attention signals across all 144 heads in BERT-base.

full 7×7 attention matrix — click any row label to switch query

q ↓ k →	The	dog	hid	because	it	was	scared
The	0.05	0.29	0.01	0.00	0.59	0.03	0.03
dog	0.02	0.04	0.01	0.00	0.84	0.06	0.03
hid	0.21	0.39	0.02	0.00	0.36	0.01	0.00
because	0.27	0.17	0.27	0.01	0.20	0.01	0.07
it	0.09	0.89	0.01	0.00	0.01	0.00	0.00
was	0.09	0.78	0.01	0.00	0.07	0.01	0.03
scared	0.21	0.28	0.07	0.02	0.20	0.11	0.10

double-outlined = it→dog (0.89) · single-outlined = dog→it (0.84) · selected row at full opacity

real BERT-base-uncased · layer 9, head 11 · rows re-normalised over content tokens · toy walkthrough uses hand-crafted dₖ=4 vectors (mathematically exact, not from a trained model)

▶ run your own sentence · Open in Colab →

Scaled Dot-Product Attention — From Scratch

Three tokens, four dimensions. No pretrained weights — just the raw mechanism.

python

import numpy as np

def softmax(x):
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)

# Toy: 3 tokens ("The", "cat", "sat"), d_model = 4
np.random.seed(42)
X  = np.random.randn(3, 4)   # token embeddings  (3 × 4)

# Learned weight matrices (random here; normally trained)
Wq = np.random.randn(4, 4)
Wk = np.random.randn(4, 4)
Wv = np.random.randn(4, 4)

Q, K, V = X @ Wq, X @ Wk, X @ Wv   # (3 × 4) each

d_k     = Q.shape[-1]
scores  = Q @ K.T / np.sqrt(d_k)    # raw scores  (3 × 3)
weights = softmax(scores)            # each row sums to 1.0
output  = weights @ V                # context-rich reps (3 × 4)

print("Attention weights (row i = how much token i attends to each token):")
print(weights.round(3))
print("\nOutput shape:", output.shape)

# weights[0][2] close to 1.0 → "The" is attending mostly to "sat"
# Real BERT-base Layer 9 Head 11 weights are in the dry-run below.

Final Touches: FFN, Add & Norm, and Impact

Position-wise Feed-Forward Network (FFN)

Following MHSA (and Add & Norm), each position's vector is passed independently through an identical FFN — typically two linear layers with a GELU activation (the original paper used ReLU):

Feed-Forward Network

FFN (x) = max (0, x W_{1} + b_{1}) W_{2} + b_{2}

Residual Connections and Layer Normalisation (Add & Norm)

Both the MHSA and FFN sub-layers are wrapped by a residual connection ( $x + Sublayer (x)$ ) and layer normalisation, which stabilises the network by normalising activations:

Layer Normalisation

LayerNorm (x) = γ \cdot \frac{x - μ}{σ + ε} + β

Real-World Impact

At this point, Transformers are everywhere:

ChatGPT & GPT series — Use Transformer decoders for text generation.
BERT — Uses Transformer encoders for language understanding.
Google Translate — Leverages Transformers for translation.
AlphaFold — Uses attention inspired by Transformers for protein folding.

≡TL;DR

Transformers ditch sequential processing for self-attention, looking at all words at once.
Positional Encodings are added to embeddings to give the model a sense of word order.
Multi-Head Self-Attention uses Q, K, V vectors to build context-aware representations.
Each attention head can learn different types of linguistic relationships in parallel.
Highly parallelisable for training, but complexity grows as $O (N^{2})$ with sequence length.

›Summary

Transformer encoders ditch recurrence for self-attention, letting each word weigh the importance of all others simultaneously. Key components: positional encodings (for order), multi-head self-attention (Q, K, V → scores → weights → weighted V sum), feed-forward networks, and Add & Norm (residual connections + layer norm). This excels at long-range dependencies and parallelisation but has $O (N^{2})$ complexity with sequence length.

◇My Take

Transformers read everything at once, which is exactly why they are so good and exactly why your GPU hates them. Fine-tuning a BERT-style model on lengthy legal documents, the

O (N^{2})

complexity hit hard, forcing a rethink of batch sizes and sequence splitting strategies just to avoid running out of memory. The architecture earns its cost.