Temperature Is Not a Creativity Dial
What temperature actually does to a language model's output distribution — and why the same mechanism powers knowledge distillation.
Temperature is not a creativity dial.
The standard explanation: low makes the model more deterministic, high makes it more creative. Set 0.2 for facts, crank to 1.5 for poetry. That's not wrong. It's also the least interesting thing you can say about it.
It's a mechanism for controlling which parts of a model's knowledge are accessible at inference time. The previous posts in this series covered architecture: how sequence models read, remember, and attend. None of that touches what happens at output: when the model produces a distribution over every possible next token and you have to pick one. That's where temperature lives. And it connects to a training technique that lets a 7B model (gaming GPU territory) outperform one twice its size.
The Audition: Logits and Softmax
Before temperature makes sense, you need to understand what a language model is actually doing when it picks the next word.
For the prompt “The sky is...”, the model has seen billions of sentences. It knows “blue” almost always follows. “Cloudy” sometimes. “Cheese” essentially never. This knowledge isn't stored as a rule. It's compressed into numbers during training.
Those numbers are called logits: raw scores on the model's internal scorecard. “Blue” might score 15.2. “Cloudy” scores 10.1. “Cheese” scores −4.8.
You can't sample from raw scores directly; they don't sum to 1. So you convert them using softmax. Exponentiation amplifies big score differences and shrinks small ones, exactly the property you want when converting a scorecard into a probability distribution:
Result: blue ≈ 99.3%, cloudy ≈ 0.6%, cheese ≈ 0%. Now the model picks a token by sampling from this distribution.
Enter Temperature
Temperature () is a single number, greater than zero. It is applied to the logits before softmax: you divide every score by first.
One division. The effect depends entirely on what τ is.
(default): Divide every score by 1. Nothing changes. You sample from the model's unmodified output distribution: the logits exactly as produced.
(near-zero, e.g. 0.1): Dividing by a tiny number amplifies the gaps between scores. Blue's 15.2 becomes 152. Cheese's −4.8 becomes −48. The gap widens until softmax collapses to a single spike: blue gets 99.99%, everything else essentially 0. Effectively the same answer every time. This is greedy decoding. (The Caution section below covers the small exception.)
(e.g. 1.5, 2.0): Dividing by a large number compresses the scores. 15.2 becomes 7.6. −4.8 becomes −2.4. The gap shrinks. Cheese goes from 0.1% to a non-trivial probability. You're sampling from a flatter distribution: more variance, more surprise, and occasionally “cheese.”
import numpy as np
def softmax_with_temperature(logits, temperature=1.0):
scaled = logits / temperature
# Subtract max for numerical stability (does not change probabilities)
exp = np.exp(scaled - scaled.max())
return exp / exp.sum()
# Logits for "The sky is ..."
tokens = ["blue", "cloudy", "clear", "dark", "green", "cheese"]
logits = np.array([15.2, 10.1, 8.4, 6.2, 2.1, -4.8])
for tau in [0.1, 1.0, 2.0]:
probs = softmax_with_temperature(logits, temperature=tau)
print(f"\nτ = {tau}")
for tok, p in zip(tokens, probs):
bar = "█" * int(p * 40)
print(f" {tok:8s} {p:.4f} {bar}")
# τ = 0.1 → blue: 1.0000 (everything else ≈ 0)
# τ = 1.0 → blue: 0.9928, cloudy: 0.0061, clear: 0.0011, dark: 0.0001, cheese: ≈ 0
# τ = 2.0 → blue: 0.8896, cloudy: 0.0695, clear: 0.0297, dark: 0.0099, cheese: 0.00004At τ → 0, what does sampling from the model's output become equivalent to?
Dark Knowledge
When the model produces logits for “The sky is...”, the probability for “cheese” is tiny. But it's not zero. That number encodes something real: the model has encountered contexts where unusual words followed those tokens. A recipe. A metaphor. A fragment. The model knows “cheese” is wrong here, and the logit score reflects exactly how wrong, in a way that captures the shape of language around that prompt.
Geoff Hinton called this dark knowledge. The top prediction is just one number from a vector of thousands. The relative scores of all candidates encode semantic similarity: which words are neighbors, which meanings cluster, which alternatives are close calls.
- The model knows “apple” and “pear” are similar not because it was told so, but because they appear in similar contexts across billions of sentences. If “apple” is the top-1 prediction, “pear” will have a noticeably higher logit than “hammer.” That gap is the knowledge.
- Ask for the capital of Australia. Sydney dominates English-language web text; it's the largest city, the Olympic host, the one in every travel article. The answer is Canberra. At , a model trained on uncurated text can surface Sydney as its top token. At , Canberra has a chance. “Lower temperature = more factual” only holds when the correct answer is also the most frequent one in training data.
Low suppresses all of this. You'll see only the tip: the dominant response. High flattens the distribution enough that wrong tokens, not just uncertain ones, gaining non-trivial probability. That's why hallucinations increase at high temperature: you're sampling from the tails, where both genuinely ambiguous answers and outright wrong ones live.
Temperature never reorders the logits; the top token stays the top token at any τ. So why might a model give a different answer at τ=0 vs τ=0.7 on the same factual question? And does the τ=0.7 answer count as 'more correct'?
Distillation: What the Runner-Ups Know
Suppose you want to train a small model (7B parameters, fits on a gaming GPU) to behave like a large one (70B parameters, requires a data center). The obvious approach: have the large model answer questions, then train the small model on those answers.
The problem: the naive approach takes only the teacher's top prediction as a hard label. “The capital of France is Paris.” The small model learns that Paris is correct. It doesn't learn that another city was a close second, or that the model was far more confident here than it was on a question it was genuinely uncertain about.
Train the small model on the large model's soft targets (the full probability distribution at high temperature) instead of the final answers. At or , the distribution is flat enough that the relative scores of all candidates become visible. The small model learns not just what the teacher predicts; it learns which answers the teacher was nearly certain about and which it was hedging on. That signal is invisible in a one-hot label.
This is knowledge distillation (Hinton et al., 2015). The high-temperature logits are the dark knowledge transfer mechanism. It's one reason smaller models trained via distillation regularly outperform equivalently-sized models trained only on hard labels; the meaningful comparison is a distilled 7B vs a 7B trained the usual way, not distilled 7B vs the 70B teacher it learned from.
▶For the Curious: The Distillation Loss
Why appears in the loss
When you compute the KL divergence between soft distributions at temperature , the gradients are scaled by compared to hard-label cross-entropy. To keep the soft loss on the same scale as the hard loss, multiply by :
Where = student logits, = teacher logits, = hard labels, controls the blend. Typical values: , .
import torch
import torch.nn.functional as F
def distillation_loss(student_logits, teacher_logits, labels, tau=3.0, alpha=0.7):
"""
student_logits: (batch, vocab) — raw scores from the small model
teacher_logits: (batch, vocab) — raw scores from the large model
labels: (batch,) — ground-truth token indices
tau: temperature applied to both before KL divergence
alpha: weight on soft loss vs hard loss
"""
# Soft targets from teacher at high temperature
soft_teacher = F.softmax(teacher_logits / tau, dim=-1)
soft_student = F.log_softmax(student_logits / tau, dim=-1)
# KL divergence — scaled by tau^2 to match gradient magnitude of hard loss
soft_loss = F.kl_div(soft_student, soft_teacher, reduction="batchmean") * (tau ** 2)
# Standard cross-entropy on hard labels
hard_loss = F.cross_entropy(student_logits, labels)
return alpha * soft_loss + (1 - alpha) * hard_loss
# Typical setup: tau=3-4 surfaces enough dark knowledge without destabilising training
# alpha=0.7 weights the soft loss more heavily — the soft targets are richer signalExploit vs Explore
Every temperature choice is a tradeoff.
Exploit (low τ): You trust the model's training distribution. You want the most likely completion, consistently. The same prompt should give the same answer. Use this for: code autocomplete, SQL generation, factual summarisation, anything where reproducibility matters. A code assistant at generates the same correct function body every run. At , it occasionally invents a method signature that doesn't exist.
Explore (high τ): You're willing to trade reliability for variety. You want candidates from further down the list: novel phrasings, unexpected angles, combinations that would never surface at low temperature. Use this for: brainstorming, creative writing, generating multiple options to pick from, anywhere diversity is the point. Ask for ten product name ideas at and you'll get ten variations of the same safe answer. At , the spread is genuinely wider.
The wrong temperature for the task gives a specific kind of wrong. A factual task at hallucinates because you're sampling from the model's uncertainty. A creative task at produces safe, repetitive output that reads like every other generated response.
One practical note: most production APIs (GPT-4, Claude, Gemini) run on instruction-tuned models, not raw pretrained ones. RLHF and supervised fine-tuning tend to concentrate probability mass on preferred outputs, effectively sharpening the distribution before any τ is applied. That means the base model's τ=0.7 and the instruction-tuned model's τ=0.7 are not the same thing. API defaults (typically 0.7 to 1.0) are calibrated for instruction-tuned distributions specifically; the raw intuition still holds, but the thresholds shift.
Why does knowledge distillation use high temperature (τ = 3 to 4) when generating soft targets from the teacher?
When Temperature Goes Wrong
It is not a creativity dial. Temperature controls entropy, which correlates with perceived creativity only in a narrow band. Above on most current models, outputs become incoherent faster than they become novel. Empirically, larger models tend to maintain coherence at higher , their output distributions being more structured, though this varies by architecture and training regime.
is not perfectly deterministic. Hardware-level floating-point non-determinism (GPU parallel reduction order, CUDA kernel differences) can flip near-tied logits. For true reproducibility you need fixed seeds and deterministic CUDA kernels, not just .
It is not interchangeable with top-p or top-k. Top-p and top-k both truncate the distribution after temperature has already reshaped it. High spreads probability across the tails; top-k=10 then cuts those tails off; the combination can be more restrictive than low with a permissive top-p=0.95. A sensible default: set temperature first, then use top-p around 0.9 as a ceiling to clip noise in the extreme tails. Using all three simultaneously with conflicting settings (say, τ=2.0 + top-k=5 + top-p=0.3) can be more restrictive than any single setting alone, since top-k and top-p each independently truncate after τ has already reshaped the distribution. Most production APIs recommend picking one truncation method, not stacking all three.
I've found the temperature setting matters less than people assume at the extremes. A well-prompted model at outperforms a poorly-prompted one at on factual tasks. Below 0.3, you're mostly getting the same answer with slightly different surface phrasing; the top token dominates regardless. Above 1.5 on most current models, you're heading toward incoherence faster than novelty. The useful range is narrow: roughly 0.5 to 1.2, where you're genuinely trading precision for vocabulary diversity.
The more interesting lever is often not temperature but prompt structure. Temperature changes which part of the distribution you sample from. The prompt changes what distribution you are sampling from in the first place.
- Temperature divides logits before softmax: smaller sharpens the distribution toward the top token, larger flattens it toward the tails.
- At you get greedy decoding, the same answer every time.
- At high you surface dark knowledge: the model's implicit understanding of similarity and uncertainty encoded in the logit scores of non-top candidates.
- In distillation, high-temperature soft targets transfer this hidden structure from a large teacher to a small student, carrying more information per training example than hard labels ever could.
Temperature is one dial, but it touches three different things: how deterministic your outputs are, how much of the model's implicit knowledge is accessible, and how information flows from large models to small ones during training. The “creativity slider” framing captures the first and ignores the other two. Distillation is the clearest proof that the runner-up probabilities contain real signal; practitioners have been deliberately amplifying them for years to make smaller models smarter.
Next: how models are evaluated, and why the metrics you choose shape the model you end up with.