Cross-entropy measures the **average number of bits needed to encode samples from one distribution P using a code optimized for another distribution Q**. It is the workhorse loss function of modern machine learning: every classifier you train, every language model that predicts the next token, and most generative models are minimizing some form of cross-entropy.
## Definition
For distributions P (true) and Q (predicted) over the same set of outcomes:
*H(P, Q) = -Σ P(x) log Q(x)*
The relationship to other quantities is:
*H(P, Q) = H(P) + D_KL(P ∥ Q)*
When the logarithm is base 2, cross-entropy is in bits; with the natural log, it is in nats.
## Intuition
- If Q = P, cross-entropy equals the entropy H(P) — the minimum possible average code length
- If Q ≠ P, cross-entropy is strictly greater than H(P), with the excess being the KL divergence — the *cost* of using the wrong model
- A perfectly confident, perfectly correct prediction gives cross-entropy of 0
- Confidently wrong predictions blow up cross-entropy toward infinity (the −log of a near-zero probability)
This last property is why cross-entropy is such an effective loss: it strongly penalizes models that are *confidently wrong*, forcing them to express calibrated uncertainty.
## Why It's the Default ML Loss
For classification tasks, the true label distribution is one-hot: P puts probability 1 on the correct class. Then:
*H(P, Q) = -log Q(correct class)*
Minimizing this over a dataset is equivalent to **maximum likelihood estimation**. Since H(P) is constant when labels are fixed, minimizing cross-entropy is identical to minimizing KL divergence between the labels and the model.
## Common Forms
- **Binary cross-entropy** (BCE): for two-class problems with sigmoid outputs. Used in binary classification, multi-label classification, and GAN discriminators
- **Categorical cross-entropy**: for multi-class problems with softmax outputs. Standard for image classification, sentiment analysis, etc.
- **Sparse categorical cross-entropy**: same as categorical, but takes integer labels instead of one-hot vectors
- **Token-level cross-entropy**: averaged over tokens in language models. Perplexity is the exponential of this loss
- **Sequence cross-entropy**: summed across all positions in sequence-to-sequence tasks
## Cross-Entropy and Language Models
Language model training is cross-entropy minimization at scale:
- For each token, the true distribution is one-hot on the actual next token
- The model outputs a probability distribution over the vocabulary
- The loss is the negative log-probability the model assigned to the actual next token
- **Perplexity** = exp(cross-entropy) — the effective number of options the model is choosing among
Lower cross-entropy means the model is less surprised by real text, i.e., it has learned the distribution of language better.
## Practical Properties
- **Convex** for linear models with softmax/sigmoid outputs, leading to well-behaved optimization
- **Calibration sensitive**: cross-entropy rewards calibrated probability outputs, not just correct argmax decisions
- **Numerical stability**: implementations combine softmax and log into a single 'logsumexp' operation to avoid overflow
- **Class imbalance**: needs reweighting, focal loss, or sampling strategies when classes are imbalanced
- **Label smoothing**: replacing one-hot labels with a softened distribution can improve generalization by preventing the model from becoming over-confident
## Cross-Entropy vs. Other Losses
- **MSE for classification** is a poor choice: it doesn't penalize confident wrong predictions as sharply, and gradients vanish when the sigmoid saturates
- **Hinge loss** (used in SVMs) optimizes margin, not probability
- **Focal loss** is a modulated cross-entropy that down-weights easy examples — useful for object detection with extreme class imbalance
## Mental Model
Cross-entropy answers: *given my model's beliefs about the world, how surprised would I be on average to see real data?* Minimizing it is the most direct way to make a probabilistic model match reality. Every time you train a softmax classifier, fine-tune an LLM, or evaluate a language model with perplexity, you are working with cross-entropy.