Mutual Information (MI) quantifies the amount of information that two random variables share. Formally, it is the reduction in uncertainty about one variable that results from observing another. Unlike correlation, mutual information captures **any** statistical dependency — linear, nonlinear, or symbolic — making it one of the most general measures of association in statistics and machine learning.
## Definition
For random variables X and Y with joint distribution p(x, y):
*I(X; Y) = H(X) − H(X | Y) = H(Y) − H(Y | X) = H(X) + H(Y) − H(X, Y)*
Equivalently, it is the KL divergence between the joint distribution and the product of the marginals:
*I(X; Y) = D_KL( p(x, y) ∥ p(x) p(y) )*
Mutual information is **non-negative**, **symmetric** (I(X; Y) = I(Y; X)), and equals **zero if and only if X and Y are statistically independent**.
## Intuition
- I(X; Y) = 0 → X and Y are independent; observing one tells you nothing about the other
- I(X; Y) is large → X and Y carry a lot of shared information
- I(X; X) = H(X) → a variable carries full information about itself
It is measured in bits (when using log₂) or nats (when using natural log).
## Why It Beats Correlation
Pearson correlation only detects linear relationships. Two variables can be deterministically related yet have zero correlation (e.g., Y = X²). Mutual information captures **any** dependency:
- Linear, nonlinear, or piecewise relationships
- Categorical, ordinal, or mixed data
- Symbolic dependencies (e.g., Y is a hash of X)
This makes MI a natural choice for feature selection, dependency discovery, and probing learned representations.
## Applications
- **Feature selection**: Pick features with high mutual information with the target. Used in filter methods like mRMR (minimum-redundancy-maximum-relevance)
- **Decision trees**: Information gain — the reduction in entropy from splitting on a feature — *is* mutual information
- **Representation learning**: Methods like InfoNCE, MINE, and contrastive learning maximize mutual information between representations and inputs
- **Communication theory**: The capacity of a noisy channel is the maximum mutual information between input and output over input distributions
- **Independent Component Analysis (ICA)**: Find sources by minimizing mutual information between components
- **Neuroscience**: Measure how much information neural responses carry about stimuli
- **Genomics**: Detect gene-gene dependencies that linear methods miss
- **Causal inference**: Used (with caveats) as evidence of statistical association before causal claims
## Pointwise vs. Average
Mutual information is the **expected** value of the *pointwise mutual information* (PMI):
*PMI(x, y) = log( p(x, y) / [p(x) p(y)] )*
PMI is widely used in NLP for measuring word association (e.g., 'New York' has very high PMI), while MI averages PMI over the joint distribution.
## Practical Notes
- **Estimation is hard**: With finite samples, mutual information is notoriously difficult to estimate, especially in continuous, high-dimensional settings. Histogram, k-NN (Kraskov estimator), and neural estimators (MINE, InfoNCE) are common
- **Not bounded above**: Unlike correlation, MI has no fixed upper bound — it grows with the entropy of the variables
- **Normalized variants**: Normalized mutual information (NMI) and adjusted mutual information (AMI) rescale MI to a [0, 1] range, useful for comparing clusterings
## Mental Model
Mutual information answers: *if I learn the value of X, how many bits less surprised am I, on average, when I see Y?* It is the formal currency of 'how much do these two things tell me about each other?' — agnostic to the form of the relationship.