Huffman coding is a foundational **lossless data compression** algorithm that produces an optimal prefix-free binary code for a given set of symbols and their probabilities. Designed by David A. Huffman in 1952 while he was an MIT graduate student — and famously a homework problem assigned by his professor Robert Fano — Huffman's algorithm beats Fano's own prior method (Shannon-Fano coding) and remains a building block of modern compression pipelines.
## Core Idea
In fixed-length encoding (like ASCII), every symbol takes the same number of bits regardless of how often it appears. This wastes space when the distribution is skewed: 'e' appears far more often than 'z' in English, so giving them equal-length codes is suboptimal.
Huffman coding exploits this by:
1. Counting symbol frequencies
2. Assigning **shorter codes to frequent symbols** and **longer codes to rare ones**
3. Ensuring the codes are **prefix-free** so no code is the start of another — making decoding unambiguous
The result minimizes the expected number of bits per symbol for a given symbol distribution.
## The Algorithm
Huffman's algorithm is elegantly simple:
1. Create a leaf node for each symbol with weight equal to its probability/frequency
2. While more than one node remains:
- Pick the two nodes with smallest weights
- Create a parent with weight = sum of the two weights, with these as children
- Edges to children get bit labels (0 and 1)
3. The path from root to each leaf gives the symbol's binary code
Using a priority queue (min-heap), this runs in **O(n log n)** time for n symbols.
## Example
Consider symbols A (frequency 5), B (2), C (1), D (1):
- Combine C+D → 2
- Combine that with B → 4
- Combine with A → 9
- Resulting codes: A=0 (1 bit), B=10 (2 bits), C=110 (3 bits), D=111 (3 bits)
Average code length = (5·1 + 2·2 + 1·3 + 1·3) / 9 ≈ 1.67 bits/symbol — much better than the 2 bits/symbol fixed encoding would give, and very close to the Shannon entropy of this distribution.
## Optimality
Huffman coding is **optimal among symbol-by-symbol prefix codes** when symbol probabilities are known and fixed. The expected code length L satisfies:
*H(X) ≤ L < H(X) + 1*
where H(X) is the Shannon entropy of the source. So Huffman achieves the entropy bound to within one bit per symbol — and exactly meets it when all probabilities are negative powers of 2.
## Limitations
Huffman is optimal but not perfect:
- **Integer-bit constraint**: each code must have an integer number of bits, so it can only approach (not match) the entropy bound for non-dyadic distributions
- **Static distributions**: classic Huffman requires knowing the symbol probabilities up front. **Adaptive Huffman** updates the tree as symbols are seen, at the cost of complexity
- **Header overhead**: the code table must be transmitted along with the compressed data
- **No context modeling**: symbols are encoded independently, ignoring correlations between adjacent symbols
- **Beaten by arithmetic and range coders** for compression ratio, since they encode using fractional bits
## Variants and Improvements
- **Adaptive Huffman coding** (FGK, Vitter): tree updates as data is read, allowing one-pass compression
- **Canonical Huffman coding**: orders codes by length to enable compact tree storage and fast decoding
- **Length-limited Huffman**: caps maximum code length for fixed-width hardware decoders
- **Range coding / arithmetic coding**: assigns fractional bits per symbol, achieving entropy almost exactly. Used in modern compressors
- **Asymmetric Numeral Systems (ANS)**: a more recent family that combines Huffman's speed with arithmetic coding's compression ratio. Used in Zstandard, Facebook's Zstd, and Apple's LZFSE
## Where Huffman Is Used Today
Despite its age, Huffman coding is everywhere:
- **DEFLATE**: the algorithm behind ZIP, gzip, and PNG combines LZ77 with Huffman coding
- **JPEG**: applies Huffman coding to quantized DCT coefficients
- **MP3 and AAC**: Huffman-codes quantized audio coefficients
- **HTTP/2 HPACK**: compresses headers using a static Huffman code
- **HEVC and H.264 fallback paths**: use Huffman coding (CAVLC) as alternative to arithmetic coding
- **Brotli, Zstandard**: use Huffman coding for some sub-streams (entropy stage)
In modern codecs, Huffman is often the entropy coder of choice when speed matters more than the last few percent of compression — its decoder is essentially a small lookup table.
## Significance
Huffman coding was the first concrete demonstration that Shannon's source coding theorem could be approached in practice, with a simple greedy algorithm. It made information theory operational and showed that **compression is fundamentally about probability** — the more you know about your data, the better you can compress it.
## Mental Model
Huffman coding is the principle that **rare events deserve long descriptions; common events deserve short ones**. Anywhere you allocate variable-cost representations to items of varying frequency — from URL shorteners reserving short codes for popular pages, to languages giving common words short forms ('the' vs. 'antidisestablishmentarianism') — you are using Huffman's idea.