What Makes Information Mutual
The structure that lets information flow between variables
Mutual information is often introduced through its formula: I(X;Y) = H(X) − H(X|Y). Knowing Y reduces your uncertainty about X. The difference between what you didn't know before and what you don't know after — that's the information they share.
But this formula hides something more interesting. It tells us that sharing requires structure. Two variables can only exchange information if their joint distribution deviates from independence in specific ways. The formula doesn't just quantify overlap — it reveals the conditions under which information can flow at all.
The Formula as Meaning
Start with entropy H(X), the uncertainty you have about X before observing anything. If you observe Y, your remaining uncertainty about X becomes H(X|Y), the conditional entropy. The difference — what Y removed from your uncertainty — is the mutual information.
This is genuinely information-theoretic. You're not learning the value of X. You're learning something that constrains X, narrowing its possibilities. Y might not tell you X's value directly, but it tells you enough to rule out some values. MI measures how much Y constrains X.
The equivalent form I(X;Y) = H(X) + H(Y) − H(X,Y) reveals another view. Picture a Venn diagram: two circles for H(X) and H(Y), overlapping where they share information. The overlap is I(X;Y). The union is H(X,Y). Subtract the union from the sum of the individuals, and you isolate the shared region.
But these visualizations can obscure the deeper point: the sharing itself has structure. Not just any pair of variables shares information. Only coupled ones do.
What Zero Means
Mutual information is zero if and only if X and Y are independent. Not "weakly correlated" — mathematically independent, meaning p(x,y) = p(x)·p(y) for all values.
This is a precise statement: I(X;Y) = 0 when the joint distribution factors into the product of marginals. The joint distribution p(x,y) contains no structure beyond what's already in p(x) and p(y) separately. There's no coupling, no dependency, no information channel between them.
Conversely, I(X;Y) > 0 exactly when p(x,y) cannot be factored this way. The deviation from independence is what creates mutual information. The stronger the deviation, the more structure in the joint that can't be explained by independent behavior.
This is why MI is the Kullback-Leibler divergence between the joint and the product of marginals:
I(X;Y) = D_KL(p(x,y) || p(x)·p(y))
The KL divergence measures "distance" between distributions (not a true metric, but it quantifies divergence). MI is the divergence between what the variables actually do together versus what they would do if independent. It measures coupling.
The Structure of Coupling
What makes p(x,y) unfactorable? Dependency, which arises from:
Direct causation. If X causes Y, then observing Y tells you something about X. The causal arrow creates statistical dependency. Temperature causes ice cream sales to rise. The joint p(temperature, sales) doesn't factor — higher temperatures cluster with higher sales, lower with lower.
Common cause. If X ← Z → Y, both X and Y depend on Z. Observing one tells you about Z, which tells you about the other. Your height and vocabulary are coupled because both depend on age. The joint can't factor because knowing height constrains age, which constrains vocabulary.
Constraint sharing. If X and Y both obey the same constraint, they become coupled. Two prices in a market are coupled because arbitrage opportunities must vanish. Two particles' positions are coupled because total energy is conserved.
Noise in a channel. A message X transmitted through a noisy channel becomes Y. The noise adds independent randomness, but the signal structure survives. MI measures how much of X's information survives the channel. Y is "what X became" after corruption.
In each case, the joint distribution acquires structure that independence can't explain. MI quantifies this structure.
What Must Be True
For mutual information to exist — for I(X;Y) to be positive — the variables must share a statistical structure. This is more demanding than mere correlation:
Correlation is linear: Cov(X,Y) ≠ 0. MI captures any dependency, linear or not. Two variables can be uncorrelated but highly mutually informative. A U-shaped relationship has zero correlation but high MI. Y = X² has zero linear correlation but maximum mutual information — knowing X tells you exactly what Y is.
MI captures the full dependence structure. Any function Y = f(X) has I(X;Y) = H(X) when f is invertible (you can recover X from Y), or less when f loses information (many X values map to the same Y).
The condition for MI > 0 is: there exists at least one x where p(X=x|Y) ≠ p(X=x). Observing Y changes X's probability distribution. If this is true for no values, MI = 0.
The Compression View
There's another interpretation. Suppose you want to transmit X over a channel. Without any side information, you need H(X) bits on average — enough to encode all its uncertainty.
But if your receiver already knows Y, and Y shares information with X, you don't need to encode the mutual information. You only need H(X|Y) bits. MI is the savings you get from the shared knowledge.
This is Slepian-Wolf coding: if two correlated sources X and Y are encoded separately but decoded jointly, you don't need H(X) + H(Y) bits. You need H(X,Y) bits. The savings is precisely I(X;Y).
The compression view makes the structure explicit: you can only save bits if the variables are coupled. Uncorrelated random variables save nothing. You need genuine dependency structure to compress.
What This Reveals
Mutual information isn't just a measure. It's a probe. When you compute I(X;Y) and find it's nonzero, you've learned that the joint distribution contains structure beyond independence. The variables are coupled.
When you find I(X;Y) < H(X) and I(X;Y) < H(Y), you've learned that neither variable fully determines the other. There's shared information, but also private information. The coupling is partial.
When you find I(X;Y) = H(X) = H(Y), you've learned that X and Y are informationally equivalent — knowing one fully specifies the other. The variables are different representations of the same information.
The formula I(X;Y) = H(X) − H(X|Y) captures a deep principle: information is reduction of uncertainty. But information can only be mutual — can only be shared — when the variables are not independent. Sharing requires dependence. Dependence requires coupling. And coupling requires a mechanism: cause, common cause, constraint, or channel.
What makes information mutual? Structure that can't be factored away.