Machine learning and information theory

A simple and abstract way to look at machine learning is as a method to find optimal data compression algorithms (and indeed, data compression can be used as a measure of general intelligence). Fundamentally, the following two goals are equivalent: (1) approximating the distribution of some data (2) finding the optimal representation (compression algorithm) for said data -- the optimal representation is just the entropy encoding given the distribution.

Note that all of this is said about unsupervised learning algorithms -- generative modelling, dimensionality reduction/clustering, etc. -- but such structure in the data can also be uncovered during supervised learning (e.g. decision trees can emulate hierarchical clustering, feed-forward neural networks embed data into its feature representations), when the desired labels depend strongly on the clustering.

In supervised learning, you are no longer seeking to simply find an optimal compression for the data, but a compression that "agrees" with the labelling -- you are no longer seeking a representation that best predicts the original data, but a representation that best predicts the labels. In other words, in unsupervised learning you are trying to find a representation of fixed length $T(X)$ that maximizes the expected information that a representation provides on the data, the mutual entropy $I(X;T(X))$; in supervised learning you are trying to find a representation $T(X)$ that maximizes the expected information that a representation provides on the labels $I(y;T(X))$. This is how information gain appears in decision trees.

(See Minimal Achievable Sufficient Statistic Learning. See also information bottleneck method, a generalization of minimal sufficient statistics to allow for representations that lose some information about the labels.)

There is yet a third place that entropy functions appear in machine learning, and that has to do with proper scoring rules for making probabilistic predictions.

Suppose you were betting on an election, and you believed there was a 70% chance of a candidate winning -- under a classic betting scheme, you should put all your money on the winning candidate to maximize your expected earnings. Thus the fraction of money you bet on a candidate (which is 100%) does not reflect your predicted belief of them winning the election (which is 70%). In order to have your bets reflect actual probabilities, one needs to use a proper scoring rule to score your prediction. The only additive proper scoring rule is $\log(q)$, where $q$ is the probability you assign to the outcome -- so where $p$ is the actual probability distribution of the outcomes, the score is $\sum p \log q$, which is the cross-entropy. The same applies to machine learning algorithms making probabilistic predictions for the labels.