### Lossless compression, invertible change of variables and entropy

In the last article, we introduced the notion of information, and defined a bit as a unit of information that halved the probability mass on the sample space. We used that to more generally define the information associated with an observation, in intuitive terms of a "number of bits" it possesses.

$$I(X=x)=\log_{1/2}(P(X=x))$$
All this makes perfect sense of course, when talking about a sample space of IID uniform binary sequences (e.g. fair coin tosses), then discovering each term in the sequence (i.e. a bit) literally gives you one bit of information.

But e.g. consider the sample space $\{HH,HT,TH,TT\}$ and consider receiving the information that the result is either $HH$ or $TT$. You haven't actually received a physical bit, but you have halved your sample space, so you have received an information-theoretic bit.

That's uncomfortable. Maybe we should start looking for the source of our discomfort.

Well, out of all the words in italic sequence above, the only phrase I'm uncomfortable is "physical bit". That's a term that makes sense when we're talking about physical computers, but what does it really mean in an abstract, mathematical, information-theoretic sense? What exactly does it mean to receive the "physical bit" $H$, and how does it differ from receiving information like "the two tosses are the same"?

And then you realize, it doesn't differ. "Receiving the physical bit $H$" just means receiving the information that the first bit is an $H$. That's all the bit means. There is no such thing as a physical bit. (I mean, there is -- we'll use "physical bit" interchangeably with "character" from now on, but the notion of a physical bit particularly having to be something like "$H$" or "$T$", rather than some arbitrary encoding, is something arbitrary.)

Instead of encoding our sample space in terms of "(what was the first coin toss?)(what was the second coin toss?)", we could encode our sample space in terms of "(were the two coin tosses the same?)(what was the first coin toss?)". There is absolutely nothing fundamental about either encoding. And the notion of a physical bit doesn't really exist (think of unfair coins), all we have is the abstract idea of a bit.

This is of course completely expected -- most of our data (not on computers) isn't even in the form of binary sequences.

What we have is an invertible change of variables between two encodings.

Often what determines our choice of encoding are cost considerations.

What do I mean? Suppose we have two coin tosses, as before, but instead of being fair, the probability of a heads is 90%. Then HH will appear 81% of the time, HT and TH will both appear 9% of the time, and TT will appear 1% of the time.

Then consider the following two encodings:
• (first toss is an H?)(second toss is an H?)
• (result is HH?)(else: result is HT?)(else: result is TH?)
In the first encoding, the data will always occupy 2 characters.

In the second encoding, the data occupies 1 character 81% of the time, 2 character 9% of the time and 3 character 10% of the time, i.e. an expected value of 1.29 characters.

This sort of knowledge about probability distributions is used for data compression all the time. For example, images are very likely to have long, continued strings of the same colour, so rather than writing an extra character for each pixel, you just store information like "5 reds, 6 greens..." (this is the idea behind run-length encoding, by the way).

This use of change of encoding is known as data compression -- and the basic idea is to get more common strings to have shorter encodings (the probability of a string's occurrence captures the probabilities of each individual character and the correlations between the characters).

(and we can have a non-invertible change of variables too, of course, which leads to lossy compression, where we project our data by discarding dimensions we don't care about.)

The natural question to ask is what the optimal data encoding should be. This is a very fundamental question -- about how much information is actually present in an object, how many Yes/No questions are actually needed, on average, to specify this particular object in the sample space.

(And if you don't understand why these are the same question: you should re-read the first article in this course: Information and bits.)

Well, the formulation of the problem basically spells out "binary tree". We need to organize the elements of our sample space into a binary tree so that the expected number of yes/no questions we need to ask/bits we need to provide are minimized, so these physical bits are actual bits of information. We organize our data space into a binary tree, with each split (bit) dividing the data into two halves of the probability mass -- this way each physical bit (character) represents an actual bit of information.

Such an encoding -- an optimal binary tree -- is known as entropy encoding; here, the optimal length of code for an object is precisely equal to its information, $\log_{1/2}(P(X))$. As a result, the expected number of bits necessary to specify an object is $\sum P(X)\log_{1/2}(P(X))$, which is known as the entropy of the system.

Simple entropy encodings include the Huffman coding and the Arithmetic coding.

In machine learning, we are often more interested in lossy compression -- the problem of determining the optimal representation for a given expected value of "distortion" (calculated via a loss function). Often practically, we have a given size (rate) of representation and calculate the minimum expected loss (distortion) that can be achieved through a compression algorithm.

In general, the distortion is represented by a loss function $d(x,\hat{x})$ -- this loss function is not an information theoretic construct, but derives from the practical motivation to reduce distortion, i.e. the preferences/utility of the decision-making agent choosing the compression algorithm. Typical loss functions include the Hamming (discrete/Dirac) distortion $\delta(x\ne\hat{x})$, appropriate for symbols that are essentially categorical, the mean-squared-error, appropriate for continuous variables, the cross-entropy, appropriate for probability distributions etc.

One may imagine that in the encoder-decoder setting as above, $Y$ should be a lossless encoding  for $\hat{X}$ and thus its length should be the entropy of $\hat{X}$. However, in general, we are only concerned with the information that $\hat{X}$ provides on $X$, i.e. the mutual entropy $I(X;\hat{X})$. Thus the rate-distortion function -- representing the minimal rate for a given distortion -- ought to be given by $R(D)=\inf\limits_{\hat{X}\mid E(d(\hat{X},X)) < D}I(X;\hat{X})$, where the minimization is over probability distributions $p(\hat{X}\mid X)$.