### Information and bits

When we deal with questions of statistical inference, we often talk in terms of information -- we say that an observation provides information, that interferes through the Bayes mechanism with our prior distribution to form the posterior.

But what exactly does "information" mean? Surely there must be a way to quantify it, to be able to compare two sources of information based on the "amount" of information they provide.

Well, consider our sample space, say for an experiment that involved tossing three coins:

$$\{HHH, HHT, HTH, HTT, THH, THT, TTH, TTT\}$$
"Information" would be something like the statement "the first toss was a heads", or "the first two tosses yielded different results", or "the tosses were not all heads". This is information, because this narrows down our sample space. So we can think of information as being quantified by the fraction of hypotheses eliminated, or more precisely: the fraction of probability mass eliminated.

"But statistical inference isn't just about eliminating hypotheses!" It kind of is, actually. Observations are about restricting your sample space, i.e. completely eliminating specific possible observations. I.e. even when applying Bayes's theorem $P(Y|X=x)=P(X=x|Y)P(Y)/P(X=x)$, the statement $X=x$ is a certain fact, and all inferences made through Bayes's theorem are a recalculation of probabilities on this restricted sample space.

It's generally common (for reasons that will become obvious later in the article on data compression) to represent the amount of information in terms of bits.

Each bit of information eliminates half the probability mass upon being found to be true.

Intuitively, one may imagine that we're attempting to identify some data in the form of a binary sequence 11001????????... where each sequence has an equal probability. Discovering one new bit halves the space of possible sequences. Of course, our data may not be in this form, something we will get to in the data compression article.

Or in general, the quantity of information carried by an observation $X=x$ is defined as:

$$I(X=x)=\log_{1/2}(P(X=x))$$
So this notion of information captures our intuitive idea that "the unlikelier a fact seems, the more information it will provide us if it turns out to be true/the more it will change our beliefs".