### Introduction to Bayesian inference

When the Higgs boson was discovered by the LHC, you heard comments e.g. here that "there is no way to definitively talk about the probability of the existence of a Higgs boson" -- more specifically, the quoted physicist in the linked article claimed "there is no way eliminate the conditional".

This is just wrong. There is a way to eliminate the conditional -- it's just Bayes's theorem!

Imagine setting up a tree diagram for this experiment. (The LHC made a certain observation which is highly correlated with the existence of the Higgs boson, which is what we're expressing below.)
As a graphical interpretation of Bayes's theorem:

$$\mathrm{Pr}(\mathrm{Theory}\mid\mathrm{Data})=\frac{\mathrm{Pr}(\mathrm{Data}\mid\mathrm{Theory})}{\mathrm{Pr}(\mathrm{Data})}\mathrm{Pr}(\mathrm{Theory})$$

What this needs, however, is a prior distribution $\mathrm{Pr}(\mathrm{Theory})$ on the probability of the Higgs boson existing -- before any data is supplied.

And this distribution obviously exists. If you've ever heard a physicist say "we were already pretty sure it existed, this experiment just increased our confidence in it", then you know what I'm talking about. This "pretty sure"-ness in its existence is a prior distribution.

OK -- but how does this prior distribution exist? Why were physicists pretty sure that the Higgs boson existed?

Well, there's a reason they postulated a Higgs boson in the first place -- to explain the mass of certain particles, etc. And they had already observed that those particles have a mass -- i.e. they had already collected some data.

This prior distribution is the posterior distribution of a previous experiment.

But... you're saying that the prior distribution is inferred as the posterior distribution of a previous experiment, which itself depends on the prior distribution of that experiment... isn't it just turtles all the way down?

Well, as you might have guessed, we need a true prior distribution on all physical theories about the universe, before collecting any data. This prior is essentially arbitrary, which might be philosophically troubling for you, but that's just how things are. There is no right prior distribution, it's fundamentally subjective.

But I'd say most people would adopt a prior distribution based on Occam's razor, giving higher priors to simpler theories -- e.g. even though all existing data agrees with both "string theory" and "string theory until New Year 2020, and then the world ends", we give higher priority to the first one, simply because it takes "fewer lines of code" to write. If you're interested in this stuff, I recommend reading about:
• Occam's razor
• Kolmogorov complexity
• Solomonoff's theory of inductive inference
I suppose Solmonoff's is not the only way to set priors -- a principle that I have heard of but do not really understand (and don't bother with, because it seems really boring) is the principle of indifference.

This perspective on inference, though, requires the following key fact about Bayesian inference, known as the stability of Bayesian inference: feeding some data, then feeding some other independent data, is equivalent to feeding both data together (the independence is important, because e.g. if you fed in the same piece of data twice, that should not affect your distribution). We can prove this fairly easily:

Given a prior distribution $\mathrm{B}_0(\theta)$ on theory-space values $\theta$, the posterior distribution upon observing some data $\delta_1$ is:

$$\mathrm{B}_1(\theta)=\frac{\mathrm{P}(\delta_1\mid\theta)}{\sum_\phi \mathrm{P}(\delta_1\mid\phi)\mathrm{B}_0(\phi)}\mathrm{B}_0(\theta)$$
Which then becomes the prior distribution for a subsequent observation of $\delta_2$:

$$\mathrm{B}_2(\theta)=\frac{\mathrm{P}(\delta_2\mid\theta)}{\sum_\phi\mathrm{P}(\delta_2\mid\phi)\mathrm{B}_1(\phi)}\mathrm{B}_1(\theta)$$
Substituting in $\mathrm{B}_1(\theta)$ and simplifying:

$$\mathrm{B_2}(\theta)=\frac{\mathrm{P}(\delta_1\land\delta_2\mid\theta)}{\sum_\phi\mathrm{P}(\delta_1\land\delta_2\mid\phi)\mathrm{B}_0(\phi)}\mathrm{B}_0(\theta)$$
Which is precisely what we wanted.

Apparently there are systems of probability theory called noncommutative probability in which this is not possible, and statistical inference is not possible -- see (Redei 1992) (isn't it weird how everything in probability and statistics is so recent?). Obviously, this is not relevant to the physical applications of probability.

If you want to see Bayesian inference in action, have a look at this interactive RShiny applet I wrote that demonstrates Bayesian inference from a continuous stream of data (relying on this stability of Bayesian inference). Here's a snapshot from the applet, to pique your interest, of the evolution of this belief distribution while tossing a coin that you gradually learn is pretty unfair: