Special cases of Bayesian inference

In the last article, we saw that Bayes's theorem allows us to infer theories from data -- or rather infer from data a distribution on the theory-space. This is the general way to make inferences and use these inferences to make predictions. We don't really need to bother about questions of whether we should choose the mean or median or mode as the theory we "choose", because we don't have to choose -- we make predictions with the entire distribution.

For example in the most general setting, we're trying to calculate our optimal action out of a set of choices based on a certain unknown utility function that depends on the right theory of physics (to calculate the consequences of the action). We don't need to "choose" the right theory of physics from the probability distribution we have on it -- we just make predictions for each possible theory in the support and integrate over the distribution to calculate the distribution on possible consequences. It's only at this point where we need to care about whether we want the action with the maximum mean, median, mode or whatever of the utility function.

Can you show that Bayesian inference can always converge? I.e. that there is always some data collection mechanism that allows your distribution to converge to a Dirac delta at some point? Well, there is, and it's called "sampling". What if we can't sample directly but collect some other indirect data? Comment on the philosophical implications of such a theory-space (hint: think about a probabilistic generalisation of falsifiability (a criterion for a theory to be scientific)).

But some special cases of Bayesian inference -- with special choices of prior -- exist and are often discussed/used in the literature for various reasons, and in fact, they often come with mechanisms to choose specific estimates based on the posterior distribution, as well as specific projections of our data to use. We will discuss some of these examples below.

Perhaps the most straightforward example is maximum likelihood estimation. Here, the prior distribution is the uniform distribution, i.e.

$$\mathrm{Pr}(\theta\mid x)=\frac{\mathrm{Pr}( x\mid\theta)}{\mathrm{Pr}(x)}$$
And we always take the mode of this (the value of theta that maximises $\mathrm{Pr}(\theta\mid x)$) to be the estimate of the parameter. This is just the value of theta that maximises $\mathrm{Pr}( x\mid\theta)$ (for observed data values $x$), which is called the likelihood function.

I.e. we're saying we have no preconceptions about the theory, and the right theory is just the one that gives the observed data the highest probability (or probability density).

Note that even in maximum likelihood estimation, you kinda do have non-uniform prior, in the sense that the prior is only uniform on its support, and you give zero values to things outside the domain of the parameter, or to theories other than the family being tested, etc. So if you were just taught maximum likelihood estimation, it would be easy for you to come up with the idea of Bayesian inference.

The next one is a hypothesis test -- this actually took me a while to figure out the implied prior of, because of the whole weirdness with looking at the probabilities of things like $\mathrm{Pr}(x\ge \mathrm{data})$ rather than just $\mathrm{Pr}(x=\mathrm{data})$.

But then I realised that whole thing is plainly a distraction -- the $x\ge \mathrm{data}$ stuff is just there so we can say smart-sounding things like "the probability of getting a result that extreme is such-and-such if the null hypothesis were true". One can write a bijection $\mathrm{Pr}(x\ge a)\leftrightarrow a$, because it is a decreasing function. So the value of $\mathrm{Pr}(x\ge a)$ allows us to find in particular the PMF or PDF at $a$.

So suppose we have the hierarchial prior $\mathrm{P}(\theta)$ given by:


hypothesis test, confidence interval -- http://econ.ucsb.edu/~startz/Choosing%20The%20More%20Likely%20Hypothesis.pdf
lasso, ridge, tikhonov
decision theory
various test statistics

No comments:

Post a Comment