The Winding Number: Uniform priors and Maximum Likelihood Estimation

In the last article, we saw that Bayes's theorem allows us to infer theories from data -- or rather infer from data a distribution on the theory-space. This is the general way to make inferences and use these inferences to make predictions. We don't really need to bother about questions of whether we should choose the mean or median or mode as the theory we "choose", because we don't have to choose -- we make predictions with the entire distribution.

For example in the most general setting, we're trying to calculate our optimal action out of a set of choices based on a certain unknown utility function that depends on the right theory of physics (to calculate the consequences of the action). We don't need to "choose" the right theory of physics from the probability distribution we have on it -- we just make predictions for each possible theory in the support and integrate over the distribution to calculate the distribution on possible consequences. It's only at this point where we need to care about whether we want the action with the maximum mean, median, mode or whatever of the utility function.

Can you show that Bayesian inference can always converge? I.e. that there is always some data collection mechanism that allows your distribution to converge to a Dirac delta at some point? Well, there is, and it's called "sampling". What if we can't sample directly but collect some other indirect data? Comment on the philosophical implications of such a theory-space (hint: think about a probabilistic generalisation of falsifiability (a criterion for a theory to be scientific)).

But some special cases of Bayesian inference -- with special choices of prior -- exist and are often discussed/used in the literature for various reasons, and in fact, they often come with mechanisms to choose specific estimates based on the posterior distribution, as well as specific projections of our data to use. We will discuss some of these examples below.

Perhaps the most straightforward example is maximum likelihood estimation. Here, the prior distribution is the uniform distribution, i.e.

P r (θ ∣ x) = \frac{P r (x ∣ θ)}{P r (x)}

$\mathrm{Pr}(\theta\mid x)=\frac{\mathrm{Pr}( x\mid\theta)}{\mathrm{Pr}(x)}$
And we always take the mode of this (the value of theta that maximises

P r (θ ∣ x)

$\mathrm{Pr}(\theta\mid x)$ ) to be the estimate of the parameter. This is just the value of theta that maximises

P r (x ∣ θ)

$\mathrm{Pr}( x\mid\theta)$ (for observed data values

x

$x$ ), which is called the likelihood function.

I.e. we're saying we have no preconceptions about the theory, and the right theory is just the one that gives the observed data the highest probability (or probability density).

Note that even in maximum likelihood estimation, you kinda do have non-uniform prior, in the sense that the prior is only uniform on its support, and you give zero values to things outside the domain of the parameter, or to theories other than the family being tested, etc. So if you were just taught maximum likelihood estimation, it would be easy for you to come up with the idea of Bayesian inference.

Suppose you have a prior distribution given by

ρ (θ)

$\rho(\theta)$ and some specific value of

θ_{0}

$\theta_0$ , which, regardless of the actual value of the prior distribution, you are "decisively" biased towards for some reason (e.g. you're looking at a suspect's guilt and you'll act on the assumption of innocence as long as the probability of his innocence is at least 1%). You may also be happy to consider a small neighbourhood around

θ_{0}

$\theta_0$ as "effectively" the same as

θ_{0}

$\theta_0$ for any practical purposes you may have (e.g. no coin is perfectly fair, but most coins are pretty fair).

Uniform priors and Maximum Likelihood Estimation

No comments:

Post a Comment