The Winding Number: hypothesis testing

The basic general idea behind a confidence region is this: Given that the true value of some parameter is $\theta$ we may have some mechanism to sample "random regions" $R$ for $\theta$ such that 95% of these random regions contain $\theta$.

The first obvious issue is that this mechanism should not depend on $\theta$, as it is not known to us. We want a general experimental mechanism that for any $\theta$, produces random regions of the same confidence level ("95%").

In some basic cases, this is easy: for example, suppose we have some $X\sim N(\mu, 1)$. Then for any $\theta$, 95% of intervals generated as $X\pm 1.96$ contain $\mu$.

The key hint that you may find in the example above is that $\mu$ is a location parameter for $X$, i.e. the probability of $X\mid\mu$ is a function of just $X-\mu$, i.e. the distribution of $X-\mu$ itself does not depend on $\mu$, and is just $N(0,1)$. $X-\mu$ is what we call a pivotal quantity here.

In general, a pivotal quantity is a function of some data and the true value of the parameter itself $k(X,\theta)$ such that its distribution is completely specified. Then a confidence region for $k$ can hopefully be transformed back into a confidence region for $\theta$ at the same confidence level.

OK, next question: what is the implied prior of confidence region calculations? I.e. under what prior can the confidence level be interpreted as the probability that the true value of the parameter is contained in the confidence region?

(For a general prior, such a region that gives you some probability of containing the true value of the parameter is called a credible region.)

Well, what exactly is the confidence level? It's the probability that a randomly generated random region contains the true parameter value -- i.e. before you actually know what the random region is. Once you get the generated random region, this probability may change depend on the prior probability of the true parameter value being in this concrete region.

In other words, the implied prior is one such that $\theta$ has an equal prior probability of being in any possible confidence region. This is easy to calculate in some specific examples:

If $\theta$ is a location parameter for $X$, the implied prior on $\theta$ is uniform, $\propto 1$.
If $\theta$ is a scale parameter for $X$, the implied prior on $\theta$ is logarithmic, $\propto 1/\theta$.

The way that hypothesis testing is first introduced, one talks of things like "the probability of finding a value of $x$ at least as extreme as you did". And one sometimes chooses a "one-sided" hypothesis test and other times a "two-sided" hypothesis test. It should be clear that this isn't too fundamental a concept to be interested in.

Rather, one sensible, more generally appropriate way of thinking of hypothesis tests is in terms of confidence regions. Specifically: testing a null hypothesis is equivalent to asking if it is contained within the confidence region of our data.

Obviously, this depends entirely on the shape we choose of our confidence region. We can always just choose a confidence region that includes or excludes our null hypothesis and maintain the same confidence level.

While it may be disappointing that there is no one way to construct a confidence region, this makes a great deal of sense. For example, consider the following multimodal distribution:

The sensible confidence region to construct would then be one that contains the bulk of both peaks. "Sensibility" in this sense is getting the confidence region of the least length (you may observe that this is not reparameterization-invariant).

Various different constructions of confidence regions is what gives you things like two-tailed and one-tailed tests.

Also read: Choosing the more likely hypothesis by Richard Startz

In the last article, we saw that Bayes's theorem allows us to infer theories from data -- or rather infer from data a distribution on the theory-space. This is the general way to make inferences and use these inferences to make predictions. We don't really need to bother about questions of whether we should choose the mean or median or mode as the theory we "choose", because we don't have to choose -- we make predictions with the entire distribution.

For example in the most general setting, we're trying to calculate our optimal action out of a set of choices based on a certain unknown utility function that depends on the right theory of physics (to calculate the consequences of the action). We don't need to "choose" the right theory of physics from the probability distribution we have on it -- we just make predictions for each possible theory in the support and integrate over the distribution to calculate the distribution on possible consequences. It's only at this point where we need to care about whether we want the action with the maximum mean, median, mode or whatever of the utility function.

Can you show that Bayesian inference can always converge? I.e. that there is always some data collection mechanism that allows your distribution to converge to a Dirac delta at some point? Well, there is, and it's called "sampling". What if we can't sample directly but collect some other indirect data? Comment on the philosophical implications of such a theory-space (hint: think about a probabilistic generalisation of falsifiability (a criterion for a theory to be scientific)).

But some special cases of Bayesian inference -- with special choices of prior -- exist and are often discussed/used in the literature for various reasons, and in fact, they often come with mechanisms to choose specific estimates based on the posterior distribution, as well as specific projections of our data to use. We will discuss some of these examples below.

Perhaps the most straightforward example is maximum likelihood estimation. Here, the prior distribution is the uniform distribution, i.e.

$$\mathrm{Pr}(\theta\mid x)=\frac{\mathrm{Pr}( x\mid\theta)}{\mathrm{Pr}(x)}$$
And we always take the mode of this (the value of theta that maximises $\mathrm{Pr}(\theta\mid x)$) to be the estimate of the parameter. This is just the value of theta that maximises $\mathrm{Pr}( x\mid\theta)$ (for observed data values $x$), which is called the likelihood function.

I.e. we're saying we have no preconceptions about the theory, and the right theory is just the one that gives the observed data the highest probability (or probability density).

Note that even in maximum likelihood estimation, you kinda do have non-uniform prior, in the sense that the prior is only uniform on its support, and you give zero values to things outside the domain of the parameter, or to theories other than the family being tested, etc. So if you were just taught maximum likelihood estimation, it would be easy for you to come up with the idea of Bayesian inference.

Suppose you have a prior distribution given by $\rho(\theta)$ and some specific value of $\theta_0$, which, regardless of the actual value of the prior distribution, you are "decisively" biased towards for some reason (e.g. you're looking at a suspect's guilt and you'll act on the assumption of innocence as long as the probability of his innocence is at least 1%). You may also be happy to consider a small neighbourhood around $\theta_0$ as "effectively" the same as $\theta_0$ for any practical purposes you may have (e.g. no coin is perfectly fair, but most coins are pretty fair).

Classical Statistics, Confidence Regions and Hypothesis tests

Uniform priors and Maximum Likelihood Estimation