Sufficient statistics and the Rao-Blackwell theorem

The motivation for the definition of a sufficient statistic is very obvious from a Bayesian standpoint -- we call a statistic $T(X)$ sufficient for $\theta$ if it carries all the information that $X$ has to offer on $\theta$, i.e. if

$$P(\theta|T\land X)=P(\theta|T)$$

(Or equivalently, since $T$ is a function of $X$, $P(\theta|T)=P(\theta|X)$), i.e. $\theta$ and $X$ are conditionally independent knowing $T$. An equivalent, more common formulation that $P(X|T)$ is independent of $\theta$:

$$P(X|T\land\theta)=P(X|T)$$

This captures the intuitive idea that $\theta$ is causally linked to $X$ only through $T$ -- from this perspective, the two formulations are precisely symmetric. 

One can find if a statistic is sufficient simply by looking at the distribution of $X$. We can write $P(X|\theta)=P(X|T(X))P(T(X)|\theta)$ for any statistic $T(X)$ -- thus for $T(X)$ to be sufficient, i.e. for $P(X|T(X))$ to be independent of $\theta$, we only require that such a factorization exists:

$$f(X|\theta)=h(X)g(T(X),\theta)$$

I.e. the distribution function can be written as the product of a function that doesn't depend on $\theta$ and a function that depends on $\theta$ and $T(X)$ but not $X$ directly. This is called the factorization criterion.

The interpretation of this theorem is that when doing Bayesian inference, two sets of data yielding the same value of $T(X)$ should yield the same inference about $\theta$ -- for this to be possible, the likelihoods' dependence on $\theta$ should only be in conjunction with $T(X)$ so that the direct dependence of the posterior on $X$ cancels out through the normalization factor.

We are further interested in sufficient statistics that are "minimal" in the sense that they don't carry any more superfluous information than other sufficient statistics -- "necessary and sufficient statistics", if you will. 

Can we define "necessary and sufficient statistics" as statistics that give no information other than that about $\theta$, i.e. such that knowing $T$ doesn't give us any additional information on $X$ than $\theta$? Consider if basic examples of minimal sufficient statistics have this property, and provide a simple counter-example.

We thus define a minimal sufficient statistic as a sufficient statistic that can be written as a function of any other sufficient statistic. So, for example, the sample mean may be a minimal sufficient statistic for the mean, but the entire sample itself is not, because it cannot be obtained from the sample mean alone.

A particular MSS of importance is the equivalence class of likelihood-ratios independent of $\theta$: one can partition the space of $X$s by the equivalence relation of $P(X|\theta)/P(Y|\theta)$ being independent of $\theta$ -- the resulting quotient function (that sends $X$ to its equivalence class) is then a minimal sufficient statistic for $\theta$.

As all MSSs can be recovered from each other, this is often used as a characterization of MSSs: $T$ is an MSS iff: 

$$T(X)=T(Y)\iff \frac{P(X|\theta)}{P(Y|\theta)}\text{ independent of }\theta$$

E.g. if two samples have the same mean, then the difference in their likelihoods comes from a factor other than the distribution mean.



Rao-Blackwell theorem

If an estimator $\hat{\theta}$ of $\theta$ takes different values for different $X$ with the same value of sufficient statistic $T(X)$, then it stands to reason that this estimator is sub-optimal, in that irrelevant features of the data contribute to the variance in $\hat{\theta}$. So for any sample $X$, we might want to take the average of $\hat{\theta}$ among all samples with the same value of $T(X)$, which would be a new estimator $\tilde{\theta}$.

This is known as the Rao-Blackwell theorem. Specifically, the Rao-Blackwell Theorem says that the new estimator $\tilde{\theta}=E(\hat{\theta}|T(X))$ has the same bias, and less or equal variance than $\hat{\theta}$ (with equality if $\hat{\theta}$ was already a function of $T$). The proof follows straightforwardly from the conditional breakdown of variance (ANOVA).

$$\mathrm{Var}(\hat{\theta})=\mathrm{E}(\mathrm{var}(\hat{\theta}|T))+\mathrm{var}(\mathrm{E}(\hat{\theta}|T))$$

Continuing this line of reasoning, we would like to average over the minimal sufficient statistic to be eliminate as much superfluous information as possible. Indeed, it is easy to show that if $T_2=h(T_1)$ are both sufficient statistics, then averaging over $T_2$ gives a lower variance estimator than averaging over $T_1$.

No comments:

Post a Comment