### Covariance matrix and Mahalanobis distance

In the article Random variables as vectors, we discussed that random variables were vectors, and their covariance was their dot product.

The basic motivation for coming up with this idea was from contrasting $\mathrm{Var}(X+X)=4\mathrm{Var}(X)$ to the formula for variables with zero covariance $\mathrm{Var}(X+Y)=\mathrm{Var}(X)+\mathrm{Var}(Y)$. These correspond to the geometric cases of adding two parallel and perpendicular vectors -- a more general addition is expressed through the cosine rule. What's the "cosine rule for random variables"?

Well, it's $\mathrm{Var}(X+Y)=\mathrm{Var}(X)+\mathrm{Var}(Y)+2\mathrm{Cov}(X,Y)$. To me, this -- like the dot product form of the cosine rule -- is highly suggestive of a bilinear form, specifically the Gram matrix, called the covariance matrix, of the random vector $\mathbf{X}=\left[ {\begin{array}{*{20}{c}}X\\Y\end{array}} \right]$ (which is really to be seen as a "matrix", because the random variables are to be understood as row vectors).

$$\Sigma ({X_1}, \ldots {X_n}) = \left[ {{\rm{Cov}}({X_i},{X_j})} \right]$$
One may compare this Gram matrix interpretation -- $\Sigma=\mathbf{X}\mathbf{X}^T$ (note: not $\mathbf{X}^T\mathbf{X}$, the way we've defined $X$ -- this is important!) -- to the variance formula $\sigma^2=XX^T$, and realise that the covariance matrix is the "right" measure of variance of a random vector (note how if we made random variables column vectors, this would all become $X^TX$, etc.).

(yeah, yeah, you need to subtract the mean, etc.)

Analogously, one may define a cross-covariance matrix $K_{\mathbf{X}\mathbf{Y}}=\mathrm{E}((\mathbf{X}-\mu_{\mathbf{X}})(\mathbf{Y}-\mu_{\mathbf{Y}})^T)$ measuring the covariance between two random vectors.

It is rather natural to see this, being a bilinear form, as related to some notion of distance -- the standard deviation, after all, can be seen as a "natural distance unit" in one dimension (in the sense that the "unlikeliness" of a data point depends on its distance from the mean in units of standard deviation).

Suppose we wish to find the variance across some direction, i.e. the variance of some random variable $u_1X+u_2Y=\mathbf{X}\hat{u}$ with $|\hat{u}|=1$ -- this is clearly just $\hat{u}^T\Sigma\hat{u}$. So this defines a natural distance scale in the direction of $\hat{u}$, so that the norm of a vector $\vec{v}$ is defined as:

$$\|\vec{v}\|=\frac{\vec{v}^T\vec{v}}{\hat{v}^T\Sigma\hat{v}}$$
It is not too hard to show -- from the bilinearity of the expression -- that this is equivalent to:

$$\|\vec{v}\|=\vec{v}^T\Sigma^{-1}\vec{v}$$
Another way to interpret is this $\Sigma^{-1}$ maps the distribution into a spherical one (one with identity covariance matrix), and this norm is just the norm of the data point in this spherical distribution, which is adjusted for variances and covariances. This measure of distance is called the Mahalanobis distance.