### Exogenous variables and normal linear models

In the last article, we described two simplifications that allow us to more sensibly (than trying to fit a complicated joint distribution) deal with modeling the relationships between random variables. The first of these simplifications was causal modeling, which we discussed at some length.

In this section, we shall focus on the latter type of statistical model, where we do not try to model the explanatory variables themselves, but are only interested in modeling our dependent variable in terms of them. In particular, we shall discuss the special case of normal linear models, a toy set-up to start thinking about this sort of statistical modeling.

In the normal linear model: $X$ is a vector-valued ($\in \mathbb{R}^n$) random variable and $Y$ is a vector-valued ($\in\mathbb{R}^m$) random variable, and we write $$Y\sim N(\mu+\mathbf{B}X, \Sigma)$$ Where $\mu\in\mathbb{R}^m$ and $\mathbf{B}\in\mathbb{R}^{m\times n}$. For $p$ data points, we can write the model concisely as:

$$\mathbf{Y}=\mathbf{X}\mathbf{B}+\mathbf{E}$$

Where $\mathbf{Y}\in\mathbb{R}^{p\times m}$ are the observed values of $Y$, $\mathbf{X}\in\mathbb{R}^{p\times (n+1)}$ is the design matrix (the first column is a column of 1s), $\mathbf{B}\in\mathbb{R}^{(n+1)\times m}$ is the parameter matrix and $\mathbf{E}\in\mathbb{R}^{p\times m}$ is the error matrix, with each row being distributed $N(0, \Sigma)$.

$\mathbf{B}$ and $\Sigma$ are thus the parameters we want to estimate from data. In particular, we want to construct estimators and confidence regions for them.

The description above is usually called a general or multivariate linear model, while the special case $m=1$ is what is usually referred to by the term normal linear model. Here we have:

$$Y=X\beta+E$$
where $Y\in\mathbb{R}^p$, $X\in\mathbb{R}^{p\times(n+1)}$, $\beta\in\mathbb{R}^{n+1}$ and $E\in\mathbb{R}^p$ with each $E_i\sim N(0,\sigma^2)$.

Maximum Likelihood Estimator for $\beta$ given known variance

The square term in the exponent of the normal distribution closely links it to the square-minimization that you see so much in variance/error-related statistics. In particular, the MLE for the parameter $\beta$ in the normal linear model is the same as the Least Squares Estimator.
• $\hat \beta = {\left( {{X^T}X} \right)^{ - 1}}{X^T}Y$
• $\hat\beta$ is the same as the Least Squares Estimator, because $X\left( {{X^T}X} \right)^{ - 1}{X^T}$ is the symmetric projection matrix onto the image of $X$, so the vector of residuals $Y-\hat{Y}$ has minimal length (its norm is called the residual sum of squares $\mathrm{RSS}$).
• $\hat{\beta}$ is linear in $Y$ and unbiased.
• $\mathrm{cov}(\hat{\beta})=\sigma^2(X^TX)^{-1}$
Observe that $\hat\beta$ is only defined ("identifiable") when $X^T X$ is invertible, i.e. when $X$ is full column rank. When $X$ is not full column rank, the data points are "vertical", so there are multiple models that give rise to the same data.

Unbiased estimator for the variance $\sigma^2$

This is a generalization of the classic statistical problem of sample variance. In the simplest example, you have some random variable $X$ with some mean $\mu$ and variance $\sigma^2$, and you want to estimate $\sigma^2$ from an $n$-sample of $X$. The "obvious" answer -- and this is the Maximum Likelihood Estimator -- is just $\frac{1}{n}\sum{(x_i-\hat{\mu})^2}$.

But if you consider, e.g. the situation with only one data point (then the sample has zero variance), it becomes very clear that this underestimates the population variance. In fact, the variance of the population has two additive components: the variance within the sample itself, and the variance between the sample and the population mean. For example in the one-data-point case, the value of the data point is itself an estimator of variance (a distribution with higher variance is more likely to produce samples away from the mean), or at least it would be if we knew the population mean.

In the situation with one explanatory variable, the same result applies when you have two data points, because a single line can fit both data points without error, unless the two data points are vertically colinear. In general when you have $n$ explanatory variables, you can perfectly fit $n+1$ data points with zero error, unless the $(n+1)$-plane passing through them is vertical.

So the expected generalization of the $\frac{\mathrm{RSS}}{p-1}$ estimator (known as the Bessel correction) to the case of a linear model is $\frac{\mathrm{RSS}}{p-r}$ where $r$ is the rank of the design matrix. Indeed, one can check that this is an unbiased estimator for the variance of the error term.

In fact it turns out that $\frac{\mathrm{RSS}}{\sigma^2}\sim\chi_{p-r}^2$.

Confidence region for $\beta$

Let $c\in\mathbb{R}^{n+1}$ We have:

$$\frac{c^T\hat{\beta}-c^T\beta}{\mathrm{Var}(c^T\hat{\beta})}\sim N(0, 1)$$
$$\frac{c^T\hat{\beta}-c^T\beta}{\sqrt{{c}^T(X^TX)^{-1}{c}\sigma^2}}\sim N(0, 1)$$

Note that for $Z\sim N(0, 1)$ and $V \sim \chi_{p-k}^2$ independent, $Z/\sqrt{V/(p-k)}\sim t_{p-k}$. So:

$$\frac{c^T\hat{\beta}-c^T\beta}{\sqrt{c^T(X^TX)^{-1}c\cdot \frac{\mathrm{RSS}}{p-(n+1)}}}\sim t_{p-(n+1)}$$

Which can be rewritten as a confidence region for $\beta$:

$$\frac{(\hat{\beta}-\beta)^T X^T X (\hat\beta - \beta)}{\mathrm{RSS}}\frac{p-(n+1)}{n+1}\sim F_{n+1,p-(n+1)}$$

Confidence interval for predictions

When you use your linear model to make new predictions (i.e. estimate $Y$ for some $x$, as $\hat{Y}=\hat{\beta}^T x$), there are two sources of error, since $Y\sim N(\beta^T x, \epsilon^2)$:
• Our estimator $\hat{\beta}$ is not an exact estimate of $\beta$.
• $Y$ has some variance $\epsilon^2$ even after specifying $x$.
In principle, if we had a Bayesian model, we could calculate the exact distribution of $Y$ by compounding the distributions of $\hat{\beta}\mid\beta$ and $Y\mid \hat{\beta}$. In the frequentist setting, $\beta$ is not seen to have a distribution as such, so we instead want to construct a confidence interval for $Y$.

ANOVA

Statistical modeling is closely linked to decision theory -- exogenous variables are those that we have "direct control" over, so the effects of the decision can be seen suppressing the distribution of and the internal correlations between these controllable variables. These "projected" correlations are causations.

This problem is related to the problem of explaining variance. Sure, statistical modeling is more general -- the variability of a random variable has more to do with just its variance, but in the special case of a normal linear model, the distribution is summarized by the mean and the variance, and so explaining the variance in $Y$ through exogenous variables is equivalent to determining a statistical model.

The basic motivating fact behind ANOVA is the law of total variance: variance in a dependent variable can be broken down, in a Pythagorean fashion, into variances in the exogenous variables. This is a result of the independence of $Y\mid X$ from the residuals.

$$\mathrm{Var}(Y)=\mathrm{Var}\left(\mathrm{E}\left(Y\mid X\right)\right)+\mathrm{E}\left(\mathrm{Var}\left(Y\mid X\right)\right)$$

This simplifies rather nicely in the case of a normal linear model, where errors are assumed to be independent of exogenous variables.

A very simple application of ANOVA is in assessing the "importance" of a particular exogenous variable to $Y$, by looking at the fractions of variance explained by each exogenous variable. More generally, ANOVA can be used to test the validity of any sub-model -- if a particular factor doesn't explain much of the variance in a variable $Y$, it can probably be discarded while still retaining a suitable model.

Any linear model can be represented as $E(Y)=\mathrm{span}(X)$, representing the hypothesis that the mean of $Y$ is a plane/that the mean of $Y$ is a linear function of $X$. A submodel $E(Y)=\mathrm{span}(X_0)$ (where $X_0$ is a submatrix of columns in $X$) represents the further hypothesis that $Y$ does not correlate with any of the variables in $X$ except those in $X_0$.

The fraction of variance in $Y$ unexplained by the sub-model is then a test statistic for the sub-model (the larger this fraction is, the less likely the sub-model is to be true), and its distribution can be calculated under the sub-model as a null hypothesis.

$$F=\frac{\mathrm{RSS}_0-\mathrm{RSS}}{\mathrm{RSS}}\cdot \frac{p-r}{r-r_0} \sim F_{r-r_0,p-r}$$

Model diagnostics

Specifying a model, one can then do inference to determine the model parameters from the data. However, in the general paradigm of statistical inference, we don't know at all that the specified model is valid (in the full paradigm of Solmonoff Induction, the model is also inferred statistically). With simple data of low dimensionality, we often "eyeball" the data to set the model. Heuristics that help us decide on/evaluate a model are called model diagnostics.

We earlier discussed ANOVA, which is a diagnostic to evaluate sub-models of a normal linear model. A related approach to evaluate the suitability of a normal linear model (without reference to a super-model) is the Coefficient of Determination $R^2$, which is defined as the fraction of variance in the response variable explained  by the model.

Well, even an uncorrelated explanatory variable will spuriously explain some of the variance in $Y$ because the model will fit to whatever insignificant sample correlation is observed -- in particular if we have a full set of $p$ explanatory variables (where $p$ is the number of data points), $Y$ is fitted exactly and $R^2$ is 1. This is not a result of the accuracy of the model -- the model isn't predicting anything, it simply lacks the freedom to equal anything but the observed data. So analogous to our degree-of-freedom argument for scaling by $1/(n-p)$ for the sample variance, one may want to divide the numerator and denominator of $R^2$ by the degrees of freedom $n-p$ and $n-1$ to obtain unbiased estimates of the errors of each model.

+leverages etc.

Linear parameterization: PCA