In the last article, we described two simplifications that allow us to more sensibly (than trying to fit a complicated joint distribution) deal with modeling the relationships between random variables. The first of these simplifications was *causal modeling*, which we discussed at some length.

In this section, we shall focus on the latter type of statistical model, where we do not try to model the explanatory variables themselves, but are only interested in modeling our dependent variable in terms of them. In particular, we shall discuss the special case of *normal linear models*, a toy set-up to start thinking about this sort of statistical modeling.

*general*or

*multivariate*linear model, while the special case $m=1$ is what is usually referred to by the term

*normal linear model*. Here we have:

**Maximum Likelihood Estimator for $\beta$ given known variance**

- $\hat \beta = {\left( {{X^T}X} \right)^{ - 1}}{X^T}Y$
- $\hat\beta$ is the same as the Least Squares Estimator, because $X\left( {{X^T}X} \right)^{ - 1}{X^T}$ is the symmetric projection matrix onto the image of $X$, so the vector of residuals $Y-\hat{Y}$ has minimal length (its norm is called the
*residual sum of squares*$\mathrm{RSS}$). - $\hat{\beta}$ is linear in $Y$ and unbiased.
- $\mathrm{cov}(\hat{\beta})=\sigma^2(X^TX)^{-1}$

**Unbiased estimator for the variance $\sigma^2$**

*itself*an estimator of variance (a distribution with higher variance is more likely to produce samples away from the mean), or at least it would be if we knew the population mean.

*two*data points, because a single line can fit both data points without error, unless the two data points are vertically colinear. In general when you have $n$ explanatory variables, you can perfectly fit $n+1$ data points with zero error, unless the $(n+1)$-plane passing through them is vertical.

*Bessel correction*) to the case of a linear model is $\frac{\mathrm{RSS}}{p-r}$ where $r$ is the rank of the design matrix. Indeed, one can check that this is an unbiased estimator for the variance of the error term.

**Confidence region for $\beta$**

**Confidence interval for predictions**

- Our estimator $\hat{\beta}$ is not an exact estimate of $\beta$.
- $Y$ has some variance $\epsilon^2$ even after specifying $x$.

*distribution*as such, so we instead want to construct a confidence interval for $Y$.

**ANOVA**

Statistical modeling is closely linked to decision theory -- exogenous variables are those that we have "direct control" over, so the effects of the decision can be seen suppressing the distribution of and the internal correlations between these controllable variables. These "projected" correlations are causations.

This problem is related to the problem of *explaining variance*. Sure, statistical modeling is more general -- the variability of a random variable has more to do with just its variance, but in the special case of a normal linear model, the distribution is summarized by the mean and the variance, and so explaining the variance in $Y$ through exogenous variables is equivalent to determining a statistical model.

The basic motivating fact behind ANOVA is the **law of total variance**: variance in a dependent variable can be broken down, in a Pythagorean fashion, into variances in the exogenous variables. This is a result of the independence of $Y\mid X$ from the residuals.

$$\mathrm{Var}(Y)=\mathrm{Var}\left(\mathrm{E}\left(Y\mid X\right)\right)+\mathrm{E}\left(\mathrm{Var}\left(Y\mid X\right)\right)$$

This simplifies rather nicely in the case of a normal linear model, where errors are assumed to be independent of exogenous variables.

A very simple application of ANOVA is in assessing the "importance" of a particular exogenous variable to $Y$, by looking at the fractions of variance explained by each exogenous variable. More generally, ANOVA can be used to test the validity of any sub-model -- if a particular factor doesn't explain much of the variance in a variable $Y$, it can probably be discarded while still retaining a suitable model.

Any linear model can be represented as $E(Y)=\mathrm{span}(X)$, representing the hypothesis that the mean of $Y$ is a plane/that the mean of $Y$ is a linear function of $X$. A submodel $E(Y)=\mathrm{span}(X_0)$ (where $X_0$ is a submatrix of columns in $X$) represents the further hypothesis that $Y$ does not correlate with any of the variables in $X$ except those in $X_0$.

The fraction of variance in $Y$ unexplained by the sub-model is then a test statistic for the sub-model (the larger this fraction is, the less likely the sub-model is to be true), and its distribution can be calculated under the sub-model as a null hypothesis.

$$F=\frac{\mathrm{RSS}_0-\mathrm{RSS}}{\mathrm{RSS}}\cdot \frac{p-r}{r-r_0} \sim F_{r-r_0,p-r}$$

**Model diagnostics**

Specifying a model, one can then do inference to determine the model parameters from the data. However, in the general paradigm of statistical inference, we don't know at all that the specified model is valid (in the full paradigm of Solmonoff Induction, the model is also inferred statistically). With simple data of low dimensionality, we often "eyeball" the data to set the model. Heuristics that help us decide on/evaluate a model are called *model diagnostics*.

We earlier discussed ANOVA, which is a diagnostic to evaluate sub-models of a normal linear model. A related approach to evaluate the suitability of a normal linear model (without reference to a super-model) is the **Coefficient of Determination** $R^2$, which is defined as the fraction of variance in the response variable explained by the model.

Well, even an uncorrelated explanatory variable will spuriously explain some of the variance in $Y$ because the model will fit to whatever insignificant sample correlation is observed -- in particular if we have a full set of $p$ explanatory variables (where $p$ is the number of data points), $Y$ is fitted exactly and $R^2$ is 1. This is not a result of the accuracy of the model -- the model isn't *predicting* anything, it simply lacks the freedom to equal anything but the observed data. So analogous to our degree-of-freedom argument for scaling by $1/(n-p)$ for the sample variance, one may want to divide the numerator and denominator of $R^2$ by the degrees of freedom $n-p$ and $n-1$ to obtain unbiased estimates of the errors of each model.

+leverages etc.

**Linear parameterization: PCA**

## No comments:

## Post a Comment