### Statistical models; causal models

A statistical model in the most general sense is a proposed distribution for some random variable. In the simplest case, one can have a univariate distribution, e.g. $X\sim N(\mu, \sigma^2)$ and experiment to get posterior distributions for the values of $\mu$, $\sigma$.

More generally, the random variable in question may be correlated with some other random variable. Well, the most general way to express such a model is via its joint distribution, but the joint distribution might be too expensive to reasonably test -- a classic example is that of causal relationship.

However, there are various special cases of this general description that naturally capture the assumptions that one would make while attempting to model these variables:

• Hierarchical/Causal model: We arrange the random variables in a tree-like format, with each parent's distribution taking the value of its child random variables as an explanatory variable -- the distribution of each node is marginalized against the distribution of its parents so that it only depends on its own children. E.g. $Y\sim N(\mu_{0Y} + \mu_{1Y} X, 1)$, $X\sim N(\mu_X, 1)$.
• Exogenous variables: We don't bother about the distribution of the explanatory variables, treating them as "exogenous variables". In the context of decision theory, these should be seen as variables which we can manipulate at will -- e.g. if we're discussing the effects of taxation on GDP growth, then tax rates should be seen as an exogenous variable (even though it might be true that the tax rate is actually influenced by voter priorities and can be predicted in such way -- that's just not the question we're interested in addressing). Essentially, we are only interested in modeling the conditional distribution.

Causation

So causation is different from correlation, and closely related to partial correlation

To illustrate what we mean, consider this highly publicized study (reported e.g. here) arguing that higher taxes benefit the economy, because the U.S. economy grew slightly faster in the 1960s when taxes were higher. This study had a host of other problems (see the p-hacking article, where I point out that doing an experimental study on such noisy data is meaningless in the absence of a solid theoretical backing for your claim), but the main problem was that there are a lot of differences between the 1960s and now apart from tax rates, e.g. regulations were fewer in the 1960s, the kinds of industries and their accounting in GDP were different in the 1960s.

The fundamental problem is this: the question you want to answer is "what will be the effect of increasing taxes be on economic growth?" Since it is unlikely that increasing taxes will reduce regulations or change accounting methods, we want to remove the contributions of these correlations in the explanatory variables from the correlation between tax rates and economic growth.

(The obvious solution would be to take cross-sectional data (comparing the U.S. to countries that had different time trends in their tax rates) instead of longitudinal data -- this is an example of randomized interventions, or reasoning on the joint probability distribution.)

This is the idea behind partial correlations. To study the causal link between a dependent variable $Y$ and an independent variable $X_i$ means to study the conditional distribution of $Y\mid X_{j\ne i}$.

Notice how this definition completely depends on the random variables we have chosen to control. For example, if we had also controlled for some economic variables that taxation affects GDP growth "through", the effect of changing taxation would perhaps be smaller. The choice of these variables depends on our purpose -- i.e. based on what we're actually able to control.

Apply this reasoning to the context of a classic example: since wind speed correlates with the rotation of a windmill, does this mean the windmill rotation affects wind speed? What are we really asking here? What are our variables (you should have at least 3 of them)?

We can also immediately carry over constructions from correlation to causation. We define the partial correlation of $Y\mid X_{j\ne i}$ as the value of the correlation computed from the conditional distribution of $Y\mid X_{j\ne i}$, and may geometrically be interpreted as the $\cos\theta$ of the projections of $Y$ and $X_i$ (as vectors) onto the orthogonal complement of the span of the controlled variables (therefore eliminating the correlations $Y$ and $X_i$ have with them).

Causal models

It is important to note that the existence of partial correlation is not actually equivalent to causation either. Causation is an underlying explanation -- it is represented in the form of causal networks, which are directed acyclic graphs that are models of the underlying behavior of the system. It's much like how quarks are an underlying model of the observed particle zoo -- the quarks themselves cannot be observed, but they form a framework on which predictions can be made.

So a causal model might look like this:

With distributions $f_T$, $f_{X_1\mid T}$, $f_{X_2\mid T}$, $f_{Y\mid X_1, X_2}$.

${f_{T,{X_1},{X_2},Y}}(t,{x_1},{x_2},y) = {f_{Y\mid {X_1},{X_2}}}(y\mid {x_1},{x_2}){f_{{X_1}\mid T}}({x_1}|t){f_{{X_2}\mid T}}({x_2}|t){f_T}(t)$

So the idea is that multiple different causal models may be equivalent in the sense that they lead to the same joint distribution -- saying that something causes another thing is not an absolute truth, but simply a model that is consistent with the truth. If two models lead to an identical joint distribution, they are "equivalent" for all physical purposes.

E.g. consider the following models:

Then the joint distribution corresponding to each causal model is:

${f_{ABR}}(a,b,r) = {f_{A\mid B}}(a\mid b){f_{B\mid R}}(b\mid r){f_R}(r) = \frac{{f(a,b)f(b,r)}}{{f(b)}}$
${f_{ABR}}(a,b,r) = {f_{R\mid B}}(r\mid b){f_{B\mid A}}(b\mid a){f_A}(a) = \frac{{f(a,b)f(b,r)}}{{f(b)}}$
${f_{ABR}}(a,b,r) = {f_{A\mid B}}(a\mid b){f_{R\mid B}}(r\mid b){f_B}(b) = \frac{{f(a,b)f(b,r)}}{{f(b)}}$

Which are equivalent.

On the other hand, the following three models:

${f_{ABR}}(a,b,r) = {f_{A\mid B}}(a\mid b){f_{B\mid R}}(b\mid r){f_R}(r) = \frac{{f(a,b)f(b,r)}}{{f(b)}}$
${f_{ABR}}(a,b,r) = {f_{B\mid A,R}}(b\mid a,r){f_B}(b){f_R}(r)$
${f_{ABR}}(a,b,r) = {f_{B\mid A}}(b\mid a){f_{A\mid R}}(a\mid r){f_R}(r) = \frac{{f(a,b)f(a,r)}}{{f(a)}}$

Are not equivalent, and can thus be distinguished by experiment.

Because this equivalence is a property of graphs, we expect to be able to read off equivalences from simply glancing at the causal trees. The key idea here is that a causal tree is defined by assumptions of conditional independence (because causation is about conditional independence), and so the equivalence of two trees means that they have the same pattern of conditional independence. Reading off patterns of conditional independence is known as D-separation.