### Causation, transfer and symmetry

In this article we attempt to build intuition for disentangled representations, and the closely related ideas of transfer learning, equivariant learning and causation.

• Transfer learning and causation
• Transfer learning and hierarchical models; counterfactual reasoning
• Out-of-distribution generalization and Independent Causal Mechanisms
• Disentangled representations and symmetry

Transfer learning and causation

Transfer learning is set in the context where two models are not independent. An example we've previously discussed is semi-supervised learning, which analyzes the correlation between $P(X)$ and $P(Y\mid X)$ (viewed as random variables from some distribution $\phi_{p_{x,y}}$ on possible distributions). As with the semi-supervised learning example, one way to think about "our belief about a distribution" is to consider that distribution as parameterized by/conditioned on some third variable $Z$ and talk about our beliefs on $Z$.

More generally, we may consider correlations between arbitrary such distributions. For $P(Y\mid X)$ to be independent of some $Z$ is an equivalent way of saying that $Y\perp\!\!\!\perp Z\mid X$ -- in a faithful causal model, this means that all causal paths between $Y$ and $Z$ are blocked by conditioning on $X$. Thus the following diagrams represent the only possible structures where $Z$ carries information about $P(Y\mid X)$ and transfer learning is only possible if there exists some $Z$ that plays this role for both distributions:

Transfer learning and hierarchical models; counterfactual reasoning

The classic example of a hierarchical model goes something like this: you're trying to model the score of students based on a couple of parameters, and the students can be divided into a bunch of schools. While the school category is itself a variable with some value and can thus be considered a parameter, our treatment is often different in application -- in application, we may e.g. be faced with an entirely new school we haven't heard of before, or with a school with few data points. Then our predictions about such a school should take into account what we know from the other schools.

A very simple hierarchical model might look like this:

-- i.e. instead of simply describing score as having a particular distribution $f_\mu(x)$, we recognize that $\mu$ is a random variable in itself and consider $f(x\mid\mu)$. The main idea of importance in a hierarchical model, however, has to do with the inference of $\mu$ for a particular sample. The full model for something like this looks like:

And we would know correlations between the distributions $P(\mu_1)$ and $P(\mu_2)$, representable by another random variable $\theta$ -- essentially our $\mu$s are sampled from some unknown distribution, and the identity of this distribution (e.g. a parameter) is $\theta$.

Another place that this idea -- of "extending" samples as random variables in a causal diagram -- is of importance is in the context of counterfactuals. A classic counterfactual question looks like this:
X is elected president, and the GDP becomes $1. What would the GDP of the country have been if Y were elected president instead? To understand the precise meaning of such a question (so as to formalize it as a causal diagram) requires us to think about the purpose of the answer to the question. For example, we might be interested in a future election, or the use of President X's vs President Y's policies in some other time or country, etc. In general, we are interested in an alternate world whose mechanisms are correlated with this one. A particular observation of President-1 and GDP-1 gives us information on Mechanism, which gives us information to make inferences in the counterfactual world. Relevant references: [1] [2] Out-of-distribution generalization and Independent Causal Mechanisms Ordinary machine learning involves some random variable$(X, Y)$which we sample IID and attempt to learn the underlying joint distribution of -- or some functional of the joint distribution, like$P(Y\mid X)$. Instead, we might be interested in sampling from several random variables$(X_i, Y_i)$with different distributions and exploit this full information to infer desired distributions. The problem of transferring information to some new distribution$(X_{n+1}, Y_{n+1})$from which no data points have been sampled is called domain generalization, while combining this prior with information from the sample of the specific task$(X_n, Y_n)$is called multi-task learning. This general setting is exactly the same as that in the first section, of course. However, having prior information on the precise correlation between the distributions of the mechanisms$P(Y_1\mid X_1)$and$P(Y_2\mid X_2)\$ is practically quite difficult; we are often interested in the case where we plainly know that these mechanisms are identical. You may immediately see the connection to independent causal mechanisms. Indeed, the precise connection has to do with the idea of extending samples as random variables, as in the previous section. See [1] [2] for details on precise algorithms for such tasks.

Disentangled representations and symmetry

[article to be extended]

Relevant references: [1] [2] [3] [4a] [4b].