**Table of contents**

- Semi-supervised learning
- Semi-supervised learning and causation
- Independent noise and causation
- Hierarchical models and transfer learning

**Semi-supervised learning**

We have previously discussed that machine learning can both be considered to be an information theoretic problem of finding an optimal representation. Most of it had to do with unsupervised learning, while supervised learning requires a notion of optimality with respect to some target variable (formalized via the information bottleneck). In particular, we're interested if the unsupervised representations are *correlated* with the supervised ones.

More specifically, suppose you're trying to classify points into some number of categories (i.e. learn a (random) function $f:X\to Y$), and have some number of labeled points (values of $f(x)$ for some $x$). Then when -- if ever -- will a further sampling from $X$ (i.e. without labels) help you achieve greater classification accuracy? Does $P(X)$ provide *information* on $P(Y\mid X)$?

Image source |

So for example, if you believe that clustered $x$ tend to have similar labels, then $P(X)$ gives information on $P(Y\mid X)$. Formally, we have a belief distribution $\phi_{p_{x,y}}$ on the possible distributions $p_{x,y}$ that generate $X, Y$ -- so the above hypothesis represents a high prior probability for distributions $p_{x,y}$ that predict a high probability for nearby $x$ to produce nearby $y$. The random distributions $p_x$ and $p_{x\mid y}$ can be extracted as function(al)s of the random distribution $p_{x,y}$, and so it makes sense to talk about their independence -- and if they are independent, semi-supervised learning doesn't work.

**Semi-supervised learning and causation**

One may think about the physical mechanisms underlying such data. Suppose that $P(Y\mid X)$ does give us information about $P(X)$ -- e.g. we believe that if $Y$ tends to cluster with $X$ then $X$ has a correspondingly clustered process -- then that tells us that either $X$ and $Y$ have a common cause, or that $Y$ causes $X$.

Why? Let's be a bit more concrete about this and suppose that $X$ and $Y$ represent altitude and temperature (of some weather stations) respectively. And suppose we believe that observing a distribution like this for $X$:

Made it likely for the relationship between $X$ and $Y$ to follow the same clustering, i.e. to look like this:

No. In fact, one way to think about "our belief about a distribution" is to consider that distribution as parameterized by (conditioned on) some third variable $Z$ and think about or beliefs on $Z$. For $P(X)$ and $P(Y\mid X)$ to be "independent" means that there is no $Z$ such that $P(X)$ and $P(Y\mid X)$ both depend on $Z$, i.e. for all variables $Z$, either $X$ and $Z$ are independent, or $Y$ and $Z$ are independent conditioned on $X$:

$$\forall Z, (X\perp\!\!\!\perp Z)\lor (Y\perp\!\!\!\perp Z\mid X)$$

*<=> For any $Z$ correlated with $X$, $Z$ affects $Y$ only through $X$.*

*<=> If there is an unblocked causal path between $X$ and some $Z$, then conditioning on $X$ blocks all causal paths between $Y$ and $Z$.*

*<=> If there are unblocked causal paths between some $Z$ and $X$, $Y$, then the paths between $Z$ and $Y$ contain $X$ **as either a mediator or a common cause. *

*<=> **If there are unblocked causal paths between some $Z$ and $X$, $Y$, then the paths between $Z$ and $Y$ contain $X$, and $X$ is a cause of $Y$.*

*<=> $X$ causes $Y$ and they have no common causes.*

This is called the "Principle of Independent Causal Mechanisms" -- if semi-supervised learning doesn't add anything, then there is a defined causal direction $X\to Y$.

(Our treatment of the Principle of Independent Causal Mechanisms is similar to that by Sokolovska & Wuillemin [1]. We have not covered the algorithmic formulation of the principle -- worth reading: in this regard Janzing et al on what ICM translates to in physics [2], Besserve et al on group-theoretic formulation [3]. Furthermore, semi-supervised learning can be considered a special case of transfer learning, relevant papers in this area/on the theory of transfer learning include: [4] [5] [6 -- p. 177])

**Independent noise and causation**

Although we have formulated the notion of $P(X)$ and $P(Y\mid X)$ being independent in terms of our beliefs about what those distributions might be, such beliefs can also be updated by data. Indeed the characterization above, in which the variable $Z$ is used for parameterization, is one such formulation.

*If there are unblocked causal paths between some $Z$ and $X$, $Y$, then the paths between $Z$ and $Y$ contain $X$ **as either a mediator or a common cause. *

The contrapositive of this reads:

*If $Y$ depends on $Z$ not through $X$, then $X$ and $Z$ must be independent.*

*<=> $X$ is independent of the noise term in $Y$.*

Indeed, this is one of the most common characterizations used to establish causation when a full picture of the causal model and its specific variables is not known. The idea is that the noise term captures the relevant stuff from the unknown variables in the true causal model.

(See Goudet et al [4] for the standard treatment of causal inference from noise.)

## No comments:

## Post a Comment