The most general formulation of machine learning is that it is the attempt to approximate (or "learn") some mathematical object. For example:
- Regression attempts to approximate a function.
- Classification attempts to approximate an equivalence relation.
- Dimensionality reduction attempts to approximate a parameterization.
- Generative neural networks attempt to approximate a distribution (i.e. a sampler).
- Representation attempts to approximate a compression algorithm.
- Game-theoretic learning (e.g. reinforcement learning, no-regret learning) attempts to approximate a distribution of strategies.
- Neural networks in NLP attempt to approximate a vector space that effectively represents desired relationships between words.
The Data Science course addresses several "toy algorithms" that discover low-order or simplified approximations. There are two -- different -- explanations of what more "AI-esque" machine learning is, which also lead to two different motivations for machine learning, and two different ways to understand each neural network architecture and each of some ideas in machine learning.
- The statistical motivation -- machine learning is a non-linear generalisation of "linear" statistical techniques (data mining) like linear regression, PCA and linear decision boundaries. One can always do these linear techniques with some transformation of the data that makes relationships linear (while making sure the transformation is not absurd), but you need a way to "train" what the right such function is. In this sense, machine learning acts as a function approximator.
- The cognitive science motivation -- a computer should be able to do whatever a brain can, but how exactly does a brain do the stuff it does? To take a simple example, the brain can recognise digits -- well, whatever the brain does, it takes an image as input and outputs a digit, i.e. it's a function. So once again, we need a function approximator.
Great. So machine learning is about making function (or object) approximators. The basic idea is that we're looking for a function that minimises the overall error for a population of data -- it's basically a calculus of variations problem, isn't it? Well, except it isn't, because we don't have access to the entire population, so we need to avoid overfitting (i.e. we need to consider a Bayesian prior). This is also what we meant by "making sure the transformation is not absurd" as we mentioned.
As a general rule, I'd say that if a feature of human brains are present at birth, we should expect to have to hard-code it, while if a feature is learned by humans, we should definitely get our AI to learn it, too (this is the "converse" of a general rule I have when trying to organize knowledge about biology in my head, which is we pretend that evolution is not an algorithm but a hard-coding, simply because actually trying to simulate an evolutionary algorithm in a highly complicated environment is hard). So for example, the transformations that a spatial transformer network finds acceptable are hard-coded, because we didn't actually need to stand on our head or squish our eyes with a truncheon to learn how to read squished-up text.
As a general rule, I'd say that if a feature of human brains are present at birth, we should expect to have to hard-code it, while if a feature is learned by humans, we should definitely get our AI to learn it, too (this is the "converse" of a general rule I have when trying to organize knowledge about biology in my head, which is we pretend that evolution is not an algorithm but a hard-coding, simply because actually trying to simulate an evolutionary algorithm in a highly complicated environment is hard). So for example, the transformations that a spatial transformer network finds acceptable are hard-coded, because we didn't actually need to stand on our head or squish our eyes with a truncheon to learn how to read squished-up text.
Anyway, we want a universal function approximator -- a system that can generate a function arbitrarily close to any given function given sufficiently many parameters. A basic example of such a system is polynomial regression, but for most applications this has the wrong Bayesian prior (it gives zero prior probabilities to high-order polynomials, but most machine learning applications require functions with heavy non-local effects).
Another function approximator is a neural network. That a neural network (even of single layer) is a universal approximator is called the universal approximation theorem, i.e. functions can be written as linear combinations of some scaled and translated sigmoid functions.
(Exercise: explain why the universal approximation theorem is true for the sigmoid function. What other kinds of functions is it true for? It's actually not that hard at all. If you do get stuck, check out the visuals in Michael Nielson's e-book. A rigorous proof can be found here.)
In fact, the universal approximation theorem is not actually particularly important at all to the success of neural networks -- like we said, plenty of systems are universal approximators, but they don't have the right Bayesian prior (and this matters when you have limited data). The single-layer neural network actually has a bad implied prior for many tasks, which is why we usually study "deep" neural networks, which have a surprisingly good prior.
I have seen many explanations as to why this is so: deep learning means doing things in steps, deep learning corresponds to decomposing "hierarchy" or "structure" in the world and our world is "inherently hierarchial", etc. But honestly, these all seem like terrible rationalizations -- I don't even see how these claims are testable. You would need to create a system that isn't "inherently hierarchial" and demonstrate that deep learning doesn't do very well on it (which, as far as I can see, makes no sense). The right way to "explain why deep learning works so well" is to compute the implied prior and find out how it scales with depth.
Someone do this.
No comments:
Post a Comment