### Machine learning as function approximation; statistical and cognitive motivations

The most general formulation of machine learning is that it is the attempt to approximate (or "learn") some mathematical object. For example:

• Regression attempt to approximate a function.
• Classification attempts to approximate an equivalence relation or a distance function.
• Dimensionality reduction attempts to approximate a parameterization.
• Generative neural networks attempt to approximate a distribution (i.e. a sampler).
• Game-theoretic learning (e.g. reinforcement learning, no-regret learning) attempts to approximate a distribution of strategies
The Data Science course addresses several "toy algorithms" that discover low-order or simplified approximations. There are two -- different -- explanations of what more "AI-esque" machine learning is, which also lead to two different motivations for machine learning, and two different ways to understand each neural network architecture and each of some ideas in machine learning.
1. The statistical motivation -- machine learning is a non-linear generalisation of "linear" statistical techniques (data mining) like linear regression, PCA and linear decision boundaries. One can always do these linear techniques with some transformation of the data that makes relationships linear (while making sure the transformation is not absurd), but you need a way to "train" what the right such function is. In this sense, machine learning acts as a function approximator.
2. The cognitive science motivation -- a computer should be able to do whatever a brain can, but how exactly does a brain do the stuff it does? To take a simple example, the brain can recognise digits -- well, whatever the brain does, it takes an image as input and outputs a digit, i.e. it's a function. So once again, we need a function approximator.
Great. So machine learning is about making function approximators. The basic idea is that we're looking for a function that minimises the overall error for a population of data -- it's basically a calculus of variations problem, isn't it? Well, except it isn't, because we don't have access to the entire population, so we need to avoid overfitting (i.e. we need to consider a Bayesian prior). This is also what we meant by "making sure the transformation is not absurd" as we mentioned.

As a general rule ... I'd say that if a feature of human brains are present at birth, we should expect to have to hard-code it, while if a feature is learned by humans, we should definitely get our AI to learn it, too (this is the "converse" of a general rule I have when trying to organize knowledge about biology in my head, which is we pretend that evolution is not an algorithm but a hard-coding, simply because actually trying to simulate an evolutionary algorithm in a highly complicated environment is hard). So for example, the transformations that a spatial transformer network finds acceptable are hard-coded, because we didn't actually need to stand on our head or squish our eyes with a truncheon to learn how to read squished-up text.

Anyway, we want a universal function approximator -- a system that can generate a function arbitrarily close to any given function given sufficiently many parameters. Well, we have such a system -- it's called polynomial regression. But it doesn't work. Why not? It's the wrong Bayesian prior. Polynomial regression gives zero prior probabilities to high-order polynomials, but if you think about it, most machine learning applications necessarily require functions with heavy non-local effects.

A function approximator based on the neurobiological analogy is a neural network. That this is a universal function approximator (this is called the universal approximation theorem) is mathematically not immediately obvious, and that it has an appropriate Bayesian prior is certainly not easy to guess. But the fact that our brains work as neural networks can be used as an empirical "proof" of these facts.

In fact, a single-layer neural network -- the input, a single layer of processing, then the output -- can be used to approximate, with sufficiently many neurons, any function to arbitrary precision. This basically just means that functions can be written as linear combinations of some scaled and translated sigmoid functions.

(Exercise: explain why the universal approximation theorem is true for the sigmoid function. What other kinds of functions is it true for? It's actually not that hard at all. If you do get stuck, check out the visuals in Michael Nielson's e-book. A rigorous proof can be found here.)

In fact, the universal approximation theorem is not actually particularly important at all to the success of neural networks -- like we said, plenty of systems are universal approximators, but they don't have the right Bayesian prior (and this matters when you have limited data).

In fact, it turns out that a single layered neural network often has a bad implied prior for many tasks -- but a multilayered neural network (called deep learning) has a good one.

I have seen many explanations as to why this is so: deep learning means doing things in steps, deep learning corresponds to decomposing "hierarchy" or "structure" in the world and our world is "inherently hierarchial", etc. But honestly, these all seem like terrible rationalizations -- I don't even see how these claims are testable. You would need to create a system that isn't "inherently hierarchial" and demonstrate that deep learning doesn't do very well on it (which, as far as I can see, makes no sense).

The right way to "explain why deep learning works so well" is to compute the implied prior and find out how it scales with depth.

Someone do this.