*is*, which also lead to two different

*motivations*for machine learning, and two different ways to understand each neural network architecture and each of some ideas in machine learning.

- The
**statistical**motivation -- machine learning is a*non-linear generalisation*of "linear" statistical techniques (data mining) like linear regression, PCA and linear decision boundaries. One can always do these linear techniques with some transformation of the data that makes relationships linear (while making sure the transformation is not absurd), but you need a way to "train" what the right such function is. In this sense, machine learning acts as a**function approximator**. - The
**neurobiological**motivation -- a computer should be able to do whatever a brain can, but how exactly does a brain do the stuff it does? To take a simple example, the brain can recognise digits -- well, whatever the brain does, it takes an image as input and outputs a digit, i.e.*it's a function*. So once again, we need a**function approximator**.

Great. So machine learning is about making function approximators. The basic idea is that we're looking for a

*function*that minimises the overall error for a population of data -- it's basically a calculus of variations problem, isn't it? Well, except it isn't, because we don't have access to the entire population, so we need to avoid overfitting (i.e. we need to consider a**Bayesian prior**). This is also what we meant by "making sure the transformation is not absurd" as we mentioned.
Anyway, we want a

**universal function approximator**-- a system that can generate a function arbitrarily close to any given function given sufficiently many parameters. Well, we have such a system -- it's called**polynomial regression**. But it doesn't work. Why not? It's the*wrong Bayesian prior*. Polynomial regression gives zero prior probabilities to high-order polynomials, but if you think about it, most machine learning applications necessarily require functions with heavy non-local effects.
A function approximator based on the neurobiological analogy is a

In fact, a

**neural network**. That this is a universal function approximator (this is called the**universal approximation theorem**) is mathematically not immediately obvious, and that it has an appropriate Bayesian prior is certainly not easy to guess. But the fact that our brains work as neural networks can be used as an empirical "proof" of these facts.In fact, a

*single-layer neural network*-- the input, a single layer of processing, then the output -- can be used to approximate, with sufficiently many neurons, any function to arbitrary precision. This basically just means that functions can be written as linear combinations of some scaled and translated sigmoid functions.(

**Exercise:**explain why the universal approximation theorem is true for the sigmoid function. What other kinds of functions is it true for? It's actually not that hard at all. If you do get stuck, check out the visuals in Michael Nielson's e-book. A rigorous proof can be found here.)

In fact, the universal approximation theorem is not actually particularly important at all to the success of neural networks -- like we said, plenty of systems are universal approximators, but they don't have the right Bayesian prior (and this matters when you have limited data). In fact, a single layered neural network, as it turns out, also often leads to a bad Bayesian prior. Actual learning often involves detecting several "

**features**" of the data in several "

**steps**" and integrating them together, e.g. -- first filtering an image, then segmenting it, then recognising edges of a digit, then recognising the shapes themselves. This is a metaphor that a single-layer network doesn't really capture.

Instead, we typically use multiple layers, like in the brain, called

*deep neural networks*, or

**deep learning**. The basic idea here comes from looking at the way we process things, which is often in "

**steps**".

## No comments:

## Post a Comment