so recently you might have heard a lot about
deep networks or neural networks or
deep-learning architectures what are
these and how did they come about we'll
talk about this starting from the
beginning of so-called neural network
research and tracing it to where it has
gotten to today modern neural network
research can be traced back to the
psychologist Donald Hebb in the 1940s
Donald Hebb proposed that networks of
simple units following very simple
learning rules can learn to understand
and model very complicated patterns and
the simplest rule that he proposed is
just if two units are active at the same
time make the network connection between
them a little bit stronger and if
they're not active at the same time make
it a little bit weaker so this was
largely inspired by his ideas about
biological neurons and how they might
learn patterns and since then we've
actually found that biological neurons
do carry out some of these simple rules
these rules that Donald Hebb proposed
are today called heavy and learning
they're a little bit different than
what's used in supervised learning they
play a bigger part in unsupervised
learning so we won't talk in depth about
them but this was the beginning of our
conception of neural networks which are
models of simple units that are
connected to each other and that learn
by changing the weights between the
units now the modern neural network
really arose with the work of Frank
Rosenblatt who was another psychologist
and who in the 1950s invented what he
called the perceptron the perceptron was
a computational model of learning and it
was already a supervisor's architecture
so he could learn to predict patterns
that were given to it and Frank
Rosenblatt in the 50s demonstrated he
could train the perceptron to recognise
simple patterns like letters and this
generated actually a lot of excitement
at the time because this is pretty
unprecedented in AI research so how does
a perceptron work well a perceptron
consists like I said of several simple
units which are called neurons or nodes
and in this diagram we see the essential
layout so the here the
two so-called input neurons which I have
labeled with X 1 and X 2 now if we
remember our previous discussion of
classifying something like images of
dogs and cats and we plotted them in two
dimensions these two dimensions would be
the numbers that are fed into X 1 and X
2 but of course that means that we can
have many more inputs we could have
thousands of inputs if we wanted for
conceptual reasons we'll keep it as
simple as possible so we have two input
neurons X 1 and X 2 these are combined
using something called a weighted sum so
essentially for each of these neurons
there's a weight and the neuron can have
a certain value and we multiply that
value by the weight and then we sum them
together that's you see that in the blue
circle and after that the weighted sum
is passed through something called the
non-linearity now non-linearity is just
a nonlinear function and what the
perceptron uses is called a threshold
non-linearity and it basically says if
the weighted sum is below some threshold
I'll put a zero and if it's above some
threshold output a 1 so it's like a
cutoff now the learning algorithm for
perceptrons involves adjusting the
weights here weight 1 and weight - and
actually also the threshold B so as to
learn a certain mapping from the input
neurons to 0 or 1 outputs which could be
for example cat or dog from now on I'll
represent the weighted sum and the
non-linearity is a combined neuron in
this greenish blue color and this will
show up in later slides now if you
remember before I discussed how a
supervised learning we have some
training data set that has for example
inputs from two classes although it
could be more classes but to keep it
simple two classes and we learn a
surface that separates inputs from one
class and the other class now this kind
of perception that you see here this
surface is a line and we know it's a
line if you remember a little bit from
high school geometry because a weighted
sum essentially defines a line so in
this case we have a way
it's um and then we check whether it's
above or below the threshold and this
actually creates this linear separating
surface where everything on one side of
the line belongs to class 1 let's say
and everything on the other side of the
line belongs to class zero and we using
the perceptional training algorithm we
can actually learn the kinds of
separating surfaces that you see here
where the training data set is
represented by + as being one class and
minus as being the other class even
though the perceptron comes from the 50s
it actually has almost all the
ingredients of a modern neural network
and so all the research that has come
since has built on this basic
architecture an important development in
the field of neural network research
happened in 1969 when two AI pioneers
miskeen Peppard published a book called
the perceptrons now they were very
interested in the idea that Frank
Rosenblatt proposed and so they did a
lot of mathematical and theoretical
analysis of the perceptron however one
result that they proved essentially
killed neural network research for 20
years and what they proved was that the
kind of perceptron that we saw in
previous slide could not learn to
recognize certain kinds of patterns
now you might already have some idea of
the kinds of patterns that might not be
able to recognize but again it's easy to
see visually as I said before for the
perceptron the separating surface is
always aligned now if you provide it
with the training data set like the one
you see in the lower right hand corner
of your screen with members of the plus
and minus class arranged as they are
there's simply no way to separate the
pluses and the minuses using a single
line and this kind of problem is called
a non linearly separable problem because
the members of the different classes
can't be separated by a single line
since the perceptron can only learn
linear separating surfaces there's no
way this kind of perceptron could learn
to properly classify pluses and minuses
arranged in this way at the same time in
this class of problems and then non
linearly separable ones occurs in many
cases it occurs whenever elements of one
class have either one thing or the other
thing but not both
and clearly that occurs in many
situations and one would like machine
learning to be able to learn patterns
like that so this is a big issue for
perceptrons and people lost interest in
them and thought that they couldn't
really do many interesting things now
the situation changed dramatically mid
1980s when two cognitive scientists
Brahma hood and McClelland published a
book called parallel distributed
processing now you saw in the previous
case the the simplest form of the
perceptron we have two inputs or some
number of inputs that have a weighted
sum and a non-linearity and that's the
outputs what parallel distributed
processing discussed was perceptrons or
more broadly neural networks in which
there's many nested layers so there's an
it there's a set of input neurons these
get some and pass through non-linearity
but then there's multiple of these sums
and nonlinearities and these something
on the air these now serve as the inputs
for the next layer so they them their
outputs are themselves summed and pass
through another non-linearity and so on
in this diagram you see a multi-layer
neural network where there's
neighborhood layer the inputs go to two
different sums and nonlinearities and
those then go to yet a further
Salmonella narrative and we call the
sums and nonlinearities in the middle in
this case a hidden layer
what parallel distributed processing
showed is that you could design such
multi-layer neural networks that there
is a very efficient computationally
efficient learning rule that could train
the weights of these multi-layer neural
networks and that these multi-layer
neural networks could learn patterns
like non linearly separable problems in
fact there's a results showing that with
enough hidden units and hidden layers
they could learn any function at all in
the world he will see how multi-layer
neural networks can actually solve
something like a non linearly separable
problem which was such an issue for the
single layer perceptron so we
know that each of the individual waited
something nonlinearities essentially
defines a linear separating surface so
we can think of each of the weighted
sums and nonlinearities and the hidden
layer as setting up their own linear
separating surfaces so here the top one
for example will say that class 1 is
everything above and to the left of the
red line and the bottom one would say
that class 1 is everything to the bottom
and right of the shifted red line now
interestingly we can represent an
intersection as a weighted sum pasteur
non-linearity imagine setting both of
the incoming weights to 1 and then
saying if the weighted sum is less than
2 output a 0 and if it's 2 or greater
than 2 output a 1 that means that the
very output neuron will only turn on if
both of the input neurons are turned on
so essentially that's like taking the
intersection of the class 1 regions of
both of the hidden layers and in this
case it's exactly what's needed to solve
the nonlinear least several problem and
separate the minuses from the pluses in
the non linearly separable example we
saw before now as I mentioned the
original perceptron used something
called a threshold non-linearity which
basically turns from a 0 to a 1 as soon
as the weighted sum input passes a
certain threshold in modern neural
network algorithms and including the
ones that were used in the starting from
the 80s we used a differentiable
non-linearity meaning that it's a smooth
function and we could state take its
derivative and minimize the training
error by essentially using the
derivatives of the functions that are
transforming our signals this sounds
complicated but it's pretty easy to
think about visually if we think about
some functions such as the training
error defined over the values of the
weights that are defined the connections
in our network then this training
algorithm essentially tries to roll down
the hill and change the weights so as to
minimize the training error and because
we have the
motives we know which way to roll this
set of algorithms broadly that follow
the derivatives down so as to minimize
training here little by little are
called gradient descent algorithms you
might also hear very often in modern
machine learning terms like stochastic
gradient descent SGD which is a small
variant of this basic idea and this was
very successful for training
multi-layered neural networks I should
also add that one reason it was
successful was because there was a
certain trick that was discovered to do
gradient descent or neural networks if
you just want to compute which way you
should change the weight so as to
minimize training air for a large neural
network it's actually very
computationally difficult problem you
might also hear the term back
propagation or back prop which is
essentially a very computationally quick
way to do gradient descent and that
became widely used in 1980s and made
neural networks practical to Train now
starting to the mid 80s and through 2010
or so neural networks generated a lot of
excitement among psychologists and
cognitive scientists they actually seem
to be quite good models of human
perceptual performance and various kinds
of behaviors that people do in
Psychological tasks however for actual
machine learning applications they just
weren't very good
they weren't the state of the art and
other algorithms tended to perform
better the neural networks did at
applied tasks like recognizing images
and determining whether it's a cat or
dog for example and because of this
there was a kind of winter of applied
neural network research that lasted for
almost two decades or maybe even a
little bit more where people did not
take neural networks very seriously as
state of the art machine learning
algorithms this changed dramatically in
the early 2010's
and particularly there was a kind of
dramatic explosion of interest in neural
networks in 2012 so I should add that's
on a yearly basis there's many
competitions in the machine learning
academic community where
different groups try to crack machine
learning problems and they compete with
each other based on how well their
algorithms do there's been one
competition in particular which involved
classifying images according to what's
shown in them so we talked earlier about
classifying images as either being a dog
or a cat this competition which is
called image net actually had thousands
of classes it has classes as you can see
now in this slide things like leopard
and mushroom and mite and all kinds of
stuff so it's a much harder task it's
not just cat or dog it has something
like a thousand on the order of a
thousand classes and the goal was to try
to predict what class the image belonged
to now there have been some improvement
in this task year-on-year maybe the best
performer improved by a percent or two
and in 2012 something very dramatic
happened for the first time a neural
network algorithm won first place in a
competition and it improved dramatically
above any non neural network entry in
particular the neural network entry that
was submitted on won first place beat by
more than 10 percent err the next best
entry which got something like 25
percent error rather than 15 now the
next best century it used hand coded
features that were aimed to capture some
important aspects of images and visual
recognition use hand-tuned algorithms
had been in development for many many
years by there some really smart people
but the neural network essentially
started from scratch and learned to beat
it by a lot so this really shocked
people that a neural network that was
kind of a not domain-specific not really
hand tuned could do so well in the next
video we'll discuss what it was that
made the neural network do so on this
competition and really started what we
might call the deep learning revolution
that's going on right now