so recently you might have heard a lot about

deep networks or neural networks or

deep-learning architectures what are

these and how did they come about we'll

talk about this starting from the

beginning of so-called neural network

research and tracing it to where it has

gotten to today modern neural network

research can be traced back to the

psychologist Donald Hebb in the 1940s

Donald Hebb proposed that networks of

simple units following very simple

learning rules can learn to understand

and model very complicated patterns and

the simplest rule that he proposed is

just if two units are active at the same

time make the network connection between

them a little bit stronger and if

they're not active at the same time make

it a little bit weaker so this was

largely inspired by his ideas about

biological neurons and how they might

learn patterns and since then we've

actually found that biological neurons

do carry out some of these simple rules

these rules that Donald Hebb proposed

are today called heavy and learning

they're a little bit different than

what's used in supervised learning they

play a bigger part in unsupervised

learning so we won't talk in depth about

them but this was the beginning of our

conception of neural networks which are

models of simple units that are

connected to each other and that learn

by changing the weights between the

units now the modern neural network

really arose with the work of Frank

Rosenblatt who was another psychologist

and who in the 1950s invented what he

called the perceptron the perceptron was

a computational model of learning and it

was already a supervisor's architecture

so he could learn to predict patterns

that were given to it and Frank

Rosenblatt in the 50s demonstrated he

could train the perceptron to recognise

simple patterns like letters and this

generated actually a lot of excitement

at the time because this is pretty

unprecedented in AI research so how does

a perceptron work well a perceptron

consists like I said of several simple

units which are called neurons or nodes

and in this diagram we see the essential

layout so the here the

two so-called input neurons which I have

labeled with X 1 and X 2 now if we

remember our previous discussion of

classifying something like images of

dogs and cats and we plotted them in two

dimensions these two dimensions would be

the numbers that are fed into X 1 and X

2 but of course that means that we can

have many more inputs we could have

thousands of inputs if we wanted for

conceptual reasons we'll keep it as

simple as possible so we have two input

neurons X 1 and X 2 these are combined

using something called a weighted sum so

essentially for each of these neurons

there's a weight and the neuron can have

a certain value and we multiply that

value by the weight and then we sum them

together that's you see that in the blue

circle and after that the weighted sum

is passed through something called the

non-linearity now non-linearity is just

a nonlinear function and what the

perceptron uses is called a threshold

non-linearity and it basically says if

the weighted sum is below some threshold

I'll put a zero and if it's above some

threshold output a 1 so it's like a

cutoff now the learning algorithm for

perceptrons involves adjusting the

weights here weight 1 and weight - and

actually also the threshold B so as to

learn a certain mapping from the input

neurons to 0 or 1 outputs which could be

for example cat or dog from now on I'll

represent the weighted sum and the

non-linearity is a combined neuron in

this greenish blue color and this will

show up in later slides now if you

remember before I discussed how a

supervised learning we have some

training data set that has for example

inputs from two classes although it

could be more classes but to keep it

simple two classes and we learn a

surface that separates inputs from one

class and the other class now this kind

of perception that you see here this

surface is a line and we know it's a

line if you remember a little bit from

high school geometry because a weighted

sum essentially defines a line so in

this case we have a way

it's um and then we check whether it's

above or below the threshold and this

actually creates this linear separating

surface where everything on one side of

the line belongs to class 1 let's say

and everything on the other side of the

line belongs to class zero and we using

the perceptional training algorithm we

can actually learn the kinds of

separating surfaces that you see here

where the training data set is

represented by + as being one class and

minus as being the other class even

though the perceptron comes from the 50s

it actually has almost all the

ingredients of a modern neural network

and so all the research that has come

since has built on this basic

architecture an important development in

the field of neural network research

happened in 1969 when two AI pioneers

miskeen Peppard published a book called

the perceptrons now they were very

interested in the idea that Frank

Rosenblatt proposed and so they did a

lot of mathematical and theoretical

analysis of the perceptron however one

result that they proved essentially

killed neural network research for 20

years and what they proved was that the

kind of perceptron that we saw in

previous slide could not learn to

recognize certain kinds of patterns

now you might already have some idea of

the kinds of patterns that might not be

able to recognize but again it's easy to

see visually as I said before for the

perceptron the separating surface is

always aligned now if you provide it

with the training data set like the one

you see in the lower right hand corner

of your screen with members of the plus

and minus class arranged as they are

there's simply no way to separate the

pluses and the minuses using a single

line and this kind of problem is called

a non linearly separable problem because

the members of the different classes

can't be separated by a single line

since the perceptron can only learn

linear separating surfaces there's no

way this kind of perceptron could learn

to properly classify pluses and minuses

arranged in this way at the same time in

this class of problems and then non

linearly separable ones occurs in many

cases it occurs whenever elements of one

class have either one thing or the other

thing but not both

and clearly that occurs in many

situations and one would like machine

learning to be able to learn patterns

like that so this is a big issue for

perceptrons and people lost interest in

them and thought that they couldn't

really do many interesting things now

the situation changed dramatically mid

1980s when two cognitive scientists

Brahma hood and McClelland published a

book called parallel distributed

processing now you saw in the previous

case the the simplest form of the

perceptron we have two inputs or some

number of inputs that have a weighted

sum and a non-linearity and that's the

outputs what parallel distributed

processing discussed was perceptrons or

more broadly neural networks in which

there's many nested layers so there's an

it there's a set of input neurons these

get some and pass through non-linearity

but then there's multiple of these sums

and nonlinearities and these something

on the air these now serve as the inputs

for the next layer so they them their

outputs are themselves summed and pass

through another non-linearity and so on

in this diagram you see a multi-layer

neural network where there's

neighborhood layer the inputs go to two

different sums and nonlinearities and

those then go to yet a further

Salmonella narrative and we call the

sums and nonlinearities in the middle in

this case a hidden layer

what parallel distributed processing

showed is that you could design such

multi-layer neural networks that there

is a very efficient computationally

efficient learning rule that could train

the weights of these multi-layer neural

networks and that these multi-layer

neural networks could learn patterns

like non linearly separable problems in

fact there's a results showing that with

enough hidden units and hidden layers

they could learn any function at all in

the world he will see how multi-layer

neural networks can actually solve

something like a non linearly separable

problem which was such an issue for the

single layer perceptron so we

know that each of the individual waited

something nonlinearities essentially

defines a linear separating surface so

we can think of each of the weighted

sums and nonlinearities and the hidden

layer as setting up their own linear

separating surfaces so here the top one

for example will say that class 1 is

everything above and to the left of the

red line and the bottom one would say

that class 1 is everything to the bottom

and right of the shifted red line now

interestingly we can represent an

intersection as a weighted sum pasteur

non-linearity imagine setting both of

the incoming weights to 1 and then

saying if the weighted sum is less than

2 output a 0 and if it's 2 or greater

than 2 output a 1 that means that the

very output neuron will only turn on if

both of the input neurons are turned on

so essentially that's like taking the

intersection of the class 1 regions of

both of the hidden layers and in this

case it's exactly what's needed to solve

the nonlinear least several problem and

separate the minuses from the pluses in

the non linearly separable example we

saw before now as I mentioned the

original perceptron used something

called a threshold non-linearity which

basically turns from a 0 to a 1 as soon

as the weighted sum input passes a

certain threshold in modern neural

network algorithms and including the

ones that were used in the starting from

the 80s we used a differentiable

non-linearity meaning that it's a smooth

function and we could state take its

derivative and minimize the training

error by essentially using the

derivatives of the functions that are

transforming our signals this sounds

complicated but it's pretty easy to

think about visually if we think about

some functions such as the training

error defined over the values of the

weights that are defined the connections

in our network then this training

algorithm essentially tries to roll down

the hill and change the weights so as to

minimize the training error and because

we have the

motives we know which way to roll this

set of algorithms broadly that follow

the derivatives down so as to minimize

training here little by little are

called gradient descent algorithms you

might also hear very often in modern

machine learning terms like stochastic

gradient descent SGD which is a small

variant of this basic idea and this was

very successful for training

multi-layered neural networks I should

also add that one reason it was

successful was because there was a

certain trick that was discovered to do

gradient descent or neural networks if

you just want to compute which way you

should change the weight so as to

minimize training air for a large neural

network it's actually very

computationally difficult problem you

might also hear the term back

propagation or back prop which is

essentially a very computationally quick

way to do gradient descent and that

became widely used in 1980s and made

neural networks practical to Train now

starting to the mid 80s and through 2010

or so neural networks generated a lot of

excitement among psychologists and

cognitive scientists they actually seem

to be quite good models of human

perceptual performance and various kinds

of behaviors that people do in

Psychological tasks however for actual

machine learning applications they just

weren't very good

they weren't the state of the art and

other algorithms tended to perform

better the neural networks did at

applied tasks like recognizing images

and determining whether it's a cat or

dog for example and because of this

there was a kind of winter of applied

neural network research that lasted for

almost two decades or maybe even a

little bit more where people did not

take neural networks very seriously as

state of the art machine learning

algorithms this changed dramatically in

the early 2010's

and particularly there was a kind of

dramatic explosion of interest in neural

networks in 2012 so I should add that's

on a yearly basis there's many

competitions in the machine learning

academic community where

different groups try to crack machine

learning problems and they compete with

each other based on how well their

algorithms do there's been one

competition in particular which involved

classifying images according to what's

shown in them so we talked earlier about

classifying images as either being a dog

or a cat this competition which is

called image net actually had thousands

of classes it has classes as you can see

now in this slide things like leopard

and mushroom and mite and all kinds of

stuff so it's a much harder task it's

not just cat or dog it has something

like a thousand on the order of a

thousand classes and the goal was to try

to predict what class the image belonged

to now there have been some improvement

in this task year-on-year maybe the best

performer improved by a percent or two

and in 2012 something very dramatic

happened for the first time a neural

network algorithm won first place in a

competition and it improved dramatically

above any non neural network entry in

particular the neural network entry that

was submitted on won first place beat by

more than 10 percent err the next best

entry which got something like 25

percent error rather than 15 now the

next best century it used hand coded

features that were aimed to capture some

important aspects of images and visual

recognition use hand-tuned algorithms

had been in development for many many

years by there some really smart people

but the neural network essentially

started from scratch and learned to beat

it by a lot so this really shocked

people that a neural network that was

kind of a not domain-specific not really

hand tuned could do so well in the next

video we'll discuss what it was that

made the neural network do so on this

competition and really started what we

might call the deep learning revolution

that's going on right now