so recently you might have heard a lot about deep networks or neural networks or deep-learning architectures what are these and how did they come about we'll talk about this starting from the beginning of so-called neural network research and tracing it to where it has gotten to today modern neural network research can be traced back to the psychologist Donald Hebb in the 1940s Donald Hebb proposed that networks of simple units following very simple learning rules can learn to understand and model very complicated patterns and the simplest rule that he proposed is just if two units are active at the same time make the network connection between them a little bit stronger and if they're not active at the same time make it a little bit weaker so this was largely inspired by his ideas about biological neurons and how they might learn patterns and since then we've actually found that biological neurons do carry out some of these simple rules these rules that Donald Hebb proposed are today called heavy and learning they're a little bit different than what's used in supervised learning they play a bigger part in unsupervised learning so we won't talk in depth about them but this was the beginning of our conception of neural networks which are models of simple units that are connected to each other and that learn by changing the weights between the units now the modern neural network really arose with the work of Frank Rosenblatt who was another psychologist and who in the 1950s invented what he called the perceptron the perceptron was a computational model of learning and it was already a supervisor's architecture so he could learn to predict patterns that were given to it and Frank Rosenblatt in the 50s demonstrated he could train the perceptron to recognise simple patterns like letters and this generated actually a lot of excitement at the time because this is pretty unprecedented in AI research so how does a perceptron work well a perceptron consists like I said of several simple units which are called neurons or nodes and in this diagram we see the essential layout so the here the two so-called input neurons which I have labeled with X 1 and X 2 now if we remember our previous discussion of classifying something like images of dogs and cats and we plotted them in two dimensions these two dimensions would be the numbers that are fed into X 1 and X 2 but of course that means that we can have many more inputs we could have thousands of inputs if we wanted for conceptual reasons we'll keep it as simple as possible so we have two input neurons X 1 and X 2 these are combined using something called a weighted sum so essentially for each of these neurons there's a weight and the neuron can have a certain value and we multiply that value by the weight and then we sum them together that's you see that in the blue circle and after that the weighted sum is passed through something called the non-linearity now non-linearity is just a nonlinear function and what the perceptron uses is called a threshold non-linearity and it basically says if the weighted sum is below some threshold I'll put a zero and if it's above some threshold output a 1 so it's like a cutoff now the learning algorithm for perceptrons involves adjusting the weights here weight 1 and weight - and actually also the threshold B so as to learn a certain mapping from the input neurons to 0 or 1 outputs which could be for example cat or dog from now on I'll represent the weighted sum and the non-linearity is a combined neuron in this greenish blue color and this will show up in later slides now if you remember before I discussed how a supervised learning we have some training data set that has for example inputs from two classes although it could be more classes but to keep it simple two classes and we learn a surface that separates inputs from one class and the other class now this kind of perception that you see here this surface is a line and we know it's a line if you remember a little bit from high school geometry because a weighted sum essentially defines a line so in this case we have a way it's um and then we check whether it's above or below the threshold and this actually creates this linear separating surface where everything on one side of the line belongs to class 1 let's say and everything on the other side of the line belongs to class zero and we using the perceptional training algorithm we can actually learn the kinds of separating surfaces that you see here where the training data set is represented by + as being one class and minus as being the other class even though the perceptron comes from the 50s it actually has almost all the ingredients of a modern neural network and so all the research that has come since has built on this basic architecture an important development in the field of neural network research happened in 1969 when two AI pioneers miskeen Peppard published a book called the perceptrons now they were very interested in the idea that Frank Rosenblatt proposed and so they did a lot of mathematical and theoretical analysis of the perceptron however one result that they proved essentially killed neural network research for 20 years and what they proved was that the kind of perceptron that we saw in previous slide could not learn to recognize certain kinds of patterns now you might already have some idea of the kinds of patterns that might not be able to recognize but again it's easy to see visually as I said before for the perceptron the separating surface is always aligned now if you provide it with the training data set like the one you see in the lower right hand corner of your screen with members of the plus and minus class arranged as they are there's simply no way to separate the pluses and the minuses using a single line and this kind of problem is called a non linearly separable problem because the members of the different classes can't be separated by a single line since the perceptron can only learn linear separating surfaces there's no way this kind of perceptron could learn to properly classify pluses and minuses arranged in this way at the same time in this class of problems and then non linearly separable ones occurs in many cases it occurs whenever elements of one class have either one thing or the other thing but not both and clearly that occurs in many situations and one would like machine learning to be able to learn patterns like that so this is a big issue for perceptrons and people lost interest in them and thought that they couldn't really do many interesting things now the situation changed dramatically mid 1980s when two cognitive scientists Brahma hood and McClelland published a book called parallel distributed processing now you saw in the previous case the the simplest form of the perceptron we have two inputs or some number of inputs that have a weighted sum and a non-linearity and that's the outputs what parallel distributed processing discussed was perceptrons or more broadly neural networks in which there's many nested layers so there's an it there's a set of input neurons these get some and pass through non-linearity but then there's multiple of these sums and nonlinearities and these something on the air these now serve as the inputs for the next layer so they them their outputs are themselves summed and pass through another non-linearity and so on in this diagram you see a multi-layer neural network where there's neighborhood layer the inputs go to two different sums and nonlinearities and those then go to yet a further Salmonella narrative and we call the sums and nonlinearities in the middle in this case a hidden layer what parallel distributed processing showed is that you could design such multi-layer neural networks that there is a very efficient computationally efficient learning rule that could train the weights of these multi-layer neural networks and that these multi-layer neural networks could learn patterns like non linearly separable problems in fact there's a results showing that with enough hidden units and hidden layers they could learn any function at all in the world he will see how multi-layer neural networks can actually solve something like a non linearly separable problem which was such an issue for the single layer perceptron so we know that each of the individual waited something nonlinearities essentially defines a linear separating surface so we can think of each of the weighted sums and nonlinearities and the hidden layer as setting up their own linear separating surfaces so here the top one for example will say that class 1 is everything above and to the left of the red line and the bottom one would say that class 1 is everything to the bottom and right of the shifted red line now interestingly we can represent an intersection as a weighted sum pasteur non-linearity imagine setting both of the incoming weights to 1 and then saying if the weighted sum is less than 2 output a 0 and if it's 2 or greater than 2 output a 1 that means that the very output neuron will only turn on if both of the input neurons are turned on so essentially that's like taking the intersection of the class 1 regions of both of the hidden layers and in this case it's exactly what's needed to solve the nonlinear least several problem and separate the minuses from the pluses in the non linearly separable example we saw before now as I mentioned the original perceptron used something called a threshold non-linearity which basically turns from a 0 to a 1 as soon as the weighted sum input passes a certain threshold in modern neural network algorithms and including the ones that were used in the starting from the 80s we used a differentiable non-linearity meaning that it's a smooth function and we could state take its derivative and minimize the training error by essentially using the derivatives of the functions that are transforming our signals this sounds complicated but it's pretty easy to think about visually if we think about some functions such as the training error defined over the values of the weights that are defined the connections in our network then this training algorithm essentially tries to roll down the hill and change the weights so as to minimize the training error and because we have the motives we know which way to roll this set of algorithms broadly that follow the derivatives down so as to minimize training here little by little are called gradient descent algorithms you might also hear very often in modern machine learning terms like stochastic gradient descent SGD which is a small variant of this basic idea and this was very successful for training multi-layered neural networks I should also add that one reason it was successful was because there was a certain trick that was discovered to do gradient descent or neural networks if you just want to compute which way you should change the weight so as to minimize training air for a large neural network it's actually very computationally difficult problem you might also hear the term back propagation or back prop which is essentially a very computationally quick way to do gradient descent and that became widely used in 1980s and made neural networks practical to Train now starting to the mid 80s and through 2010 or so neural networks generated a lot of excitement among psychologists and cognitive scientists they actually seem to be quite good models of human perceptual performance and various kinds of behaviors that people do in Psychological tasks however for actual machine learning applications they just weren't very good they weren't the state of the art and other algorithms tended to perform better the neural networks did at applied tasks like recognizing images and determining whether it's a cat or dog for example and because of this there was a kind of winter of applied neural network research that lasted for almost two decades or maybe even a little bit more where people did not take neural networks very seriously as state of the art machine learning algorithms this changed dramatically in the early 2010's and particularly there was a kind of dramatic explosion of interest in neural networks in 2012 so I should add that's on a yearly basis there's many competitions in the machine learning academic community where different groups try to crack machine learning problems and they compete with each other based on how well their algorithms do there's been one competition in particular which involved classifying images according to what's shown in them so we talked earlier about classifying images as either being a dog or a cat this competition which is called image net actually had thousands of classes it has classes as you can see now in this slide things like leopard and mushroom and mite and all kinds of stuff so it's a much harder task it's not just cat or dog it has something like a thousand on the order of a thousand classes and the goal was to try to predict what class the image belonged to now there have been some improvement in this task year-on-year maybe the best performer improved by a percent or two and in 2012 something very dramatic happened for the first time a neural network algorithm won first place in a competition and it improved dramatically above any non neural network entry in particular the neural network entry that was submitted on won first place beat by more than 10 percent err the next best entry which got something like 25 percent error rather than 15 now the next best century it used hand coded features that were aimed to capture some important aspects of images and visual recognition use hand-tuned algorithms had been in development for many many years by there some really smart people but the neural network essentially started from scratch and learned to beat it by a lot so this really shocked people that a neural network that was kind of a not domain-specific not really hand tuned could do so well in the next video we'll discuss what it was that made the neural network do so on this competition and really started what we might call the deep learning revolution that's going on right now