In the last video, we briefly talked about the history of neural network research, and we discussed how in the early 2010s neural network architecture started to win machine learning competitions. In particular, there was the ImageNet-competition which involved classifying images in which the neural network entry did significantly better than any non-neural network entry. And, interestingly, this is even so, the non-neural network entry used hand-coded features and hand-tuned algorithms. This reminds us a little bit of the development of machine learning in go-playing where AlphaGo actually succeeded and improved by reducing the impact of hand-coded features, and replaced them with features learned entirely automatically from large amounts of data. Now, what is it that made the neural network entry and ImageNet so successful? Today, we would call the kind of architecture that was entered into ImageNet and won it a deep neural network or a deep learning architecture. And, even though this is a bit of a buzzword, there are a few typical characteristics that define such neural networks and distinguish them from previous uses of neural networks. So, what happened? What was different? First, the very design and architecture of the neural network was deeper -- and more structured. And I'll define what I mean by both of those terms in a minute. Second, there was a huge data set provided. The data sets used in these competitions grew bigger and bigger every year, and it turns out that neural networks seems to do really well when there's a lot of data provided. Lastly, the team that entered this neural network used graphics cards which were developed essentially for faster gaming and better graphics and games, but that also do an operation called matrix multiplication that's used extensively in neural networks, and they use this to train the neural network several orders of magnitude faster than will be possible with regular processors. So this allowed them to train the network for a long time on large amount of data, and it turns out that just doing more data and more training makes a big deal with neural networks. So let me now go back to my first point that these deep networks are deeper, obviously, and more structured. Remember, I talked about multi-layer neural networks where, instead of just having an input that goes directly to the final output, there's intermediate weighted sums and non-linearities. And each layer of those weighted sums and non-linearities, in between the input and the final output, is called a hidden layer. In deep neural networks, there's typically many, many, many hidden layers, and many more that when used, even in the 80s or 90s, where you might have one or two hidden layers in typical neural network. Nowadays, it's possible a few dozens or easily hundreds of hidden layers in modern deep architectures. Now, I should know two things about this. Prior to the rise of deep architectures, training networks with many, many hidden layers ran into various kinds of technical difficulties. However, by tweaking with a non-linear function, it turned out that it actually is possible to train neural networks with many hidden layers and resolve some of these computational difficulties. The other thing I'd like to note is that we don't have a very good sense about why having more hidden layers helps improve performance. We have some ideas. Generally, if you look at neural networks with many hidden layers; for example, I've diagrammed this prototypical example neural network that takes in images, so it takes in pixel information and output, for example, the name of the person; you might see this being used in Facebook, for example, when it recognizes your friends from pictures. In a network like this, when there's many, many hidden layers, and we look at the kinds of patterns that seem to activate the neurons in the different layers, they seem to become, in some sense, more and more abstract and conceptual. So, at the earliest layers, what really turns the neurons on, are things like edges or high contrast spots. At intermediate layers, the neurons seem to be activated by things like noses, ears, mouths, parts of the face. And towards the final layers, it almost seems to be that the neurons are responding to what might be called prototypical faces or some kind of underlying variation in the types of faces and expressions that people have. So we see that by adding more layers, it might be that we're able to capture higher and higher-level concepts and more abstract concepts that are then recombined in useful ways. So essentially, we can think of deep neural networks as encoding some assumption that the kinds of data we're interested in is frequently hierarchical. It has many scales and it reuses some of the lower scale components in various ways in the higher scales. The other thing that was different between more recent deep network architectures and more traditional approaches in neural networks, was that many of the deep learning architectures have a lot more structure. So here on the screen, on the left-hand side, you see a more traditional neural network. Even though it has many hidden layers, you see essentially all the neurons in one layer are connected to all the neurons in the next layer. For example, the winning neural network in ImageNet that we discussed previously, we show the topology of that network on the right-hand side using a kind of block diagram. Each of the cubes represents a whole group of neurons. Here, you can see that there's a lot of structure there. There's kind of two streams, the sizes of blocks are changing, some of them are densely interconnected, some are not interconnected. So there's a lot of knowledge and design put into how the neurons are interconnected to each other. I should also add, especially for image tasks including ImageNet, what's often used are so-called convolutional layers. Convolutional layers have very structured repetitive weight patterns. And so they also impose a kind of constraints on the neurons and impose a certain kind of structure on the connectivity pattern that's possible for the neural network. So we see that, unlike more traditional neural networks, deep nets are often very structured. They don't just have everything connected to everything else as was assumed to be acceptable before. As I mentioned, designing such architectures requires quite a bit of domain knowledge, and it's actually more of an art than a science. People don't really understand how it works, but it seems to make a big difference on the performance of the neural networks. But interestingly, there's been some recent work showing that we can actually train machine learning algorithms to themselves design the topology of the neural networks which are then trained on big data sets. And this is very interesting because it's a kind of meta-learning or meta-design of machine learning algorithms designing other machine learning algorithms and doing just as well or even better than people can. So, probably, this is the beginning of the singularity. Now, given this recipe that I mentioned of large amounts of data, lots of computing power and training on graphics processors, and structured architectures of the connectivity between neurons, deep networks are coming to dominate almost all domains of machine learning or at least many, many of them. We already talked about image recognition, classifying images according to the object inside of them. Now, voice recognition. So many people notice that, for example, Siri on the iPhone or the voice recognition on Android got much, much better all of a sudden. They could suddenly recognize what people were saying with very high accuracy. A lot of this was due to neural networks and deep neural networks being used in this application. Translation is another aspect. So machine translation translating from one language to another in human language is traditionally an extremely difficult task for artificial intelligence, and it's thought that statistical models like neural networks and many other kinds of machine learning algorithms would never really do very well at such tasks because they're too structured. There's too much syntax and too many rules to follow. It turns out that, given enough data, deep neural networks actually do great at this task. And, if you've used Google Translate, they moved from a system that used hand-designed features, designed over many decades by linguists, to essentially training a huge, deep neural net on large bodies of text from the internet, and it does better translating than the hand-designed algorithms. And finally, we already mentioned things like video games and board games being supervised learning problems. Brendan talked about the development of go-playing algorithms, and, actually, a big chunk of the machine learning that was used in AlphaGo was a deep neural net. And so, in combination with other techniques, we saw an AlphaGo that deep neural nets actually solved an AI task that was thought to be intractable for many, many, many years. And there's many other examples of deep learning doing very well at tasks that were thought to be very difficult.