The paper that I will be reviewing today is Deep Learning by YeCun, Bengio and Hilton. It was published in Nature in 2015, and is one of the most highly cited papers in the field with 16,750 citations. The authors of the paper went on to win the Turing Award in 2018.
I did not know deep learning and machine learning before reading this article. Hence, a lot of my experience of reading the paper involved reading a paragraph, not knowing what the heck was going on, and then watching youtube videos to break down what was happening. I will be linking those Youtube videos as I go along. Needless to say, I am amazed at the quality of the content related to this field that is available online.
What does deep learning do? Deep learning helps us classify data. Looking at a picture of a dog, it should be able to tell you that it is a dog, and not a pangolin (which might be useful information in these times). But how does it do that? It does this through a neural network, which is essentially a series of layers of “calculators”. Suppose we take a number which has digits. This can obviously be encoded in a
-tuple. Now we want to know if that number is divisible by
or not (let us imagine a world in which we care whether numbers are divisible by
). What the layers of “calculators” or neural units, as they’re generally called, do is that they take in that 4-tuple, and spit out another
tuple. At the end, we will again have a
tuple which will look nothing like the original tuple. However, we might have a test saying that if the first component of this final tuple is
, then this number has at least a
chance of being divisible by
. This is only a simplistic example (which might have been better performed on an actual calculator), but what neural networks do is that they give us the probability that a certain input lies inside a certain class of objects: like the probability of the
digit number that was input to lie inside the class of numbers that are divisible by
.
Supervised Learning and Back Propagation
How do we train a neural network to give us the right outputs? The same way you train a kid to behave around guests. When they scream at guests and break the china, you send them outside and ask them to re-evaluate their behavior. You keep doing that until they start behaving themselves around guests.
A neural network has been displayed below:

Each edge corresponds to certain weights, or punishment techniques that will help kids learn how to behave around strangers. You may for instance try a variety of punishments: offer them ice cream, ask them to go to their room, threaten to take away their toys, or tell them that they will only be allowed to eat in yellow bowls instead of white from now on. After trying combinations of all these punishment techniques, you find out the following: you should take away ice cream instead of offering more (negative gradient for this weight), children’s behavior improves if you ask them togo to their room, their behavior improves much faster if you threaten to take away their toys, and that they really don’t care if they eat in white bowls or yellow bowls. From this analysis, you conclude that you will be best served if you do a combination of threatening to take away their toys, and taking away their ice cream.
Similarly, neural networks might take in the picture of a dog, and tell you that it’s a cat. You’ll then have to adjust the weights associated with the edges such that this error is corrected. Because millions of inputs are used to train neural networks, some inputs might ask us to increase certain weights, whilst others might require us to decrease them. Hence, we take an average desired change in weights over all inputs, and then change weights accordingly so as to decrease the error “on average”.
What is back propagation? It is a fancy way of saying that we correct the values of the weights backwards. Refer to the image of the neural network given above. First we change the weights of the edges connecting the units in Hidden Units H2 to the output layer in the diagram given above, so as to minimize error. Then, using those new values, we change the weights of the edges connecting Hidden Units H1 to Hidden Units H2, in order to minimize the error further, and so on. An amazing video that explains this much better than I ever could is this.
Unsupervised Training
But what if you don’t have the time to train your child? Would it still be likely that the child behaves well in front of unknown guests. A kid will not necessarily know good from bad. However, a sufficiently observant child might know that screaming and breaking china evoke certain kinds of reactions from parents and guests, and that laughing and playing with guests evoke other kinds of reactions. This is similar to the clustering algorithm used in unsupervised training. If you feed new raw data without labels into an unsupervised neural network, it may not know the *names* of the classes that that data lies in, but it can plot that data, and determine that the data can be subdivided into certain classes. Look at the illustration below for an example of clustering:
Another feature of unsupervised learning is that instead of classification, it can encode images, and then decode images. What this does is that it removes noise, and preserves only the important details of the input. An illustration is given below:
Much like an unsupervised child, and unsupervised neural network can do a lot of things without being trained by its parents. All the above illustrations have been taken from this amazing video.
Convolutional Neural Networks
Suppose that in a more fun parallel universe, you’re asked to locate a 1 ft tall Mickey Mouse statue in a dimly lit room. How would you do that? One way is the following: you look around the room to see if you can spot objects that are approximately 1 ft tall. Objects that are too big or too small are automatically eliminated. Then amongst the more or less 1 ft tall objects, you should look for objects that that have two round ears sticking out. You must have eliminated everything else apart from, possibly, Minnie Mouse. Now you can perhaps concentrate harder on facial features to conclusively locate the Mickey Mouse statue.
Similarly, convolutional networks are neural networks that try to detect features of an input image step by step, in order to determine what it is an image of. In the first hidden unit layer, the convolutional layer might detect where in the image one may find circles. This information is now input into the second hidden layer, which might detect whiskers, or spots, or perhaps stripes. At the end of this process, perhaps after being processed through thousands of layers of hidden units, the neural network might be able to tell us what the image is of.
The detection of edges etc happens by breaking the input image into pixels, and constructing a very large matrix containing information of those pixels. Say this pixel matrix contains a whenever the color black is present in a pixel, and
whenever white is present. Gray would perhaps correspond to
. Now construct a much smaller matrix, say a matrix of the form
.
This matrix corresponds to two (short) rows of pixels, the top row containing the color black, and the bottom row white. Now we take this small matrix, and multiply it with all the sub-matrices of the bigger pixel matrix. This process will help us detect which parts of the image look like one row of black on top of one row of white, which helps us find white edges on a black background. All this has been amazingly explained in this video.
Natural Language Processing
When we type “neural” into the Google Search Bar, how does Google know that we mean “neural networks” (*meta*). This is because the neural network responsible for making these predictions has been trained on billions of pages of text, which help it “understand” which words make meaningful phrases. I will mostly be referring to this video on natural language processing.
Let us suppose that each word in the English language has been assigned a -tuple, in which each entry of the tuple has some number. For instance, it is possible that the word “neural” corresponds to
. Now suppose we decide to train the neural network to understand English phrases that are three words long. We feed billions of pages of text into it, and train it to give us a positive score whenever the phrase is meaningful, like “neural networks rock”, and give a negative score whenever the phrase is not meaningful, like “Amit Shah rocks”.
But how does a neural network mathematically process a phrase? It takes the three -tuples corresponding to each word, glues them together, and now makes a
-tuple. Assuming that the weights of all the edges in this neural network remain constant, we’re going to change the entries in the
-tuple of each word such that when the neural network performs mathematical manipulations on the
-tuples, we get a positive answer for meaningful phrases, and a negative answer for nonsense phrases. So after all this training is done, the words “Amit” might correspond to
, “Shah” might correspond to
, and “rocks” might correspond to
. Hence, when the
-tuple
is plugged into the neural network, we get a negative answer.
But then how does word prediction work? When we type “Amit Shah” into the google search bar, the neural network tries to fit all words into the phrase that give the three word phrase a positive score. Then it calculates the probability of each of those words completing that phrase, probably depending on the frequency of use in the training data, and offers the words with highest probabilities as predictions just below the search bar.
Recurrent Neural Networks
Let us now talk about speech detection. Suppose you say the phrase “Black Lives Matter” (in order to train your neural network to do right by society). Each word that you say has to be separately interpreted by the neural network. This is a task that a recurrent neural network can perform: it takes as input each word that you say, and assigns probabilities to the various options corresponding to the input, stores it, and then takes the next input.
For instance, when you say “Black”, the neural network assigns a 99% probability that you said black, and a 1% probability that you said blot, based on the clarity of your speech. This gets stored (in Long Short Term Memory (LSTM) units). Now let us assume that when you say “Lives”, your speech is not clear. Hence, the neural network assigns a 40% probability to that word being Lives, and a 60% probability to that word being Pies. Now when you say “Matter”, your speech is much clearer, and the neural network is 99% sure that you said “Matter”. Now it tries to process the two phrases “Black Lives Matter” and “Black Pies Matter”. The neural network will give a negative score for the latter phrase as it has never seen it before, or perhaps has been negatively trained on it, and a positive score for the former. Hence, the neural network ultimately detects “Black Lives Matter”. We hope that it says an emphatic “Yes” in response.
Future directions
The authors opine that much of the future of deep learning lies in further developing unsupervised learning, and basing classification algorithms on human vision, which focuses only a small part of an object, and blurs everything else. The authors conclude that combining representation learning with complex reasoning is what will be most useful in the long run.
Thanks for reading!