“Machine intelligence is the last invention that humanity will ever need to make” – Nick Bostrom
Deep learning has greatly changed the landscape of machine learning and artificial intelligence in the last ten years. In 2018, professors Yoshua Bengio, Geoffrey Hinton, and Yann LeCun, pioneers of deep learning, have received the prestigious ACM A.M. Turing Award for “conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing”. This chapter reviews the fundamentals of deep learning. Other chapters will cover its applications in computer vision and natural language processing. Deep learning is covered in great detail in (Goodfellow et al., 2016) and the documentation of TensorFlow and Pytorch.
The work on artificial neurons started in the 1930s and 1940s. In 1943, McCulloch and Pitts proposed that “neural events and the relations among them can be treated by means of propositional logic. It is found that the behavior of every net can be described in these terms.” In 1958, Frank Rosenblatt, a researcher at the Office of Naval Research invented the Perceptron to perform image recognition using photocells and a one-layer neural network. It could however perform only some rudimentary image classification tasks.
Figure 1. Description of Mark I Perceptron.
Deep learning is a branch of machine learning that uses layers of activation functions, described as neurons, linking inputs to outputs. The inputs form an input layer which could be in the form of a numerical value, a vector, a matrix, or a multidimensional array (a tensor). The input can represent a picture, a video frame, some text, a soundwave, or any data collected by a sensor. Each function acts as a neuron with inputs and outputs. The function can be linear or nonlinear. When it is nonlinear it works as an activation function, being very small when the inputs are sufficiently small and increase in value when the combined inputs are sufficiently large. The outputs in turn form an output layer. Between the input and the output layers, there can be several hidden layers (Figure 2).
Figure 2. A neural network with an input layer, a hidden layer, and an output layer.
The inspiration of the artificial neuron is the human neuron (figure 3). A human neuron has a cell body called soma, receives nerve signals from the dendrites and sends an output signal through the axon to other neurons or other cell bodies such as muscle cells. The axon connects to the dendrite of another neuron and forms a synapse. The signal can be electric with moving ions or chemical with neurotransmitters. Each neuron has contacts with 1,000 other neurons. It is estimated that there are around 86 billion neurons in the brain. In addition to neurons, there are glial cells that outnumber the neurons by a factor of ten. Glial cells play important roles to support the neurons.
Figure 3. Neuron
By BruceBlaus – Own work, CC BY 3.0, https://commons.wikimedia.org/w/index.php?curid=28761830
The human brain appears to be vastly more complicated than a neural network. This should not be a concern as there are many examples of artificial technologies playing the same role as natural technologies: the wings and reactor of a plane replacing the wings and the muscles of a bird, or the combustion engine replacing carriage horses.
To be fair, the artificial neural network does not run in a vacuum, if we include the software, hardware, and power (the brain cells have to generate their own power) required to run the neural network the complexity can be as large. For instance, as of 2020, the Wafer Scale Engine 2 by Cerebras, a deep-learning integrated circuit chip, has 2.6 trillion transistors. A Graphical Processing Unit (GPU) can have more than 20 billion transistors and it is not uncommon to run hundreds of GPUs in parallel to train some deep learning neural networks.
Feedforward Neural Network
Figure 4 shows a feedforward neural network. It is feedforward because the information flows from the inputs to the outputs in only one direction, forward. Some other neural networks such as recurrent neural networks allow loops, with information moving backward.
The network can be described in terms of input layers, hidden layers with numbers of inputs and outputs and with an activation function applied to the inputs, and output layers. The layer will contain a state with learnable parameters such as weights and biases and will perform some computation such as multiplying the inputs by the weights and adding the biases.
Figure 4. Neural network defined as a sequence of layers
The hidden layers can be of different types:
The input layer is a tensor object with an indication of the input shape e.g. (n,) and batch size m. Each observation is a n-dimensional vector and the model takes m observations at a time.
The dense layer uses the inputs from the previous layer, multiplies them by some weights, adds some bias terms, and transforms them through an activation function. The activation function is typically a Relu (rectified linear unit) which implements the maximum between the output value and zero (max(output,0)). Another popular activation for classification problems is the softmax activation. In the softmax, outputs are converted to probabilities between 0 and 1 by taking the exponential of their values and normalizing them so that they add up to 1.
Figure 5. Relu activation
Figure 6. Softmax activation
The activation layer transforms input values with some functions similar to the ones used in the dense layer or more complex functions.
The embedding layer transforms the input values into vector representations. This is commonly used in Natural Language Processing (word embedding) where indexed words are converted to vector representations such as Word2Vec (Mikolov et al., 2013)). Words of the same meaning will tend to be close in the vector space and relationships between words will tend to be similar in that space. Closeness is measured by some distance.
A masking layer discards certain input values for instance because they are missing. Missing values could be coded as 0 and the mask value will be 0.
A lambda layer allows arbitrary calculations on previous layers. It works as an activation layer but is more general as it can for instance make calculations with multiple input layers.
A subclass layer will modify an existing class layer and add new states and computation methods. For instance, input layers can be combined and go through a new computation to produce new outputs.
The different transformation from the input layers through each successive layer up to the output layers forms the feedforward propagation of the neural network. If all the network parameters are known the propagation will give some model outputs. If the model needs to fit some output data such as in supervised learning, the parameters will need to be learned with backward propagation.
Model training will adjust the model parameters such as the weights and biases to minimize some loss function.
Model Loss Function
Sum of squared errors
The sum of squared errors is often used for regression problems. It is calculated as the sum of the squared differences between predicted values and true values. If we take the mean, it becomes the mean squared error.
Other losses that can be used for regressions are the mean absolute error, the mean absolute percentage error, the mean squared logarithmic error, and the cosine similarity among others.
The cross-entropy loss is used for classification problems. It is calculated as the negative of the sum of the products between the true class probability values (so 0 or 1) and the logarithm of the predicted probability values.
The KL divergence loss can also be used for classification problems. It is calculated as the sum of the products between the true class probability values (so 0 or 1) and the logarithm of the ratio of true class probability values to the predicted probability values.
The model parameters will be initialized when the layers are created. Usually, zero initialization is not a good idea because of the need to break the symmetry between neurons. With zero initialization the neural network conveys no information as all the inputs give the same outputs. Also in the hidden layers, the weights are not very differentiated and are unlikely to have unique final values.
With normal initialization, the weights are taken from a random normal distribution of a given mean (usually 0) and standard deviation.
With Glorot/Xavier (Glorot and Bengio, 2010)) normal initialization, the weights are taken from a random normal distribution of a given mean (usually 0) and standard deviation that depends inversely on the square root of the sum of the number of inputs and the number of outputs).
With Glorot/Xavier uniform initialization, the weights are taken from a random uniform distribution of a given mean (usually 0) and boundaries that depend inversely on the square root of the sum of the number of inputs and the number of outputs).
He initialization (He et al., 2015)) is similar to the Glorot/Xavier normal initialization but with a factor of 2 in the variance.
Deep neural networks have come back in vogue thanks to the rediscovery of backpropagation and the application of stochastic gradient-descent (Bottou, 2011)). The objective during the training of a neural network is to minimize a loss function by adjusting the weights and biases of the neural network.
In the univariate case (Figure 7), the first-order derivative indicates the direction towards which the free weight parameter x has to be adjusted. If it is positive then x has to be lower. If it is negative that x has to be higher. If the loss function is convex, this procedure is very reliable to find the global minimum. If it is not convex, the procedure might only find a local minimum.
Figure 7. Model loss as a function of weight (univariate case)
Optimizing a neural network adds two major complications to the unit variate case. The derivative becomes a gradient when there is more than one variable. There are many weights to optimize. Some very large language models such as GPT3 (Brown et al., 2020)) have billions of parameters. Then, there are many layers and each layer is a compounding function that makes use of the chain rule to calculate the gradient.
The chain rule is a simple method to calculate the derivative of a compounded function. For instance if h(x)=f(g(x)) then h’(x)=f’(g(x))g’(x). The derivative of h is the product of two derivatives. If there are n layers, the derivative would be the product of n derivatives.
With one weight variable, a new value would be calculated from the current weight, the derivative at this point, and a positive learning rate parameter lr: x’=x-lr * f’(x).
If the weights are vectors, we use the gradient instead of the derivative and the formula becomes: x’=x-lr * Dxf(x).
This procedure is iterative. Each application of the formula is an update. It is common to make an update after making the calculation for a group of observations (a mini-batch) taken from the training sample. The update is done by using the average gradients across the mini-batch observations: this is the stochastic gradient descent. Once all the mini-batches from the training sample are used, we have completed an epoch. We repeat the procedure and monitor the error on the training and validation sets after each epoch.
The learning rate is usually not constant. It will decrease in value as the learning progresses. Several methods are available such as momentum, AdaGrad, RMSProp, or Adam. The idea is to adjust the gradient faster by influencing its velocity with its past values (first moment) or past squared values (second moment). The higher the past value, the higher is the adjustment on the parameters but the higher the past squared value, the lower is the adjustment on the parameters. RMSProp and Adam are somehow normalizing the gradient so that the direction counts more than the value of the gradient itself.
Exploding and vanishing gradient
Because of the product of gradients, the final gradient can end up being very small (vanishing gradient) or very large (exploding gradient). Vanishing gradient problems can be addressed by alternative weight initialization methods and activation functions such as ReLU. Exploding gradient problems can be addressed by gradient clipping which simply imposes a maximum and minimum value to the gradient.
Like in all supervised learning problems, there is always a risk of overfitting the model and losing in generalization. The model will perform well in-sample on the training data but will perform poorly out-of-sample on the validation data. Figure 8 shows the loss curves as a function of the number of epochs. The training loss and the validation loss both decrease till it reaches a point where the validation loss starts to increase. The model starts to overfit on the training data and underfit on the validation data. Early stopping will prevent some of the overfitting.
Figure 8. Model training and validation losses as a function of the number of epochs
L1 and L2 Regularization
Another method to limit overfitting is to use L1 and L2 regularizations. They consist of limiting the size of the weights by adding a regularization term to the loss. Instead of minimizing f(x) it is minimizing f(x)+ alpha * ||x||1 or f(x)+ alpha * ||x||2, where ||.||1 is the L1 norm (sum of absolute value of vector components) and ||.||2 is the L2 norm (square root of the sum of squared component values). By limiting the size of the weights, there is less risk of overfitting to training data because the weights cannot take extreme values.
Dropout (Srivastava et al., 2014)) is a powerful technique of regularization. Dropout drops inputs randomly (put the weights at 0) at a fixed rate during training. The remaining weights are scaled up to preserve the sum of weights. With dropout, the model does not rely on particular weights and is more robust to overfitting, and will generalize better.
Batch normalization (Ioffe and Szegedy, 2015)) is a technique to stabilize the training of a deep neural network. Each mini-batch is renormalized to a mean of 0 and a standard deviation of 1 before entering an activation function. This makes the learning easier as the weight updates have a similar scale and do not become too large or too small.
When the model is trained, additional metrics can be useful in addition to the model loss. Other measures of model fit can be used for probabilistic models such as cross-entropy and regression models such as cosine similarity. For instance, in a classification model, accuracy is a useful statistic, as well as AUC (area under the curve), true positives and negative, false positives and negatives, precision and recall, sensitivity and specificity.
While the model is training, it is also run on validation data. The same metrics and loss statistics are calculated for both training and validation data. Before being deployed in production, the model can be run on test data.
The model is then used for inference and prediction on new data online or in batch mode.
During the training, validation, and inference phase, model and performance data and statistics should be collected. In TensorFlow, TensorBoard (Figure 9) can be used to visually present and monitor such data. The model weights, summary plots, training graphs can easily be reported on such a dashboard.
Figure 9. TensorBoard
Bottou, L., 2011. Large-Scale Machine Learning with Stochastic Gradient Descent, in: Statistical Learning and Data Science. Chapman and Hall/CRC, pp. 33–42. https://doi.org/10.1201/b11429-6
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D., 2020. Language Models are Few-Shot Learners. ArXiv200514165 Cs.
Glorot, X., Bengio, Y., 2010. Understanding the difﬁculty of training deep feedforward neural networks 8.
Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning, Illustrated edition. ed. The MIT Press, Cambridge, Massachusetts.
He, K., Zhang, X., Ren, S., Sun, J., 2015a. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv150201852 Cs.
Ioffe, S., Szegedy, C., 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. ArXiv150203167 Cs.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R., 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 15, 1929–1958.