The fourth and fifth weeks of the Andrew Ng’s Machine Learning course at Coursera were about Neural Networks. From picking a neural network architecture to how to fit them to data at hand, as well as some practical advice. Following are my notes about it.
A simple Neural Network diagram
Figure 1 represents a neural network with three layers. The first (blue) layer, denoted as , has three nodes and is called the input layer because its nodes are formed by the covariate/features , so that .
The second (red) layer, denoted as , is called a hidden layer because we don’t observe but rather compute the value of its nodes. The components of are given by a non-linear function applied to a linear combination of the nodes of the previous layer, so that
where is the sigmoid/logistic function.
The third (green) layer, denoted as , is called the output layer (later we see that in a multi-class classification the output layer can have more than one element) because it returns the hypothesis function , which is again a non-linear function applied to a linear combination of the nodes of the previous layer,
So, denotes the elements on layer , denotes the -th unit in layer and denotes the matrix of parameters controlling the mapping from layer to layer .
What is going on?
Note that Eq. (1) is similar to the formula used in logistic regression with the exception that now equals instead of . That is, the original features are now replaced by the second layer of the neural network.
Although is hard to see exactly what is going on, Eq. (1) uses the nodes in layer 2, , as input to produce the final output of the neural network. And has each element formed by a non-linear combination of the original features. Intuitively, what the neural network does is to create complex non-linear boundaries that will be used in the classification of future features.
In the third week of the course it was mentioned that non-linear classification boundaries could be obtained by using non-linear transformation of the original features, like or . However, the type of non-linear transformation had to be hand-picked and varies on a case-by-case basis, getting hard to do in cases with large number of features. In a sense, neural network is automating this process of creating non-linear functions of the features to produce non-linear classification boundaries.
In a binary classification problem, the target variable can be represented by and the neural network has one output unit, as represented by Figure 1. In a neural network context, for a multi-class classification problem with classes, the target variable is represented by a vector of length instead of . For example, for , and can take one of the following vectors:
and in this case the output layer will have output units, as represented in Figure 2 for .
The neural network represented in Figure 2 is similar to the one in Figure 1. The only difference is that it has one extra hidden layer and that the dimensions of the layers , are different. The math behind it stays exactly the same, as can be seen with the forward propagation algorithm.
Forward propagation: vectorized implementation
Forward propagation shows how to compute , for a given . Assume your neural network has layers, then the pseudo-code for forward propagation is given by:
Algorithm 1 (forward propagation)
The only thing that changes for different neural networks are the number of layers and the dimensions of the vectors , , and matrices , .
From now on, assume we have a training set with data-points, . The cost function for a neural network with output units is very similar to the logistic regression one:
where is the -th unit of the output layer. The main difference is that now is computed with the forward propagation algorithm. Technically, everything we have so far is enough for optimization of the cost function above. Many of the modern optimization algorithms allow you to provide just the function to be optimized while derivatives are computed numerically within the optimization routine. However, this can be very inefficient in some cases. The backpropagation algorithm provides an efficient way to compute the gradient of the cost function of a neural network.
For the cost function given above, we have that , where can be interpreted as the “error” of node in layer and is computed by
Algorithm 2 (backpropagation)
Knowing how to compute the ‘s, the complete pseudo-algorithm to compute the gradient of is given by
Algorithm 3 (gradient of )
Some practical advice
- Pick a network architecture. The number of input units is given by the dimension of the features . The number of output units is given by the number of classes. So basically we need to decide the number of hidden layers and how many units in each hidden layer. A reasonable default is to have one hidden layer with the number of units equal to times the number of input units. If more than one hidden layer, use the same number of hidden units in every layer. The more hidden units the better, the constrain here is the burden in computational time that increases with the number of hidden units.
- When using numerical optimization you need initial values for . Randomly initialize the parameters so that each . Initializing for all will result in problems. Basically all hidden units will have the same value, which is clearly undesirable.
- During the building of your neural network algorithm, use numerical derivatives to make sure that your implementation of the backpropagation is correct.