The neural network architecture - Large Scale Machine Learning with Python

Let's now focus on how neural networks are organized, starting from their architecture and a few definitions.

A network where the flow of learning is passed forward all the way to the outputs in one pass is referred to as a feedforward neural network.

A basic feedforward neural network can easily be depicted by a network diagram, as shown here:

Chapter 4 In the network diagram, you can see that this architecture consists of an input layer, hidden layer, and output layer. The input layer contains the feature vectors (where each observation has n features), and the output layer consists of separate units for each class of the output vector in the case of classification and a single numerical vector in the case of regression.

The strength of the connections between the units is expressed through weights later to be passed to an activation function. The goal of an activation function is to transform its input to an output that makes binary decisions more separable.

These activation functions are preferably differentiable so they can be used to learn.

The widely-used activation functions are sigmoid and tanh, and even more recently the rectified linear unit (ReLU) has gained traction. Let's compare the most important activation functions so that we understand their advantages and drawbacks. Note that we mention the output range and active range of the function.

The output range is simply the actual output of the function itself. The active range, however, is a little more complicated; it is the range where the gradient has the most variance in the final weight updates. This means that outside of this range, the gradient is near zero and does not add to the parameter updates during learning.

This problem of a close-to-zero gradient is also referred to as the vanishing gradient problem and is solved by the ReLU activation function, which at this time is the most popular activation for larger neural networks:

Neural Networks and Deep Learning

It is important to note that features need to be scaled to the active range of the chosen activation function. Most up-to-date packages will have this as a standard preprocessing procedure so you don't need to do this yourself:

Sigmoid functions are often used for mathematical convenience because their derivatives are very easy to calculate, which we will use to calculate the weight updates in training algorithms:

Interestingly the tanh and logistic sigmoid functions are related linearly and tanh can be seen as a rescaled version of the sigmoid function so that its range is between -1 and 1.

This function is the best choice for deeper architectures. It can be seen as a ramp function whose range lies above 0 to infinity. You can see that it is much easier to calculate than the sigmoid function. The biggest benefit of this function is that it bypasses the vanishing gradient problem. If ReLU is an option during a deep learning project, use it.

Softmax for classification

So far, we have seen that activation functions transform the values within a certain range after they are multiplied with the weight vectors. We also need to transform the outputs of the last hidden layer before providing balanced classes or probability

Chapter 4 This will convert the output of the previous layer to probability values so that a final class prediction can be made. The exponentiation in this case will return a near-zero value whenever the output is significantly less than the maximum of all the values;

this way the differences are amplified:

(

)

, , ,

_n _n^xk _xi

softmax k x x e

e

= ∑

…

Forward propagation

Now that we understand activation functions and the final outputs of a network, let's see how the input features are fed through the network to provide a final prediction.

Computations with huge chunks of units and connections might look like a complex task, but fortunately the feedforward process of a neural network comes down to a sequence of vector computations:

Neural Networks and Deep Learning

We arrive at a final prediction by performing the following steps:

1. Performing a dot-product on the inputs with the weights between the first and second layer and transforming the result with the activation function.

2. Performing a dot-product on the outputs of the first hidden layer with the weights between the second and third layer. These results are then transformed with the activation function on each unit of the second hidden layer.

3. Finally, we arrive at our prediction by multiplying the vector with the activation function (softmax for classification).

We can treat each layer in the network as a vector and apply simple vector multiplications. More formally, this will look like the following:

= the weight vector of layer x b1 and b2 are the bias units f = the activation function

Note that this example is based on a single hidden layer network architecture.

Let's perform a simple feedforward pass on a neural network with two hidden layers with basic NumPy. We apply a softmax function to the final output:

import numpy as np import math

b1=0 #bias unit 1 b2=0 #bias unit 2

Chapter 4 l_exp = np.exp(x)

sm = l_exp/np.sum(l_exp, axis=0) return sm

# input dataset with 3 features X = np.array([ [.35,.21,.33], [.2,.4,.3],

[.4,.34,.5], [.18,.21,16] ])

len_X = len(X) # training set size

input_dim = 3 # input layer dimensionality output_dim = 1 # output layer dimensionality hidden_units=4

#let's apply softmax to the output of the final layer output=softmax(l2)

Note that the bias unit enables the function to move up and down and will help fit the target values more closely. Each hidden layer consists of one bias unit.

Backpropagation

With our simple feedforward example, we have taken our first steps in training the model. Neural networks are trained quite similarly to gradient descent methods that we have seen with other machine learning algorithms. Namely, we upgrade the parameters of a model in order to find the global minimum of the error function. An important difference with neural networks is that we now have to deal with multiple units across the network that we need to train independently. We do this using the partial derivative of the cost function and calculating how much the error curve drops when we change the particular parameter vector by a certain amount (the learning rate). We start with the layer closest to the output and calculate the gradient with respect to the derivative of our loss function. If there are hidden layers, we move to the second hidden layer and update the weights until the first layer in the

Neural Networks and Deep Learning

The core idea of backpropagation is quite similar to other machine learning

algorithms, with the important complication that we are dealing with multiple layers and units. We have seen that each layer in the network is represented by a weight vector ij. So, how do we solve this issue? It might seem intimidating that we have to train a large number of weights independently. However, quite conveniently, we can use vectorized operations. Just like we did with the forward pass, we calculate the gradients and update the weights applied to the weight vectors ( ij).

We can summarize the following steps in the backpropagation algorithm:

1. Feedforward pass: We randomly initialize the weight vectors and multiply the input with the subsequent weight vectors toward a final output.

2. Calculate the error: We calculate the error/loss of the output of the feedforward step.

Randomly initialize the weight vectors.

3. Backpropagation to the last hidden layer (with respect to the output).

We calculate the gradient of this error and change weights toward the direction of the gradient. We do this by multiplying the weight vector j with the gradients performed.

4. Update the weights till the stopping criterion is reached (minimum error or number of training rounds):

We have now covered a feedforward pass of an arbitrary two-layer neural network;

let's apply backpropagation with SGD in NumPy to the same input that we used in the previous example. Take special note of how we upgrade the weight parameters:

import numpy as np import math

def sigmoid(x): # sigmoid function return 1 /(1+(math.e**-x))

def deriv_sigmoid(y): #the derivative of the sigmoid function return y * (1.0 - y)

Chapter 4

for iter in range(205000): #here we specify the amount of training rounds.

# Feedforward the input like we did in the previous exercise input_layer = X

l1 = sigmoid(np.dot(input_layer,theta0)) l2 = sigmoid(np.dot(l1,theta1))

# Calculate error l2_error = y - l2 if (iter% 1000) == 0:

print "Neuralnet accuracy:" + str(np.mean(1-(np.abs(l2_

error))))

# Calculate the gradients in vectorized form

# Softmax and bias units are left out for instructional simplicity l2_delta = alpha*(l2_error*deriv_sigmoid(l2))

l1_error = l2_delta.dot(theta1.T)

l1_delta = alpha*(l1_error * deriv_sigmoid(l1)) theta1 += l1.T.dot(l2_delta)

theta0 += input_layer.T.dot(l1_delta)

Now look how the accuracy increases with each pass over the network:

Neuralnet accuracy:0.983345051044

Neural Networks and Deep Learning

Neuralnet accuracy:0.983919393086 Neuralnet accuracy:0.983973975799 Neuralnet accuracy:0.984028069878 Neuralnet accuracy:0.984081682304 Neuralnet accuracy:0.984134819919 Common problems with backpropagation

One familiar problem with neural networks is that, during optimization with backpropagation, the gradient can get stuck in local minima. This occurs when the error minimization is tricked into seeing a minimum (the point S in the image) where it is really just a local bump to pass the peak S:

Another common problem is when the gradient descent misses the global minimum, which can sometimes result in surprisingly poor performing models. This problem is referred to as overshooting.

It is possible to solve both these problems by choosing a lower learning rate when the model is overshooting or choose a higher learning rate when getting stuck in local minima. Sometimes this adjustment still doesn't lead to a satisfying and quick convergence. Recently, a range of solutions has been found to mitigate these problems. Learning algorithms with tweaks to the vanilla SGDalgorithms that we just covered have been developed. It is important to understand them so that you can choose the right one for any given task. Let's cover these learning algorithms in more detail.

Chapter 4 Backpropagation with mini batch

Batch gradient descent computes the gradient using the whole dataset but

backpropagation SGD can also work with so-called mini batches, where a sample of the dataset with size k (batches) is used to update the learning parameter. The amount of error irregularity between each update can be smoothened out with mini batch, which might avoid getting stuck in and overshooting local minima. In most neural network packages, we can change the batch size of the algorithm (we will look at this later). Depending on the amount of training examples, a batch size anywhere between 10 and 300 can be helpful.

Momentum training

Momentum is a method that adds a fraction of the previous weight update to the current one:

Here, a fraction of the previous weight update is added to the current one. A high momentum parameter can help increase the speed of convergence reaching the global minimum faster. Looking at the formulation, you can see a v parameter. This is the equivalent of the velocity of the gradient updates with a learning rate . A simple way to understand this is to see that when the gradient keeps pointing in the same direction over multiple instances, the speed of convergence increases with each step toward the minimum. This also removes irregularities between the gradients by a certain margin. Most packages will have this momentum parameter available (as we will see in a later example). When we set this parameter too high, we have to keep in mind that there is a risk of overshooting the global minimum. On the other hand, when we set the momentum parameter too low, the coefficient might get stuck in local minima and can also slow down learning. Ideal settings for the momentum coefficient are normally in the .5 and .99 range.

Nesterov momentum

Nesterov momentum is a newer and improved version of classical momentum. In addition to classical momentum training, it will look ahead in the direction of the gradient. In other words, Nesterov momentum takes a simple step going from x to y, and moves a little bit further in that direction so that x to y becomes x to {y (v1 +1)} in the direction given by the previous point. I will spare you the technical details, but remember that it consistently outperforms normal momentum training in terms of

Neural Networks and Deep Learning Adaptive gradient (ADAGRAD)

ADAGRAD provides a feature-specific learning rate that utilizes information from the previous upgrades:

ADAGRAD updates the learning rate for each parameter according to information from previously iterated gradients for that parameter. This is done by dividing each term by the square root of the sum of squares of its previous gradient. This allows the learning rate to decrease over time because the sum of squares will continue to increase with each iteration. A decreasing learning rate has the advantage of decreasing the risk of overshooting the global minimum quite substantially.

Resilient backpropagation (RPROP)

RPROP is an adaptive method that does not look at historical information, but merely looks at the sign of the partial derivative over a training instance and updates the weights accordingly.

A direct adaptive method for faster backpropagation learning: The RPROP Algorithm. Martin Riedmiller 1993

Chapter 4 RPROP is an adaptive method that does not look at historical information, but merely looks at the sign of the partial derivative over a training instance and updates the weights accordingly. Inspecting the preceding image closely, we can see that once the partial derivative of the error changes its sign (> 0 or < 0), the gradient starts moving in the opposite direction, leading toward the global minimum correcting for the overshooting. However, if the sign doesn't change at all, larger steps are taken toward the global minimum. Lots of articles have proven the superiority of RPROP over ADAGRAD but in practice, this is not confirmed consistently. Another important thing to keep in mind is that RPROP does not work properly with mini batches.

RMSProp

RMSProp is an adaptive learning method without shrinking the learning rate:

RMSProp is also an adaptive learning method that utilizes ideas from momentum learning and ADAGRAD, with the important addition that it avoids the shrinkage of the learning rate over time. With this technique, the shrinkage is controlled with an exponential decay function over the average of the gradients.

The following is the list of gradient descent optimization algorithms:

Applications Common problems Practical tips Regular

SGD Widely applicable Overshooting, stuck

in local minima Use with momentum and mini-batch

ADAGRAD Smaller datasets

<10k Slow convergence Use a learning rate

between .01 and .1. Widely applicable.

Works with sparse data RPROP Larger datasets >10k Not effective with

mini-batches Use RMSProp when possible

RMSProp Larger datasets >10k Not effective with wide and shallow nets

Particularly useful for wide sparse data

Neural Networks and Deep Learning

在文檔中 Large Scale Machine Learning with Python (頁 153-165)