• 沒有找到結果。

Training with Simulated Annealing

To train a neural network, you must define its tasks. An objective function, otherwise known as scoring or loss functions, can generate these tasks. Essentially, an objective function evaluates the neural network and returns a number indicating the usefulness of the neural network. The training process modifies the weights of the neural network in each iteration so the value returned from the objective function improves.

Simulated annealing is an effective optimization technique that we examined in Artificial Intelligence for Humans Volume 1. In this chapter, we will review simulated annealing as well as show you how any vector optimization function can improve the weights of a feedforward neural network. In the next chapter, we will examine even more advanced optimization techniques that take advantage of the differentiable loss function.

As a review, simulated annealing works by first assigning the weight vector of a neural network to random values. This vector is treated like a position, and the program evaluates every possible move from that position. To understand how a neural network weight

vector translates to a position, think of a neural network with just three weights. In the real world, we consider position in terms of the x, y and z coordinates. We can write any

position as a vector of 3. If we are willing to move in a single dimension, we could move in a total of six different directions. We would have the option of moving forward or backwards in the x, y or z dimensions.

Simulated annealing functions by moving forward or backwards in all available dimensions. If the algorithm takes the best move, a simple hill-climbing algorithm would result. Hill climbing only improves scores. Therefore, it is called a greedy algorithm. To reach the best position, an algorithm will sometime need to move to a lower position. As a result, simulated annealing very much follows the expression of two steps forward, one step back.

In other words, simulated annealing will sometimes allow a move to a weight

configuration with a worse score. The probability of accepting such a move starts high and decreases. This probability is known as the current temperature, and it simulates the actual metallurgical annealing process where a metal cools and achieves greater hardness. Figure 5.9 shows the entire process:

Figure 5.9: Simulated Annealing

A feedforward neural network can utilize simulated annealing to learn the iris data set.

The following program shows the output from this training:

Iteration #1, Score=0.3937, k=1,kMax=100,t=343.5891,prob=0.9998 Iteration #2, Score=0.3937, k=2,kMax=100,t=295.1336,prob=0.9997 Iteration #3, Score=0.3835, k=3,kMax=100,t=253.5118,prob=0.9989 Iteration #4, Score=0.3835, k=4,kMax=100,t=217.7597,prob=0.9988 Iteration #5, Score=0.3835, k=5,kMax=100,t=187.0496,prob=0.9997 Iteration #6, Score=0.3835, k=6,kMax=100,t=160.6705,prob=0.9997 Iteration #7, Score=0.3835, k=7,kMax=100,t=138.0116,prob=0.9996 ...

Iteration #99, Score=0.1031, k=99,kMax=100,t=1.16E-4,prob=2.8776E-7

Iteration #100, Score=0.1031, k=100,kMax=100,t=9.9999E-5,prob=2.1443E-70 Final score: 0.1031

[0.22222222222222213, 0.6249999999999999, 0.06779661016949151, 0.04166666666666667] -> Iris-setosa, Ideal: Iris-setosa

[0.1666666666666668, 0.41666666666666663, 0.06779661016949151, 0.04166666666666667] -> Iris-setosa, Ideal: Iris-setosa

...

[0.6666666666666666, 0.41666666666666663, 0.711864406779661, 0.9166666666666666] -> Iris-virginica, Ideal: Iris-virginica

[0.5555555555555555, 0.20833333333333331, 0.6779661016949152, 0.75] ->

Iris-virginica, Ideal: Iris-virginica

[0.611111111111111, 0.41666666666666663, 0.711864406779661, 0.7916666666666666] -> Iris-virginica, Ideal: Iris-virginica [0.5277777777777778, 0.5833333333333333, 0.7457627118644068, 0.9166666666666666] -> Iris-virginica, Ideal: Iris-virginica [0.44444444444444453, 0.41666666666666663, 0.6949152542372881, 0.7083333333333334] -> Iris-virginica, Ideal: Iris-virginica [1.178018083703488, 16.66575553359515, -0.6101619300462806, -3.9894606091020965, 13.989551673146842, -8.87489712462323, 8.027287801488647, -4.615098285283519, 6.426489182215509, -1.4672962642199618, 4.136699061975335, 4.20036115439746, 0.9052469139543605, -2.8923515248132063, -4.733219252086315, 18.6497884912826, 2.5459600552510895, -5.618872440836617, 4.638827606092005, 0.8887726364890928, 8.730809901357286,

-6.4963370793479545, -6.4003385330186795, -11.820235441582424, -3.29494170904095, -1.5320936828139837, 0.1094081633203249, 0.26353076268018827, 3.935780218339343, 0.8881280604852664, -5.048729642423418, 8.288232057956957, -14.686080237582006, 3.058305829324875, -2.4144038920292608, 21.76633883966702, 12.151853576801647, -3.6372061664901416, 6.28253174293219, -4.209863472970308, 0.8614258660906541, -9.382012074551428, -3.346419915864691, -0.6326977049713416, 2.1391118323593203, 0.44832732990560714, 6.853600355726914, 2.8210824313745957, 1.3901883615737192, -5.962068350552335, 0.502596306917136]

The initial random neural network starts out with a high multi-class log loss score of 30. As the training progresses, this value falls until it is low enough for training to stop.

For this example, the training stops as soon as the error falls below 10. To determine a good stopping point for the error, you should evaluate how well the network is performing for your intended use. A log loss below 0.5 is often in the acceptable range; however, you might not be able to achieve this score with all data sets.

The following URL shows an example of a neural network trained with simulated annealing:

http://www.heatonresearch.com/aifh/vol3/anneal_roc.html

Chapter Summary

Objective functions can evaluate neural networks. They simply return a number that indicates the success of the neural network. Regression neural networks will frequently utilize mean squared error (MSE). Classification neural networks will typically use a log loss or multi-class log loss function. These neural networks create custom objective functions.

Simulated annealing can optimize the neural network. You can utilize any of the optimization algorithms presented in Volumes 1 and 2 of Artificial Intelligence for Humans. In fact, you can optimize any vector in this way because the optimization algorithms are not tied to a neural network. In the next chapter, you will see several training methods designed specifically for neural networks. While these specialized training algorithms are often more efficient, they require objective functions that have a derivative.

Chapter 6: Backpropagation Training

Gradient Calculation Backpropagation

Learning Rate & Momentum Stochastic Gradient Descent

Backpropagation is one of the most common methods for training a neural network.

Rumelhart, Hinton, & Williams (1986) introduced backpropagation, and it remains

popular today. Programmers frequently train deep neural networks with backpropagation because it scales really well when run on graphical processing units (GPUs). To

understand this algorithm for neural networks, we must examine how to train it as well as how it processes a pattern.

Classic backpropagation has been extended and modified to give rise to many different training algorithms. In this chapter, we will discuss the most commonly used training algorithms for neural networks. We begin with classic backpropagation and then end the chapter with stochastic gradient descent (SGD).