• 沒有找到結果。

2.4 Training an LSTM Network

2.4.4 Hyperparamter tuning

All parameters in a neural network will updated for a given times and then the training phase is completed.

2.4.3 Backpropagation and Backpropagation Through Time

In a deep learning scenario, it is common that a neural network consists of multiple hidden layers. Therefore, when performing gradient descent, chain rule will be applied repeatedly from output layer to input layer. The process of repeatedly applying chain rule is called backpropagation or backward pass.

A typical FNN uses backpropagation to calculate the gradients from output layer backward to input layer. On the other hand, an RNN will be unfolded against multiple timesteps. After unfolding, an RNN can be seen as an typical FNN with multiple layers [25]. backpropagation through time (BPTT) is the algorithm to perform backpropagation in a recurrent neural network. The underlying theory is the same as typical backpropagation except that it backpropagates through timesteps from the last timestep to the first timestep [21].

2.4.4 Hyperparamter tuning

Hyperparameters are values which are chosen by users before training a model while parameters are those values learned by the learning algorithm.

We use grid search to find the best set of hyperparameters. Grid search generates all possible combinations of given hyperparameters and iterate through all the combinations to find the best combination [26]. Hyperparameters and their respective possible values are chosen by user. Note that the range of some hyperparameters can be of infinite possibilities so it is impossible to try out all the combinations. Let N be the number of different hyperparameters to be tuned, Hi be the i-th set of hyperparameter options, then there will be

N i=1

|Hi|

possible combinations. Grid search will exhaust all the combinations and see which one produces the best performance.

We select the best hyperparameter combination after we train models with all given

possible combinations by choosing the one with highest validation accuracy. We train 9 models with different 3 optimization algorithms and 3 dropout rate combinations.

2.4.4.1 Dropout Rate

Dropout is a way to regularize the neural network by “shutting down” neurons. The chosen neurons will be given weight 0 in that round of forward propagation. Since it is given the weight of zero, it has no effect to the network. Therefore, it seems the neurons are being shut down.

A number of neurons are chosen randomly to be shut down under a given portion which is a hyperparameter that is assigned before training. The portion is called “Dropout rate”

which is usually fixed throughout the whole training process. N. Srivastava et al. show that the higher the dropout rate, the lower the test error in the experiments conducted over MNIST dataset [27]. We try out 3 different dropout rates: 0%, 30% and 50% and find that averagely the higher the dropout rate, the lower the validation error.

2.4.4.2 Neural Network Optimization Algorithm

When training a neural network, the most typical way is gradient descent or equivalently called batch gradient descent. Batch gradient descent updates the parameters after all training examples have done their forward pass. Therefore, the update of parameters can be slow when the size of training set is large.

Stochastic gradient descent (SGD) and mini-batch gradient descent are designed to speed up the update of parameters. SGD method updates the parameter every time a training example completes its forward pass. Mini-batch gradient descent updates the parameter every after m training examples complete their forward pass where m is an positive integer predefined by the user [28].

Although SGD and mini-batch methods reduce the training time by updating the parameters more frequently, they use fixed learning rate predefined by user throughout the whole training phase. However, it is better to have higher learning rate at the beginning of training phase and gradually decrease the value during the training [3]. This is where optimization algorithms with adaptive learning rate come to rescue.

Adagrad uses different learning rate in every iteration. The algorithm is shown in

Algorithm 2. When i goes up, τ goes up while α go down. Therefore, the learning rate will gradually decrease as the training iteration continues.

Initialization: int i = 0, int n = number of iteration, parameter θ Initialization: constant value ϵ, ψ, gradient accumulator τ = 0 while i < n do

It could be a disadvantage for training when the learning rate keeps lowering because the training speed could slow down. That is, the parameters modify slowly. RMSprop introduce moving average to resolve the potential problem and can be seen as an upgrade of Adagrad. This helps “discard history from the extreme past” to speed up the training phase [3]. Adaptive Moment Estimation (Adam) is a more sophisticated optimization algorithm. It is inspired by both Adagrad and RMSprop and combine the good parts of them [29]. It is still an open problem which optimization algorithm is the best. Although optimization algorithms equipped with adaptive performs fairly well, there is no one specific optimization algorithm that stands out under all scenarios [3].

We try out Adagrad, RMSprop and Adam optimization algorithms in the hyperparam-eter tuning phase to see which optimization algorithm best suits our problem. The results shown in table 5.3 indicates that Adam averagely performs better in training accuracy than the other two optimization algorithms in our scenario.

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Chapter 3

Bitcoin and Blockchain

Bitcoin is the first cryptocurrency that has been widely known to the world. It runs on blockchain, a new technology where Bitcoin transactions take place. Since internal features are extracted from Bitcoin blockchain, this chapter provides a brief introduction to these technologies. Chapter 3.1 introduces Bitcoin’s history and the mechanism of blockchain. Chapter 3.2 explains the reason why Bitcoin is volatile. Chapter 3.3 serves as a background knowledge to understand reasons of extracting certain features that are listed in section 4.4.

3.1 Bitcoin on Blockchain

Bitcoin runs on blockchain over Internet as a peer-to-peer (P2P) network [30]. Introduced by Satoshi Nakamoto on 2008, the first Bitcoin was mined on 2009 when the genesis block, block number zero, is mined and 50 bitcoins as a reward are created. Cryptocurrency is a newly coined term distinguish itself from fiat currency. The main difference is that unlike fiat currency, Cryptocurrency is neither issued nor controlled by governments. Running on blockchain, although governments worldwide try to keep surveillance on Bitcoin, no government can take full control of it. Since Bitcoin is not issued by any government and can be traded without national boundary, some believe it could become a universal currency.

In blockchain, Every node can generate a transaction (tx) and broadcast to the whole network, and after verification, the transaction will be put into a block and the block will

‧ 國

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Figure 3.1: Blockchain Illustration

be put onto blockchain. People who verify transactions are called miner, as an analogy to digging gold out of the mine. The incentive for miners to verify transaction for the whole network is that they will be rewarded with Bitcoin if they are the first one who mines the block.

Besides providing the functionality of transaction verification, Bitcoin blockchain serves as a distributed ledger that records the transactions which issued within the blockchain. Every single transaction will be written down and can theoretically never be altered. Every node that runs on Bitcoin blockchain keeps an identical version of ledger.

相關文件