運用LSTM進行Bitcoin價格預測 - 政大學術集成

全文

(1)國立政治大學資訊科學系 Department of Computer Science National Chengchi University 碩士論文 Master’s Thesis. 立. 政治大. ‧ 國. 學 ‧. Applying LSTM to Bitcoin Price Prediction 運用 LSTM 進行 Bitcoin 價格預測 n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. 研究生：陳維睿指導教授：胡毓忠. 中華民國一百零七年七月 July 2018. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(2) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. 2. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(3) Applying LSTM to Bitcoin Price Prediction 運用 LSTM 進行 Bitcoin 價格預測研究生：陳維睿指導教授：胡毓忠. Student：Wei Rui Chen Advisor：Yuh Jong Hu. 國立政治大學資訊科學系碩士論文. 立. 政治大. ‧. ‧ 國. 學. A Thesis submitted to Department of Computer Science National Chengchi University. n. er. io. sit. y. Nat. in partial fulfillment of the Requirements for the degree of Master a v. i l C n h e n ginc h i U Computer Science. 中華民國一百零七年七月 July 2018. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(4) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. 2. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(5) Acknowledgements I am grateful to my thesis advisor Professor Yuh-Jong Hu for his guidance throughout the research. It has been a great time studying in Department of Computer Science, NCCU where I have a chance to dive deeper into computer science theory. During the. 政治大. days, I grasp a better understanding in data science and gain a lot of experience in data processing and modeling.. 立. Special thanks to my friends Wei-Zong Liao and Ming for their valuable suggestions. ‧ 國. 學. and I really enjoy discussing and solving problems together.. Most importantly, I want to thank my mother for the love she gives me. Her full. ‧. support always warms my heart and encourages me to keep going and never stop trying.. sit. y. Nat. Finally, I would like to express my appreciation to Nvidia GPU grant for providing. io. n. al. er. Titan Xp which makes experiments much faster and simpler.. Ch. engchi. i n U. v. i. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(6) 運用 LSTM 進行 Bitcoin 價格預測. Abstract. This thesis focuses on applying Long Short-Term Memory (LSTM) [1] technique to predict Bitcoin [2] price direction. Features including internal and external features are extracted from Bitcoin blockchain and exchange center respectively. Cryptocurrency is a new type of currency that is traded over the. 政治大 ranks first in the market capitalization among all the other cryptocurrencies. 立 infrastructure of Internet. Bitcoin (BTC) is the first cryptocurrency and. ‧ 國. 學. Predicting Bitcoin price is a novel topic because of its differences with traditional financial assets and its volatility.. As contributions, this thesis provides a guide of processing Bitcoin. ‧. blockchain data and serves as an empirical study on applying LSTM to Bitcoin. n. al. er. io. sit. y. Nat. price prediction.. Ch. engchi. i n U. v. ii. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(7) Applying LSTM to Bitcoin Price Prediction. 摘要. 本論文運用長短期記憶模型 (Long Short-Term Memory, LSTM) [1] 來預測比特幣 (Bitcoin) [2] 價格走向。特徵值資料包含內部及外部特徵值，各抽取自比特幣區塊鏈以及交易中心。加密貨幣是一種新型態的貨幣，其交易運行在網路中。在所有加密貨幣中，比特幣 (Bitcoin, BTC) 是第一個加密貨幣，且目前擁有最高的市值。. 政治大. 預測比特幣價格是一個新興的研究題目，因為其與傳統金融資產有所差異，且其價格非常波動。. 立. ‧ 國. 學. 本論文對比特幣區塊鏈資料處理方法提出指引，並將長短期記憶模型實務應用到比特幣價格預測。. ‧. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. iii. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(8) Contents Acknowledgements. i. Abstract. ii. 摘要. ‧ 國. 1. 1.2. Deep Learning on Time Series data . . . . . . . . . . . . . . . . . . . . . .. 1. 1.3. Predicting Bitcoin Price . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2. 1.4. Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ‧. Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. y. sit. io. al. i n U. 4 5. er. LSTM on Time Series Data. n. 3. 1. 1.1. Nat. 2. Introduction. iii. 學. 1. 立. 政治大. v. 2.1. Neural Network and Deep Learning . . . . . . . . . . . . . . . . . . . . . .. 2.2. Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. 2.3. Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. 2.4. Training an LSTM Network . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 2.4.1. Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. 2.4.2. Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10. 2.4.3. Backpropagation and Backpropagation Through Time . . . . . . . 11. 2.4.4. Hyperparamter tuning . . . . . . . . . . . . . . . . . . . . . . . . . 11. Ch. engchi. 2.4.4.1. Dropout Rate . . . . . . . . . . . . . . . . . . . . . . . . 12. 2.4.4.2. Neural Network Optimization Algorithm. Bitcoin and Blockchain 3.1. 5. . . . . . . . . . 12 14. Bitcoin on Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 iv. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(9) 3.3. Mining Bitcoin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17. Machine Learning Pipeline 4.1. Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19. 4.2. Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.2.1. Collecting Bitcoin Blockchain Data . . . . . . . . . . . . . . . . . . 20. 4.2.2. Collecting Data from Exchange Center . . . . . . . . . . . . . . . . 21. 4.3. Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21. 4.4. Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.4.1. Internal Features Extraction . . . . . . . . . . . . . . . . . . . . . . 22. 4.4.2. External Features Extraction . . . . . . . . . . . . . . . . . . . . . 22. 4.4.3. Align and Combine Internal and External Features . . . . . . . . . 23. 4.4.4. Min-Max Normalization . . . . . . . . . . . . . . . . . . . . . . . . 23. 4.4.5. Train/Validation/Test Split . . . . . . . . . . . . . . . . . . . . . . 24. 政治大. 立. ‧. Methodology. 25. Tools and Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25. 5.2. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25. er. io. sit. y. Nat. 5.1. 5.2.1. Dataset Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 25. 5.2.2. Neural Network Architecture . . . . . . . . . . . . . . . . . . . . . 26. 5.2.3. Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . 28. al. n. 5.3. Ch. engchi. i n U. v. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3.1. 6. 19. 學. 5. Bitcoin as a Cryptocurrency . . . . . . . . . . . . . . . . . . . . . . . . . . 15. ‧ 國. 4. 3.2. Performance Comparison with Related Works . . . . . . . . . . . . 30. Conclusion and Future Work. 32. 6.1. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32. 6.2. Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33. Appendices. 34. A Feature Description and Representation. 34. B Feature Calculation. 36 v. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(10) References. 39. References. 39. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. vi. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(11) List of Figures 1.1. Volatility of Bitcoin Price . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 1.2. Bitcoin and S&P500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3. 2.1. FNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. 2.3. 政治大 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 立 An LSTM Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.1. Blockchain Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15. 3.2. Comparison between EUR/USD and BTC/USD . . . . . . . . . . . . . . . 16. 3.3. Upward Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17. 4.1. Machine Learning Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 20. 4.2. Communicating with Bitcoin Client . . . . . . . . . . . . . . . . . . . . . . 21. 4.3. Train/Validation/Test Split . . . . . . . . . . . . . . . . . . . . . . . . . . 24. 5.1. LSTM Network Architecture Visualization . . . . . . . . . . . . . . . . . . 27. 5.2. Many to One Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28. 7 8. 學. ‧. ‧ 國. io. sit. y. Nat. n. al. er. 2.2. Ch. engchi. i n U. v. vii. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(12) List of Tables 5.1. Machine Specification. 5.2. Dataset Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26. 5.3. Hyperparameters Combinations Performance . . . . . . . . . . . . . . . . . 29. 5.4. Optimization Algorithms Average Performance . . . . . . . . . . . . . . . . 29. 5.5. Dropout Rate Average Performance . . . . . . . . . . . . . . . . . . . . . . 30. 5.6. Performance Comparison with Related Works . . . . . . . . . . . . . . . . 31. 立. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. viii. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(13) Chapter 1 Introduction 政治大 The price of it has been very volatile. Motivated by the novelty of this new technology, 立 Bitcoin has been attracting more and more attention from both academia and industry.. provides the objective and an overview of this research.. ‧. Research Objective. sit. y. Nat. 1.1. 學. ‧ 國. we strive to apply LSTM technique to predict the direction of Bitcoin price. Chapter 1. er. future.. io. Bitcoin has been considered to be one of the most influential technologies in the near. al. Not only are entrepreneurs looking for applications over Bitcoin blockchain. n. v i n technology, but also speculators C and investors are trading Bitcoin as a financial asset. hengchi U Without any governments’ or organizations’ regulation, the price of Bitcoin has been very volatile. We strive to apply LSTM technique to predict the rise and fall of Bitcoin price. The internal features are extracted from official Bitcoin blockchain RPCs and the external features are collected from exchange center GDAX.. 1.2. Deep Learning on Time Series data. A time series dataset is a sequential dataset that involves time aspect. Compared to datasets where datapoints within can be independent from each other, datapoints in time series are related to others in chronological order. Due to this dependency, in the deep learning field, traditional feedforward neural network (FNN) does not perform well on 1. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(14) modeling time series data because it treats datapoints independently and that breaks the temporal structure of the datapoints. Recurrent Neural Network (RNN) is developed to overcome the lack of memory in ANN model. Recurrent neural network is known for its recurrence nature. That is, the state of RNN at time t refers back to the state of RNN at time t-1. This can be expressed as a form [3]:. st = f (st−1 ; θ) where st denotes the state at timestep t and theta denotes the parameters. Since the state at time t-1 can affect the state at time t, the order of the input data matters because. 政治大 state-of-the-art technique. the training of RNN can be different if the order changes.. 立. Deep learning has been a. to find patterns in data.. ‧ 國. 學. Theoretically, neural networks are capable of approximating complex functions [4]. Due to the capability of deep learning, RNN gains more and more attention. LSTM, as one. ‧. specific type of RNNs, was proposed in 1997 [1]. With a complex unit structure, an LSTM network is able to memorize past information better than traditional RNN [5].. sit. y. Nat. n. al. Bitcoin price is very volatile.. er. Predicting Bitcoin Price. io. 1.3. v i n C h stock marketUfor example, governments are able Taking engchi. to regulate the stock market in ways such as enforcing daily rise and fall percentage limitation, or even close the stock market when the situation goes extreme. This limits the volatility of stock market. When it comes to Bitcoin, a brand new financial asset traded 24/7 without government’s regulation, the volatility becomes high. From 2017/11/13 to 2017/12/18, the Bitcoin price rose from about 6000 to 20000. However, from 2017/12/18 to 2018/02/06, it fell back to around 6000. The price grew over 300% and dropped about 70% in just three months as shown in Figure 1.1. Measuring volatility with standard deviation [6][7], Bitcoin is much more volatile than stock market. From 2016/02 to 2018/02, Bitcoin price’s standard deviation is 4125.8 and S&P500’s standard deviation is 219.1. Bitcoin price is about 18.8 times more volatile than S&P500. Figure 1.2 shows the line chart of Bitcoin price and S&P500. 2. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(15) 立. 政治大. Figure 1.1: Volatility of Bitcoin Price. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 1.2: Bitcoin and S&P500. 3. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(16) 1.4. Related Works. Quantitative approaches has been applied to handle and predict market data since 2000s. Huang et al. used Support Vector Machine (SVM) to predict the direction of stock market [8]. With large dataset provided, neural network is able to find the relationship between a number of variables and stock price. Hamid et al. implemented a three-layered FNN to predict S&P500 index [9]. Vejendla et al. used RNN with 4 hidden layers to predict stock markets and RNN performed better than multilayer FNN in the study [10]. Akita et al. [11] applied LSTM on stock price prediction. Studies on Bitcoin price prediction started in 2010s. Matta et al. found that the. 治政 [12]. This work extracted features from Twitter such 大 as number of tweets and number of 立 positive tweets related to Bitcoin. Madan et al. selected features from Bitcoin blockchain number of positive tweets about Bitcoin on Twitter is positively correlated to its price. ‧ 國. 學. and exchange center, and predicted the rise and fall of Bitcoin price with random forest for 57.4% accuracy [13]. Greaves et al. achieved 55.1% accuracy in price direction. ‧. binary classification with neural network [14]. Using LSTM and RNN for Bitcoin price. y. Nat. classification, McNally achieved 52.78% in binary classification with LSTM [15].. sit. As recent studies, researchers try to solve Bitcoin prediction problem from economics. n. al. er. io. perspective. Jang et al. included macroeconomic indicators and currency exchange ratio. i n U. v. [16]. The former included indexes such as S&P 500, Nasdaq and Dow30 etc. and the. Ch. engchi. latter included CNY/USD, JPY/USD etc. Bitcoin or more generally, cryptocurrency, is gaining more and more attention from researchers of economics and finance background.. 4. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(17) Chapter 2 LSTM on Time Series Data 政治大 In 2.1, the reason of adopting deep learning in this research and a brief introduction to 立. neural network is provided. In 2.2 and 2.3, RNN and LSTM model are introduced. In. ‧ 國. 學. 2.4, we describe the procedures of training an LSTM network.. ‧. n. al. Ch. engchi. er. io. Neural Network and Deep Learning. sit. y. Nat. 2.1. i n U. v. There are a lot of neurons in human beings’brain and they transmit information to each other using electrical and chemical signals. Multiple neurons form a neural network (NN). A neural network is capable of processing and reacting to the input information. A typical FNN is illustrated as Figure 2.1. Artificial Neural Network (ANN) is a system mimicking the operation of biological neural network. In Machine Learning field, ANN is used to find a function that fits the given data. Deep Neural Network (DNN) is basically an ANN with numerous layers and possibly numerous neurons in a layer [4] . It is believed that with more layers in NN, the power to fit data increases. Also, with higher depth of NN, it could potentially learn abstract pattern [17][18]. Deep Learning is a field that put DNN to use. We use DNN to “learn”the pattern from data. 5. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(18) 立. y. ‧. ‧ 國. 學 Figure 2.1: FNN. Nat. io. sit. Recurrent Neural Network. er. 2.2. 政治大. al. v i n a time-series data scenario, it isCobvious that the contiguous value is dependent on its heng chi U n. Traditional FNN does not take the order of datapoints into consideration. However, in. previous value. Therefore, a simple FNN may not fulfill the need. This is where RNN comes in handy [19]. A typical RNN with one RNN unit is illustrated as Figure 2.2 [20]. The figure shows folded one on the left hand side and a unfolded one on the right hand side. Recurrent Neural Network (RNN) is capable of remembering the past information by passing the current information to the cell of next timestep. In other word, the timestep t would remember the past timestep t-1 because it receive the information of timestep t-1 [5]. As shown in 2.2, when an RNN network unfolded, it may easily become deep depending on how long the training sequence is. Similar to backpropagation for FNN, RNN uses backpropagation through time (BPTT) to update the weight and minimize the loss [21]. However, as an RNN’s depth goes large, it can suffer vanishing and exploding 6. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(19) Figure 2.2: RNN. 政治大 value, after multiplication for a large number of times, it could become a value near 立 zero. On the other hand, in a exploding gradient problem, if the gradients are of large. gradient problem. In a vanishing gradient problem, if the gradients are of small numeric. ‧ 國. 學. numeric value, after multiplication for a large number of times, it could become a very large number.. ‧. Due to the limitation, RNN is not suitable for learning pattern from a long sequence.. sit. y. Nat. Sometimes it can be critical to learn the pattern from long-term view. LSTM is proposed. io. [5][22].. n. al. er. to overcome the problem and strengthen the ability to remember longer temporal structure. 2.3. Ch. engchi. i n U. v. Long Short-Term Memory. An LSTM network is a network that is built with LSTM cells. It is a network like Figure 2.2 but with every U replaced by a more complex LSTM cell which is illustrated as Figure 2.3. There are three gates in a LSTM cell: input gate, forget gate and output gate. Usually sigmoid function is used so that their outputs are squashed between 0 and 1 which can be interpreted as the level of openness of the gate. There are 3 inputs, 8 weights for one LSTM cell. X t , ht−1 , Ct−1 are the inputs. Wf , Wi , Wo , Wc , Uf , Ui , Uo , Uc are the weights for forget gate, input gate, output gate and candidate cell state C ′ , respectively. There are also one bias for each gate and candidate cell state, so there are bf , bi , bo , bc for forget, input, output gates and candidate 7. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(20) cell state, respectively. There are 2 outputs for one LSTM cell. Ct , ht are the outputs. Two activation functions are used: sigmoid and hyperbolic tangent. Let sigmoid and hyperbolic tangent function be denoted by σ and tanh. Sigmoid function and hyperbolic tangent are defined respectively as 1 1 + e−x ex − e−x tanh(x) = x e + e−x σ(x) =. A typical LSTM unit is illustrated as Figure 2.3.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 2.3: An LSTM Unit Let forget gate, input gate, output gate’s output value at timestep t be ft , it , ot respectively. Let the candidate cell state at timestep t be Ct′ . . Then the output values of 3 gates and the 2 outputs of the cell are calculated as below. ft = σ(Wf · Xt + Uf · ht−1 + bf ) it = σ(Wi · Xt + Ui · ht−1 + bi ) ot = σ(Wo · Xt + Uo · ht−1 + bo ) Ct′ = tanh(Wc · Xt + Uc · ht−1 + bc ) 8. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(21) Ct = ft · Ct−1 + it · Ct′ ht = ot · tanh(Ct ). 2.4. Training an LSTM Network. There are two phases in training an LSTM network: forward propagation and backward propagation. In the forward propagation, the training example would be fed into the network sequentially and through multiplying weights and calculating cell states for each timestep, the output, or namely prediction, will be sent into loss function to see the difference between prediction and ground truth. After forward propagation, the. 政治大 last timestep to the first timestep. 立 The backward propagation would update the weights, backward propagation or backpropagation would calculate gradients backward from the. ‧ 國. 學. or namely parameters to improve the network. After the backward propagation, another iteration of forward propagation and backward propagation would start over agin.. ‧. 2.4.1 Loss Function. y. Nat. sit. Training an NN is typically a process of minimizing the loss function. In the training. n. al. er. io. phase, the distance between ground truth Y and the prediction of a neural network Yˆ is. i n U. v. called “loss”. We use categorical cross entropy loss [23] to train our LSTM network. We denote the loss function as L. L=. n ∑. Ch. engchi. −(yi · log(yˆi ) + (1 − yi ) · log(1 − yˆi )). i=1. where i denotes ith training example, n denotes the number of training example. y denotes ground truth and yˆ denotes prediction, in the form of probability. The probability can be seen as the “confidence” of the prediction. For one arbitrary training example i, there are 4 extreme cases: 2 best cases and 2 worst cases.. yˆ = y = 1 => Li = −[1 · log(1) + (1 − 1) · log(1 − 1)] = −[1 · 0 + 0 · −∞] = 0 yˆ = y = 0 => Li = −[0 · log(0) + (1 − 0) · log(1 − 0)] = −[0 · −∞ + 1 · 0] = 0 9. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(22) yˆ = 0, y = 1 => Li = −[1 · log(0) + (1 − 1) · log(1 − 0)] = −[1 · −∞ + 0 · 0] = ∞ yˆ = 1, y = 0 => Li = −[0 · log(1) + (1 − 0) · log(1 − 1)] = −[0 · 0 + 1 · −∞] = ∞ In most situations, the prediction yˆ will be in (0, 1) so the case would not be extreme. As illustrated above, categorical cross entropy penalize harder if the difference between prediction and ground truth is larger. Due to this design, the LSTM network would strive to output higher probability when the ground truth is 1 and lower probability when the ground truth is 0. After the losses being calculated, the backward propagation and gradient descent would begin.. 2.4.2 Gradient Descent. 立. 政治大. In neural network scenario, gradient descent is a technique to approach the minimum. ‧ 國. 學. of loss function. Since loss function is a function of the parameters within the network, taking partial derivative of loss function with respect to a certain parameter will retrieve. unit [24]. That is,. ∂L ∂θ. ‧. a measure of how this parameter affect the loss function if the parameter changes one is a value indicating how a parameter θ affects the loss function L.. y. Nat. er. io. sit. Gradient descent algorithm for updating one parameter is shown in Algorithm 1. Initialization: parameter θ = 0, int i = 0, int n = number of iteration. al. n. while i < n do ∇θ =. ∂L ; ∂θ. Ch. engchi. i n U. v. θ := θ − ∇θ; i + +; end Algorithm 1: Gradient Descent Theoretically, the loss would reduce as the parameters are updating their values. After a predefined number of updates, the loss may or may not be at its global minimum. Since we use cross entropy loss as our loss function as discussed in 2.4.1, we will derive the partial derivative of cross entropy loss with respect to an arbitrary parameter θ. ∂L ∂L ∂ yˆ y 1 − y ∂ yˆ = =( + ) ∂θ ∂ yˆ ∂θ yˆ 1 − yˆ ∂θ 10. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(23) All parameters in a neural network will updated for a given times and then the training phase is completed.. 2.4.3 Backpropagation and Backpropagation Through Time In a deep learning scenario, it is common that a neural network consists of multiple hidden layers. Therefore, when performing gradient descent, chain rule will be applied repeatedly from output layer to input layer. The process of repeatedly applying chain rule is called backpropagation or backward pass. A typical FNN uses backpropagation to calculate the gradients from output layer backward to input layer.. On the other hand, an RNN will be unfolded against. 政治大 multiple layers [25]. backpropagation 立 through time (BPTT) is the algorithm to perform multiple timesteps.. After unfolding, an RNN can be seen as an typical FNN with. ‧ 國. 學. backpropagation in a recurrent neural network. The underlying theory is the same as typical backpropagation except that it backpropagates through timesteps from the last. ‧. timestep to the first timestep [21].. Nat. io. sit. y. 2.4.4 Hyperparamter tuning. n. al. er. Hyperparameters are values which are chosen by users before training a model while. i n U. v. parameters are those values learned by the learning algorithm.. Ch. engchi. We use grid search to find the best set of hyperparameters. Grid search generates all possible combinations of given hyperparameters and iterate through all the combinations to find the best combination [26]. Hyperparameters and their respective possible values are chosen by user. Note that the range of some hyperparameters can be of infinite possibilities so it is impossible to try out all the combinations. Let N be the number of different hyperparameters to be tuned, Hi be the i-th set of hyperparameter options, then there will be. N ∏. |Hi |. i=1. possible combinations. Grid search will exhaust all the combinations and see which one produces the best performance. We select the best hyperparameter combination after we train models with all given 11. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(24) possible combinations by choosing the one with highest validation accuracy. We train 9 models with different 3 optimization algorithms and 3 dropout rate combinations.. 2.4.4.1 Dropout Rate Dropout is a way to regularize the neural network by “shutting down” neurons. The chosen neurons will be given weight 0 in that round of forward propagation. Since it is given the weight of zero, it has no effect to the network. Therefore, it seems the neurons are being shut down. A number of neurons are chosen randomly to be shut down under a given portion which is a hyperparameter that is assigned before training. The portion is called “Dropout rate”. 政治大 that the higher the dropout rate, the lower the test error in the experiments conducted 立. which is usually fixed throughout the whole training process. N. Srivastava et al. show. over MNIST dataset [27]. We try out 3 different dropout rates: 0%, 30% and 50% and. ‧ 國. 學. find that averagely the higher the dropout rate, the lower the validation error.. ‧. 2.4.4.2 Neural Network Optimization Algorithm. y. Nat. sit. When training a neural network, the most typical way is gradient descent or equivalently. al. er. io. called batch gradient descent. Batch gradient descent updates the parameters after all. v. n. training examples have done their forward pass. Therefore, the update of parameters can. Ch. be slow when the size of training set is large.. engchi. i n U. Stochastic gradient descent (SGD) and mini-batch gradient descent are designed to speed up the update of parameters. SGD method updates the parameter every time a training example completes its forward pass. Mini-batch gradient descent updates the parameter every after m training examples complete their forward pass where m is an positive integer predefined by the user [28]. Although SGD and mini-batch methods reduce the training time by updating the parameters more frequently, they use fixed learning rate predefined by user throughout the whole training phase. However, it is better to have higher learning rate at the beginning of training phase and gradually decrease the value during the training [3]. This is where optimization algorithms with adaptive learning rate come to rescue. Adagrad uses different learning rate in every iteration. The algorithm is shown in 12. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(25) Algorithm 2. When i goes up, τ goes up while α go down. Therefore, the learning rate will gradually decrease as the training iteration continues.. Initialization: int i = 0, int n = number of iteration, parameter θ Initialization: constant value ϵ, ψ, gradient accumulator τ = 0 while i < n do g = gradient; τ = τ + g · g; α=. ψ√ ; ϵ+ τ. θ = θ − α · g; i = i + 1;. 立. Algorithm 2: Adagrad Algorithm. 學. ‧ 國. end. 政治大. ‧. It could be a disadvantage for training when the learning rate keeps lowering because the training speed could slow down. That is, the parameters modify slowly. RMSprop. y. Nat. sit. introduce moving average to resolve the potential problem and can be seen as an upgrade. er. io. of Adagrad. This helps “discard history from the extreme past” to speed up the training. al. n. v i n algorithm. It is inspired by bothCAdagrad h e n gand i U and combine the good parts of c hRMSprop phase [3]. Adaptive Moment Estimation (Adam) is a more sophisticated optimization. them [29]. It is still an open problem which optimization algorithm is the best. Although. optimization algorithms equipped with adaptive performs fairly well, there is no one specific optimization algorithm that stands out under all scenarios [3]. We try out Adagrad, RMSprop and Adam optimization algorithms in the hyperparameter tuning phase to see which optimization algorithm best suits our problem. The results shown in table 5.3 indicates that Adam averagely performs better in training accuracy than the other two optimization algorithms in our scenario.. 13. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(26) Chapter 3 Bitcoin and Blockchain 政治大 on blockchain, a new technology where Bitcoin transactions take place. 立. Bitcoin is the first cryptocurrency that has been widely known to the world. It runs Since internal. features are extracted from Bitcoin blockchain, this chapter provides a brief introduction. ‧ 國. 學. to these technologies. Chapter 3.1 introduces Bitcoin’s history and the mechanism of blockchain. Chapter 3.2 explains the reason why Bitcoin is volatile. Chapter 3.3 serves. ‧. as a background knowledge to understand reasons of extracting certain features that are. al Bitcoin on Blockchain n. 3.1. er. io. sit. y. Nat. listed in section 4.4.. Ch. engchi. i n U. v. Bitcoin runs on blockchain over Internet as a peer-to-peer (P2P) network [30]. Introduced by Satoshi Nakamoto on 2008, the first Bitcoin was mined on 2009 when the genesis block, block number zero, is mined and 50 bitcoins as a reward are created. Cryptocurrency is a newly coined term distinguish itself from fiat currency. The main difference is that unlike fiat currency, Cryptocurrency is neither issued nor controlled by governments. Running on blockchain, although governments worldwide try to keep surveillance on Bitcoin, no government can take full control of it. Since Bitcoin is not issued by any government and can be traded without national boundary, some believe it could become a universal currency. In blockchain, Every node can generate a transaction (tx) and broadcast to the whole network, and after verification, the transaction will be put into a block and the block will 14. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(27) Figure 3.1: Blockchain Illustration be put onto blockchain. People who verify transactions are called miner, as an analogy to. 政治大 network is that they will be 立 rewarded with Bitcoin if they are the first one who mines the. digging gold out of the mine. The incentive for miners to verify transaction for the whole. ‧ 國. 學. block.. Besides providing the functionality of transaction verification, Bitcoin blockchain. ‧. serves as a distributed ledger that records the transactions which issued within the blockchain. Every single transaction will be written down and can theoretically never. Nat. sit. n. al. er. io. ledger.. y. be altered. Every node that runs on Bitcoin blockchain keeps an identical version of. 3.2. C. hengchi Bitcoin as a Cryptocurrency. i n U. v. The first real-world purchase with bitcoin as a currency was made on 2010, when a programmer indirectly bought two pizzas with 10000 bitcoins which worth over 80 million dollars in 2018. Until now, Bitcoin is still not a popularized currency. That is, not a lot of people use bitcoin to buy things. Some argue that Bitcoin is not a kind of “currency” because it is not yet widely used to buy things. Also, a currency’s value should not be so volatile. Despite the criticism, Bitcoin attracts investors and speculators as a novel financial asset. There are dozens of exchange centers that provide Bitcoin buy/sell service. Because of its novelty and high potential, people expect the value of bitcoin is going to rise and see bitcoin as a financial assets rather than a currency. 15. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(28) 政治大 Figure 3.2: Comparison between EUR/USD and BTC/USD 立. ‧ 國. 學. Figure 3.2 shows the difference between fiat currency, Euro (EUR), and Bitcoin. EUR/USD is stably being about 1:1 so the graph seems to be a horizontal line. Compared. ‧. with fiat currencies EUR, Bitcoin clearly is much more volatile than other currency. The. sit. y. Nat. volatility nature of bitcoin drives some criticism arguing the pursuit of bitcoin is a gold. io. er. rush. The volatility of Bitcoin somehow hinders it to become a currency that can be used in daily purchase. Between 2016/02 to 2018/02, the standard deviation of EUR/USD is. n. al. Ch. i n U. v. 0.04 while the standard deviation of Bitcoin price is 4125.8.. engchi. As the basic idea of monetary supply, when the money in the market rises, that is in one way, central bank printing more money and spread it out in the market, the inflation could come and the value of the currency could fall. However, in Bitcoin case, the total amount of Bitcoin is limited and fixed when it was designed. Due to limited number of Bitcoin, some investors believe it could play a role as gold which is always valuable because of its scarcity. The price has been soaring from 2016 to 2018. It encountered a few dips but maintains an overall uprising trend as shown in Figure 3.3. In bitcoin network, a user can possess one or more semi-pseudo identity: address. On the other hand, an address can be possessed by individual or a group of people. The anonymity is preserved until one decides to reveal his identity in the real world and making connection between his address and his identity. 16. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(29) 政治大. Figure 3.3: Upward Trend. 立. The number of unique address that is involved in transaction is served as one of the. ‧ 國. 學. features because it can be an indication showing how many people or groups of people are interested in Bitcoin.. ‧. Mining Bitcoin. sit. y. Nat. 3.3. er. io. This section serves as a supplement to explain the basic idea of mining Bitcoin because. al. the selected features in the model including block difficulty, block confirmation time and. n. v i n mining revenue are closely related of mining activity. C to h ethenideas h gc i U. Proof of Work (POW) is the mechanism that Bitcoin adopts to make sure every. node in the distributed P2P network can cooperate with each other correctly [31]. POW in Bitcoin network works by requesting miners to solve a very computationally costly problem. The problem is to find a hash value hashed by SHA256 of a given string, the content of the transaction, that can come out with a specific number of leading zeros [32]. To be specific, to find the nonce value which satisfies the following inequality [33]. Let || be concatenation, the problem is shown as the formula below. SHA256(nonce||previous_block_hash||tx0 ||tx1 ||...||txn ) < target where target is a random 256-bit number. The hash value of SHA256 must be lower than the set target. Normally, miners increase nonce by 1 at each round of guessing. Since 17. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(30) a little change in input can cause a huge change in the hash value, it is theoretically not possible to deduce the correct nonce. Therefore, solving the problem is computationally costly because to solve the problem, miners have to go through a large number of trial and errors, without any short cut. Miners who first solve the problem has the right to edit the first transaction of that block, the coinbase transaction, and get paid with the mining reward provided by blockchain itself and the transaction fee provided by people who issue transactions. Since Bitcoin has become more and more valuable, the number of miners increase rapidly. Miner were strongly motivated to pursue the mining reward and started to increase their computing power. The growth of total computing power of the network. 政治大. can shorten the confirmation time taken for a new block to be mined. Satoshi Nakamoto proposes a method to tackle the problem. Once the total computation power grows,. 立. the difficulty of finding the right hash value grows to lengthen the confirmation time.. ‧ 國. 學. Therefore, we pick difficulty, block confirmation time and mining revenue to be features since they reflect the condition of Bitcoin network.. ‧. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. 18. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(31) Chapter 4 Machine Learning Pipeline 政治大 data collection to model optimization. In this chapter, each stage in data pipeline will be 立 In a typical data science scenario, a machine learning pipeline shows the workflow from. explained except modeling and model optimization stages which will later be presented. ‧ 國. ‧ y. sit. Pipeline. Nat. 4.1. 學. in chapter 5.. al. er. io. In data collection stage, data is collected from Bitcoin client and Global Digital Asset. v. n. Exchange (GDAX) [34]. Exchange center provides the external features and Bitcoin. Ch. blockchain provides internal features.. engchi. i n U. In data cleaning stage, since data retrieved from exchange center APIs is in structured data format, clean csv, so it does not need to be cleaned. Whereas data from Bitcoin blockchain is in semi-structured format, json and requires further cleaning. Both block and transaction data are retrieved in json format. In data processing stage, the desired blockchain features are calculated based on json files generated in data cleaning stage and later generate a csv file that include block information and transaction information in each block between block #400000 to block #510000, ranged from 2016/02 to 2018/02. The blockchain csv file and gdax csv file then are aligned by timestamp and combined into one csv file. The combined file then is normalized by min-max normalization into numeric value between 0 and 1. Afterwards, dataset is split into training and testing data. 19. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(32) In modeling stage, training data is passed into a multi-layer LSTM model. In model optimization stage, the LSTM model is tuned with validation set and evaluated with test set. The machine learning pipeline is illustrated in Figure 4.1.. 立. 政治大. ‧. ‧ 國. 學. a iv Data Collection l C n hengchi U n. 4.2. er. io. sit. y. Nat. Figure 4.1: Machine Learning Pipeline. 4.2.1 Collecting Bitcoin Blockchain Data There are mainly two ways to collect data from Bitcoin blockchain: Official Bitcoin client or Bitcoin-related Websites. Bitcoin-related websites such as blockchain.info [35] provide graphical user interface and APIs for users and programmers to interact with. However, websites may not open all blockchain historical data to users or may limit the connection. Therefore, we retrieve data from Bitcoin client. Bitcoin client serves as an interface for users to communicate with Bitcoin blockchain. This interface provides various RPCs for users to call.. bitcoin-cli is developed to. communicate with Bitcoin blockchain by official Bitcoin community. The procedure to retrieve data from Bitcoin client by calling RPCs is shown in Figure 4.2. 20. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(33) Figure 4.2: Communicating with Bitcoin Client. 4.2.2 Collecting Data from Exchange Center An exchange center is where Bitcoin and other cryptocurrencies being traded. Exchange. 政治大. centers serve as a platform for buyer and seller to bargain a price, if a buying price and. 立. a selling price meet, a transaction will be performed. Transaction fee paid by buyer and. ‧ 國. 學. seller will be the revenue of the exchange centers. Most of the exchange centers support USD; therefore, in this thesis, the price of Bitcoin will be in USD unit.. ‧. GDAX is an exchange center that provides Bitcoin and other cryptocurrency trading for users all around the world. GDAX provides Rest APIs for developers to interact with. y. Nat. sit. and get back requested data. We retrieve a dataset of 1-minute time granularity. Since. er. io. the amount of data can largely affect the training performance and therefore bigger data. al. v i n number of training example. SoC there be 1440 datapoints each day and we get 1440 h ewill ngchi U n. is preferred. Rather than daily data, we collect data of 1 minute interval to increase the. times larger a dataset compared to a dataset with daily datapoints. The dataset ranges from 2016/02/26 to 2018/02/20 and 1014508 datapoints are collected. In comparison, there will be only 725 datapoints if collecting daily data.. 4.3. Data Cleaning. In this stage, we clean data retrieved from Bitcoin client while data from GDAX exchange center remains the same.. The reason why data cleaning is crucial is because the. transaction data retrieved from Bitcoin blockchain is large and only a small portion of transaction data is needed. Therefore, we keep only a few values from each transaction, massively lowering the space than storing all transactions in disk. 21. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(34) 4.4. Data Processing. Data processing is one of the most important stage before training machine learning model. In data processing stage, researchers have to process the cleansed data into the format that is ready for machine learning algorithm to train. The reason why and how these internal and external features are extracted are discussed in 4.4.1 and 4.4.2, respectively. For more details, please refer to Appendix A and B.. 4.4.1 Internal Features Extraction. 治政大height, block difficulty and block 13 features are extracted from Bitcoin blockchain. Block 立 size can be directly extracted from block json file. The other features require processing ‧ 國. 學. to obtain. These features can be divided into three groups: length of blockchain, miners’ incentive and market condition. Length of blockchain is represented by the block height.. ‧. Miner’s incentive is represented by block difficulty, block confirmation time, miner’s. y. Nat. revenue, and miner’s revenue per tx which are related to the mining activity as discussed. io. sit. in section 3.3. Number of tx, number of unique address, block size, block total output,. al. n. market condition.. er. tx output mean, tx output median, tx output variance and tx output std are related to. Ch. engchi. i n U. v. 4.4.2 External Features Extraction 14 features are extracted from Exchange center. The price is the value which buy order and sell order match at. We retrieve open, close, low, high and volume features from the exchange center. Open and close represents the price at the starting point and the price at the ending point in the time interval. Low and high represents the lowest and highest price in the time interval. Volume is the number of Bitcoin being traded in the time interval. We further process the external features into moving average, moving median and moving standard deviation in 5, 30, and 60 minutes trying to capture the short, medium and long term pattern of the price. 22. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(35) 4.4.3 Align and Combine Internal and External Features Datapoints of internal features are of variable frequency because the time spent on mining a block varies. Although S. Nakamoto designed the difficulty metric to be automatically adjusted to control the time interval of mining a new block, the time interval between two blocks varies with the total computation power of Bitcoin network. Averagely, the frequency is about 10 minutes per block. However, datapoints of external features are of fixed frequency: 1 min per datapoint. Thus, it is necessary to align these two datasets of different frequency. The method is simply distributing internal features for multiple external datapoints within the time interval. Because internal features represent blockchain and external features represent. 政治大. exchange center, let the dataset be B, E and timestamp be ts. The algorithm will be. 立. shown as Algorithm 3.. ‧ 國. 學. Initialization: i = 0, j = 0 while j < len(B) do. y. Nat. break;. ‧. if i > len(E) or j > len(B) then. continue; j++;. al. n. i++;. er. io. Concatenate B features into E;. Ch. sit. if E[ts][i] > B[ts][j] and E[ts][i] < B[ts][j+1] then. engchi. i n U. v. end Algorithm 3: Alignment of Internal and External Features. 4.4.4 Min-Max Normalization The scale of features can massively vary. For example, the difficulty feature could be over trillion (1012 ) while number of transaction per block is normally thousands and Miner’s revenue is about 10. If not normalized, the model can be affected by those feature whose values are much larger than the others. Commonly adopted, min-max normalization helps to scale the value into [0, 1] by performing operation below to every feature. 23. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(36) X′ =. X − min(X) max(X) − min(X). 4.4.5 Train/Validation/Test Split Train/validation/test split is a necessary step to take before training. The whole dataset will be broken down into three non-overlapped subsets: a training set, a validation set and a test set. Since we have multiple hyperparameter combinations, the training set will be sent as input to learn out models of different hyperparameter settings. The validation set is used to evaluate the performance of the models. The hyperparameter combination with the highest validation accuracy will be picked to train a new model which will be. 政治大 As shown in Figure 4.3, 立 we split the data into three subsets, first 50% for training set,. evaluated by test set to gain testing accuracy.. ‧ 國. 學. next 25% for validation set, and the last 25% for test set as suggested by T. Hastie et al [36]. The splitting is not randomized because we have to retain the chronological order. ‧. of the data. There is no overlap of time interval between these three sub-datasets.. er. io. sit. y. Nat. al. n. v i n C hTrain/Validation/Test Figure 4.3: e n g c h i U Split. 24. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(37) Chapter 5 Methodology 政治大. After the data preprocessing stages discussed in chapter 4, data is prepared for modeling.. 立. This chapter focuses on how we design the model, how experiments are conducted and. ‧ 國. ‧. 5.1. 學. the experiment results.. Tools and Platform. sit. y. Nat. er. io. LSTM experiments are conducted using Keras [37]. Keras is a popular deep learning. al. v i n our experiments. Deep learning C computing will automatically be distributed and run h e n gjobs chi U n. framework which supports Tensorflow backend. We use Nvidia GPU Titan Xp to run. in parallel on Nvidia GPU [38]. The experiments are conducted over a machine whose specification is listed in table 5.1.. 5.2. Experiments. 5.2.1 Dataset Summary There are 27 features used in training models. The dataset summary is as shown in Table 5.2. For more details about how and why they are extracted and calculated, please refer to chapter 4 and Appendix A and B. 25. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(38) Table 5.1: Machine Specification GPU. Nvidia Titan Xp Intel® Core™ i5-7400 CPU @. CPU. 3.00GHz × 4 Memory. 32GB. Disk. Intel 600P Linux Ubuntu. OS. 16.04. 治政大 Table 5.2: Dataset Summary 立 2016/02/26 -. Time interval. 400000 - 510000. Block. 1014508. io. sit. #Datapoint. n. er. Nat. minute. y. 1 datapoint per. Granularity. al. ‧. ‧ 國. 學. 2018/02/20. C. hengchi 5.2.2 Neural Network Architecture. i n U. v. There are 5 layers in the network where the first layer and the last layer are input layer and output layer, respectively. The input layer consists of 27 neurons because there are 27 features for one training example. The output layer consists of 2 neurons whose activation functions are softmax, one for probability of rise and the other for probability of fall. This is designed for categorical cross entropy discussed in section 2.4.1 . The layers between input layer and output layer are hidden layers constructed with LSTM units. There are 3 LSTM layers where 30, 20, 10 LSTM units are in each layer respectively. We choose sigmoid activation function for forget, input and output gates. The activation function for candidate cell state is hyperbolic tangent. The chosen activation functions are the same as discussed in section 2.4. 26. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(39) 立. 政治大. ‧. ‧ 國. 學 y. Nat. n. al. er. io. sit. Figure 5.1: LSTM Network Architecture Visualization. Ch. engchi. i n U. v. The network architecture is illustrated as Figure 5.1 where the superscript represents the order of this neuron in this layer or the number of feature as in input layer. The subscript represents the order of the layer in the network. Because there are only one input and output layer, the neurons in these two layers are always subscript 1. There are 3 layers of LSTM cells; therefore, the subscript is from 1 to 3.. We use many to one model or ten to one in specific, so Xt , Xt+1 , Xt+2 ...Xt+9 will be inputted to the NN and a prediction Yˆ of whether timestep t + 10 will rise or fall will be produced. The model is illustrated in Figure 5.2 [39]. Since the dataset is one training example per minute, the model is trying to predict the direction of price at next minute with the past ten minutes’ information. 27. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(40) 立. 政治大. ‧ 國. 學. Figure 5.2: Many to One Model. ‧. 5.2.3 Hyperparameter Tuning. sit. y. Nat. io. er. Hyperparameters are chosen through grid search discussed in section 2.4.4. We pick optimization algorithm and dropout rate as hyperparameters to be tuned. We would. n. al. Ch. i n U. v. like to observe how different optimization algorithm and dropout rate perform in our. engchi. scenario. Let Ho , Hd be the chosen values of optimization algorithm and dropout rate where Ho = {Adagrad, RM Sprop, Adam}, Hd = {0%, 30%, 50%}. Therefore, there are |Ho | × |Hd | = 3 × 3 = 9 combinations. We run the experiments with 3000 epochs. The results are shown as Table 5.3, 5.4 and 5.5. Hyperparameter combinations of Adam optimization algorithm perform averagely better than those of Adagrad and RMSprop in training accuracy. However, in validation accuracy, Adagrad performs averagely the best. Although RMSprop does not perform averagely better than the other two optimization algorithms in training and validation accuracy, RMSprop with 50% dropout rate performs the best in validation accuracy. When it comes to dropout, a method to regularize neural network. Theoretically, the higher the dropout rate, the higher the regularization. We observe that averagely dropout 28. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(41) rate of 50% performs better in validation accuracy metric than dropout rate of 30% and 0%. Also, the best validation accuracy is obtained with 50% dropout rate.. Table 5.3: Hyperparameters Combinations Performance Hyperparameter. Training Accuracy (%). Validation Accuracy (%). Adagrad, dropout rate 0%. 56.073. 55.275. RMSprop, dropout rate 0%. 56.206. 54.162. Adagrad, dropout rate 30%. 立. 56.904 治政 56.143 大. 55.333. 56.514. Adagrad, dropout rate 50%. 56.127. RMSprop, dropout rate 50%. 56.616. Adam, dropout rate 50%. 56.611. 54.695 55.331 55.624 55.550. n. al. er. io. sit. ‧. 56.604. Nat. Adam, dropout rate 30%. 54.739. 學. ‧ 國. RMSprop, dropout rate 30%. 55.318. y. Adam, dropout rate 0%. Ch. engchi. i n U. v. Table 5.4: Optimization Algorithms Average Performance Optimization. Avg. Training Accuracy (%). Avg. Validation Accuracy (%). Adagrad. 56.114. 55.313. RMSprop. 56.445. 54.842. Adam. 56.706. 55.188. Algorithm. 29. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(42) Table 5.5: Dropout Rate Average Performance Dropout Rate. Avg. Training Accuracy (%). Avg. Validation Accuracy (%). 0%. 56.394. 54.918. 30%. 56.420. 54.922. 50%. 56.451. 55.502. Hyperparameter combination of RMSprop optimization algorithm and 50% dropout rate achieves the highest validation accuracy 55.624%. Therefore, we choose it as the best. 政治大. hyperparameter combination. We train a new model on this hyperparameter setting for 10000 epochs with both training and validation set and evaluate the model with test set.. 立. The result is 56.776% testing accuracy.. ‧ 國. ‧. Results. 學. 5.3. sit. y. Nat. 5.3.1 Performance Comparison with Related Works. er. io. Since Bitcoin blockchain is a new technology, the studies on predicting Bitcoin price are. al. v i n it is notCfeasible h e n tog csimply h i Ucompare the accuracy performance n. not as many as on stock market or foreign exchange market. Due to different experiment settings of the works,. between studies. For example, Madan et al. collected data from 2010 to 2014. Greaves et al. collected data from 2012/2/1 to 2013/4/1. McNally collected data between 2013/08/19 to 2016/07/19. The volatility grows as shown in Figure 3.3 which may be a factor that affects performance between studies. The extracted features are also different. Madan et al. extracted many features from Bitcoin blockchain. Greaves et al. processed the blockchain data and further analyzed the data to find special addresses that may affect the market price. McNally used features mainly from exchange center where only difficulty and hash rate features are related to blockchain. Despite the differences discussed above, there is one thing that all three works and our study have in common. That is, the extracted features are mainly from exchange center 30. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(43) and Bitcoin blockchain.. Table 5.6: Performance Comparison with Related Works Automated. Using the Bitcoin. Bitcoin Trading. Transaction. via Machine. Graph to Predict. Learning. the Price of. Algorithms [13]. Bitcoin [14]. LSTM. Random Forest. Neural Network. LSTM. 56.776. 57.4. 55.1. 52.78. This thesis. 立. 政治大. using Machine Learning [15]. 學 ‧ y. Nat. io. sit. (%). price of Bitcoin. n. al. er. Accuracy. ‧ 國. Model. Predicting the. Ch. engchi. i n U. v. 31. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(44) Chapter 6 Conclusion and Future Work 6.1. Conclusion. 政治大. 立. ‧ 國. 學. Bitcoin has been gaining more and more attention. Its price varies largely due to people’s various outlook and speculation. Inspired by studies that apply quantitative techniques on. ‧. predicting the price of financial assets. We strive to find an LSTM model which performs prediction on the direction of Bitcoin price.. sit. y. Nat. We build a machine learning pipeline which collects and integrates 13 internal and 14. io. er. external features from Bitcoin blockchain and exchange center, respectively. Later on, we input the combined 27 features into LSTM model and train the model. In model. n. al. Ch. i n U. v. optimization phase, we select 3 optimization algorithms: Adagrad, RMSprop, Adam. engchi. and 3 dropout rate: 0%, 30%, 50% to run hyperparameter tuning with grid search. The best outcome is RMSprop optimization algorithm with 50% dropout rate. On this hyperparameter setting, we retrain a new model and achieves 56.776% in testing accuracy. As contributions, this thesis provides data processing details for both internal and external Bitcoin features. Also, it serves as an empirical study on applying LSTM to cryptocurrency price prediction.. 32. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(45) 6.2. Future Work. In this thesis, the features are extracted from exchange center and blockchain. However, it is obvious that the economic environment could affect the price of Bitcoin. Extracting economic features such as stock market index and foreign exchange rate could reflect the condition of other financial markets. On the other hand, new positive and negative information could also affect the price. Therefore, an integration of features from more different sources may benefit the model performance. Also, we do not adopt trading strategies over the model to form a system.. It. could be another topic of research where the system will have to coordinate the trained. 治政大 Reinforcement learning is a what actions to take in order to attain the best return. 立 study of decision making process. Inspired by neural network and deep learning, deep. model with trading strategies and strive to obtain profit. The system should figure out. ‧ 國. 學. reinforcement learning (DRL) is a technique which combines both deep learning and reinforcement learning. It is possible to build a profitable DRL system where Bitcoin can. ‧. be automatically traded.. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. 33. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(46) Appendix A Feature Description and Representation 治政大 Feature description and representation 立 Description and Representation Lowest price in the time interval. sit. The number of Bitcoin being traded in exchange center during. io. Volume. the time interval.. al. n Moving average 5 mins Moving average 30 mins Moving average 60 mins Moving median 5 mins Moving median 30 mins. y. Highest price in the time interval. Nat. High. The price at the ending point in the time interval. ‧. Low. The price at the starting point in the time interval. er. Close. 學. Open. ‧ 國. Feature. Ch. engchi. i n U. v. The short term average price. The medium term average price. The long term average price. The short term middle price. The medium term middle price Continued on next page. 34. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(47) Table A.1 – Continued from previous page Feature. Description and Representation. Moving median 60. The long term middle price. mins Moving std 5 mins. the short term volatility. Moving std 30 mins. the medium term volatility. Moving std 60 mins. the long term volatility. Block height. The height of the block.. Block difficulty. The difficulty to mine a block when the block is mined. How many data are stored in the block. Representing the. Block size. 立. How long it takes for a block to be confirmed. The time spent. Miner’s revenue. address Block total output Tx output mean. Tx output median. Tx output variance. y. sit. er. The number of transaction in this block.. al. v i n BitcoinCinhat least one transaction in the block. engchi U. The number of unique address that either send or receive. n. Number of unique. Tx output std. The Miner revenue per transaction.. io. Number of tx. validating transactions.. Nat. tx. The mining reward plus total transaction fee. The incentive of. ‧. Miner’s revenue per. on confirming last 6 blocks.. 學. ‧ 國. Block confirmation time. 政治大. number and the size of transaction.. The total output of the block.. The mean of transactions’ output. Indication of the average value transferred in each transaction. The transaction’s output whose value is in the middle among all transactions in the block. The variance of transactions’ output. How transactions’ output are spread out. The standard deviation of transactions’ output.. How. transactions’ output are spread out.. 35. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(48) Appendix B Feature Calculation. 立. Directly retrieve from GDAX API.. y. sit. Directly retrieve from GDAX API.. io. Volume. Directly retrieve from GDAX API.. Nat. High. Directly retrieve from GDAX API.. ‧. Low. Directly retrieve from GDAX API.. n. al. er. Close. 學. Open. Calculation. ‧ 國. Feature. 治政大 Feature Calculation. Moving average 5 mins. Ch. engchi. i n U. v. t−1 1 ∑ pricei 5 i=t−5. t−1 1 ∑ pricei 30 i=t−30. Moving average 30 mins. t−1 1 ∑ pricei 60 i=t−60. Moving average 60 mins. Continued on next page. 36. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(49) Table B.1 – Continued from previous page Feature. Calculation. Moving median 5 median(P ricet−5 , P ricet−4 , P ricet−3 , P ricet−2 , P ricet−1 ). mins. Moving median 30. median(P ricet−30 , P ricet−29 , ..., P ricet−2 , P ricet−1 ). mins. Moving median 60. 政治大. median(P ricet−60 , P ricet−59 , ..., P ricet−2 , P ricet−1 ). 立. mins. ‧ 國. 學. t−1 1 ∑ (pricei − pricema5 )2 5 i=t−5. Moving std 5 mins. ‧. y. Nat. sit. t−1 1 ∑ (pricei − pricema30 )2 30 i=t−30. io. n. al. er. Moving std 30 mins. Moving std 60 mins. Block height Block difficulty Block size Block confirmation time Miner’s revenue. Ch. t−1i e n g1 c∑ h. 60 i=t−60. i n U. v. (pricei − pricema60 )2. Extract the value of height attribute from block. Extract the value of difficulty attribute from block. Extract the value of size attribute from block. Block t’s timestamp - Block t-6’s timestamp. Retrieve the coinbase transaction from a block and add up the value attribute of vout attribute of all transactions in a block. Continued on next page. 37. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(50) Table B.1 – Continued from previous page Feature. Calculation. Miner’s revenue per. Miner’s Revenue divided by Number of transaction.. tx Number of tx. Calculating the length of tx attribute of a block.. Number of unique. Calculating the number of unique addresses in transaction of. address. a block either in input side or output side. ∑. ∑. tx[vout][value]. Block total output tx∈block value∈vout. Tx output mean. 政治大. Block total output divided by Number of tx.. 立. Tx output variance. The variance of a block’s transactions’ output. The standard deviation of a block’s transactions’ output.. io. sit. y. Nat. n. al. er. Tx output std. ‧. ‧ 國. Sort the transaction output and return the middle one.. 學. Tx output median. Ch. engchi. i n U. v. 38. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(51) References [1] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, 1997.. 政治大. [2] S. Nakamoto, “Bitcoin: A peer-to-peer electronic cash system,” 2009.. 立. [3] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.. MIT Press, 2016.. ‧ 國. 學. [4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, 2015.. ‧. [5] F. A. Gers, J. A. Schmidhuber, and F. A. Cummins, “Learning to forget: Continual. sit. y. Nat. prediction with lstm,” Neural Comput., vol. 12, no. 10, 2000.. er. io. [6] S. J. Taylor, An Introduction to Volatility. Princeton University Press, 2005.. al. n. v i n C h Available: https://www.investopedia.com/terms/ Investopedia. Volatility. [Online]. engchi U. [7]. v/volatility.asp. [8] W. Huang, Y. Nakamori, and S.-Y. Wang, “Forecasting stock market movement direction with support vector machine,” Computers & Operations Research, vol. 32, no. 10, 2005. [9] S. A. Hamid and Z. Iqbal, “Using neural networks for forecasting volatility of sp 500 index futures prices,” Journal of Business Research, 2004. [10] A. Vejendla and D. Enke, “Evaluation of garch, rnn and fnn models for forecasting volatility in the financial markets,” IUP Journal of Financial Risk Management, vol. 10, no. 1, 2013. 39. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(52) [11] R. Akita, A. Yoshihara, T. Matsubara, and K. Uehara, “Deep learning for stock prediction using numerical and textual information,” in 2016 IEEE/ ACIS 15th International Conference on Computer and Information Science (ICIS), 2016. [12] M. Matta, M. I. Lunesu, and M. Marchesi, “Bitcoin spread prediction using social and web search media,” in UMAP Workshops, 2015. [13] I. Madan and S. Saluja, “Automated bitcoin trading via machine learning algorithms,” Stanford University, 2014. [14] A. Greaves and B. Au, “Using the bitcoin transaction graph to predict the price of bitcoin,” Stanford University, 2015.. 治政 [15] S. McNally, “Predicting the price of bitcoin using大 machine learning,” Master’s thesis, 立 Dublin, National College of Ireland, 2016. ‧ 國. 學. [16] H. Jang and J. Lee, “An empirical study on modeling and prediction of bitcoin prices. 2018.. ‧. with bayesian neural networks based on blockchain information,” IEEE Access, vol. 6,. y. Nat. io. sit. [17] Y. Bengio, “Learning deep architectures for ai,” Foundations and Trends® in Machine. er. Learning, vol. 2, no. 1, 2009.. al. n. v i n C hP. Vincent, “Representation Y. Bengio, A. Courville, and learning: engchi U. [18]. A review and. new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, 2013.. [19] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, no. 2, 1990. [20] Z. C. Lipton, “A critical review of recurrent neural networks for sequence learning,” CoRR, vol. abs/1506.00019, 2015. [21] A. Graves, Supervised Sequence Labelling with Recurrent Neural Networks. Springer-Verlag Berlin Heidelberg, 2012. [22] K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, “LSTM: A search space odyssey,” CoRR, vol. abs/1503.04069, 2015. 40. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(53) [23] Wikipedia contributors, “Loss functions for classification — Wikipedia, the free encyclopedia,” 2018. [Online]. Available:. https://en.wikipedia.org/w/index.php?. title=Loss_functions_for_classification&oldid=838253245 [24] Wikipedia contributors, “Gradient descent — Wikipedia, the free encyclopedia,” 2018. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Gradient_ descent&oldid=845809247 [25] R. Rojas, Neural Networks: A Systematic Introduction.. Berlin, Heidelberg:. Springer-Verlag, 1996. [26] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,” J.. 政治大. Mach. Learn. Res., vol. 13, 2012.. 立. [27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,. ‧ 國. 學. “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, 2014.. ‧. [28] S. Ruder, “An overview of gradient descent optimization algorithms,” CoRR, vol.. y. Nat. er. io. sit. abs/1609.04747, 2016.. [29] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol.. al. n. abs/1412.6980, 2014.. Ch. engchi. i n U. v. [30] S. Dziembowski, “Introduction to cryptocurrencies,” 2015. [31] I. Bentov, A. Gabizon, and A. Mizrahi, “Cryptocurrencies without proof of work,” CoRR, vol. abs/1406.5694, 2014. [32] Proof of work. [Online]. Available: https://en.bitcoin.it/wiki/Proof_of_work [33] A. Narayanan, J. Bonneau, E. W. Felten, A. Miller, S. Goldfeder, and J. Clark, Bitcoin and Cryptocurrency Technologies. Princeton University Press, 2016. [34] Gdax exchange center documentation. [Online]. Available: https://docs.gdax.com/ [35] blockchain.info. [Online]. Available: https://blockchain.info/ 41. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(54) [36] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Springer New York Inc., 2001. [37] Keras. [Online]. Available: https://keras.io/ [38] Nvidia. [Online]. Available: http://www.nvidia.com/page/home.html [39] A. Karpathy, “The unreasonable effectiveness of recurrent neural networks,” 2015. [Online]. Available: http://karpathy.github.io/2015/05/21/rnn-effectiveness/. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. 42. DOI:10.6814/THE.NCCU.CS.005.2018.B02.

(55)