• 沒有找到結果。

建構GDELT數位新聞分析流程於Spark大數據平台:以新聞事件影響力探究美國S&P股市指數變化為例 - 政大學術集成

N/A
N/A
Protected

Academic year: 2021

Share "建構GDELT數位新聞分析流程於Spark大數據平台:以新聞事件影響力探究美國S&P股市指數變化為例 - 政大學術集成"

Copied!
51
0
0

加載中.... (立即查看全文)

全文

(1)國立政治大學資訊科學系 Department of Computer Science National Chengchi University. Master’s Thesis GDELT. 立. 政 治 大 Spark 大. ‧ 國. 學. 國 S&P. Establishing GDELT Digital News Analytics Pipeline on. ‧. the Spark Platform : Exploiting News Events Influences on. y. Nat. n. al. er. io. sit. S&P Stock Index Variations as an Example. Ch. engchi. 國 July 2017. i Un. v.

(2) Spark 大. GDELT 國 S&P. Establishing GDELT Digital News Analytics Pipeline on the Spark Platform : Exploiting News Events Influences on S&P Stock Index Variations as an Example Student Shu-Wei Huang Advisor. Yuh-Jong Hu. 政 治 大. 資訊科學系. 學. ‧ 國. 立 國立政治大學. ‧ er. io. sit. y. Nat. A Thesis. submitted to Department of Computer Science. al. n. iv n C National h Chengchi University engchi U. in partial fulfillment of the Requirements for the degree of Master in Computer Science. 國 July 2017.

(3) 學 學. Tel-NET 2017 IEEE 學. 學 學. 學. 政 治 大. 立. 學. ENT Lab. ‧ 國. 學. 學. ‧. 學. Nat. 學. 學. sit. y. 學 學. n. er. io. Stay Hungry. Stay Foolish.. al. 學. Ch. engchi. i Un. v. 2017/07/17. i.

(4) Spark 大. GDELT 國 S&P. 2013. GDELT. 65 學. 學. 資. 資訊. 58. 資. GDELT. 資. 大. 資 AWS. GDELT 資. 立. Spark ML Pipeline. 政 治 大 國 500 S&P 500. 學. ‧ 國. 學. 系. 2.12%. 15. ‧. 1.5%. 116.76. n. al. er. io. sit. y. Nat. S&P 500. GDELT. 45. Ch. e n學g c 大 hi. ii. i Un. v. 43.35.

(5) Establishing GDELT Digital News Analytics Pipeline on the Spark Platform: Exploiting News Events Influences on S&P Stock Index Variations as an Example ABSTRACT In 2013, the GDELT project was released to monitor global digital news media in 65 languages. It utilizes advanced artificial intelligence (AI) technologies such as machine learning algorithms, natural language processing and deep learning to extract and transfer news into structured datasets. The datasets containing 58 features are public for further research in all areas. In my research, I used GDELT datasets to develop a big data analysis process. Constructing a rolling-window machine learning model and Spark ML pipeline on AWS EC2 cloud platform to predict S&P 500 stock index. Then evaluated the causal influences of the Occupy Wall Street (OWS) event. I applied a 45-day rolling-RF (random forest) model to obtain the best RMSE of 43.35 (only 2.12%) on tracking and predicting historical index. For the online 15-min nearly real-time (NRT) rolling prediction on AWS EC2, the errors even less than 1.5%. About the causal influences analysis, I used BSTS model to evaluate the counterfactual of the OWS event. I found a land the follow-up effects, that the OWS event v prompting the i n S&P 500 stock index to in the observation C hrise 116.76 points U i e h n periods. gc. 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. Keywords : GDELT Project, Rolling-Window Machine Learning, Big Data Analysis Pipeline, News Events Influences, AWS. iii.

(6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. i. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ii. ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 政 治 大. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 立. iv vi. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii. ‧ 國. 學. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1. sit. 1.3. y. Nat. 1.2. ‧. 1.1. io. n. al. er. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. i Un. v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2.1 GDELT. Ch. engchi. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2.1.1 CAMEO. 2 4 4. . . . . . . . . . . . . . . . . . . . . . .. 4. . . . . . . . . . . . . . . . . . . . . . .. 6. . . . . . . . . . . . . . . . . . . . . . . . .. 8. . . . . . . . . . . . . . . . . . . . . . . .. 10. 2.2.1. . . . . . . . . . . . . . . . . . . . . . .. 10. 2.2.2. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 13. . . . . . . . . . . . . . . . . . . . . . .. 13. . . . . . . . . . . . . . . . . . . . . . . . . . .. 14. 2.1.2 2.1.3 資 2.2 資. 3.1 GDELT 資 3.2. 1. 資. 系. iv.

(7) 4.1 資. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. . . . . . . . . . . . . . . .. 16. . . . . . . . . . . . . . . . . . . . . . . . . .. 16. 學. 4.2 4.2.1. 學. 4.2.2 4.2.3 4.3 大. . . . . . . . . . . . . . . . . .. 17. . . . . . . . . . . . . . . . . . . . . . . . . .. 18. . . . . . . . . . . . . . . . . . . . . . .. 20. Pipeline. 學. 4.3.1 Python Scikit Learn. . . . . . . . . . . . .. 20. . . . . . . . . . . .. 21. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 22. 學. 4.3.2 Apache Spark ML 4.4. 5.1 資 5.3 Pipeline. 學. 5.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 29. . . . . . . . . . . . . . . . . . . . . . . . . . .. 31. . . . . . . . . . . . . . . . . . . . . . . . . . .. 32. io. sit. y. Nat. 38. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. i Un. 38. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 38. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 40. al. er. . . . . . . . . . . . . . . . . . . . . . . . . . . .. n. 6.1 6.2. 25 26. 5.4 Causal Impact 5.5 AWS. 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ‧. ‧ 國. 治 政 . . . . . . . . . . . . . . . . . .大 . . . . . . . . . . . . . . 立 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Ch. engchi. v. v.

(8) 1 CAMEO. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. . . . . . . . . . . . . . . . . . . . . . .. 9. 3. . . . . . . . . . . . . . . . . . . . . . . . . . . .. 10. 4. . . . . . . . . . . . . . . . . . . . . . . .. 19. . . . . . . . . . . . . . . . . . . . . . . .. 28. . . . . . . . . . .. 29. . . . . . . . . . . . . . . . . . . . 政 .治 大. . . . . . . . . . . . . 系 資. 34. . . . . . . . . . . . . . . . . .. 37. 資. 2. 5. 學. 6 7 AWS EC2 8 Spark 5080. 立. 9 AWS EC2 / S3. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. vi. i Un. v. 37.

(9) 1 1979. 2013. 2 2014. 2017. 2. . . . . . . . . . . . . . . .. 5. . . . . . . . . . . . . . . . . . . . . . .. 7. . . . . . . . . . . . . . . . . . . . .. 8. . . . . . . . . . . . . . . . . . . . . . . .. 9. . . . . . . . . . . . . . . . . . . . . . . .. 12. 政 治 大 [3] . . . . . . . . . . . . . . . . .. 17. 系. 4 PETRARCH. 資. 5 GDELT 國. 6 7. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . .. 22. . . . . . . . . . . . . . . . . . . . .. 23. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 23. . . . . . . . . . . . . . . . . . . . . . . . . .. Nat. 13 資. io. 14. 25. y. Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . .. 26. . . . . . . . . . . . . . . . . . . . . . . . . .. 27. sit. 12. al. 16. Ch. 17 GDELT 資. Pipeline. v. 28. . . . . . . . . . . . .. 29. . . . . . . . . . . . . . . . . . .. 30. . . . . . . . . . . . . . . . . .. 32. engchi. 18 RF. 45. 資. 33. . . . . . . . . . . . .. 35. . . . . . . . . . . . . . . . . . .. 35. . . . . . . . . . . . . . . . . . . . . . . .. 36. 21 AWS EC2 - 15 系. i Un. . . . .. 20 Spark 4040. 22 AWS. . . . . . . . . . . . . .. n. 15. 19. 20. ‧. 11. 11. 系. er. 10 大. ‧ 國. 立 Apache Spark 大. 學. 9. 4. 系. 3 TABARI. 8. . . . . . . . . . . . . . . .. vii.

(10) GDELT. 資. 資訊. 1.1 科. 立. GDELT[13]. 資訊. 資訊. 學. 資. 15. 學 58. GDELT 資. n. al. 國. 國. Ch. er. io. sit. y. Nat. 國. 系. ‧. ‧ 國. 65. GDELT Translingual. 學. Google jigsaw. 1. 政 治 大 Global Database of Events, Language, and Tone. 政. engchi. i Un. v. GDELT 資 資. 國 國. GDELT 資. Spark 大. 學. 國 政. 1. 國. https://jigsaw.google.com/ 政治. 國 政. Google Idea. 1. Google Jigsaw. 科.

(11) 1.2 資. GDELT. 15. 國. 國 學. 資. GDELT. 國. 大國 Ground Truth. GDELT 資. •. ETL. 500. S&P 500. Pipeline. y. 資. sit. Spark 大. Nat. S&P 500. n. al. er. io. •. 國. ‧. •. Rolling Analysis. ‧ 國. 學. 立. Ch. engchi. i Un. v. 1.3 GDELT. 資 資. 學. 2. Load. 學. •. Extract Transform 政 治 資 大. ARIMA. 資.

(12) 系. Amazon Web Service AWS. Spark 大. GDELT 資. 15 S&P 500. 學. 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 3. i Un. v.

(13) 大. 資 GDELT 資. GDELT. 2.1 GDELT GDELT. 政 治 大 Conflict and Mediation Event Observations 立. TABARI Textual Analysis by Augmented Replacement Instructions. ‧ 國. 大. News. Goole Translate. 資. ‧. Nat. io. 資. al. 15. n. David Masad,. 系. 資訊. 1: 1979. 2. 65 資訊. 系. 2. Google. 學. 國. Ch. 資. engchi. 2014. y. CAMEO. GDELT. sit. 系. 國. 2013. er. 1. 1979. i Un. v. 資. 2. 2013. GDELT - Global Data on Events, Language, and Tone p.23. 4. 資.

(14) 2: 2014. 2017. 政 治 大. 立. 2.1.1 CAMEO. ‧ 國. 學. Conflict and Mediation Event Observations (CAMEO)[6]. 1. Event Root Code. al. 0832. y. Accede to demands for change. iv n i U Class e n g c h Quad 政. C4 h. 2. CAMEO 3. n. 20. 1. er. io. in policy. Nat. 08 YIELD. 大. ‧. 20. sit. 政. 1. GDELT IGO 國. 政. 國. NGO 國. 政. IMG 國. GOV. 政. MIL. OPP 資 CAMEO. 資 資 5. NMC REB.

(15) 1: CAMEO. MAKE PUBLIC STATEMENT APPEAL EXPRESS INTENT TO COOPEARATE CONSULT ENGAGE IN DIPLOMATIC COOPERATION ENGAGE IN MATERIAL COOPERATION PROVIDE AID YIELD INVESTIGATE DEMAND DISAPPROVE REJECT THREATEN PROTEST EXHIBIT FORCE POSTURE REDUCE RELATIONS COERCE ASSAULT FIGHT USE UNCONVENTIONAL MASS VIOLENCE. 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3. 政 治 大 ( (. ) 大. io. The Computational Event Data System3. 系. 系. n. al. sit. 資. er. Nat. 國. y. 系. 1998. 4 4 4 4. ). ‧ 國. 立. 4. ‧. 資. 2.1.2. Quad Class. EVENT DESCRIPTION. 學. CAMEO EVENTCODE 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20. Ch. engchi. i Un. v. 資. 系. 政治 資. 系. KEDS[22] Kansas. Event Data System, 1995. 系. Mac OS 6.0 系. 2000. TABARI[21] Textual Anal-. ysis By Augmented Replacement Instructions. TABARI 系. Unix Linux Mac OS. 大. 1024 資 3. 大. C++. 1. 5. 3. http://eventdata.parusanalytics.com/intro.html. 6. 資.

(16) 資. 5000. 資訊. 資. GDELT. 3. 系. 3: TABARI. 政 治 大. 立. PETRARCH[23] Python Engine for Text ResoTABARI 系. lution And Related Coding Hierarchy PETRARCH. Python. 資. sit. 資. 大學 CoreNLP4. n. al. PETRARCH. TABARI. 2016. er. io 系. 大. y. Nat. Python. 學 PETRARCH 2[16]. 大. Python. ‧. ‧ 國. 系. 學. 2014. Ch. i Un. v. e n g c h iTABARI 學. PETRARCH. Penn TreeBank5. 大學 CoreNLP 大. 大 4. 4. https://stanfordnlp.github.io/CoreNLP/. 5. https://en.wikipedia.org/wiki/Treebank. 7.

(17) 系. 4: PETRARCH. PETRARCH 系. TABARI 系 Python. TABARI. 2. C++. y. Nat. PETRARCH. sit. PETRARCH. n. al. er. io. 大. CoreNLP 系. 150. ‧. 5. 5000. ‧ 國. PETRARCH. 立. 學. TABARI. 政 治 大. Ch. engchi. i Un. v. 大. 2.1.3 資 資. GDELT 5. Float. String 5. 58 9. Integer. 13. 36 GoldsteinScale NumMentions NumSources 資訊 2. 7. 8. NumArticles AvgTone.

(18) 資. 5: GDELT. 政資 治 大. 2:. 3. CAMEO. QuadClass ( ). y 國. Ch. CAMEO. 國. er. al. 國. sit. CAMEO. n. EventRootCode ( ). 國. 3. io. Actor2CountryCode ( ). ‧. YYYYMMDD 學. Nat. SQLDATE ( ) Actor1CountryCode ( ). 資. 學. ‧ 國. 立. i Un. engchi. v. 20 20. / CAMEO 1-. 2-. 3-10 國. GoldsteinScale ( ). 4-. +10. 10. 1000 -10. +10. 0 15. AvgTone ( ). 9.

(19) 2.2 資 國. 國. 國. 立立 國. 大. 2.2.1. 3. 2015 國 國. Edge. 國. 2015. 立. 國. Avg. Degree. 治 2016 政 1 88.711 大. 98.61. 15Q4 220 4554695 94.655 0.421 1.571. 16Q1 220 4741484 92.938 0.415 1.587. 16Q2 217 4699337 98.054 0.442 1.546. 16Q3 220 4594722 95.893 0.428 1.559. 16Q4 218 4542305 98.61 0.444 1.540. n. er. io. al. 15Q3 220 4418558 95.804 0.428 1.561. ‧. 15Q2 218 4023293 94.307 0.421 1.579. 4. 國. 學. 15Q1 220 3871124 88.711 0.396 1.593. 國. 3:. Nat. Time Period No. of Nodes No. of Edges Avg. Degree Network Density Avg. Path Length. ‧ 國. 國. 國. sit. 國. 220. y. Node. 資. 2016. Ch. engchi. i Un. v. Network Density 資訊 0.396. 0.444. Avg. Degree. Length. 10. Avg. Path.

(20) Degree. Density. Length 國. 2.2.2 GDELT 資. 國 Gephi6. 國 GDELT 資 Actor2CountryCode. 2016. 立 8 國. 大. 資 6. 國 立. 立. ‧. 2015. Nat. 3. sit. io. 國. a國l. n GDELT. 國. 學. ‧ 國. 2015. 國. Gephi 政 治4 大 5. y. 國. Actor1CountryCode. 6. ni C U 國 h國 engchi. 資. 大. er. 國. 國. v. 15. 資 國. NGO 資. 6. 國. https://gephi.org/, The Open Graph Viz Platform. 11.

(21) 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 國. 6:. 12. i Un. v.

(22) 資. GDELT. 國 GDELT 資. 3.1 GDELT 資. 政 治 大 GDELT. 立. 資. GDELT. 資. ‧. 2013. 2001 2012. io. sit. y. Nat. 2008. 2008. al. 62.5% Yonamine. ARIMA. er. Yonamine[25] 資. 學. ‧ 國. 資. 資. n. Autoregressive Integrated Moving Average model. Ch. naïve. engchi U. v ni. 資訊 學 2014. Racette[19]. Keertipati[11]. 政 Keertipati. GDELT. 資. Change point analysis7 系. 大 NGO. Racette 90.8% 7. Event Code. 資. http://www.variation.com/cpa/tech/changepoint.html. 13. Racette.

(23) Fight 學. Apache Mahout AUC. R. 0.868. 3.2. 資訊. 資. GDELT 資. 國. 立. 學. ‧ 國. Causality. y. ‧. Correlation. Nat. 2003. [7] Granger causality test. 學. io. n. al. er. 國. 2014. sit. Lei Jiang[9]. 政 治 大. Kumar[12]. Ch. engchi. v. GDELT 資. 2016 國. Cyber News. i Un. 政. Bayesian structural time series, 資 DDoS. 30% 50%. Kumar. BSTS8. CAMEO 資. Cyber. 8. https://en.wikipedia.org/wiki/Bayesian_structural_time_series. 14.

(24) GDELT 大. 資. 國 S&P. 4.1 資 學 資. 立. 政 治 大 大 學. 資. 立. 資. One-hot Encoding9. 科學. ‧. 資 N. sit. 100. n. al. Ch. 1. er. io. 0/1. 資. y. Nat. 訊. N. 大. 資. 學. ‧ 國. 資. 100. engchi. i Un. v. 100. 100 100. One-hot Encoding 1. 1. 99. 0 學 One-hot 資. Encoding sparse 大. 學 9. https://en.wikipedia.org/wiki/One-hot. 15. One-hot Encoding.

(25) 資. 訊 One-hot Encoding GDELT 資. 58 20,000. 資. 訊 國. Actor1CountryCode 20. CAMEO. Actor2CountryCode. EventRootCode. 立. GoldsteinScale. 政 治 大. 資. AvgTone. 立. 資. sparse 資. One-hot Encoding. 立. 學. ‧ 國. 資. y er. al. n. 資. 學. [8] Supervised Learning. io. 學. 資. ‧. 學. Nat. 4.2. CAMEO. 資. QuadClass. 資. 4. Ch. sit. 國. 280,000. e大n g c資h i. i Un. v. 學. 學. 4.2.1. Box-Jenkins[1] ARIMA gressive Integrated Moving Average model 16. Autore-.

(26) 資. ARIMA. ARIMA. Ping-Feng Pai[17] SVM. 資. 3. 資. 1. ARIMA 資. 1 學. Michael[10]. ARIMA. Random Forest. H5N1. 30 MSE. ARIMA. 政 治 大. 4.2.2. 學. ‧ 國. 立. 學. 學. ‧. 學. n. 學. Ch. 學. iv n U Rolling-Window. [28]. engchi. er. io. sit. y. Nat. al. GDELT 資. 國 S&P. 7: 7. S&P 500. 17. 資. 45. 90.

(27) 資. 180. GDELT 資. 資. 資. GDELT. 資. S&P 500 600 學. 學. 資. [4] Ensemble learning. 學. Linear Regression. 立. 學. 政 治 大. 資 學. y. ‧. ‧ 國. 學. Nat. sit. 學. er. io. al. n. 4.2.3. 大. [2]. Ch 科學. engchi. i Un. v. A. B. Graphical Causal Model [5] Directed Acyclic Graphs, DAG 立 DAG 18.

(28) DAG [14][18] 學. Neyman. Potential Outcomes Model. [15] Rubin. [20] Rubin causal. model, RCM. 4: Ytu E[Y 立 E[Y. t1. t2 ]. E[Yc2 ]. 學. u. c. t. ‧. control group. io. y. sit. Nat. δu. n. al. Sewall Wright. er. treatment group. E[Yt1 ] − E[Yc1 ] E[Yt2 ] − E[Yc2 ]. c1. δu = Ytu − Ycu. 4. δu. cu. ‧ 國. U=0 U=1. Y 治 政 大 ] E[Y ]. i Un. v. 學. Structural Equation Model, SEM. C學h. engchi. Factor analysis. Path analysis. 大. Brodersen[3]. 2015. BSTS 資. BSTS 學. Nowcasting. Brodersen. CausalImpact R. 大. 立. 19.

(29) Bayesian dynamic diffusion-regression state-space model. 8. 資. 立. 政 治 大. ‧. ‧ 國. 學. 8:. [3]. n. al. er. io. Pipeline. sit. y. Nat. 4.3 大. Ch. engchi. i Un. v. 資. 學 資 資 學 Weka Python Scikit Learn 資. 學. Spark ML. 學. Weka 大. Python Scikit Learn Spark ML Learn. 資. JAVA. Python Scikit. Spark ML. 20.

(30) 學. 4.3.1 Python Scikit Learn. Scikit-learn10. Scikit Learn 學. Python SciPy. 資 資. matplotlib 學. Engineering. 立. NumPy. ETL. 資. Feature. 資. 科學. 資 Scikit-learn. ndarray. 資. Dataframe. sklearn.pipeline. 立. Pipeline. Python. 學. ‧ 國. Scikit-learn. 大. Nat. n. al. Scikit-learn. Ch. sit. Python. er. io Python. 大. engchi. i Un. v. 學. 4.3.2 Apache Spark ML. 2014. Apache Spark[27][26] Benchmark Competition. 30. 大 Hadoop. Hadoop. 100 TB. 資. MapReduce. https://pypi.python.org/pypi/scikit-learn/0.18.2. 21. Sort. 資. 72. 資. 10. 大. y. 資. 立. Scikit-learn. ‧. 大. 資 治 政 大. Spark.

(31) 立. 政 治 大. 9: Apache Spark 大. ‧ 國. 學. Spark. 學. 學. n. 10. Ch. n U engchi. iv. Spark ML. Spark DataFrame 資. 學. Scikit Learn Pipeline. Spark ML. y. ML / MLlib. er. io Spark 2.0. 資. Spark SQL. 大. al. Java. ‧. SQL 資. GraphX. Nat. Spark Streaming. Scala Python R. sit. 9. 11. 系. Scikit Learn Pipelines 資 Pipeline 資. Spark. 系. 11. 大. Spark ML Pipeline. Stanford CS347 Guest Lecture: Apache Spark , p.82. 22. 大. 資.

(32) 10: 大. Pipeline. 4.4 GDELT. 2015. 立. 資. 政 治15 大. 19. GDELT 2.0. S&P 500 11. ‧. ‧ 國. 學. 15. 2. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. 11:. • 資 AWS EC2 Linux GDELT. Elastic Compute Cloud 系. cron jobs. EC2 15. 資 23.

(33) 資. 資. 資. 16:00. 資. 資 AWS S3. Storage Service. Simple. S3. 學. •. 資. Rolling Window. Decision Tree Regression. Gradient Boosting Regression. Random Forest Regression 系 學. 立. 政 治 大. 系. ‧ 國. 學 大. 大. ‧. K-fold Cross Validation. io. y 30. n. •. al. er. 資 K-fold. 資. sit. Nat. 10. Ch. engchi. i Un 資. v. 15 資. 資 資. S&P. 500 • 大 CausalImpact R. 24. Google.

(34) 系. Linux. Spark 大. 資. GDELT. S&P 500. 系. cron job. Linux. 15. 系 系. Ubuntu 16.04. 大. Spark 2.1.1 with PySpark. 資. 學. Python 3.5 R with CausalImpact Package. 政 治 大. 立. Event Dataset. Global Knowledge Graph Dataset 58. 資. 15. ‧. 6. Actor1CountryCode. Nat. y. 資. 資. GDELT 2.0. 學. 資. ‧ 國. 5.1 資. n. al. AvgTone. er. io. 12. sit. Actor2CountryCode EventRootCode QuadClass GoldsteinScale. Ch. engchi. i Un. v. 12: 國 Actor1CountryCode 立. 國 Actor2CountryCode 2015. 25. 資 2016.

(35) 資. 國. 35582 立. -Loop 資. 2.2.1. 217∼220. 2015. 3 國. 2016. 大. Node 220. 48,400 220*220 資. 12,818. 國 國. Non-Complete Graph 國. GDELT 20. CAMEO. EventRootCode. 4. CAMEO. QuadClass. 24. 政 治 大 GoldsteinScale. 資. 立. 資. n. al. 資. 13. Ch. engchi. 13: 資. 26. [2,6]. y. [-2,2]. sit. io. 資. 5. 35,616. er. ‧ 國. [-6,-2]. 資. ‧. 10. 4. Nat. [-10,-6]. 學. 學. AvgTone. v 500 iS&P n U. [6,10].

(36) 5.2 學. 4.2. 學. 3 S&P 500 14. 15. 90. 90 Ground Truth 學. S&P 500. Decision. Tree Regression Gradient Boosting Regression Random Forest Regression RMSE 治 Random Forest Regression 政 大. 立. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. 14:. 27. i Un. v. 5.

(37) 立. 政 治 大. Gradient Boost. 90 1 20 32 variance -. 90 20 20 32 0.1 -. 110.78 / 6.68%. 76.90 / 3.70%. Ch. engchi. er. n. RMSE. al. Random Forest. sit. Decision Tree. io. Para. rolling-window numTrees maxDepth maxBins impurity stepSize Seed. 學. ‧. PP Model PP PP. Nat. PP PP. 5:. y. ‧ 國. 學. 15:. i Un. v. 90 20 20 32 None. 65.47 / 3.14%. 學. 5 Window Size 16. 45. 180. 90 6. RMSE. 45. 43.35. 28. 資訊.

(38) 資 資 資. 資. 立. 政 治 大. ‧. ‧ 國. 學 sit. y. Nat. io. n. al. er. 16:. 6: PP PP. Ch. engchi. PP Model PP PP. i Un. v. Random Forest. Para. rolling-window numTrees maxDepth maxBins. 180 20 20 32. 90 20 20 32. 45 20 20 32. RMSE. 67.41 / 3.18%. 53.46 / 2.60%. 43.35 / 2.12%. 29. 大.

(39) 5.3 Pipeline 4.3.2. Spark ML. Pipelines GDELT 資. Pipeline. 資. 15. 資. Pipeline 資 17. 學. 17: GDELT 資. Pipeline. n. al. er. io. sit. y. Nat Pipeline. 學. ‧. ‧ 國. 立. 政 治 大. Ch. engchi. i v學 n U. 15. AWS. EC2. K-fold 資. 30. 30.

(40) 5.4 Causal Impact. Brodersen[3]. 2011. BSTS. 9. 立. 政. 國政治 政 治 大 國. 系. ‧ 國. 學. 政治. ‧. Nat. 國. n. al. Ch. 政. engchi. er. io. sit. y. 學. i Un. 政. v資. 國. 政 18. CausalImpact S&P 500 1310.04. 1426.80 95%. Confidence interval. [1166, 1456] 8.9%. 95% Testing. [-2.2%, 19.9%] P. Hypothesis. p-Value Approach. 31. 0.0643.

(41) 立. CausalImpact. BSTS. Nat. n. al. er. io 資. sit. 資. 45 資. y. ‧. 資. 學. ‧ 國. 18:. 政 治 大. Ch. engchi. i Un. v. 大. 19 45. 資 S&P 500. CausalImpact. 1426.80 1310.04. 1250.29. 60. 4.2%. GDELT 資. 國 學. 學. 32.

(42) 立. ‧. 5.5 AWS. 資. 45. ‧ 國. RF. 學. 19:. 政 治 大. Nat AWS EC2. al. AWS. n. 16 S3. us-west-2 Spark. Ch. engchi. 2017. ni U國. v. Hadoop HDFS Hadoop Distributed File System. 系. Apache Spark. EC2. spark-ec2 立. sit. Web. er. io. 70. 立. y. Amazon Web Service, AWS. GitHub. AMP Lab. AWS. Spark 2.0. spark-ec2. spark-ec2. AWS. EC2. IP. 系. Master/Slave Spark spark-ec2. Hadoop HDFS 5. 系 15∼20. 33.

(43) AWS EC2. Master/Slave. AWS EC2. 7. Master t2.medium 2-core/4G-RAM Slave t2.large. 2-core/8G-RAM. *1. *2. spark-ec2 . / s p a r k −ec2−b r a n c h − 2 . 0 / s p a r k −e c 2 −−key−p a i r =Demo −− i d e n t i t y − f i l e = / home / u s e r / Demo . pem −−r e g i o n =us−west −2 −−z o n e =us−west −2a. 治 政 −−i n s t a n c e −t y p e = t 2 . l a r大 ge 立 −−s l a v e s =2. 學. ‧ 國. −−m a s t e r −i n s t a n c e −t y p e = t 2 . medium. l a u n c h GDELT_Cluster. al. n 6.5 13. y. (GiB) 4 8 8 16. Ch. 資. Linux/UNIX. sit. 2 2 2 4. io. t2.medium t2.large m4.large m4.xlarge. ECU. (GB) EBS EBS EBS EBS. engchi. er. Nat. vCPU. ‧. 7: AWS EC2. i Un. v. $0.047 $0.094 $0.1 $0.2. 資. 15 資. EC2. S3 學. 大. 資. GDELT. GDELT on AWS S3. GDELT. 資. 資. 資 Spark. EC2 34. Master. spark-submit.

(44) Cluster. 資. YARN. 資. Master. 4040. 20. 22. Yarn 資 AWS EC2 Cluster . / s p a r k −s u b m i t −−m a s t e r s p a r k : / / m a s t e r : 7 0 7 7 −−d r i v e r −memory 2g −−num−e x e c u t o r s 2 −−e x e c u t o r −memory 4g −−e x e c u t o r −c o r e s 4. 政 治 大. GDELT_RandomForestML_USA . py. 立. s 3 n : / / GDELT_2017 /. ‧. ‧ 國. 學 sit. y. Nat. io. n. al. er. 20: Spark 4040. Ch. engchi. i Un. v. 21: AWS EC2 - 15 22. 15. S&P 500 AWS EC2. 資. 資. 15. S&P 500. 大 大 7. 12. 資. 35.

(45) 系. 立. 政 治 大. 大. Nat. n. al. 9. Ch. 8. engchi. 36. sit. io. 5080. 5. 大. In-Memory Garbage Collection. 2. y. 資. JAVA. er. ‧ 國. Spark. ‧. Apache Spark. 學. 系. 22: AWS. i Un. v. 資. 資訊. Master AWS.

(46) 系. 8: Spark 5080. 資. Master. Slave(. ). Load / Proc. CPU 資 Master 2-core Worker 2-core(. ). 政 治 大. 立. ‧ 國. 學. Memory 資 Master 4G-RAM Worker 8G-RAM(. ). ‧. n. er. io. sit. y. Nat. al. Ch. engchi. i Un. v. 9: AWS EC2 / S3 Data Transfer Elastic Compute Cloud $0.00 per t2.micro $0.047 per t2.medium $0.094 per t2.large $0.1 per m4.large $0.2 per m4.xlarge EBS. $0 $0 $19.176 $0.564 $0.4 $0.8 $2.01. Simple Storage Service. $0.12. Total. $23.07. 18 Hr 408 Hr 6 Hr 4 Hr 36 Hr 40,148,907 IOs 19,422 Requests 54,607 Requests. 37. 1-core / 1G-RAM 2-core / 4G-RAM 2-core / 8G-RAM 2-core / 8G-RAM 4-core / 16G-RAM PUT, COPY, POST, or LIST GET and all other requests.

(47) 6.1 GDELT 資. 政 治 大 GDELT. 立. 大. 國. 國. 資. ‧. ‧ 國. 資. Spark. 學. 資. 資. y. Nat. Brodersen[3]. CausalImpact. Ch. i Un. n. er. io. sit. S&P 500. al. 學. engchi. v. 系. Spark 大 GDELT 資. 15 S&P 500 學. 6.2 Apache Spark 大. 2.1.1 學. Spark ML 38. 2017/05/21.

(48) Bayesian Network scikit-learn. R. 大. Spark ML Python scikit-learn 學. 大 BSTS. Stefan Wager. [24]. 2016. Causal forest 學 大. 學. 資. 國. ‧. ‧ 國. 立. 政 治 大 GDELT 資. n. er. io. sit. y. Nat. al. Ch. engchi. 39. i Un. v. GDELT 資.

(49) [1] Box, George EP, et al. Time series analysis: forecasting and control. John Wiley & Sons, 2015. [2] Breiman, Leo. ”Random forests.” Machine learning 45.1 (2001): 5-32. [3] Brodersen, Kay H., et al. ”Inferring causal impact using Bayesian structural timeseries models.” The Annals of Applied Statistics 9.1 (2015): 247-274.. 政 治 大. [4] Dietterich, Thomas G. ”Ensemble methods in machine learning.” International. 立. workshop on multiple classifier systems. Springer Berlin Heidelberg, 2000.. ‧ 國. 學. [5] Elwert, Felix. ”Graphical causal models.” Handbook of causal analysis for social. ‧. research. Springer Netherlands, 2013. 245-273.. sit. y. Nat. [6] Gerner, Deborah J., et al. ”Conflict and mediation event observations (CAMEO):. io. al. n. national Studies Association, New Orleans (2002).. Ch. engchi. er. A new event data framework for the analysis of foreign policy interactions.” Inter-. i Un. v. [7] Granger, Clive WJ. ”Investigating causal relations by econometric models and cross-spectral methods.” Econometrica: Journal of the Econometric Society (1969): 424-438. [8] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. ”Overview of supervised learning.” The elements of statistical learning. Springer New York, 2009. 9-41. [9] Jiang, Lei, and Fan Mai. ”Discovering bilateral and multilateral causal events in GDELT.” international conference on social computing, behavioral-cultural modeling, and prediction, Washington, DC. 2014.. 40.

(50) [10] Kane, Michael J., et al. ”Comparison of ARIMA and Random Forest time series models for prediction of avian influenza H5N1 outbreaks.” BMC bioinformatics 15.1 (2014): 276. [11] Keertipati, Swetha, et al. ”Multi-Level Analysis of Peace and Conflict Data in GDELT.” Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis. ACM, 2014. [12] Kumar, Sumeet, Matthew Benigni, and Kathleen M. Carley. ”The impact of US cyber policies on cyber-attacks trend.” Intelligence and Security Informatics (ISI), 2016 IEEE Conference on. IEEE, 2016.. 政 治 大. [13] Leetaru, Kalev, and Philip A. Schrodt. ”Gdelt: Global data on events, location, and. 立. tone, 1979 2012.” ISA Annual Convention. Vol. 2. No. 4. 2013.. ‧ 國. 學. [14] Lindquist, Martin A., and Michael E. Sobel. ”Graphical models, potential out-. ‧. comes and causal inference: Comment on Ramsey, Spirtes and Glymour.” NeuroImage 57.2 (2011): 334-336.. y. Nat. io. sit. [15] Neyman, Jersey. ”Sur les applications de la théorie des probabilités aux experi-. er. ences agricoles: Essai des principes.” Roczniki Nauk Rolniczych 10 (1923): 1-51.. al. n. iv n C ”Petrarch 2:h Petrarcher.” arXiv e n g c h i U preprint. [16] Norris, Clayton. (2016).. arXiv: 1602.07236. [17] Pai, Ping-Feng, and Chih-Sheng Lin. ”A hybrid ARIMA and support vector machines model in stock price forecasting.” Omega 33.6 (2005): 497-505. [18] Pearl, Judea. ”Graphical models, potential outcomes and causal inference: comment on Linquist and Sobel.” NeuroImage 58.3 (2011): 770. [19] Racette, Mark P., et al. ”Improving situational awareness for humanitarian logistics through predictive modeling.” Systems and Information Engineering Design Symposium (SIEDS), 2014. IEEE, 2014.. 41.

(51) [20] Rubin, Donald B. ”Causal inference using potential outcomes: Design, modeling, decisions.” Journal of the American Statistical Association 100.469 (2005): 322331. [21] Schrodt, Philip A. ”Automated coding of international event data using sparse parsing techniques.” annual meeting of the International Studies Association, Chicago. 2001. [22] Schrodt, Philip A., and Blake Hall. ”Twenty years of the Kansas event data system project.” Political Methodologist 14.1 (2006): 2-6. [23] Schrodt, Philip A., John Beieler, and Muhammed Idris. ”Three sa Charm?: Open. 政 治 大. Event Data Coding with EL: DIABLO, PETRARCH, and the Open Event Data. 立. Alliance.” ISA Annual Convention. 2014.. ‧ 國. 學. [24] Wager, Stefan, and Susan Athey. ”Estimation and inference of heterogeneous treat-. ‧. ment effects using random forests.” Journal of the American Statistical Association just-accepted (2017).. y. Nat. io. sit. [25] Yonamine, James E. A nuanced study of political conflict using the Global Datasets. n. al. er. of Events Location and Tone (GDELT) dataset. Diss. The Pennsylvania State University, 2013.. Ch. engchi. i Un. v. [26] Zaharia, Matei, et al. ”Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing.” Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012. [27] Zaharia, Matei, et al. ”Spark: Cluster computing with working sets.” HotCloud 10.10-10 (2010): 95. [28] Zivot, Eric, and Jiahui Wang. ”Rolling Analysis of Time Series.” Modeling Financial Time Series with S-Plus®. Springer New York, 2003. 299-346.. 42.

(52)

參考文獻

相關文件

在工程科學及測量輪胎壓力所使用的壓力單位為每平方 英吋磅(pounds per square inch),簡稱 psi。..  利用表 13.1 的數據,以P

聞「癌」色變!大腸癌連續四年高居癌症排行榜第一,最新發布大腸癌發生人數,從 7,366 人突破 一萬四千多人,平均每天更有

2-1 化學實驗操作程序的認識 探究能力-問題解決 計劃與執行 2-2 化學實驗數據的解釋 探究能力-問題解決 分析與發現 2-3 化學實驗結果的推論與分析

推理論證 批判思辨 探究能力-問題解決 分析與發現 4-3 分析文本、數據等資料以解決問題 探究能力-問題解決 分析與發現 4-4

據。 (李昊天) 美國時段重要數據或事件:美國 12 月耐用品訂單月率修正值 1.0%,加拿大 1 月 Ivey 采購經理人指數 50.8,預期 52.5,美國工廠訂單 1.0%,預期

筆者曾經在美國的電視上,看過一個新聞專題分析的特別節目,主持人有一句話發人深 省,令我印象非常深刻,他說:「On the screen we see a lot of events happening, but we know very

• elearning pilot scheme (Four True Light Schools): WIFI construction, iPad procurement, elearning school visit and teacher training, English starts the elearning lesson.. 2012 •

The Hong Kong Musical Composition Ratings (HKMCR) 能力呎..