Comparison of Handlers - Characteristics of Adaptive Drift Ensemble

CHPATER 6 DISCUSSIONS

6.3 Characteristics of Adaptive Drift Ensemble

6.3.2 Comparison of Handlers

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

than a certain value. That is, there is no status change made by DDM within a certain number of instances. Although ADE is designed to react immediately to concept drift, it will make no reaction if DDM makes no status change.

Figure 27. NB and HT with AUE and ADE Note: Y-axis is accuracy (from 50% to 80%), and X-axis represents time.

6.3.2 Comparison of Handlers

Figure 28 shows accuracy values achieved by ADE using DDM or EDDM as the base handler. When DDM is used by ADE with NB, the average accuracy is 63.29%; and when EDDM is used, the average accuracy is 64.6%. When DDM is used by ADE with HT, the average accuracy is 64.65%; and when EDDM is used, the average is 63.76%. When the number of instances in the leaf of a tree is smaller than a threshold (whose default value is 200), HT will no split the leaf. Hence, HT performs poorly in training if concept drift occurs

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

frequently. Considering the larger variance shown by EDDM in Figure 28, especially when it is used by ADE with HT, we conclude that EDDM is less stable than DDM. Figure 28 also gives the points in time when drifts are detected by DDM or EDDM. Points in time when drifts occur are similar when the same drift detection method is used with different learning algorithms, but they are not similar when different drift detection methods are used with the same learning algorithm. Both DDM and EDDM give time intervals during each of which drift does not occur and the underlying data distribution does not change. Three of such time intervals given by DDM could be connected to three important events in the real world. One is SARS (severe acute respiratory syndrome) during 2003, another one is financial tsunami during 2007 and 2008, and the other is Eurozone crisis, or European sovereign debt crisis, during 2010. However, it is difficult for us to find connections for such time intervals given by EDDM.

Figure 28. Comparison between DDM and EDDM handlers

Note: Y-axis is accuracy (from 50% to 80%), and X-axis represents time; the lower part shows drifts which DDM or EDDM detected with the same X-axis of the upper part.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

CHPATER 7

CONCLUSIONS AND FUTURE WORK 7.1 Conclusions

The importance of extracting useful knowledge from massive data is undeniable. Data miming has been one of popular research fields, and recently data stream mining has become a popular subfield of data mining since in the real world data usually comes in a form of a stream.

In this thesis, we present our study of applying data stream mining techniques to the prediction of the rising or falling of the short-term futures of Taiwan Stock Exchange Capitalization Weighted Stock Index Futures (TAIEX Futures). First, we give an overview of related studies.

Next, we model the problem as a binary classification problem and then describe the toolkit, data pre-processing steps, and methods for experiments. Finally, we observe from the experimental results that using the drift detection method has significant impact on accuracy.

Therefore, we assume that there exists concept drift in the Futures market. In this thesis, we also see that applying data stream mining techniques on data from the futures market is useful and worth further investigation.

From our previously published work, we knew that there are concept drifts in the TAIEX Futures data, and we found that DDM could make significant improvement to data stream mining. In this thesis, we focus on the investigation of the type of concept drifts that exists in the TAIEX Futures data. By showing empirically that EDDM and the evolving ensemble method are not better than DDM, which is mainly proposed to deal with sudden concept drift,

‧

we concluded that concept drifts existing in the TAIEX Futures data would not be gradual or incremental ones. In order to identify reoccurring concept drifts, we proposed a method based on the evolving ensemble method and used it to observe the relative weights of sub-classifiers.

From the experimental results, we found that concept drift occurs frequently and reoccurring concept drift exists. Furthermore, our method achieves higher accuracy than AUE does for the TAIEX Futures data. Our method has another advantage -- it can adaptively adjust chunk size.

It is appropriate to be applied to streaming data that is collected from a dynamic environment.

7.2 Future Work

There are many possible directions to extend the work presented in this thesis. First, we use only transaction records for classification. However, more data sources may help us find better results. Hence, we can combine data from multiple sources to extend the base of data samples. Furthermore, feature selection will become more important when multiple data sources are considered. Additionally, how to retrieve association of different attributes of various data sources is a practical problem. Of course we can alternatively model the problem as a multi-class classification problem or a regression problem. Nevertheless, there still is an issue on which we can concentrate -- how to explain the associations between the time points of concept drift observed from experimental results and the trend of the stock in the real world, including investigation of reoccurring concept drift. It also includes exploration of ways to incorporate our methods into the creation of adaptive ensembles. For profit, cost-sensitive strategy is worthy to be concerned. However, the cost should be defined in the data. Domain experts perhaps interest in the rules of models, and observing rule sets in various times and giving a comparison for directing to concept drift may help domain experts to understand problems and to make decision.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

For dealing with reoccurring concept drift, we designed ADE algorithm. It classifies by weighted majority vote. However, select the result of the most appropriate sub-classifier which reflects the current concept perhaps is better way to classifying. The simple way to select is choosing the sub-classifier which has highest weight. However, there maybe is much better way to select the sub-classifier. Moreover, we found ADE generated a few similar sub-classifiers. Similar sub-classifiers will reduce advance of dealing with reoccurring concept drift because the number of active sub-classifier is limited. Hence, combine or avoid to generating similar sub-classifiers is helpful.

‧

[1] C. Sammut and M. Harries, "Concept Drift," in Encyclopedia of Machine Learning, ed:

Springer, 2010, pp. 202-205.

[2] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, "Moa: Massive online analysis," The Journal of Machine Learning Research, vol. 99, pp. 1601-1604, 2010.

[3] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The WEKA data mining software: an update," ACM SIGKDD Explorations Newsletter, vol.

11, pp. 10-18, 2009.

[4] J. A. Ou and S. H. Penman, "Financial statement analysis and the prediction of stock returns," Journal of accounting and economics, vol. 11, pp. 295-329, 1989.

[5] R. W. Holthausen and D. F. Larcker, "The prediction of stock returns using financial statement information," Journal of accounting and economics, vol. 15, pp. 373-411, 1992.

[6] D. P. Brown and R. H. Jennings, "On technical analysis," Review of Financial Studies, vol. 2, pp. 527-551, 1989.

[7] H. V. Roberts, "Stock‐Market “Patterns” And Financial Analysis: Methodological Suggestions," The Journal of Finance, vol. 14, pp. 1-10, 1959.

[8] L. Blume, D. Easley, and M. O'hara, "Market statistics and technical analysis: The role of volume," The Journal of Finance, vol. 49, pp. 153-181, 1994.

[9] E. J. Hannan, Multiple time series vol. 38: Wiley, 1970.

[10] P.-F. Pai and C.-S. Lin, "A hybrid ARIMA and support vector machines model in stock price forecasting," Omega, vol. 33, pp. 497-505, 2005.

[11] S. H. Cheng, "Data mining techniques to identify the direction of Taiwan Stock Index

‧

Futures day trading," PhD Thesis, Department of Financial Engineering and Actuarial Mathematics of Soochow University. 2011. (in Chinese)

[12] C.-H. L. Chiu, Zne-Jung, "Application of Data Mining Technologies for IC Stock Category," Digital Technology Information Management. 2009. (in Chinese)

[13] S.-H. C. Cheng, I-LING, "Data Mining for Analysis of Choosing Stocks from Taiwan Stock Market," 2009 International Conference on Advanced Information Technologies (AIT), 2009. (in Chinese)

[14] P.-C. Chang and C.-H. Liu, "A TSK type fuzzy rule based system for stock price prediction," Expert Systems with Applications, vol. 34, pp. 135-144, 2008.

[15] T.-N. Lin, "Using AdaBoost for Taiwan Stock Index Future Intra-day Trading System,"

Graduae Institute of Network and Multimedia college of Electrical Engineering and computer Science, National Taiwan University. 2008. (in Chinese), 2008. (in Chinese) [16] M. Harries and K. Horn, "Detecting concept drift in financial time series prediction using

symbolic machine learning," in AI-CONFERENCE-, 1995, pp. 91-98.

[17] K. B. Pratt and G. Tschapek, "Visualizing concept drift," in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 735-740.

[18] G. R. Marrs, R. J. Hickey, and M. M. Black, "The impact of latency on online classification learning with concept drift," in Knowledge Science, Engineering and Management, ed: Springer, 2010, pp. 459-469.

[19] C.-M. Y. Chao, Huei-Wen, "Application of Multiple Data Streams Sequential Pattern Mining on Taiwan Stock Market," Journal of Information Management, vol. 12, pp.

113-132, 2010. (in Chinse)

[20] J. Sun and H. Li, "Dynamic financial distress prediction using instance selection for the

‧

disposal of concept drift," Expert Systems with Applications, vol. 38, pp. 2566-2576, 2011.

[21] M. Last, "Online classification of nonstationary data streams," Intelligent Data Analysis, vol. 6, pp. 129-147, 2002.

[22] J. R. Quinlan, C4. 5: programs for machine learning vol. 1: Morgan kaufmann, 1993.

[23] J. R. Quinlan, "Induction of decision trees," Machine learning, vol. 1, pp. 81-106, 1986.

[24] W. W. Cohen, "Fast effective rule induction," in Machine Learning-International Workshop Then Conference, 1995, pp. 115-123.

[25] T. Cover and P. Hart, "Nearest neighbor pattern classification," Information Theory, IEEE Transactions on, vol. 13, pp. 21-27, 1967.

[26] C. Cortes and V. Vapnik, "Support-vector networks," Machine learning, vol. 20, pp.

273-297, 1995.

[27] Y. Freund and R. E. Schapire, "A decision-theoretic generalization of on-line learning and an application to boosting," Journal of computer and system sciences, vol. 55, pp.

119-139, 1997.

[28] G. Widmer and M. Kubat, "Learning in the presence of concept drift and hidden contexts," Machine learning, vol. 23, pp. 69-101, 1996.

[29] A. Bifet, J. Gama, M. Pechenizkiy, and I. Zliobaite, "Handling concept drift: Importance, challenges and solutions," PAKDD-2011 Tutorial, Shenzhen, China, 2011.

[30] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, and S. Y. Philip, "Top 10 algorithms in data mining," Knowledge and Information Systems, vol. 14, pp. 1-37, 2008.

[31] P. Domingos and G. Hulten, "Mining high-speed data streams," in Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining,

‧

[32] A. Bifet and R. Gavaldà, "Adaptive learning from evolving data streams," in Advances in Intelligent Data Analysis VIII, ed: Springer, 2009, pp. 249-260.

[33] G. Holmes, R. Kirkby, and B. Pfahringer, "Stress-testing hoeffding trees," in Knowledge Discovery in Databases: PKDD 2005, ed: Springer, 2005, pp. 495-502.

[34] J. Gama, P. Medas, G. Castillo, and P. Rodrigues, "Learning with drift detection," in Advances in Artificial Intelligence–SBIA 2004, ed: Springer, 2004, pp. 286-295.

[35] M. Baena-García, J. del Campo-Ávila, R. Fidalgo, A. Bifet, R. Gavaldà, and R.

Morales-Bueno, "Early drift detection method," 2006.

[36] H. Wang, W. Fan, P. S. Yu, and J. Han, "Mining concept-drifting data streams using ensemble classifiers," in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 2003, pp. 226-235.

[37] D. Brzeziński and J. Stefanowski, "Accuracy updated ensemble for data streams with concept drift," in Hybrid Artificial Intelligent Systems, ed: Springer, 2011, pp. 155-163.

[38] E. Kirkos, C. Spathis, and Y. Manolopoulos, "Data mining techniques for the detection of fraudulent financial statements," Expert Systems with Applications, vol. 32, pp.

995-1003, 2007.

[39] P. Ou and H. Wang, "Prediction of stock market index movement by ten data mining techniques," Modern Applied Science, vol. 3, p. P28, 2009.

[40] B. Rosenberg and W. McKibben, "The prediction of systematic and specific risk in common stocks," Journal of Financial and Quantitative Analysis, pp. 317-333, 1973.

[41] G. Gidófalvi and C. Elkan, "Using news articles to predict stock price movements,"

Department of Computer Science and Engineering, University of California, San Diego, 2001.

在文檔中串流資料分析在台灣股市指數期貨之應用 - 政大學術集成 (頁 69-0)