Thesis Organization - 串流資料分析在台灣股市指數期貨之應用

CHAPTER 1 INTRODUCTION

1.4 Thesis Organization

By extending the above findings, we study methods proposed to detect concept drift and explore methods that can help us identify the types of concept drift. As a result, we find that there are sudden and reoccurring concept drift existing in the TAIEX Futures data. For reoccurring concept drift, we propose an ensemble algorithm called Adaptive Drift Ensemble.

The proposed algorithm uses a method that can detect drift and trains a set of static sub-classifiers each of which reflects a concept. According to the results of experiments conducted on the TAIEX Futures data, the proposed algorithm is better than other algorithms considered in the experiments. Furthermore, it can adaptively determine the chunk size of each sub-classifier, which is an important parameter for others.

In summary, we empirically show that concept drift exists in the TAIEX Futures data.

Moreover, we identify the types of concept drift. Finally, we propose an ensemble method to deal with reoccurring concept drift without requiring us to pre-specify an important parameter required by other methods.

1.4 Thesis Organization

The rest of this thesis is organized as follows: In Chapter 2, we review related papers with a focus on financial data analysis, including papers using non-data mining techniques, non-streaming data mining techniques, and data stream mining techniques. In Chapter 3, we discuss challenges in data stream mining and techniques proposed to deal with them. In Chapter 4, we introduce the proposed method, Adaptive Drift Ensemble, and describe its implementation details. In Chapter 5, we report the experimental results. In Chapter 6, we discuss the characteristics of the proposed method. In Chapter 7, we conclude this thesis and discuss directions for future work.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

CHAPTER 2 PRELIMINARY

2.1 Non-Data Mining Techniques for Financial Data Analysis

Financial statement analysis is one of the most popular ways to forecast rising or falling of stocks. The one of issues of financial statement analysis is how to select accounting attributes or how to find statistical association of them. In the past, the work depends on people’s domain knowledge. For example, Ou and Penman proposed one summary measure model of subset of financial statement items to predict stock returns [4]. Holthausen and Larcker examined purely statistical models based on historical cost of accounting information [5] and the empirical result supports Ou and Penman’s contentions [4].

Technical Analysis is another popular way to predict stocks. Commonly, it is referred to as statistical model of trading records [6]. The core idea of technical analysis is that stock price can early reflect factors of impact and the trend of price will reoccur. In the past, Roberts made a research for finding patterns of stock market by some statistical suggestion, which were called technical analysis [7]. There are many famous factions of technical analysis such as Candlestick chart analysis, Dow Theory and Elliott wave principle. Academic research of technical analysis also is popular. Blume et al. studied the relation of volume, information precision and price movements. They concluded that “volume provides information on information quality that cannot be deduced from the price statistic” [8].

Autoregressive integrated moving average (ARIMA) is a popular model for time series

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

analysis, and it has been widely applied to financial analysis. ARIMA was extended from autoregressive moving average (ARMA) [9]. However, ARIMA hardly recognizes nonlinear pattern. Hence, Pai and Lin proposed a hybrid method which integrated ARIMA model and support vector machine (SVM) to forecast stock price [10]. SVM is a data mining technique which is able to handle non-linear problem. By the empirical result, Pai and Lin thought that hybrid method improves prediction performance of each single model, and they found that simple combination of two best models (single ARIMA and single SVM) did not give the best performance.

2.2 Non-Streaming Data Mining Techniques for Financial Data Analysis

In this section, we briefly review related work which predict stock or futures by their transactions and give a comparison of related work, as shown in Table I. In Table I, a column represents a paper related to the topic of this thesis and a row represents a data mining technique. A cell of the table indicates that if the paper corresponding to a column uses the method corresponding to a row. If yes then it is marked by “V”. For example, the cell in the upper right corner indicates that the paper [11] uses non-streaming decision tree (and so do the papers [12] and [13]) and the second right cell in the second row indicates that the paper [13] uses sequential pattern mining. For decision tree, we listed various versions used in related work. According to Table I, we can see that the popular methods include decision tree, Support Vector Machine (SVM), and neural network.

‧

Data of stocks or futures are naturally streaming data. When we use data mining techniques designed for non-streaming data, it is important to transform a data stream into a set of data samples by using a time frame.

In [11], 5-minute time frame is selected for TAIEX Futures forecasting. The paper uses C5.0 decision tree with two models: One is trained for rising or waiting, and it is named rising model; the other is trained for falling or waiting, and it is named falling model. These two models can be used with majority vote. To use majority vote, it would need a decision matrix, as the one shown in Table II. The row of the table is the decision of rising model, and the column is the decision of falling model. The cell of table is the joint decision given by majority vote.

Table I A Comparison of Related Work

Methods\papers

Sequential pattern mining V⁴

Adaboost V

SVM V

Neural network V

Fuzzy-based rule V

k-means V V

Association rule V

Note: 1. C5.0 Tree; 2. Unknown (not specified by the authors); 3. Hoeffding Tree; 4.

ICspan, IAspam.

‧

Paper [12] compares methods for forecasting Taiwan IC stocks. It compares SVM, SVM with Genetic Algorithm (GA-SVM), decision tree, and Back-Propagation Network (BPN). Differing from other papers, it trains a model by financial indexes of accounting rather than the records of transactions.

Paper [13] uses data mining techniques such as classification, clustering, and association rule to forecast the trend of stocks in the future. It analyzes the financial account of a company, and in order to examine whether the stock of a company is worth to hold, it uses C5.0 decision tree to train models.

Paper [14] uses TSK fuzzy model for forecasting rather than classifying the TAIEX.

Combining clustering method, TSK fuzzy model is better than BPN and multiple regression analysis for forecasting TAIEX in the paper. The data types in the paper are technical indexes which are statistical or calculated information of transactions. Because there are many technical indexes, factor selection is a problem discussed in the paper. Using the groups of data clustered by K-means, the paper sets up simplified fuzzy rules and trains the parameters.

Paper [15] designs a TAIEX Futures intra-day trading system, using Adaptive Boosting (Adaboost). The data source is based on technical indexes of TAIEX and the size of time frame used in the paper is 1 minute. The paper uses several methods combined with Adaboost to train models

Table II Decision Matrix Used in Paper [11]

Rising\Falling Waiting Falling

Waiting Waiting Falling

Rising Rising Waiting

‧

2.3 Data Streaming Mining Techniques for Financial Data Analysis

2.3.1 Concept Drift Analysis

Concept drift is a problem that the old static model can’t deal with new coming instances, and it often occurs in long term mining. So concept drift may exist not only in data stream mining but non-streaming data mining. However, concept drift forms its problem in common when processing data stream mining.

Harries and Horn used general decision tree algorithms to classify Share Price Index Futures, which is traded by Sydney Futures Exchange and based on All Ordinaries Index, and they proposed a strategy that dealt with concept drift by classifying only new data instances that are similar to those used in training [16].

In paper [17], Pratt and Tschapek proposed Brushed Parallel Histogram for visualizing concept drift [17]. They used 500 Standard and Poor’s listed stocks during 2002 and concluded that “a machine learning assumption of stationary would be unsupported by the data and that static market prediction models may be fragile in the face of drift” [17].

In paper [18], Marrs, Hickey and Black discussed the impact of latency on online classification when learning with concept drift. Latency corresponds to the duration of the recovery from concept drift. Moreover, they proposed an online learning model for data streams, and they proposed an online learning algorithm, which is based on the model, to evaluate the impact of latency. Furthermore, they showed that different types of latency have different impacts on the recovering accuracy.

2.3.2 Data Stream Mining Techniques

Paper [19] discusses sequential pattern miming techniques. It uses five ordinal labels as

‧

patterns of the change of the stock price: falling, few falling, unchanged, few rising and rising.

The inputs in the paper are multiple data streams consisting of the five types of patterns from various stocks. The paper uses two sequential pattern mining techniques, namely ICspan and IAspam, which can discover common sequential patterns from multiple inputs.

However, ICspan and IAspam do not follow requirements for data streams. Using non-streaming data mining methods on streaming data will have to face the problem of concept drift, which will decrease the accuracy as time passes. It is necessary to adjust the built model appropriately. In this thesis, we apply data stream mining techniques on TAIEX Futures data with MOA [2], which offers classification and clustering algorithms for data streams and also evaluation strategies to help its users appropriately validate the built models.

In paper [20], Sun and Li proposed an expert system to predict financial distress and used instance selection for handling concept drift, including full memory window, no memory window, fixed window size, adaptive window size, and batch selection. Sun and Li evaluated the system by financial statements of companies from Chinese Stock Exchange (CSE) and recognized the company being in financial distress if it was specially treated by CSE. They reported that some methods performed better than did others that only learn stationary models.

Moreover, they stated that full memory and no memory windowing methods were usually not good at dealing with concept drift but good at detecting concept drift and the type of concept drift. However, Sun and Li did not indicate what types of concept drift exist or how to decide the types of concept drift.

In paper [21], the author proposed On-Line Information Network (OLIN), an online classification system. OLIN deals with concept drift by adjusting the window size of training data and the number of coming instances between two model re-constructions.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

CHAPTER 3 DATA STREAM MINING 3.1 Introduction to Data Stream Mining

Generally speaking, data mining problems could be divided into three categories, namely classification, clustering, and association rule. Classification is the problem whose goal is to develop a model that can automatically classify coming instances into target classes. The model was trained by a given set of training instances, and a training instance contains values for various attributes and is attached with a correct class label. If the number of classes in the data is 2, e.g. positive and negative, we call the problem a binary-class, or simply binary, classification problem. We call it multiple-class classification problem if the number of classes is larger than 2. Clustering is defined as the problem whose goal is to develop a model that can partition the whole data set into various subsets or groups. The differences between these two problems include 1) classification needs to learn a model but clustering does not and 2) classification needs pre-specified class labels but clustering does not.

Non-streaming data mining algorithms, such as C4.5 [22], ID3 [23], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [24], K-Nearest Neighbors (KNN) [25], Support Vector Machine (SVM) [26] and Adaptive Boosting (AdaBoost) [27], these algorithms store all instances in memory for training. Sometimes, these algorithms must be reconstructed when updating a built model with new instances. Some algorithms could be used in time series analysis. However, when processing an online task which usually includes

‧

a series (rather than a set) of instances, non-streaming data mining algorithm will require more and more memory and time in order to store and process more and more instances.

Moreover, an online task may need to update a model. Otherwise, using an old model to classify new instances may significantly decrease the classification performance.

Data stream mining algorithms are designed to perform incremental learning, i.e. they can update models without reconstructing models. Furthermore, they do not store and keep all instances in memory all the time. These algorithms read instances into memory, update corresponding statistics, and then remove the instances from memory. Hence, these algorithms use a limited amount of memory and use less time because they do not need to reconstruct models.

In summary, data stream mining takes streaming data as input, and streaming data should be analyzed in a reasonable amount of time and by using a limited amount of memory.

3.2 Concept Drift

Concept drift is an important issue for data stream mining. Concept presents the underlying data distribution, and the target concept depends on hidden context. It assumes that all the data samples are fetched from a true underlying data distribution in the real world. When hidden context changes, the distribution of data samples will also change, and the target classes of data may also change. If the concept keeps the same, the distribution is stationary. When the underlying data distribution changed, it is called concept drift [28].

Concept drift has an impact on data analysis. The impact is amplified under online learning. Concept drift causes significant decreasing of accuracy. When the target concept changes, the old model may make wrong decisions for instances that are similar to instances which the old model is built on but with different class labels. Hence, updating is necessary

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

when concept drift occurs.

There are four types of concept drift, as shown in Figure 1. Every row of figure shows a kind of types of concept drift. The distribution of color of an object in a row stands for the main concept. Different color distribution in a row means that concept changed by time. For example, the first row of Figure 1 shows sudden concept drift. Color distribution of 3^rd object and 4^th object are different, so they are different concept data. This process is called concept drift and because the concept changes suddenly, so this type is sudden.

Figure 1. The four types of concept drift (reproduced from [29]) Note: The distribution of color of each data graph means a concept.

Sudden concept drift means that a concept jumps to another concept suddenly. Gradual concept drift and incremental concept drift mean that a concept move to another concept step by step. The difference between them is that incremental concept drift transfers to another

‧

concept directly. Reoccurring concept drift means that concept drift repeatedly appears in different points in time.

Bifet et al. categorized the strategies used to handle different types of concept drift into four categories [29], as listed below.

 Detector method: It uses variable-length windows and uses a new model when the corresponding detector is triggered. This method is suitable for dealing with sudden concept drift.

 Forgetting method: It uses fixed-length windows and only processes data contained in a window, or it uses strategy that gives higher weights to new instances. This method is suitable for dealing with sudden concept drift.

 Contextual method: It uses dynamic integration or meta-learning strategy, and it makes a decision by referring to results of sub-classifiers of an ensemble. This method is suitable for dealing with reoccurring concept drift.

 Dynamic ensemble method: It uses adaptive fusion rules, and it makes decisions by weighted voting. This method is suitable for dealing with gradual and incremental concept drift.

3.3 MOA: A Data Stream Mining Tool

Massive Online Analysis, MOA, is proposed by Bifet et al [2]. It is based on Waikato Environment for Knowledge Analysis, WEKA [3], and written in Java (an object-oriented and cross-platform programming language), and it supports a Graphical User Interface (GUI) and a command line interface. MOA offers a large set of functions to control the data mining process for data streams. For evaluation, MOA offers Prequential Evaluation, which tests each instances first and then trains a model using them; and it also offers Periodic Held Out Test, which tests

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

instances by a (static) trained model and outputs results for the evaluation period. For controlling data, MOA includes artificial data generation functions and, of course, it supports the use of data provided by users.

3.4 Data Stream Mining Algorithms

In this section, we introduce some data stream mining algorithms provided by MOA.

3.3.1 Naïve Bayes

Naïve Bayes is one of the most popular algorithms in the field of data mining [30], and it could be used in data stream mining. A Naïve Bayes classifier assumes that attributes are conditional independent. When classifying a given instance, it chooses the class label corresponding to the highest probability. This model needs only statistical information so that there is no need to store all instances.

3.3.2 Hoeffding Tree

Hoeffding Tree [31] is a tree-base classification algorithm for data stream. The central part of the algorithm is that splitting a node by choosing an attribute value such that the grade of the best attribute value is better than the one of the second when difference of the two grades is more than the Hoeffding bound. The grading function is information gain. With Hoeffding bound, the algorithm can match the requirements of data stream mining.

3.3.3 Hoeffding Adaptive Tree

There is another version of Hoeffding Tree used in this thesis -- Hoeffding Adaptive Tree [32][33], which is for adaptive parameter-free learning from evolving data streams. A data stream is an ordered sequence and the order is sensitive to time. In data stream miming, there

‧

are three central problems to be solved and they are listed below:

 What to remember (i.e. keep in memory)?

 When to upgrade the built model?

 How to upgrade the built model?

The above problems will be dealt with the following functions:

 Window to remember recent samples.

 Methods for detect the change of data distribution.

 Methods for updating estimations for statistics of the input.

These three functions suggest a system to contain three modules: estimator, memory, and change detector. Hoeffding Adaptive Tree is from Hoeffding Window Tree [32] which uses a window for Hoeffding tree. When change detector issues an alarm on a node of a tree, the algorithm will create a new alternative Hoeffding Tree for the node, and it replaces the original sub-tree on the node with the alternative tree if the accuracy of the alternative tree is higher.

However, deciding the optimal size of the window requires sufficient background knowledge.

Hoeffding Adaptive Tree places instances of estimators of frequency statistics at every node instead of setting the window size for Hoeffding Window Tree. Consequently, Hoeffding Adaptive Tree will evolve, and it needs no parameters for window size.

3.3.4 Drift Detection Method

In paper [34], Gama et al. proposed Drift Detection Method (DDM) to detect concept drift, and the method works in the following way: It calculates the minimum error rate of the

在文檔中串流資料分析在台灣股市指數期貨之應用 - 政大學術集成 (頁 16-0)