• 沒有找到結果。

CHAPTER 2 PRELIMINARY

2.3 Data Streaming Mining Techniques for Financial Data Analysis

2.3.2 Data Stream Mining Techniques

2.3 Data Streaming Mining Techniques for Financial Data Analysis

2.3.1 Concept Drift Analysis

Concept drift is a problem that the old static model can’t deal with new coming instances, and it often occurs in long term mining. So concept drift may exist not only in data stream mining but non-streaming data mining. However, concept drift forms its problem in common when processing data stream mining.

Harries and Horn used general decision tree algorithms to classify Share Price Index Futures, which is traded by Sydney Futures Exchange and based on All Ordinaries Index, and they proposed a strategy that dealt with concept drift by classifying only new data instances that are similar to those used in training [16].

In paper [17], Pratt and Tschapek proposed Brushed Parallel Histogram for visualizing concept drift [17]. They used 500 Standard and Poor’s listed stocks during 2002 and concluded that “a machine learning assumption of stationary would be unsupported by the data and that static market prediction models may be fragile in the face of drift” [17].

In paper [18], Marrs, Hickey and Black discussed the impact of latency on online classification when learning with concept drift. Latency corresponds to the duration of the recovery from concept drift. Moreover, they proposed an online learning model for data streams, and they proposed an online learning algorithm, which is based on the model, to evaluate the impact of latency. Furthermore, they showed that different types of latency have different impacts on the recovering accuracy.

2.3.2 Data Stream Mining Techniques

Paper [19] discusses sequential pattern miming techniques. It uses five ordinal labels as

patterns of the change of the stock price: falling, few falling, unchanged, few rising and rising.

The inputs in the paper are multiple data streams consisting of the five types of patterns from various stocks. The paper uses two sequential pattern mining techniques, namely ICspan and IAspam, which can discover common sequential patterns from multiple inputs.

However, ICspan and IAspam do not follow requirements for data streams. Using non-streaming data mining methods on streaming data will have to face the problem of concept drift, which will decrease the accuracy as time passes. It is necessary to adjust the built model appropriately. In this thesis, we apply data stream mining techniques on TAIEX Futures data with MOA [2], which offers classification and clustering algorithms for data streams and also evaluation strategies to help its users appropriately validate the built models.

In paper [20], Sun and Li proposed an expert system to predict financial distress and used instance selection for handling concept drift, including full memory window, no memory window, fixed window size, adaptive window size, and batch selection. Sun and Li evaluated the system by financial statements of companies from Chinese Stock Exchange (CSE) and recognized the company being in financial distress if it was specially treated by CSE. They reported that some methods performed better than did others that only learn stationary models.

Moreover, they stated that full memory and no memory windowing methods were usually not good at dealing with concept drift but good at detecting concept drift and the type of concept drift. However, Sun and Li did not indicate what types of concept drift exist or how to decide the types of concept drift.

In paper [21], the author proposed On-Line Information Network (OLIN), an online classification system. OLIN deals with concept drift by adjusting the window size of training data and the number of coming instances between two model re-constructions.

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

12

CHAPTER 3

DATA STREAM MINING 3.1 Introduction to Data Stream Mining

Generally speaking, data mining problems could be divided into three categories, namely classification, clustering, and association rule. Classification is the problem whose goal is to develop a model that can automatically classify coming instances into target classes. The model was trained by a given set of training instances, and a training instance contains values for various attributes and is attached with a correct class label. If the number of classes in the data is 2, e.g. positive and negative, we call the problem a binary-class, or simply binary, classification problem. We call it multiple-class classification problem if the number of classes is larger than 2. Clustering is defined as the problem whose goal is to develop a model that can partition the whole data set into various subsets or groups. The differences between these two problems include 1) classification needs to learn a model but clustering does not and 2) classification needs pre-specified class labels but clustering does not.

Non-streaming data mining algorithms, such as C4.5 [22], ID3 [23], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [24], K-Nearest Neighbors (KNN) [25], Support Vector Machine (SVM) [26] and Adaptive Boosting (AdaBoost) [27], these algorithms store all instances in memory for training. Sometimes, these algorithms must be reconstructed when updating a built model with new instances. Some algorithms could be used in time series analysis. However, when processing an online task which usually includes

a series (rather than a set) of instances, non-streaming data mining algorithm will require more and more memory and time in order to store and process more and more instances.

Moreover, an online task may need to update a model. Otherwise, using an old model to classify new instances may significantly decrease the classification performance.

Data stream mining algorithms are designed to perform incremental learning, i.e. they can update models without reconstructing models. Furthermore, they do not store and keep all instances in memory all the time. These algorithms read instances into memory, update corresponding statistics, and then remove the instances from memory. Hence, these algorithms use a limited amount of memory and use less time because they do not need to reconstruct models.

In summary, data stream mining takes streaming data as input, and streaming data should be analyzed in a reasonable amount of time and by using a limited amount of memory.

3.2 Concept Drift

Concept drift is an important issue for data stream mining. Concept presents the underlying data distribution, and the target concept depends on hidden context. It assumes that all the data samples are fetched from a true underlying data distribution in the real world. When hidden context changes, the distribution of data samples will also change, and the target classes of data may also change. If the concept keeps the same, the distribution is stationary. When the underlying data distribution changed, it is called concept drift [28].

Concept drift has an impact on data analysis. The impact is amplified under online learning. Concept drift causes significant decreasing of accuracy. When the target concept changes, the old model may make wrong decisions for instances that are similar to instances which the old model is built on but with different class labels. Hence, updating is necessary

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

14

when concept drift occurs.

There are four types of concept drift, as shown in Figure 1. Every row of figure shows a kind of types of concept drift. The distribution of color of an object in a row stands for the main concept. Different color distribution in a row means that concept changed by time. For example, the first row of Figure 1 shows sudden concept drift. Color distribution of 3rd object and 4th object are different, so they are different concept data. This process is called concept drift and because the concept changes suddenly, so this type is sudden.

Figure 1. The four types of concept drift (reproduced from [29]) Note: The distribution of color of each data graph means a concept.

Sudden concept drift means that a concept jumps to another concept suddenly. Gradual concept drift and incremental concept drift mean that a concept move to another concept step by step. The difference between them is that incremental concept drift transfers to another

concept directly. Reoccurring concept drift means that concept drift repeatedly appears in different points in time.

Bifet et al. categorized the strategies used to handle different types of concept drift into four categories [29], as listed below.

 Detector method: It uses variable-length windows and uses a new model when the corresponding detector is triggered. This method is suitable for dealing with sudden concept drift.

 Forgetting method: It uses fixed-length windows and only processes data contained in a window, or it uses strategy that gives higher weights to new instances. This method is suitable for dealing with sudden concept drift.

 Contextual method: It uses dynamic integration or meta-learning strategy, and it makes a decision by referring to results of sub-classifiers of an ensemble. This method is suitable for dealing with reoccurring concept drift.

 Dynamic ensemble method: It uses adaptive fusion rules, and it makes decisions by weighted voting. This method is suitable for dealing with gradual and incremental concept drift.

3.3 MOA: A Data Stream Mining Tool

Massive Online Analysis, MOA, is proposed by Bifet et al [2]. It is based on Waikato Environment for Knowledge Analysis, WEKA [3], and written in Java (an object-oriented and cross-platform programming language), and it supports a Graphical User Interface (GUI) and a command line interface. MOA offers a large set of functions to control the data mining process for data streams. For evaluation, MOA offers Prequential Evaluation, which tests each instances first and then trains a model using them; and it also offers Periodic Held Out Test, which tests

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

16

instances by a (static) trained model and outputs results for the evaluation period. For controlling data, MOA includes artificial data generation functions and, of course, it supports the use of data provided by users.

3.4 Data Stream Mining Algorithms

In this section, we introduce some data stream mining algorithms provided by MOA.

3.3.1 Naïve Bayes

Naïve Bayes is one of the most popular algorithms in the field of data mining [30], and it could be used in data stream mining. A Naïve Bayes classifier assumes that attributes are conditional independent. When classifying a given instance, it chooses the class label corresponding to the highest probability. This model needs only statistical information so that there is no need to store all instances.

3.3.2 Hoeffding Tree

Hoeffding Tree [31] is a tree-base classification algorithm for data stream. The central part of the algorithm is that splitting a node by choosing an attribute value such that the grade of the best attribute value is better than the one of the second when difference of the two grades is more than the Hoeffding bound. The grading function is information gain. With Hoeffding bound, the algorithm can match the requirements of data stream mining.

3.3.3 Hoeffding Adaptive Tree

There is another version of Hoeffding Tree used in this thesis -- Hoeffding Adaptive Tree [32][33], which is for adaptive parameter-free learning from evolving data streams. A data stream is an ordered sequence and the order is sensitive to time. In data stream miming, there

are three central problems to be solved and they are listed below:

 What to remember (i.e. keep in memory)?

 When to upgrade the built model?

 How to upgrade the built model?

The above problems will be dealt with the following functions:

 Window to remember recent samples.

 Methods for detect the change of data distribution.

 Methods for updating estimations for statistics of the input.

These three functions suggest a system to contain three modules: estimator, memory, and change detector. Hoeffding Adaptive Tree is from Hoeffding Window Tree [32] which uses a window for Hoeffding tree. When change detector issues an alarm on a node of a tree, the algorithm will create a new alternative Hoeffding Tree for the node, and it replaces the original sub-tree on the node with the alternative tree if the accuracy of the alternative tree is higher.

However, deciding the optimal size of the window requires sufficient background knowledge.

Hoeffding Adaptive Tree places instances of estimators of frequency statistics at every node instead of setting the window size for Hoeffding Window Tree. Consequently, Hoeffding Adaptive Tree will evolve, and it needs no parameters for window size.

3.3.4 Drift Detection Method

In paper [34], Gama et al. proposed Drift Detection Method (DDM) to detect concept drift, and the method works in the following way: It calculates the minimum error rate of the model, denoted as 𝑝𝑚𝑚𝑚, and the standard deviation, denoted as 𝑠𝑚𝑚𝑚. Given a parameter α, for the i-th data instance, DDM is at the warning level if 𝑝𝑚+ 𝑠𝑚 > 𝑝𝑚𝑚𝑚+ 2𝛼 ∗ 𝑠𝑚𝑚𝑚. DDM is at the drift level if 𝑝𝑚 + 𝑠𝑚 > 𝑝𝑚𝑚𝑚+ 3𝛼 ∗ 𝑠𝑚𝑚𝑚. When DDM is at the warning level, it trains a new

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

18

classification model by using new data instances, and it replaces the old model with the new one if it moves to the drift level. DDM can handle sudden concept drift.

3.3.5 Early Drift Detection Method

In paper [35], Baena-García et al. proposed Early Drift Detection Method (EDDM) as an enhancement of DDM. EDDM is able to detect the presence of gradual concept drift. It works similarly to DDM but, instead of the error rate, EDDM considers the interval between two errors (i.e. incorrect classifications or predictions). The smaller is the interval, the higher is the probability that concept drift occurs.

3.3.6 Accuracy Weighted Ensemble

An ensemble is a group of sub-classifiers that provide collective intelligence by voting or other way, and evolving ensembles can be used to deal with concept drift. For example, we can use a chunk of data instances whose size is constant to train a sub-classifier. After training a number of sub-classifiers, we only use top-K best sub-classifiers to vote for overall classifications. Every sub-classifier has its own weight calculated using the mean square error rate, and its weight is updated when a new chunk comes. However, a new chunk of data instances will not be used to re-train the existing sub-classifiers, so these sub-classifiers may not be able to know that the underlying data distribution has changed.

Accuracy Weighted Ensemble (AWE) [36] is one of the implementations for the above strategy. AWE uses a chunk under a given size to train a corresponding sub-classifier and calculates its weights by the distance between two minimum mean square errors for evaluation and expectation. The trained sub-classifier is static and will not be updated by a new chunk.

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

19

3.3.7 Accuracy Update Ensemble

Accuracy Updated Ensemble (AUE) [37] improved AWE by updating sub-classifiers whenever their weights are larger than a threshold for the coming of every new chunk.

Moreover, AUE uses a different way to calculate weights of sub-classifiers. AWE and AUE require an important parameter -- chunk size. However, the best chunk size should be evaluated beforehand so that AWE and AUE are not so easy to be applied to online tasks. Moreover, the best chunk size perhaps is not fixed but changes as time passes. Hence, in this thesis we propose a method that can automatically adjust the chunk size.

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

20

CHAPTER 4

ADAPTIVE DRIFT ENSEMBLE

In our experiments, we found that DDM can significantly improve accuracy no matter what algorithm we used to build a single classifier. We concluded that there exists concept drift in the TAIEX Futures data. In the work presented in this chapter, our goal is to investigate the types of concept drift existing in the TAIEX Futures data. We propose an ensemble based method, which contains static sub-classifiers that sufficiently reflect the corresponding concepts. By observing changes of relative weights of sub-classifiers, we discover that a concept occurs repeatedly. In the other words, reoccurring concept drift exists in the TAIEX Futures data.

Figure 2 presents the flowchart of Adaptive Drift Ensemble (ADE), giving details for different drift detection levels. The core algorithm of ADE, presented in Figure 3, works as follows: Initially, it resets or initializes the newest or current sub-classifier and employs DDM.

In training, for each coming instance, it updates the DDM detector (handler) and determines the current level, which could be the control level, warning level, or drift level. ADE uses the coming instance to update the DDM detector, which is the current sub-classifier (marked by asterisk in Figure 2) with DDM. The next step depends on the level suggested by the DDM detector. If the DDM detector suggests the control level, ADE uses the new data instance to update the current sub-classifier and resets the candidate classifier. If the DDM detector suggests the warning level, it updates the candidate classifier. If the DDM detector suggests the

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

21

drift level, ADE adds the candidate classifier to the ensemble and sets it as the current sub-classifier. Then ADE trains the current sub-classifier by using the coming instance, and it updates weights of sub-classifiers in the ensemble by Equations (1) and (2), which are from AUE. Finally, if the number of sub-classifiers exceeds the limit, ADE drops the one that performs the worst, and it returns top k sub-classifiers according to their weights.

𝑀𝑆𝐸𝑚 = |𝑥1

𝑖|(𝑥,𝑐)∈𝑥𝑖�1 − 𝑓𝑐𝑚(𝑥)�2, 𝑀𝑆𝐸𝑟 = ∑ 𝑝(𝑐)(1 − 𝑝(𝑐))𝑐 2 (1) 𝑤𝑚 =(𝑀𝑆𝐸1

𝑖+𝜖) (2)

i: the order in the series of instances of the chunk x: an instance of the chunk

c: a chunk

w: weight of sub-model ϵ: a value which is limited to 0

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

22

Figure 2. The flowchart of the handler (DDM detector) of ADE

Figure 3. The core algorithm of ADE

Input: S: input data stream

k: the number of sub-classifiers Output: ε: ensemble of k sub-classifiers Initialize()

3. Switch(H.DDM_Level){

4. case in_control_level:

5. C ← Ф;

6. break;

7. case warning_level:

8. train C by xi;

In order to set the chunk size of each sub-classifier adaptively and vary the number of data instances given to every sub-classifier for training, we use DDM as the base handler to calculate and adjust the chunk size. We put into a chunk the data instances collected before the drift level is triggered, and we use the chunk to train a sub-classifier. We use DDM because, according to our experimental results (in 5.2.2 and 5.2.3), EDDM is less stable than DDM.

Although we use DDM, any concept drift detector would work as the base handler.

EDDM was proposed to mainly handle gradual concept drift, and DDM was proposed to mainly handle sudden concept drift [35]. ADE could be built upon any concept drift detector.

AUE trains a new sub-classifier with instances of a chunk and then adds it to an ensemble if and only if the chunk is full. Before a chunk is full, the coming instances do not contribute to training but only to updating weights of sub-classifiers. As a result, AUE can only vote for a new coming instance by using old sub-classifiers. AUE has a delayed reaction for concept drift.

Such a condition is related to latency learning. According to the paper [18], we know that latency causes slow recovery when concept drift occurs. To address this problem, ADE not only updates weights but also trains the current sub-classifier by using every coming instance immediately.

A chunk of instances is used to train a sub-classifier in AWE and AUE, so it is important to determinate the chunk size in advance. However, the best chunk size may not constantly be the same. We use a handler in an ensemble to determine the chunk size for two reasons: One is that each sub-classifier can represent the corresponding concept, and the other is that the

A chunk of instances is used to train a sub-classifier in AWE and AUE, so it is important to determinate the chunk size in advance. However, the best chunk size may not constantly be the same. We use a handler in an ensemble to determine the chunk size for two reasons: One is that each sub-classifier can represent the corresponding concept, and the other is that the

相關文件