Accuracy Update Ensemble - Data Stream Mining Algorithms

CHAPTER 3 DATA STREAM MINING

3.4 Data Stream Mining Algorithms

3.3.7 Accuracy Update Ensemble

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

3.3.7 Accuracy Update Ensemble

Accuracy Updated Ensemble (AUE) [37] improved AWE by updating sub-classifiers whenever their weights are larger than a threshold for the coming of every new chunk.

Moreover, AUE uses a different way to calculate weights of sub-classifiers. AWE and AUE require an important parameter -- chunk size. However, the best chunk size should be evaluated beforehand so that AWE and AUE are not so easy to be applied to online tasks. Moreover, the best chunk size perhaps is not fixed but changes as time passes. Hence, in this thesis we propose a method that can automatically adjust the chunk size.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

CHAPTER 4 ADAPTIVE DRIFT ENSEMBLE

In our experiments, we found that DDM can significantly improve accuracy no matter what algorithm we used to build a single classifier. We concluded that there exists concept drift in the TAIEX Futures data. In the work presented in this chapter, our goal is to investigate the types of concept drift existing in the TAIEX Futures data. We propose an ensemble based method, which contains static sub-classifiers that sufficiently reflect the corresponding concepts. By observing changes of relative weights of sub-classifiers, we discover that a concept occurs repeatedly. In the other words, reoccurring concept drift exists in the TAIEX Futures data.

Figure 2 presents the flowchart of Adaptive Drift Ensemble (ADE), giving details for different drift detection levels. The core algorithm of ADE, presented in Figure 3, works as follows: Initially, it resets or initializes the newest or current sub-classifier and employs DDM.

In training, for each coming instance, it updates the DDM detector (handler) and determines the current level, which could be the control level, warning level, or drift level. ADE uses the coming instance to update the DDM detector, which is the current sub-classifier (marked by asterisk in Figure 2) with DDM. The next step depends on the level suggested by the DDM detector. If the DDM detector suggests the control level, ADE uses the new data instance to update the current sub-classifier and resets the candidate classifier. If the DDM detector suggests the warning level, it updates the candidate classifier. If the DDM detector suggests the

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

drift level, ADE adds the candidate classifier to the ensemble and sets it as the current sub-classifier. Then ADE trains the current sub-classifier by using the coming instance, and it updates weights of sub-classifiers in the ensemble by Equations (1) and (2), which are from AUE. Finally, if the number of sub-classifiers exceeds the limit, ADE drops the one that performs the worst, and it returns top k sub-classifiers according to their weights.

𝑀𝑆𝐸_𝑚 = _|𝑥¹

𝑖|∑(𝑥,𝑐)∈𝑥_𝑖�1 − 𝑓_𝑐^𝑚(𝑥)�², 𝑀𝑆𝐸_𝑟 = ∑ 𝑝(𝑐)(1 − 𝑝(𝑐))_𝑐 ² ⁽¹⁾ 𝑤_𝑚 =_{(𝑀𝑆𝐸}¹

𝑖+𝜖) (2)

i: the order in the series of instances of the chunk x: an instance of the chunk

c: a chunk

w: weight of sub-model ϵ: a value which is limited to 0

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Figure 2. The flowchart of the handler (DDM detector) of ADE

‧

Figure 3. The core algorithm of ADE

Input: S: input data stream

k: the number of sub-classifiers Output: ε: ensemble of k sub-classifiers Initialize()

3. Switch(H.DDM_Level){

4. case in_control_level:

5. C ← Ф;

6. break;

7. case warning_level:

8. train C by xi;

‧

In order to set the chunk size of each sub-classifier adaptively and vary the number of data instances given to every sub-classifier for training, we use DDM as the base handler to calculate and adjust the chunk size. We put into a chunk the data instances collected before the drift level is triggered, and we use the chunk to train a sub-classifier. We use DDM because, according to our experimental results (in 5.2.2 and 5.2.3), EDDM is less stable than DDM.

Although we use DDM, any concept drift detector would work as the base handler.

EDDM was proposed to mainly handle gradual concept drift, and DDM was proposed to mainly handle sudden concept drift [35]. ADE could be built upon any concept drift detector.

AUE trains a new sub-classifier with instances of a chunk and then adds it to an ensemble if and only if the chunk is full. Before a chunk is full, the coming instances do not contribute to training but only to updating weights of sub-classifiers. As a result, AUE can only vote for a new coming instance by using old sub-classifiers. AUE has a delayed reaction for concept drift.

Such a condition is related to latency learning. According to the paper [18], we know that latency causes slow recovery when concept drift occurs. To address this problem, ADE not only updates weights but also trains the current sub-classifier by using every coming instance immediately.

A chunk of instances is used to train a sub-classifier in AWE and AUE, so it is important to determinate the chunk size in advance. However, the best chunk size may not constantly be the same. We use a handler in an ensemble to determine the chunk size for two reasons: One is that each sub-classifier can represent the corresponding concept, and the other is that the handler can adjust the chunk size automatically. The size is the difference in terms of the number of data instances between two drift levels that occur.

In summary, we improve AUE from three aspects. First, we use a single classifier, which is based on the newest sub-classifier of ensemble, with DDM to determine the chunk size.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Second, we keep old sub-classifiers without updating them. Third, we update current sub-classifier (the newest one) for every new data instance instead of waiting until the chunk is full (when the drift level occurs).

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

CHAPTER 5 EXPERIMENTS 5.1 Setup

The following includes preparation for experiment and result of applying data stream mining techniques on prediction of TAIEX Futures. We observed the results and discovered that concept drift existed in TAIEX Futures market. The next chapter will discuss concept drifts more detailed.

We describe in detail the experiments whose goal is to evaluate the performance of the algorithms mentioned earlier on the TAIEX Futures data. Figure 4 presents the procedure for experiments.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Figure 4. The procedure for experiments

The transactions of TAIEX Futures were used as input data. In order to filter structure format, we stored the semi-structure transactions into database. Before using data stream mining tool, we selected a subset of features by heuristic. We conduct experiments by using the toolkit MOA [2], which is based on WEKA [3]. MOA offers a rich set of functions for data stream mining. MOA offers Prequential Evaluation that tests instances first and then uses them to train a model.

‧

For baseline experiment, TAIEX Futures data used in experiments, downloaded from Taiwan Stock Exchange Corporation web, includes transaction between January 1^st in 2000 and December 31^th in 2011, and the size of the time frame is set to 1 day. In order to train a model that can be used in the future, we remove all absolute timestamps and instead use relative timestamps, because absolute timestamps will not occur in the future.

Labeling also becomes a problem since we do not have real settlement values for futures.

So, we label each record with a closest value. Furthermore, to avoid confusion, we remove instances having 0 on the volume attribute.

The attributes of data are listed below:

 Related Time: The number of months from the trading date to the due date.

 Open Value: The opening price of a trading day.

 Close Value: The closing price of a trading day.

 Highest and Lowest Value: The highest and lowest price traded in a trading day.

 Volume: How many transactions of a trading day.

 Settlement Value: The settlement price of a trading day.

 Remains: How many remaining contracts in a trading day.

 Last Buy: The last price to buy at close time in a trading day.

 Last Sell: The last price to sell at close time in a trading day.

 Trading Time: In the format of Year/Month/Day.

 Class: Higher or lower, compared to the closing price of the final settlement day with the close price of a trading day.

Currently we select attributes by heuristics, and using some feature selection algorithms or statistical approaches will be part of our future work. Furthermore, we remove attributes that are directly related to the class label. For example, we use Close Value in the settlement day to

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

tag the class label, so we remove it in the experiments. We also select attributes through experiments. Consequently, after testing some subsets of these attributes, we decide to use only Relative Timestamps, Open Value, Highest Value, Lowest Value, Volume, Remains, and Class.

The distributions of attributes are shown in Figure 5. Each subfigure in Figure 5 corresponds to an attribute. In the bottom is the attribute name. The x-axis is the set of values of the attribute, and the number in the top in each subfigure is the number of instances of the corresponding attribute value

The number of instances of each class is similar. We observe that the attributes Volume and Remains have imbalanced distributions with respect to Class (or the target class label).

Figure 5. Distribution of each attributes.

Note: V1 = 1-6409, V2 = 134589-140998, V3 = 275587-281996, R1 = 4-3677, R2 = 36743-40417, and R3 = 77156-80830.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

For other experiments, in order to get more detailed granularity of time frame and more coverage of time duration of trading, we used only Open Value, Highest Value, Lowest Value, Volume and Class. The TAIEX Futures data used in experiments include transactions recorded in the time period starting from July 22^nd, 1998 to October 31^st, 2012. We remove all absolute timestamps in order to train classification models that can be used in different time frames. For every data instance, we set the class label as positive and negative when Close Value is larger and smaller than the settlement value, respectively. Figure 6 shows the distribution of each attribute with respect to target class.

Figure 6. Distribution of each attribute

Note: From left to right on the up side is Open Value, Highest Value and Lowest Value; from left to right on the down side is Volume and Class.

MOA accept many kinds of data format. We generated ARFF file as input after pre-processing. ARFF format is CSV file which attaches attribute information on the head of

‧

file. Except for target class and Related time, all attributes are set as numerical type. For example, our data is shown as Figure 7.

Figure 7. The ARFF format TAIEX Futures data Note: attribute class 0 means rising label and 1 means falling label

MOA offers many evaluation methods. We chose Prequential Evaluation which is appropriate for online tasks. Prequential Evaluation tests coming instance by current model and then uses the instance to train the model. However, Prequential Evaluation perhaps lifts accuracy in our case, because it tests instances immediately. In fact, the instances are labeled only after the final settlement day in the real world. Hence, Prequential Evaluation ignored the latency which is about 20 instances at most in our case.

5.2 Results

5.2.1 Baseline

In the experiment, we classify instances with the default settings. For Hoeffding Tree, the parameters settings include 95% confidence, information gain and 200 instances for grace

@relation 'ver3_day'

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

period. Additionally, 30 instances ignored for DDM.

We mainly use Hoeffding tree (HT) [31] for evaluation and use Naïve Bayes (NB) as the control group. One of the reasons that we consider Naïve Bayes is that it is one of the most popular algorithms in the field of data mining [30] and it could be used in data stream mining.

Furthermore, another major reason of using Naïve Bayes as the control group is that there are many references, such as [38][39][40][41], where Naïve Bayes is used in finance data mining.

Nevertheless, using other algorithms as the control group will be part of our future work.

We used prequential evaluation which uses the coming instance to test model and then to train the model. When model classifies an instance correctly, it was recorded as positive for accuracy. The figures were painted by Microsoft Excel, and each point on figure stands for average of 173 instances. We present experimental results in Figure 8 and Figure 9, each of which uses accuracy in percentage as the y-axis and timestamps as the x-axis. It is shown in Figure 8 that the three algorithms achieve the average accuracy higher than 50%. We also could find that Hoeffding Tree and Hoeffding Adaptive Tree do not perform worse than Naïve Bayes.

If stability of the built model is the concern, Hoeffding Tree or Hoeffding Adaptive Tree would be a better choice. Accuracy decreased at beginning for almost experiments. We think the reason is that empty model guesses a static class, and TAIEX fell during the time, but model guessed the wrong class.

‧

Figure 8. Algorithms without DDM Note: The y-axis starts from 30.

From Figure 8, we can have the following findings: Prior to 2002, Hoeffding Tree and Hoeffding Adaptive Tree perform equally well and better than does Naïve Bayes, and for each of the three algorithms the accuracy decreases as time passes. Between late 2002 and early 2004, the accuracy of Naïve Bayes increases as time passes and performs better than do the other two algorithms. Between late 2004 and mid-2006, all the three algorithms exhibited a trend of increasing accuracy, and Hoeffding Adaptive Tree performs the best among the three (and this demonstrates the “adaptive” feature of the algorithm). Between mid-2006 and mid-2008, the accuracy given by Naïve Bayes drops rapidly (and it goes back to its mid-2006 level in mid-2009), while the other two algorithms perform similarly and significantly better than does Naïve Bayes. Between late 2008 and mid-2009, the accuracy given by Hoeffding Adaptive Tree decreases while the other two algorithms show a trend of increasing accuracy. After mid-2009,

2000/1/1 2000/7/1 2001/1/1 2001/7/1 2002/1/1 2002/7/1 2003/1/1 2003/7/1 2004/1/1 2004/7/1 2005/1/1 2005/7/1 2006/1/1 2006/7/1 2007/1/1 2007/7/1 2008/1/1 2008/7/1 2009/1/1 2009/7/1 2010/1/1 2010/7/1 2011/1/1 2011/7/1

Hoeffding Tree Hoeffding Adaptive Tree Naïve Bayes

‧

Hoeffding Adaptive Tree shows a trend of increasing accuracy until mid-2010, while Hoeffding Tree shows a similar trend but Naïve Bayes shows a trend of decreasing accuracy. After mid-2010, all the three algorithms show a trend of decreasing accuracy, but after mid-2011, the accuracy given by Naïve Bayes increases as time passes.

As shown in Figure 9, the accuracy given by Hoeffding Adaptive Tree with DDM is not as good as the accuracy given by others, because there are other estimators in Hoeffding Adaptive Tree. DDM uses a type II estimator [32], and Hoeffding Adaptive Tree uses a type IV estimator. Thus, when we compare Hoeffding Adaptive Tree in Figure 8 with other algorithms with DDM in Figure 9, we can observe that using a type IV estimator is not always better than using a type II estimator.

Figure 9. Algorithms with DDM Note: the y-axis starts from 60.

We did another experiment by using data which is only during current settlement month instead of tree trading months (we can trade different settlement month of futures in a day).

Hoeffding Tree with DDM Hoeffding Adaptive Tree with DDM Naïve Bayes with DDM

‧

Figure 10 shows result of different algorithms. Hoeffding Tree performed almost as well as Naïve Bayes because Hoeffding Tree uses Naïve Bayes to classify instance on the leave.

Hoeffding Adaptive Tree performed worse than others, though Hoeffding Adaptive Tree contains an adaptive window for concept drift. On average, Hoeffding Tree gives accuracy at 58.08, Hoeffding Adaptive Tree gives 56.81, and Naïve Bayes gives 58.21. For standard deviation, Hoeffding Tree gives 5.65, Hoeffding Adaptive Tree gives 7.01 and Naïve Bayes gives 5.84.

Figure 10. Baseline

Note: Only current settlement month data, the y-axis starts from 40.

Figure 11 shows result using DDM and data set as above. Algorithms with DDM performed better than Algorithms without DDM did. Hoeffding Tree with DDM performed worse than others. On average, Hoeffding Tree with DDM give accuracy at 71.39, Hoeffding Adaptive Tree with DDM gives 73.53 and Naïve Bayes with DDM gives 72.24. For standard deviation, Hoeffding Tree with DDM gives 3.87, Hoeffding Adaptive Tree gives 1.76 and

Hoeffding Tree Hoeffding Adaptive Tree Naïve Bayes

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Naïve Bayes gives 1.75.

Figure 11. Algorithms with DDM Note: Only current settlement month data, the y-axis starts from 60.

5.2.2 Drift Detection Method

We used different data set from the base experiment. For more detailed observation of impact of DDM, we used data which is only during current settlement month and without Remains attribute which was used in base experiment. Figure 12 shows the result of DDM with different classification algorithms. Although we removed Remains attribute, there is no significant difference of average accuracy of DDM between two different subset of attribute and different coverage of time. On average, Hoeffding Tree with DDM gives accuracy at 71.39%, Hoeffding Adaptive Tree with DDM gives 73.53% and Naïve Bayes with DDM gives 72.24. For standard deviation of accuracy, Hoeffding Tree with DDM gives 6.09, Hoeffding Adaptive Tree with DDM give 5.66 and Naïve Bayes with DDM gives 7.34.

60 65 70 75 80 85

Hoeffding Tree with DDM Hoeffding Adaptive Tree with DDM Naïve Bayes with DDM

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Figure 12. Algorithms with DDM

Note: Without Remains attribute 1998/7-2012/10, the y-axis starts from 60.

5.2.3 Early Drift Detection Method

We made experiment by Early Drift Detection Method which is provided by MOA. The result was shown as Figure 13. We found the trends of result of DDM and EDDM are similar, but EDDM made less shacking of accuracy, in other words, EDDM performed much stably.

On average, Hoeffding Tree with EDDM gives accuracy at 72.45, Hoeffding Adaptive Tree with EDDM gives 72.89 and Naïve Bayes with EDDM gives 71.63. For standard deviation, Hoeffding Tree with EDDM gives 4.54, Hoeffding Adaptive Tree with EDDM gives 4.31 and Naïve Bayes with EDDM gives 5.00.

60 65 70 75 80 85 90 95 100

1998/8/1 1999/3/1 1999/10/1 2000/5/1 2000/12/1 2001/7/1 2002/2/1 2002/9/1 2003/4/1 2003/11/1 2004/6/1 2005/1/1 2005/8/1 2006/3/1 2006/10/1 2007/5/1 2007/12/1 2008/7/1 2009/2/1 2009/9/1 2010/4/1 2010/11/1 2011/6/1 2012/1/1 2012/8/1

Hoeffding Tree with DDM Hoeffding Adaptive Tree with DDM Naïve Bayes with DDM

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Figure 13. Algorithms with EDDM

Note: Without Remains attribute 1998/7-2012/10, the y-axis starts from 60.

5.2.4 Accuracy Weighted Ensemble

AWE is evolving ensemble method. We used it for this experiment. Hoeffding Tree and Hoeffding Adaptive Tree performed a little better than Naïve Bayes when using AWE. Figure 14 shows result of AWE. On average, Hoeffding Tree with AWE gives accuracy at 58.61, Hoeffding Adaptive Tree with AWE gives 59.94 and Naïve Bayes with AWE gives 57.31. For standard deviation, Hoeffding Tree with AWE gives 6.96, Hoeffding Adaptive Tree with AWE gives 6.6 and Naïve Bayes with AWE gives 6.66.

60 65 70 75 80 85 90 95 100

Hoeffding Tree with EDDM Hoeffding Adaptive Tree with EDDM Naïve Bayes with EDDM

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Figure 14. Algorithms with AWE

Note: Without Remains attribute 1998/7-2012/10, the y-axis starts from 45.

5.2.5 Accuracy Update Ensemble

AUE was improved from AWE, and AUE updates old sub-classifiers by new coming instances. However, AUE performed little worse than AWE. Figure 15 shows result of AUE.

On average, Hoeffding Tree with AUE gives accuracy at 57.87, Hoeffding Adaptive Tree with AUE gives 59.28 and Naïve Bayes with AUE gives 56.80. For standard deviation, Hoeffding Tree with AUE gives 6.99, Hoeffding Adaptive Tree with AUE gives 7.06 and Naïve Bayes with AUE gives 6.67.

45 50 55 60 65 70 75

AWE Hoeffding Tree AWE Hoeffding Adaptive Tree AWE Naïve Bayes

在文檔中串流資料分析在台灣股市指數期貨之應用 - 政大學術集成 (頁 30-0)