• 沒有找到結果。

Strategies for Ensemble Generation

As shown in Section 2.4, the ensemble classifier with high diversity can greatly out-perform every individual classifier. In this section, we analyze two strategies that further

Given a test example y, a classifier outputs fc(y), the probability of y being an instance of class c. A classifier ensemble pools the outputs of several classifiers before a decision is made. The most popular way of combining multiple classifiers is via averaging [24], in which case the probability output of the ensemble is given by:

fcE(y) = 1 k

!k i=1

fci(y)

where fci(y) is the probability output of the i-th classifier in the ensemble.

The outputs of a well trained classifier are expected to approx-imate the a posterior class distribution. In addition to the Bayes error, the remaining error of the classifier can be decomposed into bias and variance [15, 6]. More specifically, given a test example y, the probability output of classifierCican be expressed as:

fci(y) = p(c|y) + βic+ ηic(y)

" #$ % added error fory

(1)

where p(c|y) is the a posterior probability distribution of class c given input y, βciis the bias ofCi, and ηic(y) is the variance ofCi given input y. In the following discussion, we assume the error consists of variance only, as our major goal is to reduce the error caused by the discrepancies among the classifiers trained on differ-ent data chunks.

Assume an incoming data stream is partitioned into sequential chunks of fixed size, S1, S2,· · · , Sn, with Snbeing the most recent chunk. LetCi, Gk, and Ekdenote the following models.

Ci: classifier learned from training set Si;

Gk: classifier learned from the training set consisting of the last k chunks Sn−k+1∪ · · · ∪ Sn;

Ek: classifier ensemble consisting of the last k classifiers Cn−k+1,· · · , Cn.

Figure 3: Models’ classification error on test example y.

In the concept-drifting environment, models learned up-stream may carry significant variances when they are applied to the cur-rent test cases (Figure 3). Thus, instead of averaging the outputs of classifiers in the ensemble, we use the weighted approach. We assign each classifierCia weight wi, such that wiis reversely pro-portional toCi’s expected error (when applied to the current test cases). In Section 4, we introduce a method of generating such weights based on estimated classification errors. Here, assuming each classifier is so weighted, we prove the following property.

Ekproduces a smaller classification error than Gk, if classifiers in Ekare weighted by their expected classification accuracy on the test data.

Figure 4: Error regions associated with approximating the a posteriori probabilities [24].

We prove this property through bias-variance decomposition based on Tumer’s work [24]. The Bayes optimum decision assigns x to class i if p(ci|x) > p(ck|x), ∀k ̸= i. Therefore, as shown in Fig-ure 4, the Bayes optimum boundary is the loci of all points xsuch that p(ci|x) = p(cj|x), where j = argmaxkp(ck|x) [24].

The decision boundary of our classifier may vary from the optimum boundary. In Figure 4, b = xb− xdenotes the amount by which the boundary of the classifier differs from the optimum boundary.

In other words, patterns corresponding to the darkly shaded region are erroneously classified by the classifier. The classifier thus intro-duces an expected error Err in addition to the error of the Bayes optimum decision:

Err =

&

−∞

A(b)fb(b)db

where A(b) is the area of the darkly shaded region, and fbis the density function for b. Tumer et al [24] proves that the expected added error can be expressed by:

Err = σ2ηc

s (2)

where s = p(cj|x)− p(ci|x) is independent of the trained model1, and σ2ηcdenotes the variances of ηc(x).

Thus, given a test example y, the probability output of the single classifier Gkcan be expressed as:

fcg(y) = p(c|y) + ηcg(y)

Assuming each partition Siis of the same size, we study σ2ηcg, the variance of ηcg(y). If each Sihas identical class distribution, that is, there is no conceptual drift, then the single classifier Gk, which is learned from k partitions, can reduce the average variance by a factor of k. With the presence of conceptual drifts, we have:

ση2gc 1

Figure 2.5: Error regions associated with approximating the a posteriori probabilities [56].

raise its performance: using the accuracy-based weighting in Section 2.5.1 and augment-ing the ensemble size in Section 2.5.2.

2.5.1 Accuracy-Based Weight Setting

While the bagging method assigns equal weighting to each classifier, we found that if the weights of classifiers are carefully assigned, the performance of DEDS can be further improved. With the following proof, we show that the weights of classifiers should be assigned in proportion to their prediction accuracy.

Theorem 2.5.1. The DEDS framework adopting accuracy-based weighting introduces a smaller prediction error than the framework adopting equal weighting.

Proof. Since the output of a reasonably well-trained classifier is expected to approximate the corresponding a posteriori class distribution, the obtained decision boundary is ex-pected to be close to the Bayesian decision boundary. In a two-class classification prob-lem such as link prediction, the Bayesian optimum decision assigns an instance x to the class i if p(ci|x) > p(cj|x), where p(c|x) is the a posteriori probability distribution of class c given the input x. In other words, the Bayes optimum boundary is at the point x such that p(ci|x) = p(cj|x). However, the trained classifier is not perfect, and it outputs fc(y) = p(c|x) + ηc(x) instead of p(c|x), where ηc(x) is the variance of the classifier given an input x.6 Therefore, the obtained boundary may drift from xto xb, as illustrated

6According to the bias-variance decomposition [48], the added error for the output of a classifier includes the bias of the learning algorithm and the variance. Here we focus on the variance only, since our primary goal is to find a better weight setting that can reduce the prediction error introduced by the variance.

in Figure 2.5, consequently causing prediction errors in the darkly shaded region.7 This expected prediction error Err can be expressed as

Err =

−∞

A(b)fb(b)db, (2.1)

where b = x− xb, A(b) is the darkly shaded region, and fb is the density function for b.

Tumer and Ghosh [56] prove that (2.1) can be calculated by

Err = σ2ηc

s , (2.2)

where ση2c is the variance of ηc(x), and s is the difference between the derivatives of p(cj|x) and p(ci|x).

Assume that DEDS includes k individual classifiers D1, D2, · · · , Dk, and the output of each classifier Di is

fci(y) = p(c|x) + ηic(x).

When DEDS combines these classifiers with weights wi, i = 1, ..., k to build an ensemble E, the output of this ensemble classifier is

fcE(y) =

variance of the ensemble classifier, i.e., ηcE(x) .

We can further assume that the variances of individual classifiers Di, i = 1, ..., k are

7The lightly shaded region is the inherent error of the Bayes optimum decision.

independent, and as derived by Wang et al. [58], the variance of ηcE(x) is

If DEDS adopts equal weighting (i.e., wi = 1/k), by using (2.2) and (2.3), the expected prediction error of DEDS becomes

In contrast, if DEDS adopts accuracy-based weighting and sets the weights of classi-fiers to be inversely proportional to their error rate (i.e., wi = α/ση2i

c, where α is a constant), the expected prediction error of DEDS then becomes

Erracc_based = ση2E

With Cauchy’s Inequality, we can further derive that

k

By multiplying 1/s on both sides of the above inequality, together with (2.4) and (2.5), we can show that

The experimental results shown in Figures 2.4(a) and 2.4(b) further support this con-clusion. In Figure 2.4(a), while the ensemble classifier with equal weighting (i.e.,

en-sem_EW_WOD) already outperforms the four individual classifiers, the ensemble clas-sifier with accuracy-based weighting (i.e., ensem_ AW_WOD) can bring an extra 5%

increase in AUC and also a 10% decrease in MSE. That is, the ensemble classifier with accuracy-based weighting tends to have the highest true positive rate as well as the lowest prediction error. The results in Figure 2.4(b) also exhibit the analogous improvements.

Therefore, we consider accuracy-based weighting while evaluating the DEDS framework for the rest of this chapter.

2.5.2 Ensemble Size Augmentation

Rokach [42] reviews various ensemble techniques and categorizes them into two groups:

dependent frameworks and independent frameworks. In a dependent framework, the out-put of a classifier is used as the inout-put to construct the next classifier. In contrast, the classifiers in an independent framework are built independently, and then their outputs are combined to generate the final decision. One solid reason to design DEDS as an in-dependent framework is exactly that every individual classifier can be trained and used to generate predictions independently. This enables DEDS to fully utilize all the CPUs or cores in order to simultaneously train and run the maximum number of individual classi-fiers. Since these classifiers are trained and run in parallel, the total training and prediction time will not grow much, while the prediction accuracy can be raised substantially.

To further increase the ensemble size, we need to generate more than one sparsified network from each sparsification method. For the sparsification method that has an in-herent random flavor (i.e., random sparsification or random-walk-based sparsification), executing it multiple times can generate slightly varied copies of sparsified networks. The short-path-based sparsification method can produce slightly different copies of sparsified networks by modifying the length threshold L. For example, if L = 5 at the beginning, using L = 5± 1 and L = 5 ± 2 can produce four other more sparsified networks. As for the degree-based sparsification, we can moderately disturb the original sparsified net-work, by replacing some existing edges in the sparsified network with the edges outside

Figure 2.6: Comparison of AUC for different ensemble sizes. The dash line indicates the AUC of the classifier trained from the original network. (condmat, sparsification ratio = 15%)

the sparsified network. In order to maintain the core design behind the degree-based spar-sification (i.e., preserving the edge with high summation of the degrees at its two ends), the probability of the replacement is set to be proportional to SDi/SDo, where SDiand SDo are the summation of degrees for the edges in the sparsified network and the edges outside the sparsified network, respectively.

Figure 2.6 shows the performance of ensemble classifiers with different ensemble sizes. As expected, the AUC increases as the ensemble size becomes larger. When the en-semble size becomes sufficiently large, the enen-semble classifier may even slightly outper-form the classifier trained from the original network. However, the AUC of the ensemble classifier increases slowly when the ensemble size exceeds 13, and the AUC eventually remains at approximately 0.85. The reason for this is that, when there are already many existing classifiers, a newly joined classifier tends to possess much of the same knowledge (i.e., edges) that the existing classifiers already have.