Strategies for Ensemble Generation - 於社群網路中之高效能鏈結預測與群組查詢

As shown in Section 2.4, the ensemble classifier with high diversity can greatly out-perform every individual classifier. In this section, we analyze two strategies that further

Given a test example y, a classifier outputs fc(y), the probability of y being an instance of class c. A classifier ensemble pools the outputs of several classifiers before a decision is made. The most popular way of combining multiple classifiers is via averaging [24], in which case the probability output of the ensemble is given by:

f_c^E(y) = 1 k

!k i=1

f_cⁱ(y)

where f_cⁱ(y) is the probability output of the i-th classifier in the ensemble.

The outputs of a well trained classifier are expected to approx-imate the a posterior class distribution. In addition to the Bayes error, the remaining error of the classifier can be decomposed into bias and variance [15, 6]. More specifically, given a test example y, the probability output of classifierCⁱcan be expressed as:

f_cⁱ(y) = p(c|y) + βⁱ_c+ ηⁱ_c(y)

" #$ % added error fory

(1)

where p(c|y) is the a posterior probability distribution of class c given input y, β_cⁱis the bias ofCⁱ, and ηⁱ_c(y) is the variance ofCⁱ given input y. In the following discussion, we assume the error consists of variance only, as our major goal is to reduce the error caused by the discrepancies among the classifiers trained on differ-ent data chunks.

Assume an incoming data stream is partitioned into sequential chunks of fixed size, S1, S2,· · · , Sⁿ, with Snbeing the most recent chunk. LetCⁱ, Gk, and Ekdenote the following models.

Cⁱ: classifier learned from training set Si;

Gk: classifier learned from the training set consisting of the last k chunks S_n−k+1∪ · · · ∪ Sⁿ;

Ek: classifier ensemble consisting of the last k classifiers C^n−k+1,· · · , Cⁿ.

Figure 3: Models’ classification error on test example y.

In the concept-drifting environment, models learned up-stream may carry significant variances when they are applied to the cur-rent test cases (Figure 3). Thus, instead of averaging the outputs of classifiers in the ensemble, we use the weighted approach. We assign each classifierCⁱa weight wi, such that wiis reversely pro-portional toCⁱ’s expected error (when applied to the current test cases). In Section 4, we introduce a method of generating such weights based on estimated classification errors. Here, assuming each classifier is so weighted, we prove the following property.

Ekproduces a smaller classification error than Gk, if classifiers in Ekare weighted by their expected classification accuracy on the test data.

Figure 4: Error regions associated with approximating the a posteriori probabilities [24].

We prove this property through bias-variance decomposition based on Tumer’s work [24]. The Bayes optimum decision assigns x to class i if p(ci|x) > p(c^k|x), ∀k ̸= i. Therefore, as shown in Fig-ure 4, the Bayes optimum boundary is the loci of all points x^∗such that p(ci|x^∗) = p(cj|x^∗), where j = argmax_kp(ck|x^∗) [24].

The decision boundary of our classifier may vary from the optimum boundary. In Figure 4, b = xb− x^∗denotes the amount by which the boundary of the classifier differs from the optimum boundary.

In other words, patterns corresponding to the darkly shaded region are erroneously classified by the classifier. The classifier thus intro-duces an expected error Err in addition to the error of the Bayes optimum decision:

Err =

& ∞

−∞

A(b)fb(b)db

where A(b) is the area of the darkly shaded region, and fbis the density function for b. Tumer et al [24] proves that the expected added error can be expressed by:

Err = σ²ηc

s (2)

where s = p^′(cj|x^∗)− p^′(ci|x^∗) is independent of the trained model¹, and σ²ηcdenotes the variances of ηc(x).

Thus, given a test example y, the probability output of the single classifier Gkcan be expressed as:

fc^g(y) = p(c|y) + ηc^g(y)

Assuming each partition Siis of the same size, we study σ²_η_c^g, the variance of η_c^g(y). If each Sihas identical class distribution, that is, there is no conceptual drift, then the single classifier Gk, which is learned from k partitions, can reduce the average variance by a factor of k. With the presence of conceptual drifts, we have:

σ_η²^g_c ≥ 1

Figure 2.5: Error regions associated with approximating the a posteriori probabilities [56].

raise its performance: using the accuracy-based weighting in Section 2.5.1 and augment-ing the ensemble size in Section 2.5.2.

2.5.1 Accuracy-Based Weight Setting

While the bagging method assigns equal weighting to each classifier, we found that if the weights of classifiers are carefully assigned, the performance of DEDS can be further improved. With the following proof, we show that the weights of classifiers should be assigned in proportion to their prediction accuracy.

Theorem 2.5.1. The DEDS framework adopting accuracy-based weighting introduces a smaller prediction error than the framework adopting equal weighting.

Proof. Since the output of a reasonably well-trained classifier is expected to approximate the corresponding a posteriori class distribution, the obtained decision boundary is ex-pected to be close to the Bayesian decision boundary. In a two-class classification prob-lem such as link prediction, the Bayesian optimum decision assigns an instance x to the class i if p(c_i|x) > p(cj|x), where p(c|x) is the a posteriori probability distribution of class c given the input x. In other words, the Bayes optimum boundary is at the point x^∗ such that p(c_i|x^∗) = p(c_j|x^∗). However, the trained classifier is not perfect, and it outputs f_c(y) = p(c|x) + ηc(x) instead of p(c|x), where ηc(x) is the variance of the classifier given an input x.⁶ Therefore, the obtained boundary may drift from x^∗to x_b, as illustrated

6According to the bias-variance decomposition [48], the added error for the output of a classifier includes the bias of the learning algorithm and the variance. Here we focus on the variance only, since our primary goal is to find a better weight setting that can reduce the prediction error introduced by the variance.

in Figure 2.5, consequently causing prediction errors in the darkly shaded region.⁷ This expected prediction error Err can be expressed as

Err =

∫ _∞

−∞

A(b)f_b(b)db, (2.1)

where b = x^∗− xb, A(b) is the darkly shaded region, and f_b is the density function for b.

Tumer and Ghosh [56] prove that (2.1) can be calculated by

Err = σ²_η_c

s , (2.2)

where σ_η²_c is the variance of η_c(x), and s is the difference between the derivatives of p(c_j|x^∗) and p(c_i|x^∗).

Assume that DEDS includes k individual classifiers D₁, D₂, · · · , Dk, and the output of each classifier Di is

f_cⁱ(y) = p(c|x) + ηⁱc(x).

When DEDS combines these classifiers with weights w_i, i = 1, ..., k to build an ensemble E, the output of this ensemble classifier is

f_c^E(y) =

variance of the ensemble classiﬁer, i.e., η_c^E(x) .

We can further assume that the variances of individual classifiers Di, i = 1, ..., k are

7The lightly shaded region is the inherent error of the Bayes optimum decision.

independent, and as derived by Wang et al. [58], the variance of η_c^E(x) is

If DEDS adopts equal weighting (i.e., w_i = 1/k), by using (2.2) and (2.3), the expected prediction error of DEDS becomes

In contrast, if DEDS adopts accuracy-based weighting and sets the weights of classi-fiers to be inversely proportional to their error rate (i.e., w_i = α/σ_η²i

c, where α is a constant), the expected prediction error of DEDS then becomes

Erracc_based = σ_η²E

With Cauchy’s Inequality, we can further derive that

∑k

By multiplying 1/s on both sides of the above inequality, together with (2.4) and (2.5), we can show that

The experimental results shown in Figures 2.4(a) and 2.4(b) further support this con-clusion. In Figure 2.4(a), while the ensemble classifier with equal weighting (i.e.,

en-sem_EW_WOD) already outperforms the four individual classifiers, the ensemble clas-sifier with accuracy-based weighting (i.e., ensem_ AW_WOD) can bring an extra 5%

increase in AUC and also a 10% decrease in MSE. That is, the ensemble classifier with accuracy-based weighting tends to have the highest true positive rate as well as the lowest prediction error. The results in Figure 2.4(b) also exhibit the analogous improvements.

Therefore, we consider accuracy-based weighting while evaluating the DEDS framework for the rest of this chapter.

2.5.2 Ensemble Size Augmentation

Rokach [42] reviews various ensemble techniques and categorizes them into two groups:

dependent frameworks and independent frameworks. In a dependent framework, the out-put of a classifier is used as the inout-put to construct the next classifier. In contrast, the classifiers in an independent framework are built independently, and then their outputs are combined to generate the final decision. One solid reason to design DEDS as an in-dependent framework is exactly that every individual classifier can be trained and used to generate predictions independently. This enables DEDS to fully utilize all the CPUs or cores in order to simultaneously train and run the maximum number of individual classi-fiers. Since these classifiers are trained and run in parallel, the total training and prediction time will not grow much, while the prediction accuracy can be raised substantially.

To further increase the ensemble size, we need to generate more than one sparsified network from each sparsification method. For the sparsification method that has an in-herent random flavor (i.e., random sparsification or random-walk-based sparsification), executing it multiple times can generate slightly varied copies of sparsified networks. The short-path-based sparsification method can produce slightly different copies of sparsified networks by modifying the length threshold L. For example, if L = 5 at the beginning, using L = 5± 1 and L = 5 ± 2 can produce four other more sparsified networks. As for the degree-based sparsification, we can moderately disturb the original sparsified net-work, by replacing some existing edges in the sparsified network with the edges outside

Figure 2.6: Comparison of AUC for different ensemble sizes. The dash line indicates the AUC of the classifier trained from the original network. (condmat, sparsification ratio = 15%)

the sparsified network. In order to maintain the core design behind the degree-based spar-sification (i.e., preserving the edge with high summation of the degrees at its two ends), the probability of the replacement is set to be proportional to SD_i/SD_o, where SD_iand SD_o are the summation of degrees for the edges in the sparsified network and the edges outside the sparsified network, respectively.

Figure 2.6 shows the performance of ensemble classifiers with different ensemble sizes. As expected, the AUC increases as the ensemble size becomes larger. When the en-semble size becomes sufficiently large, the enen-semble classifier may even slightly outper-form the classifier trained from the original network. However, the AUC of the ensemble classifier increases slowly when the ensemble size exceeds 13, and the AUC eventually remains at approximately 0.85. The reason for this is that, when there are already many existing classifiers, a newly joined classifier tends to possess much of the same knowledge (i.e., edges) that the existing classifiers already have.

在文檔中於社群網路中之高效能鏈結預測與群組查詢 (頁 38-43)