along with ensembles composed of decision trees and artificial neural networks for active learning [15]. Hsu and Srivastava demonstrate that ensembles composed of decision trees and artificial neural networks together could outperform ensembles composed of either of them alone [16]. However, researchers use decision trees and artificial neural networks together in an ensemble without further discussion. If we additionally consider hybrid ensembles composed of other types of classi-fication algorithms, we can find various applications of hybrid ensembles, such as credit scoring [17], risk assessment [18], medical databases [19], activity recognition [20], and energy consumption forecasting [21]. The insights gained from this paper can have a potential impact on these applications.
The rest of this paper is organized as follows: Section II gives a the analysis based on bias-variance decomposition.
Section III describes experimental settings and reports exper-imental results. Section IV presents the discussion. Section V concludes this paper and suggests possible extensions of this paper.
978-1-4673-0890-8/12/$31.00 ©2012 IEEE 25 COMNETSAT 2012
II. ANALYSIS
Bias and variance are fundamental concepts in ensemble learning. The Bayes rate is the lowest one among the error rates that a classification algorithm could possibly achieve for a given data set [22]. It indicates the difficulty of performing classification tasks on the data set. For a given data set, bias is the difference between the Bayes rate and the error rate given by a classifier based on a classification algorithm [22]. The lower is its bias, the more accurate is a classifier. Variance is the expected difference among the error rates given by a set of classifiers based on a classification algorithm [22]. These classifiers are generally trained with training sets sampled from the same data set. Variance implies sensitivity of a classifier to data records that are different from those in the training set on which it is built. Variance indicates errors from the unpredictability of applying a classification algorithm to randomly generated subsets of a given data set.
In general, bias decreases but variance increases as complex-ity of classifiers constructed by using a classification algorithm on a data set increases. Because we intend to form a group of member classifiers in an ensemble that are able to compensate for their weaknesses, there is a trade-off between bias and variance when we study ensemble construction approaches.
Bagging is basically a variance reduction approach. For a classifier, variance reduction is to reduce the degree of being specific to a data set or being sensitive to the difference among training samples. Bagging averages classifications from independently created member classifiers, and simply by doing this bagging decreases the variability of using an ensemble composed of these member classifiers to generate classifica-tions. Because it mainly reduces variance and usually keeps bias unchanged, bagging prefers an unstable yet sufficiently accurate classification algorithm (e.g. the decision tree algo-rithm) rather than a stable one for building member classifiers [1].
We analyze the hybrid ensemble of decision trees and artificial neural networks from the point of view of the unified bias-variance decomposition framework for 0/1-loss [22]. We use the notation similar to that used by Domingos [22].
After applying the bootstrap procedure to the given data set 𝐷, we obtain a set of training sets, 𝕊 (i.e. ∀𝑆 ∈ 𝕊 : 𝑆 ∕= ∅, where 𝑆 is from removing the subscript of 𝑆𝑖 in Algorithm 1 described later). We also sample an arbitrary test data record 𝑥. 𝐿 is the 0/1-loss function: 𝐿 returns zero if two variables are equal, and it returns one otherwise.
Assume we have had a set of classifiers each of which is identically based on a training set of𝕊. For a test data record 𝑥, 𝑦 is the actual class label; 𝑦∗ is the optimal classification, which is the actual class label if the Bayes error rate of the given data set is zero; main classification is denoted as 𝑦𝑚 and defined as the class label given by the majority of classifiers; and 𝑡 is a classified label (a binary class label in this paper). When the set of classifiers are used to classify 𝑥, 𝐵(𝑥) = 𝐿(𝑦∗, 𝑦𝑚) = 𝐵 and 𝑉 (𝑥) = 𝔼𝕊[𝐿(𝑦𝑚, 𝑦)] = 𝑉 are the value of bias and the value of variance, respectively.
When a classifier in the set of classifiers is used to classify𝑥, 𝑁(𝑥) = 𝔼𝑡[𝐿(𝑡, 𝑦∗)] = 𝑁 indicates noise contained in 𝑥. The expected classification error of a classifier on a test data record 𝑥 is 𝔼𝕊,𝑡[𝐿(𝑡, 𝑦)]. Now we consider the following cases.
Case 1: 𝑃𝕊(𝑦 = 𝑦∗) > 0.5 and 𝑦 = 𝑦∗ = 𝑦𝑚. 𝑃𝕊 is the probability with respect to the set of training sets 𝕊. The probability that the actual class label is equal to the optimal classification is larger than 0.5. This implies that 𝐵 = 0.
The class label given by the majority of classifiers is equal to the actual class label, which implies that an ensemble using majority vote makes a correct classification. We decompose the expected classification error by referring to the framework proposed by Domingos [22]:
𝔼𝕊,𝑡[𝐿(𝑡, 𝑦)]
= [2 ⋅ 𝑃𝕊(𝑦 = 𝑦∗) − 1)] ⋅ 𝑁 + 𝐵 + 𝑉
= (2 ⋅ 𝑃 − 1) ⋅ 𝑁 + 𝑉 (1)
When 𝑃 > 0.5 then 2 ⋅ 𝑃 − 1 > 0. If the data set is noise free and𝑁 = 0, the expected classification error totally depends on𝑉 . The lower is variance, the lower the error rate.
Any variance reduction approach can decrease the error rate.
If noise appears in the data set and 𝑁 > 0, the negative of noise will be amplified, and the only way to decrease the error rate is to have an even better variance reduction approach. An ensemble is a set of classifiers, but it works just like a classifier. Hence, we can derive from Equation 1 that, when we use a classification algorithm with low bias to create more accurate member classifiers, we can have an ensemble achieving low error rate or high accuracy (as the member classifiers work together) if the classification algorithm that we use is with low variance or if the member classifiers built using the classification algorithm are not quite different from each other.
Case 2: P𝕊(𝑦 = 𝑦∗) < 0.5 and 𝑦 = 𝑦∗ ∕= 𝑦𝑚. The probability that the actual class label is equal to the optimal classification is smaller than0.5. This implies that 𝐵 = 1. The class label given by the majority of classifiers is not equal to the actual class label, which implies that an ensemble using majority vote makes an incorrect classification. We decompose the expected classification error by referring to the framework proposed by Domingos [22]:
𝔼𝕊,𝑡[𝐿(𝑡, 𝑦)]
= [2 ⋅ 𝑃𝕊(𝑦 = 𝑦∗) − 1)] ⋅ 𝑁 + 𝐵 − 𝑉
= 1 − [(1 − 2 ⋅ 𝑃 ) ⋅ 𝑁 + 𝑉 ] (2) When 𝑃 < 0.5 then 1 − 2 ⋅ 𝑃 > 0. If noise does not appear in the data set and 𝑁 = 0, decreasing variance increases the expected classification error, and variance reduc-tion approaches will not help but make the situareduc-tion worse.
If the data set is not noise free and 𝑁 > 0, the effect of noise will be amplified, but it will not be negative anymore.
The larger is noise and/or the larger is variance, the lower is the error rate. Again, an ensemble is a set of classifiers but works just like a classifier. Hence, we can derive from Equation 2 that, when we have fewer member classifiers that
26
are accurate or more member classifiers that are less accurate by using a classification algorithm with high bias, we can have an ensemble achieving low error rate or high accuracy (as the member classifiers work together) if the classification algorithm used by us is also with high variance or if the member classifiers built using the classification algorithm are diverse. In such a case, when some member classifiers make an error, others diverse or different from these member classifiers would have a chance to correct the error so that the ensemble composed of them would generate a correct classification. This holds when noise appears in the data set, since noise may
“accidentally” correct an error made by some or even most member classifiers.
III. EXPERIMENTS
Algorithm 1 presents the construction of a hybrid ensemble of decision trees and artificial neural networks, and it is implemented by modifying the implementation of bagging [1] provided by WEKA [23]. In experiments, we compare it with classic ensembles of decision trees or artificial neural networks. These classic ensembles are constructed by using the implementation of bagging [1] provided by WEKA [23]
without modifications. We use the implementation of the C4.5 algorithm [24] (with default parameters) provided by WEKA [23] to build a decision tree. Similarly, we use the implementation of the multilayer perceptron algorithm (with default parameters) provided by WEKA [23] to build an artificial neural network.
Input:𝐷 is the given data set for training, and 𝑁 is the number of member classifiers
Output:𝐸 is the resulting ensemble, i.e. a set of member classifiers
initialize 𝐸 = ∅;
for𝑖 = 1; 𝑖 ≤ 𝑁 do 𝑆𝑖← Bootstrap(D);
if𝑖 is even then
specify 𝐴𝑖 as a decision tree algorithm;
else
specify 𝐴𝑖 as an artificial neural network algorithm;
Algorithm 1: The construction of a hybrid ensemble of decision trees and artificial neural networks.
In Algorithm 1,Bootstrap() performs sampling with re-placement. Bootstrap is a procedure that replaces an unknown distribution with a known distribution to compute statistics in which we are interested. When performing classification tasks in real-world applications, we hardly know the under-lying distribution but have the distribution of observed or collected samples. We can also use the bootstrap procedure
when training samples are limited or when we use ensem-ble approaches that require diverse training samples drawn from different distributions. Furthermore, in Algorithm 1, BuildClassifier() constructs a classifier 𝐶𝑖 using the given training set 𝑆𝑖 and the specified classification algorithm 𝐴𝑖, and𝐶𝑖 will be a member classifier in the resulting ensemble.
What makes Algorithm 1 different from bagging is the if-then-else statement. Algorithm 1 alternately selects one between the decision tree algorithm and the artificial neural network algorithm, but does not bagging. Algorithm 1 is simple but we find that it yields good results. Similar findings are also reported by others, as mentioned earlier.
All the data sets considered in experiments are available on the web 1, and most are derived from the data sets available on the UC Irvine Machine Learning Repository [25]. Every data set is either originally for a binary classification task or transformed from a multi-class classification task into a binary one. Table I summarizes the characteristics of the data sets considered in experiments. In Table I, each row is for a data set, while the first column is its name, the second column is its number of data records, the third column is its number of attributes, and the fourth column is the percentage of the majority class of the data set.
TABLE I
Table II summarizes experimental results 2. The values of bias and variance (var) are given by the bias-variance decomposition method [26], provided by WEKA [23], with the number of iterations set to 100 and the percentage of data for training set to 50. In Table II, each row reports the result of applying a classification model (a single classifier or an ensemble of classifiers) to a data set, the first column is the name of the data set, the second column is the name of the classification model, the third column is the value of bias, the fourth column is the value of variance, and the fifth column is the error rate. Notations for the classification models considered in experiments are as follows: DT is for a single decision tree, MLP is for a single artificial neural network, B-DT is for a classic ensemble of ten decision trees, B-MLP is for a classic ensemble of ten artificial neural networks, and B-DT+MLP is a hybrid ensemble of five decision trees and
1http://tunedit.org/
2The details are omitted due to the page limit.
27
five artificial neural networks. We group results for the same data set for simplicity. For example, the first five rows are in the group for the data set sonar, while the first row indicates that DT gives an error rate 0.31 that can be decomposed into bias of 0.14 and variance of 0.17, and the third row indicates that B-DT gives an error rate 0.26 (lower than that given by DT) that can be decomposed into bias of 0.15 (slightly higher than that given by DT) and variance of 0.12 (lower than that given by DT).
TABLE II
THE SUMMARY OF EXPERIMENTAL RESULTS.
data set model bias var err
sonar
The results in Table II indicate that the hybrid ensemble of decision trees and artificial neural networks achieve
classifica-tion performance no worse that that given by classic ensem-bles of decision trees or those of artificial neural networks.
However, the reason why the hybrid ensemble works well on one data set may be different from that on another. For the following discussion, we summarize comparisons of values of bias and variance (var) of MLP against those values of DT in Table III, where Lo, Eq, and Hi are short for lower, equal, and higher, respectively. We can see from Table III that, for example, MLP has lower bias and variance than does DT on the data set sonar. In addition, MLP has lower bias than does DT but MLP and DT have equal variance on the data set hprice, and MLP has lower bias but higher variance than does DT on the data set credit.
TABLE III
COMPARISONS OF VALUES OF BIAS AND VARIANCE(VAR)OFMLP AGAINST THOSE VALUES OFDT.
Lo var Eq var Hi var
Lo bias sonar hprice credit
Eq bias boston, halloffame vote, credit-g
Hi bias ionosphere, colic, credit-a
IV. DISCUSSION
We divide these ten data sets into four groups according to the error rates in Table II.
Group 1. B-DT+MLP is the same as B-DT but better than B-MLP. This is the situation when there is originally a classic ensemble of artificial neural networks (MLP) and we replace half of them with decision trees (DT). In other words, we introduce DT into an ensemble of MLP. If DT is with lower bias, which makes more test data records correctly classified, and if DT is with lower variance, then the expected classification error is lower according to Equation 1. This is what happens on ionosphere, colic, and credit-a, as we can observe from Table III. If the introduced DT is with the same bias but lower variance, then the error rate is lower according to Equation 1, as shown to us by vote in Table III. Furthermore, if the introduced DT is with higher bias, which generates more incorrect classifications, but with lower variance, then we speculate that there is noise in the data set according to Equation 2. This is what happens on credit, as we can observe from Table III. If the introduced DT is with the same bias and the same variance, as what boston in Table III shows us, then we speculate that noise appears but has surprisingly positive effect according to Equation 2.
Group 2. B-DT+MLP is the same as B-MLP but better than B-DT. This is the situation when there is originally a classic ensemble of decision trees (DT) and we replace half of them with artificial neural networks (MLP). If MLP, what we introduce into the ensemble of DT, is with lower bias and lower variance, then the expected classification error is lower according to Equation 1. This is what happens on sonar, as we can observe from Table III.
Group 3. B-DT+MLP is better than both B-DT and B-MLP.
If there is no noise,𝑁 = 0, and if we think this group closer to
28
Group 1 in which we replace MLP with DT, then we introduce DT with lower variance into the ensemble, and then the error rate is lower when variance is lower according to Equation 1.
If there is noise, 𝑁 > 0, and if we think this group closer to Group 2 in which we replace DT with MLP, then we introduce MLP with higher variance into the ensemble of DT, and then the error rate is lower when there is noise and variance is high according to Equation 2. This is what credit-g in Table III shows us.
Group 4. B-DT, B-MLP, and B-DT+MLP perform equally well. If we think that this group closer to Group 1 where DT is introduced into an ensemble of MLP, and if the introduced DT is with high bias but the same variance, as shown to us by hprice in Table III, then we speculate that noise appears but has surprisingly positive effect according to Equation 2. If we think this group closer to Group 2 where MLP is introduced into an ensemble of DT, and if the introduced MLP is with low bias but the same variance, as shown to us by halloffame in Table III, then more test data records are made correctly classified by MLP but their positive effect is shifted by noise according to Equation 1. No matter we think that this group closer to Group 1 or Group 2, if the introduced DT or MLP is with the same bias and the same variance, as what happens on halloffame, then we speculate that noise appears but has surprisingly positive effect according to Equation 2.
V. CONCLUSIONS
In this paper, we study the hybrid ensemble constructed by using decision trees and artificial neural networks si-multaneously. Ensemble learning is to construct a classifier composed of several classifiers, and it has been found bene-ficial by researchers and practitioners. Both decision tree and artificial neural network are popular types of classification algorithms often used to construct classic ensembles. Re-cently, researchers proposed to use the combination or mixture of these two types of classification algorithms to construct hybrid ensembles. However, researchers use decision trees and artificial neural networks together in ensembles without further discussion. The goal of this paper is to have a better understanding of the hybrid ensemble. We not only show that the hybrid ensemble can achieve comparable or even better classification performance but we also provide an explanation of why it works. As part of the future work, we plan to investigate the hybrid ensemble composed of classifiers other than decision trees and artificial neural networks. We also plan to apply the analysis to other types of ensembles.
ACKNOWLEDGMENT
The National Science Council of Taiwan (R.O.C.) supported this work under Grant NSC 100-2218-E-004-002. The support is gratefully acknowledged. The author would also like to thank anonymous reviewers for their valuable time.
REFERENCES
[1] L. Breiman, “Bagging predictors,” Machine learning, vol. 24, no. 2, pp.
123–140, 1996.
[2] P. Tan, M. Steinbach, V. Kumar et al., Introduction to data mining.
Pearson Addison Wesley Boston, 2006.
[3] J. Ghosh, “Multiclassifier systems: Back to the future,” in Proc. of International Workshop on Multiple Classifier Systems, vol. 3, 2002, p. 1.
[4] X. Sun, “Pitch accent prediction using ensemble machine learning,”
in Seventh International Conference on Spoken Language Processing, 2002.
[5] W. Tong, H. Hong, H. Fang, Q. Xie, and R. Perkins, “Decision forest: Combining the predictions of multiple independent decision tree models,” J. Chem. Inf. Comput. Sci, vol. 43, pp. 525–531, 2003.
[6] R. Mar´ee, P. Geurts, J. Piater, and L. Wehenkel, “Biomedical image classification with random subwindows and decision trees,” Computer Vision for Biomedical Image Applications, pp. 220–229, 2005.
[7] G. Giacinto and F. Roli, “Design of effective neural network ensembles for image classification purposes,” Image and Vision Computing, vol. 19, no. 9-10, pp. 699–707, 2001.
[8] X. Yao, M. Fischer, and G. Brown, “Neural network ensembles and their application to traffic flow prediction in telecommunications networks,”
in Proc. of International Joint Conference on Neural Networks, vol. 1, 2001, pp. 693–698.
[9] Z. Zhou, Y. Jiang, Y. Yang, and S. Chen, “Lung cancer cell identification based on artificial neural network ensembles,” Artificial Intelligence in Medicine, vol. 24, no. 1, pp. 25–36, 2002.
[10] C. Shu and D. Burn, “Artificial neural network ensembles and their ap-plication in pooled flood frequency analysis,” Water Resources Research,
[10] C. Shu and D. Burn, “Artificial neural network ensembles and their ap-plication in pooled flood frequency analysis,” Water Resources Research,