• 沒有找到結果。

SeedBoost: AdaBoost with seeding

5.2.1 Algorithm

Before we introduce the SeedBoost algorithm, we take a closer look at AdaBoost (Algorithm 4.2). Following the gradient descent view of Mason et al. (2000), in the 1-st iteration, AdaBoo1-st greedily chooses (h1, v1) to approximately minimize

XN n=1

wnexp −ynv1h1(xn)

. (5.8)

Then, in the t-th iteration, AdaBoost chooses (ht, vt) to approximately minimize XN

n=1

wnexp

−yn Ht(xn) + vt+1ht+1(xn)

= XN n=1

wn(t)exp −ynvt+1ht+1(xn)

. (5.9)

Comparing (5.8) and (5.9), we see that AdaBoost at the t-th iteration using the original training set {(xn, yn, wn)} is equivalent to AdaBoost at the 1-st iteration using a modified training set n

(xn, yn, wn(t))o

. Therefore, using (5.8) as a basic step, AdaBoost with (t + T ) iterations can be recursively defined as follows.

Algorithm 5.2 (A recursive view of AdaBoost with (t + T ) iterations).

1. Run AdaBoost on {(xn, yn, wn)}Nn=1 for t steps and get an ensemble classi-fier g(1)(x) = sign

Ht(1)(x) . 2. Run AdaBoost on n

(xn, yn, w(t)n )oN

n=1 for T steps and get an ensemble classi-fier g(2)(x) = sign

HT(2)(x) . 3. Return Ht+T(x) = sign

Ht(1)(x) + HT(2)(x) .

Our proposed SeedBoost algorithm simply generalizes the recursive steps above by replacing the first step with any learning algorithm, as listed below.

We couple SeedBoost with two different algorithms as A. The first one is a poten-tially overfitting algorithm. Recall that in Theorem 5.5, we proved that under a minor condition, when κ is large enough, SVM with the perceptron kernel can always dichotomize the training set. We set κ = 216 for this purpose and call the resulting algorithm the separating SVM with the perceptron kernel (SSVM-Perc). SSVM-Perc can reach training cost 0 in all our experiments, but may not lead to a good test cost because of overfitting. The second one is SVM-Perc (see Subsection 5.1.3), which is usually better than SSVM-Perc in terms of test performance because of it uses a parameter selection procedure to determine a suitable κ.

Similar to the procedures in Subsection 5.1.3, we use decision stumps and per-ceptrons within the AdaBoost component of SeedBoost. We perform experiment on the eight real-world data sets (see Table 5.1) from the UCI repository (Hettich, Blake and Merz 1998). For simplicity, we fix T to 100 in all the experiments, while the rest

of the setup is kept the same as the ones in Subsection 5.1.3.

Table 5.5 shows the results of SSVM-Perc both when used as a stand-alone learn-ing algorithm and when coupled with SeedBoost. We see that SeedBoost can usually improve the test performance of SSVM-Perc. Such a result suggests that the second stage of AdaBoost exhibits some regularization properties that could correct overfit-ting.

Table 5.6 shows the results of SVM-Perc instead of SSVM-Perc. Unlike Table 5.5, SeedBoost usually cannot improve over stand-alone SVM-Perc significantly and often leads to worse test performance. That is, a decent binary classification algorithm like SVM-Perc cannot benefit by coupling with the second stage of AdaBoost.

If we compare Table 5.6 with AdaBoost-Perc in Table 5.3, we see that SeedBoost-Perc plus SVM-SeedBoost-Perc is mostly comparable to AdaBoost-SeedBoost-Perc. Thus, AdaBoost does not improve when we replace its first stage with a better learning algorithm. Such a result suggests that the second stage of AdaBoost plays a more important role and should be the focus of future research in explaining the success of AdaBoost.

Finally, in Table 5.7, we gather some columns from Tables 5.3 and 5.5 to show an interesting result: After SeedBoost corrects overfitting, on some of the data sets, SeedBoost-Stump plus SSVM-Perc can be significantly better than SVM-Perc. Note that because decision stumps are special cases of perceptrons, SeedBoost-Stump plus SSVM-Perc is also an algorithm that outputs an infinite ensemble of perceptrons.

Since SSVM-Perc involves solving 1 optimization problem (5.2) while SVM-Perc needs to deal with 55, SeedBoost with SSVM-Perc is much faster than SVM-Perc in training.

With the decent performance in Table 5.7 and faster training, SeedBoost-Stump with SSVM-Perc can be a useful alternative for infinite ensemble learning with perceptrons.

Table 5.5: Test classification cost (%) of SeedBoost with SSVM-Perc

data SSVM-Perc

set stand-alone SeedBoost-Stump SeedBoost-Perc australian 16.696±0.141 14.101±0.152 15.438±0.149

breast 3.299±0.087 3.858±0.103 3.434±0.086 german 25.790±0.178 24.093±0.182 25.910±0.191 heart 19.722±0.323 18.380±0.360 18.250±0.327 ionosphere 6.078±0.200 9.830±0.247 11.298±0.254

pima 25.390±0.176 24.396±0.195 24.877±0.206 sonar 15.012±0.368 18.441±0.402 19.488±0.407 votes84 4.201±0.140 3.994±0.135 4.374±0.145 (those within one standard error of the lowest one are marked in bold)

Table 5.6: Test classification cost (%) of SeedBoost with SVM-Perc

data SVM-Perc

set stand-alone SeedBoost-Stump SeedBoost-Perc australian 14.482±0.170 14.525±0.176 15.286±0.152

breast 3.230±0.080 4.051±0.104 3.453±0.089 german 24.593±0.196 24.508±0.168 26.122±0.192 heart 17.556±0.307 19.491±0.346 18.583±0.327 ionosphere 6.404±0.198 9.901±0.240 11.262±0.261 pima 23.545±0.212 24.302±0.192 25.195±0.201 sonar 15.607±0.399 18.012±0.386 19.786±0.417 votes84 4.425±0.138 4.080±0.147 4.351±0.142 (those within one standard error of the lowest one are marked in bold)

Table 5.7: Test classification cost (%) of SeedBoost with SSVM versus stand-alone SVM

data SSVM-Perc SVM-Perc

set SeedBoost-Stump stand-alone australian 14.101±0.152 14.482±0.170

breast 3.858±0.103 3.230±0.080 german 24.093±0.182 24.593±0.196

heart 18.380±0.360 17.556±0.307 ionosphere 9.830±0.247 6.404±0.198 pima 24.396±0.195 23.545±0.212 sonar 18.441±0.402 15.607±0.399 votes84 3.994±0.135 4.425±0.138

(those within one standard error of the lowest one are marked in bold)

Chapter 6 Conclusion

In Chapter 2, we proposed the cost-transformation technique from cost-sensitive classification to regular classification and proved its theoretical guarantees. Then, we designed two novel cost-sensitive classification algorithms, namely CSOVA and CSOVO, by applying the cost-transformation technique on their popular versions in regular classification. Experimental results showed that both algorithms worked well for cost-sensitive classification problems as well as for ordinal ranking problems.

In Chapter 3, we proposed the threshold ensemble model for ordinal ranking and defined margins for the model. Novel large-margin bounds of common cost functions were proved and were extended to threshold rankers for general threshold models.

We studied two algorithms for obtaining threshold ensembles. The first algorithm, RankBoost-OR, combines RankBoost and a simple threshold algorithm. In addition, we designed a new boosting approach, ORBoost, which closely connects with the large-margin bounds. ORBoost is a direct extension of AdaBoost and inherits its theoretical and practical advantages. Experimental results demonstrated that both algorithms can perform decently on real-world data sets. In particular, ORBoost was comparable to SVM-based algorithms in terms of test cost, but enjoyed the advantage of faster training. These properties make ORBoost favorable over SVM-based algorithms on large data sets.

In Chapter 4, we presented the reduction framework from ordinal ranking to bi-nary classification. The framework includes the reduction method and the reverse reduction technique. We showed the theoretical guarantees of the framework,

includ-ing the cost bound, the regret bound, and the equivalence between ordinal rankinclud-ing and binary classification.

We used the reduction framework to extend SVM to ordinal ranking. We not only derived a general cost bound for SVM-based large-margin rankers, but also demon-strated that reducing to the standard SVM can readily yield superior performance in practice. We also used reduction to design a novel boosting approach, AdaBoost.OR, which can improve the performance of any cost-sensitive ordinal ranking algorithm.

We showed the parallel between AdaBoost.OR and AdaBoost in algorithmic steps and in theoretical properties. Experimental results validated that AdaBoost.OR in-deed improved both the training and test performance of existing ordinal ranking algorithms.

In Chapter 5, we first derived two novel kernels based on the SVM-based frame-work for infinite ensemble learning. The stump kernel embodies infinitely many deci-sion stumps, and the perceptron kernel embodies infinitely many perceptrons. These kernels can be simply evaluated by the ℓ1- or ℓ2-norm distance between feature vec-tors. SVM equipped with the kernels generates infinite and nonsparse ensembles, which are usually more robust than finite and sparse ones. Experimental compar-isons with AdaBoost showed that SVM with the kernels usually performed much better than AdaBoost with the same base hypothesis set. Therefore, existing appli-cations that use AdaBoost with decision stumps or perceptrons may be improved by switching to SVM with the corresponding kernel. We also discussed how such an advantage propagates from binary classification to ordinal ranking.

Then, we proposed the SeedBoost algorithm, which takes AdaBoost as a machin-ery to regularize the overfitting effect of other learning algorithms. We conducted experimental studies on SeedBoost. The results demonstrated that SeedBoost not only improved some overfitting learning algorithms, but also achieved the best per-formance on some of the data sets.

Bibliography

Abe, N., B. Zadrozny, and J. Langford (2004). An iterative method for multi-class cost-sensitive learning. In W. Kim, R. Kohavi, J. Gehrke, and W. DuMouchel (Eds.), Proceedings of the 10th ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, pp. 3–11. ACM.

Abu-Mostafa, Y. S. (1989). The Vapnik-Chervonenkis dimension: Information versus complexity in learning. Neural Computation 1 (3), 312–317.

Abu-Mostafa, Y. S., X. Song, A. Nicholson, and M. Magdon-Ismail (2004). The bin model. Technical Report CaltechCSTR:2004.002, California Institute of Technol-ogy.

Auer, P. and R. Meir (Eds.) (2005). Learning Theory: 18th Annual Conference on Learning Theory, Volume 3559 of Lecture Notes in Artificial Intelligence. Springer-Verlag.

Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network.

IEEE Transactions on Information Theory 44 (2), 525–536.

Bartlett, P. L. and J. Shawe-Taylor (1998). Generalization performance of support vector machines and other pattern classifiers. See (Sch¨olkopf, Burges and Smola 1998), Chapter 4, pp. 43–54.

Becker, S., S. Thrun, and K. Obermayer (Eds.) (2003). Advances in Neural Infor-mation Processing Systems: Proceedings of the 2002 Conference, Volume 15. MIT Press.

Beygelzimer, A., V. Daniand, T. Hayes, J. Langford, and B. Zadrozny (2005). Error limiting reductions between classification tasks. In L. D. Raedt and S. Wrobel (Eds.), Machine Learning: Proceedings of the 22rd International Conference, pp.

49–56. ACM.

Beygelzimer, A., J. Langford, and P. Ravikumar (2007). Multiclass classification with filter trees. Downloaded from http://hunch.net/jl.

Breiman, L. (1996). Bagging predictors. Machine Learning 24 (2), 123–140.

Breiman, L. (1998). Arcing classifiers. The Annals of Statistics 26 (3), 801–824.

Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computa-tion 11 (7), 1493–1517.

Cardoso, J. S. and J. F. P. da Costa (2007). Learning to classify ordinal data: The data replication method. Journal of Machine Learning Research 8, 1393–1429.

Chang, C.-C. and C.-J. Lin (2001). LIBSVM: A Library for Support Vector Machines.

National Taiwan University. Software available at http://www.csie.ntu.edu.tw/

cjlin/libsvm.

Chu, W. and Z. Ghahramani (2005). Gaussian processes for ordinal regression. Jour-nal of Machine Learning Research 6, 1019–1041.

Chu, W. and S. S. Keerthi (2007). Support vector ordinal regression. Neural Com-putation 19 (3), 792–815.

Crammer, K. and Y. Singer (2005). Online ranking by projecting. Neural Computa-tion 17 (1), 145–175.

Demiriz, A., K. P. Bennett, and J. Shawe-Taylor (2002). Linear programming boost-ing via column generation. Machine Learnboost-ing 46 (1-3), 225–254.

Domingos, P. (1999). MetaCost: A general method for making classifiers cost-sensitive. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 155–164. ACM.

Frank, E. and M. Hall (2001). A simple approach to ordinal classification. In L. D.

Raedt and P. Flach (Eds.), Machine Learning: Proceedings of the 12th European Conference on Machine Learning, Volume 2167 of Lecture Notes in Artificial Intel-ligence, pp. 145–156. Springer-Verlag.

Freund, Y., R. Iyer, R. E. Shapire, and Y. Singer (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4, 933–969.

Freund, Y. and R. E. Schapire (1996). Experiments with a new boosting algorithm.

In L. Saitta (Ed.), Machine Learning: Proceedings of the 13th International Con-ference, pp. 148–156. Morgan Kaufmann.

Freund, Y. and R. E. Schapire (1997). A decisitheoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55 (1), 119–139.

Freund, Y. and R. E. Schapire (1999). A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14 (5), 771–780. English version down-loadable in http://boosting.org.

Gallant, S. I. (1990). Perceptron-based learning algorithms. IEEE Transactions on Neural Networks 1 (2), 179–191.

Graepel, T., R. Herbrich, and J. Shawe-Taylor (2005). PAC-Bayesian compression bounds on the prediction error of learning algorithms for classification. Machine Learning 59 (1-2), 55–76.

Har-Peled, S., D. Roth, and D. Zimak (2003). Constraint classification: A new

ap-proach to multiclass classification and ranking. See (Becker, Thrun and Obermayer 2003), pp. 785–792.

Hastie, T., R. Tibshirani, and J. Friedman (2001). The Elements of Statistical Learn-ing: Data Mining, Inference, and Prediction. New York, NY: Springer-Verlag.

Haykin, S. (1999). Neural Networks: A Comprehensive Foundation (Second ed.).

Upper Saddle River, NJ: Prentice Hall.

Herbrich, R., T. Graepel, and K. Obermayer (2000). Large margin rank boundaries for ordinal regression. See (Smola et al. 2000), pp. 115–132.

Hettich, S., C. L. Blake, and C. J. Merz (1998). UCI repository of machine learning databases. Downloadable at http://www.ics.uci.edu/mlearn/MLRepository.

html.

Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables.

Journal of the American Statistical Association 58 (301), 13–30.

Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning 11 (1), 63–91.

Hsu, C.-W., C.-C. Chang, and C.-J. Lin (2003). A practical guide to support vector classification. Technical report, National Taiwan University.

Hsu, C.-W. and C.-J. Lin (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks 13 (2), 415–425.

Hull, J. J. (1994). A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence 16 (5), 550–554.

International Neural Network Society (2007). Proceedings of the 2007 International Joint Conference on Neural Networks (IJCNN 2007). International Neural Network Society: IEEE.

Joachims, T. (2005). Text categorization with support vector machines: Learning with many relevant features. In C. N´edellec and C. Rouveirol (Eds.), Machine Learning: Proceedings of the 9th European Conference on Machine Learning, Vol-ume 1398 of Lecture Notes in Computer Science, pp. 137–142. Springer-Verlag.

Kearns, M. J. and U. V. Vazirani (1994). An Introduction to Computational Learning Theory. Cambridge, MA: MIT Press.

Keerthi, S. S. and C.-J. Lin (2003). Asymptotic behaviors of support vector machines with gaussian kernel. Neural Computation 15 (7), 1667–1689.

Langford, J. and A. Beygelzimer (2005). Sensitive error correcting output codes. See (Auer and Meir 2005), pp. 158–172.

Langford, J. and B. Zadrozny (2005). Estimating class membership probabilities using classifier learners. In R. G. Cowell and Z. Ghahramani (Eds.), Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, Volume 13.

Society for Artificial Intelligence and Statistics.

Li, L. and H.-T. Lin (2007a). Optimizing 0/1 loss for perceptrons by random coordi-nate descent. See (International Neural Network Society 2007), pp. 749–754.

Li, L. and H.-T. Lin (2007b). Ordinal regression by extended binary classification. In B. Sch¨olkopf, J. C. Platt, and T. Hofmann (Eds.), Advances in Neural Information Processing Systems: Proceedings of the 2006 Conference, Volume 19, pp. 865–872.

MIT Press.

Li, L., A. Pratap, H.-T. Lin, and Y. S. Abu-Mostafa (2005). Improving generalization by data categorization. In A. M. Jorge, L. Torgo, P. B. Brazdil, R. Camacho, and J. Gama (Eds.), Knowledge Discovery in Databases: PKDD 2005, Volume 3721 of Lecture Notes in Computer Science, pp. 157–168. Springer-Verlag.

Lin, H.-T. (2005). Infinite ensemble learning with support vector machines. Master’s thesis, California Institute of Technology.

Lin, H.-T. and L. Li (2006). Large-margin thresholded ensembles for ordinal re-gression: Theory and practice. In J. L. Balcaz´ar, P. M. Long, and F. Stephan (Eds.), Algorithmic Learning Theory, Volume 4264 of Lecture Notes in Artificial Intelligence, pp. 319–333. Springer-Verlag.

Lin, H.-T. and L. Li (2008). Support vector machinery for infinite ensemble learning.

Journal of Machine Learning Research 9, 285–312.

Margineantu, D. D. (2001). Methods for Cost-Sensitive Learning. Ph. D. thesis, Oregon State University.

Mason, L., J. Baxter, P. L. Bartlett, and M. Frean (2000). Functional gradient techniques for combining hypotheses. See (Smola et al. 2000), pp. 221–246.

McCullagh, P. (1980). Regression models for ordinal data. Journal of the Royal Statistical Society, Series B 42 (2), 109–142.

Mease, D. and A. Wyner (2008). Evidence contrary to the statistical view of boosting.

Journal of Machine Learning Research 9, 131–156.

Meir, R. and G. R¨atsch (2003). An introduction to boosting and leveraging. In S. Mendelson and A. J. Smola (Eds.), Advanced Lectures on Machine Learning, Volume 2600 of Lecture Notes in Computer Science, pp. 118–183. Springer-Verlag.

Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. See (Sch¨olkopf, Burges and Smola 1998), Chapter 12, pp. 185–208.

Rajaram, S., A. Garg, X. S. Zhou, and T. S. Huang (2003). Classification approach towards ranking and sorting problems. In N. Lavrac, D. Gamberger, L. Todorovski, and H. Blockeel (Eds.), Machine Learning: Proceedings of the 14th European Con-ference on Machine Learning, Volume 2837 of Lecture Notes in Computer Science, pp. 301–312. Springer-Verlag.

R¨atsch, G., A. Demiriz, and K. Bennett (2002). Sparse regression ensembles in infinite and finite hypothesis spaces. Machine Learning 48 (1-3), 189–218.

R¨atsch, G., S. Mika, B. Sch¨olkopf, and K.-R. M¨uller (2002). Constructing boosting al-gorithms from SVMs: An application to one-class classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (9), 1184–1199.

R¨atsch, G., T. Onoda, and K. M¨uller (2001). Soft margins for AdaBoost. Machine Learning 42 (3), 287–320.

Robertson, T., F. T. Wright, and R. L. Dykstra (1988). Order Restricted Statistical Inference. New York, NY: John Wiley & Sons.

Rosset, S., G. Swirszcz, N. Srebro, and J. Zhu (2007). ℓ1 regularization in infinite dimensional feature spaces. In N. H. Bshouty and C. Gentile (Eds.), Learning Theory: 20th Annual Conference on Learning Theory, Volume 4539 of Lecture Notes in Computer Science, pp. 544–558. Springer-Verlag.

Rosset, S., J. Zhu, and T. Hastie (2004). Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research 5, 941–973.

Rudin, C., C. Cortes, M. Mohri, and R. E. Schapire (2005). Margin-based ranking meets boosting in the middle. See (Auer and Meir 2005), pp. 63–78.

Schapire, R. E., Y. Freund, P. L. Bartlett, and W. S. Lee (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics 26 (5), 1651–1686.

Sch¨olkopf, B., C. J. C. Burges, and A. J. Smola (Eds.) (1998). Advances in Kernel Methods: Support Vector Learning. MIT Press.

Sch¨olkopf, B. and A. Smola (2002). Learning with Kernels. Cambridge, MA: MIT Press.

Shashua, A. and A. Levin (2003). Ranking with large margin principle: Two ap-proaches. See (Becker, Thrun and Obermayer 2003), pp. 937–944.

Smola, A. J., P. L. Bartlett, B. Sch¨okopf, and D. Schuurmans (Eds.) (2000). Advances in Large Margin Classifiers. MIT Press.