ORBoost-LR: ORBoost with left-right margins

Note that the upper bound is equivalent to minimizing (3.13) if ht(xn)∈ {−1, 0, 1}.

Thus, when ht is a binary classifier, the optimal αt can be exactly determined. An-other remark here is that αt is finite under some mild conditions. Thus, unlike RankBoost-OR when encountering the partial matching problem, ORBoost-LR rarely sets αt to∞.

Updating θ: Note that when the pair (ht, αt) is fixed, (3.13) can be reorganized as PK−1

k=1 ϕ^(k)₊ exp (θk) + ϕ^(k)₋ exp (−θ^k) for some ϕ^(k)₊ and ϕ^(k)₋ that can be computed within O(N). Then, each θk can be computed analytically, uniquely, and indepen-dently. Nevertheless, when each θk is updated independently, the thresholds may not be ordered. Hence, we propose to add an additional ordering constraint to (3.13).

That is, we choose θ by solving

minϑ

K−1X

k=1

ϕ^(k)₊ exp (ϑ_k) + ϕ^(k)₋ exp (−ϑk) , (3.16) such that ϑ1 ≤ ϑ2 ≤ · · · ≤ ϑK−1.

An efficient algorithm for solving (3.16) can be obtained from by a simple modification of the pool adjacent violators (PAV) algorithm for isotonic regression (Robertson, Wright and Dykstra 1988), which is at most O(K²) of time complexity.

Combination of the steps: ORBoost-LR works by combining the steps above sequentially in each iteration. Note that after htis determined, αtand θtcan be either jointly optimized, or cyclically updated. Nevertheless, we found that joint or cyclic optimization does not always introduce better performance and could sometimes cause ORBoost-LR to overfit. Thus, we only execute each step once in each iteration.

From the discussions above, the exact steps of ORBoost-LR are as follows.

(b) Determine the optimal αt∈ R by (3.15).

2. Return the threshold ensemble rH,θ, where H(x) = HT(x) =PT

t=1αtht(x).

case of K = 2, both ORBoost approaches are almost the same as AdaBoost with an additional term θ1, which can be thought as the coefficient of a constant classi-fier. Interestingly, Rudin et al. (2005) proved the connection between RankBoost and AdaBoost when including a constant classifier in the ensemble. Thus, when K = 2, RankBoost-OR, ORBoost-LR, and ORBoost-All, all share some similarity with Ada-Boost.

ORBoost formulations also have connections with SVM-based algorithms, such as SVOR by Chu and Keerthi (2007). In particular, ORBoost-LR has a counterpart of SVOR with explicit constraints (SVOR-EXC), and ORBoost-All is related to SVOR with implicit constraints (SVOR-IMC). These connections follow closely with the links between AdaBoost and SVM (Lin and Li 2008; R¨atsch et al. 2002).

3.3 Experiments

In this section, we compare the three boosting formulations above for constructing the threshold ensembles. We also compare these formulations with SVM-based algo-rithms.

Two sets of confidence functions are used in the experiments. The first one is the set of perceptrons

sign(hv, xi + b) : v ∈ R^D, b∈ R

. The RCD-bias algorithm is known to work well with AdaBoost (Li and Lin 2007a) and is adopted as our base algorithm. In all our experiments, RCD-bias is configured with zero seeding and 200 iterations.

The second set is{tanh(hv, xi + b): hv, vi + b² = γ²}, which contains normalized sigmoid functions. Note that sigmoid functions smoothen the output of perceptrons, and the smoothness is controlled by the parameter γ. We use a naive base algorithm for normalized sigmoid functions as follows: RCD-bias is first performed to get a perceptron. Then, the weights and bias of the perceptron are normalized, and the outputs are smoothened. Throughout the experiments we use γ = 4, which was picked with a few experimental runs on some data sets.

Next, we perform experiments with eight benchmark data sets and the same setup as we did in Subsection 2.4.2. Here we compare two different cost functions: Cc and Ca. Since we use the same setup as Chu and Keerthi (2007), we can compare our proposed

Table 3.1: Test absolute cost of algorithms for threshold ensembles

data RankBoost-OR ORBoost-All SVOR-IMC

set perceptron sigmoid perceptron sigmoid

pyrimdines 1.352±0.049 1.408±0.050 1.360±0.046 1.398±0.052 1.294±0.046 machine 0.896±0.022 0.905±0.025 0.889±0.019 0.969±0.025 0.990±0.026

boston 0.779±0.014 0.746±0.014 0.791±0.013 0.777±0.015 0.747±0.011 abalone 1.424±0.003 1.385±0.004 1.432±0.003 1.403±0.004 1.361±0.003 bank 1.457±0.002 1.456±0.002 1.490±0.002 1.539±0.002 1.393±0.002 computer 0.600±0.002 0.606±0.002 0.626±0.002 0.634±0.002 0.596±0.002 california 0.919±0.002 0.949±0.002 0.977±0.002 0.942±0.002 1.008±0.001

census 1.212±0.002 1.186±0.002 1.265±0.002 1.198±0.002 1.205±0.002 (those within one standard error of the lowest one are marked in bold) Table 3.2: Test classification cost of algorithms for threshold ensembles

data RankBoost-OR ORBoost-LR SVOR-EXC

set perceptron sigmoid perceptron sigmoid

pyrimdines 0.742±0.021 0.733±0.018 0.731±0.019 0.731±0.018 0.752±0.014 machine 0.614±0.009 0.625±0.011 0.610±0.009 0.633±0.011 0.661±0.012 boston 0.570±0.005 0.552±0.007 0.580±0.006 0.549±0.007 0.569±0.006 abalone 0.738±0.002 0.719±0.002 0.740±0.002 0.716±0.002 0.736±0.002 bank 0.763±0.001 0.755±0.001 0.767±0.001 0.777±0.002 0.744±0.001 computer 0.485±0.002 0.491±0.001 0.498±0.001 0.491±0.001 0.462±0.001 california 0.607±0.001 0.620±0.001 0.628±0.001 0.605±0.001 0.640±0.001

census 0.706±0.001 0.700±0.001 0.718±0.001 0.694±0.001 0.699±0.000 (those within one standard error of the lowest one are marked in bold)

algorithms fairly with their SVM-based results.

We list the mean and standard errors of all test results with T = 2000 in Tables 3.1 and 3.2. Table 3.1 compares algorithms with Ca, and Table 3.2 compares algorithms with Cc. We make several remarks here.

RankBoost versus ORBoost: RankBoost-OR can achieve decent performance after we use a loose upper bound to decide αt (see Subsection 3.2.1 and some of our earlier results of RankBoost-OR (Lin and Li 2006)). Its performance with the absolute cost is better than ORBoost-All, and its performance with the classification cost is slightly worse than ORBoost-LR. Overall we can see that all three algorithms work well on the data sets, while ORBoost ones enjoy an advantage of simplicity in implementation and efficiency.

Perceptron versus sigmoid: The best test performance are mostly achieved with sigmoid functions. One possible reason is that the data sets are quantized from regression ones (Chu and Keerthi 2007). Therefore, they hold some properties such as smoothness of the boundaries. If we only use binary classifiers like perceptrons, as depicted in Figure 3.3(b), the boundaries would not be as smooth. Thus, for ordinal ranking data sets that are quantized from regression data sets (or that follow the assumption of the threshold regression algorithm), smooth confidence functions may be more useful than discrete binary classifiers.

Boosting versus SVM: When comparing the boosting algorithms with SVOR-IMC on the classification cost and SVOR-EXC on absolute cost (Chu and Keerthi 2007), we see that boosting formulations could achieve similar test costs as the SVM-based algorithms. Note, however, that boosting formulations (especially ORBoost) with perceptrons or sigmoid functions are much faster. On the census data set, which contains 6000 training examples, it takes about an hour for ORBoost-LR to finish one trial. But SVM-based approaches, which include a time-consuming automatic parameter selection step, need more than four days. With the comparable perfor-mance and significantly less computational cost, ORBoost could be a useful tool for large data sets.

Chapter 4 Ordinal Ranking by Extended Binary Classification

In Chapter 2, we studied ordinal ranking problems from the classification perspective and proposed the novel CSOVA and CSOVO algorithms to tackle ordinal ranking problems via cost-sensitive classification. Both CSOVA and CSOVO decompose the cost-sensitive classification problem to several binary classification problems and call an underlying binary classification algorithm to solve them. In Chapter 3, we studied ordinal ranking problems from the regression perspective and proposed the threshold ensemble model for ordinal ranking. Each threshold ensemble in the model aggregates binary classifiers (confidence functions) to form its final prediction. We also designed RankBoost-OR and ORBoost algorithms, which return a threshold ensemble by call-ing a base binary classification algorithm several times. RankBoost-OR and ORBoost are derived from AdaBoost, a popular binary classification algorithm.

In other words, binary classification showed up frequently in our proposed ap-proaches to deal with ordinal ranking. Since binary classification is arguably the most widely studied machine learning problem, it is not coincidental that we tackle more complicated machine learning problems, such as ordinal ranking, by reducing them to what we know in binary classification (Beygelzimer et al. 2005; Langford and Zadrozny 2005). A systematic reduction framework from ordinal ranking to bi-nary classification can introduce two immediate benefits. First, well-tuned bibi-nary classification approaches can be readily transformed into good ordinal ranking ones,

which saves immense efforts in design and implementation. Second, new theoretical guarantees for ordinal ranking can be easily extended from known ones for binary classification, which saves tremendous efforts in derivation and analysis.

We introduced one such reduction framework in Subsection 2.3.2. The framework not only forms a cost-sensitive classification algorithm (CSOVO) by calling an under-lying binary classification algorithm, but also guarantees that a good cost-sensitive classifier can be obtained by combining a set of decent binary classifiers. Since the framework is designed for general cost-sensitive classification rather than for ordinal ranking, arguably it does not use all the properties of ordinal ranking. For instance, it is not clear whether the framework explicitly makes ordinal comparisons between the ranks (see Section 1.2). In this chapter, we study another reduction framework that fully takes the properties of ordinal ranking into account. The framework includes both the classification and the regression perspective of ordinal ranking. Under this framework, we will eventually show an interesting fact: Ordinal ranking (with its full properties) is equivalent to binary classification.

4.1 Reduction Framework

The reduction framework was first proposed in our earlier work, which considered a more restricted cost-sensitive setup (Li and Lin 2007b).¹ The core of the framework is the following reduction method, which is composed of three stages: preprocessing, training, and prediction. Next, we introduce the stages of the reduction method and its consequent theoretical guarantees.

在文檔中 From Ordinal Ranking to Binary Classiﬁcation (頁 64-71)