Probability Estimates for Multi-class Classification by Pairwise Coupling

(1)

Probability Estimates for Multi-class

Classification by Pairwise Coupling

Ting-Fan Wu Chih-Jen Lin Department of Computer Science

National Taiwan University Taipei 106, Taiwan

Ruby C. Weng Department of Statistics National Chenechi University

Taipei 116, Taiwan

Abstract

Pairwise coupling is a popular multi-class classification method that combines together all pairwise comparisons for each pair of classes. This paper presents two approaches for obtaining class probabilities. Both methods can be reduced to linear systems and are easy to implement. We show conceptually and experimentally that the proposed approaches are more stable than two existing popular methods: voting and [3].

1 Introduction

The multi-class classification problem refers to assigning each of the observations into one ofk classes. As two-class problems are much easier to solve, many authors propose to use

two-class classifiers for multi-class classification. In this paper we focus on techniques that provide a multi-class classification solution by combining all pairwise comparisons. A common way to combine pairwise comparisons is by voting [6, 2]. It constructs a rule for discriminating between every pair of classes and then selecting the class with the most winning two-class decisions. Though the voting procedure requires just pairwise decisions, it only predicts a class label. In many scenarios, however, probability estimates are desired. As numerous (pairwise) classifiers do provide class probabilities, several authors [12, 11, 3] have proposed probability estimates by combining the pairwise class probabilities. Given the observation x and the class labely, we assume that the estimated pairwise class

probabilitiesrij ofµij = p(y = i | y = i or j, x) are available. Here rij are obtained by some binary classifiers. Then, the goal is to estimate{pi}ki=1, where pi = p(y =

i | x), i = 1, . . . , k. We propose to obtain an approximate solution to an identity, and

then select the label with the highest estimated class probability. The existence of the solution is guaranteed by theory in finite Markov Chains. Motivated by the optimization formulation of this method, we propose a second approach. Interestingly, it can also be regarded as an improved version of the coupling approach given by [12]. Both of the proposed methods can be reduced to solving linear systems and are simple in practical implementation. Furthermore, from conceptual and experimental points of view, we show that the two proposed methods are more stable than voting and the method in [3].

We organize the paper as follows. In Section 2, we review two existing methods. Sections 3 and 4 detail the two proposed approaches. Section 5 presents the relationship among the four methods through their corresponding optimization formulas. In Section 6, we compare

(2)

these methods using simulated and real data. The classifiers considered are support vector machines. Section 7 concludes the paper. Due to space limit, we omit all detailed proofs. A complete version of this work is available at http://www.csie.ntu.edu.tw/ ˜cjlin/papers/svmprob/svmprob.pdf.

2 Review of Two Methods

Letrij be the estimates ofµij = pi/(pi+ pj). The voting rule [6, 2] is

δV = argmaxi[

X

j:j6=i

I{rij>rji}]. (1)

A simple estimate of probabilities can be derived aspv i = 2

P

j:j6=iI{rij>rji}/(k(k − 1)). The authors of [3] suggest another method to estimate class probabilities, and they claim that the resulting classification rule can outperformδV in some situations. Their approach is based on the minimization of the Kullback-Leibler (KL) distance betweenrijandµij:

l(p) =X

i6=j

nijrijlog(rij/µij), (2)

wherePk

i=1pi= 1, pi > 0, i = 1, . . . , k, and nij is the number of instances in classi or

j. By letting ∇l(p) = 0, a nonlinear system has to be solved. [3] proposes an iterative

procedure to find the minimum of (2). Ifrij > 0, ∀i 6= j, the existence of a unique global minimal solution to (2) has been proved in [5] and references therein. Let p∗denote this point. Then the resulting classification rule is

δHT(x) = argmaxi[p ∗ i]. It is shown in Theorem 1 of [3] that

p∗_i > p∗_j if and only ifp˜i> ˜pj, where ˜pj=

2P

s:s6=jrjs

k(k − 1) ; (3)

that is, thep˜iare in the same order as thep∗i. Therefore,p˜are sufficient if one only requires the classification rule. In fact, as pointed out by [3],p˜ can be derived as an approximation

to the identity by replacingpi+ pjwith2/k, and µij withrij.

pi= X j:j6=i (pi+ pj k − 1 )( pi pi+ pj ) = X j:j6=i (pi+ pj k − 1 )µij (4)

3 Our First Approach

Note thatδHT is essentially argmaxi[˜pi], and ˜pis an approximate solution to (4). Instead

of replacingpi+ pjby2/k, in this section we propose to solve the system:

pi= X j:j6=i (pi+ pj k − 1 )rij, ∀i, subject to k X i=1 pi= 1, pi≥ 0, ∀i. (5) Letp¯denote the solution to (5). Then the resulting decision rule is

δ1= argmax_i[¯pi].

AsδHT relies onpi+ pj ≈ k/2, in Section 6.1 we use two examples to illustrate possible problems with this rule.

(3)

To solve (5), we rewrite it as Qp = p, k X i=1 pi= 1, pi≥ 0, ∀i, where Qij = ( rij/(k − 1) ifi 6= j, P s:s6=iris/(k − 1) ifi = j. (6) Observe thatPk

j=1Qij= 1 for i = 1, . . . , k and 0 ≤ Qij≤ 1 for i, j = 1, . . . , k, so there exists a finite Markov Chain whose transition matrix isQ. Moreover, if rij > 0 for all

i 6= j, then Qij > 0, which implies this Markov Chain is irreducible and aperiodic. These conditions guarantee the existence of a unique stationary probability and all states being positive recurrent. Hence, we have the following theorem:

Theorem 1 Ifrij > 0, i 6= j, then (6) has a unique solution p with 0 < pi < 1, ∀i. With Theorem 1 and some further analyses, if we remove the constraintpi ≥ 0, ∀i, the linear system withk + 1 equations still has the same unique solution. Furthermore, if any

one of the k equalities Qp = p is removed, we have a system with k variables and k

equalities, which, again, has the same single solution. Thus, (6) can be solved by Gaussian elimination. On the other hand, as the stationary solution of a Markov Chain can be derived by the limit of then-step transition probability matrix Qn_{, we can solve p by repeatedly} multiplyingQT _{with any initial vector.}

Now we reexamine this method to gain more insight. The following arguments show that the solution to (5) is a global minimum of a meaningful optimization problem. To begin, we express (5) asP

j:j6=irjipi−Pj:j6=irijpj = 0, i = 1, . . . , k, using the property that

rij + rji = 1, ∀i 6= j. Then the solution to (5) is in fact the global minimum of the following problem: min p k X i=1 (X j:j6=i rjipi− X j:j6=i rijpj)2 subject to k X i=1 pi= 1, pi≥ 0, ∀i. (7) Since the object function is always nonnegative, and it attains zero under (5) and (6).

4 Our Second Approach

Note that both approaches in Sections 2 and 3 involve solving optimization problems using the relations likepi/(pi+ pj) ≈ rijorPj:j6=irjipi≈Pj:j6=irijpj. Motivated by (7), we suggest another optimization formulation as follows:

min p 1 2 k X i=1 X j:j6=i (rjipi− rijpj)2 subject to k X i=1 pi= 1, pi≥ 0, ∀i. (8) In related work, [12] proposes to solve a linear system consisting of Pk

i=1pi = 1 and anyk − 1 equations of the form rjipi = rijpj. However, pointed out in [11], the results of [12] strongly depends on the selection ofk − 1 equations. In fact, as (8) considers all rijpj− rjipi, not justk − 1 of them, it can be viewed as an improved version of [12]. Let p†denote the corresponding solution. We then define the classification rule as

δ2= argmax_i[p†_i].

Since (7) has a unique solution, which can be obtained by solving a simple linear system, it is desired to see whether the minimization problem (8) has these nice properties. In the rest of the section, we show that this is true. The following theorem shows that the nonnegative constraints in (8) are redundant.

(4)

Theorem 2 Problem (8) is equivalent to a simplification without conditionspi≥ 0, ∀i.

Note that we can rewrite the objective function of (8) as

min p = 1 2p T_Qp, _where_Q ij= (_P s:s6=ir 2 si ifi = j, rjirij ifi 6= j. (9)

From here we can show thatQ is positive semi-definite. Therefore, without constraints pi ≥ 0, ∀i, (9) is a linear-constrained convex quadratic programming problem. Conse-quently, a point p is a global minimum if and only if it satisfies the KKT optimality condi-tion: There is a scalarb such that

Q e eT 0 p b =0₁ . (10)

Here e is the vector of all ones andb is the Lagrangian multiplier of the equality constraint Pk

i=1pi= 1. Thus, the solution of (8) can be obtained by solving the simple linear system (10). The existence of a unique solution is guaranteed by the invertibility of the matrix of (10). Moreover, ifQ is positive definite(PD), this matrix is invertible. The following

theorem shows thatQ is PD under quite general conditions.

Theorem 3 If for anyi = 1, . . . , k, there are s 6= i and j 6= i such that rsirsj_ris 6= rjirjs_rij , thenQ is positive definite.

In addition to direct methods, next we propose a simple iterative method for solving (10):

Algorithm 1

1. Start with some initialpi≥ 0, ∀i andPk_i=1pi= 1. 2. Repeat(t = 1, . . . , k, 1, . . .) pt← 1 Qtt [− X j:j6=t Qtjpj+ pTQp] (11) normalize p (12) until (10) is satisfied.

Theorem 4 Ifrsj > 0, ∀s 6= j, and {pi}∞i=1 is the sequence generated by Algorithm 1, any convergent sub-sequence goes to a global minimum of (8).

As Theorem 3 indicates that in generalQ is positive definite, the sequence {pi_}∞ i=1 from Algorithm 1 usually globally converges to the unique minimum of (8).

5 Relations Among Four Methods

The four decision rules δHT,δ1,δ2, and δV can be written as argmaxi[pi], where p is derived by the following four optimization formulations under the constantsPk

(5)

andpi≥ 0, ∀i: δHT : min p k X i=1 [ k X j:j6=i (rij 1 k − 1 2pi)] 2 , (13) δ1: min p k X i=1 [ k X j:j6=i (rijpj− rjipi)] 2 , (14) δ2: min p k X i=1 k X j:j6=i (rijpj− rjipi)2, (15) δV : min p k X i=1 k X j:j6=i (I{rij>rji}pj− I{rji>rij}pi)2. (16) Note that (13) can be easily verified, and that (14) and (15) have been explained in Sections 3 and 4. For (16), its solution is

pi =

c P

j:j6=iI{rji>rij}

,

wherec is the normalizing constant;∗_{and therefore, argmax}

i[pi] is the same as (1). Clearly, (13) can be obtained from (14) by lettingpj ≈ 1/k, ∀j and rji≈ 1/2, ∀i, j. Such approx-imations ignore the differences between pi. Similarly, (16) is from (15) by taking the extreme values ofrij: 0 or 1. As a result, (16) may enlarge the differences betweenpi. Next, compared with (15), (14) may tend to underestimate the differences between thepi’s. The reason is that (14) allows the difference betweenrijpj andrjipito get canceled first. Thus, conceptually, (13) and (16) are more extreme – the former tends to underestimate the differences betweenpi’s, while the latter overestimate them. These arguments will be supported by simulated and real data in the next section.

6 Experiments

6.1 Simple Simulated Examples

[3] designs a simple experiment in which allpi’s are fairly close and their methodδHT outperforms the voting strategy δV. We conduct this experiment first to assess the per-formance of our proposed methods. As in [3], we define class probabilitiesp1 = 1.5/k,

pj= (1 − p1)/(k − 1), j = 2, . . . , k, and then set

rij =

pi

pi+ pj

+ 0.1zijifi > j, (17)

rji= 1 − rij ifj > i, (18)

wherezijare standard normal variates. Sincerijare required to be within (0,1), we truncate

rijat below and 1 − above, with = 0.00001. In this example, class 1 has the highest probability and hence is the correct class.

Figure 1 shows accuracy rates for each of the four methods whenk = 3, 5, 8, 10, 12, 15, 20.

The accuracy rates are averaged over 1,000 replicates. Note that in this experiment all classes are quite competitive, so, when usingδV, sometimes the highest vote occurs at two

∗_{For I to be well defined, we consider r}

ij6= rji, which is generally true. In addition, if there is

an i for whichP

j:j6=iI{rji>rij}= 0, an optimal solution of (16) is pi= 1, and pj = 0, ∀j 6= i.

(6)

2 3 4 5 6 7 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 log₂ k Accuracy Rates 2 3 4 5 6 7 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 log₂ k Accuracy Rates 2 3 4 5 6 7 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 1.05 log₂ k Accuracy Rates

(a) balanced pi (b) unbalanced pi (c) highly unbalanced pi

Figure 1: Accuracy of predicting the true class by the methods δHT (solid line, cross marked), δV (dash line, square marked),δ1 (dotted line, circle marked), andδ2 (dashed line, asterisk marked) from simulated class probabilitypi, i = 1, 2 · · · k.

or more different classes. We handle this problem by randomly selecting one class from the ties. This partly explains whyδV performs poor. Another explanation is that therij here are all close to 1/2, but (16) uses 1 or 0 instead; therefore, the solution may be severely biased. BesidesδV, the other three rules have done very well in this example.

SinceδHT relies on the approximationpi+ pj ≈ k/2, this rule may suffer some losses if the class probabilities are not highly balanced. To examine this point, we consider the following two sets of class probabilities:

(1) We letk1 = k/2 if k is even, and (k + 1)/2 if k is odd; then we define p1 =

0.95×1.5/k1,pi= (0.95−p1)/(k1−1) for i = 2, . . . , k1, andpi= 0.05/(k−k1) fori = k1+ 1, . . . , k.

(2) Ifk = 3, we define p1 = 0.95 × 1.5/2, p2 = 0.95 − p1, andp3 = 0.05. If

k > 3, we define p1 = 0.475, p2 = p3 = 0.475/2, and pi = 0.05/(k − 3) for

i = 4, . . . , k.

After settingpi, we define the pairwise comparisonsrij as in (17)-(18). Both experiments are repeated for 1,000 times. The accuracy rates are shown in Figures 1(b) and 1(c). In both scenarios,piare not balanced. As expected,δHTis quite sensitive to the imbalance of

pi. The situation is much worse in Figure 1(c) because the approximationpi+ pj ≈ k/2 is more seriously violated, especially whenk is large.

In summary, δ1 and δ2 are less sensitive to pi, and their overall performance are fairly stable. All features observed here agree with our analysis in Section 5.

6.2 Real Data

In this section we present experimental results on several multi-class problems:segment,

satimage, andletterfrom the Statlog collection [9],USPS[4], andMNIST[7]. All data

sets are available athttp://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/ t. Their numbers of classes are 7, 6, 26, 10, and 10, respectively. From thousands of instances in each data, we select 300 and 500 as our training and testing sets.

We consider support vector machines (SVM) with RBF kernele−γkxi−xjk2

as the binary classifier. The regularization parameterC and the kernel parameter γ are selected by

cross-validation. To begin, for each training set, a five-fold cross-validation is conducted on the following points of(C, γ): [2−5_{, 2}−3_{, . . . , 2}15_{] × [2}−5_{, 2}−3_{, . . . , 2}15_{]. This is done by} modifying LIBSVM[1], a library for SVM. At each(C, γ), sequentially four folds are

(7)

Table 1: Testing errors (in percentage) by four methods: Each row reports the testing errors based on a pair of the training and testing sets. The mean and std (standard deviation) are from five 5-fold cross-validation procedures to select the best(C, γ).

Dataset k δHT δ1 δ2 δV

mean std mean std mean std mean std

satimage 6 14.080 1.306 14.600 0.938 14.760 0.784 15.400 0.219 12.960 0.320 13.400 0.400 13.400 0.400 13.360 0.080 14.520 0.968 14.760 1.637 13.880 0.392 14.080 0.240 12.400 0.000 12.200 0.000 12.640 0.294 12.680 1.114 16.160 0.294 16.400 0.379 16.120 0.299 16.160 0.344 segment 7 9.960 0.480 9.480 0.240 9.000 0.400 8.880 0.271 6.040 0.528 6.280 0.299 6.200 0.456 6.760 0.445 6.600 0.000 6.680 0.349 6.920 0.271 7.160 0.196 5.520 0.466 5.200 0.420 5.400 0.580 5.480 0.588 7.440 0.625 8.160 0.637 8.040 0.408 7.840 0.344 USPS 10 14.840 0.388 13.520 0.560 12.760 0.233 12.520 0.160 12.080 0.560 11.440 0.625 11.600 1.081 11.440 0.991 10.640 0.933 10.000 0.657 9.920 0.483 10.320 0.744 12.320 0.845 11.960 1.031 11.560 0.784 11.840 1.248 13.400 0.310 12.640 0.080 12.920 0.299 12.520 0.917 MNIST 10 17.400 0.000 16.560 0.080 15.760 0.196 15.960 0.463 15.200 0.400 14.600 0.000 13.720 0.588 12.360 0.196 17.320 1.608 14.280 0.560 13.400 0.657 13.760 0.794 14.720 0.449 14.160 0.196 13.360 0.686 13.520 0.325 12.560 0.294 12.600 0.000 13.080 0.560 12.440 0.233 letter 26 39.880 1.412 37.160 1.106 34.560 2.144 33.480 0.325 41.640 0.463 39.400 0.769 35.920 1.389 33.440 1.061 41.320 1.700 38.920 0.854 35.800 1.453 35.000 1.066 35.240 1.439 32.920 1.121 29.240 1.335 27.400 1.117 43.240 0.637 40.360 1.472 36.960 1.741 34.520 1.001

used as the training set while one fold as the validation set. The training of the four folds consists ofk(k − 1)/2 binary SVMs. For the binary SVM of the ith and the jth classes,

using decision values ˆf of training data, we employ an improved implementation [8] of

Platt’s posterior probabilities [10] to estimaterij:

rij= P (i | i or j, x) =

1

1 + eA ˆf +B, (19)

whereA and B are estimated by minimizing the negative log-likelihood function.† Then, for each validation instance , we apply the four methods to obtain classification decisions. The error of the five validation sets is thus the cross-validation error at(C, γ).

After the cross-validation is done, each rule obtains its best(C, γ).‡ _{Using these} param-eters, we train the whole training set to obtain the final model. Next, the same as (19), the decision values from the training data are employed to findrij. Then, testing data are tested using each of the four rules.

Due to the randomness of separating training data into five folds for finding the best(C, γ),

we repeat the five-fold cross-validation five times and obtain the mean and standard devi-ation of the testing error. Moreover, as the selection of 300 and 500 training and testing instances from a larger dataset is also random, we generate five of such pairs. In Table 1, each row reports the testing error based on a pair of the training and testing sets. The re-sults show that when the number of classesk is small, the four methods perform similarly;

however, for problems with largerk, δHT is less competitive. In particular, for problem

letterwhich has 26 classes,δ2or δV outperformsδHT by at least 5%. It seems that for

†_{[10] suggests to use}_{f from the validation instead of the training. However, this requires a further}ˆ

cross-validation on the four-fold data. For simplicity, we directly usef from the training.ˆ

‡_{If more than one parameter sets return the smallest cross-validation error, we simply choose one}

(8)

problems here, their characteristics are closer to the setting of Figure 1(c), rather than that of Figure 1(a). All these results agree with the previous findings in Sections 5 and 6.1. Note that in Table 1, some standard deviations are zero. That means the best(C, γ) by different

cross-validations are all the same. Overall, the variation on parameter selection due to the randomness of cross-validation is not large.

7 Discussions and Conclusions

As the minimization of the KL distance is a well known criterion, some may wonder why the performance ofδHT is not quite satisfactory in some of the examples. One possi-ble explanation is that here KL distance is derived under the assumptions thatnijrij ∼ Bin(nij, µij) and rij are independent; however, as pointed out in [3], neither of the as-sumptions holds in the classification problem.

In conclusion, we have provided two methods which are shown to be more stable than both

δHT andδV. In addition, the two proposed approaches require only solutions of linear systems instead of a nonlinear one in [3].

The authors thank S. Sathiya Keerthi for helpful comments.

References

[1] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available athttp://www.csie.ntu.edu.tw/˜cjlin/libsvm.

[2] J. Friedman. Another approach to polychotomous classification. Techni-cal report, Department of Statistics, Stanford University, 1996. Available at

http://www-stat.stanford.edu/reports/friedman/poly.ps.Z.

[3] T. Hastie and R. Tibshirani. Classification by pairwise coupling. The Annals of Statistics, 26(1):451–471, 1998.

[4] J. J. Hull. A database for handwritten text recognition research. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 16(5):550–554, May 1994.

[5] D. R. Hunter. MM algorithms for generalized Bradley-Terry models. The Annals of Statistics, 2004. To appear.

[6] S. Knerr, L. Personnaz, and G. Dreyfus. Single-layer learning revisited: a stepwise procedure for building and training a neural network. In J. Fogelman, editor, Neurocomputing: Algorithms,

Architectures and Applications. Springer-Verlag, 1990.

[7] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998. MNIST database available athttp://yann.lecun.com/exdb/mnist/.

[8] H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on Platt’s probabilistic outputs for support vector machines. Technical report, Department of Computer Science and Information Engineering, National Taiwan University, 2003.

[9] D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Sta-tistical Classification. Prentice Hall, Englewood Cliffs, N.J., 1994. Data available at

http://www.ncc.up.pt/liacc/ML/statlog/datasets.html.

[10] J. Platt. Probabilistic outputs for support vector machines and comparison to regularized like-lihood methods. In A. Smola, P. Bartlett, B. Sch ¨olkopf, and D. Schuurmans, editors, Advances

in Large Margin Classifiers, Cambridge, MA, 2000. MIT Press.

[11] D. Price, S. Knerr, L. Personnaz, and G. Dreyfus. Pairwise nerual network classifiers with probabilistic outputs. In G. Tesauro, D. Touretzky, and T. Leen, editors, Neural Information

Processing Systems, volume 7, pages 1109–1116. The MIT Press, 1995.

[12] P. Refregier and F. Vallet. Probabilistic approach for multiclass classification with neural net-works. In Proceedings of International Conference on Artificial Networks, pages 1003–1007, 1991.