• 沒有找到結果。

如何使Support Vector Machine成為主要的classification方法之一(2/3)

N/A
N/A
Protected

Academic year: 2021

Share "如何使Support Vector Machine成為主要的classification方法之一(2/3)"

Copied!
11
0
0

加載中.... (立即查看全文)

全文

(1)

行政院國家科學委員會專題研究計畫 期中進度報告

如何使 Support Vector Machine 成為主要的

classification 方法之一(2/3)

計畫類別: 個別型計畫

計畫編號: NSC93-2213-E-002-030-

執行期間: 93 年 08 月 01 日至 94 年 07 月 31 日

執行單位: 國立臺灣大學資訊工程學系暨研究所

計畫主持人: 林智仁

報告類型: 精簡報告

報告附件: 出席國際會議研究心得報告及發表論文

處理方式: 本計畫可公開查詢

中 華 民 國 94 年 5 月 27 日

(2)

Summary

In the second year of the project, we continue the development of our

Sup-port Vector Machines (SVM) software:

LIBSVM

(

http://www.csie.

ntu.edu.tw/˜cjlin/libsvm

). Two new versions were released:

ver-sion 2.7 on November 10, 2004 and verver-sion 2.8 on April fool’s day, 2005.

Our software remains to be the most widely used SVM software in the

com-munity.

In the first year of this project we finished some work on multi-class

prob-ability estimates and. This year we analyze more general situations. This

effort leads to our NIPS 2004 paper which is included in this report.

(3)

A Generalized Bradley-Terry Model: From

Group Competition to Individual Skill

Tzu-Kuo Huang Chih-Jen Lin

Department of Computer Science National Taiwan University

Taipei 106, Taiwan

Ruby C. Weng

Department of Statistics National Chenechi University

Taipei 116, Taiwan

Abstract

The Bradley-Terry model for paired comparison has been popular in many areas. We propose a generalized version in which paired individual comparisons are extended to paired team comparisons. We introduce a simple algorithm with convergence proofs to solve the model and obtain individual skill. A useful application to multi-class probability estimates using error-correcting codes is demonstrated.

1

Introduction

The Bradley-Terry model [2] for paired comparisons has been broadly applied in many areas such as statistics, sports, and machine learning. It considers the model

P (individual i beats individual j) = πi πi+ πj

, (1)

whereπiis the overall skill of theith individual. Given k individuals and rijas the number

of times thati beats j, an approximate skill pican be found by minimizing the negative log

likelihood of the model (1): min p l(p) =X i<j µ rijlog pi pi+ pj + rjilog pj pi+ pj ¶ subject to 0≤ pi, i = 1, . . . , k, k X i=1 pi = 1. (2)

Thus, from paired comparisons, we can obtain individual performance. This model dates back to [14] and has been extended to more general settings. Some reviews are, for exam-ple, [5, 6]. Problem (2) can be solved by a simple iterative procedure:

Algorithm 1 1. Start with any initialp0

j > 0, j = 1, . . . , k. 2. Repeat(t = 0, 1, . . .)

a. Lets = (t mod k) + 1. For j = 1, . . . , k, define pt,nj ≡    P i:i6=srsi P i:i6=srsi+rispts +pt i ifj = s, pt j ifj6= s. (3)

(4)

b. Normalize pt,nto be pt+1. until∂l(pt)/∂p

j= 0, j = 1, . . . , k are satisfied.

This algorithm is so simple that there is no need to use sophisticated optimization tech-niques. Ifrij > 0,∀i, j, Algorithm 1 globally converges to the unique minimum of (2). A

systematic study of the convergence is in [9].

Several machine learning work have used the Bradley-Terry model and one is to obtain multi-class probability estimates from pairwise coupling [8]. For any data instance x, if nijis the number of training data in theith or jth class, and

rij ≈ nijP (x in class i| x in class i or j)

is available, solving (2) obtains the estimate ofP (x in class i), i = 1, . . . , k. [13] tried to extend this algorithm to other multi-class settings such as “one-against-the rest” or “error-correcting coding,” but did not provide a convergence proof. In Section 5.2 we show that the algorithm proposed in [13] indeed has some convergence problems.

In this paper, we propose a generalized Bradley-Terry model where each comparison is between two disjoint subsets of subjects. Then from the results of team competitions, we can approximate the skill of each individual. This model has many potential applications. For example, from records of tennis or badminton doubles (or singles and doubles com-bined), we may obtain the rank of all individuals. A useful application in machine learning is multi-class probability estimates using error-correcting codes. We then introduce a sim-ple iterative method to solve the generalized model with a convergence proof. Experiments on multi-class probability estimates demonstrate the viability of the proposed model and algorithm. Due to space limitation, we omit all proofs in this paper.

2

Generalized Bradley-Terry Model

We propose a generalized Bradley-Terry model where, using team competition results, we can approximate individual skill levels. Consider a group ofk individuals: {1, . . . , k}. Two disjoint subsetsIi+andIi−form teams for games andri ≥ 0 (r′i ≥ 0) is the number

of times thatIi+beatsIi−(Ii−beatsIi+). Thus, we haveIi ⊂ {1, . . . , k}, i = 1, . . . , m so

that Ii = Ii+∪ I − i , I + i 6= ∅, I − i 6= ∅, and I + i ∩ I − i =∅.

Under the model that

P (Ii+beatsI− i ) = P j∈I+ i πj P j∈I+ i πj+ P j∈Ii−πj = P j∈I+ i πj P j∈Iiπj , we can define qi≡ X j∈Ii pj, q+i ≡ X j∈Ii+ pj, q−i ≡ X j∈Ii− pj,

and minimize the negative log likelihood min p l(p) =− m X i=1 ¡rilog(qi+/qi) + r′ilog(qi−/qi)¢ , (4)

under the same constraints of (2). IfIi, i = 1, . . . , k(k− 1)/2 are as the following:

Ii+ Ii− ri r′i

{1} {2} r12 r21

..

. ... ... ...

(5)

then (4) goes back to (2). The difficulty of solving (4) over solving (2) is that nowl(p) is expressed in terms ofq+i , qi−, qi but the real variable is p. The original Bradley-Terry

model is a special case of other statistical models such as log-linear or generalized linear model, so methods other than Algorithm 1 (e.g., iterative scaling and iterative weighted least squares) can also be used. However, (4) is not in a form of such models and hence these methods cannot be applied. We propose the following algorithm to solve (4).

Algorithm 2 1. Start with p0

j > 0, j = 1, . . . , k and corresponding

qi0,+, q0,−i , q0

i, i = 1, . . . , m. 2. Repeat(t = 0, 1, . . .)

a. Lets = (t mod k) + 1. For j = 1, . . . , k, define pt,nj      P i:s∈Ii+ ri qt,+i + P i:s∈Ii− r′i qit,− P i:s∈Ii ri+r′i qti pt j ifj = s, pt j ifj 6= s. (5) b. Normalize pt,nto pt+1. c. Updateqit,+, qt,−i , qt itoq t+1,+ i , q t+1,− i , q t+1 i , i = 1, . . . , m. until∂l(pt)/∂p j= 0, j = 1, . . . , k are satisfied.

For the multiplicative factor in (5) to be well defined (i.e., non-zero denominator), we need Assumption 1, which will be discussed in Section 3. Eq. (5) is a simple fixed-point type update; in each iteration, only one component (i.e.,pt

s) is modified while the others

remain the same. It is motivated from using a descent direction to strictly decreasel(p): If ∂l(pt)/∂p s6= 0, then ∂l(pt) ∂ps · (p t,n s − pts) = Ã −µ ∂l(p t) ∂ps ¶2 pt s ! . Ã X i:s∈Ii ri+ r′i qt i ! < 0, (6) where ∂l(p) ∂ps = − X i:s∈I+ i ri q+i − X i:s∈I− i r′ i qi− + X i:s∈Ii ri+ ri′ qi . Thus,pt,n

s − ptsis a descent direction in optimization since a sufficiently small step along

this direction guarantees the strict decrease of the function value. Since now we take the whole direction without searching for the step size, more efforts are needed to prove the strict decrease in Lemma 1. However, (6) does hint that (5) is a reasonable update.

Lemma 1 Ifpt

s> 0 is the index to be updated and ∂l(pt)/∂ps6= 0, then l(pt+1) < l(pt).

If we apply the update rule (5) on the pairwise model, P i:i6=srpsit s P i:i6=s ptrsi s+pti + P i:i6=sptris s+pti pt s= P i:i6=srsi P i:i6=srsi +ris pt s+pti

(6)

3

Convergence of Algorithm 2

For any point satisfying ∂l(p)/∂pj = 0, j = 1, . . . , k and constraints of (4), it is a

stationary point of (4)∗. We will prove that Algorithm 2 converges to such a point. If it stops in a finite number of iterations, then ∂l(p)/∂pj = 0, j = 1, . . . , k, which

means a stationary point of (4) is already obtained. Thus, we only need to handle the case where {pt} is an infinite sequence. As {pt}t=0 is in a compact (i.e., closed and bounded) set{p | 0 ≤ pj ≤ 1,Pkj=1pj = 1}, it has at least one convergent

subse-quence. Assume p∗ is one such convergent point. In the following we will prove that ∂l(p∗)/∂p

j = 0, j = 1, . . . , k.

To prove the convergence of a fixed-point type algorithm, we need that ifp∗

s > 0 and

∂l(p∗)/∂p

s6= 0, then from p∗swe can use (5) to update it top∗,ns 6= p∗s. We thus make the

following assumption to ensure that p∗

s> 0 (see also Theorem 1).

Assumption 1 For eachj ∈ {1, . . . , k},

∪i:i∈AIi={1, . . . , k}, where A = {i | (Ii+={j}, ri> 0) or (Ii−={j}, r′i> 0)}. That is, each individual forms a winning (losing) team in some competitions which together involve all subjects.

An issue left in Section 2 is whether the multiplicative factor in (5) is well defined. With Assumption 1 and initialp0j > 0, j = 1, . . . , k, one can show by induction that ptj > 0,∀t

and hence the denominator of (5) is never zero: Ifpt

j > 0, Assumption 1 implies that

P i:j∈I+ i ri/q t,+ i or P i:j∈Ii−r′i/q t,−

i is positive. Thus, both numerator and denominator in

the multiplicative factor are positive, and so ispt+1j .

Ifrij > 0, the original Bradley-Terry model satisfies Assumption 1. No matter the model

satisfies the assumption or not, an easy way to fulfill it is to add an additional term −µ k X s=1 log à ps Pk j=1pj ! (7)

tol(p), where µ is a small positive number. That is, for each s, we make an Ii={1, . . . , k}

withIi+={s}, ri = µ, and ri′= 0. As

Pk

j=1pj= 1 is one of the constraints, (7) reduces

to−µPk

s=1log ps, which is a barrier term in optimization to ensure thatpsdoes not go to

zero. The propertyp∗

s> 0 and the convergence of Algorithm 2 are in Theorem 1:

Theorem 1 Under Assumption 1, any convergent point pof Algorithm 2 satisfiesp∗

s >

0, s = 1, . . . , k and is a stationary point of (4).

4

Asymptotic Distribution of the Maximum Likelihood Estimator

For the standard Bradley-Terry model, asymptotic distribution of the MLE (i.e., p) has been discussed in [5]. In this section, we discuss the asymptotic distribution for the proposed estimator. To work on the real probability π, we define

A stationary point means a Karash-Kunh-Tucker (KKT) point for constrained optimization

prob-lems like (2) and (4). Note that here∂l(p)/∂pj= 0 implies (and is more restricted than) the KKT

(7)

¯ qi≡Pj∈Iiπj, q¯ + i ≡ P j∈I+ i πj, q¯ − i ≡ P j∈I− i πj,

and considerni ≡ ri + r′i as a constant. Note that ri ∼ BIN(ni, ¯q+i /¯qi) is a random

variable representing the number of times thatIi+beatsI−

i innicompetitions. By defining fors, t = 1, . . . , k, λss ≡ var h ∂l(π) ∂ps i =P i:s∈I+ i niq¯−i ¯ q+ iq¯ 2 i +P i:s∈Ii− niq¯+i ¯ q−iq¯2 i , λst ≡ cov h∂l(π) ∂ps , ∂l(π) ∂pt i =P i:s,t∈I+ i ¯ q− i ni ¯ q+ iq¯ 2 i− P i:(s,t)∈I+ i×I − i ni ¯ q2 i − P i:(s,t)∈I− i ×I + i ni ¯ q2 i + P i:s,t∈I− i ¯ qi+ni ¯ qi−q¯2i , s6= t, we have the following theorem:

Theorem 2 Letn be the total number of comparisons. If riis independent ofrj,∀i 6= j,

then√n(p1− π1), . . . ,√n(pk−1− πk−1) have for large samples the multivariate normal distribution with zero means and dispersion matrix[λ′

st]−1, where

λ′

st= λst− λsk− λtk+ λkk, s, t = 1, . . . , k− 1.

5

Application to Multi-class Probability Estimates

Many classification methods are two-class based approaches and there are different ways to extend them for multi-class cases. Most existing studies focus on predicting class labels but not probability estimates. In this section, we discuss how the generalized Bradley-Terry model can be applied to multi-class probability estimates.

Error-correction coding [7, 1] is a general method to construct binary classifiers and com-bine them for multi-class prediction. It suggests some ways to constructIi+andIi−; both are subsets of{1, . . . , k}. Then one trains a binary model using data from classes in Ii+ (Ii−) as positive (negative). Simple and commonly used methods such as “one-against-one” and “one-against-the rest” are its special cases. Givennithe number of training data with

classes inIi= Ii+∪ Ii−, we assume here that for any data x,

ri≈ niP (x in classes of Ii+| x in classes of I +

i orIi−) (8)

is available, and the task is to approximateP (x in class s), s = 1, . . . , k. In the rest of this section we discuss the special case “one-against-the rest” and the earlier results in [13].

5.1 Properties of the “One-against-the rest” Approach

For this approach,Ii, i = 1, . . . , m are

Ii+ Ii− ri r′i

{1} {2, . . . , k} r1 1− r1

{2} {1, 3, . . . , k} r2 1− r2

..

. ... ... ...

Nown1 = · · · = nm = the total number of training data, so the solution to (4) is not

affected byni. Thus, we remove it from (8), sori+ r′i= 1. As ∂l(p)/∂ps= 0 becomes

rs ps +X j:j6=s r′ j 1− pj = k, we have r1 p1− 1− r1 1− p1 =· · · = rk pk− 1− rk 1− pk = k k X j=1 r′ j 1− pj = δ, whereδ is a constant. These equalities provide another way to solve p, and ps = ((1 +

δ)p(1 + δ)2− 4r

(8)

the equalities, but it is negative whenδ < 0, and greater than 1 when δ > 0. By solving Pm

s=1ps= 1, we obtain δ and the optimal p.

From the formula ofps, ifδ > 0, larger psimplies smaller(1 + δ)2−4rsδ and hence larger

rs. It is similar forδ < 0. Thus, the order of p1, . . . , pkis the same as that ofr1, . . . , rk:

Theorem 3 Ifrs≥ rt, thenps≥ pt.

5.2 The Approach in [13] for Error-Correcting Codes

[13] was the first attempt to address the probability estimates using general error-correcting codes. By considering the same optimization problem (4), it proposes a heuristic update rule pt,ns ≡ P i:s∈I+ i ri+ P i:s∈Ii−r′i P i:s∈I+ i niqt,+i qt i + P i:s∈I− i niqit,− qt i pts, (9)

but does not provide a convergence proof. For a fixed-point update, we expect that at the optimum, the multiplicative factor in (9) is one. However, unlike (5), when the factor is one, (9) does not relate to∂l(p)/∂ps= 0. In fact, a simple example shows that this algorithm

may never converge. Taking the “one-against-the rest” approach, if we keepPk

i=1pti = 1

and assumeni= 1, then ri+ r′i= 1 and the factor in the update rule (9) is rs+Pi:i6=sr ′ i pt s+ P i:i6=s(1−pti) = k−1+2rs−Pki=1ri k−2+2pt s .

If the algorithm converges and the factor approaches one, thenps= (1 + 2rs−Pki=1ri)/2

but they may not satisfyPk

s=1ps= 1. Therefore, if in the algorithm we keepPki=1pti = 1

as [13] did, the factor may not approach one and the algorithm does not converge. More generally, ifIi={1, . . . , k}, ∀i, the algorithm may not converge. As qit= 1, the condition

that the factor equals one can be written as a linear equation of p. Together withPk

i=1pi=

1, there is an over-determined linear system (i.e., k + 1 equations and k variables).

6

Experiments on Multi-class Probability Estimates

6.1 Simulated Examples

We consider the same settings in [8, 12] by defining three possible class probabilities: (a) p1= 1.5/k, pj= (1− p1)/(k− 1), j = 2, . . . , k.

(b) k1 = k/2 if k is even, and (k + 1)/2 if k is odd; then p1 = 0.95× 1.5/k1,pi =

(0.95− p1)/(k1− 1) for i = 2, . . . , k1, andpi= 0.05/(k− k1) for i = k1+ 1, . . . , k.

(c) p1= 0.95× 1.5/2, p2= 0.95− p1, andpi= 0.05/(k− 2), i = 3, . . . , k.

Classes are competitive in case (a), but only two dominate in case (c). We then generateri

by adding some noise toq+i /qi:

ri= min(max(ǫ,q

+ i

qi(1 + 0.1N (0, 1))), 1− ǫ). Thenr′

i = 1− ri. Hereǫ = 10−7is used so that allri, r′iare positive. We consider the

four encodings used in [1] to generateIi:

1. “1vs1”: the pairwise approach (eq. (2)).

2. “1vsrest”: the “one-against-the rest” approach in Section 5.1.

3. “dense”:Ii ={1, . . . , k} for all i. Iiis randomly split to two equally-sized setsIi+and

I−

i . [10 log2k] such splits are generated†. Following [1], we repeat this procedure 100

times and select the one whose[10 log2k] splits have the smallest distance.

(9)

2 3 4 5 6 0 0.2 0.4 0.6 0.8 1 log2 k Test Accuracy (a) 2 3 4 5 6 0 0.2 0.4 0.6 0.8 1 log2 k Test Accuracy (b) 2 3 4 5 6 0 0.2 0.4 0.6 0.8 1 log2 k Test Accuracy (c)

Figure 1: Accuracy by the four encodings: “1vs1” (dashed line, square), “1vsrest” (solid line, cross), “dense” (dotted line, circle), “sparse” (dashdot line, asterisk)

2 3 4 5 6 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 log2 k MSE (a) 2 3 4 5 6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 log2 k MSE (b) 2 3 4 5 6 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 log2 k MSE (c)

Figure 2: MSE by the four encodings: legend the same as Figure 1 4. “sparse”:Ii+, I−

i are randomly drawn from{1, . . . , k} with E(|I +

i |) = E(|Ii−|) = k/4.

Then[15 log2k] such splits are generated. Similar to “dense,” we repeat the procedure

100 times to find a good coding.

Figure 1 shows averaged accuracy rates over 500 replicates for each of the four methods whenk = 22, 23, . . . , 26. “1vs1” is good for (a) and (b), but suffers some losses in (c),

where the class probabilities are highly unbalanced. [12] has observed this and proposed some remedies. “1vsrest” is quite competitive in all three scenarios. Furthermore, “dense” and “sparse” are less competitive in cases (a) and (b) whenk is large. Due to the large |Ii+| and |Ii−|, the model is unable to single out a clear winner when probabilities are more

balanced. We also analyze the (relative) mean square error (MSE) in Figure 2:

MSE= 1 500 500 X j=1 Ã k X i=1 (ˆpji− pi)2/ k X i=1 p2i ! , (10)

wherepˆj is the probability estimate obtained in thejth of the 500 replicates. Results of Figures 2(b) and 2(c) are consistent with those of the accuracy. Note that in Figure 2(a), as p(andpˆj) are balanced,Pki=1(ˆpji− pi)2is small. Hence, all approaches have small MSE

though some have poor accuracy.

6.2 Experiments on Real Data

In this section we present experimental results on some real-world multi-class problems. They have been used in [12], which provides more information about data preparation. Two

(10)

problem sizes, 300/500 and 800/1,000 for training/testing, are used. 20 training/testing splits are generated and the testing error rates are averaged. All data used are available athttp://www.csie.ntu.edu.tw/˜cjlin/papers/svmprob/data. We use the same four ways in Section 6.1 to generateIi. All of them have|I1| ≈ · · · ≈ |Im|. With

the property that these multi-class problems are reasonably balanced, we setni= 1 in (8).

Since there are no probability values available for these problems, we compare the accu-racy by predicting the label with the largest probability estimate. The purpose here is to compare the four probability estimates but not to check the difference from existing multi-class multi-classification techniques. We consider support vector machines (SVM) [4] with the RBF kernel as the binary classifier. An improved version [10] of [11] obtains ri. Full

SVM parameter selection is conducted before testing, although due to space limitation, we omit details here. The code is modified fromLIBSVM[3], a library for support vector machines. The resulting accuracy is in Table 1 for smaller and larger training/testing sets. Except “1vs1,” the other three approaches are quite competitive. These results indicate that practical problems are more similar to the case of (c) in Section 6.1, where few classes dominate. This observation is consistent with the findings in [12]. Moreover, “1vs1” suf-fers some losses whenk is larger (e.g.,letter), the same as in Figure 1(c); so for “1vs1,” [12] proposed using a quadratic model instead of the Bradley-Terry model.

In terms of the computational time, because the number of binary problems for “dense” and “sparse” ([10 log2k] and [15 log2k], respectively) is larger than k, and each binary problem

involves many classes of data (all and one half), their training time is longer than “1vs1” and “1vsrest.” “Dense” is particularly time consuming. Note that though “1vs1” solves k(k− 1)/2 binaries, it is efficient as each binary problem involves only two classes of data.

Table 1: Average of 20 test errors (in percentage) by four encodings (lowest boldfaced) 300 training and 500 testing 800 training and 1,000 testing Problem k 1vs1 1vsrest dense sparse 1vs1 1vsrest dense sparse

dna 3 10.47 10.33 10.45 10.19 6.21 6.45 6.415 6.345 waveform 3 15.01 15.35 15.66 15.12 13.525 13.635 13.76 13.99 satimage 6 14.22 15.08 14.72 14.8 11.54 11.74 11.865 11.575 segment 7 6.24 6.69 6.62 6.19 3.295 3.605 3.52 3.25 USPS 10 11.37 10.89 10.81 11.14 7.78 7.49 7.31 7.575 MNIST 10 13.84 12.56 13.0 12.29 8.11 7.37 7.59 7.535 letter 26 39.73 35.17 33.86 33.88 21.11 19.685 20.14 19.49

In summary, we propose a generalized Bradley-Terry model which gives individual skill from group competition results. A useful application to general multi-class probability estimate is demonstrated.

References

[1] E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113–141, 2001.

[2] R. A. Bradley and M. Terry. The rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324–345, 1952.

[3] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software

available athttp://www.csie.ntu.edu.tw/˜cjlin/libsvm.

[4] C. Cortes and V. Vapnik. Support-vector network. Machine Learning, 20:273–297, 1995. [5] H. A. David. The method of paired comparisons. Oxford University Press, New York, second

(11)

[6] R. R. Davidson and P. H. Farquhar. A bibliography on the method of paired comparisons.

Biometrics, 32:241–252, 1976.

[7] T. G. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2:263–286, 1995.

[8] T. Hastie and R. Tibshirani. Classification by pairwise coupling. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural Information Processing Systems 10. MIT Press, Cambridge, MA, 1998.

[9] D. R. Hunter. MM algorithms for generalized Bradley-Terry models. The Annals of Statistics, 32:386–408, 2004.

[10] H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on Platt’s probabilistic outputs for support vector machines. Technical report, Department of Computer Science, National Taiwan University, 2003.

[11] J. Platt. Probabilistic outputs for support vector machines and comparison to regularized like-lihood methods. In A. Smola, P. Bartlett, B. Sch¨olkopf, and D. Schuurmans, editors, Advances

in Large Margin Classifiers, Cambridge, MA, 2000. MIT Press.

[12] T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pair-wise coupling. In S. Thrun, L. Saul, and B. Sch¨olkopf, editors, Advances in Neural Information

Processing Systems 16. MIT Press, Cambridge, MA, 2004.

[13] B. Zadrozny. Reducing multiclass to binary by coupling probability estimates. In T. G. Di-etterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing

Systems 14, pages 1041–1048. MIT Press, Cambridge, MA, 2002.

[14] E. Zermelo. Die berechnung der turnier-ergebnisse als ein maximumproblem der wahrschein-lichkeitsrechnung. Mathematische Zeitschrift, 29:436–460, 1929.

數據

Figure 1: Accuracy by the four encodings: “1vs1” (dashed line, square), “1vsrest” (solid line, cross), “dense” (dotted line, circle), “sparse” (dashdot line, asterisk)
Table 1: Average of 20 test errors (in percentage) by four encodings (lowest boldfaced) 300 training and 500 testing 800 training and 1,000 testing Problem k 1vs1 1vsrest dense sparse 1vs1 1vsrest dense sparse

參考文獻

相關文件

2 Distributed classification algorithms Kernel support vector machines Linear support vector machines Parallel tree learning?. 3 Distributed clustering

Keywords Support vector machine · ε-insensitive loss function · ε-smooth support vector regression · Smoothing Newton algorithm..

support vector machine, ε-insensitive loss function, ε-smooth support vector regression, smoothing Newton algorithm..

Programming languages can be used to create programs that control the behavior of a. machine and/or to express algorithms precisely.” -

“Transductive Inference for Text Classification Using Support Vector Machines”, Proceedings of ICML-99, 16 th International Conference on Machine Learning, pp.200-209. Coppin

“Machine Learning Foundations” free online course, and works from NTU CLLab and NTU KDDCup teams... The Learning Problem What is

what is the most sophisticated machine learning model for (my precious big) data. • myth: my big data work best with most

Hsuan-Tien Lin (NTU CSIE) Machine Learning Basics