Ranking individuals by group comparisons

(1)

Ranking Individuals by Group Comparisons

Tzu-Kuo Huang r93002@csie.ntu.edu.tw

Chih-Jen Lin cjlin@csie.ntu.edu.tw

Department of Computer Science National Taiwan University Taipei 106, Taiwan

Ruby C. Weng chweng@nccu.edu.tw

Department of Statistics National Chengchi University Taipei 116, Taiwan

Editor: Greg Ridgeway

Abstract

This paper proposes new approaches to rank individuals from their group comparison results. Many real-world problems are of this type. For example, ranking players from team comparisons is important in some sports. In machine learning, a closely related application is classification using coding matrices. Group comparison results are usually in two types: binary indicator outcomes (wins/losses) or measured outcomes (scores). For each type of results, we propose new models for estimating individuals’ abilities, and hence a ranking of individuals. The estimation is carried out by solving convex minimization problems, for which we develop easy and efficient solution procedures. Experiments on real bridge records and multi-class classification demonstrate the viability of the proposed models.

Keywords: ranking, group comparison, binary/scored outcomes, Bradley-Terry model, multi-class classification

1. Introduction

We address an interesting problem of estimating individuals’ abilities from their group comparison results. This problem arises in some sports. One can evaluate a basketball player by his/her average points, but this criterion may be unfair as it ignores opponents’ abilities. Comparison results in some sports, such as bridge, even do not reveal any direct information related to individuals’ abilities. In a bridge match two partnerships form a team to compete with another two. The match record fairly reflects which two partnerships are better, but a partnership’s raw score, depending on different boards, does not indicate its ability. Finding reasonable individual rankings using all group comparison records is thus a challenging task. Another application in machine learning/statistics is multi-class classification by coding matrices (Dietterich and Bakiri, 1995; Allwein et al., 2001). This technique decomposes a multi-class problem into several two-class problems, each of which is considered as the comparison between two disjoint subsets of class labels. The label with the greatest ability then serves as the prediction.

(2)

This line of research stems from the study of paired comparisons (David, 1988), in which one group/team consists of only one individual, and individuals’ abilities are estimated from paired comparison results. Several models have been proposed, among which the most popular one is the Bradley-Terry model (Bradley and Terry, 1952): suppose there are k individuals whose abilities are indicated by a non-negative vector p = [p1 p2 . . . pk]T.

They proposed that

P (individual i beats j) = pi pi+ pj

. (1)

If comparisons are independent, then the maximum likelihood estimate of p is obtained by solving min p −X i6=j nijlog pi pi+ pj subject to k X j=1 pj = 1, pj ≥ 0, j = 1, . . . , k, (2)

where nij is the number of times individual i beats j. The normalizing constraint in (2) is

imposed because the objective function is scale-invariant. The solution to (2) can be found via a simple iterative procedure, which converges to the unique global minimum under mild conditions. Detailed discussions are in, for example, Hunter (2004).

Going from paired to group comparisons, we consider k individuals {1, . . . , k} having m comparisons. The ith comparison setting involves a subset Ii, which is separated as two

disjoint teams, I_i+ and I_i−. They have ni = n+i + n−i comparisons, among which Ii+ and

I_i− win n+_i and n−_i times, respectively. Before seeking sophisticated models, an intuitive way to estimate the sth individual’s ability is by the number of its winning comparisons normalized by the total number it involves:

P i:s∈Ii+n + i + P i:s∈I− i n − i P i:s∈Iini . (3)

In the case of paired comparisons, several authors (David, 1988; Hastie and Tibshirani, 1998) have shown that if

nsi> 0, nis > 0, and nsi+ nis = constant, ∀s, i, (4)

then the ranking by (3) is identical to that by the solution of (2). Note that under (4), the denominator of (3) is the same over all s, so the calculation is simplified to

X

i:i6=s

nsi,

Although the above property may provide some support of (3), this approach has several problems. Firstly, (4) may not hold in most applications of paired comparisons. Secondly, (3) does not consider teammates’ abilities, so strong players and weak ones receive the same credits. Because of these deficiencies, we use (3) as a baseline in experiments in Section 4 to demonstrate the need for more advanced methods. We refer to this approach as AVG.

(3)

As a direct extension of (1), Huang et al. (2006b) proposed a generalized Bradley-Terry model for group comparisons:

P (I_i+ beats I_i−) = P j:j∈I_i+pj P j:j∈Iipj , (5)

which assumes that a team’s ability is the sum of its members’. Under the assumption that comparisons are independent, individuals’ abilities can be estimated by minimizing the negative log-likelihood of (5):

min p − m X i=1 n+_i log P j:j∈I_i+pj P j:j∈Iipj + n−_i log P j:j∈I− i pj P j:j∈Iipj ! subject to k X j=1 pj = 1, pj ≥ 0, j = 1, . . . , k. (6)

Huang et al. (2006b) pointed out that (6) may not be a convex optimization problem, so global minima are not easy to obtain. Zadrozny (2002) was the first attempt to solve (6) by an iterative procedure, which, however, may fail to converge to a stationary point (Huang et al., 2006b). The algorithm of Huang et al. (2006b) converges to a stationary point under certain conditions. We refer to this approach as GBT.ML (Generalized Bradley-Terry Model using Maximum Likelihood).

Both models (1) and (6) consider comparisons’ “binary” outcomes, that is, wins and losses. However, in many comparisons, results are also quantities reflecting opponents’ performances/strengths, such as points in basketball or soccer games. Some work use these “measured” outcomes for paired comparisons; an example is Glickman (1993): instead of modeling the probability that one individual beats another, he considers the difference in two individuals’ abilities as a random variable, whose realization is the difference in two scores. Individuals’ abilities are then estimated via maximizing the likelihood.

In this paper we focus on the batch setting, under which individuals’ abilities are not estimated until all comparisons are finished. This setting is suitable for annual sports events, such as the Bermuda Bowl for bridge considered in Section 4, where the goal is to rank participants according to their performances in the event. However, in some applications, competitions continue to take place without a clear end and a real-time ranking is required. An example is online gaming, where players make teams to compete against one another anytime they wish and expect a real-time update of their ranking right after a game is over. Several work deal with such an online scenario. For example, Herbrich and Graepel (2007) proposed the TrueSkillTM _{system, which generalizes the Elo system used in Chess}

(Elo, 1986). The system follows a Bayesian framework and obtains real-time rankings by an online learning scheme called Gaussian density filtering (Minka, 2001). Menke and Martinez (2007) re-parameterized the Bradley-Terry model (2) as a single-layer artificial neural network (ANN) and extended it for group competitions. Individuals’ abilities are estimated by training the ANN with the delta rule, a typical online or incremental learning technique.

We managed to advance the state of the art in two directions. On the one hand, for comparisons with binary outcomes, we propose a new exponential model in Section 2. The

(4)

main advantage over Huang et al. (2006b) is that one can estimate individuals’ abilities by minimizing unconstrained convex formulations. Hence global minima are easily obtained. On the other hand, we propose in Section 3 two models for comparisons with measured outcomes, which we call scored outcomes. The induced optimization problems are also un-constrained and convex; simple solution procedures are presented. This section may be the first study on finding individuals’ abilities from scored group comparisons. Section 4 ranks partnerships in real bridge matches with the proposed approaches. Properties of different methods and their relations are studied in Section 5, which helps to explain experimental re-sults. Section 6 demonstrates applications in multi-class classification. Section 7 concludes the work and discusses possible future directions.

Part of this work appears in a conference paper (Huang et al., 2006a).

2. Comparisons with Binary Outcomes

We denote individuals’ abilities as a vector v ∈ Rk, −∞ < vs < ∞, s = 1, . . . , k. Unlike p

used in (5), v may have negative values. A team’s ability is then defined as the sum of its members’: for I_i+ and I_i−, their abilities are respectively

T_i+≡ X s:s∈I_i+ vs and Ti− ≡ X s:s∈I− i vs. (7)

We consider teams’ actual performances as random variables Y_i+ and Y_i−, 1 ≤ i ≤ m and define

P (I_i+ beats I_i−) ≡ P (Y_i+− Y_i− > 0). (8) The distribution of Y_i+ and Y_i− is generally unknown, but a reasonable choice should place the mode (the value at which the density function is maximized) around T_i+ and T_i−. To derive a computationally simple form for (8), we assume that Y_i+ (and similarly Y_i−) has a doubly-exponential extreme value distribution with

P (Y_i+≤ y) = exp(−e−(y−Ti+)), (9)

whose mode is exactly T_i+. Suppose Y_i+ is independent of Y_i−, from (8) and (9) we have P (I_i+ beats I_i−) = e

T_i+

eT_i+_{+ e}T− i

. (10)

The derivation is in Appendix A. One may assume other distributions (e.g., normal) in (9), but the resulting model is more complicated than (10). Such differences already occur for paired comparisons, where David (1988) gave some discussion. Thus (10) is our proposed model for binary outcomes.

For paired comparisons (i.e., each individual forms a team), (10) reduces to P (individual i beats individual j) = e

vi

evi+ evj,

which is an equivalent re-parameterization (David, 1988; Hunter, 2004) of the Bradley-Terry model (1) by pi ≡ evi Pk j=1evj .

(5)

Therefore, our model (10) can also be considered as a generalized Bradley-Terry model. This re-parameterization however does not extend to the case of group comparisons, so (10) and (5) are different. Interestingly, (10) is a conditional exponential model or a maximum

entropy model (Jaynes, 1957a,b), which is commonly used in the computational linguistic

community (Berger et al., 1996). Thus we can use existing properties of this type of models. Following the proposed model (10), we estimate v by using available comparison results. The following two sub-sections give two approaches: one minimizes a regularized least square formula, and the other minimizes the negative log-likelihood. Both are unconstrained convex optimization problems. Their differences are discussed in Section 5.

2.1 Regularized Least Square (Ext-B.RLS)

Recall that n+_i and n−_i are respectively the number of comparisons teams I_i+ and I_i− win. From (10), we have eT_i+ eT_i+_{+ e}T− i ≈ n + i n+_i + n−_i , and therefore eTi+−T − i = e T_i+ eT− i ≈ n + i n−_i . If n+_i 6= 0 and n−_i 6= 0, one can solve

min v m X i=1 (T_i+− T_i−) − logn + i n−_i 2 (11) to estimate the vector v of individuals’ abilities. In case of n+_i = 0 or n−_i = 0, a simple solution is adding a small number to all n+_i and n−_i . This technique is widely used in the computational linguistic community, known as the “add-one smoothing” for dealing with the zero-frequency problem. To represent (11) in a simpler form, we define a vector d ∈ Rm

with

di≡ log

n+_i n−_i , and a “comparison setting matrix” G ∈ Rm×k with

Gij ≡      1 if individual j ∈ I_i+, −1 if individual j ∈ I_i−, 0 if individual j 6∈ Ii. (12)

Take bridge in teams of four as an example. An individual stands for a partnership, so G’s jth column records the jth partnership’s team memberships in all m matches. Since a match is played by four partnerships from two teams, each row of G has two 1’s, two −1’s and k−4 0’s. Thus, G may look like

     1 1 −1 −1 0 0 0 0 1 1 0 0 −1 −1 0 0 −1 −1 0 0 0 0 1 1 .. . ... ... ... ... ... ... ...      , (13)

(6)

read as “The first match: the 1st, 2nd partnerships versus the 3rd, 4th; the second match: the 1st, 2nd versus the 5th, 6th; . . . .”

With the help of d and G, we rewrite (11) as min

v (Gv − d)

T_{(Gv − d),} ₍₁₄₎

which is equivalent to solving the following linear system:

GTGv = GTd. (15)

If GT_{G is not invertible, the linear system (15) may have multiple solutions, which lead to}

possibly multiple rankings. To see when GT_{G is invertible, we prove the following result:}

Theorem 1 GT_{G is invertible if and only if rank(G) = k.}

The proof is in Appendix B. This result shows that teams’ members should change fre-quently across comparisons (as indicated by rank(G) = k) so that individuals’ abilities are uniquely determined. To see how multiple rankings occur, consider an extreme case where several players always belong to the same team. Under the model (10), they can be merged as a single virtual player. After solving (14), their respective abilities can take any values but still remain optimal as long as the total ability is equal to the virtual player’s. To handle such situations, we add a regularization term µvT_{v to (14):}

min

v (Gv − d)

T

(Gv − d) + µvTv, where µ is a small positive number. Then a unique solution exists:

GTG + µI−1

GTd. (16)

The rationale of the regularization is that individuals have equal abilities before having comparisons. We refer to this approach as Ext-B.RLS (Extreme value model for Binary outcomes using Regularized Least Square).

2.2 Maximum Likelihood (Ext-B.ML)

Under the assumption that comparisons are independent, the negative log-likelihood func-tion is l(v) ≡ − m X i=1 n+_i log e T_i+ eT_i+ _{+ e}T− i + n−_i log e T− i eT_i+ _{+ e}T− i , (17)

and we may estimate v by

arg min

v

l(v).

It is well known that the log-likelihood of a conditional exponential model is concave, and hence l(v) is convex. However, if l(v) is not strictly convex, multiple global minima may result in multiple rankings. The following theorem gives the sufficient and necessary condition for strict convexity:

(7)

The proof is in Appendix C. As discussed in Section 2.1, the condition may not hold, and a regularization term is usually added to ensure the uniqueness of the optimal solution. Here we consider a special one

µ

k

X

s=1

(evs _{+ e}−vs_), ₍₁₈₎

which is strictly convex and has unique minimum at vs= 0, s = 1, . . . , k. Later we will see

that this function helps to derive a simple algorithm for maximizing the likelihood. The modified negative log-likelihood is as the following:

l(v) ≡ − m X i=1 n+_i log e T_i+ eT_i+ _{+ e}T− i + n−_i log e T− i eT_i+ _{+ e}T− i + µ k X s=1 (evs _{+ e}−vs_), ₍₁₉₎

where µ is a small positive number. We estimate individuals’ abilities by the unique global minimum

arg min

v

l(v), (20)

which satisfies the optimality condition: ∂l(v) ∂vs = − X i:s∈I_i+ n+_i + X i:s∈I− i n−_i + X i:s∈I_i+ nieT + i eT_i+_{+ e}T− i + X i:s∈I− i nieT − i eT_i+_{+ e}T− i + µ(evs _{− e}−vs₎ = 0, s = 1, . . . , k.

Note that the strict convexity of (19) may not guarantee (20) to be attainable; we address this issue later in Theorem 3. Since µ is small,

X i:s∈I_i+ n+_i + X i:s∈I− i n−_i ≈ X i:s∈I_i+ nieT + i eT_i+ _{+ e}T− i + X i:s∈I− i nieT − i eT_i+_{+ e}T− i , (21)

which is a reasonable condition that the total number of observed wins of individual s is nearly the expected number by the assumed model. Meanwhile, the last term in ∂l(v)/∂vs

restricts the value of vs from extremity, and thereby brings some robustness against huge

n+_i or n−_i .

Standard optimization methods (e.g., gradient or Newton’s method) can be used to find a solution of (19). For conditional exponential models, an alternative technique to maximize the likelihood is the generalized iterative scaling by Darroch and Ratcliff (1972), which generates a sequence of iterations {vt_}∞

t=0. The improved iterative scaling (Pietra

et al., 1997) speeds up the convergence, but its update from vt_{to v}t+1_{requires the solution}

of k one-variable minimization problems, which, however, usually do not have closed-form solutions. Goodman (2002) proposed the sequential conditional generalized iterative scaling, which changes only one variable at a time with a closed-form update rule. All the above techniques, however, need to be modified for solving (19) due to the regularization term (18). In the following we propose an iterative method that modifies one component of v at a time. Let δ ≡ [0, . . . , 0, δs, 0, . . . , 0]T indicate the change of the sth component. Using the

(8)

inequality log x ≤ x − 1, ∀x > 0, l(v + δ) − l(v) = −   X i:s∈I_i+ n+_i + X i:s∈I− i n−_i  δs+ X i:s∈I_i+ nilog eT_i++δs _{+ e}Ti− eT_i+ _{+ e}T− i ! + X i:s∈I− i nilog eT_i+_{+ e}T− i +δs eT_i+ _{+ e}T− i ! + µevs_(eδs_{− 1) + µe}−vs_(e−δs_{− 1)} ≤ −   X i:s∈I_i+ n+_i + X i:s∈I− i n−_i  δs+   X i:s∈I_i+ nieT + i eT_i+_{+ e}T− i + X i:s∈I− i nieT − i eT_i+ _{+ e}T− i  (eδs− 1) + µevs_(eδs_{− 1) + µe}−vs_(e−δs_{− 1).} ₍₂₂₎

If δs= 0, (22) = 0. We then minimize (22) to obtain the largest reduction. It is easy to see

that (22) is strictly convex. Taking the derivative with respect to δs to be zero, we find the

root for a second-order polynomial of eδs_{, so the update rule is:}

vs← vs+ log Bs+pBs2+ 4µAse−vs 2As , (23) where As ≡ µevs+ X i:s∈I_i+ nieT + i eT_i+_{+ e}T− i + X i:s∈I− i nieT − i eT_i+_{+ e}T− i , (24) Bs ≡ X i:s∈I_i+ n+_i + X i:s∈I− i n−_i .

If using other regularization terms, minimizing (22) may not lead to a closed-form solution of δs. The algorithm is as the following:

Algorithm 1

1. Start with v0 and obtain T_i0,+, T_i0,−, i = 1, . . . , m. 2. Repeat (t = 0, 1, . . .)

(a) Let s = (t + 1) mod k. Change the sth element of vt _{by (23) to obtain v}t+1_.

(b) Calculate T_it+1,+, T_it+1,−, i = 1, . . . , m. until ∂l(vt)/∂vj = 0, j = 1, . . . , k are satisfied.

Next we address the convergence issue. As As > 0, (23) is always well-defined. A formal

proof of Algorithm 1’s convergence is in the following theorem:

Theorem 3 The modified negative log-likelihood l(v) defined in (19) attains a unique global

(9)

The proof is in Appendix D. In Huang et al. (2006b), some assumptions are needed to ensure their update rule to be well-defined as well as the convergence. In contrast, Algorithm 1 does not require any assumption since the regularization term provides very nice properties. We refer to the approach of minimizing (19) as Ext-B.ML (Extreme value distribution model for Binary outcomes using Maximum Likelihood).

3. Comparisons with Scored Outcomes

This section proposes estimating individuals’ abilities based on measured outcomes, such as points in basketball or soccer games. We still use random variables Y_i+and Y_i−for team performances, but give n+_i and n−_i different meanings: they now denote scores of I_i+ and I_i−. Our idea is to view n+_i − n−_i as a realization of Y_i+− Y_i− and maximize the resulting likelihood. Note that we model difference in scores instead of the score itself. We propose two approaches in the following sub-sections. One assumes normal distributions for Y_i+ and Y_i−, while the other assumes the same extreme value distribution (9). Individuals’ abilities are estimated by maximizing the likelihood of score differences. Properties of the two approaches are investigated in Section 5.

3.1 Normal Distribution Model (NM-S.ML)

As mentioned in Section 2, using normal distributions for comparisons with binary outcomes is computationally more difficult due to a complicated form of P (I_i+ beats I_i−). However, for scored paired comparisons, Glickman (1993) successfully applied normal distributions. He considers individuals’ performances as normally distributed random variables

Yi∼ N (vi, σ2), i = 1, . . . , k,

and view the score difference of individuals i and j as a realization of Yi− Yj. By assuming

Yi and Yj are independent for all individuals,

Yi− Yj ∼ N (vi− vj, 2σ2), (25)

and individuals’ abilities are estimated by maximizing the likelihood. We extend this ap-proach to group comparisons. Recall that Y_i+ and Y_i− are random variables for two teams’ performances. With the same assumption of independent normal distributions, we have

Y_i+∼ N (T_i+, σ2), Y_i−∼ N (T_i−, σ2). and

Y_i+− Y_i− ∼ N (T_i+− T_i−, 2σ2).

Assuming comparisons are independent and defining a vector b with bi ≡ n+i − n−i ,

the negative log-likelihood then is

l(v, σ) = log σ + 1 4σ2 m X i=1 T_i+− T_i−− (n+_i − n−_i )2 (26) = log σ +(Gv − b) T_{(Gv − b)} 4σ2 ,

(10)

where G is the comparison setting matrix defined in (12). The maximum likelihood estimate of v is obtained by solving ∂l(v, σ)/∂vs = 0 ∀s, which is the following linear system:

GTGv = GTb. (27)

Similar to (14), (27) may have multiple solutions if GT_{G is not invertible. To overcome this}

problem, we add a regularization term and solve min

v l(v, σ) +

µ 4σ2v

T_v, ₍₂₈₎

where µ is small positive number. The unique solution of (28) then is ¯

v ≡ (GTG + µI)−1GTb. (29)

In addition, we also obtain an estimate of the variance by solving ∂ l(v, σ2_{) +} µ 4σ2vTv ∂σ = 0, which leads to ¯ σ2 ≡ (G¯v − b) T_(G¯_{v − b) + µ¯}_vT_v_¯ 2 .

We refer to this method as NM-S.ML (Normal distribution-based Model for Scored outcomes using Maximum Likelihood).

3.2 Extreme Value Distribution Model (Ext-S.ML)

Instead of the normal distribution in (25), we now consider that Y_i+− Y_i− is under the extreme value distribution for binary outcomes. Appendix A shows that

P (Y_i+− Y_i−≤ y) = e T− i eT_i+−y_{+ e}T− i , (30)

and hence the density function is f_Y+ i −Yi−(y) = eT− i +T + i −y (eT_i+−y_{+ e}T− i )2 . The negative log-likelihood function is

− m X i=1 log e T_i++T− i −(n + i−n−i) eT_i+−(n+_i−n− i)+ eTi−2 . (31)

A similar proof to Theorem 2’s shows that (31) is convex and shares the same condition for strict convexity in Section 2.2. Therefore, the problem of multiple solutions may also occur. We thus adopt the same regularization term as in Section 2.2 and solve

min v l(v) ≡ − m X i=1 log e T_i++T− i −(n + i−n−i) eT_i+−(n+_i−n− i)+ eT − i 2 + µ k X s=1 (evs_{+ e}−vs_). ₍₃₂₎

(11)

The unique global minimum satisfies for s = 1, . . . , k, ∂l(v) ∂vs = −ms+ 2 X i:s∈I_i+ eT_i++n− i eT_i++n− i + eTi−+n + i + X i:s∈I− i eT− i +n + i eT_i++n− i + eTi−+n + i ! + µ(evs_{− e}−vs₎ = 0, (33) where ms≡ X i:s∈Ii 1. From (30), P (Y_i+− Y_i−≥ T_i+− T_i−) = 1 2, i = 1, . . . , m. (34)

Since µ is small, (33) and (34) imply that for s = 1, . . . , k, X i:s∈I_i+ P (Y_i+− Y_i− ≥ n+_i − n−_i ) + X i:s∈I− i P (Y_i−− Y_i+≥ n−_i − n+_i ) ≈ m 2 = X i:s∈I_i+ P (Y_i+− Y_i− ≥ T_i+− T_i−) + X i:s∈I− i P (Y_i−− Y_i+≥ T_i−− T_i+).

As (21) in Section 2.2, the above condition also indicates that models should be consistent with observations. To solve (32), we use Algorithm 1 with a different update rule, which is in the form of (23) but with

As ≡ µevs+ 2 X i:s∈I_i+ eT_i++n− i eT_i++n− i + eT − i +n + i + X i:s∈I− i eT− i +n + i eT_i++n− i + eT − i +n + i , Bs ≡ ms.

The derivation is similar to (23)’s: let δ ≡ [0, . . . , 0, δs, 0, . . . , 0]T. Then

l(v + δ) − l(v) = − msδs+ 2 X i:s∈I_i+ loge T_i++n− i+δs_{+ e}Ti−+n + i eT_i++n− i + eTi−+n + i + X i:s∈I− i loge T− i +n + i+δs_{+ e}Ti++n−i eT_i++n− i + eTi−+n + i + µevs_(eδs_{− 1) + µe}−vs_(e−δs_{− 1)} ≤ − msδs+ 2 X i:s∈Ii+ eT_i++n− i eT_i++n− i + eTi−+n + i + X i:s∈I− i eT− i +n + i eT_i++n− i + eTi−+n + i (eδs_{− 1)} + µevs_(eδs_{− 1) + µe}−vs_(e−δs_{− 1).} ₍₃₅₎

Minimizing (35) leads to the update rule. Global convergence can be proved in a similar way to Theorem 3. We refer to this approach as Ext-S.ML (Extreme value distribution model for Scored outcomes using Maximum Likelihood).

(12)

N S W E N S W E A1 A2 B2 B1 B3 B4 A4 A3

Figure 1: A typical bridge match setting. N, S, E and W stand for north, south, east, and west, respectively.

4. Ranking Partnerships from Real Bridge Records

This section presents a real application: ranking partnerships from match records of Bermuda Bowl 2005,1 _{which is the most prestigious bridge event. In a match two partnerships (four}

players) from a team compete with two from another team. The rules require mutual un-derstanding within a partnership, so partnerships are typically fixed while a team can send different partnerships for different matches. To rank partnerships using our model, an in-dividual stands for a partnership, and every T_i+ (or T_i−) consists of two individuals. We caution the use of the term “team” here. Earlier we refer to each T_i+ as a team and in bridge the two partnerships (or four players) of T_i+are really called a team. However, these four players are from a (super)-team (usually a country), which often has six members. We use “team” in both situations, which are easily distinguishable.

4.1 Experimental Settings

We discuss why a partnership’s ability is not directly available from match results, and ex-plain why our model is applicable here. Figure 1 illustrates the match setting. A1, A2, A3, A4

and B1, B2, B3, B4are four players of Team A and Team B, sitting at two tables as depicted.

A match consists of several boards, each of which is played at both tables. An important feature is that a board’s four hands are at identical positions of two tables, but a team’s two partnerships sit at complementary positions. In Figure 1, A1 and A2 sit at the north

(N) and the south (S) sides of one table, so A3 and A4 must sit at the east (E) and the

west (W) sides of the other table. This setting reduces the effect of uneven hands.

On each board winning partnerships receive raw scores. Depending on the difference in two teams’ total scores, the winning team gains International Match Points (IMPs). For example, Table 1 shows records of the first ten boards of the match between two Indian partnerships and two Portuguese partnerships. We can see that a larger difference in raw scores results in more IMPs for the winner. IMPs are then converted to Victory Points (VP) for the team ranking.2 A quick look at Table 1 may motivate the following straightforward approach: a partnership’s score in a match is the sum of raw scores over all boards, and its ability is the average over the matches it plays. However, this estimate is unfair due to raw

1. All match records are available at http://www.worldbridge.org/tourn/Estoril.05/Estoril.htm. The subset used here is available at http://www.csie.ntu.edu.tw/~cjlin/papers/genBTexp/Data.zip. 2. The IMP-to-VP conversion for Bermuda Bowl 2005 is on page 32, http://www.worldbridge.org/

(13)

Board Table I Table II IMPs NS EW NS EW IN PT 1 1510 1510 2 100 650 11 3 630 630 4 650 660 5 690 690 6 420 50 10 7 140 600 10 8 420 100 8 9 460 400 2 10 110 140 1

Table 1: Records of the first ten boards between India (IN) and Portugal (PT). India: NS at Table I and EW at Table II. The four columns in the middle are boards’ raw scores, and only winners get points. For example, in the second board IN’s NS partnership won at Table I and got 100 points while PT’s NS got 650 at Table II. Since PT got more points than IN, it obtained IMPs.

scores’ dependency on boards and opponents. Summing a partnership’s raw scores favors those who get better hands or play against weak opponents. Moreover, since boards are different across rounds and partnerships play in different rounds, the sum of raw scores can be more unfair. The above analysis indicates that a partnership’s ability cannot be obtained directly from group comparison results. Hence the proposed models can be helpful.

We consider qualifying games: 22 teams from all over the world had a round robin tour-nament, which consisted of 22₂ = 231 matches and each team played 21. Most teams had six players in three fixed partnerships, and there were 69 partnerships in total. In order to obtain reasonable rankings, each partnership should play enough matches. The last column of Table 3 shows each partnership’s number of matches. Most played 13 to 15 matches, which are close to the average (14=21×2/3) of a team with three fixed partnerships. Thus these match records are reasonable for further analysis.

To use our model, the comparison setting matrix G defined in (12) is of size 231 × 69; as shown in (13) each row records a match setting and has exactly two 1’s (two partnerships from one team), two −1’s (two partnerships from another team) and 65 0’s (the remaining partnerships). The sum of two rival teams’ scores (VPs) is generally 30, but occasionally between 25 to 30 as a team’s maximal VP is 25. We use two rival teams’ VPs as n+_i and n−_i , respectively. Several matches have zero scores; we add one to all n+_i and n−_i for Ext-B.RLS to avoid the numerical difficulties caused by log(n+_i /n−_i ).

4.2 Evaluation and Results

In sport events, rankings serve two main purposes. On the one hand, they summarize the relative performances of players or teams based on outcomes in the event, so that people may easily distinguish outstanding ones from poor ones. On the other hand, rankings in

(14)

past events may indicate the outcomes of future events, and can therefore become a basis for designing future event schedules. Interestingly, we may connect these two purposes with two basic concepts in machine learning: minimizing the empirical error and minimizing the generalization error. For the first purpose, a good ranking must be consistent with available outcomes of the event, which relates to minimizing errors on training data, while for the second, a good ranking must predict well on the outcomes of future events, which is about minimizing errors on unseen data. We thus adopt these two principles to evaluate the proposed approaches, and in the context of bridge matches, they translate into the following evaluation criteria:

• Empirical Performance: How well do the estimated abilities and rankings fit the available match records?

• Generalization Performance: How well do the estimated abilities and rankings predict the outcomes of unseen matches?

Here we distinguish individuals’ abilities from their ranking: Abilities give a ranking, but not vice versa. When we only have a ranking of individuals, groups’ strengths are not directly available since the relation of individuals’ ranks to those of groups is unclear. In contrast, if individuals’ abilities are available, a group’s ability can be the sum of its members’. We thus propose different error measures for abilities and rankings. Let {(I₁+, I₁−, n+₁, n−₁), . . . , (I+

m, Im−, n+m, n−m)} be the group comparisons of interest and their

out-comes. For a vector v ∈ Rk of individuals’ abilities, we define the • Group Comparison Error:

GCE(v) ≡ m X i=1 In(n+_i − n−_i )(T_i+− T_i−) ≤ 0o m ,

where I{·} is the indicator function; T_i+and T_i− are predicted group abilities of I_i+and I_i−, as defined in (7). The GCE is essentially the proportion of wrongly predicted comparisons by the ability vector v to the m comparisons.

In the error measure for rankings, we use r, a permutation of the k individuals, to denote a ranking, where rs is the rank of individual s. Then we define the

• Group Rank Error:

GRE(r) ≡ m X i=1 Inn+_i > n−_i and U_i+> L−_i o+ Inn+_i < n−_i and L+_i < U_i−o m X i=1 InU_i+> L−_i o+ InL+_i < U_i−o , (36) in which U_i+≡ min j∈I_i+ rj, L+i ≡ max j∈I_i+ rj, U_i−≡ min j∈I− i rj, L−_i ≡ max j∈I− i rj.

(15)

Ext−B.RLS Ext−B.ML GBT.ML NM−S.ML Ext−S.ML AVG 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Error rate

(a) Group Comparison Error

Ext−B.RLS Ext−B.ML GBT.ML NM−S.ML Ext−S.ML AVG

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 Error rate (b) Group Rank Error

Figure 2: Empirical performances of the six approaches.

0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 Error rate

(a) Group Comparison Error

0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 Error rate (b) Group Rank Error

Figure 3: Generalization performances of the six approaches, averaged over 50 random test-ing sets. Vertical bars indicate standard deviations.

Since a smaller rank indicates more strength, the U_i+ and L+_i defined above represent the best and the weakest in I_i+, respectively. The denominator in (36) is thus the number of comparisons where one group’s members are all ranked higher (or lower) than the members of the competing group, and the numerator in (36) counts the number of wrong predictions, that is, comparisons in which members of the winning group are all ranked lower than those of the defeated group. In other words, GRE computes the error only on comparisons in which relative strengths of the participating groups can be clearly determined by their members’ ranks, whereas GCE considers the error on all of the comparisons. From this point of view, GRE is a more conservative error measure.

Combining the two error measures with the two evaluation criteria, we conducted four sets of experiments: Empirical GCE, Empirical GRE, Generalization GCE, and General-ization GRE. We compared six approaches, including the newly proposed B.RLS,

(16)

Ext-Ext-B.RLS Ext-B.ML GBT.ML NM-S.ML Ext-S.ML AVG

10/53 6/51 9/57 6/52 8/66 35/132

Table 2: Empirical Group Rank Errors in fraction.

B.ML, NM-S.ML, and Ext-S.ML; the generalized Bradley-Terry model GBT.ML (Huang et al., 2006b), and AVG, the simple approach (3) of summing individuals’ scores, which serves as a baseline.3 _{In the empirical part, we applied each approach on the entire 231}

matches to estimate partnerships’ abilities, and computed the two errors. Since the goal in the empirical part is to fit available records well, we set the regularization parameter µ for all approaches4 _{except AVG to a small value 10}−3_{. In the generalization part, we randomly}

split the entire set as a training set of 162 matches and a testing set of 69 matches for 50 times. For each split, we searched for µ in [25, 24, . . . , 2−8, 2−9] by the Leave-One-Out (LOO) validation on the training set, estimated partnerships’ abilities with the best µ, and then computed GCE and GRE on the testing set.

Results are in Figures 2 and 3 for empirical and generalization performances, respec-tively. In the empirical part, the four proposed approaches and GBT.ML perform obviously better than AVG, and the improvement in GRE is very significant. In particular, Ext-B.ML, NM-S.ML, and Ext-S.ML cause very small GREs, to the order of 10−1. These results show that the proposed approaches are effective in fitting the available bridge match records. However, in the generalization part, all of the approaches result in poor GCEs, nearly as large as a random predictor does, and the proposed approaches did not improve over AVG. For GREs, values are smaller, but the improvements over AVG are rather marginal. In the following we give some accounts of the poor generalization performances. As mentioned in Section 4.1, each match setting can be viewed as a vector in {1, 0, −1}69, in which only two dimensions have 1’s, and another two have −1’s. Moreover, we are using records in the qualifying stage, a round-robin tournament in which every two teams (countries) played exactly one match. Consequently, when a match is removed from the training set, the four competing partnerships of that match have no chance to meet directly during the training stage. Indirect comparisons may only be marginally useful in predicting those partnerships’ competition outcomes due to the lack of transitivity. In conclusion, the outcome of a match in this bridge data set may not be well indicated by outcomes of the other matches, and therefore all of the approaches failed to generalize well.

To further study the rankings by the six approaches, we show in Table 2 their empirical GREs. Since GRE only looks at the subset of matches in which group members’ ranks clearly decide groups’ relative strengths, the size of this subset, that is, the denominator in GRE, may also be a performance indicator of each approach. We thus present GREs in fraction. It is clear that the ranking by AVG is able to determine the outcomes of more matches, but at the same time causes more errors. Similar results are also found in the generalization experiments. We may therefore say that the proposed approaches and

3. AVG gives individuals’ abilities. We then use the same summation assumption to obtain groups’ abilities for computing GCEs.

4. In order to ensure the convergence of their algorithm, Huang et al. (2006b) added to the objective function (5) what they called a “barrier term,” which is also controlled by a small positive number µ (See Eq. (14) in Huang et al. 2006b). Here we simply refer to it as a regularization parameter.

(17)

5 4 3 2 1 0 −1 −2 −3 −4 −5 −6 −7 −8 −9 0 5 10 15 20 25 30 35 40 45

Average loo time (sec)

log₂µ Ext−B.RLS Ext−B.ML GBT.ML NM−S.ML Ext−S.ML AVG

Figure 4: Average LOO time (sec) over 50 training/testing splits. Vertical bars indicate standard deviations.

GBT.ML lead to rankings with more “precision,” in the sense that they may not be able to decide groups’ relative performances in the majority of comparisons, but once they do, their decisions are accurate.

In addition to the efficacy of the six approaches, we also reported their efficiency. Figure 4 shows the average LOO time over the 50 training/testing splits under different values of µ. We obtained these timing results on an IntelR _CoreTM_{2 Quad CPU (2.66GHz) machine}

with 8G main memory; the linear systems of Ext-B.RLS and NM-S.ML were solved by Gaussian Elimination. AVG,5 _{Ext-B.RLS, and NM-S.ML finished LOO almost instantly}

under all values of µ, while Ext-B.ML, GBT.ML, and Ext-S.ML, the three approaches using iterative algorithms, took more time as µ decreased. However, for large-scale problems with a huge k or m, traditional linear system solvers may encounter memory or computational difficulties, and the efficiency of the proposed approaches requires a more thorough study.

Finally, we list the top ten partnerships ranked by Ext-B.ML in Appendix F. Most of them are famous bridge players.

5. Properties of Different Approaches

Although we distinguish binary comparisons from scored ones, they are similar in some situations. On the one hand, if two teams had a series of comparisons, the number of victories can be viewed as a team’s score in a super-game. On the other hand, scores in a game might be the sum of binary outcomes; for example, scores in soccer games are total numbers of successful shots. It is therefore interesting to study the properties of different methods and their relation. Table 3 lists partnership rankings obtained by applying the six approaches to the entire set of match records. We first investigate the

5. Apparently there is no need to run LOO for AVG, which is independent of µ; we do it here only for timing comparisons.

(18)

Team (ordered by Partnership rankings

team rankings) Ext-B.RLS Ext-B.ML GBT.ML NM-S.ML Ext-S.ML AVG #match Italy (IT) 14 18 11 7 14 21 4 12 19 6 18 22 7 4 40 5 4 11 15 14 13 U.S.A.2 (US2) 57 67 1 53 65 1 39 67 1 53 65 1 54 50 1 42 25 2 8 17 17 U.S.A.1 (US1) 8 27 37 _{11 17 38 11 13 38 11 14 38 35} _{9 16 23} _{6 10 18 10 14} Sweden (SE) 2 43 50 2 23 55 2 10 65 2 23 56 5 8 47 1 14 38 14 13 15 India (IN) 10 35 39 _{9 29 41} _{9 28 40} _{9 28 41 12 28 37 19 12 15 15 14 13} Argentina (AR) 29 25 28 27 20 30 25 23 34 26 19 30 41 10 52 16 18 26 15 14 13 Egypt (EG) 47 23 24 51 18 22 51 22 15 51 17 21 51 18 17 37 20 3 14 20 7 49 52 50 52 ₄₄ 8 ₁ Brazil (BR) 31 4 66 ₂₈ _{8 59 24} _{8 63 29} _{7 58 26 57 11 28 13 31 11 18 13} Japan (JP) 5 65 38 3 67 39 3 68 27 3 68 40 3 68 15 7 44 46 14 14 14 Netherlands (NL) 16 52 17 32 43 31 30 45 33 32 43 31 30 34 49 36 32 24 15 15 12 China (CN) 51 48 7 45 44 6 47 46 7 45 44 8 43 31 21 30 52 9 13 14 15 South Africa (ZA) 45 30 20 49 26 15 52 26 20 50 24 15 55 29 13 49 35 27 15 13 14 Russia (RU) 34 21 42 35 16 46 36 16 49 36 16 47 48 6 23 39 21 53 14 14 14 Portugal (PT) 22 12 58 34 10 56 29 14 60 37 10 55 33 22 56 50 29 47 14 14 14 Australia (AU) 40 55 19 42 50 19 43 53 21 42 49 20 20 45 32 43 51 40 16 11 15 New Zealand (NZ) 68 41 3 68 48 5 66 42 5 66 48 5 66 58 2 64 41 17 9 16 17 England (UK) 9 ₃₃ _{61 12 36 64 17 32 64 13 35 63 36 25 64 48 22 55 17 12 13} Canada (CA) 13 36 56 13 40 58 18 35 62 12 39 57 19 24 67 34 45 62 14 16 12 Chinese Taipei (TW) 53 62 46 63 66 57 56 61 55 64 67 59 59 65 60 57 56 66 2 12 1 6 26 ₅₉ 4 25 ₅₄ 6 37 ₅₄ 4 25 _{54 14 27 53 33 63 61} ₄ _{7 16} Poland (PL) 15 ₅₄ _{60 24 47 60 31 48 59 27 46 60 39 38 61 58 54 60 12 15 15} Guadeloupe (GP) 44 ₃₂ _{69 37 33 69 44 41 69 34 33 69 42 46 69 65 59 69 14 14 14} Jordan (JO) 63 64 62 61 57 58 61 62 62 63 67 68 21 21

Table 3: Partnerships’ rankings. A partnership corresponds to the same position in columns. For example, Italy’s second partnership is ranked 18th, 14th, 12th, 18th, 4th and 4th by Ext-B.RLS, Ext-B.ML, GBT.ML, NM-S.ML, Ext-S.ML and AVG, respectively, and it plays 14 matches. Rankings satisfying (38) and (39) are underlined and boldfaced, respectively.

similarity between these rankings by Kendall’s tau, a standard correlation coefficient that quantifies the consistency between two rankings. We computed Kendall’s tau for every pair of the six rankings and present them in Table 4, which indicates roughly three groups: Ext-B.RLS, Ext-B.ML, GBT.ML and NM-S.ML give similar rankings; the one by Ext-S.ML is quite different, while AVG seems to be uncorrelated with the others. We then measure the distance between two groups of rankings g1 and g2: For each partnership,

d(ranks by g1, ranks by g2) ≡     

min(ranks by g2) − max(ranks by g1) if ranks by g1 are all smaller, min(ranks by g1) − max(ranks by g2) if ranks by g2 are all smaller,

0 otherwise.

(19)

Ext-B.RLS Ext-B.ML GBT.ML NM-S.ML Ext-S.ML AVG Ext-B.RLS 1.00 0.84 0.79 0.82 0.50 0.44 Ext-B.ML 0.84 1.00 0.87 0.97 0.61 0.49 GBT.ML 0.79 0.87 1.00 0.86 0.62 0.53 NM-S.ML 0.82 0.97 0.86 1.00 0.60 0.49 Ext-S.ML 0.50 0.61 0.62 0.60 1.00 0.50 AVG 0.44 0.49 0.53 0.49 0.50 1.00

Table 4: Kendall’s tau (correlation coefficients).

For example, from Table 3 the second partnership of U.S.A.2 is ranked 67th/65th/67th/65th by Ext-B.RLS/Ext-B.ML/GBT.ML/NM-S.ML and 25th by AVG. Therefore,

d({67, 65, 67, 65}, 25) = min(67, 65, 67, 65) − 25 = 40. Checking all 69 partnerships’ ranks gives

|d({Ext-B.RLS,Ext-B.ML,GBT.ML,NM-S.ML}, Ext-S.ML) ≥ 20| = 6, (38) |d({Ext-B.RLS,Ext-B.ML,GBT.ML,NM-S.ML}, AVG) ≥ 20| = 11. (39) In Table 3 we respectively underline and boldface partnerships satisfying (38) and (39). The eleven ranks satisfying (39) shows that AVG’s ranking is closer to the team ranking:6 Partnerships satisfying (39) have higher ranks than those by the others when the team ranks are high, but have the opposite when the team ranks are low. This observation indicates that AVG may fail to identify weak (strong) individuals from strong (weak) groups.

The above results suggest that approaches based on different types of comparisons may produce similar rankings, such as Ext-B.ML and NM-S.ML, while those based on the same type of outcomes may lead to diverse results, such as NM-S.ML and Ext-S.ML. Therefore, in the next two subsections we study their formulations and obtain the following results:

• When all ni’s are equal, that is, the number of games or the total score in every

group comparison is the same, and estimated group abilities are approximately even, Ext-B.ML and NM-S.ML give similar rankings.

• When all ni’s are equal, Ext-B.RLS is more sensitive than Ext-B.ML and NM-S.ML

to extreme outcomes (n+_i ≈ 0 or n+_i ≈ ni).

• For the two scored-outcome approaches, extreme outcomes have a greater impact on NM-S.ML than on Ext-S.ML.

5.1 Comparing Binary- and Scored-outcome Approaches

Experimental results in Section 4 show that the binary-outcome approach Ext-B.ML and the scored-outcome approach NM-S.ML give very similar rankings. By analyzing their optimization problems, we find that

6. Recall that in the beginning of Section 4, we mentioned that all teams, after the qualifying stage was over, were ranked according to their total VPs gained in the tournament.

(20)

Claim 1 If all ni’s are equal and the optimal v for Ext-B.ML satisfies

T_i+≈ T_i− ∀i,

then Ext-B.ML and NM-S.ML give very close rankings.

The proof is in Appendix E. For the bridge data used in Section 4, ni’s are two rival teams’

total VPs and are mostly 30; the average |T_i+− T_i−| from the optimal v for Ext-B.ML is 0.3983.

However, in applications where ni’s are unequal, these two approaches may give different

results. Clearly, they use different approximations: eT_i+ eT− i ≈ n + i n−_i and T + i − Ti−≈ n+i − n−i . (40)

One considers the ratio, which is independent from the values of ni’s, but the other considers

the difference, whose value scales with those of ni’s. Therefore, the estimate by NM-S.ML

may be more biased than Ext-B.ML to fit comparison outcomes with large ni.

Another issue is the small but perceivable dissimilarity of the ranking by Ext-B.RLS from those by Ext-B.ML and NM-S.ML, as revealed in the empirical GREs in Table 2 and the Kendall’s tau in Table 4. Investigating them more carefully, we find that

|d(Ext-B.RLS, {Ext-B.ML, NM-S.ML}) ≥ 10| = 8, (41) where the distance is defined in (37). Interestingly, five of these eight partnerships played matches where weak teams beat strong teams by an extreme amount, such as Netherlands beating U.S.A.2 by 25:0, and Ext-B.RLS ranks them higher than Ext-B.ML and NM-S.ML do. This result suggests that Ext-B.RLS is vulnerable to even only few extreme outcomes so as to change the overall ranking. We verify this property by comparing the estimates by Ext-B.RLS and NM-S.ML. Suppose ni = n ∀i (which is the case here), and then according

to (16), the ability estimate of individual s by Ext-B.RLS is vs= m X i=1 Asi log n+_i − log n−_i = m X i=1 Asi log n+_i − log(n − n+_i ), where A = GTG + µI−1

GT. To check the sensitivity of vs with respect to the change of

n+_i , we calculate ∂vs ∂n+_i = Asi 1 n+_i + 1 n − n+_i = nAsi n+_i (n − n+_i ).

Clearly, the estimate vsis more sensitive to extreme values of n+_i , that is, n+_i ≈ 0 or n+_i ≈ n.

However, for NM-S.ML we have vs= m X i=1 Asi(n+_i − n−_i ) = m X i=1 Asi(2n+_i − n) and ∂vs ∂n+_i = 2Asi.

(21)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 5 10 15 20 25 x f(x)

(a) Loss function curves.

Cir-cles: x2. Diamonds: x. Squares:

log(1 + cosh(x)). 0 5 10 15 20 25 30 35 0 10 20 30 40 50 60 70 80 90

(b) Error histogram for NM-S.ML

0 5 10 15 20 25 30 35 0 10 20 30 40 50 60 70 80 90

(c) Error histogram for Ext-S.ML

Figure 5: Error function curves and histograms. The x-axis of histograms is |T_i+− T_i−− (n+_i − n−_i )|.

That is, different values of n+_i have equal impact on the estimate by NM-S.ML.

In conclusion, when ni remains a constant and the estimates by Ext-B.ML have Ti+≈

T_i− ∀i, NM-S.ML and Ext-B.ML give similar estimates, which are less sensitive than that by Ext-B.RLS to extreme outcomes. When ni’s are unequal, the discussion in (40) indicates

that NM-S.ML is more affected than Ext-B.ML.

5.2 Comparing the Two Scored-outcome Approaches

As shown in (38), the ranking by Ext-S.ML is rather diverse from those by Ext-B.RLS, Ext-B.ML, and NM-S.ML. We explore this issue by first re-writing the objective functions of NM-S.ML and Ext-S.ML respectively as

min v m X i=1 T_i+− T_i−− (n+_i − n−_i ) 2 + µ k X s=1 v2_s and min v m X i=1 log 1 + cosh T_i+− T_i−− (n+_i − n−_i ) + µ k X s=1 (evs _{+ e}−vs_),

where cosh is the hyperbolic cosine function. Although these two formulations are derived to maximize the likelihood, they can be viewed as minimizing estimation errors

T_i+− T_i−− (n+_i − n−_i )

(22)

with two different loss functions. As µ is small, we ignore the effect of the regularization term. It is easy to show that as T_i+− T_i−− (n+_i − n−_i )

→ ∞, log1 + cosh T_i+− T_i−− (n+_i − n−_i ) T_i+− T_i−− (n+_i − n−_i ) → 1.

To show the behaviors of the three functions: x2, x and log(1 + cosh(x)), we plot their curves in Figure 5(a). One can see that log(1 + cosh(x)) increases almost linearly with x. In the machine learning community, it is well known that quadratic loss functions may lead to a very different estimation from linear ones. The reason is that quadratic loss functions penalize large errors more severely than linear ones do; estimations are thus dominated by even only few extreme observations, and as a side effect, may cause quite a few moderate errors. In contrast, estimations under linear loss functions may allow several large errors in order to make most errors small. Figures 5(b) and 5(c) are histograms of T_i+− T_i−− (n+_i − n−_i )

for NM-S.ML and Ext-S.ML, respectively; we see clearly the aforementioned two error patterns: Compared with NM-S.ML, Ext-S.ML has a lot more errors in the first bin and also some in the last two. In addition, we find that the empirical GRE of Ext-S.ML in Section 4.2 is highly related to its error pattern: Among the 24 correct rank predictions7

produced by Ext-S.ML but not by NM-S.ML, twelve have T_i+− T_i−− (n+_i − n−_i )

smaller than 3 (the first bin of histograms); NM-S.ML has noT_i+− T_i−− (n+_i − n−_i )

larger than 27 (the last two bins of histograms) while Ext-S.ML has four, among which the partnerships satisfying (38) participate in three. Interestingly, the two types of loss functions seem to reflect two different ranking criteria: one focuses more on performances against extreme opponents, so wins over strong opponents and losses to weak opponents greatly influence the ranking; the other is less sensitive to extreme outcomes and treat comparisons more evenly. Consequently, deciding which loss function, and hence which approach to use may eventually be contingent on game-specific factors and subjective preferences.

6. Multi-class Classification

Multi-class classification using coding matrices (Dietterich and Bakiri, 1995; Allwein et al., 2001) is a general scheme to decompose a problem into several two-class problems. The widely-used methods “one-against-one” and “one-against-the rest” are special cases of this framework. The decomposition is usually specified by a coding matrix G ∈ {1, 0, −1}m×k, where k is the number of classes and m is the number of two-class problems. Each row of G describes how the k classes are separated to two groups: those with 1 are in one group while those with −1 are in the other; those with 0 are not used in this two-class problem. The coding matrix in Table 5 illustrates four common types of codes: against-one, one-against-all, dense, and sparse; their definitions are given by Items 1 to 4 on Page 2209. At the training stage, m binary classifiers are trained for the m two-class problems. For an unlabeled instance, its label is predicted by combining results of the m binary classifiers.

There are several schemes for deciding the final prediction. Dietterich and Bakiri (1995) proposed choosing the class whose column in G has the smallest distance to the m binary decisions on the instance. This method can correct errors made by some decision rules,

(23)

One-against-one One-against-the rest Dense Sparse .. .        0 0 1 0 0 −1 0 0 −1 −1 −1 1 −1 −1 −1 −1 1 1 −1 −1 1 1 −1 −1 1 −1 0 0 1 0 0 −1 .. . ... ... ... ... ... ... ...       

Table 5: A coding matrix (k = 8). The four rows illustrate four types of codes. and thus is called error-correcting output codes (ECOC). Allwein et al. (2001) proposed a more general framework, the loss-based decoding, which exploits not only binary decisions, but also decision values of binary classifiers. In particular, they adopted the exponential

loss-based decoding (EXPLOSS): let ˆfi be the decision function of the ith binary classifier,

and ˆfi(x) > 0 (< 0) specifies that an instance x to be in classes of I_i+ (I_i−). Then,

predicted label ≡ arg min

s m

X

i=1

e−Gisfˆi.

If Gis = 1 and ˆfi(x) says s ∈ Ii+, then e−Gis ˆ

fi _{gives a small loss. By using decision values,}

the loss-based decoding incorporates the confidence of each binary prediction in making the final decision.

Table 5 is in the same format as our “comparison setting matrix” G defined in (12) and (13). Huang et al. (2006b) (GBT.ML) thus consider classes as individuals and two-class problems as group comparisons; the 1’s and −1’s in the ith row of G correspond to I_i+ and I_i−, respectively. The group competition results n+_i and n−_i are assumed to be available from two-class classifiers. For an unlabeled instance, classes are ranked according to their estimated “abilities” and the highest one (with the largest ability) serves as the prediction. All of our newly proposed models can be applied in the same way, but there are two minor issues. Firstly, all of our proposed methods except Ext-B.RLS assume that group comparisons are independent. This property does not hold for multi-class classification since two-class classifiers involving the same classes share training data. Huang et al. (2006b) pointed out that GBT.ML can be interpreted as minimizing the Kullback-Leibler distance between the model and the observations. It is easy to see that their argument also applies to Ext-B.ML but not to NM-S.ML nor Ext-S.ML. Secondly, the n+_i and n−_i given by two-class two-classifiers are real values, for which the binary-outcome approaches, according to their definition, may not be suitable. Despite of these minor issues, as we will show, our proposed methods perform quite well in practice.

We compare our methods with EXPLOSS and GBT.ML on six real data sets: waveform, satimage, segment, USPS, MNIST, and letter; numbers of classes range from 3 to 26. The settings of experiments are the same as those in Huang et al. (2006b). We use the 20 subsets of 800 training and 1,000 testing instances8 _{and consider the same four types of}

coding matrices:

1. One-against-one: |I_i+| = |I_i−| = 1, i = 1, . . . , k(k − 1)/2.

(24)

1−vs−1 1−vs−the rest dense sparse 11.5 12 12.5 13 13.5 14 14.5 15 15.5 16

Testing error rate (%)

EXPLOSS Ext−B.RLS Ext−B.ML

GBT.ML NM−S.ML Ext−S.ML

(a) waveform (k = 3)

1−vs−1 1−vs−the rest dense sparse

10 10.5 11 11.5 12 12.5 13 13.5

(b) satimage (k = 6)

2.5 3 3.5 4 4.5 5

(c) segment (k = 7)

6 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11

(d) USPS (k = 10)

6 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11

(e) MNIST (k = 10)

17 18 19 20 21 22 23 24 25

(f) letter (k = 26)

Figure 6: Testing error rates on the 800-training-1000-testing data sets by six approaches under four codes: one-against-one (1-vs-1), one-against-the rest (1-vs-the rest), dense, and sparse. Vertical bars indicate standard deviations.

(25)

2. One-against-all: |I_i+| = 1, |I_i−| = k − 1, i = 1, . . . , k. 3. Dense: |I_i+| = |I_i−| = k/2, ∀i; m = [10 log₂k].

4. Sparse: E(|I_i+|) = E(|I_i−|) = k/4, ∀i; m = [15 log₂k].

[x] rounds a real number x to its nearest integer. We choose support vector machines (SVM) (Boser et al., 1992) with the RBF (Radial Basis Function) kernel e−γkxi−xjk2 _{as the two-class}

classifier, where xiand xj are two training instances. An improved version (Lin et al., 2007)

of Platt (2000) generates n+_i and n−_i = 1−n+_i from SVM decision values. We implement our methods by modifying LIBSVM (Chang and Lin, 2001). For all of the 20 subsets, we select SVM parameters by cross validation before testing. Figures 6(a)-6(f) report the average testing error rates and standard deviations of the six methods: EXPLOSS, B.RLS, Ext-B.ML, GBT.ML, NM-S.ML and Ext-S.ML. Each figure summarizes the results on one data set by six groups of colored error bars, which represent the error rates of the six methods under the four types of codes. We can see that EXPLOSS (black diamond) and Ext-B.RLS (red square) perform worse than the others under the one-against-one and the sparse codes as k becomes large, while GBT.ML, Ext-B.ML, NM-S.ML and Ext-S.ML are almost equally good. Regarding the performances of the four types of codes, one-against-one and sparse are less effective for large values of k, an observation consistent with the results in (Huang et al., 2006b). Recall that in Section 4.2 Ext-S.ML behaves differently from the others, but here its predictions are similar to those of NM-S.ML and Ext-B.ML. The reason is that the n+_i and n−_i produced by (Lin et al., 2007) are probabilities satisfying n+_i + n−_i = 1, so values of |T_i+− T_i−− (n+_i − n−_i )| are mostly small and the difference between quadratic and linear loss functions is negligible. Results here suggest that the proposed methods are useful for multi-class classification with coding matrices.

7. Conclusions

We propose new and useful methods to rank individuals from group comparisons. For comparisons with binary outcomes, earlier work solves non-convex problems, but here con-vex formulations with easy solution procedures are developed. For scored outcomes, our formulations are probably the first for this type of problems. Experiments show that the proposed approaches give reasonable partnership rankings from bridge records and perform effectively in multi-class classification. We give theoretical accounts for behaviors of pro-posed approaches, which demonstrate how different models reflect diverse ranking criteria. We also develop techniques to evaluate different rankings, which may be used in other ranking tasks.

Appendix A. Derivation of (10) from (8)

P (Y_i+− Y_i−> 0) ≡ Z ∞ −∞ Z ∞ y− de−e−(y + −T_i+) de−e−(y −_−T− i ) . (42) Let

(26)

Consequently,

de−e−(y

+ −T_i+)

= −e−x+dx+ and de−e−(y

−_−T− i ) = −e−x−dx−. Then, (42) = Z ∞ 0 −e−x− Z x−eT +i −T −i 0 −e−x+dx+dx− = e T_i+ eT_i+_{+ e}T− i .

Appendix B. Proof of Theorem 1

If rank(G) < k, GT_{G is obviously not invertible; if rank(G) = k, the Singular Value}

Decomposition of G can be written as

G = U ΛVT,

where U ∈ Rm×k and V ∈ Rk×k are orthonormal and Λ ∈ Rk×k is diagonal with Λii6= 0, i = 1, . . . , k.

Therefore,

GTG = V ΛUTU ΛVT = V Λ2VT is invertible.

Appendix C. Proof of Theorem 2

We first rewrite l(v) as l(v) = − m X i=1 (n+_i T_i++ n−_i T_i−) + m X i=1 nilog(eT + i + eT − i ).

The first summation is obviously convex. For the second summation, by using H¨older’s inequality we have m X i=1 nilog eλTi++(1−λ) ˜T + i + eλTi−+(1−λ) ˜Ti− = m X i=1 nilog (eTi+)λ(eT˜ + i )1−λ+ (eTi−)λ(eT˜i−)1−λ ≤ m X i=1 nilog eTi+ + eTi− λ eT˜i+ + eT˜i− 1−λ = m X i=1 niλ log eTi++ eT − i + m X i=1 ni(1 − λ) log eT˜i++ eT˜i− (43)

(27)

for any v, ˜v and λ ∈ (0, 1), and the equality holds if and only if T_i+− T_i−= ˜T_i+− ˜T_i− ∀i, which can be re-written as

G(v − ˜v) = 0. (44)

If rank(G) = k, then (44) holds if and only if v = ˜v, so l(v) is strictly convex. If l(v) is strictly convex, then the equality in (43) holds if and only if v = ˜v, so

G(v − ˜v) = 0 ⇔ v = ˜v, which implies rank(G) = k.

Appendix D. Proof of Theorem 3

It is easy to verify that the level sets of l(v) are bounded. Since l(v) is strictly convex, it then attains a unique global minimum. To prove the convergence of Algorithm 1, we first show that if ∂l(v)/∂vs 6= 0, then minimizing (22) leads to

l(v + δ) < l(v). (45)

From (23), if the optimal δs for (22) is zero, then

Bs+pBs2+ 4µAse−vs 2As = 1, which implies 4As(µe−vs − As+ Bs) = −4As ∂l(v) ∂vs = 0. (46)

Since As6= 0 throughout iterations, (46) implies ∂l(v)/∂vs = 0. Thus if ∂l(v)/∂vs6= 0, the

optimal δs6= 0. With (22) = 0 if δs= 0, (45) follows.

Next we show that the sequence {vt_{} generated by Algorithm 1 is bounded. If not,}

there must exist j such that |vt

j| → ∞. Then l(vt) ≥ µ k X s=1 (evst + e−vts) = µ k X s=1 (e|vts|_{+ e}−|vst|₎ ≥ µe|vtj|+ e−|v t j| → ∞, which contradicts the fact that

l(v0) > l(vt) ∀t.

Since {vt_{} is bounded, it has limit points. For any limit point v}∗_{, there is an infinite set ¯}_N

such that

lim

t∈ ¯N ,t→∞v t_{= v}∗_.

(28)

Since v is finite dimensional, there is one component s updated in an infinite set N ⊂ ¯N : (t mod k) + 1 = s for t ∈ N.

Because l(v) is convex, to prove that v∗ _{is a global minimum, it suffices to show that}

∂l(v∗) ∂vs

= 0 for s = 1, . . . , k. (47)

Suppose the contrary is true, then among s, s + 1, . . . , k, 1, . . . , s − 1, there is ¯s such that ∂l(v∗₎ ∂vs = · · · = ∂l(v ∗₎ ∂vs−1¯ = 0, ∂l(v ∗₎ ∂vs¯ 6= 0. (48)

From (45), updating v∗_¯_s by (23) yields v∗+16= v∗ and l(v∗+1) < l(v∗). We have that ∂l(v∗)/∂vs= 0 implies

Bs+pBs2+ 4µA∗se−v ∗ s 2A∗ s = 1,

where A∗_s is defined according to (24) and Bs is a constant independent of v. Therefore,

lim t∈N, t→∞ vst+1 = lim t∈N, t→∞ vst+ log Bs+pBs2+ 4µAtse−v t s 2At s ! = v_s∗+ logBs+pB 2 s + 4µA∗se−v ∗ s 2A∗ s = vs∗, (49) and lim t∈N,t→∞v t+1 ₌ _lim t∈N,t→∞v t_{= v}∗_. ₍₅₀₎

Let ¯t be the iteration corresponding to ¯s. Using (48), a similar derivation to (49) and (50) shows that lim t∈N, t→∞ vt+1 = · · · = lim t∈N, t→∞ v¯t= v∗ and lim t∈N, t→∞ v¯t+1= v∗+1; consequently, lim t∈N,t→∞l(v ¯ t+1_{) = l(v}∗+1_{) < l(v}∗_),

which contradicts the fact that

l(v∗) ≤ · · · ≤ l(vt+1) ≤ l(vt).

Thus (47) holds for all limit points. Since l(v) is strictly convex, every limit point is the unique global minimum. Moreover, the sequence {vt} is bounded, so it globally converges to the global minimum.

(29)

Appendix E. Proof of Claim 1

From (29) it is clear that the ranking by NM-S.ML is invariant to the scale of ni; we thus

assume

n+_i + n−_i = 2, ∀i. Then (26) can be rewritten as

min v m X i=1 (T_i+− T_i−)2− (4n+_i − 4)(T_i+− T_i−) .

For Ext-B.ML, as µ is small and can be ignored, we consider the objective function in (17), which can be re-written as

m X i=1 −n+_i (T_i+− T_i−) + nilog(eT + i −Ti−+ 1) (51) = m X i=1 −n+_i (T_i+− T_i−) + 2log 2 +1 2(T + i − Ti−) + 1 8(T + i − Ti−)2+ O (Ti+− Ti−)3 (52) ≈ 1 8 m X i=1 (T_i+− T_i−)2− (4n+_i − 4)(T_i+− T_i−) .

From (51) to (52) we use the Taylor expansion of the function log(ex_{+ 1) at x = 0 and}

the assumption that T_i+ ≈ T_i−∀i. Therefore, the rankings by NM-S.ML and Ext-B.ML are similar.

Appendix F. Top 10 Partnerships by Ext-B.ML

Team Players

U.S.A.2 Eric Greco Geoff Hampson

Sweden Peter Bertheau Fredrik Nystrom

Japan Yoshiyuki Nakamura Yasuhiro Shimizu Chinese Taipei Chih-Kuo Shen Jui-Yiu Shih

New Zealand Tom Jacob Malcolm Mayer

China Zhong Fu Jie Zhao

Italy Norberto Bocchi Giorgio Duboin

Brazil Gabriel Chagas Miguel Villas-boas

India Subhash Gupta Rajeshwar Tewari

Portugal Jorge Castanheira Sofia Pessoa

References

Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of Machine Learning Research, 1: 113–141, 2001. ISSN 1533-7928.

(30)

Adam L. Berger, Vincent J. Della Pietra, and Stephen A. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71, 1996. Bernhard E. Boser, Isabelle Guyon, and Vladimir Vapnik. A training algorithm for

opti-mal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational

Learning Theory, pages 144–152. ACM Press, 1992.

Ralph A. Bradley and Milton E. Terry. The rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324–345, 1952.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.

John N. Darroch and Douglas Ratcliff. Generalized iterative scaling for log-linear models.

The Annals of Mathematical Statistics, 43(5):1470–1480, 1972.

Herbert A. David. The method of paired comparisons. Oxford University Press, second edition, 1988.

Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2:263–286, 1995. Arpad E. Elo. The Rating of Chessplayers, Past and Present. Arco Pub., New York, 2nd

edition, 1986.

Mark E. Glickman. Paired comparison models with time-varying parameters. PhD thesis, Department of Statistics, Harvard University, 1993.

Joshua Goodman. Sequential conditional generalized iterative scaling. In ACL, pages 9–16, 2002.

Trevor Hastie and Robert Tibshirani. Classification by pairwise coupling. The Annals of

Statistics, 26(1):451–471, 1998.

Ralf Herbrich and Thore Graepel. TrueSkillTM_{: A Bayesian skill rating system. In Advances} in Neural Information Processing Systems 19. MIT Press, Cambridge, MA, 2007.

Tzu-Kuo Huang, Chih-Jen Lin, and Ruby C. Weng. Ranking individuals by group compar-isons. In Proceedings of the Twenty Third International Conference on Machine Learning

(ICML), 2006a.

Tzu-Kuo Huang, Ruby C. Weng, and Chih-Jen Lin. Generalized Bradley-Terry models and multi-class probability estimates. Journal of Machine Learning Research, 7:85–115, 2006b. URL http://www.csie.ntu.edu.tw/~cjlin/papers/generalBT.pdf.

David R. Hunter. MM algorithms for generalized Bradley-Terry models. The Annals of

Statistics, 32:386–408, 2004.

Edwin T. Jaynes. Information theory and statistical mechanics. Physical Review, 106(4): 620–630, 1957a.