## From Ordinal Ranking to Binary Classification

Hsuan-Tien Lin

Learning Systems Group, California Institute of Technology

Talk at Dept. of CSIE, National Taiwan University March 21, 2008

Benefited from joint work with Dr. Ling Li (ALT’06, NIPS’06)

& discussions with Prof. Yaser Abu-Mostafa and Dr. Amrit Pratap

## Outline

**1** **Introduction to Machine Learning**

**2** The Ordinal Ranking Setup

**3** Reduction from Ordinal Ranking to Binary Classification
Algorithmic Usefulness of Reduction

Theoretical Usefulness of Reduction Experimental Performance of Reduction

**4** Conclusion

## Apple, Orange, or Strawberry?

**?**

apple orange strawberry

**how can machine learn to classify?**

## Supervised Machine Learning

Parent

?

(picture, category) pairs

?

Kid’s good

decision function brain

'

&

$

% -

6 possibilities

Truth f (x ) + noise e(x )

?

examples (picture x_{n}, category y_{n})

?

learning good

decision function h(x ) ≈ f (x ) algorithm

'

&

$

% -

6

learning model {hα(x )}

challenge:

see only {(x_{n},y_{n})}without knowing f (x ) or e(x )

=⇒? **generalize to unseen (x , y ) w.r.t. f (x )**

## Machine Learning Research

What can the machines learn?

concrete applications:

computer vision, multimedia analysis, architecture optimization, information retrieval, bio-informatics, computational finance, · · · abstract setups:

classification, regression, · · · How can the machines learn?

faster algorithms

algorithms with better**generalization performance**
Why can the machines learn?

theoretical paradigms:

statistical learning, reinforcement learning, interactive learning, · · · generalization guarantees

**new opportunities of machine learning keep**
**coming from new applications/setups**

## Outline

**1** Introduction to Machine Learning

**2** **The Ordinal Ranking Setup**

**3** Reduction from Ordinal Ranking to Binary Classification
Algorithmic Usefulness of Reduction

Theoretical Usefulness of Reduction Experimental Performance of Reduction

**4** Conclusion

## Which Age-Group?

**2**

infant(1) child(2) teen(3) adult(4)

**rank: a finite ordered set of labels Y = {1, 2, · · · , K }**

## Properties of Ordinal Ranking (1/2)

ranks represent**order information**

infant (1)

## <

child (2)

## <

teen (3)

## <

adult (4)
**general multiclass classification cannot**

**properly use order information**

## Hot or Not?

http://www.hotornot.com

**rank: natural representation of human preferences**

## Properties of Ordinal Ranking (2/2)

ranks do**not carry numerical information**
rating 9 not 2.25 times “hotter” than rating 4

actual metric hidden

infant (ages 1–3)

child (ages 4–12)

teen (ages 13–19)

adult
(ages 20–)
**general metric regression deteriorates**

**without correct numerical information**

## How Much Did You Like These Movies?

http://www.netflix.com

**goal: use “movies you’ve rated” to automatically**
**predict your preferences (ranks) on future movies**

## Ordinal Ranking Setup

Given

N examples (input x_{n},rank y_{n}) ∈ X × Y

age-group: X = encoding(human pictures), Y = {1, · · · , 4}

hotornot: X = encoding(human pictures), Y = {1, · · · , 10}

netflix: X = encoding(movies), Y = {1, · · · , 5}

Goal

an ordinal ranker (decision function) r (x ) that “closely predicts”

the ranks y associated with some**unseen inputs x**

**ordinal ranking: a hot and important research problem**

## Importance of Ordinal Ranking

relatively new for machine learning connecting classification and regression

matching human preferences—many applications in social science, information retrieval, psychology, and recommendation systems

**Ongoing Heat: Netflix Million Dollar Prize**

## Ongoing Heat: Netflix Million Dollar Prize

(since 10/2006)Given

each user u (480,189 users) rates N_{u} (from tens to thousands)
movies x —a total ofP

uNu=100,480,507 examples Goal

personalized ordinal rankers ru(x ) evaluated on 2,817,131

“unseen” queries (u, x )

**the first team being 10% better than**
**original Netflix system getsa million USD**

## Cost of Wrong Prediction

ranks carry no numerical information: how to say “better”?

artificially quantify the**cost of being wrong**

e.g. loss of customer royalty when the system

says but you feel

cost vector**c of example (x , y , c):**

**c[k ] = cost when predicting (x , y ) as rank k**

e.g. for ( Sweet Home Alabama , ), a proper cost
is**c = (1, 0, 2, 10, 15)**

**closely predict: small testing cost**

## Ordinal Cost Vectors

For an ordinal example (x , y ,**c), the cost vector c should**
follow the rank y :**c[y ] = 0; c[k ] ≥ 0**

respect the ordinal information: V-shaped (ordinal) or even convex (strongly ordinal)

**1: infant** **2: child** **3: teenager** **4: adult**
**C****y, k**

V-shaped: pay more when predicting further away

**1: infant** **2: child** **3: teenager** **4: adult**
**C****y, k**

convex: pay**increasingly**
more when further away
**c[k ] =**Jy 6= k K **c[k ] =**

y − k

**c[k ] = (y − k )**^{2}
classification: absolute: squared (Netflix):

ordinal strongly strongly

ordinal ordinal

(1, 0, 1, 1, 1) (1, 0, 1, 2, 3) (1, 0, 1, 4, 9)

## Our Contributions

a theoretical and algorithmic foundation of ordinal ranking, which ...

provides a methodology for designing new ordinal
ranking algorithms with**any ordinal cost effortlessly**
takes many existing ordinal ranking algorithms as
**special cases**

introduces**new theoretical guarantee on the**
generalization performance of ordinal rankers
leads to**superior experimental results**

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure:truth; traditional algorithm; our algorithm

## Central Idea: Reduction

(iPod)

complex ordinal ranking problems

(adapter) (reduction)

(cassette player)

simpler binary classification problems with well-known results on models, algorithms, and theories

**If I have seen further it is by**

**standing on the shoulders of Giants—I. Newton**

## Outline

**1** Introduction to Machine Learning

**2** The Ordinal Ranking Setup

**3** **Reduction from Ordinal Ranking to Binary Classification**
Algorithmic Usefulness of Reduction

Theoretical Usefulness of Reduction Experimental Performance of Reduction

**4** Conclusion

## Threshold Model

If we can first get an ideal score s(x ) of a movie x , how can we construct the discrete r (x ) from an analog s(x )?

x x - θ1

d d d

θ2

t tt t θ3

??

**1** **2** **3** **4** ordinal rankerr (x )

score function s(x )

1 2 3 4 target rank y

quantize s(x ) by some**ordered threshold θ**
commonly used in previous work:

threshold perceptrons (PRank, Crammer and Singer, 2002)

threshold hyperplanes (SVOR, Chu and Keerthi, 2005)

threshold ensembles (ORBoost, Lin and Li, 2006)

**threshold model: r (x ) = min {k : s(x ) < θ**_{k}}

## Key of Reduction: Associated Binary Queries

getting the rank using a threshold model

1 is s(x ) > θ_{1}? Yes

2 is s(x ) > θ_{2}? No

3 is s(x ) > θ_{3}? No

4 is s(x ) > θ_{4}? No

generally, how do we query the rank of a movie x ?

1 is movie x better than rank 1?Yes

2 is movie x better than rank 2?No

3 is movie x better than rank 3?No

4 is movie x better than rank 4?No
**associated binary queries:**

**is movie x better than rank k ?**

Reduction from Ordinal Ranking to Binary Classification

## More on Associated Binary Queries

say, the machine uses g(x , k ) to answer the query

“is movie x better than rank k ?”

e.g. threshold model g(x , k ) = sign(s(x ) − θ_{k})
K − 1 binary classification problems w.r.t. each k

x x d d d t tt t ?? -

**1** **2** **3** **4** r_{g}(x )

s(x )

1 2 3 4 y

N N θ_{1} Y Y Y Y YY Y YY

(z)_{1}

θ_{1} g(x , 1)

N N N N N Y YY Y YY

(z)_{2}

θ2 g(x , 2)

N N N N N N NNN YY

(z)3

θ_{3} g(x , 3)

let (x , k ), (z)_{k} be binary examples
(x , k ): extended input w.r.t. k -th query
(z)k: desired binary answerY/N
**If g(x , k ) = (z)**_{k} **for all k ,**

**we can compute r**g(x )**from g(x , k ) s.t. r**g(x ) = y**.**

## Computing Ranks from Associated Binary Queries

when g(x , k ) answers “is movie x better than rank k ?”

Consider g(x , 1), g(x , 2), · · · , g(x , K −1), consistent predictions: (Y,Y,N,N,N,N,N) extracting the rank from consistent predictions:

minimum index searching: rg(x ) = min {k : g(x , k ) =N}

counting: rg(x ) = 1 +P

kJg (x , k ) =YK

two approaches equivalent for consistent predictions noisy/inconsistent predictions? e.g. (Y,N,Y,Y,N,N,Y)

**counting: simpler to analyze and robust to noise**

## The Counting Approach

Say y = 5, i.e., (z)_{1}, (z)_{2}, · · · , (z)_{7} = (Y,Y,Y,Y,N,N,N)
if g_{1}(x , k ) reports consistent predictions (Y,Y,N,N,N,N,N)

g1(x , k ) made 2 binary classification errors rg1(x ) = 3 by counting: the absolute cost is 2

absolute cost = # of binary classification errors

if g_{2}(x , k ) reports inconsistent predictions (Y,N,Y,Y,N,N,Y)
g2(x , k ) made 2 binary classification errors

rg2(x ) = 5 by counting: the absolute cost is 0

absolute cost ≤ # of binary classification errors
**If (z)**_{k} **= desired answer & r**_{g} **computed by counting,**

y − rg(x ) ≤

K−1

P

k =1

q(z)_{k} 6= g(x, k )y .

## Binary Classification Error v.s. Ordinal Ranking Cost

Say y = 5, i.e., (z)_{1}, (z)_{2}, · · · , (z)_{7} = (Y,Y,Y,Y,N,N,N)
if g_{1}(x , k ) reports consistent predictions (Y,Y,N,N,N,N,N)

g_{1}(x , k ) made 2 binary classification errors
r_{g}_{1}(x ) = 3 by counting: the**squared cost is 4**

if g_{3}(x , k ) reports consistent predictions (Y,N,N,N,N,N,N)
g3(x , k ) made 3 binary classification errors

rg3(x ) = 2 by counting: the**squared cost is 9**

now 1 binary classification error can introduce up to 5 more ordinal ranking cost—how to take this into account?

## Importance of Associated Binary Queries

(z)_{k} Y Y Y Y N N N

g_{1}(x , k ) Y Y N N N N N **cr**_{g}_{1}(x ) = c[3] = 4
g_{3}(x , k ) Y N N N N N N **cr**_{g}_{3}(x ) = c[2] = 9

(w )_{k} 7 **5** 3 1 1 3 5

(w )_{k} ≡

**c[k + 1] − c[k ]**

: the importance of (x , k ), (z)_{k}
per-example cost bound(Li and Lin, 2007; Lin, 2008):

for**consistent predictions or strongly ordinal costs**

**cr**_{g}(x ) ≤

K−1

X

k =1

(w )_{k}q(z)_{k} 6= g(x, k )y

**accurate binary predictions =⇒ correct ranks**

## The Reduction Framework (1/2)

1 transform ordinal examples (xn,yn,**c**_{n})to
weighted binary examples (x_{n},k ), (z_{n})_{k}, (w_{n})_{k}

2 use your favorite algorithm on the weighted binary examples and get K −1 binary classifiers (i.e., one big joint binary classifier) g(x , k )

3 for each new input x , predict its rank using rg(x ) = 1 +P

kJg (x , k ) =YK
**the reduction framework:**

**systematic & easy to implement**

ordinal examples (xn, yn, cn)

⇒ ^{}

@ AA

%

$ '

&

weighted binary examples

(xn, k), (zn)k,(wn)k

k= 1, · · · , K −1

⇒

⇒

⇒ ^{core}

binary classification

algorithm ⇒

⇒

⇒

%

$ '

&

associated binary classifiers

g(x, k) k= 1, · · · , K −1

AA

@

⇒

ordinal

ranker rg(x)

## The Reduction Framework (2/2)

**performance guarantee:**

accurate binary predictions =⇒ correct ranks
**wide applicability:**

works with any ordinal**c & any binary classification algorithm**
**simplicity:**

mild computation overheads with O(NK ) binary examples
**up-to-date:**

allows new improvements in binary classification to be immediately inherited by ordinal ranking

ordinal examples (xn, yn, cn)

⇒ ^{}

@ AA

%

$ '

&

weighted binary examples

(xn, k), (zn)k,(wn)k

k= 1, · · · , K −1

⇒

⇒

⇒ ^{core}

binary classification

algorithm ⇒

⇒

⇒

%

$ '

&

associated binary classifiers

g(x, k) k= 1, · · · , K −1

AA

@

⇒

ordinal

ranker rg(x)

## Theoretical Guarantees of Reduction (1/3)

is reduction a practical approach? **YES!**

error transformation theorem(Li and Lin, 2007)

For**consistent predictions or strongly ordinal costs,**
if g makes test error ∆ in the induced binary problem,
then r_{g}pays test cost at most ∆ in ordinal ranking.

a one-step extension of the per-example cost bound conditions: general and minor

performance guarantee in the absolute sense:

accuracy in binary classification =⇒ correctness in ordinal ranking
Is reduction really**optimal?**

—what if the induced binary problem is “too hard”?

## Theoretical Guarantees of Reduction (2/3)

is reduction an optimal approach?**YES!**

regret transformation theorem(Lin, 2008)

For a general class of**ordinal costs,**

if g is -close to the optimal binary classifier g∗,
then r_{g}is -close to the optimal ordinal ranker r∗.
error guarantee in the relative setting:

regardless of the absolute hardness of the induced binary prob., optimality in binary classification =⇒ optimality in ordinal ranking reduction does not introduce additional hardness

“reduction to binary” sufficient, but necessary?

i.e., is reduction a**principled approach?**

## Theoretical Guarantees of Reduction (3/3)

is reduction a principled approach? **YES!**

equivalence theorem(Lin, 2008)

For a general class of**ordinal costs,**

ordinal ranking is learnable by a learning model
**if and only if binary classification is learnable by the**
associated learning model.

a surprising equivalence:

ordinal ranking is**as easy as binary classification**
reduction to binary classification:

**practical, optimal, and principled**

## Outline

**1** Introduction to Machine Learning

**2** The Ordinal Ranking Setup

**3** **Reduction from Ordinal Ranking to Binary Classification**
**Algorithmic Usefulness of Reduction**

Theoretical Usefulness of Reduction Experimental Performance of Reduction

**4** Conclusion

## Unifying Existing Algorithms

ordinal ranking = reduction + cost + binary classification

ordinal ranking cost binary classification algorithm PRank absolute modified perceptron rule

(Crammer and Singer, 2002)

kernel ranking classification modified hard-margin SVM

(Rajaram et al., 2003)

SVOR-EXP classification modified soft-margin SVM SVOR-IMC absolute modified soft-margin SVM

(Chu and Keerthi, 2005)

ORBoost-LR classification modified AdaBoost ORBoost-All absolute modified AdaBoost

(Lin and Li, 2006)

development and implementation time could have been saved e.g. correctness proof significantly simplified (PRank)

algorithmic structure revealed (SVOR, ORBoost)
**variants of existing algorithms can be**
**designed quickly by tweaking reduction**

## Designing New Algorithms Effortlessly

ordinal ranking = reduction + cost + binary classification ordinal ranking cost binary classification algorithm Reduction-C4.5 absolute standard C4.5 decision tree Reduction-SVM absolute standard soft-margin SVM SVOR (modified SVM) v.s. Reduction-SVM (standard SVM):

ban com cal cen

0 1 2 3 4 5 6

**avg. training time (hour)**

SVOR RED−SVM

**advantages of core binary classification algorithm**
**inherited in the new ordinal ranking one**

## Designing New Algorithms Easily (1/2)

say, we have some ordinal rankers that predict your preference on movies:

r1(x ) = an ordinal ranker based on actor performance r2(x ) = an ordinal ranker based on actress performance r3(x ) = an ordinal ranker based on an expert opinion r4(x ) = an ordinal ranker based on box reports

no single ordinal ranker can explain your preference well, but a
**combination of them possibly can**

**ensemble learning:**

how can machines combine simple functions to make complicated decisions?

previously: no good ensemble algorithm for ordinal ranking

## Designing New Algorithms Easily (2/2)

good ensemble alg. for bin. class.:

AdaBoost(Freund and Schapire, 1997)

for t = 1, 2, · · · , T ,

1 find a simple g_{t} that matches
best with the current “view” of
{(x_{n},yn)}

2 give a larger weight vt to gt if the match is stronger

3 update “view” by emphasizing the weights of those (xn,yn) that gt doesn’t predict well prediction:

majority vote of

v_{t},g_{t}(x )

good ensemble alg. for ord. rank.:

AdaBoost.OR(Lin, 2008)

for t = 1, 2, · · · , T ,

1 find a simpler_{t} that matches
best with the current “view” of
{(x_{n},yn)}

2 give a larger weight vt tort if the match is stronger

3 update “view” by emphasizing
the costs**c**nof those (xn,yn)
that rt doesn’t predict well
prediction:

weighted median of

v_{t},r_{t}(x )
**AdaBoost.OR**

**= reduction + any cost + AdaBoost + math derivation**

## Outline

**1** Introduction to Machine Learning

**2** The Ordinal Ranking Setup

**3** **Reduction from Ordinal Ranking to Binary Classification**
Algorithmic Usefulness of Reduction

**Theoretical Usefulness of Reduction**
Experimental Performance of Reduction

**4** Conclusion

## Proving New Generalization Theorems

Ordinal Ranking(Lin, 2008)

For AdaBoost.OR, with prob. > 1 − δ, expected test abs. cost of r

≤ _{N}^{1}

N

X

n=1 K−1

X

k =1

q ¯ρ r (x_{n}),y_{n},k ≤ Φy

| {z }

ambiguous training predictions w.r.t.

criteria Φ

+ O

poly

K ,^{log N}^{√}

N,_{Φ}^{1},
q

log^{1}_{δ}

| {z }

deviation that decreases with stronger criteria or

more examples

Bin. Class. (Schapire et al., 1998)

For AdaBoost, with prob. > 1 − δ, expected test err. of g

≤ _{N}^{1}

N

X

n=1

q ¯ρ g(xn),yn ≤ Φy

| {z }

ambiguous training predictions w.r.t.

criteria Φ

+ O

poly

log N√
N,_{Φ}^{1},

q
log^{1}_{δ}

| {z }

deviation that decreases with stronger criteria or

more examples

**new ordinal ranking theorem**

**= reduction + any cost + bin. thm. + math derivation**

## Outline

**1** Introduction to Machine Learning

**2** The Ordinal Ranking Setup

**3** **Reduction from Ordinal Ranking to Binary Classification**
Algorithmic Usefulness of Reduction

Theoretical Usefulness of Reduction
**Experimental Performance of Reduction**

**4** Conclusion

## Reduction-C4.5 v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

**avg. test absolute cost**

SVOR (Gauss)

RED−C4.5 C4.5: a (too) simple

binary classifier

—decision trees SVOR:

state-of-the-art ordinal ranking algorithm

**even simple Reduction-C4.5**
**sometimes beats SVOR**

## Reduction-SVM v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

**avg. test absolute cost**

SVOR (Gauss)

RED−SVM (Perc.) SVM: one of the most

powerful binary classification algorithm SVOR:

state-of-the-art ordinal ranking algorithm extended from modified SVM

**Reduction-SVM without modification**
**often better than SVOR and faster**

## Outline

**1** Introduction to Machine Learning

**2** The Ordinal Ranking Setup

**3** Reduction from Ordinal Ranking to Binary Classification
Algorithmic Usefulness of Reduction

Theoretical Usefulness of Reduction Experimental Performance of Reduction

**4** **Conclusion**

## Conclusion

reduction framework:

not only simple, intuitive, and useful

but also**practical, optimal, and principled**
algorithmic reduction:

take existing ordinal ranking algorithms as**special cases**
design new and better ordinal ranking algorithms**easily**
theoretic reduction:

derive**new generalization guarantee of ordinal rankers**
**superior experimental results:**

better performance and faster training time
**reduction keeps ordinal ranking**
**up-to-date with binary classification**