## From Ordinal Ranking to Binary Classification

Hsuan-Tien Lin

Learning Systems Group, California Institute of Technology

Talk at Caltech CS/IST Lunch Bunch March 4, 2008

Benefited from joint work with Dr. Ling Li (ALT’06, NIPS’06)

& discussions with Prof. Yaser Abu-Mostafa and Dr. Amrit Pratap

Introduction to Ordinal Ranking

**Introduction to Ordinal Ranking**

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 2 / 32

Introduction to Ordinal Ranking What is Ordinal Ranking?

## Which Age-Group?

**2**

1 2 3 4

**rank: a finite ordered set of labels Y = {1, 2, · · · , K }**

Introduction to Ordinal Ranking What is Ordinal Ranking?

## Hot or Not?

http://www.hotornot.com

**rank: natural representation of human preferences**

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 4 / 32

Introduction to Ordinal Ranking What is Ordinal Ranking?

## How Much Did You Like These Movies?

http://www.netflix.com

**goal: use “movies you’ve rated” to automatically**
**predict your preferences (ranks) on future movies**

Introduction to Ordinal Ranking What is Ordinal Ranking?

## How Machine Learns the Preference of YOU?

Alice

?

(movie, rank) pairs

?

brain of good

hypothesis Bob

'

&

$

% -

6 alternatives:

prefer romance/action/etc.

You

?

examples (movie x_{n}, rank y_{n})

?

learning good

hypothesis r (x ) algorithm

'

&

$

% -

6 learning model

**challenge: how to make the right-hand-side work?**

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 6 / 32

Introduction to Ordinal Ranking Ordinal Ranking Problem

## Ordinal Ranking Problem

given: N examples (input xn,rank yn) ∈ X × Y, e.g.

age-group: X = encoding(human pictures), Y = {1, · · · , 4}

hotornot: X = encoding(human pictures), Y = {1, · · · , 10}

netflix: X = encoding(movies), Y = {1, · · · , 5}

goal: an ordinal ranker (hypothesis) r (x ) that “closely predicts” the
ranks y associated with some**unseen inputs x**

**a hot and important research problem:**

relatively new for machine learning connecting classification and regression

matching human preferences—many applications in social science and information retrieval

Introduction to Ordinal Ranking Ordinal Ranking Problem

## Ongoing Heat: Netflix Million Dollar Prize

(since 10/2006)a huge joint ordinal ranking problem

given: each user u (480,189 users) rates N_{u} (from tens to
hundreds) movies—a total ofP

uN_{u} =100,480,507 examples
goal: personalized predictions r_{u}(x ) on 2,817,131 testing
queries (u, x )

**the first team being 10% better than**
**original Netflix system getsa million USD**

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 8 / 32

Introduction to Ordinal Ranking Ordinal Ranking Problem

## Properties of Ranks

Y = {1, 2, · · · , 5}representing**order:**

<

—relabeling by (3, 1, 2, 4, 5) erases information general multiclass classification cannot

properly use ordering information
**not carrying numerical information:**

not 2.5 times better than

—relabeling by (2, 3, 5, 9, 16) shouldn’t change results
general metric regression deteriorates
without correct numerical information
**ordinal ranking resides uniquely between**
**multiclass classification and metric regression**

Introduction to Ordinal Ranking Ordinal Ranking Problem

## Cost of Wrong Prediction

ranks carry no numerical meaning: how to say “closely predict”?

artificially quantify the**cost of being wrong**

infant (1) child (2) teen (3) adult (4) small mistake—classify a child as a teen;

big mistake—classify an infant as an adult
cost vector**c of example (x , y , c):**

**c[k ] = cost when predicting (x , y ) as rank k**
e.g. for

,2

, a reasonable cost is**c = (2, 0, 1, 4)**

**closely predict: small cost**

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 10 / 32

Introduction to Ordinal Ranking Ordinal Ranking Problem

## Reasonable Cost Vectors

For an ordinal example (x , y ,**c), the cost vector c should**
respect the rank y : **c[y ] = 0; c[k ] ≥ 0**

respect the ordinal information: V-shaped or even convex

**1: infant** **2: child** **3: teenager** **4: adult**
**C****y, k**

**1: infant** **2: child** **3: teenager** **4: adult**
**C****y, k**

V-shaped: pay more when predicting further away

convex: pay **increasingly**
more when further away

**c[k ] =**Jy 6= k K **c[k ] =**
y − k

**c[k ] = (y − k )**^{2}
classification: absolute: squared (Netflix):

V-shaped only convex convex

(1, 0, 1, 1) (1, 0, 1, 2) (1, 0, 1, 4)

Introduction to Ordinal Ranking Contribution

## Our Contributions

a new framework that works with any reasonable cost, and ...

reduces ordinal ranking to binary classification
**systematically**

unifies and**clearly explains many existing ordinal**
ranking algorithms

makes the design of new ordinal ranking algorithms
**much easier**

allows**simple and intuitive proof for new ordinal**
ranking theorems

leads to**promising experimental results**

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure:answer; traditional method; our method

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 12 / 32

Reduction from Ordinal Ranking to Binary Classification

**Reduction from Ordinal Ranking to**

**Binary Classification**

Reduction from Ordinal Ranking to Binary Classification Thresholded Model for Ordinal Ranking

## Thresholded Model

If we can first compute the score s(x ) of a movie x , how can we construct r (x ) from s(x )?

x x - θ1

d d d

θ2

t tt t θ3

??

**1** **2** **3** **4** ordinal rankerr (x )

score function s(x )

1 2 3 4 target rank y

quantize s(x ) by some**ordered threshold θ**
commonly used in previous work:

thresholded perceptrons (PRank, Crammer and Singer, 2002)

thresholded hyperplanes (SVOR, Chu and Keerthi, 2005)

thresholded ensembles (ORBoost, Lin and Li, 2006)

**thresholded model: r (x ) = min {k : s(x ) < θ**_{k}}

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 14 / 32

Reduction from Ordinal Ranking to Binary Classification Associated Binary Questions

## Key of Reduction: Associated Binary Questions

getting the rank using a thresholded model

1 is s(x ) > θ_{1}? Yes

2 is s(x ) > θ_{2}? No

3 is s(x ) > θ_{3}? No

4 is s(x ) > θ_{4}? No

generally, how do we query the rank of a movie x ?

1 is movie x better than rank 1?Yes

2 is movie x better than rank 2?No

3 is movie x better than rank 3?No

4 is movie x better than rank 4?No
**associated binary questions g(x , k ):**

**is movie x better than rank k ?**

Reduction from Ordinal Ranking to Binary Classification Associated Binary Questions

## More on Associated Binary Questions

g(x , k ): is movie x better than rank k ?
e.g. thresholded model g(x , k ) = sign(s(x ) − θ_{k})
K − 1 binary classification problems w.r.t. each k

x x d d d t tt t ?? -

**1** **2** **3** **4** r_{g}(x )

s(x )

1 2 3 4 y

N N θθ_{1}_{1} Y Y Y Y YY Y YY (z)g(x , 1)_{1}

N N N N Nθ2Y YY Y YY (z)g(x , 2)_{2}

N N N N N N NNN θ_{3} YY (z)g(x , 3)3

let (x , k ), (z)_{k} be binary examples
(x , k ): extended input w.r.t. k -th query
(z)_{k}: binary labelY/N

**if g(x , k ) = (z)**_{k} **for all k , we can compute r**_{g}(x )
**from g(x , k ) such that r**_{g}(x ) = y

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 16 / 32

Reduction from Ordinal Ranking to Binary Classification Associated Binary Questions

## Computing Ranks from Associated Binary Questions

g(x , k ): is movie x better than rank k ? Consider g(x , 1), g(x , 2), · · · , g(x , K −1),

consistent answers: (Y,Y,N,N, · · · ,N) extracting the rank from consistent answers:

minimum index searching: rg(x ) = min {k : g(x , k ) =N}

counting: r_{g}(x ) = 1 +P

kJg (x , k ) =YK

two approaches equivalent for consistent answers

noisy/inconsistent answers? e.g. (Y,N,Y,Y,N,N,Y,N,N)

—counting is simpler to analyze, and is robust to noise
**are all associated binary questions of**

**the same importance?**

Reduction from Ordinal Ranking to Binary Classification Associated Binary Questions

## Importance of Associated Binary Questions

given a movie x with rank y = 2 and**c[k ] = (y − k )**^{2}
g(x , 1): is x better than rank 1? No Yes Yes Yes
g(x , 2): is x better than rank 2? No No Yes Yes
g(x , 3): is x better than rank 3? No No No Yes
g(x , 4): is x better than rank 4? No No No No

r_{g}(x ) 1 2 3 4

**cr**g(x )

1 0 1 4

1 more for answering question 2 wrong;

but 3 more for answering question 3 wrong
(w )_{k} ≡

**c[k + 1] − c[k ]**

: the importance of (x , k ), (z)_{k}
per-example error bound(Li and Lin, 2007; Lin, 2008):

for**consistent answers or convex costs**
**cr**g(x ) ≤XK−1

k =1(w )_{k}q(z)_{k} 6= g(x, k )y
**accurate binary answers =⇒ correct ranks**

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 18 / 32

Reduction from Ordinal Ranking to Binary Classification The Reduction Framework

## The Reduction Framework

1 transform ordinal examples (x_{n},y_{n},**c**_{n})to
weighted binary examples (xn,k ), (zn)_{k}, (wn)_{k}

2 use your favorite algorithm on the weighted binary examples and get K −1 binary classifiers (i.e., one big joint binary classifier) g(x , k )

3 for each new input x , predict its rank using
r_{g}(x ) = 1 +P

kJg (x , k ) =YK

ordinal examples (xn, yn, cn)

⇒^{}

@ A

A

%

$ '

&

weighted binary examples

(xn, k), (zn)k,(wn)k

k= 1, · · · , K −1

⇒

⇒

⇒ ^{core}

binary classification

algorithm ⇒

⇒

⇒

%

$ '

&

associated binary classifiers

g(x, k) k= 1, · · · , K −1

A A

@

⇒

ordinal ranker rg(x)

Reduction from Ordinal Ranking to Binary Classification The Reduction Framework

## Properties of Reduction

performance guarantee:

accurate binary answers =⇒ correct ranks wide applicability:

systematic; works with any reasonable**c and any binary**
classification algorithm

up-to-date:

allows new improvements in binary classification to be immediately inherited by ordinal ranking

**If I have seen further it is by**

**standing on the shoulders of Giants—I. Newton**

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 20 / 32

ordinal examples (xn, yn, cn)

⇒^{}

@ A

A

%

$ '

&

weighted binary examples

(xn, k), (zn)k,(wn)k

k= 1, · · · , K −1

⇒

⇒

⇒ ^{core}

binary classification

algorithm ⇒

⇒

⇒

%

$ '

&

associated binary classifiers

g(x, k) k= 1, · · · , K −1

A A

@

⇒

ordinal ranker rg(x)

Reduction from Ordinal Ranking to Binary Classification Theoretical Guarantees

## Theoretical Guarantees of Reduction (1/3)

is reduction a reasonable approach? **YES!**

error transformation theorem(Li and Lin, 2007)

For**consistent answers or convex costs,**

if g makes test error ∆ in the induced binary problem,
then r_{g}pays test cost at most ∆ in ordinal ranking.

a one-step extension of the per-example error bound conditions: general and minor

performance guarantee in the absolute sense:

accuracy in binary classification =⇒ correctness in ordinal ranking
**What if the induced binary problem is “too hard”**

**and even the best g**_{∗} **can only commit a big ∆?**

Reduction from Ordinal Ranking to Binary Classification Theoretical Guarantees

## Theoretical Guarantees of Reduction (2/3)

is reduction a promising approach?**YES!**

regret transformation theorem(Lin, 2008)

For a general class of**reasonable costs,**

if g is -close to the optimal binary classifier g∗,
then r_{g}is -close to the optimal ordinal ranker r_{∗}.
error guarantee in the relative setting:

regardless of the absolute hardness of the induced binary prob., optimality in binary classification =⇒ optimality in ordinal ranking reduction does not introduce additional hardness

**It is sufficient to go with reduction plus binary classifi-**
**cation, but is it necessary?**

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 22 / 32

Reduction from Ordinal Ranking to Binary Classification Theoretical Guarantees

## Theoretical Guarantees of Reduction (3/3)

is reduction a principled approach? **YES!**

equivalence theorem(Lin, 2008)

For a general class of**reasonable costs,**

ordinal ranking is learnable by a learning model
**if and only if binary classification is learnable by the**
associated learning model.

a surprising equivalence:

ordinal ranking is**as easy as binary classification**

“without loss of generality”, we can just focus on binary classification

**reduction to binary classification:**

**systematic, reasonable, promising, and principled**

Usefulness of the Reduction Framework

**Usefulness of the Reduction** **Framework**

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 24 / 32

Usefulness of the Reduction Framework Algorithmic Reduction

## Unifying Existing Algorithms

ordinal ranking cost binary classification algorithm PRank absolute modified perceptron rule

(Crammer and Singer, 2002)

kernel ranking classification modified hard-margin SVM

(Rajaram et al., 2003)

SVOR-EXP classification modified soft-margin SVM SVOR-IMC absolute modified soft-margin SVM

(Chu and Keerthi, 2005)

ORBoost-LR classification modified AdaBoost ORBoost-All absolute modified AdaBoost

(Lin and Li, 2006)

if the reduction framework had been there,

development and implementation time could have been saved correctness proof significantly simplified (PRank)

algorithmic structure revealed (SVOR, ORBoost)
**variants of existing algorithms can be**
**designed quickly by tweaking reduction**

Usefulness of the Reduction Framework Algorithmic Reduction

## Designing New Algorithms (1/2)

ordinal ranking cost binary classification algorithm Reduction-C4.5 absolute standard C4.5 decision tree Reduction-AdaBoost absolute standard AdaBoost

Reduction-SVM absolute standard soft-margin SVM

SVOR (modified SVM) v.s. Reduction-SVM (standard SVM):

ban com cal cen

0 1 2 3 4 5 6

**avg. training time (hour)**

SVOR RED−SVM

**advantages of core binary classification algorithm**
**inherited in the new ordinal ranking one**

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 26 / 32

Usefulness of the Reduction Framework Algorithmic Reduction

## Designing New Algorithms (2/2)

AdaBoost(Freund and Schapire, 1997)

for t = 1, 2, · · · , T ,

1 find a simple g_{t} that matches
best with the current “view” of
{(X_{n},Yn)}

2 give a larger weight vt to gt if the match is stronger

3 update “view” by emphasizing the weights of those (Xn,Yn) that gt doesn’t predict well prediction:

majority vote of

vt,gt(x )

AdaBoost.OR(Lin, 2008)

for t = 1, 2, · · · , T ,

1 find a simple r_{t} that matches
best with the current “view” of
{(x_{n},y_{n})}

2 give a larger weight v_{t} to r_{t} if
the match is stronger

3 update “view” by emphasizing
the costs**c**nof those (xn,yn)
that rt doesn’t predict well
prediction:

weighted median of

v_{t},r_{t}(x )

**AdaBoost.OR:**

**an extension of Reduction-AdaBoost;**

**a parallel of AdaBoost in ordinal ranking**

Usefulness of the Reduction Framework Theoretical Reduction

## Proving New Theorems

Binary Classification

(Bartlett and Shawe-Taylor, 1998)

For SVM, with prob. > 1 − δ, expected test error

≤ _{N}^{1}

N

X

n=1

J ¯ρ(X_{n},Y_{n}) ≤ ΦK

| {z }

ambiguous training predictions w.r.t.

criteria Φ

+ O

log N√
N,_{Φ}^{1},

q
log^{1}_{δ}

| {z }

deviation that decreases with stronger criteria or

more examples

Ordinal Ranking

(Li and Lin, 2007)

For SVOR or Red.-SVM, with prob. > 1 − δ, expected test cost

≤ ^{β}_{N}

N

X

n=1 K −1

X

k =1

(w_{n})_{k}q ¯ρ (x_{n},k ), (z_{n})_{k} ≤ Φy

| {z }

ambiguous training predictions w.r.t.

criteria Φ

+ O

log N√
N,_{Φ}^{1},

q
log^{1}_{δ}

| {z }

deviation that decreases with stronger criteria or

more examples

**new test cost bounds with any c[·]**

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 28 / 32

Usefulness of the Reduction Framework Experimental Comparisons

## Reduction-C4.5 v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

**avg. test absolute cost**

SVOR (Gauss)

RED−C4.5 C4.5: a (too) simple

binary classifier

—decision trees SVOR:

state-of-the-art ordinal ranking algorithm

**even simple Reduction-C4.5**
**sometimes beats SVOR**

Usefulness of the Reduction Framework Experimental Comparisons

## Reduction-SVM v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

**avg. test absolute cost**

SVOR (Gauss)

RED−SVM (Perc.) SVM: one of the most

powerful binary classification algorithm SVOR:

state-of-the-art ordinal ranking algorithm extended from modified SVM

**Reduction-SVM without modification**
**often better than SVOR**^{∗}**and faster**

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 30 / 32

Usefulness of the Reduction Framework Netflix Prize?

## Can We Win the Netflix Prize with Reduction?

possibly

a principled view of the problem

now easy to apply known binary classification techniques or to design suitable ordinal ranking approaches

e.g., AdaBoost.OR “boosted” some simple rtand reduced the test cost from 1.0704 to 1.0343

but not yet

need 0.8563 to win

the problem has its own characteristics huge data set: computational bottleneck

allows real-valued predictions: r (x ) ∈ R instead of r (x) ∈ {1, · · · , K } encoding(movie), encoding(user): important

**many interesting research problems arose**
**during “CS156b: Learning Systems”**

Usefulness of the Reduction Framework Conclusion

## Conclusion

reduction framework: simple, intuitive, and useful for ordinal ranking

algorithmic reduction:

unifying existing ordinal ranking algorithms designing new ordinal ranking algorithms theoretic reduction:

new bounds on ordinal ranking test cost promising experimental results:

some for better performance some for faster training time

**reduction keeps ordinal ranking**
**up-to-date with binary classification**

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 32 / 32