## From Ordinal Ranking to Binary Classification

### Hsuan-Tien Lin

### Learning Systems Group, California Institute of Technology

### Talk at CS Department, National Tsing-Hua University March 27, 2008

### Benefited from joint work with Dr. Ling Li (ALT’06, NIPS’06)

### & discussions with Prof. Yaser Abu-Mostafa and Dr. Amrit Pratap

## Outline

**1** **Introduction to Machine Learning**

**2** The Ordinal Ranking Setup

**3** Reduction from Ordinal Ranking to Binary Classification Algorithmic Usefulness of Reduction

### Theoretical Usefulness of Reduction Experimental Performance of Reduction

**4** Conclusion

## Apple, Orange, or Strawberry?

**?**

### apple orange strawberry

**how can machine learn to classify?**

## Supervised Machine Learning

### Parent

### ?

### (picture, category) pairs

### ?

### Kid’s good

### decision function brain

### '

### &

### $

### % -

### 6 possibilities

### Truth f (x ) + noise e(x )

### ?

### examples (picture x

_{n}

### , category y

_{n}

### )

### ?

### learning good

### decision function h(x ) ≈ f (x ) algorithm

### '

### &

### $

### % -

### 6

### learning model {h

α### (x )}

### challenge:

### see only {(x _{n} , y _{n} )} without knowing f (x ) or e(x )

### =⇒ ? **generalize to unseen (x , y ) w.r.t. f (x )**

## Machine Learning Research

### What can the machines learn? (application) concrete:

### computer vision, architecture optimization, information retrieval, bio-informatics, computational finance, · · ·

### abstract setups:

### classification, regression, · · ·

### How can the machines learn? (algorithm) faster

### better **generalization**

### Why can the machines learn? (theory) paradigms:

### statistical learning, reinforcement learning, · · · generalization guarantees

**new opportunities keep coming**

**from new applications/setups**

## Outline

**1** Introduction to Machine Learning

**2** **The Ordinal Ranking Setup**

**3** Reduction from Ordinal Ranking to Binary Classification Algorithmic Usefulness of Reduction

### Theoretical Usefulness of Reduction Experimental Performance of Reduction

**4** Conclusion

## Which Age-Group?

**2**

### infant (1) child (2) teen (3) adult (4)

**rank: a finite ordered set of labels Y = {1, 2, · · · , K }**

## Properties of Ordinal Ranking (1/2)

### ranks represent **order information**

### infant (1)

## <

### child (2)

## <

### teen (3)

## <

### adult (4) **general multiclass classification cannot**

**properly use order information**

## Hot or Not?

### http://www.hotornot.com

**rank: natural representation of human preferences**

## Properties of Ordinal Ranking (2/2)

### ranks do **not carry numerical information** rating 9 not 2.25 times “hotter” than rating 4

### actual metric hidden

### infant (ages 1–3)

### child (ages 4–12)

### teen (ages 13–19)

### adult (ages 20–) **general metric regression deteriorates**

**without correct numerical information**

## How Much Did You Like These Movies?

### http://www.netflix.com

**goal: use “movies you’ve rated” to automatically**

**predict your preferences (ranks) on future movies**

## Ordinal Ranking Setup

### Given

### N examples (input x _{n} , rank y _{n} ) ∈ X × Y

### age-group: X = encoding(human pictures), Y = {1, · · · , 4}

### hotornot: X = encoding(human pictures), Y = {1, · · · , 10}

### netflix: X = encoding(movies), Y = {1, · · · , 5}

### Goal

### an ordinal ranker (decision function) r (x ) that “closely predicts”

### the ranks y associated with some **unseen inputs x**

**ordinal ranking: a hot and important research problem**

## Ongoing Heat: Netflix Million Dollar Prize (since 10/2006)

### Given

### each user u (480,189 users) rates N _{u} (from tens to thousands) movies x —a total of P

### u N u = 100,480,507 examples Goal

### personalized ordinal rankers r u (x ) evaluated on 2,817,131

### “unseen” queries (u, x )

**the first team being 10% better than**

**original Netflix system gets** **a million USD**

## Cost of Wrong Prediction

### ranks carry no numerical information: how to say “better”?

### artificially quantify the **cost of being wrong**

### e.g. loss of customer royalty when the system

### says but you feel

### cost vector **c of example (x , y , c):**

**c[k ] = cost when predicting (x , y ) as rank k**

### e.g. for ( Sweet Home Alabama , ), a proper cost is **c = (1, 0, 2, 10, 15)**

**closely predict: small test cost**

## Ordinal Cost Vectors

### For an ordinal example (x , y , **c), the cost vector c should** follow the rank y : **c[y ] = 0; c[k ] ≥ 0**

### respect the ordinal information: V-shaped (ordinal) or even convex (strongly ordinal)

**1: infant** **2: child** **3: teenager** **4: adult**
**C****y, k**

### V-shaped: pay more when predicting further away

**1: infant** **2: child** **3: teenager** **4: adult**
**C****y, k**

### convex: pay **increasingly** more when further away **c[k ] =** Jy 6= k K **c[k ] =**

### y − k

### **c[k ] = (y − k )**

^{2}

### classification: absolute: squared (Netflix):

### ordinal strongly strongly

### ordinal ordinal

### (1, 0, 1, 1, 1) (1, 0, 1, 2, 3) (1, 0, 1, 4, 9)

## Our Contributions

### a theoretical and algorithmic foundation of ordinal ranking, which ...

### provides a methodology for designing new ordinal ranking algorithms with **any ordinal cost effortlessly** takes many existing ordinal ranking algorithms as **special cases**

### introduces **new theoretical guarantee on the** generalization performance of ordinal rankers leads to **superior experimental results**

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

### Figure: truth; traditional algorithm; our algorithm

## Outline

**1** Introduction to Machine Learning

**2** The Ordinal Ranking Setup

**3** **Reduction from Ordinal Ranking to Binary Classification** Algorithmic Usefulness of Reduction

### Theoretical Usefulness of Reduction Experimental Performance of Reduction

**4** Conclusion

## Threshold Model

### If we can first get an ideal score s(x ) of a movie x , how can we construct the discrete r (x ) from an analog s(x )?

### x x - θ 1

### d d d

### θ 2

### t tt t θ 3

### ??

**1** **2** **3** **4** ordinal ranker r (x )

### score function s(x )

### 1 2 3 4 target rank y

### quantize s(x ) by some **ordered threshold θ** commonly used in previous work:

### threshold perceptrons (PRank, Crammer and Singer, 2002)

### threshold hyperplanes (SVOR, Chu and Keerthi, 2005)

### threshold ensembles (ORBoost, Lin and Li, 2006)

**threshold model: r (x ) = min {k : s(x ) < θ** _{k} }

## Key of Reduction: Associated Binary Queries

### getting the rank using a threshold model

1

### is s(x ) > θ _{1} ? Yes

2

### is s(x ) > θ _{2} ? No

3

### is s(x ) > θ _{3} ? No

4

### is s(x ) > θ _{4} ? No

### generally, how do we query the rank of a movie x ?

1

### is movie x better than rank 1? Yes

2

### is movie x better than rank 2? No

3

### is movie x better than rank 3? No

4

### is movie x better than rank 4? No **associated binary queries:**

**is movie x better than rank k ?**

Reduction from Ordinal Ranking to Binary Classification

## More on Associated Binary Queries

### say, the machine uses g(x , k ) to answer the query

### “is movie x better than rank k ?”

### e.g. threshold model g(x , k ) = sign(s(x ) − θ _{k} ) K − 1 binary classification problems w.r.t. each k

### x x d d d t tt t ?? -

**1** **2** **3** **4** r _{g} (x )

### s(x )

### 1 2 3 4 y

### N N θ _{1} Y Y Y Y YY Y YY

### (z)

_{1}

### θ _{1} g(x , 1)

### N N N N N Y YY Y YY

### (z)

_{2}

### θ 2 g(x , 2)

### N N N N N N NNN YY

### (z)

3### θ _{3} g(x , 3)

### let (x , k ), (z) _{k} be binary examples (x , k ): extended input w.r.t. k -th query (z)

k### : desired binary answer Y/N **If g(x , k ) = (z)** _{k} **for all k ,**

**we can compute r** g (x ) **from g(x , k ) s.t. r** g (x ) = y **.**

## Computing Ranks from Associated Binary Queries

### when g(x , k ) answers “is movie x better than rank k ?”

### Consider g(x , 1), g(x , 2), · · · , g(x , K −1), consistent predictions: (Y, Y, N, N, N, N, N) extracting the rank:

### minimum index searching: r

g### (x ) = min {k : g(x , k ) = N}

### counting: r

g### (x ) = 1 + P

k

### Jg (x , k ) = Y K

### two approaches equivalent for consistent predictions noisy/inconsistent predictions? e.g. (Y, N, Y, Y, N, N, Y)

**counting: simpler to analyze and robust to noise**

## The Counting Approach

### Say y = 5, i.e., (z) _{1} , (z) _{2} , · · · , (z) _{7} = (Y, Y, Y, Y, N, N, N) if g _{1} (x , k ) reports (Y, Y, N, N, N, N, N)

### g

1### (x , k ) made 2 errors r

g1### (x ) = 3; absolute cost = 2

### absolute cost = # of binary classification errors if g _{2} (x , k ) reports (Y, N, Y, Y, N, N, Y)

### g

2### (x , k ) made 2 errors r

g2### (x ) = 5; absolute cost = 0

### absolute cost ≤ # of binary classification errors **If (z)** _{k} **= desired answer & r** _{g} **computed by counting,**

### y − r g (x ) ≤

### K−1

### P

### k =1

### q(z) _{k} 6= g(x, k )y .

## Binary Classification Error v.s. Ordinal Ranking Cost

### Say y = 5, i.e., (z) _{1} , (z) _{2} , · · · , (z) 7 = (Y, Y, Y, Y, N, N, N) if g _{1} (x , k ) reports (Y, Y, N, N, N, N, N)

### g

1### (x , k ) made 2 errors r

g1### (x ) = 3; **squared cost = 4**

### if g _{3} (x , k ) reports consistent predictions (Y, N, N, N, N, N, N) g

_{3}

### (x , k ) made 3 errors

### r

_{g}

_{3}

### (x ) = 2; **squared cost = 9**

### now 1 error can introduce up to 5 more in cost

### —how to take this into account?

## Importance of Associated Binary Queries

### (z) _{k} Y Y Y Y N N N

### g _{1} (x , k ) Y Y N N N N N **cr** _{g}

_{1}

### (x ) = c[3] = 4 g _{3} (x , k ) Y N N N N N N **cr** _{g}

_{3}

### (x ) = c[2] = 9

### (w ) _{k} 7 **5** 3 1 1 3 5

### (w ) _{k} ≡

### **c[k + 1] − c[k ]**

### : the importance of (x , k ), (z) _{k} per-example cost bound (Li and Lin, 2007; Lin, 2008) :

### for **consistent predictions or strongly ordinal costs**

**cr** _{g} (x ) ≤

### K−1

### X

### k =1

### (w ) _{k} q(z) _{k} 6= g(x, k ) y

**accurate binary predictions =⇒ correct ranks**

## The Reduction Framework (1/2)

1

### transform ordinal examples (x n , y n , **c** n ) to weighted binary examples (x _{n} , k ), (z _{n} ) _{k} , (w _{n} ) _{k}

2

### apply your favorite algorithm and get one big joint binary classifier g(x , k )

3

### for each new input x , predict its rank using r _{g} (x ) = 1 + P

### k Jg (x , k ) = Y K **the reduction framework:**

**systematic & easy to implement**

### ordinal examples (x

n### , y

n### , c

n### )

### ⇒ ^{}

### @ A A

### %

### $ '

### &

### weighted binary examples

(xn, k), (zn)k,(wn)k

### k = 1, · · · , K −1

### ⇒

### ⇒

### ⇒ ^{core}

### binary classification

### algorithm ⇒

### ⇒

### ⇒

### %

### $ '

### &

### associated binary classifiers

### g(x, k) k = 1, · · · , K −1

### A A

### @

### ⇒

### ordinal

### ranker

### r

g### (x)

## The Reduction Framework (2/2)

**performance guarantee:**

### accurate binary predictions =⇒ correct ranks **wide applicability:**

### works with any ordinal **c & any binary classification algorithm** **simplicity:**

### mild computation overheads with O(NK ) binary examples **up-to-date:**

### allows new improvements in binary classification to be immediately inherited by ordinal ranking

### ordinal examples (x

n### , y

n### , c

n### )

### ⇒ ^{}

### @ A A

### %

### $ '

### &

### weighted binary examples

(xn, k), (zn)k,(wn)k

### k = 1, · · · , K −1

### ⇒

### ⇒

### ⇒ ^{core}

### binary classification

### algorithm ⇒

### ⇒

### ⇒

### %

### $ '

### &

### associated binary classifiers

### g(x, k) k = 1, · · · , K −1

### A A

### @

### ⇒

### ordinal

### ranker

### r

g### (x)

## Theoretical Guarantees of Reduction (1/3)

### is reduction a practical approach? **YES!**

### error transformation theorem (Li and Lin, 2007)

### For **consistent predictions or strongly ordinal costs,** if g makes test error ∆ in the induced binary problem, then r _{g} pays test cost at most ∆ in ordinal ranking.

### a one-step extension of the per-example cost bound conditions: general and minor

### performance guarantee in the absolute sense:

### accuracy in binary classification =⇒ correctness in ordinal ranking Is reduction really **optimal?**

### —what if the induced binary problem is “too hard”?

## Theoretical Guarantees of Reduction (2/3)

### is reduction an optimal approach? **YES!**

### regret transformation theorem (Lin, 2008)

### For a general class of **ordinal costs,**

### if g is -close to the optimal binary classifier g ∗ , then r _{g} is -close to the optimal ordinal ranker r ∗ . error guarantee in the relative setting:

### regardless of the absolute hardness of the induced binary prob., optimality in binary classification =⇒ optimality in ordinal ranking reduction does not introduce additional hardness

### “reduction to binary” sufficient, but necessary?

### i.e., is reduction a **principled approach?**

## Theoretical Guarantees of Reduction (3/3)

### is reduction a principled approach? **YES!**

### equivalence theorem (Lin, 2008)

### For a general class of **ordinal costs,**

### ordinal ranking is learnable by a learning model **if and only if binary classification is learnable by the** associated learning model.

### a surprising equivalence:

### ordinal ranking is **as easy as binary classification** reduction to binary classification:

**practical, optimal, and principled**

## Outline

**1** Introduction to Machine Learning

**2** The Ordinal Ranking Setup

**3** **Reduction from Ordinal Ranking to Binary Classification** **Algorithmic Usefulness of Reduction**

### Theoretical Usefulness of Reduction Experimental Performance of Reduction

**4** Conclusion

## Unifying Existing Algorithms

### ordinal ranking = reduction + cost + binary classification

### ordinal ranking cost binary classification algorithm PRank absolute modified perceptron rule

### (Crammer and Singer, 2002)

### kernel ranking classification modified hard-margin SVM

### (Rajaram et al., 2003)

### SVOR-EXP classification

### modified soft-margin SVM

### SVOR-IMC absolute

### (Chu and Keerthi, 2005)

### ORBoost-LR classification

### modified AdaBoost ORBoost-All absolute

### (Lin and Li, 2006)

### correctness proof significantly simplified (PRank) algorithmic structure revealed (SVOR, ORBoost)

**variants of existing algorithms can be**

**designed quickly by tweaking reduction**

## Designing New Algorithms Effortlessly

### ordinal ranking = reduction + cost + binary classification

### ordinal ranking cost binary classification algorithm Reduction-C4.5 absolute standard C4.5 decision tree Reduction-SVM absolute standard soft-margin SVM SVOR (modified SVM) v.s. Reduction-SVM (standard SVM):

ban com cal cen

0 1 2 3 4 5 6

**avg. training time (hour)**

SVOR RED−SVM

**advantages of core binary classification algorithm**

**inherited in the new ordinal ranking one**

## Outline

**1** Introduction to Machine Learning

**2** The Ordinal Ranking Setup

**3** **Reduction from Ordinal Ranking to Binary Classification** Algorithmic Usefulness of Reduction

**Theoretical Usefulness of Reduction** Experimental Performance of Reduction

**4** Conclusion

## Recall: Threshold Model

### “bad” ordinal ranker: predictions close to thresholds

### —small noise changes prediction xx -

### θ _{1} ^{dd} θ ^{d} _{2} ^{tt} ^{tt} θ _{3} ??

**1** **2** **3** **4** r (x )

### s(x )

### “good” ordinal ranker: clear separation using thresholds x x -

### θ _{1} ^{d dd} θ _{2} ^{tttt} θ _{3} ??

**1** **2** **3** **4** r (x )

### s(x )

**next: good ordinal ranker =⇒ small expected test cost**

## Proving New Generalization Theorems

### Ordinal Ranking (Li and Lin, 2007)

### For SVOR or Reduction-SVM, with probability > 1 − δ,

### expected test abs. cost of r

### ≤

_{N}

^{1}

N

### X

n=1 K−1

### X

k =1

### q ¯ ρ r (x

_{n}

### ), y

_{n}

### , k ≤ Φ y

### | {z }

### “goodness” in training

### + O

### poly

### K ,

^{log N}

^{√}

N

### ,

_{Φ}

^{1}

### , q

### log

^{1}

_{δ}

### | {z }

### deviation that decreases with more examples

### Bi. Class. (Bartlett and Shawe-Taylor, 1998)

### For SVM,

### with probability > 1 − δ, expected test err. of g

### ≤

_{N}

^{1}

N

### X

n=1

### q ¯ ρ g(x

n### ), y

n### ≤ Φ y

### | {z }

### “goodness” in training

### + O

### poly

log N√ N

### ,

_{Φ}

^{1}

### ,

### q log

^{1}

_{δ}

### | {z }

### deviation that decreases with more examples

**new ordinal ranking theorem**

**= reduction + any cost + bin. thm. + math derivation**

## Outline

**1** Introduction to Machine Learning

**2** The Ordinal Ranking Setup

**3** **Reduction from Ordinal Ranking to Binary Classification** Algorithmic Usefulness of Reduction

### Theoretical Usefulness of Reduction **Experimental Performance of Reduction**

**4** Conclusion

## Reduction-C4.5 v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

**avg. test absolute cost**

SVOR (Gauss)

RED−C4.5

### C4.5: a (too) simple

### binary classifier

### —decision trees SVOR:

### state-of-the-art ordinal ranking algorithm

**even simple Reduction-C4.5**

**sometimes beats SVOR**

## Reduction-SVM v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

**avg. test absolute cost**

SVOR (Gauss)

RED−SVM (Perc.)