## Automatic Ranking by Extended Binary Classification

Hsuan-Tien Lin Learning Systems Group Joint work with Ling Li (NIPS 2006)

EE Pizza Meeting, November 17, 2006

Introduction

## What is the Age-Group?

g

**2**

g

1

g

2

g

3

g

4
**rank: a finite ordered set of labels**Y = {1,2, · · · ,*K*}

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 2 / 22

## Hot or Not?

http://www.hotornot.com

**rank: natural representation of preferences in surveys**

## How Much Did You Like These Movies?

http://www.netflix.com

**Can machines use movies you’ve rated to closely predict**
**your preferences (i.e., ranks) on future movies?**

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 4 / 22

## How Machine Learns the Preference of You

Mary

?

(movie, rank) pairs

?

brain of hypothesis

Bob '

&

$

% -

6

alternatives:

prefer romance/action/etc.

You

?

*examples (movie x*_{n}*, rank y** _{n}*)

?

learning hypothesis

*r*(x)
algorithm

'

&

$

% -

6

learning model

**machine learning:**

**an automatic route of system design**

## Poor Bob

Bob impresses Mary by memorizing every given (movie, rank);

**but too nervous about a new movie and guesses randomly**

**memorize**6=**generalize**

**prefect from Bob’s view**6=**good for Mary**
**perfect during training**6=**good when testing**

**challenges:**

**algorithms and theories for doing well when testing**

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 6 / 22

## Ranking Problem

*input: N examples*(x*n*,*y**n*) ∈ X × Y, e.g.

hotornot:X =human pictures,Y = {1, · · · ,10}

netflix:X =movies,Y = {1, · · · ,5}

*output: a ranking function r*(x)that ranks
future unseen examples(x,*y*)“correctly”

*properties for the K elements in*Y:
**ordered**

<

**not carrying numerical information**
not 2.5 times better than

1 instance representation? some meaningful vectors

2 **correctly? cost of wrong prediction**

## Cost of Wrong Prediction

cannot quantify the numerical meaning of ranks;

**but can artificially quantify the cost of being wrong**

infant (1) child (2) teen (3) adult (4) small mistake – classify a child as a teenager;

big mistake – classify an infant as an adult
*C*_{y,k}*: cost when rank y predicted as rank k*
*V-shaped C*_{y,k}*with C** _{y,y}* =0,

*e.g. absolute cost C** _{y,k}* = |y−

*k*|,

0 1 2 3 1 0 1 2 2 1 0 1 3 2 1 0

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 8 / 22

## Even More Challenging: Netflix Million Dollar Prize

*input: N*_{i}*examples from each user i with*
480,000+users andP

*i**N** _{i}* ≈100,000,000

*output: personalized predictions r*(i,

*x*)on 2,800,000+testing queries(i,

*x*)

*cost: squared cost C*_{y}_{,k} = (y −*k*)^{2}
a huge joint ranking problem

**The first team that gets 10%**better than
**existing Netflix system gets a million USD**

## Our Contributions

## a new framework that ...

makes the design and implementation of ranking
**algorithms almost effortless**

**makes the proof of ranking theories much simpler**
unifies many existing ranking algorithms and
**helps understand their cons and pros**

**shows that ranking is theoretically not much more**
**complex than binary classification**

**leads to promising experimental performance**

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure:answer; traditional method; our method

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 10 / 22

## Key Idea: Reduction

(iPod)

(adapter)

(cassette player)

complex ranking problems

(reduction) simpler binary problems with well-known results on

models, algorithms, proofs, etc.

**many new results immediately come up;**

**many existing results unified**

## Intuition: Associated Binary Questions

*how we query the rank of a movie x ?*

1 *is movie x better than rank 1? Yes*

2 *is movie x better than rank 2? No*

3 *is movie x better than rank 3? No*

4 *is movie x better than rank 4? No*

5 *is movie x better than rank 5? No*
*g** _{b}*(x,

*k*): is movie x better than rank k ?

*consistent answers: G(x*) = (1,1,1,0, · · · ,0) extract the rank from consistent answers:

searching: compare to a “middle” rank each time
*voting: r(x*) =1+P

*k**g** _{b}*(x,

*k*)

what if the answers are not consistent? e.g. (1,0,1,1,0,0,1,0) – voting is simple enough to analyze, and still works

**accurate binary answers**=⇒**correct ranks**

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 12 / 22

## Reduction during Training

*input: N examples*(x* _{n}*,

*y*

*) ∈ X × Y*

_{n}tool: your favorite binary classification algorithm
*output: a binary classifier g** _{b}*(x,

*k)*that can answer the associated questions correctly

need to feed binary examples(X* _{n,k}*,

*Y*

*)to the tool*

_{n,k}*X*

*≡ (x*

_{n,k}*,*

_{n}*k*),

*Y*

*≡ [y*

_{n,k}*>*

_{n}*k*]

*about NK extended binary examples extracted from given input*
– bigger, but not troublesome

*some approaches extract about N*^{2}binary examples
using a different intuition

– can be too big

**Are extended binary examples of the same**
**importance?**

## Importance of Extended Binary Examples

*for a given movie x*_{n}*with rank y** _{n}* =

*2, and C*

*= (y−*

_{y,k}*k*)

^{2}

*is x*

*n*better than rank 1? No Yes Yes Yes

*is x** _{n}*better than rank 2? No No Yes Yes

*is x*

*better than rank 3? No No No Yes*

_{n}*is x*

*better than rank 4? No No No No*

_{n}*r*(x*n*) 1 2 3 4

cost 1 0 1 4

3 more for answering question 4 wrong;

only 1 more for answering question 1 wrong
*W** _{n,k}* ≡

C* _{n,k+1}*−

*C*

_{n,k}: the importance of(X* _{n,k}*,

*Y*

*)*

_{n,k}*most binary classification algorithm can handle W*

_{n,k}**analogy to economics:**

**additional cost (marginal)**⇐⇒**importance**

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 14 / 22

## The Reduction Framework for Ranking

1 transform ranking examples(x* _{n}*,

*y*

*)to extended binary examples(X*

_{n}*,*

_{n,k}*Y*

*,*

_{n,k}*W*

*)*

_{n,k}*based on C*

_{y,k}2 use your favorite algorithm to learn from the extended
*binary examples, and get g** _{b}*(x,

*k) ≡g*

*(X)*

_{b}3 *for each new instance x , predict its rank using*
*r*(x) =1+P

*k**g** _{b}*(x,

*k*)

error equivalence: accurate binary answers=⇒correct ranks
*simplicity: works with almost any C** _{y,k}* and any algorithm

up-to-date: new improvements in binary classification immediately propagates to ranking

**If I have seen further it is by**

**standing on ye shoulders of Giants – I. Newton**

## Unifying Existing Algorithms

ranking with perceptrons

– (PRank, Crammer and Singer, 2002)

several long proof

⇒a few lines extended from binary perceptron results

large-margin (high confidence) formulations

– (Rajaram et al., 2003), (SVORIM, Chu and Keerthi, 2005)

results explained more directly; algorithm structure revealed

**variants of existing algorithms can be**
**designed quickly by tweaking reduction**

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 16 / 22

## Proposing New Algorithms

ranking using ensemble (consensus) of classifiers – (ORBoost, Lin and Li, 2006), OR-AdaBoost ranking using decision trees – OR-C4.5

ranking with large-margin classifiers – OR-SVM

bank computer california census 0

2 4 6

**avg. training time (hour)**

reduction SVOR−IMC

**advantages of underlying binary algorithm**
**inherited in the new ranking one**

## Proving New Theorems

simpler cost bound for PRank

new guarantee of ranking performance using ensemble of classifiers (Lin and Li, 2006)

new guarantee of ranking performance using large-margin classifiers, e.g.,

E_{(x}_{,y)}*C*_{y,r(x)}

| {z }

expected cost during testing

≤ _{N}^{1} X

*n*

X

*k*

ρ(X* _{n,k}*,

*Y*

*) ≤ ∆*

_{n,k}| {z }

low confidence extended examples

+*K* · *h*_{δ}(N, ∆)

| {z } deviation func. that decreases with more

data or confidence

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 18 / 22

## Reduction-C4.5 vs. SVORIM

pyr mac bos aba ban com cal cen

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

**avg. test absolute cost**

SVOR−Gauss

RED−C4.5 C4.5: decision

tree, a intuitive, but often too simple, binary classifier SVORIM:

state-of-the-art ranking algorithm

**even reduction to simple C4.5**
**beats SVORIM some time**

## Reduction-SVM vs. SVORIM

pyr mac bos aba ban com cal cen

0 0.5 1 1.5

**avg. test absolute cost**

SVOR−Gauss

RED−SVM SVM: one of the

most powerful binary classifier SVORIM:

state-of-the-art ranking algorithm extended from a modified SVM

**reducing to SVM without modification**
**often better than SVORIM**

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 20 / 22

## Reduction-Boost vs. RankBoost

py ma bo ab ba co ca ce

0 0.5 1 1.5 2 2.5

dataset

test absolute error

RankBoost

ORBoost Boost: a popular

ensemble algorithm RankBoost:

state-of-the-art ensemble ranking algorithm

**our reduction to boosting approaches results in**
**significantly better ensemble ranking algorithm**

## Conclusion

reduction framework: simple, intuitive, and useful for ranking algorithmic reduction:

unifying existing ranking algorithms proposing new ranking algorithms theoretic reduction:

new guarantee on ranking performance promising experimental results:

some for better performance some for faster training time next level: the Netflix challenge?

handling huge datasets

finding useful representations (features)

using collaborative information from other users
**Thank you. Questions?**

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 22 / 22