Automatic Ranking by Extended Binary Classiﬁcation

(1)

Automatic Ranking by Extended Binary Classification

Hsuan-Tien Lin Learning Systems Group Joint work with Ling Li (NIPS 2006)

EE Pizza Meeting, November 17, 2006

(2)

Introduction

What is the Age-Group?

g

2

g

1

g

2

g

3

g

4 rank: a finite ordered set of labelsY = {1,2, · · · ,K}

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 2 / 22

(3)

Hot or Not?

http://www.hotornot.com

rank: natural representation of preferences in surveys

(4)

How Much Did You Like These Movies?

http://www.netflix.com

Can machines use movies you’ve rated to closely predict your preferences (i.e., ranks) on future movies?

(5)

How Machine Learns the Preference of You

Mary

?

(movie, rank) pairs

?

brain of hypothesis

Bob '

&

$

% -

6

alternatives:

prefer romance/action/etc.

You

?

examples (movie x_n, rank y_n)

?

learning hypothesis

r(x) algorithm

'

&

$

% -

6

learning model

machine learning:

an automatic route of system design

(6)

Poor Bob

Bob impresses Mary by memorizing every given (movie, rank);

but too nervous about a new movie and guesses randomly

memorize6=generalize

prefect from Bob’s view6=good for Mary perfect during training6=good when testing

challenges:

algorithms and theories for doing well when testing

(7)

Ranking Problem

input: N examples(xn,yn) ∈ X × Y, e.g.

hotornot:X =human pictures,Y = {1, · · · ,10}

netflix:X =movies,Y = {1, · · · ,5}

output: a ranking function r(x)that ranks future unseen examples(x,y)“correctly”

properties for the K elements inY: ordered

<

not carrying numerical information not 2.5 times better than

1 instance representation? some meaningful vectors

2 correctly? cost of wrong prediction

(8)

Cost of Wrong Prediction

cannot quantify the numerical meaning of ranks;

but can artificially quantify the cost of being wrong

infant (1) child (2) teen (3) adult (4) small mistake – classify a child as a teenager;

big mistake – classify an infant as an adult C_y,k: cost when rank y predicted as rank k V-shaped C_y,k with C_y,y =0,

e.g. absolute cost C_y,k = |y−k|,







0 1 2 3 1 0 1 2 2 1 0 1 3 2 1 0







(9)

Even More Challenging: Netflix Million Dollar Prize

input: N_i examples from each user i with 480,000+users andP

iN_i ≈100,000,000 output: personalized predictions r(i,x)on 2,800,000+testing queries(i,x)

cost: squared cost C_y_,k = (y −k)² a huge joint ranking problem

The first team that gets 10%better than existing Netflix system gets a million USD

(10)

Our Contributions

a new framework that ...

makes the design and implementation of ranking algorithms almost effortless

makes the proof of ranking theories much simpler unifies many existing ranking algorithms and helps understand their cons and pros

shows that ranking is theoretically not much more complex than binary classification

leads to promising experimental performance

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure:answer; traditional method; our method

(11)

Key Idea: Reduction

(iPod)

(adapter)

(cassette player)

complex ranking problems

(reduction) simpler binary problems with well-known results on

models, algorithms, proofs, etc.

many new results immediately come up;

many existing results unified

(12)

Intuition: Associated Binary Questions

how we query the rank of a movie x ?

1 is movie x better than rank 1? Yes

2 is movie x better than rank 2? No

5 is movie x better than rank 5? No g_b(x,k): is movie x better than rank k ? consistent answers: G(x) = (1,1,1,0, · · · ,0) extract the rank from consistent answers:

searching: compare to a “middle” rank each time voting: r(x) =1+P

kg_b(x,k)

what if the answers are not consistent? e.g. (1,0,1,1,0,0,1,0) – voting is simple enough to analyze, and still works

accurate binary answers=⇒correct ranks

(13)

Reduction during Training

input: N examples(x_n,y_n) ∈ X × Y

tool: your favorite binary classification algorithm output: a binary classifier g_b(x,k)that can answer the associated questions correctly

need to feed binary examples(X_n,k,Y_n,k)to the tool X_n,k ≡ (x_n,k),Y_n,k ≡ [y_n >k]

about NK extended binary examples extracted from given input – bigger, but not troublesome

some approaches extract about N²binary examples using a different intuition

– can be too big

Are extended binary examples of the same importance?

(14)

Importance of Extended Binary Examples

for a given movie x_nwith rank y_n =2, and C_y,k = (y−k)² is xnbetter than rank 1? No Yes Yes Yes

is x_nbetter than rank 2? No No Yes Yes is x_nbetter than rank 3? No No No Yes is x_nbetter than rank 4? No No No No

r(xn) 1 2 3 4

cost 1 0 1 4

3 more for answering question 4 wrong;

only 1 more for answering question 1 wrong W_n,k ≡

C_n,k+1−C_n,k

: the importance of(X_n,k,Y_n,k) most binary classification algorithm can handle W_n,k

analogy to economics:

additional cost (marginal)⇐⇒importance

(15)

The Reduction Framework for Ranking

1 transform ranking examples(x_n,y_n)to extended binary examples(X_n,k,Y_n,k,W_n,k)based on C_y,k

2 use your favorite algorithm to learn from the extended binary examples, and get g_b(x,k) ≡g_b(X)

3 for each new instance x , predict its rank using r(x) =1+P

kg_b(x,k)

error equivalence: accurate binary answers=⇒correct ranks simplicity: works with almost any C_y,k and any algorithm

up-to-date: new improvements in binary classification immediately propagates to ranking

If I have seen further it is by

standing on ye shoulders of Giants – I. Newton

(16)

Unifying Existing Algorithms

ranking with perceptrons

– (PRank, Crammer and Singer, 2002)

several long proof

⇒a few lines extended from binary perceptron results

large-margin (high confidence) formulations

– (Rajaram et al., 2003), (SVORIM, Chu and Keerthi, 2005)

results explained more directly; algorithm structure revealed

variants of existing algorithms can be designed quickly by tweaking reduction

(17)

Proposing New Algorithms

ranking using ensemble (consensus) of classifiers – (ORBoost, Lin and Li, 2006), OR-AdaBoost ranking using decision trees – OR-C4.5

ranking with large-margin classifiers – OR-SVM

bank computer california census 0

2 4 6

avg. training time (hour)

reduction SVOR−IMC

advantages of underlying binary algorithm inherited in the new ranking one

(18)

Proving New Theorems

simpler cost bound for PRank

new guarantee of ranking performance using ensemble of classifiers (Lin and Li, 2006)

new guarantee of ranking performance using large-margin classifiers, e.g.,

E_(x_,y)C_y,r(x)

| {z }

expected cost during testing

≤ _N¹ X

n

X

k

ρ(X_n,k,Y_n,k) ≤ ∆

| {z }

low confidence extended examples

+K · h_δ(N, ∆)

| {z } deviation func. that decreases with more

data or confidence

(19)

Reduction-C4.5 vs. SVORIM

pyr mac bos aba ban com cal cen

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

avg. test absolute cost

SVOR−Gauss

RED−C4.5 C4.5: decision

tree, a intuitive, but often too simple, binary classifier SVORIM:

state-of-the-art ranking algorithm

even reduction to simple C4.5 beats SVORIM some time

(20)

Reduction-SVM vs. SVORIM

pyr mac bos aba ban com cal cen

0 0.5 1 1.5

avg. test absolute cost

SVOR−Gauss

RED−SVM SVM: one of the

most powerful binary classifier SVORIM:

state-of-the-art ranking algorithm extended from a modified SVM

reducing to SVM without modification often better than SVORIM

(21)

Reduction-Boost vs. RankBoost

py ma bo ab ba co ca ce

0 0.5 1 1.5 2 2.5

dataset

test absolute error

RankBoost

ORBoost Boost: a popular

ensemble algorithm RankBoost:

state-of-the-art ensemble ranking algorithm

our reduction to boosting approaches results in significantly better ensemble ranking algorithm

(22)

Conclusion

reduction framework: simple, intuitive, and useful for ranking algorithmic reduction:

unifying existing ranking algorithms proposing new ranking algorithms theoretic reduction:

new guarantee on ranking performance promising experimental results:

some for better performance some for faster training time next level: the Netflix challenge?

handling huge datasets

finding useful representations (features)

using collaborative information from other users Thank you. Questions?