From Ordinal Ranking to Binary Classiﬁcation

(1)

From Ordinal Ranking to Binary Classification

Hsuan-Tien Lin

Learning Systems Group, California Institute of Technology

Talk at Caltech CS/IST Lunch Bunch March 4, 2008

Benefited from joint work with Dr. Ling Li (ALT’06, NIPS’06)

& discussions with Prof. Yaser Abu-Mostafa and Dr. Amrit Pratap

(2)

Introduction to Ordinal Ranking

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 2 / 32

(3)

Introduction to Ordinal Ranking What is Ordinal Ranking?

Which Age-Group?

2

1 2 3 4

rank: a finite ordered set of labels Y = {1, 2, · · · , K }

(4)

Hot or Not?

http://www.hotornot.com

rank: natural representation of human preferences

(5)

How Much Did You Like These Movies?

http://www.netflix.com

goal: use “movies you’ve rated” to automatically predict your preferences (ranks) on future movies

(6)

How Machine Learns the Preference of YOU?

Alice

?

(movie, rank) pairs

?

brain of good

hypothesis Bob

'

&

$

% -

6 alternatives:

prefer romance/action/etc.

You

?

examples (movie x_n, rank y_n)

?

learning good

hypothesis r (x ) algorithm

'

&

$

% -

6 learning model

challenge: how to make the right-hand-side work?

(7)

Introduction to Ordinal Ranking Ordinal Ranking Problem

Ordinal Ranking Problem

given: N examples (input xn,rank yn) ∈ X × Y, e.g.

age-group: X = encoding(human pictures), Y = {1, · · · , 4}

hotornot: X = encoding(human pictures), Y = {1, · · · , 10}

netflix: X = encoding(movies), Y = {1, · · · , 5}

goal: an ordinal ranker (hypothesis) r (x ) that “closely predicts” the ranks y associated with someunseen inputs x

a hot and important research problem:

relatively new for machine learning connecting classification and regression

matching human preferences—many applications in social science and information retrieval

(8)

Ongoing Heat: Netflix Million Dollar Prize

(since 10/2006)

a huge joint ordinal ranking problem

given: each user u (480,189 users) rates N_u (from tens to hundreds) movies—a total ofP

uN_u =100,480,507 examples goal: personalized predictions r_u(x ) on 2,817,131 testing queries (u, x )

the first team being 10% better than original Netflix system getsa million USD

(9)

Properties of Ranks

Y = {1, 2, · · · , 5}

representingorder:

<

—relabeling by (3, 1, 2, 4, 5) erases information general multiclass classification cannot

properly use ordering information not carrying numerical information:

not 2.5 times better than

—relabeling by (2, 3, 5, 9, 16) shouldn’t change results general metric regression deteriorates without correct numerical information ordinal ranking resides uniquely between multiclass classification and metric regression

(10)

Cost of Wrong Prediction

ranks carry no numerical meaning: how to say “closely predict”?

artificially quantify thecost of being wrong

infant (1) child (2) teen (3) adult (4) small mistake—classify a child as a teen;

big mistake—classify an infant as an adult cost vectorc of example (x , y , c):

c[k ] = cost when predicting (x , y ) as rank k e.g. for

,2

, a reasonable cost isc = (2, 0, 1, 4)

closely predict: small cost

(11)

Reasonable Cost Vectors

For an ordinal example (x , y ,c), the cost vector c should respect the rank y : c[y ] = 0; c[k ] ≥ 0

respect the ordinal information: V-shaped or even convex

1: infant 2: child 3: teenager 4: adult Cy, k

V-shaped: pay more when predicting further away

convex: pay increasingly more when further away

c[k ] =Jy 6= k K c[k ] = y − k

c[k ] = (y − k )² classification: absolute: squared (Netflix):

V-shaped only convex convex

(1, 0, 1, 1) (1, 0, 1, 2) (1, 0, 1, 4)

(12)

Introduction to Ordinal Ranking Contribution

Our Contributions

a new framework that works with any reasonable cost, and ...

reduces ordinal ranking to binary classification systematically

unifies andclearly explains many existing ordinal ranking algorithms

makes the design of new ordinal ranking algorithms much easier

allowssimple and intuitive proof for new ordinal ranking theorems

leads topromising experimental results

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure:answer; traditional method; our method

(13)

Reduction from Ordinal Ranking to Binary Classification

Reduction from Ordinal Ranking to

Binary Classification

(14)

Reduction from Ordinal Ranking to Binary Classification Thresholded Model for Ordinal Ranking

Thresholded Model

If we can first compute the score s(x ) of a movie x , how can we construct r (x ) from s(x )?

x x - θ1

d d d

θ2

t tt t θ3

??

1 2 3 4 ordinal rankerr (x )

score function s(x )

1 2 3 4 target rank y

quantize s(x ) by someordered threshold θ commonly used in previous work:

thresholded perceptrons (PRank, Crammer and Singer, 2002)

thresholded hyperplanes (SVOR, Chu and Keerthi, 2005)

thresholded ensembles (ORBoost, Lin and Li, 2006)

thresholded model: r (x ) = min {k : s(x ) < θ_k}

(15)

Reduction from Ordinal Ranking to Binary Classification Associated Binary Questions

Key of Reduction: Associated Binary Questions

getting the rank using a thresholded model

1 is s(x ) > θ₁? Yes

2 is s(x ) > θ₂? No

3 is s(x ) > θ₃? No

4 is s(x ) > θ₄? No

generally, how do we query the rank of a movie x ?

1 is movie x better than rank 1?Yes

2 is movie x better than rank 2?No

3 is movie x better than rank 3?No

4 is movie x better than rank 4?No associated binary questions g(x , k ):

is movie x better than rank k ?

(16)

Computing Ranks from Associated Binary Questions

g(x , k ): is movie x better than rank k ? Consider g(x , 1), g(x , 2), · · · , g(x , K −1),

consistent answers: (Y,Y,N,N, · · · ,N) extracting the rank from consistent answers:

minimum index searching: rg(x ) = min {k : g(x , k ) =N}

counting: r_g(x ) = 1 +P

kJg (x , k ) =YK

two approaches equivalent for consistent answers

noisy/inconsistent answers? e.g. (Y,N,Y,Y,N,N,Y,N,N)

—counting is simpler to analyze, and is robust to noise are all associated binary questions of

the same importance?

(18)

Importance of Associated Binary Questions

given a movie x with rank y = 2 andc[k ] = (y − k )² g(x , 1): is x better than rank 1? No Yes Yes Yes g(x , 2): is x better than rank 2? No No Yes Yes g(x , 3): is x better than rank 3? No No No Yes g(x , 4): is x better than rank 4? No No No No

r_g(x ) 1 2 3 4

crg(x )

1 0 1 4

1 more for answering question 2 wrong;

but 3 more for answering question 3 wrong (w )_k ≡

c[k + 1] − c[k ]

: the importance of (x , k ), (z)_k per-example error bound(Li and Lin, 2007; Lin, 2008):

forconsistent answers or convex costs crg(x ) ≤XK−1

k =1(w )_kq(z)_k 6= g(x, k )y accurate binary answers =⇒ correct ranks

(19)

Reduction from Ordinal Ranking to Binary Classification The Reduction Framework

The Reduction Framework

1 transform ordinal examples (x_n,y_n,c_n)to weighted binary examples (xn,k ), (zn)_k, (wn)_k

2 use your favorite algorithm on the weighted binary examples and get K −1 binary classifiers (i.e., one big joint binary classifier) g(x , k )

3 for each new input x , predict its rank using r_g(x ) = 1 +P

kJg (x , k ) =YK

ordinal examples (xn, yn, cn)

⇒

@ A

A

%

$ '

&

weighted binary examples

(xn, k), (zn)k,(wn)k

k= 1, · · · , K −1

⇒

⇒ ^core

binary classification

algorithm ⇒

⇒

%

$ '

&

associated binary classifiers

g(x, k) k= 1, · · · , K −1

A A

@

⇒

ordinal ranker rg(x)

(20)

Reduction from Ordinal Ranking to Binary Classification The Reduction Framework

Properties of Reduction

performance guarantee:

accurate binary answers =⇒ correct ranks wide applicability:

systematic; works with any reasonablec and any binary classification algorithm

up-to-date:

allows new improvements in binary classification to be immediately inherited by ordinal ranking

If I have seen further it is by

standing on the shoulders of Giants—I. Newton

ordinal examples (xn, yn, cn)

⇒

@ A

A

%

$ '

&

weighted binary examples

(xn, k), (zn)k,(wn)k

k= 1, · · · , K −1

⇒

⇒ ^core

binary classification

algorithm ⇒

⇒

%

$ '

&

associated binary classifiers

g(x, k) k= 1, · · · , K −1

A A

@

⇒

ordinal ranker rg(x)

(21)

Reduction from Ordinal Ranking to Binary Classification Theoretical Guarantees

Theoretical Guarantees of Reduction (1/3)

is reduction a reasonable approach? YES!

error transformation theorem(Li and Lin, 2007)

Forconsistent answers or convex costs,

if g makes test error ∆ in the induced binary problem, then r_gpays test cost at most ∆ in ordinal ranking.

a one-step extension of the per-example error bound conditions: general and minor

performance guarantee in the absolute sense:

accuracy in binary classification =⇒ correctness in ordinal ranking What if the induced binary problem is “too hard”

and even the best g_∗ can only commit a big ∆?

(22)

Theoretical Guarantees of Reduction (2/3)

is reduction a promising approach?YES!

regret transformation theorem(Lin, 2008)

For a general class ofreasonable costs,

if g is -close to the optimal binary classifier g∗, then r_gis -close to the optimal ordinal ranker r_∗. error guarantee in the relative setting:

regardless of the absolute hardness of the induced binary prob., optimality in binary classification =⇒ optimality in ordinal ranking reduction does not introduce additional hardness

It is sufficient to go with reduction plus binary classifi- cation, but is it necessary?

(23)

Theoretical Guarantees of Reduction (3/3)

is reduction a principled approach? YES!

equivalence theorem(Lin, 2008)

For a general class ofreasonable costs,

ordinal ranking is learnable by a learning model if and only if binary classification is learnable by the associated learning model.

a surprising equivalence:

ordinal ranking isas easy as binary classification

“without loss of generality”, we can just focus on binary classification

reduction to binary classification:

systematic, reasonable, promising, and principled

(24)

Usefulness of the Reduction Framework

(25)

Usefulness of the Reduction Framework Algorithmic Reduction

Unifying Existing Algorithms

ordinal ranking cost binary classification algorithm PRank absolute modified perceptron rule

(Crammer and Singer, 2002)

kernel ranking classification modified hard-margin SVM

(Rajaram et al., 2003)

SVOR-EXP classification modified soft-margin SVM SVOR-IMC absolute modified soft-margin SVM

(Chu and Keerthi, 2005)

ORBoost-LR classification modified AdaBoost ORBoost-All absolute modified AdaBoost

(Lin and Li, 2006)

if the reduction framework had been there,

development and implementation time could have been saved correctness proof significantly simplified (PRank)

algorithmic structure revealed (SVOR, ORBoost) variants of existing algorithms can be designed quickly by tweaking reduction

(26)

Designing New Algorithms (1/2)

ordinal ranking cost binary classification algorithm Reduction-C4.5 absolute standard C4.5 decision tree Reduction-AdaBoost absolute standard AdaBoost

Reduction-SVM absolute standard soft-margin SVM

SVOR (modified SVM) v.s. Reduction-SVM (standard SVM):

ban com cal cen

0 1 2 3 4 5 6

avg. training time (hour)

SVOR RED−SVM

advantages of core binary classification algorithm inherited in the new ordinal ranking one

(27)

Designing New Algorithms (2/2)

AdaBoost(Freund and Schapire, 1997)

for t = 1, 2, · · · , T ,

1 find a simple g_t that matches best with the current “view” of {(X_n,Yn)}

2 give a larger weight vt to gt if the match is stronger

3 update “view” by emphasizing the weights of those (Xn,Yn) that gt doesn’t predict well prediction:

majority vote of

vt,gt(x )

AdaBoost.OR(Lin, 2008)

for t = 1, 2, · · · , T ,

1 find a simple r_t that matches best with the current “view” of {(x_n,y_n)}

2 give a larger weight v_t to r_t if the match is stronger

3 update “view” by emphasizing the costscnof those (xn,yn) that rt doesn’t predict well prediction:

weighted median of

v_t,r_t(x )

AdaBoost.OR:

an extension of Reduction-AdaBoost;

a parallel of AdaBoost in ordinal ranking

(28)

Usefulness of the Reduction Framework Theoretical Reduction

Proving New Theorems

Binary Classification

(Bartlett and Shawe-Taylor, 1998)

For SVM, with prob. > 1 − δ, expected test error

≤ _N¹

N

X

n=1

J ¯ρ(X_n,Y_n) ≤ ΦK

| {z }

ambiguous training predictions w.r.t.

criteria Φ

+ O

log N√ N,_Φ¹,

q log¹_δ

| {z }

deviation that decreases with stronger criteria or

more examples

Ordinal Ranking

(Li and Lin, 2007)

For SVOR or Red.-SVM, with prob. > 1 − δ, expected test cost

≤ ^β_N

N

X

n=1 K −1

X

k =1

(w_n)_kq ¯ρ (x_n,k ), (z_n)_k ≤ Φy

| {z }

ambiguous training predictions w.r.t.

criteria Φ

+ O

log N√ N,_Φ¹,

q log¹_δ

| {z }

deviation that decreases with stronger criteria or

more examples

new test cost bounds with any c[·]

(29)

Usefulness of the Reduction Framework Experimental Comparisons

Reduction-C4.5 v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

avg. test absolute cost

SVOR (Gauss)

RED−C4.5 C4.5: a (too) simple

binary classifier

—decision trees SVOR:

state-of-the-art ordinal ranking algorithm

even simple Reduction-C4.5 sometimes beats SVOR

(30)

Usefulness of the Reduction Framework Experimental Comparisons

Reduction-SVM v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

avg. test absolute cost

SVOR (Gauss)

RED−SVM (Perc.) SVM: one of the most

powerful binary classification algorithm SVOR:

state-of-the-art ordinal ranking algorithm extended from modified SVM

Reduction-SVM without modification often better than SVOR^∗and faster

(31)

Usefulness of the Reduction Framework Netflix Prize?

Can We Win the Netflix Prize with Reduction?

possibly

a principled view of the problem

now easy to apply known binary classification techniques or to design suitable ordinal ranking approaches

e.g., AdaBoost.OR “boosted” some simple rtand reduced the test cost from 1.0704 to 1.0343

but not yet

need 0.8563 to win

the problem has its own characteristics huge data set: computational bottleneck

allows real-valued predictions: r (x ) ∈ R instead of r (x) ∈ {1, · · · , K } encoding(movie), encoding(user): important

many interesting research problems arose during “CS156b: Learning Systems”

(32)

Usefulness of the Reduction Framework Conclusion

Conclusion

reduction framework: simple, intuitive, and useful for ordinal ranking

algorithmic reduction:

unifying existing ordinal ranking algorithms designing new ordinal ranking algorithms theoretic reduction:

new bounds on ordinal ranking test cost promising experimental results:

some for better performance some for faster training time

reduction keeps ordinal ranking up-to-date with binary classification

From Ordinal Ranking to Binary Classiﬁcation

From Ordinal Ranking to Binary Classification

Introduction to Ordinal Ranking

Which Age-Group?

Hot or Not?

How Much Did You Like These Movies?

How Machine Learns the Preference of YOU?

Ordinal Ranking Problem

Ongoing Heat: Netflix Million Dollar Prize

Properties of Ranks

Cost of Wrong Prediction

Reasonable Cost Vectors

Our Contributions

Reduction from Ordinal Ranking to

Binary Classification

Thresholded Model

Key of Reduction: Associated Binary Questions

More on Associated Binary Questions

Computing Ranks from Associated Binary Questions

Importance of Associated Binary Questions

The Reduction Framework

Properties of Reduction

Theoretical Guarantees of Reduction (1/3)

Theoretical Guarantees of Reduction (2/3)

Theoretical Guarantees of Reduction (3/3)

Usefulness of the Reduction Framework

Unifying Existing Algorithms

Designing New Algorithms (1/2)

Designing New Algorithms (2/2)

Proving New Theorems

Reduction-C4.5 v.s. SVOR

Reduction-SVM v.s. SVOR

Can We Win the Netflix Prize with Reduction?

Conclusion