• 沒有找到結果。

# From Ordinal Ranking to Binary Classiﬁcation

N/A
N/A
Protected

Share "From Ordinal Ranking to Binary Classiﬁcation"

Copied!
32
0
0

(1)

## From Ordinal Ranking to Binary Classification

Hsuan-Tien Lin

Learning Systems Group, California Institute of Technology

Talk at Caltech CS/IST Lunch Bunch March 4, 2008

Benefited from joint work with Dr. Ling Li (ALT’06, NIPS’06)

& discussions with Prof. Yaser Abu-Mostafa and Dr. Amrit Pratap

(2)

Introduction to Ordinal Ranking

## Introduction to Ordinal Ranking

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 2 / 32

(3)

Introduction to Ordinal Ranking What is Ordinal Ranking?

## Which Age-Group?

2

1 2 3 4

rank: a finite ordered set of labels Y = {1, 2, · · · , K }

(4)

Introduction to Ordinal Ranking What is Ordinal Ranking?

## Hot or Not?

http://www.hotornot.com

rank: natural representation of human preferences

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 4 / 32

(5)

Introduction to Ordinal Ranking What is Ordinal Ranking?

## How Much Did You Like These Movies?

http://www.netflix.com

goal: use “movies you’ve rated” to automatically predict your preferences (ranks) on future movies

(6)

Introduction to Ordinal Ranking What is Ordinal Ranking?

## How Machine Learns the Preference of YOU?

Alice

?

(movie, rank) pairs

?

brain of good

hypothesis Bob

'

&

\$

% -

6 alternatives:

prefer romance/action/etc.

You

?

examples (movie xn, rank yn)

?

learning good

hypothesis r (x ) algorithm

'

&

\$

% -

6 learning model

challenge: how to make the right-hand-side work?

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 6 / 32

(7)

Introduction to Ordinal Ranking Ordinal Ranking Problem

## Ordinal Ranking Problem

given: N examples (input xn,rank yn) ∈ X × Y, e.g.

age-group: X = encoding(human pictures), Y = {1, · · · , 4}

hotornot: X = encoding(human pictures), Y = {1, · · · , 10}

netflix: X = encoding(movies), Y = {1, · · · , 5}

goal: an ordinal ranker (hypothesis) r (x ) that “closely predicts” the ranks y associated with someunseen inputs x

a hot and important research problem:

relatively new for machine learning connecting classification and regression

matching human preferences—many applications in social science and information retrieval

(8)

Introduction to Ordinal Ranking Ordinal Ranking Problem

## Ongoing Heat: Netflix Million Dollar Prize

(since 10/2006)

a huge joint ordinal ranking problem

given: each user u (480,189 users) rates Nu (from tens to hundreds) movies—a total ofP

uNu =100,480,507 examples goal: personalized predictions ru(x ) on 2,817,131 testing queries (u, x )

the first team being 10% better than original Netflix system getsa million USD

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 8 / 32

(9)

Introduction to Ordinal Ranking Ordinal Ranking Problem

## Properties of Ranks

Y = {1, 2, · · · , 5}

representingorder:

<

—relabeling by (3, 1, 2, 4, 5) erases information general multiclass classification cannot

properly use ordering information not carrying numerical information:

not 2.5 times better than

—relabeling by (2, 3, 5, 9, 16) shouldn’t change results general metric regression deteriorates without correct numerical information ordinal ranking resides uniquely between multiclass classification and metric regression

(10)

Introduction to Ordinal Ranking Ordinal Ranking Problem

## Cost of Wrong Prediction

ranks carry no numerical meaning: how to say “closely predict”?

artificially quantify thecost of being wrong

infant (1) child (2) teen (3) adult (4) small mistake—classify a child as a teen;

big mistake—classify an infant as an adult cost vectorc of example (x , y , c):

c[k ] = cost when predicting (x , y ) as rank k e.g. for



,2



, a reasonable cost isc = (2, 0, 1, 4)

closely predict: small cost

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 10 / 32

(11)

Introduction to Ordinal Ranking Ordinal Ranking Problem

## Reasonable Cost Vectors

For an ordinal example (x , y ,c), the cost vector c should respect the rank y : c[y ] = 0; c[k ] ≥ 0

respect the ordinal information: V-shaped or even convex

1: infant 2: child 3: teenager 4: adult Cy, k

1: infant 2: child 3: teenager 4: adult Cy, k

V-shaped: pay more when predicting further away

convex: pay increasingly more when further away

c[k ] =Jy 6= k K c[k ] = y − k

c[k ] = (y − k )2 classification: absolute: squared (Netflix):

V-shaped only convex convex

(1, 0, 1, 1) (1, 0, 1, 2) (1, 0, 1, 4)

(12)

Introduction to Ordinal Ranking Contribution

## Our Contributions

a new framework that works with any reasonable cost, and ...

reduces ordinal ranking to binary classification systematically

unifies andclearly explains many existing ordinal ranking algorithms

makes the design of new ordinal ranking algorithms much easier

allowssimple and intuitive proof for new ordinal ranking theorems

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 12 / 32

(13)

Reduction from Ordinal Ranking to Binary Classification

## Binary Classification

(14)

Reduction from Ordinal Ranking to Binary Classification Thresholded Model for Ordinal Ranking

## Thresholded Model

If we can first compute the score s(x ) of a movie x , how can we construct r (x ) from s(x )?

x x - θ1

d d d

θ2

t tt t θ3

??

1 2 3 4 ordinal rankerr (x )

score function s(x )

1 2 3 4 target rank y

quantize s(x ) by someordered threshold θ commonly used in previous work:

thresholded perceptrons (PRank, Crammer and Singer, 2002)

thresholded hyperplanes (SVOR, Chu and Keerthi, 2005)

thresholded ensembles (ORBoost, Lin and Li, 2006)

thresholded model: r (x ) = min {k : s(x ) < θk}

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 14 / 32

(15)

Reduction from Ordinal Ranking to Binary Classification Associated Binary Questions

## Key of Reduction: Associated Binary Questions

getting the rank using a thresholded model

1 is s(x ) > θ1? Yes

2 is s(x ) > θ2? No

3 is s(x ) > θ3? No

4 is s(x ) > θ4? No

generally, how do we query the rank of a movie x ?

1 is movie x better than rank 1?Yes

2 is movie x better than rank 2?No

3 is movie x better than rank 3?No

4 is movie x better than rank 4?No associated binary questions g(x , k ):

is movie x better than rank k ?

(16)

Reduction from Ordinal Ranking to Binary Classification Associated Binary Questions

## More on Associated Binary Questions

g(x , k ): is movie x better than rank k ? e.g. thresholded model g(x , k ) = sign(s(x ) − θk) K − 1 binary classification problems w.r.t. each k

x x d d d t tt t ?? -

1 2 3 4 rg(x )

s(x )

1 2 3 4 y

N N θθ11 Y Y Y Y YY Y YY (z)g(x , 1)1

N N N N Nθ2Y YY Y YY (z)g(x , 2)2

N N N N N N NNN θ3 YY (z)g(x , 3)3

let (x , k ), (z)k be binary examples (x , k ): extended input w.r.t. k -th query (z)k: binary labelY/N

if g(x , k ) = (z)k for all k , we can compute rg(x ) from g(x , k ) such that rg(x ) = y

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 16 / 32

(17)

Reduction from Ordinal Ranking to Binary Classification Associated Binary Questions

## Computing Ranks from Associated Binary Questions

g(x , k ): is movie x better than rank k ? Consider g(x , 1), g(x , 2), · · · , g(x , K −1),

consistent answers: (Y,Y,N,N, · · · ,N) extracting the rank from consistent answers:

minimum index searching: rg(x ) = min {k : g(x , k ) =N}

counting: rg(x ) = 1 +P

kJg (x , k ) =YK

two approaches equivalent for consistent answers

—counting is simpler to analyze, and is robust to noise are all associated binary questions of

the same importance?

(18)

Reduction from Ordinal Ranking to Binary Classification Associated Binary Questions

## Importance of Associated Binary Questions

given a movie x with rank y = 2 andc[k ] = (y − k )2 g(x , 1): is x better than rank 1? No Yes Yes Yes g(x , 2): is x better than rank 2? No No Yes Yes g(x , 3): is x better than rank 3? No No No Yes g(x , 4): is x better than rank 4? No No No No

rg(x ) 1 2 3 4

crg(x )

1 0 1 4

1 more for answering question 2 wrong;

but 3 more for answering question 3 wrong (w )k

c[k + 1] − c[k ]

: the importance of (x , k ), (z)k per-example error bound(Li and Lin, 2007; Lin, 2008):

forconsistent answers or convex costs crg(x ) ≤XK−1

k =1(w )kq(z)k 6= g(x, k )y accurate binary answers =⇒ correct ranks

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 18 / 32

(19)

Reduction from Ordinal Ranking to Binary Classification The Reduction Framework

## The Reduction Framework

1 transform ordinal examples (xn,yn,cn)to weighted binary examples (xn,k ), (zn)k, (wn)k

2 use your favorite algorithm on the weighted binary examples and get K −1 binary classifiers (i.e., one big joint binary classifier) g(x , k )

3 for each new input x , predict its rank using rg(x ) = 1 +P

kJg (x , k ) =YK







 ordinal examples (xn, yn, cn)





@ A

A

%

\$ '

&

weighted binary examples

(xn, k), (zn)k,(wn)k

 k= 1, · · · , K −1

core

binary classification

algorithm

%

\$ '

&

associated binary classifiers

g(x, k) k= 1, · · · , K −1

A A

@









 ordinal ranker rg(x)

(20)

Reduction from Ordinal Ranking to Binary Classification The Reduction Framework

## Properties of Reduction

performance guarantee:

accurate binary answers =⇒ correct ranks wide applicability:

systematic; works with any reasonablec and any binary classification algorithm

up-to-date:

allows new improvements in binary classification to be immediately inherited by ordinal ranking

If I have seen further it is by

standing on the shoulders of Giants—I. Newton

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 20 / 32







 ordinal examples (xn, yn, cn)





@ A

A

%

\$ '

&

weighted binary examples

(xn, k), (zn)k,(wn)k

 k= 1, · · · , K −1

core

binary classification

algorithm

%

\$ '

&

associated binary classifiers

g(x, k) k= 1, · · · , K −1

A A

@









 ordinal ranker rg(x)

(21)

Reduction from Ordinal Ranking to Binary Classification Theoretical Guarantees

## Theoretical Guarantees of Reduction (1/3)

is reduction a reasonable approach? YES!

error transformation theorem(Li and Lin, 2007)

if g makes test error ∆ in the induced binary problem, then rgpays test cost at most ∆ in ordinal ranking.

a one-step extension of the per-example error bound conditions: general and minor

performance guarantee in the absolute sense:

accuracy in binary classification =⇒ correctness in ordinal ranking What if the induced binary problem is “too hard”

and even the best g can only commit a big ∆?

(22)

Reduction from Ordinal Ranking to Binary Classification Theoretical Guarantees

## Theoretical Guarantees of Reduction (2/3)

is reduction a promising approach?YES!

regret transformation theorem(Lin, 2008)

For a general class ofreasonable costs,

if g is -close to the optimal binary classifier g, then rgis -close to the optimal ordinal ranker r. error guarantee in the relative setting:

regardless of the absolute hardness of the induced binary prob., optimality in binary classification =⇒ optimality in ordinal ranking reduction does not introduce additional hardness

It is sufficient to go with reduction plus binary classifi- cation, but is it necessary?

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 22 / 32

(23)

Reduction from Ordinal Ranking to Binary Classification Theoretical Guarantees

## Theoretical Guarantees of Reduction (3/3)

is reduction a principled approach? YES!

equivalence theorem(Lin, 2008)

For a general class ofreasonable costs,

ordinal ranking is learnable by a learning model if and only if binary classification is learnable by the associated learning model.

a surprising equivalence:

ordinal ranking isas easy as binary classification

“without loss of generality”, we can just focus on binary classification

reduction to binary classification:

systematic, reasonable, promising, and principled

(24)

Usefulness of the Reduction Framework

## Usefulness of the ReductionFramework

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 24 / 32

(25)

Usefulness of the Reduction Framework Algorithmic Reduction

## Unifying Existing Algorithms

ordinal ranking cost binary classification algorithm PRank absolute modified perceptron rule

(Crammer and Singer, 2002)

kernel ranking classification modified hard-margin SVM

(Rajaram et al., 2003)

SVOR-EXP classification modified soft-margin SVM SVOR-IMC absolute modified soft-margin SVM

(Chu and Keerthi, 2005)

(Lin and Li, 2006)

if the reduction framework had been there,

development and implementation time could have been saved correctness proof significantly simplified (PRank)

algorithmic structure revealed (SVOR, ORBoost) variants of existing algorithms can be designed quickly by tweaking reduction

(26)

Usefulness of the Reduction Framework Algorithmic Reduction

## Designing New Algorithms (1/2)

ordinal ranking cost binary classification algorithm Reduction-C4.5 absolute standard C4.5 decision tree Reduction-AdaBoost absolute standard AdaBoost

Reduction-SVM absolute standard soft-margin SVM

SVOR (modified SVM) v.s. Reduction-SVM (standard SVM):

ban com cal cen

0 1 2 3 4 5 6

avg. training time (hour)

SVOR RED−SVM

advantages of core binary classification algorithm inherited in the new ordinal ranking one

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 26 / 32

(27)

Usefulness of the Reduction Framework Algorithmic Reduction

## Designing New Algorithms (2/2)

for t = 1, 2, · · · , T ,

1 find a simple gt that matches best with the current “view” of {(Xn,Yn)}

2 give a larger weight vt to gt if the match is stronger

3 update “view” by emphasizing the weights of those (Xn,Yn) that gt doesn’t predict well prediction:

majority vote of

vt,gt(x )

for t = 1, 2, · · · , T ,

1 find a simple rt that matches best with the current “view” of {(xn,yn)}

2 give a larger weight vt to rt if the match is stronger

3 update “view” by emphasizing the costscnof those (xn,yn) that rt doesn’t predict well prediction:

weighted median of

vt,rt(x )

a parallel of AdaBoost in ordinal ranking

(28)

Usefulness of the Reduction Framework Theoretical Reduction

## Proving New Theorems

Binary Classification

(Bartlett and Shawe-Taylor, 1998)

For SVM, with prob. > 1 − δ, expected test error

N1

N

X

n=1

J ¯ρ(Xn,Yn) ≤ ΦK

| {z }

ambiguous training predictions w.r.t.

criteria Φ

+ O



log N N,Φ1,

q log1δ



| {z }

deviation that decreases with stronger criteria or

more examples

Ordinal Ranking

(Li and Lin, 2007)

For SVOR or Red.-SVM, with prob. > 1 − δ, expected test cost

βN

N

X

n=1 K −1

X

k =1

(wn)kq ¯ρ (xn,k ), (zn)k ≤ Φy

| {z }

ambiguous training predictions w.r.t.

criteria Φ

+ O



log N N,Φ1,

q log1δ



| {z }

deviation that decreases with stronger criteria or

more examples

new test cost bounds with any c[·]

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 28 / 32

(29)

Usefulness of the Reduction Framework Experimental Comparisons

## Reduction-C4.5 v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

avg. test absolute cost

SVOR (Gauss)

RED−C4.5 C4.5: a (too) simple

binary classifier

—decision trees SVOR:

state-of-the-art ordinal ranking algorithm

even simple Reduction-C4.5 sometimes beats SVOR

(30)

Usefulness of the Reduction Framework Experimental Comparisons

## Reduction-SVM v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

avg. test absolute cost

SVOR (Gauss)

RED−SVM (Perc.) SVM: one of the most

powerful binary classification algorithm SVOR:

state-of-the-art ordinal ranking algorithm extended from modified SVM

Reduction-SVM without modification often better than SVORand faster

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 30 / 32

(31)

Usefulness of the Reduction Framework Netflix Prize?

## Can We Win the Netflix Prize with Reduction?

possibly

a principled view of the problem

now easy to apply known binary classification techniques or to design suitable ordinal ranking approaches

e.g., AdaBoost.OR “boosted” some simple rtand reduced the test cost from 1.0704 to 1.0343

but not yet

need 0.8563 to win

the problem has its own characteristics huge data set: computational bottleneck

allows real-valued predictions: r (x ) ∈ R instead of r (x) ∈ {1, · · · , K } encoding(movie), encoding(user): important

many interesting research problems arose during “CS156b: Learning Systems”

(32)

Usefulness of the Reduction Framework Conclusion

## Conclusion

reduction framework: simple, intuitive, and useful for ordinal ranking

algorithmic reduction:

unifying existing ordinal ranking algorithms designing new ordinal ranking algorithms theoretic reduction:

new bounds on ordinal ranking test cost promising experimental results:

some for better performance some for faster training time

reduction keeps ordinal ranking up-to-date with binary classification

Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 32 / 32

Animal or vegetable fats and oils and their fractiors, boiled, oxidised, dehydrated, sulphurised, blown, polymerised by heat in vacuum or in inert gas or otherwise chemically

• The DAS (decimal adjust after subtraction) instruction converts the binary result of a SUB or SBB operation to packed decimal format. • The value must be

our reduction to boosting approaches results in significantly better ensemble ranking

• As the binary quadratic programming program is NP-hard in general, identifying polynomially solvable subclasses of binary quadratic programming problems not only offers the-

Since the generalized Fischer-Burmeister function ψ p is quasi-linear, the quadratic penalty for equilibrium constraints will make the convexity of the global smoothing function

A Boolean function described by an algebraic expression consists of binary variables, the constant 0 and 1, and the logic operation symbols.. For a given value of the binary

Cost-and-Error-Sensitive Classification with Bioinformatics Application Cost-Sensitive Ordinal Ranking with Information Retrieval Application Summary.. Non-Bayesian Perspective

For consistent predictions or strongly ordinal costs, if g makes test error ∆ in the induced binary problem, then r g pays test cost at most ∆ in ordinal ranking. a one-step