From Ordinal Ranking to Binary Classiﬁcation

(1)

From Ordinal Ranking to Binary Classification

Hsuan-Tien Lin

Department of Computer Science and Information Engineering National Taiwan University

Talk in the Applied Math Department at NDHU February, 2010

Joint work with Dr. Ling Li at Caltech (ALT’06, NIPS’06)

(2)

Outline

1 Machine Learning Setup

2 Ordinal Ranking Setup

3 The Reduction Framework Key Ideas

Important Properties Theoretical Usefulness Algorithmic Usefulness

4 Experimental Results

5 Conclusion

(3)

Which Digit Did You Write?

?

one (1) two (2) three (3) four (4)

How can machines learn to classify?

(4)

Supervised Statistical Machine Learning

Parent

?

(picture, label) pairs

?

Kid’s good

decision function brain

'

&

$

% -

6 possibilities

Truth f (x ) + noise e(x )

?iid w.r.t. someP on x and noise examples (picture x_n, label y_n)

?

learning good

decision function g(x )≈ f (x) algorithm

'

&

$

% -

6

learning model{g^α(x )} challenge:

see only{(xn, yn)} without knowing f (x) or e(x)

=?⇒ generalize to unseen (x, y) w.r.t. f (x)

(5)

Some Classical Learning Problems

regression: numerical yn (∈ R) students’ scores; stock prices Translation for Statisticians

examples == sample from the process α == parameters

decision function == estimate of parameters

classification: discrete yn

{one, two, three, four}

{apple, orange, banana}

{yes, no}:binary classification

Truth f (x ) + noise e(x )

?iid w.r.t. someP on x and noise examples (picture xn, label yn)

?

learning good

decision function g(x )≈ f (x) algorithm

'

&

$

% -

6

learning model{g^α(x )}

new types of machine learning problems keep coming from new applications

(6)

Outline

5 Conclusion

(7)

Which Age-Group?

2

infant(1) child(2) teen(3) adult(4)

rank: a finite ordered set of labelsY = {1, 2, · · · , K }

(8)

Properties of Ordinal Ranking (1/2)

ranks representorder information

infant (1)

<

child (2)

<

teen (3)

<

adult (4)

general classification cannot properly use order information

(9)

Hot or Not?

http://www.hotornot.com

rank: natural representation of human preferences

(10)

Properties of Ordinal Ranking (2/2)

ranks donot carry numerical information rating 9 not 2.25 times “hotter” than rating 4

actual metric hidden

infant (ages 1–3)

child (ages 4–12)

teen (ages 13–19)

adult (ages 20–) general regression deteriorates

without correct numerical information

(11)

How Much Did You Like These Movies?

http://www.netflix.com

goal: use “movies you’ve rated” to automatically predict yourpreferences (ranks) on future movies

(12)

Ordinal Ranking Setup

Given

N examples (input x_n, rank y_n)∈ X × Y

age-group:X = encoding(human pictures), Y = {1, · · · , 4}

hotornot:X = encoding(human pictures), Y = {1, · · · , 10}

netflix:X = encoding(movies), Y = {1, · · · , 5}

Goal

an ordinal ranker (decision function) r (x ) that “closely predicts”

the ranks y associated with someunseen inputs x

ordinal ranking: a hot and important research problem

(13)

Importance of Ordinal Ranking

relatively new for machine learning (not so new for statisticians)

connecting classification and regression

matching human preferences—many applications in social science, information retrieval, psychology and recommendation systems

(14)

Formalizing (Non-)Closeness: Cost

ranks carry no numerical information: how to say “close”?

artificially quantify thecost of being wrong

e.g. loss of customer loyalty when the system

says but you feel

Yes! cost == loss for statisticians cost vectorc of example (x, y , c):

c[k ] = cost when predicting (x, y ) as rank k

e.g. for ( Sweet Home Alabama, ), a proper cost isc = (1, 0, 2, 10, 15)

closely predict: small cost during testing

(15)

Ordinal Cost Vectors

For an ordinal example (x, y , c), the cost vector c should be consistent with rank y : c[y ] = min_kc[k ] (= 0)

respect order information: V-shaped (ordinal) or even convex (strongly ordinal)

1: infant 2: child 3: teenager 4: adult Cy, k

V-shaped: pay more when predicting further away

1: infant 2: child 3: teenager 4: adult Cy, k

convex: payincreasingly more when further away c[k ] =Jy 6= k K c[k ] =

y − k c[k ] = (y− k)² classification: absolute: squared:

ordinal strongly strongly

ordinal ordinal (1, 0, 1, 1, 1) (1, 0, 1, 2, 3) (1, 0, 1, 4, 9)

(16)

Our Contributions

a theoretical and algorithmic foundation of ordinal ranking, whichreducesordinal ranking to binary classification, and ...

provides a methodology for designing new ordinal ranking algorithms withany ordinal cost effortlessly takes many existing ordinal ranking algorithms as special cases

introducesnew theoretical guarantee on the generalization performance of ordinal rankers leads tosuperior experimental results

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure:truth; traditional algorithm; our algorithm

(17)

Central Idea: Reduction

(iPod)

complex ordinal ranking problems

(adapter) (reduction)

(cassette player)

simpler binary classification problems with well-known results on models, algorithms and theories

If I have seen further it is by

standing on the shoulders of Giants—I. Newton

(18)

Outline

5 Conclusion

(19)

Outline

5 Conclusion

(20)

Threshold Ranker

if getting an ideal score s(x ) of a movie x , how to construct the discrete r (x ) from an analog s(x )?

x x - θ1

d d d

θ2

t tt t θ3

??

1 2 3 4 threshold rankerr (x )

score function s(x )

1 2 3 4 target rank y

quantize s(x ) byordered (non-uniform) thresholdsθ_k commonly used in previous work:

threshold perceptrons (PRank, Crammer and Singer, 2002)

threshold hyperplanes (SVOR, Chu and Keerthi, 2005)

threshold ensembles (ORBoost, Lin and Li, 2006)

threshold ranker: r (x ) = min{k : s(x) < θk}

(21)

Key Idea: Associated Binary Queries

getting the rank using a threshold ranker

1 is s(x )> θ₁? Yes

2 is s(x )> θ₂? No

3 is s(x )> θ₃? No

4 is s(x )> θ4? No

generally, how do we query the rank of a movie x ?

1 is movie x better than rank 1?Yes

2 is movie x better than rank 2?No

3 is movie x better than rank 3?No

4 is movie x better than rank 4?No associated binary queries:

is movie x better than rank k ?

(22)

The Reduction Framework Key Ideas

More on Associated Binary Queries

say, the machine uses g(x, k ) to answer the query

“is movie x better than rank k ?”

e.g. for threshold ranker: g(x, k ) = sign(s(x)− θk)

x x d d d t tt t ?? -

1 2 3 4 rg(x )

s(x )

1 2 3 4 y

N N θ₁ Y Y Y Y YY Y YY

(z)1

θ₁ g(x, 1)

N N N N N Y YY Y YY

(z)₂

θ₂ g(x, 2)

N N N N N N NNN YY

(z)₃

θ₃ g(x, 3) associated binary examples:



 (x, k )

| {z }

k -th associated binary query

, (z)_k

|{z}

desired answer





(23)

Computing Ranks from Associated Binary Queries

when g(x, k ) answers “is movie x better than rank k ?”

Consider g(x, 1), g(x, 2),· · · , g(x, K −1) , concordant predictions: (Y,Y,N,N,N,N,N) extracting the rank from concordant predictions:

minimum index searching: rg(x ) = min{k : g(x, k) =N} counting: rg(x ) = 1 +P

kJg (x , k ) =YK

two approaches equivalent for concordant predictions

mistaken/non-concordant predictions? e.g. (Y,N,Y,Y,N,N,Y) counting: simpler to analyze and robust to mistake

(24)

The Counting Approach

Say y = 5, i.e., (z)₁, (z)₂,· · · , (z)7

=(Y,Y,Y,Y,N,N,N) if g₁(x, k ) reports concordant predictions (Y,Y,N,N,N,N,N)

g1(x, k ) made 2 binary classification errors rg1(x ) = 3 by counting: the absolute cost is 2

absolute cost = # of binary classification errors

if g₂(x, k ) reports non-concordant predictions (Y,N,Y,Y,N,N,Y) g2(x, k ) made 2 binary classification errors

rg2(x ) = 5 by counting: the absolute cost is 0

absolute cost≤ # of binary classification errors If (z)_k = desired answer & r_g computed by counting,

y − r^g(x ) ≤^K−1P

k =1

q(z)_k 6= g(x, k)y .

(25)

Binary Classification Error v.s. Ordinal Ranking Cost

Say y = 5, i.e., (z)₁, (z)₂,· · · , (z)7

=(Y,Y,Y,Y,N,N,N) if g₁(x, k ) reports concordant predictions (Y,Y,N,N,N,N,N)

g₁(x, k ) made 2 binary classification errors r_g₁(x ) = 3 by counting: thesquared cost is 4

if g₃(x, k ) reports concordant predictions (Y,N,N,N,N,N,N) g3(x, k ) made 3 binary classification errors

rg3(x ) = 2 by counting: thesquared cost is 9 1 error in binary classification

=⇒ 5 cost in ordinal ranking

(26)

Importance of Associated Binary Examples

(z)_k Y Y Y Y N N N

g₁(x, k ) Y Y N N N N N c rg₁(x )

=c[3] = 4 g₃(x, k ) Y N N N N N N c

r_g₃(x )

=c[2] = 9

(w )_k 7 5 3 1 1 3 5

(w )_k ≡c[k + 1] − c[k]

: the importance of (x, k), (z)^k per-example cost bound(Li and Lin, 2007):

forconcordant predictions or strongly ordinal costs c

rg(x )

≤ XK−1 k =1

(w )_kq(z)_k 6= g(x, k)y

(27)

Outline

5 Conclusion

(28)

The Reduction Framework (1/2)

1 transform ordinal examples (xn, yn, cn)to weighted binary examples (x_n, k ), (z_n)_k, (w_n)_k

2 use your favorite algorithm on the weighted binary examples and get K−1 binary classifiers (i.e., one big joint binary classifier) g(x, k )

3 for each new input x , predict its rank using rg(x ) = 1 +P

kJg (x , k ) =YK the reduction framework:

systematic & easy to implement

ordinal examples (xn, yn, cn) ⇒

@ AA

%

$ '

&

weighted binary examples

(xn, k), (zn)k, (wn)k

k = 1,· · · , K −1 ⇒⇒⇒ _binary^core

classification algorithm ⇒⇒⇒

%

$ '

&

associated binary classifiers

g(x, k) k = 1,· · · , K −1

AA

@

⇒

ordinal

ranker rg(x)

(29)

The Reduction Framework (2/2)

performance guarantee:

accurate binary predictions =⇒ correct ranks wide applicability:

works with any ordinalc & any binary classification algorithm simplicity:

mild computation overheads with O(NK ) binary examples state-of-the-art:

allows new improvements in binary classification to be immediately inherited by ordinal ranking

ordinal examples (xn, yn, cn) ⇒

@ AA

%

$ '

&

weighted binary examples

(xn, k), (zn)k, (wn)k

k = 1,· · · , K −1 ⇒⇒⇒ _binary^core

classification algorithm ⇒⇒⇒

%

$ '

&

associated binary classifiers

g(x, k) k = 1,· · · , K −1

AA

@

⇒

ordinal

ranker rg(x)

(30)

Theoretical Guarantees of Reduction (1/3)

error transformation theorem(Li and Lin, 2007)

Forconcordant predictions or strongly ordinal costs, if g makes test error ∆ in the induced binary problem, then r_g pays test cost at most ∆ in ordinal ranking.

a one-step extension of the per-example cost bound conditions: general and minor

performance guarantee in the absolute sense

what if no “absolutely good” binary classifier?

1 absolutelygood binary classifier

=⇒absolutelygood ranker? YES!

(31)

Theoretical Guarantees of Reduction (2/3)

regret transformation theorem(Lin, 2008)

Forconcordant predictions or strongly ordinal costs, if g is-close to the optimal binary classifier g∗, then r_g is-close to the optimal ranker r∗.

“reduction to binary” sufficient for algorithm design, but necessary?

1 absolutely good binary classifier

=⇒ absolutely good ranker? YES!

2 relativelygood binary classifier

=⇒relativelygood ranker?YES!

(32)

Theoretical Guarantees of Reduction (3/3)

equivalence theorem(Lin, 2008)

For a general family ofordinal costs, a good ordinal ranking algorithm exists

if & only if a good binary classification algorithm exists for the corresponding learning model.

ordinal ranking isequivalent to binary classification

1 absolutely good binary classifier

=⇒ absolutely good ranker? YES!

2 relatively good binary classifier

=⇒ relatively good ranker? YES!

3 algorithm producingrelatively good binary classifier

⇐⇒ algorithm producingrelatively good ranker?YES!

(33)

Outline

5 Conclusion

(34)

Proving New Generalization Theorems

Ordinal Ranking(Li and Lin, 2007)

For linear rankers, with pr.> 1− δ, expected test cost of r

≤ N^β

XN n=1

XK−1 k =1

q ¯ρ r (xn), yn, k

≤Φy

| {z }

goodness of fit

+ O

poly

K,^{log N}^√

N ,_Φ¹, q

log¹_δ

| {z }

confidence interval

Bi. Cl.(Bartlett and Shawe-Taylor, 1998)

For linear classifiers, with pr.> 1− δ, expected test err. of g

≤ N¹

XN n=1

q ¯ρ g(xn), yn

≤ Φy

| {z }

goodness of fit

+ O

poly

log N√ N,_Φ¹,

q log¹_δ

| {z }

confidence interval

new ordinal ranking theorem

= reduction + any cost + bin. thm. + math derivation

(35)

Outline

5 Conclusion

(36)

Unifying Existing Algorithms

ordinal ranking = reduction + cost + binary classification

ordinal ranking cost binary classification algorithm PRank absolute modified perceptron rule

(Crammer and Singer, 2002)

kernel ranking classification modified hard-margin SVM

(Rajaram et al., 2003)

SVOR-EXP classification modified soft-margin SVM SVOR-IMC absolute modified soft-margin SVM

(Chu and Keerthi, 2005)

ORBoost-LR classification modified AdaBoost ORBoost-All absolute modified AdaBoost

(Lin and Li, 2006)

development and implementation time could have been saved algorithmic structure revealed (SVOR, ORBoost)

variants of existing algorithms can be designed quickly by tweaking reduction

(37)

Designing New Algorithms Effortlessly

ordinal ranking = reduction + cost + binary classification ordinal ranking cost binary classification algorithm

RED-SVM absolute standard soft-margin SVM RED-C4.5 absolute standard C4.5 decision tree

(Li and Lin, 2007)

SVOR (modified SVM) v.s. RED-SVM (standard SVM):

ban com cal cen

0 1 2 3 4 5 6

avg. training time (hour)

SVOR RED−SVM

advantages of core binary classification algorithm inherited in the new ordinal ranking one

(38)

Outline

5 Conclusion

(39)

Reduction-C4.5 v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

avg. test absolute cost

SVOR (Gauss)

RED−C4.5 C4.5: a (too) simple

binary classifier

—decision trees SVOR:

state-of-the-art ordinal ranking algorithm

even simple Reduction-C4.5 sometimes beats SVOR

(40)

Reduction-SVM v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

avg. test absolute cost

SVOR (Gauss)

RED−SVM (Perc.) SVM: one of the most

powerful binary classification algorithm SVOR:

state-of-the-art ordinal ranking algorithm extended from modified SVM

Reduction-SVM without modification often better than SVOR and faster

(41)

Outline

5 Conclusion

(42)

Conclusion

reduction framework: simple but useful

establish equivalence to binary classification unify existing algorithms

simplify design of new algorithms

facilitate derivation of new theoretical guarantees superior experimental results:

better performance and faster training time reduction keeps ordinal ranking up-to-date with binary classification