From Ordinal Ranking to Binary Classiﬁcation

(1)

From Ordinal Ranking to Binary Classification

Hsuan-Tien Lin

Learning Systems Group, California Institute of Technology

Talk at CS Department, National Tsing-Hua University March 27, 2008

Benefited from joint work with Dr. Ling Li (ALT’06, NIPS’06)

& discussions with Prof. Yaser Abu-Mostafa and Dr. Amrit Pratap

(2)

Outline

1 Introduction to Machine Learning

2 The Ordinal Ranking Setup

3 Reduction from Ordinal Ranking to Binary Classification Algorithmic Usefulness of Reduction

Theoretical Usefulness of Reduction Experimental Performance of Reduction

4 Conclusion

(3)

Apple, Orange, or Strawberry?

?

apple orange strawberry

how can machine learn to classify?

(4)

Supervised Machine Learning

Parent

?

(picture, category) pairs

?

Kid’s good

decision function brain

'

&

$

% -

6 possibilities

Truth f (x ) + noise e(x )

?

examples (picture x

_n

, category y

_n

)

?

learning good

decision function h(x ) ≈ f (x ) algorithm

'

&

$

% -

6 learning model {h

α

(x )}

challenge:

see only {(x _n , y _n )} without knowing f (x ) or e(x )

=⇒ ? generalize to unseen (x , y ) w.r.t. f (x )

(5)

Machine Learning Research

What can the machines learn? (application) concrete:

computer vision, architecture optimization, information retrieval, bio-informatics, computational finance, · · ·

abstract setups:

classification, regression, · · ·

How can the machines learn? (algorithm) faster

better generalization

Why can the machines learn? (theory) paradigms:

statistical learning, reinforcement learning, · · · generalization guarantees

new opportunities keep coming

from new applications/setups

(6)

Outline

1 Introduction to Machine Learning

2 The Ordinal Ranking Setup

3 Reduction from Ordinal Ranking to Binary Classification Algorithmic Usefulness of Reduction

Theoretical Usefulness of Reduction Experimental Performance of Reduction

4 Conclusion

(7)

Which Age-Group?

2 infant (1) child (2) teen (3) adult (4)

rank: a finite ordered set of labels Y = {1, 2, · · · , K }

(8)

Properties of Ordinal Ranking (1/2)

ranks represent order information

infant (1)

<

child (2)

<

teen (3)

<

adult (4) general multiclass classification cannot

properly use order information

(9)

Hot or Not?

http://www.hotornot.com

rank: natural representation of human preferences

(10)

Properties of Ordinal Ranking (2/2)

ranks do not carry numerical information rating 9 not 2.25 times “hotter” than rating 4

actual metric hidden

infant (ages 1–3)

child (ages 4–12)

teen (ages 13–19)

adult (ages 20–) general metric regression deteriorates

without correct numerical information

(11)

How Much Did You Like These Movies?

http://www.netflix.com

goal: use “movies you’ve rated” to automatically

predict your preferences (ranks) on future movies

(12)

Ordinal Ranking Setup

Given

N examples (input x _n , rank y _n ) ∈ X × Y

age-group: X = encoding(human pictures), Y = {1, · · · , 4}

hotornot: X = encoding(human pictures), Y = {1, · · · , 10}

netflix: X = encoding(movies), Y = {1, · · · , 5}

Goal

an ordinal ranker (decision function) r (x ) that “closely predicts”

the ranks y associated with some unseen inputs x

ordinal ranking: a hot and important research problem

(13)

Ongoing Heat: Netflix Million Dollar Prize (since 10/2006)

Given

each user u (480,189 users) rates N _u (from tens to thousands) movies x —a total of P

u N u = 100,480,507 examples Goal

personalized ordinal rankers r u (x ) evaluated on 2,817,131

“unseen” queries (u, x )

the first team being 10% better than

original Netflix system gets a million USD

(14)

Cost of Wrong Prediction

ranks carry no numerical information: how to say “better”?

artificially quantify the cost of being wrong

e.g. loss of customer royalty when the system

says but you feel

cost vector c of example (x , y , c):

c[k ] = cost when predicting (x , y ) as rank k

e.g. for ( Sweet Home Alabama , ), a proper cost is c = (1, 0, 2, 10, 15)

closely predict: small test cost

(15)

Ordinal Cost Vectors

For an ordinal example (x , y , c), the cost vector c should follow the rank y : c[y ] = 0; c[k ] ≥ 0

respect the ordinal information: V-shaped (ordinal) or even convex (strongly ordinal)

1: infant 2: child 3: teenager 4: adult Cy, k

V-shaped: pay more when predicting further away

1: infant 2: child 3: teenager 4: adult Cy, k

convex: pay increasingly more when further away c[k ] = Jy 6= k K c[k ] =

y − k

c[k ] = (y − k )

²

classification: absolute: squared (Netflix):

ordinal strongly strongly

ordinal ordinal

(1, 0, 1, 1, 1) (1, 0, 1, 2, 3) (1, 0, 1, 4, 9)

(16)

Our Contributions

a theoretical and algorithmic foundation of ordinal ranking, which ...

provides a methodology for designing new ordinal ranking algorithms with any ordinal cost effortlessly takes many existing ordinal ranking algorithms as special cases

introduces new theoretical guarantee on the generalization performance of ordinal rankers leads to superior experimental results

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure: truth; traditional algorithm; our algorithm

(17)

Outline

1 Introduction to Machine Learning

2 The Ordinal Ranking Setup

3 Reduction from Ordinal Ranking to Binary Classification Algorithmic Usefulness of Reduction

Theoretical Usefulness of Reduction Experimental Performance of Reduction

4 Conclusion

(18)

Threshold Model

If we can first get an ideal score s(x ) of a movie x , how can we construct the discrete r (x ) from an analog s(x )?

x x - θ 1

d d d

θ 2

t tt t θ 3

??

1 2 3 4 ordinal ranker r (x )

score function s(x )

1 2 3 4 target rank y

quantize s(x ) by some ordered threshold θ commonly used in previous work:

threshold perceptrons (PRank, Crammer and Singer, 2002)

threshold hyperplanes (SVOR, Chu and Keerthi, 2005)

threshold ensembles (ORBoost, Lin and Li, 2006)

threshold model: r (x ) = min {k : s(x ) < θ _k }

(19)

Key of Reduction: Associated Binary Queries

getting the rank using a threshold model

1

is s(x ) > θ ₁ ? Yes

2

is s(x ) > θ ₂ ? No

3

is s(x ) > θ ₃ ? No

4

is s(x ) > θ ₄ ? No

generally, how do we query the rank of a movie x ?

1

is movie x better than rank 1? Yes

2

is movie x better than rank 2? No

3

is movie x better than rank 3? No

4

is movie x better than rank 4? No associated binary queries:

is movie x better than rank k ?

(20)

Reduction from Ordinal Ranking to Binary Classification

More on Associated Binary Queries

say, the machine uses g(x , k ) to answer the query

“is movie x better than rank k ?”

e.g. threshold model g(x , k ) = sign(s(x ) − θ _k ) K − 1 binary classification problems w.r.t. each k

x x d d d t tt t ?? -

1 2 3 4 r _g (x )

s(x )

1 2 3 4 y

N N θ ₁ Y Y Y Y YY Y YY

(z)

₁

θ ₁ g(x , 1)

N N N N N Y YY Y YY

(z)

₂

θ 2 g(x , 2)

N N N N N N NNN YY

(z)

3

θ ₃ g(x , 3)

let (x , k ), (z) _k be binary examples (x , k ): extended input w.r.t. k -th query (z)

k

: desired binary answer Y/N If g(x , k ) = (z) _k for all k ,

we can compute r g (x ) from g(x , k ) s.t. r g (x ) = y .

(21)

Computing Ranks from Associated Binary Queries

when g(x , k ) answers “is movie x better than rank k ?”

Consider g(x , 1), g(x , 2), · · · , g(x , K −1), consistent predictions: (Y, Y, N, N, N, N, N) extracting the rank:

minimum index searching: r

g

(x ) = min {k : g(x , k ) = N}

counting: r

g

(x ) = 1 + P

k

Jg (x , k ) = Y K

two approaches equivalent for consistent predictions noisy/inconsistent predictions? e.g. (Y, N, Y, Y, N, N, Y)

counting: simpler to analyze and robust to noise

(22)

The Counting Approach

Say y = 5, i.e., (z) ₁ , (z) ₂ , · · · , (z) ₇ = (Y, Y, Y, Y, N, N, N) if g ₁ (x , k ) reports (Y, Y, N, N, N, N, N)

g

1

(x , k ) made 2 errors r

g1

(x ) = 3; absolute cost = 2

absolute cost = # of binary classification errors if g ₂ (x , k ) reports (Y, N, Y, Y, N, N, Y)

g

2

(x , k ) made 2 errors r

g2

(x ) = 5; absolute cost = 0

absolute cost ≤ # of binary classification errors If (z) _k = desired answer & r _g computed by counting,

y − r g (x ) ≤

K−1

P

k =1

q(z) _k 6= g(x, k )y .

(23)

Binary Classification Error v.s. Ordinal Ranking Cost

Say y = 5, i.e., (z) ₁ , (z) ₂ , · · · , (z) 7 = (Y, Y, Y, Y, N, N, N) if g ₁ (x , k ) reports (Y, Y, N, N, N, N, N)

g

1

(x , k ) made 2 errors r

g1

(x ) = 3; squared cost = 4

if g ₃ (x , k ) reports consistent predictions (Y, N, N, N, N, N, N) g

₃

(x , k ) made 3 errors

r

_g₃

(x ) = 2; squared cost = 9

now 1 error can introduce up to 5 more in cost

—how to take this into account?

(24)

Importance of Associated Binary Queries

(z) _k Y Y Y Y N N N

g ₁ (x , k ) Y Y N N N N N cr _g

₁

(x ) = c[3] = 4 g ₃ (x , k ) Y N N N N N N cr _g

₃

(x ) = c[2] = 9

(w ) _k 7 5 3 1 1 3 5

(w ) _k ≡

c[k + 1] − c[k ]

: the importance of (x , k ), (z) _k per-example cost bound (Li and Lin, 2007; Lin, 2008) :

for consistent predictions or strongly ordinal costs

cr _g (x ) ≤

K−1

X

k =1

(w ) _k q(z) _k 6= g(x, k ) y

accurate binary predictions =⇒ correct ranks

(25)

The Reduction Framework (1/2)

1

transform ordinal examples (x n , y n , c n ) to weighted binary examples (x _n , k ), (z _n ) _k , (w _n ) _k

2

apply your favorite algorithm and get one big joint binary classifier g(x , k )

3

for each new input x , predict its rank using r _g (x ) = 1 + P

k Jg (x , k ) = Y K the reduction framework:

systematic & easy to implement

ordinal examples (x

n

, y

n

, c

n

)

⇒

@ A A

%

$ '

&

weighted binary examples

(xn, k), (zn)k,(wn)k

k = 1, · · · , K −1

⇒

⇒ ^core

binary classification

algorithm ⇒

⇒

%

$ '

&

associated binary classifiers

g(x, k) k = 1, · · · , K −1

A A

@

⇒

ordinal

ranker

r

g

(x)

(26)

The Reduction Framework (2/2)

performance guarantee:

accurate binary predictions =⇒ correct ranks wide applicability:

works with any ordinal c & any binary classification algorithm simplicity:

mild computation overheads with O(NK ) binary examples up-to-date:

allows new improvements in binary classification to be immediately inherited by ordinal ranking

ordinal examples (x

n

, y

n

, c

n

)

⇒

@ A A

%

$ '

&

weighted binary examples

(xn, k), (zn)k,(wn)k

k = 1, · · · , K −1

⇒

⇒ ^core

binary classification

algorithm ⇒

⇒

%

$ '

&

associated binary classifiers

g(x, k) k = 1, · · · , K −1

A A

@

⇒

ordinal

ranker

r

g

(x)

(27)

Theoretical Guarantees of Reduction (1/3)

is reduction a practical approach? YES!

error transformation theorem (Li and Lin, 2007)

For consistent predictions or strongly ordinal costs, if g makes test error ∆ in the induced binary problem, then r _g pays test cost at most ∆ in ordinal ranking.

a one-step extension of the per-example cost bound conditions: general and minor

performance guarantee in the absolute sense:

accuracy in binary classification =⇒ correctness in ordinal ranking Is reduction really optimal?

—what if the induced binary problem is “too hard”?

(28)

Theoretical Guarantees of Reduction (2/3)

is reduction an optimal approach? YES!

regret transformation theorem (Lin, 2008)

For a general class of ordinal costs,

if g is -close to the optimal binary classifier g ∗ , then r _g is -close to the optimal ordinal ranker r ∗ . error guarantee in the relative setting:

regardless of the absolute hardness of the induced binary prob., optimality in binary classification =⇒ optimality in ordinal ranking reduction does not introduce additional hardness

“reduction to binary” sufficient, but necessary?

i.e., is reduction a principled approach?

(29)

Theoretical Guarantees of Reduction (3/3)

is reduction a principled approach? YES!

equivalence theorem (Lin, 2008)

For a general class of ordinal costs,

ordinal ranking is learnable by a learning model if and only if binary classification is learnable by the associated learning model.

a surprising equivalence:

ordinal ranking is as easy as binary classification reduction to binary classification:

practical, optimal, and principled

(30)

Outline

1 Introduction to Machine Learning

2 The Ordinal Ranking Setup

3 Reduction from Ordinal Ranking to Binary Classification Algorithmic Usefulness of Reduction

Theoretical Usefulness of Reduction Experimental Performance of Reduction

4 Conclusion

(31)

Unifying Existing Algorithms

ordinal ranking = reduction + cost + binary classification

ordinal ranking cost binary classification algorithm PRank absolute modified perceptron rule

(Crammer and Singer, 2002)

kernel ranking classification modified hard-margin SVM

(Rajaram et al., 2003)

SVOR-EXP classification

modified soft-margin SVM

SVOR-IMC absolute

(Chu and Keerthi, 2005)

ORBoost-LR classification

modified AdaBoost ORBoost-All absolute

(Lin and Li, 2006)

correctness proof significantly simplified (PRank) algorithmic structure revealed (SVOR, ORBoost)

variants of existing algorithms can be

designed quickly by tweaking reduction

(32)

Designing New Algorithms Effortlessly

ordinal ranking = reduction + cost + binary classification

ordinal ranking cost binary classification algorithm Reduction-C4.5 absolute standard C4.5 decision tree Reduction-SVM absolute standard soft-margin SVM SVOR (modified SVM) v.s. Reduction-SVM (standard SVM):

ban com cal cen

0 1 2 3 4 5 6

avg. training time (hour)

SVOR RED−SVM

advantages of core binary classification algorithm

inherited in the new ordinal ranking one

(33)

Outline

1 Introduction to Machine Learning

2 The Ordinal Ranking Setup

3 Reduction from Ordinal Ranking to Binary Classification Algorithmic Usefulness of Reduction

Theoretical Usefulness of Reduction Experimental Performance of Reduction

4 Conclusion

(34)

Recall: Threshold Model

“bad” ordinal ranker: predictions close to thresholds

—small noise changes prediction xx -

θ ₁ ^dd θ ^d ₂ ^tt ^tt θ ₃ ??

1 2 3 4 r (x )

s(x )

“good” ordinal ranker: clear separation using thresholds x x -

θ ₁ ^{d dd} θ ₂ ^tttt θ ₃ ??

1 2 3 4 r (x )

s(x )

next: good ordinal ranker =⇒ small expected test cost

(35)

Proving New Generalization Theorems

Ordinal Ranking (Li and Lin, 2007)

For SVOR or Reduction-SVM, with probability > 1 − δ,

expected test abs. cost of r

≤

_N¹

N

X

n=1 K−1

X

k =1

q ¯ ρ r (x

_n

), y

_n

, k ≤ Φ y

| {z }

“goodness” in training

+ O

poly

K ,

^{log N}^√

N

,

_Φ¹

, q

log

¹_δ

| {z }

deviation that decreases with more examples

Bi. Class. (Bartlett and Shawe-Taylor, 1998)

For SVM,

with probability > 1 − δ, expected test err. of g

≤

_N¹

N

X

n=1

q ¯ ρ g(x

n

), y

n

≤ Φ y

| {z }

“goodness” in training

+ O

poly

log N√ N

,

_Φ¹

,

q log

¹_δ

| {z }

deviation that decreases with more examples

new ordinal ranking theorem

= reduction + any cost + bin. thm. + math derivation

(36)

Outline

1 Introduction to Machine Learning

2 The Ordinal Ranking Setup

3 Reduction from Ordinal Ranking to Binary Classification Algorithmic Usefulness of Reduction

Theoretical Usefulness of Reduction Experimental Performance of Reduction

4 Conclusion

(37)

Reduction-C4.5 v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

avg. test absolute cost

SVOR (Gauss)

RED−C4.5

C4.5: a (too) simple

binary classifier

—decision trees SVOR:

state-of-the-art ordinal ranking algorithm

even simple Reduction-C4.5

sometimes beats SVOR

(38)

Reduction-SVM v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

avg. test absolute cost

SVOR (Gauss)

RED−SVM (Perc.)

SVM: one of the most

powerful binary classification algorithm SVOR:

state-of-the-art ordinal ranking algorithm extended from modified SVM

Reduction-SVM without modification

often better than SVOR and faster

(39)

Outline

1 Introduction to Machine Learning

2 The Ordinal Ranking Setup

3 Reduction from Ordinal Ranking to Binary Classification Algorithmic Usefulness of Reduction

Theoretical Usefulness of Reduction Experimental Performance of Reduction

4 Conclusion

(40)

From Ordinal Ranking to Binary Classiﬁcation

From Ordinal Ranking to Binary Classification

Hsuan-Tien Lin

Learning Systems Group, California Institute of Technology

Talk at CS Department, National Tsing-Hua University March 27, 2008

Benefited from joint work with Dr. Ling Li (ALT’06, NIPS’06)

& discussions with Prof. Yaser Abu-Mostafa and Dr. Amrit Pratap

Outline

1 Introduction to Machine Learning

2 The Ordinal Ranking Setup

3 Reduction from Ordinal Ranking to Binary Classification Algorithmic Usefulness of Reduction

Theoretical Usefulness of Reduction Experimental Performance of Reduction

4 Conclusion

Apple, Orange, or Strawberry?

?

apple orange strawberry

how can machine learn to classify?

Supervised Machine Learning

Parent

?

(picture, category) pairs

?

Kid’s good

decision function brain

'

&

$

% -

6 possibilities

Truth f (x ) + noise e(x )

?

examples (picture x

, category y

)

?

learning good

decision function h(x ) ≈ f (x ) algorithm

'

&

$

% -

6

learning model {h

(x )}

challenge:

see only {(x n , y n )} without knowing f (x ) or e(x )

=⇒ ? generalize to unseen (x , y ) w.r.t. f (x )

Machine Learning Research

What can the machines learn? (application) concrete:

computer vision, architecture optimization, information retrieval, bio-informatics, computational finance, · · ·

abstract setups:

classification, regression, · · ·

How can the machines learn? (algorithm) faster

better generalization

Why can the machines learn? (theory) paradigms:

statistical learning, reinforcement learning, · · · generalization guarantees

new opportunities keep coming

from new applications/setups

Outline

1 Introduction to Machine Learning

2 The Ordinal Ranking Setup

3 Reduction from Ordinal Ranking to Binary Classification Algorithmic Usefulness of Reduction

Theoretical Usefulness of Reduction Experimental Performance of Reduction

4 Conclusion

Which Age-Group?

2

infant (1) child (2) teen (3) adult (4)

rank: a finite ordered set of labels Y = {1, 2, · · · , K }

Properties of Ordinal Ranking (1/2)

ranks represent order information

infant (1)

<

child (2)

<

teen (3)

<

adult (4) general multiclass classification cannot

properly use order information

Hot or Not?

http://www.hotornot.com

see only {(x _n , y _n )} without knowing f (x ) or e(x )

N examples (input x _n , rank y _n ) ∈ X × Y

each user u (480,189 users) rates N _u (from tens to thousands) movies x —a total of P