• 沒有找到結果。

From Ordinal Ranking to Binary Classification

N/A
N/A
Protected

Academic year: 2022

Share "From Ordinal Ranking to Binary Classification"

Copied!
25
0
0

加載中.... (立即查看全文)

全文

(1)

From Ordinal Ranking to Binary Classification

Hsuan-Tien Lin

Department of Computer Science and Information Engineering National Taiwan University

Talk at Microsoft Research Asia February 18, 2009

Joint work with Dr. Ling Li at Caltech (ALT’06, NIPS’06)

(2)

The Ordinal Ranking Problem

Which Age-Group?

2

infant(1) child(2) teen(3) adult(4)

rank: a finite ordered set of labelsY = {1, 2, · · · , K }

(3)

The Ordinal Ranking Problem

Properties of Ordinal Ranking (1/2)

ranks representorder information

infant (1)

<

child (2)

<

teen (3)

<

adult (4)

general classification cannot properly use order information

(4)

The Ordinal Ranking Problem

How Much Did You Like These Movies?

http://www.netflix.com

rank: natural representation of human preferences

(5)

The Ordinal Ranking Problem

Properties of Ordinal Ranking (2/2)

ranks donot carry numerical information not 2.5 times “better” than

actual metric may be hidden

infant (ages 1–3)

child (ages 4–12)

teen (ages 13–19)

adult (ages 20–) general regression deteriorates

without correct numerical information

(6)

The Ordinal Ranking Problem

Ordinal Ranking

Setup

input spaceX ; rank space Y (a finite ordered set)

age-group:X = encoding(human pictures), Y = {1, · · · , 4}

netflix: X = encoding(movies), Y = {1, · · · , 5}

Given

N examples (input xn, rank yn)∈ X × Y Goal

a ranker (decision function) r (x ) that closely predicts the ranks y associated with someunseen inputs x

How to say closely predict?

(7)

The Ordinal Ranking Problem

Formalizing (Non-)Closeness: Cost

ranks carry no numerical information: how to say “close”?

artificially quantify thecost of being wrong

e.g. loss of customer loyalty when the system

says but you feel

cost vectorc of example (x, y , c):

c[k ] = cost when predicting (x, y ) as rank k

e.g. for ( Sweet Home Alabama, ), a proper cost isc = (1, 0, 2, 10, 15)

closely predict: small cost during testing

(8)

The Ordinal Ranking Problem

Ordinal Cost Vectors

For an ordinal example (x, y , c), the cost vector c should be consistent with rank y : c[y ] = minkc[k ] (= 0)

respect order information: V-shaped (ordinal) or even convex (strongly ordinal)

1: infant 2: child 3: teenager 4: adult Cy, k

V-shaped: pay more when predicting further away

1: infant 2: child 3: teenager 4: adult Cy, k

convex: payincreasingly more when further away c[k ] =Jy 6= k K c[k ] =

y − k c[k ] = (y− k)2 classification: absolute: squared:

ordinal strongly strongly

ordinal ordinal (1, 0, 1, 1, 1) (1, 0, 1, 2, 3) (1, 0, 1, 4, 9)

(9)

The Ordinal Ranking Problem

Our Contributions

a theoretical and algorithmic foundation of ordinal ranking, whichreducesordinal ranking to binary classificaction, and ...

provides a methodology for designing new ordinal ranking algorithms withany ordinal cost effortlessly takes many existing ordinal ranking algorithms as special cases

introducesnew theoretical guarantee on the generalization performance of ordinal rankers leads tosuperior experimental results If I have seen further it is by

standing on the shoulders of Giants—I. Newton

(10)

Reduction from Ordinal Ranking to Binary Classification Key Ideas

Threshold Ranker

if getting an ideal score s(x ) of a movie x , how to construct the discrete r (x ) from an analog s(x )?

x x - θ1

d d d

θ2

t tt t θ3

??

1 2 3 4 threshold rankerr (x )

score function s(x )

1 2 3 4 target rank y

quantize s(x ) byordered (non-uniform) thresholdsθk commonly used in previous work:

threshold perceptrons (PRank, Crammer and Singer, 2002)

threshold hyperplanes (SVOR, Chu and Keerthi, 2005)

threshold ensembles (ORBoost, Lin and Li, 2006)

threshold ranker: r (x ) = min{k : s(x) < θk}

(11)

Reduction from Ordinal Ranking to Binary Classification Key Ideas

Key Idea: Associated Binary Queries

getting the rank using a threshold ranker

1 is s(x )> θ1? Yes

2 is s(x )> θ2? No

3 is s(x )> θ3? No

4 is s(x )> θ4? No

generally, how do we query the rank of a movie x ?

1 is movie x better than rank 1?Yes

2 is movie x better than rank 2?No

3 is movie x better than rank 3?No

4 is movie x better than rank 4?No associated binary queries:

is movie x better than rank k ?

(12)

Reduction from Ordinal Ranking to Binary Classification Key Ideas

More on Associated Binary Queries

say, the machine uses g(x, k ) to answer the query

“is movie x better than rank k ?”

e.g. for threshold ranker: g(x, k ) = sign(s(x)− θk)

x x d d d t tt t ?? -

1 2 3 4 rg(x )

s(x )

1 2 3 4 y

N N θ1 Y Y Y Y YY Y YY

(z)1

θ1 g(x, 1)

N N N N N Y YY Y YY

(z)2

θ2 g(x, 2)

N N N N N N NNN YY

(z)3

θ3 g(x, 3) associated binary examples:

 (x, k )

| {z }

k -th associated binary query

, (z)k

|{z}

desired answer



(13)

Reduction from Ordinal Ranking to Binary Classification Key Ideas

Computing Ranks from Associated Binary Queries

when g(x, k ) answers “is movie x better than rank k ?”

Consider g(x, 1), g(x, 2),· · · , g(x, K −1) , consistent predictions: (Y,Y,N,N,N,N,N) extracting the rank from consistent predictions:

minimum index searching: rg(x ) = min{k : g(x, k) =N} counting: rg(x ) = 1 +P

kJg (x , k ) =YK

two approaches equivalent for consistent predictions

mistaken/inconsistent predictions? e.g. (Y,N,Y,Y,N,N,Y)

—counting: simpler to analyze and robust to mistake are all associated examples of the same importance?

(14)

Reduction from Ordinal Ranking to Binary Classification Key Ideas

Importance of Associated Binary Examples

given movie x with rank y = 2, andc = (y − k)2 g1 g2 g3 g4

is x better than rank 1? N Y Y Y is x better than rank 2? N N Y Y is x better than rank 3? N N N Y is x better than rank 4? N N N N

rg(x ) 1 2 3 4

c rg(x )

1 0 1 4

3 more for answering query 3 wrong;

only 1 more for answering query 1 wrong (w )k c[k + 1] − c[k]

: the importance of (x, k), (z)k per-example cost bound(Li and Lin, 2007):

forconsistent predictions or strongly ordinal costs c

rg(x )

≤ XK−1 k =1

(w )kq(z)k 6= g(x, k)y

(15)

Reduction from Ordinal Ranking to Binary Classification Important Properties

The Reduction Framework (1/2)

1 transform ordinal examples (xn, yn, cn)to weighted binary examples (xn, k ), (zn)k, (wn)k

2 use your favorite algorithm on the weighted binary examples and get K−1 binary classifiers (i.e., one big joint binary classifier) g(x, k )

3 for each new input x , predict its rank using rg(x ) = 1 +P

kJg (x , k ) =YK the reduction framework:







 ordinal examples (xn, yn, cn)



@ AA

%

$ '

&

weighted binary examples

(xn, k), (zn)k, (wn)k



k = 1,· · · , K −1 ⇒⇒⇒ binarycore

classification algorithm ⇒⇒⇒

%

$ '

&

associated binary classifiers

g(x, k) k = 1,· · · , K −1

AA

@









 ordinal

ranker rg(x)

(16)

Reduction from Ordinal Ranking to Binary Classification Important Properties

The Reduction Framework (2/2)

performance guarantee:

accurate binary predictions =⇒ correct ranks wide applicability:

works with any ordinalc & any binary classification algorithm simplicity:

mild computation overheads with O(NK ) binary examples state-of-the-art:

allows new improvements in binary classification to be immediately inherited by ordinal ranking







 ordinal examples (xn, yn, cn)



@ AA

%

$ '

&

weighted binary examples

(xn, k), (zn)k, (wn)k



k = 1,· · · , K −1 ⇒⇒⇒ binarycore

classification algorithm ⇒⇒⇒

%

$ '

&

associated binary classifiers

g(x, k) k = 1,· · · , K −1

AA

@









 ordinal

ranker rg(x)

(17)

Reduction from Ordinal Ranking to Binary Classification Important Properties

Theoretical Guarantees of Reduction (1/3)

error transformation theorem(Li and Lin, 2007)

Forconsistent predictions or strongly ordinal costs, if g makes test error ∆ in the induced binary problem, then rg pays test cost at most ∆ in ordinal ranking.

a one-step extension of the per-example cost bound conditions: general and minor

performance guarantee in the absolute sense

what if no “absolutely good” binary classifier?

1 absolutelygood binary classifier

=⇒absolutelygood ranker? YES!

(18)

Reduction from Ordinal Ranking to Binary Classification Important Properties

Theoretical Guarantees of Reduction (2/3)

regret transformation theorem(Lin, 2008)

Forconsistent predictions or strongly ordinal costs, if g is-close to the optimal binary classifier g, then rg is-close to the optimal ranker r.

“reduction to binary” sufficient for algorithm design, but necessary?

1 absolutely good binary classifier

=⇒ absolutely good ranker? YES!

2 relativelygood binary classifier

=⇒relativelygood ranker?YES!

(19)

Reduction from Ordinal Ranking to Binary Classification Important Properties

Theoretical Guarantees of Reduction (3/3)

equivalence theorem(Lin, 2008)

For a general family ofordinal costs, a good ordinal ranking algorithm exists

if & only if a good binary classification algorithm exists for the corresponding learning model.

ordinal ranking isequivalent to binary classification

1 absolutely good binary classifier

=⇒ absolutely good ranker? YES!

2 relatively good binary classifier

=⇒ relatively good ranker? YES!

3 algorithm producingrelatively good binary classifier

⇐⇒ algorithm producingrelatively good ranker?YES!

(20)

Reduction from Ordinal Ranking to Binary Classification Algorithmic Usefulness

Unifying Existing Algorithms

ordinal ranking = reduction + cost + binary classification

ordinal ranking cost binary classification algorithm PRank absolute modified perceptron rule

(Crammer and Singer, 2002)

kernel ranking classification modified hard-margin SVM

(Rajaram et al., 2003)

SVOR-EXP classification modified soft-margin SVM SVOR-IMC absolute modified soft-margin SVM

(Chu and Keerthi, 2005)

ORBoost-LR classification modified AdaBoost ORBoost-All absolute modified AdaBoost

(Lin and Li, 2006)

development and implementation time could have been saved algorithmic structure revealed (SVOR, ORBoost)

variants of existing algorithms can be designed quickly by tweaking reduction

(21)

Reduction from Ordinal Ranking to Binary Classification Algorithmic Usefulness

Designing New Algorithms Effortlessly

ordinal ranking = reduction + cost + binary classification ordinal ranking cost binary classification algorithm

RED-SVM absolute standard soft-margin SVM RED-C4.5 absolute standard C4.5 decision tree

(Li and Lin, 2007)

SVOR (modified SVM) v.s. RED-SVM (standard SVM):

ban com cal cen

0 1 2 3 4 5 6

avg. training time (hour)

SVOR RED−SVM

advantages of core binary classification algorithm

(22)

Reduction from Ordinal Ranking to Binary Classification Theoretical Usefulness

Proving New Generalization Theorems

Ordinal Ranking(Li and Lin, 2007)

For RED-SVM/SVOR, with pr.> 1− δ, expected test cost of r

Nβ XN n=1

XK−1 k =1

q ¯ρ r (xn), yn, k

≤Φy

| {z }

ambiguous training predictions w.r.t.

criteria Φ

+ O

 poly

 K,log N

N,Φ1, q

log1δ



| {z }

deviation that decreases with stronger criteria or

more examples

Bi. Cl. (Bartlett and Shawe-Taylor, 1998)

For SVM, with pr.> 1− δ, expected test err. of g

N1

XN n=1

q ¯ρ g(xn), yn

≤ Φy

| {z }

ambiguous training predictions w.r.t.

criteria Φ

+ O

 poly



log N N,Φ1,q

log1δ



| {z }

deviation that decreases with stronger criteria or

more examples

new ordinal ranking theorem

= reduction + any cost + bin. thm. + math derivation

(23)

Experimental Results

Reduction-C4.5 v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

avg. test absolute cost

SVOR (Gauss)

RED−C4.5 C4.5: a (too) simple

binary classifier

—decision trees SVOR:

state-of-the-art ordinal ranking algorithm

even simple Reduction-C4.5 sometimes beats SVOR

(24)

Experimental Results

Reduction-SVM v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

avg. test absolute cost

SVOR (Gauss)

RED−SVM (Perc.) SVM: one of the most

powerful binary classification algorithm SVOR:

state-of-the-art ordinal ranking algorithm extended from modified SVM

Reduction-SVM without modification often better than SVOR and faster

(25)

Conclusion

Conclusion

reduction framework: simple but useful

establish equivalence to binary classification unify existing algorithms

simplify design of new algorithms

facilitate derivation of new theoretical guarantees superior experimental results:

better performance and faster training time reduction keeps ordinal ranking up-to-date with binary classification

參考文獻

相關文件

In addi- tion, soft cost-sensitive classification algorithms reach significantly lower test error rate than their hard siblings, while achieving similar (sometimes better) test

D. Existing cost-insensitive active learning strategies 1) Binary active learning: Active learning for binary classification (binary active learning) has been studied in many works

In this paper, we first proposed a data selection technique Closest RankSVM, that discovers the most informative pairs for pair-wise ranking. We then presented the experimental

We also used reduction and reverse reduction to design a novel boosting ap- proach, AdaBoost.OR, to improve the performance of any cost-sensitive base ordinal ranking algorithm..

‡ In an ordered binary tree (usually called just a binary tree), if an internal vertex has two children, the first child is called the left child and the second one is called

 This technique is solving a problem by reducing its instance to a smaller one, solving the latte r, and then extending the obtained solution to g et a solution to the

our reduction to boosting approaches results in significantly better ensemble ranking

Shih-Cheng Horng , Feng-Yi Yang, “Embedding particle swarm in ordinal optimization to solve stochastic simulation optimization problems”, International Journal of Intelligent