• 沒有找到結果。

# Automatic Ranking by Extended Binary Classiﬁcation

N/A
N/A
Protected

Share "Automatic Ranking by Extended Binary Classiﬁcation"

Copied!
22
0
0

(1)

## Automatic Ranking by Extended Binary Classification

Hsuan-Tien Lin Learning Systems Group Joint work with Ling Li (NIPS 2006)

EE Pizza Meeting, November 17, 2006

(2)

Introduction

## What is the Age-Group?

g

2

g

1

g

2

g

3

g

4 rank: a finite ordered set of labelsY = {1,2, · · · ,K}

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 2 / 22

(3)

## Hot or Not?

http://www.hotornot.com

rank: natural representation of preferences in surveys

(4)

## How Much Did You Like These Movies?

http://www.netflix.com

Can machines use movies you’ve rated to closely predict your preferences (i.e., ranks) on future movies?

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 4 / 22

(5)

## How Machine Learns the Preference of You

Mary

?

(movie, rank) pairs

?

brain of hypothesis

Bob '

&

\$

% -

6

alternatives:

prefer romance/action/etc.

You

?

examples (movie xn, rank yn)

?

learning hypothesis

r(x) algorithm

'

&

\$

% -

6

learning model

machine learning:

an automatic route of system design

(6)

## Poor Bob

Bob impresses Mary by memorizing every given (movie, rank);

but too nervous about a new movie and guesses randomly

memorize6=generalize

prefect from Bob’s view6=good for Mary perfect during training6=good when testing

challenges:

algorithms and theories for doing well when testing

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 6 / 22

(7)

## Ranking Problem

input: N examples(xn,yn) ∈ X × Y, e.g.

hotornot:X =human pictures,Y = {1, · · · ,10}

netflix:X =movies,Y = {1, · · · ,5}

output: a ranking function r(x)that ranks future unseen examples(x,y)“correctly”

properties for the K elements inY: ordered

<

not carrying numerical information not 2.5 times better than

1 instance representation? some meaningful vectors

2 correctly? cost of wrong prediction

(8)

## Cost of Wrong Prediction

cannot quantify the numerical meaning of ranks;

but can artificially quantify the cost of being wrong

infant (1) child (2) teen (3) adult (4) small mistake – classify a child as a teenager;

big mistake – classify an infant as an adult Cy,k: cost when rank y predicted as rank k V-shaped Cy,k with Cy,y =0,

e.g. absolute cost Cy,k = |y−k|,

0 1 2 3 1 0 1 2 2 1 0 1 3 2 1 0

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 8 / 22

(9)

## Even More Challenging: Netflix Million Dollar Prize

input: Ni examples from each user i with 480,000+users andP

iNi ≈100,000,000 output: personalized predictions r(i,x)on 2,800,000+testing queries(i,x)

cost: squared cost Cy,k = (y −k)2 a huge joint ranking problem

The first team that gets 10%better than existing Netflix system gets a million USD

(10)

## a new framework that ...

makes the design and implementation of ranking algorithms almost effortless

makes the proof of ranking theories much simpler unifies many existing ranking algorithms and helps understand their cons and pros

shows that ranking is theoretically not much more complex than binary classification

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 10 / 22

(11)

## Key Idea: Reduction

(iPod)

(cassette player)

complex ranking problems

(reduction) simpler binary problems with well-known results on

models, algorithms, proofs, etc.

many new results immediately come up;

many existing results unified

(12)

## Intuition: Associated Binary Questions

how we query the rank of a movie x ?

1 is movie x better than rank 1? Yes

2 is movie x better than rank 2? No

3 is movie x better than rank 3? No

4 is movie x better than rank 4? No

5 is movie x better than rank 5? No gb(x,k): is movie x better than rank k ? consistent answers: G(x) = (1,1,1,0, · · · ,0) extract the rank from consistent answers:

searching: compare to a “middle” rank each time voting: r(x) =1+P

kgb(x,k)

what if the answers are not consistent? e.g. (1,0,1,1,0,0,1,0) – voting is simple enough to analyze, and still works

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 12 / 22

(13)

## Reduction during Training

input: N examples(xn,yn) ∈ X × Y

tool: your favorite binary classification algorithm output: a binary classifier gb(x,k)that can answer the associated questions correctly

need to feed binary examples(Xn,k,Yn,k)to the tool Xn,k ≡ (xn,k),Yn,k ≡ [yn >k]

about NK extended binary examples extracted from given input – bigger, but not troublesome

some approaches extract about N2binary examples using a different intuition

– can be too big

Are extended binary examples of the same importance?

(14)

## Importance of Extended Binary Examples

for a given movie xnwith rank yn =2, and Cy,k = (y−k)2 is xnbetter than rank 1? No Yes Yes Yes

is xnbetter than rank 2? No No Yes Yes is xnbetter than rank 3? No No No Yes is xnbetter than rank 4? No No No No

r(xn) 1 2 3 4

cost 1 0 1 4

3 more for answering question 4 wrong;

only 1 more for answering question 1 wrong Wn,k

Cn,k+1Cn,k

: the importance of(Xn,k,Yn,k) most binary classification algorithm can handle Wn,k

analogy to economics:

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 14 / 22

(15)

## The Reduction Framework for Ranking

1 transform ranking examples(xn,yn)to extended binary examples(Xn,k,Yn,k,Wn,k)based on Cy,k

2 use your favorite algorithm to learn from the extended binary examples, and get gb(x,k) ≡gb(X)

3 for each new instance x , predict its rank using r(x) =1+P

kgb(x,k)

error equivalence: accurate binary answers=⇒correct ranks simplicity: works with almost any Cy,k and any algorithm

up-to-date: new improvements in binary classification immediately propagates to ranking

If I have seen further it is by

standing on ye shoulders of Giants – I. Newton

(16)

## Unifying Existing Algorithms

ranking with perceptrons

– (PRank, Crammer and Singer, 2002)

several long proof

⇒a few lines extended from binary perceptron results

large-margin (high confidence) formulations

– (Rajaram et al., 2003), (SVORIM, Chu and Keerthi, 2005)

results explained more directly; algorithm structure revealed

variants of existing algorithms can be designed quickly by tweaking reduction

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 16 / 22

(17)

## Proposing New Algorithms

ranking using ensemble (consensus) of classifiers – (ORBoost, Lin and Li, 2006), OR-AdaBoost ranking using decision trees – OR-C4.5

ranking with large-margin classifiers – OR-SVM

bank computer california census 0

2 4 6

avg. training time (hour)

reduction SVOR−IMC

advantages of underlying binary algorithm inherited in the new ranking one

(18)

## Proving New Theorems

simpler cost bound for PRank

new guarantee of ranking performance using ensemble of classifiers (Lin and Li, 2006)

new guarantee of ranking performance using large-margin classifiers, e.g.,

E(x,y)Cy,r(x)

| {z }

expected cost during testing

N1 X

n

X

k

 ρ(Xn,k,Yn,k) ≤ ∆

| {z }

low confidence extended examples

 +K · hδ(N, ∆)

| {z } deviation func. that decreases with more

data or confidence

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 18 / 22

(19)

## Reduction-C4.5 vs. SVORIM

pyr mac bos aba ban com cal cen

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

avg. test absolute cost

SVOR−Gauss

RED−C4.5 C4.5: decision

tree, a intuitive, but often too simple, binary classifier SVORIM:

state-of-the-art ranking algorithm

even reduction to simple C4.5 beats SVORIM some time

(20)

## Reduction-SVM vs. SVORIM

pyr mac bos aba ban com cal cen

0 0.5 1 1.5

avg. test absolute cost

SVOR−Gauss

RED−SVM SVM: one of the

most powerful binary classifier SVORIM:

state-of-the-art ranking algorithm extended from a modified SVM

reducing to SVM without modification often better than SVORIM

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 20 / 22

(21)

## Reduction-Boost vs. RankBoost

py ma bo ab ba co ca ce

0 0.5 1 1.5 2 2.5

dataset

test absolute error

RankBoost

ORBoost Boost: a popular

ensemble algorithm RankBoost:

state-of-the-art ensemble ranking algorithm

our reduction to boosting approaches results in significantly better ensemble ranking algorithm

(22)

## Conclusion

reduction framework: simple, intuitive, and useful for ranking algorithmic reduction:

unifying existing ranking algorithms proposing new ranking algorithms theoretic reduction:

new guarantee on ranking performance promising experimental results:

some for better performance some for faster training time next level: the Netflix challenge?

handling huge datasets

finding useful representations (features)

using collaborative information from other users Thank you. Questions?

H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 22 / 22

Initial Approaches and Some Settings Sparse Features and Linear Classification Condensed Features and Random Forest Ensemble and Final Results.. Discussion

For consistent predictions or strongly ordinal costs, if g makes test error ∆ in the induced binary problem, then r g pays test cost at most ∆ in ordinal ranking. a one-step

For consistent predictions or strongly ordinal costs, if g makes test error ∆ in the induced binary problem, then r g pays test cost at most ∆ in ordinal ranking. a one-step

regardless of the absolute hardness of the induced binary prob., optimality in binary classification =⇒ optimality in ordinal ranking reduction does not introduce additional

For concordant predictions or strongly ordinal costs, if g makes test error ∆ in the induced binary problem, then r g pays test cost at most ∆ in ordinal ranking. a one-step

regardless of the absolute hardness of the induced binary prob., optimality in binary classification =⇒ optimality in ordinal ranking reduction does not introduce additional