## From Ordinal Ranking to Binary Classification

Hsuan-Tien Lin

Department of Computer Science and Information Engineering National Taiwan University

Talk at NTUST March 16, 2009

Joint work with Dr. Ling Li at Caltech (ALT’06, NIPS’06)

## Outline

**1** **Machine Learning Setup**

**2** **Ordinal Ranking Setup**

**3** **The Reduction Framework**
**Key Ideas**

**Important Properties**
**Algorithmic Usefulness**
**Theoretical Usefulness**

**4** **Experimental Results**

**5** **Conclusion**

## Which Digit Did You Write?

?

one (1) two (2) three (3) four (4)

**How can machines learn to classify?**

## Supervised Machine Learning from Examples

Parent

?

(picture, label) pairs

?

Kid’s good

decision function brain

'

&

$

% -

6 possibilities

Truth f (x ) + noise e(x )

?

examples (picture x_{n}, label y_{n})

?

learning good

decision function g(x )≈ f (x) algorithm

'

&

$

% -

6

learning model{g^{α}(x )}
challenge:

see only{(xn, yn)} without knowing f (x) or e(x)

=?**⇒ generalize to unseen (x, y) w.r.t. f (x)**

## Some Classical Machine Learning Problems

classification: discrete yn

{one, two, three, four}

{apple, orange, banana}

{yes, no}:**binary classification**
regression: numerical yn (∈ R)

stock prices students’ scores

Truth f (x ) + noise e(x )

?

examples (picture xn, label yn)

?

learning good

decision function g(x )≈ f (x) algorithm

'

&

$

% -

6

learning model{g^{α}(x )}

**new types of machine learning problems**
**keep coming from new applications**

## Outline

**1** **Machine Learning Setup**

**2** **Ordinal Ranking Setup**

**3** **The Reduction Framework**
**Key Ideas**

**Important Properties**
**Algorithmic Usefulness**
**Theoretical Usefulness**

**4** **Experimental Results**

**5** **Conclusion**

## Which Age-Group?

**2**

infant(1) child(2) teen(3) adult(4)

**rank: a finite ordered set of labels**Y = {1, 2, · · · , K }

## Properties of Ordinal Ranking (1/2)

ranks represent**order information**

infant (1)

## <

child (2)

## <

teen (3)

## <

adult (4)

**general classification cannot**
**properly use order information**

## Hot or Not?

http://www.hotornot.com

**rank: natural representation of human preferences**

## Properties of Ordinal Ranking (2/2)

ranks do**not carry numerical information**
rating 9 not 2.25 times “hotter” than rating 4

actual metric hidden

infant (ages 1–3)

child (ages 4–12)

teen (ages 13–19)

adult
(ages 20–)
**general regression deteriorates**

**without correct numerical information**

## How Much Did You Like These Movies?

http://www.netflix.com

**goal: use “movies you’ve rated” to automatically**
**predict your preferences (ranks) on future movies**

## Ordinal Ranking Setup

Given

N examples (input x_{n}, rank y_{n})∈ X × Y

age-group:X = encoding(human pictures), Y = {1, · · · , 4}

hotornot:X = encoding(human pictures), Y = {1, · · · , 10}

netflix:X = encoding(movies), Y = {1, · · · , 5}

Goal

an ordinal ranker (decision function) r (x ) that “closely predicts”

the ranks y associated with some**unseen inputs x**

**ordinal ranking: a hot and important research problem**

## Importance of Ordinal Ranking

relatively new for machine learning connecting classification and regression

matching human preferences—many applications in social science, information retrieval, psychology, and recommendation systems

**Ongoing Heat: Netflix Million Dollar Prize**

## Ongoing Heat: Netflix Million Dollar Prize

(since 10/2006)Given

each user u (480,189 users) rates Nu (from tens to thousands) movies x —a total ofP

uNu=100,480,507 examples Goal

personalized ordinal rankers ru(x ) evaluated on 2,817,131

“unseen” queries (u, x)

**the first team being 10% better than**
**original Netflix system getsa million USD**

## Formalizing (Non-)Closeness: Cost

ranks carry no numerical information: how to say “close”?

artificially quantify the**cost of being wrong**

e.g. loss of customer loyalty when the system

says but you feel

cost vector**c of example (x, y , c):**

**c[k ] = cost when predicting (x**, y ) as rank k

e.g. for ( Sweet Home Alabama, ), a proper cost
is**c = (1, 0, 2, 10, 15)**

**closely predict: small cost during testing**

## Ordinal Cost Vectors

For an ordinal example (x**, y , c), the cost vector c should**
be consistent with rank y : **c[y ] = min**_{k}**c[k ] (= 0)**

respect order information: V-shaped (ordinal) or even convex (strongly ordinal)

**1: infant** **2: child** **3: teenager** **4: adult**
**C****y, k**

V-shaped: pay more when predicting further away

**1: infant** **2: child** **3: teenager** **4: adult**
**C****y, k**

convex: pay**increasingly**
more when further away
**c[k ] =**Jy 6= k K **c[k ] =**

y − k **c[k ] = (y**− k)^{2}
classification: absolute: squared:

ordinal strongly strongly

ordinal ordinal (1, 0, 1, 1, 1) (1, 0, 1, 2, 3) (1, 0, 1, 4, 9)

## Our Contributions

a theoretical and algorithmic foundation of ordinal ranking, whichreducesordinal ranking to binary classificaction, and ...

provides a methodology for designing new ordinal
ranking algorithms with**any ordinal cost effortlessly**
takes many existing ordinal ranking algorithms as
**special cases**

introduces**new theoretical guarantee on the**
generalization performance of ordinal rankers
leads to**superior experimental results**

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0 0.2 0.4 0.6 0.8 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure:truth; traditional algorithm; our algorithm

## Central Idea: Reduction

(iPod)

complex ordinal ranking problems

(adapter) (reduction)

(cassette player)

simpler binary classification problems with well-known results on models, algorithms, and theories

**If I have seen further it is by**

**standing on the shoulders of Giants—I. Newton**

## Outline

**1** **Machine Learning Setup**

**2** **Ordinal Ranking Setup**

**3** **The Reduction Framework**
**Key Ideas**

**Important Properties**
**Algorithmic Usefulness**
**Theoretical Usefulness**

**4** **Experimental Results**

**5** **Conclusion**

## Outline

**1** **Machine Learning Setup**

**2** **Ordinal Ranking Setup**

**3** **The Reduction Framework**
**Key Ideas**

**Important Properties**
**Algorithmic Usefulness**
**Theoretical Usefulness**

**4** **Experimental Results**

**5** **Conclusion**

## Threshold Ranker

if getting an ideal score s(x ) of a movie x , how to construct the discrete r (x ) from an analog s(x )?

x x - θ1

d d d

θ2

t tt t θ3

??

**1** **2** **3** **4** threshold rankerr (x )

score function s(x )

1 2 3 4 target rank y

quantize s(x ) by**ordered (non-uniform) thresholds**θ_{k}
commonly used in previous work:

threshold perceptrons (PRank, Crammer and Singer, 2002)

threshold hyperplanes (SVOR, Chu and Keerthi, 2005)

threshold ensembles (ORBoost, Lin and Li, 2006)

**threshold ranker: r (x ) = min**{k : s(x) < θk}

## Key Idea: Associated Binary Queries

getting the rank using a threshold ranker

1 is s(x )> θ_{1}? Yes

2 is s(x )> θ_{2}? No

3 is s(x )> θ_{3}? No

4 is s(x )> θ4? No

generally, how do we query the rank of a movie x ?

1 is movie x better than rank 1?Yes

2 is movie x better than rank 2?No

3 is movie x better than rank 3?No

4 is movie x better than rank 4?No
**associated binary queries:**

**is movie x better than rank k ?**

The Reduction Framework Key Ideas

## More on Associated Binary Queries

say, the machine uses g(x, k ) to answer the query

“is movie x better than rank k ?”

e.g. for threshold ranker: g(x, k ) = sign(s(x)− θk)

x x d d d t tt t ?? -

**1** **2** **3** **4** rg(x )

s(x )

1 2 3 4 y

N N θ_{1} Y Y Y Y YY Y YY

(z)1

θ_{1} g(x, 1)

N N N N N Y YY Y YY

(z)_{2}

θ_{2} g(x, 2)

N N N N N N NNN YY

(z)_{3}

θ_{3} g(x, 3)
associated binary examples:

(x, k )

| {z }

k -th associated binary query

, (z)_{k}

|{z}

desired answer

## Computing Ranks from Associated Binary Queries

when g(x, k ) answers “is movie x better than rank k ?”

Consider g(x, 1), g(x, 2),· · · , g(x, K −1) , consistent predictions: (Y,Y,N,N,N,N,N) extracting the rank from consistent predictions:

minimum index searching: rg(x ) = min{k : g(x, k) =N} counting: rg(x ) = 1 +P

kJg (x , k ) =YK

two approaches equivalent for consistent predictions

mistaken/inconsistent predictions? e.g. (Y,N,Y,Y,N,N,Y)
**counting: simpler to analyze and robust to mistake**

## The Counting Approach

Say y = 5, i.e., (z)_{1}, (z)_{2},· · · , (z)7

=(Y,Y,Y,Y,N,N,N)
if g_{1}(x, k ) reports consistent predictions (Y,Y,N,N,N,N,N)

g1(x, k ) made 2 binary classification errors rg1(x ) = 3 by counting: the absolute cost is 2

absolute cost = # of binary classification errors

if g_{2}(x, k ) reports inconsistent predictions (Y,N,Y,Y,N,N,Y)
g2(x, k ) made 2 binary classification errors

rg2(x ) = 5 by counting: the absolute cost is 0

absolute cost≤ # of binary classification errors
**If (z)**_{k} **= desired answer & r**_{g} **computed by counting,**

y − r^{g}(x )
≤^{K−1}P

k =1

q(z)_{k} 6= g(x, k)y .

## Binary Classification Error v.s. Ordinal Ranking Cost

Say y = 5, i.e., (z)_{1}, (z)_{2},· · · , (z)7

=(Y,Y,Y,Y,N,N,N)
if g_{1}(x, k ) reports consistent predictions (Y,Y,N,N,N,N,N)

g_{1}(x, k ) made 2 binary classification errors
r_{g}_{1}(x ) = 3 by counting: the**squared cost is 4**

if g_{3}(x, k ) reports consistent predictions (Y,N,N,N,N,N,N)
g3(x, k ) made 3 binary classification errors

rg3(x ) = 2 by counting: the**squared cost is 9**
1 error in binary classification

=⇒ 5 cost in ordinal ranking

## Importance of Associated Binary Examples

(z)_{k} Y Y Y Y N N N

g_{1}(x, k ) Y Y N N N N N **c**
rg_{1}(x )

=**c[3] = 4**
g_{3}(x, k ) Y N N N N N N **c**

r_{g}_{3}(x )

=**c[2] = 9**

(w )_{k} 7 **5** 3 1 1 3 5

(w )_{k} ≡**c[k + 1] − c[k]**

: the importance of (x, k), (z)^{k}
per-example cost bound(Li and Lin, 2007):

for**consistent predictions or strongly ordinal costs**
**c**

rg(x )

≤ XK−1 k =1

(w )_{k}q(z)_{k} 6= g(x, k)y

## Outline

**1** **Machine Learning Setup**

**2** **Ordinal Ranking Setup**

**3** **The Reduction Framework**
**Key Ideas**

**Important Properties**
**Algorithmic Usefulness**
**Theoretical Usefulness**

**4** **Experimental Results**

**5** **Conclusion**

## The Reduction Framework (1/2)

1 transform ordinal examples (xn, yn**, c**n)to
weighted binary examples (x_{n}, k ), (z_{n})_{k}, (w_{n})_{k}

2 use your favorite algorithm on the weighted binary examples and get K−1 binary classifiers (i.e., one big joint binary classifier) g(x, k )

3 for each new input x , predict its rank using rg(x ) = 1 +P

kJg (x , k ) =YK
**the reduction framework:**

**systematic & easy to implement**

ordinal
examples
(xn, yn, cn) ⇒ ^{}

@ AA

%

$ '

&

weighted binary examples

(xn, k), (zn)k, (wn)k

k = 1,· · · , K −1 ⇒⇒⇒ _{binary}^{core}

classification algorithm ⇒⇒⇒

%

$ '

&

associated binary classifiers

g(x, k) k = 1,· · · , K −1

AA

@

⇒

ordinal

ranker rg(x)

## The Reduction Framework (2/2)

**performance guarantee:**

accurate binary predictions =⇒ correct ranks
**wide applicability:**

works with any ordinal**c & any binary classification algorithm**
**simplicity:**

mild computation overheads with O(NK ) binary examples
**state-of-the-art:**

allows new improvements in binary classification to be immediately inherited by ordinal ranking

ordinal
examples
(xn, yn, cn) ⇒ ^{}

@ AA

%

$ '

&

weighted binary examples

(xn, k), (zn)k, (wn)k

k = 1,· · · , K −1 ⇒⇒⇒ _{binary}^{core}

classification algorithm ⇒⇒⇒

%

$ '

&

associated binary classifiers

g(x, k) k = 1,· · · , K −1

AA

@

⇒

ordinal

ranker rg(x)

## Theoretical Guarantees of Reduction (1/3)

error transformation theorem(Li and Lin, 2007)

For**consistent predictions or strongly ordinal costs,**
if g makes test error ∆ in the induced binary problem,
then r_{g} pays test cost at most ∆ in ordinal ranking.

a one-step extension of the per-example cost bound conditions: general and minor

performance guarantee in the absolute sense

**what if no “absolutely good” binary classifier?**

1 absolutelygood binary classifier

=⇒absolutelygood ranker? **YES!**

## Theoretical Guarantees of Reduction (2/3)

regret transformation theorem(Lin, 2008)

For**consistent predictions or strongly ordinal costs,**
if g is-close to the optimal binary classifier g∗,
then r_{g} is-close to the optimal ranker r∗.

“reduction to binary” sufficient for algorithm design,
**but necessary?**

1 absolutely good binary classifier

=**⇒ absolutely good ranker? YES!**

2 relativelygood binary classifier

=⇒relativelygood ranker?**YES!**

## Theoretical Guarantees of Reduction (3/3)

equivalence theorem(Lin, 2008)

For a general family of**ordinal costs,**
a good ordinal ranking algorithm exists

**if & only if a good binary classification algorithm exists**
for the corresponding learning model.

ordinal ranking is**equivalent to binary classification**

1 absolutely good binary classifier

=**⇒ absolutely good ranker? YES!**

2 relatively good binary classifier

=**⇒ relatively good ranker? YES!**

3 algorithm producingrelatively good binary classifier

⇐⇒ algorithm producingrelatively good ranker?**YES!**

## Outline

**1** **Machine Learning Setup**

**2** **Ordinal Ranking Setup**

**3** **The Reduction Framework**
**Key Ideas**

**Important Properties**
**Algorithmic Usefulness**
**Theoretical Usefulness**

**4** **Experimental Results**

**5** **Conclusion**

## Unifying Existing Algorithms

ordinal ranking = reduction + cost + binary classification

ordinal ranking cost binary classification algorithm PRank absolute modified perceptron rule

(Crammer and Singer, 2002)

kernel ranking classification modified hard-margin SVM

(Rajaram et al., 2003)

SVOR-EXP classification modified soft-margin SVM SVOR-IMC absolute modified soft-margin SVM

(Chu and Keerthi, 2005)

ORBoost-LR classification modified AdaBoost ORBoost-All absolute modified AdaBoost

(Lin and Li, 2006)

development and implementation time could have been saved algorithmic structure revealed (SVOR, ORBoost)

**variants of existing algorithms can be**
**designed quickly by tweaking reduction**

## Designing New Algorithms Effortlessly

ordinal ranking = reduction + cost + binary classification ordinal ranking cost binary classification algorithm

RED-SVM absolute standard soft-margin SVM RED-C4.5 absolute standard C4.5 decision tree

(Li and Lin, 2007)

SVOR (modified SVM) v.s. RED-SVM (standard SVM):

ban com cal cen

0 1 2 3 4 5 6

**avg. training time (hour)**

SVOR RED−SVM

**advantages of core binary classification algorithm**
**inherited in the new ordinal ranking one**

## Outline

**1** **Machine Learning Setup**

**2** **Ordinal Ranking Setup**

**3** **The Reduction Framework**
**Key Ideas**

**Important Properties**
**Algorithmic Usefulness**
**Theoretical Usefulness**

**4** **Experimental Results**

**5** **Conclusion**

## Proving New Generalization Theorems

Ordinal Ranking(Li and Lin, 2007)

For RED-SVM/SVOR, with pr.> 1− δ, expected test cost of r

≤ _{N}^{β}
XN
n=1

XK−1 k =1

q ¯ρ r (xn), yn, k

≤Φy

| {z }

ambiguous training predictions w.r.t.

criteria Φ

+ O

poly

K,^{log N}^{√}

N,_{Φ}^{1},
q

log^{1}_{δ}

| {z }

deviation that decreases with stronger criteria or

more examples

Bi. Cl. (Bartlett and Shawe-Taylor, 1998)

For SVM, with pr.> 1− δ, expected test err. of g

≤ N^{1}

XN n=1

q ¯ρ g(xn), yn

≤ Φy

| {z }

ambiguous training predictions w.r.t.

criteria Φ

+ O

poly

log N√
N,_{Φ}^{1},q

log^{1}_{δ}

| {z }

deviation that decreases with stronger criteria or

more examples

**new ordinal ranking theorem**

**= reduction + any cost + bin. thm. + math derivation**

## Outline

**1** **Machine Learning Setup**

**2** **Ordinal Ranking Setup**

**3** **The Reduction Framework**
**Key Ideas**

**Important Properties**
**Algorithmic Usefulness**
**Theoretical Usefulness**

**4** **Experimental Results**

**5** **Conclusion**

## Reduction-C4.5 v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

**avg. test absolute cost**

SVOR (Gauss)

RED−C4.5 C4.5: a (too) simple

binary classifier

—decision trees SVOR:

state-of-the-art ordinal ranking algorithm

**even simple Reduction-C4.5**
**sometimes beats SVOR**

## Reduction-SVM v.s. SVOR

pyr mac bos aba ban com cal cen

0 0.5 1 1.5 2 2.5

**avg. test absolute cost**

SVOR (Gauss)

RED−SVM (Perc.) SVM: one of the most

powerful binary classification algorithm SVOR:

state-of-the-art ordinal ranking algorithm extended from modified SVM

**Reduction-SVM without modification**
**often better than SVOR and faster**

## Outline

**1** **Machine Learning Setup**

**2** **Ordinal Ranking Setup**

**3** **The Reduction Framework**
**Key Ideas**

**Important Properties**
**Algorithmic Usefulness**
**Theoretical Usefulness**

**4** **Experimental Results**

**5** **Conclusion**

## Conclusion

reduction framework: simple but useful

**establish equivalence to binary classification**
**unify existing algorithms**

**simplify design of new algorithms**

**facilitate derivation of new theoretical guarantees**
**superior experimental results:**

better performance and faster training time
**reduction keeps ordinal ranking**
**up-to-date with binary classification**