From Ordinal Ranking to Binary Classification
Hsuan-Tien Lin
Department of Computer Science and Information Engineering National Taiwan University
Talk at NTUST March 16, 2009
Joint work with Dr. Ling Li at Caltech (ALT’06, NIPS’06)
Outline
1 Machine Learning Setup
2 Ordinal Ranking Setup
3 The Reduction Framework Key Ideas
Important Properties Algorithmic Usefulness Theoretical Usefulness
4 Experimental Results
5 Conclusion
Which Digit Did You Write?
?
one (1) two (2) three (3) four (4)
How can machines learn to classify?
Supervised Machine Learning from Examples
Parent
?
(picture, label) pairs
?
Kid’s good
decision function brain
'
&
$
% -
6 possibilities
Truth f (x ) + noise e(x )
?
examples (picture xn, label yn)
?
learning good
decision function g(x )≈ f (x) algorithm
'
&
$
% -
6
learning model{gα(x )} challenge:
see only{(xn, yn)} without knowing f (x) or e(x)
=?⇒ generalize to unseen (x, y) w.r.t. f (x)
Some Classical Machine Learning Problems
classification: discrete yn
{one, two, three, four}
{apple, orange, banana}
{yes, no}:binary classification regression: numerical yn (∈ R)
stock prices students’ scores
Truth f (x ) + noise e(x )
?
examples (picture xn, label yn)
?
learning good
decision function g(x )≈ f (x) algorithm
'
&
$
% -
6
learning model{gα(x )}
new types of machine learning problems keep coming from new applications
Outline
1 Machine Learning Setup
2 Ordinal Ranking Setup
3 The Reduction Framework Key Ideas
Important Properties Algorithmic Usefulness Theoretical Usefulness
4 Experimental Results
5 Conclusion
Which Age-Group?
2
infant(1) child(2) teen(3) adult(4)
rank: a finite ordered set of labelsY = {1, 2, · · · , K }
Properties of Ordinal Ranking (1/2)
ranks representorder information
infant (1)
<
child (2)
<
teen (3)
<
adult (4)
general classification cannot properly use order information
Hot or Not?
http://www.hotornot.com
rank: natural representation of human preferences
Properties of Ordinal Ranking (2/2)
ranks donot carry numerical information rating 9 not 2.25 times “hotter” than rating 4
actual metric hidden
infant (ages 1–3)
child (ages 4–12)
teen (ages 13–19)
adult (ages 20–) general regression deteriorates
without correct numerical information
How Much Did You Like These Movies?
http://www.netflix.com
goal: use “movies you’ve rated” to automatically predict your preferences (ranks) on future movies
Ordinal Ranking Setup
Given
N examples (input xn, rank yn)∈ X × Y
age-group:X = encoding(human pictures), Y = {1, · · · , 4}
hotornot:X = encoding(human pictures), Y = {1, · · · , 10}
netflix:X = encoding(movies), Y = {1, · · · , 5}
Goal
an ordinal ranker (decision function) r (x ) that “closely predicts”
the ranks y associated with someunseen inputs x
ordinal ranking: a hot and important research problem
Importance of Ordinal Ranking
relatively new for machine learning connecting classification and regression
matching human preferences—many applications in social science, information retrieval, psychology, and recommendation systems
Ongoing Heat: Netflix Million Dollar Prize
Ongoing Heat: Netflix Million Dollar Prize
(since 10/2006)Given
each user u (480,189 users) rates Nu (from tens to thousands) movies x —a total ofP
uNu=100,480,507 examples Goal
personalized ordinal rankers ru(x ) evaluated on 2,817,131
“unseen” queries (u, x)
the first team being 10% better than original Netflix system getsa million USD
Formalizing (Non-)Closeness: Cost
ranks carry no numerical information: how to say “close”?
artificially quantify thecost of being wrong
e.g. loss of customer loyalty when the system
says but you feel
cost vectorc of example (x, y , c):
c[k ] = cost when predicting (x, y ) as rank k
e.g. for ( Sweet Home Alabama, ), a proper cost isc = (1, 0, 2, 10, 15)
closely predict: small cost during testing
Ordinal Cost Vectors
For an ordinal example (x, y , c), the cost vector c should be consistent with rank y : c[y ] = minkc[k ] (= 0)
respect order information: V-shaped (ordinal) or even convex (strongly ordinal)
1: infant 2: child 3: teenager 4: adult Cy, k
V-shaped: pay more when predicting further away
1: infant 2: child 3: teenager 4: adult Cy, k
convex: payincreasingly more when further away c[k ] =Jy 6= k K c[k ] =
y − k c[k ] = (y− k)2 classification: absolute: squared:
ordinal strongly strongly
ordinal ordinal (1, 0, 1, 1, 1) (1, 0, 1, 2, 3) (1, 0, 1, 4, 9)
Our Contributions
a theoretical and algorithmic foundation of ordinal ranking, whichreducesordinal ranking to binary classificaction, and ...
provides a methodology for designing new ordinal ranking algorithms withany ordinal cost effortlessly takes many existing ordinal ranking algorithms as special cases
introducesnew theoretical guarantee on the generalization performance of ordinal rankers leads tosuperior experimental results
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure:truth; traditional algorithm; our algorithm
Central Idea: Reduction
(iPod)
complex ordinal ranking problems
(adapter) (reduction)
(cassette player)
simpler binary classification problems with well-known results on models, algorithms, and theories
If I have seen further it is by
standing on the shoulders of Giants—I. Newton
Outline
1 Machine Learning Setup
2 Ordinal Ranking Setup
3 The Reduction Framework Key Ideas
Important Properties Algorithmic Usefulness Theoretical Usefulness
4 Experimental Results
5 Conclusion
Outline
1 Machine Learning Setup
2 Ordinal Ranking Setup
3 The Reduction Framework Key Ideas
Important Properties Algorithmic Usefulness Theoretical Usefulness
4 Experimental Results
5 Conclusion
Threshold Ranker
if getting an ideal score s(x ) of a movie x , how to construct the discrete r (x ) from an analog s(x )?
x x - θ1
d d d
θ2
t tt t θ3
??
1 2 3 4 threshold rankerr (x )
score function s(x )
1 2 3 4 target rank y
quantize s(x ) byordered (non-uniform) thresholdsθk commonly used in previous work:
threshold perceptrons (PRank, Crammer and Singer, 2002)
threshold hyperplanes (SVOR, Chu and Keerthi, 2005)
threshold ensembles (ORBoost, Lin and Li, 2006)
threshold ranker: r (x ) = min{k : s(x) < θk}
Key Idea: Associated Binary Queries
getting the rank using a threshold ranker
1 is s(x )> θ1? Yes
2 is s(x )> θ2? No
3 is s(x )> θ3? No
4 is s(x )> θ4? No
generally, how do we query the rank of a movie x ?
1 is movie x better than rank 1?Yes
2 is movie x better than rank 2?No
3 is movie x better than rank 3?No
4 is movie x better than rank 4?No associated binary queries:
is movie x better than rank k ?
The Reduction Framework Key Ideas
More on Associated Binary Queries
say, the machine uses g(x, k ) to answer the query
“is movie x better than rank k ?”
e.g. for threshold ranker: g(x, k ) = sign(s(x)− θk)
x x d d d t tt t ?? -
1 2 3 4 rg(x )
s(x )
1 2 3 4 y
N N θ1 Y Y Y Y YY Y YY
(z)1
θ1 g(x, 1)
N N N N N Y YY Y YY
(z)2
θ2 g(x, 2)
N N N N N N NNN YY
(z)3
θ3 g(x, 3) associated binary examples:
(x, k )
| {z }
k -th associated binary query
, (z)k
|{z}
desired answer
Computing Ranks from Associated Binary Queries
when g(x, k ) answers “is movie x better than rank k ?”
Consider g(x, 1), g(x, 2),· · · , g(x, K −1) , consistent predictions: (Y,Y,N,N,N,N,N) extracting the rank from consistent predictions:
minimum index searching: rg(x ) = min{k : g(x, k) =N} counting: rg(x ) = 1 +P
kJg (x , k ) =YK
two approaches equivalent for consistent predictions
mistaken/inconsistent predictions? e.g. (Y,N,Y,Y,N,N,Y) counting: simpler to analyze and robust to mistake
The Counting Approach
Say y = 5, i.e., (z)1, (z)2,· · · , (z)7
=(Y,Y,Y,Y,N,N,N) if g1(x, k ) reports consistent predictions (Y,Y,N,N,N,N,N)
g1(x, k ) made 2 binary classification errors rg1(x ) = 3 by counting: the absolute cost is 2
absolute cost = # of binary classification errors
if g2(x, k ) reports inconsistent predictions (Y,N,Y,Y,N,N,Y) g2(x, k ) made 2 binary classification errors
rg2(x ) = 5 by counting: the absolute cost is 0
absolute cost≤ # of binary classification errors If (z)k = desired answer & rg computed by counting,
y − rg(x ) ≤K−1P
k =1
q(z)k 6= g(x, k)y .
Binary Classification Error v.s. Ordinal Ranking Cost
Say y = 5, i.e., (z)1, (z)2,· · · , (z)7
=(Y,Y,Y,Y,N,N,N) if g1(x, k ) reports consistent predictions (Y,Y,N,N,N,N,N)
g1(x, k ) made 2 binary classification errors rg1(x ) = 3 by counting: thesquared cost is 4
if g3(x, k ) reports consistent predictions (Y,N,N,N,N,N,N) g3(x, k ) made 3 binary classification errors
rg3(x ) = 2 by counting: thesquared cost is 9 1 error in binary classification
=⇒ 5 cost in ordinal ranking
Importance of Associated Binary Examples
(z)k Y Y Y Y N N N
g1(x, k ) Y Y N N N N N c rg1(x )
=c[3] = 4 g3(x, k ) Y N N N N N N c
rg3(x )
=c[2] = 9
(w )k 7 5 3 1 1 3 5
(w )k ≡c[k + 1] − c[k]
: the importance of (x, k), (z)k per-example cost bound(Li and Lin, 2007):
forconsistent predictions or strongly ordinal costs c
rg(x )
≤ XK−1 k =1
(w )kq(z)k 6= g(x, k)y
Outline
1 Machine Learning Setup
2 Ordinal Ranking Setup
3 The Reduction Framework Key Ideas
Important Properties Algorithmic Usefulness Theoretical Usefulness
4 Experimental Results
5 Conclusion
The Reduction Framework (1/2)
1 transform ordinal examples (xn, yn, cn)to weighted binary examples (xn, k ), (zn)k, (wn)k
2 use your favorite algorithm on the weighted binary examples and get K−1 binary classifiers (i.e., one big joint binary classifier) g(x, k )
3 for each new input x , predict its rank using rg(x ) = 1 +P
kJg (x , k ) =YK the reduction framework:
systematic & easy to implement
ordinal examples (xn, yn, cn) ⇒
@ AA
%
$ '
&
weighted binary examples
(xn, k), (zn)k, (wn)k
k = 1,· · · , K −1 ⇒⇒⇒ binarycore
classification algorithm ⇒⇒⇒
%
$ '
&
associated binary classifiers
g(x, k) k = 1,· · · , K −1
AA
@
⇒
ordinal
ranker rg(x)
The Reduction Framework (2/2)
performance guarantee:
accurate binary predictions =⇒ correct ranks wide applicability:
works with any ordinalc & any binary classification algorithm simplicity:
mild computation overheads with O(NK ) binary examples state-of-the-art:
allows new improvements in binary classification to be immediately inherited by ordinal ranking
ordinal examples (xn, yn, cn) ⇒
@ AA
%
$ '
&
weighted binary examples
(xn, k), (zn)k, (wn)k
k = 1,· · · , K −1 ⇒⇒⇒ binarycore
classification algorithm ⇒⇒⇒
%
$ '
&
associated binary classifiers
g(x, k) k = 1,· · · , K −1
AA
@
⇒
ordinal
ranker rg(x)
Theoretical Guarantees of Reduction (1/3)
error transformation theorem(Li and Lin, 2007)
Forconsistent predictions or strongly ordinal costs, if g makes test error ∆ in the induced binary problem, then rg pays test cost at most ∆ in ordinal ranking.
a one-step extension of the per-example cost bound conditions: general and minor
performance guarantee in the absolute sense
what if no “absolutely good” binary classifier?
1 absolutelygood binary classifier
=⇒absolutelygood ranker? YES!
Theoretical Guarantees of Reduction (2/3)
regret transformation theorem(Lin, 2008)
Forconsistent predictions or strongly ordinal costs, if g is-close to the optimal binary classifier g∗, then rg is-close to the optimal ranker r∗.
“reduction to binary” sufficient for algorithm design, but necessary?
1 absolutely good binary classifier
=⇒ absolutely good ranker? YES!
2 relativelygood binary classifier
=⇒relativelygood ranker?YES!
Theoretical Guarantees of Reduction (3/3)
equivalence theorem(Lin, 2008)
For a general family ofordinal costs, a good ordinal ranking algorithm exists
if & only if a good binary classification algorithm exists for the corresponding learning model.
ordinal ranking isequivalent to binary classification
1 absolutely good binary classifier
=⇒ absolutely good ranker? YES!
2 relatively good binary classifier
=⇒ relatively good ranker? YES!
3 algorithm producingrelatively good binary classifier
⇐⇒ algorithm producingrelatively good ranker?YES!
Outline
1 Machine Learning Setup
2 Ordinal Ranking Setup
3 The Reduction Framework Key Ideas
Important Properties Algorithmic Usefulness Theoretical Usefulness
4 Experimental Results
5 Conclusion
Unifying Existing Algorithms
ordinal ranking = reduction + cost + binary classification
ordinal ranking cost binary classification algorithm PRank absolute modified perceptron rule
(Crammer and Singer, 2002)
kernel ranking classification modified hard-margin SVM
(Rajaram et al., 2003)
SVOR-EXP classification modified soft-margin SVM SVOR-IMC absolute modified soft-margin SVM
(Chu and Keerthi, 2005)
ORBoost-LR classification modified AdaBoost ORBoost-All absolute modified AdaBoost
(Lin and Li, 2006)
development and implementation time could have been saved algorithmic structure revealed (SVOR, ORBoost)
variants of existing algorithms can be designed quickly by tweaking reduction
Designing New Algorithms Effortlessly
ordinal ranking = reduction + cost + binary classification ordinal ranking cost binary classification algorithm
RED-SVM absolute standard soft-margin SVM RED-C4.5 absolute standard C4.5 decision tree
(Li and Lin, 2007)
SVOR (modified SVM) v.s. RED-SVM (standard SVM):
ban com cal cen
0 1 2 3 4 5 6
avg. training time (hour)
SVOR RED−SVM
advantages of core binary classification algorithm inherited in the new ordinal ranking one
Outline
1 Machine Learning Setup
2 Ordinal Ranking Setup
3 The Reduction Framework Key Ideas
Important Properties Algorithmic Usefulness Theoretical Usefulness
4 Experimental Results
5 Conclusion
Proving New Generalization Theorems
Ordinal Ranking(Li and Lin, 2007)
For RED-SVM/SVOR, with pr.> 1− δ, expected test cost of r
≤ Nβ XN n=1
XK−1 k =1
q ¯ρ r (xn), yn, k
≤Φy
| {z }
ambiguous training predictions w.r.t.
criteria Φ
+ O
poly
K,log N√
N,Φ1, q
log1δ
| {z }
deviation that decreases with stronger criteria or
more examples
Bi. Cl. (Bartlett and Shawe-Taylor, 1998)
For SVM, with pr.> 1− δ, expected test err. of g
≤ N1
XN n=1
q ¯ρ g(xn), yn
≤ Φy
| {z }
ambiguous training predictions w.r.t.
criteria Φ
+ O
poly
log N√ N,Φ1,q
log1δ
| {z }
deviation that decreases with stronger criteria or
more examples
new ordinal ranking theorem
= reduction + any cost + bin. thm. + math derivation
Outline
1 Machine Learning Setup
2 Ordinal Ranking Setup
3 The Reduction Framework Key Ideas
Important Properties Algorithmic Usefulness Theoretical Usefulness
4 Experimental Results
5 Conclusion
Reduction-C4.5 v.s. SVOR
pyr mac bos aba ban com cal cen
0 0.5 1 1.5 2 2.5
avg. test absolute cost
SVOR (Gauss)
RED−C4.5 C4.5: a (too) simple
binary classifier
—decision trees SVOR:
state-of-the-art ordinal ranking algorithm
even simple Reduction-C4.5 sometimes beats SVOR
Reduction-SVM v.s. SVOR
pyr mac bos aba ban com cal cen
0 0.5 1 1.5 2 2.5
avg. test absolute cost
SVOR (Gauss)
RED−SVM (Perc.) SVM: one of the most
powerful binary classification algorithm SVOR:
state-of-the-art ordinal ranking algorithm extended from modified SVM
Reduction-SVM without modification often better than SVOR and faster
Outline
1 Machine Learning Setup
2 Ordinal Ranking Setup
3 The Reduction Framework Key Ideas
Important Properties Algorithmic Usefulness Theoretical Usefulness
4 Experimental Results
5 Conclusion
Conclusion
reduction framework: simple but useful
establish equivalence to binary classification unify existing algorithms
simplify design of new algorithms
facilitate derivation of new theoretical guarantees superior experimental results:
better performance and faster training time reduction keeps ordinal ranking up-to-date with binary classification