From Ordinal Ranking to Binary Classification
Hsuan-Tien Lin
Learning Systems Group, California Institute of Technology
Talk at Caltech CS/IST Lunch Bunch March 4, 2008
Benefited from joint work with Dr. Ling Li (ALT’06, NIPS’06)
& discussions with Prof. Yaser Abu-Mostafa and Dr. Amrit Pratap
Introduction to Ordinal Ranking
Introduction to Ordinal Ranking
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 2 / 32
Introduction to Ordinal Ranking What is Ordinal Ranking?
Which Age-Group?
2
1 2 3 4
rank: a finite ordered set of labels Y = {1, 2, · · · , K }
Introduction to Ordinal Ranking What is Ordinal Ranking?
Hot or Not?
http://www.hotornot.com
rank: natural representation of human preferences
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 4 / 32
Introduction to Ordinal Ranking What is Ordinal Ranking?
How Much Did You Like These Movies?
http://www.netflix.com
goal: use “movies you’ve rated” to automatically predict your preferences (ranks) on future movies
Introduction to Ordinal Ranking What is Ordinal Ranking?
How Machine Learns the Preference of YOU?
Alice
?
(movie, rank) pairs
?
brain of good
hypothesis Bob
'
&
$
% -
6 alternatives:
prefer romance/action/etc.
You
?
examples (movie xn, rank yn)
?
learning good
hypothesis r (x ) algorithm
'
&
$
% -
6 learning model
challenge: how to make the right-hand-side work?
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 6 / 32
Introduction to Ordinal Ranking Ordinal Ranking Problem
Ordinal Ranking Problem
given: N examples (input xn,rank yn) ∈ X × Y, e.g.
age-group: X = encoding(human pictures), Y = {1, · · · , 4}
hotornot: X = encoding(human pictures), Y = {1, · · · , 10}
netflix: X = encoding(movies), Y = {1, · · · , 5}
goal: an ordinal ranker (hypothesis) r (x ) that “closely predicts” the ranks y associated with someunseen inputs x
a hot and important research problem:
relatively new for machine learning connecting classification and regression
matching human preferences—many applications in social science and information retrieval
Introduction to Ordinal Ranking Ordinal Ranking Problem
Ongoing Heat: Netflix Million Dollar Prize
(since 10/2006)a huge joint ordinal ranking problem
given: each user u (480,189 users) rates Nu (from tens to hundreds) movies—a total ofP
uNu =100,480,507 examples goal: personalized predictions ru(x ) on 2,817,131 testing queries (u, x )
the first team being 10% better than original Netflix system getsa million USD
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 8 / 32
Introduction to Ordinal Ranking Ordinal Ranking Problem
Properties of Ranks
Y = {1, 2, · · · , 5}representingorder:
<
—relabeling by (3, 1, 2, 4, 5) erases information general multiclass classification cannot
properly use ordering information not carrying numerical information:
not 2.5 times better than
—relabeling by (2, 3, 5, 9, 16) shouldn’t change results general metric regression deteriorates without correct numerical information ordinal ranking resides uniquely between multiclass classification and metric regression
Introduction to Ordinal Ranking Ordinal Ranking Problem
Cost of Wrong Prediction
ranks carry no numerical meaning: how to say “closely predict”?
artificially quantify thecost of being wrong
infant (1) child (2) teen (3) adult (4) small mistake—classify a child as a teen;
big mistake—classify an infant as an adult cost vectorc of example (x , y , c):
c[k ] = cost when predicting (x , y ) as rank k e.g. for
,2
, a reasonable cost isc = (2, 0, 1, 4)
closely predict: small cost
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 10 / 32
Introduction to Ordinal Ranking Ordinal Ranking Problem
Reasonable Cost Vectors
For an ordinal example (x , y ,c), the cost vector c should respect the rank y : c[y ] = 0; c[k ] ≥ 0
respect the ordinal information: V-shaped or even convex
1: infant 2: child 3: teenager 4: adult Cy, k
1: infant 2: child 3: teenager 4: adult Cy, k
V-shaped: pay more when predicting further away
convex: pay increasingly more when further away
c[k ] =Jy 6= k K c[k ] = y − k
c[k ] = (y − k )2 classification: absolute: squared (Netflix):
V-shaped only convex convex
(1, 0, 1, 1) (1, 0, 1, 2) (1, 0, 1, 4)
Introduction to Ordinal Ranking Contribution
Our Contributions
a new framework that works with any reasonable cost, and ...
reduces ordinal ranking to binary classification systematically
unifies andclearly explains many existing ordinal ranking algorithms
makes the design of new ordinal ranking algorithms much easier
allowssimple and intuitive proof for new ordinal ranking theorems
leads topromising experimental results
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure:answer; traditional method; our method
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 12 / 32
Reduction from Ordinal Ranking to Binary Classification
Reduction from Ordinal Ranking to
Binary Classification
Reduction from Ordinal Ranking to Binary Classification Thresholded Model for Ordinal Ranking
Thresholded Model
If we can first compute the score s(x ) of a movie x , how can we construct r (x ) from s(x )?
x x - θ1
d d d
θ2
t tt t θ3
??
1 2 3 4 ordinal rankerr (x )
score function s(x )
1 2 3 4 target rank y
quantize s(x ) by someordered threshold θ commonly used in previous work:
thresholded perceptrons (PRank, Crammer and Singer, 2002)
thresholded hyperplanes (SVOR, Chu and Keerthi, 2005)
thresholded ensembles (ORBoost, Lin and Li, 2006)
thresholded model: r (x ) = min {k : s(x ) < θk}
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 14 / 32
Reduction from Ordinal Ranking to Binary Classification Associated Binary Questions
Key of Reduction: Associated Binary Questions
getting the rank using a thresholded model
1 is s(x ) > θ1? Yes
2 is s(x ) > θ2? No
3 is s(x ) > θ3? No
4 is s(x ) > θ4? No
generally, how do we query the rank of a movie x ?
1 is movie x better than rank 1?Yes
2 is movie x better than rank 2?No
3 is movie x better than rank 3?No
4 is movie x better than rank 4?No associated binary questions g(x , k ):
is movie x better than rank k ?
Reduction from Ordinal Ranking to Binary Classification Associated Binary Questions
More on Associated Binary Questions
g(x , k ): is movie x better than rank k ? e.g. thresholded model g(x , k ) = sign(s(x ) − θk) K − 1 binary classification problems w.r.t. each k
x x d d d t tt t ?? -
1 2 3 4 rg(x )
s(x )
1 2 3 4 y
N N θθ11 Y Y Y Y YY Y YY (z)g(x , 1)1
N N N N Nθ2Y YY Y YY (z)g(x , 2)2
N N N N N N NNN θ3 YY (z)g(x , 3)3
let (x , k ), (z)k be binary examples (x , k ): extended input w.r.t. k -th query (z)k: binary labelY/N
if g(x , k ) = (z)k for all k , we can compute rg(x ) from g(x , k ) such that rg(x ) = y
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 16 / 32
Reduction from Ordinal Ranking to Binary Classification Associated Binary Questions
Computing Ranks from Associated Binary Questions
g(x , k ): is movie x better than rank k ? Consider g(x , 1), g(x , 2), · · · , g(x , K −1),
consistent answers: (Y,Y,N,N, · · · ,N) extracting the rank from consistent answers:
minimum index searching: rg(x ) = min {k : g(x , k ) =N}
counting: rg(x ) = 1 +P
kJg (x , k ) =YK
two approaches equivalent for consistent answers
noisy/inconsistent answers? e.g. (Y,N,Y,Y,N,N,Y,N,N)
—counting is simpler to analyze, and is robust to noise are all associated binary questions of
the same importance?
Reduction from Ordinal Ranking to Binary Classification Associated Binary Questions
Importance of Associated Binary Questions
given a movie x with rank y = 2 andc[k ] = (y − k )2 g(x , 1): is x better than rank 1? No Yes Yes Yes g(x , 2): is x better than rank 2? No No Yes Yes g(x , 3): is x better than rank 3? No No No Yes g(x , 4): is x better than rank 4? No No No No
rg(x ) 1 2 3 4
crg(x )
1 0 1 4
1 more for answering question 2 wrong;
but 3 more for answering question 3 wrong (w )k ≡
c[k + 1] − c[k ]
: the importance of (x , k ), (z)k per-example error bound(Li and Lin, 2007; Lin, 2008):
forconsistent answers or convex costs crg(x ) ≤XK−1
k =1(w )kq(z)k 6= g(x, k )y accurate binary answers =⇒ correct ranks
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 18 / 32
Reduction from Ordinal Ranking to Binary Classification The Reduction Framework
The Reduction Framework
1 transform ordinal examples (xn,yn,cn)to weighted binary examples (xn,k ), (zn)k, (wn)k
2 use your favorite algorithm on the weighted binary examples and get K −1 binary classifiers (i.e., one big joint binary classifier) g(x , k )
3 for each new input x , predict its rank using rg(x ) = 1 +P
kJg (x , k ) =YK
ordinal examples (xn, yn, cn)
⇒
@ A
A
%
$ '
&
weighted binary examples
(xn, k), (zn)k,(wn)k
k= 1, · · · , K −1
⇒
⇒
⇒ core
binary classification
algorithm ⇒
⇒
⇒
%
$ '
&
associated binary classifiers
g(x, k) k= 1, · · · , K −1
A A
@
⇒
ordinal ranker rg(x)
Reduction from Ordinal Ranking to Binary Classification The Reduction Framework
Properties of Reduction
performance guarantee:
accurate binary answers =⇒ correct ranks wide applicability:
systematic; works with any reasonablec and any binary classification algorithm
up-to-date:
allows new improvements in binary classification to be immediately inherited by ordinal ranking
If I have seen further it is by
standing on the shoulders of Giants—I. Newton
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 20 / 32
ordinal examples (xn, yn, cn)
⇒
@ A
A
%
$ '
&
weighted binary examples
(xn, k), (zn)k,(wn)k
k= 1, · · · , K −1
⇒
⇒
⇒ core
binary classification
algorithm ⇒
⇒
⇒
%
$ '
&
associated binary classifiers
g(x, k) k= 1, · · · , K −1
A A
@
⇒
ordinal ranker rg(x)
Reduction from Ordinal Ranking to Binary Classification Theoretical Guarantees
Theoretical Guarantees of Reduction (1/3)
is reduction a reasonable approach? YES!
error transformation theorem(Li and Lin, 2007)
Forconsistent answers or convex costs,
if g makes test error ∆ in the induced binary problem, then rgpays test cost at most ∆ in ordinal ranking.
a one-step extension of the per-example error bound conditions: general and minor
performance guarantee in the absolute sense:
accuracy in binary classification =⇒ correctness in ordinal ranking What if the induced binary problem is “too hard”
and even the best g∗ can only commit a big ∆?
Reduction from Ordinal Ranking to Binary Classification Theoretical Guarantees
Theoretical Guarantees of Reduction (2/3)
is reduction a promising approach?YES!
regret transformation theorem(Lin, 2008)
For a general class ofreasonable costs,
if g is -close to the optimal binary classifier g∗, then rgis -close to the optimal ordinal ranker r∗. error guarantee in the relative setting:
regardless of the absolute hardness of the induced binary prob., optimality in binary classification =⇒ optimality in ordinal ranking reduction does not introduce additional hardness
It is sufficient to go with reduction plus binary classifi- cation, but is it necessary?
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 22 / 32
Reduction from Ordinal Ranking to Binary Classification Theoretical Guarantees
Theoretical Guarantees of Reduction (3/3)
is reduction a principled approach? YES!
equivalence theorem(Lin, 2008)
For a general class ofreasonable costs,
ordinal ranking is learnable by a learning model if and only if binary classification is learnable by the associated learning model.
a surprising equivalence:
ordinal ranking isas easy as binary classification
“without loss of generality”, we can just focus on binary classification
reduction to binary classification:
systematic, reasonable, promising, and principled
Usefulness of the Reduction Framework
Usefulness of the Reduction Framework
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 24 / 32
Usefulness of the Reduction Framework Algorithmic Reduction
Unifying Existing Algorithms
ordinal ranking cost binary classification algorithm PRank absolute modified perceptron rule
(Crammer and Singer, 2002)
kernel ranking classification modified hard-margin SVM
(Rajaram et al., 2003)
SVOR-EXP classification modified soft-margin SVM SVOR-IMC absolute modified soft-margin SVM
(Chu and Keerthi, 2005)
ORBoost-LR classification modified AdaBoost ORBoost-All absolute modified AdaBoost
(Lin and Li, 2006)
if the reduction framework had been there,
development and implementation time could have been saved correctness proof significantly simplified (PRank)
algorithmic structure revealed (SVOR, ORBoost) variants of existing algorithms can be designed quickly by tweaking reduction
Usefulness of the Reduction Framework Algorithmic Reduction
Designing New Algorithms (1/2)
ordinal ranking cost binary classification algorithm Reduction-C4.5 absolute standard C4.5 decision tree Reduction-AdaBoost absolute standard AdaBoost
Reduction-SVM absolute standard soft-margin SVM
SVOR (modified SVM) v.s. Reduction-SVM (standard SVM):
ban com cal cen
0 1 2 3 4 5 6
avg. training time (hour)
SVOR RED−SVM
advantages of core binary classification algorithm inherited in the new ordinal ranking one
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 26 / 32
Usefulness of the Reduction Framework Algorithmic Reduction
Designing New Algorithms (2/2)
AdaBoost(Freund and Schapire, 1997)
for t = 1, 2, · · · , T ,
1 find a simple gt that matches best with the current “view” of {(Xn,Yn)}
2 give a larger weight vt to gt if the match is stronger
3 update “view” by emphasizing the weights of those (Xn,Yn) that gt doesn’t predict well prediction:
majority vote of
vt,gt(x )
AdaBoost.OR(Lin, 2008)
for t = 1, 2, · · · , T ,
1 find a simple rt that matches best with the current “view” of {(xn,yn)}
2 give a larger weight vt to rt if the match is stronger
3 update “view” by emphasizing the costscnof those (xn,yn) that rt doesn’t predict well prediction:
weighted median of
vt,rt(x )
AdaBoost.OR:
an extension of Reduction-AdaBoost;
a parallel of AdaBoost in ordinal ranking
Usefulness of the Reduction Framework Theoretical Reduction
Proving New Theorems
Binary Classification
(Bartlett and Shawe-Taylor, 1998)
For SVM, with prob. > 1 − δ, expected test error
≤ N1
N
X
n=1
J ¯ρ(Xn,Yn) ≤ ΦK
| {z }
ambiguous training predictions w.r.t.
criteria Φ
+ O
log N√ N,Φ1,
q log1δ
| {z }
deviation that decreases with stronger criteria or
more examples
Ordinal Ranking
(Li and Lin, 2007)
For SVOR or Red.-SVM, with prob. > 1 − δ, expected test cost
≤ βN
N
X
n=1 K −1
X
k =1
(wn)kq ¯ρ (xn,k ), (zn)k ≤ Φy
| {z }
ambiguous training predictions w.r.t.
criteria Φ
+ O
log N√ N,Φ1,
q log1δ
| {z }
deviation that decreases with stronger criteria or
more examples
new test cost bounds with any c[·]
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 28 / 32
Usefulness of the Reduction Framework Experimental Comparisons
Reduction-C4.5 v.s. SVOR
pyr mac bos aba ban com cal cen
0 0.5 1 1.5 2 2.5
avg. test absolute cost
SVOR (Gauss)
RED−C4.5 C4.5: a (too) simple
binary classifier
—decision trees SVOR:
state-of-the-art ordinal ranking algorithm
even simple Reduction-C4.5 sometimes beats SVOR
Usefulness of the Reduction Framework Experimental Comparisons
Reduction-SVM v.s. SVOR
pyr mac bos aba ban com cal cen
0 0.5 1 1.5 2 2.5
avg. test absolute cost
SVOR (Gauss)
RED−SVM (Perc.) SVM: one of the most
powerful binary classification algorithm SVOR:
state-of-the-art ordinal ranking algorithm extended from modified SVM
Reduction-SVM without modification often better than SVOR∗and faster
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 30 / 32
Usefulness of the Reduction Framework Netflix Prize?
Can We Win the Netflix Prize with Reduction?
possibly
a principled view of the problem
now easy to apply known binary classification techniques or to design suitable ordinal ranking approaches
e.g., AdaBoost.OR “boosted” some simple rtand reduced the test cost from 1.0704 to 1.0343
but not yet
need 0.8563 to win
the problem has its own characteristics huge data set: computational bottleneck
allows real-valued predictions: r (x ) ∈ R instead of r (x) ∈ {1, · · · , K } encoding(movie), encoding(user): important
many interesting research problems arose during “CS156b: Learning Systems”
Usefulness of the Reduction Framework Conclusion
Conclusion
reduction framework: simple, intuitive, and useful for ordinal ranking
algorithmic reduction:
unifying existing ordinal ranking algorithms designing new ordinal ranking algorithms theoretic reduction:
new bounds on ordinal ranking test cost promising experimental results:
some for better performance some for faster training time
reduction keeps ordinal ranking up-to-date with binary classification
Hsuan-Tien Lin (Caltech) From Ordinal Ranking to Binary Classification 03/04/2008 32 / 32