Automatic Ranking by Extended Binary Classification
Hsuan-Tien Lin Learning Systems Group Joint work with Ling Li (NIPS 2006)
EE Pizza Meeting, November 17, 2006
Introduction
What is the Age-Group?
g
2
g
1
g
2
g
3
g
4 rank: a finite ordered set of labelsY = {1,2, · · · ,K}
H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 2 / 22
Hot or Not?
http://www.hotornot.com
rank: natural representation of preferences in surveys
How Much Did You Like These Movies?
http://www.netflix.com
Can machines use movies you’ve rated to closely predict your preferences (i.e., ranks) on future movies?
H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 4 / 22
How Machine Learns the Preference of You
Mary
?
(movie, rank) pairs
?
brain of hypothesis
Bob '
&
$
% -
6
alternatives:
prefer romance/action/etc.
You
?
examples (movie xn, rank yn)
?
learning hypothesis
r(x) algorithm
'
&
$
% -
6
learning model
machine learning:
an automatic route of system design
Poor Bob
Bob impresses Mary by memorizing every given (movie, rank);
but too nervous about a new movie and guesses randomly
memorize6=generalize
prefect from Bob’s view6=good for Mary perfect during training6=good when testing
challenges:
algorithms and theories for doing well when testing
H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 6 / 22
Ranking Problem
input: N examples(xn,yn) ∈ X × Y, e.g.
hotornot:X =human pictures,Y = {1, · · · ,10}
netflix:X =movies,Y = {1, · · · ,5}
output: a ranking function r(x)that ranks future unseen examples(x,y)“correctly”
properties for the K elements inY: ordered
<
not carrying numerical information not 2.5 times better than
1 instance representation? some meaningful vectors
2 correctly? cost of wrong prediction
Cost of Wrong Prediction
cannot quantify the numerical meaning of ranks;
but can artificially quantify the cost of being wrong
infant (1) child (2) teen (3) adult (4) small mistake – classify a child as a teenager;
big mistake – classify an infant as an adult Cy,k: cost when rank y predicted as rank k V-shaped Cy,k with Cy,y =0,
e.g. absolute cost Cy,k = |y−k|,
0 1 2 3 1 0 1 2 2 1 0 1 3 2 1 0
H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 8 / 22
Even More Challenging: Netflix Million Dollar Prize
input: Ni examples from each user i with 480,000+users andP
iNi ≈100,000,000 output: personalized predictions r(i,x)on 2,800,000+testing queries(i,x)
cost: squared cost Cy,k = (y −k)2 a huge joint ranking problem
The first team that gets 10%better than existing Netflix system gets a million USD
Our Contributions
a new framework that ...
makes the design and implementation of ranking algorithms almost effortless
makes the proof of ranking theories much simpler unifies many existing ranking algorithms and helps understand their cons and pros
shows that ranking is theoretically not much more complex than binary classification
leads to promising experimental performance
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0 0.2 0.4 0.6 0.8 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure:answer; traditional method; our method
H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 10 / 22
Key Idea: Reduction
(iPod)
(adapter)
(cassette player)
complex ranking problems
(reduction) simpler binary problems with well-known results on
models, algorithms, proofs, etc.
many new results immediately come up;
many existing results unified
Intuition: Associated Binary Questions
how we query the rank of a movie x ?
1 is movie x better than rank 1? Yes
2 is movie x better than rank 2? No
3 is movie x better than rank 3? No
4 is movie x better than rank 4? No
5 is movie x better than rank 5? No gb(x,k): is movie x better than rank k ? consistent answers: G(x) = (1,1,1,0, · · · ,0) extract the rank from consistent answers:
searching: compare to a “middle” rank each time voting: r(x) =1+P
kgb(x,k)
what if the answers are not consistent? e.g. (1,0,1,1,0,0,1,0) – voting is simple enough to analyze, and still works
accurate binary answers=⇒correct ranks
H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 12 / 22
Reduction during Training
input: N examples(xn,yn) ∈ X × Y
tool: your favorite binary classification algorithm output: a binary classifier gb(x,k)that can answer the associated questions correctly
need to feed binary examples(Xn,k,Yn,k)to the tool Xn,k ≡ (xn,k),Yn,k ≡ [yn >k]
about NK extended binary examples extracted from given input – bigger, but not troublesome
some approaches extract about N2binary examples using a different intuition
– can be too big
Are extended binary examples of the same importance?
Importance of Extended Binary Examples
for a given movie xnwith rank yn =2, and Cy,k = (y−k)2 is xnbetter than rank 1? No Yes Yes Yes
is xnbetter than rank 2? No No Yes Yes is xnbetter than rank 3? No No No Yes is xnbetter than rank 4? No No No No
r(xn) 1 2 3 4
cost 1 0 1 4
3 more for answering question 4 wrong;
only 1 more for answering question 1 wrong Wn,k ≡
Cn,k+1−Cn,k
: the importance of(Xn,k,Yn,k) most binary classification algorithm can handle Wn,k
analogy to economics:
additional cost (marginal)⇐⇒importance
H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 14 / 22
The Reduction Framework for Ranking
1 transform ranking examples(xn,yn)to extended binary examples(Xn,k,Yn,k,Wn,k)based on Cy,k
2 use your favorite algorithm to learn from the extended binary examples, and get gb(x,k) ≡gb(X)
3 for each new instance x , predict its rank using r(x) =1+P
kgb(x,k)
error equivalence: accurate binary answers=⇒correct ranks simplicity: works with almost any Cy,k and any algorithm
up-to-date: new improvements in binary classification immediately propagates to ranking
If I have seen further it is by
standing on ye shoulders of Giants – I. Newton
Unifying Existing Algorithms
ranking with perceptrons
– (PRank, Crammer and Singer, 2002)
several long proof
⇒a few lines extended from binary perceptron results
large-margin (high confidence) formulations
– (Rajaram et al., 2003), (SVORIM, Chu and Keerthi, 2005)
results explained more directly; algorithm structure revealed
variants of existing algorithms can be designed quickly by tweaking reduction
H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 16 / 22
Proposing New Algorithms
ranking using ensemble (consensus) of classifiers – (ORBoost, Lin and Li, 2006), OR-AdaBoost ranking using decision trees – OR-C4.5
ranking with large-margin classifiers – OR-SVM
bank computer california census 0
2 4 6
avg. training time (hour)
reduction SVOR−IMC
advantages of underlying binary algorithm inherited in the new ranking one
Proving New Theorems
simpler cost bound for PRank
new guarantee of ranking performance using ensemble of classifiers (Lin and Li, 2006)
new guarantee of ranking performance using large-margin classifiers, e.g.,
E(x,y)Cy,r(x)
| {z }
expected cost during testing
≤ N1 X
n
X
k
ρ(Xn,k,Yn,k) ≤ ∆
| {z }
low confidence extended examples
+K · hδ(N, ∆)
| {z } deviation func. that decreases with more
data or confidence
H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 18 / 22
Reduction-C4.5 vs. SVORIM
pyr mac bos aba ban com cal cen
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
avg. test absolute cost
SVOR−Gauss
RED−C4.5 C4.5: decision
tree, a intuitive, but often too simple, binary classifier SVORIM:
state-of-the-art ranking algorithm
even reduction to simple C4.5 beats SVORIM some time
Reduction-SVM vs. SVORIM
pyr mac bos aba ban com cal cen
0 0.5 1 1.5
avg. test absolute cost
SVOR−Gauss
RED−SVM SVM: one of the
most powerful binary classifier SVORIM:
state-of-the-art ranking algorithm extended from a modified SVM
reducing to SVM without modification often better than SVORIM
H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 20 / 22
Reduction-Boost vs. RankBoost
py ma bo ab ba co ca ce
0 0.5 1 1.5 2 2.5
dataset
test absolute error
RankBoost
ORBoost Boost: a popular
ensemble algorithm RankBoost:
state-of-the-art ensemble ranking algorithm
our reduction to boosting approaches results in significantly better ensemble ranking algorithm
Conclusion
reduction framework: simple, intuitive, and useful for ranking algorithmic reduction:
unifying existing ranking algorithms proposing new ranking algorithms theoretic reduction:
new guarantee on ranking performance promising experimental results:
some for better performance some for faster training time next level: the Netflix challenge?
handling huge datasets
finding useful representations (features)
using collaborative information from other users Thank you. Questions?
H.-T. Lin (Learning Systems Group) Automatic Ranking 2006/11/17 22 / 22