Improving Ranking Performance with
Cost-sensitive Ordinal Classification via Regression
Yu-Xun Ruan1, Hsuan-Tien Lin1, Ming-Feng Tsai2
National Taiwan University1, National Chengchi University2
Preference Learning @ EURO, July 10, 2012
Preference Ranking in Search Engine
not just for searchinggood machine learning book ; but also forrecommendation systems & other web service
Three Properties of Search-Engine Ranking
listwise with focus ontop ranks query-oriented & personalized
emphasis onhighly-preferred (relevant) items large scale
both duringtraining & testing
e.g. Yahoo! Learning-To-Rank Challenge 2010: 473K training URLs, 166K test URLs
ordinal data
labeled qualitatively by human, e.g. { highly irrelevant, irrelevant, neutral, relevant, highly relevant} lack of quantitative info
search-engine ranking problem:
learning a ranker fromlarge scaleordinal data with focus ontop ranks
Search-Engine Ranking Setup
Given
for query indices q = 1, 2, · · · , Q,
a set of related documents {xq,i}N(q)i=1
ordinal relevance yq,i ∈ Y = {0, 1, . . . , K }for each documentxq,i withlarge Q and N(q)
Goal
a ranker r (x) that “accurately ranks”topxQ+1,ifrom an unseen setof documents {xQ+1,i}
how to evaluateaccurate ranking around the top?
Expected Reciprocal Rank
(ERR; Chapelle et al., CIKM ’09)Assumption: Choice Probability of Single Document for any example (documentx, rank y ),
P(user chooses documentx) = (2y − 1)/2K Assumption: Stopping Probability ofList of Documents
P(user stops at position i of list)
= P(doesn’t stop at pos. i − 1) × P(chooses document at pos. i) ERR: TotalDiscountedStopping Probability of List of Documents
ERRq(r ) ≡
N(q)
X
i=1
1
iP(user stops at position i of the list ordered by r )
large ERR ⇔ small i matches large P ⇔ good ranking around top
Possible Approach 1: LambdaRank
(Burges et al., NIPS ’06)maximize ERR directly with non-smooth optimization on N(q)! list reorderings
Pros
respecttop rankgoal
respectordinalnature of data
Cons
difficult optimization problem
challenging to apply onlarge-scaledata
LambdaRank: a state-of-the-art approach, butpossibly inefficient
Possible Approach 2: SVM-Rank
(Joachims, KDD ’02)conduct listwise ranking by predicting pairwise preferences accurately
Pros
respectordinalnature of data (w/ comparison) somewhat applicable tolarge-scaledata
Cons
all pairs equal, not respectingtop rankgoal
somewhat applicable tolarge-scaledata, because of O(N2)pairs
SVM-Rank: a baseline pairwise ranking approach, but possibly not the best for listwise
Possible Approach 3:
Direct Regression
(Cossock and Tong, COLT ’06)conduct listwise ranking by predicting real-valued scores accurately
Pros
respecttop rankgoal by embedding it in regression loss applicable tolarge-scaledata
Cons
treats y as numerical score, not respectingordinalnature of data
Direct Regression: a simple pointwise ranking approach, but may be improved by taking ordinal property into account
Possible Approach 4:
Ordinal Classification
(MCRank; Li et al., NIPS ’07)conduct listwise ranking by predicting ordinal-valued ranks accurately
Pros
somewhat respecttop rankgoal respectordinalnature of data applicable tolarge-scaledata
Cons
somewhat respecttop rankgoal because of a loose bound in embedding the goal
McRank: a state-of-the-art pointwise ranking approach, but may be improved further towards top rank goal
Our Contributions
an algorithmic development on cost-sensitive ordinal classification via regression (COCR), which ...
systematically respects all three properties of search-engine ranking
algorithm top rank large scale ordinal data
LambdaRank ? ◦ ?
SVM-Rank × ◦ ?
Direct Regression ? ? ×
McRank ◦ ? ?
COCR
? ? ?
leads topromising experimental results
Overview of Cost-sensitive Ordinal Classification via Regression (COCR)
reduction from listwise ranking (ERR) to cost-sensitive ordinal classification (approximately)
—aim fortop rankandlarge scale data(like Direct Regression) reduction from cost-sensitive ordinal classification to binary classification
—aim forrespecting ordinal data(like McRank) reduction from binary classification to regression
—aim forlarge scale dataandavoiding discrete ties (like Direct Regression)
COCR: combine the benefits of Direct Regression and McRank
Ordinal Classification via Binary Classification
(Lin & Li, Neural Computation ’12)
desired pointwise ranking problem
r (x) = What is the rank of the document x?
reduced problems
gk(x) = Is the rank of document x greater than k ? train binary classifiers with {(xq,i,[yq,i >k ])}
predict with a simplecountingranker rg(x) =
K −1
P
k =0
gk(x) simple and efficient
good theoretical guarantee:
1 absolutely good binary classifier =⇒ absolutely good ranker relatively good binary classifier =⇒ relatively good ranker
Ordinal Classification via Regression
desired pointwise ranking problem
E (y |x) = What is theexpected rankof the documentx?
exploited by both Direct Regression and McRank
reduced problems
g˜k(x) = P(y > k |x) = What is theprobabilitythat the rank of documentx is greater than k ?
trainregressorswith {(xq,i,[yq,i >k ])}
predict with a simplecountingestimatorE (y |x) =
K −1
P
k =0
g˜k(x)
absolutely good regressor =⇒ absolutely good expected rank estimator
Cost-sensitive Ordinal Classification via Regression
desired pointwise ranking problem
Ec(y |x) = What is thebiasedexpected rankof the documentx ifif a mis-ranking is penalized with a costc[r (x)]?
for embedding the emphasis on top rank
reduced problems
g˜k ,w(x) = What is thebiasedprobabilitythat the rank of documentx is greater than k when a wrong answer is penalized with a weight wk?
trainregressorswith {(xq,i,[yq,i >k ],wq,i,k)}
predict with a simplecountingestimatorEc(y |x) =
K −1
P
k =0
g˜k ,w(x)
some good theoretical guarantees follow similarly
Optimistic ERR (oERR) Cost for COCR
desired listwise criteria
How to make ERR(r ) close to ERR(p), the ERR of perfect ranker?
embed criteria within cost
ERR(p) − ERR(r ) ≤·
N(q)
X
i=1
2yq,i − 2r (xq,i)2
+∆
∆≈ 0 if r ≈ p (optimistic) then,c[k ] = 2y− 2k2
embeds ERR
not a very tight bound, butbetter than nothing
—heuristically used in some earlier works
The Proposed Algorithm
Given
for query indices q = 1, 2, · · · , Q,
a set of related documents {xq,i}N(q)i=1
ordinal relevance yq,i ∈ Y = {0, 1, . . . , K }for each documentxq,i withlarge Q and N(q)
1 construct {(xq,i,yq,i,c[k ])} with oERR cost c
2 obtain {(xq,i, [yq,i >k ], wq,i,k)}by reduction to binary classification
3 train regressors ˜gk(x) with {(xq,i, [yq,i >k ], wq,i,k)}
4 predict (order) future documentx with
K −1
P
k =0
g˜k(x)
systematic, simple, efficient, and take all three properties into account
Empirical Comparison Using Linear Regression
data set Direct Regression McRank-like oERR-COCR
LTRC1 0.4470 0.4484 0.4505
LTRC2 0.4440 0.4465 0.4461
MS10K 0.2643 0.2642 0.2792
MS30K 0.2748 0.2748 0.2942
best ERR
significantly better than direct regression
oERR-COCRusually the best, andordinalinformation is important
Empirical Comparison Using M5’ Decision Tree
data set Direct Regression McRank-like oERR-COCR
LTRC1 0.4499 0.4526 0.4530
LTRC2 0.4489 0.4499 0.4538
MS10K 0.3014 0.3129 0.3156
MS30K 0.3298 0.3438 0.3451
best ERR
significantly better than direct regression
oERR-COCRthe best
Conclusion
Cost-sensitiveOrdinal ClassificationviaRegression emphasize ontop rank
respectordinal data
regress pointwise forlarge-scale data theoretical guarantee:
reduction from listwise to cost-sensitive ordinal, approximately reduction from cost-sensitive ordinal to binary
reduction from binary to regression obtainedgood experimental results
Thank you. Questions?