Improving Ranking Performance with Cost-sensitive Ordinal Classiﬁcation via Regression

(1)

Improving Ranking Performance with

Cost-sensitive Ordinal Classification via Regression

Yu-Xun Ruan¹, Hsuan-Tien Lin¹, Ming-Feng Tsai²

National Taiwan University¹, National Chengchi University²

Preference Learning @ EURO, July 10, 2012

(2)

Preference Ranking in Search Engine

not just for searchinggood machine learning book ; but also forrecommendation systems & other web service

(3)

Three Properties of Search-Engine Ranking

listwise with focus ontop ranks query-oriented & personalized

emphasis onhighly-preferred (relevant) items large scale

both duringtraining & testing

e.g. Yahoo! Learning-To-Rank Challenge 2010: 473K training URLs, 166K test URLs

ordinal data

labeled qualitatively by human, e.g. { highly irrelevant, irrelevant, neutral, relevant, highly relevant} lack of quantitative info

search-engine ranking problem:

learning a ranker fromlarge scaleordinal data with focus ontop ranks

(4)

Search-Engine Ranking Setup

Given

for query indices q = 1, 2, · · · , Q,

a set of related documents {x_q,i}^N(q)_i=1

ordinal relevance y_q,i ∈ Y = {0, 1, . . . , K }for each documentx_q,i withlarge Q and N(q)

Goal

a ranker r (x) that “accurately ranks”topx_Q+1,ifrom an unseen setof documents {x_Q+1,i}

how to evaluateaccurate ranking around the top?

(5)

Expected Reciprocal Rank

(ERR; Chapelle et al., CIKM ’09)

Assumption: Choice Probability of Single Document for any example (documentx, rank y ),

P(user chooses documentx) = (2^y − 1)/2^K Assumption: Stopping Probability ofList of Documents

P(user stops at position i of list)

= P(doesn’t stop at pos. i − 1) × P(chooses document at pos. i) ERR: TotalDiscountedStopping Probability of List of Documents

ERR_q(r ) ≡

N(q)

X

i=1

1

iP(user stops at position i of the list ordered by r )

large ERR ⇔ small i matches large P ⇔ good ranking around top

(6)

Possible Approach 1: LambdaRank

(Burges et al., NIPS ’06)

maximize ERR directly with non-smooth optimization on N(q)! list reorderings

Pros

respecttop rankgoal

respectordinalnature of data

Cons

difficult optimization problem

challenging to apply onlarge-scaledata

LambdaRank: a state-of-the-art approach, butpossibly inefficient

(7)

Possible Approach 2: SVM-Rank

(Joachims, KDD ’02)

conduct listwise ranking by predicting pairwise preferences accurately

Pros

respectordinalnature of data (w/ comparison) somewhat applicable tolarge-scaledata

Cons

all pairs equal, not respectingtop rankgoal

somewhat applicable tolarge-scaledata, because of O(N²)pairs

SVM-Rank: a baseline pairwise ranking approach, but possibly not the best for listwise

(8)

Possible Approach 3:

Direct Regression

(Cossock and Tong, COLT ’06)

conduct listwise ranking by predicting real-valued scores accurately

Pros

respecttop rankgoal by embedding it in regression loss applicable tolarge-scaledata

Cons

treats y as numerical score, not respectingordinalnature of data

Direct Regression: a simple pointwise ranking approach, but may be improved by taking ordinal property into account

(9)

Possible Approach 4:

Ordinal Classification

(MCRank; Li et al., NIPS ’07)

conduct listwise ranking by predicting ordinal-valued ranks accurately

Pros

somewhat respecttop rankgoal respectordinalnature of data applicable tolarge-scaledata

Cons

somewhat respecttop rankgoal because of a loose bound in embedding the goal

McRank: a state-of-the-art pointwise ranking approach, but may be improved further towards top rank goal

(10)

Our Contributions

an algorithmic development on cost-sensitive ordinal classification via regression (COCR), which ...

systematically respects all three properties of search-engine ranking

algorithm top rank large scale ordinal data

LambdaRank ? ◦ ?

SVM-Rank × ◦ ?

Direct Regression ? ? ×

McRank ◦ ? ?

COCR

? ? ?

leads topromising experimental results

(11)

Overview of Cost-sensitive Ordinal Classification via Regression (COCR)

reduction from listwise ranking (ERR) to cost-sensitive ordinal classification (approximately)

—aim fortop rankandlarge scale data(like Direct Regression) reduction from cost-sensitive ordinal classification to binary classification

—aim forrespecting ordinal data(like McRank) reduction from binary classification to regression

—aim forlarge scale dataandavoiding discrete ties (like Direct Regression)

COCR: combine the benefits of Direct Regression and McRank

(12)

Ordinal Classification via Binary Classification

(Lin & Li, Neural Computation ’12)

desired pointwise ranking problem

r (x) = What is the rank of the document x?

reduced problems

g_k(x) = Is the rank of document x greater than k ? train binary classifiers with {(x_q,i,[y_q,i >k ])}

predict with a simplecountingranker rg(x) =

K −1

P

k =0

g_k(x) simple and efficient

good theoretical guarantee:

1 absolutely good binary classifier =⇒ absolutely good ranker relatively good binary classifier =⇒ relatively good ranker

(13)

Ordinal Classification via Regression

E (y |x) = What is theexpected rankof the documentx?

exploited by both Direct Regression and McRank

reduced problems

g˜_k(x) = P(y > k |x) = What is theprobabilitythat the rank of documentx is greater than k ?

trainregressorswith {(x_q,i,[y_q,i >k ])}

predict with a simplecountingestimatorE (y |x) =

K −1

P

k =0

g˜_k(x)

absolutely good regressor =⇒ absolutely good expected rank estimator

(14)

Cost-sensitive Ordinal Classification via Regression

Ec(y |x) = What is thebiasedexpected rankof the documentx ifif a mis-ranking is penalized with a costc[r (x)]?

for embedding the emphasis on top rank

reduced problems

g˜_{k ,w}(x) = What is thebiasedprobabilitythat the rank of documentx is greater than k when a wrong answer is penalized with a weight w_k?

trainregressorswith {(x_q,i,[y_q,i >k ],w_q,i,k)}

predict with a simplecountingestimatorE_c(y |x) =

K −1

P

k =0

g˜_{k ,w}(x)

some good theoretical guarantees follow similarly

(15)

Optimistic ERR (oERR) Cost for COCR

desired listwise criteria

How to make ERR(r ) close to ERR(p), the ERR of perfect ranker?

embed criteria within cost

ERR(p) − ERR(r ) ≤·





N(q)

X

i=1

2^y^q,i − 2^{r (x}^q,i⁾2

+∆





∆≈ 0 if r ≈ p (optimistic) then,c[k ] = 2^y− 2^k2

embeds ERR

not a very tight bound, butbetter than nothing

—heuristically used in some earlier works

(16)

The Proposed Algorithm

Given

for query indices q = 1, 2, · · · , Q,

a set of related documents {x_q,i}^N(q)_i=1

ordinal relevance y_q,i ∈ Y = {0, 1, . . . , K }for each documentx_q,i withlarge Q and N(q)

1 construct {(x_q,i,y_q,i,c[k ])} with oERR cost c

2 obtain {(x_q,i, [y_q,i >k ], w_q,i,k)}by reduction to binary classification

3 train regressors ˜g_k(x) with {(x_q,i, [y_q,i >k ], w_q,i,k)}

4 predict (order) future documentx with

K −1

P

k =0

g˜_k(x)

systematic, simple, efficient, and take all three properties into account

(17)

Empirical Comparison Using Linear Regression

data set Direct Regression McRank-like oERR-COCR

LTRC1 0.4470 0.4484 0.4505

LTRC2 0.4440 0.4465 0.4461

MS10K 0.2643 0.2642 0.2792

MS30K 0.2748 0.2748 0.2942

best ERR

significantly better than direct regression

oERR-COCRusually the best, andordinalinformation is important

(18)

Empirical Comparison Using M5’ Decision Tree

data set Direct Regression McRank-like oERR-COCR

LTRC1 0.4499 0.4526 0.4530

LTRC2 0.4489 0.4499 0.4538

MS10K 0.3014 0.3129 0.3156

MS30K 0.3298 0.3438 0.3451

best ERR

significantly better than direct regression

oERR-COCRthe best

(19)

Conclusion

Cost-sensitiveOrdinal ClassificationviaRegression emphasize ontop rank

respectordinal data

regress pointwise forlarge-scale data theoretical guarantee:

reduction from listwise to cost-sensitive ordinal, approximately reduction from cost-sensitive ordinal to binary

reduction from binary to regression obtainedgood experimental results

Thank you. Questions?