行政院國家科學委員會專題研究計畫 期末報告
貝氏方法在網路評分資料之應用(第 2 年)
計 畫 類 別 : 個別型
計 畫 編 號 : NSC 100-2118-M-004-001-MY2
執 行 期 間 : 101 年 08 月 01 日至 102 年 08 月 31 日
執 行 單 位 : 國立政治大學統計學系
計 畫 主 持 人 : 翁久幸
計畫參與人員: 碩士班研究生-兼任助理人員:陳冠全
碩士班研究生-兼任助理人員:張良卉
博士班研究生-兼任助理人員:蘇建郎
報 告 附 件 : 出席國際會議研究心得報告及發表論文
公 開 資 訊 : 本計畫可公開查詢
中 華 民 國 102 年 09 月 18 日
中 文 摘 要 : 網路提供消費者大量資訊. 網路使用者對於各種產品,如電
影,音樂,餐廳, 商品等給予的評分,常常構成相當大的網路資
料. 關於這類評分資料,目前網路上常見的呈現方式是以各產
品所獲得的平均評分表示(例如以五個星號代表五分). 然而,
因為沒有考慮到評分者與評分者之間的差異,這樣簡單的平均
分數有所缺失.
Ho and Quinn (2008) 提出一個貝氏模型 並以 MCMC 方法估
計其中參數,該模型 納入評分者與評分者之間的差異,比平均
分數能夠更合理的解釋評分資料. 可是, 如該二位作者所指
出, 此方法之最大問題在於當資料量很大,甚至於當新資料進
來而需要重新估計模型參數, 以 MCMC 方法來計算於實際應
用上是不可行的.
本研究計畫提出一個有效可行的方法來解決這個問題. 應用
在兩組實際數據,得到不錯結果.
中文關鍵詞: 貝氏方法,評分資料
英 文 摘 要 : The internet has offered consumers with a vast amount
of information. One growing area of such information
is ratings by internet users on various kinds of
products such as movies, music, restaurants,
commodities, etc. The current displays of each
product`s preference are typically based
on 'average rating,' but the average rating method
ignores systematic differences across raters.
Ho and Quinn (2008) proposed a Bayesian model and
Markov chain Monte Carlo (MCMC) methods to take into
account systematic differences across raters, and at
the same time incorporate statistical uncertainty in
the ratings. However, to work efficiently on an
industrial scale and to adjust the parameters in
real-time as new rating arrive, the MCMC methods may
not be computationally feasible.
The current project provided a feasible solution to
this problem by using
efficient approximation algorithm. Experiments on two
real datasets are promising.
A Statistical Analysis of Internet Ratings Data final report
1
Introduction
As the Internet population continue to grow, the ratings data generated by Internet users increase rapidly. The ratings data are typically ordinal measurements on the quality of all kinds of items such as movies, books, consumer products, etc. These data have provided much information for consumers to make product choices. For a given product, most of the current graphical displays provide the number of votes it receives and a number of stars to represent the mean rating. Some displays may also present the frequency plot of ratings.
However, typical ratings data exhibits systematic tendencies for some raters to give higher scores than others, and some may not discriminate very well between products. The current displays all ignore the systematic differences across raters.
Ho and Quinn [2] proposed to use a Bayesian item response theory (IRT) model for ratings data. Then, they fit the model using Markov Chain Monte Carlo (MCMC), and proposed graphical displays based on estimates of model parameters. This model can ac-count for the rater bias, and the proposed graphical displays are easily interpretable and incorporate statistical uncertainty in the ratings. While the methodology is reasonable, it has not been used by any website. The main difficulty is that, as the Internet data grow rapidly and the new ratings continuously arrive, the MCMC methods may not be computationally feasible to adjust model parameters.
Some researchers have employed Bayesian IRT models in psychological and educational studies; for example, Patz and Junker [3], Fox and Glas [1], and Wang et al. [4]. They all rely on the MCMC approach.
The MCMC approach is a nondeterministic method for approximate Bayesian infer-ence. This approach often gives more accurate inference, but requires considerably more computation. Alternatively, there are deterministic approaches such as Laplace method, variational Bayes, expectation propagation, among others. They are part of mainstream machine learning methodologies. For many applications, deterministic methods produce
solutions of comparable accuracy to MCMC at greater speed. In fact, Ho and Quinn [2, Section 5] pointed out that their model-fitting approach (MCMC) needs to be modified so as to work efficiently on an industrial scale or real-time data.
The present paper extends the online Bayesian method in Weng and Lin [5] to the IRT model for ratings data. First, we obtain an efficient online algorithm to adjust the parameters in real-time as new ratings arrive. Secondly, the proposed method provides a reasonable alternative to MCMC approach.
2
Preliminaries
For the sake of being self-contained, we review the model-based approach in Ho and Quinn [2]. Suppose that there are R raters and P products. Let yrp be the rating of product p
by rater r, and Y = [yrp] the R × P rater-by-product matrix. Assume that yrp is ordinal
and takes values in {1, 2, ..., C}, where larger numbers indicate higher preference. In many cases, yrp is not observed. It is sensible to introduce a missingness indicator zrp and assume
that the data are generated according to
yrpobs= c ⇔ y∗ rp ∈ (γc−1, γc] and zrp = 0 missing ⇔ zrp = 1 (1)
where y∗rp is a latent variable and γ0 < γ1< · · · < γC are cutpoints. Assume that γ0= −∞,
γ1= 0, and γC = ∞.
The latent variable y∗rp is parametrized as
yrp∗ = αr+ βrθp+ rp, rp iid
∼ N (0, 1), r ∈ R, p ∈ P. (2)
The parameter α captures the center of rater r’s internal scale. For a more “critical” rater r, the αr tends to be smaller. The parameter βr captures how well rater r discriminates
between low and high quality products. Ho and Quinn [2] constrained βr ∈ <+ ∀r to
identify the sign of θr; otherwise, two different sets of parameter values can give the same
model. A value of βr near 0 means that rater r is unable to distinguish between low and
high quality; a larger βr means that rater r is discriminating. The parameter θp captures
the latent quality of product p. With the constraint that βr ∈ <+ ∀r, the value of y∗rp is
P (Yobs|α, β, θ, γ) = Y p,r:zrp=0 {Φ(γyobs rp − αr− βrθp) − Φ(γyrpobs−1− αr− βrθp)}, (3) where γyobs rp = γc⇔ y obs
rp = c. The prior distribution for the parameters are assumed as
fol-lows: αr iid
∼ N (1, 1), βr iid
∼ N (−5, 20) truncated to the positive half, γ iid∼ improper uniform, θp
iid
∼ N (0, 1). Given M Markov Chain Monte Carlo samples {α(m)r , βr(m), θp(m), γ(m)}Mm=1
from the posterior distribution p(αr, βr, θp, γ|Yobs), the posterior predictive density for yrp
can be approximated with:
P(yrprep= c|Yobs)
≈ 1 M M X m=1 {Φ(γc(m)− α(m)r − βr(m)θ(m)p ) − Φ(γc−1(m)− α(m)r − βr(m)θ(m)p )} (4)
for c = 1, ..., C. Their proposed graphical displays depend on the posterior predictive probabilities for product p over all raters:
τpc =
1 |R|
X
r∈R
P (yrprep= c|Yobs) (5)
3
Main results
We propose to estimate τpc by
P (ypr= c|Yobs) ≈ Φ(γc−1− µ∗αr− µ∗βrµ∗θp) − Φ(γc−1− µ∗αr− µ∗βrµ∗θp),
where µ∗αr, µ∗βr, and µ∗θp are current estimates of posterior means.
For the prior distributions, it is assumed that each αrfollows N (µαr, σαr2 ), each βrfollows
N (µβr, σ2βr), and each θp follows N (µθp, σ2θp) with all parameters mutually independent. As
γ values are not of primal interest, we suggest to set these values by observed proportions of {yobs = c} rather than treating them as unknown parameters. Priors such as improper uniform and normal may be used; however, they result in more complicated algorithms.
Now define α∗r= αr− µαr σαr , βr∗ = βr− µβr σβr , θ∗p = θp− µθp σθp . (6)
Denote α = (α1, ..., αR)T, β = (β1, ..., βR)T, θ = (θ1, ..., θP)T, and similarly for α∗, β∗, θ∗.
Then, the posterior distribution of (α∗, β∗, θ∗) given data Yobs is
p(α∗, β∗, θ∗|Yobs)
∝ φ(α∗, β∗, θ∗) Y
p,r:zpr=0 {Φ(γyo
pr− αr− βrθp) − Φ(γyopr−1− αr− βrθp)}.
In particular, if there is only one new observation ypro = c, then the corresponding likelihood is
L(αr, βr, θp; yopr) = Φ(γc− αr− βrθp) − Φ(γc−1− αr− βrθp). (7)
Since (7) only involves (αr, βr, θp), we need only to adjust posterior distribution of the three
parameters α∗r, βr∗, θp∗. The posterior density of (αr∗, β∗r, θ∗p) given yopr is
p(α∗r, βr∗, θp∗|ypro ) ∝ φ(αr∗, βr∗, θ∗p)L(αr, βr, θp), (8)
which of the form for a version of Stein’s identity. Therefore, we apply it to obtain the pos-terior means and variances of α∗r, βr∗, θ∗p. Then, by (6) we can easily convert these moments to that of αr, βr, θp; for instance,
E(αr|ypro ) = µαr+ σαrE(α∗r|yopr)
V ar(αr|ypro ) = σαr2 V ar(α∗r|ypro ).
(9)
The following functions are needed when taking derivatives of L in (8). Let
Ma(x) = φ(x) − φ(x − a) Φ(x) − Φ(x − a) Na(x) = xφ(x) − (x − a)φ(x − a) Φ(x) − Φ(x − a) + φ(x) − φ(x − a) Φ(x) − Φ(x − a) 2 . (10)
For posterior means, applying a version of Stein’s identity gives
E(α∗r|ypro ) = E ∂L/∂α ∗ r L y o pr = −σαrE Ma(x)|ypro , (11) where x = γc− αr− βrθp; a = γc− γc−1 (12) and hence Ma(x) = φ(γc− αr− βrθp) − φ(γc−1− αr− βrθp) Φ(γc− αr− βrθp) − Φ(γc−1− αr− βrθp) .
Similarly, we have E(βr∗|ypro ) = E ∂L/∂β ∗ r L y o pr = −σβrE θpMa(x) y o pr (13) E(θp∗|yo pr) = E ∂L/∂θ∗ p L y o pr = −σθpE βrMa(x) y o pr . (14)
The expressions of the posterior variance can be obtained similarly. We have
V ar(α∗r|yopr) =1 − (σαr)2E h Na(x) y o pr i , V ar(βr∗|yopr) =1 − (σβrθp)2E h Na(x) y o pr i , V ar(θ∗r|yopr) =1 − (σθpβr)2E h Na(x) y o pr i .
The approximation of expectations are as in Weng and Lin [5], where the error incurred by this approximation is reduced by a scaling factor. Take E(α∗r|yo
pr) in (11) for illustration: E(α∗r|yo pr) = − σαrE h Ma(x) y o pr i ≈ − σαr ν Maν(xν), (15)
where a and x are as in (12), ν =q1 + σ2
αr+ σβr2 µ2θp+ σθp2 µ2βr, and aν = γc− γc−1 ν and xν = γc− µαr− µβrµθp ν . (16)
Together with (9), we have
E(αr|yopr) ≈ µαr− σαr2 ν " φ(γc−µαr−µβrµθp ν ) − φ( γc−1−µαr−µβrµθp ν ) Φ(γc−µαr−µβrµθp ν ) − Φ( γc−1−µαr−µβrµθp ν ) # . (17)
Similar approximations give
V ar(αr|yopr) ≈ σαr2 1 −σαr ν 2 Naν(xν) . (18)
The proposed algorithm is described below:
Step 1. Given current estimates µ(t)α , µ(t)β , µ(t)θ , σα(t), σ(t)β , σθ(t), where µ(t)α = (µ(t)α1, · · · , µ(t)αR)T,
Step 2. Given the (t + 1)st observation ypr = c. Calculate ν(t)= q 1 + (σ(t)αr)2+ (σ(t)βr)2(µ(t)θp)2+ (σ(t)θp)2(µ(t)βr)2, (19) ω(t) = − 1 ν(t)Ma(t)ν (x (t) ν ), (20) δ(t) = 1 ν(t)2Na(t)ν (x (t) ν ), (21) where a(t)ν = γc− γc−1 ν(t) and x (t) ν = γc−1− µ(t)αr− µ(t)βrµ(t)θp ν(t) .
Step 3. Update parameters as below:
µ(t+1)αr = µ(t)αr+σαr(t)2ω(t) µ(t+1)βr = µ(t)βr + σβr(t) 2 µ(t)θpω(t) µθp(t+1)= µ(t)θp +σθp(t)2µ(t)βrω(t) (22) (σαr(t+1))2 = (σαr(t))2max1 − (σ(t)αr)2δ(t), κ (23) (σβr(t+1))2 = (σβr(t))2max 1 − (σ(t)βrµ(t)θp)2δ(t), κ (24) (σθp(t+1))2 = (σθp(t))2max 1 − (σ(t)θrµ(t)βr)2δ(t), κ . (25)
4
Experiments
We consider the ratings of news outlets from Mondo Times (http://www.mondotimes. com/) used in Ho and Quinn [2]. Mondo Times is an online company that disseminates information about media outlets such as newspapers, magazines, radio stations, and tele-vision stations in 211 countries. Raters submit five-point ratings of the content quality of news outlets from awful, poor, average, very good, to great. The dataset used in Ho and Quinn [2], which features 1,515 products (news outlets) and 946 raters, is available from their Ratings package (available at http://cran.r-project.org/). The average number of ratings for a product is 3.0 and the average number rated by a rater is 4.8.
As in Ratings of Ho and Quinn [2], we remove raters who rate less than five products and remove products that are only rated by these raters. This ends up with 3249 ratings from 232 raters on 1344 products.
News outlets MCMC: sub-Mondo online: sub-Mondo online: whole-Mondo US News & World Report -0.215 (0.481) 0.259 (0.241) -0.148 (0.320)
Toronto Sun -1.027 (0.392) -0.684 (0.160) -0.743 (0.224)
Toronto Star 0.397 (0.400) 0.321 (0.148) 0.215 (0.199)
San Diego Union Tribune 0.089 (0.482) 0.124 (0.280) 0.426 (0.224)
People -1.793 (0.593) -1.970 (0.689) -1.909 (0.413)
PBS 1.223 (0.393) 1.695 (0.258) 0.809 (0.182)
Montana Magazine -0.233 (0.358) -0.071 (0.171) 0.060 (0.199)
London Sun -2.140 (0.569) -1.477 (0.267) -1.380 (0.312)
Great Falls Tribune -2.839 (0.770) -1.817 (0.399) -2.353 (0.484) Daily Utah Chronicle 0.054 (0.818) -0.161 (0.256) -0.686 (0.668) Colorado Public Radio 1.431 (0.602) 2.572 (0.700) 1.617 (0.484)
CNN 0.038 (0.192) -0.055 (0.078) 0.312 (0.074)
Table 1: Posterior means of θ for twelve outlets.
As in Ho and Quinn [2], we assume that each αr follows N (1, 1) and each θp follows
N (0, 1). Instead of letting each βr follow N (−5, 20) truncated to positive, we assume that each βrfollows N (1, 20). All parameters are assumed to be mutually independent. Since the
γ values are not the main interest here, we set them by the following steps: first, calculate the observed proportions #{ypr = c}/N , for c = 1, ..., 5, where N is the number of observed
ratings; next, find the z-scores corresponding to these areas; then, obtain approximations of the mean and variance of y∗ and convert the z-scores to y∗’s scale. In our scenario, the new ratings arrive sequentially. So, the empirical proportions are based on just part of the data, or in a pilot study.
The initial parameter values in Algorithm 1 are set to be µ(0)αr = 1, µ(0)βr = 1, µ(0)θp = 1,
σ(0)αr = 1, σβr(0) = 1, σθp(0) = 1, for r = 1, ..., R and p = 1, ..., P ; and the positive lower bound
κ in (23)-(25) is set to be 0.0001.
5
Conclusions
Ho and Quinn [2] have demonstrated the advantages of fitting the IRT models to Internet ratings data. However, for a real-time data pipeline that continuously collects new ratings
on new items, their MCMC approach is not computationally viable. Our proposed online method can adjust the model parameters in a large-scale problem.
References
[1] J.-P. Fox and C. A. W. Glas. Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika, 66:269–286, 2001.
[2] D. E. Ho and K. M. Quinn. Improving the presentation and interpretation of online ratings data with model-based figures. The American Statistician, 62(4):279–288, 2008.
[3] R. J. Patz and B. Junker. A straightforward approach to Markov Chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24:146–178, 1999.
[4] X. Wang, J. O. Berger, and D. S. Burdick. Bayesian analysis of dynamic item response models in educational testing. The Annals of Applied Statistics, 7(1):126–153, 2013.
[5] R. C. Weng and C.-J. Lin. A Bayesian approximation method for online ranking. Journal of Machine Learning Research, 12:267–300, 2011.
Report on attending The 59th World Statistics Congress (WSC) Aug. 25 - 30, 2013, Hong Kong
The International Statistical Institute (ISI), established in 1885, is one of the oldest scientific associations. It dedicates to promote the statistics profession. The ISI is best renowned for its biennial World Statistics Congress (WSC), where the international statistical community congregates to exchange ideas, develop new links and discuss current trends and developments in the statistical world. This year WSC was held in Hong Kong during 25-30 August 2013. The meeting venue was the Hong Kong Convention and Exhibition Centre (HKCEC). It covers a wide range of topics and features the latest scientific developments in the fields of probability and statistics and their applications.
This year I presented the paper “An analysis of online ratings data” in a poster spotlight session, scheduled on Tuesday (August 27th) morning. I got a good chance to present my work and communicate with many people about my research. I also attended several oral presentations, and was impressed by some talks such as talks in a History session on “Jacob Bernoulli’s Ars Conjectandi and the emergence of probability” sponsored by Bernoulli Society to celebrate the 300 anniversary of the publication of this paper, talks about latent variable modeling in the session “Latent variable modeling of complex survey data,” a talk about “Business analytics and big data: what do statisticians need to succeed?” by Robert N. Rodriguez in the session “Special invited panel on career development,” especially designed for the Youth Theme. I also visited academic book booths in the Exhibition Hall and the Career Placement Service to learn more about information about current statistical job market.
During these days, I met some people from both industry and university. Having chats with them inspired me and encouraged me to keep on moving. It was a fruitful trip.