貝氏方法在網路評分資料之應用

(1)

A Statistical Analysis of Internet Ratings Data final report

1 Introduction

As the Internet population continue to grow, the ratings data generated by Internet users increase rapidly. The ratings data are typically ordinal measurements on the quality of all kinds of items such as movies, books, consumer products, etc. These data have provided much information for consumers to make product choices. For a given product, most of the current graphical displays provide the number of votes it receives and a number of stars to represent the mean rating. Some displays may also present the frequency plot of ratings.

However, typical ratings data exhibits systematic tendencies for some raters to give higher scores than others, and some may not discriminate very well between products. The current displays all ignore the systematic differences across raters.

Ho and Quinn [2] proposed to use a Bayesian item response theory (IRT) model for ratings data. Then, they fit the model using Markov Chain Monte Carlo (MCMC), and proposed graphical displays based on estimates of model parameters. This model can ac-count for the rater bias, and the proposed graphical displays are easily interpretable and incorporate statistical uncertainty in the ratings. While the methodology is reasonable, it has not been used by any website. The main difficulty is that, as the Internet data grow rapidly and the new ratings continuously arrive, the MCMC methods may not be computationally feasible to adjust model parameters.

Some researchers have employed Bayesian IRT models in psychological and educational studies; for example, Patz and Junker [3], Fox and Glas [1], and Wang et al. [4]. They all rely on the MCMC approach.

The MCMC approach is a nondeterministic method for approximate Bayesian infer-ence. This approach often gives more accurate inference, but requires considerably more computation. Alternatively, there are deterministic approaches such as Laplace method, variational Bayes, expectation propagation, among others. They are part of mainstream machine learning methodologies. For many applications, deterministic methods produce

(2)

solutions of comparable accuracy to MCMC at greater speed. In fact, Ho and Quinn [2, Section 5] pointed out that their model-fitting approach (MCMC) needs to be modified so as to work efficiently on an industrial scale or real-time data.

The present paper extends the online Bayesian method in Weng and Lin [5] to the IRT model for ratings data. First, we obtain an efficient online algorithm to adjust the parameters in real-time as new ratings arrive. Secondly, the proposed method provides a reasonable alternative to MCMC approach.

2 Preliminaries

For the sake of being self-contained, we review the model-based approach in Ho and Quinn [2]. Suppose that there are R raters and P products. Let yrp be the rating of product p

by rater r, and Y = [yrp] the R × P rater-by-product matrix. Assume that yrp is ordinal

and takes values in {1, 2, ..., C}, where larger numbers indicate higher preference. In many cases, yrp is not observed. It is sensible to introduce a missingness indicator zrp and assume

that the data are generated according to y_rpobs=    c ⇔ y∗ rp ∈ (γc−1, γc] and zrp = 0 missing ⇔ zrp = 1 (1)

where y∗_rp is a latent variable and γ0 < γ1< · · · < γC are cutpoints. Assume that γ0= −∞,

γ1= 0, and γC = ∞.

The latent variable y∗_rp is parametrized as y_rp∗ = αr+ βrθp+ rp, rp

iid

∼ N (0, 1), r ∈ R, p ∈ P. (2) The parameter α captures the center of rater r’s internal scale. For a more “critical” rater r, the αr tends to be smaller. The parameter βr captures how well rater r discriminates

between low and high quality products. Ho and Quinn [2] constrained βr ∈ <+ ∀r to

identify the sign of θr; otherwise, two different sets of parameter values can give the same

model. A value of βr near 0 means that rater r is unable to distinguish between low and

high quality; a larger βr means that rater r is discriminating. The parameter θp captures

the latent quality of product p. With the constraint that βr ∈ <+ ∀r, the value of y∗rp is

(3)

P (Yobs|α, β, θ, γ) = Y p,r:zrp=0 {Φ(γ_yobs rp − αr− βrθp) − Φ(γyrpobs−1− αr− βrθp)}, (3) where γ_yobs rp = γc⇔ y obs

rp = c. The prior distribution for the parameters are assumed as

fol-lows: αr iid

∼ N (1, 1), βr iid

∼ N (−5, 20) truncated to the positive half, γ iid∼ improper uniform, θp

iid

∼ N (0, 1). Given M Markov Chain Monte Carlo samples {α(m)r , βr(m), θp(m), γ(m)}M_m=1

from the posterior distribution p(αr, βr, θp, γ|Yobs), the posterior predictive density for yrp

can be approximated with: P(y_rprep= c|Yobs)

≈ 1 M M X m=1 {Φ(γ_c(m)− α(m)_r − β_r(m)θ(m)_p ) − Φ(γ_c−1(m)− α(m)_r − β_r(m)θ(m)_p )} (4)

for c = 1, ..., C. Their proposed graphical displays depend on the posterior predictive probabilities for product p over all raters:

τpc =

1 |R|

X

r∈R

P (y_rprep= c|Yobs) (5)

3 Main results

We propose to estimate τpc by

P (ypr= c|Yobs) ≈ Φ(γc−1− µ∗αr− µ∗βrµ∗θp) − Φ(γc−1− µ∗αr− µ∗βrµ∗θp),

where µ∗_αr, µ∗_βr, and µ∗_θp are current estimates of posterior means.

For the prior distributions, it is assumed that each αrfollows N (µαr, σαr2 ), each βrfollows

N (µ_βr, σ2_βr), and each θp follows N (µθp, σ2θp) with all parameters mutually independent. As

γ values are not of primal interest, we suggest to set these values by observed proportions of {yobs = c} rather than treating them as unknown parameters. Priors such as improper uniform and normal may be used; however, they result in more complicated algorithms.

Now define α∗_r= αr− µαr σαr , β_r∗ = βr− µβr σβr , θ∗_p = θp− µθp σθp . (6)

(4)

Denote α = (α1, ..., αR)T, β = (β1, ..., βR)T, θ = (θ1, ..., θP)T, and similarly for α∗, β∗, θ∗.

Then, the posterior distribution of (α∗, β∗, θ∗) given data Yobs is p(α∗, β∗, θ∗|Yobs₎

∝ φ(α∗, β∗, θ∗) Y

p,r:zpr=0

{Φ(γ_yo

pr− αr− βrθp) − Φ(γyopr−1− αr− βrθp)}.

In particular, if there is only one new observation y_pro = c, then the corresponding likelihood is

L(αr, βr, θp; yopr) = Φ(γc− αr− βrθp) − Φ(γc−1− αr− βrθp). (7)

Since (7) only involves (αr, βr, θp), we need only to adjust posterior distribution of the three

parameters α∗_r, β_r∗, θ_p∗. The posterior density of (α_r∗, β∗_r, θ∗_p) given yo_pr is

p(α∗_r, β_r∗, θ_p∗|y_pro ) ∝ φ(α_r∗, β_r∗, θ∗_p)L(αr, βr, θp), (8)

which of the form for a version of Stein’s identity. Therefore, we apply it to obtain the pos-terior means and variances of α∗_r, β_r∗, θ∗_p. Then, by (6) we can easily convert these moments to that of αr, βr, θp; for instance,

E(αr|ypro ) = µαr+ σαrE(α∗r|yopr)

V ar(αr|ypro ) = σαr2 V ar(α∗r|ypro ).

(9)

The following functions are needed when taking derivatives of L in (8). Let Ma(x) = φ(x) − φ(x − a) Φ(x) − Φ(x − a) Na(x) = xφ(x) − (x − a)φ(x − a) Φ(x) − Φ(x − a) + φ(x) − φ(x − a) Φ(x) − Φ(x − a) 2 . (10)

For posterior means, applying a version of Stein’s identity gives E(α∗_r|y_pro ) = E ∂L/∂α ∗ r L y o pr = −σαrE Ma(x)|ypro , (11) where x = γc− αr− βrθp; a = γc− γc−1 (12) and hence Ma(x) = φ(γc− αr− βrθp) − φ(γc−1− αr− βrθp) .

(5)

Similarly, we have E(β_r∗|y_pro ) = E ∂L/∂β ∗ r L y o pr = −σβrE θpMa(x) y o pr (13) E(θ_p∗|yo pr) = E _∂L/∂θ∗ p L y o pr = −σθpE βrMa(x) y o pr . (14)

The expressions of the posterior variance can be obtained similarly. We have V ar(α∗_r|yo_pr) =1 − (σαr)2E h Na(x) y o pr i , V ar(β_r∗|yo_pr) =1 − (σβrθp)2E h Na(x) y o pr i , V ar(θ∗_r|yo_pr) =1 − (σθpβr)2E h Na(x) y o pr i .

The approximation of expectations are as in Weng and Lin [5], where the error incurred by this approximation is reduced by a scaling factor. Take E(α∗_r|yo

pr) in (11) for illustration: E(α∗_r|yo pr) = − σαrE h Ma(x) y o pr i ≈ − σαr ν Maν(xν), (15)

where a and x are as in (12), ν =q1 + σ2

αr+ σβr2 µ2θp+ σθp2 µ2βr, and aν = γc− γc−1 ν and xν = γc− µαr− µβrµθp ν . (16)

Together with (9), we have

E(αr|yopr) ≈ µαr− σ_αr2 ν " φ(γc−µαr−µβrµθp ν ) − φ( γc−1−µαr−µβrµθp ν ) Φ(γc−µαr−µβrµθp ν ) − Φ( γc−1−µαr−µβrµθp ν ) # . (17)

Similar approximations give

V ar(αr|yopr) ≈ σαr2 1 −σαr ν 2 Naν(xν) . (18)

The proposed algorithm is described below:

Step 1. Given current estimates µ(t)α , µ(t)_β , µ(t)_θ , σα(t), σ(t)_β , σ_θ(t), where µ(t)α = (µ(t)_α1, · · · , µ(t)_αR)T,

(6)

Step 2. Given the (t + 1)st observation ypr = c. Calculate ν(t)= q 1 + (σ(t)αr)2+ (σ(t)_βr)2(µ(t)_θp)2+ (σ(t)_θp)2(µ(t)_βr)2, (19) ω(t) = − 1 ν(t)Ma(t)ν (x (t) ν ), (20) δ(t) = 1 ν(t)2Na(t)ν (x (t) ν ), (21) where a(t)_ν = γc− γc−1 ν(t) and x (t) ν = γc−1− µ(t)αr− µ(t)_βrµ(t)_θp ν(t) .

Step 3. Update parameters as below:

µ(t+1)_αr = µ(t)_αr+σ_αr(t)2ω(t) µ(t+1)_βr = µ(t)_βr + σ_βr(t) 2 µ(t)_θpω(t) µ_θp(t+1)= µ(t)_θp +σ_θp(t)2µ(t)_βrω(t) (22) (σ_αr(t+1))2 = (σ_αr(t))2max1 − (σ(t)_αr)2δ(t), κ (23) (σ_βr(t+1))2 = (σ_βr(t))2max 1 − (σ(t)_βrµ(t)_θp)2δ(t), κ (24) (σ_θp(t+1))2 = (σ_θp(t))2max 1 − (σ(t)_θrµ(t)_βr)2δ(t), κ . (25)

4 Experiments

We consider the ratings of news outlets from Mondo Times (http://www.mondotimes. com/) used in Ho and Quinn [2]. Mondo Times is an online company that disseminates information about media outlets such as newspapers, magazines, radio stations, and tele-vision stations in 211 countries. Raters submit five-point ratings of the content quality of news outlets from awful, poor, average, very good, to great. The dataset used in Ho and Quinn [2], which features 1,515 products (news outlets) and 946 raters, is available from their Ratings package (available at http://cran.r-project.org/). The average number of ratings for a product is 3.0 and the average number rated by a rater is 4.8.

As in Ratings of Ho and Quinn [2], we remove raters who rate less than five products and remove products that are only rated by these raters. This ends up with 3249 ratings from 232 raters on 1344 products.

(7)

News outlets MCMC: sub-Mondo online: sub-Mondo online: whole-Mondo US News & World Report -0.215 (0.481) 0.259 (0.241) -0.148 (0.320)

Toronto Sun -1.027 (0.392) -0.684 (0.160) -0.743 (0.224)

Toronto Star 0.397 (0.400) 0.321 (0.148) 0.215 (0.199)

San Diego Union Tribune 0.089 (0.482) 0.124 (0.280) 0.426 (0.224)

People -1.793 (0.593) -1.970 (0.689) -1.909 (0.413)

PBS 1.223 (0.393) 1.695 (0.258) 0.809 (0.182)

Montana Magazine -0.233 (0.358) -0.071 (0.171) 0.060 (0.199)

London Sun -2.140 (0.569) -1.477 (0.267) -1.380 (0.312)

Great Falls Tribune -2.839 (0.770) -1.817 (0.399) -2.353 (0.484) Daily Utah Chronicle 0.054 (0.818) -0.161 (0.256) -0.686 (0.668) Colorado Public Radio 1.431 (0.602) 2.572 (0.700) 1.617 (0.484)

CNN 0.038 (0.192) -0.055 (0.078) 0.312 (0.074)

Table 1: Posterior means of θ for twelve outlets.

As in Ho and Quinn [2], we assume that each αr follows N (1, 1) and each θp follows

N (0, 1). Instead of letting each β_r follow N (−5, 20) truncated to positive, we assume that each βrfollows N (1, 20). All parameters are assumed to be mutually independent. Since the

γ values are not the main interest here, we set them by the following steps: first, calculate the observed proportions #{ypr = c}/N , for c = 1, ..., 5, where N is the number of observed

ratings; next, find the z-scores corresponding to these areas; then, obtain approximations of the mean and variance of y∗ and convert the z-scores to y∗’s scale. In our scenario, the new ratings arrive sequentially. So, the empirical proportions are based on just part of the data, or in a pilot study.

The initial parameter values in Algorithm 1 are set to be µ(0)αr = 1, µ(0)_βr = 1, µ(0)_θp = 1,

σ(0)αr = 1, σ_βr(0) = 1, σ_θp(0) = 1, for r = 1, ..., R and p = 1, ..., P ; and the positive lower bound

κ in (23)-(25) is set to be 0.0001.

5 Conclusions

Ho and Quinn [2] have demonstrated the advantages of fitting the IRT models to Internet ratings data. However, for a real-time data pipeline that continuously collects new ratings

(8)

on new items, their MCMC approach is not computationally viable. Our proposed online method can adjust the model parameters in a large-scale problem.

References

[1] J.-P. Fox and C. A. W. Glas. Bayesian estimation of a multilevel IRT model using Gibbs sampling. Psychometrika, 66:269–286, 2001.

[2] D. E. Ho and K. M. Quinn. Improving the presentation and interpretation of online ratings data with model-based figures. The American Statistician, 62(4):279–288, 2008. [3] R. J. Patz and B. Junker. A straightforward approach to Markov Chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24:146–178, 1999.

[4] X. Wang, J. O. Berger, and D. S. Burdick. Bayesian analysis of dynamic item response models in educational testing. The Annals of Applied Statistics, 7(1):126–153, 2013. [5] R. C. Weng and C.-J. Lin. A Bayesian approximation method for online ranking. Journal