基於貝氏IRT模型之線上學習演算法 - 政大學術集成

全文

(1)國立政治大學統計學(系)研究所碩士學位論文. 基於貝氏 IRT 模型之線上學習演算法 Online Learning Algorithms based on Bayesian IRT models. 指導教授：翁久幸博士研究生：賴翔偉撰. 中華民國 107 年 07 月. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(2) 摘要我們在此篇論文中呈現兩種類型的線上學習演算法─靜態與動態，用以即時的量化網路評分類型資料中，使用者與被評分項的潛在變量。靜態學習演算法是延續 Weng and Coad (2018)的結果，在他們所採用 Ho and Quinn (2008)的 Bayesian ordinal IRT 模型中的截斷點加上常態型先驗分配；動態學習演算法則是試行 Graepel et al.(2010)中所採用的動態概念來改進靜態學習演算法下可能產生的缺失。透過實驗，我們得到以下兩個結論：(1)截斷點加上先驗分配後，經過序列化的修正所得到的結果，會比截斷點沒有設置先驗並固定下所得到的結果來的好；(2)雖然靜態學習演算法的運算時間少於動態學習演算法，但動態學習在某些配置下，可能會表現的比靜態學習好。在文末，針對 Ho and Quinn 的 Bayesian ordinal IRT 模型中的潛在變量，我們給出幾個比較合適的先驗參數配置。關鍵字：項目反應理論、潛在變量、貝氏、動差配對法、靜態學習、動態學習. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(3) Abstract In this paper, we present two types of online learning algorithmsstatical and dynamicalto capture users' and items' latent traits' information through online product rating data in a real-time manner. The statical one extends Weng and Coad (2018)'s deterministic moment-matching method by adding priors to cutpoints, and the dynamical one extends the statical one with the dynamical ideas adopted in Graepel et al. (2010) for taking users' and items' time-dependent latent traits into account. Both learning algorithms are designed for the Bayesian ordinal IRT model proposed by Ho and Quinn (2008).. 政治大. Through experiments, we have veried two things: First, updating cutpoints. 立. sequentially produces better results.. Second, statical learning's computational. ‧ 國. 學. time is almost twice as less as dynamical learning's, but dynamical learning can slightly outperform statical learning under some congurations.. ‧. At the end of the paper, we give some useful congurations for setting up the. y. Nat. er. io. Bayesian, dynamical learning, item response theory, latent trait, moment-matching. method, statical learning. al. n. keywords:. sit. priors of the latent variables of Ho and Quinn's ordinal IRT model.. Ch. engchi. i n U. v. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(4) Contents. 1. Introduction. 2. Online Rating Data. 3. Item Response Theory (IRT). 政治大. 立. 5. ‧ 國. 3.1. Unidimensional Logistic Models for Dichotomous Items . . . . . . . . . . . .. 6. 3.2. Models for Polytomous Items. 7. . . . . . . . . . . . . . . . . . . . . . . . . . .. ‧. Online Learning Algorithm. 4.2. Dynamical Learning. sit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. n. al. er. Statical Learning. io. 4.1. y. Nat. 5. Experimental Results. Ω`. Ch. and. ∆`. i e .n. g. c. h . . .. i n U. v. 10 10 16. 22. 5.1. Limiting properties of. . . . . . . . . . . . . . . . . .. 22. 5.2. Algorithm Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 27. 5.2.1. Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 27. 5.2.2. Statical vs. dynamical. 29. 5.3. 6. 3. 學. 4. 1. . . . . . . . . . . . . . . . . . . . . . . . . . .. Conguration searching process. . . . . . . . . . . . . . . . . . . . . . . . . .. Conclusions. 31. 36. References. 37. A Derivations of Statical Learning formulas. 39. A.1. Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 40. i. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(5) A.2. Posterior expectation and variation of. αj. . . . . . . . . . . . . . . . . . . . .. 41. A.3. Posterior expectation and variation of. βj. . . . . . . . . . . . . . . . . . . . .. 43. A.4. Posterior expectation and variation of. θi. . . . . . . . . . . . . . . . . . . . .. 45. A.5. Posterior expectation and variation of. (γc−1 , γc ). A.6. Limiting properties of. Ω`. and. 立. ∆`. . . . . . . . . . . . . . . . .. 45. . . . . . . . . . . . . . . . . . . . . . . . .. 45. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. ii. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(6) List of Figures. 3.1. Visualization of a basic IRT model. . . . . . . . . . . . . . . . . . . . . . . .. 4.1. Damped pendulum. 4.2. Phase portrait of a damped pendulum system. 4.3. Dynamical behaviour of. 5.1. Visualization of. 5.2. Zoom in on. 5.3. The p.d.f. and c.d.f. of standard normal distribution N(0,1),. . . .. 25. 5.4. Linearization near resonating points . . . . . . . . . . . . . . . . . . . . . . .. 26. 5.5. The quality of updates with training dataset . . . . . . . . . . . . . . . . . .. 政治大. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 立 µ ˜. with dierent. ξ. 18 19. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 24. 學. Ω` (Yij = 1), ∆` (Yij = 1). . . . . . . . . . . . . . . . . . . . . . .. ‧. ‧ 國. (t), σ ˜dynamical (t). 17. . . . . . .. dynamical. (Ω` , ∆` ). . . . . . . . . . . . . . . . . .. 7. Nat. n. al. er. io. sit. y. φ(x), Φ(x). Ch. engchi. i n U. 24. 32. v. iii. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(7) List of Tables. γ. γ. 5.1. Comparison of estimated. . . . . . . . . . . . .. 30. 5.2. Comparison of statical learning and dynamical learning . . . . . . . . . . . .. 32. 5.3. Top 10 congurations for statical learning. . . . . . . . . . . . . . . . . . . .. 35. 5.4. Top 10 congurations for dynamical learning . . . . . . . . . . . . . . . . . .. 35. 5.5. Spear correlation of the pairwise rank(µ ˜ θ ) under statical learning. . . . . . .. 35. 5.6. Spear correlation of the pairwise rank(µ ˜ θ ) under dynamical learning . . . . .. 35. 立. and sequentially updated. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. iv. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(8) 1. Introduction. Recently, using customers' online historical data to expand revenue gains high attention among industries, e.g., Google, Amazon, Netix, Yahoo!, etc. Most of the data are composed. 政治大. of customers' historical behaviours toward online items, like musics, movies, commodities, or. 立. just links in websites.. Records among this kind of data are usually nothing more than recording whether a. ‧ 國. 學. customer has responded to an item or how a customer responds to an item, which in turn reveals his/hers rst impression and aection toward the item. The composition of the data. ‧. makes itself looked simple at rst glance, but to do modelling with it is actually a hard issue.. Nat. sit. y. As an old saying goes, "The simpler it looks, the more complicated it will be.". er. io. In this paper, we focus on applying IRT models to t the MovieLens 100k movie ratings. al. iv n C accurately and fast as possible. The basic the ones proposed by Ho and Quinn hconcepts e n g cfollow hi U n. dataset based on Bayesian approaches to capture raters' and movies' latent information as. (2008) who rst proposed to t a Bayesian ordinal IRT model via MCMC techniques for Internet rating data. Their methods work well with rating data under oine settings. However,. the stableness of a MCMC technique is usually provided by a long period of sampling, which means that MCMC techniques usually require some time before uncovering latent information among data.. Ho and Quinn themselves concluded that using deterministic approximation. algorithms to adjust models' parameters might be more feasible in real-time settings. It turns out that, if we want to apply Ho and Quinn's concepts in a real-time scenario, other faster methods are needed. Fortunately, Weng and Coad (2018) proposed a deterministic moment-matching method to sequentially capture the information of the latent variables in Ho and Quinn's IRT model.. 1. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(9) The aim of this paper is to extend Weng and Coad (2018)'s concepts by: First, add priors to the cutpoints (γ1 ,. γ2 , γ3 , γ4 ). to see if there are any advantages.. Second, nd. suitable priors for the latent variables in Ho and Quinn's ordinal IRT model. The organization of this paper is as follows. In Chapter 2, we give some descriptions about the 100k movie ratings dataset from MovieLens. In Chapter 3, we introduce the basic ideas and mathematical models behind Item-Response Theory (IRT). In Chapter 4, we develop two types of online learning algorithms with some explanations. In Chapter 5, we rst conduct a simulation to verify the eectiveness of adding priors to cutpoints, then evaluate the algorithms with the 100k movie ratings dataset and provide some useful congurations for setting up priors and parameters. In Chapter 6, we state our conclusions.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. 2. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(10) 2. Online Rating Data. In this chapter, we focus on introducing the online rating data100k ratings dataset from MovieLens (Harper and Konstan, 2016), which will be used to assess the eectiveness. 政治大. of our online learning algorithms in chapter 5.. 立. The dataset consists of 100,000 ratings on 1,682 movies from 943 raters collected through the MovieLens website (https://movielens.org) during the seven-month period from Septem-. ‧ 國. 學. ber 19th, 1997 through April 22nd, 1998 and was nally released in 4/1998 by the GroupLens Research Project for any research purposes.. This project has also released other datasets. ‧. with various volumes, please refer to the GroupLens website (https://grouplens.org).. Nat. al. n. 1. 196. 2. 186. Rating. sit. Movie ID. Timestamps. er. io. Rater ID. y. The dataset has the following structure:. 242 3 881250949 iv n 3 891717742 C h302 e n g c h1 i U 878887116 377. 3. 22. 4. 244. 51. 2. 880606923. 5. 166. 346. 1. 886397596. 6. 298. 474. 4. 884182806. 7. 115. 265. 2. 881171488. 8. 253. 465. 5. 891628467. 9. 305. 451. 3. 886324817. 10. 6. 86. 3. 883603013. The timestamps column represents the cumulated seconds starting from 1970/01/01 00:00:00. Each row exactly recorded who rated what at when. Raters included in this dataset are guaranteed to at least rate 20 movies, so it should not be too dicult to capture raters' latent information.. Contrarily, over 22.8 percent of. 3. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(11) the movies, which is 384 movies, received no more than 5 ratings. With little raters, some movies' latent information might be hard to trace. However, there are some remarks about how Bayesian approaches can be applied with only a few data to perform parameter estimation, which make Bayesian inference useful in some research elds; see McNeish (2016) and Van De Schoot et al. (2015).. In Bayesian. approaches, the starting point is to specify some suitable models for the observed data, then pick some rational priors for the models' parameters. Afterwards, we can perform inferences about the parameters based on the models and the data.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. 4. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(12) 3. Item Response Theory (IRT). Item response theory is a theory that is used to measure some latent properties which cannot be easily observed or quantied in real life, like respondents' latent traits which aect. 政治大. how they will respond to items' characteristics or items' latent quality that inuences how. 立. respondents will respond to them. The term item is used to point out that the theory broadly puts focus on items rather than specic objects.. ‧ 國. 學. The form of responses toward items can be any choices of the followings: The correct-. ‧. ness of multiple choice questions, like {Correct, Incorrect }, aection toward items, such as {Awful, Poor, Average, Good, Great }, or Likert-type ratings toward commodities, namely,. sit. y. Nat. {1, 2, 3, · · · , C}.. er. io. There are some scenarios that item response theory might help: Questionnaire designers. al. n. iv n C questionnaire inquiring. Some educational U want to not only rank examinees h einstitutions n g c h imight. might want to balance each question item's diculty to reduce biases among respondents in a. but also examine these examinees' performance towards some specic question items in high-. stakes tests, e.g., GRE, GMAT. Some online marketing companies might want to capture customers' latent traits through historical data to do new commodities recommendation for these customers through the Internet. For the rst applicationquestionnaire designing, IRT can aid dening the scale of questions' latent diculty as references to do diculty adjustment, for the second one performance assessing, IRT can aid dening the scale of examinees' latent performance, and for the last onecommodity recommending, IRT can aid dening the scale of customers' latent preferences for marketing companies to evaluate whether a customer will like a new commodity.. 5. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(13) Mathematical models behind IRT IRT models have been widely used to model dichotomous and polytomous data from educational tests or psychometrics nowadays, see van der Linden (2010). Theoretically, the basic idea of every IRT model is to concretize a set of latent variables by a set of continuous variables. Then, the continuous ones can conversely be regarded as the latent ones and be used to infer respondents' future behaviours or items' latent characteristics. Due to the number of parameters and purposes of use, IRT models can be classied as follows: Unidimensional Logistic Models for Dichotomous Items (∗), Models for Polytomous Items (including Nominal-response model, Graded-response model (∗), and Partial-credit models), Multidimensional Models, Cognitive-Component Models, and Hierarchical Models.. 政治大. For giving readers a brief taste of IRT models, we only discuss about the terms marked. 立. (∗).. ‧ 國. 學. 3.1 Unidimensional Logistic Models for Dichotomous Items Given a dataset of a sequence of dichotomous response data points. D = {Di : Di ∈. ‧. {0, 1} ∀ i = 1 : N }, the basic IRT model among unidimensional logistic models for modeling. n. al. where. Di. Ch. exp(βi − θ) 1 + exp(βi − θ). engchi. i n U. represents a dichotomous response respondent. latent trait of respondent. i,. and. θ. (3.1). er. io. P (Di = 1 | βi , θ) =. sit. y. Nat. this kind of data is. i. v. gave to an item,. βi. represents the. represents the latent characteristic of the item.. (3.1) is a simplied version of the three-parameter logistic (3PL) (see Birnbaum, 1968). P (Di = 1 | βi , θ) = γi + (1 − γi ) If we add the variable that. αi = a 6= 1, ∀ i,. exp(αi (θ − βj )) 1 + exp(αi (θ − βi )). αi , which controls the slope of the logistic curve, into (3.1) and assume (3.1) becomes the Rasch model, see Rasch (1961).. To be more specic, assume that the item is actually a test question and the correctness of the answer examinee and. θ. i. Di. represents. responded to the question, then we can regard. as the latent performance of the examinee. i. βi. and the latent diculty of the question,. 6. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(14) Figure 3.1: Visualization of the basic IRT model (3.1). respectively.. 立. 政治大 i. If the performance of examinee. surpasses the diculty of the question, we. should expect the probability that he/she answers the question correctly to be over 0.5. This. ‧ 國. 學. can be veried by visualizing (3.1) with a xed pseudo value of. θ=0. (Figure 3.1).. If there are not just one question, (3.1) can be modied into. ‧. y. er. Dij. io. where. exp(βi − θj ) 1 + exp(βi − θj ). sit. Nat. P (Dij = 1 | βi , θj ) =. represents the correctness of the answer examinee. al. i. responded to question. j , βi. n. iv represents the latent performance of examinee i, and θj represents the latent diculty of n Ch U engchi question j . More generally, (3.1) can be regarded as a special case of the following:. P (Dij = 1 | βi , θj ) = F (βi − θj ) where. F (·). is the item response function (IRF), which is usually a cumulative distribution. function of a continuous distribution.. 3.2 Models for Polytomous Items For a dataset of a sequence of polytomous response data points. {1, 2, 3, · · · , C}, ∀ i = 1 : I, j = 1 : J; I ≡. # of respondents;. J ≡. P = {Pij : Pij ∈ # of iterms}, more. 7. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(15) sophisticated IRT models are needed. Here we only concern about the ordered polytomous responses with ordinal levels, like ratings with ordinal levels from 1 to. C,. where. C. is the. maximum ordinal level of response. Samejima (1970) proposed the graded-response model, also called ordinal IRT model, to model this kind of data. P (Yij ≥ c | βj , θi , δj,c ) = F (βj (θi − δj,c )) where. Yij. is the response respondent. respondent. i, βj. i. j , θi. gave to item. represents the latent prociency of. represents the latent discrimination of item. j,. j,. which can be regarded as a. δ 政治 j 大. specic slope-controlling parameter for the item, and variable of item. (3.2). an exclusive threshold for item. 立. j,c represents the latent item response. .. ordinal level. k. has the maximum ordinal level. Ck−1 , resulting in Ck 6= Ck−1 .. Ck. 學. For instance, item. ‧ 國. Note that, in this model, every item's maximum ordinal level of response can be dierent. and item. k−1. has the maximum. This might be suitable for a set of items containing. ‧. a lot of heterogeneous items in it. For a set of items containing all homogeneous items, other. sit. y. Nat. appropriate models are required.. io. er. Muraki (1990) proposed a modied graded-response model which is suitable for Likerttype data, all items' maximum ordinal level of response are equivalent to. iv. n. al. P (Yij ≥ c | αj , βj , where. αj. and. dc−1. n U i e h n θi , dc−1g )= cF (βj (αj + θi − dc−1)). Ch. are the latent location variable of item. j. C (3.3). and the latent threshold variable. respectively, both of which contain partial information of the original latent item response variable. δj,c .. Ho and Quinn (2008) proposed a new ordinal IRT model to model online product rating data. P (Yij ≥ c | αj , βj , θi , γc−1 ) = Φ(αj + βj θi − γc−1 ) where. αj. is represents the latent location of rater. j 's preferable rating, βj. (3.4). represents rater. j 's. 8. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(16) latent discrimination ability toward items' quality, and. γc−1. θi. represents the latent quality of item. is the latent cutpoint (or threshold) that partitions the continuous scale. into sub-groups. e.g.,. (−∞, γ0 ] ∪ (γ0 , γ1 ] ∪ · · · ∪ (γC , ∞). . i,. (−∞, ∞). .. (3.4) is the one we are concerned about in this paper and it has the following properties: iid. 1.. Yij∗ = αj + βj θi + ij , ij ∼ N (0, 1). 2.. Yij = c ⇔ Yij∗ ∈ (γc−1 , γc ]. 3.. P (Yij = c | αj , βj , θi , γc−1 , γc ) = Φ(αj + βj θi − γc−1 ) − Φ(αj + βj θi − γc ). The rst property can be regarded as tting a linear regression model for rater the intercept,. βj. the slope,. θi. Y 政治大. the independent variable, and. j. with. αj. ∗ ij the dependent variable. The. second property declares a sucient and necessary condition, which maps a discrete rating. 立. using in our learning algorithms. Although we stated that. Yij∗. we are. in the rst property is the dependent variable, it is actually. Yij. ‧. unobservable. The only observable quantity is. the rating that item. Nat. i. received from rater. y. Only when we consider the three properties simultaneously can the linear regression. sit. j.. Yij. 學. ‧ 國. record onto a continuous scale. The third property is the likelihood function of. io. al. Yij∗. is obtained. er. interpretation be true. In other words, if the second property is achieved when. n. by applying the formula in the rst property with the latent variables, which are estimated. Ch. i n U. v. by our learning algorithms, which use the likelihood in the third property, then we can safely claim that the interpretation is valid.. engchi. Ho and Quinn (2008) applied MCMC approaches to estimate the latent variables in their proposed ordinal IRT model. The method works well in oine scenarios where time consuming is not critical, but if we want to promote their ordinal IRT model to a real-time circumstance, using MCMC approaches to estimate latent variables is not feasible, especially when data size is large, all procedures have to be rerun when new records arrive. In the next chapter, we present two types of online learning algorithms to fulll the need of applying Ho and Quinn's ordinal IRT model on Internet platforms to capture raters' and items' latent information.. 9. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(17) 4. Online Learning Algorithm. In this chapter, we present two types of online learning algorithmsstatical and dynamical which can be used to sequentially estimate both users' latent preferences and items' latent. 政治大. quality . The dynamical learning algorithm is the extended version of the statical one in-. 立. corporating priors' information in each update.. 學. ‧ 國. approaches.. Both algorithms are based on Bayesian. 4.1 Statical Learning. ‧. As we have introduced in chapter 2, an item-response dataset usually has the following. n. al. a N-by-3 matrix. with the. tth. row. Ch. engchi. er. io. D = (D1 , D2 , · · · , DN )0 ,. sit. y. Nat. structure. i n U. v. Dt = (j, i, c) where. j ∈ {1, 2, · · · , J},. J ≡#. of users. i ∈ {1, 2, · · · , I},. I≡#. of items. c ∈ {1, 2, · · · , C},. C ≡ maximal. where j represents gave to. ith. j th. user, i represents. ith. level of rating. item, and c represents the rating. j th. user. item. Some datasets may have additional columntimestampto indicate the. 10. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(18) time when each data entry was recorded, which is useful in dynamical learning (see. Section. 4.2). The whole concept of statical learning algorithm is pretty simple. When a single data point. Dt. comes in, we can capture a portion of the corresponding user's favour and item's. value by the rating. c,. which represents how the user responded to the item. What's more,. in the course of learning, if one user has contributed lots of ratings towards many items, his/hers latent preference can be estimated more accurately. Likewise, if an item has many raters, its estimated latent value is more reliable. Recall Ho and Quinn (2008)'s ordinal IRT model's rst two properties listed at the end of chapter 3. Yij = c ⇔. Yij∗. represents an user's latent preference,. θi. represents an item's latent value,. γc ]. represents the user's latent discrimination. represents a small disturbance, and. (γc−1 , γc ]. which are the cutpoints pair used when rating c. Nat. appears.. γc−1 , γc ,. c−1 ,. ‧. is the interval with end points. ij. βj. ij. y. ability,. j i. sit. αj. j. 學. where. ‧ 國. 立. 治政 = α + β θ + ∈大 (γ. It is an essential mapping from discrete ratings to continue values in Ho and Quinn's. al. th. ith. er. io. ordinal IRT model. The relation means that, j. user rated. item a `c' is equivalent to. n. iv ∗ Yij is located within the interval (γc−1 once we've estimated every user's n C,hγc]. As a result, U e n g cθi,hwei can latent variables αj , βj and item's latent variable predict how an user will respond to an item that they haven't rated yet and evaluate how valuable an item is. In terms of a recommender system, we can make recommendation(prediction) based on these information. Every user and item's latent variable should have their own prior information before we can start to capture users and items' latent information. It is common to assume prior is subject to normal distribution. ψi ∼ N (µψi , σψi ) where. µψi. is the mean, and. σψi. is the standard deviation.. 11. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(19) We set the priors of every user's latent preference and latent discrimination ability and every item's latent quality as. αj ∼ N (µαj , σα2 j ), βj ∼ N (µβj , σβ2j ), θi ∼ N (µθi , σθ2i ) Moreover, we set each cutpoint's prior as. γk ∼ N (µγk , σγ2k ),. k=0:C. By assigning normal distribution to these variables as their priors, according to WoodroodeStein's identity, if the full posterior distribution of all variables has the following form. 政治 Z 大 Cφ(ψ )f (ψ ), C = φ(ψ )f (ψ )dψ 立 ∗. ∗. D. ∗. −1. ∗. D. ∗. ‧ 國. 學. each variable's posterior mean and variance can be simplied as. ‧. io. sit. ED. ∂ log f ∂ψi∗. (4.1). !2 (4.2). n. er. Nat. al. . y. ∂ log f = ED ∂ψi∗ 2 ∂ f /∂(ψi∗ )2 ∗ − VarD (ψi ) = 1 + ED f . ∗ ED (ψi ). Ch. i n U. v. For a preliminary understanding, see Weng and Coad (2018, p.6-9), and, for a comprehensive. engchi. understanding of the mathematical results, please refer to Weng and Lin (2011). Note that The likelihood function. fD (ψ ∗ ). we are using is. Φ(αj + βj θi − γc−1 ) − Φ(αj + βj θi − γc ).. With (4.1)(4.2), if each variable's posterior mean and variance have closed forms, then they can be easily computed in a computer system and be used to update each variable's prior information,. (µ, σ).. The updated information then becomes the variable's new prior. information for next update.. In addition, in Bayesian approaches, a variable's posterior. mean and variance can be used not only to update its prior information but also to perform inferences. After many updates, information contained in a variable may gradually approach the true properties of an event in the real world resulting in more reliable inferences. This is the common idea in Bayesian inference.. 12. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(20) The latent information can be captured by the following formulas (complete derivation. Appendix A). ← µαj + σα2 j · Ω` (Yij = c), . ← σα2 j · 1 − σα2 j · ∆` (Yij = c) ;. (4.3). ← µβj + σβ2j µθi · Ω` (Yij = c), . ← σβ2j · 1 − (σβj µθi )2 · ∆` (Yij = c). (4.5). (4.7). statical. ← µθi + σθ2i µβj · Ω` (Yij = c),. ← σθ2j · 1 − (σθi µβj )2 · ∆` (Yij = c). µ ˜γc−1 ,. statical. ← µγc−1. σ ˜γ2c−1 ,. statical. . ← σγ2c−1 · 1 − σγ2c−1 · ∆γ (Yij = c)(1, 0). (4.10). ← µγc − σγ2c · Ωγ (Yij = c)(0, 1) ,. ← σγ2c · 1 − σγ2c · ∆γ (Yij = c)(0, 1). (4.11). µ ˜ αj ,. statical. σ ˜α2 j ,. statical. µ ˜ βj ,. statical. σ ˜β2j ,. statical. µ ˜θi ,. statical. σ ˜θ2j ,. 立. 治政 −σ · Ω (Y 大 = c) 2 γc−1. γ. (1, 0) ,. ij. statical. σ ˜γ2c ,. statical. ‧. µ ˜γc ,. Ωγ (Yij = c)(c1 , ∆γ (Yij = c)(c1 ,. c2 ). c2 ). = =. (4.9). (4.12). sit. 1 c−1 φ( µνc−1 ) − ν1c φ( µνcc ) νc−1 c−1 Φ( µνc−1 ) − Φ( µνcc ) µc−1 1 c−1 · νc−1 φ( µνc−1 ) − µν 2c · ν1c φ( µνcc ) 2 νc−1 c c−1 Φ( µνc−1 ) − Φ( µνcc ) c1 c−1 φ( µνc−1 ) − νc2c φ( µνcc ) νc−1 c−1 ) − Φ( µνcc ) Φ( µνc−1 µc−1 c1 c−1 · νc−1 φ( µνc−1 ) − µν 2c · νc2c φ( µνcc ) 2 νc−1 c µc−1 µc Φ( νc−1 ) − Φ( νc ). al. n. ∆` (Yij = c) =. (4.8). Ch. engchi. er. io. Ω` (Yij = c) =. (4.6). y. Nat. where. (4.4). 學. ‧ 國. is appended in. i n U +. . v. 1 c−1 φ( µνc−1 ) − ν1c φ( µνcc ) 2 νc−1 c−1 Φ( µνc−1 ) − Φ( µνcc ). (4.13). (4.14). (4.15). +. . c1 c−1 φ( µνc−1 ) − νc2c φ( µνcc ) 2 νc−1 c−1 Φ( µνc−1 ) − Φ( µνcc ). µc = µαj + µβj µθi − µγc. (4.16). (4.17). νc2 = 1 + σγ2c + σα2 j + µ2θj σβ2j + µ2βj σθ2i ,. c=1:5. (4.18). 13. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(21) The statical learning algorithm (Algorithm add the term. κ,. 1). is presented in the next page.. which act as a small positive number, say. becoming negative, see (Weng and Coad, 2018, p.10-11).. 10−6 ,. to prevent. 2 σ ˜statical. We from. Noticed that some calculations. involve element-wise computing, which is common in a high level programming language such as R, Matlab. We denote a result by bold face followed by the notation. (, )0. to remind. readers that we are doing element-wise calculation. At the beginning, every user and item has their own initial information,. σ (0) .. Once a data point. Dt. µ = µ(0) , σ =. arrives, the information will be updated by (4.3)-(4.18). These. updated information then replace the initial ones and become the prior for the next round, i.e.,. µ←µ ˜, σ ← σ ˜.. 政治大. However, in this algorithm, if an user or an item has many rating records to update, their latent variables'. σ. 立. will converge closely to 0 eventually and no more updates can be made. due to the fact that every update formula above depends on the size of. σ,. which controls. ‧ 國. 學. the extent of update. This induces the problemstop learning. Since users will keep rating. ‧. items in the future shifting their preferable ratings and items' value, stopping learning is not reasonable for a real recommender system.. Nat. sit. y. What's more, like every recommender system, the algorithm also faces the same problem. er. io. cold start. The problem is one of the resources causing biases in many recommender systems.. al. iv n C Their latent variables could storehhigh potential of e n g c h i U large jump in just one update, in n. This problem comes from inactive users or unpopular items having no records for a long time.. other words, one data point of inactive users' might result in large update, which may be troublesome in some ways. At the time when those inactive users and unpopular items' data appear, not only will their data greatly alter their latent variables. αj , β j. by just a few data. points, but their data will also distort some items' quality dramatically. In this section, we have expounded how we can exploit the Bayesian ideas to update variables' information in detail.. In the next section, we will brief an alternative learning. processdynamical learning, the physical idea behind it, and how a dynamical learning algorithm can cope with stopping learning and cold start problems in a more sensible way. There may still have lots of issues we didn't mention, but we only consider these two major problems in this paper.. 14. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(22) Algorithm 1 Statical Learning Setup lower bound: κ Setup prior:. Algorithm. µα (0) , µβ (0) , µθ (0) , µγ (0) ; σα (0) , σβ (0) , σθ (0) , σγ (0) Set. current:. µα = µα (0) , µβ = µβ (0) , µθ = µθ (0) , µγ = µγ (0) σα = σα (0) , σβ = σβ (0) , σθ = σθ (0) , σγ = σγ (0) Given Dt = (j, i, c). (the subscript. t. is not related to time here, it's just a index for any. data point). µ = µαj + µβj ∗ µθi − µγ(c−1):c ⇒ (µc−1 , µc )0 2 , νc2 )0 ν 2 = 1 + σγ2(c−1):c + σα2 j + µ2θi σα2 j + µ2βj σθ2j ⇒ (νc−1 x = µ/ν ⇒ (xc−1 , xc )0 if xc ≥ 7.650731) then (please refer to Appendix A.6). 立. ‧ 國. else. 學. (Ω` , ∆` ) = (−µc /νc2 , 1/νc2 ) (Ωγ , ∆γ )(1, 0) = (0, 0) (Ωγ , ∆γ )(0, 1) = (−µc /νc2 , 1/νc2 ). 政治大. ‧. N1 = φ(x)/ν ⇒ (N11 , N12 )0 N2 = (µ/ν 2 ) ∗ N1 ⇒ (N21 , N22 )0 D = Φ(xc−1 ) − Φ(xc ). n. er. io. sit. y. Nat. Ω` = (N11 − N12 ) / D ∆` = (N21 − N22 ) / D + (Ω` )2 Ωγ (1, 0) = N11 / D ∆γ (1, 0) = N21 / D + (Ωγ (1, 0)a)l2 Ch Ωγ (0, 1) = −N12 / D engchi ∆γ (0, 1) = −N22 / D + (Ωγ (0, 1) )2. i n U. v. end if. µ ˜αj , statical ← µαj + σα2 j · Ω` . σ ˜α2 j , statical ← σα2 j · max κ, 1 − σα2 j · ∆` µ ˜βj , statical ← µβj + σβ2j µθi · Ω` . σ ˜β2j , statical ← σβ2j · max κ, 1 − (σβj µθi )2 · ∆` µ ˜θi , statical ← µθi + σθ2i µβj · Ω`. σ ˜θ2j , statical ← σθ2j · max κ, 1 − (σθi µβj )2 · ∆` µ ˜γc−1 , statical ← µγc−1 − σγ2c−1 · Ωγ (1, 0) . σ ˜γ2c−1 , statical ← σγ2c−1 · max κ, 1 − σγ2c−1 · ∆γ (1, 0) µ ˜γc , statical ← µγc − σγ2c · Ωγ (0, 1) . σ ˜γ2c , statical ← σγ2c · max κ, 1 − σγ2c · ∆γ (0, 1) 15. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(23) 4.2 Dynamical Learning Statistical inference for dynamical systems is important in many engineering realms, for many mathematical problems in engineering usually contain lots of unknown features which are usually set by personal experiences after lots of try-and-error. However, by many data, we can actually solve these feature-setting problems analytically by parameter estimation, for a brief example, see Coelho et al. (2011). On the other hand, lots of concepts in dynamical systems can also be applied to statistical learning algorithms, for they have many exibilities in estimation. In this dynamical learning section, we adopted the new formulas for real updates (see Graepel et al., 2010, sec. 3.3.3). 政治大 ·σ σ 2. 2. 立(1 − ξ)σ. 2. inter. prior. σ ˜dynamical =. 2. µ ˜dynamical. µprior µinter =σ ˜dynamical · (1 − ξ) 2 + ξ 2 σinter σprior 2. (4.20). ‧. Nat. the update values coming from. Algorithm 1. (see. Section 4.1. y. (µprior , σprior ) are the prior information, and (µinter , σinter ) are equivalent to (˜ µstatical , σ ˜statical ) for more details), which now. sit. where. (4.19). 2 + ξσinter. 學. ‧ 國. prior. er. io. act as intermediate values cooperating with prior whenever (4.19)(4.20) are invoked. (4.19). al. n. iv n C using the properties of the h normal e n g cdistribution, h i U see Moser (2010, p.10).. is always computed rst before (4.20). These formulas can be derived from an idea called partial update. The formulas above manifest the central part in Bayesian inferenceall posterior results are combined with prior assumption and information in data. If we have many data in the sense that they are more signicant than prior, posterior results will tend to depend more on data, and vice versa. The point of these new dynamical update formulas is that, the dynamical version can gradually forget the inuence of previous data resuming users and items' latent information (µ,. σ2). if the time duration between present and the last update is long enough, and the. speed of restoration is governed by. ξ.. Due to this property, we can easily solve the problems. we've just mentioned in the last of Section 4.1, stopping learning and cold start.. 16. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(24) The rst issue is caused by active users that have rated many items. Ultimately, their. 2 latent variables' variances (σα. j. , σβ2j ). 2 2 (σθ ) could converge toward 0 (σ i. (αj ,. βj ). and the corresponding items' latent variable's variance. → 0). and subsequent updates for their latent variables. and items' latent variable (θi ) would halt. Instead, if variances can slightly grow. to a reasonable small amount rather than go to zero in the end, stopping learning problem might be lightened. The second issue is caused by inactive users that have barely data to update.. Once. their data arrive, large updates are possible. This issue, compared to the rst one, is more concerned about the mean of latent variable (µ). That is, unless inactive users become active, their ratings to some items could greatly shift the central tendency of these items' latent quality (µθi ).. 政治大. Instead, if mean can grow back towards prior assumption as time duration. 立. between present and the last time of update expands in the sense that we gradually lose inactive users information, the trouble can be mitigated.. ‧ 國. 學. Before giving a numerical example to enhance the ideas above, we'd like to elaborate. ‧. the term dynamical in detail rst.. sit. y. Nat. Damped dynamical system. io. item's information whatever there has datum or not.. n. al. Ch. engchi. er. Given a sequence of unit time, say 1 second or 1 minute, update every user and. i n U. v. This is exactly the point where the term dynamical comes.. Take a damped pendulum. problem for example, which is a basic dynamical problem in physics (Figure 4.1).. Figure 4.1: Damped pendulum (from Shane Mac, 2016). 17. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(25) Figure 4.2: Phase portrait of a damped pendulum system (from Strogatz, 1994, p.173). At any moment, if there are no external forces (no data), the black ball will stay at its. 政治大. stationary point (prior). Once it gets a push (one data point arrives), it will start wiggling. 立. back and forth. Meanwhile, it receives the downward gravitational force. (the parameter,. which slows it down. Without further interaction with any external force afterwards, it. ‧ 國. 學. ξ ),. g. will use up its moving energy and end up stopping at the stationary position.. ‧. To make a strong connection between this dynamical system and the following dynamical learning property that we are going to demonstrate, the whole mechanism can be illustrated. y. Nat. sit. more explicitly by the phase portrait of the damped pendulum system (Figure 4.2).. It is. er. io. clear from this gure that, this system has stable xed (stationary) points at (kπ , 0) and. al. other words, if the. − 1)π ,. |θ| < π. n. k = 0, ±2, · · · . The iorigin, v (0, 0), is asymptotically n C U it will spiral and close to it. In h e nstarting stable causing the stable spiral, all solutions g c h i near saddle nodes at ((k. 0), where. caused by an external force, the black ball will wiggle around the. lowest position for a while and then stop near it. Furthermore, this gure unintentionally emphasizes the serious consequences of cold start eectsif the external force is so powerful (large potential of update) leading to. |θ| > π ,. though it seems ne in this picture, the nal. outcome would be unpredictable and the stop point could no longer be at the position (0, 0). (black ball in reality will always go back to the lowest position as long as the external force didn't break the link between the ball and the end point, however, in a recommender system, the overall preference or quality could be distorted in this manner!). With the explanation above, we have the following conclusion:. 18. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(26) If there are no extra rating data (external forces) of users' or items' to update for a long period of time, we may take it as losing their information (losing moving energy) gradually and their preferences/qualities will eventually converge back to their priors (stationary point). To make this conclusion more precisely, assume that. µprior = 0, σprior = 1 Dt = (j, i, c). and assume that, one data point. arrives at time. t.. Through Algorithm 1,. 政治大. µinter = µ ˜statical = 3, σinter = σ ˜statical = 0.5. 立. generate a pair of real update values (µ ˜dynamical , we can observe the behaviour of. Without any further data,. by applying (4.19)(4.20) to. σprior , µinter = µ ˜dynamical (t − 1), σinter = σ ˜dynamical (t − 1)). in each run (Figure 4.3).. io. sit. y. many times with dier-. n. al. er. ξ. σ ˜dynamical ). µ ˜dynamical (t), σ ˜dynamical (t). Nat. ent. to. ‧. (µprior ,. σprior , µinter , σinter ). 學. ‧ 國. We then apply (4.19)(4.20) to these four practical values (µprior ,. Figure 4.3: Dynamical behaviour of. Ch. engchi. i n U. v. µ ˜dynamical (t), σ ˜dynamical (t) with dierent ξ . σprior ).. The horizontal. lines represent the prior's information (µprior ,. 19. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(27) Just like the damped pendulum problem,. µ ˜dynamical (t), σ ˜dynamical (t). will ultimately. converge back to where they started (the horizontal lines). Figure 4.3 also shows that, the greater the. ξ. (gravitational force), the faster the speed of convergence (stop).. exactly the value of. ξ. But, how. is is still undetermined. We will give a range of sensible values through. experimental results in chapter 5. The dynamical learning algorithm (Algorithm. 2). is presented in the next page. Recall. the issues, stopping learning and cold start, both can be mitigated through the gradually forgetting mechanism in the sense that we gradually lose an user's or item's information if the time period between present and the last time of update is long enough. This mechanism may benet to reduce biases in including the latent variables of inactive users who have rated. 政治大. few items, and the latent variables of unpopular items which have few raters.. 立. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. 20. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(28) Algorithm 2 Dynamical Setup ξU ser , ξItem , ξγ Setup. Learning Algorithm. prior:. µα (0) , µβ (0) , µθ (0) , σα (0) , σβ (0) , σθ (0) , µγ (0) , σγ (0) Set. current:. µα = µα (0) , µβ = µβ (0) , µθ = µθ (0) , µγ = µγ (0) ; , σα = σα (0) , σβ = σβ (0) , σθ = σθ (0) , σγ = σγ (0) dynamical(prior, current, xi) { Apply (4.19)(4.20) }. At time t (t. is related to time here),. 政治大 = (· · · , (j, i, c), · · · ) , N 立. (1). collect all rating data received within the time interval. Dt. t -by-3 matrix. is the size of data collected within the time interval.. ‧ 國. Nt. a. 學. where. 0. (t − 1, t]. (2). setup update ag (to indicate which user/item has been updated) agI. = {FALSE}1:I. Nat. io. ξU ser ) item_current, ξItem ) v. user_current = dynamical(user_prior, user_current,. n. al. item_current = dynamical(item_prior,. Ch. er. if Nt == 0 then. sit. (3). check {# of data} and update users/items' information. y. = {FALSE}1:J ,. ‧. agU. i n U. cutpoints_current = dynamical(cutpoints_prior, cutpoints_current,. else for. k = 1:Nt do Dk = Dt [k, ]. engchi. ξγ ). inter = statical(Dk ). D [1] = dynamical(user_prior Dk [1] , inter[1:4], ξU ser ) k item_current Dk [2] = dynamical(item_prior Dk [2] , inter[5:6], ξItem ). user_current. cutpoints_current[7:10] = dynamical(cutpoints_prior[7:10], inter[7:10],. ξγ ). agU Dk [1] = TRUE agI Dk [2] = TRUE end for user_current[!agU] = dynamical(user_prior[!agU], user_current[!agU], item_current[!agI] = dynamical(item_prior[!agI], item_current[!agI],. ξU ser ). ξItem ). end if. 21. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(29) 5. Experimental Results. In this chapter, we rst discuss about the limiting properties of (4.13)(4.14), for they play as the central parts in all updates. Then, we conduct a simulation process to verify whether. 政治大. having cutpoints variable has any advantages. The following experiment is the comparison of. 立. statical learning and dynamical learning with the MovieLens 100k ratings dataset mentioned Finally, we present a conguration searching process to nd the range of a. 學. ‧ 國. in chapter 2.. subset of settings of priors and parameters. All the experimental results are obtained by the open-source statistical programming language, R.. ‧. Before we start, readers should know the following notations:. sit. y. Nat. er. io. µγ = (µγ0 , µγ1 , µγ2 , µγ3 , µγ4 , µγ5 )0. σγ 2 =a(σγ20 , σγ21 , σγ22 , σγ23 , σγ24 , σγ25 )0. n. iv l C n 0 , h∀i i,Uj µ` = (µα , hµβe,nµg θ )c j. i. j. σ` 2 = (σα2 j , σβ2j , σθ2i )0 ,. ∀ i, j. ξ = (ξUser , ξItem , ξγ )0 We will use these notations to indicate the pre-assigned numerical values of all priors' central tendency and variation in our R programs.. 5.1 Limiting properties of Ω` and ∆` The accuracy of all updates depends on the stability of Algorithm 1Statical Learning Algorithm (see. Section 4.1. for more details), and the most important parts dominating all. 22. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(30) updates in this algorithm are the terms. Ω` (Yij = c) = ∆` (Yij = c) =. 1 c−1 φ( µνc−1 ) − ν1c φ( µνcc ) νc−1 c−1 Φ( µνc−1 ) − Φ( µνcc ) µc−1 1 c−1 φ( µνc−1 ) − µν 2c · ν1c φ( µνcc ) · νc−1 2 νc−1 c µc−1 µc Φ( νc−1 ) − Φ( νc ). (5.1). +. . 1 c−1 φ( µνc−1 ) − ν1c φ( µνcc ) 2 νc−1 c−1 ) − Φ( µνcc ) Φ( µνc−1. (5.2). To test the limiting properties of (5.1)(5.2), the rst thing is to setup a set of temporary values:. µγ = (−∞, 0, 3, 6, 9, ∞)0 σ 2γ = (0, 0, 0.1, 0.1, 0.1, 0)0. 政治大 µ = (x, 0.1, 0.1) 立 0. σ` 2 = (1, 0.1, 0.1)0. 學. ‧ 國. `. Nat. Since we have set. σγ21 = 0, it is because we need to pin down µγ1. io. setting. al. sit. there is no further potential to update for them, that is why we set. x. in. µ`. µγ0 = −∞, µγ5 =. σγ20 = σγ25 = 0.. As for. to prevent the whole range of cutpoints. er. ∞,. µαj , which is a free variable in this experiment.. y. represents. Note that the. ‧. Then, we can visualize (5.1)(5.2) with these values (Figure 5.1).. n. iv n C The rst two columns starting from h the left are the curves e n g c h i U of Ω` and ∆`, and the following. from oating, just like an anchor to a vessel (see. three columns are the curves of in. Section 4.1).. ∆`. Appendix A.1).. with some update factors (please refer to (4.4)(4.6)(4.8). From the rst row to the last row, each row uses dierent rating level,. and dierent cutpoints pair,. (γc−1 , γc ).. c,. The vertical red lines within each panel in Figure. 5.1 are displayed to remind the location of the specic cutpoints pair. The horizontal red lines are used to indicate the range of the update factors, to be within. (0, 1). because. σ ˜ 2 ← σ 2 (1 − k∆).. k∆.. These factors are supposed. If not, some updates for variables' posterior. variance could change sign and it is unacceptable in the denition of variance. Everything was ne until. x surpasses a specic point, x∗ , and the curves start to resonate. (Figure 5.2). (We call the point,. x∗ ,. the resonating point because it is the place where the. curves start resonating and become unstable in values.). 23. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(31) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Figure 5.1: Visualization of. Ch. v `) (Ω` , i∆. n engchi U. Figure 5.2: Zoom in on. Ω` (Yij = 1), ∆` (Yij = 1). 24. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(32) Figure 5.3: The p.d.f. and c.d.f. of standard normal distribution N(0,1),. This problem appears when. x. 政治大. is large, such that. 立. io. tion in the corresponding update to x this problem.. n. al. (Ω` ,. (please refer to. y. we can do the following approxima-. sit. µc /νc ≥ Φ−1 (1 − 10−14 ) ≈ 7.650731,. er. Nat. Nevertheless, if. 5.3). ‧. ‧ 國. 學. µc−1 µc µc−1 µc ) → 0; φ( ) → 0; Φ( ) → 1; Φ( ) → 1 (Figure νc−1 νc νc−1 νc µc−1 µc 1 1 φ( νc−1 ) − νc φ( νc ) 0 ν → (resonate) ⇒ Ω` (Yij = c) = c−1 µc−1 µc Φ( νc−1 ) − Φ( νc ) 0. φ(. φ(x), Φ(x). i 2 2 n ∆C (−µ /ν , 1/ν ) `) ≈ c h e n g cc h i cU. v. (Ωγ , ∆γ )(1,. 0). ≈ (0, 0). (Ωγ , ∆γ )(0,. 1). ≈ (−µc /νc2 , 1/νc2 ). Appendix A.6. for more discussion about the approximation results).. With the approximation above, we can do linearization (blue lines) near resonating points (red crosses) (Figure 5.4).. Each blue line passes through the center of the range. of each resonating curve perfectly, meaning that the values of the original curve can be approximated by the line when. x. exceeds the position of resonating point.. 25. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(33) 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. i n U. v. Figure 5.4: Linearization near resonating points. 26. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(34) 5.2 Algorithm Evaluation 5.2.1. Simulation. Our goal here is to verify which kind of methods of dealing with the cutpoints. γ. is more. reliable. The methods we are considering are: (1) let them be xed after estimating them from a portion of data at the beginning; (2) update them sequentially once their priors are determined. We call the rst methodEstimated, and the second oneSequential.. (0). set. (0). (0). (µβj , (σβ2j )(0) ; µθi , (σθ2i )(0) ) as (1, 0.1; 0, 0.1),. is another free variable,. (σγ2c )(0) ,. (0). (0). (µαj , (σα2 j )(0) , µβj , (σβ2j )(0) ; µθi , (σθ2i )(0) ),. Before starting, we dene the initial values. and let. (σα2 j )(0). be a free variable. (There. which will be mentioned in the following contents.). (0). µαj. will. be set later once some other information are detailed.. 政治 β大. First, assume that each user's real information (αj ,. 立. j ) and each item's real information. 學. ‧ 國. (θi ) can be generated by sampling from the following random variables. (0). (0). αj ∼ Unif(1, 5), βj ∼ N(µβj , (σβ2j )(0) ), θi ∼ N(µθi , (σθ2i )(0) ) (100, 50). representing 100 items and 50 users we are considering.. y. Nat. Also, assume. are set as. er. io. sit. (I, J). ‧. where. n. γ =a(−∞, 1.5, 2.5, 3.5, 4.5, ∞)v0. i l C n hengchi U Then, the estimated information (α ˜ , β˜ , θ˜ , γ˜ ) can be obtained by running Algorithm j. 1 along with a pseudo dataset containing. i. j. c. 50 × 100 Yijgen. generated by randomly sampling. from the following relations. Yij∗ = αj + βj θi + ij ∼ N αj + βj θi , 1 Yij∗ ∈ (γc−1 , γc ] ⇔ Yijgen = c,. c=1:5. Note that, we assume each user has rated all items in this simulation.. 27. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(35) To execute Algorithm 1, we initialize latent variables' priors' information by. 0 µγ (0) = − ∞, (∗), ∞ 0 (σγ2 )(0) = 0, 0, (σγ2c )(0) , (σγ2c )(0) , (σγ2c )(0) , 0 ; (0) 0. (0). µ` (0) = (∗∗), µβj , µθi. (σ` 2 )(0) = (σα2 j )(0) , (σβ2j )(0) , (σθ2i )(0) where. (∗). is estimated from the pseudo dataset above, and. The estimation of. (∗). cumulated proportions of. #{Yijgen = c}/1, 000, c = 1 : 4. 立. 政治大. (0). σ(Yij ) (= 1 + (σα2 j )(0) +. Finally, shift the results to where the rst element of. Note that, if we are using estimated cutpoints,. ‧ 國. (σγ2c )(0). is set as 0.. (σγ20 )(0) = (σγ21 )(0) = (σγ25 )(0) = 0, see Appendix A.1. (σα2 j )(0) , (σγ2c )(0) and the method of dealing with cutpoints are. For the reasoning of setting. ‧. Once a set of. (∗).. from the randomly sampled 1,000. 學. γ1 .. is set as the center of. Then, multiply the z-scores by. (0) (0) ((σβ2j )(0) + (µβj )2 )((σθ2i )(0) + (µθi )2 ) . is 1.5 to align with. (∗∗). goes as follows. First, compute the corresponding z-scores of the. ratings in the pseudo dataset.. (∗). 0. we can execute Algorithm 1 one round to get one set of estimated information. decided, which. Nat. y. ˜ (t) , µ. n. al. )=. q. Pn. i=1 (µi. (t). −µ ˜i )2 /n, n = I. er. ˜ and the estimated ones, where RMSEµ (µ. io. (µ). (t). sit. ˜ (t) ) to measure the distance between the real information can be used to calculate RMSEµ (µ. i n U. or. J.. We. v. repeat 100 rounds the steps to calculate the average of RMSEµ. Ch. RMSEµ. =. e n100g c h i. 1 X ˜ (t) ) RMSEµ (µ 100 t=1. to compare the accuracy between estimated setting (Table 5.1).. That we only choose. γ. and sequentially updated. (σα2 j )(0). and. (σγ2c )(0). γ. under a given. to be free variables but not. all is to testify whether using dierent combinations of initial values can cause signicantly dierent results under a simpler scenario.. Remarks Table 5.1 shows that, generally, when cutpoints are updated sequentially, the estimated. ˜ θ, ˜ α ˜ , β,. and. γ˜c. are closer to the real ones.. RMSEα surges when. (σα2 j )(0). changes from 1. 28. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(36) to 10 under estimated cutpoints, indicating that statical learning algorithm might be more sensitive to the selection of priors when cutpoints are xed after being estimated from a portion of data. Note that, smaller. 5.2.2. (σα2 j )(0). and. (σγ2c )(0). seem to get better results.. Statical vs. dynamical. The previous subsection suggests the use of sequential update schemes for cutpoints. Since time-series information cannot be articially imposed into simulation data, we are not able to compare statical learning and dynamical learning at the simulation stage. However, with the 100k ratings dataset from MovieLens, we can start to make comparisons.. The. dataset is rstly sorted by its timestamps column, then separated into training dataset and. 政治大. testing dataset with the ratio, 70/30. Since we are conducting small experiments, instead of considering data points within every second or minute, we let the time unit be a day by. 立. dividing timestamps column by 86,400 seconds.. ‧ 國. 學. For we do not know each user's and item's real information, we compare training/testing errors each algorithm produces. Training/testing errors can be computed by RMSEY ∗ (Y. Y∗. ‧. where. is the vector of ratings in either training dataset or testing dataset, and. rep. ),. Y rep. Nat. | Yij ) = ≈. sit. 5 X. er. obs. c·a p(Yij∗ = c | Yijobs ). iv l C n hengchi U c · Φ(˜ µα + µ ˜β µ ˜θ − µ ˜γ ) − Φ(˜ µα n. Yij =. E(Yij∗. io. rep. y. represents a vector of reproduced ratings generated by the following formula. c=1 5 X. j. j. i. c−1. j. +µ ˜βj µ ˜ θi − µ ˜γc ). (5.3). c=1 For the reasoning of the approximation. p(Yij∗ = c | Yijobs ) ≈ Φ(˜ µαj + µ ˜ βj µ ˜ θi − µ ˜γc−1 ) − Φ(˜ µαj + µ ˜βj µ ˜ θi − µ ˜γc ) see Weng and Coad (2018, p.15-16). We initialize the settings of priors and parameters by the following conguration. 0 µγ (0) = − ∞, (∗), ∞ (σγ 2 )(0) = (0, 0, 0.01, 0.01, 0.01, 0)0 29. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(37) 10. RMSEβ. RMSEθ. 0. 0.225. 0.271. 0.147. 0. 1.829. 0.301. 0.1. 0.2. 0.01. γ2 (2.5) γ3 (3.5) γ4 (4.5) 2.403. 3.277. 4.184. 0.233. 3.578. 5.558. 7.651. 0.267. 0.148. 2.41. 3.347. 4.278. 0.196. 0.267. 0.145. 2.405. 3.333. 4.314. 0.1. 1.217. 0.286. 0.227. 3.311. 4.793. 5.867. 0.01. 1.449. 0.291. 0.236. 3.532. 5.071. 6.253. Table 5.1: Comparison of estimated. sit. y. ‧. er. γ. and sequentially updated. γ. 學. Sequential. n. 1. RMSEα. 政治大. 10. io. 1. Nat. (σγ2c )(0). 立. Estimated. (σα2 j )(0). a v l ni C h engchi U. Method. ‧ 國. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(38) 0 µ` (0) = (∗∗), 1, 0 (σ` 2 )(0) = (1, 0.1, 0.1)0 ξ = (0.0001, 0.0001, 0.0001)0. to run either statical learning algorithm or dynamical learning algorithm to generate. µ ˜γc−1 , µ ˜γc−1 ,. which can then be plugged into (5.3) to produce. Yijrep .. from data but from the rst 1,000 ratings in training dataset, and. (∗).. We continue setting. (0). (0). (µβj , (σβ2j )(0) ; µθi , (σθ2i )(0) ). as. Again,. (∗∗) is. is set as. (1, 0.01),. 政治大. is obtained. set as the center of. (1, 0.1; 0, 0.1). plexity and we will search dierent choices of them in the next section.. (∗). µ ˜ αj , µ ˜ βj , µ ˜ θi ,. for reducing com-. (σα2 j )(0) , (σγ2c )(0). . for this combination generated better results under sequentially updated. cutpoints in Table 5.1. Setting. ξ = (0.0001, 0.0001, 0.0001)0. 立. is for convenience and we will. search dierent choices of them in the next section as well. The results are present in Table. Remarks. ‧. ‧ 國. 學. 5.2.. sit. y. Nat. Table 5.2 shows that, although the dierences between the two learnings seem to be. io. al. er. negligible, it does manifest that, under sequentially updated cutponts, statical learning works. n. better on training dataset but works slight not that well on testing dataset compared with. Ch. i n U. v. dynamical learning, indicating that, dynamical learning might be more adaptive to new data.. engchi. To do further investigation, we compare the quality of updates under each type of learning (Figure 5.5). Figure 5.5 shows that, generally, the biggest dierent part is the size of. σγ2c .. The size of. σγ2c. is slightly larger under dynamical than under statical when the index. of data point exceeds 20,000, which might be the reason why dynamical learning is more adaptive with new data.. 5.3 Conguration searching process In this section, we keep using the 100k ratings dataset separated into training data(70%) and testing data(30%) to present a conguration searching process to nd the range of a subset of settings of priors and parameters to construct two sets of top 10 congurations. 31. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(39) Nat. 立. io. 政治大. y. ‧. sit. 學. n. a v l ni C h engchi U. er. ‧ 國. Figure 5.5: The quality of updates with training dataset. Type. training error. testing error. Statical. 0.910. 1.044. Dynamical. 0.914. 1.034. Table 5.2: Comparison of statical learning and dynamical learning. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(40) for Algorithm 1 and Algorithm 2.. Sequentially updated cutpoints initialized by the rst. 1,000 ratings in training dataset are considered due to the experimental results in the last section. The eectiveness of the found range of each setting is veried by training error and testing error. Moreover, we use Spearman correlation to discuss about the sensitivity of each algorithm toward the settings of priors and parameters. First, dene a set containing some possible values for selected settings,. S := Sµ(0) × S(σβ2 βj. j. )(0). × S(σθ2 )(0) × Sξγ × S(ξUser , i. S. ξγ ≥ (ξUser , ξItem ). ξItem ) ,. where. 政治大. Sµ(0) := {1, 0.1, 0.01}. 立. βj. )(0). := {0.1, 0.01, 0.001}. S(σθ2 )(0) := {0.1, 0.01, 0.001} i. ‧. Sξγ := {0.001, 0.0001, 0.00001}. io. sk ∈ S. y. := {0.001, 0.0001, 0.00001}. to run both. Algorithm 1. sit. ξItem ). and. er. Nat. S(ξUser ,. Then, pick one conguration. j. 學. ‧ 國. S(σβ2. Algorithm 2. to produce. n. aInl addition, the rank of the i v estimated central tendencies n Ch of items' latent quality, rank(µ ˜ θ ), under e each n galgorithm c h i Uis recorded, which can be used to training error and testing error.. calculate the Spearman correlation mentioned at the beginning of this chapter. To simplify the process, we assume that. 0 2 (0) = (0, 0, 0.01, 0.01, 0.01, 0)0 µ(0) γ = (−∞, (∗), ∞) , (σ γ ) (0). 2 (0) µ(0) =1, µθi = 0, αj = mean(∗), (σαj ). where. (∗). is estimated from the rst 1,000 ratings in training dataset as usual.. The most important setting is. σβ2j. and. ∀ i, j. σθ2i. (σβ2j , σθ2i )(0) .. It is the small variance assumption to both. that makes it possible to get (4.3)-(4.18), the update formulas in. Section 4.1 (for. more details about small variance assumption which is not mentioned in section 4.1, please. 33. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(41) refer to. Appendix A.1.. But, how small they have to be is an open question, that is why. we have to start it from some small values, say 0.1, and decrease it exponentially to search the scale at which the results are more reliable. Setting. S(ξUser ,. ξItem ) ,. Sξγ. ξUser , ξItem , ξγ. is to nd appropriate values of. learning. To lower the time complexity, we let the values of. ξUser. and. ξItem. for dynamical. be the same in. each run.. Remarks Table 5.3 shows that, statical learning tends to work better with. (σβ2j )(0). and. (σθ2i )(0) .. (0). µβj = 1,. and larger. In addition, Table 5.5 manifests that, No 6, 9, and 10 congurations. seem to generate relatively dierent outcomes resulting in smaller Spearman correlation, indicating that, either setting. (σθ2i )(0). 政治大 (σ ). 2 (0) with signicantly dierent βj. too small or setting. 立. is not recommended.. 學. ‧ 國. scale compared to. (0). µβj. Table 5.4 is consisted of many similar congurations with dierent combination of. ξγ , (ξUser , ξItem ) .. It is clear that, the searching set for dynamical learning is much larger. ‧. than for statical learning due to these extra parameters. Excluding. ξ,. the patterns of the. (0). y. Nat. congurations for dynamical seem to consist with the ones for statical. They both tend to. n. er. io. sit. µβj = 1, (σβ2j )(0) = (σθ2i )(0) ≥ 0.01. However, with some specic combi nations of ξγ , (ξUser , ξItem ) , dynamical learning might outperform statical learning (No 1, al v 2,and 4). Through Table 5.6, we see that, dierent combinations n i of ξγ , (ξUser, ξItem) seem C. work better with. hengchi U. to generate similar results, indicating that, even though. (˜ µ, σ ˜2). might alter with dierent. congurations resulting in dierent training errors and testing errors, but the overall eects (the rank of latent ability or latent quality) are similar, which in turn manifests that the algorithm is not that sensitive to the given congurations. For rule of thumb, we recommend. (ξUser , ξItem ) ≤ ξγ = 0.001.. 34. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(42) (0). µβj. (σβ2j )(0). (σθ2i )(0). 1. 1. 0.1. 0.1. 0.910. 1.044. 2. 1. 0.01. 0.1. 0.921. 1.042. 3. 1. 0.001. 0.1. 0.927. 1.042. 4. 0.1. 0.1. 0.1. 0.928. 1.154. 5. 1. 0.1. 0.01. 0.947. 1.107. 6. 0.01. 0.1. 0.1. 0.948. 1.177. 7. 1. 0.01. 0.01. 0.969. 1.111. 8. 1. 0.001. 0.01. 0.973. 1.113. 0.1. 0.01. 0.1. 0.986. 1.166. 1. 0.1. 0.001. 1.016. 1.175. No. 9 10. training error. testing error. Table 5.3: Top 6 congurations for statical learning (sorted by training error) (0). No. µβj. (σβ2j )(0). (σθ2i )(0). ξγ. (ξUser , ξItem ). training error. testing error. 1. 1. 0.1. 0.1. 0.001. 1e-04. 0.908. 1.044. 2. 1. 0.1. 0.1. 0.001. 1e-05. 0.908. 3. 1. 0.1. 0.1. 0.001. 0.001. 0.909. 1.048. 4. 1. 0.1. 0.1. 1e-05. 1e-05. 0.910. 1.043. 5. 1. 0.1. 0.1. 1e-04. 1e-04. 0.914. 1.034. 6. 1. 0.1. 0.1. 1e-04. 1e-05. 0.914. 1.034. 7. 1. 0.01. 0.1. 0.001. 1e-04. 0.919. 1.041. 8. 1. 0.01. 0.1. 0.001. 1e-05. 0.919. 1.041. 9. 1. 0.01. 0.1. 0.001. 0.001. 0.921. 1.045. 10. 1. 0.01. 0.1. 1e-05. 1e-05. 0.922. 1.041. 1.044. 學 ‧. ‧ 國. 立. 政治大. 3. 4. 5. 6. 7. 8. 0.990. 0.958. 0.957. 0.919. 0.936. 0.932. 0.948. 0.962. 0.909. 0.949. 0.946. 0.915. 0.901. 0.960. 0.902. 0.95. 0.948. 0.901. 0.977. 0.984. 0.949. v. 0.912. 0.943. 0.966. 0.927. 0.991. 0.988. 0.983. 0.972. 0.926. 0.920. 0.967. 0.914. 1.000. 0.973. 0.987. 0.970. 0.987. 0.999. 2. al. 0.941. n. 3 4 5. Ch. 6 7. engchi 0.956. 9. 10. 0.911. 0.887. sit. 0.994. er. 2. 1. io. Nat. No. y. Table 5.4: Top 6 congurations for dynamical learning (sorted by training error). i n U. 8 9. 0.976. Table 5.5: Spear correlation of the pairwise rank(µ ˜ θ ) under statical learning No. 2. 3. 4. 5. 6. 7. 8. 9. 10. 1. 1.000. 1.000. 0.999. 1.000. 1.000. 0.994. 0.994. 0.994. 0.994. 1.000. 0.999. 1.000. 1.000. 0.994. 0.994. 0.993. 0.994. 0.999. 0.999. 0.999. 0.994. 0.994. 0.994. 0.994. 1.000. 1.000. 0.993. 0.993. 0.993. 0.994. 1.000. 0.994. 0.994. 0.993. 0.994. 0.994. 0.994. 0.993. 0.994. 1.000. 1.000. 1.000. 1.000. 1.000. 2 3 4 5 6 7 8 9. 0.999. Table 5.6: Spear correlation of the pairwise rank(µ ˜ θ ) under dynamical learning. 35. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(43) 6. Conclusions. Through experiments, we have manifested two things.. First, rather than having cut-. points xed after estimating them from a portion of data, updating them sequentially after. 政治大. setting up their priors seems to produce better results, for it is less sensitive to improper priors. 立. (Table 5.1). Second, although statical learning's computational time is less than dynamical. 學. ‧ 國. learning's, under some constraints, dynamical learning can outperform statical learning (Table 5.2). The computational time with MovieLens 100k ratings dataset is: Statical Learning:. ∼. 7 seconds (Through Rate:. ∼. 14,285 ratings/sec.); Dynamical Learning:. ∼. 11 seconds. ‧. (Through Rate:. ∼. 9,090 ratings/sec.) with the following hardware specication: OS: Win-. Nat. sit. y. dows 8.1 64bits; CPU: Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz; RAM: 12.0 GB.. er. io. Suitable congurations for setting up priors and parameters are found, and we acquired. al. iv n C It is possible that, with other from dierent sources, the top 10 hrating e n gdatasets chi U n. the information that two types of learning algorithms are prefer to the similar ones (Table 5.3-5.4).. congurations we've found could become improper, which is worthwhile to do further verication. For now, we recommend the following conguration to initialize priors' information and parameters for both algorithms:. µγ (0) = (−∞, (∗), ∞)0 , µ` (0) =. mean(∗),. 0 1, 0 ,. (σγ 2 )(0) = (0, 0, 0.01, 0.01, 0.01, 0)0 (σ` 2 )(0) = (1, 0.1, 0.1)0. ξ = (0.001 ∼ 0.00001, 0.001 ∼ 0.00001, 0.001)0 where. (∗). is estimated from a portion of data at hand.. 36. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(44) References. Birnbaum, A. (1968).. Some latent trait models and their use in inferring an examinee's. ability. Statistical theories of mental test scores.. 政治大. Coelho, F. C., Codeço, C. T., and Gomes, M. G. M. (2011).. 立. A bayesian framework for. parameter estimation in dynamical models. PloS one, 6(5):e19616.. ‧ 國. 學. Graepel, T., Candela, J. Q., Borchert, T., and Herbrich, R. (2010). Web-scale bayesian clickthrough rate prediction for sponsored search advertising in microsoft's bing search engine.. ‧. Omnipress.. y. Nat. The movielens datasets: History and context.. sit. Harper, F. M. and Konstan, J. A. (2016).. n. al. er. io. ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):19.. Ch. i n U. v. Ho, D. E. and Quinn, K. M. (2008). Improving the presentation and interpretation of online. engchi. ratings data with model-based gures. The American Statistician, 62(4):279288.. McNeish, D. (2016). On using bayesian methods to address small sample problems. Structural Equation Modeling: A Multidisciplinary Journal, 23(5):750773.. Moser, J. (2010). The math behind trueskill.. Muraki, E. (1990). Fitting a polytomous item response model to likert-type data. Applied Psychological Measurement, 14(1):5971.. Rasch, G. (1961).. On general laws and the meaning of measurement in psychology.. In. Proceedings of the fourth Berkeley symposium on mathematical statistics and probability,. volume 4, pages 321333.. 37. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.

(45) Samejima, F. (1970). Estimation of latent ability using a response pattern of graded scores. Psychometrika, 35(1):139139.. Shane Mac (2016). The pendulum. my attempt at building a diverse company from the start.. Strogatz, S. H. (1994). Nonlinear dynamics and chaos: With applications to physics, biology, chemistry and engineering.. Van De Schoot, R., Broere, J. J., Perryck, K. H., Zondervan-Zwijnenburg, M., and Van Loey, N. E. (2015).. Analyzing small data sets using bayesian estimation:. the case of post-. traumatic stress symptoms following mechanical ventilation in burn survivors. European Journal of Psychotraumatology, 6(1):25216.. van der Linden, W. J. (2010). Education, pages 8188.. 政治大. Item respoinse theory.. 立. In International Encyclopedia of. ‧ 國. 學. Weng, R. C.-H. and Coad, D. S. (2018). Real-time bayesian parameter estimation for item. ‧. response models. Bayesian Analysis, 13(1):115137.. sit. y. Nat. Weng, R. C.-H. and Lin, C.-J. (2011). A bayesian approximation method for online ranking.. io. n. al. er. Journal of Machine Learning Research, 12(Jan):267300.. Ch. engchi. i n U. v. 38. DOI:10.6814/THE.NCCU.STAT.006.2018.B03.