混合SEM模型加入作答時間利用應試行為促進模型分析

全文

(1)國立臺灣師範大學數學系碩士班碩士論文指導教授：蔡蓉青博士. Incorporating Response Time to Model Test Behavior with Mixture SEM. 研究生：詹淑貞中華民國 102 年 7 月.

(2) 致謝光陰似箭，三年的研究所生活即將告一個段落。非常感謝我的指導教授：蔡蓉青博士。由於老師的指導，讓我在建構模型方面有很大的成長。更重要的是，老師專注在研究上的精神與態度，陪我們挑燈夜戰，讓我相當感動並清楚了解自己不足的地方。我感謝口試委員：林定香博士與蔡碧玟博士，感謝這兩位老師能來參加我的口試並給予建議。在所有老師的指導之下，我的論文得以延續與能力不斷提升。另外感謝 James Madison University 的評估和研究中心無私地提供資料，讓我們順利的進行研究。最後，我感謝家人、研究所的同學和朋友。感謝他們，在我的研究路上給了我很大的幫助，我才能順利完成研究。這一段的研究生活將會是我生命中一個很重要的過程，我很幸運在三年的研究生涯能有這麼一段重要的經驗，體會做研究是需要專注與耐心外，還要有積極熱忱的心。希望日後，不論在教職的路上或從事其他工作，也都要保有這樣的積極與熱忱的態度。. i.

(3) 摘要作答時間已被證明能夠辨別不同測試行為的考生以及混合試題反應理論模型已經提出了探索試題資料的反應模式。在近其的文獻中，混合 Rasch 模型 (mixture Rasch model, MRM) 加入試題反應時間 (mixture Rasch model with response time components, MRM-RT) 透過資料分析顯示將作答行為分成兩類行為 - 快速猜題、確實作答，比只有一類的作答行為 - 確實作答，模型適配度較佳。然而，MRM-RT 無法解釋在快速猜測類中一些成員在作答時間要比同一類的成員少的很多。因此，可能要多加入一類群 - 作答快速，幫助解釋資料。在這項研究中，我們試題反應和作答時間同時嵌入到混合的結構方程模型的分析框架，並多加入快速反應類群以促進模型分析，並重新分析數據。由模擬結果顯示，MRM-RT 的表現較優於 MRM。具體來說，研究顯示 MRM-RT 具有較好的收斂速度，得到更準確的參數估計，更好地描述應試行為，並允許評估測量潛在類群的不變性。此外，穩健標準誤差的最大似然估計比利用蒙特卡羅馬爾可夫鏈貝式估計需要花費的時間極少，使得 MRM-RT 更容易研究估計。關鍵字：作答時間、混合 Rasch 模型. ii.

(4) Incorporating Response Time to Model Test Behavior with Mixture SEM Shu-Chen Chan August 20, 2013. Advisor: Rung-Ching Tsai Mathematics, National Taiwan Normal University, Taipei, Taiwan. 1.

(5) Abstract Item response time has been shown valuable in identifying different test behavior of the test takers and mixtures of item response models have been proposed to explore response patterns in test data. In recent literature, a mixture Rasch model with response time components (MRM-RT) showed that a two-class solution representing rapid-guessers and solution behavior examinees fit the test data better than a one-class solution. However, the two-class MRM-RT could not account for the much less response time of some members in the rapid-guessing class of the test data. Thus, the inclusion of an additional class of fast respondents, might be necessary to fulfill the assumption of conditional independence of item responses and response time given the latent class. In this study, we embed such a simultaneous analysis of item responses and response time into the mixture structural equation model framework which in turn facilitated the estimation of a three-class model with the fast responders class added, and reanalyze the empirical test data. Our simulation results indicated that the MRM-RT performed better than the mixture Rasch model alone. Specifically, it showed that MRM-RT has better convergence rate, yield more accurate item parameter estimates, describe better the test-taking behavior, and allow for assessing measurement invariance across latent classes as well. In addition, Maximum Likelihood estimation with robust standard errors takes much less time than using Monte Carlo Markov Chains for Bayesian estimation and therefore makes the estimation of MRM-RT more accessible to researchers. Keywords: item response time, mixture Rasch model. 2.

(6) Contents 1 Introduction. 7. 2 Model 2.1 MRM-RT in Mixture SEM . . . . . . . 2.1.1 Mixture Rasch Model . . . . . 2.1.2 Mixture Response Time Model 2.1.3 MRM-RT . . . . . . . . . . . . 2.2 Estimation . . . . . . . . . . . . . . . 2.3 Identification . . . . . . . . . . . . . .. . . . . . .. 11 11 11 12 13 17 17. 3 Simulation Studies 3.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 20 20 24 24. 4 Applied Data Analysis 4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 36 36 37. 5 Discussion. 47. 6 Conclusion. 49. 3. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . ..

(7) List of Tables 1 2 3 4. 5 6 7 8 9 10 11 12 13 14 15 16 17. Mean and Variance Parameters of Response Log-Time for RG, SB and HARF Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Item Difficulty Parameters for RG, SB, and HARF Classes . . . . . . . Percentages of Replications that Converge under MRM Fittings . . . . Mean, SD, and MSE of the Estimates for Three Class Sizes, Mean Ability of HARF Class, and Ability Variances of SB and HARF Classes, under MRM-RT and MRM Fittings . . . . . . . . . . . . . . . . . . . . Mean, SD, and MSE of Difficulty Parameter Estimates for the SB Class with Sample Size of 250, under MRM-RT and MRM Fittings . . . . . . Mean, SD, and MSE of Difficulty Parameter Estimates for the HARF Class with Sample Size of 250, under MRM-RT and MRM Fittings . . Mean, SD, and MSE of Difficulty Parameter Estimates for the SB Class with Sample Size of 500, under MRM-RT and MRM Fittings . . . . . . Mean, SD, and MSE of Difficulty Parameter Estimates for the HARF Class with Sample Size of 500, under MRM-RT and MRM Fittings . . Mean, SD, and MSE of Difficulty Parameter Estimates for the SB Class with Sample Size of 1000, under MRM-RT and MRM Fittings . . . . . Mean, SD, and MSE of Difficulty Parameter Estimates for the HARF Class with Sample Size of 1000, under MRM-RT and MRM Fittings . . Mean, SD, and MSE of Difficulty Parameter Estimates for the SB Class with Sample Size of 2000, under MRM-RT and MRM Fittings . . . . . Mean, SD, and MSE of Difficulty Parameter Estimates for the HARF Class with Sample Size of 2000, under MRM-RT and MRM Fittings . . Model Fit Indices of the Two-Class, Three-Class, and Four-Class Models Estimates of Class Sizes of the Three-Class Model . . . . . . . . . . . . Estimates of Mean Response Time for SB 1 and SB 2 . . . . . . . . . . Estimates of Difficulty Parameters for SB 1 and SB 2 . . . . . . . . . . Fit Indices for Three-Class Models with and without Equality Constraints on Difficulty Parameters for SB 1 and SB 2 . . . . . . . . . . .. 4. 21 23 26. 27 28 29 30 31 32 33 34 35 37 37 38 40 40.

(8) 18 19 20 21. 22. Fit Indices for Three-Class Models with and without Equality Constraint on Difficulty Parameters for RG . . . . . . . . . . . . . . . . . . Fit Indices of Two-Class and Three-Class Models without Equality Constraint on Difficulty Parameters for RG . . . . . . . . . . . . . . . . . . Estimates of Difficulty Parameters for RG and for Both Solution Behavior Classes, SB 1 and SB 2 . . . . . . . . . . . . . . . . . . . . . . . Estimates of Class Sizes of the Three-Class Model with the Equality Constraint on Difficulty Parameters between SB 1 and SB 2 and without the Equality Constraint on Difficulty Parameters for RG . . . . . . . . Estimates of Mean Ability for RG, SB 1, and SB 2 . . . . . . . . . . .. 5. 42 42 44. 45 45.

(9) List of Figures 1 2 3. 4. 5. Path diagram of MRM-RT model . . . . . . . . . . . . . . . . . . . . . The response time distribution on item 13 for the RG, SB and HARF classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Observed and expected mean response probability of getting a correct answer for each latent class of examinees classified using the greatest posterior probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . Observed and expected mean response probability of getting a correct answer for each latent class without the equality constraint on difficulty parameters for RG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Estimates of difficulty parameters and mean response log-time for rapidguessing class (RG), solution behavior class 1 (SB 1), and solution behavior class 2 (SB 2). . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 6. 16 22. 41. 43. 46.

(10) 1. Introduction. Item response time denotes the time length that one examinee spends on an item in a test. Response time provides information about test-taking behavior or personality styles of examinees (McIntyre, 2011; Schnipke & Scrams, 1997). It provide a tool for researchers on educational testing, psychology, or psycholinguistic to answer various research questions such as understanding the relation between reaction time and error rate at the item and subject level (Loeys, Rosseel, & Baten, 2011). Moreover, it really benefits the estimation of item response theory (IRT) modeling. For example, one can reduce the estimation bias of IRT model parameters by incorporating the time information from examinees whose response time is unreasonably shorter (Ferrando & Lorenzo-Seva, 2007; Meyer, 2010; Oshima, 1994; Ranger & Kuhn, 2012; Schnipke & Scrams, 1997; Wise & DeMars, 2006). Moreover, rapid-guessing behavior of examinees in a test can cause wrong identification of differential item functioning (DIF) items for some manifest groups (DeMars & Wise, 2010). With the help of response time, it becomes possible to account for examinees with such rapid-guessing behavior to yield more accurate detection of DIF items. Owing to the fact that computer-based testing and computer adaptive testing are more common in real applications, response time data become more accessible. Thus, there has been a growing interest in joint modeling of response time and item responses in recent years (Lee & Chen, 2011; Meyer, 2010; Schnipke & Scrams, 2002; van der Linden, 2007). Under the IRT framework, the utilization of response time could be divided into three categories: direct inclusion of response time parameters into modeling the item responses, joint modeling of item responses and response time, and using response time to identify useful information on test responses. For the first category, there are various approaches. First, Roskam (1987) extends the Rasch model with its additive log-time parameter in the regressors. Secondly, Wang and Hanson (2005) propose four-parameter logistic response time model (4PLRT) in which response time is incorporated into the three-parameter logistic (3PL) IRT model. They consider that response performance of an item is affected by or depends on the time spent on that item in a test with time limit. In 4PLRT, one term composed of response time, an item slowness parameter, and a person slowness parameter is added in the exponent of the. 7.

(11) 3PL model to determine the time-related rate of increase/decrease in the probability of getting a correct answer. The key idea of these models is that not only the examinees’ ability can influence the probability of examinees’ correct answer, but also the response time spent on the items. As a result, item responses can be modeled more effectively. For the second category, item responses and response time are modeled simultaneously. The ability and speed could be considered as related traits of the examinees and item responses and response time are modeled conditional on these two latent traits (Entink, Fox, Hornke, & Kuhn, 2009; van der Linden, 2007). Another approach is regarding ability as the common trait underlying both item responses and response time, and introduce an additional trait for response time in a joint modeling of item responses and response time (Ferrando & Lorenzo-Seva, 2007; Ranger & Kuhn, 2012). These models improve the estimation of item difficulty parameter estimates and reduce variance of parameter estimates in the IRT models by borrowing ability-related information from response time to help with estimation through simultaneously modeling both item responses and response time. Entink, Fox, Hornke, & Kuhn (2009) make explicit use of both structural and measurement models to characterize the hierarchical model introduced by van der Linden(2007). They further propose using a function of item parameters to describe the underlying cognitive structures. In addition, in order to investigate relationship between response styles and personality in self-report personality data, McIntyre (2011) uses the latent structure of a set of traits to model jointly the item response and response time under mixture structural equation model (SEM) and the model estimation is done with Mplus (Muthén & Muthén, 1998-2010) using maximum likelihood estimation with the expectation-maximization algorithm. All models of the above two categories are based on the assumptions that examinees actually try to answer or respond the test items with some effort. They do not account for careless responses from the examinees. However, it is widely known that the assumption of conditional independence might be violated when different response types are present and such a violation would further result in biased parameter estimates. Models belonging to the third category try to take the above possibility into consideration and use the information of response time to help identify examinees with certain test behavior such as simply guessing without too much thinking. With the careless responses disregarded from the response modeling, one would be able to 8.

(12) recover the parameter estimates better. The effort-moderated item response model (EMIRM; Wise & DeMars, 2006) classify item responses into actually contributing to the estimation of the IRT model or simply careless responding/guessing. The classification is done through a visual inspection the plots of response time for all the items and once the response time exceeds a threshold, the responses are considered answered with effort and therefore useful for parameter estimation of IRT models. The mixture Rasch model with response time components (MRM-RT; Meyer, 2010), on the other hand, uses response time to classify examinees, not responses, to reduce the influence of the unwanted information by the rapid-guessing responses on parameter estimation of the Rasch model. MRM-RT assumes that item responses and response time are independent given the latent classes. Response time functions differently in these two models such that item responses of any item with shorter response time are disregarded whereas examinees consistently having shorter response time are gotten rid of in MRM-RT while considering their contribution to parameter estimation. Meyer (2010) mainly considers two kinds of test-taking behavior: rapid-guessing and solution behavior (Schnipke & Scrams, 1997; Wise & Kong, 2005). It is assumed that examinees can be divided into latent classes of different testing behavior based on their response time, and within each class the response time and item responses are modeled simultaneously with the proposed IRT models. Model parameters are allowed to be class-specific. Therefore, the test-taking behavior in each class can be characterized by examining the parameter estimates regarding the distribution of their response time. The analysis of extant data showed that a two-class solution representing rapid-guessing and solution behavior examinees fit the test data better than the one-class solution (Meyer, 2010). However, the distribution of response time for the rapid-guessing class appears bi-modal. That is, there seems to be a small group of examinees who “answer” rather than “guess” the items much quickly than others in the class. Thus, Meyer (2010) suggests that one might want to further include an additional class to better characterize this small group of examinees. However, with the previously proposed Monte Carlo Markov Chains procedure for Bayesian estimation of MRM-RT, fitting a three-class model seems time-consuming and sophisticated. In this study, we embed MRM-RT into mixture SEM framework. The use of maximum likelihood with robust standard errors (MLR) estimation method takes much less time than the Bayesian approach. Furthermore, we can easily extend the number of 9.

(13) latent classes under the mixture SEM framework. In the simulation studies, we add one more latent class and consider a three-class MRM-RT under the mixture SEM framework and compare parameter recovery of MRM-RT and mixture Rasch model alone. In the next section, the MRM-RT is reviewed, and the mixture SEM framework is introduced. In section 3, a simulation study is conducted to demonstrate the performance of the parameter estimation using MLR. The empirical data analyzed in Meyer (2010) are revisited to assess the necessity of a three-class solution in Section 4. Finally, the thesis ends with some discussion and concluding remarks.. 10.

(14) 2. Model. In this section, we first briefly describe item responses models with latent classes. Secondly, we introduce response time models with latent classes. Finally, we combine them under the mixture SEM framework. Denote examinees as i = 1, · · · , N , items as j = 1, · · · , J, and latent classes as c = 1, · · · , C.. 2.1. MRM-RT in Mixture SEM. 2.1.1 Mixture Rasch Model Mixture Rasch model (MRM; Rost, 1990) extends the traditional Rasch model (RM; Rasch, 1960) by adding some number of unobserved subpopulations. Rasch model is described as follows:. p(uij = 1|θi ) =. exp(θi − bj ) , 1 + exp(θi − bj ). (1). where uij = 1 denotes the event of getting a correct response for examinee i on item j, θi represents the ability for examinee i, bj is the difficulty parameter for item j. RM is the importance model to measure the probability of correct answer describing relationship between item and examinees’ ability on a psychological or educational test. However, considering the population may not be homogeneous with respect to the latent variable of RM. Since there is only one continuous variable, examinees’ ability, in the latent space of RM, the RM does not accurately describe the generation of item responses.Furthermore, the population may be split into several unobserved subpopulations, and it may be different individual response propensities to an item in different unobserved subpopulation. Rost (1990) proposed MRM that combines the theoretical stable of RM with the exploratory power of latent class analysis. Moreover, one of the key feature for MRM is that latent classes are mutually exclusive, and RM holds within each latent class. In other words, the item difficulty parameters are allowed to be different across the mutually exclusive latent classes. In MRM, the item response Uij for item j and examinee i who belongs to latent. 11.

(15) class c can be formulated as follows: Uij ∼ Bernoulli(Pij |θi , bcj , ci ) u. g(uij |θi , bcj , ci ) = Pijij (1 − Pij )1−uij ,. (2). logit(Pij ) = logit(P (uij = 1|θi , bcj , ci )) = θi − bcj , ci = c,. (3). with. where uij = 1 denotes the event of getting a correct response for examinee i on item j, θi represents the ability for examinee i, bcj is the class-specific difficulty parameter for item j, and ci is the class which examinee i in. We further assume that the mean for the ability of examinees in class c is µθc , and the variance for the ability of examinees in class c is σθ2c . that is, each class is allowed to have their own ability mean to account for quantitative differences between classes. Moreover, the class size for class c is πc which satisfies that C ∑. πc = 1.. (4). c=1. One of the possible ways to make the model identifiable is to set the difficulty parameter of the first item for each class to be zero, i.e., bc1 = 0, for each c. 2.1.2 Mixture Response Time Model As response time becomes available, incorporating response time into MRM is considered. Due to the two-state mixture modeling (Schnipke & Scrams, 1997), a lognormal mixture model for the response time may be helpful in detecting the different test-taking behaviors. In the present model, response time, Tij , is assumed to follow a log-normal distribution (van der Linden, 2006) because log-normal has been shown to exhibit excellent fit when compared to other distributions such as exponential (Scheiblechner, 1979 1985), Gamma (Maris, 1993), or Weibull distribution (Lu, Rouder, Speckman, Sun, & Zhou, 2003; Tatsuoka & Tatsuoka, 1980). Moreover, it has often been used in other testing applications (van der Linden, 2006 2007; van der. 12.

(16) Linden, Scrams, & Schnipke, 1999). More specifically, the distribution of the response time can be written as log(Tij |ci ) ∼ Normal(λcj , ξcj , ci ), √ ξcj 1 ξcj exp{− (logtij − λcj )2 }. f (log(tij )|ci ) = 2π tij 2. f (log(tij )) = =. C ∑ c=1 C ∑. (5). πc f (log(tij )|c) √ πc. c=1. ξcj 1 ξcj exp{− (logtij − λcj )2 }, 2π tij 2. (6). where the class-specific mean of response time, λcj , reflects the mean amount of logtime taken by members in class c to respond to item j. The class-specific precision, ξcj , is related to the standard deviation of item response time in each class by σcj = √1 . ξcj. Again, it is assumed that C ∑. πc = 1.. (7). c=1. 2.1.3 MRM-RT Meyer (2010) considers the joint modeling of the item response and item response time based on the assumption that conditional on the latent classes, the item response and response time are independent. Furthermore, by conditioning on the ability and latent class, local dependencies in item response attributed to time are remitted. The model is called MRM-RT and the joint probability function of item response and response time is the product of equations (2) and (5), i.e., f (log(tij ), uij |λcj , ξcj , θi , bcj , ci ) = f (log(tij )|λcj , ξcj , ci )g(uij |θi , bcj , ci ).. (8). Therefore, given the assumption of conditional independence between responses and response time of items, the conditional joint probability function of Tij and Uij given 13.

(17) class membership c is simply the product of equations (2) and (5). Denote the sizes of latent class as π = (π1 , π2 , · · · , πC ), item difficulty of class c as bc = (bc1 , bc2 , · · · , bcJ ), mean ability as µθ = (µθ1 , µθ2 , · · · , µθC ), ability variance as σθ2 = (σθ21 , σθ22 , · · · , σθ2C ), mean response log-time of class c as λc = (λc1 , λc2 , · · · , λcJ ), and precision of log-time of class c as ξc = (ξc1 , ξc2 , · · · , ξcJ ). It is assumed that item responses and response time are independent conditioning on given latent class and ability of the respondent. Let u = (u11 , u12 , · · · , uIJ ), t∗ = (log(t11 ), log(t12 ), · · · , log(tIJ )), η = (bc , µθ , σθ2 , λc , ξc , π), θ = (θ1 , θ2 , · · · , θN ), and c = (c1 , c2 , · · · , cN ). In the present study, frequentist estimation approach with respect to the marginal likelihood will be adopted for MRM-RT. That is, we consider the marginal likelihood such that f (u, t∗ |bc , µθ , σθ2 , λc , ξc , π) ∫ ∞ ∫ ∞ = ··· f (u, t∗ |η, θ, c)f (θ, c|η)dθ1 dθ2 · · · dθN −∞. =. =. =. =. C ∫ ∑. −∞. ∞. c=1 −∞ C ∫ ∞ ∑ c=1 C ∑ c=1 C ∑ c=1. (9). −∞. ΠN i=1. ∫. ···. −∞. ∫ ··· ∫. ∞. −∞. ∞. −∞. ∫. πc ΠN i=1. ∞. J ΠN i=1 Πj=1 f (uij , log(tij )|η, θi , ci )f (θi |η, ci )f (ci |η)dθ1 dθ2 · · · dθN. J ΠN i=1 Πj=1 f (uij |η, θi , ci )f (log(tij )|η, θi , ci )f (θi |η, ci )f (ci |η)dθ1 · · · dθN. ΠJj=1 f (uij |η, θi , ci )f (log(tij )|η, θi , ci )f (θi |η, ci )πc dθi ∞. −∞. ΠJj=1 f (uij |η, θi , ci )f (log(tij )|η, θi , ci )f (θi |η, ci )dθi .. In current study, we consider to write an alternative representation for MRM-RT under mixture SEM. SEM consist of two components, a measurement model and a structural model. The measurement model typically relates the observed responses or indicators to the latent variables. The structural model specifies relations among latent variables and regressions of latent variables on observed variables. When the indicators are categorical, the conventional measurement model for continuous indicators should be modified. However, the structural model can remain the same as in the continuous variables case. 14.

(18) First of all, we write the measurement part of the MRM-RT as u∗ij = θi |ci + ϵij , uij = 1. if. ,. u∗ij. > bcj ,. log(tij ) = λcj + δcij ,. (10) (11) (12). 2. where −∞ < bcj < ∞, ϵij ∼ logistic(0, π3 ), and δcij ∼ N(0, ξcj ). Secondly, the structural part of the MRM-RT can be written as ci ∼ Multinomial(π1 , π2 , · · · , πC ), θi |ci = µθci + ϵ∗ci ,. (13) (14). where ϵ∗ci ∼ N(0, σθ2c ). To make the mixture SEM representation easier to understand, Figure 1 depicts its path diagram.. 15.

(19) log(T1). log(T2). C. U1. ˙˙˙. log(TJ). θ. U2. ˙˙˙. UJ. Figure 1: Path diagram of MRM-RT model. 16.

(20) 2.2. Estimation. The likelihood function of the MRM-RT model in equation (9) is L(η) =. C ∑ c=1. ∫ πc ΠN i=1. ∞. −∞. ΠJj=1 g(uij |η, θi , ci )f (log(tij )|η, θi , ci )h(θi |η, ci )dθi ,. (15). where g(uij |η, θi , ci ) = P (uij = 1|η, θi , ci )uij P (uij = 0|η, θi , ci )1−uij , √ ξcj 1 ξcj exp{− (log(tij ) − λcj )2 }, and f (log(tij )|η, θi , ci ) = 2π tij 2 (θi − µθc )2 1 exp{− h(θi |η, ci ) = √ }. 2 2σ 2 θ c 2πσ θc. To adopt the marginal likelihood, models with continuous latent variables and categorical outcomes require numerical integration in the computations of the maximum likelihood values. That is, a numerical integration method is used to approximate the integral needed to marginal out the nuisance continuous latent variables. Three types of numerical integration commonly carried out with or without adaptive quadratures in combinations are rectangular integration, Gauss-Hermite integration, or Monte Carlo integration. Here, we employ the maximum likelihood estimator with robust standard errors using a numerical integration algorithm implemented in Mplus 5 (Muthén & Muthén, 1998-2010). In particular, rectangular numerical integration which uses area of rectangular to calculate the weights of the latent factor is a method implemented as default in Mplus. Robust standard errors means that standard errors are consistent even when distributional assumptions underlying the original parameter estimates are incorrect (Huber, 1967).. 2.3. Identification. The Rasch model has the identification problem that different sets of parameters give rise to the same distribution of Uij . For example, the two sets of (bj , θi ) = (0, 0.5) and (bj , θi ) = (−0.5, 0) result in the same probability of getting a correct answer under 17.

(21) the Rasch model (1). Similar identification problems exist in our current model. In current model, similar indeterminacy exists between the item difficulty bj and mean ability µθ parameters. That is, the sets of parameters (bj , µθ ) and (bj + c, µθ + c) result in the same probability of getting a correct answer under the Rasch model such that ∫ p(uij = 1) = = = = = =. ∞. −∞ ∫ ∞. p(uij = 1|θi )p(θi )dθi. exp(bj − θi ) (θ − µ )2 1 √ exp(− i 2 θ )dθi 2σθ −∞ 1 + exp(bj − θi ) σθ 2π ∫ ∞ 1 exp(bj − θi ) (θi − µθ )2 √ )dθi exp(− 2σθ2 σθ 2π −∞ 1 + exp(bj − θi ) ∫ ∞ 1 exp(bj − θi − c + c) (θi − µθ − c + c)2 √ exp(− )dθi 2σθ2 σθ 2π −∞ 1 + exp(bj − θi − c + c) ∫ ∞ 1 ((θi + c) − (µθ + c))2 exp(bj + c − (θi + c)) √ exp(− )dθi 2σθ2 σθ 2π −∞ 1 + exp(bj + c − (θi + c)) ∫ ∞ 1 exp(bj + c − θi′ ) (θi′ − (µθ + c))2 √ exp(− )dθi′ , θi′ = θi + c. 2σθ2 σθ 2π −∞ 1 + exp(bj + c − θi′ ). In other words, different parameter sets could satisfy the same distribution of Uij . That is, based on the data, we are unable to distinguish between the two sets of parameters and therefore it is necessary to constrain some parameter values for identification purpose. In the context of multiple-group factor models for ordered-categorical measures, Millsap and Tein (2004) propose a set of parameter identification constraints in congeneric and dichotomous case. Congeneric denotes the factor structure including any single-factor models, or any multiple-factor model where each indicator loads on only one latent variate. In the multiple-group factor models, the ordered-categorical outcome Yij yields the response s where Yij∗ c = aj c θic + εij , c. c. Yij = s if τj,s <. Yij∗ c. (16) ≤ τj,s+1 , c. where c is the group membership of examinee i, τj,s is the sth threshold of item j, aj is the loading of item j, and E(ϵij ) = 0, ∀i = 1, · · · , N and j = 1, · · · , J. 18.

(22) The set of identification constraints proposed by Millsap and Tein (2004) is: For some c0 ∈ {1, · · · , C}, E(θic0 ) = 0, and Var(ϵcij0 ) = 1, ∀j = 1, · · · , J; 1 2 C ∀j = 1, · · · , J, τj,s = τj,s = · · · = τj,s for some s1 ∈ {1, · · · , S}; 1 1 1. for some j0 ∈ {1, · · · , J}, a1j0 = a2j0 = · · · = aC j0 = 1, and τj10 ,s2 = τj20 ,s2 = · · · = τjC0 ,s2 for some s2 ∈ {1, · · · , S}(s2 ̸= s1 ). In the two-group case of the focal (f ) and reference (r) groups, Hwang (2012) proposes a set of identification constraints for model (16) as E(θir ) = 0; ∀j = 1, · · · , J, Var(ϵrij ) = Var(ϵfij ) = 1; f r ar1 = af1 = 1, and τ1,1 = τ1,1 .. An alternative set of equivalent identification constraints based on Millsap and Tein (2004) in this case is E(θir ) = 0, and Var(ϵrij ) = 1, ∀j = 1, · · · , J; f r ∀j = 1, · · · , J, τj,1 = τj,1 ; f r ar1 = af1 = 1, and τ1,2 = τ1,2 .. Although the above identification constraints are proposed for multiple-group models, similar principles apply to the cases of mixture or latent class models. Our present model (8) uses the logit link and the variance of the standard logistic distribution is π2 . That is, if we formulate our model with the εij terms in (16), the variances of 3 εij ’s are fixed for all items and all classes and therefore Hwang’s approach is more applicable. Moreover, we only consider Rasch models for dichotomous responses and therefore the complexity of identification issue is greatly reduced. To be more specific, our study adopts the identification constraints that For some c0 ∈ {1, · · · , C}, E(θi |η, c0 ) = µθc0 = 0; for some j0 ∈ {1, · · · , J}, b1j0 = b2j0 = · · · = bCj0 .. 19.

(23) 3. Simulation Studies. In this section, we examine the estimation performance under the mixture SEM framework for the MRM-RT model. Furthermore, the advantage of using the information of response time is evaluated by comparing results from simultaneously analyzing item responses and response time to those from simply using item responses alone.. 3.1. Data Generation. First, we introduce the simulation settings. There are three latent classes and 25 multiple-choice items, each with four options. Parameters of the three classes are designed to characterize examinees whose behaviors are rapid-guessing (RG, class 1), solution behavior (SB, class 2), high ability and/or respond with familiarity (HARF, class 3). The proportions of examinees in the RG, SB, and HARF classes are 0.15, 0.55 and 0.3, respectively. The parameter values are mainly based on the simulation settings used in Meyer (2010). Examinees in the SB class generally spend the most time on each item among the three classes. In contrast, examinees in the HARF class spend less time on items than those in the SB class due to their smartness or familiarity with the items from practice. Examinees in the RG class only take a few seconds on each question, for the circumstances that they simply read through item quickly and guess without too much thinking. In the RG class, mean and variance of response log-time are fixed to be -0.5 and 0.01, respectively. For the SB and HARF classes, mean of response log-time increases linearly as item difficulty increases. Response time spent on either the easier or the more difficult items are likely to be similarly shorter or longer among examinees, and therefore the variances of response log-time for both the easier and the more difficult items are set to be small. In the SB class, mean and variance of response log-time range from -0.3 to 0.9 and 0.32 to 0.41, respectively. In the HARF class, mean and variance of response log-time range from -0.47 to 0.25 and 0.21 to 0.28, respectively. In other words, the variability in response time is considered smaller for examinees in the HARF class than those in the SB class. The parameters in Table 1 are in the log scale, and in order to better understand the difference among the classes, the response time distribution on item 13 for each class are plotted in Figure 2. The 20.

(24) response time for examinees in the RG class is generally much shorter than that for other classes. Table 1: Mean and Variance Parameters of Response Log-Time for RG, SB and HARF Classes. RG SB HARF mean variance mean variance mean variance -0.5 0.01 -0.3 0.32 -0.47 0.23 -0.5 0.01 -0.25 0.35 -0.44 0.23 -0.5 0.01 -0.2 0.35 -0.41 0.23 -0.5 0.01 -0.15 0.35 -0.38 0.23 -0.5 0.01 -0.1 0.35 -0.35 0.23 -0.5 0.01 -0.05 0.38 -0.32 0.25 -0.5 0.01 0.00 0.38 -0.29 0.25 -0.5 0.01 0.05 0.38 -0.26 0.25 -0.5 0.01 0.10 0.38 -0.23 0.25 -0.5 0.01 0.15 0.38 -0.2 0.25 -0.5 0.01 0.20 0.41 -0.17 0.28 -0.5 0.01 0.25 0.41 -0.14 0.28 -0.5 0.01 0.30 0.41 -0.11 0.28 -0.5 0.01 0.35 0.41 -0.08 0.28 -0.5 0.01 0.40 0.41 -0.05 0.28 -0.5 0.01 0.45 0.41 -0.02 0.28 -0.5 0.01 0.50 0.32 0.01 0.23 -0.5 0.01 0.55 0.32 0.04 0.23 -0.5 0.01 0.60 0.32 0.07 0.23 -0.5 0.01 0.65 0.32 0.10 0.23 -0.5 0.01 0.70 0.32 0.13 0.23 -0.5 0.01 0.75 0.32 0.16 0.23 -0.5 0.01 0.80 0.32 0.19 0.23 -0.5 0.01 0.85 0.32 0.22 0.23 -0.5 0.01 0.90 0.32 0.25 0.23 RG = rapid-guessing; SB = solution behavior; HARF = high ability and/or respond with familiarity.. As for the distribution of the ability, we assume that examinees in the RG class simply guess one of the options without thinking, and therefore each option of an item has the same probability of being chosen. In other words, the probability of getting a correct answer on each four-option item by guessing is .25 in the RG class. Therefore, the mean and variance of the ability distribution are both conveniently 21.

(25) Figure 2: The response time distribution on item 13 for the RG, SB and HARF classes.. set to be 0 and we simply use the appropriate item difficulty parameter to ensure such a probability of answering correctly for each item. The SB class stands for the more general population, and therefore the ability distribution is assumed to follow the standard normal distribution N(0, 1). In contrast, the mean of the ability is higher for the HARF class, characterizing that the faster response is partly due to the smartness of examinees in this class. Moreover, its variance of ability is smaller than that of the SB class, indicating that examinees in the HARF class are more homogeneous in terms of their ability. The mean and variance of the ability distribution in the HARF class are respectively 0.5 and 0.65. The characteristics of the three classes are not only reflected in the response time and the ability distribution, but also in the item difficulty parameters. In the RG class, item difficulty parameters bj ’s are fixed to 1.099 for all items such that the probability of getting a correct answer on each item is .25. In both the SB and HARF classes, item difficulty parameters bj ’s range from -2 to 2. To capture the feature of the HARF class such that examinees in this class might be more familiar with test items through more practice and therefore some items may appear easier to them, the difficulty parameters of those items are considered to be smaller for the HARF class than the SB class. Here, we randomly select ten out of the 25 items to have smaller difficulty parameters for the HARF class than those for the SB class. The chosen items are 3, 7, 9, 11, 15, 16, 19, 20, 23, and 25. These items are considered to exhibit differential item functioning (DIF) with respect to the latent groups of SB and HARF (Maij-de Meij, Kelderman & 22.

(26) van der Flier, 2010). All the item difficulty parameter values are summarized in Table 2. Table 2: Item Difficulty Parameters for RG, SB, and HARF Classes. item RG SB HARF 1 1.099 -2.4 -2.4 2 1.099 -2.2 -2.2 3 1.099 -2.0 -2.3 4 1.099 -1.8 -1.8 5 1.099 -1.6 -1.6 6 1.099 -1.4 -1.4 7 1.099 -1.2 -1.5 8 1.099 -1.0 -1.0 9 1.099 -0.8 -1.1 10 1.099 -0.6 -0.6 11 1.099 -0.4 -0.7 12 1.099 -0.2 -0.2 13 1.099 0.0 0.0 14 1.099 0.2 0.2 15 1.099 0.4 0.0 16 1.099 0.6 0.2 17 1.099 0.8 0.8 18 1.099 1.0 1.0 19 1.099 1.2 0.6 20 1.099 1.4 0.8 21 1.099 1.6 1.6 22 1.099 1.8 1.8 23 1.099 2.0 1.5 24 1.099 2.2 2.2 25 1.099 2.4 1.9 RG = rapid-guessing; SB = solution behavior; HARF = high ability and/or respond with familiarity.. The estimation performance under various sample sizes, i.e., the total numbers of examinees, is also of interest. In addition to the sample sizes of 500 and 2000 used in Meyer (2010), we also take 250 and 1000 into consideration to better represent the small, medium and large sample sizes. For each case, 100 sets of independent replications are simulated. To further examine the information brought from the response time, two fitted models, MRM-RT and MRM, are considered for each replication. MLR is applied to both fitted models. All the estimations are done using Mplus 5. Necessary identification constraints are imposed to both the SB and HARF classes to 23.

(27) ensure identification of item parameters in both fittings. More specifically, the ability mean is fixed to 0 for the SB class, and item 1 is considered the anchor item with no differential item functioning between the SB and HARF classes. In addition, with regard to the RG class, the ability mean and variance are fixed to 0, and item difficulty parameters is fixed to 1.099 without estimating.. 3.2. Evaluation Criteria. To summarize the estimation performance, we use the mean, the standard deviation(SD) and the mean square error (MSE) of parameter estimates to compare MRMRT and MRM. That is, we average the estimates obtained from the 100 replications as the criterion mean, to evaluate the empirical bias of the estimator. For SD, we compute the square root of the squared differences between the estimates and their average divided by the number of replication minus one to evaluate the potential variation of the estimates. The MSE is simply the average of the squared differences between the estimates and their true parameter. The smaller the SD and MSE are, the better the performance of the estimation is.. 3.3. Results. The number of replications is limited to 100 in our study given that it takes 1 to 5 minutes to obtain parameter estimates for each replication. However, it takes one to two weeks for the Monte Carlo Markov Chain (MCMC) chain to arrive at the posterior distribution of the parameters using the Bayesian estimation. For some replications, convergence problems occur under the MRM fittings, but not under the MRM-RT fitting. We suspect that the classification of examinees heavily requires the information from the response time in the current simulation condition. Therefore, lack of the response time information in MRM results in the problem of unable to classifying the examinees into the RG class accurately due to its small class size, which in turn causes the problem of convergence. The percentages of the replications that converge under MRM fittings with different sample sizes are reported in Table 3. The percentage of convergence increases as the sample size becomes larger. This is reasonable because the latent classes are large enough to be distinguished with a large sample. Only. 24.

(28) those replications that converge are used for summarizing the simulation results. More specifically, to ensure that estimation does not fail to converge, we only collect those replications with their item difficulty estimates ranging from -5 to 5. We summarize the estimation results from a total of 100 replications which satisfy such a criterion for the MRM fitting. The estimation performances of the proportions of the three classes, the ability mean of the HARF class and ability variances of the SB and HARF classes, are summarized in Table 4. Comparing the mean of the estimates to the true parameters, the empirical bias decreases as the sample size increases. The SD and MSE exhibit the same pattern that the estimation is more precise as the sample size increases, as expected. Based on Table 4, it seems that MRM could not well account for the proportions of the three classes, πRG , πSB , and π HARF , especially when the sample size is only 250. Moreover, the ability mean of the HARF class does not seem to approach its parameter values even when the sample size increases to 2000 under MRM. Comparing the results of MRM-RT with those of MRM, the empirical bias as well as the SD are shown to be generally smaller in the former model. That is, the information of the response time really helps capture the test-taking behavior of the examinees in the RG class and resultantly improve the estimation with respect to bias and relative efficiency for finite samples. The estimation performance of the item difficulty parameters for each sample size and the SB and HARF classes are summarized in Tables 5 to 12. The empirical bias, SD, and MSE all decrease as the sample size increases. In addition, the estimates from the MRM-RT are better than those from the MRM with respect to their SD as well as MSE. For example, Table 7 shows that SD and MSE of each item under the MRM-RT are smaller than those under the MRM. However, empirical bias from the MRM-RT are not always better than MRM, especially in small samples. More specifically, Table 5 shows that the empirical bias from MRM are better than MRM-RT for items 4, 10, 12, 13, 14, 16, 17, 22 and 24 for samples of size 250. As the sample size increases to 1000, empirical bias for almost all items from the MRM-RT are better than MRM. For example, Table 9 shows that only item 1 has a smaller empirical bias from MRM than that from MRM-RT, and Table 12 shows that the empirical bias for all items are smaller from MRM-RT than MRM. For example, the difficulty parameter of item 25 is 1.9, and its mean estimates for the HARF class under MRM-RT are respectively 25.

(29) Table 3: Percentages of Replications that Converge under MRM Fittings. Sample size percent of passing the (-5, 5) criterion. 250 7.45 %. 500 14 %. 1000 17%. 2000 27 %. 2.29, 2.09, 1.94, and 1.95 as the sample size increases from 250 to 2000. However, its mean of estimates under the MRM are respectively 2.54, 2.32, 2.32, and 2.19. In other words, the empirical bias under the MRM-RT are obviously much smaller than those obtained under the MRM as the sample size increases. In addition, the SD and MSE of the 10 DIF items for the HARF class are obviously larger than other items under the MRM-RT when the sample size is small. As the sample size increases, no differences are found for the SD and MSE between these 10 DIF items and the others. Table 6 with sample size of 250 shows that the SD and MSE of item 19 are respectively 0.55 and 0.52, and of item 18 are respectively 0.45 and 0.25. Table 10 with sample size of 1000 shows that the SD and MSE of item 19 are respectively 0.29 and 0.14, and of item 18 are respectively 0.27 and 0.08. That is, the stability of item difficulty parameter mean of estimates under MRM increases as the sample size increases. The item difficulty parameters of the larger proportion classes in MRM could be estimated when the sample size is 1000 and their empirical bias, SD and MSE of estimates obviously decrease. In addition, the empirical bias, SD, and MSE for the difficulty parameter estimates of the DIF items under the MRM-RT obviously decrease when sample size is 1000. Overall, these results show that MRM-RT could recover parameters very well and better than MRM using MLR estimation under the mixture SEM. Especially with a small sample size, the estimation performance of MRM-RT is much advantaged over MRM because the response time really help disregarding (or accounting for) the responses of the examinees in the RG class to arrive at better estimation of the item difficulty parameters.. 26.

(30) 27. 0.144 0.023 0.00058 0.15 0.015 0.00024 0.149 0.011 0.00011. mean SD MSE. mean SD MSE. mean SD MSE. Parameter. 0.538 0.057 0.003. 0.512 0.097 0.011. 0.493 0.114 0.016. πSB 0.55. 0.313 0.058 0.004. 0.339 0.095 0.011. 0.364 0.115 0.017. 0.991 0.117 0.014. 0.93 0.167 0.033. 0.899 0.195 0.048. MRM-RT π HARF σθ2SB 0.3 1. 0.494 0.272 0.074. 0.519 0.28 0.079. 0.603 0.447 0.211. µθHARF 0.5. 0.149 0.012 0.00016. 0.15 0.019 0.00035. 0.142 0.027 0.00079. πRG 0.15. 0.462 0.164 0.043. 0.425 0.192 0.045. 0.458 0.219 0.056. πSB 0.55. 0.389 0.163 0.042. 0.424 0.193 0.045. 0.4 0.22 0.056. 0.885 0.295 0.100. 0.887 0.392 0.166. 0.968 0.531 0.283. MRM π HARF σθ2SB 0.3 1. 0.662 0.149 0.464 0.387 0.895 0.086 0.01 0.164 0.163 0.336 0.008 0.00010 0.035 0.037 0.124 ability and/or respond with familiarity;. 0.662 0.114 0.013. 0.706 0.169 0.032. 0.769 0.235 0.069. σθ2HARF 0.65. mean 0.149 0.543 0.308 0.989 0.511 SD 0.008 0.045 0.044 0.08 0.219 MSE 0.00007 0.002 0.002 0.007 0.048 RG = rapid-guessing; SB = solution behavior class; HARF = high SD = standard deviation; MSE = mean squared error.. 2000. 1000. 500. N 250. πRG 0.15. 0.554 0.623 0.392. 0.502 0.619 0.384. 0.463 0.647 0.42. 0.52 0.735 0.541. µθHARF 0.5. Table 4: Mean, SD, and MSE of the Estimates for Three Class Sizes, Mean Ability of HARF Class, and Ability Variances of SB and HARF Classes, under MRM-RT and MRM Fittings. 0.787 0.37 0.156. 0.936 0.359 0.211. 0.972 0.407 0.269. 1.209 0.907 1.136. σθ2HARF 0.65.

(31) Table 5: Mean, SD, and MSE of Difficulty Parameter Estimates for the SB Class with Sample Size of 250, under MRM-RT and MRM Fittings. item Parameter 1 -2.4 2 -2.2 3 -2.0 4 -1.8 5 -1.6 6 -1.4 7 -1.2 8 -1.0 9 -0.8 10 -0.6 11 -0.4 12 -0.2 13 0.0 14 0.2 15 0.4 16 0.6 17 0.8 18 1.0 19 1.2 20 1.4 21 1.6 22 1.8 23 2.0 24 2.2 25 2.4 SD = standard deviation;. MRM-RT mean SD MSE -2.43 0.30 0.09 -2.35 0.40 0.18 -2.22 0.50 0.30 -1.92 0.36 0.14 -1.73 0.40 0.17 -1.55 0.33 0.13 -1.41 0.39 0.20 -1.16 0.35 0.15 -1.01 0.43 0.22 -0.78 0.38 0.17 -0.61 0.48 0.27 -0.31 0.32 0.11 -0.16 0.29 0.11 0.07 0.32 0.12 0.23 0.42 0.20 0.38 0.43 0.23 0.65 0.33 0.13 0.86 0.27 0.09 0.99 0.42 0.22 1.19 0.43 0.23 1.48 0.37 0.15 1.68 0.36 0.14 1.83 0.50 0.28 2.09 0.35 0.13 2.22 0.53 0.31 MSE = mean squared error.. 28. mean -2.47 -2.41 -2.35 -1.89 -1.80 -1.55 -1.68 -1.22 -1.04 -0.70 -0.67 -0.26 -0.02 0.14 0.10 0.43 0.83 0.83 0.99 1.17 1.38 1.84 1.80 2.12 2.10. MRM SD 0.36 0.71 0.72 0.71 0.66 0.71 0.89 0.43 0.60 0.51 0.74 0.56 0.61 0.65 0.60 0.55 0.56 0.59 0.73 0.58 0.77 0.56 0.59 0.69 0.91. MSE 0.13 0.54 0.64 0.51 0.47 0.52 1.01 0.24 0.42 0.27 0.62 0.32 0.36 0.42 0.45 0.33 0.31 0.37 0.57 0.38 0.63 0.31 0.38 0.47 0.92.

(32) Table 6: Mean, SD, and MSE of Difficulty Parameter Estimates for the HARF Class with Sample Size of 250, under MRM-RT and MRM Fittings. item Parameter 1 -2.4 2 -2.2 3 -2.3 4 -1.8 5 -1.6 6 -1.4 7 -1.5 8 -1.0 9 -1.1 10 -0.6 11 -0.7 12 -0.2 13 0.0 14 0.2 15 0.0 16 0.2 17 0.8 18 1.0 19 0.6 20 0.8 21 1.6 22 1.8 23 1.5 24 2.2 25 1.9 SD = standard deviation;. MRM-RT mean SD MSE -2.43 0.30 0.09 -2.01 0.54 0.32 -2.15 0.65 0.45 -1.63 0.52 0.30 -1.38 0.50 0.29 -1.20 0.49 0.27 -1.24 0.48 0.30 -0.83 0.45 0.23 -0.83 0.47 0.29 -0.41 0.47 0.25 -0.48 0.47 0.26 -0.03 0.46 0.24 0.26 0.44 0.26 0.37 0.45 0.23 0.31 0.49 0.33 0.52 0.47 0.31 1.03 0.46 0.26 1.22 0.45 0.25 1.08 0.55 0.52 1.32 0.44 0.47 1.78 0.46 0.24 2.00 0.48 0.27 1.87 0.49 0.37 2.43 0.49 0.30 2.29 0.47 0.37 MSE = mean squared error.. 29. mean -2.47 -2.07 -1.92 -1.63 -1.36 -1.26 -1.09 -0.80 -0.78 -0.52 -0.37 -0.07 0.18 0.28 0.48 0.54 1.00 1.31 1.17 1.51 2.00 1.98 2.12 2.63 2.54. MRM SD 0.36 1.06 0.99 0.98 0.94 0.97 1.00 0.92 1.05 1.18 1.05 0.99 1.08 0.92 1.11 0.98 0.85 0.99 1.03 1.05 0.99 1.02 1.01 0.97 1.03. MSE 0.13 1.14 1.11 0.99 0.93 0.95 1.15 0.88 1.19 1.38 1.20 0.98 1.19 0.84 1.44 1.07 0.75 1.07 1.37 1.59 1.12 1.06 1.39 1.12 1.46.







(39) 4. Applied Data Analysis. In this section, the real data of Meyer (2010) are reanalyzed with mixture SEM. Meyer (2010) compared the one- and two-class models and showed that the two-class model fitted better. However, he suggested the possibility of adding one more class to account for some members of the rapid-guessing class who seem to spend more response time than the others within the same class, but on the other hand less response time than those in the solution behavior class. Under the current mixture SEM framework, the model can be easily extended to a three-class model and readily estimated with MLR. In this study,the two- and three-class models are compared to examine the necessity of including this additional class to better characterize the test-taking behavior through investigating the characteristics of each class of examinees.. 4.1. Data. The original data were taken from a Spring 2004 administration of the Information Literacy Test (ILT). Participants included 524 college sophomores who completed the computerized testing. The number of items on the test is 60. Cronbach’s alpha was 0.88 for this test (Meyer, 2010). The response data are dichotomous, and the unit of response time is in seconds. Meyer(2010) considered the existence of two types of test-taking behavior, namely the rapid-guessing and solution behavior, in the data. However, their results showed that it could not account for the much longer response time of some members in the rapid-guessing class. Thus, we first consider a three-class model with the rapid-guessing class, and the other two solution behavior classes, denoted as SB 1 and SB 2. Secondly, the two-, three-, and four-class models are compared using the Akaike information criterion (AIC; Akaike, 1987), Bayesian information criterion (BIC; Schwartz, 1978), and Sample-Size Adjusted BIC (Sclove, 1987) such that AIC = −2log(L(u, t∗ )) + 2k;. (17). BIC = −2log(L(u, t∗ )) + klog(N );. (18). Adjusted BIC = −2log(L(u, t∗ )) + klog((N + 2)/24),. (19). 36.

(40) Table 13: Model Fit Indices of the Two-Class, Three-Class, and Four-Class Models. AIC BIC Adjusted BIC. two-class 78772.401 80059.371 79100.750. three-class 73850.126 75912.688 74376.354. four-class 75974.571 78812.725 76698.678. Table 14: Estimates of Class Sizes of the Three-Class Model. class proportion SE RG 0.095 0.013 SB 1 0.453 0.035 SB 2 0.452 0.036 RG = rapid-guessing class; SB 1 = solution behavior class 1; SB 2 = solution behavior class 2.. where N is the number of examinees in the data, and k is the number of parameters computed by adding the the number of means and variances of the item response time, the number of item difficulty parameters and the number of means and variances of the ability distributions for all the classes, and the number of mixing proportions minus one. The most parsimonious model is the one with the smallest AIC, BIC, and Adjusted BIC. Finally, we explore the characteristics of each class of examinees.. 4.2. Results. As shown in Table 13, all fit indices show that the three-class model fits better than the two-class and four-class models. Table 14 shows the estimated proportions of each class in the three-class model. In the three-class model, the estimated size of the rapid-guessing class is smaller than the estimate of 15% in Meyer (2010).. 37.

(41) Table 15: Estimates of Mean Response Time for SB 1 and SB 2. class SB 1 SB 2 SB 1 item estimate SE estimate SE item estimate SE 1 22.10 0.64 16.19 0.52 31 13.66 0.49 2 19.80 0.59 15.28 0.46 32 10.03 0.27 3 43.88 1.23 28.52 0.97 33 61.34 1.84 4 16.77 0.52 12.25 0.34 34 55.45 1.83 5 15.45 0.63 10.09 0.29 35 50.13 1.70 6 19.01 0.86 12.41 0.35 36 23.42 0.75 7 10.43 0.37 7.55 0.20 37 58.35 1.93 8 13.14 0.50 9.90 0.28 38 63.18 2.91 9 25.12 0.80 16.99 0.53 39 26.38 0.95 10 23.07 0.85 15.09 0.36 40 45.78 1.97 11 14.02 0.43 10.11 0.29 41 32.20 1.32 12 21.20 0.89 14.66 0.37 42 18.53 0.67 13 34.34 0.86 23.16 0.79 43 29.80 0.86 14 14.61 0.47 10.06 0.30 44 43.66 1.44 15 40.92 1.31 25.99 0.86 45 34.65 1.77 16 45.00 1.67 25.72 1.26 46 22.91 0.83 17 48.84 1.51 34.89 0.73 47 17.83 0.86 18 13.18 0.36 10.25 0.29 48 48.86 2.25 19 8.68 0.31 6.24 0.14 49 26.17 0.89 20 24.92 0.77 17.52 0.58 50 41.66 1.42 21 19.43 0.68 13.95 0.36 51 13.59 0.42 22 31.97 1.34 20.58 0.51 52 17.30 0.55 23 47.03 1.83 27.27 1.17 53 21.72 0.56 24 48.70 1.75 24.93 1.55 54 9.14 0.27 25 30.56 1.10 18.70 0.69 55 17.77 0.48 26 21.11 0.89 13.57 0.56 56 29.59 0.80 27 23.88 1.10 14.67 0.62 57 16.85 0.52 28 10.97 0.44 7.63 0.26 58 19.84 0.60 29 12.83 0.41 9.46 0.31 59 14.56 0.44 30 12.27 0.37 9.09 0.23 60 16.87 0.49 SB 1 = solution behavior class 1; SB 2 = solution behavior class 2.. 38. SB 2 estimate 10.24 8.14 49.56 41.04 31.19 13.47 34.86 34.10 17.08 28.19 20.78 12.13 18.99 26.73 19.50 16.91 10.61 26.36 17.18 25.13 10.32 12.24 14.73 7.02 12.40 19.57 12.18 13.21 10.12 11.00. SE 0.26 0.25 1.04 1.48 1.37 0.63 1.71 1.81 0.68 1.18 0.83 0.39 0.70 1.15 0.96 0.80 0.51 1.61 0.64 1.58 0.30 0.42 0.60 0.20 0.47 0.82 0.48 0.52 0.36 0.43.

(42) As shown in Table 15, the means of response time of all items for SB 2 are smaller than those of SB 1 in the three-class model. The estimates of item difficulty parameters for SB 1 and SB 2 are reported in Table 16. Considering the magnitudes of standard errors of those difficulty estimates, the differences in the difficulty parameter estimates for the items between the two classes might not be statistically significant. For example, the estimates of the difficulty parameter of item 5 are respectively -0.75 and -1.269 for SB 1 and SB 2, but their corresponding standard errors are respectively 0.149 and 0.506. In other words, the item difficulty parameters in the two solution behavior classes might be statistically indifferent. Thus, we next consider the more restricted model which constrains the item difficulty parameters of the two solution classes to be the same and test whether some items indeed exhibit DIF between the two latent classes. The fit indices reported in Table 17 indicate that the three-class model with equality constraint on the item parameters for both solution behavior classes fits better than the unconstrained one. To assess the absolute instead of relative fit of the three-class model, we compare the observed and expected probability of getting a correct answer on each item for each of the three classes. The examinees are first classified into their most likely class based on the obtained posterior probabilities of belonging to each of the three classes. For each class of examinees, the observed and expected mean probability of getting a correct answer on each item are computed and plotted in Figure 3. Figure 3 shows that the observed and expected mean probability of getting a correct answer for SB 1 and SB 2 are very close for the three-class model. However, the observed mean probabilities of correct on the items appear quite different from the expected probability of 0.25 and seem to vary across items for RG. Therefore, we further relax the assumption of having the probability of getting a correct answer for RG be 0.25 for all items, and allow the item difficulty parameters to be freely estimated. Table 18 shows that the three-class model without equality constraint on the item difficulty parameters for RG fits better than the restricted model. Moreover, as shown in Table 19, a three-class model is again preferred to a two-class model when the item difficulty parameters are allowed to vary for the RG class. The observed and expected mean probability of getting a correct answer for each of the three-class of examinees are plotted in Figure 4. According to Figure 4, the observed mean probabilities of getting a correct answer are very close to the expected ones on all items for SB 1, SB 39.

(43) Table 16: Estimates of Difficulty Parameters for SB 1 and SB 2. class SB 1 SB 2 SB 1 item estimate SE estimate SE item estimate SE 1 -3.366 0.355 -3.366 0.355 31 -2.01 0.204 2 -2.798 0.279 -2.85 0.541 32 -3.653 0.416 3 -0.474 0.15 -0.129 0.484 33 2.098 0.209 4 -1.484 0.175 -1.519 0.504 34 -2.381 0.238 5 -0.75 0.149 -1.269 0.506 35 -0.408 0.146 6 -3.291 0.355 -3.431 0.546 36 -4.452 0.581 7 -3.36 0.35 -4.409 0.673 37 0.408 0.145 8 -2.666 0.264 -2.62 0.502 38 -1.102 0.159 9 -1.014 0.155 -1.188 0.486 39 -2.001 0.204 10 -2.722 0.266 -2.855 0.546 40 -3.123 0.318 11 -2.795 0.276 -3.131 0.56 41 -1.732 0.192 12 -0.471 0.149 -0.752 0.485 42 -3.47 0.37 13 0.892 0.151 0.831 0.494 43 -2.286 0.225 14 -2.077 0.214 -1.99 0.527 44 -3.236 0.339 15 3.445 0.372 3.033 0.601 45 -3.347 0.354 16 1.451 0.176 1.401 0.507 46 -2.096 0.215 17 -1.082 0.165 -0.578 0.497 47 -2.058 0.213 18 -1.939 0.2 -1.672 0.51 48 -0.439 0.146 19 -3.135 0.318 -3.484 0.579 49 -0.795 0.153 20 -1.024 0.16 -1.026 0.503 50 -0.627 0.156 21 -1.907 0.21 -2.146 0.504 51 -2.551 0.247 22 0.599 0.146 0.776 0.488 52 -0.304 0.144 23 -0.29 0.144 0.056 0.487 53 -2.613 0.265 24 0.123 0.14 0.131 0.486 54 -0.833 0.154 25 -1.463 0.174 -1.789 0.518 55 -3.054 0.309 26 -0.315 0.146 -0.721 0.493 56 -1.207 0.162 27 0.797 0.15 0.843 0.507 57 -1.087 0.161 28 -2.858 0.346 -2.394 0.531 58 -0.519 0.147 29 -0.426 0.148 -0.681 0.488 59 -1.767 0.196 30 -3.949 0.462 -2.912 0.535 60 -0.681 0.148 SB 1 = solution behavior class 1; SB 2 = solution behavior class 2.. SB 2 estimate SE -1.704 0.502 -3.454 0.559 2.367 0.547 -2.106 0.511 -0.74 0.502 -3.73 0.602 0.062 0.498 -0.695 0.489 -2.5 0.503 -2.111 0.51 -1.387 0.49 -3.72 0.597 -1.379 0.484 -2.07 0.508 -2.786 0.526 -1.878 0.495 -2.147 0.505 -0.497 0.492 -0.858 0.5 -0.461 0.485 -1.964 0.503 -0.694 0.507 -1.867 0.499 -0.861 0.501 -2.838 0.544 -0.783 0.484 -1.074 0.498 -0.941 0.496 -2.041 0.525 -0.674 0.489. Table 17: Fit Indices for Three-Class Models with and without Equality Constraints on Difficulty Parameters for SB 1 and SB 2. AIC BIC Adjusted BIC. without 73850.126 75912.688 74376.354. 40. with 73848.904 75660.038 74310.984.

(44) Figure 3: Observed and expected mean response probability of getting a correct answer for each latent class of examinees classified using the greatest posterior probability. 41.

(45) Table 18: Fit Indices for Three-Class Models with and without Equality Constraint on Difficulty Parameters for RG. AIC BIC Adjusted BIC. without 73067.466 75134.290 73594.781. with 73848.904 75660.038 74310.984. Table 19: Fit Indices of Two-Class and Three-Class Models without Equality Constraint on Difficulty Parameters for RG. AIC BIC Adjusted BIC. two-class 77718.044 79264.966 78112.715. three-class 73067.466 75134.290 73594.781. 2, as well as for RG. Therefore, the resultant three-class model provides reasonable fit to the data. From the above results that the three-class model without the equality constraints on the difficulty parameters of all items appear to be more parsimonious, we conclude that the examinees in RG do not just randomly respond to or guess the answers for all the items. It is possible that they might try to solve some of the items but guess some other items. In addition, with the equality constraint between the item difficulty parameters for the two latent classes of examinees, the items are said to be DIF free for the SB 1 and SB 2 classes. The estimates of items difficulty parameters with equality constraints for both SB 1 and SB 2, and those for RG are reported in Table 20. According to Table 21, the estimated class sizes for the RG, SB 1, and SB2 are respectively 0.121, 0.421, and 0.458. Next, we investigate the characteristics of the RG, SB 1, and SB 2 classes. Table 22 shows that the mean ability of SB 1 class is slightly higher than that of SB 2 class, where the mean ability is set to be 0 for SB 2 class for identification purpose. In addition, the mean ability of the RG class is obviously lower than those of the two SB classes. However, the low mean ability of the RG is of less interest because if guessing behavior is assumed for examinees in RG 42.

(46) Figure 4: Observed and expected mean response probability of getting a correct answer for each latent class without the equality constraint on difficulty parameters for RG. 43.

(47) Table 20: Estimates of Difficulty Parameters for RG and for Both Solution Behavior Classes, SB 1 and SB 2. class RG SB 1& 2 RG SB 1& 2 item estimate SE estimate SE item estimate SE estimate SE 1 -2.780 0.187 -2.780 0.187 31 -0.460 0.288 -1.693 0.138 2 -0.912 0.298 -2.649 0.190 32 -0.756 0.292 -3.327 0.253 3 1.413 0.332 -0.146 0.107 33 1.892 0.382 2.430 0.162 4 -0.105 0.278 -1.314 0.123 34 -0.388 0.283 -2.034 0.153 5 0.250 0.283 -0.817 0.113 35 1.311 0.331 -0.434 0.109 6 -0.756 0.293 -3.210 0.240 36 -0.175 0.283 -4.108 0.360 7 -1.075 0.303 -3.608 0.286 37 0.466 0.291 0.416 0.105 8 -0.532 0.290 -2.458 0.178 38 0.107 0.288 -0.720 0.113 9 0.621 0.295 -0.940 0.113 39 -0.317 0.283 -2.123 0.157 10 -0.388 0.281 -2.615 0.189 40 0.322 0.289 -2.373 0.175 11 -1.250 0.313 -2.756 0.197 41 0.699 0.303 -1.427 0.127 12 0.104 0.279 -0.454 0.110 42 -0.178 0.275 -3.688 0.302 13 2.204 0.428 1.014 0.113 43 0.542 0.284 -1.626 0.138 14 -0.246 0.281 -1.873 0.145 44 -0.175 0.280 -2.400 0.175 15 2.198 0.422 3.373 0.238 45 0.621 0.293 -2.961 0.216 16 1.892 0.386 1.591 0.128 46 0.178 0.282 -1.817 0.143 17 1.413 0.335 -0.677 0.112 47 0.104 0.275 -1.951 0.150 18 0.395 0.292 -1.693 0.137 48 1.213 0.322 -0.298 0.109 19 -1.161 0.314 -3.210 0.240 49 0.699 0.296 -0.657 0.110 20 0.542 0.289 -0.883 0.114 50 2.839 0.536 -0.444 0.109 21 -0.105 0.281 -1.892 0.145 51 0.249 0.283 -2.056 0.152 22 1.029 0.308 0.853 0.113 52 0.178 0.284 -0.318 0.108 23 1.314 0.331 0.050 0.107 53 0.322 0.286 -2.078 0.156 24 1.759 0.377 0.255 0.104 54 -0.535 0.290 -0.667 0.109 25 0.469 0.290 -1.471 0.129 55 -1.161 0.310 -2.756 0.201 26 0.699 0.295 -0.337 0.110 56 1.119 0.313 -0.861 0.116 27 1.413 0.337 0.982 0.113 57 0.942 0.307 -0.940 0.116 28 -0.034 0.285 -2.429 0.178 58 -0.107 0.285 -0.554 0.109 29 0.696 0.297 -0.385 0.110 59 -0.317 0.286 -1.780 0.140 30 -0.532 0.289 -3.210 0.241 60 0.696 0.300 -0.554 0.109 RG = rapid-guessing class; SB 1 = solution behavior class 1; SB 2 = solution behavior class 2.. 44.

(48) Table 21: Estimates of Class Sizes of the Three-Class Model with the Equality Constraint on Difficulty Parameters between SB 1 and SB 2 and without the Equality Constraint on Difficulty Parameters for RG. class proportion SE RG 0.121 0.016 SB 1 0.421 0.034 SB 2 0.458 0.036 RG = rapid-guessing class; SB 1 = solution behavior class 1; SB 2 = solution behavior class 2. Table 22: Estimates of Mean Ability for RG, SB 1, and SB 2. class µˆθ SE RG -1.705 0.154 SB 1 0.138 0.067 SB 2 0 (fixed) RG = rapid-guessing class; SB 1 = solution behavior class 1; SB 2 = solution behavior class 2.. class on some items, their responses do not solely depend on ability and therefore their ability might not be appropriately estimated. Figure 5 shows the estimates of the item difficulty and mean response log-time for RG, SB 1, and SB 2. Figure 5 indicates that the mean response time of the SB 2 class is shorter than those of the SB 1 class. For example, the mean response time of item 12 are respectively 21.20 and 14.66 for the SB 1 and SB 2 classes. Therefore, the results suggest that examinees with a slightly higher mean ability, and spend more time on the items are more likely to belong to the SB 1 class. The estimated class size of 0.121 for the rapid-guessing class is again smaller than the estimate of 15% in Meyer (2010). It is possibly due to the fact that shorter response time is the main indication of being a member in the rapid-guessing class in Meyer (2010) whereas some members belonging to the original rapid-guessing class are now classified as those faster respondents of solution behavior 2 in the three-class model. In summary, we can label the three types 45.

(49) of examinees in the ILT data as the rapid-guessing, the solution behavior, and the faster respondents classes. The estimates of item difficulty parameters of solution behavior class are overall very similar to those of Meyer (2010). Furthermore, the mean response log-time for the solution behavior class in Mayer (2010) falls between the estimates of mean response log-time of SB 1 and SB 2 in our present analysis.. Figure 5: Estimates of difficulty parameters and mean response log-time for rapid-guessing class (RG), solution behavior class 1 (SB 1), and solution behavior class 2 (SB 2).. 46.