多階層高層試題反應理論之蒙地卡羅馬可夫鏈估計法

全文

(1)國立臺中教育大學教育測驗統計研究所博士論文 National Taichung University of Education, Graduate Institute of Educational Measurement and ststistics Doctoral Dissertation 指導教授：郭伯臣. 博士. Advisor: Dr. Bor-Chen Kuo. 多階層高層試題反應理論之蒙地卡羅馬可夫鏈估計法 Multilevel Higher-Order Item Response Model Using Markov Chain Monte Carlo Estimation 研究生：曾筱倩. 撰. Graduate:Hsiao-Chien Tseng. 中. 華. 民. 國. 一. ○. 三. January, 2014. 年. 一. 月.

(2) Abstract The higher-order item response framework specifying the overall and multiple domain abilities using the same model results in a more parsimonious model for the joint distribution of multiple dimensional abilities that is natural in many real situations. Many researches have demonstrated that incorporating student’s background variables such as gender, age, race, and grade level into the estimation process can lead to unbiased and more precise ability estimates. Some multilevel models based on unidimensional or multidimensional item response theories have been developed for this purpose. However, so far there has been no study that incorporates the higher order item response model with student’s background information. The aim of this study is to propose a multilevel higher-order item response model in which the background variables of students are treated as regressors of the overall ability. The Markov chain Monte Carlo (MCMC) algorithm is applied to calibrate the parameters in the proposed model. Simulated and real data are applied to verify the usefulness and feasibility of the proposed model. There are two experiments based on the simulated data and one experiment based on real data. In the first experiment, the data is generated from the proposed multilevel higher-order item response model with the continuous background variables. The goal of this experiment is to study the influence of the correlation between background variables and the overall ability. The results of the first experiment show that the models that include the background variables are relatively efficient. Results indicate that the multilevel higher-order item response model (MHO-IRT), multilevel multidimensional item response model (MM-IRT), and multilevel unidimensional item response model. I.

(3) (MU-IRT) models provide more accurate estimates compared to the unidimensional item response model (U-IRT), multidimensional item response model (M-IRT), and higher-order item response model (HO-IRT) models. In addition, smaller root mean square errors (RMSEs) of the parameters are obtained with longer tests, larger sample sizes, and higher correlations between the overall ability and background variables. In the second experiment, the data is generated from the proposed multilevel higher-order item response model with the dichotomous background variables. The goal of this experiment is to explore the impact of group statistics estimation when incorporating the background variables. In the second experiment, compared to the U-IRT, M-IRT, and HO-IRT models, the results indicate that the models that include the background variables are relatively efficient. The RMSEs of the group mean and standard deviations shows that the RMSEs decreases when the test length and sample size increases. The differences between the models including and excluding the background variables are higher when estimating the population standard deviations. In the third experiment, AIC, BIC, and DIC indices are applied to find the best fit model for the TASA 2007 fourth-grade mathematics assessment data. The results show that AIC, BIC, and DIC all indicate the MHO-IRT is more suitable for the TASA data. In summary, the MHO-IRT approach has important implications for parameter estimations. The MHO-IRT model fits the design of large-scale assessments. More importantly, it provides an efficient estimation of the parameters. Better estimates are obtained with longer tests, larger sample sizes, and higher correlations between the overall ability and background variables. Keywords: Higher-order item response model, Markov Chain Monte Carlo, item response theory, multilevel model. II.

(4) 摘要學生的學習成效及其表現，往往需要一個總體分數做為整體學習情形的評估，以及多個面向的個別不同表現，提供學生做為自我學習的診斷，高層次試題反應模式不僅僅可以提供高階的主要能力值(overall abilities)，更提供了各個子向度間的次級能力(domain abilities)，此模式更符合現實測驗架構。近幾年開發了不少的模式是以多階層(multilevel)模式為基礎，使用可得的資料，包含學生的答題反應和背景變項資料進行估計，其中背景變項包括了學生的性別、年齡、教育程度... 等，透過背景變項的納入可以使得估計更為精準，進而獲得更加準確的參數估計值。納入背景變項的多階層模式目前僅被應用在單向度以及多向度模式中，但並未有相關的研究將其應用在高階層試題反應理論模式中。有鑑於此，本研究旨在發展多階層高層次試題反應理論之蒙地卡羅馬可夫鏈估計法，此模式主要為納入背景變項在主要能力值中，透過蒙地卡羅馬可夫鏈估計法進行估計，並透過模擬和實徵資料進行評估。本論文中為了探討本研究所發展的多階層高層次試題反應理論之蒙地卡羅馬可夫鏈估計法之成效使用了不同的研究設計進行成效之評估，在實驗一中，利用納入連續的背景變項探討，不同相關程度的背景變項對於參數估計改善之成效；實驗二中則透過產生離散的背景變項，探討本研究所提出之模式對於群體以及次群體參數的改善成效；實驗三則主要透過實徵資料進行所提出之模式的探究，本研究使用臺灣學生學習成就評量資料庫(TASA)作為本研究實徵資料之來源。透過模擬和實徵資料評估多階層高層次試題反應理論之蒙地卡羅馬可夫鏈估計法之成效。實驗一中發現，在估計中納入背景變項，有助於提升估計之精準度，研究結果顯示，納入背景變項進行估計的三種估計方法 MHO-IRT、MM-IRT 和 MU-IRT 較未納入背景變項的 HO-IRT、M-IRT 和 U-IRT 精準。此外提升試題的長度、受. III.

(5) 試者的人數以及背景變項以及能力值之間的相關，皆有助於提升參數估計的精準度。實驗二中透過納入離散的背景變項，藉以探討納入背景變項對於回復群體參數的成效，研究結果發現，在對於回復群體平均數時，各種方法皆有相似的表現，其中納入背景變項進行估計的模式比未納入背景變項進行估計的模式來的精確，但在不同估計模式如 HO-IRT、M-IRT 和 U-IRT 間並未有太大的差異，此外研究結果顯示，納入背景變項的估計模式對於回復群體差異較大的背景變項時，有較好的估計成效。且對於回復群體能力標準差比群體能力平均數有較好的表現最後利用TASA 2007的小四數學作為本研究的實徵資料的來源，並透過三種不同的模式適合度指標進行評估，研究結果顯示三種模式適合度指標皆顯示本研究所提出的MHO-IRT模式為最符合實徵資料之模式，其次為未納入背景變項的 HO-IRT模式，結果中可以發現高層次試題反應理論的模式更符合實徵資料。研究結果顯示，MHO-IRT模式提供更精確的參數估計值，且更符合大型測驗的資料。研究結果中更發現，當所納入的背景變項與能力值之間的相關較高時，則MHO-IRT會有較好的估計精準度，提高測驗的長度以及樣本數皆有助於提升估計精準度。. 關鍵詞：高層次試題反應理論、蒙地卡羅馬可夫鏈、試題反應理論、多階層模式。. IV.

(6) TABLE OF CONTENTS Abstract. .................................................................................................................... I. 摘要. ................................................................................................................. III. Table of ContentsLists of Figures ............................................................................... V Lists of Figures ...........................................................................................................VII Lists of Tables .......................................................................................................... VIII CHAPTER 1 INTRODUCTION..................................................................................1 1.1 Motivation ...........................................................................................................1 1.2 Aims of this study ...............................................................................................4 1.3 Significance and contribution .............................................................................4 CHAPTER 2 LITERRATURE. REVIEW ...............................................................5. 2.1 Item Response Models ........................................................................................5 2.1.1 Unidimensional Item Response Models ......................................................5 2.1.2 Multidimensional Item Response Models ...................................................7 2.1.3 Higher-Order Item Response Model ............................................................9 2.2 Multilevel IRT Models .....................................................................................10 2.2.1 Multilevel Unidimensional IRT Model......................................................11 2.2.2 Multilevel Multidimensional IRT Model...................................................13 2.3 Model fit ...........................................................................................................15 CHAPTER. 3 A Multilevel Higher-Order item response model ..........................17. 3.1 Model Specification ..........................................................................................17 ....................................................................................................................................18 3.2 Parameter Estimation ........................................................................................19 3.2.1 Prior Distributions ......................................................................................19 3.2.2 Joint and Conditional Posterior Distributions ............................................20 3.2.3 Parameter Estimation .................................................................................20 3.3 Markov Chain Monte Carlo Methods for Other IRT Models ..........................22 3.3.1 MCMC method for multilevel unidimensional 3PL model .......................22 3.3.2 MCMC method for multilevel multidimensional 3PL model....................24 3.3.3 MCMC method for multidimensional 3PL model .....................................27 CHAPTER. 4 Experiments .......................................................................................29 V.

(7) 4.1 Experiment one ...............................................................................................29 4.1.1 Experiment Design .....................................................................................29 4.1.2 Data generation ..........................................................................................31 4.2 Experiment two ...............................................................................................32 4.2.1 Experiment Design .....................................................................................32 4.2.2 Data Generation .........................................................................................33 4.2.3 Analysis ......................................................................................................34 4.3 Experiment three .............................................................................................40 CHAPTER. 5 Results ................................................................................................42. 5.1 Experiment one: Model parameter recovery with continuous background variables .....................................................................................................................42 5.1.1 Overall ability estimates.............................................................................42 5.1.2 Domain ability estimates ............................................................................48 5.1.3 Regression parameter estimates .................................................................54 5.1.4 Item parameter estimates ...........................................................................56 5.2 Experiment two: Model parameter recovery with discrete background variables .....................................................................................................................59 5.2.1 Overall ability estimates.............................................................................59 5.2.2 Domain ability estimates ............................................................................64 5.3 Experiment three: Model Fit in real data ........................................................69 CHAPTER. 6 ConClusion and disscussion .............................................................71. REFERENCES ............................................................................................................74. VI.

(8) Lists of Figures Figure 2-1 A HO-IRT model applied to a D-domain test ..........................................10 Figure 3-1 Multilevel HO-IRT method ......................................................................18 Figure 4-1 Simulated Data – Markov chains for selected beta, overall and domain ability, and lambda parameters ................................................................36 Figure 4-2 Simulated Data – Markov chains for selected discrimination, difficulty and guessing item parameters ..................................................................37 Figure 4-3 Simulated Data – Estimated autocorrelation function for selected beta, overall and domain ability, and lambda parameters ................................38 Figure 4-4 Simulated Data – Estimated autocorrelation function for selected discrimination, difficulty and guessing item parameters .........................39 Figure 4-5 Participants in TASA ................................................................................40 Figure 5-1 Scatter Plots of True and Estimated overall abilities: N=1000, n=20,  =0.7...........................................................................................................46 Figure 5-2 Scatter Plots of True and Estimated overall abilities: N=4000, n=20,  =0.7...........................................................................................................47 Figure 5-3 Scatter Plots of True and Estimated domain abilities: N=1000, n=20,  =0.7...........................................................................................................52 Figure 5-4 Scatter Plots of True and Estimated domain abilities: N=4000, n=20,  =0.7...........................................................................................................53 . . VII.

(9) Lists of Tables Table 4-1. The cases examined in the simulation studies .........................................30. Table 4-2. The Setting of Manipulated variables ......................................................31. Table 4-3. Means and standard deviations used to generate the simulated dataset ..34. Table 4-4. The test booklets design for TASA 2007 fourth-grade mathematics assessment ................................................................................................41. Table 4-5. The Questionnaire item from TASA 2007 fourth-grade .........................41. Table 5-1. RMSE of four model overall ability estimates ........................................44. Table 5-2. Correlation between true overall abilities and estimated overall abilities ..................................................................................................................45. Table 5-3. RMSE of four model domain ability estimates........................................49. Table 5-4. Correlation between true domain abilities and estimated domain abilities ..................................................................................................................50. Table 5-5. RMSE of three model  estimates ........................................................54. Table 5-6. RMSE of two model  estimates ..........................................................55. Table 5-7. RMSE of three model item parameter estimates when N=1000 .............57. Table 5-8. RMSE of three model item parameter estimates when N=4000 .............58. Table 5-9. RMSE of group mean in overall ability ...................................................60. Table 5-10. RMSE of group standard deviation in overall ability ............................61. Table 5-11. RMSE of subgroup mean in overall ability ...........................................62. Table 5-12. RMSE of subgroup standard deviation in overall ability ......................63. Table 5-13. RMSE of group mean in domain ability ................................................64. Table 5-14. RMSE of group standard deviation in domain ability ...........................65. Table 5-15. RMSE of subgroup mean in domain ability ..........................................66. Table 5-16. RMSE of subgroup standard deviation in domain ability......................67. Table 5-17. The overall ability estimates based on the real data ..............................69. Table 5-18. The domain ability estimates based on the real data..............................69. VIII.

(10) Table 5-19. Model selection indices in each model ..................................................70. . IX.

(11) CHAPTER 1 INTRODUCTION 1.1. Motivation The unidimensional item response theory that reports the overall abilities. indicating students’ levels of achievements in broadly defined content area domains are widely implemented in the field of educational assessment. Although there has been an increasing demand for assessments that directly assist students and teachers in understanding and responding to key strengths and weaknesses in specific content domains, such a need for large-scale assessments that is more formative and informative in design remains largely unfulfilled. As previous studies have reported, test items usually require more than one trait of ability to determine a correct answer (Ansley & Forsyth, 1985; Reckase & McKinley, 1991). For this reason, in recent times, considerable attention has been devoted to the item response theory (IRT) applied to models that include more than one ability, so-called multidimensional IRT (MIRT) models. Various new procedures exist for applications to the problems of multidimensional IRT (Ackerman, 1996; Adams, Wilson, & Wang, 1997; Davey, Oshima, & Lee, 1996; Luecht, 1996; Reckase, 1997; Roussos & Stout, 1996; Stout, Habing, Douglas, Kim, Roussos, & Zhang, 1996; van der Linden, 1996). However, tests comprising of different domains are common in many large-scale assessments. For example, in the Programme for International Student Assessment (PISA), the mathematics literacy includes the overall ability and four subject domain abilities (e.g., quantity, space and shape, change and relationships, and uncertainty) (OECD, 2005). The overall and domain abilities complement each other, and thus enhance the utility of large-scale assessments. In general, as an indicator of an examinee’s overall ability, the summative overall test ability is often useful for important decisions such as graduation, college admissions, promotion to the next grade level, licensure and certification, and fulfilling the accountability requirements. Informative abilities from specific content or construct areas supplement the overall. 1.

(12) test ability by allowing a finer-grained analysis of the examinees’ ability. This diagnostic information not only identifies the examinees’ relative strengths and weaknesses, but also helps assess and inform teachers’ instruction in practical settings. Therefore, recent years have witnessed increased interest in the higher-order item response theory (HO-IRT) model. (de la Torre & Song, 2009; de la Torre & Hong, 2010; Song, 2007). The HO-IRT is a new approach to estimate students’ overall and domain-specific abilities simultaneously. The HO-IRT model is an integrated model that can better capture the structure of multiple-component tests and provide an efficient estimation of multilevel abilities. Besides, the HO-IRT approach can enhance the validity and usefulness of a given test by providing diagnostic subscale estimates in addition to an overall scale estimate (de la Torre & Song, 2009). At present, several methods are available that intend to provide more precise and reliable estimation of the abilities. The core idea shared by these methods is to incorporate background information into the estimation process, not only of the abilities, but also of the item parameters (e.g., Ackerman & Davey, 1991; Kahraman & Kamata, 2004; Mislevy, 1987; Mislevy & Sheehan, 1989; de la Torre, 2009). Examples of this background information include demographic variables, such as sex, age, and race, and educational variables, such as grade level, courses taken or overall test or other subtests. Many researches have shown that incorporating student demographic and educational variables into the estimation process can lead to unbiased estimates of population parameters such as population means, standard deviations, percentages in levels, and percentile points, more precise ability estimates, and consistent parameter estimates (Mislevy, 1984; Mislevy, 1987; Mislevy & Sheehan, 1989). The theory and use of this method were first developed for the analyses of the 1983–84 US National Assessment of Educational Progress (NAEP) data (Mislevy, 1991; Mislevy, Beaton, Kaplan, & Sheehan, 1992; Beaton & Gonzalez, 1995) based on Rubin’s (1987) work on multiple imputations. The Trends in International Mathematics and Science Study (TIMSS) by including the examinees’ background variables obtained from their survey. 2.

(13) responses (Mislevy, 1984; Mislevy, Johnson, & Muraki, 1992; Gonzalez, Galia, & Li, 2004). Adams, Wilson, and Wu (1997) show how structural Rasch models can be viewed as nonlinear multilevel models and as a structural model in regressing the latent ability variables on several demographic variables. Better ability estimates can be obtained by using models that simultaneously incorporate the various sources of background information. One such model is the mixed-coefficient multinomial logit model (Adams & Wu, 2006), which can be used in conjunction with item response and latent regression models. Using the information from other subtests, some researchers have used a multidimensional IRT framework to obtain more precise ability estimates for test domains or subsections (de la Torre & Patz, 2005; Wang, Chen, & Cheng, 2004; Yao & Boughton, 2007; de la Torre, 2009). The ability estimation in which the background variables found in the covariates and the correlation structure of the abilities can be incorporated in the estimation process using an integrated framework is made possible in this proposed framework. However, these methods are based on the UIRT or MIRT model. In the large scale assessment, the test structure are more like HO-IRT model. This study proposes a multilevel HO-IRT (MHO-IRT) model. The MHO-IRT model considers the structure of the abilities and the background information, an examinee’s performance in each domain is accounted for by a single domain-specific ability, and the correlations among domain abilities are accounted for by positing a higher-order ability that can be viewed as the examinee’s overall ability. This model uses the different sources of information to provide better ability estimates. Moreover, the literature reviews indicate that the use of Bayesian methods and Markov chain Monte Carlo methods can lead to better estimates of the item parameters. All three studies indicate that the use of MCMC results in estimating the joint posterior distribution of the item parameters and better estimates (Jones & Nediak, 2000; Patz & Junker, 1999a; Yao, Patz, & Hanson, 2002). Furthermore, the MCMC algorithm is more flexible for its implementation in more complex models. In this dissertation, the parameters of the MHO-IRT are calibrated by the MCMC algorithm in a hierarchical Bayesian framework.. 3.

(14) 1.2. Aims of this study The major aims of this study are as follows: (a) to propose a multilevel HO-IRT. (MHO-IRT) model incorporating background variables, (b) to develop an MCMC algorithm to estimate the overall ability, domain abilities, item parameters, and latent regression coefficients simultaneously for the proposed MHO-IRT model, (c) to investigate the performance of the proposed model by comparing with other multilevel IRT models through simulated and real data experiments.. 1.3. Significance and contribution As discussed above, this dissertation has two major contributions to educational. measurement and research. First, a multilevel HO-IRT model is proposed to incorporate background variables into the HO-IRT model. Incorporating the background variables can lead to unbiased estimates of population parameters, more precise ability estimates, and consistent model parameter estimates. Second, a MCMC algorithm is proposed to estimate the overall ability, domain abilities, item parameters, and latent regression coefficients, simultaneously. The proposed MHO-IRT model is an integrated model that can better capture the structure and provide efficient estimates. A MCMC procedure has also been developed to estimate the parameters of this complex model and it can be used in conjunction with various item response models. Moreover, this dissertation provides a comparison of five more models with the proposed model. The experimental results suggest an appropriate situation for using the proposed model.. 4.

(15) CHAPTER 2 LITERRATURE. REVIEW. This chapter first discusses a variety of dichotomous IRT, MIRT and HO-IRT models. The principle of the multilevel method is presented and implemented in unidimensional and multidimensional item response theory model. The Markov Chain Monte Carlo algorithm are implemented in unidimensional and multidimensional item response theory model. Finally, model fit were describe.. 2.1. Item Response Models. 2.1.1. Unidimensional Item Response Models. Item response theory now contains a large family of models. The simplest of these models is the Rasch (1960) model, which is also known as the one-parameter logistic model (1PL). For the Rasch model, the dependent variable is the dichotomous response for particular person to a specified item. The 1PL function provides the prediction as follows:. P( X ij  1 |  i , b j ) . exp( i  b j ) 1  exp( i  b j ). (2.1.1). where P( X ij  1) is the probability of the examinee i answered item j correctly;. bj is difficulty parameter for item j ; and i is the. ith. examinee’s ability. parameter for the administered test. In the two-parameter logistic model (2PL), item discrimination is included in the measurement model. The model includes two parameters to represent item properties. Both item difficulty, bj , and item discrimination, aj , are included in the exponential form of the logistic model (Birnbaum, 1968), as follows:. P( X ij  1 |  i , a j , b j ) . exp[a j ( i  b j )] 1  exp[a j ( i  b j )]. (2.1.2). Notice that the item discrimination is a multiplier of the different between trait level and item difficulty. Item discriminations are related to the biserial correlations between item responses and total scores. 5.

(16) When a third parameter, the guessing parameter, c j , is added to the 2PL model, it becomes the three-parameter logistic (3PL) IRT model, as follows (Lord, 1980):. P( X ij  1 |  i , a j , b j , c j )  c j  (1  c j ). exp[a j ( i  b j )] 1  exp[a j ( i  b j )]. (2.1.3). where c j represents the probability of endorsing item i for a examinee when his or her ability level reaches the low extreme. Patz and Junker (1999a) described a general Markov chain Monte Carlo strategy, based on Metropolis-Hastings sampling, for Bayesian inference in complex item response theory settings. They demonstrate the basic MCMC methodology using the two-parameter logistic (2PL) model. Patz and Junker (1999b) extended their basic MCMC methodology to address issues such as non-response, designed missingness, multiple raters, guessing behavior and partial credit (polytomous) test items. MCMC algorithm for unidimentional 3PL is described in the following. The prior distributions of the ability, item parameters are given below. In a Bayesian framework, the estimation method can be expressed as (Patz & Junker, 1999a):. i ~ N (0,1). (2.1.4). a j ~ log N (0.6,1.13). (2.1.5). b j ~ N(0,1). (2.1.6). c j ~ Beta(4,16). (2.1.7). where aj , b j and cj are the discrimination, difficulty, and guessing parameters of item j . IP are represent those item parameters which is IP  {a , b , c} . Using this formulation, the marginal distribution of domain ability can be shown to be the standard normal distribution. The joint posterior distribution of the parameters, given the observed item response X , can be expressed as. P( , IP | X)  P(X |  , IP)P( , IP)  P(X |  , IP)P( IP). 6. (2.1.8).

(17) The full conditional distribution of  and IP are derived as follows:. P( | IP, X)  P(X |  , IP). (2.1.9). P(IP |  , X)  P(X |  , IP)P(IP). (2.1.10). At iteration t, the outline of the MCMC algorithm are as following. 1.  has independent components, the sampling can be done one examinee at a time. t -1 2 For examinee i , sample  * from N ( ,  t -1 ) ,. and accept  * with. probability P ( X |  * , IP (t 1) ) min{ ,1} P ( X |  (t 1) , IP ( t 1) ). (2.1.11). 2. Finally, for IP  {a , b , c} are separated into three parts of item parameter t 1 sampling. First, draw the candidate values a*j from N(a j ,1) ,and accept a*j. with probability. P(X |  t , IP{a*j , b (tj-1) , c (tj -1) }) P(IP{a*j , b (tj-1) , c (tj -1) }). min{ ,1} P(X |  t , IP{a (jt 1) , b (tj-1) , c (tj -1) }) P(IP{a (jt 1) , b (tj-1) , c (tj -1) }). (2.1.12). t 1 Second, draw the candidate values b*j from N (b j ,1) ,and accept b*j with. probability. P(X |  t , IP{a tj , b*j , c(tj -1) })P( IP{a*j , b*j , c (tj -1) }). min{ ,1} P(X |  t , IP{a tj , b (tj-1) , c(tj -1) })P( IP{a tj , b (tj-1) , c(tj -1) }). (2.1.13). t 1 Third, draw the candidate values c*j from N (c j ,1) ,and accept c*j with. probability. P(X |  t , IP{a tj , b tj , c*j }) P( IP{a *j , b tj , c*j }). min{ ,1} P(X |  t , IP{a tj , b tj , c (tj -1) }) P( IP{a tj , b tj , c (tj -1) }) 2.1.2. (2.1.14). Multidimensional Item Response Models. Many assessments are designed to report not only overall ability but also domain abilities on a few domains or subskills, with a certain number of items in each domain. Multidimensional IRT (MIRT) models provide two or more parameters and their covariance structure to represent each person’s trait level. Multidimensional 7.

(18) One-Parameter Logistic Model (M1PLM; Mckinley & Reckase, 1982) can be expressed as:. P( X ij  1 |  i , b j ) . exp(θ i  b j 1) 1  exp(θ i  b j 1). (2.1.15). where P( X ij  1) is the probability of a correct response; θi  {1 , 2 ,..., p } refers to the p -dimensional abilities; bj is the difficulty parameter for item i , respectively; and 1 is a p  1 vector of 1’s. In addition to the M1PLM, Adams, Wilson and Wang (1997) proposed the multidimensional random coefficients multinominal logit model (MRCMLM) for Rasch family models. Being a member of the exponential family of distribution, the MRCMLM can be viewed as a generalized linear mixed model (De Boeck & Wilson, 2004; McCulloch & Searle, 2001; Rijmen, Tuerlinckx, De Boeck, & Kupens, 2003; Wang & Wilson, 2005a; Wang & Wilson, 2005b) of which the Rasch testlet model (Wang & Wilson, 2005b), the logistic latent trait model (LLTM; Fischer, 1973), the rating scale model (RSM; Andrich, 1978), and the partial credit model (PCM; Masters, 1982) are all the special cases of the MRCMLM. The model can be expressed as. P( X ijk  1 | θ i ,  ) . exp(b jk θ i  a jk  ) Oi.  exp(b jk θ i  a jk  ). (2.1.16). u 1. where P( X ijk  1) is the probability of the response to item j in category k for examinee i ; Oi is the number of category in item j ;  is a vector of difficulty parameters of that item; b jk is a score vector given to category k of item j across the P ability; and a jk is a design vector given to category k of item j that integrates the element of  into a linear relationship. The commercial computer program ConQuest (Wu, Adams, & Wilson, 1998) can be implemented to calibrate the parameters based on the MRCMLM.. 8.

(19) When the discrimination is included in the model, Equation (2.1.17) will be the Multidimensional Two-parameter Logistic Model (M2PLM; Reckase, 1997). The function is defined by the following:. P( X ij  1 |  i , a j , b j )  '. exp(a 'j i  b j ) 1  exp(a 'j i  b j ). (2.1.17). where a'j is a 1  p vector of discrimination parameters for item j . Reckase (1985) proposed a multidimensional IRT model as an extension of the 3PL. In his original formulation, a single item can measure two or more abilities. Extending the 3PL model to a multidimensional context, Reckase (1997) formulated linear logistic multidimensional model as:. P( X ij  1 |  i , a j , b j , c j )  c j  (1  c j ) '. exp(a 'j i  b j ) 1  exp(a 'j i  b j ). (2.1.18). where c j is the guessing parameters for item j . 2.1.3. Higher-Order Item Response Model HO-IRT model was developed for simultaneous estimation of the overall and. domain abilities. In the proposed HO-IRT model, a test is viewed as consisting of several unidimensional sub domains. That is, a single domain-specific ability  i(d ) accounts for examinee i ’s performance on domain d , where d  1, 2 ,..., D . When different domains measure the same ability, the entire test is deemed unidimensional. The correlations between different domain abilities can be accounted for by posting a higher-order ability i that is viewed as the examinee’s overall ability. Specifically, the domain abilities are expressed as linear functions of the overall ability (de la Torre & Song, 2009)..  i( d )  ( d ) i   id. (2.1.19). where  ( d ) is the latent regression coefficient, and id is the error term that is assumed to be normally distributed with a mean of zero and variance of 1  (  ( d ) ) 2 , and  ( d )  1 . By imposing these constraints, the marginal distribution of  i(d ) is. 9.

(20) guaranteed to follow an identical distribution as i , namely, the standard normal distribution N (0,1) It is also assumed that the domain level abilities are independent of each other given the overall ability. The correlation between the overall and domain abilities is given by  ( d ) , whereas the correlation between the domain ability d and. d ' is  ( d )   ( d ') . Although  ( d ) can be negative, it is expected to be non-negative in most educational applications where domain-abilities are positively correlated with the overall ability.. Figure 2-1 A HO-IRT model applied to a D-domain test Adapted from “A higher-order item response model: development and application.” by Song, H., 2007, doctoral dissertation, The State University of New Jersey. The diagrammatic representation of the HO-IRT model is driven in Figure 2-1. The first level of the figure shows the response of examinee i to the j th item in domain d . On the second level, an examinee’s domain level responses are linked to (d ) the examinee’s domain-specific ability  i(d ) , and the specific item parameters IPj. via IRT models. On the third level of the figure, the examinee’s domain ability is relate to his or her overall ability  i by the latent regression parameter  ( d ) .. 2.2. Multilevel IRT Models Several methods are currently available for improving estimation of domain. abilities. The core idea shared by these methods is to incorporate background variables into the estimation process for improving the estimation of item parameters and person 10.

(21) abilities (e.g., Mislevy, 1987; Mislevy & Sheehan, 1989; Adams, Wilson & Wu, 1997; de la Torre, 2009; von Davier & Sinharay, 2010). background variables includes examinees’ demographic and educational background variables, examinees’ performance on the overall test or on other subtests, and the correlation structure of the underlying abilities that are best estimated by the IRT scale scores (de la Torre, 2003; Wainer et al., 2001). Two kinds of models are included in this section: multilevel unidimensional IRT model (Mislevy, Johnson, & Muraki, 1992), and multilevel multidimensional IRT model (de la Torre, 2009). 2.2.1. Multilevel Unidimensional IRT Model. Currently, several methods are available that intent to provide more precise and reliable estimates by incorporating the background variables. Research evidence has shown that incorporating student demographic and educational variables in the estimation process can lead to unbiased estimate of population parameters, more precise ability estimates, and consistent parameter estimates (Mislevy, 1984; Mislevy, 1987; Mislevy & Sheehan, 1989). The hierarchical structures framework using modeling approaches allow specification of different models at the different levels of the hierarchy. Examples of such an approach are IRT models. IRT models integrate two models specified at two levels. At the first level is the item response function that relates the examinee’s ability and the item characteristics to the probability of a particular response; at the second level is the distribution function that characterizes how the ability is distributed in the population. One can view the former as modeling the within-person variability and the latter as modeling the between-person variability (Adams, Wilson, & Wu, 1997). This idea was actually implemented in the scaling process for the National Assessment of Educational Progress (NAEP) (Mislevy, Johnson, & Muraki, 1992; Gonzalez, Galia, & Li, 2004). The NAEP scaling approach was originally devised for reporting population abilities on the overall test or test domains (Mislevy, Johnson, & Muraki, 1992). Instead of estimating ability for individual examinees, NAEP generates consistent of population characteristics using marginal estimation techniques. The basic idea of the NAEP scaling procedure is to improve ability estimation by 11.

(22) incorporating the ancillary information from background surveys so called plausible values methodology. Plausible values methodology was developed as a way to address this issue by using all available data to estimate directly the characteristics of student populations and subpopulations, and then generating multiple imputed scores, called plausible values, from these distributions that can be used in analyses with standard statistical software. A detailed review of plausible values methodology was given in Mislevy (1991). Suppose a sample statistic t ( , Y ) is used for estimating a corresponding population parameter T , where  represents the latent ability values for all sampled examinees, and Y represents the vector of student’s background variables. By treating  as missing (Rubin, 1987), t ( , Y ) can be evaluated through multiple imputations and the resultant values are plausible values. Estimate of t ( , Y ) by its expectation conditional on the observed data ( X, Y ) is,. t * (X, Y)  E[t ( , Y) | X, Y]   t ( , Y) p( | X, Y)d. (2.2.1). Where X represents the responses of all sampled examinees to test items, and  is a vector of unknown abilities. In IRT measurement models, closed-form solutions for this equation are not available. Instead, the integration can be approximated using Monte Carlo procedure by randomly drawing from the conditional distributions. p ( | x i , y i ) for each sampled examinee i . The procedure to obtain the posterior distribution p ( | X, Y ) is based on using Bayes’ theorem and the IRT procedure,. P( | x i , y i )  P(x i |  , y i ) p( | y i )  P(x i |  ) p( | y i ). (2.2.2). Where x i is the observed responses of examinee i to test items, y i is observed background variables. The item parameters are assumed to be known values. Assume p( | y i ) is normally distribution, and  is a linear function of background variable y i and their interactions denoted by y c :   ' y c  . 12. (2.2.3).

(23) Where  is assumed of normal distribution with mean 0 and variance  .  and  are the parameters that can be estimated through maximum likelihood and Bayesian estimation procedures (see Mislevy, Johnson, & Muraki, 1992, p.140 for details). The normalized-likelihood are used for the estimation of  and  and for generation of plausible values.. 2.2.2. Multilevel Multidimensional IRT Model. de la Torre and Patz (2005) devised a method to improve estimation of domain abilities by incorporating the correlation structure of the abilities. de la Torre (2009) proposed a method to provide a general framework for ability estimation where background variables found in the covariates and correlation structure of the abilities can be incorporated in the estimation process using an integrated framework. The extension of the 3PL model to the multidimensional context (Reckase, 1997) is given by. P( X ij  1 |  i , a j , b j , c j )  c j  (1  c j ) '. exp(a 'j i  b j ) 1  exp(a 'j i  b j ). (2.2.4). Where P( X ij  1) is the probability of a correct response; θi  {1 , 2 ,..., p } refers to the p -dimensional abilities; a'j is a 1  p vector of discrimination parameters for item j ; bj is the difficulty parameter for item i ; c j is the guessing parameters for item j . The prior distributions of the ability parameters are given below. For examinee i with ability θi ,. θi | , , Yi ~ MVN('Yi , )  ~ Inv  Wishartv0 (01 ).  ~ (, ). (2.2.5) (2.2.6) (2.2.7). Parameters were estimated by using MCMC. Following is an outline of the MCMC algorithm. Iteration 0: 13.

(24) 1. Assign the following initial values to the parameters:   0 ,   I , and  ,. random draws from MVN (0, I). Iteration t:. 2. For the regression parameters, the full conditional distribution of  , p(  |  ,  , Y ) ,. given. an. improper. prior,. is. the. matrix-normal. MVN((Y' Y) -1 Y'  ,   (Y' Y) -1 ) . This allows  (t ) to be sampled directly. from p(  |  ( t 1) ,  ( t 1) , Y ) . 3. The. full. conditional. Inv  Wishart vI (I1 ). distribution ,.  i   0   ( i   ' Y )(  i   ' Y )'. Therefore,.  (t ). of. where.  ,. p ( | , , Y ) ,. v I  v0  I. is. ;. an and. (Gelman, Carlin, Stern, & Rubin, 2003).. can be sampled directly from p (  |  ( t ) ,  ( t 1) , Y ) .. 4. Finally, since  has independent components, the sampling can be done one examinee at a time. For examinee i , sample  i* from MVN ( i( t 1) ,  c ) where  c is the fixed scale of the candidate-generating distribution. Accept the draw with probability.  ( i(t -1) ,  i* )  min{. p(x i |  i* )p( i* |  (t ) ,  (t) , y i ) ,1} p(x i |  i(t) )p( i(t) |  (t ) ,  (t) , y i ). (2.2.8). To sample from p( Ξ | Θ , Σ , Y ) in vector format, one can use the following algorithm: Vec( Ξ)  [ Σ  ( Y ' Y ) 1 ]1 / 2 Z  Vec (( Y ' Y ) 1 Y ' Θ ). (2.2.9). where Z ( DP )1 ~ MVN(0, I) , and Vec( • ) stacks the vectors of the argument. Gelman, Carlin, Stern, and Rubin (1995, p. 409) provide an alternative method of sampling from this full conditional distribution that avoids the use of the Kronecker product. By recasting the matrices as follows, where y i is the background variables vector of examinee i (i.e., the transpose of the i th row of the design matrix Y), one can sample from Ξ * ~ MVN(( Y *' Σ *-1 Y * ) -1 Y *' Σ *-1Θ * , ( Y *' Σ *-1 Y * ) -1 ). 14. (2.2.10).

(25) They also suggested the use of matrix factorization to avoid inversion of large matrices in this algorithm. In summary, the two kinds of approach all incorporate the background variable into their estimation process in order to obtain more precise and reliable domain abilities.. However,. those. methods. are. based. on. the. unidimensional. or. multidimensional IRT models. None of these methods estimates the overall ability together with the domain abilities. This study proposed a multilevel higher-order IRT estimation method.. 2.3. Model fit There is usually uncertainty about appropriate error structure and predictor. variables to include in models. Adding more parameters may improve fit, but maybe at the expense of identifiability and generalizability. Model selection criteria assess whether improvements in fit measures such as likelihoods, deviances or error sum of squares justify the inclusion of extra parameters in a model. Classical and Bayesian model choice methods may both involve comparison either of measures of fit to the current data or cross validatory fit to out of sample data. For example, the deviance statistics of general linear models (with Poisson, normal, binomial or other exponential family outcomes) follow standard densities for comparisons of models nested within one another, at least approximately in large samples (McCullagh and Nelder, 1989). Penalised measures of fit (Aikake, 1973) may be used, involving an adjustment to the model log-likelihood or deviance to reflect the number of parameters in the model (Congdon, 2003). In this dissertation, three criteria were used to assess the model fit: (1) Akaike’s information coefficient, AIC (Congdon, 2003), (2) Bayesian information coefficient, BIC (Congdon, 2003), and (3) deviance information coefficient, DIC (Spiegelhalter, Best & Carlin, 1998). Thus, the L denotes the likelihood and D the deviance of a model involving p parameters. The deviance may be simply defined as minus twice the log likelihood,. 15.

(26) D  2 logL . Then to allow for the number of parameters, one may use criteria such as the Akaike Information Criterion (or AIC), expressed as AIC ( Model )  D( )  2 p. （4.1.5）. So when the AIC is used to compare models, an increase in likelihood and reduction in deviance is offset by a greater penalty for more complex models. Another criterion used generally as a penalized fit measure, though also justified as an asymptotic approximation to the Bayesian posterior probability of a model, is the Schwarz Information Criterion (Schwarz, 1978). This is also often called the Bayes Information Criterion. Depending on the simplifying assumptions made, it may take different forms, but the most common version is, for sample of size N. BIC ( Model )  D ( )  p( log (N)). （4.1.6）. Spiegelhalter, Best and Carlin (1998) have developed a Bayesian alternative to both AIC and BIC, based on the deviance and called DIC. This criterion is more satisfactory than the two former alternatives because it takes into account the prior information and gives a natural penalization factor to the log-likelihood. DIC ( Model )  D( )  p D. （4.1.7）. The sum of the differences between the posterior mean of the model-level deviance and the deviance at each draw i is the pD .. 16.

(27) CHAPTER. 3. A Multilevel Higher-Order item response model 3.1. Model Specification In this chapter, a multilevel higher-order item response model is to proposed to. combine the higher-order item response model with background variables. In this model, a. test is viewed as consisting of several unidimensional subtest domains. That is, a single domain-specific ability  i( d ) accounts for examinee i ’s performance on domain d , where d  1, 2 ,..., D . The overall ability is regarded as normal distribution. It is assume that students have been sampled from a normal population with mean  and variance  2 . That is:. f ( ;  ,  2 )  (2 2 ) 1 / 2 exp[. (   ) 2 ] 2 2. (3.1.1). or equivalently  E. (3.1.2). where E ~ N (0,  2 ) Adams et al. (1997) discuss how a natural extension of (3.1.2) is to replace the mean,  with the regression model, Yi  where Yi is a vector of background variables, fixed and known values for student i , and  is the corresponding vector of regression coefficients. For example, Yi could be constituted of student variables such as gender or socio-economic status. Then the population model for student i , becomes,.  i  Yi   Ei. (3.1.3). where it is assumed that E i are independently and identically normally distributed with mean zero and variance  2 . The correlations between different domain abilities can be accounted for by positing a higher-order ability i that is viewed as the examinee’s overall ability.. 17.

(28) Specifically, the domain abilities are expressed as linear functions of the overall ability..  i( d )  ( d ) i   id. (3.1.4). where  ( d ) is the latent regression coefficient, and  id is the error term that is assumed to be normally distributed with a mean of zero and variance of 1  (( d ) ) 2 .. Observed variables are in boxes; the remaining variables are to be estimated.. Figure 3-1 Multilevel HO-IRT method Adapted from “A higher-order item response model: development and application.” By Song, H., 2007, doctoral dissertation, The State University of New Jersey. The diagrammatic representation of the MHO-IRT model is driven in Figure 3-1. The first level of the figure shows the response of examinee i to the j th item in domain d . On the second level, an examinee’s domain level responses are linked to the examinee’s domain-specific ability  i( d ) , and the specific item characteristics (d) IPj( d )  {a (d) j , b j , c j } via IRT models, where a j , b j and cj are the discrimination,. difficulty, and guessing parameters of item j . On the third level of the figure, the examinee’s domain ability is relate to his or her overall ability  i by the latent. 18.

(29) regression parameter  ( d ) . On the fourth level, the examinee’s overall ability is relate to his or her background variables Yni by the latent regression parameter  n .. 3.2. Parameter Estimation For this study, the model parameters were estimated using MCMC methods. The. procedure uses simultaneous estimation and background variables was compared to procedures that estimate abilities one at a time or ignores the background variables. In addition, although this article focuses on the three-parameter logistic (3PL) model, the framework was formulated such that other item response models can be used in its place. 3.2.1. Prior Distributions. The prior distributions of the ability, item, and the latent regression parameters are given below. In a hierarchical Bayesian framework, the model can be expressed as:.  n ~ U (0,1)  i | β, Yi ~ N (βYi ,1    i2 ). (3.2.1) (3.2.2). (d ) ~ U (0,1.0). (3.2.3).  i( d ) |  i , ( d ) ~ N (( d ) i ,1  ( d ) 2 ). (3.2.4). a (d) j ~ log N (0,1). (3.2.5). b j ~ N(0,1). (3.2.6). c j ~ Beta(4,16). (3.2.7). where Yi  {Yi1 , Yi 2 ,..., Yin } is the vector the n observable background variables of examinee i ; β  {1 ,  2 ,...,  n } is the regression parameters between ability and background variables;  ( d ) is the latent regression parameter between overall and domain abilities; and the item characteristics, where a (d) j , b j and cj are the discrimination, difficulty, and guessing parameters of item j . Using this formulation, the marginal distribution of domain ability can be shown to be the standard normal distribution.. 19.

(30) 3.2.2. Joint and Conditional Posterior Distributions. Let X be the matrices of item responses;  is the overall ability parameters; Y be the matrix of background variables; θ ( d )  { (1) , ( 2) ,... ( D ) , } represent the domain ability parameters; IP represent the item parameters; λ  {(1) , ( 2) ,...( D ) , } be the matrics of the latent regression parameter between overall and domain abilities. The joint posterior distribution of the parameters, given the observed item response X and Y , can be expressed as. P( , θ (d ) , β, λ, IP | X, Y)  P(X |  , θ (d ) , β, λ, IP, Y) P( , θ ( d ) , β, λ, IP, Y)  P(X | θ (d ) , IP) P(θ (d ) |  , λ) P( | β, Y ) P(β) P(λ) P( IP) (3.2.8) As this joint posterior distribution is of an unknown distribution, it is impossible to obtain draws from it directly. Instead, draws can be taken from the full conditional distributions of β ,  , λ , θ ( d ) and IP . The joint posterior distributions can be approximated by taking large numbers of draws from these full conditional distributions. The full conditional distribution of β,  , λ , θ ( d ) and IP are derived as follows:. P(β |  , θ ( d ) , λ, IP, X, Y)  P( | β, Y ) P(β). (3.2.9). P( | θ ( d ) , β, λ, IP, X, Y)  P(θ ( d ) |  , λ ) P( | β, Y ) P(λ |  , θ ( d ) , β, IP, X, Y)  P(θ ( d ) |  , λ) P(λ). 3.2.3. (3.2.10). (3.2.11). P(θ( d ) |  , β, λ, IP, X, Y)  P(X | θ( d ) , IP) P(θ( d ) |  , λ). (3.2.12). P( IP |  , θ ( d ) , β, λ, X, Y)  P(X | θ ( d ) , IP) P( IP). (3.2.13). Parameter Estimation. Parameters for all the conditions were estimate using MCMC, following is an outline of the MCMC algorithm. In this study, we compare six models. For the MHO-IRT models. At iteration t, 1. Draw the components of β from N( t -1 ,  2 t -1 ) , and accept β * with probability P( (t 1) | β * , Y ) P(β * ) ,1} min{ P ( (t 1) | β (t 1) , Y ) P(β (t 1) ) 20. (3.2.14).

(31) 2.  has independent components, the sampling can be done one examinee at a time. t -1 2 For examinee i , sample  * from N ( ,  t -1 ) , , and accept  * with. probability. P(θ ( d )(t 1) |  * , λ (t 1) ) P( * | β t , Y ) min{ ( d )(t 1) (t 1) (t 1) ,1} |  , λ ) P( (t 1) | β t , Y ) P(θ. (3.2.15). 3. For λ (d) , draw the candidate values λ (d)* from N (λ (d)(t -1) ,  λ2(d)(t -1) ) , and accept λ (d)* with probability. P(θ(d )(t 1) |  t , λ * )P(λ * ) min{ ( d )(t 1) t (t 1) ,1} |  , λ )P(λ (t 1) ) P(θ. (3.2.16). 4. For θ ( d ) , draw the candidate values θ ( d )* from N( i( d )(t 1) , 2. ( d )( t 1) i. θ ( d )*. ) , , and accept. with probability P(X | θ( d )* , IP(t 1) ) P(θ ( d )* |  t , λ t ) min{ ,1} P( X | θ ( d )(t 1) , IP(t 1) ) P(θ ( d )(t 1) |  t , λ t ). (3.2.17). 5. Finally, for IP  {a ( d ) , b, c} are separate into three parts of item parameters t 1 * * from N(a j ,1) ,and accept a (d) sampling. First, draw the candidate values a (d) j j. with probability * (t-1) (t-1) (d)* (t-1) (t-1) P(X | θ( d )t , IP{a (d) j , b j , c j }) P( IP{a j , b j , c j }). min{ ,1} -1) (t-1) (t-1) (d)(t-1) (t-1) (t-1) , b , c } ) ( { , b , c } ) P(X | θ(d )t , IP{a (d)(t P IP a j j j j j j. (3.2.18). t 1 Second, draw the candidate values b*j from N(b j ,1) ,and accept b*j with. probability * (t-1) (d)t * (t-1) P(X | θ(d )t , IP{a (d)t j , b j , c j }) P( IP{a j , b j , c j }). min{ ,1} (t-1) (t-1) (d)t (t-1) (t-1) , b , c } ) ( { , b , c } ) P(X | θ( d )t , IP{a (d)t P IP a j j j j j j. (3.2.19). t 1 Third, draw the candidate values c*j from N(c j ,1) ,and accept c*j with. probability t * (d)t t * P(X | θ( d )t , IP{a (d)t j , b j , c j }) P( IP{a j , b j , c j }). min{ ,1} t (t-1) (d)t t (t-1) , b , c } ) ( { , b , c } ) P(X | θ(d )t , IP{a (d)t P IP a j j j j j j. 21. (3.2.20).

(32) 3.3. Markov Chain Monte Carlo Methods for Other IRT Models In the simulation experiment of this study, the performance of the proposed. MHO-IRT model is compared to those of multilevel U-IRT/M-IRT 3PL models and U-IRT/M-IRT 3PL models based on MCMC parameter estimation strategy. The models and parameter estimation procedures are described in the following 3.3.1. MCMC method for multilevel unidimensional 3PL model. The multilevel unidimensional 3PL model is an extension of 3PL model. The. three-parameter logistic (3PL) IRT model, as follows:. P( X ij  1 |  i , a j , b j , c j )  c j  (1  c j ). exp[a j ( i  b j )] 1  exp[a j ( i  b j )]. (3.3.1). where P( X ij  1) is the probability of a correct response; item discrimination,. aj ; bj is item difficulty parameter for item i ; c j represents the probability of endorsing item i for a examinee when his or her ability level reaches the low extreme; and i is the ith examinee’s ability parameter for the administered test. A natural extension is to replace the mean,  with the regression model, Yi  where Yi is a vector of background variables, fixed and known values for student i , and  is the corresponding vector of regression coefficients. Then the population model for student i , becomes,.  i  Yi   Ei. (3.3.2). The prior distributions of the ability, item, and the latent regression parameters are given below. In a Bayesian framework, the model can be expressed as (Patz & Junker, 1999a):.  n ~ U (0,1)  i | β, Yi ~ N (βYi ,1    i2 ). (3.3.3) (3.3.4). a j ~ log N (0.6,1.13). (3.3.5). b j ~ N(0,1). (3.3.6). c j ~ Beta(4,16). (3.3.7). 22.

(33) where Yi  {Yi1 , Yi 2 ,..., Yin } is the vector the n observable background variables of examinee i ; β  {1 ,  2 ,...,  n } is the regression parameters between overall ability and background variables; and the item characteristics, where a j , b j and cj are the discrimination, difficulty, and guessing parameters of item j . Using this formulation, the marginal distribution of domain ability can be shown to be the standard normal distribution. Let X be the matrices of item responses;  is the overall ability parameters; Y be the matrix of background variables; IP represent the item parameters. The joint posterior distribution of the parameters, given the observed item response X and Y , can be expressed as. P( , β, IP | X, Y)  P(X |  , β, IP, Y) P( , β, IP, Y)  P(X |  , IP) P( | β, Y ) P(β) P( IP). (3.3.8). As this joint posterior distribution is an unknown distribution, it is impossible to obtain draws from it directly. Instead, draws can be taken from the full conditional distributions of β,  and IP . The joint posterior distributions can be approximated by taking large numbers of draws from these full conditional distributions. The full conditional distribution of β,  and IP are derived as follows:. P(β |  , IP, X, Y)  P( | β, Y ) P(β). (3.3.9). P( | β, IP, X, Y)  P(X |  , IP) P( | β, Y). (3.3.10). P( IP |  , β, X, Y)  P(X |  , IP) P( IP). (3.3.11). Parameters for all the conditions were estimate using MCMC, following is an outline of the MCMC algorithm. For the MU-IRT models. At iteration t, 1. Draw the components of β from N( t -1 ,  2 t -1 ) , and accept β * with probability P( (t 1) | β * , Y ) P(β * ) ,1} min{ P ( (t 1) | β (t 1) , Y ) P(β (t 1) ). (3.3.12). 2.  has independent components, the sampling can be done one examinee at a time. t -1 2 For examinee i , sample  * from N ( ,  t -1 ) , , and accept  * with. probability 23.

(34) min{. P( * | β t , Y ) P(X |  * , IP (t 1) ) ,1} P( (t 1) | β t , Y ) P(X |  (t 1) , IP (t 1) ). (3.3.13). 3. Finally, for IP  {a , b , c} are separate into three parts of item parameters sampling. t 1 First, draw the candidate values a*j from N(a j ,1) ,and accept a*j with. probability. P(X |  t , IP{a*j , b (tj-1) , c (tj -1) }) P(IP{a*j , b (tj-1) , c (tj -1) }). min{ ,1} P(X |  t , IP{a (jt 1) , b (tj-1) , c (tj -1) }) P(IP{a (jt 1) , b (tj-1) , c (tj -1) }). (3.3.14). t 1 Second, draw the candidate values b*j from N (b j ,1) ,and accept b*j with. probability. P(X |  t , IP{a tj , b*j , c(tj -1) })P( IP{a*j , b*j , c (tj -1) }). min{ ,1} P(X |  t , IP{a tj , b (tj-1) , c(tj -1) })P( IP{a tj , b (tj-1) , c(tj -1) }). (3.3.15). t 1 Third, draw the candidate values c*j from N (c j ,1) ,and accept c*j with. probability. P(X |  t , IP{a tj , b tj , c*j }) P( IP{a *j , b tj , c*j }). min{ ,1} P(X |  t , IP{a tj , b tj , c (tj -1) }) P( IP{a tj , b tj , c (tj -1) }) 3.3.2. (3.3.16). MCMC method for multilevel multidimensional 3PL model. The multilevel multidimensional 3PL model is an extension of M3PL model. The M3PL IRT model, as follows:. P( X ij  1 |  i , a j , b j , c j )  c j  (1  c j ). exp(a 'j i  b j ). '. 1  exp(a 'j i  b j ). (3.3.17). A natural extension is to replace the mean,  with the regression model, Yi  where Yi is a vector of background variables, fixed and known values for student i , and  is the corresponding vector of regression coefficients. Then the population model for student i , becomes,.  i  Yi   Ei. 24. (3.3.18).

(35) The prior distributions of the ability, item, and the latent regression parameters are given below. In a Bayesian framework, the model can be expressed as:.  n ~ U (0,1). θ | β, Yi ~ MVN(βYi , )  ~ Inv  Wishart v0 (01 ). a j ~ log N (0.6,1.13). (3.3.19) (3.3.20) (3.3.21) (3.3.22). b j ~ N(0,1). (3.3.23). c j ~ Beta(4,16). (3.3.24). Where βYi and  are the mean vector and common (i.e., undifferentiated by examinee) covariance matrix of the multivariate normal distribution, respectively; v0 are the degrees of freedom, and 01 is the D  D symmetric positive-definite scale matrix of the inverse-Wishart distribution. The joint posterior distribution of the parameters, given the observed item response X and Y , can be expressed as. P(, θ ( d ) , β, IP | X, Y)  P(X | , θ ( d ) , β, IP, Y) P(, θ ( d ) , β, IP, Y)  P(X | θ ( d ) , IP) P(θ ( d ) | , β, Y ) P(β) P() P( IP). (3.3.25). As this joint posterior distribution is of an unknown distribution, it is impossible to obtain draws from it directly. Instead, draws can be taken from the full conditional distributions of  , β ,  and IP . The joint posterior distributions can be approximated by taking large numbers of draws from these full conditional distributions. The full conditional distribution of  , β ,  and IP are derived as follows: P( | θ( d ) , β, IP, X, Y)  P(θ( d ) | , β, Y ) P(). (3.3.26). P(β | , θ( d ) , IP, X, Y)  P(θ( d ) | , β, Y ) P(β). (3.3.27). P(θ( d ) | , β, IP, X, Y)  P(X | θ(d ) , IP) P(θ(d ) | , β, Y ). (3.3.28). P( IP | , θ( d ) , β, X, Y)  P(X | θ( d ) , IP) P( IP). (3.3.29). 25.

(36) Parameters for all the conditions were estimated using MCMC algorithm, and the following is an outline of the MCMC algorithm. For the MM-IRT models. At iteration t, 1. Draw the components of β from N( t-1 , 2 ) , and accept β * with probability t -1. min{. P (θ (d) | , β * , Y ) P (β * ) ,1} P (θ (d) | , β (t 1) , Y ) P (β (t 1) ). (3.3.30). 2. For  , draw the candidate values  * from N (λ (d)(t -1) , λ2. (d)(t -1). ) , and accept λ (d)*. with probability P(θ ( d ) | * ,  t , Y ) P(* ) min{ ,1} P(θ ( d ) |  (t -1) ,  t , Y ) P( (t -1) ). (3.3.31). 3. For θ ( d ) , draw the candidate values θ ( d )* from N( i( d )(t 1) , 2. ( d )( t 1) i. θ ( d )*. ) , , and accept. with probability P( X | θ ( d )* , IP (t 1) ) P(θ ( d )* |  t ,  t , Y ) min{ ,1} P( X | θ ( d )(t 1) , IP (t 1) ) P(θ ( d )(t 1) |  t ,  t , Y ). (3.3.32). 4. Finally, for IP  {a ( d ) , b, c} are separate into three parts of item parameters * * from N(atj1,1) ,and accept a (d) sampling. First, draw the candidate values a (d) j j. with probability * (t-1) (t-1) (d)* (t-1) (t-1) P(X | θ(d )t , IP{a (d) j , b j , c j }) P( IP{a j , b j , c j }). min{ ,1} -1) (t-1) (t-1) (d)(t-1) (t-1) (t-1) P(X | θ(d )t , IP{a (d)(t P IP a , b , c } ) ( { , b , c } ) j j j j j j. (3.3.33). Second, draw the candidate values b*j from N(btj1,1) , and accept b*j with probability (d)t * (t-1) * (t-1) P(X | θ(d )t , IP{a(d)t j , b j , c j }) P( IP{a j , b j , c j }). min{ ,1} (t-1) (t-1) (d)t (t-1) (t-1) , b , c } ) ( { , b , c } ) P(X | θ(d )t , IP{a (d)t P IP a j j j j j j. (3.3.34). Third, draw the candidate values c*j from N (ctj1,1) ,and accept c*j with probability (d)t t * t * P(X | θ(d )t , IP{a(d)t j , b j , c j }) P( IP{a j , b j , c j }). min{ ,1} (d)t t (t-1) t (t-1) P(X | θ(d )t , IP{a(d)t j , b j , c j }) P( IP{a j , b j , c j }). 26. (3.3.35).

(37) 3.3.3. MCMC method for multidimensional 3PL model. The M3PL IRT model is as equation 3.3.17. Multidimensional 3PL MCMC method is similar to multidimensional multilevel 3PL MCMC method. The only different is the background variables. The prior distributions of the ability are given below. In a Bayesian framework, the model can be expressed as:. θi ~ MVN(0, ). (3.3.36). Where 0 and  are the mean vector and common (i.e., undifferentiated by examinee) covariance matrix of the multivariate normal distribution, respectively; v0 are the degrees of freedom, and 01 is the D D symmetric positive-definite scale matrix of the inverse-Wishart distribution. The joint posterior distribution of the parameters given the observed item response X can be expressed as P(, θ( d ) , IP | X, Y)  P(X | , θ( d ) , IP, Y) P(, θ( d ) , IP, Y)  P(X | θ( d ) , IP) P(θ( d ) | ) P() P( IP). (3.3.37). As this joint posterior distribution is of an unknown distribution, it is impossible to obtain draws from it directly. Instead, draws can be taken from the full conditional distributions of  ,  and IP . The joint posterior distributions can be approximated by taking large numbers of draws from these full conditional distributions. The full conditional distribution of  ,  and IP are derived as follows: P( | θ(d ) , IP, X)  P(θ( d ) | ) P(). (3.3.38). P(θ( d ) | , IP, X)  P(X | θ(d ) , IP) P(θ( d ) | ). (3.3.39). P( IP | , θ ( d ) , X)  P(X | θ ( d ) , IP) P( IP). (3.3.40). Parameters for all the conditions were estimate using MCMC, following is an outline of the MCMC algorithm. For the MM-IRT models. At iteration t, 1. For  , draw the candidate values  * from N (λ (d)(t -1) , λ2. (d)(t -1). ) , and accept λ (d)*. with probability min{. P(θ ( d ) | * ) P(* ) ,1} P(θ ( d ) |  (t-1) ) P( (t-1) ). 27. (3.3.41).

(38) 2. For θ ( d ) , draw the candidate values θ ( d )* from N( i( d )(t 1) , 2. ( d )( t 1) i. ) , and accept θ ( d )*. with probability min{. P( X | θ ( d )* , IP (t 1) ) P(θ ( d )* |  t ) ,1} P( X | θ ( d )(t 1) , IP (t 1) ) P(θ ( d )(t 1) |  t ). (3.3.42). 3. Finally, for IP  {a ( d ) , b, c} are separate into three parts of item parameters * * from N(atj1,1) ,and accept a (d) sampling. First, draw the candidate values a (d) j j. with probability * (t-1) (t-1) (d)* (t-1) (t-1) P(X | θ(d )t , IP{a(d) j , b j , c j }) P( IP{a j , b j , c j }). min{ ,1} -1) (t-1) (t-1) -1) (t-1) (t-1) P(X | θ(d )t , IP{a(d)(t , b j , c j }) P(IP{a(d)(t , b j , c j }) j j. (3.3.43). Second, draw the candidate values b*j from N(btj1,1) ,and accept b*j with probability (d)t * (t-1) * (t-1) P(X | θ(d )t , IP{a(d)t j , b j , c j }) P( IP{a j , b j , c j }). min{ ,1} (t-1) (t-1) (d)t (t-1) (t-1) P(X | θ(d )t , IP{a (d)t j , b j , c j }) P( IP{a j , b j , c j }). (3.3.44). Third, draw the candidate values c*j from N (ctj1,1) ,and accept c*j with probability (d)t t * t * P(X | θ(d )t , IP{a(d)t j , b j , c j }) P( IP{a j , b j , c j }). min{ ,1} (d)t t (t-1) t (t-1) , b , c } ) ( { , b , c } ) P(X | θ(d )t , IP{a(d)t P IP a j j j j j j. 28. (3.3.45).

(39) CHAPTER. 4. Experiments To evaluate the performance of the multilevel higher-order item response model in estimating the data with or without background variables, the author implemented the approaches into a variety of combinations of the generating model and fitting model. For the different purpose, the data responses were generated from two different model. In experiment one, the data was generated from the multilevel higher-order item response (MHO-IRT) model which means.   0, 0.35 or 0.7. and the. background variable were continuous. Using this model to see the influence of different correlation background variables in estimating the parameters. In the experiment two, the data was generated from the higher-order item response (HO-IRT) model and the background variable were dichotomous. The result can show that incorporating the background variables how to improve the group statistic estimation. In the last experiment three, use the real data from TASA 2007 fourth-grade mathematics assessment to check the model fitting of the data and the proposed models.. 4.1. Experiment one. 4.1.1. Experiment Design. A simulation study was conducted to investigate the feasibility of the HO-IRT model incorporating student background variables in the estimation process and to show how the estimates obtained from this model are affected by different factors. The data responses are generated from the multilevel higher-order item response (MHO-IRT) model which means   0, 0.35 or 0.7 and the background variables are continuous. The combination is shown in Table 4-1 with an asterisk presenting the cases examined.. 29.

(40) Table 4-1 The cases examined in the simulation studies Fitted Model Parameter Overall ability. U-IRT. MU-IRT. *. *. M-IRT. Domain ability. *. MM-IRT. *. HO-IRT MHO-IRT *. *. *. *. The generated data are analyzed using the multilevel higher-order item response model (MHO-IRT), higher-order item response model (HO-IRT), multilevel multidimensional item response model (MM-IRT), multidimensional item response model (M-IRT), multilevel unidimensional item response model (MU-IRT) and unidimensional item response model (U-IRT). A simulation study was conducted to investigate the feasibility of the HO-IRT model incorporating student background variables in the estimation process, and show how the estimates obtained from this model are affected by different factors. Consequently, five factors with varied conditions were considered in the simulation study (de la torry, 2009): (1) the generating models, the MHO-IRT and HO-IRT models; (2) the data fitting model, the six conditions described above; (3) the sample size = 1000 and 4000; (4) the correlation between the background variables and the overall ability,   0, 0.35 and 0.7 ; (5) the test lengths in each domain, 5, 10, and 20 items. The item discrimination parameters were drawn from log N (0.6, 1.13) . The item difficulties were drawn from N (0, 1) , and the guessing parameters were drawn from Beta ( 4, 16 ) .. 30.

(41) Table 4-2 The Setting of Manipulated variables Manipulated variables Fitted models. Setting MHO-IRT, HO-IRT, MM-IRT, M-IRT, MU-IRT, U-IRT. Number of domains. 2. Correlation between the background.   0, 0.35 and 0.7. variables and the overall ability Test lengths in each domain. 5, 10, and 20 items. Sample size. 1,000 and 4,000. Each simulated data set contained 1,000 or 4,000 simulated students. In each simulation has used 30 replications. Fully crossing different levels of these five factors yielded 108 conditions. The manipulated variables are shown in Table 4-2. 4.1.2. Data generation. The simulation study is divided into two experiments, the first experiment aims to investigate the model parameter recovery with continuous background variables, and the second experiment aims to investigate the model parameter recovery with dichotomous background variables. The data generation process of experiment one were show as following: (1) The overall ability parameters and the background variable were randomly generated from the multivariate normal distribution, as describe as equation 4.1.1; (2) the domain ability parameters could be generated by multiplying its corresponding factor loading values and then adding the residual values from independent distributions according to equation (3.1.4); (3) the item difficulties are drawn from N(0, 1); item discriminate parameters are drawn from log N(0.6, 1.13); the guessing parameters are drawn from Beta(4, 16); (4) given the item and person parameters, the probabilities of item responses were computed according to the MHO-IRT model; and (5) the cumulative probability was computed and compared to a number randomly generated from the uniform (0, 1) distribution. If the random number was less than or equal to the cumulative probability, the simulated item. 31.

(42) response was recorded as endorsing that item. (6) number of examinees, N=1000 or 4000. The formula for generating the overall ability parameters and the background variables is given below(de la Torre, 2003):   0         ,  ~ MVN Y   0        Y.  Y     YY  . (4.1.1). YY is designed to be I . Using the properties of conditional distributions of a MVN distribution (Johnson, & Wichern, 1997; Mardia, Kent, & Bibby, 1979), the conditional distribution of  given Y  y is.  | Y ~ MVNY Y,   Y  Y . 4.2. Experiment two. 4.2.1. Experiment Design. (4.1.2). In the experiment two, the data is generated from the higher-order item response (HO-IRT) model which means and standard deviations are presented in Table 4-3 and the background variables are dichotomous. The generated data are analyzed using the multilevel higher-order item response model (MHO-IRT), higher-order item response model (HO-IRT), multilevel multidimensional item response model (MM-IRT), multidimensional item response model (M-IRT), multilevel unidimensional item response model (MU-IRT) and unidimensional item response model (U-IRT), respectively. The simulation study is conducted to investigate the feasibility of the HO-IRT model incorporating student background variables in the estimation process, and show how the estimates obtained from this model are affected by different factors. Consequently, five factors with varied conditions were considered in the simulation study (de la torry, 2009): (1) the generating models, the MHO-IRT and HO-IRT models; (2) the data fitting model, the six conditions described above; (3) the sample size = 1000 and 4000; (4) the correlation between the background variables and the overall ability,   0, 0.35 and 0.7 ; (5) the test lengths in each domain, 5, 10, and 20 items. The item discriminate parameters were drawn from log N (0.6, 1.13) . The item. 32.