在認知診斷測量架構中的試題差異功能偵測效果探討

全文

(1)國立臺灣師範大學教育心理與輔導學系博士論文. 指導教授：陳柏熹博士、陳學志博士. 在認知診斷測量架構中的試題差異功能偵測效果探討. 研究生：洪素蘋撰中華民國一百零一年十月.

(2) Department of Educational Psychology and Counseling, National Taiwan Normal University Doctoral Dissertation. Advisors: Dr. Po-Hsi Chen and Dr. Hsueh-Chih Chen. DETECTING DIFFERENTIAL ITEM FUNCTIONING IN A FRAMEWORK OF COGNITIVE DIAGNOSTIC MEASUREMENT. Su-Pin Hung. October, 2012. ii.

(3) DEDICATION. To my lovely daughter, Bonnie and my dearest husband, Hung-Yu.. i.

(4) ACKNOWLEDMENTS I would like to express the deepest appreciation to my advisor, Dr. Po-Hsi Chen. Without his persistent encouragement and support this dissertation would not have been possible. During my graduate study, he made endless efforts in guiding me to become a successful scholar: supporting me for many national and international conferences, spending his time in reading and revising the draft of my papers and providing me with sincere opinions about my future career. Also, I would like to thank my co-advisor, Dr. Hsueh-Chih Chen. He always gives me sufficient flexibility and freedom in doing my research. His humor and understanding made it possible for me to work in a happy and relaxed atmosphere. I would like to thank Dr. Shun-Wen Chang and Dr. Sieh-Hwa Lin. I got most of training in IRT from their courses, which provided me a solid foundation in education measurement. I would like to thank my committee members, Dr. Wen-Chung Wang, Dr. Bor-Chen Kuo and Chih-Chien Yang. Dr. Wen-Chung Wang is my scholar role model, I will never forget his patience and useful suggestions when I was struggle with my dissertation topic. He always generously shared his thoughts and related papers and welcomes any research discussions. Dr. Bor-Chen Kuo and Chih-Chien Yang proposed many substantial suggestions that make this dissertation perfect. I would like to thank my colleagues in Lab 506 and my testing gang members Joe and Scott for their help and care in my study and personal life. Without them, I couldn’t have been so happy during these years. Finally, I would like to thank my parents, my sisters, my father in law, my mother in law and my husband, who have always supported me and my decisions, allowing me to pursue my ideal life. I also want to thank to my pet choppy whom always accompany with me when I writing this dissertation.. ii.

(5) 在認知診斷測量架構中的試題差異功能偵測效果探討洪素蘋試題差異功能檢驗已被視為在測驗發展過程的重要程序。隨著認知診斷評量持續在實務與方法學研究方面受到關注，在認知診斷測量架構下的試題差異功能議題自然也莫可忽視。本研究涵蓋三大目的，首先，本研究提出以模式為基礎所進行的試題差異功能偵測方法以處理認知診斷評量架構下的補償與非補償性資料；其次，本研究聚焦於過去在認知診斷測量架構下的試題差異功能研究中所忽視的當測驗受到偏誤試題污染的相關議題。最後，本研究以更系統性的探討可能影響試題差異功能偵測方法成效的因素，並將這些可能的影響因素導入於模擬研究設計中。本研究以馬克夫鍊蒙地卡羅演算法分別針對兩個所提出的模式進行參數估計，並且比較參數回覆性效果，同時檢驗在不同測驗情境下，使用模式為基礎的試題差異功能偵測方法與非參數取向的 MH 以及 LR 等試題差異功能偵測方法的型一錯誤率以及統計檢定力。除此之外，本研究加入了淨化程序於 MH 以及 LR 等試題差異功能偵測方法之中，並探討加入試題淨化程序後對於試題差異功能偵測的效能能否提升。最後，本研究使用 2007 年國際數學與科學教育成就趨勢調查研究中四年級數學科評量為範例，說明如何運用所提出的試題差異功能偵測方法於實務情境中。研究結果發現，在參數回覆方面，本研究所提出的兩個模式為基礎的試題差異功能偵測方法其參數回覆性效果甚佳。而在不同試題差異功能偵測方法的比較方面，本研究發現在相同測驗情境下以模式為基礎的試題差異功能檢驗方法其型一錯誤率的控制以及統計檢定力均優於 MH 以及 LR。再者，模擬研究結果發現，當處理認知診斷測量資料時，試題遭受污染而未加以進行淨化程序即進行試題差異功能偵測，將會影響偵測效果，並且得到錯誤的結論。隨著淨 iii.

(6) 化程序的加入，可以幫助改善 MH 以及 LR 等試題差異功能偵測方法在特定情境下的型一錯誤率的控制以及統計檢定力。不過此兩種方法，即使加入淨化程序後，仍無助於解決當受試者平均能力分布差異很大時，所造成的第一類型錯誤率膨脹的問題。最後，本研究也發現相較於 MH 以及 LR 等試題差異功能偵測方法，本研究所提出的模式為基礎的試題差異功能偵測方法在試題差異功能偵測的結果解釋較為細緻，並且能藉由模式擴展找出可能造成試題差異功能原因的前瞻性。. 關鍵字：關鍵字：認知診斷測量、認知診斷測量、限制式高階層再參數化 DINA 模式、模式、限制式高階層再參限制式高階層再參數化 DINO 模式、模式、試題差異功能檢驗. iv.

(7) DETECTING DIFFERENTIAL ITEM FUNCTIONING IN A FRAMEWORK OF COGNITIVE DIAGNOSTIC MEASUREMENT. Su-Pin Hung ABSTRACT. Detection of Differential item functioning, DIF has been recognizing as an important procedure especially in test development. With the cognitive diagnostic measurements, CDMs continue to receive attention both in applied and methodological studies. DIF related issues in the framework of CDMs remain to concern. The purpose of the study had three objectives; first, to propose model based DIF detection method in dealing compensatory and non-compensatory cognitive diagnostic data; second, to address on the contaminated matching criterion issue that has be overlook in the past DIF study within the CDM framework; third, to investigate more possible factors that may affect DIF detection methods and introduced into the simulation design. An MCMC algorithm employing Gibbs sampling was used to estimate the two proposed models and simulation study was done to examine model recovery, Type I error rates, and power under different testing conditions. For DIF detection, the model based method was also compared with the MH method and LR method. Furthermore, the purification procedure is v.

(8) applied in the MH and LR methods and compared with the model based method to investigate the effectiveness of DIF detection methods. Finally, TIMSS 2007 fourth grade mathematics assessment was used to demonstrate and the results were used to illustrate the implementation of the new method. The parameter recovery of the proposed models yielded well. The simulation results of DIF methods comparison appeared to confirm that the model based method outperformed the MH and LR methods in Type I error control and power rate under comparable testing conditions. Moreover, the result revealed that the biased matching criterion may also determine the effectiveness of DIF detection in a framework of cognitive diagnostic measurement. With purification procedure, could improve the Type I errors and power rates for MH and LR under specific circumstance. Finally, the model based method had the strength of interpreting results more elaborately compared to the other DIF methods. KEY WORDS：Cognitive diagnostic measurement, restricted higher-order reparameterized DINA model, restricted higher-order reparameterized DINO model, Differential item functioning. vi.

(9) TABLE OF CONTENTS. ACKNOWLEDMENTS .......................................................................................... ii ABSTRACT............................................................................................................. v TABLE OF CONTENTS....................................................................................... vii LIST OF TABLES .................................................................................................. ix LIST OF FIGURES............................................................................................... xii CHAPTER 1 INTRODUCTION ............................................................................ 1 1.1 Motivation .................................................................................................. 1 1.2 Significance and Contribution................................................................... 5 CHAPTER 2 LITERATURE REVIEW................................................................. 6 2.1 Characteristics of Cognitive Diagnostic Models ....................................... 6 2.1.1 Non-Compensatory Cognitive Diagnostic Models ......................... 7 2.1.2 Compensatory Cognitive Diagnostic Model ..................................11 2.1.3 Cognitive Diagnostic Model in a Higher-Order Structure............15 2.2 Differential Item Function in Cognitive Diagnostic Context...................18 2.2.1 Non-Parametric DIF Methods .......................................................19 2.2.2 Parametric DIF Methods ...............................................................23 2.2.3 Previous DIF Detection Methods in CDM Framework ................30 2.3 Matching Variable Issues in CDM ...........................................................32 2.5 Hypothesis and Research Questions.........................................................36 CHAPTER 3 METHOD.........................................................................................39 3.1 Study 1: The Parameter Recovery of RHO-RDINA and RHO-RDINO Model...............................................................................................................39 3.1.1 The Simulation Design ...................................................................39 3.1.2 Data Simulation Procedures ..........................................................46 3.1.3 Recovery Analysis of the RHO-RDINA and RHO-RDINO Model48 3.2 Study 2: Comparing the Effectiveness of Traditional DIF Methods with Purification Procedure and Model Based Method within CDM Framework50 3.2.1 The Simulation Design ...................................................................50 3.2.2 Data Analysis ..................................................................................54 3.3 Study 3: Real Data Example with DIF Detection in a Framework of CDM................................................................................................................56 3.3.1 Data Description.............................................................................57 CHAPTER 4 RESULTS.........................................................................................62 vii.

(10) 4.1 Study 1: Parameter Recovery of the RHO-DINA and RHO-DINO Models .............................................................................................................63 4.1.1 Recovery of Higher-Level Parameters...........................................63 4.1.2 Recovery of Lower-Level Parameters ...........................................70 4.2 Study 2: Comparing the Effectiveness of Traditional DIF Methods with Purification Procedure and Model Based Method within CDM Framework76 4.2.1 Type Ι Error study..........................................................................76 4.2.2 Power Study ..................................................................................104 4.3 Study 3: Real Data Application ..............................................................131 CHAPTER 5 DISCUSSION AND CONCLUSION ............................................140 5.1 Summary of Simulation Study Results...................................................141 5.2 Limitations and Future Studies ..............................................................146 Reference ..............................................................................................................149. viii.

(11) LIST OF TABLES Table 2.1 Contingency Table for Mantel–Haenszel DIF Statistic ..............................19 Table 3.1 The Q-Matrix Structure for 20 items..........................................................42 Table 3.2 DIF Pattern Manipulation..........................................................................46 Table 3.3 Attributes descriptions from the TIMSS 2007 framework for fourth grade mathematics ............................................................................................60 Table 3.4 TIMSS 2007 Fourth Grade Mathematics Q-matrix ....................................61 Table 4.1 Bias and RMSEs of Attribute Difficulty, A, and Discrimination, γ over 25 Replications with RHO-RDINA Model ...................................................65 Table 4.2 Bias and RMSEs of Attribute Difficulty, A, and Discrimination, γ over 25 Replications with RHO-RDINO Model...................................................66 Table 4.3 Percent of RHO-RDINA Correct Classification by Attribute and Vector ....69 Table 4.4 Percent of RHO-RDINO Correct Classification by Attribute and Vector....69 Table 4.5 Estimates of Guessing and Slip Parameters with RHO-RDINA, over 25 Replications ............................................................................................71 Table 4.6 Estimates of Guessing and Slip Parameters with RHO-RDINO, over 25 Replications ............................................................................................72 Table 4.7 Estimates of DIF-g and DIF-s Parameters with RHO-RDINA, over 25 Replications ............................................................................................74 Table 4.8 Estimates of DIF-g and DIF-s Parameters with RHO-RDINO, over 25 Replications ............................................................................................75 Table 4.9 Type Ι error rates of DIF with Model Based Method Data Derived from RHO-RDINA Model ...............................................................................77 Table 4.10 Factorial ANOVA for Type I error rates in DIF-s and DIF-g with RHORDINA model.........................................................................................79 Table 4.11 Marginal Means and Ranges of Type I Errors of DIF-s and DIF-g with RHO-RDINA model ...............................................................................80 Table 4.12Type Ι error rates of DIF with Model Based Method Data Derived from RHO-RDINO Model...............................................................................82 Table 4.13 Factorial ANOVA for Type I error rate in DIF-s and DIF-g with RHO-RDINO model ...............................................................................83 Table 4.14 Marginal Means and Ranges of Type I Errors of DIF-s and DIF-g with RHO-RDINO model ...............................................................................85 Table 4.15 Type Ι error of DIF with MH methods Data Derived from RHO-RDINA Model .....................................................................................................86 Table 4.16 Marginal Means and Ranges of Type I Errors for MH and MH-P with ix.

(12) Data Derived from RHO-RDINA model .................................................87 Table 4.17 Factorial ANOVA for Type I error rate with MH Method Data Derived RHO-RDINA Model ...............................................................................89 Table 4.18 Type Ι error of DIF with MH Method Data Derived from RHO-RDINO Model .....................................................................................................91 Table 4.19 Marginal Means and Ranges of Type I Errors for MH and MH-P with Data Derived from RHO-RDINO model .................................................92 Table 4.20 Factorial ANOVA for Type I error Rate with MH method Data Derived RHO-RDINO Model...............................................................................94 Table 4.21 Type Ι error of DIF with LR methods Data Derived from RHO-RDINA Model .....................................................................................................95 Table 4.22 Marginal Means and Ranges of Type I Errors for LR and LR-P with Data Derived from RHO-RDINA model..........................................................96 Table 4.23 Factorial ANOVA for Type I error rate with LR method Data Derived RHO-RDINA Model ...............................................................................98 Table 4.24 Type Ι error of DIF with LR Methods Data Derived from RHO-RDINO Model ...................................................................................................100 Table 4.25 Marginal Means and Ranges of Type I Errors for LR and LR-P with Data Derived from RHO-RDINO model .......................................................101 Table 4.26 Factorial ANOVA for Type I error Rate with LR Method Data Derived RHO-RDINO Model.............................................................................102 Table 4.27 Factorial ANOVA for Power Rate in DIF-s and DIF-g with RHO-RDINA Model ...................................................................................................105 Table 4.28 Factorial ANOVA for Power Rate in DIF-s and DIF-g with RHO-RDINO Model ...................................................................................................107 Table 4.29 Power of DIF with Model Based Method Data Derived from RHO-RDINA Model .............................................................................109 Table 4.30 Power of DIF with Model Based Method Data Derived from RHO-RDINO Model.............................................................................110 Table 4.31 Marginal Means and Ranges of Power of DIF-s and DIF-g with RHO-RDINA Model ............................................................................. 111 Table 4.32 Marginal Means and Ranges of Power of DIF-s and DIF-g with RHO-RDINO Model............................................................................. 111 Table 4.33 Power of DIF with MH methods Data Derived from RHO-RDINA Model115 Table 4.34 Marginal Means and Ranges of Power Rates for MH and MH-P with Data Derived from RHO-RDINA Model ...............................................116 Table 4.35 Factorial ANOVA for Power Rate with MH Method Data Derived RHO-RDINA Model .............................................................................118 x.

(13) Table 4.36 Power of DIF with MH method Data Derived from RHO-RDINO Model119 Table 4.37 Marginal Means and Ranges of Power Rates for MH and MH-P with Data Derived from RHO-RDINO Model...............................................120 Table 4.38 Factorial ANOVA for Power rate with MH method Data Derived RHO-RDINO Model.............................................................................121 Table 4.39 Power of DIF with LR method Data Derived from RHO-RDINA Model123 Table 4.40 Marginal Means and Ranges of Power Rates for LR and LR-P with Data Derived from RHO-RDINA Model .......................................................124 Table 4.41 Factorial ANOVA for Power rate with LR Method Data Derived RHO-RDINA Model .............................................................................126 Table 4.42 Power of DIF with LR methods Data Derived from RHO-RDINO Model127 Table 4.43 Marginal Means and Ranges of Power Rates for LR and LR-P with Data Derived from RHO-RDINO Model .......................................................128 Table 4.44 ANOVA for Power rate with LR Method Data Derived RHO-RDINO Model ...................................................................................................129 Table 4.45 Information Criteria for Model Comparison Between RHO-RDINA model and RHO-RDINO Model............................................................132 Table 4.46 TIMSS Item Parameters ........................................................................133 Table 4.47 DIF Detection based on the Three Methods ...........................................135 Table 4.48 Selected DIF Items................................................................................139. xi.

(14) LIST OF FIGURES Figure 3.1 Simulation Design of study 1...................................................................40 Figure 3.2 Simulation Design of Study 2 ..................................................................51. xii.

(15) CHAPTER 1 INTRODUCTION 1.1 Motivation Assessments are developed to help inform instruction, learning and as an assistant tool in clinical diagnosis. Along with classical testing theory, practitioners adopt either norm referencing or criterion referencing to interpret relevant information from assessments. It can be seen in many well-developed educational and psychological assessments that the observed sum score can be represented as examinees’ ability on the assessed domain. The diagnosis is determined by the cut score, which has been set based on theory and practical experience. With the increasing development of statistical methods, researchers start to explore the diagnosis issue on latent trait (i.e., the factor analysis approach, item response models). Although unidimensional item response models (IRMs) are useful for scaling and ordering examinees on a latent proficiency continuum, they do not allow evaluation of students’ specific strengths and weaknesses that can be used to facilitate learning and instruction (Torre, Hong & Deng, 2010). Besides, examinees who get the same sum score may not have the same skill or attribute mastery patterns. In contrast to conventional item response models, cognitive diagnostic models (CDMs) place more emphasis on diagnosis information using a function in which the underlying latent variables are developed specifically for identifying the presence or absence of multiple finer-grained skills in a particular domain. Attribute mastery diagnostics has been emphasized in many countries recently. In the US, the No Child Left Behind policy contributes the development of standard setting and the development of CDMs. Results of diagnosis also benefits both students and teachers. For teacher, results of diagnosis can help arrange appropriate. 1.

(16) curricular; for students, the individual attribute mastery profile admonish students if they do not master all required attributes. In addition, many international academic surveys (e.g., PISA; TIMSS) are developed to assess the basic competency of students and aimed to do some comparison between subgroups. The results from large scale testing with informative meaning that guided practitioners remedy their teaching and curriculum design. It is worth to note the problem of test equality between different subgroups or majority and minority groups. Differential item functioning, DIF has been treated as a necessary procedure when a test is developed. Different DIF detection methods have been proposed and applied in practical situations. A number of simulation studies have been designed to compare the effectiveness of DIF methods (e.g., nonparametric approach; parametric approach) and to investigate factors that may affect DIF detection in some specific item response models (Finch & French, 2007; French & Maller, 2007; Fidalgo,Mellenbergh & Muniz, 2000; Holland & Thayer, 1988; Lord, 1980; Li & Stout, 1996; Mantel & Haenszel, 1959; Rogers & Swaminathan, 1993; Shealy & Stout, 1993; Swaminathan & Rogers, 1990; Shih & Wang, 2009; Thissen, Steinberg & Wainer, 1993; Narayanan & Swaminathan, 1996; Wang & Su, 2004a; Wang & Su, 2004b). However, only a few studies have investigated DIF issues within the CDM context. For example, the study of Gierl, Zheng, and Cui (2008) using the SIBTEST to detect ADF (attribute differential function) with a specific CDM named attribute hierarchy model. Besides, in the dissertation of Zhang (2006) two widely applied non-parametric DIF detection procedures, namely MH and SIBTEST are used to compare the effectiveness of DIF detection with DINA model (deterministic input noisy-and-gate ). In Zhang’s study, several possible DIF patterns are simulated and two matching variables (total raw score and profile score) were compared in which. 2.

(17) the profile scores yielded outperformed the total raw score. Recently, another dissertation addressed the parametric DIF method within the CDM. Li (2008) modified the higher-order DINA model and proposed a model based DIF and DAF detection procedure in the framework of higher-order DINA model, which can detect DIF and DAF simultaneously. However, the past studies mistaken concluded that the poorer estimation in DIF detection result in using test total score as matching is the only reason rather than suspect the contaminated matching issue may also lead to the same result. Hence, it is wondering that with purification procedure using test total score as matching could work equally well within the framework of CDM. Although, the above mentioned studies contribute in this area, some common issues remain unsolved while applied DIF detection procedure in CDM. First, though the previous studies aim to introduce popular DIF detection methods in CDMs, several factors (e.g., test length, DIF percentage, DIF magnitude, sample size, etc) may affect the result of DIF detection remains unclear in these studies. Since some DIF studies in IRT have found that the DIF magnitude (e.g., French & Maller, 2007; Narayanan & Swaminathan, 1996; Rogers & Swaminathan, 1993), test length (e.g., French & Maller, 2007; Finch, & French, 2007; Shih & Wang, 2009; Narayanan & Swaminathan, 1996), DIF patterns (e.g., Su & Wang, 2004a; Shih & Wang, 2009) and percentage of DIF items (e.g., Fidalgo, Mellenbergh and Muniz, 2000; Finch & French, 2007; French & Maller, 2007; Shih & Wang, 2009) are important factors which will influence the type Ι error rate and power of DIF detection, it is believed that these factors should be considered when conducting DIF analysis in CDMs. Second, previous studies have only focused on the non-compensatory model (e.g., DINA model, Zhang, 2006; HO-DINA model, Li, 2008; AHM, Gierl, Zheng & Cui, 2008). It is worth noting that the DINA model is a non-compensatory model, meaning. 3.

(18) that examinees need master all required attributes in a tested item, otherwise he or she obtains lower probability of scoring correctly on that item. However, the strong assumption may not be appropriate in real situations. One limitation of the DINA model is that it does not further differentiate between respondents who have not mastered at least one attribute. Though, a number of studies have applied the DINA model to analyze real data (e.g., de la Torre, 2009; de la Torre & Douglas, 2004, 2008; Henson, Templin, & Willse, 2009; Templin, Henson, & Douglas, 2006), only a few studies applied the DINO model to analyze data. Since Lee, Park and Taylan (2011) recommended the possibility of alternative or multiple strategies for solving an item may better explain students’ response, it seems worth investigating DIF detection issue while the compensatory CDMs are applied. Third, the previous studies did not deal with the issue of contaminated matching criteria. Because the internal matching criterion may be contaminated, they may not be appropriate for directly detecting DIF. If invalid matching criterion are used the results of DIF detection will be suspect. Many studies have focused on this issue and proposed strategies to solve this problem in the framework of item response theory (i.e., Candell & Drasgow, 1988; Fidalgo, Mellenbergh and Muniz, 2000; French & Maller, 2007; Holland & Thayer, 1988; Shih & Wang, 2009; Wang & Yeh, 2003; Wang & Su, 2004a; Wang & Su, 2004b). A newly idea named DIF-free-then-DIF has also been proposed by Wang (2008). The central idea here is that it is very important to find a set of clear items (i.e., DIF free) as matching criteria. It is believed that the same situation may be occurring in the CDM context. However, all these studies overlooked the purification issue when dealing with DIF. In implication, one cannot predict which item is a DIF free item. Thus, item purification procedures cannot be neglected in DIF detection. Nevertheless, to date no studies have addressed this issue. 4.

(19) in the framework of CDMs.. 1.2 Significance and Contribution As mentioned above, several important factors (e.g., DIF magnitude, test length, DIF amounts and DIF patterns) have been overlooked in past DIF detection studies within the framework of CDMs. The present study aims to investigate DIF related issues in a more broad perspective. Thus, more factors that may affect the effectiveness of DIF will be considered in this dissertation. Besides, considering the parametric DIF approach is more efficient in application. However, previous studies have only proposed non-compensatory models to detect DIF. The present dissertation aims to propose two modified reparameterized compensatory and non-compensatory models to detect DIF directly. Furthermore, since the contaminated matching criterion will cause invalid results in DIF analysis, this dissertation aims to introduce the purification procedure in the framework of CDMs. In addition to the model-based DIF detection method proposed in the study, two widely used DIF detection methods, Mantel-Haenszel, MH and Logistic Regression, LR methods are also used. Multiple detection methods are used for this DIF study so that agreement and discrepancy of the outcomes can be compared under various test conditions. Using datasets generated to reflect various conditions of DIF, the Type Ι error rate and power rate of the detection methods are investigated. Finally, in order to compare and evaluate the performance of purification procedures to build a common metric for DIF analysis, a dataset from TIMSS 2007 fourth grade mathematics assessment is used for demonstrate gender differential item functioning.. 5.

(20) CHAPTER 2 LITERATURE REVIEW This chapter first discusses the common features of CDMs along with some widely applied CDMs. Second, the concept of differential item functioning, DIF and related issues are referred to and then applied in the cognitive diagnostic context. Finally, the problematic procedure of present DIF detection methods in cognitive diagnosis measurement is presented.. 2.1 Characteristics of Cognitive Diagnostic Models The purpose of cognitive diagnostic models is to classify examinees into the latent categories based on an array of binary attributes, a vector of latent variables indicating mastery on a set of finite skills under diagnosis. A number of cognitive diagnostic models have been developed according to different diagnostic demands and can be estimated with different estimation methods (e.g., expectation maximization, EM or Markov chain Monte Carlo, MCMC) and softwares (see more detailed classification in Rupp, Templin & Henson, 2010). These CDMs share some common features. Further, some features of CDMs can be found in some item response models (IRMs) and multidimensional factor analysis models, such as multidimensional nature, confirmatory nature, complexity of loading structure and ability to be designed to handle dichotomous and polytomous response data (Rupp & Templin, 2008; in a review of Rupp, Templin & Henson, 2010). However, CDMs differ from IRMs and multidimensional factor analysis models which assume an underlying continuous latent trait and aiming to located examinees in the latent space. CDMs, on the other hand aimed to get more information about exactly what every examinee has mastered or not mastered on a set of cognitive attributes. Thus, they allowed the nature of the categorical latent predictor variables, the criterion-referenced interpretations and the diagnostic nature of the interpretations are unique of CDMs. 6.

(21) There are three essential components in any CDM. The Q-matrix is the specification of which attributes are measured by each item. A Q-matrix traditionally contains i item in the rows and k attributes in the columns. Its entries consist of 1s and 0s indicating whether or not an attribute is measured by an item (e.g., q ik =1). The Q-matrix is the quintessential component in any CDM because it represents the operationalization of the substantive theory that has given rise to the design of the diagnostic assessment (Rupp, Templin & Henson, 2010). The Q-matrix is always decided by a set of domain experts. The alpha-matrix defines the attribute mastery profiles of examinees. An alpha-matrix traditionally contains examinee in the rows and attributes in the columns. The latent responseη ij , which is an indicator, depends on the Q-matrix and alpha-matrix. The term “attribute” used here can also be labeled latent characteristics, latent traits, and elements of process, skills and attributes. Based on the interaction between attributes required by a tested item and how examinees use attributes in a tested item, CDMs fall into two categories: non- compensatory and compensatory.. 2.1.1 Non-Compensatory Cognitive Diagnostic Models Non-compensatory cognitive diagnostic models reflect the assumption that a deficit in one latent variable cannot be compensated for by a surplus in a different latent variable. Many cognitive diagnostic models are formulated with a conjunctive rule. Among these models are the DINA (deterministic input, noisy and gate) model of Haertel (1989); The NIDA (noisy inputs deterministic and gate) model of Junker and Sijtsma (2001); reparameterized unified model (RUM) of Hartz (2002) and the conjunctive MCLCM of Maris (1999). Because the DINA model and NIDA model are simpler than other non- compensatory conjunctive models, the others can seem to be extensions of these two models. For this reason, these two non- compensatory models 7.

(22) are discussed in more detail below. DINA model. The deterministic-input, noisy-and-gate (DINA) model (see Haertel, 1989) is a discrete latent variable CDM with a conjunctive condensation rule, that means from a deterministic perspective, a respondent has to master all required attributes to obtain a score on a particular item. Given the jth students’ attribute vector α j , and the ith row of the Q-matrix, the conjunctive kernel that creates the variable η ij is called an and-gate because it functions like an output summary that represents a deterministic prediction of task performance from each examinee’s knowledge and performance state. The element of the Q-matrix, q ik , which takes values of 1 or 0, indicates whether attribute k is required or not required in an item i. A mastery and non-mastery status α jk =1or 0, indicates whether or not examinee j has mastered attribute k. This can be expressed as below, where the deterministic latent response η ij denotes whether examinee j has mastered all required attributes for item i. Thus, the model separates respondents broadly into two groups for each item: students who possess all attributes for a successful response to the item, and students who lack at least one of the required attributes measured by an item. K. η ij = ∏ α qik jk. (2.1). k =1. The model is considered as stochastic, since the observed response Xij is not completely consistent with the latent responseη ij . Moreover, the DINA model allows for the possibility that respondents who have mastered all measured attributes but “slip” and incorrectly answer an item (i.e., η ij =1) as well as the possibility that respondents who have not mastered at least one of measured attributes makes a “lucky guess” and correctly answers an item (i.e., η ij =0). The probabilistic relation is 8.

(23) governed by two “Noisy” parameters unique to each item, si, a slip parameter, and gi, a guessing parameter. Specifically, if η ij = 1 and respondents should correctly respond to the item unless they “slip”. However, if η ij = 0 then respondents should answer the item incorrectly so that gi can be interpreted as the probability of an examinee who is classified in the non-master group but answers the item correctly due to “guessing”. si = p ( X ij = 0 η ij = 1). (2.2). g i = p ( X ij = 1η ij = 0). (2.3). Given si and gi, the item response function can be written as below: η. 1−η ij. p ( X ij = 1ηij ) = (1 − si ) ij g i. (2.4). That is, the probability of a correct response to item i can only be divided into two categories: gi for any examinee j who lack one or more attributes measured by item i (i.e., η ij = 0), and 1- si for any examinee j who masters all attributed measured by item i (i.e., η ij =1). The DINA model provides one slipping parameter and one guessing parameter per item with equal constraints across attributes. Consequently, the number of parameters is not influenced by the number of attributes measured on the test. Furthermore, because of its conjunctive nature, the DINA model cannot further differentiate respondents who lack one or more than one of the measured attributes. In addition, mastering more attributes than required for correctly answering item i do not make the correct response probability higher; the model is a so-called non-compensatory model. NIDA model. The NIDA model, namely, the noisy-inputs, deterministic-and gate model (e.g., Junker & Sijtsma, 2001), is another noncompesatory cognitive diagnostic model with a conjunctive condensation rule. Like the DINA model, aberrant responses. 9.

(24) are also modeled in the NIDA model. However, unlike the DINA model, the guessing parameter and slipping parameter are estimated at the attribute level but with equal constraints across items. Thus, in contrast to the DINA model, the number of estimated parameters increases with the number of attributes but is not influenced by the number of items on measured test. The slipping parameter (s k ) and guessing parameter ( g k ) are defined at attribute level with subscript k, the attribute mastery indicator α jk , is defined at the level of attribute k for any examinee j and a latent response variableη ijk , is defined whether or not examinee j’s performance on the item i is consistent with possessing attribute k. This, in turn, leads to the definition of slipping and guessing parameters as follows: s k = p (η ijk = 0 α jk = 1). (2.4). g k = p (η ijk = 1α jk = 0). (2.5). That is, slipping for attribute k amounts to the incorrect application of the attribute even though the attribute has been mastered. Similarly, “ guessing” means the correct application of the attribute even though the attribute has not been mastered. The formula for a correct response in NIDA model is written as: k. p(X. ij. = 1α , s, g ) =. ∏. [(1 − s. qik. k. ). α. jk. g. 1−α k. jk. ]. (2.6). k =1. Where when qik =0, so that the product term for that attribute becomes 1 which means irrelevant, but whenever the attribute is measured, then the qik =1, and the product term becomes relevant. In that situation, there are two possible results. If the attribute has been mastered, the probability of a correct latent response for that attribute is. 1 − s k and if the attribute has not been mastered, this probability is g k . These. 10.

(25) attribute-wise contributions are then multiplied over all attributes, resulting in the total probability of a correct response for each item. There are two differences between the DINA model and NIDA model. First, when using the DINA model, the slipping and guessing parameters can be used to characterize the diagnostic value of an item. On the other hand, when using the NIDA model, the slipping and guessing parameters can be used to evaluate the diagnostic value of an attribute. Second, the DINA model cannot further differentiate respondents who have not mastered at least one attribute. However, the NIDA model can completely solve this limitation of the DINA model.. 2.1.2 Compensatory Cognitive Diagnostic Model The use of compensatory CDMs has more recently become popular as compared with conjunctive models and widely applied in medical and psychological diagnosis (e.g., Templin & Henson, 2006). Compensatory CDMs including the disjunctive MCLCM and compensatory MCLCM of Maris (1999), the DINO (deterministic input noisy or gate) model of Templin and Henson (2006), and the NIDO (noisy input deterministic or gate) model of Templin, Henson, and Douglas (2006). The disjunctive MCLCM can be viewed as the extension of NIDO, but the model is not identifiable. Thus, the DINO model and NIDO model are described here in conjunction with the DINA and NIDA model. DINO model. The deterministic input, noisy-or-gate (DINO) model (e.g., Templin &. Henson, 2006) is the compensatory analog to the DINA model. As with the DINA model, slipping and guessing parameters are modeled at the item level. In contrast to the DINA model, the DINO model utilizes the disjunctive condensation rule to represent the formula. The first component is the latent response variable ω ij which. 11.

(26) denotes whether examinee j has mastered at least one measured attribute or not for item i. This is called an or-gate. If an attribute is not measured by an item i, then qik =0, which implies that (1 − α jk ) qik = 1 . When an attribute is measured by an item i, then there are two possibilities which can direct the value of latent response variable ω ij , that is whether α jk = 1 or 0, which means that the person j did or did not possess the attribute k, thus the value of latent response variable ω ij =1 only occurred when the product term = 0. That means that person j is required to possess at least one attribute which measured by an item. Therefore, any one attribute can completely compensate for the lack of all others to increase the correct rate of possibility. K. ω ij = 1 − ∏ (1 − α jk ) qik. (2.7). k =1. Apart from the latent response variable ωij , there are two measured components for the DINO model, the slipping parameter and guessing parameter. As with the DINA model, the slipping parameter and guessing parameters are the stochastic elements that lead to the noise in the or-gate as shown in formula 2.8 and formula 2.9. In the DINO model, the slipping parameter and guessing parameter are estimated for every item and set equality restrictions across attributes. g i = p ( X ij = 1ω ij = 0). (2.8). si = p( X ij = 0 ωij = 1). (2.9). The DINO model for a correct response to an item can be formulated as follows:. p(X. ij. = 1ω. ij. ) = (1 − s i ). ω. ij. g. 1− ω i. ij. (2.10). Where P is the probability of correct response for item i, X ij is the observed. 12.

(27) response for person j on item i, ω ij is the latent response variable for person j on item i, 1 − si is the probability of not slipping for item i and g i is probability of guessing for item i. Note that the probability looks like the ones used in DINA model on the surface; however, the interpretation of these two parameters is different than in the DINA model. In the DINO model, the guessing parameter ( g i ) represents the probability of getting a score when all measured attributes are absent, while the slipping parameter ( si ) represents for the probability of not obtaining a score for an item when at least one measured attribute is present.. NIDO model. The noisy input, deterministic-or-gate (NIDO) model (e.g., Templin,. 2006; referred from Rupp, Templin & Henson, 2010) is the compensatory model analogy to the NIDA model. In the NIDO model, the response behavior is modeled at the attribute level with equality constraints across items. As described above, the restriction of the DINO model is that it cannot make distinctions between respondents who only mastered one attribute or mastered more attributes than required for an item. Just as the NIDA model provided a finer distinction than a DINA model, the NIDO model provides a finer distinction than the DINO model (Rupp, Templin & Henson, 2010). To build the NIDO model, two components are required. The first one is the intercept parameter (λ.,0,( k ) ) and the second one is the slope parameter (λ.,1,( k )α j k ) , and both of these two parameters are estimated at attribute level. The NIDO model can be formulated as follows. 13.

(28)  K  exp ∑ (λ.,0,( k ) + λ.,1( k )α jk )qik   k =1  P( X ij = 1α ) = K   1 + exp ∑ (λ., 0,( k ) + λ.,1( k )α jk )qik   k =1 . (2.11). The structure of formula 2.11 comes from logistic regression analysis, where P is the probability, X ij is the observed response for person j on item i, q ik is the indicator from the Q-matrix indicating whether attribute k is measured by item i, α jk is the attribute mastery indicator for person j on attribute k, (λ., 0,( k ) ) is the intercept parameter for attribute k and (λ.,1,( k )α j k ) is the slope parameter for attribute k, the first subscript “dot” represents the item to which the parameter corresponds and the second subscript represents the level of the parameter. In the NIDO model, If q ik =0 then it means the contribution of an attribute to this kernel is 0 (i.e., it does not contribute at all). If q ik =1, then it matters whether the value of α jk =0 or 1. When. α jk =0, then the contribution to the overall response probability is λ., 0,( k ) , if α jk =1, then the contribution to the overall response probability is λ.,0,( k ) + λ.,1,( k )α j k which means the probability of a positive response will increase with the numbers of mastered attributes measured by an item. It is worth noting that just like the DINO model, the NIDO model cannot be applied to assess item quality because all parameters are specified for attribute level and constrained to equality across items. Except for the above models, some researchers have attempted to propose more general forms to represent the real cognitive diagnostic procedures. For example, the general diagnostic model (GDM, von Davier, 2005), the log linear cognitive diagnostic model (LCDM, Henson, Templin & Willse, 2009) and the generalized DINA model (G-DINA, De la Torre, 2011) have all been proposed as yielding more flexibility in model comparison. 14.

(29) 2.1.3 Cognitive Diagnostic Model in a Higher-Order Structure The main purpose of cognitive diagnostic models is to provide information on the attributes mastered by examinees. In many applications, examinations are required to report attribute mastery profiles and an overall latent trait simultaneously. To meet this demand, De la Torre and Douglas (2004) extended the DINA model and included the higher order latent traits for specifying the joint distribution of binary attributes. The HO-DINA model assumes the attributes as arising from a broadly defined latent trait resembling the θ of item response model. This hierarchy where the item responses X are independent given α , and the components of α are independent given θ , is natural in conjunctive models for cognitive diagnosis. The probability model of α , conditional on θ can be represented as below: K. p (α θ) = ∏ (α k θ). (2.12). k. The model can be formulated in a logistic regression function with latent covariates θ, p (α k θ) =. exp(λ0 k + λ ′kθ ) 1 + exp(λ0 k + λ ′kθ ). (2.13). In many applications, the latent trait will be a unidimensional construct and normally distributed with mean 0 and variance 1, however, in some situations the latent trait can be multidimensional. Thus, in a case of multidimensional examination, a structured factor loading matrix would be used, where λ k denotes the factor loading vector corresponding to α k . De la Torre and Douglas (2004) stated that the broadly-defined latent traits were assumed to consist of a small number of dimensions. Because unidimensionality is assumed in many educational and cognitive diagnostic assessments, in the present dissertation, a unidimensional ability, θ , is assumed, and. 15.

(30) attributes, α , to be independent conditional on θ . The basic assumption of the higher order structure commonly exists in educational examinations (e.g., in the example of fraction subtraction; the Wechsler Intelligence Scale for Children (WISC)) in which the acquired attributes are related to one broadly-defined construct of general intelligence or aptitude. In the lower part, the HO-DINA is the same as the DINA model, and the higher order part is the same as the 2PL model. In the higher order level, the discrimination parameter can be either constrained equally or estimated freely across attributes. De la Torre and Douglas (2004) stated that using a higher order trait to model the joint distribution of α has several advantages. For instance, it greatly reduces the complexity of the saturated model in cases where it is reasonable to view the examination as measuring one or perhaps two general abilities in addition to the specific knowledge states that comprise α . Additionally, DeCarlo (2011) reparameterized the HO-DINA model as a logistic regression model with latent classes and use the term HO-RDINA (higher order reparameterized DINA) model in order to differentiate with the HO-DINA model. Specifically, the g i can be replaced by a function that gives a positive value in the range of zero to one. For example, p ( X ij = 1η ij = 0) = exp( f i ) [1 + exp( f i )],. The above gives the probability that an examinee gets an item correct given that they do not have the requisite skills which can also be interpreted as a false alarm. In the HO-RDINA model, the parameter in the lower part are rewrite simply by using a logit function, log[ p/(1 − p)] : logitp ( X ij = 1η ij = 0) = f i .. The parameter f i , is the log odds of a false alarm. Similarly, the log odds of a hit is. 16.

(31) logitp ( X ij = 1η ij = 1) = f i + d i .. The parameter d i indicates how well item i detects the presence versus absence of the requisite skill set. According the above two equations, the reparameterized DINA model can be written as follow: logitp ( X ij = 1η ij ) = f i + dη ij. (2.14). The formula 2.14 is a logistic model with latent classes, with the item serving as detectors of the skill sets. The parameter f i gives the log odds of a false alarm, and the d i reflects how well the item discriminates between examinees with and without the requisite skills; for a different discrimination index. Including the vector of latent skills α into formula 2.14 gives logitp( X ij = 1 α ) = f i + dh(α j , q i ) , Where the function h(α j , q i ) denotes the operation of multiplying each element of α raised to a power of qik . By assuming local independence of the conditional response probabilities and a higher order structure for p (α ) , the complete model of HO-RDINA is formulated. Thus, in the higher order model for the skills is assumed as the model assumption that proposed by dela Torre and Douglas (2004) that has been described before and if the discrimination parameter α k are restricted to be equal across all skills, which is referred as the restricted higher order RDINA, or RHO-RDINA, can be represent as . logitp(α k θ ) = α k θ − β k. Where the α k is the skill discrimination parameter, β k is difficulty parameter of attribute k. Furthermore, along with the higher order structure assumption of 17.

(32) HO-RDINA, by analogy, the deterministic input, noisy-or-gate (DINO) model can also be extended to include the higher order latent traits for specifying the joint distribution of binary attributes. Thus, the compensatory DINO model is extended in a higher order structure and the HO-DINO model is formulated. The two higher order cognitive models can be adjusted to detect differential item functioning by adding group indicator in the lower level of HO-DINA model and HO-DINO model. This will be described in a later section.. 2.2 Differential Item Function in Cognitive Diagnostic Context No matter what kind of statistic model that researchers use to analyze data, detecting differential item functioning (DIF) of items has become a necessary procedure to attain the goal of test fairness and make educational and psychological tests valid. Traditionally, DIF is defined as existing where examinees of the same ability are more or less likely to give the correct response in one group (the focal group) compared to another group (reference group). Several statistical techniques have been developed to detect DIF which can be separated into two major approaches. One is the parametric approach, which assumes a specific item response model such as the item response theory (IRT) –based chi-square test (Lord, 1980), the IRT-based likelihood ratio test method (Thissen, Steinberg & Wainer, 1993). The other one is nonparametric approaches, such as the Mantel-Haenszel method (M-H; Holland & Thayer, 1988; Mantel & Haenszel, 1959), Simultaneous Item Bias Test (SIBTEST; Shealy & Stout, 1993) and the logistic regression method (Swaminathan & Rogers, 1990) which does not require a specific form of item response function or large sample size. In this dissertation, only the M-H method and logistic regression method are used in conjunction with parametric DIF detection method within CDM context. 18.

(33) 2.2.1 Non-Parametric DIF Methods Mantel-Haenszel method. The Mantel–Haenszel (MH) method (Holland & Thayer,. 1988; Mantel & Haenszel, 1959) is one of the most popular DIF detection procedures and can be extended to deal with polytomous items (i.e., the Mantel method, Mantel, 1963; Generalized M-H method, Mantel & Haenszel, 1959). In the MH procedure, the total raw score usually serve as matching variable. The matching variable is used to divide the range of scores into K stratas to comparing the correct versus incorrect response for the reference group and focal groups. Conditional on the total test score, the studied items for examinees in reference group and focal groups can be arranged into K 2×2 contingency tables. Table 2.1 is an example table for score group k (k=1,…., K) on the studied item. Rrk, Wrk, are counts of right and wrong responses, respectively, in the reference group at score level k, Rfk, and Wfk are counts of right and wrong responses, respectively, in the focal group at score level k.. Table 2.1 Contingency Table for Mantel–Haenszel DIF Statistic Score on studied item Group. 1. 0. Total. R. Rrk. Wrk. Nrk. F. Rfk. Wfk. Nfk. Total. Rtk. Wtk. Ntk. Note. Rrk, Wrk, Rfk, and Wfk are the counts within a cell; Nrk, Nfk, Rtk, and Ntk are the marginal totals. Therefore, the null hypothesis for the MH DIF detect can be expressed as H0:. R fk Rrk = Wrk W fk. (2.12). That means the odds of getting the item correct in the focal group is the same as that 19.

(34) in the reference group at a given level of the matching variable. The MH chi-square is then computed as a single degree of freedom chi-square over the K 2×2 contingency tables. The MH chi-square statistic can be shown as below:. [∑ MH chi-square =. K k =1. (R rk. ∑. − E ( R rk ) ) − . 5. K. ]. 2. (2.13). Var ( R rk ) k =1. where E ( Rrk ) = N rk Rtk N tk ,. (2.14). and Var ( Rrk ) =. N rk N fk Rtk wtk. (2.15). N tk2 ( N tk − 1). A significant MH chi-square indicates there is a difference in the probability of a correct answer to an item between two groups at each of k ability levels. Also, the. α MH can be computed to represent the ratio of the odds that the reference group answer studied item correctly to the odds that a matched member of focal group did the same. The value of α MH can be seen as a measure of the DIF effect size. When. α MH =1, then there is no difference in the performance of the two groups on the studied item at the k th score level.. ∑ = ∑. K. α MH. k =1 K. RrkWfk N tk. (2.16). W R N tk k =1 rk fk. Holland and Thayer (1988) introduced logarithmic transformation of the odd ratio α MH to make the effect size scale symmetric:. ∆α MH = −2.35 ln(α MH ). (2.17). A value of zero indicates no DIF, a positive value indicates that the item favors the focal group and a negative value indicates that the item favors the reference group.. 20.

(35) Educational Testing Service (ETS) classifies DIF based on the delta α MH into three levels (Dorans & Holland, 1993): 1. Negligible DIF, when chi-square is not significant and ∆α MH < 1 2. Intermediate DIF, when chi-square is significant and 1< ∆α MH < 1.5 3. Large DIF, when chi-square is significant and ∆α MH. ≥ 1.5. The ETS standards were used in the real data application to assist with the interpretation of the flagged DIF items. Logistic Regression. The logistic regression procedure (LR, Swaminathan & Rogers,. 1990) is another commonly used method for DIF detection. The LR procedure is quite popular due to its several advantages. The first is ease of computer programming for implementation and that it can be used with common statistical software (e.g., SAS proc, SPSS, and free software R). Second, LR has the capacity to detect nonuniform DIF. Finally, LR allows for the flexibility to construct more complex models such as permitting conditioning or matching on more than one ability measure in DIF detection (Mazor, Kanjee, & Clauser, 1995). As outlined by Swaminathan and Rogers (1990), the LR model for DIF detection is: p (u. i. = 1 θ,g). =. e. β. 1 + e. 0. + β β. 0. j. θ + β. + β. j. j. g + β. θ + β. j. j. ( θ g). g + β. j. ( θ g). (2.18). Where p (ui ) is the probability of person I responding correctly to the item, θ represents ability, g represents the group identifier, and the θ g represents the interaction term. In the model, ability generally is the total test score, and the group is typically identified as 0 or 1 to indicate group membership. The β 0 represent the intercept, β j represents the ability, group and ability by group interaction weights, respectively. 21.

(36) To detect DIF with LR is a process of comparing three models for each item and testing the improvement of fit for these models as terms are eliminated. The full model (2.18) is compared with the reduced model (R1) that lacks the interaction term. The further reduced model (R2) that includes only the ability term is compared with the reduced model (R1). These three models are compared with a difference log-likelihood ratio test, using the -2log-likelihood statistics. The log-likelihood statistic (G2) is a goodness-of-fit measure used as a criterion of model comparison. A significant group term is an indication of uniform DIF, whereas a significant interaction term is an indication of nonuniform DIF. The R 2 in each step can be compared and computed as the ∆R 2 which is an effect size measure to indicate the magnitude of DIF. The estimated effect size ∆R 2 can be classified in three levels (Jodoin & Gierl, 2001): 1. Negligible DIF, where ∆R 2 < .035. 2. Moderate DIF, where .035 ≤ ∆R 2 ≤ .070. 3. Large DIF, where ∆R 2 ﹥.070.. Simultaneous Item Bias Test Method. The simultaneous item bias test (SIBTEST,. Shealy & Stout, 1993) is another popular non-parametric method for DIF detection. Just as with the M-H method, the SIBTEST also provides an estimate of the effect size of DIF. Nevertheless, unlike the M-H method which can only detect DIF item by item, the SIBTEST can be used to detect whether DIF is present in one or more items simultaneously. The SIBTEST was initially developed to detect unidirectional uniform DIF, but can also be applied to detect nonuniform DIF as well (Li & Stout, 1996). When applying the SIBTEST in DIF detection, the test needs to be divided into two subtests, one is the suspected subtest which contains item(s) that may exhibit 22.

(37) differential function and the other is assumed to be the DIF-free subtest. The score on the DIF-free subtest serves as the matching variable. A weighted mean difference in item or subtest performance between the focal group and reference group, βUNI , is computed and then this difference is tested statistically. K. βÛNI = ∑ p k (Yrk − Y fk ) ,. (2.18). k =0. Where pk denotes the proportion among the focal group examinees with score k on the DIF-free subtest. Yrk − Y fk is the true score mean difference on the studied item for the reference group and focal group attaining a subtest score k, k=0,…,K, on the DIF-free subtest. The test statistics for βUNI is given by. βUNI =. βÛNI , σˆ ( βÛNI ). (2.19). Where the σˆ ( βÛNI ) is the estimated standard error of βÛNI . Shealy and Stout (1993) demonstrated the statistic βUNI is approximately distributed as a standard normal under the null hypothesis. The estimated effect size βˆ can be classified in three levels (Roussos and Stout, 1996): 1. Negligible DIF, where absolute value of βˆ <.059 and the hypothesis test is rejected. 2. Moderate DIF, where absolute value of .059< βˆ <.088 and the hypothesis test is rejected. 3. Large DIF, where absolute value of βˆ ≥ .088 and the hypothesis test is rejected.. 2.2.2 Parametric DIF Methods The parametric DIF methods assume a specific item response model and 23.

(38) compare item parameters from different groups. In this dissertation, two IRT based DIF detection methods will be briefly described. Lord’s Chi-Square. In the IRT framework, DIF can be identified when the item. characteristic curve (ICC) differs for the reference and focal groups. The ICC is an S shaped curve describing the relationship between the probability of correct to an item and examinee ability. Since the ICC is completely determined by the item parameters, comparing the ICC is the same as comparing item parameters from two groups. Lord (1980) proposed a Chi-square statistic to detect item DIF in which the item parameters of the IRT model are estimated separately for the reference and focal groups. To do this, the item parameters need to be placed on the same metric by means of some linking strategy in order to test the equality of item parameters between the reference and focal groups. For example, if the Rasch model is proposed then the null hypothesis for Lord’s chi-square statistic is: H0: b f = br. (2.20). The Lord’s chi-square statistic can be computed by. x 2 = (bdiff )′∑ −1 (bdiff ) where bdiff = b f − br and. ∑. (2.21). is the variance-covariance matrix of differences. between the parameter estimates. The statistic x 2 is asymptotically distributed as a chi-square with p degrees of freedom, where p is the number of parameters being computed. Likelihood Ratio Test for DIF. The likelihood ratio test for DIF has been proposed by. Thissen, Steinberg, and Wainer (1993). When applying this method, the likelihoods for two nested models (the compact and augmented models) are compared. In the compact model, item parameters for all items are assumed to be the same for. 24.

(39) reference and focal groups. In the augmented model, item parameters for studied items are not constrained to be equal for reference and focal groups and the remaining items are constrained to be the same for reference and focal groups. The likelihood ratio can be calculated by. L  G 2 = − 2 log  C  ,  LA . (2.22). Where the LC and L A are the likelihood ratios for the compact model and augmented model respectively. The G 2 is distributed as a chi-square with p degrees of freedom, where p is the number of parameters estimated in the compact model and augmented model. In addition, the remaining items serve as an anchor set to link the metrics of the focal and reference groups. Then DIF will be checked item by item. Modified Higher-Order DINA Model. Li (2008) proposed a modified HO-DINA. model by making some adjustments to the HO-DINA model. In the modified HO-DINA, the group indicator is used as an examinee covariance in the upper level of the higher order DINA model. Thus, the attribute level specification of the HO-DINA model can be re-written as: logit[ P (α k θ j )] = a (θ j + ∆tI j ) − ( β k + γ k I j ). (2.23). Where I j is the group indicator, that takes a value of 0, if examinee j belongs to the reference group, and 1, if examinee j belongs to the focal group; a is a uniform discrimination parameter, that is, it is fixed to be the same across attributes and groups;. β k is the difficulty parameter of attribute k for the reference group; γ k is the difference in difficulty for attribute k between the reference and focal groups, and represents the amount of uniform DIF for attribute k. A positive sign for γ indicates the attribute favors the reference group, a negative sign indicates the attribute favor. 25.

(40) the focal group. θ j is the general ability for member j of the reference group, and ∆t is the mean difference in ability between the reference and focal group.. Furthermore, the parameter is adjusted using Chaimonkol’s (2005) method for multilevel logistic regression method to solve the problem of model identification. The modified HO-DINA yield can detect DAF and DIF simultaneously. With this model, DAF can be detected by examining if γ kadj = 0 and the γ k is adjusted by. γ kadj = γ k − γ . Model based DIF Detection with RHO-RDINA and RHO-RDINO Model. Since the. philosophy of researchers may not assume a higher order theta determine the response patterns of test takers the modified HO-DINA model (Li, 2008) can be extended into a more generalize form. For example, the group indicator can be added both or either in the item level or in the attribute level. Hence, in the present dissertation the more general formula is proposed as below to detect DIF. The general formula can be decomposed into two parts, the item level and attribute level. For the item level, the group indicator can be added in, and then one can detect DIF only. When the group indicator added in the attribute level, then one can detect DAF only. And if the group indicator is added in both item and attribute level, one can detect DAF and DIF simultaneously. In addition, the item discrimination parameter can be estimated according to attributes for both focal and reference group. Furthermore, in order to compare with the results with the previous DIF study within the framework of IRT, the estimated parameters are reparameterized in a logit scale. Hence, the model of item level can be rewrite as below: η ij.  exp(si + s di )   exp( g i + g di )  p ( X ij = 1η ij ) = 1 −     1 + exp(s i + s di )  1 + exp( g i + g di ) . 1−η ij. (2.24). Where s di is the deviation from the mean slipping parameter for group g on item i; 26.

(41) the g di is the deviation from the mean guessing parameter for group g on item i,; the. η ij is defined as the formula 2.1. In the DINA model, η ij can be regarded as the ability variable with two levels: if η ij =1 indicates examinee j mastered all attributes required by item i, and if η ij =0 indicates examinee j missed at least one attribute required by item i. While the group indicator does not exist, it becomes a reparameterized DINA model. And the higher level of HO-DINA can be re-written as Logit[ p (α k θ j )] = rk (θ j + γ d ) − β k. (2.25). Where θ j is the general ability for examinee j of reference group; γ d is the mean difference in ability between the reference and focal group; β k is the difficult parameter of attribute k for examinees and if a subscript “d” is added in it means the model can be use to estimate DAF; rk is the item discrimination parameter. As noted above, in this model rk is a common discrimination parameter for all attributes in both focal and reference group. The restricted discrimination parameter version presented here can also be found in the model of de la Torre and Douglas (2004) and also implemented to fit real data sets that yielded better model data fit than the nonrestricted HO-DINA (e.g., DeCarlo, 2011). The modified RHO-RDINA model presented in the above is a special case of the generalized formula. Analogizing to the modified RHO-RDINA, the modified RHO-RDINO model can be rewritten as:  exp(si + s di ) p ( X ij = 1η ij ) = 1 −  (1 + exp(si + s di.   ) . ϖ ij.  exp( g i + g di )     (1 + exp( g i + g di ) . 1−ϖ ij. (2.26). In the model, sdi is the deviation from the mean slipping parameter for group g on. 27.

(42) item i; the g di is the deviation from the mean guessing parameter for group g on item i which is defined as in the modified RHO-RDINA; the ω ij is defined as the formula 2.7. In the DINO model, ωij can be regarded as the ability variable with two levels: if ω ij = 1 examinee j has mastered at least one attributes required by item i, and if ωij = 0 examinee j has not mastered at least one attribute required by item i. Therefore, any one attribute can completely compensate for the lack of all others to increase the correct rate of possibility. And the higher level of the modified RHO-RDINO model can be formulated in formula 2.25. It is appealing that Li (2008) proposed modified HO-DINA model to detect DIF and DAF in a parametric DIF detection approach. However, the model is hard to apply in real situation. Firstly, the model of Li (2008) did not constrain any item or attribute parameter as the same across groups. Therefore, the required estimation parameters will be increase with test length and the complexity of Q matrix. The most serious problem is the aberrant phenomenon appeared in a real dataset application. The results of the real datasets indicated that an inconsistent DAF and DIF result that is items indicated as DIF items are not belong to the indicated DAF attribute. On the other hand, the indicated DAF attribute has not any DIF item in it. The result may due to insufficient sample size to estimate such a complexity model. Second, considering the function of DIF detection work is to assure the quality in item level, to assume the same attribute parameter across groups is more reasonable. Undoubtedly, some researchers devoted to detect DAF to test different assumptions of cognitive learning strategy by assume different attribute hierarchy structures (e.g., Gierl, Zheng & Cui, 2008), however, these is beyond the scope of this dissertation. Hence, unlike the modified model, the present study adding the group indicator in the lower item level.. 28.

(43) It is more reasonable and logical to detect DIF in the item level. Furthermore, in order to compare with the results with the previous DIF study within the framework of IRT, the estimated parameters are reparameterized in a logit scale. Hence, the modified RHO-RDINA and RHO-RDINO are proposed to detect DIF in the present study. :. By definition, in the DINA model, the conditional probability of a correct. response to item i is 1− s i , when η ij =1, and g i , when η ij =0. To obtain an estimate of DIF it is necessary to estimate the s di and g di for each group and their corresponding 100(1 − α )% CI is computed to determine if the two noise parameters of specific group is significantly different from another group (i.e., check if the range contains 0). The result is similar with the estimate of Wald-test. However, it will be detected as statistically significant as long as the sample size is sufficiently large. Thus, to quantify the magnitude of DIF contamination, the mean item difficulty difference (MIDD, Wang &Yeh, 2003) between reference and focal groups is used as an index to determine DIF items. The MIDD is directly related to the signed Area measure proposed by Raju (Wang & Su, 2004a). Thus, applying the concept of MIDD in the framework of CDMs yields: MIDD= s R − s F. (2.27). Where the s R and s F denote the mean item slipping of the reference and focal groups respectively. The same idea can be implemented in the mean item guessing of the reference and focal groups. Wang (2008) pointed out that a DIF magnitude of 0.5 logits can be treated as a cut-off point to determine DIF. Because the item parameters are reparameterized in a logit scale, to determine if the item is exhibiting significant DIF the empirical cut-off value 0.5 logit can be used. In the modified RHO-RDINA model, the positive value of s di indicates the. 29.

(44) item favors the focal group for examinees mastering all attributes required by item i and, and positive values of g di indicates the item favors the reference group for examinees who have not mastered at least one of the attributes required by item i. In the modified RHO-RDINO model, the positive value of s di indicates the item favors the focal group for examinees mastering at least attributes required by item i and, and positive values of g di indicates the item favors the reference group for examinees who have not mastered one of the attributes required by item i. Thus, there are four combinations of sdi and g di : 1. Both sdi and g di are positive, that is, the item favors the focal group for masters but favors the reference group for non-masters. 2. Both sdi and g di are negative, that is, the item favors the reference group for masters but favors the focal group for non-masters. 3. s di is positive and g di is negative, that is, the item favors the focal group for both masters and non-masters. 4. s di is negative and g di is positive, that is, the item favors the reference group for both masters and non-masters. According to the definition of Li (2008) the combination 1 and 2 indicate non-uniform DIF, and combination 3 and 4 indicate uniform DIF. 2.2.3 Previous DIF Detection Methods in CDM Framework. The performance under various manipulated conditions of the above mentioned DIF detection methods in terms of the rate of correct (power) and incorrect (Type I error) identification of DIF has been examined in the framework of item response models (i.e., items generated from Rasch model, 2PL model and 3PL model). For instance, several factors were pointed to affect the result of DIF detection, including the power 30.