1.1 Motivation
Assessments are developed to help inform instruction, learning and as an assistant tool in clinical diagnosis. Along with classical testing theory, practitioners adopt either norm referencing or criterion referencing to interpret relevant information from assessments. It can be seen in many well-developed educational and
psychological assessments that the observed sum score can be represented as examinees’ ability on the assessed domain. The diagnosis is determined by the cut score, which has been set based on theory and practical experience. With the increasing development of statistical methods, researchers start to explore the
diagnosis issue on latent trait (i.e., the factor analysis approach, item response models).
Although unidimensional item response models (IRMs) are useful for scaling and ordering examinees on a latent proficiency continuum, they do not allow evaluation of students’ specific strengths and weaknesses that can be used to facilitate learning and instruction (Torre, Hong & Deng, 2010). Besides, examinees who get the same sum score may not have the same skill or attribute mastery patterns. In contrast to
conventional item response models, cognitive diagnostic models (CDMs) place more emphasis on diagnosis information using a function in which the underlying latent variables are developed specifically for identifying the presence or absence of multiple finer-grained skills in a particular domain.
Attribute mastery diagnostics has been emphasized in many countries recently. In the US, the No Child Left Behind policy contributes the development of standard setting and the development of CDMs. Results of diagnosis also benefits both students and teachers. For teacher, results of diagnosis can help arrange appropriate
curricular; for students, the individual attribute mastery profile admonish students if they do not master all required attributes. In addition, many international academic surveys (e.g., PISA; TIMSS) are developed to assess the basic competency of students and aimed to do some comparison between subgroups. The results from large scale testing with informative meaning that guided practitioners remedy their teaching and curriculum design. It is worth to note the problem of test equality between different subgroups or majority and minority groups.
Differential item functioning, DIF has been treated as a necessary procedure when a test is developed. Different DIF detection methods have been proposed and applied in practical situations. A number of simulation studies have been designed to compare the effectiveness of DIF methods (e.g., nonparametric approach; parametric approach) and to investigate factors that may affect DIF detection in some specific item response models (Finch & French, 2007; French & Maller, 2007;
Fidalgo,Mellenbergh & Muniz, 2000; Holland & Thayer, 1988; Lord, 1980; Li &
Stout, 1996; Mantel & Haenszel, 1959; Rogers & Swaminathan, 1993; Shealy & Stout, 1993; Swaminathan & Rogers, 1990; Shih & Wang, 2009; Thissen, Steinberg &
Wainer, 1993; Narayanan & Swaminathan, 1996; Wang & Su, 2004a; Wang & Su, 2004b). However, only a few studies have investigated DIF issues within the CDM context. For example, the study of Gierl, Zheng, and Cui (2008) using the SIBTEST to detect ADF (attribute differential function) with a specific CDM named attribute hierarchy model. Besides, in the dissertation of Zhang (2006) two widely applied non-parametric DIF detection procedures, namely MH and SIBTEST are used to compare the effectiveness of DIF detection with DINA model (deterministic input noisy-and-gate ). In Zhang’s study, several possible DIF patterns are simulated and two matching variables (total raw score and profile score) were compared in which
the profile scores yielded outperformed the total raw score. Recently, another dissertation addressed the parametric DIF method within the CDM. Li (2008) modified the higher-order DINA model and proposed a model based DIF and DAF detection procedure in the framework of higher-order DINA model, which can detect DIF and DAF simultaneously. However, the past studies mistaken concluded that the poorer estimation in DIF detection result in using test total score as matching is the only reason rather than suspect the contaminated matching issue may also lead to the same result. Hence, it is wondering that with purification procedure using test total score as matching could work equally well within the framework of CDM.
Although, the above mentioned studies contribute in this area, some common issues remain unsolved while applied DIF detection procedure in CDM. First, though the previous studies aim to introduce popular DIF detection methods in CDMs, several factors (e.g., test length, DIF percentage, DIF magnitude, sample size, etc) may affect the result of DIF detection remains unclear in these studies. Since some DIF studies in IRT have found that the DIF magnitude (e.g., French & Maller, 2007;
Narayanan & Swaminathan, 1996; Rogers & Swaminathan, 1993), test length (e.g., French & Maller, 2007; Finch, & French, 2007; Shih & Wang, 2009; Narayanan &
Swaminathan, 1996), DIF patterns (e.g., Su & Wang, 2004a; Shih & Wang, 2009) and percentage of DIF items (e.g., Fidalgo, Mellenbergh and Muniz, 2000; Finch &
French, 2007; French & Maller, 2007; Shih & Wang, 2009) are important factors which will influence the type Ι error rate and power of DIF detection, it is believed that these factors should be considered when conducting DIF analysis in CDMs.
Second, previous studies have only focused on the non-compensatory model (e.g., DINA model, Zhang, 2006; HO-DINA model, Li, 2008; AHM, Gierl, Zheng & Cui, 2008). It is worth noting that the DINA model is a non-compensatory model, meaning
that examinees need master all required attributes in a tested item, otherwise he or she obtains lower probability of scoring correctly on that item. However, the strong assumption may not be appropriate in real situations. One limitation of the DINA model is that it does not further differentiate between respondents who have not mastered at least one attribute. Though, a number of studies have applied the DINA model to analyze real data (e.g., de la Torre, 2009; de la Torre & Douglas, 2004, 2008;
Henson, Templin, & Willse, 2009; Templin, Henson, & Douglas, 2006), only a few studies applied the DINO model to analyze data. Since Lee, Park and Taylan (2011) recommended the possibility of alternative or multiple strategies for solving an item may better explain students’ response, it seems worth investigating DIF detection issue while the compensatory CDMs are applied.
Third, the previous studies did not deal with the issue of contaminated matching criteria. Because the internal matching criterion may be contaminated, they may not be appropriate for directly detecting DIF. If invalid matching criterion are used the results of DIF detection will be suspect. Many studies have focused on this issue and proposed strategies to solve this problem in the framework of item response theory (i.e., Candell & Drasgow, 1988; Fidalgo, Mellenbergh and Muniz, 2000; French &
Maller, 2007; Holland & Thayer, 1988; Shih & Wang, 2009; Wang & Yeh, 2003;
Wang & Su, 2004a; Wang & Su, 2004b). A newly idea named DIF-free-then-DIF has also been proposed by Wang (2008). The central idea here is that it is very important to find a set of clear items (i.e., DIF free) as matching criteria. It is believed that the same situation may be occurring in the CDM context. However, all these studies overlooked the purification issue when dealing with DIF. In implication, one cannot predict which item is a DIF free item. Thus, item purification procedures cannot be neglected in DIF detection. Nevertheless, to date no studies have addressed this issue
in the framework of CDMs.
1.2 Significance and Contribution
As mentioned above, several important factors (e.g., DIF magnitude, test length, DIF amounts and DIF patterns) have been overlooked in past DIF detection studies within the framework of CDMs. The present study aims to investigate DIF related issues in a more broad perspective. Thus, more factors that may affect the
effectiveness of DIF will be considered in this dissertation. Besides, considering the parametric DIF approach is more efficient in application. However, previous studies have only proposed non-compensatory models to detect DIF. The present dissertation aims to propose two modified reparameterized compensatory and non-compensatory models to detect DIF directly. Furthermore, since the contaminated matching criterion will cause invalid results in DIF analysis, this dissertation aims to introduce the purification procedure in the framework of CDMs. In addition to the model-based DIF detection method proposed in the study, two widely used DIF detection methods, Mantel-Haenszel, MH and Logistic Regression, LR methods are also used. Multiple detection methods are used for this DIF study so that agreement and discrepancy of the outcomes can be compared under various test conditions. Using datasets generated to reflect various conditions of DIF, the TypeΙ error rate and power rate of the
detection methods are investigated. Finally, in order to compare and evaluate the performance of purification procedures to build a common metric for DIF analysis, a dataset from TIMSS 2007 fourth grade mathematics assessment is used for
demonstrate gender differential item functioning.