Person-fit - Person-fit偵測作假之效用- 非參數試題反應理論的模擬與應用

Chapter 1 Introduction

1.4 Glossary

1.4.3 Person-fit

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

1.4.2 Nonparametric item response theory

Nonparametric item response theory is one of the models of item response theory;

and is adaptive when the item scale is ordinal. The only assumption of this model is that the relation between score and ability is order. The discrepancy between NIRT and PIRT is that each item function may not have to follow the discipline of logistic monotonous function (Sijtsma, 2005).

1.4.3 Person-fit

Person-fit refers to the issues of the unusual response pattern (Meijer, &Sijtsma, 2001). Person-fit could be indicated by the level of discrepancy between the assumed IRT models or item-score pattern and the individual response pattern (LaHuis &

Copeland, 2009; Meijer & Sijtsma, 2001). The person-fit statistics are used to detect if the examinee had unusual item response patterns, and to distinguish them from normal respondents (Katabatsos, 2003).

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Chapter 2 Literature Review

Five sections are used to explain the relative theories and previous studies. The first section introduces the concept and indicators of person-fit. Follows are some reviews on the studies about faking detection via person-fit. The third section offers an explanation of the role of sample size, the need for sample size in this study, and the advantages of NIRT in the sample size. The comparison between PIRT and NIRT follows the third section. Lastly, the study illustrates the methods of non-parametric estimates, and demonstrate the theory and model of NIRT.

2.1 Person-fit

Person-fit refers to the level of consistency between the assumed IRT models or item-score pattern and the individual response pattern (LaHuis & Copeland, 2009;

Meijer & Sijtsma, 2001). This issue concerns atypical response patterns (Meijer, &

Sijtsma, 2001), and the methodology of person-fit could test invalid responses (Emons, 2008).

Parametric and non-parametric estimates can analyze the person-fit. Parametric estimate is the measurement of the difference between test scores and the predicted values in the assumed model. Conversely, the person-fit value of non-parametric

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

estimate is not based on the test response model parameters, but rather calculates the responses of all testers (Karabatsos, 2003). Both methods, in dichotomous items, are commonly discussed (Karabatsos, 2003; Meijer & Sijtsma, 2001;van Krimpen-Stoop

& Meijer, 2002). However, there is limited research which analyzes both methods in polytomous items (Glas & Dagohoy, 2007), especially for non-parametric estimate.

The topics of most studies focused on parametric estimates in cognitive tests (Dagohoy, 2005; van Krimpen-Stoop & Meijer, 2002) and non-cognitive tests (Reise,

& Widaman, 1999; Zickar, Gibby, & Robie, 2004).

There has been several statistics proposed to represent the statistics of person-fit.

Karabatsos (2003), the most completed one in our knowledge, integrated 36 indicators.

Those statistics are divided into two parts, parametric and nonparametric methods, and are shown in table 1. There are more than one index for investigating person fit though, the indices are based on different theory and might have merits for certain condition. St-Onge, et al(2011) asserted that the lz is IRT-based indices, which are used in studies due to the relatively high detection rate, while U3^p is belonged to the group-based person fit statistics, which is superior in correctness and wide usage.

With regard to the potency between nonparametric and parametric method, some comparative study have been conducted. In St-Onge, et al.(2011) study, consider the range of item difficulty, discrimination, test length, and two parametric person fit

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

indicator, l_z and ECI2z, and two nonparametric person fit indicators, H^T and U3, were compared on the detection rate for aberrant response pattern. They found the nonparametric indices, H^T and U3 , have higher detection rate than the two parametric methods for a fixed aberrant rate of 0.6 under the simulated condition of cheating.

Also, a number of studies indicate that the U3 and Guttman error may have lower detecting rates; however, Emons (2008) provided evidence that the detecting rates are similar to the Guttman error and l by a simulation study. That means that the nonparametric method could have a similar power to PIRT, and could perform superior than parametric indices.

In the scope of nonparametric indices, the Guttman error is the effective method to detect misfit, while it is easily affected by total score, especially for examinees with higher and lower scores. Karabatsos (2003) stated that for dichotomous items, H^T, C, MCI, and U3 performed optimally in an aberrant pattern. It should be noticed that the

four statistics are nonparametric methods. Furthermore, by comparing 36 statistics of person-fit in detecting rate, Karabatsos (2003) stated that U3 is the one of the strongest indicators in the top four.

Accordingly, all indices have certain condition to favor. Based on the purpose of comparisons in this study, the proper indices in both nonparametric and parametric method would be selected to investigate the efficiency under the factors of this study.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

lz is one of the commonly used indicators, and Guttman and U3^p are the more stronger of nonparametric person fit indices, based on previous studies. Three types of indicators, lz, Guttman errors, and U3^p, are introduced in this study, which are related to this study. The other indicators could refer to those in the study of Karabatsos (2003). The three types of indicators are as follows.

‧

Table 1. Person-fit statistics

Non-Parametric Person-Fit Statistics(12) Parametric Person-Fit Statistics(25) G (Guttman, 1944, 1950) U (Wright & Stone, 1979)s Gⁿ (van der Flier, 1977) ZU (Wright, 1980)

G^p (Molenaar, 1991) lnU (Wright & Stone, 1979)

GN (Emons, 2008)

rpbis (Donlon & Fischer, 1968) W (Wright, 1980)

C(SCI) (Sato, 1975) ZW (Wright, 1980)

Sijtsma & Mejer, 1992) ECI4z, ECI6z A, D, E_i (Kane & Brennan, 1980)

(Tatsuoka, 1984)

Item-Grouping Person-fit Statistics(5)

D(θ) (Trabin & Weiss, 1983) l (Levine & Rubin, 1979)

lzm (Drasgow, Levine, &

McLaughlin, 1991)

lz (Drasgow, Levine, &

Williams, 1985) UB (Smith, 1986) M (Molenaar & Hojtink,

1990)

ZUB (Smith, 1986) M(p-value) (Bedrick, 1997)

lnUB (through

Wright & Stone, 1979)

Resource: Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-six person-fit statistics. Applied

Measurement in Education, 16, 277-298.

Huang, T. W. (2006). Aberrant response diagnoses by the Beyond-Ability-Surprise index (B*) and the

Within-Ability Concern index (W*). Proceedings of 2006 Hawaii International Conference on Education, Honolulu, Hawaii, pp. 2853-2865.

Emons, W. H. M., Meijer, R. R., & Sijtsma, K. (2002). Comparing simulated and theoretical sampling distributions of the U3 person-fit statistic. Applied Psychological Measurement, 26(1), 88-108.

‧

This is one of the commonly used indicators; and represents the compound probability of responses and estimated ability. Mostly, according to the concept of person fit, the indices could be expressed as formula 1（Snijders, 2001）.

∑=

If we let a person-fit statistic with expectation 0, the person fit statistics are expressed in the form of formula 2, which is a centered form.

∑=

Most studies conducted the indices by log likelihood function, shown as formula 3. The form of log-likelihood function of V, let be abbreviated as l, verifies person-fit via the likelihood of item parameters (ex: under one, two or three parameters) and the estimated theta (St-Onge et al., 2011). The index was proposed by Levine and Rubin(1979) originally. Then was further developed by Drasgow, Levin and Williams(1985) and Drasgow, Levin and McLaughlin(1991).

‧

items. However, the calculation of l would cause two major problems: 1. Since l is not the standardized score, researchers have to relay on θ to classify individual into normal or aberrant. That is, the statistics is not independent enough; 2. Researchers need a null assumption distribution of fitting response behavior to classify an item response pattern as aberrant. According to the problems above, which are theta dependent and an unknown sampling distribution, Drasgow et al.(1985) pointed out the method of standardizing l . The method could reduce the interference of theta, and assumed a asymptotically standard normally distributed (Meijer & van Krimpen-Stoop, 2000). That is, the value of standardizing l, lz, would not be influenced by θ. The calculation of lz is as formula 4, where E(l) is the expected value of l，Var(l) refers to the variance of l. The negative lz indicates insufficient fit. The response patterns are different from the predicted value compound by item difficulty and individual ability. The large number and negative values indicates an inferior fit (Armstrong, Stoumbos, Kung, & Shi, 2007; Meijer, 2003). Since the conditional null distribution of lz is standard normal as the test length is long enough, the critical value set as -2 generally (Ferrando & Lorenzo, 2000).

‧

Molenaar (1991), underlying the NIRT, suggested using the value of the Guttman error as the indicator of person-fit. The value is derived by calculating the number of difficult items (or item-steps) that were passed and the number of easy items (or item-steps) that were failed. A large value indicates an inferior person-fit (Emons, 2008). The indicator could be apply on dichotomous (denoted as G) and polytomous cases (denoted as G^p). In dichotomous scale, for example, there are five items ordered from easy to difficult as A, B, C, D, and E. The response pattern of examinee J is 10101, where 1 is correct, otherwise, 0. The Guttman errors are then three (the violated pattern is BC, BE and DE). For examinee K, the response is 00111, then the G is six (the violated pattern is AC, AD, AE, BC, BD, and BE).

As for the polytomous scale, the calculation of the Guttman errors are based on the all possible item-step pairs, which the easy step failed and the difficult item step

‧

passed. For example, there are two items with three item steps for each. We denote the item step difficulties as

jxj

π ，and let the as the sample estimate. It represents the probability of score χ

jxj

π∧

j in the item J. The three item-step difficulties of item 1 are π11=.90, π12=.50, π13=.20. Similarly, the three step difficulties of item2 are set as π21=.80, π12=.70, π13=.30. It is known that the easiest step of those six is the step 1 of item 1. The passing rate goes to 90%. The next one is the step 1 of item 2. The passing rate is 80%. The step2 of item 2 is the third easiness. The passing rate is 70%.

The most difficulty is the step 3 of item 1. The through rate is just 20%. If there is a respondent score 3 in item1 and 1 in item 2. That means the respondent pass the step 3 of item1 but fail the step 2 and 3 of item 1 which are more easier. In this case, all of the item-step pair which failed in easier step and passed in difficult one would be calculated. The number is the Guttman errors for polytomous scale (Emons, 2008).

We can model the Guttman errors as formula (7) for both dichotomous and polytomous test. Let y represents the item-step score variable, and we can score 1 if the item-step is passed, otherwise 0. When the M in (7) equal to 1, the statistic represents the Guttman errors of dichotomous items (Meijer & Sijtsma, 2001), however, the flaw of the Guttman error is that it is dependent on total score.

G ∑ y

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

For the comparison of each value of G, the statistics was normed by the item-step difficulties and the maximum possible given total scores. The form of normed Guttman error is denoted by G^p and the model is as the formula(8), where the X+ represents the sum of score.

) max(

G G X G

= (8)

The software of Mokken’s scaling program (MSP) and mooken package in R can calculate the Guttman errors for each subject, which can be saved in another file for researchers to use (van Schuur, 2003, van der Ark, personal communication, November, 2011).

2.1.3 U3^p

Considering the flaw of the Guttman errors, van der Flier (1980) developed the U3 as an alternative person-fit statistic, which is under the nonparametric statistics

(Meijer, Molenaar, & Sijtsma, 1994). In concept, it is based on a log ratio of response vectors with the group’s correct response proportion for each examinee (St-Onge et al., 2011). U3 considers the value and order of item difficulty. Given a dichotomous scale with J items, χ_j (j=1,…,J）, 1 represents the correct answer, and 0 means wrong answer. There would be a random vector comprised by the response pattern, denoted

‧

as X. Also, we use the symbol of X+ to represent unweighted total score, which . Accordingly, the value of U3 could be obtained by the formula (9).

Given π

∑= += ^J_j X_j X 1

j (j=1,…,J) is the proportion of the correct response on item j of the population, and its estimate by sample is denoted as π^∧^j. The π^∧jis assumed to be in the order of

∧≥

π¹ π²^∧^{≥ …}^≥π^∧ ^J. Under the fixed X+, all terms in the formula (9) are constant except the w(x).The value ranges from 0 to1, where 0 represents the response pattern as an ideal Guttman pattern, that is, if and only if the respondent’s item score pattern is a Guttman pattern(Emons, 2002). The larger value indicates that the pattern deviated further from perfect Guttman patterns.

U3(X)=

U3 can be used in polytomous items as well. The value was defined as formula (11), which is added the numbers of response category, M. The value, denoted as U3^p, represents the sum of log odds of the item-step difficulties of the steps that were passed. The normed W(X) is used to obtain the generalization value of U3^p, denoted

‧

means no misfit, and 1 indicates extreme misfit. The max(W|X+) in formula(12) would occur when the easiest item steps of X+ are passed. The formula (13) is the expression of the condition of max(W|X+)(Emons, 2008).

)

Where M means the numbers of response category. J represents the numbers of items. X+ means the total score.

2.2 Person-fit and faking

The phenomenon of response set would appear in the personality test. Response set means that subjects tend to respond in a particular manner, such as center-tendency, guess, and affected by social desirability (Guo, 1985). Among these causes of response set, the tendency of social desirability is most considered by test

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

experts (Chen et al., 2004). Generally, the faking is defined as the response tends to social desirability (Lai, 2010).

Two common methods are used to examine faking. The first one is by using forced choice questions in the items arrangement. This method sets the levels of social desirability in each alternative to be equal; therefore, examinees could not be affected by social desirability. The second method is by using the social desirability scale, which examines social desirability. The description of items consists of two parts, including the good behaviors, defined by the culture, which most people cannot achieve, and the bad behaviors, which most people would act out (Marlowe &

Crowne, 1960). If the total score in this scale is more than a given value, the examinee would be judged as faking, which implies that the other items of the tests would also be faked (Lai, 2010). However, the forced choice questions are difficult to arrange. If the examinee displays both characteristics, the only choice might lead to an untrue answer instead. Comparatively, the social desirability scale is easier and more convenient to use and implement (Chen et al., 2004). Nevertheless, the desired goal of each method is to detect atypical response patterns and to eliminate them.

The concept of traditional methods to detect faking is similar to the construct of person-fit. The statistics of person-fit reflects the phenomenon that the response varied as the model predicted, when the difficulty of items increases, such as whether

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

the probability of adornment reduces when the difficulty varied (LaHius & Copeland, 2009). The difference between the value predicted by the model and the data may reflect the lack of motivation or faking in the test. The discrepancy would impact the validity of the interpretation of the test and results (Emons, 2008). Compared to the traditional method, one of the advantages of using person-fit as the means to detect faking is that we could explore how faking affects the measurement properties of items and scale. Distinct from prior studies focused on the implications of faking, such as, the exploration of the difference of faking between employee and applicants (Lai, 2010), and to determine the proportion of faking that is reduced when participants were posted in the context of “easy to fake” and “not easy to fake”

(LaHuis & Copeland, 2009); This study focuses on the property of measurement.

A number of possible reasons for the aberrant pattern have been discussed. The reasons included cheating, creative response, guessing, careless, and random response (Meijer et al., 1994; Karabatsos, 2003), as well as carelessness and inattention, tendency to choose extreme response options, and reversed scoring (Emons, 2008).

Reise (2000) stated that two perspectives are used to investigate the person misfit. The first perspective considers misfit as systematically deviate from the normal response.

Under this point of view, it is originated from measurement error, such as answer in the wrong item or misunderstood the item description. The carelessness and

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

inattention were belonged to this category. The second perspective is the systematic difference among groups. That is, different level misfit resulted from level faking. In this viewpoint, researchers have to correct the effect of faking to obtain true ability.

LaHuis and Copeland (2009) extended the second perspective to propose that the discrepancy between true ability and scores reflects the faking, and can be predicted by other variables, such as honesty and job desirability. In a particular context, the systemic bias in the response curve would correspond to the spurious response, due to the individual trying to achieve a higher score on certain items.

The causes of atypical responses could be discussed from achievement test and psychological test separately. In psychological tests, faking might have two possible reasons. One is that it occurred unconsciously since the false personality has been adopted by the individual; the other one is that respondent fakes purposely. The faking in psychological test is defined as the second one, and which is also the target of social desirability scale (Lai, 2010). The motivation to fake might be shown in several ways, such as choose extreme response options, and reversed scoring. Those are distinct from the reasons in achievement test, in which, the atypical response could be explained by cheating, creative response, guessing. As for the random response, some technique of detecting invalid survey could be adopted. Therefore, it could be assumed and tested that the atypical response is aroused mainly by faking

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

motivation in psychological (personality) test, there might have other miner factors though.

Person-fit, as an indicator of faking, has been discussed in prior studies (Schmitt et al., 1999; Zicker & Drasgow, 1996). LaHuis and Copeland (2009) used multilevel logistic regression to generate the value of person-fit, and to investigate the relationship between faking and simulated dichotomous and polytomous data. They found that person-fit is an effective predictor to the motivation to fake. However, their study was conducted under the parametric method with 1000 examinees, and 20 items.

Besides, the data was simulated from GRM, and did not have an adequate model fit.

Chernyshenko et al. (2001) suggested that the nonparametric model could achieve a better fit than GRM in a personality test. Emons (2008) investigated the relationship between G^p and aberrant responses. The study was conducted under the nonparametric model, and 1000 examinees were used in the process. A small sample was not the main point of the study, yet the advantage of the nonparametric model was not demonstrated in the study. Zicker and Drasgow (1996) compared the effectiveness of faking detection between person-fit and the social desirability scale, and indicated that it is more effective to use person-fit rather than the social desirability scale. A total of 48,725 participants were involved in that study, using the parametric person-fit.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Using person-fit as a predictor is feasible, theoretically and practically, in the above-mentioned study. Prior studies explored the relation, and conducted a comparison between person-fit and social desirability. However, these studies used the model of parametric method, and a large sample size. The advantages of NIRT, such as the small sample size and distribution free, have not been discussed. This

在文檔中 Person-fit偵測作假之效用- 非參數試題反應理論的模擬與應用 - 政大學術集成 (頁 17-0)