Statistical analysis: Virtues of Rasch modelling over CTT and IRT 1 Classical Test Theory

Chapter 2. Literature Review

2.1 Statistical analysis: Virtues of Rasch modelling over CTT and IRT 1 Classical Test Theory

In contrast to more modern psychometric theories, collectively known as Item Response Theory (IRT), Classical test theory (CTT) refers to classical psychometric theory that aims to understand and improve the reliability of psychological tests, which have for decades been the mainstay in disciplines ranging from psychology to economics to education. Evolving since Binet created his intelligence test in the early 1900s, CTT is regarded as a simple, robust model (Coaley, 2009). Based on Novick’s (1966) foundational formulation, CTT predicts the outcomes of these tests, such as the difficulty of items or the ability of test-takers. Mathematically, the theory is grounded in the idea that a person’s observed or obtained score on a test is the sum of a true score (error-free score) and an error score. The relationship between these three elements is often formulated as:

Observed Score = True Score + Error; or

X = T + E.

CTT can therefore be viewed as true score theory, where the true score (T) of a person is a hypothetical construct that could be realized if the person were to complete the same test an infinite number of times. The main concern is to quantify the random error (E) part, and in test creation, to minimize the error so that the Observed score (X) will approach the true score.

According to Traub’s (1997) historical analysis, CTT was the result of an evolution of three concepts: 1. The recognition that errors are intrinsic to measurements, 2. the realization that this error is a random variable, and 3. the conception of correlation and

how to measure it. Charles Spearman in 1904 started the evolution by figuring out how to correct a correlation coefficient for attenuation due to measurement error and how to obtain the index of reliability needed in making the correction (Traub, 1997).

In test development and validation, items require analysis. CTT item analysis is most commonly achieved using descriptive statistics and involves calculating the item mean and item variability. In this framework, more effective items have both higher variability and item means closer to the center of the distribution of the item scores.

CTT can investigate an item by analyzing its distractors, difficulty, discrimination, and total correlations (Coaley, 2009, pp. 35-40).

Distractor analysis evaluates and compares the frequency of the selected answer options.

Ideally, the distractor options are more or less equally chosen by the test-takers who incorrectly answered the item. As for difficulty analysis, it produces a difficulty indicator, or p value, which represents the percentage of test-takers who answered the item correctly; this is calculated by simply dividing the number of people who answered the item correctly by the total number of people who answered it. A high p value approaching 1 indicates that most people got the item correct, suggesting that the item is too easy; on the other hand, a p value close to 0 suggests the item is too difficult. A mean p value of 0.5 indicates moderate difficulty and is able to better discriminate test takers. A well-balanced test will have items representing a range of difficulties (0.2-0.8), but their mean p value should be close to 0.5. Difficulty analysis can also be extended to determine if the item exhibits bias towards any group of test-takers; this is done by comparing total group correct scores on an item.

Discrimination analysis determines whether the response on one item is related to all of the others. This can help identify which items are effectively measuring the trait (or dimension) under investigation. People who score well are more likely to answer an item correctly, while lower scorers will be less likely. However, if compared to higher scorers, the lower scorers tend to either correctly answer an item more often (negative discrimination) or just as often (zero discrimination), this is a red flag and suggests that the item is measuring a different trait or dimension. Item discrimination is often calculated by comparing the top and bottom 27% of the distribution of scores.

Specifically, discrimination, or d, is found by subtracting the percent of people getting the item right in the high group (Ph/Nh) from that in the low group (Pl/Nl):

d = Ph/Nh - Pl/Nl.

Items that discriminate well are easier for the higher group, and thus have large, positive values of d. Items with a negative value are easier for lower scorers and should be removed.

Total correlation analysis is another method to evaluate the discriminability of an item and involves determining the correlation between an item and a total score on that measure. Items with high positive item-total score correlations are more clearly related to the trait or dimension being measured. These items exhibit more variability than others with lower correlations, which indicates better ability to discriminate between high and low values. A negative value suggests that an item is negatively related to the other items on the measure, and that it is measuring a different trait.

Although CTT was the dominant mode of analysis in the social sciences for decades and still remains very widely used, Hambleton, Swaminathan, and Rogers (1991, pp. 4-5) pointed out almost 30 years ago that there are four important weaknesses of CTT.

The first involves its definition of reliability as "the correlation between test scores on parallel forms of a test,” which is problematic because there is no consensus as to what parallel tests are. Another issue is related to its conception of standard error, which is assumed to be the same for all test-takers. Unfortunately, this assumption is difficult to accept given that scores on any test are unequally precise measures for examinees of different ability. The third shortcoming of CTT that Hambleton and colleagues identified is that examinee characteristics and test characteristics cannot be separated, which means that they can only be interpreted with reference to each other. Finally, CTT is test-oriented, i.e., based on the sum of all items. Since this approach is not oriented to individual items, CTT is unable to make predictions on how well a test-taker or even group of test-takers might do on a particular test item.

在文檔中單字階層測驗之局部獨立性檢測 (頁 38-41)