• 沒有找到結果。

Scale Validity and Reliability

A series of processes were followed to assure the validity of the scale, including the content validity covering the biology curriculum, the construct validity of six biology domains, the validity study of items by experts, and the inspection of the infit and outfit of items in the model.

The adaptive testing developed in this study is to be used particularly by junior secondary school students to test their biology competency. Consequently, the biology curriculum issued by the Indonesian government (Ministry of National Education, 2006a) for junior secondary schools is used. The curriculum covers grades 7, 8, and 9 and forms the basis of the assessment framework of this study. Thus, the content of the scale is valid in terms of coverage and covers six biology domains, as mandated in the curriculum.

After establishing the content validity in terms of coverage, the items within the scales were constructed based on the official biology curriculum to measure students’

abilities in different biology domains. Biology teachers were involved in the item construction, and the construction process capitalized on the professional knowledge of the teachers who dealt with the subject’s teaching and learning processes and its

60

assessment on a daily basis. Involving teachers in the item construction process ensured the alignment between item difficulty and student ability.

The validity of the test was further strengthened by consulting biology subject experts before the items were used for data collection. A team of subject experts from the Biology Department of a local university was formed to analyze the items and confirm that the items were valid to use for data collection.

Finally, the measurement validity was established by inspecting the infit and outfit statistics of the items in the proposed model of measurement. ConQuest software (Adams, et. al., 2012) was used, and the analysis was conducted repeatedly through iterations with different codes, including number iterations and nodes, to find the most appropriate model to fit the maximum items within the scale. In turn, when most items were aligned with the model, the scale would be valid. The criteria of the mean-square (MNSQ) infit and outfit being within the range of 0.80 to 1.20 was used as an indication of goodness of fit (OECD, 2012a). There were four items that did not satisfy the criteria.

They were items 1, 16, 278, and 287. These items were removed from the scale. The distribution of items in the MIRT weighted items across the six biology domains is shown in figure 9.

Figure 9 shows the Wright map (Wilson, 2011) of the biology domain scales: 1) Biology and Research scale; 2) Botany scale; 3) Zoology scale; 4) Human Being scale; 5) Anatomy Function scale; and 6) Ecosystem scale. It can be seen that the domain 3 (Zoology) scale contains the most difficult items, as it requires a score of about 2.6 for the ability value to answer the most difficult items correctly. In contrast, the domain 2

61

(Botany) scale has the easiest items in which test takers need to earn a score of -1.8 to get the easiest items correct. Based on this distribution, three patterns emerged in which the item distribution for the domain 2 (Botany) scale and the domain 5 (Anatomy Function) scale was much more concentrated, and the range was between -1.8 to 0.8 for domain 2.

In general, the distribution of the items with these six scales is reasonably normal. Thus, all of the above processes assure that the MIRT scales are valid.

Figure 9. Wright Map of the Biology MIRT Scales

The reliability of the scales was assured by examined covariance, correlation matrix and variance of the MIRT, item fit with MNSQ item difficulty and reliability, and descriptive statistics of scale ability.

Table 9 shows the results of the covariance, correlation matrix, and reliability values across six MIRT scales. As shown in Table 9, the covariance of the MIRT scales

62

ranges from 0.230 (in domain 5) to 0.611 (in domain 4). In the correlation matrix, the values range from 0.856 to 0.956, and they are strongly correlated.

Table 9

Covariance, Correlation Matrix, and Variance of the MIRT Scales

Scale 1 2 3 4 5 6 Upper triangle is the covariant; lower triangle is the correlation matrix

Table 10 shows the item fit with MNSQ item difficulty and reliability. Adam and Khoo (1996) considered a value that ranged between 0.75 and 1.33 of MNSQ to be a good fit, while PISA considered a good fit to be a smaller range from 0.8 to 1.2 (OECD, 2012a). As shown in Table 10, the MNSQs across six biology domains fall within the range, while the difficulty item ranges from the lowest -2.87 (Botany) to the highest 2.60 (Biology and Research). The reliability was assessed using Rasch reliability (EAP/PV).

63

Their Rasch reliability across six domains ranged from 0.807 to 0.866. According to these numbers, the scales are reliable (Adams & Khoo, 1996).

Table 10

Scale Psychometric: MNSQ Range, Item Difficulty Range, and Rasch Reliability

Scale

MNSQ range Item difficulty

Reliability (Min; Max) (Min; Max)

Biology and Research 0.86; 1.15 -2.48; 2.60 0.846

Botany 0.88; 1.12 -2.87; 2.34 0.807

Zoology 0.80; 1.14 -1.98; 2.05 0.829

Human Being 0.84; 1.20 -1.61; 2.56 0.866

Anatomy Function 0.91; 1.12 -1.88; 1.93 0.807

Ecosystem 0.85; 1.19 -2.04; 1.95 0.852

Table 11 shows descriptive statistics of the scale ability across the six biology domains. The table comprises of minimum and maximum ability, mean, standard deviation, skewness, and kurtosis statistics. The lowest and the highest ability was in Zoology Domain. The mean ranged from -545 to 0.612, while the SD ranged from 0.454 to 0.811. The skewness and kurtosis statistics ranged –2 to 2, which is a considerably normal distribution. As shown in Table 11, the distribution of ability across six domains falls within the range. This indicates that the scales have a normal distribution.

64 Table 11

Descriptive Statistics of Person Ability in the Domains

Domain compared. Figure 10 shows the RMSEs of MLE across the six biology domains. Details of the RMSE of MLE are given in appendix B. Overall, the RMSEs of MLE showed high fluctuations for the early trends of all domains, i.e., when the number of items was 15 or below. The RMSEs of MLE for all domains gradually decreased and stabilized when more items were administered. It began at around 0.6, jumped to around 3.05 for the second item, and declined to around 1.5 for the third item for all domains. The instability

相關文件