MCAT Procedures and Performance Evaluation

There are four steps to set up adaptive testing: initialization of how to start the test, item selection rules of how to administer items to test takers, ability estimation of how to predict the proficiency of test takers, and setting a stopping rule of how to end the test (van der linden & Glas, 2002; Wainer, 2000).

In the initialization step, the test taker is estimated to have a score of 0 in its initialization in most CAT practices (Thomson & Weiss, 2012). Thomson and Weiss (2012) suggested that a randomization technique should be used, such as considering a scale from -0.5 to +0.5 of item difficulty as initialization. Another suggestion is to consider the background information of the students, which can be gathered from classroom tests or even pre-tests conducted prior to CAT initialization (Kustiyahningsih & Cahyani, 2013).

To some extent, the initial ability estimation in MCAT uses one of these existing procedures where the initial ability (θ) estimate is assumed to be 0. This initial ability estimation also has been used in MCAT systems, such as the Chinese proficiency test

(Wang et al., 2011). Its mean of the prior distribution of ability is set at µ = (0, 0), and the standard deviation (SE) is set at 1 (Segall, 1996; van der Linden & Glas, 2002; Wainer, 2000).

In the item selection procedure, Segal (1996) suggested that items can be administered on the basis of item information. This provisional ability estimate obtained from the response of the j-th item is used to evaluate the item information function.

The discourse of item selection continues, and van der Linden and Glas (2002) suggested that the item selection procedure should rely on a maximum information criterion and Owen’s Approximate Bayes Procedure. With this procedure, Reckase (2009) added that CAT estimates the location of the examinee’s ability on a coordinate system.

Segall (1996) suggested that Maximum Likelihood Estimation (MLE) should be used to produce a conditional distribution of ability estimates in combination with the Fisher information matrix. Moreover, Wang et al. (2011) stated that two methods widely used in MCAT are maximum Fisher information and maximum expected precision information.

Based on the information gathered by these methods, the next item is administered to the test taker. This suggests that the maximum Fisher information is applicable in most CAT item selections.

For ability estimation, the maximum likelihood estimation (MLE) has been used in CAT (van der Linden & Glas, 2002); however, it has a weakness in that it requires a mixed pattern of right and wrong answers to get a precise estimation (Thomson &Weiss, 2012). As a result, joining MLE with the Bayesian estimation of expected a posteriori (EAP) is used, especially when only a small number of items is administered. The

maximum expected posteriori (MAP) is much more appropriate for a large number of items (Babcock & Weiss, 2012; Reckase, 2009). Chen (2006) added that MAP is appropriate for MCAT systems that test a larger number of domains. MCAT also adopts this existing procedure for MLE, EAP, and MAP of the Bayesian procedure for its ability estimation. An example of MCAT that uses a MAP estimator is the MCAT for the Chinese proficiency test (Wang et al., 2011).

For the stopping rule, there are three types. The first is the criterion of SE, the second is the test length, and the third is the test time (van der Linden & Glas, 2002;

Wainer, 2000; Wang, et al., 2012). The simulation study of SE determines what error rate is acceptable for the system. This criterion based on SE would impact on number of item tests (van der Linden & Glas, 2002; Wainer, 2000). In addition, the test length is divided into two types: fixed test length and varying test length. The fixed test length is a specified number of test items to be administered to the test taker (Reckase, 2009; Yao, 2012). In contrast, a varying test length is a test that stops when the desired precision level or confidence level has been reached, so the test length differs from test taker to test taker (Wainer, 2000; Yao, 2012). Furthermore, the test time rule is to run the test in a certain period of time. When the time given is over, the test ends. In addition, Chae et al.

(2000) also addressed other CAT stopping rules: when the item bank is exhausted, the ability measure is far enough away from the pass-fail criterion, and the test taker is exhibiting off-test behavior. Early CAT applications used a varying test length because it was in line with the intention to produce better precision for test taker ability estimates (Babcock & Weiss, 2012; Wainer, 2000; Yao, 2012); however, for some practices, this

would prolong the test time because the test taker would be assumed to be a high performer test taker. He/she would even have an effect on item exposure because during a longer test, the test bank would have fewer items left over. As a result, scholars were more inclined to use a fixed test length as a stopping rule (Yao, 2012). After all procedures have been completed, the performance of adaptive testing must be evaluated at some point to determine whether or not the system functions effectively.

Figure 2 is an MCAT procedure used in Wang et al. (2011, p. 247). The procedure begins with an evaluation for determining the participants’ ability estimation. Then, the system can calibrate and estimate a participant’s ability as soon as the response to the item is administered. The CAT system will then select the next item, which is closer to the participant’s ability, to answer during the test. The test end when the criteria have been matched. Otherwise, the process of administering items, calibration, and ability estimation continues.

Figure 2. MCAT Process

Most CAT evaluations focus on the procedures of the ability estimation precision, item selection procedures, and stopping rules (Reckase, 2009; Wainer, 2000; Wang, et al., 2011; Wang, Chang, & Boughton, 2012; Yao, 2012). This evaluation assures the accuracy of the system in its measuring ability estimation as well as reduces any errors in the precision. To achieve this outcome, some scholars use data generated from simulations, while others use real data. Some of them use both real data and generated data to observe the different performances with difference simulations. This method allows scholars to condition the system with various situations.

For ability estimation, different methods have been used by scholars. These methods include the maximum likelihood estimation (MLE), expected a posteriori (EAP) estimation, weighted likelihood estimation (WLE), and MAP (Reckase, 2009; Wainer, 2000). In addition, the root mean squared error (RMSE), bias, and standard error (SE) are three important indexes examined to guarantee the system has fewer errors (Reckase, 2009;

Wainer, 2000). In a previous study, Chen (2006) found that RMSE MAP is better than RMSE MLE. Chen (2006) suggested that adaptive testing that has multidimensional domains tends to have better estimates from RMSE MAP than from RMSE MLE. Wang et al. (2011) used the methods of MLE, EAP, and MAP to determine their MCAT ability estimation performance. The results indicated that MAP was also better for their system than MLE and EAP. Moreover, some scholars also examined bias and SE indices.

Babcock and Weiss (2012) suggested that the criterion of bias was 0.16, while the SE falls between 0.385 ~ 0.315. Lower bias criteria were suggested by Wang, Hanson, and Lau

(1999) below 0.01. The SE is used to assess the performance of all item information and fluctuations across the administered items. The formulas are shown below.

在文檔中印尼國中生物科多向度電腦化適性測驗建置 (頁 39-44)