BRR-CAT對能力估計與測驗焦慮之影響

全文

(1)國立臺灣師範大學資訊教育研究所博士論文. 指導教授：何榮桂博士. BRR-CAT 對能力估計與測驗焦慮之影響 Effects of Block-Review and Rearrangement Computerized Adaptive Test on Ability Estimation and Test Anxiety. 研究生：陳麗如撰. 中華民國九十八年六月.

(2) 摘要傳統電腦化適性測驗(computerized adaptive test; CAT)中，提供分區式的回溯 (Block-Review；BR)機制，可給予受試者修正誤解題意、輸入錯誤及計算錯誤的機會，有助受試者發揮其部分知識，展現最大能力表現，亦滿足受試者對回溯及改答機會的需求，舒緩其因無法回溯及改答而引發的挫折與焦慮。然而，受試者改答後，隨之改變的作答反應(response pattern)可能會形成不合理的情況，進而影響能力重新估計的結果。受試者改答後，於變動的作答反應中加入重排程序 (rearrangement procedure)，調整為合理的作答反應是否可行？重排後的作答反應是否增加受試者能力重估計的精確度？皆為BR-CAT中待解決的問題。本研究旨在提出於 BR-CAT 中加入作答反應重排程序 (rearrangement procedure)之可行性，及探究分區重排式 CAT(BRR-CAT)對受試者能力估計之精確度與效率，以及測驗焦慮之影響。第一階段模擬實驗，比較 BRR-CAT、BR-CAT 與傳統 CAT 於能力估計值、測驗標準誤及施測長度的差異情形。研究結果顯示，與傳統 CAT 相比，當受試者為中高能力考生，BRR-CAT 和 BR-CAT 演算法測得的能力估計值更為接近受試者的真實能力值，而測量標準誤（standard error, SE）則無顯著差異。此外，當測驗結束條件較寬鬆時(例 SE ≤ .35), BRR-CAT 和 BR-CAT 之施測長度與 CAT 無顯著差異。第二階段為實徵實驗，即依模擬研究結果，建置 BRR-CAT、BR-CAT 與 CAT 之線上測驗系統，蒐集真實受試者之作答反應及測驗焦慮資料，比較 BRR-CAT、BR-CAT 與傳統 CAT 對受試能力估計及測驗時間的差異情形，以及受試者對測驗焦慮的反應狀況。資料分析結果顯示，與 CAT 和 BR-CAT 相較，BRR-CAT 演算法測得的能力估計值與 SE 皆無顯著差異，而接受 BRR-CAT 和 BR-CAT 時，學生的擔心和緊張程度較低。. 關鍵字：項目反應理論、電腦化適性測驗、分區式回溯、重排程序、測驗焦慮. i.

(3) Abstract The block-review computerized adaptive test (BR-CAT) offers the review and change mechanism, provides examinees with the opportunities to check answers and revise key-in errors or mistakes after completing a block items. It also helps examinees mitigate their test anxiety. However, changing answers would make an unsuitable estimator in ability re-estimation if the new response pattern was unreasonable. Is it practicable to incorporate the rearrangement procedure into BR-CAT so as to arrange a reasonable order of new response patterns and improve the precision of ability re-estimation? We would like to find a solution to this problem. In this study, the BRR-CAT algorithm incorporating the rearrangement procedure into BR-CAT was designed. Two phase experiments were carried out to investigate the precision and the efficiency of BRR-CAT on ability estimation and the effect of BRR-CAT on examinees’ test anxiety. In Phase 1 (the simulated experiment), compared with CAT, the estimations of BRR-CAT were closer to examinees’ true ability and kept the equal SE when examinees’ ability was middle and high. Moreover, the precision of BRR-CAT and BR-CAT was equal. The efficiency of BRR-CAT would not be decreased when a moderate SE (SE ≤ .35) as a stopping criterion was set. In Phase 2 (an empirical experiment), three versions of testing system were implemented and the response patterns and test anxiety records from 112 participants were collected. The analytical results showed that the estimators and SE of BRR-CAT on ability estimation was equal to those of BR-CAT and CAT. Additionally, compared with CAT, the participants have lower worry and tenseness in a reviewable CAT environment such as BRR-CAT and BR-CAT. Keywords: Item response theory (IRT); computer adaptive testing (CAT), block review, rearrangement procedure, test anxiety. ii.

(4) 誌謝「一時衝動」-於一次研討論中，某教授獲知我非博士生而收回名片，當下即決定再繼續學生生涯。一時衝動擠進入學考的窄門，卻渾然不知自己步入的是師大退學率最高的資教所博士班。漫長的修業之道上，迎面而來的資格考、TOEFL、課業、SSCI 期刊投稿、研究計畫、實驗及論文撰寫等考驗，如驚濤駭浪擊潰我的信心，幸而有恩師何榮桂教授的悉心關懷及提攜鼓勵，終於安然渡過種種難闗。回首來時路，三千多個充滿壓力、歡笑和汗水的日子霎時東去，然而師長之恩、同窗之誼卻永銘於心。「一生感謝」-論文研究的完成，首先感謝指導教授何榮桂院長於研究過程之啟發與指導，老師像 7-11 般，每日馬不停蹄的處理繁重的院務工作，並時常關切論文的研究進度，自確定主題、蒐集文獻、建立研究架構、分析資料及結論建議等步驟，因老師的諄諄教誨、細心指正及鼓勵，讓我能適度的調整想法，如期完成論文，在此獻上最誠摯的感激與謝忱。論文計晝及論文完稿的口試委員孫永年教授、孫春在教授、吳正己教授、邱瓊慧教授、鄭海蓮教授、蔡清欉教授在百忙之中撥冗審查，悉心斧正、直指闕漏，並惠賜許多寶貴建議，不但使論文修改得更加完整及充實，因不同的觀點和想法，使我對後續研究發展更具信心，在此亦由衷致謝。其次，感謝同門的永進、文偉鼎力相助線上適性測驗系統之建置，台北市新店高中凌倩老師及南港高中慧君老師協助量表資料收集和線上施測，使得實驗得以順利完成。同時感謝參與實驗的學生們，因為你們的認真參與，此本論文才能誕生。此外，合作學習與溫馨友誼亦是支持研究完成的動力，同窗好友芳苓、綿緣、瓊芳於資格考、TOEFL 和投稿過程中的互相扶持與勉勵，使我能達成嚴苛的修業規定；信雄、聖峰、理薰、弘川、月妹、紫珊、瑞敏、育榕、國軒、佩樺、金粟、康靈及學長們無時無刻的加油打氣，使我能忘卻研究及工作的辛勞，不斷的修正錯誤而日漸精進。進修期間，同仁們坤明校長、素鄉校長、春光校長、佳燕主任、世民主任、志坤、美琴、姿利、瑞蘭、品秀、巨平、曉芳、文泳、寶貴、雅美、明瑩、梅芳，以及李錦小姐於學校行政上給予最大支持，使我無後顧之憂的專心完成學業，在此一併致上由衷的謝意。最後，感謝父母多年來的劬育之恩與默默地支持，外婆、妹妹盈秀、慧娟、佩君及弟弟孟宏的鼓勵支持！感謝一路上許多好友、貴人與家人的相伴，使我不斷的學習與成長。「一世遺憾」-最疼愛我的外婆於 2006 年 8 月 26 日晚上 8:00 走了，曾允諾要讓她老人家穿上博士服拍照的諾言，再也無法實現。論文完成之前，總以學業和工作繁忙為藉口，不但未能經常陪伴她，也來不及見她老人家最後一面。僅以此論文獻給我最親愛的外婆！. iii.

(5) Table of Contents. List of Tables................................................................................................................. vi List of Figures.............................................................................................................viii Chapter One Introduction............................................................................................ 1 1.1 Background and Motivation.............................................................................................. 1 1.2 Purpose.............................................................................................................................. 4 1.3 Scope and Limitation ........................................................................................................ 5. Chapter Two Literature Review .................................................................................. 6 2.1 Item Response Theory (IRT) ............................................................................................ 6 2.1.1 Assumptions of IRT ................................................................................................... 7 2.1.2 Item characteristic function........................................................................................ 8 2.1.3 Item parameter estimation........................................................................................ 10 2.1.4 Examinee ability estimation..................................................................................... 12 2.1.5 Item selection strategies ........................................................................................... 17 2.2 Computerized Adaptive Testing (CAT).......................................................................... 19 2.2.1 Starting ..................................................................................................................... 20 2.2.2 Continuing................................................................................................................ 22 2.2.3 Stopping ................................................................................................................... 23 2.3 Reviewable CAT............................................................................................................. 24 2.3.1 Cognitive factor-partial knowledge.......................................................................... 26 2.3.2 Psychological factor- test anxiety ............................................................................ 26 2.3.3 Algorithms of reviewable CAT................................................................................ 27 2.4 Test Anxiety.................................................................................................................... 32 2.4.1 Components of test anxiety...................................................................................... 33 2.4.2 Test anxiety scales ................................................................................................... 33. iv.

(6) Chapter Three Method ............................................................................................... 38 3.1 Phase 1-Simulated Experiment ....................................................................................... 38 3.1.1 Participants............................................................................................................... 38 3.1.2 Item bank.................................................................................................................. 39 3.1.3 Procedure ................................................................................................................. 40 3.2 Phase 2-Empirical Experiment........................................................................................ 49 3.2.1 Participants............................................................................................................... 50 3.2.2 Instrument ................................................................................................................ 51 3.2.3 Procedure ................................................................................................................. 60. Chapter Four Results and Discussion ....................................................................... 64 4.1 Phase 1-Simulated Experiment ....................................................................................... 64 4.1.1 The precision of BRR-CAT ..................................................................................... 66 4.1.2 The efficiency of BRR-CAT.................................................................................... 75 4.2 Phase 2-Empirical Experiment........................................................................................ 86 4.2.1 The precision of BRR-CAT ..................................................................................... 88 4.2.2 The efficiency of BRR-CAT.................................................................................... 90 4.2.3 The effect of BRR-CAT on Examinees’ test anxiety............................................... 91. Chapter Five Conclusion and Suggestion ................................................................. 94 5.1 Conclusion ...................................................................................................................... 94 5.2 Suggestion....................................................................................................................... 95. References .................................................................................................................... 97 Appendix A ................................................................................................................ 104 Appendix B ................................................................................................................ 108. v.

(7) List of Tables Table 3.1 Distribution of Participants in the Tryout of STAS (N=503) .................... 50 Table 3.2 Distribution of Participants in Three Groups............................................. 51 Table 3.3 The Properties of Three Parameters in the Item Bank (Number of Items=123) ............................................................................................................ 52 Table 3.4 Test Information of Thirteen θs ................................................................. 52 Table 3.5 Descriptive Statistics of Items in STAS (Number of Items=15, N=502) .. 56 Table 3.6 Item Analysis of STAS (N=502) ............................................................... 58 Table 3.7 Factor Analysis of STAS (Number of Items=15)...................................... 59 Table 3.8 Reliability Statistics of Four Factors ......................................................... 60 Table 4.1 The Frequency of the Numbers of Excluded Items in BRR-CAT (N=13000)............................................................................................................. 65 Table 4.2 MAD on Examinees’ Ability Estimation in Three Groups (N=1000 in each θ) ........................................................................................................................... 67 Table 4.3 Mauchly's Test of Sphericity for MAD (df=2, n=30 in each θ)................. 69 Table 4.4 Repeated Measure ANOVA of MAD Using Lower-bound Correction (n=30 in each θ) .................................................................................................... 70 Table 4.5 Descriptive Statistics for SE of Examinees’ Ability Estimation (Test Length=30 Items, N=1000 in each θ) ................................................................... 71 Table 4.6 Mauchly's Test of Sphericity for SE (df=2, n=30 in each θ) ..................... 73 Table 4.7 Repeated Measure ANOVA of SE Using Lower-bound Correction (n=30 in each θ)............................................................................................................... 74 Table 4.8 Descriptive Statistics of Test Length (SE ≤ .4, N=1000 in each θ)............ 76 Table 4.9 Descriptive Statistics of Test Length (SE ≤ .35, N=1000 in each θ).......... 77 vi.

(8) Table 4.10 Descriptive Statistics of Test Length (SE ≤ .3, N=1000 in each θ).......... 77 Table 4.11 Descriptive Statistics of Test Length (SE ≤ .25, N=1000 in each θ)........ 78 Table 4.12. Mauchly's Test of Sphericity for Test Length (df=2, n=30 in each θ)..... 82. Table 4.13 Repeated Measure ANOVA of Test Length Using Lower-bound Correction (SE ≤ .4, n=30 in each θ) ..................................................................... 83 Table 4.14 Repeated Measure ANOVA of Test Length Using Lower-bound Correction (SE ≤ .35, n=30 in each θ) ................................................................... 83 Table 4.15 Repeated Measure ANOVA of Test Length Using Lower-bound Correction (SE ≤ .3, n=30 in each θ) ..................................................................... 84 Table 4.16 Repeated Measure ANOVA of Test Length Using Lower-bound Correction (SE ≤ .25, n=30 in each θ) ................................................................... 85 Table 4.17 Descriptive Statistics of Numbers of Three Change Types, Changed Answers, and Reviewed Items (Number of Participants=74)............................... 87 ^. Table 4.18 Descriptive Statistics of Participants’ English Midterm Score, θ , and SE ............................................................................................................................... 89 ^. Table 4.19 Homogeneity of Variances for θ and SE................................................ 89 ^. Table 4.20 ANCOVA of θ and SE ........................................................................... 89 Table 4.21 Descriptive Statistics of Participants’ Test Time, Review Time, and Total Test-taking Time................................................................................................... 91 Table 4.22 ANCOVA of Test-taking Time ............................................................... 91 Table 4.23 Descriptive Statistics of Four Factors in STAS (Number of Items=15, N=112) .................................................................................................................. 92 Table 4.24 ANCOVA of Four Factors in STAS (Number of Items=15, N=112) ..... 93. vii.

(9) List of Figures. Figure 2.1 Three examples of ICC for 3-parameter logistic model .......................... 10 Figure 2.2 Typical liklihood function curve.............................................................. 15 Figure 2.3 Maximum likelihood curves of uncommon response patterns ................ 16 Figure 2.4 Typical item information curve................................................................ 18 Figure 2.5 The procedure of CAT ............................................................................. 20 Figure 2.6 The procedure of BR-CAT ...................................................................... 29 Figure 2.7 Rearrangement procedure (Papanastasiou, 2005).................................... 31 Figure 2.8 The estimative order in rearrangement procedure (Papanastasiou, 2005)31 Figure 3.1 Test information curve of item bankUUU .................................................. 40 Figure 3.2 The simulated procedure of CAT………………………………………………..41 Figure 3.3 The simulated procedure of BRR-CAT and BR-CAT............................. 42 Figure 3.4 The pseudo code of the simulated procedure of BRR-CAT .................... 49 Figure 3.5 Test information curve of the item bank .................................................. 53 Figure 3.6 Screenshot of the answering step ............................................................. 54 Figure 3.7 Screenshot of the review and change procedure in BRR-CAT and BR-CAT................................................................................................................ 54 Figure 4.1 MAD on examinees’ ability estimation ................................................... 67 Figure 4.2 Means of SE on examinees’ ability estimation........................................ 72 Figure 4.3 Means of test length (SE ≤ .4) .................................................................. 78 Figure 4.4 Means of test length (SE ≤ .35) ................................................................ 79 Figure 4.5 Means of test length (SE ≤ .3) .................................................................. 79 Figure 4.6 Means of test length (SE ≤ .25) ................................................................ 80 Figure 4.7 Means of test length (SE ≤ .4, .35, .3, and .25)......................................... 80 viii.

(10) Chapter One Introduction. This study proposes the architecture of a computerized adaptive testing (CAT) with review, change and rearrangement in a block in order to propose a reasonable procedure and precise algorithm of reviewable CAT which lowers examinees’ test anxiety and therefore reaches an a more accurate estimation of examinees’ ability. In chapter 1 of this thesis, motivation, background, purposes and limitation of the research are introduced. Chapter 2 reviews related literature on item response theory (IRT), CAT, reviewable CAT algorithms, and test anxiety scales. Chapter 3 describes participant demography and characteristics, the design of Block-Review rearrangement CAT (BRR-CAT) and Block-Review CAT (BR-CAT) algorithms, a test anxiety scale as well as detailed experiment procedures. Chapter 4 shows the analytic results of the investigation into the effects of BRR-CAT on examinees’ ability estimation and test anxiety. Chapter 5 attempts to answer and discuss proposed research questions. The final chapter offers some conclusions and suggestions for future research.. 1.1 Background and Motivation With the development in information technology, CAT becomes a practicable testing tool. For example, item response theory (IRT)–base CAT is a tailored test because it adjusts test item difficulty according to examinees’ responses (Lord, 1980, p11). If examinees answer correctly on a certain test item, the following one will be adjusted to be more difficult than the previous one. On the contrary, if examinees’ 1.

(11) responses are incorrect, the following item will be easier. The IRT-based CAT is also a precise and efficient assessment tool because it can accurately estimate examinees’ level of ability based on statistical probability models only after a few test items are delivered (何榮桂, 1999). Because of these advantages, the IRT-based CAT becomes a popular evaluation tool in large-scale test-administration institutions. CAT is applied to ability tests, such as Test of English as a Foreign Language (TOEFL), Graduate Record Examinations (GRE), Graduate Management Admission Test (GMAT) and Scholastic Assessment Tests (SAT) administrated by Educational Testing Service (ETS) (http://www.ets.org/); American College Test (ACT) administrated by ACT, Inc. (http://www.act.org/), and Test of Ability for Chinese Communication (TACC) administrated by Waseda University in Japan (村上公一、砂岡和子、劉松, 2005). CAT is also applied to qualification tests, such as National. Council Licensure Examination for Practical/Vocational Nurses (NCLEX-PN) and National Council Licensure Examination for Registered Nurses (NCLEX-RN) administrated by National Council of State Board of Nursing, Inc.(NCSBN) (https://www.ncsbn.org/), and Law School Admission Test (LSAT) administrated by Law School Admission Council (LSAC) (http://www.lsac.org/). The IRT-based CAT is an accurate and economical assessment tool, but it has one shortcoming; that is, examines are prohibited to review the answered test items or change their choices (Vispoel, Hendrickson, & Bleiler, 2000). This constraint on the IRT-based CAT procedure is due to the fact that examinees’ ability is immediately estimated after they respond to the item and the difficulty level of the following item is adjusted based on the estimation of their ability (Lord, 1980, p12). A number of researchers, however, claimed that this limitation underestimated examinees’ ability because they were deprived of the opportunities to check their answers and yet changing answers was proved to significantly improve their performance (Lunz, 2.

(12) Bergstrom, & Wright, 1992; Vispoel, 2000; Waddell & Blankenship, 1995). Moreover, the inaccessibility to previous answered items would aggravate examinees’ test anxiety (Vispoel, 1998; 2000). According to Shermis and Lombard (1998), providing control elasticity for testing procedure might be helpful to reduce examinees’ test anxiety. Some other studies also proposed pilot simulations in which the traditional CAT was replaced by a reviewable CAT. Researchers concluded that reviewing the fixed number of items in multiple sessions was necessary to reduce the complexity of designing a reviewable CAT procedure and to keep the security of item bank (Lunz, et al., 1992；Vispoel, 1998；Waddell & Blankenship, 1995). In Vispoel’s study (2000), examinees held positive attitudes and voiced the need for the incorporation of reviewable CAT, so as to lower their test anxiety. In short, designing a CAT with item review and change options will not only be necessary to provide examinees a fair environment but it is also beneficial to estimate their performance on a more precise level. The Block-Review CAT (BR–CAT) which permits examinees reviewing items and changing answers in a block is practical reviewable CAT. Two major advantages of providing examinees with a BR-CAT are as follows. First, BR-CAT allows examinees to review answers and it is helpful to lessen their test anxiety. Second, BR-CAT is beneficial to increase examinees’ performance because it provides opportunities of rechecking and changing answers (Stocking, 1997; Vispoel, 1998; 2000). After examines changed their answers, their ability is also re-estimated based on the new response patterns. However, changing answers would make a mistake in ability re-estimation if the new response pattern was unreasonable. For example, if the kth item was changed from incorrect to correct during the reviewing and changing procedure and the (k+1)th item was correctly answer on the first pass, the (k+1)th item might be unsuitable because its difficulty level was lower than the new ability 3.

(13) estimator. Similarly, if the kth item was changed from correct to incorrect and the (k+1)th item was incorrect, the (k+1)th item might be unsuitable because the difficulty level of the (k+1)th item was higher than the new ability estimator. The Block-review and rearrangement CAT (BRR–CAT) may be a solution to the aforementioned that BR-CAT incorporates the rearrangement procedure which is a useful procedure to arrange a reasonable order of new response patterns (Papanastasiou, 2005). In addition, the practicality of BRR–CAT can be not only proven on theoretical level given that the effects of a reviewable CAT on examinees’ ability and test anxiety are investigated in empirical experimental research.. 1.2 Purpose The purpose of the present study is to investigate the effects of Block-Review Rearrangement CAT (BRR-CAT) on examinees’ ability estimation and test anxiety. To achieve the purpose, first, BRR-CAT algorithm is designed and its precision and efficiency in examinees’ ability estimation is verified. Second, a BRR-CAT system is implemented, and finally, an experiment is carried out to evaluate the effects of the BRR-CAT on examinees’ ability estimation and test anxiety. Three research questions are described as follows: 1. What is the difference among the algorithm of BRR-CAT, BR-CAT, and CAT in a simulated and an empirical experiment in regard to how precise their estimation is on examinees’ ability? 2. What is the difference among BRR-CAT, BR-CAT, and CAT in a simulated and an empirical experiment in regard to how efficient their estimation is on examinees’ ability?. 4.

(14) 3. What is the difference in examinees’ test anxiety when they take a BRR-CAT, a BR-CAT, and a CAT?. 1.3 Scope and Limitation 1. The purpose of this study is to investigate the implementation of a BRR-CAT system; as the result, the BRR-CAT algorithm is focused. The management of item bank and the item analysis are not in the scope of this study. 2. CAT, the term, in this study is based on 3-parameter logistic model in IRT. 3. In this study, examinees’ illegitimate behaviors, such as the answering strategies (Wainer, 1993; Kinsbury, 1996) taken by examinees to cheat, are neglected. Interpretation on the research findings may be limited. 4. The empirical experiment adopted a real item bank consisting of 123 items selected from the vocabulary and grammar sections of the High School English Ability Test (HSEAT) administered by the College Entrance Examination Center (CEEC). Applicability of the present findings is limited to the academic domain of English testing.. 5.

(15) Chapter Two Literature Review. This study attempts to provide examinees with a reviewable CAT environment incorporating an accurate algorithm and a reasonable procedure, and further to investigate effects of reviewable CAT on examinees’ ability estimation and test anxiety. This chapter describes the surveys of the related topics of this study. Section 2.1 introduces item response theory (IRT) models. Section 2.2 expresses the CAT procedure based on IRT. Section 2.3 describes the algorithms for BR-CAT and Rearrangement procedure. The final section details and summarizes popular test anxiety scales.. 2.1 Item Response Theory (IRT) Item response theory (IRT) is the core of modern measurement (Wainer, Dorans, Flaugher, Green, Mislevy, Steinberg, & Thissen, 1990, p.9). Four advantages of IRT as follows (Baker, 1992, p2; Ho, 1989). First, different ability level groups of examinees will not affect the estimation of item parameters because of group invariance of item parameters (Hambleton & Swaminathan, 1985, p.11). Second, the different test items which were answered by the examinees will not affect the estimation of examinees’ latent ability (Hambleton & Swaminathan, 1985, p13). Third, the estimation of examinees’ latent ability can be precisely obtained because every examinee will have an individual standard error of estimated latent ability (Hambleton & Swaminathan, 1985, p.9). Finally, IRT models are non-linear because they are based on the 6.

(16) relationship between the probability for examinees’ correctly answering a test item and their latent ability, and therefore these models fit to measure the majority of non-linear latent traits, such as intelligence (Hambleton & Swaminathan, 1985, p.13). IRT models is also practical because the essential principle of IRT is to describe the probability of answering correctly on a test item based on the item characteristic function (ICF) (Hambleton & Swaminathan, 1985, p.25). However, before IRT models is applied to estimate examinees’ latent ability or trait, the assumptions of models has to be met (Hambleton & Swaminathan, 1985, p.16). The details are shown as follows.. 2.1.1 Assumptions of IRT Three preliminary assumptions of IRT– unidimensionality, local independence, and nonspeedness– listed below must be met for model-data fit (Hambleton & Swaminathan, 1985, p.16).. 1. Unidimensionality: Unidimensionality refers to the assumption that a single latent ability is constantly measured by each test item (Hambleton & Swaminathan, 1985, p17). In other words, examinees’ test performance explained by the single dominative factor measured in a set of test items whose covariance between items is zero. However, this assumption is not always satisfied because the nonsystematic factors, such as test anxiety, motivation, personality or cognitive skills, also impact on test performance. Hambleton and Swaminathan (1985, p.17) thus suggest that only when the test items focus on testing a single latent ability can the assumption of unidimensionality be met.. 2. Local independence: The local independence assumption is the theoretical 7.

(17) basis of the unidimensionality assumption and suggests that an examinee's responses to different items are fully independent of each other (Hambleton & Swaminathan, 1985, p22). That is, examinees’ performance on one item must not influence their responses to any other items. The dominative factor that influences examinee's responses is their latent ability on which the test items focus. In brief, if the unidimensionality assumption is satisfied, the local independence assumption will be obtained, but not so vice versa (Hambleton & Swaminathan, 1985, p25).. 3 Nonspeededness: Nonspeededness denotes the assumption that IRT models can not be applied to speed tests (Hambleton & Swaminathan, 1985, p30). Examinees' failure to answer correctly test items in a speeded test may be due to insufficient time or limited ability (Hambleton & Swaminathan, 1985, p30). That is, in a speeded test, both answering speed and the latent ability affect examinees’ test performance. Therefore, the non-speeded tests can reflect the latent ability more accurately because examinees have sufficient time to answer test items (Hambleton & Swaminathan, 1985, p30). In short, when the aforementioned three assumptions of IRT models for any data set are satisfied, the model-data fit will often be sufficient, and so the IRT models can be aptly applied to CAT. On the contrary, if the assumptions are not met, the models are not fit, and so the IRT models are not applicable for the data set.. 2.1.2 Item characteristic function The essential principle of IRT is to describe the probability of answering correctly on a test item based on ICF. In other words, ICF can reflect the relationship between examinees’ latent ability and their responses. One of the commonly used item response 8.

(18) models for dichotomous response level is 3-parameter logistic model (Hambleton & Swaminathan, 1985, p.35; Lord, 1980, p.12). The expression of ICF for 3-parameter logistic model is shown as Formula 2.1 (Lord, 1980, p.12). The plot of the function is called item characteristic curve (ICC) (Lord, 1980, p12). Figure 2.1 provides three examples of ICC for 3-parameter logistic model. The horizontal axis represents the examinees’ latent ability scale and the vertical axis denotes the probability scale of answering correctly on a test item. The b for the three test items is located at the point on the ability scale where the slop of the ICC1, ICC2 and ICC3 is a maximum. For the a, the higher it is, the steeper the ICC is. The c, which is the lower asymptote on the probability scale, indicates the probability of answering correctly on a test item, regardless of how low examinee's ability is.. p ij (θ j ) = c i + (1 − c i ). 1 1+ e. − Dai (θ j − bi ). ;………………………………(2.1). where, D is scaling factor, the constant 1.702; e is mathematical constant 2.71828; i is item i; j is examinee j;. θ j is the ability of examinee j; a i , b i and c i are the item discrimination parameter, the item difficulty parameter and the pseudo-chance level parameter of item i, respectively; and. p ij (θ j ) denotes the probability of the correct response to the item i given by the examinee j who has the ability ( θ j ).. 9.

(19) p. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0. ICC1 ICC2 ICC3. -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0. 0.5. 1. 1.5. 2. 2.5. 3. 3.5. θ Figure 2. 1 Three examples of ICC for 3-parameter logistic model ( ICC1: a=1, b=-1, c=.05; ICC2: a=1, b=0, c=.05; and ICC3: a=1, b=1, c=.05 ). 2.1.3 Item parameter estimation The accuracy of item parameter estimation is a factor to affect the estimation of examinees’ latent ability (Hambleton & Swaminathan, 1985, p125). The procedure of item parameter estimation comprises collecting examinees’ response patterns, selecting a method of item parameter estimation and gaining the item parameters (Hambleton & Swaminathan, 1985, p126). First, a large number of examinees’ response patterns could be collected by using paper-and-pencil test or CBT. Second, selecting an appropriate method to estimate item parameters is necessary. Finally, parameters of each test item could be gained by several estimation tools. As shown as Fig. 2.1, the parameters of a test item in 3-parameter logistic model are the item discrimination (a), the item difficulty (b) and the pseudo-chance level(c), respectively. The commonly used tools for item parameter estimation may adopt different methods.. 10.

(20) For example, LOGIST adopts JMLE and MMLE, MicroCAT’s X-Calibrate adopts Bayesian parameter estimation (Assessment Systems Corporation, 1989; 1995; Ho & Hsu, 1989) and BILOG adopts MAP and EAP (Mislevy & Bock, 1989; 1993). Three aforementioned steps estimation, especially selecting an appropriate method would affect the accuracy of item parameter estimation. The details of methods of item parameter estimation are shown as follows. The most frequently used methods for item parameter estimation are maximum likelihood estimation (MLE), Bayesian parameter estimation, joint maximum likelihood estimation (JMLE), marginal maximum likelihood estimation (MMLE), Bayesian modal or maximum a posteriori estimation (MAP) and Bayesian mean or expected a posteriori estimation (EAP) (Hambleton & Swaminathan, 1985, p.142; Owen, 1969). If examinees’ ability is known, MLE and Bayesian parameter estimation are used to estimate a, b and c parameters (Owen, 1969). On the other hand, if examinees’ ability is unknown, JMLE, MMLE, MAP or EAP are used to estimate item parameters and examinees’ latent ability as well (Hambleton & Swaminathan, 1985, p.141). In practice, JMLE, MMLE, MAP and EAP are commonly used to estimate item parameters because the examinees’ ability is usually unknown before selecting an estimation method. JMLE procedure comprises three steps. First, it is supposed that the initial values of item parameters are known. Second, the ability parameter is estimated by known item parameters. Third, the iteration for the item parameters re-estimation and examinees’ ability parameter re-estimation is conducted for numerous times until the ability parameter is converged on known-value (Lord, 1980, p.86). In MMLE, it is assumed that examinees’ ability is displayed in a certain distribution, such as normal distribution. The estimation of item parameters is based on the integral equation. The estimation of ability parameter is the solution of marginal maximum likelihood 11.

(21) function. Finally, the iteration for item parameters re-estimation and examinees’ ability parameter re-estimation is conducted for numerous times until the estimation error of item and ability parameters reach the minimum (Lord, 1980, p.172). Lord (1980, p.176) also indicated that no significant difference existed between the accuracy of JMLE and MMLE when the number of used items was about 40 and the number of examinees was from 1000 to 200. However, if the used item was few (10 to 15), the estimation of item parameters in MMLE would be more accurate than JMLE. MAP and EAP format on the basis of Thomas Bayesian’s prior and posterior probability distribution (Bock & Alken, 1981). The procedure of these two methods is composed of three steps. First, the maximum likelihood function with the posterior distribution is demonstrated according to the prior distribution of examinees’ ability and the maximum likelihood distribution function. Second, the solution of the maximum likelihood function is solved. Third, the iteration for the item parameters re-estimation and examinees’ ability parameter re-estimation is conducted for numerous times until the estimation error of item and ability parameters hit the minimum (Hambleton & Swaminathan, 1985, p.142). The difference between MAP and EAP is the estimator in posterior distribution. The estimator in MAP is Bayes modal estimator which is the mode in posterior distribution. However, the estimator in EAP is the expected value in posterior distribution. Mislevy and Stocking (1989) indicated that EAP was more accurate than MAP.. 2.1.4 Examinee ability estimation Two of the most commonly used methods to estimate examinee’s ability are maximum likelihood estimation (MLE) and Bayesian procedure (Bejar & Weiss, 1979). They are reviewed respectively in the following section. 12.

(22) 1. Maximum liklihood estimation (MLE): The MLE procedure comprises collecting examinees’ response patterns, adopting a liklihood function (Formula 2.2) and employing Newton-Raphson iteration (Formula 2.3) in order to gain the likelihood ∧. estimator of ability ( θ )(Lord, 1980, p.59). Figure 2.2 illustrates Formula 2.2 and 2.3, in which an example of liklihood function curve for 3-parameter logistic model is provided. The highest point on this curve is the maximum of the liklihood function, .5 on the ability scale is the maximum liklihood estimator of ability. n. 1− u i. L(u1, u2 ,…, un |θ) = ∏ Pi i Q u. i =1. where,. i. ; …………………………………………(2.2). L(u1, u2 ,…, un |θ) is the liklihood function; u1, u2 ,…, un is the responses of item 1, 2,…and n (u=1 if the item response is. correct, and 0 if incorrect.);. P. ui i. is the probability of correct response to item i for examinee with θ. ability; 1−ui. Q. i. is the probability of incorrect response to item i for examinee with θ. ability; and n is the number of used items.. θ t +1 = θ t - ht ;……………………………………………………………(2.3) where,. f ' (θ ) ht = " ; f (θ t ) n. f ' (θ ) =. D ∑ ai (u ij − Pij )( Pij − ci ) i =1. Pij (1 − ci ). ;. 13.

(23) n. f " (θ ) =. D 2 ∑ ai2 (u ij ci − Pij2 )( Pij − ci )Qij i =1. Pij2 (1 − ci2 ). ;. θ t and θ t +1 are the tth and the (t+1)th estimator of ability, respectively; D is scaling factor, the constant 1.702; e is mathematical constant 2.71828; i is item i; j is examinee j;. θ j is the ability of examinee j; a i , b i and c i are the item discrimination parameter, the item difficulty parameter and the pseudo-chance level parameter of item i, respectively;. P. ui. is the probability of correct response to item i for examinee with θ. i. ability; 1−ui. Q. i. is the probability of incorrect response to item i for examinee with θ. ability;. Pij is the probability of correct response to item i for examinee j; Qij is the probability of incorrect response to item i for examinee j; and u ij is the examinee j’s responses of item 1, 2,…and n (u=1 if the item response is correct, and 0 if incorrect.).. 14.

(24) Likehood Function Value -3.5 -3 -2.5 -2 -1.5 -1 -0.5. 0. 0.5. 1. 1.5. 2. 2.5. 3. 3.5. θ Figure 2. 2 Typical liklihood function curve. In literature, the strengths and weaknesses of MLE are recognized. MLE is easy to be applied in practice. When the number of used items is sufficient (greater than 20), the estimation of examinees’ ability is an unbiased estimator. However, MLE has limitations under three circumstances: (1) uncommon response patterns, such as the one where examinees answer all the items correctly or incorrectly, will disable the iteration of MLE to converge (Hambleton & Swaminathan, 1985, p.91) (See Fig. 2.3), (2) unreasonable response patterns, such as the one where examinees correctly answer difficult items but fail to correctly answer easy ones, will also disable the iteration of MLE to converge (Hambleton & Swaminathan 1985, p.91), and (3) insufficient number of used items could cause a great bias of ability estimation (Hambleton, Swaminathan & Rogers, 1991, p.91).. 15.

(25) Correctly answer all items. Likehood Function Value. Incorrectly answer all items. -3. -2.25. -1.5. -0.75. 0. 0.75. 1.5. 2.25. 3. θ Figure 2. 3 Maximum likelihood curves of uncommon response patterns. 2. Bayesian procedure: Owen (1975) proposed Bayesian procedure to produce a. method of ability estimation and solve two limitations in MLE. The basic principle of Bayesian procedure is that the posterior probability distribution of examinees’ ability is the multiplication of maximum likelihood function and the prior probability distribution of examinees’ ability. In Bayesian procedure, it is assumed that examinees’ ability is displayed in a certain distribution, such as a normal distribution with a mean of .0 and a variance of 1.00 (Baker 1992, p.209; Owen, 1975; Wainer, et. al., 1990, p.72). Then, the examinees’ ability estimation can be obtained by calculating the value of MLE according to their correct or incorrect response. Therefore, Bayesian procedure can solve the problem of uncommon response patterns in MLE (Hambleton & Swaminathan, 1985, p91). However, when the number of used items is insufficient, Bayesian procedure could cause the regress effect (Ho, 1989). That is, the examinees’ ability estimation will not be so accurate because the estimator is closer to the mean of. prior distribution (Weiss, 1982).. 16.

(26) 2.1.5 Item selection strategies Before the estimator of examinees’ ability is accurate enough, the item selection strategies must select a proper unused item for examinees to answer, therefore increase the efficiency of a test. The common item selection strategies are maximum information and Bayesian strategies. The detailed is as follows.. 1. Maximum item information selection: The maximum item information. selection is to find an item which provides the most information in the item bank according to the estimation of the examinee’s ability (Birnbaum, 1968). It can be denoted by the item information function (IIF) (Hambleton & Swaminathan, 1985, p.91; Lord, 1980, p.72) (Formula 2.4). The plot of this function is called item information curve (IIC). Figure 2.4 provides an example of IIC for 3-parameter logistic model. As shown as Fig. 2.4, the same item provides different information to the examinees with different ability. In addition, the highest point on IIC is the maximum item information of an item where the ability is -0.2. That is, this item is suitable for the examinee with middle ability.. ∧. ∧. ∧. ∧. I (θ ) = D 2 ∑ a 2 Qi (θ )( Pi (θ ) − ci ) 2 / (1 − ci ) 2 Pi (θ ) ; ……………………………(2.4) i =1. where,. ∧. θ is the estimator of an examinee’s ability; ∧. ∧. I ( θ ) is item i's item information for an examinee with θ ability; i is item i ;. a i , b i and c i are the item discrimination parameter, the item difficulty parameter and the pseudo-chance level parameter of item i, respectively; 17.

(27) Item Information. -3.5. -3 -2.5. -2. -1.5. -1. -0.5. 0. 0.5. 1. 1.5. 2. 2.5. 3. 3.5. θ. Figure 2 .4 Typical item information curve. 2. Bayesian selection: Bayesian selection includes four steps. First, an. assumption for Bayesian selection is that the prior probability distribution of examinees’ ability, for example the examinee’s ability is in a normal distribution with a mean of 0.0 and a variance of 1.00, is necessary (Owen, 1975; Wainer, et al., 1990, p.111; Baker 1992, p.194).The prior probability distribution of examinees’ ability is also the initial estimator of ability at the beginning of the test. Second, after the examinees answer a test item, the ability estimation of posterior probability distribution and a post hoc variance are calculated. Third, both of the foregoing parameters become the prior estimators of ability and variance to calculate the expected value of post hoc variance for each unused item. Finally, the unused item which has the minimum post hoc variance is selected as the next proper item (Owen, 1975; Wainer, et al., 1990, p.112; Baker 1992, p.202). The IRT supposes that the relation between the performance of examinees on an item and their abilities can be plotted as an item characteristic curve (ICC). The higher. 18.

(28) the examinee’s ability is, the higher the probability of a correct response is (Lord, 1980, p.12). In brief, an examinee’s ability or latent trait can be predicted by his performance on a test and the item response model (Hambleton, Swaminathan, & Rogers, 1991, p.7). Moreover, the examinees in different groups will not affect the item parameter estimation due to group invariance of item parameters. That is, the test results of examinees are easy to be explained and compared even though they are in different ability groups.. 2.2 Computerized Adaptive Testing (CAT) In contrast to traditional CBT, computerized adaptive testing (CAT) is a more precise and efficient way to measure examinees’ ability. When an examinee correctly answers the working item, the more difficult item will be selected as the next one. Otherwise, the easier one will be taken. The steps of examinees’ ability estimation and item selection will be repeated until examinees’ ability estimation is precise enough (Ho, 1989; Wainer, et al., 1990, p103). In this way, more precise ability estimation can be expected and fewer items (thus less time) are required to the procedure of CAT. Based on IRT, the procedure CAT comprises of Starting, Continuing and Stopping steps (Wainer, et al., 1990, p.108). Following the three steps, a CAT system can be implemented in practice. Figure 2.5 shows the flowchart of CAT procedure.. 19.

(29) Start starting First item selection. Item presentation Item is responded by the examinee Collection of the examinee's responses. continuing. Item selection. Examinees' latent trait or ability estimation. Are stopping criteria satisfied? N Y stopping. Stop. Figure 2. 5 The procedure of CAT. 2.2.1 Starting In Starting step, the CAT system predetermines the level of examinees’ initial ability and selects first test item for them. Methods for the first item selection are shown as follows: 1. Medium-difficulty item: Selecting a medium-difficulty item as the first one is. a commonly applied method because the majority of examinees’ ability is in medium level. It is convenient, but it often ends in overusing the middle-difficulty-level items. 20.

(30) Furthermore, if examinees are at the two ends of θ, this selection method might waste examinees' extra time doing items that are not in their ability range (Wainer, et al., 1990, p.109).. 2. Random entry: The CAT system randomly selected an item from the item. bank as the first item. This method is good for improving test security because each item has equal chance to be selected. However, in general, the CAT system would like to randomly administrate the candidate items whose difficulty range is from -.5 to .5, that is, middle difficulty.. 3. Self-adaptive (SA): Self-adaptive method is a method in which examinees can. decide the difficulty level of the first test item depending on their self-expected abilities. Examinees could still choose the difficulty level for the remaining items to be administered during the CAT procedure (Wise, Plake, Johnson, & Roos, 1992).. 4. Referring to related data: Another method to the selection of first test item. involves reference to related data. Examinees’ achievement, intelligence quotient, or age is possible data to be considered (Vispoel, 1998). This method is precise for examinees’ ability estimation because the difficulty level of the first item is close to examinees’ performance in similar domain or subject. However, collecting the examinees’ related data is difficult and it also increases the cost of testing. Lord (1977) claimed that if examinees take enough test items, there is no significant difference on examinees’ ability estimation among using different methods to set up examinees’ initial point. In other words, examinees’ ability estimation is stable if enough test items are taken no matter what methods are used.. 21.

(31) 2.2.2 Continuing The Continuing step involves the iteration for the examinees’ ability estimation and the following item selection. After examinees answer one item, the CAT system immediately estimates their ability according to the responses. Then, the CAT system administrates the next item according to the present ability estimation. The procedure is repeated until the examinees’ ability estimation is precise enough, that is, the standard error of estimators is small enough.. 1. Examinees’ ability estimation: The most often used methods for examinees’. ability estimation are maximum likelihood estimation (MLE) and Bayesian estimation. The former procedure is easy, but the examinees’ ability estimation can not converge if examinees either answer correctly or fail all items (Hambleton & Swaminathan, 1985, p35). The latter procedure is more complex and less efficient. However, it can solve the problem of divergent estimation because it supposed a prior-information on the distribution of the examinees’ abilities (Ho & Hsu, 1989).. 2. Item selection strategies: The common methods for the item selection step. include maximum information selection and Bayesian strategies (Baker, 1992, p.209). The key concept of maximum information selection strategy is that each item revealed different information for examinees’ ability estimation. The relation between the information of an item and the examinees’ abilities could be plotted as an IIC (See Fig. 2.4). The more information the test item reveals, it is to be selected as the next question. This method is to select an item revealing maximum information and which has not yet been taken as the next item (Birnbaum, 1968). In contrast to maximum information strategy, Bayesian strategy is more 22.

(32) complicated. This method supposes a prior-information on a normal distribution of examinees’ abilities. After examinees answer an item, the posterior distribution of examinees’ abilities and variance are estimated, and then the results become the indices for the selection of the next item. This method is to select an item revealing minimum posterior variance and which has not yet been used as the next item (Owen, 1975).. 2.2.3 Stopping When a CAT is finished, examinees can have completed different number of test items. The constraint conditions for the Stopping step, however, are based on what research purposes are (洪碧霞、吳鐵雄, 1989). Four constraint conditions in which the testing stops are shown as follows: 1. Setting the maximum standard error (SE): Examinees finish a CAT when. the SE of the examinees’ ability estimation is less than the set maximum SE. The constraint is often incorporated into Bayesian selection strategy (Wainer, et al., 1990, p.114).. 2. All suitable items were used: Examinees finish a CAT when there is no any. item that can increase item information. The constraint is often incorporated into maximum information selection strategy.. 3. Setting the testing length: Examinees finish a CAT when the testing length is. reached. The constraint is easy to implement, and therefore is often incorporated into simulation research. The item exposure rate can be controlled, but the preciseness for examinees’ ability estimation may be unstable (Wainer et al., 1990, p.114). 23.

(33) 4 Setting the test time: The CAT is terminated when the test time is finished.. This rule is often used for the purpose of convenience, especially for test administration(洪碧霞、吳鐵雄, 1989). However, it is not fair for the examinees who answer questions slowly. The advantages of IRT-based CAT (Ho, 1989; Baker, 1992, p.2) are demonstrated in the following three aspects. First, it is an effective and efficient measurement tool for examinees. Examinees can take fewer items and gain a more accurate estimator of ability because of information technology. Second, in terms of group invariance of item parameters, the abilities of examinees in different groups are comparable based on their ability is on the same scale. In other words, it is easier to explain the test results. Third, it is an economical tool for administration because it provides examinees with a standard testing procedure, promotes the efficiency of mass data process and reduces the cost of testing. Owing to the efficiency and effectiveness of the IRT-based CAT, it is unsurprising that a growing number of language testing systems adapting IRT-based CAT procedure are becoming available (Brown, 1997; Dunkel, 1997; 1999; Madsen & Larson, 1985).. 2.3 Reviewable CAT Compared with CBT, examinees can not review the answered items or change their answers in traditional CAT. This is based on the mechanism that examinees’ ability estimation is done immediately after they answer a test item. Forbidding examinees to review and change answers can ensure the accuracy of examinees’ ability estimation and control the efficiency of testing (Hambleton & Swaminathan, 1985, p35; Wise, 1996). 24.

(34) In addition, the inaccessibility to reviewing and changing answered items can avoid two cheating strategies－Wainter strategy (Wainer, 1993) and Kinsbury strategy (Kingsbury, 1996)－that may otherwise occur and inflate examinee's scores the in reviewable CAT. When examinees adopt Wainer strategy, they intentionally answer incorrectly all the test items so that they are assigned more and more easy ones along the procedure. Then, on the review pass, they go back to change all the incorrect answers to correct ones in order to inflate test scores. When examinees adopt Kingsbury strategy, they try to detect whether their response to a previous item is correct or not according to the difficulty level of the subsequent item (Kingsburg & Weiss, 1983). If the subsequent item is easier than the previous item, they go back and revise the incorrect answer of the previous item. However, Wainer strategy and Kingsbury strategy do not have any noticeable effect on inflating test scores. According to Gershon, Berfstrom and Stone’s findings (1995), Wainer strategy only inflate estimators of simulated examinees’ ability when examinees correctly revise all reviewed items. Stocking (1997) also contends that Wainer strategy would lose its effect when examinees review test items only allowed to review items in a block (e.g. 5 items in a block). Likewise, Kinsbury strategy does not have much influence on the inflation of scores as the researcher proclaimed. It is because examinees do not necessarily have the ability to distinguish the difficulty level of test items (Vispoel, et al., 2001; Vispoel, Clough, & Bleiler, 2005). In brief, the effects of Wainer strategy on misestimating examinees' ability can be alleviated by using Block-Review and the effects of Kinsbury strategy is not distinguished enough to concern. Therefore, when examinees take a CAT incorporating review mechanism, the effects of both cheating strategies on ability estimation can be ignored. The majority of examinees expressed their desire toward a reviewable CAT environment (e.g., 87% in Vispoel, et al, 2000; 85% in Vispoel, 1998) even though 25.

(35) they will spend additional 37% to 61% of time to take a reviewable CAT (Vispoel, 1998). Some research results also showed that a reviewable procedure in CAT is beneficial to examinees in terms of partial knowledge and text anxiety. Aside from its positive effects on examinees' test performance, a reviewable CAT does not lessen the accuracy of ability estimation. At least two simulation experiments have proposed the algorithms to meet this end.. 2.3.1 Cognitive factorpartial knowledge Some researchers contended that examinees’ partial knowledge and recall can increase their performance in a reviewable testing procedure (Higgins, Russell, & Hoffmann, 2005; Papanastasiou, 2005; Parshall, Kalhn, & Davey, 2002, p34; Vispoel, et al., 2000). That is, when examinees use their partial knowledge to recall or revise their key in error or miscalculation, more answers are changed from incorrect to correct than from correct to incorrect. Researchers also suggested that the examinees’ performance after changing answers was closer to their real ability than without changing (Stone & Lunz, 1994; Vispoel, et al., 2000; Wise, Roos, Plake, & Nebelsick-Gullett, 1994). In short, examinees’ partial knowledge is beneficial to increase their performance if they are allowed to change answers in a reviewable CAT procedure; therefore, the estimation of their ability will be closer to their maximum performance.. 2.3.2 Psychological factor test anxiety In terms of the effect of test anxiety on performance, researchers indicated that examinees with high test anxiety perform poorly under pressure in traditional CAT 26.

(36) procedure (Vispoel, 1998; Vispoel, et al., 2000). In other words, the control constraint of forbiddance on reviewing as well as changing of answers may undermine high-anxiety examinees’ performance (McMorris & Leonard, 1976; Wise, et al., 1994). On the contrary, examinees are more likely to demonstrate their level of competence because their test anxiety are alleviated to a more manageable level if they can review and change answers in the test (Chapell & Overton, 1998; Shermis & Lombard, 1998). That is, examinees’ maximum performance enables their ability to be estimated as precise as possible. Hence, a reviewable CAT procedure is helpful for examinees to lessen their test anxiety and do their best. In brief, a reviewable CAT is valuable and necessary to be implemented.. 2.3.3 Algorithms of reviewable CAT Although advantages of reviewable CAT have been shown in literature, this mechanism is in fact quite rare in practice, due to the complexity and difficulty of implementation (Parshall, et al., 2002, p34). However, Vispoel, et al. (2000) and Papanastasiou (2005) still proposed two algorithms of a reviewable CAT procedure to be adopted in simulation experiments, respectively. They were detailed as the following.. 1. Limiting answer review and change procedure (Block-review CAT; BR-CAT): Vispoel, et al. (2000) proposed the limiting answer review and change. procedure which allowed reviewing and changing within successive m-item blocks. Compared with the traditional CAT, the test items were grouped into n blocks. Figure 2.6 shows the flow chart of limiting answer review and change procedure of CAT. In this procedure, examinees were only allowed to review and change answers within the 27.

(37) recent block. If an examinee was answering the items in block i, s/he was not allowed to review the items in the previous blocks. There are two advantages for using the limiting answer review and change procedure. First, the problem of Wainer strategy could be overcome. That is, the examinees' cheating strategy would not have much effect when they are allowed to review items in a block only (Stocking, 1997; Vispoel, et al., 2000). Second, there were no significant difference in the accuracy of ability estimation between the limiting review and the no review procedure (Vispoel, et al., 2000). This suggested that an examinee could still gain accurate ability estimation by using a reviewable CAT. However, a serious problem may happen in examinees’ ability re-estimation in this procedure. Examinees’ response patterns may become unreasonable after they change answers; for example, examinees may correctly change the answers for more difficult items but incorrectly changed the answers for easier ones. Therefore, the examinees’ ability should be re-estimated based on a rearranged, more reasonable response patterns after answers were changed.. 28.

(38) Figure 2. 6 The procedure of BR-CAT. 2. Rearrangement procedure: Papanastasiou (2005) proposed the rearrangement. procedure which rearranged and skipped certain items to better estimate the examinees’ abilities. For example, the rearrangement procedure allowed examinees to change up to five of their answers after they finished 30 items in the fix allotted testing time. After examinees revised their answers, the rearrangement procedure calculating examinees’ final scores would take place (See Fig. 2.7). 29.

(39) Compared with the traditional CAT, three types of answer-changing caused the rearrangement procedure in ability estimation. Type 1 change involves changing answers from incorrect to incorrect and it would make no difference in ability estimation between the traditional CAT and the rearrangement procedure. The second type involves changing answers from incorrect to correct and it would result in item skipping in the rearrangement procedure. For instance, if item i was Type 2 change, the ability estimation would skip from item i+1 to i+k-1(1<k<4) which were answered correctly but were not included in the estimation because the difficulty levels were lower than the examinee’s ability. The third type is to change answers from correct to incorrect and it would also result in item skipping in the rearrangement procedure. For example, if item j was Type 3 change, the ability estimation would ignore from item j +1 to j +k-1(1<k<4) which were answered incorrectly but were not included in the estimation because they were too difficult for examinees. (See Fig. 2.8). Moreover, Papanastasiou (2005) proposed a constraint on the number of changed items－a maximum of 5 items would be permitted to be changed in rearrangement procedure, but most simulated examinees would only change about 2 items out of 30 in order to keep the percent of changing answers to meet 5.1% in Waddell and Blankenship (1995) meta-analysis study. There are two problems in Papanastasiou’s study (2005). First, the real data is necessary to verify the practicability of the rearrangement procedure. Second, skipping items in ability estimation will reduce the total test information. It is necessary to include additional items to overcome this problem.. 30.

(40) Figure 2. 7 Rearrangement procedure (Papanastasiou, 2005). Figure 2. 8 The estimative order in rearrangement procedure (Papanastasiou, 2005). To summarize this section, the advantages of reviewable CAT and the related algorithms were recapitulated as follows. First, permitting examinees to review and change answers in a CAT procedure is beneficial for the enhancement of test performance because they can exploit their partial knowledge or correct key-in errors and miscalculation (Vispoel, et al., 2000). Second, examinees usually desire review opportunities in CAT procedure (Papanastasiou, 2005). It is reported that review. 31.

(41) opportunities can alleviate examinees' test anxiety (Vispoel, 1998; 2000; Vispoel, et al., 2000) that might otherwise hinder their normal level of test performance (Hancock, 2001). Third, in terms of the algorithms of reviewable CAT, the limiting answer review and change procedure could overcome the effects of Wainer cheating strategy (Stocking, 1997) and maintain the accuracy of ability estimation (Vispoel, et al., 2000). However, it increased examinees’ testing time (Vispoel, 1998) and resulted in unreasonable response patterns. Another procedure, the rearrangement procedure, made a better estimation of examinees’ ability and reduced the standard error of ability estimation (Papanastasiou, 2005). However, it is only simulated with a fixed test length (30 items) adaptive test. In the present study, therefore, the algorithms of the limiting review and rearrangement procedures were utilized and improved to create Block-Review Rearrangement CAT (BRR-CAT) for the purpose of practicability and its effectiveness and efficiency.. 2.4 Test Anxiety Test anxiety refers to examinees’ tenseness or uneasiness induced by a fear of failing a test (Sarason, 1980). According to state-trait anxiety theory, test anxiety includes state anxiety and trait anxiety (Sarason, 1980). The former depicts examinees’ nervous and uneasy perceptivity when they are taking a specific test; whereas the latter denotes a long-term feature of personality towards test-taking situations (Deffenbacher, 1980). The components of test anxiety and the commonly used test anxiety scales are summarized as follows.. 32.

(42) 2.4.1 Components of test anxiety Spielberger (1980) proposes that the components of test anxiety including worry and emotionality. Worry is similar to trait anxiety. It is a feature of personality such as the case that examinees worry about the failure in testing. On the other hand, emotionality is similar to state anxiety. It is examinees’ perceptivity such as nervousness or uneasiness. Shermis and Lombard (1998) indicated that appropriate anxiety is helpful for examinees to keep their mind on the test, enhance their motivation, improve their effectiveness and develop positive self-expectation. However, examinees with high test anxiety usually performed poorer than those with low test anxiety (Hancock, 2001; Williams, 1996). Moreover, compared with emotionality, worry has a more negative effect on examinees’ performance (Deffenbacher, 1980; McMorris & Leonard, 1976; Sarason, 1980). Besides worry and emotionality, examinees’ computer experience and the design of CBT, such as the limit of test time and a ban on reviewing or changing, might cause the failure to answer further questions and thus increase their test anxiety. Therefore, they may have poorer performance than they are supposed to have when taking a CBT (Shermis & Lombard, 1998; Vispoel, 1998; 溫福星, 1994)。. 2.4.2 Test anxiety scales Four commonly used test anxiety scales includes Test anxiety (TAS) (Sarason, 1980, p9), Test Anxiety Inventory (TAI) (Spielberger, 1980), FRIEDBEN Test Anxiety Scale (FTA) (Friedman & Bendas-Jacob, 1997) and Cognitive Test Anxiety Scale (CTAS) (Cassady & Johnson, 2002). Except for the earlier TAS (Sarason, 1980, p9), the other three are Likert scales. They are reviewed in detail in the following 33.

(43) section. 1. Test Anxiety Inventory (TAI): Sarason (1980, p9) designed a 37-item TAS. based on the effects of test anxiety on examinees’ performance and cognitive process. TAS was composed of 37 true/false items－30 positive and 7 negative items. The test-retest reliability was from .8 to .87 (7 weeks). The target population was above 12 years old. For the scoring method, one point will be obtained if “true” is answered for positive item or “false” is answered for negative item (See Fig. 2.9). Twelve and twenty points are the benchmarks for the classification of levels of test anxiety. That is, those who score lower than 12 points will be classified as having low-level test anxiety, and those who score higher than 12 points will be classified as having high-level test anxiety and those who score in between will be classified as having middle-level test anxiety (Sarason, 1978). The followings are one positive and one negative item respectively. An example of positive item If I were to take an intelligence test, I would worry a great deal before taking it. An example of negative item If I knew I was going to take an intelligence test, I would feel confident and relaxed.. 2. Test Anxiety Inventory (TAI): Spielberger (1980) designed 20-item Test Anxiety. Inventory (TAI) in terms of the components of test anxiety, worry and emotionality. The entire scale was composed of 8 items in worry (TAI-W) and emotionality (TAI-E) subscales each and 4 integrated items (worry and emotionality). Spielberger’s TAI was a 4-point Likert scale. The test-retest reliability was .80 (2 weeks) and .811 (four. 34.

(44) weeks). The internal consistency reliability coefficient in TAI, TAI-W and TAI-E, Cronbach's coefficient alpha, was from .92 to .96, from .83 to .91 and from .85 to .91 for different age groups, respectively. The target population was elder than 15 years old. The scoring method converted raw scores to percentile rank based on the norms for high school students and undergraduates. The higher the percentile rank was, the higher level of test anxiety examinees had (Spielberger, 1980). The followings are two example items in TAI-W and TAI-E scales respectively. An example item in TAI-W Even when I am well prepared for a test, I feel very anxious about it. An example item in TAI-E During tests I feel very tense.. 3. FRIEDBEN Test Anxiety Scale (FTA): Friedman and Bendas-Jacob (1997). created FRIEDBEN test anxiety scale (FTA) based on the dimensions of test anxiety. It was a 23-item 4-point Likert scale composed of social derogation, cognitive obstruction and tenseness subscales. The first subscale of social derogation was composed of 8 items which were statements related to worries of being socially belittled by peers, parents or teachers following failures on a test. The second subscale of cognitive obstruction was composed of 9 items which were statements related to poor concentration, failure to recall, difficulties of finding solutions when taking a test. The last subscale of tenseness was composed of 6 items which are statements related to physical and psychological discomfort. The target subjects included 1194 junior high school students and 1422 high school students. Cronbach's coefficient alpha for the scores in FTA full-scale and three subscales was .91, .86, .85 and .81, respectively. If examinees obtained a high score, it signified that their level of test anxiety is high. The 35.

(45) following are example items in these three subscales respectively.. Example item in social derogation subscale If I fail a test I am afraid I shall be rated as stupid by my friends. Example item in cognitive obstruction subscale In a test I feel like my head is empty, as I have forgotten all I have learned. Example item in tenseness subscale During a test I keep moving uneasily in my chair.. 4. Cognitive Test Anxiety Scale (CTAS): Cassady and Johnson (2002). established Cognitive Test Anxiety Scale (CTAS) focusing on the cognitive dimension of test anxiety given the fact that cognitive test anxiety exerts a stable and negative impact on academic performance measures. The items related to emotionality or bodily symptoms were not included in CTAS because the cognitive dimension is of primary importance on determining examinees’ test performance. It was a 27-item 4-point Likert scale. The target subjects included 168 undergraduates. Cronbach's coefficient alpha for the scores in the CTAS was .91. The results showed that higher levels of cognitive test anxiety were associated with significantly lower test scores (Cassady & Johnson, 2002). The following are exemple items in CTAS. An example of positive item I lose sleep over worrying about examinations. An example of negative item I have less difficulty than the average college student in getting test instructions straight.. 36.

(46) In summary, test anxiety is expected to be alleviated if the review function can be integrated to traditional CAT. Test anxiety is examinees’ fear of failing a test. When examinees’ test anxiety is overly high, it may induce their tenseness, worry and uneasy perceptivity and also lessen their performance on tests. If a traditional CAT can incorporate review options, not only can it alleviate examinees’ test anxiety but it can also meet their expectation to change answers during the test (Vispoel, et al., 2000). TAS (Sarason, 1978), TAI (Spielberger, 1980), FTA (Friedman & Bendas-Jacob, 1997) and CTAS (Cassady & Johnson, 2002) are commonly used tools for test anxiety measurement. However, when examinees’ test anxiety was induced from the frustration of inaccessibility to review in CAT, their performance may be affected by the worry and emotional components of test anxiety (O'Neil & Richardson, 1980); therefore TAI (Spielberger, 1980) and tenseness subscale in FTA (Friedman & Bendas-Jacob, 1997) are more appropriate to measure examinees’ test anxiety. In addition, the statements of items in a test anxiety scale have to be easy to be understood for the target population in the present study. Also, the test anxiety scale has to be rendered in equivalent Chinese for the participants in the formal experiment.. 37.