同時編製平行測驗數之比較

(1)

行政院國家科學委員會專題研究計畫成果報告

同時編製平行測驗數之比較

The compar isons of the number of par allel tests constr ucted in simultaneous test assembly

NSC 90-2413-H-011-001

90 年 08 月 01 日至 91 年 07 月 31 日 鄭海蓮

國立台灣科技大學教育學程中心

一、 中文摘要

本計畫係以模擬降溫法與試題反應理論，設置自動化編製平行測驗程序，分別為同時編製與依序編製兩種。利用三參數模式模擬 3000 題試題的題庫，所組合之平行測驗數目分別為 2、4 與 8，各平行測驗題數分別為 80、40 與 20。本計畫結果突破文獻上依序編製之缺點，所得之平行測驗與目標測驗之間的試題特徵曲線幾乎一樣，且可依序編製到八組測驗。本計畫結果供吾人認識模擬降溫法的運用與表現，尤其是在依序編製方法上的突破，將可適應未來更多測驗實務或研究之需要與基礎。

關鍵詞：同時編製、依序編製、平行測驗、模擬降溫法、試題反應理論

Abstr act

The study implemented the method of simulated annealing to automated test assembly of parallel tests with the application of item response theory. There were two procedures established: simultaneous assembly and sequential assembly. The item bank was simulated by the three-parameter model to have 3000 items. The numbers of parallel tests to be constructed are two, four, and eight, and the numbers of items selected for these tests are 80, 40, and 20, respectively. The results of the study showed that the tests assembled sequentially are almost coincide in item characteristic curves – a result quilt different from previous studies. This study helps us understand the performance of the method of simulated annealing in test construction, and its capacities in implementing simultaneous assembly and sequential assembly as well. The procedures and system developed in the study can be applied in the test business and research in which multiple parallel tests are in needs.

(2)

Keywor ds: Simultaneous assembly, sequential assembly, parallel tests, method of simulated annealing, and item response theory.

二、 緣由與目的

目前自動化編製平行測驗（又稱複本測驗）的方法中，利用試題反應理論 (item response theory, IRT) 者較諸古典測驗理論者為多。IRT 的選題機制主要是試題與測驗的訊息方程式(item information and test information function)，訊息方程式中的試題參數(item parameters)與潛在特質參數(latent trait parameters)會使編製程序中所要處理的變項數目快速增加，所以一般自動化組合平行測驗的程序如 linear programming(LP)系列（如 binary programming 或 integer programming）的方法，會有組合爆炸、以致計算無解之問題，故在 LP 的主要邏輯結構之外，常需修正程序(heuristics)來輔助解決計算容量的問題；唯修正程序有越來越趨複雜之勢，才能符合目前測驗編製實務上的需求。

在上述 LP 系列的方法之外，應用模擬降溫法(method of simulated annealing, SA)則不需要修正程序，而仍有很好的編製結果(Jeng, 1994; Jeng and Shih, 1997;

鄭海蓮、潘靖英，民 87；鄭海蓮，民 89)，這是因為 SA 的主要邏輯結構即可包容各種條件限制(constraints)的設定，並有隨機功能之設定，能簡化計算程序，減少計算量之負擔。其解法是先求一個有效解的範圍(solution space)，而計算方式就是從這範圍中尋找一個最佳解。與 LP 系列之求最佳解相比較，SA 的結果雖不一定比 LP 的好，而且每次計算結果會稍有不同（由於隨機之功能），但卻提供了一個尋找近似最佳解的有效途徑，而且一定會有解。

除理論之選擇(IRT 或 CTT)以及程序之選擇（LP 或 SA）外，編製平行測驗還需考慮是否使用同時編製 (simultaneous assembly) 或依序編製 (sequential assembly)。一般認為若利用 LP 進行同時編製會使其 combinatorial optimization problem 更行困難，這種情形可由較早利用 LP 自動化編製平行測驗的文獻窺知，

因其（同時）所編之兩個平行測驗的題數都很少 (Boekkooi-Timminga, 1986, 1987, 1990)。加入修正程序後，則所編的平行測驗數與題數頗有增加，例如 Armstrong, Jones, and Wang (1994)的題庫有 567 題，分別屬於 4 個內容領域（各為 185 題、

185 題、95 題與 102 題），要從這 4 個領域中各選取 10、10、5 與 5 題，共組成 30 題的平行測驗，所編平行測驗數分別有 3 與 6。Stocking, Swanson, and Pearlman (1993)應用 weighted deviations model(WDM)，從 1,538 題之語文題庫中組成 70 題的測驗，分屬 3 個內容領域；從 853 題之數學題庫中組成 65 題的測驗，分屬

(3)

已有的一套測驗作為目標測驗(target test)，然後從題庫中選取最好的一套試題，

來趨近這個目標測驗，即求兩者之間的最大平行程度或最小差異。依序編製可以幫助舒緩同時編製的計算容量問題，但是利用依序編製會早早用盡題庫中的優良試題，這種情形會隨著測驗數目或試題數目之增加而每下愈況；這是依序編製法的主要問題。

本研究目的在以 SA 繼續探究其在自動化編製平行測驗之可能與限度，主要朝兩方向進行：一為編製多個平行測驗(multiple parallel tests)間之比較，二為同時編製與依序編製之比較。SA 的計算邏輯在各相關領域中已有很好的應用成果 (Metropolis et al., 1953; Wong, Leong, and Liu, 1988)，都是在求極大值或極小值之最佳化應用。SA 在測驗組合之應用，始自 Jeng (1994)，經 Jeng and Shih (1997)、

鄭海蓮與潘靖英（民 87）至鄭海蓮（民 89），其在測驗編製上的可能與限度，需要繼續研發，以提供測驗實務之選擇與應用。

三、 研究方法與編製結果

（一）模擬題庫

本研究之題庫試題沿用鄭海蓮與潘靖英（民 87）之模擬方法，以 IRT 三參數模式模擬 3,000 題選擇題為題庫，其中試題難度值範圍為(-3.0, 3.0)，鑑別度值範圍為(0, 2.0)，猜測度值範圍為(0, 0.25)。

（二）編製平行測驗數與題數

本研究所編製的平行測驗有下列三種，各編製情況所需之試題總數占題庫試題總數比例均為 160/3000。

1. 編製二套平行測驗，各測驗題數為 80。

2. 編製四套平行測驗，各測驗題數為 40。

3. 編製八套平行測驗，各測驗題數為 20。

（三）自動化編製程序 1.同時編製程序

同時編製沿用 Jeng and Shih (1997)以及鄭海蓮與潘靖英（民 87）之編製程序、

cost function 與比較基礎，各種測驗編製情況之敘述分列於下。

a. 編製二套平行測驗，各測驗題數為 80─自 3000 題之題庫中選取最接近的兩組測驗，各為 80 題，編製結果如圖一所示。其結果如前述文獻所預期者，兩組平行測驗非常接近。這部份的編製結果主要是要和其他兩種編製情形做比較。

(4)

0 2 4 6 8 10 12 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

AbilityLevel

information

test1 information(80) test2 information(80)

圖一同時編製兩組 80 題的平行測驗結果

b. 編製四套平行測驗，各測驗題數為 40─如上述程序，先自 3000 題之題庫中，編製出兩組各為 1000 題的平行測驗，成為兩個新題庫。在各題庫（1000 題）之內，再用同樣程序繼續編製兩組 40 題的平行測驗，故總計得到四組 40 題的測驗，編製結果如圖二所示。由圖二可看出，雖然兩個 1000 題的新題庫/測驗非常接近，但是再由其分出的測驗則僅是兩兩接近，跨題庫(1000)的測驗(40)則差異不小，而這樣的結果也在預期之中。

1 2 3 4 5 6 7 8 9

information

test1 information(1000) test2 information(1000) test3 information(40) test4 information(40) test5 information(40) test6 information(40)

(5)

c. 編製八套平行測驗，各測驗題數為 20─依上述程序，先自 3000 題之題庫中，編製出兩組各為 1000 題的平行測驗，成為兩個新題庫。在各題庫（1000 題）之內，再依同樣程序繼續編製兩組 200 題的平行測驗，又成為新題庫，共可得到四組 200 題之測驗/題庫。最後再依同樣程序，在每個 200 題題庫中，繼續編製兩個 20 題的測驗，總計得到八組 20 題的平行測驗，編製結果如圖三所示，各平行測驗除了兩兩接近外，跨題庫的測驗差異較之編製四組 40 題之情形更為擴大。

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

AbilityLevel

information

test7 information(20) test8 information(20) test9 information(20) test10 information(20) test11 information(20) test12 information(20) test13 information(20) test14 information(20) average information

圖三同時編製八組 20 題的平行測驗結果

2.依序編製

前述同時編製程序僅能得到兩兩相近之平行測驗，跨題庫編製而得的測驗則差異度隨編製測驗之數目而增大，並且此法只適用於產生偶數的測驗數目。依序編製法則是先以 SA 分出所需要的其中二組測驗，若所需要的不僅是兩組，則固定住第一組測驗、不再進行交換，而與第三組測驗比對，第三組試題同樣以 SA 進行交換的動作，直至第三組與第一組相近為止。所有編製的數目若在三組或以上時，均以此依序編製法產生，比較基礎仍與 Jeng and Shih (1997)以及鄭海蓮與潘靖英（民 87）相同，各種測驗編製情況之敘述分列於下。

a. 編製二套平行測驗，各測驗題數為 80─此步驟實則與同時編製情形相同，即自 3000 題題庫中同時編製兩組 80 題的平行測驗，編製結果如圖四所示，與圖一之結果類似，兩組測驗之間非常接近。

(6)

0 2 4 6 8 10 12 14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 AbilityLevel

information

test1 information(80) test2 information(80)

圖四依序編製兩組 80 題的平行測驗編製結果

b. 編製四套平行測驗，各測驗題數為 40─先自 3000 題題庫中同時編製兩組 40 題之平行測驗，然後固定第一組為不變，第三組測驗試題自剩餘之 2920 題題庫中選取，以求最近似於第一組之結果。依此類推，第四組測驗試題自剩餘之 2880 題題庫中選取，以求最近似於第一組之結果。四組平行測驗之依序編製結果如圖五所示，四個測驗組之間都非常接近。

0 1 2 3 4 5 6 7

information

test1 information(40) test2 information(40) test3 information(40) test4 information(40)

(7)

c. 編製八套平行測驗，各測驗題數為 20─先自 3000 題題庫中同時編製兩組 20 題之平行測驗，然後固定第一組為不變，第三組測驗試題自剩餘之 2960 題題庫中選取，以求最近似於第一組之結果。第四組測驗試題自剩餘之 2940 題題庫中選取，以求最近似於第一組之結果。第五組測驗試題自剩餘之 2920 題題庫中選取，以求最近似於第一組之結果。第六組測驗試題自剩餘之 2900 題題庫中選取，以求最近似於第一組之結果。第七組測驗試題自剩餘之 2880 題題庫中選取，以求最近似於第一組之結果。第八組測驗試題自剩餘之 2860 題題庫中選取，以求最近似於第一組之結果。八組平行測驗之依序編製結果如圖六所示，八個測驗組之間都非常接近。在分成八組的情況下，雖然看來不如四組時優秀，但是這主要是由於每個測驗組僅有 20 題的原因，才會造成數值有較大的抖動，並非是由於 SA 不適用所致。

0 0.5 1 1.5 2 2.5 3 3.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

AbilityLevel

information

test1 information(20) test2 information(20) test3 information(20) test4 information(20) test5 information(20) test6 information(20) test7 information(20) test8 information(20)

圖六依序編製八組 20 題的平行測驗編製結果

四、討論

同時編製法因為以 Jeng and Shih (1997)與鄭海蓮與潘靖英（民 87）之基礎設立，只能一次同時編製兩個測驗，若要編製多個測驗以上，則需經一次以上之（同樣的）程序，目的是要先自母題庫中選取所需之子題庫，再由子題庫中同時（兩兩）編製所需之最後題數的測驗。由上述結果與圖所示，此同時編製法所編製之平行測驗結果將隨測驗數目之增加而漸差，且此法只適用於編製偶數之測驗數。

依序編製法仍以 Jeng and Shih (1997)與鄭海蓮與潘靖英（民 87）之基礎設立，以最先選取出來的測驗為目標測驗，隨後各測驗依同樣之程序、一個一個依序選取完畢，可以產生任何數目（包括奇數）的測驗數，最重要的是，隨後所產

(8)

生的各測驗與目標測驗之間非常接近。所以在題庫夠大的情況之下，此法並不會因為產生的測驗組較多而使得後來產生的測驗組越來越差。為了更確定依序編製法的表現，以程式自動連續執行 50 次同樣的過程，以觀察其結果最佳表現如下表所示：

表一自動連續執行 50 次依序編製之最佳結果比較

四組 40 題八組 20 題 Best item information averaged across

items in the test

2.493722 2.353518

Worst item information averaged across items in the test

1.722706 1.601299

Best total difference 0.003560 0.004468 Worst total difference 0.015721 0.019662 Average item information by 50 runs 2.044774 2.021081 Average total difference by 50 runs 0.009891 0.009914 Total simulated time by 50 runs 2176 sec 4464 sec

由上表可知，SA 的依序編製法表現相當優秀，所得之平行測驗其最差的總差異量，分別為四組 40 題的 0.015721 和八組 20 題的 0.019662，幾乎是一樣的試題特徵曲線(item characteristic curve, ICC)，且執行的速度也很快。

五、 計畫成果自評

自動化編製平行測驗的研究、尤其是同時編製多個平行測驗的研究，不但是學術研究的興趣，更有其實務應用的重要價值。目前不但一年多試的各級考試方案陸續推出，所需的各級各類題庫也紛紛建立中。這些題庫通常都很大，才能應付大型的考試需求，以及試題的枯竭率或曝光率(item exposure)問題。但當題庫變大，測驗編製與組合就更行困難，必須藉助電腦化編製，以輔人工編製之有限、

甚至避免偏差(bias)。本計畫成果簡列如下：

1. 供吾人認識模擬降溫法的運用與表現。

2. 供吾人瞭解如何設立程序，以進行同時編製或依序編製。

3. 供吾人瞭解平行測驗數目之增加，對測驗品質的影響。

(9)

參考文獻

Armstrong, R. D., Jones, D. H., and Wang, Z. (1994). Automated parallel test construction using classical test theory. Journal of Educational Statistics, Vol. 19,

No. 1, 73-90.

Boekkooi-Timminga, E. (1986). Algorithms for the construction of parallel tests by zero-one programming. Research Report 86-7. Enschede: Department of Education, University of Twente, The Netherlands.

Boekkooi-Timminga, E. (1987). Simultaneous test construction by zero-one programming. Methodika, 1, 101-112.

Boekkooi-Timminga, E. (1990). The construction of parallel tests from IRT-based item banks. Journal of Educational Statistics, Vol. 15, No. 2, 129-145.

Jeng, H. (1994).

The application of simulated annealing in automated construction of parallel tests.

Technical Report, Abteilung Angewandte Psychologie, Psychologisches Institut der Universität Zürich, Schweiz.

Jeng, H., and Shih, S. (1997). A comparison of pairwise and group selections of items using simulated annealing in automated construction of parallel tests. Psychological

Testing, Vol. 44, No. 2, 195-212.

Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. (1953).

Equation of state calculations by fast computing machines.

Journal of Chemical Physics, 21, 1087-1092.

Stocking, M. L., Swanson, L., and Pearlman, M. (1993). Application of an automated item selection method to real data.

Applied Psychological Measurement, Vol. 17, No. 2, 167-176.

Wong, D. F., Leong, H. W., and Liu, C. L. (1988).

Simulated annealing for VLSI design. Boston: Kluwer Academic Publishers.

鄭海蓮、潘靖英（民 87）。應用自動化編製平行測驗方法於大學聯考之研究。國

科會專題研究計畫成果報告，計畫編號 NSC87-2413-H-011-001。

鄭海蓮（民 89）。在自動化編製平行測驗程序中使用古典試題指標與 IRT 試題參

同時編製平行測驗數之比較

No. 1, 73-90.

The application of simulated annealing in automated construction of parallel tests.

Testing, Vol. 44, No. 2, 195-212.

Journal of Chemical Physics, 21, 1087-1092.

Applied Psychological Measurement, Vol. 17, No. 2, 167-176.

Simulated annealing for VLSI design. Boston: Kluwer Academic Publishers.

科會專題研究計畫成果報告，計畫編號 NSC87-2413-H-011-001。

數之比較，中國測驗學會測驗年刊，47（1），47-55 頁。