結論 - 結論與建議 - BIB、PBIB與NEAT設計於多元計分測驗之連結效果比較

第五章結論與建議

第一節結論

本研究比較五種 BIB 設計、三種 PBIB 設計與三種 NEAT 設計於 2800 人、5600 人、7560 人和 19880 人等四種施測人數下，利用同時估計法進行水平等化，以了解多分題測驗之連結效益。以下為本研究之結論：

一、增加施測人數能降低試題鑑別度與選項閾值參數的估計誤差，但是對於受試者能力參數估計並無太大影響。

二、在樣本數為 5600 以上時，試題鑑別度參數與選項閾值參數之估計效果為 BIB 設計最佳、NEAT 設計次之，PBIB 設計最差。

三、在受試者能力參數估計方面，NEAT 設計之連結效果優於 PBIB 和 BIB 設計，

但是 PBIB 和 BIB 設計對之誤差值差距極小。

四、在總受試人數相同之情境下，隨著題庫內試題區塊數增加，受試者能力參數、

試題鑑別度與選項閾值參數的估計誤差亦隨之增加。

五、PBIB 設計中，固定題庫大小、樣本數與區塊中之試題數的情境下，受試者能力參數、試題鑑別度與選項閾值參數的估計誤差隨著各題本之區塊數增加而減少。

五、PBIB3 設計是為探討 NAEP 對於寫作評量之題本設計的連結效益，根據結果顯示此設計在受試者能力參數、試題鑑別度與選項閾值參數的估計值之誤差值均較其餘設計為高出許多，亦即估計效果不精準。

六、由 BIB3 與 BIB4 設計之結果可知區塊內之試題數是影響測驗等化精準度的重要因素，增加區塊內之試題數能有效降低參數的估計誤差。

第二節建議

一、由於 NEAT 設計公布某一題本時，導致定錨試題公布、BIB 設計之題本與試題區塊符合條件限制的組合不易尋找，常導致試題區塊的重複出現次數增加，導致題本數亦增加。於實際情境中，人力、財力等資源限制下，根據研究結果得之 PBIB 和 BIB 設計對於受試者能力參數之誤差值差異很小，因此欲選擇以其他連結設計替代時，可選擇 PBIB 設計。

二、PBIB3、BIB3 設計均為探討寫作評量之題本設計的連結效益，根據結果顯示此設計在受試者能力參數、試題鑑別度與選項閾值參數的估計值之誤差值均較其餘設計為高出許多，亦即估計效果較不精準。由本研究之實驗結果可知增加試題區塊內的試題數能有效降低參數的估計誤差，因此研究者建議對於寫作評量之評分規則可採分項計分制，如此一來各獨立計分項可視為不同之多元計分試題，便能有效的降低估計誤差。

三、本研究的受試者的能力分佈僅考慮常態分佈，應可探討其他類型的能力分佈，如正偏態、負偏態等其它能力分佈的情形下之連結效果。

四、本研究設定為六個作答反應變項，應可探討不同的作答反應變項數目下之連結效果。

五、本研究在進行測驗等化時，僅採水平等化一種，未來就可針對垂直等化效果進行研究。

六、本研究中利用 MULTILOG 估計參數的過程中，均只按照軟體的預設值來估計參數，在後續的研究中，應可探討其它估計時設定的選項對參數估計的影響。

參考文獻

中文部份

王寶墉(1995)。現代測驗理論。臺北市：心理出版社。

余民寧(1992a)。試題反應理論的介紹（三）－試題反應模式及其特性。研習資 訊 9(2)，6-10。

余民寧(1992b)：試題反應理論的介紹(九)-測驗分數的等化(上)。研習資訊 10（2）， 6-11。

李源煌、楊玉女(2000)。建立學科評量量尺之理論基礎。中國測驗學會測驗年刊，

47 輯，1 期，頁 95-116。

吳裕益(1991)。IRT等化法在題庫建立之應用。初等教育學報，第四輯，pp.319-365。

國立臺南師範學院初等教育學系。

洪碧霞、吳裕益、陳英豪(1991)。IRT 參數量尺化系列研究：考生人數及能力特 質，共同試題題數及難度特質，及連結方法等因素對連結效益影響之探討。

國科會報告，NSC 80-0301-H-024-01。

曾玉琳、王暄博、郭伯臣、許天維(2006)。不同 BIB 設計對測驗等化的影響。測 驗統計年刊，第十三輯下期，頁 209-229。台中市：國立台中教育大學。

王暄博(2006)。BIB 與 NEAT 設計之水平及垂直等化效果比較。國立台中師範學 院教育測驗統計研究所碩士論文。

黃美芳(2006)。試題反應理論三參數模式下等化效果之探究。國立台中師範學院教育測驗統計研究所碩士論文。

陳雁芳(2006)。等級反應模式下等化效果之探究。國立台中師範學院教育測驗統計研究所教學碩士論文。

劉穎蓁(2006)。項目反應理論應用在不同期測驗之等化分析。國立台北大學統計 學系碩士論文。

英文部份

Allen, N.L., Donoghue, J.R., & Schoeps, T.L. (2001). The NAEP 1998 technical report,

(NCES 2001-509 ). Washington, DC: National Center for Educational Statistics.

Bose R. C. & Nair K. R. (1939). Partially balanced incomplete block designs, Sankhya 4, 337-372.

Evaluation of National and State Assessments of Evaluation. Board on Educational Testing and Assessment. Washington, DC: National Academy Press.

Kim, S.H. & Cohen, A.S. (1998). A Comparison of Linking and Concurrent

Calibration Under Item Response Theory. Applied Psychological Measurement, 22, 131-143.

Kim, S. H., & Cohen, A. S. (2002). A comparison of linking and concurrent calibration under the graded response model. Applied Psychological Measurement, 26(1), 25-40.

Kolen, M.J. & Brennan, R.J. (1995). Test Equating: Methods and Practices. New York:

Springer-Verlag.

Kolen, M. J. & Brennan, R. L. (2004). Test equating, scaling, and linking: methods

and practices (2nd ed.). New York: Springer-Verlag.

Klein, L. W., & Jarjoura, D. (1985). The importance of content representation for common-item equating with non-random groups. Journal of Educational Measurement, 22, 197-206.

Kolen, M. J. (2000). Issues in Combing State NAEP and Main NAEP. In J. W.

Pellegrino, L. R. Jones, & K. J. Mitchell, (Eds.), Grading the Nation’s Reportcard:

Research from the Evaluation of NAEP.

Kuehl, R. O. (2000). Design of Experiments: Statistical Principles of Research Design

and Analysis. CA: Duxbury Press.

Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing

Problems. Hillsdale, NJ: Lawrence Erlbaum.

Matthias von Davier & Alina A. von Davier (2004). A Unified Approach to IRT Scale

Linking and Scale Transformations. (ETS RR-04-09). Princeton, NJ: ETS.

Mislevy, R. J. & Bock R. D. (1990). BILOG-3 (2^nd

ed.): Item analysis and test scoring with binary logistic models. Mooresvilk: Scientific Software.

Mislevy, R. J. & Bock R. D. (1982). Implementation of the EM algorithm in the

estimation of item parameters: The BILOG computer program. In: Item Response

Theory and Computerized Adaptive Testing Conference Proceedings (Wayzata,

MN).

Muraki, E. (1992). A generalized partial credit model: application of an EM

Algorithm. Applied Psychological Measurement, 16, 159-176.

Nancy, L. A., John, R. D., & Terry L. S. (2001). The NAEP 1998 Technical Report (NCES 2001-509). National Assessment Governing Board, U.S. Department of Education.

Nemhauser, G. L., & Wolsey, L. A. (1999). Integer and Combinatorial Optimization.

New York: John Wiley.

Reise, S.P., & Yu, J. (1990). Parameter recovery in the graded response model using MULTILOG. Journal of Educational Measurement, 27, 133-144.

Rust, K.F & Johnson, E.G. (1992). Sampling and weighting in the national assessment.

Journal of Educational Measurement, 17, 111-129.

Samejima, F. (1969). Estimation of a latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 17.

Thissen, D. (1991). MULTILOG user’s guide: Multiple, categorical item analysis and

test scoring using item response theory [Computer program]. Chicago: Scientific

Software International.

van der Linden, W.J., & Veldkamp, B.P.,& Carlson, J.E. (2004).Optimizing Balanced Incomplete Block Designs for Educational Assessments. Applied Psychological

Measurement, 28, 317-331.

von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test

equating. New York: Springer.

Yates, F. (1936). A new method of arranging variety trials involving a large number of varieties. J. Agric. Sci. 26, 424-455.

在文檔中 BIB、PBIB與NEAT設計於多元計分測驗之連結效果比較 (頁 66-0)

結論

第五章 結論與建議

第一節 結論

第二節 建議