結論與改進建議

第一節結論

本研究欲了解進行測驗等化時，不同施測樣本數與不同垂直定錨試題數，在水平及垂直等化測驗中，使用BIB與NEAT設計下等化連結之效果。

本研究比較在四種BIB設計與三種NEAT設計情形下，三種施測人數，5460 人、7500及10000人；垂直等化中三種定錨試題數，3題、6題及9題情形下的等化情形。得到下列結論：

1. 在水平等化測驗中，不論受試人數為何，BIB設計在試題鑑別度參數、試題難度參數與試題猜測度參數的風險值較NEAT設計等化效果好；NEAT設計在受試者能力的風險值較BIB設計等化效果好。

2. 在垂直等化測驗中，不論受試人數與定錨試題數為何，BIB設計在試題鑑別度參數的風險值較NEAT設計等化效果好；NEAT設計在受試者能力值、試題難度參數與試題猜測度參數的風險值較BIB設計等化效果好。

3. 施測人數越多，各種參數估計越精準，對於受試者能力值影響較小、對於試題參數值影響較大。但受試人數在7500及10000人時，各種參數估計差距不大，

則若考慮施測成本，採用7500人的受測人數即可達10000人的效果。

4. 不同年級之定錨試題數越多，各種參數估計越精準，但定錨試題數為6題或 9題的估計精準度差異不大。因此，在進行測驗時可選擇6題的定錨試題數，減少年級間定錨試題量，以增加測驗的題庫量。

5. 三種題庫數，試題區塊數為 7 題、9 題及 13 題中，由等化後之結果可發現，在相同的受試者人數與定錨試題數下，隨著試題區塊數增加，受試者能力值及試題參數風險值亦跟著增加。

第二節改進建議

本模擬研究在水平及垂直等化測驗中，共同變項僅設定三種施測人數，分別為5460人、7500人及10000人：三種試題區塊數，分別為7個、9個及13個；受試者能力分布為常態分布：與垂直等化測驗中，三種垂直定錨試題數，分別為3題、

6題及9題，來進行BIB與NEAT設計等化設計之模擬比較。茲就本研究未盡完備之處，提出一些研究建議，供後續研究者參考。

一、本研究僅考慮一種受試者能力與試題參數分布，未來研究可考量進行不同參數分布之等化效果比較。

二、本研究僅考慮三種試題區塊數、三種定錨試題數與三種施測人數，未來研究可考量不同的試題區塊數、定錨試題數與施測人數作為等化效果之研究。

三、本研究的四種BIB設計與三種NEAT設計，只進行一種試題區塊的配置方式，未來研究可就其他配置方式探討其等化效果。

四、本研究設計之BIB與NEAT等化設計，僅模擬產生二元計分之作答反應組型，未來研究可考量多元計分對於BIB與NEAT設計之等化效果比較。

五、本研究只探討進行測驗等化時，水平等化與垂直等化兩種情況，並無針對不同年度之等化效果研究，因此，未來研究可針對同年級不同年度測驗與不同年級不同年度測驗之等化效果比較。

參考文獻

中文部份

王寶墉(民 84)。現代測驗理論。臺北市：心理出版社。

李源煌、楊玉女(民 89)。以專業導向為準則之大學聯考草案。文教新潮，5(1)。

李源煌、楊玉女(民 89)。建立學科評量量尺之理論基礎。中國測驗學會測驗年刊，

47 輯，1 期，頁 95-116。

李文忠(民84)。以無參數反應理論之等化模式探討測驗等化與能力成長曲線。國立台中師範學院國民教育研究所碩士論文，未出版。

吳裕益(民80)。IRT等化法在題庫建立之應用。初等教育學報，第四輯，pp.319-365。

國立臺南師範學院初等教育學系。

洪碧霞、吳裕益、陳英豪(民 80)。IRT 參數量尺化系列研究：考生人數及能力特質，共同試題題數及難度特質，及連結方法等因素對連結效益影響之探討。

國科會報告，NSC 80-0301-H-024-01。

陳煥文(民 93)。垂直等化連結特性之研究-四種連結方法的比較。國科會專題研究計畫。

曾玉琳、王暄博、郭伯臣、許天維(民 95)。不同 BIB 設計對測驗等化的影響。測驗統計年刊，第十三輯下期，頁209-229。台中市：國立台中教育大學。

英文部份

Allen, N.L., Donoghue, J.R., & Schoeps, T.L. (2001). The NAEP 1998 technical

report. Washington, DC: National Center for Educational Statistics.

Angoff, W.H. (1984). Scaling, Norming, and Equating. Princeton, NJ: Educational Testing Service.

Baker, F. B. (1992). Item Response Theory: Parameter Estimation Techniques. New York: Marcel Dekker. Inc.

Braun, H.I., & Holland, P.W. (1982). Observed-score test equating: A mathematical analysis of some ETS equating procedures. In P. W. Holland and D. B. Rubin (Eds.), Test equating ( pp.9-49). New York：Academic.

Driscoll, D. P. (2002), 2001 MCAS Technical Report. Malden MA：Massachusetts Department of Education

Dorans, N. J. & Holland, P. W. (2000). Linking Scores from Multiple Instruments.

Evaluation of National and State Assessments of Evaluation. Board on Educational Testing and Assessment. Washington, DC: National Academy Press.

Hanson, B.A. & Beguin, A.A. (2002). Obtaining a Common Scale for Item Response Theory Item Parameters Using Separate Versus Concurrent estimation in the Common-Item Equating Design. Applied Psychological Measurement, 26, 3-24.

Hambleton, R.K., & Swaminathan, H. (1985). Item Response Theory: Principles and

Application. Boston, MA：Kivwer-Nijhoff.

Haebara, T. (1980). Equating Logistic Ability Scales by a Weighted Least Squares Method. Japanese Psychological Research, 22, 144-149.

Kolen, M. J. (2000). Issues in Combing State NAEP and Main NAEP. In J. W.

Pellegrino, L. R. Jones, & K. J. Mitchell, (Eds.), Grading the Nation’s

Reportcard: Research from the Evaluation of NAEP. Committee on the

Kuehl, R. O. (2000). Design of Experiments: Statistical Principles of Research Design

and Analysis. CA: Duxbury Press.

Kim, S.H. & Cohen, A.S. (1998). A Comparison of Linking and Concurrent

Calibration Under Item Response Theory. Applied Psychological Measurement, 22, 131-143.

Kolen, M.J. & Brennan, R.J. (1995). Test Equating: Methods and Practices. New York:

Springer-Verlag.

Klein, L. W., & Jarjoura, D. (1985). The importance of content representation for common-item equating with non-random groups. Journal of Educational Measurement, 22, 197-206.

Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing

Problems. Hillsdale, NJ: Lawrence Erlbaum.

Mislevy, R. J. & Bock R. D. (1990). BILOG-3 (2^nd

ed.): Item analysis and test scoring with binary logistic models. Mooresvilk: Scientific Software.

Mislevy, R. J. & Bock R. D. (1982). Implementation of the EM algorithm in the estimation of item parameters: The BILOG computer program. In: Item Response

Theory and Computerized Adaptive Testing Conference Proceedings (Wayzata,

MN).

NAEP Mathematics Consensus Project (2001). Mathematics Framework for The 1996

and 2000 National Assessment of Educational Progress. National Assessment

Governing Board, U.S. Department of Education.

Nattional Research Council. (1999). Uncommon Measures: Equivalency and Linkage

of Educational Tests. Washington, DC: Author.

Nemhauser, G. L., & Wolsey, L. A. (1999). Integer and Combinatorial Optimization.

New York: John Wiley.

Petersen, Nancy S., Kolen, Michael J., Hoover, H.D. (1993). Scaling, Norming, and Equating. In R.L. Linn (Ed.), Educational Measurement (3rd ed., pp221-262).

New York: Macmillan.

Stocking, ML. & Lord, F.M. (1983). Developing a Common Metric in Item Response Theory. Applied Psychological Measurement, 7(2).201-211.

Tianyou, W. (2005). An Alternative Continuization Method to the Kernel Method in von Davier, Holland and Thayer's (2004) Test Equating Framework.

van der Linden, W.J., & Veldkamp, B.P.,& Carlson, J.E. (2004).Optimizing Balanced Incomplete Block Designs for Educational Assessments. Applied Psychological

Measurement, 28, 317-331.

von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test

equating. New York: Springer.

Weiss, D.J., & Yoes, M.E.(1991). Item response theory. In R.K. Hambleton & J. N.

Zaal (eds.), Advances in educational and psychological testing. Boston: Kluwer Academic Publishers.

Zimowski, M.F., Muraki, E., Mislevy, R.J. & Bock, R.D. (2003). BILOG-MG.

Scientific Software lnternational.

在文檔中 BIB與NEAT設計之水平及垂直等化效果比較 (頁 100-105)

第一節 結論

第二節 改進建議