華語文電腦化適性評量系統評估方式之使用

第四章結果與討論

第四節華語文電腦化適性評量系統評估方式之使用

本研究主要希望透過實徵資料模擬電腦適性測驗系統之流程，以評估系統使用之成效。由實徵資料中，可獲得受試者完整的作答反應，經由模擬CAT說明在基礎級華語文測驗題庫下，不同估計法在不同測驗長度下與施測完整題庫的 RMSE變動情形，以下針對四份測驗（A1與A2的聽力與閱讀測驗）在不同能力估計法的變動情形進行敘述。

圖4-12為使用三種不同能力估計方法在不同測驗長度下與施測完整試題 RMSE的變動情形，其中，左上角為A1級聽力測驗的估計結果、右上角為A1級閱讀測驗的估計結果、左下角為A2級聽力測驗的估計結果、右下角A2級閱讀測驗的估計結果。圖4-10顯示不論使用何種估計方法，當施測題數累積愈多，RMSE 的下降情形也愈明顯。若以A1級聽力測驗估計結果而言，顯示MLE估計方法施測總題數在15題之前，RMSE都大於1；當總題數達到31題時，RMSE小於0.4。MAP 估計方法施測總題數在5題之前，RMSE都大於1；當總題數達到19題時，RMSE 小於0.4。EAP估計方法不論施測題數為多少，RMSE都小於1；且當總題數達到6 題時，RMSE小於0.4。其他三個測驗也呈現一致的情形，顯示在三種能力估計法中，EAP估計方法可以得到整體較低的RMSE，此結果與陳柏熹（2006）的結果

類似，所以在系統適性功能的估計法選擇中，會建議使用EAP估計法。

EAP MAP MLE

第五章結論與未來研究方向

合華語文電腦化適性評量系統進行施測。透過此電腦化適性評量系統，研究者可

即觀看學習成果報告得到回饋；在管理者方面，除了擁有題庫編修的功能介面，

還有指派試卷的選單介面，透過試題庫管理介面，管理者可以隨時新增或修改試題，也可以隨時編修試題庫中的試題，而透過試卷管理介面，可以針對使用者不同需求來設定測驗功能，例如使用不同測驗模式的試題時，在選單上就可以選擇符合試題的模式，或是同一個地區的有兩群以上的受試群體在不同時間點施測。

此外，在適性測驗方面，則建議使用EAP能力估計法。

第二節未來研究方向

在系統實作與實際施測過程中，可以得到許多寶貴的經驗，可作為未來研究方向及建議，分述如下：

壹、華語文能力測驗之研發

一、涵蓋完整受試者能力等級之測驗（六級）

本研究僅發展 A 級華語文理解能力測驗，未來可發展更完整的華語文能力測驗（B 級與 C 級）。然而，這時必須注意如何透過標準設定（standard setting）之程序與方法，以決斷分數來制訂測驗通過門檻。

二、藉由多媒體科技之運用，發展更多元的測驗題型，以提高測驗真實性與效度本系統以聽力與閱讀測驗為主，且測驗題型是以選擇題為主。然而，若以 CEFR 架構為基礎，其測驗題型不應該有所侷限（圖 5-1），但必須重新進行模式適合度之評估，以確認適合使用之測量模式。

圖 5-1 以 CEFR 為基礎之華語文能力測驗

（二）增加拼音與注音符號之輸入法，讓受試者能自行選擇。

（三）建置錄音與寫入的模組，以應付更多元的測驗題型。

（四）建置錄音與寫入的模組，讓受試者可以在線上做聽力、口說、閱讀、寫作等不同測驗類型之華語文能力電腦化測驗，但是在成績及評量上如何不靠人工閱卷，則有待能力的克服。

二、系統建置方面：適性測驗的核心程式需要運用較複雜之矩陣運算，目前是用 php 撰寫，可以嘗試運用 C/C++撰寫，以提高電腦效能並節省運算時間。

三、電腦化適性測驗設定方面：適性功能的初始設定，只有單一種設定，未來可以加入隨機法等其他初始設定；在適性功能的選題策略，只有撰寫最大訊息法，未來可以加入最接近偏移難度法等其他選題策略，以增加使用者的測驗需求或是提供其他研究者更多種實驗設計；考慮曝光率控管，使得題庫使用更符合需求。

參考文獻

址： http://www.hsk.org.cn/index.aspx

中國漢語水平考試（2012）。中國漢語水平考試。檢索日期：2012年5月19日。網 址： http://www.hsk.org.cn/index.aspx

白樂桑、張麗（2008）。《歐洲語言共同參考框架》新理念對漢語教學的啟示與推 動：處於抉擇關頭的漢語教學。世界漢語教學，3，58-74。

多媒體英語學會（2007）。歐洲共同語文參考架構（中譯）。高雄：和遠。

何榮桂（2006）。國際電腦化測驗發展趨勢之研究。電腦測驗發展趨勢與國家考 試電腦化測驗研討會，2006年5月29日，臺北市。

余慕薌（2008）。APEC 第二外語標準及其評價：趨勢、機會及意涵（下）。APEC 通訊，103，15-16。

國家華語測驗推動工作委員會（2012b）。兒童華語文能力測驗。檢索日期：2012

楊振升、洪淑萍（2002）。基本能力指標與轉化－以語文學習領域為例。教育研 究月刊，96，23-33。

實用漢語水平認定考試（2012）。實用漢語水平認定考試。檢索日期：2012年5月 19日。網址：http://www.c-test.org.cn/index.asp

蔡雅薰（2009）。華語文教材分級研制原理之建構。臺北縣：正中。

錢永財（2006）。以a-鄰近法為選題策略之電腦化適性測驗模擬研究。國立臺中 教育大學教育測驗統計研究所碩士論文，未出版，台中市。

錢永財、劉家惠、郭伯臣（2005）。a-鄰近法選題對電腦適性測驗試題曝光率之 比較。2005年教育與心理測驗學術研討會，台北：國立政治大學。

藍珮君（2007）。基礎華語文能力測驗與歐洲共同架構的對應關係。第三屆華文 教學國際論壇，2007年12月1-2日，國立臺灣師範大學。

籃玉如（2009）。資訊融入華語教學設計理念與實踐。第六屆全球華文網路教育 研討會，2009年6月19-21日，台北市。

英文部分

Ackerman, T. A. (1991). The use of unidimensional parameter estimates of

multidimensional items in adaptive testing. Applied Psychological Measurement.

13, 113-127.

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716-723.

Allen, N. L., Donoghue, J. R., & Schoeps, T. L. (2001). The NAEP 1998 technical report. Washington, DC: National Center for Educational Statistics.

Anderson, E. B. (1973). A goodness of fit test for the Rasch model. Psychometrika, 38, 123-140.

Baker, F. B. (1992). Item Response Theory: Parameter Estimation Techniques. New Yook: Marcel Dekker.

Baker, F. B., & Kim, S.-H. (2004). Item response theory: Parameter estimation techniques (2^nd ed.). New York: Marcel Dekker.

Birnbaum, A. (1968). Some Latent trait Model and Their Use in Inferring an Examinee’s Ability. In F. M. Lord and M. R. Novick, Statistical theories of mental test scores, 17-20. Reading, Mass: Addison-Wesley.

Boar, B. H. (1984). Application prototyping: A requirements definition strategy for the '80s. John Wiley & Sons, New York.

Bock, R. D. & Mislevy, R. J. (1982). Adaptive EAP Estimation of Ability in A Microcomputer Environment. Applied Psychological Measurement, 6, 431-444.

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459.

Bose, R. C. & Nair, K. R. (1939). Partially balanced incomplete block designs.

Sankhya, 4, 337–372.

Brannan, R. L., & Kolen, M. J. (1987). Some practical issues in equating. Applied Psychological Measurement, 11, 279-290.

Brennan, R. L. (2008). A Discussion of Population Invariance. Applied Psychological Measurement, 32(1), 102-114.

Chang, H., Qian, J., & Ying, Z. (2001). a-Stratified Multistage Computerized Adaptive Testing with b-Blocking. Applied Psychological Measurement, 25, 333-341.

College Board (2012a). Chinese with Listening. Retrieved May 20, 2012, from

http://www.collegeboard.com/student/testing/sat/lc_two/chinese/chinese.html?chi nese

College Board (2012b). Chinese language and culture. Retrieved May 7, 2012, from http://www.collegeboard.com/student/testing/ap/sub_chineselang.html

Cook, L. L., & Petersen, N. S. (1987). Problems related to the use of conventional and item response theory equating methods in less than optimal circumstances.

Applied Psychological Measurement: Issues and Practice 10, 37-45.

Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Cambridge, UK: Cambridge University Press.

Dorans, N. J. & Holland, P. W. (2000). Linking Scores from Multiple Instruments.

Evaluation of National and State Assessments of Evaluation. Board on

Educational Testing and Assessment. Washington, DC: National Academy Press.

Dorans, N. J. & Liu, J. (2008). Anchor Test Type and Population Invariance: An Exploration Across Subpopulations and Test Administrations. Applied Psychological Measurement, 32(1), 81-97.

Haebara, T. (1980). Equating Logistic Ability Scales by a Weighted Least Squares Method. Japanese Psychological Research, 22, 144-149.

Hambleton, R. K., & Swaminathan, H. (1985). Item Response Theory: Principles and Application. Boston, MA：Kivwer-Nijhoff.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newburry Park, CA: SAGE.

Hung, P. H. (1988). Application of Computerized Adaptive Testing to The University Entrance Exam of Taiwan, R. O. C. Unpublished doctoral dissertation, University of Minnesota, Minnesota.

Kang, T., & Cohen, A. S. (2007). IRT Model Selection Methods for Dichotomous Items. Applied Psychological Measurement, 31(4), 331-358.

Kao, C. W., Kim, S., & Hatrak, N. (2005). Scale drift study for a large-scale English proﬁciency test. Paper presented at the annual meeting of the Northeastern Educational Research Association (NERA) held between October 19 and 21, 2005 in Kerhonkson, N.Y.

Kecker, G., & Eckes, T. (2007). Linking the TestDaF to the CEFR: The case of writing proficiency. Paper presented at the Fourth Annual Conference of EALTA.

Retrieved August 4, 2009, from

http://www.ealta.eu.org/conference/2007/docs/pres_sunday/Kecker&Eckes.pdf Klein, L. W., & Jarjoura, D. (1985). The importance of content representation for

common-item equating with non-random groups. Journal of Educational Measurement, 22, 197-206.

Kolen, M. J. & Brennan, R. J. (1995). Test Equating: Methods and Practices. New York: Springer-Verlag.

Kolen, M. J. & Brennan, R. L. (2004). Test equating, scaling, and linking: methods and practices (2nd ed.). New York: Springer-Verlag.

Kuo, B.-C., Tseng, H.-C., and Shih, S.-C. (2013). A Computerized Adaptive Testing System for Undergraduate Level Chinese Reading Proficiency. Turkish Online Journal of Educational Technology. (Accepted).

Leonard, T., & Hsu, J. S. J. (1999). Bayesian methods. New York: Cambridge University Press.

Li, F., Cohen, A. S., Kim, S-H., & Cho, S-J. (2009). Model Selection Methods for Mixture Dichotomous IRT Models. Applied Psychological Measurement, 33(5), 353-373.

Lord, F. M. (1977). Practical Applications of Item Characteristic Curve Theory.

Jaurnal of Educational Measurement, 14, 117-138.

Lord, F. M. (1980). Application of item response theory to practical testing problems.

hillsdale, NJ : lawrence erlbaum associates.

Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Lawrence Eribaum Associates.

Lord, F. M., & Wingersky, M. S. (1984). Comparing IRT true-score and equipercentile

observed score ”equatings.” Applied Psychological Measurement, 8, 452-461.

Marco, G., Petersen, N., & Stewart, E. (1979). A test of the adequacy of curvilinear score equating models. Paper presented at the Computerized Adaptive Testing Conference, Minneapolis, MN.

Martin, M. O., Mullis, I. V.S., & Chrostowski, S. J. (Eds). (2004). TIMSS 2003 Technical Report. Chestnut Hill, MA: Boston College, Center for the Study of Testing, Evaluation, and Educational Policy.

Masters, G. N. (1982). A Rasch model for partial credit model. Psychometrika, 47, 149-174.

McBride, J. R. & Martin, J. T. (1983). Reliability and Validity of Adaptive Ability Tests in a Military Setting. In D. J. Wiess (Ed.), New Horizons in Testing: Latent Trait Test Theory and Computerized Adaptive Testing (pp. 223-236). New York:

Academic Press.

Mislevy, R. J. & Bock, R. D. (1990). PC-BILOG-Item analysis and test scoring with binary logistic models [Computer software]. Mooresville, IN: Scientific Software.

Mislevy, R. J. & Sheehan, K. M. (1987). Marginal estimation procedures, in A.E.

Beaton (ed.). The NAEP 1983-1984 Technical Report (Report No. 15-TR-20).

Educational Testing Service, Princeton, N.J.

Muraki, E. (1992). A generalized Partial credit model：Application of an EM algorithm.

Applied Psychological Measurement, 16(2), 159-176.

Muraki, E., & Bock, R. D. (1991). PARSCALE: Parameter scaling of rating data [Computer software]. Chicago: Scientific Software International, Inc.

Nancy, S. (2008). A Discussion of Population Invariance of Equating. Applied Psychological Measurement, 32(1), 98-101.

NCACLS (2012). National Preparation Test for SAT Subject Test in Chinese with Listening. Retrieved May 7, 2012, from http://www.scccs.net/events/event34/

SATII/2010SATII.pdf

Nemhauser, G. L., & Wolsey, L. A. (1999). Integer and combinatorial optimization.

New York: John Wiley.

OECD (2005). PISA 2003 Technical Report. OCED. Paris.

OECD (2009). PISA 2006 Technical Report. OCED, Paris.

Petersen, N. S., Cook, L. L., & Stocking M. L. (1983). IRT versus conventional

equating methods: a comparative study of scale stability. Journal of Educational Statistics, 8(2), 135-156.

Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, Norming, and Equating. In R.L. Linn (Ed.), Educational measurement (3rd ed., pp. 221-262).

New York: Macmillan.

Puhan, G. (2007). Scale drift in equating on a test that employs cut scores. RR-07-34, Educational Testing Service, Princeton, New Jersey.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests.

Chicago: University of Chicago Press.

Rust, K.F., and Johnson, E.G. (1992). Sampling and weighting the national assessment.

Journal of Educational Statistics, Special Issue: National Assessment of Educational Progress, 17(2), 111-129.

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464.

Stocking, M. L. & Lord, F. M. (1983). Developing a Common Metric in Item Response Theory. Applied Psychological Measurement, 7(2), 201-211.

Stocking, M. L. (1994). Three Practical Issues for Modern Adaptive Testing Item Pools. Educational Testing Service, Princeton, N. J.

Tannenbaum, R. J., & Wylie, E. C. (2005). Mapping English proficiency test scores onto the Common European Framework (TOFEL Research Rep. NO. RR-80).

Retrieved August 4, 2012, from http://www.ets.org/Media/Research/pdf/

RR-05-18.pdf

Taylor, L. (2004). IELTS, Cambridge ESOL examination and the Common European Framework. Research Notes, 18, 2-3. University of Cambridge, ESOL

Examinations.

Tianyou, W. (2005). An Alternative Continuization Method to the Kernel Method in von Davier, Holland and Thayer's (2004) Test Equating Framework.

Tozer, M. (1987). The joy of strength and movement: A centennial appreciation of

Edward Thring. Physical Education Review, 10(1), 58-63.

U.S. Department of State (2006). National security language initiative. Retrieved May 21, 2009, from http://merln.ndu.edu/archivepdf/nss/state/58733.pdf

van der Linden, W. J., & Veldkamp, B. P.,& Carlson, J. E. (2004). Optimizing balanced incomplete block designs for educational assessments.Applied Psychological Measurement, 28, 317-331.

von Davier, A. A., & Liu, M. (2008). Population Invariance of Test Equating and Linking: Theory Extension and Applications Across Exams. Applied Psychological Measurement, 32(1), 9-10.

von Davier, A. A., & Wilson, C. (2008). Investigating the population sensitivity

assumption of Item Response Theory true-score equating across two subgroups of examinees and two test formats. Applied Psychological Measurement, 32(1), 11-26.

von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test equating. New York: Springer.

Wang, H.-P., Kuo, B.-C., Tsai, Y.-H., and Liao, C.-H. (2012). A CEFR-based

Computerized Adaptive Testing System for Chinese Proficiency. Turkish Online Journal of Educational Technology, 11(4), 1-12.

Wang, T., & Vispoel, W. P. (1998). Properties of Ability Estimation Methods in Computerized Adaptive Testing. Journal of Educational Measurement, 35, 109-135.

Wright, B. D. (1999). Fundamental measurement for psychology. The new rules of measurement. S. E. Embretson and S. L. Hershberger. Mahwah NJ, Lawrence Erlbaum Associates.

Wu, M. L., Adams, R. J., & Wilson, M. R. (1998). Acer ConQuest. Melbourne, Victoria, Australia: Australian Council for Educational Research press.

Yang, W.-L., & Gao, R. (2008). Invariance of Score Linkings Across Gender Groups for Forms of a Testlet-Based College-Level Examination Program Examination.

Applied Psychological Measurement, 32, 45-61.

Yates, F. (1936). A new method of arranging variety trials involving a large number of

varieties. J. Agric. Sci. 26, 424-455.

Yi, Q., Harris, D. J., & Gao, X. (2008). Invariance of Equating Functions Across Different Subgroups of Examinees Taking a Science Achievement Test. Applied Psychological Measurement, 32(1), 62-80.

Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG.

Scientific Software lnternational.

附錄一基礎級華語文理解能力指標與檢測屬性

壹、 A1 等級華語文能力指標與檢測屬性

可檢測之華語文能力對應之華語文能力指標（能力指標內容描述）

理解能力

（8）

聽覺理解能力（2）

A1.2.1.1 能跟上緩慢及仔細說出的話語 A1.2.1.2 能聽懂簡短、簡單、緩慢的說明

視覺理解能力（6）

A1.2.2.1 能理解非常簡短、簡易的文本

在文檔中 CEFR基礎級之華語文聽力與閱讀理解能力測驗研發與電腦化適性評量系統建置 (頁 107-0)

第四章 結果與討論

第四節 華語文電腦化適性評量系統評估方式之使用

第五章 結論與未來研究方向

第二節 未來研究方向

壹、 華語文能力測驗之研發