建議 - 結論與建議 - 大型測驗之等化群體不變性的估計與探討

第五章結論與建議

第二節建議

本研究透過模擬實驗方式探究在不同情境中，檢驗測驗資料經由等化估計程序後是否仍符合群體不變性之性質。由於資料以常態的方式進行模擬，且兩測驗的題數、

受試人數都設定成相同，試題參數的難易度設定成一致，可能會與現實狀況有些差距，

以下就本研究未盡完備之處，提出一些研究建議，供後續研究者參考：

一、本研究模擬資料的能力分佈為常態分佈，但實際中資料的分佈未必是常態，也許未來可選用不同資料分佈再進行探討。

二、本研究等化設計使用NEAT設計，但未來可應用於更複雜的等化設計，如：BIB 設計或PBIB設計。

三、本研究僅以自然變項和能力變項進行探討，因此選用變項為受試總人數、次群體比例、測驗題數，及次群體能力差距等研究變項，未來可增加試題參數變項、定錨試題比例，或其它變項再進行探究。

四、本研究等化估計方法使用測驗特徵曲線法和固定試題參數法，未來可使用其它等化方法，如：Kernel。

五、一般研究者在評估等化成效時通常只考量參數的真值與估計值的誤差，未能考量群體誤差的因素，建議未來測驗分析者評估等化成效時，除了比較參數估計誤差外能再考量群體誤差因素，將使測驗等化技術更臻完善。

六、 REMSD、RESD、RMSD三種群體誤差測量方法的用途皆不同，未來研究者可依不同用途選用不同的測量方法。REMSD可以看出整體的群體誤差，易於檢測是否符合群體不變性，但無法看出更多訊息；RESD可以看出次群體的群體誤差，

除了可得知次群體的群體誤差也可得知兩次群體間群體誤差的關係；RMSD可以看出每個分數點的群體誤差，在每個分數點間的群體誤差高低越接近等化結果會更公正。

七、實徵資料研究最好能使用三種測量群體不變性的方法，更嚴謹的評估等化成效。

模擬研究因次數較多，使用RMSD需將模擬的資料都列出實不方便，則可使用 REMSD檢測整體的群體誤差或RESD檢測次群體的群體誤差。

參考文獻

中文部份

郭伯臣、王暄博（2008）。大型測驗中同時進行垂直與水平等化效果之探討。教育研 究與發展期刊，4（4），87-120。

王暄博、郭伯臣、呂玉如（2012）。大型測驗等化群體不變性之探究--以 2007 年臺灣 學生學習成就評量資料庫國中二年級數學科為例。測驗學刊。（已接受)

曾玉琳（2005）。不同配置設計下測驗等化效果之模擬研究。國立臺中教育大學數學 教育研究所碩士論文，未出版，臺中市。

臺灣學生學習成就評量資料庫網站（2012）。臺灣學生學習成就評量資料庫。檢索日 期：2012 年 3 月 20 日。網址：http://tasa.naer.edu.tw/index.asp

英文部份

Brennan, R. L., & Kolen, M. J. (1987). Some practical issues in equating. Applied Psychological Measurement, 11, 279-290.

Cook, L. L., & Petersen, N. S. （1987). Problems related to the use of conventional and item response theory equating methods in less than optimal circumstances. Applied Psychological Measurement: Issues and Practice 10, 37-45.

Dorans, N. J. (Ed.). (2004). Assessing the population sensitivity of equating functions [Special issue]. Journal of Educational Measurement, 41(1).

Dorans, N. J., & Holland, P. W. (2000). Population invariance and equatability of tests:

Basic theory and the linear case. Journal of Educational Measurement, 37, 281-306.

Dorans, N. J., Holland, P. W., Thayer, D. T., & Tateneni, K. (2003). Invariance of score linking across gender groups for three advanced placement program exams. Paper presented at the annual meeting of the national council on measurement in education, New Orleans, LA.

Dorans, N. J., Liu, J., & Hammond, S. (2008). Anchor test type and population invariance:

An exploration across subpopulations and test administrations. Applied Psychological Measurement, 32, 81-97.

Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method.

Japanese Psychological Research, 22, 144-149.

Hanson, B. A., & Béguin, A. A. (2002) . Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26, 3-24.

Hanson, B. A., Zeng, L., & Chien, Y. (2004). ST: A computer program for IRT scale transformation [Computer software]. Retrieved March 20, 2012, from http://www.education.uiowa.edu/casma

Hanson, B. A., Zeng, L., & Chien, Y. (2004). PIE: IRT true and observed scoring equating for dichotomously scored tests [Computer software]. Retrieved March 20, 2012, from http://www.education.uiowa.edu/casma

Hambleton, R.K., & Swaminathan, H. (1985). Item response theory. principles and applications.

Boston: Kluwer.

Harris, D. J. (1993). Practical issues in equating. Paper presented at the annual meeting of the American Educational Research Association, Atlanta, GA.

Harris, D. J., & Crouse, J. D. (1993). A study of criteria used in equating. Applied Measurement in Education, 6, 195-240.

Huh, N. R., & Kolen, M. J. (2006). Group invariance in a concordance context. Paper presented at the National Council on Measurement Education Annual Meeting, San Francisco.

Huh N., & Lee W. C. (2009). The effect of different factors on group invariance in a concordance context with a single group design (CASMA Research Report No. 28).

Iowa City, IA:Center for Advanced Studies in Measurement and Assessment, The University of Iowa. (Available on http://www.education.uiowa.edu/casma).

Kim S., & Walker M. E., (2011). Does linking mixed-format tests using a multiple-choice anchor produce comparable results for male and female subgroups? (ETS Research Report No. RR-11-44). Princeton, NJ: ETS.

Kolen, M. (2007). Data collection designs and linking procedures. In N. J. Dorans, M.

Pommerich, & P. W. Holland (Eds.), Linking and aligning scores and scales, 31-55.

New York, NY: Springer-Verlag.

Kolen, M. J. (1981). Comparison of traditional and item response theory methods for equating test. Journal of Educational Measurement, 18, 1-11.

Kolen, M. J. & Brennan, R. L. (1995). Test Equating: Methods and Practices. New York:

Springer-Verlag.

Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: methods and practices (2nd ed.), 161-181. New York, NY:Springer-Verlag.

Laitusis, C. C., Cook, L., Cline, F., King, T., & Sabatini, J. (2008). Examining the impact of audio presentation on tests of reading comprehension (ETS Research Report No.

RR-08-23). Princeton, NJ: ETS.

Li, Y.H., Griffith, W.D., & Tam, H.P. (1997). Equating multiple tests via an IRT linking design utilizing a single set of anchor items with fixed common item parameters during calibration process. Paper presented at the annual meeting of the Psychometric Society, Knoxville, TN.

Liu, M. & Holland, P. W. (2008). Exploring population sensitivity of linking functions across three law school admission test administrations. Applied Psychological Measurement, 32, 27-44.

Lord, F. M. (1980). Application of item response theory to practical testing problems.

Hillsdale, NJ: Lawrence Erlbaum.

Lord, F. M., & Wingersky, M. S. (1984). Comparing IRT true-score and equipercentile observed score ”equatings.” Applied Psychological Measurement, 8, 452-461.

Loyd, B.H., & Hoover, H.D. (1980). Vertical equating using the rasch model. Journal of Educational Measurement, 4, 11-22.

Marco, G., Petersen, N., & Stewart, E. (1979). A test of the adequacy of curvilinear score equating models. Paper presented at the Computerized Adaptive Testing Conference, Minneapolis, MN.

Marco, G.L. (1977). Item characteristic curve solutions to three intractable testing problems.

Journal of Educational Measurement, 14, 139-160.

Middleton K., & Dorans, N. J. (2011). Assessing the falsifiability of extreme linkings (ETS Reserch Report No.RR-11-04). Princeton, NJ: ETS.

Mislevy, R. J. & Bock, R. D. (1982).Implementation of the EM algorithm in the estimation of item parameters:THE BILOG computer program. In:Item Response Theory and Computerized Adaptive Testing Conference Proceedings (Wayzata, MN).

Mislevy, R. J. （1986). Bayes model estimation in item response models. Psychometrika, 51, 177-195.

Paek I., & Young, M. J.(2005). Investigation of student growth recovery in a fixed-item linking procedure with a fixed-person prior distribution for mixed-format test data.

Applied Measurement in Education, 18(2), 199-215.

Petersen, N. S., Cook, L. L., & Stocking M. L. (1983). IRT versus conventional equating methods: a comparative study of scale stability. Journal of Educational Statistics, 8(2), 135-156.

Petersen, N. S., Marco, G. L., & Stewart, E. E. (1982). A test of the adequacy of linear score equating models. In P. W. Holland & D. B. Rubin (Eds.), Testing Equating (pp.

71-135). New York: Academic Press, Inc.

Pommerich, M., Hanson, B. A., Harris, D. J., & Sconing, J. A. (2004). Issues in conducting linkages between distinct tests. Applied Psychological Measurement, 28, 247-273.

Skaggs, G. (1990). Assessing the utility of item response theory models for testing equating.

Paper presented at the annual meeting of the National Council on Measurement in Education, Boston, MA.

Skaggs, G., & Lissitz, R. W. (1986). IRT test equating: Relevant issues and a review of recent research. Review of Educational Research, 56 (4), 495-529.

Stocking, M. L. & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2) , 201-211.

Swaminathan, H., & Gifford, J. A. (1986). Bayesian estimation in the three parameter logistic models. Psychometrika, 51, 589-601.

von Davier, A. A., & Wilson, C. (2005). A didactic approach to the use of IRT true score equating (ETS Research Rep. No. RR-05-26). Princeton, NJ: Educational Testing

von Davier, A. A., & Wilson, C. (2008). Investigating the population sensitivity assumption of item response theory true-score equating across two subgroups of examinees and two test formats. Applied Psychological Measurement, 32, 11-26.

Yang, W.-L. (2004). Sensitivity of linkings between AP multiple-choice scores and composite scores to geographical region: An illustration of checking for population invariance. Journal of Educational Measurement, 41, 33-41.

Yang, W.-L., & Gao, R. (2008). Invariance of score linkings across gender groups for forms of a testlet-based college-level examination program examination. Applied Psychological Measurement, 32, 45-61.

Yi, Q., Harris, D. J., & Gao, X. (2008). Invariance of equating functions across different subgroups of examinees taking a science achievement test. Applied Psychological Measurement, 32, 62-80.

Yin, P., Brennan, R. L., & Kolen, M. J. (2004). Concordance between ACT and ITED scores from different populations. Applied Psychological Measurement, 28, 274-289.

Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock, R. D. (2003). BILOG-MG 3.0 [Computer software and manual]. Lincolnwood, IL: Scientific Software International.

附錄一不同變項之能力值參數估計誤差

附錄二實驗三轉換後量尺分數

附表2-1 總群體轉換後量尺分數 原始

分數

TCC FIX

S1 S2 S3 S4 S1 S2 S3 S4

0 0.000 0.000 1 1.000 1.107 2 2.000 2.215 3 3.000 3.322 4 4.000 4.430 5 5.000 5.537 6 6.000 6.625 7 7.000 7.500 8 8.000 8.379 9 9.000 9.283 10 10.000 11 11.000 12 12.000 13 13.000 14 14.000 15 15.000 16 16.000 17 17.000 18 18.000 19 19.000 20 20.000 21 21.000 22 22.000 23 23.000 24 24.000 25 25.000 26 26.000 27 27.000 28 28.000 29 29.000 30 30.000 0.000 0.000 0.000 0.000 0.000 0.000 1.079 1.060 1.000 1.098 1.069 1.031 2.158 2.119 2.000 2.195 2.139 2.062 3.236 3.179 3.000 3.293 3.208 3.093 4.315 4.239 4.000 4.391 4.278 4.124 5.394 5.298 5.000 5.489 5.347 5.155 6.475 6.361 6.000 6.586 6.416 6.186 7.346 7.188 7.000 7.684 7.239 7.058 8.165 7.925 8.000 8.464 8.078 7.802 8.984 8.696 9.000 9.299 8.920 8.605 10.208 9.808 9.516 10.000 10.179 9.762 9.455 11.148 10.636 10.378 11.000 11.089 10.603 10.336 12.100 11.469 11.270 12.000 12.021 11.444 11.238 13.059 12.307 12.178 13.000 12.971 12.289 12.154 14.023 13.153 13.092 14.000 13.935 13.139 13.080 14.989 14.007 14.007 15.000 14.911 13.997 14.013 15.958 14.873 14.923 16.000 15.897 14.866 14.952 16.935 15.750 15.845 17.000 16.892 15.748 15.898 17.923 16.642 16.780 18.000 17.896 16.644 16.850 18.925 17.550 17.737 19.000 18.909 17.557 17.812 19.943 18.476 18.723 20.000 19.930 18.487 18.785 20.976 19.422 19.742 21.000 20.960 19.435 19.773 22.021 20.390 20.793 22.000 21.996 20.405 20.781 23.072 21.381 21.872 23.000 23.035 21.397 21.812 24.118 22.399 22.966 24.000 24.067 22.416 22.868 25.143 23.445 24.060 25.000 25.083 23.464 23.941 26.132 24.523 25.128 26.000 26.072 24.546 25.016 27.072 25.643 26.153 27.000 27.028 25.672 26.074 27.969 26.847 27.166 28.000 27.957 26.875 27.137 28.872 28.254 28.319 29.000 28.895 28.264 28.341 30.000 30.000 30.000 30.000 30.000 30.000 30.000

註記：S1表示測驗題本S1的總群體。

附表2-2 男生次群體轉換後量尺分數 原始

分數

TCC FIX

S1_M S2_M S3_M S4_M S1_M S2_M S3_M S4_M 0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1 0.980 1.106 1.097 1.055 1.019 1.117 1.076 1.050 2 1.960 2.211 2.194 2.109 2.037 2.234 2.153 2.100 3 2.940 3.317 3.291 3.164 3.056 3.352 3.229 3.150 4 3.920 4.423 4.388 4.218 4.074 4.469 4.306 4.199 5 4.900 5.528 5.484 5.273 5.093 5.586 5.382 5.249 6 5.879 6.664 6.633 6.327 6.112 6.703 6.459 6.299 7 7.025 7.605 7.569 7.374 7.139 7.800 7.355 7.240 8 8.198 8.500 8.409 8.201 8.173 8.600 8.220 8.075 9 9.249 9.398 9.217 9.019 9.194 9.449 9.075 8.915 10 10.249 10.305 10.012 9.854 10.209 10.336 9.921 9.775 11 11.230 11.222 10.804 10.710 11.219 11.250 10.763 10.655 12 12.208 12.152 11.600 11.580 12.225 12.185 11.603 11.551 13 13.188 13.095 12.404 12.456 13.228 13.137 12.445 12.456 14 14.174 14.054 13.221 13.335 14.226 14.104 13.292 13.366 15 15.163 15.029 14.054 14.215 15.221 15.082 14.147 14.280 16 16.154 16.021 14.904 15.098 16.213 16.072 15.013 15.195 17 17.145 17.027 15.774 15.992 17.199 17.073 15.890 16.112 18 18.132 18.048 16.664 16.903 18.181 18.084 16.781 17.034 19 19.112 19.081 17.575 17.838 19.156 19.108 17.687 17.964 20 20.081 20.126 18.505 18.802 20.124 20.145 18.608 18.906 21 21.035 21.180 19.456 19.797 21.083 21.196 19.547 19.867 22 21.974 22.239 20.429 20.826 22.035 22.259 20.505 20.854 23 22.904 23.295 21.424 21.888 22.980 23.329 21.484 21.873 24 23.834 24.332 22.444 22.978 23.922 24.393 22.487 22.929 25 24.771 25.328 23.489 24.082 24.866 25.426 23.517 24.021 26 25.719 26.270 24.562 25.177 25.814 26.400 24.575 25.130 27 26.677 27.157 25.669 26.231 26.763 27.302 25.668 26.221 28 27.652 28.016 26.844 27.256 27.718 28.152 26.824 27.281 29 28.697 28.911 28.195 28.366 28.726 29.016 28.138 28.412 30 30.000 30.000 30.000 30.000 30.000 30.000 30.000 30.000

註記：S1_M表示測驗題本S1的男生次群體。

附表2-3 女生次群體轉換後量尺分數 原始

分數

TCC FIX

S1_F S2_F S3_F S4_F S1_F S2_F S3_F S4_F 0 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1 0.979 1.042 0.977 1.012 0.989 1.107 1.038 1.030 2 1.958 2.084 1.953 2.024 1.978 2.214 2.076 2.061 3 2.937 3.126 2.930 3.036 2.967 3.321 3.114 3.091 4 3.915 4.168 3.907 4.048 3.956 4.428 4.152 4.122 5 4.894 5.210 4.884 5.060 4.945 5.535 5.190 5.152 6 5.873 6.251 5.860 6.072 5.935 6.642 6.228 6.183 7 6.800 7.208 6.845 6.843 6.922 7.749 7.110 7.004 8 7.715 8.135 7.719 7.522 7.912 8.503 7.957 7.670 9 8.707 9.096 8.594 8.284 8.905 9.336 8.800 8.431 10 9.747 10.079 9.486 9.122 9.897 10.206 9.639 9.260 11 10.808 11.073 10.389 10.020 10.891 11.100 10.477 10.130 12 11.868 12.070 11.299 10.962 11.885 12.014 11.318 11.026 13 12.916 13.063 12.213 11.930 12.880 12.943 12.165 11.940 14 13.946 14.047 13.128 12.907 13.877 13.885 13.019 12.868 15 14.958 15.020 14.040 13.881 14.876 14.840 13.884 13.806 16 15.955 15.984 14.947 14.845 15.877 15.806 14.761 14.755 17 16.941 16.942 15.845 15.799 16.881 16.782 15.652 15.714 18 17.924 17.898 16.736 16.750 17.889 17.770 16.558 16.683 19 18.910 18.856 17.622 17.707 18.902 18.770 17.480 17.665 20 19.906 19.819 18.511 18.680 19.919 19.785 18.419 18.661 21 20.917 20.789 19.408 19.676 20.943 20.815 19.377 19.675 22 21.947 21.767 20.321 20.699 21.973 21.863 20.355 20.710 23 22.995 22.759 21.256 21.748 23.008 22.930 21.356 21.768 24 24.055 23.767 22.216 22.819 24.044 24.012 22.383 22.850 25 25.118 24.794 23.207 23.903 25.075 25.102 23.439 23.949 26 26.172 25.830 24.240 24.985 26.092 26.182 24.528 25.048 27 27.219 26.850 25.332 26.054 27.096 27.226 25.659 26.128 28 28.274 27.824 26.532 27.133 28.117 28.192 26.862 27.201 29 29.336 28.771 27.983 28.334 29.243 29.065 28.236 28.365 30 30.000 30.000 30.000 30.000 30.000 30.000 30.000 30.000

註記：S1_F表示測驗題本S1的女生次群體。

在文檔中大型測驗之等化群體不變性的估計與探討 (頁 53-64)

建議

第五章 結論與建議

第二節 建議

參考文獻

中文部份

英文部份

Hambleton, R.K., & Swaminathan, H. (1985). Item response theory. principles and applications.

Boston: Kluwer.

附錄一 不同變項之能力值參數估計誤差

附錄二 實驗三轉換後量尺分數

註記：S1表示測驗題本S1的總群體。

註記：S1_M表示測驗題本S1的男生次群體。

註記：S1_F表示測驗題本S1的女生次群體。

第五章結論與建議

第二節建議

附錄一不同變項之能力值參數估計誤差

附錄二實驗三轉換後量尺分數