計分規準與評分程序對數學科實作評量評分者一致性影響之研究

(1)

(2)

(3)

(4)

Keywords: rater consistency, holistic scoring rubrics, mathematics performance assessments, generalizability theory, many-facet Rasch model

An investigation of scoring and score resolution procedures on rater

reliability of mathematics performance-based assessments

(An investigation of score resolution procedures on results of performance assessments and a comparison of three rater reliability detection approaches)

Lily Chang & Sujen Lo

Department of Educational Psychology and Counseling National Pingtung University of Education

The two major purposes of this study were: (1) to compare four rater reliability detection methods, i.e., agreement index (percentage of agreement), correlation, generalizability theory, and many-facet Rasch analysis; and (2) to examine the effects of three score resolution methods on results of performance assessment.

Four hundred and thirty-eight sixth grade students from thirteen classes in eight elementary schools and six full-time elementary school teachers, all of them were part-time graduate students with 5 to 10 years of teaching experiences and no prior experiences in designing and grading PA tasks, participated in this study.

The instrument developed by the researchers, contained 15 paper-and-pencil mathematics performance assessment (PA) tasks measuring fraction concept, contained in three test booklets. A generalized holistic scoring rubrics with 7-point (0-6) was constructed also by the researchers. Each rater graded six classes of students’paper(about200 papers)and each task wasrated by twoto four raters independently. Rater consistency was analyzed and compared by using agreement index (percentage of agreement), correlation, generalizability theory, and

many-faceted Rasch model.

(5)

緒論

效度指的是測驗結果或分數的有效程度，也就是測驗能夠提供正確的資料，以做適切解釋及決定的程度。在進行實作評量結果的信度與效度的檢視時，我們可以蒐集內容、反映歷程、內部結構、分數類推、與外在變項關係、測驗後果等六方面的憑證來支持評量結果解釋或根據結果做決定的正確性（Messick, 1994,

1995; AERA, APA, & NCME, 1999）。效度的建立是從評量建構階段開始，直至施

測完畢、結果解釋，甚至到後果影響的階段，它是一個持續不斷蒐集憑證的歷程。傳統檢視成就測驗結果效度的方法大多偏重於內容依據的效度，但在以構念效度為核心的統整概念提出後，我們也以檢視其它測驗同樣的標準來檢視成就測驗結果的效度，而在實作評量興起後，由於其所測構念複雜，目的多元，影響學生表現的因素龐雜，故我們在檢視實作評量結果效度的歷程中所蒐集的憑證就比以往廣泛且多元。但在這個效度憑證建立的過程中，我們仍多比照一般非成就測驗建立效度的方法來建立憑證，譬如：在建構階段，蒐集與評量內容及反映歷程有關的憑證，當所測構念複雜時（如：問題解決），我們除了會以內部一致性，也會以因素分析（含探索型與驗證型因素分析）來檢驗構念的內部結構；在施測結束後，蒐集分數類推（信度）及與外在變項關係的憑證，或蒐集後果影響的憑證。其中，由於實作評量計分屬「主觀」型計分，故我們也會在過程中檢視計分規準的效度與信度。但計分規準與計分程序之效度的檢視卻仍以專家檢視的途徑居多。唯計分規準與程序雖不直接涉及評量的實質內容，但卻「比任何影響效度的因素與分數解釋間的關聯都要直接」 (Clauser, 2000) 。 Dunbar, Koretz 及 Hoover(1991)也說到不可信的評分會折損大眾對分數的信任，且在信度的方程式中增加一項在測驗中不存在且無法客觀評分的部份。

由於評分者給分一致性的程度是大眾能否接受實作評量的關鍵性要素（張麗麗，2002），且其對評量結果解釋的正確性有直接的影響，故以往有關評分方面的研究大多偏重於評分者一致性的探討(Becker, Hess, & Gibney, 1993; Blok, 1985; Engelhard, 1992, 1994; Lane, Sabers, 1989; Lunz, Wright, & Linacre, 1990)。雖然目前研究均一致性的提出降低評分者誤差或提高評分者一致性的不二法門有二，一是編製具效度的計分規準，二是規劃嚴謹的評分者訓練（含具體的定錨範本與明確的計分程序）。而近期的研究也的確支持這兩點可以提高評分者一致性或降低

評分者誤差的變異至一個相當低的比例（吳欣黛，民 87; 林敬修，民 92; Gao, 1996;

(6)

(7)

(8)

(9)

論，若有問題或困難，則由研究者視情形，額外提供說明與討論。為避免評分者效應（如：月暈效應），所有試卷均除去受試者之背景資料，另分數紀錄表格亦各題分開紀錄。表二評分者評分設計評分者代碼班級代碼 A B C D E F 01-0102 V V 02-0311 V V 03-0406 V V 04-0207 V V 05-0605 V V V V 06-0315 V V V V 07-0107 V V 08-0502 V V V V 09-0707 V V V V 10-0210 V V V V 11-0503 V V 12-0812 V V 13-0306 V V

資料分析方法

研究者以下列四種方式檢視並比較評分者一致性：一致性比例、相關係數

（Spearman 等級相關）、類推性理論、多面向 Rasch 模式（many-facet Rasch Model）

(Linacre, 1994)。本研究將視特定的研究問題選取評分者一致性的檢視方法。以下僅略述類推性理論及多面向 Rasch 模式。類推性理論 (Generalizability Theory) 類推性理論是古典測驗理論真分數模式的延伸，它強調測量結果的「可靠性」或「可依賴性」(dependability)，測量是所有可接受觀察值(admissible observations) 之全域中(universe)的樣本，這些觀察值是決策者在進行決策時願意視為可相互替換(interchangeable)的樣本(Brennan, 1992, 2001; Shavelson & Webb, 1991)。在此架構下，我們假定任一評分者皆是類推全域中的隨機樣本或是可相互替換的樣本，此全域是由研究者意欲類推之所有可能之評分者所組成的，全域中所有分數的平均分數即為全域的真分數。

(10)

(11)

(12)

依賴性係數 ) ( 2 2 2            p p _。其中 ' r n 為決策性研究中評分者之個數，n'_s為決策性研究中「計分規準」之。

多面向 Rasch 模式 (Many-facet Rasch Model)

Linacre (1994)將 Rasch 模式延伸，提供評斷實作評量中評分品質的架構。多

面向 Rasch 模式可以包含多個面向（facets），譬如：試題難度、受試者能力、評

(13)

(14)

(15)

表四各評分者在各題給定分數之平均數與標準差 All raters (n=438) Rater A (n=205) Rater B (n=205) Rater C (n=195) Rater D (n=195) Rater E (n=199) Rater F (n=199) Item

Mean (sd) Mean (sd) Mean (sd) Mean (sd) Mean (sd) Mean (sd) Mean (sd)

(16)

表五各組分評者給分不一致之百分比

Item 1 Item 2 Item 3 Item 4

A,B C,D E,F D,F A,B C,D E,F D,F A,B C,D E,F D,F A,B C,D E,F D,F Discrepancy Score N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) 0 124(60.5) 129(66.8) 130(65.7) 57(58.8) 155(75.6) 145(74.4) 164(82.8) 72(74.2) 176(85.9) 151(77.4) 175(88.4) 74(76.3) 168(82.0) 148(73.3) 163(81.9) 74(76.3) 1 56(27.3) 54(27.7) 51(25.8) 30(30.9) 37(18.0) 31(15.9) 27(13.6) 19(19.6) 23(11.2) 36(18.5) 17(8.6) 17(17.5) 34(16.6) 38(19.5) 31(15.6) 16(16.5) 2 16(7.8) 8(4.1) 14(7.1) 9(9.3) 10(4.9) 14(7.2) 7(3.5) 4(4.1) 6(2.9) 6(3.1) 6(3.0) 5(5.2) 1(0.5) 8(4.1) 4(2.0) 5(5.2) 3 3(1.5) 2(1.0) 2(1.0) 1(1.0) 3(1.5) 4(2.1) (0) 2(2.1) (0) 2(1.0) (0) 1(1.0) 2(1.0) 5(2.6) 1(0.5) 2(2.1) 4 3(1.5) (0) 1(0.5) (0) (0) 1(0.5) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) 5 1(0.5) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) 6 2(1.0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) 不一致% 39.5 33.2 34.3 41.2 24.4 25.6 17.2 25.8 14.1 22.6 11.6 23.7 18.0 26.7 18.1 23.7 不一致>1% 12.2 5.5 8.5 10.3 6.4 9.7 3.6 6.2 2.9 4.1 3.0 6.2 1.5 6.7 2.5 7.3

(17)

【接上頁】

A,B C,D E,F D,F A,B C,D E,F D,F A,B C,D E,F D,F A,B C,D E,F D,F Discrepancy Score N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) N(%) 0 167(81.5) 145(74.4) 148(74.4) 71(73.2) 147(71.7) 154(79.0) 158(79.4) 72(74.2) 163(79.5) 167(85.6) 137(68.8) 64(66.0) 169(82.4) 151(77.4) 156(78.4) 58(59.8) 1 35(17.1) 36(18.5) 46(23.1) 21(21.6) 35(17.1) 41(21.0) 39(19.6) 22(22.7) 38(18.5) 26(13.3) 53(26.6) 29(29.9) 33(16.1) 40(20.5) 36(18.1) 33(34.0) 2 1(0.5) 11(5.6) 5(2.5) 3(3.1) 19(9.3) (0) 1(0.5) 2(2.1) 4(2.0) 2(1.0) 5(2.5) 3(3.1) 3(1.5) 3(1.5) 4(2.0) 2(2.1) 3 (0) 2(1.0) (0) 2(2.1) 4(2.0) (0) 1(0.5) 1(1.0) (0) (0) 4(2.0) 1(1.0) (0) 1(0.5) 3(1.5) 4(4.1) 4 1(0.5) 1(0.5) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) 5 (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) 6 (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) 不一致% 18.5 25.6 25.6 26.8 28.3 21.0 20.6 25.8 20.5 14.4 31.2 34.0 17.6 22.6 21.6 40.2 不一致>1% 0.5 7.1 2.5 5.2 1.0 0.0 1.0 3.1 2.0 1.0 4.5 4.1 1.5 2.0 3.5 6.2

Item 13 Item 14 Item 15

(18)

(19)

表七「p×t×r」（受試者×試題×評分者）類推分析結果

Raters 1,2 Raters 3,4 Raters 5,6 Raters 1,2,3,4 Raters 3,4,5,6

effect df V.C. SE V.C.% df V.C. SE V.C.% df V.C. SE V.C.% df V.C. SE V.C.% df V.C. SE V.C.% p 204 1.805 0.194 33.08 194 1.678 0.185 33.04 198 1.135 0.129 25.67 63 2.136 0.403 37.68 96 1.198 0.192 26.54 t 14 1.071 0.384 19.63 14 0.965 0.346 19.00 14 0.919 0.329 20.79 14 0.965 0.355 17.02 14 0.960 0.348 21.27 r 1 0.0 0.0 0.0 1 0.0 0.0 0.0 1 0.0 0.0 0.0 3 0.0 0.0 0.0 3 0.003 0.002 0.0 Pt 2856 2.385 0.066 43.71 2716 2.234 0.632 43.99 2772 2.22 0.062 50.21 882 2.371 0.115 41.82 1344 2.169 0.853 48.05 Pr 204 0.0 0.001 0.001 194 0.0 0.001 0.0 198 0.0 0.001 0.0 189 0.0 0.001 0.0 288 0.0 0.001 0.0 Tr 14 0.005 0.002 0.001 14 0.006 0.002 0.0 14 0.001 0.001 0.0 42 0.006 0.002 0.001 42 0.006 0.001 0.001 Ptr,e 2856 0.190 0.005 0.035 2716 0.193 0.005 0.038 2772 0.146 0.039 0.033 2646 0.197 0.005 0.035 4032 0.178 0.004 0.039 2  類推係數 .421 .409 .324 .454 .338  依賴係數 .337 .330 .257 .376 .265 1 '  r

n Tasks needed to reach2=.80

6 6 9 5 8

1

'  r

n Tasks needed to reach=.80

8 8 12 7 11

2

'  r

n Tasks needed to reach2=.80

6 6 8 5 8

2

'  r

n Tasks needed to reach=.80

(20)

(21)

表八多面向「共同等級」模式下「評分者嚴苛」參數估計值

---| Obsvd Obsvd Obsvd Fair-M| Model | Infit Outfit |Estim.| Exact Agree. | | | Score Count Average Avrage|Measure S.E. | MnSq ZStd MnSq ZStd|Discrm| Obs % Exp % | N raters | ---| 6526 2985 2.2 1.29| .49 .01 | 1.01 .4 .98 -.4| 1.02 | 78.9 37.7 | 1 A | | 6504 2985 2.2 1.28| .50 .01 | 1.02 .6 .94 -1.4| 1.01 | 78.6 37.8 | 2 B | | 5496 2805 2.0 1.25| .51 .01 | .98 -.6 .89 -2.3| .99 | 78.2 39.5 | 3 C | | 5584 2805 2.0 1.29| .49 .01 | 1.01 .4 1.00 .0| .96 | 75.5 39.4 | 4 D | | 4706 2850 1.7 1.18| .54 .01 | 1.05 1.5 .95 -.9| 1.00 | 79.5 40.8 | 5 E | | 4632 2850 1.6 1.15| .55 .01 | 1.06 1.9 .92 -1.6| 1.01 | 77.9 40.9 | 6 F | ---| 5574.7 2880.0 1.9 1.24| .51 .01 | 1.02 .7 .95 -1.1| | | Mean (Count: 6) | | 754.8 76.5 .2 .06| .02 .00 | .03 .8 .04 .8| | | S.D. (Populn) | | 826.8 83.8 .2 .06| .03 .00 | .03 .9 .04 .9| | | S.D. (Sample) | ---Model, Populn: RMSE .01 Adj (True) S.D. .02 Separation 1.52 Reliability (not inter-rater) .70

Model, Sample: RMSE .01 Adj (True) S.D. .02 Separation 1.72 Reliability (not inter-rater) .75 Model, Fixed (all same) chi-square: 19.3 d.f.: 5 significance (probability): .00

Model, Random (normal) chi-square: 4.0 d.f.: 4 significance (probability): .41

Rater agreement opportunities: 18225 Exact agreements: 14205 = 77.9% Expected: 7186.0 = 39.4%

---表九多面向 Rasch「評分者－等級」模式下「評分者嚴苛」參數估計值

---| Obsvd Obsvd Obsvd Fair-M| Model | Infit Outfit |Estim.| Exact Agree. | | | Score Count Average Avrage|Measure S.E. | MnSq ZStd MnSq ZStd|Discrm| Obs % Exp % | N raters | ---| 6526 2985 2.2 1.27| .49 .01 | 1.01 .2 .98 -.4| 1.00 | 78.9 38.0 | 1 A | | 6504 2985 2.2 1.27| .49 .01 | 1.01 .4 .94 -1.4| 1.01 | 78.6 37.9 | 2 B | | 5496 2805 2.0 1.26| .50 .01 | .99 -.2 .90 -2.2| 1.02 | 78.2 38.8 | 3 C | | 5584 2805 2.0 1.29| .48 .01 | 1.03 .8 1.00 .0| .99 | 75.5 38.6 | 4 D | | 4706 2850 1.7 1.18| .55 .01 | 1.04 1.3 .94 -1.1| .99 | 79.5 41.1 | 5 E | | 4632 2850 1.6 1.16| .58 .01 | 1.05 1.5 .90 -1.8| .99 | 77.9 41.3 | 6 F | ---| 5574.7 2880.0 1.9 1.24| .52 .01 | 1.02 .7 .94 -1.2| | | Mean (Count: 6) | | 754.8 76.5 .2 .05| .04 .00 | .02 .6 .04 .8| | | S.D. (Populn) | | 826.8 83.8 .2 .05| .04 .00 | .02 .7 .04 .8| | | S.D. (Sample) | ---Model, Populn: RMSE .01 Adj (True) S.D. .03 Separation 2.59 Reliability (not inter-rater) .87

Model, Sample: RMSE .01 Adj (True) S.D. .04 Separation 2.88 Reliability (not inter-rater) .89 Model, Fixed (all same) chi-square: 45.3 d.f.: 5 significance (probability): .00

Model, Random (normal) chi-square: 4.5 d.f.: 4 significance (probability): .34

Rater agreement opportunities: 18225 Exact agreements: 14205 = 77.9% Expected: 7162.6 = 39.3%

---評分者效應 為了解評分者是否有評分者效應（如：趨中、嚴苛或寬鬆效應），研究者呈現各評分者之各等級次數分配（如表十）。從表中可知評分者分數偏低，大部份集中在 0 分與 1 分，當然這與試題對受試者偏難有關；評分者 A、B、C、D 評定 6 分之次數（介於 13%至 16%）高於評分者 E 及 F(介於 8%至 9%)，顯示評分者 E 及 F 之給分相對偏低。另外，表十中各等級受試者平均能力估計值雖均呈現階層關係，但各等級之間的間距太窄，顯示等級或可合併，以增加各等級之區辨力。各階估計值均不呈階層關係，顯示大部份等級均不為最可能之等級。造成此現象的原因可能為次數分配不呈理想的矩形分配、數據違背單向度假定、或 Rasch 模式獨特估計各階的方式（相鄰兩等級以二元方式估計）等（但注意各階估計值不呈階層並不代表所

測構念不呈階層次序的關係）（見 Andrich, 2004; Linacre, 1999, 2004; Shaw, Wright,

(22)

圖一多面向「共同等級」模式及「評分者－等級」模式下「受試者－評分者－試題」估計值對照圖 「評分者×試題」交互作用 研究者以「共同等級」模式探討評分者與試題是否有交互作用，亦即不同評分者在各試題上的給分嚴苛度是否有差異。表十一為以 t 考驗檢視不同評分者給分嚴苛估計值達到顯著之值。譬如：評分者 B 及 D 在試題五上的嚴苛度達到顯著差異，評分者 B 之給分(試題難度為 1.14 logits)較評分者 D 為嚴苛(試題難度 為.87)（t=2.30, p=.02）。整體而言，在 225 對（每兩人在 15 個題試題上之比較） 的比較中僅 10 對（佔 4.4%）達到顯著差異，顯示不同評分者在相同試題上的給分並無明顯偏差(bias)的現象。 ---|Measr|+person |-raters |-ITEMS | PA | ---+ 2 + + + + (6) + | | | | | | | | | | | | | | . | | | | | | | | | | | | . | | | | | | ** | | | | | | *. | | | | + 1 + **. + + 15 + --- + | | ***. | | | | | | *** | | 35 | | | | ******. | | | 5 | | | ******. | A B C D E F | | | | | ******. | | 13 | --- | | | ********. | | 23 25 | 4 | | | ********* | | 12 | --- | * 0 * *********. * * 14 33 34 * 3 * | | ********. | | 21 | --- | | | ******. | | 11 22 32 | 2 | | | ******. | | | --- | | | ******. | | 24 | | | | ** | | | 1 | | | ***. | | | | | | ** | | | | + -1 + ** + + + --- + | | **. | | 31 | | | | *. | | | | | | * | | | | | | . | | | | | | . | | | | | | . | | | | | | | | | | + -2 + + + + + | | . | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | . | | | | | | | | | | + -3 + + + + + | | | | | | | | | | | | | | . | | | | | | | | | | | | | | | | | | | | | | | | | | | | + -4 + ** + + + (0) + ---|Measr| * = 4 |-raters |-ITEMS | PA |

「共同等級」模式

---|Measr|+person |-ITEMS |-raters |-raters | S.1 | S.2 | S.3 | S.4 | S.5 | S.6 | ---+ 2 + + + + + (6) + (6) + (6) + (6) + (6) + (6) + | | | | | | | | | | | | | | . | | | | | | | | | | | | . | | | | | | | | | | | | | | | | | | | | | | | | . | | | | | | | | | | | | **. | | | | | | | | | | | | *. | | | | | | | | --- | --- | + 1 + **. + 15 + + + --- + --- + --- + --- + + + | | ***. | | | | | | | | | | | | *** | 35 | | | | | | | | | | | ******. | | F | F | 5 | 5 | 5 | 5 | 5 | 5 | | | ****** | | A B C D E | A B C D E | | | | | | | | | *******. | 13 | | | --- | --- | --- | --- | --- | --- | | | ********. | 23 25 | | | 4 | 4 | 4 | 4 | 4 | 4 | | | ********. | 12 | | | --- | --- | --- | --- | --- | --- | * 0 * ********** * 14 33 34 * * * 3 * 3 * 3 * 3 * 3 * 3 * | | ********* | 21 | | | --- | --- | --- | --- | --- | --- | | | ******. | 11 22 32 | | | 2 | 2 | 2 | 2 | 2 | 2 | | | ******. | | | | --- | --- | --- | --- | --- | | | | ******. | 24 | | | | | | | | --- | | | ** | | | | 1 | 1 | 1 | 1 | 1 | 1 | | | ***. | | | | | | | | | | | | **. | | | | | | | | | | + -1 + **. + + + + --- + --- + --- + --- + --- + --- + | | * | 31 | | | | | | | | | | | **. | | | | | | | | | | | | *. | | | | | | | | | | | | . | | | | | | | | | | | | . | | | | | | | | | | | | | | | | | | | | | | | | . | | | | | | | | | | + -2 + + + + + + + + + + + | | . | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | . | | | | | | | | | | | | | | | | | | | | | | + -3 + + + + + + + + + + + | | | | | | | | | | | | | | | | | | | | | | | | | | . | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | + -4 + ** + + + + (0) + (0) + (0) + (0) + (0) + (0) +

---|Measr| * = 4 |-ITEMS |-raters |-raters | S.1 | S.2 | S.3 | S.4 | S.5 | S.6 |

(23)

---表十各評分者在等級量尺上之給分配及各階估計值

續下頁

評分者 A

(24)

---(Mean)---(Modal)--(Median)---接前頁

表十一評分者與試題交互作用分析

---| Target---| Target Obs-Exp Context| Target Obs-Exp Context| Target Joint | | Nu IT | Measr S.E. Average N rater| Measr S.E. Average N rater|Contrast S.E. t d.f. Prob. | ---| 5 15 ---| 1.14 .09 -.10 2 B | .87 .08 .00 4 D | .27 .12 2.30 384 .0222 | | 5 15 | 1.14 .09 -.10 2 B | .87 .09 -.14 5 E | .27 .13 2.05 387 .0406 | | 1 11 | -.21 .04 -.18 2 B | -.36 .05 -.21 4 D | .15 .06 2.34 384 .0198 | | 4 14 | .02 .05 -.16 2 B | -.12 .05 -.58 6 F | .14 .07 2.15 387 .0325 | | 4 14 | .02 .05 -.15 4 D | -.12 .05 -.23 6 F | .14 .07 2.06 375 .0397 | | 8 23 | .10 .05 .22 2 B | .24 .05 -.40 3 C | -.14 .07 -2.02 384 .0440 | | 10 25 | .17 .05 .18 2 B | .32 .05 -.40 4 D | -.15 .07 -2.03 384 .0435 | | 8 23 | .10 .05 .22 2 B | .26 .06 -.72 5 E | -.16 .07 -2.21 387 .0278 | | 8 23 | .13 .05 .16 1 A | .33 .06 -.85 6 F | -.20 .08 -2.61 387 .0094 | | 8 23 | .10 .05 .22 2 B | .33 .06 -.85 6 F | -.23 .08 -3.01 387 .0028 | ---註：僅列出 t 值大於2 之數據。 試題符合模式及分隔受試者情形 由於類推分析顯示試題與受試者交互作用變異成分所佔比例高，需增加題數以提高結果之類推。研究者檢視試題符合模式之情形，以依此判斷受試者及試題在界定潛在構念上是否適切（即：是否符合模式，及試題是否能有效分隔受試者）。表十二及表十三分別為「共同等級」及「評分者－等級」模式之試題適合度情形。由於 ZSTD 適合度指標受大樣本影響，故研究者主要以 infit 及 outfit 均方(MNSQ)做為檢視試題符合模式之依據。就等級量表而言，合理之均方值介於 0.60 至 1.40 (Wright & Linacre, 1994)，此數值太大，顯示數據違背單向度假定，

數據太小反映則數據重疊性或可預期性太高。「共同等級」及「評分者－等級」

評分者 E

(25)

---(Mean)---(Modal)--(Median)---模式下之結果相當接近，除試題 12 之 outfit 均方值略大於 1.40，及試題 15 值低於 0.6 外，其他試題皆介於可接受範圍。另外，試題分隔信度皆達到 1.00，顯示受試者能有效區隔試題，及試題難度之位置不會因為不同受試者而有所不同。最後，兩種模式下，受試者分隔信度皆為.89（分隔指標皆為 2.78），顯示試題能有效分隔受試者，給予測同樣構念之試題予受試者，受試者能力位置亦會相當穩定。表十二「共同等級」模式之試題估計 ---| Obsvd Obsvd Obsvd Fair-M| Model | Infit Outfit |Estim.| | | Score Count Average Avrage|Measure S.E. | MnSq ZStd MnSq ZStd|Discrm| Nu ITEMS | ---| 2883 1152 2.5 2.12| -.28 .02 | 1.35 7.7 1.29 4.4| 1.14 | 1 11 | | 1794 1152 1.6 .98| .12 .02 | .85 -3.2 .71 -4.4| 1.17 | 2 12 | | 1325 1152 1.2 .65| .33 .02 | .76 -4.9 .85 -1.9| .87 | 3 13 | | 2209 1152 1.9 1.35| -.05 .02 | .80 -5.0 .98 -.3| .72 | 4 14 | | 452 1152 .4 .22| 1.00 .04 | 1.09 1.0 .84 -1.3| 1.09 | 5 15 | | 2358 1152 2.0 1.51| -.10 .02 | 1.16 3.5 1.04 .6| .99 | 6 21 | | 2858 1152 2.5 2.09| -.27 .02 | .93 -1.7 1.11 1.7| .87 | 7 22 | | 1606 1152 1.4 .84| .20 .02 | 1.07 1.5 .77 -3.1| 1.27 | 8 23 | | 3411 1152 3.0 2.83| -.46 .02 | 1.31 6.9 1.19 3.1| .98 | 9 24 | | 1488 1152 1.3 .76| .25 .02 | .92 -1.6 .79 -2.9| 1.06 | 10 25 | | 5311 1152 4.6 5.09| -1.18 .02 | .78 -4.5 .82 -2.5| .79 | 11 31 | | 2759 1152 2.4 1.97| -.24 .02 | 1.24 5.3 1.45 6.4| .75 | 12 32 | | 2229 1152 1.9 1.37| -.05 .02 | 1.32 6.7 1.03 .5| 1.49 | 13 33 | | 2024 1152 1.8 1.18| .03 .02 | .72 -6.8 .78 -3.3| .90 | 14 34 | | 741 1152 .6 .34| .70 .03 | .56 -7.8 .54 -5.3| .88 | 15 35 | ---| 2229.9 1152.0 1.9 1.55| .00 .02 | .99 -.2 .95 -.6| | Mean (Count: 15) | | 1136.5 .0 1.0 1.17| .49 .00 | .24 5.1 .23 3.3| | S.D. (Populn) | | 1176.4 .0 1.0 1.21| .50 .00 | .25 5.3 .24 3.4| | S.D. (Sample) | ---Model, Populn: RMSE .02 Adj (True) S.D. .49 Separation 21.99 Reliability 1.00

Model, Sample: RMSE .02 Adj (True) S.D. .50 Separation 22.77 Reliability 1.00 Model, Fixed (all same) chi-square: 5667.9 d.f.: 14 significance (probability): .00 Model, Random (normal) chi-square: 14.0 d.f.: 13 significance (probability): .38

---表十三「評分者－等級」模式之試題估計

---| Obsvd Obsvd Obsvd Fair-M| Model | Infit Outfit |Estim.| | | Score Count Average Avrage|Measure S.E. | MnSq ZStd MnSq ZStd|Discrm| Nu ITEMS | ---| 2883 1152 2.5 2.12| -.28 .02 | 1.35 7.7 1.29 4.4| 1.15 | 1 11 | | 1794 1152 1.6 .98| .12 .02 | .85 -3.3 .71 -4.4| 1.17 | 2 12 | | 1325 1152 1.2 .65| .33 .02 | .76 -5.0 .84 -2.0| .88 | 3 13 | | 2209 1152 1.9 1.36| -.04 .02 | .79 -5.0 .97 -.4| .72 | 4 14 | | 452 1152 .4 .22| 1.00 .04 | 1.10 1.0 .83 -1.4| 1.10 | 5 15 | | 2358 1152 2.0 1.51| -.10 .02 | 1.16 3.5 1.04 .5| .99 | 6 21 | | 2858 1152 2.5 2.09| -.27 .02 | .93 -1.7 1.09 1.5| .86 | 7 22 | | 1606 1152 1.4 .84| .20 .02 | 1.07 1.4 .77 -3.2| 1.26 | 8 23 | | 3411 1152 3.0 2.83| -.46 .02 | 1.31 7.0 1.20 3.2| .97 | 9 24 | | 1488 1152 1.3 .76| .25 .02 | .92 -1.6 .78 -2.9| 1.06 | 10 25 | | 5311 1152 4.6 5.08| -1.18 .02 | .78 -4.6 .82 -2.6| .80 | 11 31 | | 2759 1152 2.4 1.97| -.24 .02 | 1.24 5.3 1.45 6.5| .75 | 12 32 | | 2229 1152 1.9 1.38| -.05 .02 | 1.32 6.7 1.03 .4| 1.49 | 13 33 | | 2024 1152 1.8 1.18| .03 .02 | .72 -6.9 .78 -3.4| .89 | 14 34 | | 741 1152 .6 .34| .70 .03 | .55 -7.9 .55 -5.3| .88 | 15 35 | ---| 2229.9 1152.0 1.9 1.55| .00 .02 | .99 -.2 .94 -.6| | Mean (Count: 15) | | 1136.5 .0 1.0 1.17| .49 .00 | .24 5.2 .23 3.3| | S.D. (Populn) | | 1176.4 .0 1.0 1.21| .50 .00 | .25 5.3 .24 3.4| | S.D. (Sample) | ---Model, Populn: RMSE .02 Adj (True) S.D. .49 Separation 22.00 Reliability 1.00

Model, Sample: RMSE .02 Adj (True) S.D. .50 Separation 22.77 Reliability 1.00 Model, Fixed (all same) chi-square: 5651.9 d.f.: 14 significance (probability): .00 Model, Random (normal) chi-square: 14.0 d.f.: 13 significance (probability): .38

(26)

---不同解決歧異分數方法之比較

研究者檢視三種歧異分數給定方法對受試者能力估計值的影響，此三種方法分別為：取原始兩位評分者之平均分數（方法 A）、取三人（兩位原始評分者及給定歧異分數之第三位評分者）的平均分數(方法 B)、以第三位評分者之給分取代歧異分數（方法 C）。表十四為三種歧異分數解決下受試者能力估計值與多面向 Rasch 模式能力估計值之相關矩陣。結果顯示：（1）兩種多面向 Rasch 模式下受試者能力估計值相關高達 1.00，（2）根據三種歧異分數之給定分數下受試者總分之相關高達.994，（3）多面向 Rasch 模式之受試者能力估計值與不同歧異分數給定方式下之原始總分間相關達.82，雖然顯示不同方法得到之分數無差異，但亦顯示原始分數之估計值與 Rasch 模式之估計值仍有些差異存在。由於本研究中歧異分數（差異分數大於 2 分者）所佔比例相當低（15 題中之比例介於.004%至 6.6%），故根據不同歧異分數之解決方式下得到的分數並無明顯不同。表十四三種歧異分數解決下受試者能力估計值與多面向 Rasch 模式能力估計值之相關相關 1.000 1.000** .823** .823** .828** . .000 .000 .000 .000 430 430 430 430 430 1.000** 1.000 .823** .823** .828** .000 . .000 .000 .000 430 430 430 430 430 .823** .823** 1.000 1.000** .994** .000 .000 . .000 .000 430 430 438 438 438 .823** .823** 1.000** 1.000 .994** .000 .000 .000 . .000 430 430 438 438 438 .828** .828** .994** .994** 1.000 .000 .000 .000 .000 . 430 430 438 438 438 Pearson 相關顯著性 (雙尾) 個數 Pearson 相關顯著性 (雙尾) 個數 Pearson 相關顯著性 (雙尾) 個數 Pearson 相關顯著性 (雙尾) 個數 Pearson 相關顯著性 (雙尾) 個數 facet ability estimates

Fk model

facet ability estimates Fjk model resolution A resolution B resolution C facet ability estimates Fk model facet ability estimates

Fjk model resolution A resolution B resolution C

(27)

(28)

試者能力是經過調整後之分數，亦即在數據符合的情形下，其為「測驗獨立」與「評分者獨立」，故可趨近受試者之真分數（或全域分數）。另外，其也允許我們探討各種評分者效應之可行性。最後，研究中所使用之三種解決歧異分數之方法對受試者能力估計之影響並不大，不同方法下之原始總分估計值間相關接近1.00 (.994)，此應受評分者歧異分數所佔比例相當低之故，顯示評分者訓練具其成效。但其與多面向Rasch模式下之能力估計值之相關僅達.80，顯示以原始分數得到之能力估計值與Rasch模式得到之估計值仍有些微差異存在。

參考書目

吳欣黛（民 87）。實作評量在效度上的真實性與直接性。國立台北師範學院國民 教育研究所碩士論文。 林敬修（2003）。影響國小數學科實作評量信度相關因素之類推性理論分析。國 立屏東師院教育心理與輔導研究所未發表論文。 張麗麗（2002）。檔案評量信度與效度的分析－以國小寫作檔案為例。教育與心 理研究，25，1-34。

AERA, APA, & NCME (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Andrich, D. (1988). Rasch models for measurement. Newbury Park: SAGE. Andrich, D. (2004). Understanding resistance to the data-model relationship in

Rasch’s paradigm: A reflection for the next generation. In E. V. Smith and R. M. Smith (Eds.), Introduction to Rasch measurement (pp. 167-200). Maple Grove, Minnesota: JAM Press.

Becker, M., Hess, R. K., & Gibney, V. (1993, April). Large-scale assessment in

writing:Factorsinfluencing scaling orwriter’sperformance. Paper presented at the annual meeting of the National Council of Measurement in Education, Atlanta, GA.

Blok, H. (1985). Estimating the reliability, validity, and invalidity of essay ratings.

Journal of Educational Measurement, 22, 41-52.

Brennan, R. L. (1992). Elements of generalizability theory. Iowa City, IA: American College Testing Program.

Brennan, R. L. (2001). Generalizability theory. New York: Springer.

Clauser, B. E. (2000). Recurrent issues and recent advances in scoring performance assessment. Applied Psychological Measurement, 24(4), 310-324.

Dunbar, S. B., Koretz, D. M., & Hoover, H. D. (1991). Quality control in the development and use of performance assessments. Applied Measurement in

Education, 4, 289-303.

(29)

Rasch model. Applied Measurement in Education, 5, 171-191.

Engelhard, G., Jr. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational

Measurement, 31, 93-112.

Gao, X. (1996) Sampling variability and generalizability of work keys listening and

writing scores (ACT research report series 96-1). Iowa City, Iowa: ACT.

Gao, X., Shavelson, r. J., & Baxter, G. P. (1994). Generalizability of large-scale performance assessments in science. Applied Measurement in Education, 7(4), 323-342.

Lane, S., Liu, M., Ankenmann, R. D., & Stone, C. A. (1996). Generalizability and validity of a mathematics performance assessment. Journal of Educational

Measurement, 33(1), 71-92.

Lane, S., & Sabers, D. (1989). Use of generalizability theory for estimating the dependability of a scoring system for sample essays. Applied Measurement in

Education, 2, 195-205.

Linacre, J. M. (1994). Many-facet Rasch measurement. Chicago, IL: MESA Press. Linacre, J. M. (1999). Category disordering vs. step (thresholds) disordering. Rasch

Measurement Transactions, 13(1), 675.

Linacre, J. M. (2004). Optimizing rating scale category effectiveness. In E. V. Smith and R. M. Smith (Eds.), Introduction to Rasch measurement (pp. 258-278). Maple Grove, Minnesota: JAM Press. Chicago, IL: John M. Lincare.

Linn, R. L., Burton, E., DeStefano, L., & Hanson, M. (1996). Generalizability of New Standard Project 1993 pilot study tasks in mathematics. Applied Measurement in

Education, 9(3), 201-214.

Lunz, M. E., Wright, B. D., & Linacre, J. M. (1990). Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3, 331-345. McBee, M. M., & Barnes, L. B. (1998). The generalizability of a performance

assessment measuring achievement in eighth-grade mathematics. Applied

Measurement in Education, 11(2), 179-194.

Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13-23.

Messick, S. (1995). Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist,

50(9), 741-749.

Shavelson, R. J., Baxter, G. P., Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30(3), 215-232.

Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: SAGE.

Shaw, F., Wright, B., & Linacre, J. M. (1992). Rasch Measurement Transactions,

16(2), 225.

(30)

Measurement Transactions, 8(3), 370.