RaschGSP IRT理論在大量數據教育測驗上之應用

全文

(1)國立臺中教育大學教育資訊與測驗統計研究所博士論文. 指導教授：許天維. 博士. 永井正武. 博士. The Application of RaschGSP IRT Theory for Large Data Sets in Educational Measurement RaschGSP IRT 理論在大量數據教育測驗上之應用. 研究生：阮逢選. 中. 華. 民. 國. 一. ○. 四. 撰. 年. 六. 月.

(2)

(3) National Taichung University of Education Graduate Institute of Educational Information and Measurement Ph. D. Dissertation Dissertation Advisors: Prof. Tian-Wei Sheu Prof. Masatake Nagai. The Application of RaschGSP IRT Theory for Large Data Sets in Educational Measurement RaschGSP IRT 理論在大量數據教育測驗上之應用. By Nguyen Phung Tuyen. Taiwan, June 2015.

(4)

(5) Acknowledgements The first word I would like to express my sincere gratitude to my advisors: Prof. Tian-Wei Sheu and Prof. Masatake Nagai who taught me the knowledge, thought in research, research methods, and encouraged me to do the research work and provide support in all the time of writing this dissertation. Next, to the members of the doctoral committee, I would like to sincerely thank to all of committee members: Prof. Chin-Tsai Lin, Prof. Jiang-Long Lin, Prof. Kun-Li Wen, Prof. Chaang-Yung Kung, and Prof. JungChin Liang. I deeply respect their opinions, encouragement, and valuable contribution to this dissertation. To my professors at the Graduate Institute of Educational Information and Measurement, representative: Prof. Bor-Chen Kuo, who is the Dean of College of Education, National Taichung University of Education. He is the specialist in item response theory, has taught me lots of item response theory, cognitive diagnostic models, ... I would like to thank them so much. And I want to thank Prof. Hui-Chung Ho for her encourage, support, and helps that I could early complete the dissertation. To my classmates at the National Taichung University of Education, Dr. Jian-Wei Tzeng, Dr. Ching-Pin Tsai, Duc-Hieu Pham, Phuoc-Hai Nguyen, Hei-Ju Chen, … who helped me as volunteers in designing the Matlab program of this dissertation, the comments for repair with the active support even though they were busy in their own academic works. I thank them very much. I would also like to gratefully acknowledge the support and assistance of the teachers and colleagues at the Kien Giang Teachers’ Training College, Vietnam. They took care me and my family during my study time. Finally, to my parent, my father-in-law, mother-in-law, and family members who prayed that I could complete this dissertation early, I appreciate their love for me. Especially, I would like to give my thanks to my wife whose patient love enabled me to complete this study. I.

(6) Academic research requires the support of many people. The great appreciation is expressed to all those who offered me their assistance and words of encouragement during the completion of this study. Thanks all from the bottom of my heart!. II.

(7) Abstract In the trend of student numbers becoming increasingly fewer in a class, a good educational assessment method not only satisfies the important criteria such as valid, reliable, and feasible but also has to be flexible in assessment; it can well handle the small samples as well as the large samples to create the unity in all cases. The purpose of this paper is to propose the combination of Grey System Theory, RaschGSP IRT Theory, and Receiver Operating Characteristic (ROC) analysis to build up new assessment methods to meet context of student numbers increasingly fewer in a class. This dissertation focuses on ways to assess the effectiveness of the proposed methods in handling both small and large samples of educational tests. The research approach adopted in this dissertation includes theoretical study and experimental study to aim at the comparison of the results processed by proposed methods and by the previous models for validity and reliability. The findings from this study are shown as follows: (1) The new assessment methods have been proposed; The proposed methods consists of the method to evaluate difficulty of questions and ability of students, method to evaluate quality of test and its suitability, method to evaluate ability level of class, and method to establish standard of test. All of them not only well handle small samples but also can apply for large samples. (2) The proposed methods agreed with the previous models; The experimental results were compared with results processed by the previous models. The comparison results showed that the suitability between the models was high. (3) The proposed methods had the potential for diverse assessment; The proposed methods are considered for treatment of large samples due to similarity of their assessment results with the results processed by previous models. On the other hand, because of the ability for better handling small samples, so they have prospects of being applied in the context of student numbers increasingly fewer in a class. Theoretical contributions and general implications of the findings are discussed. Keywords: Assessment method, Grey System Theory, Large sample, RaschGSP IRT, ROC, Small sample. III.

(8) 中文摘要面對學生小班化的趨勢，良好的教學評量方法不僅須具有信度、效度與可行性等重要條件，同時也須具備評量的靈活性。評量方法無論運用小樣本或大樣本，均須呈現出一致性的結果。本研究的目的為透過灰色系統理論、RaschGSP IRT理論以及接收者操作特徵(ROC)分析，應用這三者的結合，提出符合學生小班化現象適合解析的新評量方法。本論文的重點主要針對教學評量小樣本與大樣本兩種形式，提出評量方法之效果，進行研究方法的檢驗。本研究所採用的研究方法包涵理論分析與實證研究兩部分，旨在透過新評量方法處理之結果，對照以往方法進行的處理結果進行比較，證明新評量方法之信度與效度。研究結果具有如下的成效：(1) 提出了新評量方法;包涵內在測驗題難易度與學生能力的評量方法；測驗水準與測驗適當性的評量方法；班級能力程度的評量方法以及測驗標準設定的方法。研究結果顯示，新評量方法不僅能夠充分處理小樣本，且同樣適用於大樣本。 (2) 新評量方法的適用性；透過新評量方法的實證研究結果與以往的評量方法之結果進行比較，證明兩者之效能是一致。 (3) 新評量方法具有多樣性應用的潛力；本研究新評量方法之實證研究結果，證明應用於大樣本的研究執行符合以往評量方法的結果，可見新評量方法也可適用於大樣本。另一方面，因為新評量方法對小樣本處理具優越性，所以未來可同時適用於班級小班化學生的評量。關鍵字：評量方法、小樣本、大樣本、灰色系統理論、 RaschGSP IRT、ROC. IV.

(9) Summary Purpose The purpose of this study is to propose new assessment methods which are based on the combination of Grey System Theory, RaschGSP IRT theory and ROC analysis to handle both small and large samples. Method The research approach adopted in this dissertation includes theoretical study and experimental study to aim at the proposal of new assessment methods and the comparison of the results which were processed by proposed methods and results processed by the previous models for validity and reliability. Results (1) The new assessment methods have been proposed including the method to evaluate difficulty of questions and ability of students, method to evaluate quality of test and its suitability, method to evaluate ability level of class, and method to establish standard of test. All of them not only well handle small samples but also can apply for large samples. (2) The proposed methods were consistent with the previous model as evidenced by the way that the experimental results were compared with results processed by the previous models. The comparison results showed that the suitability between the models was very high. (3) The proposed methods had the potential for diverse assessment because they are considered for treatment of large samples due to similarity of their assessment results with the results processed by previous models. On the other hand, because of the ability for better handling small samples, so they have prospects of being applied in the context of student numbers increasingly fewer in a class. Conclusion The proposed methods can apply to educational measurement and solve the problem that faces us, that is, the assessment of tests in the context of the decline in number of students in a class is performed effectively. V.

(10) 總結研究目的本研究的研究目的為透過灰色系統理論、RaschGSP IRT 理論以及接收者操作特徵(ROC)分析三者之間的結合，提出新的測驗方法，證明測驗方法能夠同時處理小樣本及大樣本。研究方法本論文採用的研究方法包含理論研究及實證研究，指在提出新的測驗方法，並將透過新方法處理所得的結果與過往方法所處理的結果進行比較，確定新方法的正確性與信度。研究結果 (1) 本文建構的新評量方法包涵：測驗題難易度與學生能力的評量方法；測驗水準與測驗適當性的評量方法；班級能力程度的評量方法; 以及測驗標準設定的方法。研究結果顯示以上的所有方法不僅能夠實際處理小樣本，而且同樣也適用於大樣本。 (2) 證明新測驗方法與以往方法相符：實證研究之結果與以往方法處理，所獲得的結果進行比較後得到相同的效果。結果顯示新舊方法兩者的效能一致。 (3) 新測驗方法具有多元應用的潛力。透過新方法進行測驗的結果與以往方法處理之結果是一致，證明新方法適用於大樣本的處理，同時，因為新測驗方法對小樣本處理之優越性，所以具有運用於小班化測量的潛力。結論本論文所提出的新評量方法可應用於教育測驗，並可針對小班化班級有效進行測驗活動問題解決。. VI.

(11) Table of Contents Page. Acknowledgements ...............................................................................................I Abstract. ......................................................................................................... III. 中文摘要 .......................................................................................................... IV Summary ........................................................................................................... V 總結. .......................................................................................................... VI. Table of Contents ............................................................................................ VII List of Tables....................................................................................................... X List of Figures .................................................................................................. XII Notations ........................................................................................................ XV Chapter 1 Introduction ...................................................................................... 1 1.1 Background and Motivation of Research...............................................................1 1.2 Research Problem...................................................................................................4 1.3 Research Purpose and Objectives ..........................................................................4 1.4 Research Questions ................................................................................................5 1.5 Object and Scope of Research ...............................................................................5 1.6 Research Method and Research Flowchart ............................................................6 1.7 Explanations of Terms ...........................................................................................7 1.8 Significance of Research ........................................................................................8 1.9 Overview of This Paper .........................................................................................9. Chapter 2 Literature Review ........................................................................... 11 2.1 Brief Introduction of Sample Size - Large Sample, Small Sample and Relationship between Them .................................................................................11 2.1.1 Brief Introduction of Sample Size............................................................................ 11 2.1.2 Relationship between Large and Small Samples ..................................................... 12 2.1.3 Minimum Size of Sample for Statistics.................................................................... 13 VII.

(12) 2.1.4 Necessity of Nonparametric Statistical Methods..................................................... 15. 2.2 Some Theories Commonly Apply to Large Samples in Educational Measurement ........................................................................................................ 16 2.2.1 Classical Test Theory .............................................................................................. 16 2.2.2 Rasch Model versus Thurstone Model .................................................................... 17 2.2.3 Item Response Theory ............................................................................................. 21. 2.3 Some Theories Commonly Apply to Small Samples in Educational Measurement ........................................................................................................ 24 2.3.1 Student-Problem Chart Analysis ............................................................................. 24 2.3.2 Grey System Theory ................................................................................................ 27 2.3.3 GSP Chart and RaschGSP ....................................................................................... 35 2.3.4 Receiver Operating Characteristic ........................................................................... 37 2.3.5 Standard Setting for Tests........................................................................................ 39. 2.4 From the Development and Transformation of IRT Researches to RaschGSP IRT Theory ........................................................................................................... 40. Chapter 3 Methodology................................................................................... 47 3.1 Research Design .................................................................................................. 47 3. 2 RaschGSP IRT Theory ....................................................................................... 50 3.2.1 Determination and Verification of Data .................................................................. 50 3.2.2 Normalization of Data ............................................................................................. 50 3.2.3 Logistic Regression for RaschGSP IRT .................................................................. 51. 3.3 Proposing to Apply GSP Chart for Evaluating the Difficulty of Questions and the Ability of Students .......................................................................................... 67 3.4 Proposing to Apply RaschGSP IRT Theory for Evaluating the Difficulty of Questions and the Ability of Students .................................................................. 71 3.5 Proposing to Apply RaschGSP IRT Theory for Evaluating Quality of Test and Its Suitability ........................................................................................................ 76 3.6 Proposing to Apply RaschGSP IRT Theory for Evaluating Ability Level of Class ..................................................................................................................... 80 3.7 Proposing Method of Test Standard Setting Based on the Combination of ROC Analysis and Prediction Models T-GMs .............................................................. 84. Chapter 4 Applications, Results and Discussions .......................................... 91 4.1 Application of RaschGSP IRT Theory in Evaluating the Difficulty of Questions and the Ability of Students .................................................................. 92 VIII.

(13) 4.2 Application of RaschGSP IRT Theory in Evaluating the Quality of Test and Its Suitability .......................................................................................................101 4.3 Application of RaschGSP IRT Theory in Evaluating the Ability Level of Class ....................................................................................................................106 4.4 Establishment of the Standard of Test in Evaluating the Academic Achievement of Students ....................................................................................110. Chapter 5 Conclusions and Recommendations ........................................... 119 5.1 Conclusions ........................................................................................................119 5.2 Recommendations ..............................................................................................122. References ........................................................................................................ 123 APPENDIXES ................................................................................................. 137 Appendix 1 ...............................................................................................................137 Appendix 2 ...............................................................................................................142 Appendix 3 ...............................................................................................................143 VITA ........................................................................................................................148 Awards .....................................................................................................................149 Proceedings of Scholarly Works ..............................................................................150 Journal papers.................................................................................................................. 150 Conference papers ........................................................................................................... 152. IX.

(14) List of Tables Page Table 2-1 S-P chart………………………………………………………………. 26. Table 2-2 GSP chart……………………………………………………………... 36. Table 2-3 Confusion matrix……………………………………………………... 38. Table 2-4 Schedule of development and transformation of the IRT research........ 41. Table 2-5 Researches related with diagram in Figure 2-8……………………….. 43. Table 3-1 Evaluation result for questions difficulty……………………………... 70 Table 3-2 Difficulty of questions (example 3.2)……………………………….... 75. Table 3-3 Ability of students (example 3.2)……………………………………... 76. Table 3-4 Test discrimination 0.5 of test for ten classes ….………………….... 79. Table 3-5 Test intermediate values 0.5 of ten classes and whole data …….…... 83. Table 3-6 Measurement data for ability score of students in the previous five semesters and test score……………………………………………….. 87. Table 3-7 Modeling the predicted values for English ability score of students in six semesters…………………………………………………………... 88 Table 3-8 Hypothetical data for sensitivity and specificity at various cut scores………………………………………………………………….. 89 Table 4-1 Difficulty of questions evaluated by proposed method of data set. S1 ……………………………………………………………………. 94. Table 4-2 Difficulty of questions evaluated by IRT 1PL model of data set. S1 ……………………………………………………………………. 95. Table 4-3 Difficulty of questions in set of data {S2}……………………………. 100 Table 4-4 Ability score of students in set of data {S2}……………….…………. 101 Table 4-5 Test discrimination 0.5 for ten classes in data set S1 .……………. 103 Table 4-6 Test. discrimination. 0.5. for. three. classes. in. data. set. S2  ……………………………………………………..……………. X. 105.

(15) Table 4-7 Test intermediate values. 0.5 for ten classes in data set. S1 ……………………………………………………………………. 108 Table 4-8 Test intermediate value  0.5 for three classes in data set. S2 ………………………………………………………………..….. 109. Table 4-9 Measurement data for final score of Math in the previous five tests and the total test score of current test (a part) of S1……………………. 111 Table 4-10 Model the predicted values for Math test score in six tests (a part) of. S1……………………………………………………….….……….. 112. Table 4-11 Hypothetical data for sensitivity and specificity at various cut scores (a part) of S1………………………………………………………….. 113. Table 4-12 Measurement data for final score of Math in the previous five tests and total test score (a part) of S2 ………………………..………..……. 115 Table 4-13 Model the predicted values for Math test score in six tests (a part) of. S2 ………………………………………………………..………….. 116. Table 4-14 Hypothetical data for sensitivity and specificity at various cut scores (a part) of S2 …………………………………………….…………….. 117 Table 5-1 Comparison of IRT models and RaschGSP IRT model………………. 120. XI.

(16) List of Figures Page Fig. 1-1 Research flowchart of dissertation………………………………….…... 6. Fig. 2-1 The 68–95–99.7 rule for Normal distributions ……………………….... 13. Fig. 2-2 ICC for three different items in Rasch model………………………....... 19. Fig. 2-3 A continuous response process on a partitioned continuum……………. 20 Fig. 2-4 A three-parameter logistic model item characteristic curve……………. 22. Fig. 2-5 One-parameter ICC of real data outputted from BILOG-MG3 software…………………………………………………………….…... 24. Fig. 2-6 Student diagnostic analysis……………………………………………... 26. Fig. 2-7 Problem diagnostic analysis………………………………………...….. 27. Fig. 2-8 Diagram of development and transformation of the IRT researches…... 42. Fig. 3-1 (a) Flowchart of research design……………………………………….. 48. Fig. 3-1 (b) Flowchart of research design………………………………………... 49. Fig. 3-2 Test results of the students plotted against the pass-fail categories…….. 51. Fig. 3-3 Test results plotted against probability of allocation to pass-fail categories……………………………………………………………….. 52. Fig. 3-4 Logistic regression curve of test results………………………………... 53. Fig. 3-5 RaschGSP curve for students of class with high ability level …………. 56. Fig. 3-6 RaschGSP curve for students of class with low ability level …………... 57. Fig. 3-7 RaschGSP. curve. for. students. in. case. of. test. with. high. discrimination……………………………………………………………. 60 Fig. 3-8 RaschGSP. curve. for. students. in. case. of. test. with. low. discrimination ……………………………………………………….…. 61. Fig. 3-9 Test discrimination for each class represented by  0.5 is comparable to each other……..………………………………………………………... 64. Fig. 3-10 Test intermediate values of three classes are comparable to each other…………………………………………………………………….. 66. XII.

(17) Fig. 3-11 Flowchart for the process of proposed method applying GSP…….……. 68 Fig. 3-12 Person Map for score of students, it is modeled according to WINSTEPS…..…………………………………………………………. 70 Fig. 3-13 Flowchart for applying RaschGSP IRT to evaluate difficulty of questions and ability of students……………………………………..…………….. 72 Fig. 3-14 RaschGSP IRT curve for problems in example 3.2……………………… 74 Fig. 3-15 RaschGSP IRT curve for students in example 3.2………………………. 75 Fig. 3-16 Flowchart for evaluating the quality of test………………………….….. 77 Fig. 3-17 Family of RaschGSP curve represents data sets Dk ……………………. 78 Fig. 3-18 Two RaschGSP curves for two classes and one for whole data ………... 79 Fig. 3-19 Flowchart for evaluating the ability level of a class………………..…… 81 Fig. 3-20 RaschGSP curves represent data sets Dk (k  1,2,,10) .……………… 82 Fig. 3-21 Two RaschGSP curves for two classes and one for whole data…………. 83 Fig. 3-22 Flowchart for algorithm of test standard setting method……………….. 86 Fig. 3-23 ROC graph and its AUC………………………………………………… 90 Fig. 4-1 Diagram of application of RaschGSP IRT in the practice……………… 92 Fig. 4-2 RaschGSP IRT curve for problems of data set S1 ……………………. 93 Fig. 4-3 RaschGSP IRT curve for students of data set S1 ……………………… 94 Fig. 4-4 Person-Map for ability score of students evaluated by proposed method of data set S1 …………………………………………..…………….... 96 Fig. 4-5 Person-Map for ability score of students evaluated by IRT 1PL of data set S1 …………………………………………..……………………… 97 Fig. 4-6 RaschGSP IRT curve for problems of data set S2 …………………….. 99 Fig. 4-7 RaschGSP IRT curve for students of data set S2 ……………………… 100 Fig. 4-8 Family of RaschGSP IRT curves represents data set S1 ……………….102 Fig. 4-9 RaschGSP IRT curves for classes number 3 and 8 and whole data…….. 103 Fig. 4-10 Family of RaschGSP IRT curve represents data set S2 ………………. 106. XIII.

(18) Fig. 4-11 RaschGSP IRT curves represent data sets S1 …………………………. 107 Fig. 4-12 RaschGSP IRT curves for classes numbered 5 and 6 and for the whole data ……………………………………………………………………. 108. Fig. 4-13 RaschGSP IRT curves represents data set S2 ……………………….. 110. Fig. 4-14 ROC graph and area under the ROC curve (AUC) of S1…………… 113 Fig. 4-15 ROC graph and area under the ROC curve (AUC) of S 2  …………... 117. XIV.

(19) Notations x0. In GRA system, reference vector. xi. In GRA system, inspected vector. x ij. In S-P chart, Item response result of student i for item j, i  1,2,, m; j  1,2,, n .. AGO. In GM calculation, Accumulated Generating Operation. AUC. In ROC analysis, Area under the ROC curve. CPj. In S-P chart, Caution index for problem j. CS i. In S-P chart, Caution index for student i. CTT. Classical Test Theory. GM. Grey Prediction Model. GPj. In GSP chart, localized grey relational grade of the j-th problem (LGRG-P). GRA. Grey Relational Analysis. GSi. In GSP chart, localized grey relational grade of the i-th student (LGRG-S). GSP. In GSP chart, Grey Student Problem. GST. Grey System Theory. IAGO. In GM calculation, Inverse AGO. ICC. Item Characteristic Curve. IRT. Item Response Theory. J MAPE. In ROC analysis, YOUDEN index In T-GM calculation, Mean Absolute Percentage Error. XV.

(20) O. In RaschGSP IRT theory, Ability level of class only based on test result. PM. Person Map, it is modeled according to WINSTEPS software. Pj. In S-P chart, Problem-number, j  1,2,, n. ROC. Receiver Operating Characteristic. Se. In ROC analysis, Sensitivity. Si. In S-P chart, Student-number, i  1,2,, m;. Sp. In ROC analysis, Specificity. S-P. In S-P chart, Student-Problem. T-GM X. Taylor approximation in Grey Prediction Model S-P chart matrix. 0.5. In RaschGSP IRT theory, test discrimination by a class. 0. In RaschGSP IRT theory, test discrimination by whole large data. k. In RaschGSP IRT theory, test discrimination by the k-th class.  0.5. In RaschGSP IRT theory, test intermediate value of a class. 0. In RaschGSP IRT theory, test intermediate value of whole large data. k. In RaschGSP IRT theory, test intermediate value of the k-th class.  0i. In GRA system, Absolute value of the difference between x0 and xi. 0i. In GRA system, Localized grey relational grade (LGRG). s (x)  p (x ). In RaschGSP IRT theory, Localized grey relational grade for student bound by condition x In RaschGSP IRT theory, Localized grey relational grade for problem bound by condition x. XVI.

(21) Chapter 1 Introduction This chapter introduces the overview of this dissertation. First of all, it indicates the background and motivation of research that discusses about the demand of an effective assessment method for present context. Therefore, the combination of GSP chart and RaschGSP is proposed to build new assessment analysis methods, and the research purpose is then discussed. It briefly describes three contents which are: proposing new assessment methods in educational measurement, explaining and illustrating how to solve the problems facing the assessment of learning outcomes in the context of the number of students in a class is declining, and being compared with the methods commonly used to express the contribution of the research. Next, the research questions are posed, so the study proposes the application of new methods for both small and large statistical data. The research method and flowchart of research are established, followed by the scope of research. Finally, to increase the readability of the paper, the explanations of terms related to the study are presented. The significance of this paper is subsequently pointed out, and an overview of the paper has been presented.. 1.1 Background and Motivation of Research Assessment of student learning outcomes provides feedback on the process of student learning and teaching process of teachers. This is the basis for determining the degree which students have achieved the goals of educational program and the success level of the teachers in teaching. Assessment of student learning outcomes has not only function that provided feedback, but also adjusts whole teaching process (Bui, 2007). In fact, “Educational assessment seeks to determine how well students are learning and is an integral part of the quest for improved education. It provides feedback to students, educators, parents, policy makers, and the public about the effectiveness of educational services” (Pellegrino, Chudowsky, & Glaser, 2001). Although assessing learning outcomes of students with such an important role, but this has not been properly respected in teaching practice and curriculum of pedagogy, due to some of reasons, those are assessment of academic achievement is a complex 1.

(22) process, academic achievement is commonly measured by examinations or continuous assessment but there is no general agreement on how it is best tested or which aspects are most important (Ward, Stoker, & Murray-Ward, 1996). Therefore, depending on the purpose, content, and time to apply which kind of assessment most appropriate have become essential things in teaching and learning process. In order to get good assessment results, the assessment process will be reliable if assessment tools used are considered effective and appropriate. Whether within the scope of a school or within wider scope, evaluating academic achievement of students occurs frequently to ensure the provision of timely feedback results in service of teaching, it requires evaluation results to be obtained quickly but accurately and objectively. Thus, the design of assessment method that has high effectiveness is very interesting. An important requirement for assessment methods is that they must be valid, reliable, and practicable to meet the needs of teaching (Pulakos, 2005). When we look back to the past, there were many assessment models, which were built on the basis of rigorous mathematical, has been applying in many countries around the world. These can be briefly summarized as follows: In the classroom space, the Student-Problem chart (S-P chart) is an evaluation method, which effectively evaluates the results of students’ learning, has been in use for many years. The main purpose of the S-P chart is to get the diagnostic data of each student, and teachers can provide better advise for each student academically depending on the analyzed data (D. McArthur, 1983; Sato, 1974; Wu, 1998; Yu, 2011). In 1982, Deng proposed grey system theory wherein grey relational analysis is an effective mathematical tool, this analysis method can measure the degree of similarity or difference between two sequences based on the grade of relationship between them. In addition, grey prediction model (GM) is also useful tool for prediction in many fields, and recently it is noted to apply in educational measurement ( Deng, 1989; Liu, Dang, & Fang, 2010; Wen, Chao, Chang, Chen, & Wen, 2009). In order to overcome the weaknesses of S-P chart which only processed dichotomous data, Nagai proposed Grey Student-Problem (GSP) chart in 2010. GSP chart is a combination of S-P chart and grey system theory to analyze S-P chart data more 2.

(23) specifically. With GSP chart analysis, the uncertainty factors in the study are analyzed clearly (Sheu, Tsai, et al., 2013). In 1960, Rasch model was proposed for analyzing the test data to assess an examinee’s level of ability in a particular domain such as math or reading. The aim of this model is to measure each examinee’s level of a latent trait that underlies his or her scores on items of a test (Karabatsos, 2001; McArthur, 1987; Tennant & Conaghan, 2007). Going abreast with Rasch model, the concept of item response theory (IRT) was known during the 1960s and 1970s. The purpose of IRT is to provide a framework for evaluating how well assessments work, and how well individual items on assessments work. The most common application of IRT is in education, it was used for developing and designing exams, building item banks for exams, and equating the difficulties of items for successive versions of exams (Hambleton, Swaminathan, & Rogers, 1991). It is clear that both Rasch model and IRT are suitable for the analysis of large data, they need to have the support from computer for processing data. In practical application, the parameters of IRT are estimated by computer programs because of the vast number of parameters that must be estimated, and BILOG-MG is one of the most popular software used (Du Toit, 2003). The view of the Rasch model was applied in GSP chart to form RaschGSP theory that was a creativity suggestion by Nagai in 2010. This method has been used to judge uncertain factors (Sheu, Tzeng, Liang, Wang, & Nagai, 2012). It can make problem analysis more specific and clear, so it can make the students and questions be classified through the test (Tzeng, Sheu, Liang, Wang, & Nagai, 2012a). Receiver operating characteristic (ROC) curve was first developed by electrical engineers and radar engineers during World War II for detecting enemy objects in battlefields and was soon introduced to psychology to account for perceptual detection of stimuli (Swets, 1996). ROC analysis is now widely recognized as the best technique for measuring the quality of diagnostic information and diagnostic decisions, because it has discrimination capacity from decision-threshold effects (Hajian-Tilaki, 2013; Metz, Herman, & Roe, 1998). 3.

(24) In the context of the number of students in a class is declining in some countries such as Japan, Taiwan, etc., the main reason of this is due to declining birth rate and strategy of enhancing the education quality. Thus, assessment methods also have to accord with this new context. As was introduced, S-P chart analysis method has been used popularly for last many years, however it is often applied for analyzing small samples. Meanwhile, the assessment models such as Rasch model and IRT are applied to evaluate parameters of the questions and the ability of students, and results are known very clearly and specifically, but the samples which satisfied their assumptions are always large statistical data sets.. 1.2 Research Problem Because assessments of academic achievement take place regularly throughout the teaching process, within the classroom and greater range, so the assessment methods and assessment tools are required to be simple, easy to apply but objective and accurate in order to save time and effort that still achieve results as expected, and concurrently solve the problem which faces us, that is, the assessment of tests in the context of the decline in number of students in a class is performed effectively. It is hypothesized that it was possible to design assessment methods, which were combined by grey system theory, RaschGSP IRT theory and ROC analysis, not only well handled small statistical data but also could apply for large statistical data. These new methods are expected to be valid, reliable, and effective. They could perform the assessment of tests in the condition of the number of students becoming increasingly fewer in a class.. 1.3 Research Purpose and Objectives Based on the above research motivation, the purpose of this study is to propose new assessment methods which are based on the combination of grey system theory, RaschGSP IRT theory and ROC analysis to handle both small and large samples. Research objectives are as follows: 4.

(25) (1) Use the application of RaschGSP IRT theory to propose a new assessment method which can evaluate difficulty of questions and estimate ability of students. (2) Propose a method from applying RaschGSP IRT theory to evaluate quality of test and its suitability. (3) Propose a method from applying RaschGSP IRT theory to evaluate ability level of class. (4) Build up a method of test standard setting based on the combination of ROC analysis and prediction models T-GMs. The established methods above are all applied to handle both small and large samples, and proved being valid and reliable by applying them in practice, concurrently comparing them with the previous models. To solve the problem that had been mentioned, this study would conduct the experiments in which the proposed methods will be applied with various sample sizes (small or large), in such a way that the obtained results will be always reliable and in accord with results of the previous models.. 1.4 Research Questions This study is performed to solve the urgent problem mentioned above, so new assessment methods are proposed, and here are the research questions need to be answered: (1) What benefits do the new assessment methods which are proposed from the combination of grey system theory, RaschGSP IRT theory and ROC analysis bring for educational measurement? (2) Are these new assessment methods valid and reliable? (3) How do the proposed methods be applied for assessing the tests in the context of the number of students increasingly declined in a class?. 1.5 Object and Scope of Research Object of research is new assessment methods based on combination of grey system theory, RaschGSP IRT theory, ROC analysis to solve the urgent problem. 5.

(26) Scope of research about content: Proposing a completely new assessment method is a very great work, this study only proposes the assessment methods focusing to evaluate the difficulty of questions and the ability of students, evaluate the quality of test, evaluate ability level of class through the test, and establish standard for test. Scope about space of research: Educational dichotomous tests, questions and their content in test, students in a class, the same grade students.. 1.6 Research Method and Research Flowchart START Research Motivation Determination of Research Purpose. Literature Review Determination of Research Content Determination of Research Method Data Collection Data Processing, Results and Discussions. Are research questions answered?. No. Yes. Conclusions STOP Fig. 1-1 Research flowchart of dissertation 6.

(27) The dissertation was mainly made based on the quantitative method, theoretical method, experimental method, comparative method, and analytical method. In general, it was conducted according to the research flowchart above (Fig. 1-1).. 1.7 Explanations of Terms “Student numbers increasingly fewer in a class,” according to Finn (Finn, 2002) and (Hertling, Leonard, Lumsden, & Smith, 2000), class size in public schools averaged about 25 students, reduced to small class which had 15 to 18 students. Project STAR, a. leading study from Tennessee, defined small classes as those with 13-17 students. In Taiwan, there are now many places having 12 even 9 students in a class. Therefore, “Student numbers increasingly fewer in a class” indicates that the number of students in a class is reduced – phenomenon of a small class. “Small sample,” in this dissertation, item response theory one-parameter logistic model is used to compare with the proposed method. For it, statistical sample used is large sample, there are approximation at least 250 examinees are required for the oneparameter logistic model (Hulin, Lissak, & Drasgow, 1982; Kirisci, Hsu, & Yu, 2001). That means the limitation of IRT is at the assumption that the sample size is more than 250 for ability to fit the model. Therefore, the statistical sample has sample size fewer than 250 is considered small sample. “Large sample,” the same with the view point above, the statistical sample has sample size at least 250 is considered large sample. “Grey system theory,” grey system theory was proposed by Deng in 1982. As far as information is concerned, the systems which lack information, such as structure message, operation mechanism and behavior document, are referred to as Grey Systems. In which, the theoretical models mainly applied to information systems that were unclear or incomplete for relational analysis, prediction and decision, and other methods to explore the entire system (Deng, 1982, 1989). “ROC,” receiver operating characteristic (ROC) is an analytical and diagnostic method whose curve was first developed by electrical engineers and radar engineers. 7.

(28) during World War II for detecting enemy objects in battlefields and was soon introduced to psychology to account for perceptual detection of stimuli (Swets, 1996). “Test intermediate value of a class,” is the abscissa of the intersection point between the RaschGSP IRT curve and the straight line y  0.5 defined in definition 3.4. “Ability level of a class,” ability level of student class only based on test result (abbreviated as ability level of class) is determined by the ratio O of portion ( 1  0.5 ) of students getting high test score to portion ( 0.5 ) of students getting low test score in that class, is defined in definition 3.5. “Test discrimination by a class,” is the slope of tangent to the RaschGSP IRT curve defined in definition 3.6. “Quality of test and its suitability” is a new criterion determined based on test discrimination by class.. 1.8 Significance of Research The contribution of this research could be concluded as follows: (1) The study has proposed a system of four new assessment methods consisting of the method to evaluate difficulty of questions and ability of students, method to evaluate quality of test and its suitability, method to evaluate ability level of class, and method to set standard for a test. All of them not only well handle small samples but also can apply for large samples. (2) The proposed methods are considered to be the methods of nonparametric statistics, their mathematical formulas and algorithms are simple, so they are easy to be understood and gotten high efficiency. (3) The experimental results of the proposed methods were compared with results processed by the previous models showing that the suitability between the models was high. (4) The new assessment methods are considered for treatment of large samples due to the similarity of their assessment results with the results processed by previous models. On the other hand, because the ability to better handle small samples, so they have 8.

(29) prospect of being applied in the context of student numbers increasingly declined in a class.. 1.9 Overview of This Paper This paper is comprised of five chapters. Chapter 1 introduces the background and focus of the paper including research motivation and research problem, then research purpose and research questions which are needed to answer are posed, and finally, research scope and research method are also presented. Chapter 2 describes about the relevant literature and basic theory for dissertation. They include the overview of statistical samples in educational measurement, some applied theories for large sample, and some applied theories for small sample related to proposal of research. In chapter 3, the methodology of the current paper is presented. The basic theory of proposed method, the framework of research including the proposed methods to apply for large data in detail, followed by instrument and the application scope of methods are described. Chapter 4 presents in detail the application of proposed methods, results and findings of the present paper, the discussion of research results is pointed out in detail. Chapter 5 is the conclusion, the final conclusions with suggestions for future research.. 9.

(30) 10.

(31) Chapter 2 Literature Review This chapter is about the relevant literature and basic theory for dissertation. The relevant literature and basic theory are arranged and refreshed to meet the readability and understandability. They include the overview of statistical samples in educational measurement, some applied theories for large sample, and some applied theories for small sample related to proposal of this study.. 2.1 Brief Introduction of Sample Size - Large Sample, Small Sample and Relationship between Them 2.1.1 Brief Introduction of Sample Size Scientific research in general and education measurement in particular, the statistical sample size plays a very important role because of reason which have two opposite issues, the first is statistical power, the second includes cost, effort and data collection conditions. Statistical power is the probability that a statistical test will indicate a significant difference when there truly is equal to one. Statistical power is analogous to the sensitivity of a diagnostic test, and one could mentally substitute the word “sensitivity” for the word “power” during statistical discussions. The second issue says that some measurements contain a large amount of information concerning the parameter of interest, others may contain little or none. Since the product of research is information, its “purchase” is expected to be at minimum cost. Therefore, the estimation of sample size should be taken when studying. The process of estimating sample size depends on some assumptions and parameters, namely at the following 5 basic elements: modeling research, sampling variation and variance, effect size, significance, and power (Eng, 2003; Florey, 1993; Henry, 1990; Noordzij et al., 2010). The ever increasing demand for research has created a need for an efficient method of determining the sample size needed to be representative of a given population. Krejcie and Morgan (1970) had published a formula for determining sample size. It is easy to. 11.

(32) consult calculation which could have been constructed using the following general formula (Krejcie & Morgan, 1970): In the case of population size unknown: n(. Z12 / 2 d2. ) P(1  P). (2-1). where, P: population proportion, d: confident limit around the point estimate, Z: Z-score corresponding to expected statistical meaning. In the case of population size known (less than 10,000), sample size is calibrated: Nc . n. (2-2). n 1 N. where, N: population size, n: sample size calculated in (2-1) In specific cases, the formulae are different, all can be seen in (Bartlett, Kotrlik, & Higgins, 2001; Cochran, 1977; Desmond & Glover, 2002; Hayes & Bennett, 1999; Kerry & Bland, 1998; Kerry & Bland, 1998).. 2.1.2 Relationship between Large and Small Samples Although it is difficult to draw a clear-cut line of demarcation between large and small samples, but it is normally agreed amongst statisticians that a sample is to be recorded as large only if its size exceeds 30. After World War II, for doctoral dissertations and most other purposes, when comparing groups, the proper sample size is 30 cases per group. The number 30 has arisen from the understanding that with fewer than 30 cases, this was dealing with “small” samples that required specialized handling with “small sample statistics” instead of the critical-ratio approach had been accepted (Cohen, 1990). Hogg, Tanis, and Rao (1977) wrote that, sample size which was less than 25 or 30 would be considered small and so more than that number would be considered large. The reason of this is that when sample size is more than 30, its student’s t-distribution approximates to normal distribution. This study agrees with this judgment that means if sample whose size is less than 30 it will be considered small sample.. 12.

(33) 2.1.3 Minimum Size of Sample for Statistics For a normal distribution, there are about 68% of values drawn within one standard deviation sigma (σ) away from the mean, about 95% of values lain within two standard deviations, and about 99.7% of values lain within three standard deviations. This fact is known as the 68-95-99.7 (empirical) rule or the 3-sigma rule (Govindaraju & Lai, 2004; Maronna, Martin, & Yohai, 2006; Pukelsheim, 1994). More obviously, it can be seen in the Figure 2-1 (Moore, 2010).. Note: Adapted from “The basic practice of statistics” Fifth ed., (p. 75) by Moore, 2010, United States of America: Palgrave Macmillan.. Fig. 2-1 The 68–95–99.7 rule for normal distributions In mathematical notation, these facts can be expressed as follows, where x is an observation from a normally distributed random variable, μ is the mean of the distribution, and σ is its standard deviation:. Pr(     x     )  0.683 Pr(   2  x    2 )  0.955. Pr(   3  x    3 )  0.997 where Pr is abbreviated by probability. 13.

(34) Now, a minimum subject number is necessary determined to efficiently acquire the trustful data for normal distribution, this number proposed is eleven in the following section (Yamaguchi et al., 2006). In fact, the following expansion uses the nature of the standard deviation: When x,  0; di  xi  x , there is. x  3  xi  x  3 1 where x  n. (2-3). n. x. i. i 1. and n is the number of subjects considered. ˆ . 1 n 1. n. (x  x). 2. (2-4). i. i 1. It is found out:. 1   n 2. n. d. n. 2 i.  n  2. i 1. d. 2 i. (2-5). i 1. It is estimated the minimum number of subjects lying in normal distribution using the above (2-5). (1) Consider the case of n  9 Let n  9 The left side of (2-5) is equal to: 9 2  (3 )2. (2-6). Substituting (2-6) into (2-5), it can be derived as follows using any xk , d k with ( 1  k  9 ). The (2-5) is rewritten: n. (3 ) 2 . d. 2 i. . i 1. . (3 ) 2 dk 2. (3 ) 2 dk. 2. d . 2 i. dk 2. 1. n. 1 dk. 2. d. 14. i 1 ik. 2 i.

(35) n. Because. d. 2 i.  0 so. (3 ) 2 dk. i 1 ik. 2.  1  (3 ) 2  d k 2.  3  dk Here, dk  xk  x from (8), there is. 3  xk  x. (2-7). Formula (2-7) shows that xk satisfies (2-3), but with this sample size, it cannot be detected outliers in case of n  9 . It can be considered the standard deviation which can be computed in case of n  9 , it cannot be determined whether there are outliers (messy data) in 3 range. (2) Consider the case of 10  n Let n  10 , the left side of (2-5) is equal to 10 2 . As expanding the equation (2-5) by the same procedure above (case of (1)), 10 2 dk. 2.  1  10 2  d k 2  10   d k.  10   xk  x. Because 3  10 is always true, so from (2-7) it is not necessary for all of xk to satisfy (2-3). It is possible to detect as an outlier if xk jumps out of other data, range of 3 . Of course the number of subjects can be more if all of them satisfy (2-3) for normal. distribution. Because denominator of right side of (2-4) is n  1, so n  10 is applicable to the case (1), and n  11 corresponds to the case (2). The conclusion is that the minimum subject number required is eleven for normal distribution in range of the 3-sigma rule.. 2.1.4 Necessity of Nonparametric Statistical Methods Statistical science usually tends to focus on what are called parametric statistics. These techniques are termed parametric because they focus on specific parameters of 15.

(36) the population, commonly the mean and variance. In order to utilize these techniques, the following assumptions regarding the nature of population from which the data are drawn must be satisfied (Pett, 1997; Tomkins, 2006): (a) Normal distribution of the dependent variable (b) A certain level of measurement: Interval data (c) Adequate sample size (more than 30 recommended per group) (d) An independence of observations, except with paired data (e) Observations for the dependent variable have been randomly drawn (f) Equal variance among sample populations (g) Hypotheses usually made about numerical values, especially the mean In practice of measurement and educational measurement in specific, one or all of these parametric assumptions is often broken. In many cases, the solution to this problem is another group of tests for statistical inference, which do not make strict assumptions about the population, is known the nonparametric statistics – distribution free (Gibbons & Chakraborti, 2011; Siegel, 1957). This study proposes new assessment method which is considered nonparametric statistical method named RaschGSP IRT applying to educational measurement to solve urgent problem that faces us.. 2.2 Some Theories Commonly Apply to Large Samples in Educational Measurement 2.2.1 Classical Test Theory Classical test theory (CTT) is the foundational theory of measurement of mental abilities. At its core, CTT describes the relationship between observed composite scores on a test and a presumed but unobserved “true” score for an examinee. CTT is called “classical” because it is thought to be the first operational use of mathematics to characterize this relationship (Gulliksen, 2013). The classical approach assumes that the raw score (test score) X obtained by any one individual is made up of a true component (true score) T and a random error (error score) E component:. X T  E. (2-8). 16.

(37) Because there are two unknowns in the equation for each examinee, so the equation is not solved unless some simplifying assumptions are made. The assumptions in the classical test model are that (a) true scores and error scores are uncorrelated, (b) the average error score in the population of examinees is equal to zero, and (c) error scores on parallel tests are uncorrelated. In this formulation, where error scores are defined, true score is the difference between test score and error score. True score is easily shown to be the expected test score across parallel forms (Cappelleri, Jason Lundy, & Hays, 2014; Fan, 1998; Güler, Uyanık, & Teker, 2014; Hambleton & Jones, 1993; Wiberg, 2004). The primary outcomes can be obtained from testing through the analysis of the model include the ability of examinee, item difficulty, and item discriminating power. Advantages of many classical test models are that they are based on relatively weak assumptions (i.e., they are easy to meet in real test data) and they are well-known and have a long track record. On the other hand, both person parameters (i.e., true scores) and item parameters (i.e., item difficulty and item discrimination) are dependent on the test and the examinee sample, respectively, and these dependencies can limit the utility of the person and item statistics in practical test development work and complicate any analyses (Hambleton & Jones, 1993; Novick, 1966).. 2.2.2 Rasch Model versus Thurstone Model The Rasch model was named after the Danish mathematician Georg Rasch. The model shows what should be expected in responses to items (also called questions or problems) if measurement is to be achieved (McArthur, 1987; G Rasch, 1960; Tennant & Conaghan, 2007). The model assumes that if the following assumptions are satisfied, the probability of a given respondent affirming an item will be a logistic function of the relative distance between the item location and the examinee location on a linear scale: (1) The first assumption is that only one term or quantity ( i ) is necessary to characterize an individual or, to put it another way, an individual’s ability is “unidimensional”. Likewise, every item has only one characteristic, its difficulty (b). 17.

(38) (2) The second one is that the relative difficulty of the items in a test is the same for all individuals. Thus, the ratio of the probability of a pass on item i to the probability of a pass on item j depends only on the difficulty values of these two items. (3) The third point concerns what is known as the “local independence” assumption. This says that, for any individual, the response to an item is completely independent of his or her response to any other item (Choppin, 1983a; Goldstein, 1979). Georg Rasch first announced this model for analyzing the response of the answerers to obtain an objective interval scale that can measure the latent trait of an answerer (Cano & Hobart, 2011; Choppin, 1982; G Rasch, 1960; W. C. Wang, 2004). Call P ( ) is the probability that examinee has latent trait  ' affirms the item having difficulty  ' ; so. 1  P( ) is the probability that examinee has latent trait  ' not affirm the item having difficulty  ' . Rasch model said that the winning percentage is:. odds . ' P( )  ' 1  P( ) . Taking the natural logarithm of both sides, the form of logistic regression is:. P( ) ' ln( odds)  ln  ln ' 1  P( )  Obtained: ln. P( )    1  P( ). where,   ln( ' );   ln(  ' ) Continue for mathematical transformation, obtained: exp    . P( ) 1  P( ). Therefore, the mathematical formula illustrates the item characteristic curve as follows: P( ) . exp    1  exp   . (2-9). 18.

(39) In other word, for the Rasch model, the correct response probability of a student is a logistic function of the difference between that student’s ability and the item difficulty (Baker, 2001; Choppin, 1976, 1983b; Georg Rasch, 1961). The relation between latent trait (theta) and correct response probability is described by an item characteristic curve (ICC) (as shown in Figure 2-2).. Note. Adapted from “Rasch Measurement Theory and Application in Education and Psychology,” by Wang, 2004, Journal of Education & Psychology 27(4), p. 644.. Fig. 2-2 ICC for three different items in Rasch model The Rasch model has descriptive and predictive function. In function of description, this model can clearly explain the relationship between student’s ability and item difficulty, the difference between students and the difference between items. For predictive function, this model can predict the probability of a student who has a specified ability to answer a specific item correctly (Wang, 2004). The Thurstone cumulative probability model (Andrich, 1995) The derivation of the Thurstone model is based on the plausible assumption of a single continuous response process across a continuum, which Thurstone originally assumed to be normally distributed. If 1i , 2i ,, xi ,, mi are m ordered thresholds of some items i dividing the continuum into m  1 categories, then the person is classified into a category depending on the realization of this process. This formulation is shown in Figure 2-3. Because of tractability and scaling constant it is not virtually distinguishable from the normal distribution, it is now often assumed that the response. 19.

(40) process is given by the double exponential distribution rather than the normal distribution.. P{y > τ4}. τ1. τ2. τ3. τ4. τ5. τ6. Note. Adapted from “Distinctive and incompatible properties of two common classes of IRT models for graded responses” by Andrich, 1995, Applied Psychological Measurement, 19(1), p.104.. Fig. 2-3 A continuous response process on a partitioned continuum Thus, if y pi is a random continuous process on the continuum about the location  p of person p, and if successive categories of item i are denoted by successive integers x pi , then an outcome  xi  y pi   ( x 1)i leads to the outcome x pi , with x pi  0 if y pi  1i , and x pi  m if y pi  1i . Formally, . P( y pi   xi ) . .  xi. exp( y pi   p ). exp(  p   xi ) dy  pi 1  exp(  p   xi ) [1  exp( y pi   p )]2. (2-10). Example as shown in Fig. 2-3 with  x   4 . Rasch model versus Thurstone model: The distinctive features of the processes characterized by the Rasch and Thurstone models when applied to graded responses in IRT are as follows: In Rasch model (1) the person has a single location; (2) the source of the final distribution of observed locations resides entirely in instrument; (3) the probability of a response in any category depends on location of all thresholds, not just locations of thresholds bounding the category; (4) the joining assumption does not hold; (5) the 20.

(41) variance of this distribution is inversely related to number of thresholds; (6) the precision of an estimate of single location increases with an increase in number of thresholds; (7) the estimate of location of person is separated from location of thresholds (Andrich, 1995). In Thurstone model (1) the person also has a single location; but (2) the source of distribution of observed locations resides entirely in person; (3) the probability of a response in a category depends on location of only thresholds bounding the category; (4) the joining assumption holds; (5) the variance of this distribution is unrelated to number of thresholds; (6) the precision of each observed location in empirical distribution increases with the number of thresholds, but the precision of estimate of single location parameter of the person is not improved; (7) the estimate of single location of the person cannot be separated explicitly from location of the thresholds (Andrich, 1995).. 2.2.3 Item Response Theory Item response theory (IRT) is a general statistical theory about examinee, item, test performance, and how performance relates to the abilities that are measured by the items in the test. Item responses can be discrete or continuous and can be dichotomously or polytomously scored, item score categories can be ordered or unordered. There can be one ability or many abilities underlying test performance, and there are many models in which the relationship between item responses and the underlying ability or abilities can be specified. Within the general IRT framework, many models have been formulated and applied to real test data (Hambleton & Jones, 1993).There were some models which started with the following assumptions: (1) assume a single ability underlies test performance, (2) can be applied to dichotomously scored data, and (3) assume the relationship between item performance and ability is given by a one, two, or three-parameter logistic function will be considered.. 21.

(42) Fig. 2-4 shows the general form of item characteristic functions with the threeparameter logistic model (Hambleton et al., 1991; Sijtsma & Junker, 2006). Item characteristic functions are generated from the expression: Pi ( )  ci . 1  ci , i  1,2,, n 1  exp{ Dai (  bi )}. (2-11). where n is the number of items in the test, Pi ( ) gives the probability of a correct response to item i as a function of ability  . The c parameter in the model is the height of the lower asymptote of the ICC and is introduced into the model to account for the performance of low ability examinees on multiple-choice test items. This parameter is not needed in the model with free response data. The b parameter is the point on the ability scale where an examinee has a (1 + c)/2 probability of a correct answer. The a parameter is proportional to the slope of the ICC at the point b on the ability scale. In general, the steeper the slope the higher the a parameter. The item parameters: b, a , and c are correspondingly referred to as the item difficulty, item discrimination, and pseudo-guessing parameters. The D in the model is simply a scaling factor D  1.702 (Baker & Kim, 2004; Baylor et al., 2011; Camilli, 1994; Embretson & Reise, 2013; Hambleton et al., 1991; Harvey & Hammer, 1999; Hays, Morales, & Reise, 2000).. Note. Adapted and modified from “Fundamentals of item response theory” (p.18), by Hambleton et al., 1991, Sage.. Fig. 2-4 A three-parameter logistic model item characteristic curve 22.

(43) IRT has become one of the most popular scoring frameworks for measurement data. IRT models are used frequently in computerized adaptive testing, cognitively diagnostic assessment, and test equating. One of the programs exists for this purpose is BILOGMG that has proven particularly useful and reliable over recent decades for many applications (Rupp, 2003). In order to fit into IRT models estimable with BILOG-MG, experimental data have to be satisfied the three assumptions, those are local independence, monotonicity, and uni-dimensionality (Du Toit, 2003; Rupp, 2003). For estimation of IRT model parameters in BILOG-MG, the degree of bias and estimation error for parameter estimates depends on factors such as the number of parameters, number of examinees and test length. The influence of these factors decreases as the number of examinees increases for a fixed number of items. If any general guidelines can be given, it appears that for tests with between 20 and 50 items, approximately at least 250 examinees are required for the one-parameter logistic model and two-parameter logistic model, and approximately at least 500, maybe even 1,000 examinees are required for the threeparameter logistic model. The graded response model will achieve stable parameter estimates (Drasgow, 1989; Harwell & Janosky, 1991; Hulin, Lissak, & Drasgow, 1982; Kirisci, Hsu, & Yu, 2001; Lord, 1968; Reise & Yu, 1990; Seong, 1990; Stone, 1992; Yen, 1987). Advantages and disadvantages of item response theories: From history summary of above IRT models, their advantages and disadvantages are drawn. The highlight advantages are that: Assessment - Through model parameters, contribution of each item to precision of total test score can be assessed, estimates precision of measurement at each level of ability and for each examinee; Explanation Graphical illustrations are helpful for managers (Fig. 2-5); Equating - It is good for tests where a core of items is administered, but different groups get different subsets (e.g., cross-cultural testing, computer adapted testing).. 23.

(44) Fig. 2-5 One-parameter ICC of real data outputted from BILOG-MG3 software. However, they also exit some disadvantages: Strict assumptions - The assumptions of each model are very strict, it is very hard for collecting data to satisfy them; Large sample size - Minimum sample size reaches 250, it is difficult to perform in reality; Complication - Models are complex and difficult to understand.. 2.3 Some Theories Commonly Apply to Small Samples in Educational Measurement 2.3.1 Student-Problem Chart Analysis S-P chart is known as a method that can analyze, process, and arrange data in a defined order, it is a very useful tool for diagnosing the learning state of student and quality of problem (also called question or item) (Chang, Yang, Shih, & Li, 2008; Ho, 1989; Tsai, Sheu, Tzeng, Chen, & Nagai, 2013; Wu, 1998; You & Yu, 2006). The S-P chart gives matrix structure of student-problem that is described in definition 2.1 (Nguyen, Nguyen, Pham, Tsai, & Nagai, 2013; Tsai et al., 2013). Definition 2.1: (The S-P chart matrix) Let X  [ xij ]m  n be the S-P chart matrix, where i  1,2,, m is the order of student, j  1,2,, n is the order of question, m, n  N , and 24.

(45) 0, if answer is wrong xij    1, if answer is right. (2-12). The caution indexes for student and problem help to diagnose the learning status of each student and problem quality as follows (Chen, Lai, & Liu, 2005; D'Costa, 1993; Sato, 1974, 1980): Caution index for student (CS): n. CS i  1 . ( xij )( x j )  ( xi  )( x )  j 1. (2-13). l. ( x j )  ( xi  )( x )  j 1. 1 where x  n. n. n. x j and l  xi    xij  j 1 j 1. (2-14). Caution index for problem (CP): m. CPj  1 . (x. ij )( xi  )  ( x j )( x '). i 1. (2-15). l'. (x. i  )  ( x j )( x '). i 1 m. 1 where x'  m. m. xi  and l '  x j   xij  i 1 i 1. (2-16). The specific description of S-P chart is shown in Table 2-1 (McArthur, 1983; Sheu, Pham, Nguyen, & Nguyen, 2013b). Based on CS and rate of problems answered correctly by student, students are diagnosed and classified. Similarly, based on CP and rate of students answering problem correctly, problems are also diagnosed and classified (Chen et al., 2005; Yih & Lin, 2010; Yu, 2011). The specific description of classification in S-P chart is presented in Figures 2-6 and 2-7.. 25.

RaschGSP IRT理論在大量數據 教育測驗上之應用

RaschGSP IRT理論在大量數據教育測驗上之應用