• 沒有找到結果。

Discussion and Conclusion

8.1 Summary

This work presents an adaptive test environment in order to enhance English as

foreign language learners improving their understanding. We propose a personalized

automatic quiz generation model to generate multiple–choice questions with varying

difficulty and select questions depending on a student’s estimated proficiency level and

unclear concepts behind incorrect responses. We also present a reading difficulty

esti-mation, which designed for English as foreign language learners. By Bayesian

Infor-mation Criterion (BIC), we investigate the optimal combination of features for

im-proving reading difficulty estimation. These features were extracted and sent to a linear

regression model to estimate a reading level of a document. Finally, a novel and

inter-pretable statistical ability estimation is presented based on the quantiles of acquisition

grade distributions and Item Response Theory, and considers long-term observation as

a student’s estimated ability. The results in the empirical study showed:

(1) The proposed personalized design with the appropriate instructional

scaf-(2) The students with our proposed personalized method corrected more

ques-tions which they answer incorrectly than students in the traditional test

en-vironment do.

(3) The questionnaire results showed that the proposed personalized method can

identify the students’ knowledge which needed to be improved and help

stu-dents understand their strengths and weaknesses; furthermore, most subjects

will agree that the proposed system is of functionality and quality

The proposed reading difficulty model not only inherently employed the

com-plexity of lexical and syntactic features, but also newly introduced some meaningful

new features such as the word and grammar acquisition grade distributions, word

sense, and co-referential relations. The results from the evaluations reported:

(4) The representative features of the proposed reading difficulty estimation

showed that the word acquisition grade distributions particularly plays an

important role for reading materials written for English as foreign language

learners.

(5) The results of the proposed reading difficulty estimation were better than the

previous work.

This work develops a statistical and interpretable method of estimated ability that

captures the succession of learning over time in a Web-based test environment.

More-over, it provides an explainable interpretation of the statistical measurement based on

the quantiles of acquisition grade distributions and Item Response Theory. The results

from the simulation demonstrated:

(6) The proposed ability estimation based on the grade distributions was robust

especially when the responses were uncertain.

(7) The result from proposed ability estimation was more accurate than the other

ability estimations and can provide a better understanding of student

compe-tence.

(8) The empirical results revealed that the correlation values between the

esti-mated abilities which incorporating this testing history were higher than the

values that only consider the test responses at the current test. Moreover,

students who were estimated as advanced graders will show significantly

higher post-test scores and better responses than ones who were estimated as

basic graders.

8.2. Contribution

To our knowledge, the work is the first empirical study to analyze the student

performance with automatically generated questions and a personalized test strategy.

Table 16 Comparison of different test environments.provides a comparison of the

pro-posed system with previous test environments. In the traditional test environment,

ex-aminees in the same grade or class usually take the same tests, which was previously

made by experts. In the adaptive test environment (e.g., Barla et al., 2010), tests are

likewise made beforehand by experts, but examinees in the same grade or class could

have different tests depending on their ability. With automatic question generation (e.g.,

Mitkov & Ha, 2003), tests save both time and production costs; nevertheless, they are

usually not designed for any test purpose. In our method, questions and the difficulty

of questions are not only generated automatically, but are also provided to examinees

depending on various abilities and their previous mistakes. The examinee’s

perfor-mance is recorded in the system and the concepts behind incorrectly answered

ques-tions are reincorporated into future tests. Their abilities are estimated by the current

responses incorporating testing history. Additionally, the estimated proficiency level in

this study corresponds to an explicit grade level in a school, whereas that in the ability

estimation of the traditional adaptive test environment is a point on an implicit scale.

This retains the advantage of the adaptive test environment and automatic question

generation. It offers students an effective approach to automatically measure their

un-derstanding and clear their incorrect concepts; moreover, it reduces teachers’ burden on

question generation. Teachers can take more time to teach and assist students.

Table 16 Comparison of different test environments.

Comparison Automation Personalization

The traditional test environment No No

The adaptive test environment No Yes

The automatic question test environment Yes No

The proposed adaptive test environment Yes Yes

This work is the first to mathematically draw connections among question

diffi-culties, ability estimation and the acquisition grade distributions. Through this idea, for

example, the estimated ability represents a student as grade level six because he or she

answered correctly 90 percent of items in a test with the difficulty level which

normal-ly distributed in level six and this behavior is equal to 80 percent of the population

(assume s=90% and r=80%). Unlike the traditional approaches, which focused on

norm referenced item parameter scale for an individual item, the ability which

esti-mated by the proposed method is explainable that the ability scale are based on the

school grade of the most of people acquire these knowledge. In addition, for the

pro-posed method, an examinee’s ability is estimated from all responses of questions in a

test; in contrast, for the traditional approaches, the ability was determined by an

indi-vidual question. This point is similar to that of Classical Test Theory (Crocker &

Algi-na, 1986), which considered all responses in a test as an examinee’s observed scores.

But the result of Classical Test Theory is sample-dependent; instead, the estimated

re-sult from the proposed method is stable due to estimating based on the acquisition

grade distributions. Moreover, our estimated ability is obtained from the weighted

combination of an examinee’s current performance and his or her historical data. The

much historical data allows the ability to be estimated more accurately. This

character-istic remains the advantage of BME (Bock & Mislevy, 1982; Baker, 1993; Lee, 2012),

which considers the successive change in the ability level within a learning session,

and achieves more accurate results than the BME. Finally, the experimental sample in

this study was drawn from the student population with varied abilities, whereas the

pa-rameters in Lee’s research (2012) were estimated on the student population with

simi-lar knowledge level. Even though the characteristics of Item Response Theory are

robust enough to use the same student population without losing any generality, it

would be better to acquire parameters from different student populations.

Several implications can be drawn from this study, if learners could learn English

with this learning environment. First, it would provide a personalized learning

envi-ronment. Students with different abilities could practice adaptive exercises with

appro-priate difficulties and repeatedly unclear concepts. This could be used as a qualitative

guideline for identifying the current learning status of students for providing

instruc-tional supports, which could in turn enhance what students do not acquire yet. For

ex-ample, when the estimated ability of a student is determined, the student could

under-stand his or her learning status because the ability is estimated based on the difficulty

levels of words he or she acquired. It is easier for students to see the extent of their

proficiency in the different levels. Moreover, the system records students’ behavior,

teachers can use this information to clear up misunderstanding that students have.

Second, it would take away the barrier of the physical academic textbooks. By having

online resource be available and updated every day, learners would be able to learn

something new every time they want. Finally, the framework of this system could be

used as a quantitative purpose for adapting the different learning environment for

of-fering flexible measurement, which could set different values in these two parameters r

and s depending on the various conditions. A good example is native speakers versus

second language learners. In this way, teachers could adjust the parameters of the

pro-posed ability estimation to the test purpose, regarding a qualified ability corresponding

to the age which the certain percent of a population have acquired.

8.3 Limitations

Limitations of our evaluations itself leave ample room for future research.

One limitation concerns the difficulty of reading comprehension questions used in

the study. To develop this question type, we only took the predicted value from the

proposed reading difficulty estimation as consideration. It should identify other criteria

for the characteristics of anaphoric relations, e.g. the forward reference or the

back-ward reference, or the most frequent mistakes in the coreference resolution.

One of the limitations in our current research is the limited question types even

though vocabulary, grammar and reading comprehension (referential) questions were

proposed in this study. It will be desirable to see more different generated questions

types in the future work. Moreover, because of the limited number of question types, it

is difficult to identify students’ incorrect responses in reading comprehension questions.

Although these questions are classified into various difficulties, it could be insufficient

to investigate students’ understanding. One possible solution is to observe and learn

from data; however, it requires researchers or students to label and define this resource.

Future work should evaluate the personalized questions on additional criteria.

Even though these questions were evaluated with empirical data, e.g. questionnaire, the

quality of generated questions could be examined further. For example, the generated

questions could be evaluated by a reprehensive sample of experts: Is a generated

ques-tion acceptable? One criterion is psychometric reliability: how well does performance

on a question correlate with performance on other questions with the same difficulty?

Another idea is to design how to filter an invalid generated question automatically.

Another limitation is that the distribution of item difficulties of questions in a test

was assumed as a normal distribution. Even though teachers usually design a

combina-tion of difficulties of quescombina-tion in a test which is similar to a normal distribucombina-tion, some

questions are uniformly generated. One of possible solution is that the item

discrimina-tion parameter and the guessing parameter described in three-parameter logistic model

of Item Response Theory might be taken into consideration. The item characteristic

curve could accurately model the probability of a correct response between an

exami-nee’s ability and the item parameters. This concern would be much more desirable to

address in the future.

Additional limitation is that this approach only focuses on English learning. The

personalized framework may be applied to other language learning field, but other

dis-ciplines, such as mathematics, need to be redesigned.

8.4 Future applications

One possible use is human-assisted machine generation of personalized question

generation, for example, with the human editing or selecting among candidate

ques-tions generated automatically, thereby reducing the amount of human effort currently

required to compose questions, and producing them more systematically. Further

re-search might extend the framework for automatic use. One thing for the further

devel-opment is to design the automatic evaluation of generated questions. If a generated

question is reported as an unacceptable question, it should be removed and used to

im-prove the algorithm.

Another potential application is adaptive test based on Big Data (Long & Siemens,

2011). With the emergence of abundant online learning materials and electronic

text-books, it is highly practical to employ the proposed framework of personalized

auto-matic question generation in the future. We can imagine a scenario in which English as

a foreign language learner read up–to–date news and immediately take a test to

evalu-ate himself. We look forward to a fast adoption of learning environment and hope

stu-dents and teachers will have the benefits of this work.

References

[1] Agarwal, M. & Mannem, P. (2011). Automatic gap–fill question generation

from text books. Proceedings of the 6th Workshop on Innovative Use of NLP

for Building Educational Applications, 56–64.

[2] Agarwal, M., Shah, R., & Mannem, P. (2011). Automatic question generation

using discourse cues. Proceedings of the 6th Workshop on Innovative Use of

NLP for Building Educational Applications, 1–9.

[3] Alario, F. X., Ferrand, L., Laganaro, M., New, B., Frauenfelder, U. H. & Segui,

J. (2005). Predictors of picture naming speed. Behavior Research Methods,

In-struments, & Computers, 36, 140-155.

[4] Anderson, R. C., & Biddle, W. B. (1975). On asking people questions about

what they are reading. Psychology of learning and motivation, 9, 89-132.

[5] Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval. New

York: ACM Press.

[6] Baker, F. B. (1993). Equating tests under the nominal response model. Applied

Psychological Measurement, 17, 239-251.

[7] Bates, E. (2003). On the nature and nurture of language. Retrieved November 24,

2011, from http://crl.ucsd.edu/bates/papers/pdf/bates-inpress.pdf

[8] Barla, M., Bielikova, M., Ezzeddinne, A. B., Kramar, T., Simko, M. & Vozar, O.

(2010). On the impact of adaptive test question selection for learning efficiency.

Computer & Education, 55(2), 846–857.

[9] Barzilay, R. and Lapata, M. (2008). Modeling local coherence: An entity-based

approach. Computational Linguistics, 34(1), 1-34.

[10] Brown, R. G. (2004). Smoothing, forecasting and prediction of discrete time

series. New York: Dover Publications.

[11] Brown, J., Frishkoff, G. & Eskenazi, M. (2005). Automatic question

generation for vocabulary assessment. Proceedings of Proceedings of the

conference on Human Language Technology and Empirical Methods in Natural

Language Processing, 819-826.

[12] Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a

microcomputer environment. Applied Psychological Measurement, 6, 431-444.

[13] Barzilay, R. and Lapata, M. (2008). Modeling local coherence: An

enti-ty-based approach. Computational Linguistics, 34(1), 1-34.

[14] Brown, R. G. (2004). Smoothing, forecasting and prediction of discrete time

series. Dover Publications.

[15] Brysbaert, M., Wijnendaele, I. V., & Deyne, S. D. (2000). Age-of-acquisition

effects in semantic processing tasks. Acta Psychologica, 104(2), 215-226.

[16] Carrolla, J. B. & Whitea, M. N. (1973). Word frequency and age of

acquisi-tion as determiners of picture-naming latency. Quarterly Journal of

Experi-mental Psychology, 25(1), 85-95.

[17] Chali, Y. & Hasan, S. A. (2012). Towards Automatic Topical Question

Generation. Proceedings of the 24th International Conference on

Computational Linguistics, 475–492.

[18] Chen, C. M., Lee, H. M., & Chen, Y. H. (2005). Personalized e–learning

system using item response theory. Computers & Education, 44(3), 237–255.

[19] Chen, C. M., & Chung, C. J. (2008). Personalized mobile English vocabulary

learning system based on item response theory and learning memory cycle.

Computers & Education, 51(2), 624–645.

[20] Chen, C. Y., Ko, M. H., Wu, T. W. & Chang, J. S. (2005). FASTǺFree

Assistant of Structural Tests. Proceedings of the Computational Linguistics and

Speech Processing (ROCLING 2005).

[21] Chen, W., & Mostow, J. (2011). A Tale of Two Tasks: Detecting Children’s

Off-Task Speech in a Reading Tutor. Proceedings of the 12th Annual

Conference of the International Speech Communication Association, 1621-124.

[22] Chen, W., Aist, G., & Mostow, J. (2009). Generating Questions

Automatically from Informational Text. Proceedings of AIED 2009 Workshop

on Question Generation, 17-24.

[23] Crocker, L., & Algina, J. (1986). Introduction to classical and modern test

theory. New York: Holt, Rinehart & Winston.

[24] Coleman, M. and Liau, T. L. (1975). A computer readability formula designed

for machine scoring. Journal of Applied Psychology, 60(2):283–284.

[25] Collins-Thompson, K. and Callan, J. (2004). A Language Modeling Approach

to Predicting Reading Difficulty. Proceedings of the Human Language

Tech-nology Conference of the North American Chapter of the Association for

Com-putational Linguistics (HLT-NAACL2004)

[26] Collins-Thompson, K., Bennett, P. N., White, R. W., Chica, S. Sontag, D.

(2011). Personalizing Web Search Results by Reading Level. Proceedings of

CIKM2011

[27] Copestake, A., Flickinger, D., Pollard, C. & Sag, I. A. (2005). Minimal

Recursion Semantics: An Introduction, Research on Language and

Computation, 3, 281-332.

[28] Dale, E. and Chall, J. S. (1948). A Formula for Predicting Readability.

Edu-cational Research Bulletin, 27(1).

[29] David, H. A. & Nagaraja, H. N. (2003), Order statistics. Marblehead, MA:

Wiley.

[30] Davies, R., Barbón, A., & Cuetos, F. (2013). Lexical and semantic

age-of-acquisition effects on word naming in Spanish. Memory & Cognition,

41(2), 297-311.

[31] EL-Manzalawy, Y. and Honavar, V. (2005). {WLSVM}: Integrating LibSVM

into Weka Environment. Retrieved November 24, 2011, from

http://www.cs.iastate.edu/~yasser/wlsvm/

[32] Elo, A. (1978). The rating of chessplayers, past and present. New York: Arco

Publishers.

[33] Embertson, S., & Resise, S. (2000). Item response theory for psychologists.

New Jersey, USA: Lawrence Erlbaum.

[34] Fehr, C. N., Davison, M. L., Graves, M. F., Sales, G. C., Seipel, B., &

SekhranȉSharma, S. (2012). The effects of individualized, online vocabulary

instruction on picture vocabulary scores: an efficacy study, Computer Assisted

Language Learning, 25(1), 87–102.

[35] Feng, L., Jansche, M., Huenerfauth, M., and Elhadad, N. (2010). A

Compari-son of Features for Automatic Readability Assessment. Proceedings of

Interna-tional Conference on ComputaInterna-tional Linguistics , 276-284.

[36] Flesch, R. (1948). A new readability yardstick. Journal of applied psychology,

32(3), 221-233.

[37] Gronlund, N. (1993). How to make achievement tests and assessments. New

York: Allyn and Bacon.

[38] Gunning, R. (1952). The technique of clear writing. McGraw-Hill, 1952.

[39] Hambleton, R. K., & Swaminathan, H. (1985). Item response theory:

princi-ples and applications. Boston, MA: Kluwer-Nijhoff.

[40] Heilman, M., Collins-Thompson, K., Callan, J., and Eskenazi, M. (2007).

Combining Lexical and Grammatical Features to Improve Readability Measures

for First and Second Language Texts. Proceedings of the Human Language

Technology Conference, 460-467.

[41] Heilman, M., Collins-Thompson, K. and Eskenazi, M. (2008). An Analysis of

Statistical Models and Features for Reading Difficulty Prediction. Proceedings

of the Third ACL Workshop on Innovative Use of NLP for Building Educational

Applications, 71–79.

[42] Heilman, M. & Smith, N. A. (2009). Question generation via overgenerating

transformations and ranking. Technical report, Language Technologies

Institute, Carnegie Mellon University Technical Report CMU–LTI–09–013,

Retrieved from

http://www.cs.cmu.edu/~nasmith/papers/heilman+smith.tr09.pdf.

[43] Heilman, M. & Smith, N. A. (2010). Good question! statistical ranking for

question generation. Proceedings of the North American Chapter of the

Association for Computational Linguistics: Human Language Technologies,

609–617.

[44] Hsiao, H. S., Chang, C. S., Chen, C. J., Wu, C. H. & Lin, C. Y. (2013). The

influence of Chinese character handwriting diagnosis and remedial instruction

system on learners of Chinese as a foreign language, Computer Assisted

Language Learning, DOI: 10.1080/09588221.2013.818562.

[45] Ho, H. and Huong, C. (2011). A Multiple Aspects Quantitative Indicator for

Ability of English Vocabulary: Vocabulary Quotient. Journal of Educational

Technology Development and Exchange, 4(1), 15-26.

[46] Huang, S. X. (1996) A content-balanced adaptive testing algorithm for

com-puter-based training systems. Intelligent Tutoring Systems, 306–314.

[47] Izura, C., & Ellis, A. W. (2002). Age of acquisition effects in word

recogni-tion and producrecogni-tion in first and second languages. Psicologica, 23, 245-281.

[48] Johns, T. F., Hsingchin, L., & Lixun, W. (2008). Integrating corpusȉbased

CALL programs in teaching English through children's literature, Computer

Assisted Language Learning, 21(5), 483–506.

[49] Kate, R. J., Luo, X., Patwardhan, S., Franz, M., Florian, R., Mooney, R. J.,

Roukos, S., Welty, C. (2010). Learning to Predict Readability using Diverse

Lingustic Features. Proceedings of the 23rd Internation Conference on

Compu-tational Lingustics, pages 546-554.

[50] Kidwell, P., Lebanon, G., and Collins-Thompson, K. (2009). Statistical

esti-mation of word acquisition with application to readability prediction.

Proceed-ings of Empirical Methods in Natural Language Processing, 900-909.

[51] Kidwell, P., Lebanon, G., & Collins-Thompson, K. (2011). Statistical

Estima-tion of Word AcquisiEstima-tion With ApplicaEstima-tion to Readability PredicEstima-tion. Journal of

the American Statistical Association, 106(493), 21-30.

[52] Kincaid, J. Peter; Fishburne, Lieutenant Robert P., Jr.; Rogers, Richard L.;

Chissom, Brad S. (1975). Derivation of New Readability Formulas (Automated

Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy

En-listed Personnel. Branch Report. Virginia: National Technical Information

Ser-vice.

[53] Kintsch, W. (1998). Comprehension: A Paradigm for Cognition. Cambridge:

Cambridge: Cambridge University Press.

[54] Kireyev, K. & Landauer, T. K. (2011). Word maturity: computational

model-ing of word knowledge. Proceedmodel-ings of the 49th Annual Meetmodel-ing of the

Associa-tion for ComputaAssocia-tional Linguistics, 299–308.

[55] Klein, D. and Manning, C. D. (2003). Accurate Unlexicalized Parsing.

Pro-ceedings of the 41st Meeting of the Association for Computational Linguistics, .

423-430.

[56] Klinkenberg, S., Straatemeier, M., & van der Maas, H.L.J. (2011). Computer

adaptive practice of Maths ability using a new item response model for on the

fly ability and difficulty estimation. Computers & Education, 57(2), 1813-1824.

[57] Kuo, C. H., Wible, D., Chen, M. C., Sung, L. C., Tsao, N. L. & Chio, C. L.

(2002). Design and implementation of an intelligent Web–based interactive

language learning system. Journal of Educational Computing Research, 27(3),

785–788.

[58] Lin, Y. C., Sung, L. C. & Chen, M. C. (2007). An automatic multiple-choice

question generation scheme for English adjective understanding. Proceedings of

the 15th International Conference on Computers in Education, 137-142.

[59] Lee, J. & Seneff, S. (2007). Automatic generation of cloze items for

prepositions. Proceeding of INTERSPEECH 2007, 2173–2176

[60] Lee, Y. J. (2012). Developing an efficient computational method that

esti-mates the ability of students in a Web-based learning environment. Computer

and Education, 58(1), 579-589.

[61] Levy, R. & Andrew G. (2006). Tregex and Tsurgeon: tools for querying and

manipulating tree data structures. Proceeding of 5th International Conference

on Language Resources and Evaluation.

[62] Lin, Y. C., Sung, L. C. & Chen, M. C. (2007). An automatic multiple–choice

question generation scheme for English adjective understanding. Proceedings

of the 15th International Conference on Computers in Education, 137–142.

[63] Liu, C. L., Wang, C. H., Gao, Z. M., & Huang, S. M. (2005). Applications of

lexical information for algorithmically composing multiple–choice cloze items.

Proceedings of the Second Workshop on Building Educational Applications

Using Natural Language Processing, 1–8.

[64] Liu, M., Calvo, R. A., Aditomo, A., & Pizzato, L. A. (2012). Using

Wikipedia and conceptual graph structures to generate questions for academic

writing support. IEEE Transactions on learning technologies, 5(3), 251-263.

[65] Long, P. and Siemens, G. (2011). Penetrating the Fog: Analytics in Learning

and Education. Educause Review, 46(5), 31-40.

[66] Lou, B and Guy, A. (1998). The BNC handbook: exploring the British

Na-tional Corpus. Edinburgh: Edinburgh University Press.

[67] Manning, C. D. & Schütze, H. (1999). Foundations of statistical natural

lan-guage processing. Cambridge, MA: MIT Press.

lan-guage processing. Cambridge, MA: MIT Press.

相關文件