8.1 Summary
This work presents an adaptive test environment in order to enhance English as
foreign language learners improving their understanding. We propose a personalized
automatic quiz generation model to generate multiple–choice questions with varying
difficulty and select questions depending on a student’s estimated proficiency level and
unclear concepts behind incorrect responses. We also present a reading difficulty
esti-mation, which designed for English as foreign language learners. By Bayesian
Infor-mation Criterion (BIC), we investigate the optimal combination of features for
im-proving reading difficulty estimation. These features were extracted and sent to a linear
regression model to estimate a reading level of a document. Finally, a novel and
inter-pretable statistical ability estimation is presented based on the quantiles of acquisition
grade distributions and Item Response Theory, and considers long-term observation as
a student’s estimated ability. The results in the empirical study showed:
(1) The proposed personalized design with the appropriate instructional
scaf-(2) The students with our proposed personalized method corrected more
ques-tions which they answer incorrectly than students in the traditional test
en-vironment do.
(3) The questionnaire results showed that the proposed personalized method can
identify the students’ knowledge which needed to be improved and help
stu-dents understand their strengths and weaknesses; furthermore, most subjects
will agree that the proposed system is of functionality and quality
The proposed reading difficulty model not only inherently employed the
com-plexity of lexical and syntactic features, but also newly introduced some meaningful
new features such as the word and grammar acquisition grade distributions, word
sense, and co-referential relations. The results from the evaluations reported:
(4) The representative features of the proposed reading difficulty estimation
showed that the word acquisition grade distributions particularly plays an
important role for reading materials written for English as foreign language
learners.
(5) The results of the proposed reading difficulty estimation were better than the
previous work.
This work develops a statistical and interpretable method of estimated ability that
captures the succession of learning over time in a Web-based test environment.
More-over, it provides an explainable interpretation of the statistical measurement based on
the quantiles of acquisition grade distributions and Item Response Theory. The results
from the simulation demonstrated:
(6) The proposed ability estimation based on the grade distributions was robust
especially when the responses were uncertain.
(7) The result from proposed ability estimation was more accurate than the other
ability estimations and can provide a better understanding of student
compe-tence.
(8) The empirical results revealed that the correlation values between the
esti-mated abilities which incorporating this testing history were higher than the
values that only consider the test responses at the current test. Moreover,
students who were estimated as advanced graders will show significantly
higher post-test scores and better responses than ones who were estimated as
basic graders.
8.2. Contribution
To our knowledge, the work is the first empirical study to analyze the student
performance with automatically generated questions and a personalized test strategy.
Table 16 Comparison of different test environments.provides a comparison of the
pro-posed system with previous test environments. In the traditional test environment,
ex-aminees in the same grade or class usually take the same tests, which was previously
made by experts. In the adaptive test environment (e.g., Barla et al., 2010), tests are
likewise made beforehand by experts, but examinees in the same grade or class could
have different tests depending on their ability. With automatic question generation (e.g.,
Mitkov & Ha, 2003), tests save both time and production costs; nevertheless, they are
usually not designed for any test purpose. In our method, questions and the difficulty
of questions are not only generated automatically, but are also provided to examinees
depending on various abilities and their previous mistakes. The examinee’s
perfor-mance is recorded in the system and the concepts behind incorrectly answered
ques-tions are reincorporated into future tests. Their abilities are estimated by the current
responses incorporating testing history. Additionally, the estimated proficiency level in
this study corresponds to an explicit grade level in a school, whereas that in the ability
estimation of the traditional adaptive test environment is a point on an implicit scale.
This retains the advantage of the adaptive test environment and automatic question
generation. It offers students an effective approach to automatically measure their
un-derstanding and clear their incorrect concepts; moreover, it reduces teachers’ burden on
question generation. Teachers can take more time to teach and assist students.
Table 16 Comparison of different test environments.
Comparison Automation Personalization
The traditional test environment No No
The adaptive test environment No Yes
The automatic question test environment Yes No
The proposed adaptive test environment Yes Yes
This work is the first to mathematically draw connections among question
diffi-culties, ability estimation and the acquisition grade distributions. Through this idea, for
example, the estimated ability represents a student as grade level six because he or she
answered correctly 90 percent of items in a test with the difficulty level which
normal-ly distributed in level six and this behavior is equal to 80 percent of the population
(assume s=90% and r=80%). Unlike the traditional approaches, which focused on
norm referenced item parameter scale for an individual item, the ability which
esti-mated by the proposed method is explainable that the ability scale are based on the
school grade of the most of people acquire these knowledge. In addition, for the
pro-posed method, an examinee’s ability is estimated from all responses of questions in a
test; in contrast, for the traditional approaches, the ability was determined by an
indi-vidual question. This point is similar to that of Classical Test Theory (Crocker &
Algi-na, 1986), which considered all responses in a test as an examinee’s observed scores.
But the result of Classical Test Theory is sample-dependent; instead, the estimated
re-sult from the proposed method is stable due to estimating based on the acquisition
grade distributions. Moreover, our estimated ability is obtained from the weighted
combination of an examinee’s current performance and his or her historical data. The
much historical data allows the ability to be estimated more accurately. This
character-istic remains the advantage of BME (Bock & Mislevy, 1982; Baker, 1993; Lee, 2012),
which considers the successive change in the ability level within a learning session,
and achieves more accurate results than the BME. Finally, the experimental sample in
this study was drawn from the student population with varied abilities, whereas the
pa-rameters in Lee’s research (2012) were estimated on the student population with
simi-lar knowledge level. Even though the characteristics of Item Response Theory are
robust enough to use the same student population without losing any generality, it
would be better to acquire parameters from different student populations.
Several implications can be drawn from this study, if learners could learn English
with this learning environment. First, it would provide a personalized learning
envi-ronment. Students with different abilities could practice adaptive exercises with
appro-priate difficulties and repeatedly unclear concepts. This could be used as a qualitative
guideline for identifying the current learning status of students for providing
instruc-tional supports, which could in turn enhance what students do not acquire yet. For
ex-ample, when the estimated ability of a student is determined, the student could
under-stand his or her learning status because the ability is estimated based on the difficulty
levels of words he or she acquired. It is easier for students to see the extent of their
proficiency in the different levels. Moreover, the system records students’ behavior,
teachers can use this information to clear up misunderstanding that students have.
Second, it would take away the barrier of the physical academic textbooks. By having
online resource be available and updated every day, learners would be able to learn
something new every time they want. Finally, the framework of this system could be
used as a quantitative purpose for adapting the different learning environment for
of-fering flexible measurement, which could set different values in these two parameters r
and s depending on the various conditions. A good example is native speakers versus
second language learners. In this way, teachers could adjust the parameters of the
pro-posed ability estimation to the test purpose, regarding a qualified ability corresponding
to the age which the certain percent of a population have acquired.
8.3 Limitations
Limitations of our evaluations itself leave ample room for future research.
One limitation concerns the difficulty of reading comprehension questions used in
the study. To develop this question type, we only took the predicted value from the
proposed reading difficulty estimation as consideration. It should identify other criteria
for the characteristics of anaphoric relations, e.g. the forward reference or the
back-ward reference, or the most frequent mistakes in the coreference resolution.
One of the limitations in our current research is the limited question types even
though vocabulary, grammar and reading comprehension (referential) questions were
proposed in this study. It will be desirable to see more different generated questions
types in the future work. Moreover, because of the limited number of question types, it
is difficult to identify students’ incorrect responses in reading comprehension questions.
Although these questions are classified into various difficulties, it could be insufficient
to investigate students’ understanding. One possible solution is to observe and learn
from data; however, it requires researchers or students to label and define this resource.
Future work should evaluate the personalized questions on additional criteria.
Even though these questions were evaluated with empirical data, e.g. questionnaire, the
quality of generated questions could be examined further. For example, the generated
questions could be evaluated by a reprehensive sample of experts: Is a generated
ques-tion acceptable? One criterion is psychometric reliability: how well does performance
on a question correlate with performance on other questions with the same difficulty?
Another idea is to design how to filter an invalid generated question automatically.
Another limitation is that the distribution of item difficulties of questions in a test
was assumed as a normal distribution. Even though teachers usually design a
combina-tion of difficulties of quescombina-tion in a test which is similar to a normal distribucombina-tion, some
questions are uniformly generated. One of possible solution is that the item
discrimina-tion parameter and the guessing parameter described in three-parameter logistic model
of Item Response Theory might be taken into consideration. The item characteristic
curve could accurately model the probability of a correct response between an
exami-nee’s ability and the item parameters. This concern would be much more desirable to
address in the future.
Additional limitation is that this approach only focuses on English learning. The
personalized framework may be applied to other language learning field, but other
dis-ciplines, such as mathematics, need to be redesigned.
8.4 Future applications
One possible use is human-assisted machine generation of personalized question
generation, for example, with the human editing or selecting among candidate
ques-tions generated automatically, thereby reducing the amount of human effort currently
required to compose questions, and producing them more systematically. Further
re-search might extend the framework for automatic use. One thing for the further
devel-opment is to design the automatic evaluation of generated questions. If a generated
question is reported as an unacceptable question, it should be removed and used to
im-prove the algorithm.
Another potential application is adaptive test based on Big Data (Long & Siemens,
2011). With the emergence of abundant online learning materials and electronic
text-books, it is highly practical to employ the proposed framework of personalized
auto-matic question generation in the future. We can imagine a scenario in which English as
a foreign language learner read up–to–date news and immediately take a test to
evalu-ate himself. We look forward to a fast adoption of learning environment and hope
stu-dents and teachers will have the benefits of this work.
References
[1] Agarwal, M. & Mannem, P. (2011). Automatic gap–fill question generation
from text books. Proceedings of the 6th Workshop on Innovative Use of NLP
for Building Educational Applications, 56–64.
[2] Agarwal, M., Shah, R., & Mannem, P. (2011). Automatic question generation
using discourse cues. Proceedings of the 6th Workshop on Innovative Use of
NLP for Building Educational Applications, 1–9.
[3] Alario, F. X., Ferrand, L., Laganaro, M., New, B., Frauenfelder, U. H. & Segui,
J. (2005). Predictors of picture naming speed. Behavior Research Methods,
In-struments, & Computers, 36, 140-155.
[4] Anderson, R. C., & Biddle, W. B. (1975). On asking people questions about
what they are reading. Psychology of learning and motivation, 9, 89-132.
[5] Baeza-Yates, R. & Ribeiro-Neto, B. (1999). Modern Information Retrieval. New
York: ACM Press.
[6] Baker, F. B. (1993). Equating tests under the nominal response model. Applied
Psychological Measurement, 17, 239-251.
[7] Bates, E. (2003). On the nature and nurture of language. Retrieved November 24,
2011, from http://crl.ucsd.edu/bates/papers/pdf/bates-inpress.pdf
[8] Barla, M., Bielikova, M., Ezzeddinne, A. B., Kramar, T., Simko, M. & Vozar, O.
(2010). On the impact of adaptive test question selection for learning efficiency.
Computer & Education, 55(2), 846–857.
[9] Barzilay, R. and Lapata, M. (2008). Modeling local coherence: An entity-based
approach. Computational Linguistics, 34(1), 1-34.
[10] Brown, R. G. (2004). Smoothing, forecasting and prediction of discrete time
series. New York: Dover Publications.
[11] Brown, J., Frishkoff, G. & Eskenazi, M. (2005). Automatic question
generation for vocabulary assessment. Proceedings of Proceedings of the
conference on Human Language Technology and Empirical Methods in Natural
Language Processing, 819-826.
[12] Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a
microcomputer environment. Applied Psychological Measurement, 6, 431-444.
[13] Barzilay, R. and Lapata, M. (2008). Modeling local coherence: An
enti-ty-based approach. Computational Linguistics, 34(1), 1-34.
[14] Brown, R. G. (2004). Smoothing, forecasting and prediction of discrete time
series. Dover Publications.
[15] Brysbaert, M., Wijnendaele, I. V., & Deyne, S. D. (2000). Age-of-acquisition
effects in semantic processing tasks. Acta Psychologica, 104(2), 215-226.
[16] Carrolla, J. B. & Whitea, M. N. (1973). Word frequency and age of
acquisi-tion as determiners of picture-naming latency. Quarterly Journal of
Experi-mental Psychology, 25(1), 85-95.
[17] Chali, Y. & Hasan, S. A. (2012). Towards Automatic Topical Question
Generation. Proceedings of the 24th International Conference on
Computational Linguistics, 475–492.
[18] Chen, C. M., Lee, H. M., & Chen, Y. H. (2005). Personalized e–learning
system using item response theory. Computers & Education, 44(3), 237–255.
[19] Chen, C. M., & Chung, C. J. (2008). Personalized mobile English vocabulary
learning system based on item response theory and learning memory cycle.
Computers & Education, 51(2), 624–645.
[20] Chen, C. Y., Ko, M. H., Wu, T. W. & Chang, J. S. (2005). FASTǺFree
Assistant of Structural Tests. Proceedings of the Computational Linguistics and
Speech Processing (ROCLING 2005).
[21] Chen, W., & Mostow, J. (2011). A Tale of Two Tasks: Detecting Children’s
Off-Task Speech in a Reading Tutor. Proceedings of the 12th Annual
Conference of the International Speech Communication Association, 1621-124.
[22] Chen, W., Aist, G., & Mostow, J. (2009). Generating Questions
Automatically from Informational Text. Proceedings of AIED 2009 Workshop
on Question Generation, 17-24.
[23] Crocker, L., & Algina, J. (1986). Introduction to classical and modern test
theory. New York: Holt, Rinehart & Winston.
[24] Coleman, M. and Liau, T. L. (1975). A computer readability formula designed
for machine scoring. Journal of Applied Psychology, 60(2):283–284.
[25] Collins-Thompson, K. and Callan, J. (2004). A Language Modeling Approach
to Predicting Reading Difficulty. Proceedings of the Human Language
Tech-nology Conference of the North American Chapter of the Association for
Com-putational Linguistics (HLT-NAACL2004)
[26] Collins-Thompson, K., Bennett, P. N., White, R. W., Chica, S. Sontag, D.
(2011). Personalizing Web Search Results by Reading Level. Proceedings of
CIKM2011
[27] Copestake, A., Flickinger, D., Pollard, C. & Sag, I. A. (2005). Minimal
Recursion Semantics: An Introduction, Research on Language and
Computation, 3, 281-332.
[28] Dale, E. and Chall, J. S. (1948). A Formula for Predicting Readability.
Edu-cational Research Bulletin, 27(1).
[29] David, H. A. & Nagaraja, H. N. (2003), Order statistics. Marblehead, MA:
Wiley.
[30] Davies, R., Barbón, A., & Cuetos, F. (2013). Lexical and semantic
age-of-acquisition effects on word naming in Spanish. Memory & Cognition,
41(2), 297-311.
[31] EL-Manzalawy, Y. and Honavar, V. (2005). {WLSVM}: Integrating LibSVM
into Weka Environment. Retrieved November 24, 2011, from
http://www.cs.iastate.edu/~yasser/wlsvm/
[32] Elo, A. (1978). The rating of chessplayers, past and present. New York: Arco
Publishers.
[33] Embertson, S., & Resise, S. (2000). Item response theory for psychologists.
New Jersey, USA: Lawrence Erlbaum.
[34] Fehr, C. N., Davison, M. L., Graves, M. F., Sales, G. C., Seipel, B., &
SekhranȉSharma, S. (2012). The effects of individualized, online vocabulary
instruction on picture vocabulary scores: an efficacy study, Computer Assisted
Language Learning, 25(1), 87–102.
[35] Feng, L., Jansche, M., Huenerfauth, M., and Elhadad, N. (2010). A
Compari-son of Features for Automatic Readability Assessment. Proceedings of
Interna-tional Conference on ComputaInterna-tional Linguistics , 276-284.
[36] Flesch, R. (1948). A new readability yardstick. Journal of applied psychology,
32(3), 221-233.
[37] Gronlund, N. (1993). How to make achievement tests and assessments. New
York: Allyn and Bacon.
[38] Gunning, R. (1952). The technique of clear writing. McGraw-Hill, 1952.
[39] Hambleton, R. K., & Swaminathan, H. (1985). Item response theory:
princi-ples and applications. Boston, MA: Kluwer-Nijhoff.
[40] Heilman, M., Collins-Thompson, K., Callan, J., and Eskenazi, M. (2007).
Combining Lexical and Grammatical Features to Improve Readability Measures
for First and Second Language Texts. Proceedings of the Human Language
Technology Conference, 460-467.
[41] Heilman, M., Collins-Thompson, K. and Eskenazi, M. (2008). An Analysis of
Statistical Models and Features for Reading Difficulty Prediction. Proceedings
of the Third ACL Workshop on Innovative Use of NLP for Building Educational
Applications, 71–79.
[42] Heilman, M. & Smith, N. A. (2009). Question generation via overgenerating
transformations and ranking. Technical report, Language Technologies
Institute, Carnegie Mellon University Technical Report CMU–LTI–09–013,
Retrieved from
http://www.cs.cmu.edu/~nasmith/papers/heilman+smith.tr09.pdf.
[43] Heilman, M. & Smith, N. A. (2010). Good question! statistical ranking for
question generation. Proceedings of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies,
609–617.
[44] Hsiao, H. S., Chang, C. S., Chen, C. J., Wu, C. H. & Lin, C. Y. (2013). The
influence of Chinese character handwriting diagnosis and remedial instruction
system on learners of Chinese as a foreign language, Computer Assisted
Language Learning, DOI: 10.1080/09588221.2013.818562.
[45] Ho, H. and Huong, C. (2011). A Multiple Aspects Quantitative Indicator for
Ability of English Vocabulary: Vocabulary Quotient. Journal of Educational
Technology Development and Exchange, 4(1), 15-26.
[46] Huang, S. X. (1996) A content-balanced adaptive testing algorithm for
com-puter-based training systems. Intelligent Tutoring Systems, 306–314.
[47] Izura, C., & Ellis, A. W. (2002). Age of acquisition effects in word
recogni-tion and producrecogni-tion in first and second languages. Psicologica, 23, 245-281.
[48] Johns, T. F., Hsingchin, L., & Lixun, W. (2008). Integrating corpusȉbased
CALL programs in teaching English through children's literature, Computer
Assisted Language Learning, 21(5), 483–506.
[49] Kate, R. J., Luo, X., Patwardhan, S., Franz, M., Florian, R., Mooney, R. J.,
Roukos, S., Welty, C. (2010). Learning to Predict Readability using Diverse
Lingustic Features. Proceedings of the 23rd Internation Conference on
Compu-tational Lingustics, pages 546-554.
[50] Kidwell, P., Lebanon, G., and Collins-Thompson, K. (2009). Statistical
esti-mation of word acquisition with application to readability prediction.
Proceed-ings of Empirical Methods in Natural Language Processing, 900-909.
[51] Kidwell, P., Lebanon, G., & Collins-Thompson, K. (2011). Statistical
Estima-tion of Word AcquisiEstima-tion With ApplicaEstima-tion to Readability PredicEstima-tion. Journal of
the American Statistical Association, 106(493), 21-30.
[52] Kincaid, J. Peter; Fishburne, Lieutenant Robert P., Jr.; Rogers, Richard L.;
Chissom, Brad S. (1975). Derivation of New Readability Formulas (Automated
Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy
En-listed Personnel. Branch Report. Virginia: National Technical Information
Ser-vice.
[53] Kintsch, W. (1998). Comprehension: A Paradigm for Cognition. Cambridge:
Cambridge: Cambridge University Press.
[54] Kireyev, K. & Landauer, T. K. (2011). Word maturity: computational
model-ing of word knowledge. Proceedmodel-ings of the 49th Annual Meetmodel-ing of the
Associa-tion for ComputaAssocia-tional Linguistics, 299–308.
[55] Klein, D. and Manning, C. D. (2003). Accurate Unlexicalized Parsing.
Pro-ceedings of the 41st Meeting of the Association for Computational Linguistics, .
423-430.
[56] Klinkenberg, S., Straatemeier, M., & van der Maas, H.L.J. (2011). Computer
adaptive practice of Maths ability using a new item response model for on the
fly ability and difficulty estimation. Computers & Education, 57(2), 1813-1824.
[57] Kuo, C. H., Wible, D., Chen, M. C., Sung, L. C., Tsao, N. L. & Chio, C. L.
(2002). Design and implementation of an intelligent Web–based interactive
language learning system. Journal of Educational Computing Research, 27(3),
785–788.
[58] Lin, Y. C., Sung, L. C. & Chen, M. C. (2007). An automatic multiple-choice
question generation scheme for English adjective understanding. Proceedings of
the 15th International Conference on Computers in Education, 137-142.
[59] Lee, J. & Seneff, S. (2007). Automatic generation of cloze items for
prepositions. Proceeding of INTERSPEECH 2007, 2173–2176
[60] Lee, Y. J. (2012). Developing an efficient computational method that
esti-mates the ability of students in a Web-based learning environment. Computer
and Education, 58(1), 579-589.
[61] Levy, R. & Andrew G. (2006). Tregex and Tsurgeon: tools for querying and
manipulating tree data structures. Proceeding of 5th International Conference
on Language Resources and Evaluation.
[62] Lin, Y. C., Sung, L. C. & Chen, M. C. (2007). An automatic multiple–choice
question generation scheme for English adjective understanding. Proceedings
of the 15th International Conference on Computers in Education, 137–142.
[63] Liu, C. L., Wang, C. H., Gao, Z. M., & Huang, S. M. (2005). Applications of
lexical information for algorithmically composing multiple–choice cloze items.
Proceedings of the Second Workshop on Building Educational Applications
Using Natural Language Processing, 1–8.
[64] Liu, M., Calvo, R. A., Aditomo, A., & Pizzato, L. A. (2012). Using
Wikipedia and conceptual graph structures to generate questions for academic
writing support. IEEE Transactions on learning technologies, 5(3), 251-263.
[65] Long, P. and Siemens, G. (2011). Penetrating the Fog: Analytics in Learning
and Education. Educause Review, 46(5), 31-40.
[66] Lou, B and Guy, A. (1998). The BNC handbook: exploring the British
Na-tional Corpus. Edinburgh: Edinburgh University Press.
[67] Manning, C. D. & Schütze, H. (1999). Foundations of statistical natural
lan-guage processing. Cambridge, MA: MIT Press.
lan-guage processing. Cambridge, MA: MIT Press.