個人化電腦輔助出題於英文學習之研究

(1)

୯ҥᆵ᡼εᏢᆅ౛Ꮲଣၗૻᆅ౛Ꮲࣴز܌

റγፕЎ

Department of Information Management College of Management

National Taiwan University Doctoral Dissertation

ঁΓϯႝတᇶշрᚒܭमЎᏢಞϐࣴز

Personalized Computer-aided Question Generation for English Language Learning

໳ཀ൛ Yi-Ting Huang

ࡰᏤ௲௤Ǻ৊໡᜽ റγ Advisor: Yeali S. Sun, Ph.D.

ύ๮҇୯ 104 ԃ 6 Д June 2015

(2)

ঁΓϯႝတᇶշрᚒܭमЎᏢಞϐࣴز

ҁፕЎ߯ගҬ୯ҥѠ᡼εᏢ

ၗૻᆅ౛Ꮲࣴز܌բࣁֹԋറγ ᏢՏ܌ሡచҹϐ΋೽ҽ

ࣴزғǺ໳ཀ൛ ኗ ύ๮҇୯΋ԭ႟ѤԃϤД

(3)

(4)

ठᖴ

२ӃךाགᖴךޑࡰᏤ௲௤ɡ৊໡᜽ԴৣکഋۏᄆԴৣޑ຤ЈࡰᏤǴᕇ੻ঘ

భǶ๏ϒךคज़ޑЍ࡭ᆶх৒Ǵᡣךᅰ௃჋၂όӕޑࣴزБݤǴ೸ၸԾيޑჴ፬

аᕇளឦܭԾρޑޕ᛽Ƕҭགᖴ Carnegie Mellon University ޑ Jack Mostow ௲௤

ӧךίٚଭीฝය໔ޑࡰᏤǴᝄᙣЪ΋ํόवޑᏢೌᄊࡋࢂךոΚᏢಞޑڂጄǶ

Μϩགᖴα၂ہ঩ύѧεᏢᏢಞᆶ௲Ꮲ܌ David Wible ௲௤ǵᆛၡᆶᏢಞࣽ

מ܌ླྀௗය௲௤ᆶѠ᡼εᏢၗૻπำسഋߞ׆௲௤ǵၗૻᆅ౛سഋࡌᒸ௲௤Ǵ੝

ӦኘϧܭፕЎα၂ਔ๏ϒࡰ҅ᆶࡌ᝼Ǵ٬ҁፕЎૈ୼׳уֹ๓Ƕ

གᖴ೚ӭΓჹԜፕЎޑᔅշǺ఩ԢεᏢၗૻπำس೾࿶๮௲௤஥ሦޑ IWiLL

ი໗Ǵ஭દᅼǵރ፵ӧჴᡍ΢ޑᔅԆǹIWiLL 18 ۛǵ19 ۛǵ21 ۛཥᆪ᎙᠐ࢲ୏

ୖуჴᡍޑԴৣ(ଯ໢ύ҅ଯύߋࡿ௵ǵ୷ໜζύೱ࣓ࣔǵࡀܿζύ஭ඁയǵε༜

ଯύഋ܃ᆺǵඳऍζύቅችѳǵύѧߕύ஭ػ଻)ᆶӕᏢॺӧၗ਑ԏ໣΢ޑᔅԆǶ

аϷჴᡍ࠻ޑම໡௵ǵ஭☰੦Ꮲۂॺӧࣴز΢ޑᔅԆǴکύࣴଣၗૻ܌ޑუՔഋ

ࡌӜǵယ୷๓ӧמೌ΢ޑЍජǶ

ќѦགᖴس΢஭ۏนᆶЦЈӵշ௲کჴᡍ࠻շ౛໳᜼᐀ک໳ျ໥ڐշךՉࡹ

΢٣୍ǶགᖴᏢߏۊഋ։ሎǵᑵ๥ЎǵഋΚሎǵ׵ਁᆢǵ໳ᅽሎǵಷܴਕǵ݅က

୲฻ΓаϷᏢ׌ۂڬਁ⩽ǵࡼۡզǵഋ܃ჱǵ஭ద൤ǵֆܱਤǵ݅ཀ൛ǵܷ݅ࡏ

฻ΓѳВޑᔅԆǶᗋԖךޑܻ϶Ֆۗᓉǵ݅҉๭ǵጰٵ֗ǵᐽ٩ᡕǵߠችޱǵᚑ

യࢪǵणࠬγᘥǵЦߪΓǵJiyeon Kim аϷϾз᜻฻ΓǴགᖴգॺޑഉՔᆶႴᓰǶ

നࡕǴགᖴךኑངޑৎΓǺ໳ਁੀӃғǵᄃذݯζγǵ໳ηྼǵ໳η◖ǵഋ

܃ᑪǵ໳ңᏌǴᖴᖴգॺޑЍ࡭ᆶႴᓰǴᙣаԜЎ᝘๏գॺǶ

(5)

ፕЎᄔा

ፕЎᚒҞǺঁΓϯႝတᇶշрᚒܭमЎᏢಞϐࣴز բޣǺ໳ཀ൛ ΋ԭ႟ѤԃϖД

ࡰᏤ௲௤Ǻ৊໡᜽ റγ

ၸѐ൳ԃٰǴႝတᇶշԾ୏рᚒ(Computer-aided Question Generation)ࣴزӧ

่ ӝ Ծ ฅ ᇟ ق ೀ ౛ (Natural Language Processing) ޑ מ ೌ ک ी ᆉ ᇟ ق Ꮲ

(Computational Linguistics)ޑ Б ݤ Ǵ ڙ ډ ႝ တ ᇶ շ ᇟ ق Ꮲ ಞ (Computer-assisted

Language Learning)ሦୱύຫٰຫӭޑᜢݙǶࣁΑගٮаमЎࣁಃΒѦᇟޑᏢಞޣ

ԾךᏢಞຑໆǴҁࣴزගрঁΓϯБݤǴаղᘐᏢಞ௲׷ᜤܰࡋϷຑ՗Ꮲғำࡋ

ޑᐒڋǴᔈҔܭႝတᇶշԾ୏рᚒǶӧղᘐ᎙᠐ᜤܰࡋ(Reading Difficulty Esti-

mation)೽ϩǴਥᏵᏢғޑᏢಞ௲׷ǴԵໆᙦ൤ޑᇟق੝ቻаϷᏢғᇟقಞளԃભ

ϩѲ(language acquisition grade distributions)ǴଞჹಃΒᇟقᏢಞ੝܄Ǵගр፾ӝ

ಃΒᇟقᏢಞޣ᎙᠐ᜤܰࡋϩ݋ǹӧຑ՗Ꮲғำࡋ (Ability Estimation)ޑ೽ϩǴ

่ӝ (Item Response Theory)کԃભϩѲǴԵໆڙ၂ޣߏයޑෳᡍٰ่݀՗ीᏢғ

ჴሞำࡋǶӧԾ୏рᚒޑ೽ϩǴԵໆൂӷǵЎݤᆶ᎙᠐ૈΚԖҬϕբҔቹៜǴග

рόӕᜤࡋޑൂӷǵЎݤᆶ᎙᠐ෳᡍрᚒБݤǴճҔຑ՗ᏢғำࡋᐒڋᕇளᏢғ

ޑૈΚ՗ीǴܜڗᆶᏢғำࡋ࣬಄ޑ᎙᠐ન׷کԵᚒ຾ՉෳᡍǴԵ၂ޑ่݀Ψբ

ࣁΠ΋ԛঁΓϯрᚒୖԵǶჴᡍ่݀ᡉҢ᎙᠐ᜤܰࡋ՗ीکૈΚำࡋຑ՗ёаК

ၸѐ࣬ᜢࣴزᗋाྗዴǹԜѦǴ೸ၸঁΓϯႝတᇶշрᚒس಍ޑڐշΠǴᏢಞޣ

ёа෧Ͽख़ፄҍᒱǴ٠ЪԖܴᡉޑ຾؁Ƕ

ᜢᗖຒǺႝတᇶշԾ୏рᚒǵ᎙᠐ᜤܰࡋຑ՗ǵᏢғૈΚำࡋ՗ीǵ໨Ҟϸ

(6)

THESIS ABSTRACT

Personalized Computer-aided Question Generation for English Language Learning

By Yi-Ting Huang DOCTOR OF PHILOSOPHY

DEPARTMENT OF INFORMATION MANAGEMENT NATIONAL TAIWAN UNIVERSITY

June 2015

ADVISER: Dr. Yeali S. Sun

In recent years, there has been increasing attention to computer-aided question

generation in the field of computer assisted language learning and Natural Language

Processing (NLP). However, the previous related work often provides examinees with

an exhaustive amount of questions that are not designed for any specific testing pur-

pose. In this study, we present a personalized automatic quiz generation that generates

multiple–choice questions at various difficulty levels and categories, including gram-

mar, vocabulary, and reading comprehension. We also design a reading difficulty esti-

mation to predict the readability of a reading material, for learners taking English as a

foreign language. The proposed reading difficulty estimation is based not only on the

complexity of lexical and syntactic features, but also on several novel concepts, in-

(7)

cluding the word and grammar acquisition grade distributions from several sources,

word sense from WordNet, and the implicit relations between sentences. Moreover, we

combine the proposed question generation with a quiz strategy for estimating a stu-

dent’s ability and question selection. We develop a statistical and interpretable ability

estimation. This method captures the succession of learning over time and provides an

explainable interpretation of a statistical measurement, based on the quantiles of acqui-

sition distributions and Item Response Theory (IRT). The concepts behind incorrectly

answered questions are reincorporated into future tests in order to improve the weak-

nesses of examinees. The results showed that proposed second language reading diffi-

culty estimation outperforms other first language reading difficulty estimations and the

proposed ability estimation showed more accurate and robust than other ability estima-

tions. In an empirical study, the results showed that the subjects with the personalized

automatic quiz generation corrected their mistakes more frequently than ones only with

computer–aided question generation. Moreover, subjects demonstrated the most pro-

gress between the pre–test and post–test and correctly answered more difficult ques-

tions.

(8)

Keyword: Computer-aided question generation, reading difficulty estimation,

ability estimation, Item Response Theory, Computer assisted language learning.

(9)

List of Tables

Table 1 Design of personalized questions with different question types: vocabulary,

grammar, and reading comprehension questions. ... 30

Table 2 Distractor templates were referred by a grammar textbook and an expert in order to ensure the disambiguation of distractors. ... 39

Table 3 Results of RMSE and correlation among different feature categories. ... 76

Table 4 Results of the optimal model selection. ... 83

Table 5 Results of the optimal model selection. ... 89

Table 6 Comparison between the estimations. ... 93

Table 7 The results of convergence point and RMSE (each row represents the degree of difference between the initial ability and the actual ability, and each column represents the number of time periods considered by the exponential weight of the current ability)100 Table 8 The results of RMSE between MLE, Lee (2012) and the proposed ability estimation ... 106

Table 9 The correlation result between the estimated ability and the post-test in the control group and the experimental group ... 113

Table 10 The mean post-test score of the subjects in different estimated ability groups between both groups and the result of ANOVA ... 116

Table 11 The equations among question types represent that the log odds ratio of the observation that the student i correctly answers item j is in class 1 or the student incorrectly answers item j is in class 0. ... 118

Table 12 The results of the pretest and post-test between the control group and the experimental group ... 121

Table 13 Contingency tables for the number of correctly answered questions per difficulty level in the pretest and post-test. ... 123

Table 14 The mean and standard deviation of rectification rate. ... 125

Table 15 Questionnaire results. ... 126

Table 16 Comparison of different test environments. ... 134

(12)

List of Figures

Figure 1 The architecture of the personalized computer-aided question generation. ... 9 Figure 2 A paragraph and example generated questions: the bolded words represent stems, the bold italics are answers and the other plausible choices in the questions are called as distractors. ... 34 Figure 3 The parse structure of the sentence “Many of the original Halloween

traditions have developed today into fun activities for children”. ... 41 Figure 4 A table of a database in the implemented system captures the incorrectly answered concepts of a student. ... 71 Figure 5 the performance of a selected model. ... 87 Figure 6 The changes in the estimated ability computed from the proposed method for the different weights (n=1, n=3, n=6, n=12) ... 102 Figure 7 Snapshots of the system: (a) An example of a given reading materials from new online website; (b) An example of vocabulary items; (c) An example of grammar items; (d) An example of reading comprehension items; (e) An example of a score result with explicit warning. ... 109 Figure 8 The charts on the percentage value vary from strongly agree to the strongly disagree for item six (upper left), item seven (upper right), item eight (lower left) and item nine (lower right). ... 129

(13)

Chapter 1 Introduction

1.1 Background

For many years, educational assessment has played an important role in teaching

and learning (Gronlund, 1993). It can evaluate the effectiveness of teaching, diagnose

the state of learning, and help the development of students’ learning (Chen, Lee, &

Chen, 2005; Chen & Chung, 2008; Johns, Hsingchin, & Lixun, 2008; Barla et al.,

2010). With the development of computers and the Internet, Computer Adaptive Test-

ing (CAT) is now a developing way to administer tests adapting to learners’ knowledge

or competence in language learning (Troubley, Heireman, & Walle, 1996). Based on

adaptive tests, examinees’ abilities can be more accurately measured by fewer suitable

questions (Weiss & Kingsbury, 1984; Van der Linden & Glas, 2000); moreover, student

performance has also been demonstrated improved (Barla et al., 2010). CAT can not

only provide questions but also be combined with scaffolding hints and instructional

feedback (Feng, Heffernan & Koedinger, 2010). This facilitates students learning and

helps them acquire knowledge with external help. However, when a great number of

(14)

in assessment resources because it is time–consuming and cost–intensive for human

experts to manually produce questions.

In recent years, there has been increasing attention to computer-aided question

generation (also called automatic question generation or automatic quiz generation) in

the field of e-learning and Natural Language Processing (NLP). It is useful in multiple

subareas and has been proposed to use in generating instructions in tutoring system

(Mostow & Chen, 2009), assessing domain knowledge (Mitkov & Ha, 2003), evaluat-

ing language proficiency (Brown, Frishkoff, & Eskenazi, 2005), assisting academic

writing (Liu, Calvo, Aditomo, & Pizzato, 2012) and question answering (Pasca, 2011).

In order to make learning environment more effectively and efficiently, many re-

searchers have been exploring the possibility of an automatic question generation in

various contexts. For example, a wide variety of applications, such as Linguistics

(Mitkov, Ha, & Karamanis, 2006) and Biology (Agarwal & Mannem, 2011), identified

the important concepts in textbooks and generated multiple-choice questions and

gap-fill questions. In the domain of language learning, a growing number of studies

(Turney, 2001; Turney, Littman, Bigham, & Shnayder, 2003; Liu, Wang, Gao, &

(15)

Huang, 2005; Sumita et al., 2005; Lee & Seneff, 2007; Lin, Sung, & Chen, 2007; Pino,

Heilman, & Eskenazi, 2008; Smith, Avinesh, & Kilgarriff,, 2010) are now available to

not only drills and exercise, including vocabulary, grammar, reading questions, but also

formal exams, including SAT (Scholastic Aptitude Test) analogy questions and TOEFL

(Test of English as a Foreign Language) synonym task. To support academic writing,

Liu et al. (2012) used Wikipedia and the conceptual graph structures of research papers

and generated specific trigger questions for supporting literature review writing.

Several researches have addressed the benefit of facilitative learning and teaching

with automatic question generation. The use of computer-aided question generation for

educational purpose was motivated as research of reading comprehension consistently

found that assessment is helpful in learning and enhances learners’ retention of materi-

al (Anderson & Biddle, 1975). Mitkov et al. (2006) demonstrated that computer–aided

question generation was more timeȉefficient than manual labor. Turney et al. (2003)

showed that the generated SAT and TOEFL questions are comparable to that generated

by experts. Liu et al. (2012) found that the generated trigger questions were more use-

ful than manual generic questions and that the questions could prompt students to re-

(16)

flect on key concepts, because the questions were generated based on what students

read. With the advantage of automatic question generation, students can practice with-

out waiting for a teacher to compose a quiz, and teachers can spend more time on

teaching; moreover, besides evaluating students’ understanding, automatic question

generation can be designed with additional functions.

1.2 Research problem

Recent theories on learning have focused increasing attention on understanding

and measuring student ability. There is now general consensus over Vygotsky’s (1978)

observation that a learner’s ability in the Zone of Proximal Development (ZPD)—the

difference between a learner’s actual ability and his or her potential development—can

progress well with external help. Instructional scaffolding (Wood, Bruner, & Ross,

1976), closely related to the concept of ZPD, suggests that appropriate support during

the learning process helps learners achieve their learning goals. However, effective in-

structional support requires identifying students’ prior knowledge, tailoring assistance

to meet their initial needs, and then removing this aid when they acquire sufficient

(17)

knowledge.

Even though previous studies in the field of computer-aided question generation

automatically generate all possible questions based on their proposed approach in an

attempt to reduce the cost of time and money of manual question generation, such ex-

haustive list of questions is inappropriate for language learning, because it can lead to

redundant, over–simplistic test questions that are unsuitable for evaluating student

progress. Moreover, it is hard to achieve meaningful test purpose and maximize exam-

inees’ learning outcomes because the personalized design (Fehr et al., 2012; Hsiao,

Chang, Chen, Wu, & Lin, 2013; Wu, Su, & Liu, 2013) is still critically lacking.

1.3 Research purpose

This work is intended to provide personalized computer-aided question genera-

tion on formative assessment to assess students’ receptive skills in English as a foreign

or second language. It generates three question types, including vocabulary, grammar

and reading comprehension, and differs from previous studies in the way learners’

language proficiency levels are considered in the generating process and questions are

(18)

generated with difficulties. The definition of “personalization” refers to the adjustment

to learner needs by matching the difficulty of questions to their knowledge level. In

other words, questions are generated based on an individual’s ability even though stu-

dents read the same learning material.

This work, the personalized computer-aided question generation, is based on the

related concept to the age of acquisition (AOA). The basic idea of age of acquisition is

the age at which a word, a concept, even specific knowledge is acquired. For instance,

people learn some words such as “dog” and “cat” before others such as “calculus” and

“statistics”. Numerous studies in psychology and cognitive science have shown the

positive influence on the process of brain, such as object recognition (Urooj et al.,

2013), object naming (Carrolla & Whitea, 1973; Morrison, Ellis, & Quinlan, 1992;

Alario, Ferrand, Laganaro, New, Frauenfelder, & Segui, 2005; Davies, Barbón, &

Cuetos, 2013), and language learning (Brysbaert, Wijnendaele, & Deyne, 2000;

McDonald, 2000; Izura & Ellis, 2002; Zevin & Seidenberg, 2002). Today, with the

various number of content available from the web and other digital resources, this

concept can be realized with advanced technology, Information Retrieval (Baeza-Yates

(19)

& Ribeiro-Neto, 1999; Manning, Raghavan, & Schütze, 2008) and Natural Language

Processing (Manning & Schütze, 1999), which counts word frequency and calculates

the probability of which a word is acquired at a certain school grade when given a

group of documents. With a large enough resource, such as an extensive collection of

all learning materials which people read and learn, the acquisition grade distributions

can be computed and implemented. For example, based on textbooks authored specifi-

cally for students in grade level six, questions can be generated based on concepts in

these textbooks that were correctly answered by a student, and from this, the student

can be said to either have or lack the skills at the grade level six. This implies that

learning materials, such as textbook, are written with intent to represent what learners

at a certain grade level learn and acquire. Two related work to this concept are a reada-

bility prediction (Kidwell, Lebanon & Collins-Thompson, 2011), which mapped a

document to a numerical value corresponding to a grade level based on the distribution

of acquisition age, and a word difficulty estimation (Kireyev & Landauer, 2011), which

modeled language acquisition with Latent Semantic Analysis to compute the degree of

knowledge of words at different learning stages.

(20)

In response to the personalized design based on the acquisition grade distributions,

we propose a personalized automatic quiz generation to generate multiple–choice

questions with varying difficulty, a reading difficulty estimation to predict the difficul-

ty level of an article for English as foreign language learners, as well as an interpreta-

ble and statistical ability estimation to estimate a student’s ability with inherent ran-

domness in the acquisition process, specifically in the Web-based learning environment,

as shown in Figure 1.

The purpose of personalized testing is to not only measure the achievement per-

formance of students, but also help them improve their own learning process and cor-

rect their mistakes by understanding what they has learned and has not learned yet.

Through this approach, students can read any materials online and then do more exer-

cises to understand their strengths and to improve their weaknesses, as a strategy to

guide them to language acquisition.

(21)

Figure 1 The architecture of the personalized computer-aided question generation.

The main research questions addressed in this study are:

(1) Does the proposed personalized design with the appropriate instructional

scaffolding help students advance their learning progress?

(2) Does the proposed personalized question selection help students correct their

unclear concept?

(3) How are students’ perceptions and experiences in the proposed personalized

computer-aided question generation?

We also conduct simulation and empirical evaluations to investigate the property

(22)

(4) What are the representative features of the proposed reading difficulty esti-

mation in English as a foreign or second language?

(5) How is the performance of the proposed reading difficulty estimation com-

pared with the other reading difficulty estimation?

(6) What are the characteristics of the proposed ability estimation based on the

quantiles of acquisition grade distributions and item response theory?

(7) How is the performance of the proposed ability estimation compared with

the other ability estimations?

(8) How is the performance of the proposed ability estimation with the empirical

data in a Web-based learning environment?

The rest of this article is organized as follows. Chapter 2 describes related work.

In Chapter 3, we present the design of automatic quiz generation and the mechanism

for assigning question difficulty. Chapter 4 outlines the personalization framework,

consisting of reading difficulty estimation, ability estimation and quiz selection. In

Chapter 5 and Chapter 6, we present simulation evaluations of reading difficulty esti-

mation and ability estimation respectively. Chapter 7 evaluates the effectiveness of

(23)

personalized computer-aided question generation in the empirical study. Finally, Sec-

tion 8 summarizes with contributions, limitations, and potential applications.

(24)

Chapter 2 Related Work

In this chapter, the background of question generation is presented, including

computer-aided question generation for education purpose, and in natural language

processing. Next, the related work of reading difficulty estimation is also introduced.

Finally, a modern theory of testing, Item Response Theory, will be discussed.

2.1 Question generation

2.1.1 Computer-aided question generation for lan- guage learning

Computer-aided question generation is the task of automatically generating ques-

tions, which consists of a stem, a correct answer and distractors, when given a text.

These generated questions can be used as an efficient tool for measurement and diag-

nostics. The first computer–aided question generation was proposed by Mitkov and Ha

(2003). Multiple–choice questions are automatically generated by three components:

term extraction, distractor selection and question generation. First, noun phrases are

(25)

extracted as answer candidates and sorted by term extraction. The more frequent a

terms appears, the more important the term becomes. The terms with higher term fre-

quency consequently serve as answers to the generated questions. Next, WordNet

(Miller, Beckwith, Fellbaum, Gross, & Miller, 1990) is consulted by the distractor se-

lection in order to capture the semantic relation between each incorrect choice and the

correct answer. Finally, the generated questions are formed by predefined syntactic

templates. Most of the following studies are based on such system architecture.

A growing number of researches are now available to shed some light on the do-

main of English language learning, such as vocabulary, grammar and comprehension.

It is because in these question generations, linguistic characteristics are analyzed to

help produce items, just like what experts do. In vocabulary assessment, Liu et al.

(2005) investigated word sense disambiguation to generate vocabulary questions in

terms of a specific word sense, and considered the background knowledge of first lan-

guage of test-takers to select distractors. Lin et al. (2007) analyzed the semantics of

words and develop algorithm to select candidates as a substitute word from WordNet

(Miller et al., 1990) and filtered by web corpus searching. They presented adjective–

(26)

noun pair questions, including collocation, antonym, synonym and similar word ques-

tions in order to test students’ understanding in sematic. Turney (2003) used a standard

supervised machine learning approach with feature vectors based on the frequencies of

patterns in a large corpus to automatically recognize analogies, synonyms, antonyms,

and associations between words, and then transformed those word pairs into multiple–

choice SAT (Scholastic Assessment Tests) analogy questions, TOEFL synonym ques-

tions and ESL (English as second language) synonym–antonym questions.

In grammar assessment, Chen et al. (2005) focused on automatic grammar quiz

generation. Their FAST system analyzed items from the TOEFL test and collected

documents from Wikipedia to generate grammar questions using a part–of–speech

tagger and predefined templates. Lee and Seneff (2007) particularly discussed algo-

rithm to generate questions for prepositions in language learning. They proposed two

novel distractor selections, one is applied a collocation–based method and the other is

the usage of the deletion error in a non-native corpus.

In reading comprehension assessment, the MARCT system (Yang et al., 2005)

designed three question types, including true-false question, numerical information

(27)

question and not-in-the-list questions. In the true-false question generation, they re-

placed words in a sentence, extracted from an article on the Internet, with the syno-

nyms or antonyms by using WordNet (Miller et al., 1990). In the numerical infor-

mation question generation, they listed some specific trigger words, such as “kilo-

gram”, “square foot”, and “foot”, corresponding to some predefined templates, such as

“what is the weight of”, “how large”, and “how tall”. In the not-in-the-list question

generation, they used terms listed in Google Sets to identify the question type and se-

lect distractors. Unlike previous methods, Mostow and Jang (2012) designed different

types of distractor to diagnose the cause of comprehension failure, including ungram-

matical, nonsensical, and plausible failures. Especially, the plausible distractors con-

sidered the context in reading materials. They used a Naïve Bayes formula to score the

relevance to the context in paragraph and words earlier in sentence. A student’s com-

prehension is judged by not only evaluating one’s vocabulary knowledge but also test-

ing the ability to decide which word is consistent with the surrounding context.

(28)

2.1.2 Question generation in natural language pro- cessing

Question generation has been primarily concerned by the natural language pro-

cessing community through the question generation workshop and the shared task in

2010 (QGSTEC 2010; Rus et al., 2010). It is an important task in many different ap-

plications including automated assessment, dialogue systems (Piwek, Prendinger,

Hernault & Ishizuka, 2008), intelligent tutoring systems (Chen & Mostow, 2011), and

search interfaces (Pasca, 2011). The aim of the task is to generate a series of questions

based on the raw text from sentences or paragraphs. The question types includes why,

who, when, where, when, what, which, how many/long and yes/no questions. Generally,

the procedure of question generation task can be characterized in three components:

content selection, the identification of a question type, and question formulation. First,

the content selection identifies which part of the given text is worthy of being generat-

ed as a question. When the content is given, the identification will determine the ques-

tion type. Finally, the question formulation transforms the content into a question.

Many generation approaches to wh-questions have been developed, inclusive of

(29)

template-based (Chen, Aist, & Mostow, 2009; Mostow & Chen, 2009), syntactic-based

(Heilman & Smith, 2009; Heilman & Smith, 2010), semantic-based (Mannem, Prasad

& Joshi, 2010; Yao, Bouma, & Zhang, 2012), and discourse-based approach (Prasad

and Joshi, 2008; Agarwal, Shah & Mannem, 2011). To identify question type, both of

template-based and syntactic-based approaches focused on lexical information and the

syntactic structure of a single sentence and transformed them into questions. Chen et al.

(2009) enumerated words with conditional context, temporal context and modality ex-

pression, such as “if”, “after”, and “will”, as criteria for selecting questioning indica-

tors. Based on these indicators, they defined six specific rules to transform the in-

formative sentence into questions, like “What would happen if <x>?” in conditional

context, “When would <x>?” in temporal context and “What <auxiliary-verb> <x>?”

in linguistic modality. On the other hand, Heilman and Smith (2009) analyzed the

structures of sentences and proposed general–purpose rules using part-of-speech (POS)

tags and category labels. The question generation, produce derived sentences from

complex sentences and transform declarative sentences into questions, can generate

more grammatical and readable questions rather than leading to unnatural or senseless

(30)

questions.

Since inter-sentential causal relations can also be identified by a semantic parser,

such as a semantic role labeler, semantic-based question generations made use of the

additional information of the semantic role labeling along with the marked relations.

Mannem, Prasad and Joshi (2010) used the predicate argument structures along with

semantic roles to identify important aspects of paragraphs. For instance, the label

“ARGM-CAU” can be seen as a cause clause marker. When the marker is recognized, a

corresponding question type, like “why”, will be generated. Similarly, with semantic

information, MrsQG system (Yao et al., 2012) transformed declarative sentences into

the Minimal Recursion Semantics (MRS, Copestake, Flickinger, Pollard, & Sag, 2005),

a theory of semantic representation of natural language sentences. And then MRS rep-

resentations of declarative sentences were mapped to interrogative sentences.

Cross-sentence information, such as discourse relation, has been particularly in-

fluential in contributing insights into question generation in the recent year. Prasad and

Joshi (2008) firstly used causal relations in the Penn Discourse Treebank (PDTB; Pra-

sad et al., 2008) as content selection trigger. They found that the PDTB causal relations

(31)

can be seen as providing the source for 71% of the why-questions in their experiment

settings. This implied the potential for PDTB and intrigued subsequent research to fol-

low such concept. Like Agarwal et al. (2011), they used explicit discourse connectives,

such as “because”, “when”, “although” and “for example”, to select content for ques-

tion formation, and construct questions involving sense disambiguation of the dis-

course connectives, identification of question type and applying syntactic transfor-

mations on the content.

These techniques from the field of question generation may facilitate the devel-

opment of the question generations in the various forms of question types. However,

unlike previous research directly related to the topic of generating questions for educa-

tional purpose, the question generation only focused on generating questions based on

the given context, these related studies were not involved in the distractor selection.

2.1.3 The importance of the generated questions

While many work pertaining to computer–aided question generation have focused

on the procedure of question generation and distractor selection, little work analyzed

(32)

guistic features, such as the number of tokens or noun phrases in a question, source

sentence and answer phrase, the score from the n-gram language model, and the pres-

ence of questioning words or negative words, to statistically rank the quality of gener-

ated questions. Agarwal and Mannem (2011) considered lexical and syntactic features,

like the similarity between sentence and the title of a given text, the presence of abbre-

viation, discourse connective and superlative adjective, to select the most informative

sentences from a document, and generated questions on them. Chali and Hasan (2012)

considered that questions associated with these topics should be generated first, so they

used Latent Dirichlet Allocation (LDA) to identify the sub-topics, which are closely

related to the original topic, in the given content, and next applied the Extended String

Subsequence Kernel (ESSK) to calculate their similarity with the questions and com-

puted the syntactic correctness of the questions by tree kernel. Although these output

questions were improved by considering linguistic features, these studies still did not

take examinees into consideration.

(33)

2.2 Personalization

2.2.1 Reading difficulty estimation

Reading difficulty (also called readability) is often used to estimate the reading

level of a document, so that readers can choose appropriate material for their skill level.

Heilman, Collins-Thompson, Callan and Eskenazi (2007) described reading difficulty

as a function of mapping a document to a numerical value corresponding to a difficulty

or grade level. A list of features extracted from the document usually acts as the inputs

of this function, while one of the ordered difficulty grade levels is the output corre-

sponding to a reader’s reading skill.

Early related work on estimating reading difficulty only used a few simple fea-

tures to measure lexical complexity, such as word frequency or the number of syllables

per word. Because they took fewer features into account, most studies made assump-

tions on what variables affected readability, and then based their difficulty metrics on

these assumptions. One example is the Dale-Chall model (Dale and Chall 1948), which

determined a list of 3,000 commonly known words and then used the percentage of

(34)

1996), which used the mean log word frequency as a feature to measure lexical com-

plexity. Using word frequency to measure lexical difficulty assumes that a more fre-

quent word is easier for readers. Although this assumption seems fair, since a widely

used word has a higher probability to be seen and absorbed by readers, this method is

not always true when there are the numerous differences in diverse words acquired by

different language learners. This method is susceptible to the diverse word frequency

rates found in various corpora.

More recent approaches have started to take n-gram language models into con-

sideration to assess lexical complexity, which can measure difficulty more accurately.

Collins-Thompson and Callan (2004) used the smoothed unigram language model to

measure the lexical difficulty of a given document. For each document, they generated

language models by levels of readability, and then calculated likelihood ratios to assign

the level of difficulty; in other words, the predicted value is the level with the highest

likelihood ratio of the document. Similarly, Schwarm and Ostendorf (2005) also uti-

lized statistical language models to classify documents based on reading difficulty lev-

el, and they found that trigram models are more accurate than bigram and unigram

(35)

ones.

In addition to using fairly basic measures to calculate lexical complexity, prior

studies often only calculated the mean number of words per sentence to estimate

grammatical readability. Using sentence length to measure grammatical difficulty as-

sumes that a shorter sentence is syntactically simpler than a longer one. However, long

sentences are not always more difficult than shorter sentences. In response, more re-

cent approaches have started to consider the structure of sentences when measuring

grammatical complexity and making use of increasingly precise parser accuracy rates.

These researches usually considered more grammatical features such as parse features

per sentence in order to make a more accurate difficulty prediction. Schwarm and Os-

tendorf (2005) employed four grammatical features derived from syntactic parsers.

These features included the average parse tree height, the average number of noun

phrases, the average number of verb phrases, and the average number of subsidiary

conjunctions to assess a document’s readability. Similarly, Heilman et al. (2008) used

grammatical features extracted from an automatic context-free grammar parse trees of

sentences, and then computed the relative frequencies of partial syntactic derivations.

(36)

In their model, the more frequent sub-trees are viewed as less difficult for readers.

These approaches have investigated the effect of the sentence structures; however, few

studies have been examined the effect of language learners on the grammar acquisition

grade distributions.

The majority of research on reading difficulty has focused on documents written

for native readers (also called first language), and comparatively little work (Heilman

et al., 2007) has been done on the difficulties of documents written for second lan-

guage learners. Second language learners have a distinct way to acquire second lan-

guage from native speakers. As Bates (2003) pointed out, there are wide differences in

the learning timelines and processing times between native and non-native readers;

first language learners learn all grammar rules before formal education, whereas sec-

ond language learners learn grammatical structures and vocabulary simultaneously and

incrementally. Almost all first-language reading difficulty estimations focus on vocab-

ulary features, while second-language reading difficulty estimations especially empha-

size grammatical difficulty (Heilman et al., 2007). Wan, Li and Xiao (2010) found that

college students in China still have difficulty reading English documents written for

(37)

native readers, even though they have learned English over a long period of time.

These studies indicate that it is unsuitable to apply a first-language reading difficulty

estimation directly; instead, second-language reading difficulty estimation must be de-

veloped.

2.2.2 Ability estimation

Item Response Theory (Embretson & Reise, 2000) is a modern theory of testing

that examines the relationship between an examinee’s responses and items related to

abilities measured by the items in the test. One of the interesting characteristics of Item

Response Theory is that an ability parameter and item parameters are invariant, while

these parameters in Classical Test Theory (CTT) vary by sample (Crocker & Algina,

1986). Three well-known ability estimations proposed by Item Response Theory are

maximum likelihood estimation (MLE), maximum a posteriori (MAP) and expected a

posteriori (EAP). The procedure of MLE, an iterative process, is to find the maximum

likelihood of a response to each item for an examinee. However, when an examinee

(38)

point during the estimated iteration (Hambleton & Swaminathan, 1985). One possible

solution to this problem involves using MAP (Baker, 1993) and EAP (Bock & Mislevy,

1982), which are variants of Bayes Modal Estimation (BME) and incorporate prior in-

formation into the likelihood function. Prior distributions can protect against outliers

that may have negative influence on ability estimation. For example, Barla et al. (2010)

employed EAP to score each examinee’s ability for each test.

Even though Item Response Theory has been used for decades, the estimation

procedure of Item Response Theory is computation-intensive. Until recently, with the

rapid development of the computer industry, Item Response Theory has been increas-

ingly used in e-learning applications as an offline service. However, Item Response

Theory has till now had little application in Web-based learning environments, which

is unfortunate because a real-time and online assessment would be more desirable.

Fortunately, Lee (2012) proposed an alternative computational approach in which a

Gaussian fitting to the posterior distribution of the estimated ability could more effi-

ciently approximate that determined by the conventional BME approach.

In a Web-based learning environment, Computerized Adaptive Testing is usually

(39)

seen as a part of a component in the environment, providing learners with a combina-

tion of practice and measurement. But Klinkenberg et al. (2011) noted that the Item

Response Theory was designed for measurement only, the reason being that the pa-

rameters of items had to be pre-calibrated in advance before items were used in a test.

Generally, during the item calibration, an item should be taken by a large number of

people, ideally between 200 to 1000 people, in order to estimate reliable parameters for

the items (Wainer & Mislevy, 1990; Huang, 1996). This procedure is very costly and

time-consuming, and also less beneficial for learning environments. It is especially

impractical because the calibration had to be conducted repeatedly in order to get ac-

curate norm referenced item parameters. Alternatively, Klinkenberg et al. (2011) in-

troduced a new ability estimation based on Elo’s (1978) rating system and an explicit

scoring rule. Elo’s rating system was developed for chess competitions and used to es-

timate the relative ability of a player. With this method, pre-calibration was no longer

required, and the ability parameter was updated depending on the weighted difference

between the response and the expected response. This method was employed in a

Web-based monitoring system, called a computerized adaptive practice (CAP) system,

(40)

and designed for monitoring arithmetic in primary education.

Although much work has been done thus far, there are still some problems that

have attracted little attention. First, although every exercise performed by a student is

recorded in most of the Web-based learning environments listed above, the ability es-

timations of Item Response Theory only consider test responses at the time of testing,

rather than incorporating a testing history. Moreover, the result of estimating an exam-

inee’s ability is often defined in terms of a norm referenced value, the interpretation of

which in the most ability estimations is often defined as a number or a sign. For exam-

ple, a student with the specific ability, such as level six, means he has a large propor-

tion of knowledge similar to other students in grade level six. Unfortunately, as this

definition is qualitative rather than quantitative, this approach cannot provide a quanti-

tative result in terms of a student’s understanding.

(41)

Chapter 3 Computer-aided Question Generation

How to generate personalized questions in different question types? In this chap-

ter, we will respectively describe the constraints on vocabulary questions, grammar

questions, and reading comprehension questions.

Table 1 summarizes how to define the question difficulty and how distractors are

selected and Figure 2 shows four questions (also called items) generated from a docu-

(42)

Table 1 Design of personalized questions with different question types: vocabu-

lary, grammar, and reading comprehension questions.

Vocabulary

question

Grammar

question

Independent

referential

question

Overall ref-

erential ques-

tion

How to define

question dif-

ficulty?

a graded word

list

grammar fre-

quency

reading difficulty estimation

(43)

How to select

a target sen-

tence (with

answer)?

a word a sentence a referent (not a singleton)

Stem tem-

plate

In the sentence

"…

______ …",

the blank can

be:

In the Sentence,

"…

______ …", the

blank can be

filled in:

The word

“[target

word]” in

this sentence

“[target sen-

tence]” refer

to:

Which of the

following

statement is

TRUE?

Distractor

candidate

source

words from a

graded word

list

grammar pat-

terns defined by

a grammar

book

other noun phrases (common

nouns or proper nouns) in the

given article

Distractor word difficulty disambiguation non-anaphora

(44)

selection part-of-speech not pronoun

word length number

Levenshtein

distance

gender

Document

Halloween, which falls on October 31, is one of the most unusual and fun holi-

days in the United States. It is also one of the scariest! It is associated with ghosts,

skeletons, witches, and other scary images. …Many of the original Halloween

traditions have developed today into fun activities for children. The most popular

one is "trick or treat." On Halloween night, children dress up in costumes and go to

visit their neighbors. When someone answers the door, the children cry out, "trick or

treat!" What this means is, "Give us a treat, or we'll play a trick on you!"… This tra-

dition comes from an old Irish story about a man named Jack who was very stingy.

(45)

He was so stingy that he could not enter heaven when he died. But he also could not

enter hell, because he had once played a trick on the devil. All he could do was walk

the earth as a ghost, carrying a lantern…

Quiz

1. In the sentence "It is __________ with ghosts, skeletons, witches, and other

scary images.", the blank can be:

(1) distributed (2) associated (3) contributed (4) illustrated

2. In the Sentence, "Many of the original Halloween traditions __________ today

into fun activities for children.", the blank can be filled in:

(1) have developed (2) have developing (3) is developed (4) develop

3. The word “he” in this sentence “All he could do was walk the earth as a ghost,

carrying a lantern” refer to:

(1) ghost (2) devil (3) witch (4) Jack

4. Which of the following statement is TRUE?

(1) On Halloween night, neighbors dress up in costumes and go to visit their children.

(2) What this means is, "Give us a trick, or we'll play a treat on you!"

(46)

(3) But the devil also could not enter hell, because he had once played a trick on the

witch.

(4) Jack was so stingy that he could not enter heaven when he died.

Figure 2 A paragraph and example generated questions: the bolded words represent

stems, the bold italics are answers and the other plausible choices in the questions are

called as distractors.

3.1 Vocabulary question generation

The difficulty of a vocabulary question is determined by the difficulty of the cor-

rect answer. We assume if a student selects the correct answer, he/she probably under-

stood the question stem and distinguished the correct answer from distractors. Here,

the difficulty of a word refers to word acquisition, the temporal process by which

learners learn the meaning, understanding and usage of new words. For most of Eng-

lish as foreign language learners, the acquisition grade distributions of different words

can be drawn from the inference from textbooks or a word list made by experts, be-

cause English as foreign language learners learn foreign language depending on mate-

rials they study, not the environment they live. In this study, the word difficulty is de-

(47)

termined by a word list made by an education organization. We adopted a wordlist

from the College Entrance Examination Center (CEEC) of Taiwan

(http://www.ceec.edu.tw/research/paper_doc/ce37/5.pdf ). It contains 6,480 words in

English, divided into six levels, which represent the grade in which a word should be

taught, as the word acquisition grade distrbutions. For each word from the given text,

we identify its difficulty by first referencing its difficulty level from within the word

list. When given the vocabulary proficiency level of a student, words with the same

difficulty level in the given document are selected as the basis to form test questions.

In the distractor selection, we also consult the same graded word list as the source

of distractor candidates. The distractors were selected by the following criteria: word

difficulty, part-of-speech (POS), word length and Levenshtein distance.

• Word difficulty: Distractors are selected with the equal difficulty for two rea- sons. One is for personalization. A student has personalized generated questions

whose difficulty is as the same as the student’s proficiency level. The other is

for familiar. Choices must be familiar to students; otherwise the correct answer

may be selected because students only know it.

(48)

• Part-of-speech (POS): Distractors have the same POS as the answer because

this makes the target sentence grammatical, but is semantically inconsistent

with the context of the target sentence. In this way, students can be tested the

lexical knowledge and comprehension instead of syntax. We use Stanford POS

Tagger (Toutanova, Klein, Manning, & Singer, 2003) to identify words as

nouns, verbs, adjectives, or adverbs.

• Word length and Levenshtein distance: Distractors are ranked by the least small word length difference between a distractor and the correct answer and Le-

venshtein distance based on changing the prefix or postfix of a distractor into

the correct answer. According to the (Perfetti & Hart, 2002), high-skilled stu-

dents easily have confusion when words share phonological forms with other

homophones. We try to catch the grapheme-phoneme by considering the word

length and Levenshtein distance.

The first question in Figure 2 is a vocabulary question. When a knowledge level

four student is given, difficulty level four words, e.g. “associate”, are identified by the

graded word list. The sentence containing the word, “It is associated with ghosts, skel-

(49)

etons, witches, and other scary images”, is then extracted to form a question and take

“associate” as the correct answer. We also consult the same word list to select distrac-

tors which have same difficulty (level 4) and part-of-speech (verb), and the least small

distance of word length (9) and Levenshtein distance (distributed:6, illustrated:7, con-

tributed:7).

3.2 Grammar question generation

The difficulty of a grammar question, which similar as that of a vocabulary ques-

tion, is determined by the difficulty of the grammar pattern of the correct answer. Un-

fortunately, unlike the aforementioned word list, there is no predefined grammar diffi-

culty measure available. In addition, second language learners usually learn grammati-

cal structures simultaneously and incrementally, while native speakers have learned all

grammar rules before formal education. Second language learning materials are pre-

dominated by the well-thought learning plan. Thus, we assigned the difficulty of a

grammar pattern based on the grade level of the textbook, which represents the gram-

mar acquisition grade distributions.

(50)

The difficulties of grammar patterns rely identify the grade level of the textbook in

which it frequently appears, representing the grammar acquisition grade distributions.

We manually predefined 44 grammar patterns from a grammar textbook for Taiwan

high school students and automatically calculated the rate of occurrence of grammar

patterns in a set of English textbooks. First, we used Stanford Parser (Klein and Man-

ning, 2002) to produce constituent structure trees of sentences. And next Tregex (Levy

& Andrew, 2006), a searching tool for matching patterns in trees, was used to recognize

the instances of the target grammar patterns in the set of textbooks. Finally, we counted

the frequencies of the syntactic grammar patterns in a set of corpus. This set of corpus

contains 342 articles written by different authors and collected from five different pub-

lishers (including The National Institute for Compilation and Translation, Far East Book

Company, Lungteng Cultural Company, San Min Book Company, and Nan-I Publish-

ing Company).

For generating grammar distractors, we also consult the same grammar textbook

and manually predefine distractor templates. These templates also need to ensure no

ambiguous choices in the templates. Sometimes, not only one grammar pattern could be

(51)

correct answer in a sentence. For example, the stem in the second question in Figure 2, a

distractor “develop” could be consistent with the syntax of the target sentence regard-

less of the global context. Thus, we referred to the grammar textbook and an expert for

designing distractor templates for each grammar pattern (examples shown in Table 2).

Table 2 Distractor templates were referred by a grammar textbook and an expert in or-

der to ensure the disambiguation of distractors.

level function name example

answer

distractor 1 distractor 2 distractor 3

1 PerfectTense has grown have growing have been grown had grown

1 OnetheOther one…the other one…another one…other one…the others

2 TooAdjectiveTo too happy to too happy that too happiest to

none of the

above

2 soThat so heavy so heavier so heaviest

none of the

above

2 PastPerfectTense had taken had had taken have taken had been taken

3 prepVing in helping in being help in helped in being helping

(52)

4 GernudasObject avoid taking avoid to taking avoid to take avoid to took

5 Passive is used is using used will be using

6 RememberLike

remember to

take

remembering

to take

remember to tak-

ing

none of the

above

6 ModelAuxiliary

may have

driven

may have

driving

may has drived may be drived

The second question in Figure 2 is a grammar question. The target testing purpose

in the second question is “present perfect tense”, which is taught in the first grade. The

original sentence is “Many of the original Halloween traditions have developed today

into fun activities for children”. The parse structure of the original sentence is in Figure

3. The grammar pattern of this parse structure can be automatically identified by the

Tregex patterns: /S.?/ < (VP < (/VB.?/ << have|has|haven't|hasn't)): /S.?/ < (VP < (VP <

VBN)). When a grammar pattern is recognized (the green part of the parse tree, the dif-

ficulty degree of the grammar question is assigned based on the matched grammar pat-

(53)

Figure 3 The parse structure of the sentence “Many of the original Halloween tradi-

tions have developed today into fun activities for children”.

3.3 Comprehension question generation

The difficulty of the reading comprehension questions is based on the reading

level of the reading materials themselves. We assume that an examinee correctly an-

swers a reading comprehension question because he/she could understand the whole

story. The difficulty level of an article is correlated with the interaction between the

(54)

lexical, syntactic and semantic relations of the text and the reader's cognitive aptitudes.

The reading level estimation of a given document in recent years has increased noticea-

bly. Most past literature was designated for first language learners, but the learning

timeline and processing between first language learners and second language learners is

different. In this study, we adopt the measure of reading difficulty estimation [6] de-

signed for English as second language learners to identify the difficulty of reading ma-

terials, as a difficulty measure for the reading comprehension questions.

Reading Comprehension replies on a highly complicated set of cognitive pro-

cesses (Nation & Angell, 2006). In these processes, it is a key to make an anaphora res-

olution, construction-integration model and build a coherent knowledge representation

(Kintsch 1998). Thus, in this work, we focus on a relation between sentences to gener-

ate two kinds of meaningful reading questions based on noun phrase coreference resolu-

tion. Similar to Mitkov and Ha (2003), who extracted nouns and noun phrases as im-

portant terminology in reading material, we also focus on the interaction of noun

phrases as the test purpose. The purpose of noun phrase coreference resolution is to de-

termine whether two expressions refer to the same entity in real life. An example is ex-

(55)

cerpted from Figure 2 (This tradition…on the devil5). It is easy to see that Jack2 means

man1 because of the semantic relationship between the sentences. The following he3 and

he4 are more difficult to judge as referring to Jack1 or devil5 when examinees do not

clearly understand the meaning of the context in the document. This information is used

in this work to generate reading comprehension questions, in order to examine whether

learners really understand the relationship between nouns in the given context.

There are two question types generated in the reading comprehension questions: an

independent referential question for the single concept test purpose and an overall ref-

erential question for overall comprehension test purpose. When a noun phrase is select-

ed as a target word in the stem question, it should have an anaphoric relation with the

other noun phrase. In the first type, a noun phrase (a pronoun, a common noun or a

proper noun) is selected as a target word in the stem question, a noun phrase (a common

noun or a proper noun) will the same anaphoric relation will be chosen as the correct

answer and other noun phrases (common nouns or proper nouns) will be determined as

the distractors. In the second type, the same technique of the question generation applies

to a sentence level. We regenerate new sentences as choices by replacing a noun (a

(56)

pronoun, a common noun or a proper noun) with an anaphoric noun (a common noun or

a proper noun) as the correct answer and substituting a noun with a non-anaphoric noun

as distractors.

The distractors should be satisfied with the following constraints:

• Non-anaphoric relation: Distractors should have non-anaphoric relations. The anaphoric and non-anaphoric relations can be identified by the Stanford Coref-

erence system (Raghunathan, Lee, Rangarajan, Chambers, Surdeanu, Jurafsky,

and Manning 2010).

• Not pronoun: Pronoun is a replacement of a noun and a dependent on an ante- cedent (a common noun or a proper noun). Thus, distractors should be common

nouns or proper nouns in order to have a clear test purpose.

• Number: Distractors should have the same number attributes (singular, plural or unknown) in order to make the sentence grammatically. For example, “devil”

in the Figure 2 is singular; the number attribute of a distractor should be the

same. If not, an unacceptable distractor (a plural noun or a collective noun)

could violate the subject-verb agreement. The number attributes were given by

(57)

the Stanford Coreference system (Raghunathan, Lee, Rangarajan, Chambers,

Surdeanu, Jurafsky, and Manning 2010), based on a dictionary, POS tags and

Named Entity Recognizer (NER) tool.

• Gender: Distractors should have the same gender attributes (male, female, neu-

tral or unknown) in order to make the sentence semantically. For example,

“Jack” in the Figure 2 is male; the gender attribute of a distractor should be

“male”, “neutral” rather than “female”; otherwise, students could answer the

question directly instead of reading the passages. The gender attributes were as-

signed by the Stanford Coreference system (Raghunathan, Lee, Rangarajan,

Chambers, Surdeanu, Jurafsky, and Manning 2010), which is from static lexi-

cons.

The third question in Figure 2 independent referential question, which assesses

one’s understanding of the concept of an entity involved in sentences. The word “he” in

the original sentence “All he could … a lantern” refers to “Jack”, the distractors

“ghost”, “devil”, and “witch” have non-anaphoric relation, not pronouns, and are “sin-

gular” and “neutral”. The fourth question in Figure 2 the overall referential question,

(58)

which contains more than one concept that needs to be understood. The correct answer

is from the sentence “He was so stingy … died,” and the word “He” is replaced with

“Jack” because they have referential relation. One of distractors is from “But he also

could not … devil,” the word “he” refers to “Jack” instead of “devil”. But we replace it

with the non-anaphoric noun as a distractor. This approach further examines in the con-

nection of concepts in the given learning material.

(59)

Chapter 4 Personalization

In this chapter, the personalized quiz strategy based on automatic quiz generation

is presented. This personalized quiz strategy aims to achieve the following three pur-

poses: first, we not only build a model to estimate reading difficulty, but also investi-

gate the optimal combination of features for improving reading difficulty estimation.

Next, an examinee’s grade level is estimated by concerning the test responses and his

or her historical data; in contrast, previous work only considered the current test re-

sponses. Finally, questions are selected with not only corresponding difficulties but al-

so examinees’ unclear concepts behind the previous incorrect responses. A student’s

previous mistakes are recorded and considered in advance in order to confirm whether

he or she has learnt. Through the iterative practice, students’ understanding will be en-

hanced by absorbing lots of different reading materials.

4.1 Reading difficulty estimation

As mentioned above, almost all past literature was designed for native readers,

個人化電腦輔助出題於英文學習之研究

Department of Information Management College of Management

National Taiwan University Doctoral Dissertation

THESIS ABSTRACT

Table of Contents

List of Tables

List of Figures

Chapter 1 Introduction

1.1 Background

1.2 Research problem

1.3 Research purpose

Chapter 2 Related Work

2.1 Question generation

2.1.1 Computer-aided question generation for lan- guage learning

2.1.2 Question generation in natural language pro- cessing

2.1.3 The importance of the generated questions

2.2 Personalization

2.2.1 Reading difficulty estimation

2.2.2 Ability estimation

Chapter 3 Computer-aided Question Generation

3.1 Vocabulary question generation

3.2 Grammar question generation

3.3 Comprehension question generation

Chapter 4 Personalization

4.1 Reading difficulty estimation