In this chapter, the background of question generation is presented, including
computer-aided question generation for education purpose, and in natural language
processing. Next, the related work of reading difficulty estimation is also introduced.
Finally, a modern theory of testing, Item Response Theory, will be discussed.
2.1 Question generation
2.1.1 Computer-aided question generation for lan-guage learning
Computer-aided question generation is the task of automatically generating
ques-tions, which consists of a stem, a correct answer and distractors, when given a text.
These generated questions can be used as an efficient tool for measurement and
diag-nostics. The first computer–aided question generation was proposed by Mitkov and Ha
(2003). Multiple–choice questions are automatically generated by three components:
term extraction, distractor selection and question generation. First, noun phrases are
extracted as answer candidates and sorted by term extraction. The more frequent a
terms appears, the more important the term becomes. The terms with higher term
fre-quency consequently serve as answers to the generated questions. Next, WordNet
(Miller, Beckwith, Fellbaum, Gross, & Miller, 1990) is consulted by the distractor
se-lection in order to capture the semantic relation between each incorrect choice and the
correct answer. Finally, the generated questions are formed by predefined syntactic
templates. Most of the following studies are based on such system architecture.
A growing number of researches are now available to shed some light on the
do-main of English language learning, such as vocabulary, grammar and comprehension.
It is because in these question generations, linguistic characteristics are analyzed to
help produce items, just like what experts do. In vocabulary assessment, Liu et al.
(2005) investigated word sense disambiguation to generate vocabulary questions in
terms of a specific word sense, and considered the background knowledge of first
lan-guage of test-takers to select distractors. Lin et al. (2007) analyzed the semantics of
words and develop algorithm to select candidates as a substitute word from WordNet
(Miller et al., 1990) and filtered by web corpus searching. They presented adjective–
noun pair questions, including collocation, antonym, synonym and similar word
ques-tions in order to test students’ understanding in sematic. Turney (2003) used a standard
supervised machine learning approach with feature vectors based on the frequencies of
patterns in a large corpus to automatically recognize analogies, synonyms, antonyms,
and associations between words, and then transformed those word pairs into multiple–
choice SAT (Scholastic Assessment Tests) analogy questions, TOEFL synonym
ques-tions and ESL (English as second language) synonym–antonym quesques-tions.
In grammar assessment, Chen et al. (2005) focused on automatic grammar quiz
generation. Their FAST system analyzed items from the TOEFL test and collected
documents from Wikipedia to generate grammar questions using a part–of–speech
tagger and predefined templates. Lee and Seneff (2007) particularly discussed
algo-rithm to generate questions for prepositions in language learning. They proposed two
novel distractor selections, one is applied a collocation–based method and the other is
the usage of the deletion error in a non-native corpus.
In reading comprehension assessment, the MARCT system (Yang et al., 2005)
designed three question types, including true-false question, numerical information
question and not-in-the-list questions. In the true-false question generation, they
re-placed words in a sentence, extracted from an article on the Internet, with the
syno-nyms or antosyno-nyms by using WordNet (Miller et al., 1990). In the numerical
infor-mation question generation, they listed some specific trigger words, such as
“kilo-gram”, “square foot”, and “foot”, corresponding to some predefined templates, such as
“what is the weight of”, “how large”, and “how tall”. In the not-in-the-list question
generation, they used terms listed in Google Sets to identify the question type and
se-lect distractors. Unlike previous methods, Mostow and Jang (2012) designed different
types of distractor to diagnose the cause of comprehension failure, including
ungram-matical, nonsensical, and plausible failures. Especially, the plausible distractors
con-sidered the context in reading materials. They used a Naïve Bayes formula to score the
relevance to the context in paragraph and words earlier in sentence. A student’s
com-prehension is judged by not only evaluating one’s vocabulary knowledge but also
test-ing the ability to decide which word is consistent with the surroundtest-ing context.
2.1.2 Question generation in natural language pro-cessing
Question generation has been primarily concerned by the natural language
pro-cessing community through the question generation workshop and the shared task in
2010 (QGSTEC 2010; Rus et al., 2010). It is an important task in many different
ap-plications including automated assessment, dialogue systems (Piwek, Prendinger,
Hernault & Ishizuka, 2008), intelligent tutoring systems (Chen & Mostow, 2011), and
search interfaces (Pasca, 2011). The aim of the task is to generate a series of questions
based on the raw text from sentences or paragraphs. The question types includes why,
who, when, where, when, what, which, how many/long and yes/no questions. Generally,
the procedure of question generation task can be characterized in three components:
content selection, the identification of a question type, and question formulation. First,
the content selection identifies which part of the given text is worthy of being
generat-ed as a question. When the content is given, the identification will determine the
ques-tion type. Finally, the quesques-tion formulaques-tion transforms the content into a quesques-tion.
Many generation approaches to wh-questions have been developed, inclusive of
template-based (Chen, Aist, & Mostow, 2009; Mostow & Chen, 2009), syntactic-based
(Heilman & Smith, 2009; Heilman & Smith, 2010), semantic-based (Mannem, Prasad
& Joshi, 2010; Yao, Bouma, & Zhang, 2012), and discourse-based approach (Prasad
and Joshi, 2008; Agarwal, Shah & Mannem, 2011). To identify question type, both of
template-based and syntactic-based approaches focused on lexical information and the
syntactic structure of a single sentence and transformed them into questions. Chen et al.
(2009) enumerated words with conditional context, temporal context and modality
ex-pression, such as “if”, “after”, and “will”, as criteria for selecting questioning
indica-tors. Based on these indicators, they defined six specific rules to transform the
in-formative sentence into questions, like “What would happen if <x>?” in conditional
context, “When would <x>?” in temporal context and “What <auxiliary-verb> <x>?”
in linguistic modality. On the other hand, Heilman and Smith (2009) analyzed the
structures of sentences and proposed general–purpose rules using part-of-speech (POS)
tags and category labels. The question generation, produce derived sentences from
complex sentences and transform declarative sentences into questions, can generate
more grammatical and readable questions rather than leading to unnatural or senseless
questions.
Since inter-sentential causal relations can also be identified by a semantic parser,
such as a semantic role labeler, semantic-based question generations made use of the
additional information of the semantic role labeling along with the marked relations.
Mannem, Prasad and Joshi (2010) used the predicate argument structures along with
semantic roles to identify important aspects of paragraphs. For instance, the label
“ARGM-CAU” can be seen as a cause clause marker. When the marker is recognized, a
corresponding question type, like “why”, will be generated. Similarly, with semantic
information, MrsQG system (Yao et al., 2012) transformed declarative sentences into
the Minimal Recursion Semantics (MRS, Copestake, Flickinger, Pollard, & Sag, 2005),
a theory of semantic representation of natural language sentences. And then MRS
rep-resentations of declarative sentences were mapped to interrogative sentences.
Cross-sentence information, such as discourse relation, has been particularly
in-fluential in contributing insights into question generation in the recent year. Prasad and
Joshi (2008) firstly used causal relations in the Penn Discourse Treebank (PDTB;
Pra-sad et al., 2008) as content selection trigger. They found that the PDTB causal relations
can be seen as providing the source for 71% of the why-questions in their experiment
settings. This implied the potential for PDTB and intrigued subsequent research to
fol-low such concept. Like Agarwal et al. (2011), they used explicit discourse connectives,
such as “because”, “when”, “although” and “for example”, to select content for
ques-tion formaques-tion, and construct quesques-tions involving sense disambiguaques-tion of the
dis-course connectives, identification of question type and applying syntactic
transfor-mations on the content.
These techniques from the field of question generation may facilitate the
devel-opment of the question generations in the various forms of question types. However,
unlike previous research directly related to the topic of generating questions for
educa-tional purpose, the question generation only focused on generating questions based on
the given context, these related studies were not involved in the distractor selection.
2.1.3 The importance of the generated questions
While many work pertaining to computer–aided question generation have focused
on the procedure of question generation and distractor selection, little work analyzed
guistic features, such as the number of tokens or noun phrases in a question, source
sentence and answer phrase, the score from the n-gram language model, and the
pres-ence of questioning words or negative words, to statistically rank the quality of
gener-ated questions. Agarwal and Mannem (2011) considered lexical and syntactic features,
like the similarity between sentence and the title of a given text, the presence of
abbre-viation, discourse connective and superlative adjective, to select the most informative
sentences from a document, and generated questions on them. Chali and Hasan (2012)
considered that questions associated with these topics should be generated first, so they
used Latent Dirichlet Allocation (LDA) to identify the sub-topics, which are closely
related to the original topic, in the given content, and next applied the Extended String
Subsequence Kernel (ESSK) to calculate their similarity with the questions and
com-puted the syntactic correctness of the questions by tree kernel. Although these output
questions were improved by considering linguistic features, these studies still did not
take examinees into consideration.
2.2 Personalization
2.2.1 Reading difficulty estimation
Reading difficulty (also called readability) is often used to estimate the reading
level of a document, so that readers can choose appropriate material for their skill level.
Heilman, Collins-Thompson, Callan and Eskenazi (2007) described reading difficulty
as a function of mapping a document to a numerical value corresponding to a difficulty
or grade level. A list of features extracted from the document usually acts as the inputs
of this function, while one of the ordered difficulty grade levels is the output
corre-sponding to a reader’s reading skill.
Early related work on estimating reading difficulty only used a few simple
fea-tures to measure lexical complexity, such as word frequency or the number of syllables
per word. Because they took fewer features into account, most studies made
assump-tions on what variables affected readability, and then based their difficulty metrics on
these assumptions. One example is the Dale-Chall model (Dale and Chall 1948), which
determined a list of 3,000 commonly known words and then used the percentage of
1996), which used the mean log word frequency as a feature to measure lexical
com-plexity. Using word frequency to measure lexical difficulty assumes that a more
fre-quent word is easier for readers. Although this assumption seems fair, since a widely
used word has a higher probability to be seen and absorbed by readers, this method is
not always true when there are the numerous differences in diverse words acquired by
different language learners. This method is susceptible to the diverse word frequency
rates found in various corpora.
More recent approaches have started to take n-gram language models into
con-sideration to assess lexical complexity, which can measure difficulty more accurately.
Collins-Thompson and Callan (2004) used the smoothed unigram language model to
measure the lexical difficulty of a given document. For each document, they generated
language models by levels of readability, and then calculated likelihood ratios to assign
the level of difficulty; in other words, the predicted value is the level with the highest
likelihood ratio of the document. Similarly, Schwarm and Ostendorf (2005) also
uti-lized statistical language models to classify documents based on reading difficulty
lev-el, and they found that trigram models are more accurate than bigram and unigram
ones.
In addition to using fairly basic measures to calculate lexical complexity, prior
studies often only calculated the mean number of words per sentence to estimate
grammatical readability. Using sentence length to measure grammatical difficulty
as-sumes that a shorter sentence is syntactically simpler than a longer one. However, long
sentences are not always more difficult than shorter sentences. In response, more
re-cent approaches have started to consider the structure of sentences when measuring
grammatical complexity and making use of increasingly precise parser accuracy rates.
These researches usually considered more grammatical features such as parse features
per sentence in order to make a more accurate difficulty prediction. Schwarm and
Os-tendorf (2005) employed four grammatical features derived from syntactic parsers.
These features included the average parse tree height, the average number of noun
phrases, the average number of verb phrases, and the average number of subsidiary
conjunctions to assess a document’s readability. Similarly, Heilman et al. (2008) used
grammatical features extracted from an automatic context-free grammar parse trees of
sentences, and then computed the relative frequencies of partial syntactic derivations.
In their model, the more frequent sub-trees are viewed as less difficult for readers.
These approaches have investigated the effect of the sentence structures; however, few
studies have been examined the effect of language learners on the grammar acquisition
grade distributions.
The majority of research on reading difficulty has focused on documents written
for native readers (also called first language), and comparatively little work (Heilman
et al., 2007) has been done on the difficulties of documents written for second
guage learners. Second language learners have a distinct way to acquire second
lan-guage from native speakers. As Bates (2003) pointed out, there are wide differences in
the learning timelines and processing times between native and non-native readers;
first language learners learn all grammar rules before formal education, whereas
sec-ond language learners learn grammatical structures and vocabulary simultaneously and
incrementally. Almost all first-language reading difficulty estimations focus on
vocab-ulary features, while second-language reading difficulty estimations especially
empha-size grammatical difficulty (Heilman et al., 2007). Wan, Li and Xiao (2010) found that
college students in China still have difficulty reading English documents written for
native readers, even though they have learned English over a long period of time.
These studies indicate that it is unsuitable to apply a first-language reading difficulty
estimation directly; instead, second-language reading difficulty estimation must be
de-veloped.
2.2.2 Ability estimation
Item Response Theory (Embretson & Reise, 2000) is a modern theory of testing
that examines the relationship between an examinee’s responses and items related to
abilities measured by the items in the test. One of the interesting characteristics of Item
Response Theory is that an ability parameter and item parameters are invariant, while
these parameters in Classical Test Theory (CTT) vary by sample (Crocker & Algina,
1986). Three well-known ability estimations proposed by Item Response Theory are
maximum likelihood estimation (MLE), maximum a posteriori (MAP) and expected a
posteriori (EAP). The procedure of MLE, an iterative process, is to find the maximum
likelihood of a response to each item for an examinee. However, when an examinee
point during the estimated iteration (Hambleton & Swaminathan, 1985). One possible
solution to this problem involves using MAP (Baker, 1993) and EAP (Bock & Mislevy,
1982), which are variants of Bayes Modal Estimation (BME) and incorporate prior
in-formation into the likelihood function. Prior distributions can protect against outliers
that may have negative influence on ability estimation. For example, Barla et al. (2010)
employed EAP to score each examinee’s ability for each test.
Even though Item Response Theory has been used for decades, the estimation
procedure of Item Response Theory is computation-intensive. Until recently, with the
rapid development of the computer industry, Item Response Theory has been
increas-ingly used in e-learning applications as an offline service. However, Item Response
Theory has till now had little application in Web-based learning environments, which
is unfortunate because a real-time and online assessment would be more desirable.
Fortunately, Lee (2012) proposed an alternative computational approach in which a
Gaussian fitting to the posterior distribution of the estimated ability could more
effi-ciently approximate that determined by the conventional BME approach.
In a Web-based learning environment, Computerized Adaptive Testing is usually
seen as a part of a component in the environment, providing learners with a
combina-tion of practice and measurement. But Klinkenberg et al. (2011) noted that the Item
Response Theory was designed for measurement only, the reason being that the
pa-rameters of items had to be pre-calibrated in advance before items were used in a test.
Generally, during the item calibration, an item should be taken by a large number of
people, ideally between 200 to 1000 people, in order to estimate reliable parameters for
the items (Wainer & Mislevy, 1990; Huang, 1996). This procedure is very costly and
time-consuming, and also less beneficial for learning environments. It is especially
impractical because the calibration had to be conducted repeatedly in order to get
ac-curate norm referenced item parameters. Alternatively, Klinkenberg et al. (2011)
in-troduced a new ability estimation based on Elo’s (1978) rating system and an explicit
scoring rule. Elo’s rating system was developed for chess competitions and used to
es-timate the relative ability of a player. With this method, pre-calibration was no longer
required, and the ability parameter was updated depending on the weighted difference
between the response and the expected response. This method was employed in a
Web-based monitoring system, called a computerized adaptive practice (CAP) system,
and designed for monitoring arithmetic in primary education.
Although much work has been done thus far, there are still some problems that
have attracted little attention. First, although every exercise performed by a student is
recorded in most of the Web-based learning environments listed above, the ability
es-timations of Item Response Theory only consider test responses at the time of testing,
rather than incorporating a testing history. Moreover, the result of estimating an
exam-inee’s ability is often defined in terms of a norm referenced value, the interpretation of
which in the most ability estimations is often defined as a number or a sign. For
exam-ple, a student with the specific ability, such as level six, means he has a large
propor-tion of knowledge similar to other students in grade level six. Unfortunately, as this
definition is qualitative rather than quantitative, this approach cannot provide a
quanti-tative result in terms of a student’s understanding.