• 沒有找到結果。

In this chapter, the background of question generation is presented, including

computer-aided question generation for education purpose, and in natural language

processing. Next, the related work of reading difficulty estimation is also introduced.

Finally, a modern theory of testing, Item Response Theory, will be discussed.

2.1 Question generation

2.1.1 Computer-aided question generation for lan-guage learning

Computer-aided question generation is the task of automatically generating

ques-tions, which consists of a stem, a correct answer and distractors, when given a text.

These generated questions can be used as an efficient tool for measurement and

diag-nostics. The first computer–aided question generation was proposed by Mitkov and Ha

(2003). Multiple–choice questions are automatically generated by three components:

term extraction, distractor selection and question generation. First, noun phrases are

extracted as answer candidates and sorted by term extraction. The more frequent a

terms appears, the more important the term becomes. The terms with higher term

fre-quency consequently serve as answers to the generated questions. Next, WordNet

(Miller, Beckwith, Fellbaum, Gross, & Miller, 1990) is consulted by the distractor

se-lection in order to capture the semantic relation between each incorrect choice and the

correct answer. Finally, the generated questions are formed by predefined syntactic

templates. Most of the following studies are based on such system architecture.

A growing number of researches are now available to shed some light on the

do-main of English language learning, such as vocabulary, grammar and comprehension.

It is because in these question generations, linguistic characteristics are analyzed to

help produce items, just like what experts do. In vocabulary assessment, Liu et al.

(2005) investigated word sense disambiguation to generate vocabulary questions in

terms of a specific word sense, and considered the background knowledge of first

lan-guage of test-takers to select distractors. Lin et al. (2007) analyzed the semantics of

words and develop algorithm to select candidates as a substitute word from WordNet

(Miller et al., 1990) and filtered by web corpus searching. They presented adjective–

noun pair questions, including collocation, antonym, synonym and similar word

ques-tions in order to test students’ understanding in sematic. Turney (2003) used a standard

supervised machine learning approach with feature vectors based on the frequencies of

patterns in a large corpus to automatically recognize analogies, synonyms, antonyms,

and associations between words, and then transformed those word pairs into multiple–

choice SAT (Scholastic Assessment Tests) analogy questions, TOEFL synonym

ques-tions and ESL (English as second language) synonym–antonym quesques-tions.

In grammar assessment, Chen et al. (2005) focused on automatic grammar quiz

generation. Their FAST system analyzed items from the TOEFL test and collected

documents from Wikipedia to generate grammar questions using a part–of–speech

tagger and predefined templates. Lee and Seneff (2007) particularly discussed

algo-rithm to generate questions for prepositions in language learning. They proposed two

novel distractor selections, one is applied a collocation–based method and the other is

the usage of the deletion error in a non-native corpus.

In reading comprehension assessment, the MARCT system (Yang et al., 2005)

designed three question types, including true-false question, numerical information

question and not-in-the-list questions. In the true-false question generation, they

re-placed words in a sentence, extracted from an article on the Internet, with the

syno-nyms or antosyno-nyms by using WordNet (Miller et al., 1990). In the numerical

infor-mation question generation, they listed some specific trigger words, such as

“kilo-gram”, “square foot”, and “foot”, corresponding to some predefined templates, such as

“what is the weight of”, “how large”, and “how tall”. In the not-in-the-list question

generation, they used terms listed in Google Sets to identify the question type and

se-lect distractors. Unlike previous methods, Mostow and Jang (2012) designed different

types of distractor to diagnose the cause of comprehension failure, including

ungram-matical, nonsensical, and plausible failures. Especially, the plausible distractors

con-sidered the context in reading materials. They used a Naïve Bayes formula to score the

relevance to the context in paragraph and words earlier in sentence. A student’s

com-prehension is judged by not only evaluating one’s vocabulary knowledge but also

test-ing the ability to decide which word is consistent with the surroundtest-ing context.

2.1.2 Question generation in natural language pro-cessing

Question generation has been primarily concerned by the natural language

pro-cessing community through the question generation workshop and the shared task in

2010 (QGSTEC 2010; Rus et al., 2010). It is an important task in many different

ap-plications including automated assessment, dialogue systems (Piwek, Prendinger,

Hernault & Ishizuka, 2008), intelligent tutoring systems (Chen & Mostow, 2011), and

search interfaces (Pasca, 2011). The aim of the task is to generate a series of questions

based on the raw text from sentences or paragraphs. The question types includes why,

who, when, where, when, what, which, how many/long and yes/no questions. Generally,

the procedure of question generation task can be characterized in three components:

content selection, the identification of a question type, and question formulation. First,

the content selection identifies which part of the given text is worthy of being

generat-ed as a question. When the content is given, the identification will determine the

ques-tion type. Finally, the quesques-tion formulaques-tion transforms the content into a quesques-tion.

Many generation approaches to wh-questions have been developed, inclusive of

template-based (Chen, Aist, & Mostow, 2009; Mostow & Chen, 2009), syntactic-based

(Heilman & Smith, 2009; Heilman & Smith, 2010), semantic-based (Mannem, Prasad

& Joshi, 2010; Yao, Bouma, & Zhang, 2012), and discourse-based approach (Prasad

and Joshi, 2008; Agarwal, Shah & Mannem, 2011). To identify question type, both of

template-based and syntactic-based approaches focused on lexical information and the

syntactic structure of a single sentence and transformed them into questions. Chen et al.

(2009) enumerated words with conditional context, temporal context and modality

ex-pression, such as “if”, “after”, and “will”, as criteria for selecting questioning

indica-tors. Based on these indicators, they defined six specific rules to transform the

in-formative sentence into questions, like “What would happen if <x>?” in conditional

context, “When would <x>?” in temporal context and “What <auxiliary-verb> <x>?”

in linguistic modality. On the other hand, Heilman and Smith (2009) analyzed the

structures of sentences and proposed general–purpose rules using part-of-speech (POS)

tags and category labels. The question generation, produce derived sentences from

complex sentences and transform declarative sentences into questions, can generate

more grammatical and readable questions rather than leading to unnatural or senseless

questions.

Since inter-sentential causal relations can also be identified by a semantic parser,

such as a semantic role labeler, semantic-based question generations made use of the

additional information of the semantic role labeling along with the marked relations.

Mannem, Prasad and Joshi (2010) used the predicate argument structures along with

semantic roles to identify important aspects of paragraphs. For instance, the label

“ARGM-CAU” can be seen as a cause clause marker. When the marker is recognized, a

corresponding question type, like “why”, will be generated. Similarly, with semantic

information, MrsQG system (Yao et al., 2012) transformed declarative sentences into

the Minimal Recursion Semantics (MRS, Copestake, Flickinger, Pollard, & Sag, 2005),

a theory of semantic representation of natural language sentences. And then MRS

rep-resentations of declarative sentences were mapped to interrogative sentences.

Cross-sentence information, such as discourse relation, has been particularly

in-fluential in contributing insights into question generation in the recent year. Prasad and

Joshi (2008) firstly used causal relations in the Penn Discourse Treebank (PDTB;

Pra-sad et al., 2008) as content selection trigger. They found that the PDTB causal relations

can be seen as providing the source for 71% of the why-questions in their experiment

settings. This implied the potential for PDTB and intrigued subsequent research to

fol-low such concept. Like Agarwal et al. (2011), they used explicit discourse connectives,

such as “because”, “when”, “although” and “for example”, to select content for

ques-tion formaques-tion, and construct quesques-tions involving sense disambiguaques-tion of the

dis-course connectives, identification of question type and applying syntactic

transfor-mations on the content.

These techniques from the field of question generation may facilitate the

devel-opment of the question generations in the various forms of question types. However,

unlike previous research directly related to the topic of generating questions for

educa-tional purpose, the question generation only focused on generating questions based on

the given context, these related studies were not involved in the distractor selection.

2.1.3 The importance of the generated questions

While many work pertaining to computer–aided question generation have focused

on the procedure of question generation and distractor selection, little work analyzed

guistic features, such as the number of tokens or noun phrases in a question, source

sentence and answer phrase, the score from the n-gram language model, and the

pres-ence of questioning words or negative words, to statistically rank the quality of

gener-ated questions. Agarwal and Mannem (2011) considered lexical and syntactic features,

like the similarity between sentence and the title of a given text, the presence of

abbre-viation, discourse connective and superlative adjective, to select the most informative

sentences from a document, and generated questions on them. Chali and Hasan (2012)

considered that questions associated with these topics should be generated first, so they

used Latent Dirichlet Allocation (LDA) to identify the sub-topics, which are closely

related to the original topic, in the given content, and next applied the Extended String

Subsequence Kernel (ESSK) to calculate their similarity with the questions and

com-puted the syntactic correctness of the questions by tree kernel. Although these output

questions were improved by considering linguistic features, these studies still did not

take examinees into consideration.

2.2 Personalization

2.2.1 Reading difficulty estimation

Reading difficulty (also called readability) is often used to estimate the reading

level of a document, so that readers can choose appropriate material for their skill level.

Heilman, Collins-Thompson, Callan and Eskenazi (2007) described reading difficulty

as a function of mapping a document to a numerical value corresponding to a difficulty

or grade level. A list of features extracted from the document usually acts as the inputs

of this function, while one of the ordered difficulty grade levels is the output

corre-sponding to a reader’s reading skill.

Early related work on estimating reading difficulty only used a few simple

fea-tures to measure lexical complexity, such as word frequency or the number of syllables

per word. Because they took fewer features into account, most studies made

assump-tions on what variables affected readability, and then based their difficulty metrics on

these assumptions. One example is the Dale-Chall model (Dale and Chall 1948), which

determined a list of 3,000 commonly known words and then used the percentage of

1996), which used the mean log word frequency as a feature to measure lexical

com-plexity. Using word frequency to measure lexical difficulty assumes that a more

fre-quent word is easier for readers. Although this assumption seems fair, since a widely

used word has a higher probability to be seen and absorbed by readers, this method is

not always true when there are the numerous differences in diverse words acquired by

different language learners. This method is susceptible to the diverse word frequency

rates found in various corpora.

More recent approaches have started to take n-gram language models into

con-sideration to assess lexical complexity, which can measure difficulty more accurately.

Collins-Thompson and Callan (2004) used the smoothed unigram language model to

measure the lexical difficulty of a given document. For each document, they generated

language models by levels of readability, and then calculated likelihood ratios to assign

the level of difficulty; in other words, the predicted value is the level with the highest

likelihood ratio of the document. Similarly, Schwarm and Ostendorf (2005) also

uti-lized statistical language models to classify documents based on reading difficulty

lev-el, and they found that trigram models are more accurate than bigram and unigram

ones.

In addition to using fairly basic measures to calculate lexical complexity, prior

studies often only calculated the mean number of words per sentence to estimate

grammatical readability. Using sentence length to measure grammatical difficulty

as-sumes that a shorter sentence is syntactically simpler than a longer one. However, long

sentences are not always more difficult than shorter sentences. In response, more

re-cent approaches have started to consider the structure of sentences when measuring

grammatical complexity and making use of increasingly precise parser accuracy rates.

These researches usually considered more grammatical features such as parse features

per sentence in order to make a more accurate difficulty prediction. Schwarm and

Os-tendorf (2005) employed four grammatical features derived from syntactic parsers.

These features included the average parse tree height, the average number of noun

phrases, the average number of verb phrases, and the average number of subsidiary

conjunctions to assess a document’s readability. Similarly, Heilman et al. (2008) used

grammatical features extracted from an automatic context-free grammar parse trees of

sentences, and then computed the relative frequencies of partial syntactic derivations.

In their model, the more frequent sub-trees are viewed as less difficult for readers.

These approaches have investigated the effect of the sentence structures; however, few

studies have been examined the effect of language learners on the grammar acquisition

grade distributions.

The majority of research on reading difficulty has focused on documents written

for native readers (also called first language), and comparatively little work (Heilman

et al., 2007) has been done on the difficulties of documents written for second

guage learners. Second language learners have a distinct way to acquire second

lan-guage from native speakers. As Bates (2003) pointed out, there are wide differences in

the learning timelines and processing times between native and non-native readers;

first language learners learn all grammar rules before formal education, whereas

sec-ond language learners learn grammatical structures and vocabulary simultaneously and

incrementally. Almost all first-language reading difficulty estimations focus on

vocab-ulary features, while second-language reading difficulty estimations especially

empha-size grammatical difficulty (Heilman et al., 2007). Wan, Li and Xiao (2010) found that

college students in China still have difficulty reading English documents written for

native readers, even though they have learned English over a long period of time.

These studies indicate that it is unsuitable to apply a first-language reading difficulty

estimation directly; instead, second-language reading difficulty estimation must be

de-veloped.

2.2.2 Ability estimation

Item Response Theory (Embretson & Reise, 2000) is a modern theory of testing

that examines the relationship between an examinee’s responses and items related to

abilities measured by the items in the test. One of the interesting characteristics of Item

Response Theory is that an ability parameter and item parameters are invariant, while

these parameters in Classical Test Theory (CTT) vary by sample (Crocker & Algina,

1986). Three well-known ability estimations proposed by Item Response Theory are

maximum likelihood estimation (MLE), maximum a posteriori (MAP) and expected a

posteriori (EAP). The procedure of MLE, an iterative process, is to find the maximum

likelihood of a response to each item for an examinee. However, when an examinee

point during the estimated iteration (Hambleton & Swaminathan, 1985). One possible

solution to this problem involves using MAP (Baker, 1993) and EAP (Bock & Mislevy,

1982), which are variants of Bayes Modal Estimation (BME) and incorporate prior

in-formation into the likelihood function. Prior distributions can protect against outliers

that may have negative influence on ability estimation. For example, Barla et al. (2010)

employed EAP to score each examinee’s ability for each test.

Even though Item Response Theory has been used for decades, the estimation

procedure of Item Response Theory is computation-intensive. Until recently, with the

rapid development of the computer industry, Item Response Theory has been

increas-ingly used in e-learning applications as an offline service. However, Item Response

Theory has till now had little application in Web-based learning environments, which

is unfortunate because a real-time and online assessment would be more desirable.

Fortunately, Lee (2012) proposed an alternative computational approach in which a

Gaussian fitting to the posterior distribution of the estimated ability could more

effi-ciently approximate that determined by the conventional BME approach.

In a Web-based learning environment, Computerized Adaptive Testing is usually

seen as a part of a component in the environment, providing learners with a

combina-tion of practice and measurement. But Klinkenberg et al. (2011) noted that the Item

Response Theory was designed for measurement only, the reason being that the

pa-rameters of items had to be pre-calibrated in advance before items were used in a test.

Generally, during the item calibration, an item should be taken by a large number of

people, ideally between 200 to 1000 people, in order to estimate reliable parameters for

the items (Wainer & Mislevy, 1990; Huang, 1996). This procedure is very costly and

time-consuming, and also less beneficial for learning environments. It is especially

impractical because the calibration had to be conducted repeatedly in order to get

ac-curate norm referenced item parameters. Alternatively, Klinkenberg et al. (2011)

in-troduced a new ability estimation based on Elo’s (1978) rating system and an explicit

scoring rule. Elo’s rating system was developed for chess competitions and used to

es-timate the relative ability of a player. With this method, pre-calibration was no longer

required, and the ability parameter was updated depending on the weighted difference

between the response and the expected response. This method was employed in a

Web-based monitoring system, called a computerized adaptive practice (CAP) system,

and designed for monitoring arithmetic in primary education.

Although much work has been done thus far, there are still some problems that

have attracted little attention. First, although every exercise performed by a student is

recorded in most of the Web-based learning environments listed above, the ability

es-timations of Item Response Theory only consider test responses at the time of testing,

rather than incorporating a testing history. Moreover, the result of estimating an

exam-inee’s ability is often defined in terms of a norm referenced value, the interpretation of

which in the most ability estimations is often defined as a number or a sign. For

exam-ple, a student with the specific ability, such as level six, means he has a large

propor-tion of knowledge similar to other students in grade level six. Unfortunately, as this

definition is qualitative rather than quantitative, this approach cannot provide a

quanti-tative result in terms of a student’s understanding.

Chapter 3 Computer-aided Question

相關文件