• 沒有找到結果。

System architecture

The system works like this. First, the teacher specifies the key, or a list of keys to be

processed. Assume, then, that the teacher/user wishes to teach or test the use of the adjective sunny, as used to describe personality. She would enter sunny into our system as her chosen key. The system will find words which have a similar lexical distribution to that of sunny, such as rainy, windy and so on. It will do this by establishing that these potential distractors (PDs) and the key are all found with some set of other words (key and PD collocates, KPDCs) such as weather and climate.

Next, the system looks in the corpus for a word which co-occurs with the key, but never with the PDs. This word is termed the key only collocate (KOC). In this example it could conceivably be personality, which co-occurs with sunny but no other weather adjectives. A sentence that includes the KOC personality along with the key sunny is then selected from the corpus. All that remains is to delete the key from the sentence, and supply key, distractors and sentence to the student in an appropriate format, as shown in the “Cloze generation system architecture” diagram.

sunny Thesaurus module

Xiao Ming has a sunny personality

Diffs module foggy personality

windy personality snowy personality sunnypersonality Concordance

module

foggy, windy, snowy

Text processing

module

Xiao Ming has a ____ personality (a)foggy (b)windy (c)snowy (d)sunny

Cloze generation system architecture

Thus, the carrier sentence, the key and the three incorrect answers (distractors) are returned by the system. Subsequently, in the interactive mode, the teacher would be asked if they were satisfied with the item, whether they wanted to generate a new item using the same key, or whether they were happy with the sentence but would like to create a new set of distractors.

Here is an example of a cloze item actually generated by our system.

(1) They have an enviable ____ of blue-chip clients.

Ans: investment infrastructure asset portfolio

The learner is asked to complete the underscored gap with one of the four answers given. The reader will agree that only the (key) answer portfolio is possible, and that if any of the three distractors were inserted, the sentence would become meaningless.

In this work, we make use of the Sketch Engine (SkE) suite of corpus query tools described by Kilgarriff et al (2004), and the ukWaC web corpus to which it provides access.

It needs to be made clear at this point that our system is not computationally implemented. The procedure for deriving the carrier sentences and distractors currently involves the manual implementation of rules which will be automated when we have the necessary time and resources available; we have taken care to set the system up in such a way that it can be readily programmed.

We now describe each step of the algorithm used for generating cloze items in detail.

Thesaurus Module

The Thesaurus module of SkE outputs words which typically occur in the same context as the search term. We show below the SkE Thesaurus output for portfolio (the key for the cloze item presented at (1) above). The screenshot reveals that most of the words with similar distribution to portfolio are in fact not synonyms or near synonyms: only collection and package qualify in that regard. A number of the words, as one might expect, have to do with business and the world of investment, with investment itself and asset ranking high on the list. The presence of the word curriculum on the list reflects the fact that the term portfolio is now widely used in the education domain.

The three top-ranking list members – investment, infrastructure and asset are noted and retained for use as PDs (potential distractors).

SkE Thesaurus entry for portfolio

Sketch Differences Module

We next consult the Sketch Differences display. The screenshot below shows sketch differences for portfolio and investment, in contexts where either can occur in the ukWaC corpus. Notice how the display divides the output into grammatical relations between keyword and collocate. The screenshot shows us that portfolio occurs 34 times in a PP_IN relation with excess, while investment occurs in this collocation 25 times. Typical contexts are “… an investment/ a portfolio in excess of n million dollars”.

Part of Sketch Differences entry for portfolio and investment

Of course, we are interested in situations where the two words do not share a collocate, and for this we glance down at the “portfolio only” patterns. Alongside each collocating word, in the Sketch Differences screenshot, is shown the frequency of the collocation (an underlined integer) and the salience (an index of the number of times portfolio occurs with the collocating word, as opposed to other words, given to one decimal place).

We now search for the collocate appearing only with portfolio (and never with investment) with the highest salience. We apply the condition that the collocate must be a correctly spelled English word, not a proper name. Thus, the non-alpha character

 with salience of 10.6 is rejected, as is harrah, a proper name (salience 9.6). The third-ranking in salience (8.8), diversified, is selected, and marked as a potential Key Only Collocate (KOC).

We next consider the second PD, infrastructure. The potential KOC diversified also does not occur in ukWaC in collocation with this PD, so it remains a candidate.

However, when we move on to consider the third PD, asset, we find that diversified assets does indeed occur in the corpus. This means that asset cannot be used as a distractor for the key portfolio in the context diversified portfolio.

We therefore go on to consider the collocate appearing only with portfolio with the fourth highest salience: this turns out to be enviable. This time, we find that the potential KOC does not occur in collocation with any of the PDs, so it is adopted as KOC.

So far, we have decided on the key, as well as the three distractors. We have also established that we wish our carrier sentence to include the collocation enviable portfolio. The next step is to determine what the carrier sentence will be: we do this by consulting a concordance.

Concordance Module

The SkE concordancing software is equipped with a feature called GDEX (Husak et al, forthcoming) which favours sentences which are between 10 and 25 words long, containing only common words, and some other related constraints. GDEX sorts the order in which concordance sentences are presented, so that optimal sentences appear first. This means that the sentences which are most likely to be selected for dictionary examples or cloze exercises appear conveniently at the beginning of the concordance display.

From the concordance output from which the screenshot below is taken, we may now extract the sentence shown at (1) above. Note that if the user is dissatisfied with the first sentence, for any reason, they can be prompted to select the second or a subsequent sentence.

Part of SkE concordance entry for portfolio and enviable

BNC cloze example

In our experiments, we also generated (2), this time from the British National Corpus.

Again, the correct answer choice is supposed to be portfolio.]

(2) Albert E Sharp Fund Managers have launched AES European unit trust, which seeks

long-term capital growth from a diversified _____ of European Securities.

Ans: asset portfolio stock holding

Unlike ukWaC, the corpus used to generate (1), the BNC does not contain any examples of the adjective diversified modifying any of the PDs. However, the concept of a “diversified holding of European Securities” does seem quite plausible; given two apparent possible answers, it is unlikely that many teachers would find (2) an acceptable cloze exercise.

The way in which the BNC was compiled means that it consists mostly of clean text, and relatively little noise, while ukWaC contains a fair amount of duplication and non-textual data. This might be taken as a compelling argument for preferring the BNC as a source corpus. However, the GDEX software does a good job of ensuring that the most meaningful sentences from a ukWaC concordance are presented first.

What is more, if we posit that certain collocations have a vanishingly small chance of occurring – and that is the claim that one makes when setting the distractors for a cloze exercise – we should be using the very largest corpus available. The larger the corpus, the more exhaustive the evidence; and the less likely the system will be to generate unwanted correct distractors, such as holding in (2) above.

相關文件