問答系統技術研發(3/3)－異質資訊源問答系統之研究

(1)

行政院國家科學委員會專題研究計畫成果報告

問答系統技術研發(3/3)－異質資訊源問答系統之研究

計畫類別：個別型計畫計畫編號： NSC93-2213-E-002-009- 執行期間： 93 年 08 月 01 日至 94 年 07 月 31 日執行單位：國立臺灣大學資訊工程學系暨研究所計畫主持人：陳信希計畫參與人員：林川傑、曾郁淳報告類型：完整報告處理方式：本計畫可公開查詢

中華民國 94 年 10 月 17 日

(2)

行政院國家科學委員會補助專題研究計畫成果報告

問答系統技術研發

計畫類別：個別型計畫

計畫編號：NSC 93－2213－E－002－009－

執行期間：2004 年 8 月 1 日至 2005 年 7 月 31 日

計畫主持人：陳信希

共同主持人：

計畫參與人員：林川傑、

曾郁淳

成果報告類型(依經費核定清單規定繳交)：完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

□出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計

畫、列管計畫及下列情形者外，得立即公開查詢

執行單位：國立台灣大學資訊工程學系

中

華

民

國

九十四年十月十七日

(3)

Chapter 1 Lessons in TREC QA Tasks

1. Introduction

Question Answering (QA) becomes a hot research topic in recent years due to the

very large virtual database on the Internet. QA is defined to find the exact answer,

which can meet the users' need more precisely, from a huge unstructured database.

Traditional information retrieval systems cannot afford to resolve this problem. On the

one hand, users have to find out the answers by themselves from the documents returned

by IR systems. On the other hand, the answers may appear in any documents, even that

the document is irrelevant to the question. In this chapter, we will present some of our

results in TREC QA evaluation series.

2. Description of Our System at TREC8

Two possible approaches, i.e., keyword matching and template extraction, can be

considered. Keyword matching postulates that the answering text contains most of the

keywords. In other words, it carries enough information relevant to the question.

Using templates is some sort of information extraction. The contents of documents are

represented as templates. To answer a question, we have to select an appropriate

template, then fill the template and finally offer the answer. The major difficulties in

this approach are to find general domain templates, and to decide which template can be

applied to answer the question.

Some other techniques are also useful. For example, to answer the questions

"Who ..." and "When ...", the identification of named entities like person names and

(4)

In our preliminary study, we adopt keyword-matching strategy coupling with

expanding the keyword set selected from the question sentence by the synonyms and the

morphological forms. The detail will be presented below.

The system is composed of three major steps: (1) preprocessing the question

sentences, (2) retrieving the documents containing answers, and (3) retrieving the

sentences containing answers.

2.1 Preprocessing the Question Sentences

Our main strategy is keyword matching. This approach has a drawback, i.e., the

words used in the question sentences and in the sentences containing the answers may be

different. For example, verbs can be in different tenses and synonyms can also be used.

Therefore, we have to make necessary changes and expansions in the question sentences.

At first, the parts-of-speech are assigned to the words in question sentences. Then,

stop-words are removed. The remaining words are transformed into the canonical forms

and selected as the keywords of the question sentences. For each keyword, we find all

of its synonyms from WordNet 1.6. Those terms form an expansion set for the keyword.

If the keyword is a noun, a verb, an adjective, or an adverb, all the possible

morphological forms of the words in the expansion set are also added into this set. Here

the morphological forms are the plural of a noun, different tenses of a verb, and the

comparison of an adjective or an adverb. They are shown as follows:

noun AAA: AAAs | AA[s,z,sh]es

verb BBB: BBBs | BB[s,z,sh]es + BBBed BBBing | BB[e]d BB[e]ing

adjective or adverb CCC: CCCer CCCest | CC(y)ier CC(y)iest

(5)

2.2 Retrieving the Documents Containing Answers

We employ a full text retrieval system to find the documents that may contain the

answers. The purpose is to decrease the number of documents we have to search the

answering sentences. Each keyword of a question sentence is assigned a weight.

Those words tagged as NNP and NNPS, which denote proper nouns, have assigned

higher weights. This is because they should be presented in the answer. The score of a

document is computed as follows:





where isakeywordand itsexpansionset document, : T xD X x X D t weight D score T t     



 ) ( ) (

The document containing one keyword or any words in its expansion set earns a score

from this keyword. For example, consider the Question 30:

<num> Number: 30

What are the Valdez Principles?

Its keywordsare“Valdez”and “Principles”,and theexpansion setsare [valdez/ valdezes/] [principle/principles/rule/rules/precept/precepts/rationale/rationales/], respectively. If a

documentcontains“principles”and “rules”,butno “valdez”and “valdezs”,itsscore is only theweightof“Principles”.

Those documents that have scores no less than the threshold are selected as the

answering documents. Threshold is set to the sum of weights of the words in the

original question sentence. Note that the removed words have no scores. If no

documents have scores greater than the threshold, we assume that no answers can be

(6)

2.3 Retrieving the Sentences Containing Answers

Finally, we examine each sentence in the answering documents. Those sentences

that contain most words in the expanded question sentence are retrieved. The top five

sentences are regarded as the answers. If there are more than five possible answers, we

randomly select five of them. To meet the limit of 250 bytes, we truncate the sentences

that exceed the limit. On the contrary, if the answer is shorter than the limit, we

concatenate it with the next sentences.

2.4 Results and Discussions

The system runs on the 198 questions provided by Q&A Track of TREC-8. The

weights of proper noun keywords are set to 100, and the others are set to 1. Among

these 198 questions, 60 have answers. Total 25 of them are correct, and 20 answers are

at the top scores. The following shows some examples.

<num> Number: 29

What is the brightest star visible from Earth?

Ans: In the year 296036, Voyager 2 will make its closest approach to

Sirius, the brightest star visible from Earth. Deep space is benign, so dust

and cosmic rays will erode Voyager 2 extraordinarily slowly. In a billion

or more years, Sagan said, "there w

<num> Number: 102

Who is the Voyager project manager?

Ans: Until December, Voyager 2 occasionally will glance at Neptune and

dark space to improve the accuracy of observations its cameras and

(7)

Norm Haynes. Pictures of empty space let engi

We examine the results of formal runs, and find that the system can be

improved from several aspects:

(1) execution speed of the system

Owing to the long time required, 138 questions in the formal run do not have

answers. After revising our algorithm and running again, we answer 136 questions.

The evaluation is done by ourselves. Total 62 of them are correct, and 42 answers

are at the top scores.

(2) anaphor resolution

The answering sentence may contain pronouns referring to the constituents in

the previous sentences. We have to find the antecedents. Similarly, date

expressions like today have to be substituted by an exact time.

(3) phrasal searching

Phrasal searching is helpful in some kind of questions. For example, to

answer the questions

<num> Number: 115

What is Head Start?

<num> Number: 40

Who won the Nobel Peace Prize in 1991?

the key phrases "head start" and "Nobel peace prize" are very useful to find the

(8)

3. Description of Our System at TREC9 3.1 QA Track

In TREC-8,we’veexperimented on expanding questionsby adding inflectionsof verbs and nouns, as well as their synonyms (Lin and Chen, 1999). However, the

performance was not as good as our expectation. This year we propose three models to

see whether expansion is helpful or not. Model 1 is a base model. Only inflections are

added. Model 2 adds synonyms from WordNet (Miller, 1990). And Model 3 tries to

resolve co-reference in a simple way. Each of them will be described in detail in later

sections.

Besides, we select answers according to the named entities that the question might

be relevant. Our QA system will guess the interested entity type by looking at the

questions. Position of the interested answer terms is also important. If the length of

answering sentences is longer than restricted length, the final answer text has to include

the actual answer. We also propose a method to implement this idea. The proposed

algorithm will be described later.

3.2 Model Description

3.2.1 Interested Entity Type

After taking a question as input, our system first guesses which entity type the

question is interested in. The method is simply rule-based. If the question starts with

“who”,“when”,and “where”,it may ask for a person name, a time/date expression, and a location name,respectively. Ifitstartswith “what”or“which”,oritisthe“Name a...” -type question, then the system goes on to look at the first noun behind it. We collected

(9)

name,“person”forpersonalname,and so on.

3.2.2 Named Entity Extraction

Named entity extraction plays an important role in our experiments. It is

introduced while deciding question focusing, doing question expansion, and measuring

similarity between document passage and question sentence.

For named entity extraction, we employ several named entities dictionaries, such as

gazetteer, a collection of family name, etc. Different from simply dictionary look-up,

these dictionaries also include other useful information. For a personal name, we can

know that it is a family name, a male first name, or a female first name. For a country

name, we can get its adjective form as well as how to call its people. For other location

names, it provides the names of provinces or countries it belongs to as well.

Organization names are accompanied by their abbreviations. We have not employed the

information of types of personal names and the superior administrative division yet.

Time/date expression is simply keywords (Sunday, January, etc.) The resolution of

expressionslike“yesterday”,“lastweek”, and so on, is still undergoing. Other named entities like quantity and numbers are not handled yet.

3.2.3 Base Model - Question Expansion by Named Entity and Inflection Forms

In Base Model, we first decide if there is a named entity in the question sentence. If so,

we record its equivalence (e.g. abbreviation of an organization name). Notice that a

named entity can be more than one word. For the rest words in the question sentence,

we remove stop words and attach the root form and all the inflection forms of each of

(10)

The next step is to segment documents into passages as comparison units. The

document set we use this year is the set of the 50 most relevant documents to the

questions. The relevant document set is offered by NIST. In the Base Model, a

passage is simply a sentence.

For each passage, we also identify named entities in it, but their equivalences are not

attached. The inflections are not added either. This is because we have already

introduced them in the question side.

Then we measure its similarity to the expanded question sentence. For each word

(or phrase) occurs in the passage and also in the expanded question, it contributes a score

to the similarity. By the recent experiment, if it is a named entity, it contributes 2 points;

otherwise 1 point. If it occurs in the original question, the contributed score is doubled.

Besides, if a word (or a phrase) does not occur in the question but is of the interested

type of the question, the FOCUS tag is set and the position of this word is recorded.

While giving answers, those words (or phrases) that are assigned the FOCUS tag are

reported first. The passage of higher score is considered to be more possible to carry the

answer and is ranked higher.

To meet the length restriction, we have to truncate the passages longer than 250

bytes. We decide the focusing center of each answering passage first. Truncate

characters 125 bytes ahead of the center and also the exceed part if the remaining passage

is still longer than 250 bytes. For those assigned a FOCUS tag, the center is the average

position of all the found named entities of interest. For those did not, the center is the

(11)

3.2.4 Model 2 –More Expansion by Synonyms

Besides the basic structure of Base Model, we also expand questions by the

synonyms of ordinary nouns or verbs, i.e., those which are not named entities.

Synonyms are obtained by looking up the WordNet (Miller, 1990). We do so because

we want to save those answers written in different terms.

3.2.5 Model 3 –Passage with Co-Reference Resolved

This model is also based on the Base Model. But we want to resolve co-reference

problem first before measuring similarity with the question sentence. We proposed a

simple strategy to do so: take the first sentence as a passage. If the next sentence

containspronounsexcept“it”, it is merged into the previous passage. Or if the next one containsa phraseofthe pattern “theA”and theword “A”occursin thepreviouspassage, it is merged into the previous one, too. It can help resolve anaphora problem as well as

the co-referential noun phrases.

3.3 Evaluation

Table 1 lists the results of our three models. We submitted three runs, each run for

each model, i.e., qantu01 for Base Model, and so on. Each answer text can be judged as

Wrong, Correct, and Unsupported. "Unsupported" means that the document associated

to the answer text does not really support the answer. The Strict Evaluation only counts

Correct ones, and the Lenient Evaluation takes both Correct and Unsupported ones as

(12)

Table 1. Results of Three Models in the QA Track at TREC-9

Strict Lenient Strict (Debugged) Run ID

MMR Failed MMR Failed MMR Failed qantu01 0.315 377 (55.3%) 0.348 354 (51.9%) 0.333 368 (55.0%) qantu02 0.315 376 (55.1%) 0.341 354 (51.9%) 0.327 365 (53.5%) qantu03 0.278 394 (57.8%) 0.309 370 (54.3%) 0.284 394 (57.8%)

From Table 1, half of the questions failed to be answered. It is better than last year that

we only answered 1/3 of the questions correctly. There are 24 more questions in

average answered by unsupported documents.

Comparing the performance of different models, Base Model and Model 2 are

almost the same, but Model 3 is worse than the other two. Model 2 answered one more

question than Base Model did, but Base Model offered unsupported answers at higher

ranks than Model 2 did in the Lenient Evaluation. Model 3 is worse in either

evaluation.

It seems that adding synonyms does not help a lot. It even lows down the speed.

The most difficulties we met in QA are often paraphrases, not only synonyms.

Therefore, it might be more efficient to tackle the paraphrases problem.

The reason that Model 3 worked badly may be the over-simplified co-occurrence

resolution. For those questions failed to be answered here but successful in the other

two runs, it was often the case that the passages containing the answer texts have been

expanded into large ones. The occurrence of co-reference candidates is too frequent to

simply concatenate sentences.

But co-reference resolution is helpful for question answering. During the

(13)

with co-reference resolved. To integrate the co-reference resolution part into the system,

or find an alternative way to tackle it will be another important future work.

4. Description of Our System at TREC10

In the past years, we attended the 250-bytes group. Our main strategy was to

measure the similarity score (or the informative score) of each candidate sentence to the

question sentence. The similarity score was computed by sums of weights of

co-occurred question keywords.

To meet the requirement of shorter answering texts proposed in this year, we adapt

our system, and experiment on a new strategy that is focused on named entities only.

The similarity score is now measured in terms of the distances to the question keywords

in the same document. The MRR score is 0.145. Section 2 will deal with our work in

the main task.

We also attended the list task and the context task this year. In the list task, the

algorithm is almost the same as that in the main task except that we have to avoid

duplicate answers and find the new answers at the same time. Positions of the

candidates in the answering texts should be considered. We will talk about this in

Section 4.3.

In the context task, how to keep the context, and what the answers of the previous

questions can help are the main issues. In our strategy, the answers of the first question

are kept when answering the subsequent questions, but the answers of the other ones

(denoted by question i) are kept only if question i has a co-referential relationship to its

(14)

4.1 Main Task

In the previous 250-bytes task, we measured the similarity of the question sentence

and each sentence in the relevant documents, and reported the top 5 sentences with the

highest scores and with the question focus words. In our experiment, the real answer

sometimes lies in the sentence that is not so “similar”to the question. It becomes harder

to extract text shorter than 50 bytes and containing the answer in this manner. Therefore,

we experiment on another strategy, which is “candidate-focused” rather than

“sentence-focused”.

After reading a question, the system first decides its question type and keywords as

usual. Now every named entity in the relevant documents becomes our answer

candidate. For each candidate, we find out its distances to the question keywords in the

same document, and sum up the reciprocals of these distances. One question keyword

only contributes once, i.e., if a keyword occurs more than once, only the one nearest to

the candidate contributes the score. Moreover, we assign higher weights to the

keywords that are named entities. After scoring all the candidates, the highest top five

are proposed, together with the texts surrounding the candidates within 50 bytes. The

texts are extracted in such a way that the candidates can be placed in the middle.

In our experiment, we found that if there is a question keyword right proceeding or

following the candidate, it will dominate the score despite of the other question keywords.

To solve this problem, we divide the distance by three, i.e., we consider three words as a

unit to measure the distance. The scoring function is shown as follows:









      DQ t D D t weight x pos t pos x score ( ) 3 ) ( ) ( min 1 ) ( (1)

(15)

examined, t is a term occurring in both Q and D, and posD(t) is one of the occurrence

positions of t in D.

The algorithms of deciding question type and extracting named entities are the same

as those in last year, which was proposed in Lin and Chen (2000). If we cannot tell

which question type a question belongs to, or the question type is not concerned with a

named entity, we consider every kind of entities as candidates. To extract different

answers as more as possible, we ignore those answering texts whose named entity

answers have appeared in the previous answering texts.

Two runs were submitted this year. When question keywords were prepared in the

first run qntuam1, variants of ordinary words (inflections of verbs, plural forms of nouns,

etc.) and named entities (adjective forms of country names, abbreviations of organization

names, etc.) are added into the keyword bag. Stems of keywords are also added with a

lower weight. Note that no matter how many variants or stems of a keyword are

matched in a document, only one of them contributes the score. We select the one that

can contribute the highest score.

In the second run qntuam2, the synonyms and explanations provided by WordNet

(Fellbaum Ed., 1998) are also added, with lower weight to reduce the noise. Moreover,

if there are m words in an explanation text, and n words occur in the document, the

matching score of this explanation is defined as n mweight(e), where weight(e) is the weight of this explanation.

(16)

4.2 List Task

List task is a new task beginning in this year. A question does not only ask for its

information need but also a specified number of answers. Therefore, the system has to

offer different answers to the specified number. An example is Question 1:

Question 1: Name 20 countries that produce coffee.

In this case, the system is asked to provide 20 names of different countries. Besides

deciding which country produces coffee, the system also has to decide if the answer is

duplicated, or if two answers are identical to each other.

The main algorithm to this task is almost the same as the main task. The only

difference is that we extract the answering text in the manner that the candidates will be

located at the beginning. By this way, if more than one answer appears in the same

sentence, the previously proposed candidates will not appear again in the subsequent

answering texts. The algorithm of the main task has already ignored the same answers

(which is lexical identical), so we do not do other things to check answer identity.

Two runs were submitted as the same as those in the main task. Scores of the

average accuracy are 0.18 and 0.14, respectively.

4.3 Context Task

There is another new task this year. A series of questions are submitted, which are

somewhat relative to the previous questions. For example, in Question CTX1:

a. Which museum in Florence was damaged by a major bomb explosion in 1993?

b. On what day did this happen?

c. Which galleries were involved?

(17)

e. Where were these people located?

f. How much explosive was used?

Question CTX1a asks the name of the museum. Question CTX1b continues to ask the

date of the event mentioned in Question CTX1a, so this question and its answer are

important keys to Question CTX1b. Question CTX1c asks more details of Question

CTX1a, but irrelevant to Question CTX1b. So is Question CTX1d. But Question

CTX1e refers to both Question CTX1a and CTX1d. We can draw a dependency graph

of this series of questions as below:

CTX1a ←─┬─ CTX1b

CTX1a ←─├─ CTX1c

CTX1a ←─├─ CTX1d ← CTX1e

CTX1a ←─└─ CTX1f

If a question is dependent on one of its previous question, it is obvious that the

information relative to this previous question is also important to the present question.

Thus the system has to decide the question dependency.

We proposed a simple strategy to judge the dependency. Because the first question

is the base question of this series, every subsequent question is dependent to the first one.

After reading a question, if there is an anaphor or a definite noun phrase whose head noun

also appears in the previous question, we postulate that this question is dependent on its

previous question.

Next issue is that how we can use the dependency information in finding answers as

well as its context information. After answering a single question, the system has

located some answering candidates together with documents and segments of texts in

(18)

dependent questions, as well as the keywords of the question itself. Note that context

information can be transitive. In the above example, Question CTX1e consults the

information that Question CTX1d itself owns, and Question CTX1d refers to, i.e.,

Question CTX1a.

In our experiment, we only consider the keywords and their weights as the context

information. Furthermore, we assign the lower weights to the keywords in the context

information so that the importance of recent keywords cannot be underestimated. The

answers to the previous question remain their weights because they are new information.

The question type is decided by the present question.

The accompanying issue is that how confident an answer is included in the context

information. This is because we may find the wrong answers in the preceding questions

and those errors may be propagated to the subsequent questions. Moreover, do these

five answers have the same weight? Or we trust the answers of the higher ranks than

those of the lower ones, or only the top one is considered.

These issues are worthy of investigating, but not yet implemented in the experiment

of this year. We assign weights to the previous answers according to the following

equation:



6 ( )



5 _ ( ) ) ( _ )

(x weight NE x rank x weight PreAns x

weight     (2)

where weight_NE(x) assigns higher weight if x is a named entity; rank(x) is the rank of x,

and weight_PreAns(x) is a discount to the previous answers because they may be wrong.

The square root part tries to assign higher weights to the higher-ranked answers.

Because only relevant documents to the first questions are provided, and we do not

(19)

subsequent questions. Our solution is to search the same relevant set of the first

question.

We submitted one run this year. Its main algorithm followed the first run of the

main task.

There is still no formal evaluation of this task. The MRR of all 42 question of our

result is 0.139. 4 of the first questions are correctly answered. Answers of at least one

of the subsequent questions can also be found in each of these 4 series. Only one of the

series is fully answered.

4.4 Discussion

Comparing the results of two runs of the main task and the two runs of the list task,

we can find that synonyms and explanations introduce too much noise, so that the

performance is worse. However, paraphrase is an important problem in question

answering. Explanation provides only one of the paraphrases, thus we have to do more

researches on paraphrases.

After investigation of the results of the list task, we found that there is a small bug

when reporting answers. Although duplicate answers were neglected, equivalent

answers were not. In other words, adjective forms of country names were regarded as

different answers to their original names, which produced redundancy and lowered the

performance.

In this year, the question types of many questions are not named entities. Many of

them in the main task are “definition”questions. For example,

Question 896: Who was Galileo?

(20)

In our system, we only take named entities as answer candidates, so we cannot answer

such type of questions, and the performance is rather worse than that of last year.

The same problem happened in the context task, too. Therefore, it is not obvious

that our proposed model to the context task is good or bad. Further investigation and

(21)

Chapter 2 Selection of Answer Candidates in Question Answering

Using Information Fusion

1. Introduction

In recent years, question answering has become a popular research topic. Since

1999, TREC QA-Tracks (Voorhees, 2001) provided important evaluation test beds to

develop question answering systems. There have been 1,893 questions, together with

their correct answers found in the document set, as well as the surrounding text of the

answers.

Answer type is important information used among most teams in TREC QA-Tracks.

QA systems first analyze input questions and decide which types of answers are required.

For example, if we know that a question is asking for a person, it would be better to

report a personal name as an answer.

Because answer types cannot be enumerated completely, it is impossible to list all

the answer types and design an answer candidate extractor for each type. In this chapter,

we propose three models to extract answer candidates automatically from the corpus

based on information fusion.

2. Answer Types and Candidates

Each participating team of TREC QA-Tracks has its own answer type classification.

Harabagiu et al. (2001) encoded 38 answer types in an ANSWER TAXONOMY. Hovy

et al. (2001) defined 140 types in the Webclopedia project. These answer types are

mostly named entities, such as persons, countries, dates, plants, etc. The participants

have to implement an answer candidate extractor, or a named entity identifier,

(22)

from TREC QA-Track questions with their possible answer types attached at the end:

Q971: How tall is the Gateway Arch in St. Louis, MO? [LENGTH]

Q998: What county is Phoenix, AZ in? [COUNTY]

Q1228: What is the melting point of gold? [TEMPERATURE]

When the answer type of a question is decided, a QA system finds out all occurrences of

terms which match this answer type, and considers them as answer candidates. The QA

system will rank these candidates, and propose the most proper answer candidates will be

proposed.

Answer types can be divided into two classes –say, named entities and entity sets.

For named entities, we want to know the name given to a specific entity, such as

“Canada”(a country), “Venus”(a planet), or “Titanic”(a ship), etc. For entity sets, what we want is a concept denoting a set of entities, such as duck (a kind of bird), rose (a

kind of flower), or dictionary (a kind of book), etc.

Answer candidates of the first class are often identified by named entity recognizers

for the pre-classified answer types of each QA system. Candidates of the second class

need some world knowledge to capture. One possible resource is WordNet, which

includes the hierarchy of entities. To answer questions like “What kind of bird can …?”,

any descendant of “bird”in WordNet can be regarded as answer candidates.

In fact, not all of the questions can be classified into pre-defined answer types. In

the named-entity class, there are so many entity types which can be named that it is not

easy to define all possible named-entity sets, not to mention to design a system to identify

them all. In the entity-set class, not all terms in the world are collected in WordNet (e.g.,

“birthstone”in TREC questions). Besides, the knowledge collected in WordNet (Fellbaum, 1998) is absolute hypernymy/hyponymy relationship. For example,

(23)

WordNet does not provide relationship between “habitat”and “mature tree”in the

following example:

Q217: What is the habitat of the chickadee?

Ans: oak tree, mature tree, meadow, …

Hyponyms of “habitat”in WordNet 1.7 is “habitation”, which has two hyponyms: “aerie,

aery, eyrie, eyry”and “lair, den”. Maybe the information “oak trees can be a habitat”is

collected in some knowledge base, but we do not know where it is.

3. Information Fusion

In question answering, there may exist a single piece of text which offers the

information needed to answer a question. In such a case, we can extract the answer

directly from the text. For example:

Q894: How far is it from Denver to Aspen?

Ans: 204 miles

Text: Aspen is 204 miles from Denver.

In the above example, this single passage explicitly mentions a DISTANCE-QUANTITY,

and its end locations are Aspen and Denver, which exactly matches question Q894.

But there may not always exist sufficient information in a sentence to answer a

question. A QA system may have to gather together pieces of information scattering in

different documents or different pieces in a document in order to find the answer.

Information fusion is the process to handle pieces of information from different

documents to answer a question. Sometimes the answer selection is decided from

multiple pieces of texts. Sometimes there are more than one answer found in the corpus,

(24)

Here are some examples that information fusion has to deal with:

(1) From multiple passages to one answer

A question can be decomposed into two or more than one sub-question. For

example,

Q: Where was the first president of the United States born?

It can be decomposed into two sub-questions: “WHO was the first president of the

United States”, and “WHERE is his birthplace”. It is possible that the answers for

the first sub-question may appear in many documents while the answer for the second

sub-question appears in other documents.

(2) Contradictory answers

When different answers are reported, they may be contradictory. The most

significant case is news stories for the same event reported in different time. For

example,

Q: Who murdered Mary?

D1(in 1996): John was judged guilty for murdering Mary.

D2(in 1997): The police found new evident that Tom murdered Mary.

For QA systems, “John”and “Tom”are both effect answers from their surface texts.

But there is only one true answer to this question, which is “Tom”.

(3) Individual answers

Sometimes different answers are individually correct. For example,

Q378: Who is the emperor of Japanese?

The name of any previous Japanese Ten-On will be considered as a correct answer.

(4) Answers which have to be combined

(25)

quantity answers. For example,

Q: How many people were killed by cancer in Europe?

A QA system first finds out the death tolls in the European countries, and gives the

sum of these numbers as the answer.

The other case is summarization of multiple passages. Questions asking for

opinions, methods, status, or procedures often require longer answering passages. Texts

extracted from different documents contain redundant or novel information which has to

be removed or added before being reported to users.

Cases 2, 3, and 4 can be regarded as “answer fusion”, because the fusion is mainly

done on answer part. In Case 1, information fusion is used to resolve question terms

and helps to detect correct answers in the next step.

4. Automatic Answer Candidate Selection

4.1. What-Question Type

For 5W1H questions, the targets for who, where, and when are clearer than those of the

other three. The answer for how-question is non-entities, so that it is not major focus of

this paper. The following only considers what-question and which-question.

There are four cases of what-questions:

1. “What X VP?”or “N V what X?”

E.g.Q427:“What culture developed the idea of potlatch?”

E.g.Q934:“Material called linen is made from what plant?”

Answer candidates are those which are X’s, such as cultures or plants in this example.

2. “What be the X-NP?”

(26)

Answer candidates are those which are X’s, such as chemical symbols in this

example.

3. “What”alone as a subject or an object where its main verb is not be-verb

E.g.Q552:“What caused the Lynmouth floods?”

Its answer type does not directly appear in the question.

4. DIFINITION questions

E.g.Q600:“What is typhoid fever?”

Answers to such questions are definitions or descriptions.

In this paper, we experimented on only the first and the second cases. For the fourth

case, i.e., DIFINITION questions, no answer candidates are needed to answer a question.

Instead, gloss information or definition pattern is more helpful. For the third case, one

possible way to find answer candidates is to gather all the terms as subjects (or objects) of

this main verb. It remains future work and is not discussed in this paper.

For the first and the second cases, answer candidates are those which can be X’s. If

Y is the answer to a question Q “What X does something?”, the information of “Y is an

X”and “Y does something”may not appear in the same passage, even not in the same

document. Information fusion is needed to gather these pieces of information together

in order to answer such questions.

Our idea of answer candidate selection by information fusion is: find instances of X

in a knowledge base; assign Y as one of the instances and check if “Y does something”.

If so, this instance is reported as an answer. Instances finding procedure is described in

(27)

4.2. Question Focus

For a which-question or what-question in the first and the second cases, our system first

identifies its X part, which is referred as “question focus”by Harabagiu et al (2000). We

use this term but with slightly different meaning.

After syntactic parsing, if the word “what”or “which”alone is an NP, then it is in

the second case and our system extracts the noun phrase after the be-verb as its question

focus. If “what”or “which”is in a noun phrase with other words, it is in the first case

and our system assigns its question focus as the noun phrase which “what”or “which”is

in, but excludes the word “what”or “which”.

Because it does not guarantee that we can find at least one instance of this question

focus in the knowledge base, we have to relax the range of focus if necessary. Other

possible foci are the head noun phrase of the question focus, and the remaining phrase

with removing leading article, attaching propositional phrase, or any modifier. If the

question focus is in the form of “kind of NP”, “type of NP”, or “name of NP”, etc.,

possible focus is the noun phrase after “of”.

In the following example, a question and its possible foci are demonstrated in

sequence:

Q254: What is California's state bird?

Foci: California's state bird

state bird

(28)

4.3. Corpus Candidates DIFINITION Instances

In order to find instances of an entity set, we adopted DEFINITION patterns from

Ravichandran and Hovy (2002), and from Soubbotin (2001). DEFINITION questions

are a special group in question answering. Such a question asks for a definition of a

term, or a description of a specific person or entity.

In Ravichandran and Hovy’s system, they made experiments on six question types.

One of the six question types is DEFINITION. They collected pairs of questions and

the corresponding answers as examples, and automatically learned their co-occurrence

patterns in the knowledge base. Some example DEFINITION patterns are listed below:

<NAME> -LRB- <ANSWER> -

-RRB-<NAME> and related <ANSWER>s

-COMMA-in which <NAME> denotes a question term, and <ANSWER> the correspond-COMMA-ing answer

part.

Soubbotin also used DEFINITION patterns, but they made them manually. Some

examples are:

-COMMA-<NAME> is called <ANSWER>

The reason that we use definition to find instances is: for the instances of an entity set, the

name of the entity set is just like the definition of the instances. Unlike the usage of

these patterns in finding answers of DEFINITION questions, this time <ANSWER> part

(29)

<NAME> part as instances.

Syntactic information is integrated into these patterns. Since answers are mostly

entities, we forced the extracted <NAME> parts to be noun phrases (NP) or quantitative

phrases (QP). We extracted the minimal noun phrase if there is no other text to the left

or right of the <NAME> tag.

Equivalent Instances

In some cases, the name of the entity set is not the best definition of its instance.

Moreover, it may not be an appropriate definition of the instance. For example, “oak

tree”can be an instance of “habitat”, but the definition of “oak tree”is “a deciduous tree

that has acorns and lobed leaves”.

To capture such instances, we further extracted equivalent entities in the knowledge

base. That is, if any form of “A is B”appeared in the corpus, than we thought A could

be an instance of B, or vise versa B could be an instance of A. Again, during extraction,

A or B was restricted to an NP or QP.

4.4. Answer Candidates Selection Models

We experimented on three models to find answer candidates automatically. They are:

(1) Model A: Extracting Self-Evident NPs

If an NP’s head is the same as the question focus, it is regarded as an answer

candidate.

E.g. QFocus: artery

(30)

(2) Model B: Looking for WN Descendants

If a term is a descendant of the question focus in WordNet, it is considered as an

answer candidate.

E.g. QFocus: color

AnsCand: red

WN: red, redness

=> chromatic color, spectral color…

=> color, colour, …

(3) Model C: Extracting Corpus Candidates

If a term in the corpus matches one of the DEFINITION patterns, or an equivalent

relationship (A is B) is found, it is considered as an answer candidate.

E.g. QFocus: elephant

AnsCand: Loxodonta Africana

Pat: Loxodonta Africana (African elephants)

5. Experiments

5.1. Experiment Design

We used question sets provided by TREC QA-Tracks from TREC-9 to TREC-2000.

We chose what- and which-questions, but dropped those which were asking persons,

countries, cities, time, and quantity. This is because the answer candidates of these

questions can be provided by a common named entity recognizer. After filtering out

questions with no answer in TREC QA-Tracks, 251 questions were selected to do the

experiment.

(31)

questions by ApplePie Parser, then decided its what-question type as described in Section

4.1, and extracted the focus part together with all of its sub-NPs. Human effort was

introduced to check errors produced by the parser in order to focus on only answer

candidate problems.

When implementing Model A, top 1,000 documents of a given question were

retrieved and served as a corpus to extract self-evident noun phrases related to this

question. Each noun phrase with the head the same as one of the foci of the question

was collected as an answer candidates. We also tested on smaller corpus, only top 100

documents, to see the coverage.

To evaluate Model B, we used the formal answers provided by TREC QA-Tracks.

For a given question, we checked if one formal answer was a descendant of the question

focus in the WordNet. If so, this question was counted as “covered”, because all the

WordNet descendant entries were regarded as answer candidates.

The corpus of Model C was created by querying Google1. Each question focus was

submitted as a query to Google, but forced it to only retrieve documents containing the

whole phrase if the question focus had more than one word. We retrieved the first

10,000 sentences containing the question focus in the top 1,000 documents. Each

sentence in the retrieved corpus was matched against the DEFINITION patterns

described in Section 4.3. If matched, the noun phrase in the <NAME> part and all its

sub-NPs were extracted as answer candidates. Equivalent relationship was also

examined.

(32)

5.2. Results

The results are listed in Table 1. Self-Evident NPs (Model A) cover 62 questions in

top 100 documents, and 85 in top 1000 documents. WordNet Descendants (Model B)

cover 59 questions. Google Candidates (Model C) covers 54 questions.

The fourth column lists the coverage of combined models, where “A+C”denotes he

combined model of Model A and C, and so on.

Table 1. Coverage of Models (in Numbers of Questions with Correct Answer Candidates)

Self-Evident NPs (top 100) 62 A+C 113 Self-Evident NPs (A) 85 B+C 92 WordNet Descendants (B) 59 A+B 120 Google Candidates (C) 54 A+B+C 137

5.3. Discussion

Interestingly, even though self-evident NPs alone have the largest coverage, these three

models in fact cover different set of questions. Therefore, they can be good

complements to one another. Many descendants in WordNet do not contain the same

words as their ancestors, while many self-evident noun phrases are not collected in

WordNet, especially those named entities. The model of Google candidates can extract

named entities or senses not self-evident and not collected in the WordNet. Some

examples are listed below. Each of them was extracted in only one model.

Q254: What is California's state bird?

A: quail (WordNet Descendants)

(33)

A: the Hallmark card company (Self-Evident NPs)

Q355: What is the most expensive car in the world?

A: Bugatti Royale (Google Candidates)

The combined model of Model A and C covers 113 questions, and the combined model of

Model B and C covers 92 questions. This means that model C does improve the

coverage of answer candidates comparing to the coverage of Model A or B alone.

Finally, the combined model of all three models covers 137 questions, which are more

than a half of the testing questions.

The result of Model B is not exactly the performance using WordNet, because the

formal answers provided by TREC dropped the words already occurring in the questions.

For example, the answer of Q 1256:“What is the only artery that carries blue blood from the heart to the lungs?”is “pulmonary artery”, which is a descendant of “artery”. But

the given formal answer is “pulmonary”. Even so, the missing coverage of WordNet is

the set of self-evident phrases. It does not affect the coverage of combined model too

much.

There are many possible reasons of the low coverage of Corpus Candidates. One is

that we only match patterns in top 1,000 documents. Many extracted candidates are

redundant, especially those frequent entities.

The performance of DEFINITION patterns in this experiment is not yet clear. It

was often that erroneous noun phrases were extracted. For example, one of the

apposition patterns, “<NAME> -COMMA- <ANSWER> -COMMA-”is often mixed with the conjunction case. Further investigation of these patterns is necessary.

(34)

6. Conclusion and Future Work

In this paper, we investigated the coverage of different answer candidate extraction

models. We also proposed a method to extract candidates from a large corpus. The

extraction was based on the idea of information fusion, and patterns were employed to

detect possible candidates.

The results of our experiments show that the three models, i.e., Self-Evident Noun

Phrases, WordNet Descendants, and Corpus Candidates, have their respective coverage,

and they can complement one another.

In the future, the extraction patterns of Corpus Candidates should be more carefully

investigated. We will also try to find out new patterns to capture those un-answered

questions.

The detection of answer candidates is very time-consuming. It will be great if such

detection can be done in indexing time of IR system, and the relationships between foci

and candidates can be kept in the index. It is so called QA-based indexing mentioned in

(35)

Chapter 3 Web as a Translation Aid for Query Processing and Answer

Fusion in Multilingual Question-Answering Systems

1. Introduction

Question-answering (QA) attracts much attention due to that huge heterogeneous

data collection is available on the Internet. Figure 1 shows a typical multilingual QA

system. A Chinese query is segmented and part-of-speech tagged. After query

translation, both the original query and the translated query, e.g., Chinese and English

queries, are sent to an information retrieval (IR) system. IR system retrieves the

relevant Chinese and English documents. According to the foci of the query, Chinese

and English answers are extracted from the relevant documents. Finally, the answers are

fused and reported. This paper will show how to use the web as an aid in query

(36)

2. The Web as a Translation Aid

After bilingual dictionary lookup, those out-of-vocabulary query terms are translated

by using the web as a multilingual corpus. For example, the named entity 亨利‧杜南 in

the Chinese query 亨利‧杜南是哪一國人? (What is Jean Henri Dunant’s nationality?) is

an important query term, but not in the bilingual dictionary. Figure 2 demonstrates a

snapshot after Google search, where snippets in a sorted sequence are returned. Figure

3 shows one of snippets in which the corresponding English translation appears. Here, a Segmentation and POS Tagging

Information Retrieval System

Multilingual Document Collection Relevant Documents Named Entity Recognition Answer(s) Chinese Query Query Translation Answer Finding Answer Fusion Thesauri

(37)

snippet consists of title, type, body and source fields. The following depicts how to

extract the translation pairs from snippets.

Figure 2. A Snapshot after Google Search “亨利‧杜南”

Figure 3. A Snippet Containing Translation of the Named Entity “亨利‧杜南”

The basic algorithm is as follows. Top-k snippets returned by Google are analyzed.

For each snippet, we collect those continuous capitalized words, and regard them as

candidates. Then we count the total occurrences of each candidate in the k snippets, and [DOC] 香港紅十字會(青年及義工事務部)

檔案類型:Microsoft Word 2000 -HTML 版

... 為紀念本會創辦人亨利杜南先生(Jean Henri Dunant)的誕辰，每年的五月八日被定為世界紅十字日。世界各地的紅十字會均會以不同形式的活動慶祝世界紅十字日. 和推廣紅十字運 動。經歷南亞海嘯後，紅十字會之救災及備災工作更為社會各界人仕關注及 ...

(38)

sort the candidates by their frequencies. The candidates of the larger occurrences are

considered as the translation of the query term.

The above algorithm does not consider the distance between the query term and the

corresponding candidate in a snippet. Intuitively, the larger the distance is, the less

possible a candidate is. We modify the basic algorithm as follows. We drop those

candidates whose distances are larger than a predefined threshold. In this way, a snippet

may not contribute any candidates. To collect enough candidates –say, cnum, we may

have to examine more than k snippets. Because there may not always exist cnum

candidates, we stop collecting when maximum (max) snippets are examined. Finally,

the candidates are sorted by scores computed as follows.

score(qt,ci)= 3 ) , ( 2 ) (c_i AvgDist qt c_i freq _

where score(qt,ci) denotes a score function of a query term qt and a candidate ci,

freq(ci) denotes the frequency of ci, and

AvgDist(qt,ci) denotes the average distance between qt and ci.

In this way, we prefer those candidates ci of higher occurrences with the query term qt

and smaller average distances.

3. Experiments

We adopt the 500 questions of TREC 2002 QA track (Voorhees, 2002), and translate

them into Chinese by human. There are total 3,490 words in the 500 Chinese questions.

Of these, 1,393 words are unique. After bilingual dictionary lookup, 118 words are out

of vocabulary. We use them to evaluate the performance of the proposed methods in

(39)

(1) #correct: total number of query terms being resolved correctly,

(2) AvgRank: average ranks of the correct candidates in the solved questions,

(3) Time: how much time taken to find all the candidates.

Figures 4-6 shows the results corresponding to these three metrics under different

methods and cnum. Six methods shown as follows are experimented, and the factor

cnum is tried from 10 to 100.

(1) method1: the basic algorithm in Section 2.

(2) max1000: maximum 1000 snippets (title and body) are explored in the revised

algorithm.

(3) max500: maximum 500 snippets (title and body) are explored in the revised

algorithm.

(4) max1000_title: max 1000 snippets are explored and only title field of a snippet

are used in the revised algorithm.

(5) max1000_quotes: max 1000 snippets (title and body) are explored and query

term is quoted in the revised algorithm.

(6) livetrans: the online livetrans system (Cheng, et al., 2004) are explored.

Figure 4 shows that the number of query terms being resolved correctly in the

revised algorithms (i.e., methods 2-5) is increased when cnum is increased. After

cnum≧40, the #correct of the four methods, i.e., max1000_title > max1000_quotes >

(40)

Figure 4. Total Number of Query Terms Being Solved

Figure 5. Average Ranks

# co rr ec t A v g R an k

(41)

Figure 6. Time Taken

For max1000_title method, title field, which is a short summary of a snippet, contains

important words. When terms in different language appear in title field, they often form

the corresponding translation. For max1000_quotes method, matching a quoted query in

Google requires all the query terms should appear, and their order cannot be changed.

That is more concrete than matching unquoted one.

Figure 5 shows the metric of average rank. The baseline performs the worst. The

two methods max500 and max1000 have the lower average ranks, and then

max1000_quotes and max1000_title. When considering the time issue, Figure 6 shows

max500 spends the less time than all the other methods. The online livetrans takes more

time because it tries to retrieve the relevant images besides text.

4. Extension to Answer Fusion

In a multilingual QA system, we submit a question to extract the plausible answers

from a multilingual document collection. The same named entities may be reported in

T im e (h r: m in :s ec )

(42)

different languages. For example, in the Chinese question “1997 年擔任日本首相的是

誰?”(Who was the Japanese Prime Minister in 1997?), Table 1 lists the first five answers from English and Chinese document sets, respectively.

Table 1. Answers in Different Languages

Answers from English Documents Answers from Chinese Documents Yoshiro Mori 森喜朗

Keizo Obuchi 小淵惠三

Junichiro 陳世昌

Mori 橋本龍太郎

Ryutaro Hashimoto 官房

In this example, 森喜朗, 小淵惠三, and 橋本龍太郎 denote the same persons as

Yoshiro Mori, Keizo Obuchi, and Ryutaro Hashimoto, respectively. We can merge the

two sets of answers in the following way.

(1) Multiply out the English answers Ei(1≦i≦5) and the Chinese answers Cj(1≦j

≦5), and generate 25 combinations.

(2) For a combination (Ei, Cj), submit Eiand Cjtogether to Google, and employ the

similar way as the methods specified in Section 2 to verify if Eiand Cjappear in

the neighborhood. If the combination has strong collocation, then delete (Ei, X)

(where X≠Cj) and (X, Cj) (where X≠Ei), and try the remaining combinations.

Figure 7 shows an example of submitting “小淵惠三 Keizo Obuchi”to Goggle. The collocation is marked in red.

(43)

Figure 7. An Example of Submitting 小淵惠三 and Keizo Obuchi to Google

5. Conclusion

This paper employs the web as a live multilingual corpus to translate questions and

merge answers in different languages. The methods of quoted query terms

(max1000_quotes) and title only (max1000_title) have better coverage. The method

(44)

Chapter 4 Open-Domain Question Answering on Heterogeneous Data

1. Introduction

Question answering has become a hot research topic in computational linguistics in recent years. QA Track in TREC has been held by NIST for two years, which has offered a new evaluation on this topic. Keyword matching was one of the major methods used among the participating groups (Moldovan, et al., 2000; Singhal, et al., 1999). Named entity information was also found important, especially, when question focus had been detected by hand-crafted rules. Many groups employed IR systems to reduce the size of documents for finding answers.

However, the target of TREC QA Track is aimed at plain text collection only. Besides, the collection is in English. Nowadays, data in different medias has become more and more popular. It is also more valuable to provide users information from heterogeneous data. Many issues arise if heterogeneous data are taken into account in a QA system:

(1) Where is the information to support answering?

In textual data, information is in text itself. Consider other kinds of data. For a table, the information is not only the texts in table cells, but also the relationships among cells. This information has to be clarified before applications. For video programs, information is carried by image, sound, speech, and captions for each frame. How to find answers in video programs becomes more challenging that that in plain text.

Figure 1. Architecture of QA system

IR System Questio n-Focus Question Processing Question Expansion Answer Finding Questions Answers Word Tagging And NE Extraction WordNet, etc. Q-t yp Docu ment Colle Summarization Table Interpretation Video OCR

(45)

(2) What is the basic information unit?

In textual data, a basic unit is often defined as a sentence, a paragraph, or a passage segmented according to some linguistic information. There may be no such linguistic information in many other kinds of heterogeneous data. Therefore, the basic information unit has to be redefined for QA on heterogeneous data.

(3) What kind of questions can it be?

Since the heterogeneous data carry more information than text, many other possible kinds of questions can be issued. For example, prices are often listed in tables, so comparison between price tables becomes possible. It is also possible to ask a question where the answer is embedded in a fragment of a film.

(4) How does a QA system to measure similarity?

Most similarity measurements are based on lexical matching. We have to study the different similarity measurements for heterogeneous data.

(5) How does a QA system present answer to users?

There are more informative ways for visualizing the answers. Comparative answers can be shown in a table. Answers found in films are also shown in fragments of films.

In this paper, we propose a QA system for English/Chinese text at first, and then extend its function to handle some heterogeneous data, including summaries, tabular data, and video programs. The necessary adaptation to deal with these kinds of data is addressed. Section 2 depicts the core QA system; Sections 3, 4, 5, and 6 deal with the individual problems for plain text, summarization, tabular data, and video programs, respectively.

2. The Core QA System

The proposed QA system consists of three modules (QuestionFocus-Deciding, Question- Processing, and Answer-Finding), and an optional Question-Expansion module, together with an IR system to support IR task. Its architecture is illustrated in Figure 1.

A question is issued by a user in natural language. Question-Focus Deciding Modulefirstdecidesthequestion focusofthisquestion. “Question Focus”heredenotes the interested information thatthe question requestsfor,such as“person name”,“reason”,

(46)

etc. Deciding question foci helps us locate answers more precisely. The construction of this module will be described in detail in Section 2.1.

In Question-Processing Module, the question sentence is word-segmented (if necessary) and POS-tagged first. The named entities in the question sentence are also identified. Only named entities and nouns, verbs, adjectives, and adverbs are kept as keywords.

The optional Question-Expansion Module will add synonyms, morpheme inflections, abbreviations of organization names, or other information of locations as keywords. Newly added keywords contribute smaller weights than the original keywords.

Moreover, an IR system is employed to retrieve relevant documents for searching answers. The advantage of using IR results is to reduce the amount of documents we have to examine. The disadvantage is that the answer texts may not appear in the so-called relevant documents, thus the answers can never be found. The model of IR system is described in Section 2.2. The IR results are transferred to the Answer-Finding Module for finding answer texts.

The Answer-Finding Module searches each passage in the relevant documents and measures its similarity to the question sentence. If a passage contains more information in the question and the interested type of question focus, then it is more likely to carry the answer information and is ranked higher. Section 2.3 shows how the answers are extracted in detail.

2.1 Question-Focus Deciding Module

The patterns of question sentences are quite different from Chinese to English. In English, 5W1H is the main question words. Patterns can be hand coded including these question words (Moldovan, et al., 2000; Singhal, et al., 1999).

In Chinese, we do not yet have the information of question words, together with question patterns. Forexample,thereareatleastthreekindsofwaysto express“what”: “ ”, “ ”, and “ ”. Therearefew researcheson questionsforChineselanguage processing. Chang (1997) analyzed questions in Chinese, and classified them into seven categories. But her classification was based on the functions of questions in discourse, not on the question foci.

(47)

In order to find all the possible question words, question patterns, and their mapping to the question foci, we conducted an experiment. All the questions in Academia Sinica Balanced Corpus (Chen, et al., 1996) were extracted. The question words and question foci were hand-tagged in these 16,851 sentences. We defined nine question foci, and one more category NOFOCUS for the questions which are functionally not requesting information. Appendix A lists the question foci and some examples of the hand-tagged questions.

Question-Focus Decision rules were trained by C4.5 (Quinlan, 1993). Question words and the terms preceding or following them were selected as features. Question words occurring less than 4 times and preceding (following) terms occurring less than 100 times were discarded while training. We got 200 rules with 81.5% correctness. Appendix B lists some of the Question-Focus Decision rules.

2.2 IR system for Question Answering

Our IR system was based on vector space model (VSM). In English, index terms are words except stop words; in Chinese, index terms are bigrams of Chinese characters and English strings (if any).

After examining some retrieval results, we found it was useful to integrate Boolean model into IR system for a better QA performance. This is because every keyword in the question sentence is equally precious in QA task. The more keywords being included in a document implies the higher possibility to find the answer in that document. Therefore, we employ the Boolean score as the first sorting key, and the VSM score as the second key while ranking relevant documents.

2.3 Answer-Finding Module

The Answer-Finding Module searches each passage in the relevant documents and measures its similarity to the question sentence. Similarity is defined as the sum of weights contributed by the terms matched:

  Q P t t weight P Q sim  ) ( ) , ( (1)

(48)

contribute higher weights than the original question terms, because named entities often carry more information. The expanded terms contribute lower weights to reduce the noise that might be introduced.

A passage can be chosen as a sentence, a meaningful unit that carries the smallest piece of information, or a video segment, depending on the data type we are processing. In Sections 3, 4, and5,“passage”willbe defined for different media, respectively.

The answerable passages were ranked in the order of their similarity scores. In other words, those passages meeting the question focus were reported first.

3. QA for Plain Text

The passages selected for plain text are sentences. We made experiments on both English and Chinese documents.

3.1 Experiment on English

In English experiment, we conducted an experiment as the same as QA Track in TREC-9 (Voorhees, 2000). There are 693 questions to be answered. We gave five 250-byte-length answers for each question. The metric is MRR (Mean Reciprocal Rank): , 1 N r MRR N i i    (2)         0 0 0 where 1 i i rank i rank rank r i _{, rank}

i is the rank of the first correct answer of the ith

question, and N is total number of questions. That is, if the first correct answer is at rank 1, the score is 1/1=1; if it is at rank 2, the score is 1/2=0.5, and so on. If no answer is found, score is 0.

After evaluating by hand, the MRR is 0.348, and 354 (51.9%) questions failed to find any answers.

3.2 Experiment on Chinese

In Chinese, the test data is collected from 6 news sites in Taiwan through the Internet. There are total 17,877 documents (near 13MB) from January 1, 2001 to January 5, 2001.

(49)

In order to compare with the multi-document news summarization work in Section 4, we concatenated the articles of the same news event into one article, and took these event articles as our document collection for experiment. After clustering, there are 3,146 events.

Questions were formulated by research assistants. We deliberately prepared two kinds of questions to see how these QA models work in different situations. One group of questions is the ones lexically similar to the answer texts, and the other is not similar. After filtering out the questions that had no answers in the test collection, there were 127 answer-like questions and 96 not-answer-like questions. Examples of these questions are listed in Appendix C.

We gave five sentences as answers for each question. After evaluating by hand, the MRR is 0.62, and 52 (23.3%) questions failed to find any answers. Table 1 depicts the comparison of the two sets of questions.

Table 1. Plain-Text QA Results

All Answer-like Questions Not-Answer-like Questions MRR 0.6243 0.6790 0.5519 No Answer 52 (23.3%) 25 (19.7%) 27 (28.1%) 4. QA on Summarization

Summarization is a kind of data that can be served as a knowledge base for question answering. In Internet, some web sites provide only summaries for retrieval. Besides, search engines reply fragments of texts as summaries to the relevant documents. Multi-document summarization is also necessary for users to reduce the reading time. Therefore, summaries will be a good resource to find information.

Many papers have touched on single document summarization (Hovy and Marcu, 1998a) and multiple document summarization (Chen and Huang, 1999; Chen and Lin, 2000; Mani and Bloedorn, 1997; Radev and McKeown, 1998). We employed our multi-document summarizer on the news articles describing the same event (i.e., in the same cluster).