• 沒有找到結果。

Performance of Medical Question Answering

Chapter 4. Experiments and Analysis

4.2 Performance of Medical Question Answering

For the Concept-Verb-Concept (CVC) patterns, we take 8,729 medical abstracts from PubMed to extract the patterns with concept identification by using UMLS. There are 951,678 distinct patterns received from the training set. The details of training results are described in Table 14.

Table 14. Training result for Concept-Verb-Concept patterns Data source Number of Abstracts Distinct CVC Patterns Clusters

PubMed 8,729 951,678 4,496

We want to evaluate each component about our method. The precision for question classifier is important because our strategy will retrieve the relevant documents according to the question type and the strategy for each question type is not the same. For the definitional question, we use rule-based method to detect it and assign the tag of “definition” to the question. For the other type, the classifier will assign the tag according to the probability of n-grams clustered by K-means algorithm and POS sequence for each type.

We divide the method into three components: Question Classifier (QC), Query Expansion (QE) and Concept-Verb-Concept scoring (CVC). We take 55 questions from testing corpus to evaluate each component. The contribution for each component can be seen in Table 15.

Table 15. MRR of each component

MRR QC+QE+CVC 0.63

QC+QE 0.58 QC+CVC 0.57

In Table 15, the improvement of MRR is about 0.06 for query expansion. It consists with other researches (Wang et al. 2004; Niu et al. 2004) in question answering. Query expansion will provide some important patterns to retrieve the more correct documents for the given question. The improvement of MRR is about 0.05 for CVC patterns. The main idea for CVC patterns is to extract the implicit medical information as the form of patterns by using UMLS. The results show that specific domain knowledge help QA improve the performance.

Table 16. MRR for each semantic categorization

Number of Questions MRR

Diagnosis 103 0.62 Therapy 45 0.67 Etiology 55 0.62

We also take 203 FAQ questions which are set aside from the collected FAQs to evaluate the method (QC+QE+CVC) according to the question type. The experimental results show in Table 16. There are 0.62 for Diagnosis, 0.67 for Therapy and 0.62 for Etiology in MRR. According to the experimental results, therapy type is more efficient than the other type. Because diagnostic and causal conditions are similar in many cases and the information from the given question is not sufficient for QA.

Additionally, we also classify the testing questions only by using the interrogative words, such as what, where, when, who, why, and how. The evaluation is designed to analyze the intention of question simply according to the interrogative words. The result is showed in Table 17. For the interrogative word, there is only 0.54 MRR for the “when” type. The medical literatures always contain few information for the “when” type. This will causes the MRR value decreased for the “When” type.

Table 17. MRR for the interrogative words

What When Who Where Why How

Number of Questions 78 8 13 11 5 88

MRR 0.63 0.54 0.65 0.64 0.66 0.64

For the factoid questions, the interrogative word in the question can be determined its answer type. Some medical questions are factoid by our observation.

For example, consider the question “What is the mortality rate of SARS?” In order to evaluate our method for factoid questions, we take 25 medical questions rewritten from TREC-8 to evaluate the method. The results are described in Table 18.

Table 18. MRR for TREC-factoid questions Number of Questions MRR

25 0.62

For the module of information retrieval, PubMed and Google are the document sources for our method. We count the number for the document source which is PubMed or Google in our method. The MRR value for each search engine is also calculated. The results are showed in Table 19.

Table 19. Result for each document source

PubMed Google

Number of Questions 54 149

Percentage 27% 73%

MRR 0.53 0.66

Another indicator is human effort (HE). We record the top five answer texts in statistical method and calculate the average of human effort for each experiment. In the experimental setup for human effort, we consider the question type as the major class to evaluate the method. The experimental results of human effort are described in Table 20. People can find the answer passage at the top 2 or top 3 in the returned texts. In Figure 5, we evaluate all types by using the indicator of recall at the top five passages. There is 79% recall at top five texts. In Figure 6, the curves show the increasing rate of recall for each question type. There are 79% recall for diagnosis, 80% recall for therapy and 80% recall for etiology at top five texts returned.

Table 20. Human effort for each component

Rank Rank Count

Diagnosis Therapy Etiology All Types

1 48 24 27 99

2 19 9 6 34

3 9 3 6 18

4 3 0 2 5

5 3 0 3 6

No Answer 21 9 11 41

# of questions 103 45 55 203

HE per question 2.58 2.33 2.65 2.55

0

Rank1 Rank2 Rank3 Rank4 Rank5

Recall

Rank1 Rank2 Rank3 Rank4 Rank5

Recall

Diagnosis Therapy Etiology Figure 6. Recall for each type

By our observations for the experiments, there are some reasons caused to decrease the performance:

z Incorrect POS tagging.

z Assign the wrong category for the given question.

z Assigning the concept to each noun phrase that is not sufficient enough to explain the meanings

According to the experimental results, we find that the knowledge in the ontology indeed improve the performance of QA in specific domain. For CVC patterns, we also use hypernym concept as concept expansion for the CVC pattern from the given question to extract more medical implicit information. The experimental results show that the idea is positive for the performance. On other hand, query expansion by using the relations in UMLS works effectively to retrieve the more relevant documents from search engine. We integrate the medical resources from the Web into the question answering, such as online medical literature, UMLS resources. Natural language processing and information retrieval technique are the key points to integrate them for the users to get the answers from the huge amount of data.

相關文件