Related Work - 以知識本體為基礎之醫藥問答系統

Many researches (Zhang et al 2004; Niu et al. 2004; Soo et al. 2004; Wu et al. 2005) related to specific domain QA have been reported during the last decade. The specific domain QA is usually considered into four steps: the utilization of domain ontology, question processing, document retrieval and answer processing. For the domain ontology, it is the knowledge source for specific QA. The strategy to extract the relevant information by using domain ontology is the most important. Zhang et al.

(2004) use the concepts of ontology to tag the question and the documents in order to measure the similarity between the question and the documents. Niu et al. (2004) consider the ontology as the keyword expansion for the question in order to gain more information. But to combine the ontology and the Web resources is another trend for specific domain QA. The system proposed by Soo et al. (2004) can integrate the biological literatures from the Web into the ontology automatically. Wu et al. (2005) use the medical FAQ from the Web as the data source to propose the medical QA. In the thesis, we consider how to utilize the concepts of ontology and the medical recourses, i.e. medical FAQ and literatures, to propose the method to deal with the medical questions in question processing and document retrieval.

For question processing, most specific domain QA adopts question classification as the essential component to deal with the given questions because there are different strategies to process the questions. Researches classify the question by identifying the format of answers, such as Yes/No format (Wu et al 2005), Description format (Wu et al 2005; Zhang et al. 2004) and NE format (Zhang et al. 2004). Except the question classification, how to extract the information from the given question by using the

ontology is an important factor for the performance of document retrieval. In our study, the concept information and the syntactic relation from the given question are concerned in order to make document retrieval work efficiently. But the concept ambiguity is occurred during the processing. Navigli et al. (2005) provide a knowledge-based approach to do word sense disambiguation. They propose structural semantic interconnections algorithm (SSI) to construct the related senses as the form of network. The relations in the network are defined as the form in WordNet. In our study, the frequency of co-occurrence in UMLS is used to identify the concept.

For document retrieval, Zhang et al. (2004) use the okapi function to score the question concepts and keywords for retrieving the documents. Niu et al. (2004) show that the role information in the given question and documents is an important clue to match the relevant documents. On the other hand, query expansion will increase the performance for document retrieval. But 70% errors in handling QA are attributed to question classification, keyword selection, and query expansion as Moldovan et al.

(2002) mentions. It is important that how to make query expansion efficient in document retrieval. Wang et al. (2004) propose Web-based unsupervised learning to transform the question term. They collect the QA pairs from Quiz-Zone as training corpus and align the question terms and bigrams of answer passages returned by Google. The question terms include the question keywords and the question patterns extracted by rules. The authors calculate the value of logarithmic likelihood ratio (LLR) between question terms and bigrams of answer passages and choose the top-rank bigram for each question term. These bigrams are recognized as the transformations for the question terms. The experiments indicate 0.69 MRR for the search engine to retrieve efficiently according to the keywords and expansions. But there is still sparseness problem for this method. On the other hand, Hersh et al. (2000)

use the relations in the UMLS Metathesaurus to expand the query. The relations include synonym, parent relation, child relation, and others. The authors consider the hierarchical relations as the important clue to increase the performance in document retrieval.

Zhang et al. (2004) constructed a specific domain QA via the ontology which connects the concepts by links like network. The ontology which they use is The Canadian thesaurus of Construction Science and Technology. The authors tag the parent category for the terms in the documents collected from Web according to the concepts of ontology. For an inputted question, the system will extract the headword by identifying the first head noun in the question and tag the category for the identifying word. The given questions are classified into four classes: definition, named entity, category and keyword type. The authors use the Okapi function to measure the weight of keywords for the passages in the documents and match the categories between the passage and the given question by counting the categories in common. Finally, they combine the weight of keywords and the number of categories by using linear function to rank the candidate passages. The result shows that the MRR value is 0.6545 and the improvement in the performance is 7.19%. It will decrease the recall in IR module because query expansion is not adopted for this system.

The Medical Question Answering system (MQA) is presented in (Niu et al. 2004) in which the PICO format presented by Straus et al. (2000) is used to deal with the given medical question and WordNet⁶ and UMLS are used as the knowledge bases.

WordNet is used to get the common keyword expansion and UMLS is used to get the

6 WordNet http://wordnet.princeton.edu/

specific keyword expansion. The roles of PICO format are extracted from the questions. The authors match the roles between the question and the medical documents in order to spot the possible answer. The PICO format is considered as the important information in medical texts because the roles in the format construct the meaning of the text.

Additionally, the ontology is the most important resource in most of specific domain QA system. Soo et al. (2004) propose an agent to extract the knowledge from biological literatures. The authors integrate the knowledge resources, such as WordNet, MeSH⁷(Medical Subject Heading), and GO⁸(Gene Ontology), and develop the system to process the semantic annotation for the biological literatures automatically in order to encourage the domain knowledge in the ontology. For the inputted query, the system will infer the answers by using pattern matching and sentence parsing. The evaluation indicates that there are 85.2% in recall and 74.2% in precision. It improves recall from 48.1% to 85.2% and precision from 61.9% to 74.2%

for the ontology-based knowledge extraction compared with the keyword-based search.

The FAQ are also considered as good materials to construct the medical QA because the answers are maintained by the domain experts. Wu et al. (2005) use the FAQ retrieval system to collect the medical FAQ pairs and adopt the medical ontology proposed by Yeh et al. (2004). The structure of ontology is based on WordNet and HowNet⁹. The authors consider the topic into two parts: question part and answer part. Three aspects are investigated separately for the question part, i.e. the question

7 Medical Subject Heading (MeSH) http://www.nlm.nih.gov/mesh/meshhome.html

8 Gene Ontology (GO) http://www.geneontology.org/

9 HowNet http://www.keenage.com/

stem for the interrogative word, the distance of keyword concept in ontology, and the vector space representation between the FAQ questions and the inputted query. Two aspects are investigated separately for the answer part, i.e. the relations and the paragraph cluster. The relations in the ontology are identified for the answers of FAQ.

They paragraph and cluster the answers of FAQ by using latent semantic analysis (LSA) and K-means algorithm in the paragraph cluster. The authors calculate the similarity for each aspect by conditional probabilistic function and combine those values by probabilistic mixture model. The EM algorithm is employed to optimize the mixing weights in the model. The answer formats are classified into three groups. The Set type means that the answer for the given question is enumerated. The Description type is the explanation for the given question. And the Boolean type is Yes/No question. The experimental results show that the Boolean type is 0.6643, the Set type is 0.6732, and the Description type is 0.6327 for the metric of 11-AvgP.

For answering definitional questions, Hovy et al. (2001) use WordNet to assist the QA to deal with them. In recent years, Xu et al. (2004) consider the linguistic features as the important clues to extract the definitions from the documents. With the growth of Web, Hildebrandt et al. (2004) use the surface patterns to collect the definitions from Web and integrate the definitions into knowledge database in order to answer this type of questions. In the thesis, we use the definition database from UMLS to answer the definitional question. If the definition is not found in it, the online dictionary is queried to answer the question and expand the definition database at the same time.

Xu et al. (2004) use the linguistic features to extract the definitional information from the documents. They take five types of ranked features to handle the definitional

questions in the following order: appositives, copulas, structured patterns, relations, and propositions and establish the question profile for definitions from many sources, such as WordNet glossaries, Merriam-Webster dictionary, Columbia encyclopedia and Google. They calculate the similarity of given question according to the question profile. The top ten features are selected for the given question by using the similarity score. The five ranked features and the top ten features are used to extract the definitions from the documents. The experiment shows 0.555 for F-score in performance.

On the other hand, Hildebrandt et al. (2004) want to answer definitional questions by using multiple knowledge sources on the Web. They collect the definitional answers by using surface patterns and normalize them as the form of database. If the answer can’t be found in the collected data, the authors will process the question into the string and query the online dictionary or document retriever. In our study, we will detect the definitional question first by using simple patterns and use UMLS ontology to answer this type of questions. We will convert the question into a single noun phrase and retrieve the definition from Web dictionary if the definition is not found in UMLS.

For the relevant work on specific domain QA, we focus on the problem in converting the given question into the syntactic relations with concept identification by using UMLS and integrating the medical literatures from PubMed as the document source to match the relevant passages or documents by mixing the weight of TF-IDF score and Concept-Verb-Concept score.

在文檔中以知識本體為基礎之醫藥問答系統 (頁 14-20)