The Proposed Methods and Results - 資訊萃取技術在生物醫學文獻上的應用與探討(II)

3.1 The proposed named entities recognition

In this project, the recognition for protein entities from PubMed corpus is addressed so as to facilitate the automation of protein interaction databases construction. In order to mine more features relevant to protein entities, we assembled a domain-specific protein corpus SRC (SwissProt_Ref Corpus) extracted from SwissProt reference articles and tagged it by using SRC entry collection. The kernel NER is approached with two empirical strategies. One is rule-based strategy which exploits the patterns information mined from SRC. Experimental results show that the derived patterns are useful for NER task even though the number of the patterns is relatively less than the rules used in two popular systems Kex or Yapex. On the other hand, a concise HMM-based strategy is presented with a back-off strategy to overcome data sparseness. Experimental results on both GENIA corpus and the domain-specific SRC showed that the presented approach could achieve promising results in terms of 77% F-score in the case of strict annotation, proving that our approach is portable and competitive.

Besides, the recognition of the entities in coordination variants is concerned in this project. To resolve such term variants, a method based on heuristic rules together with clustering strategies is presented. Experimental results on GENIA corpus 3.0 proved the feasibility of the proposed approach by achieving 88.51% recall and 57.04% precision.

For detail description about the proposed method, please refer the attached conference paper presented in NLDB 2005, Alicante, Spain.

3.2 The proposed anaphora resolution for biomedical literature

In this project, a resolution procedure as shown in Figure 3.1 is presented for tackling both nominal anaphora and pronominal anaphora in biomedical literature by using morphological, syntactic and semantic clues. For nominal anaphora resolution, semantic association between anaphora and its antecedents is predicted with the semantic lexicons mined from UMLS and WordNet. For unknown entities, the semantic association is discovered by mining the search results with the help of PubMed, the search engine for MEDLINE databases. On the other hand, semantic coercion type of pronominal anaphor is done by semantic-tagged SA/AO patterns, which were pre-collected from GENIA 3.02p corpus. Unlike manual decision of feature sets at salience grading on antecedent selection, the presented resolution is boosted with a genetic algorithm. Experimental results on the evaluation corpus MedStract, the presented resolution is promising for its 92% F-Score in pronominal anaphora and 78%

F-Score in nominal anaphora.

For detail description about the proposed method, please refer the attached conference paper presented in IJCNLP 2005, Jesu Island, South Korea.

3.3 The proposed relation recognition from biomedical literature

In this project, the interactions between protein pairs are addressed. The SWISS-PROT database is used as our lexicon to identify protein entities in corpus by maximum matching procedure. Through corpus preprocessing, protein pairs are formed and processed by the proposed extraction method. As shown in Figure 3.2, the proposed relation extraction is

Grammar: [Question Word + Be + Noun Phrase]

Question Word: What | Who Be: is | are | was | were | be

Noun Phrase: ((Term1) (Term2) (Term3)…headword) | ((Term1 (Term2 (Term3 (…)))) headword)

divided into two stages. In the first stage, a set of predefined patterns mined from training corpus is employed to recognize relations from the testing sentences. In the second stage, the classifier based on Naive Bayes model is used for classifying each protein pair into two classes: “yes” or “no” by using a rich set of features which are verified with the Chi-Square test. The predefined features are described in detail in TABLE 1.

In order to select the best features, we incorporate the presented classifier with a genetic algorithm. TABLE 2 shows that we can have 74% F-score with the selected features and it is indeed better than the results yielded by using all features. TABLE 3 shows the impact of each feature in the training data. It reveals that the reference similarity feature plays a critical role for interaction extraction. Besides, the recognition performance is also justified with two corpora “Corpus1” and “Corpus2” with the best set of features selected by the genetic algorithm. (‘Corpus1 contains 155 Medline abstracts, and “Corpus2” contains 100 abstracts collected from the references listed in DIP.) The experiment results are displayed in TABLE 4 and TABLE 5, respectively. We can find that 61% F-score is achieved on both corpora, showing that the two-stage method is feasible for relation extraction.

For detail description about the proposed method, please refer the master thesis done by Hsiao-Ju Shih, Institute of Computer Science and Engineering, National Chiao Tung University 2006.

3.4 The proposed specific-domain question answering

The proposed QA processing is shown in Fig. 3.3 in which a given question is first identified to be is definitional or not. If the question is definitional type, the definitional strategy will be involved to process the question. If the question is the other types, a Naïve-Bayes classifier is employed to classify the questions into three target types. On the other hand, we use ontology-based expansion to expand the query term in order to increase the recall. Finally, we measure the returned texts by considering both TF-IDF and extracted concept patterns. Details of the implementation steps are described in the remaining subsections.

3.4.1 Rule-based approach for identifying definitional question

There are 108 definitional questions which have been classified manually in 910 pairs of the collected FAQs. We parse these questions and analyze the sentence structure. There are 88%

definitional questions parsed as the following two structures.

The headword is the most important word for the noun phrase in the parsing tree. And then we can take the noun phrase to search the definitions in UMLS. The rules used to recognize definitional questions are listed as follows:

(i). The length of POS sequence is less and equal than four.

(ii). [“What or Who” + “be” + NP], the question structure is identified as structure 1 or structure 2.

(iii). The question contains only one NP.

(iv). There are no prepositions in NP.

In the experiment, we take 40 definitional questions from TREC-9 to evaluate the definitional rules. The experimental results show that 36 questions are detected by these rules.

The accuracy rate is 90% in the test data. Some errors are resulted from wrong parsing tree or tags.

3.4.2 Naïve-Bayes classifier for classifying other type questions

A Naïve-Bayes classifier is used to classify the non-definitional questions into the pre-defined types, namely: diagnosis, therapy and etiology. We collect 8,729 medical documents classified by PubMed as the training data. Then we filter out stop words or medical proper nouns in UMLS. The remaining monograms (single word) and bigrams (adjacent two words) are clustered into 18 groups by a typical K-means algorithm. Meanwhile, we extract POS sequence from the classified questions and use POS sequence as one feature for our classifier.

We follow the Bayesian Theorem (defined by Equation (1)) to train the question classifier by the features of grams and POS sequence. Each question is assigned with one unique question type. In the testing phase, we take 453 questions randomly from the rest FAQs. There are 85%

precision and 86% recall for diagnosis, 84% precision and 94% recall for therapy and 82%

precision and 88% recall for etiology.

3.4.3 Concept identification

Concept identification is presented with the help of UMLS for each medical phrase in the question so as to transform the NP-Verb-NP pattern into CVC pattern. Since UMLS is the multi-node structure, it is necessary for us to do concept disambiguation. We use the co-occurrence information in UMLS and the concept probabilistic function is designed as equation (3). Then we use the association function defined as (2) to measure which concepts are the most possible one to be associated in the sentence. Details of concept identification steps are summarized as following.

Algorithm for Concept Identification IF the question contains only one noun phrase

THEN we get all concepts for the noun phrase from UMLS OTHERWISE

(i). Identify all concepts for noun phrases

(ii). Calculate the probability for all concepts of the noun phrases according to the co-occurrence in UMLS

(iii). Calculate the association value to choose the most possible concept by equation (2) and assign it to the noun phrase

( _r, )_h ( _r _h)* (_h _r) (2)

freq(Xr, *): any concepts in UMLS co-occur with concept Xr

freq(Xr, Yh): concept Xr co-occur with concept Yh

C = {diagnosis, therapy, etiology}

Fi = {unigram, bigram, POS sequence}

freq(Verb,CB) = the co-occurrence for (Verb,CB) freq(C_A,Verb) = the co-occurrence for (C_A,Verb) freq(CA,Verb,CB) = the co-occurrence for (CA,Verb,CB)

The extracted CVC patterns are used to score the answer texts in information retrieval. In the training phase, we use 400 medical terms as the keywords in UMLS to query the PubMed and collect 8,729 medical abstracts for training materials. The strategy is that all noun phrase preceding and succeeding the key verbs are extracted in the medical abstracts. If the noun phrase is a pronoun, the noun phrase which is preceded or succeeded the pronoun is extracted instead of the pronoun. Then noun phrases are combined with their preceding and succeeding verb as NP-Verb-NP patterns which are then transformed into CVC patterns.

For the verb in CVC patterns, we use the synsets of verb in WordNet to cluster CVC patterns into 4,496 groups and then we weigh each CVC pattern by equation (4).

At run time, we use CVC pattern extracted from the given question to retrieve the stored CVC patterns from the training result and use the relevant CVC patterns to score the answer texts returned by search engine.

3.4.4 Ontology-based Query Expansion

The query expansion is done with the the synonyms and hierarchical relations in UMLS Metathesaurus. The expanded strategy is described as follows:

3.4.5 Retrieval procedure and ranking

In the proposed QA, we use PubMed as the major information retrieval platform and Google as the minor platform. PubMed is triggered to retrieve the relevant medical texts if there exists.

If not, Google will be triggered to retrieve the snippets according to the keywords from the given question.

The answer texts are measured by equation (5) based on TF-IDF.

For each medical term in query

(i). Add the synonym variants in UMLS to the query (ii). Add its parent terms in UMLS to the query (iii). Add its child terms in UMLS to the query

(iv). Add other relations defined in UMLS to the query

Beside the TF-IDF rank, we also compute the rank for each CVC of the answer texts by scoring the degree of the CVC patterns checked in common between the question and the answer texts.

3.4.6 Results and Analysis

Two indicators are used to measure the performance for our method. One is the Mean Reciprocal Rank (MRR). Another is the Human Effort (HE). The HE is defined as the user finds the answer in the least rank of passages returned.

Table 6 shows the experimental results on 55 questions from testing corpus and it is noticed that the proposed question classification (QC), query expansion (QE) and CVC patterns ranking indeed improve the QA performance. Table 7 shows the experimental results on 203 set-aside FAQ questions of different types. Table 8 shows the experimental results on the questions from view point of interrogative words. Table 9 shows the results in terms of Human Effort (HE) and it shows that the answer passage is at the top 2 (or top 3) in the returned texts from the proposed QA.

There are some errors attributed to the following reasons:

(1) Incorrect POS tagging.

(2) Assign the wrong category for the given question.

(3) Assign the not appropriate concept to noun phrase.

For detail description about the proposed method, please refer the master thesis done by Li-Hong Huang, Institute of Computer Science and Engineering, National Chiao Tung University 2006.

在文檔中資訊萃取技術在生物醫學文獻上的應用與探討(II) (頁 9-14)