Table 12 summarizes the results of enhancing the performance of protein and gene name recognizers with filtering and integration strategies. We propose a fully automatic method of mining collocates from scientific texts in the protein and gene domains, and employ the extracted collocates to improve the precision rate of protein/gene name recognition. The precision of Yapex is increased from 70.90% to 85.84% at a small expense in the recall rate (i.e. it only decreases 2.44%) when collocates are incorporated. When the integration-only approach is adopted (i.e. -filtering, +integration), the F-score of the Yapex-based (ABGenebased) integration is a little lower than that of the filteringonly approach (i.e. +filtering, -integration). This shows that collocation learning is useful, and integration depends on the individual performance NE recognizers. When both filtering and integration (i.e. +filtering, +integration) strategies are employed together, the Yapex-based integration with KeX achieves 7.83% Fscore increase compared to the pure Yapex method (i.e., filtering, -integration). The ABGene-based integration with Idgene shows a 10.18% F-score increase relative to the pure ABGene method.
Performance of protein/gene
Table 12. Summary of Experimental Results for enhancing performance of protein and gene name recognizers
-filtering +filtering -filtering +filtering -integration -integration +integration +integration Precision 70.90% 85.84% 61.98% 79.40%
Protein Recall 69.53% 67.09% 77.52% 76.69%
F-Score 70.22% 76.47% 69.75% 78.05%
Precision 55.87% 70.08% 54.29% 69.99%
Gene Recall 74.56% 71.89% 84.47% 80.81%
F-Score 65.22% 70.99% 69.38% 75.40%
The main benefits of our method are: (1) The collocates used in the filtering strategies are produced by the training corpus rather than by intuition. This forms a more complete set than one identified by human experts; (2) The combination of the filtering and integration strategies shows better performance than the original protein/gene name taggers. The main drawback of our method is that we cannot solve the problem of false negatives. To solve such problems, more linguistic technologies need to be investigated in order to recover the false negatives. In addition, the performance of integrity strategies relied on the performance of the selected taggers as shown in Table 12.
This tendency is consistent with gene and protein name entity extraction. We expect that the methodologies can be easily extended to other domains, such as drugs and diseases. This will be verified in future work. The protein (or gene) collocates extracted from the domain corpus are also important keywords for pathway discovery, so that a systematic way from basic named entities finding to the discovery of complex relationships can be explored.
Although the relation extraction involves more complex issues, such as related objects, pathway direction and dependency relation, the correct recognition of genome/protein is the most basic task and this can be help with our methods. The values of the frequency, average distance, standard deviation and t-score can serve as some features for machine learning approaches to tag the protein/gene names. This will be studied. The experimental systems adopted in this paper are rule-based. The effects of combining different types of protein/gene name taggers, e.g., rule-based and corpus-based, will be investigated in the future.
In the second study of annotating multiple types of biological entities, we introduced the use of existing taggers and presented a way to collect common substrings shared by entities.
Due to lack of time, the models were not well tuned against the two parameters – C and gamma, influencing the capabilities of the models. Further, not all of the training instances provided were used to train the model, and it will be interesting and worthwhile to investigate.
How to deal with data imbalance is another important issue. By solving this problem, further evaluation of feature effectiveness would be facilitated. We believe there is much left for our approach to improve and it may perform better if more time is given.
For the last application of extracting GeneRIF from biological documents, we proposed an automatic approach to locate the GeneRIF sentence in an abstract with the assistance of SVMs, reducing the human effort in updating and maintaining the GeneRIF field in the LocusLink database.
Strategy
We have to admit that the 139 abstracts provided in TREC 2003 are too few to verify the performance among models, and the results based on these 139 abstracts may be slightly biased. Our next step would aim at measuring the cross-validation performances using Dice coefficient.
The syntactic information is worth exploring, since the sentences describing gene functions may share some common structural patterns. Moreover, how the weighting scheme affects the performance is also very interesting. We are currently trying to obtain a weighting scheme that can best distinguish GeneRIF sentence from non-GeneRIF sentence without classifiers.
Acknowledgements
Part of this research was supported in part by National Science Council under contracts NSC-91-2213-E-002-088, and NSC-92-2213-E-002-022. We wish to thank Dr. Lorrie Tanabe and Dr. W. John Wilbur in NCBI, NLM, NIH, and Dr. George Demetriou in the Department of the Computer Science, University of Sheffield who kindly supported the resources in this work.
References
Adamic L.A., Wilkinson D., Huberman B.A. and Adar E. (2002) A Literature Based Method for Identifying Gene-Disease Connections. IEEE Computer Society Bioinformatics Conference (CSB'02) 2002; 109-117.
BIOSIS organization (1999). Biomedical Literature Searching: A Comparison of BIOSIS Previews, EMBASE, and MEDLINE. BIOSIS Evolutions 1999; 6(3): 1, 4-7.
Blaschke C., Andrade M.A., Ouzounis C. and Valencia A. (1999) Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions. Proceedings of 7th International Conference on Intelligent Systems for Molecular Biology 1999; 60-67.
Bhalotia G., Nakov P.I., Schwartz A.S., and Hearst M.A. (2003) BioText Team Report for the TREC 2003 Genomics Track. TREC 2003 work notes 2003; 158-166.
Brill E. (1994) Some Advances in Transformation-Based Part of Speech Tagging.
Proceedings of the National Conference on Artificial Intelligence. AAAI Press; 1994, p.
722-727.
Burges C. (1998) A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2: 121-167.Chen H.H.; Ding Y.W. and Tsai S.C.
Named Entity Extraction for Information Retrieval. Computer Processing of Oriental Languages. Special Issue on Information Retrieval on Oriental Languages 1998; 12(1):
75-85.
Chang J.T., Schutze H. and Altman R.B. (2004) GAPSCORE: Finding Gene and Protein Names One Word at a Time. Bioinformatics 2004; 20(2): 216-225.
Chang Y.C., Hsu I.H.and Chou. L.Y. (2002) Graphical Features Selection Method. Intelligent Data Engineering and Automated Learning, Edited by H. Yin, N. Allinson, R. Freeman, J. Keane, and S. Hubband, 2002.
Chen H.H. and Lee J.C. (1996) Identification and Classification of Proper Nouns in Chinese Texts. Proceedings of 16th International Conference on Computational Linguistics 1996;
222-229.
Collier N., Nobata C. and Tsujii J.I. (2000) Extracting the Names of Genes and Gene Products with a Hidden Markov Model. Proceedings of 18th International Conference on Computational Linguistics 2000; 201-207.
Collier N., Park H.S., Ogata N., Tateishi Y., Nobata C. and Ohta T. (1999) The GENIA project: Corpus-based Knowledge Acquisition and Information Extraction from Genome Research Papers. Proceedings of the Annual Meeting of the European Chapter of the Association for Computational Linguistics (EACL’99) 1999, June.
Craven M. and Kumlien J. (1999) Constructing Biological Knowledge Bases by Extracting Information from Text Sources. Proceedings of 7th International Conference on Intelligent Systems for Molecular Biology 1999; 77-86.
DARPA (1998) Proceedings of 7th Message Understanding Conference 1998.
Dutoit S., Yang Y.H., Callow M.J. and Speed T.P. (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. J. Amer.
Statis. Assoc. 2002; 97:77-86.
Fan J.W. (2003) Information Retrieval and Extraction for the Chinese Gene Variation Database (CGVdb). Unpublished Master Thesis; 2003.
Fox C. (1992) Lexical Analysis and Stoplists. In: Frakes W. B. and Baeza-Yates R. editors.
Information Retrieval: Data Structures and Algorithms. Prentice Hall; 1992; 102-130.
Friedman C., Kra P., Yu H., Krauthammer M. and Rzhetsky A. (2001) GENIES: A Natural Language Processing System for the Extraction of Molecular Pathways from Journal Articles. Bioinformatics 2001; 17(S1): 74-82.
Fukuda K., Tsunoda T., Tamura A. and Takagi T. (1998) Toward Information Extraction:
Identifying Protein Names from Biological Papers. Proceedings of Pacific Symposium on Biocomputing 1998; 707-718.
GENIA project. http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA.
Hanisch D, Fluck J, Mevissen H.T. and Zimmer R. (2003) Playing Biology's Name Game:
Identifying Protein Names in Scientific Text. Proceedings of the Pacific Symposium on Biocomputing 2003; 403-414.
Hersh W. and Bhupatiraju R.T. (2003) TREC Genomics Track Overview. Proceedings of TREC 2003.
Hirschman L., Park J.C., Tsujii J., Wong L. and Wu C.H. (2002) Accomplishments and Challenges in Literature Data Mining for Biology. Bioinformatics 2002; 18(12): 1553-1561.
Hou W.J. and Chen H.H. (2002) Extracting Biological Keywords from Scientific Text.
Proceedings of 13th International Conference on Genome Informatics 2002; 571-573.
Hou W.J. and Chen H.H. (2003) Enhancing Performance of Protein Name Recognizers Using Collocation. Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine 2003; 25-32.
Hou W.J., Teng C.Y., Lee C. and. Chen H.H. (2003) SVM Approach to GeneRIF Annotation.
Proceedings of TREC 2003.
Hsu C.W., Chang C.C and Lin C.J. (2003) A Practical Guide to Support Vector Classification.
http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html.
Humphreys K., Demetriou G. and Gaizauskas R. (2000) Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures. Proceedings of Pacific Symposium on Biocomputing 2000; 5: 502-513.
Jelier R., Schuemie M., Eijk C.V.E., Weeber M., Mulligen E.V., Schijvenaars B., Mons B.
and Kors J. (2003) Searching for geneRIFs: concept-based query expansion and Bayes classification. Proceedings of TREC 2003; 167-174.
Joachims T. (1998) Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of ECML-98 1998; 137-142.
Kazama J., Makino T., Ohta Y. and Tsujii J. (2002) Tuning Support Vector Machines for Biomedical Named Entity Recognition. Proceedings of the ACL 2002 workshop on NLP in the Biomedical Domain 2002; 1-8.
Krauthammer M, Rzhetsky A, Morozov P and Friedman C. (2000) Using BLAST for Identifying Gene and Protein Names in Journal Articles. Gene 2000; 259(1-2): 245-252.
Lee K.J., Hwang Y.S. and Rim H.C. (2003) Two-Phase Biomedical NE Recognition based on SVMs. Proceedings of the ACL 2003 Workshop on NLP in Biomedicine 2003; 33-40.
Manning C.D. and Schutze H.(1999) Foundations of Statistical Natural Language Processing.
The MIT Press; 1999.
Marcotte E.M., Xenarios I. and Eisenberd D. Mining Literature for Protein-protein Interactions. Bioinformatics 2001; 17(4): 359-363.
Morgan A., Hirschman L., Yeh A. and Colosimo M. Gene Name Extraction Using FlyBase Resources. Proceedings of the ACL 2003 Workshop(1999) on Natural Language Processing in Biomedicine 2003; 1-8.
Ng S.K. and Wong M. (1999) Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts. Proceedings of 10th International Conference on Genome Informatics 1999; 104-112.
Olsson F., Eriksson G., Franzen K., Asker L. and Liden P. (2002) Notions of Correctness when Evaluating Protein Name Taggers. Proceedings of the 19th International Conference on Computational Linguistics 2002; 765-771.
Ono T., Hishigaki H., Tanigami A. and Takagi T. (2001) Automated Extraction of Information on Protein-Protein Interactions from the Biological Literature.
Bioinformatics 2001; 17(2): 155-161.
Park J.C., Kim H.S. and Kim J.J. (2001) Bidirectional Incremental Parsing for Automatic Pathway Identification with Combinatory Categorial Grammar. Proceedings of Pacific Symposium on Biocomputing 2001; 6: 396-407.
Pearson H. (2001) Biology’s Name Game. Nature 2001; 411: 631-632.
Pruitt K.D., Katz K.S., Sicotte H. and Maglott D.R. (2000) Introducing RefSeq and LocusLink: Curated Human Genome Resources at the NCBI. Trends Genet 2000; 16(1):
44-47.
Ratnaparkrhi. A. (1998) Maximum Entropy Models for Natural Language Ambiguity Resolution. PhD Thesis, University of Pennsylvania; 2003.
Rindflesch T.C., Tanabe L., Weinstein J.N. and Hunter L. (2000) EDGAR: Extraction of Drugs, Genes, and Relations from Biomedical Literature. Proceedings of Pacific Symposium on Biocomputing 2000; 5: 517-528.
Sekimizu T., Park H.S. and Tsujii T. (1998) Identifying the Interaction Between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. Genome Informatics 1998; 62-71.
Takeuchi K. and Collier N. (2003) Bio-Medical Entity Extraction using Support Vector Machines. Proceedings of the ACL 2003 workshop on NLP in Biomedicine 2003; 57-64.
Tanabe L. and Wilbur W.J. (2002) Tagging Gene and Protein Names in Biomedical Text.
Bioimformatics 2002; 18(8): 1124-1132.
Thomas J., Milward D., Ouzounis C., Pulman S. and Carroll M. (2000) Automatic Extraction of Protein Interactions from Scientific Abstracts. Proceedings of Pacific Symposium on Biocomputing 2000; 5: 538-549.
TREC 2003 Genome TRACK, http://medir.ohsu.edu/~genomics/.
Tsuruoka Y. and Tsujii J. (2003) Boosting Precision and Recall of Dictionary-based Protein Name Recognition. Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine 2003; 41-48.
Wong L. (2001) PIES, a Protein Interaction Extraction System. Proceedings of Pacific Symposium on Biocomputing 2001; 6: 520-531.
Yamamoto K., Kudo T., Konagaya A. and Matsumoto Y. (2003) Protein Name Tagging for Biomedical Annotation in Text. Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine 2003; 65-72.
Appendix A: Terms suggested by an expert
dephosphorylat (-e, -ed, -es, -ing, -ion, -ory), effect (-, -ed, -ing, -s),transduc (-e, -ed, -es, -ing, -tion, ,-tor, -tory), trigger (-, -ed, -ing, -s)