Discovering Larger Semantic Structures: Events

Just as the supervised training of event extractors is more difficult than relation extractors, the unsupervised discovery of multi-argument structures such as event templates is more challenging than relation discovery. Not every description of an event will provide informa-tion on all the event arguments, and the arguments which are present may be scattered over several sentences, so we may need to build a separate set of contexts for each argument.

One property of news stories that we can take advantage off is that each article, or at least the first few sentences of each article, generally describe a single event, so the names in (the beginning of) an article correspond to (a subset of) the event arguments.

Shinyama and Sekine [SS06] addressed the problem of missing arguments in a single article by taking 12 parallel feeds from different news sources, using bag of words overlap (including names) to identify stories about the same event, and putting these articles into a cluster. Names which appeared early and often in multiple articles were likely to be the primary arguments of the event. Then they built metaclusters out of clusters representing the same type of event. Two event clusters were placed in a metacluster based on find-ing correspondfind-ing names in the two clusters with common contexts; these then define the arguments and extraction patterns for this type of event.

7.4 Evaluation

Evaluating unsupervised extraction is inherently problematic. When we developed hand-coded, supervised, or semi-supervised extraction systems, we declared what entity and rela-tion classes we wanted to extract; if something different was extracted, that counted as an error. When we perform unsupervised extraction, we in effect ask the data to tell us what the good classes are. If we independently (through manual data analysis) come up with a different set of classes for the same data, it may not be easy to say who is right, or whether there are multiple correct analyses.

Nonetheless, an approach based on a separate gold standard can be useful. [CJTN05], for example, tested on an ACE corpus and evaluated by aligning the resulting classes against the ACE relation types. [RF07] and [KD08] created their own keys by hand from subsets of their corpora.

More indirect evaluations are also possible. To evaluate its argument clusters, [KD08]

measured how well they aligned with WordNet synsets. [YHRM11] used the relation phrase clusters as features for a relation classifier trained using distant supervision, and reported some performance improvements.

Chapter 8

Other Domains

What sort of texts are good candidates for information extraction? Basically, domains in which there are large volumes of text which express a common set of semantic relations, and where there is a strong incentive for being able to search, build data bases, collect statistics, or otherwise mine this information.

News. The examples we have used so far involve general news, including political, international, and business news. The Web has made such news available from thousands of sources in large volume, and the ability to search or react rapidly to such information is of interest to many large businesses and governments.

Quite a number of such systems have been deployed. For example, the Europe Media Monitor’s NewsExplorer¹ gathers news from across Europe, clusters related news stories, and extracts names, locations, general person-person relations, and event types. Open-Calais from Thomson Reuters² extracts a range of entity, relation, and event types from general and business news. The GATE system from the University of Sheffield has been used in a number of business intelligence applications [MYKK05]. For the most part these deployed applications have used hand-crafted lists of terms and regular expressions, rather than corpus-trained approaches.

Two other domains which meet the criteria for IE are medical records and the scientific literature.

Medical records have been a target area for information extraction for several decades [SFLmotLSP87]. Hospitals produce very large amounts of patient data, and a significant portion of this is in text form. IE could improve access to crucial medical information in time-critical situations. Furthermore, medical research and monitoring of overall patient care require analysis of this data, in order to observe relationships between diagnosis and treatment, or between treatment and outcome, and this in turn has required a manual review of this textual data, and in many cases the manual assignment of standardized diagnosis and treatment codes. Automatically transforming this text into standardized fields and categories based on medical criteria can greatly reduce the manual effort required. The push for electronic health records (EHR) has increased both the need for and the potential impact of medical text analysis.

A number of implemented systems have already demonstrated the feasibility of such applications for specialized medical reports [MSKSH08, FSLH04]. However, progress in clinical text analysis has been slower than for other IE tasks [CNH⁺11], for a number of reasons. Medical records are sensitive and have to be carefully anonymized; this has made it difficult to obtain large amounts of data or to share data between sites. Few standard test sets are available for clinical data. More generally, until recently much of the work on EHR has been done locally by individual medical centers, leading to a lack of standardization of EHR.

1http://emm.newsexplorer.eu

2http://www.opencalais.com

Only in the past few years have shared evaluations of IE for clinical data developed along the lines of MUC and ACE. Several of these Challenges in NLP for Clinical Data have been organized in connection with Informatics for Integrating Biology and the Bedside.³ The specific tasks are quite different from year to year. For example, the 2009 task involved the extraction of information on medication; the 2010 task was quite general, involving the extraction of problems, tests, and treatments from discharge summaries (the extended reports prepared at the end of a patient’s hospital stay).

Biomedical literature. In our introduction, we noted Zellig Harris’s vision for using a process similar to fine-grained information extraction to index scientific journal articles [Har58]. As robust extraction technology has caught up in the last few years with this vision, there has been renewed interest in extracting information from the scientific literature. One particular area has been biomedicine and genomics, where the very rapid growth of the field has overwhelmed the researcher seeking to keep current with the literature. The goal for NLP has been to automatically identify the basic entities (genes and proteins) and reports of their interaction, and build a data base to index the literature. To address this goal, a number of annotated corpora have been developed.⁴ These have been used in turn for open, multi-site evaluations of biomedical named entity and relation extraction.

3See www.i2b2.org/NLP

4See for example the resources of the GENIA project, http://www-tsujii.is.s.u-tokyo.ac.jp/ ge-nia/topics/Corpus/

Bibliography

[Abn08] Steven Abney. Semisupervised Learning for Computational Linguists.

Chapman and Hall, 2008.

[AG00] Eugene Agichtein and Luis Gravano. Snowball: extracting relations from large plain-text collections. In DL ’00: Proceedings of the fifth ACM con-ference on Digital libraries, pages 85–94, New York, NY, USA, 2000. ACM.

[AHB⁺93] Douglas Appelt, Jerry Hobbs, John Bear, David Israel, and Mabry Tyson.

FASTUS: A finite-state processor for information extraction from real-world text. In Proceedings of IJCAI-93, pages 1172–1178, Chambery, France, August 1993.

[Ahn06] David Ahn. The stages of event extraction. In Proceedings of the Workshop on Annotating and Reasoning about Time and Events, pages 1–8, Sydney, Australia, July 2006. Association for Computational Linguistics.

[BHAG05] Markus Becker, Ben Hachey, Beatrice Alex, and Claire Grover. Optimising selective sampling for bootstrapping named entity recognition. In Proceed-ings of the ICML-2005 Workshop on Learning with Multiple Views, 2005.

[BM05] Razval Bunescu and Raymond Mooney. Subsequence kernels for relation extraction. In Proceedings of the 19th Conference on Neural Information Processing Systems (NIPS), Vancouver, BC, December 2005.

[BMSW97] Daniel Bikel, Scott Miller, Richard Schwartz, and Ralph Weischedel.

Nymble: a high-performance learning name-finder. In Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, D.C., 1997.

[Bor99] Andrew Borthwick. A Maximum Entropy Approach to Named Entity Recognition. PhD thesis, Dept. of Computer Science, New York University, 1999.

[Bri98] Sergey Brin. Extracting patterns and relations from the world-wide web. In Proceedings of the 1998 International Workshop on the Web and Databases at the 6th International Conference on Extending Database Technology, EDBT 98, pages 172–183, 1998.

[BSAG98] Andrew Borthwick, John Sterling, Eugene Agichtein, and Ralph Grishman.

Exploiting diverse knowledge sources via maximum entropy in named en-tity recognition. In Proceedings of the Sixth Workshop on Very Large Cor-pora, Montreal, Canada, 1998.

[CA05] Massimiliano Ciaramita and Yasemin Altun. Named-entity recognition in novel domains with external lexical knowledge. In Advances in Structured Learning for Text and Speech Processing Workshop, 2005.

[CJ09] Zheng Chen and Heng Ji. Language specific issue and feature exploration in chinese event extraction. In Proceedings of Human Language Technologies:

The 2009 Annual Conference of the North American Chapter of the Asso-ciation for Computational Linguistics, Companion Volume: Short Papers, pages 209–212, Boulder, Colorado, June 2009. Association for Computa-tional Linguistics.

[CJH09] Zheng Chen, Heng Ji, and Robert Haralick. A pairwise event coreference model, feature impact and evaluation for event coreference resolution. In Proceedings of the Workshop on Events in Emerging Text Types, pages 17–

22, Borovets, Bulgaria, September 2009. Association for Computational Linguistics.

[CJTN05] Jinxiu Chen, Donghong Ji, Chew Lim Tan, and Zhengyu Niu. Unsuper-vised feature selection for relation extraction. In Second International Joint Conference on Natural Language Processing – Companion Volume, pages 262–267, Jeju Island, Republic of Korea, October 2005.

[CK99] Mark Craven and Johan Kumlien. Constructing biological knowledge bases by extracting information from text sources. In Proceedings of the 7th Inter-national Conference on Intel ligent Systems for Molecular Biology (ISMB-99, pages 77–86. AAAI Press, 1999.

[CNH⁺11] Wendy Chapman, Prakash Nadkarni, Lynette Hirschman, Leonard D’Avolio, Guergana Savova, and Ozlen Uzuner. Overcoming barriers to nlp for clinical text: the role of shared tasks and the need for additional creative solutions. Journal of the American Medical Informatics Associa-tion, 18:540–543, 2011.

[CS99] Michael Collins and Yoram Singer. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.

[DeJ82] Gerald DeJong. An overview of the FRUMP system. In Wendy Lehnert and Martin Ringle, editors, Strategies for Natural Language Processing, pages 149–176. Lawrence Erlbaum, Hillsdale, NJ, 1982.

[FBH07] Donghui Feng, Gully Burns, and Eduard Hovy. Extracting data records from unstructured biomedical full text. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 837–

846, Prague, Czech Republic, June 2007. Association for Computational Linguistics.

[FSLH04] C. Friedman, L. Shagina, Y. Lussier, and G. Hripcsak. Automated encod-ing of clinical documents based on natural language processencod-ing. Journal of the American Medical Informatics Association, 11:392–402, 2004.

[GS96] Ralph Grishman and Beth Sundheim. Message Understanding Conference-6: a brief history. In Proceedings of the 16th International Conference on Computational linguistics, pages 466–471, Copenhagen, Denmark, 1996.

[Har58] Zellig Harris. Linguistic transformations for information retrieval. In Pro-ceedings of the International Conference on Scientific Information, Wash-ington, D.C., 1958. National Academy of Sciences-National Research Coun-cil.

[Har68] Zellig Harris. Mathematical Structures of Language. Interscience, 1968.

[HSG04] Takaaki Hasegawa, Satoshi Sekine, and Ralph Grishman. Discovering rela-tions among named entities from large corpora. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pages 415–422, Barcelona, Spain, July 2004.

[JG08] Heng Ji and Ralph Grishman. Refining event extraction through cross-document inference. In Proceedings of ACL-08: HLT, pages 254–262, Columbus, Ohio, June 2008. Association for Computational Linguistics.

[JG11] Heng Ji and Ralph Grishman. Knowledge base population: Successful ap-proaches and challenges. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technolo-gies, pages 1148–1158, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.

[JZ07] Jing Jiang and ChengXiang Zhai. A systematic exploration of the feature space for relation extraction. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Compu-tational Linguistics; Proceedings of the Main Conference, pages 113–120, Rochester, New York, April 2007. Association for Computational Linguis-tics.

[Kam04] Nanda Kambhatla. Combining lexical, syntactic, and semantic features with maximum entropy models for information extraction. In The Com-panion Volume to the Proceedings of 42st Annual Meeting of the Associa-tion for ComputaAssocia-tional Linguistics, pages 178–181, Barcelona, Spain, July 2004. Association for Computational Linguistics.

[KD08] Stanley Kok and Pedro Domingos. Extracting semantic networks from text via relational clustering. In Proceedings of the Nineteenth European Conference on Machine Learning, pages 624–639, Antwerp, Belgium, 2008.

Springer.

[LG10a] Shasha Liao and Ralph Grishman. Filtered ranking for bootstrapping in event extraction. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 680–688, Beijing, China, August 2010. Coling 2010 Organizing Committee.

[LG10b] Shasha Liao and Ralph Grishman. Using document level cross-event in-ference to improve event extraction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 789–797, Uppsala, Sweden, July 2010. Association for Computational Linguistics.

[LG11] Shasha Liao and Ralph Grishman. Acquiring topic features to improve event extraction: in pre-selected and balanced collections. In Proceedings of the Conference on Recent Advances in Natural Language Processing, Hissar, Bulgaria, September 2011.

[MBSJ09] Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. Distant super-vision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th Inter-national Joint Conference on Natural Language Processing of the AFNLP, pages 1003–1011, Suntec, Singapore, August 2009. Association for Com-putational Linguistics.

[MC08] Tara McIntosh and James R. Curran. Weighted mutual exclusion boot-strapping for domain independent lexicon and template acquisition. In Proceedings of the Australasian Language Technology Workshop, Hobart, Australia, 2008.

[McI10] Tara McIntosh. Unsupervised discovery of negative categories in lexicon bootstrapping. In Proceedings of the 2010 Conference on Empirical Meth-ods in Natural Language Processing, pages 356–365, Cambridge, MA, Oc-tober 2010. Association for Computational Linguistics.

[MGM98] Andrei Mikheev, Claire Grover, and Marc Moens. Description of the LTG system used for MUC-7. In Proceedings of the Seventh Message Under-standing Conference (MUC-7), 1998.

[MSKSH08] S.M. Meystre, G.K. Savova, K.C. Kipper-Schuler, and J.F. Hurdle. Ex-tracting information from textual documents in the electronic health record: a review of recent research. In Yearbook of Medical Informatics, pages 128–144. Schattauer, Stuttgart, 2008.

[MYKK05] Diana Maynard, Milena Yankova, Alexandros Kourakis, and Antonis Kokossis. Ontology-based information extraction for market monitoring and technology watch. In Proceedings of the Workshop on End User As-pects of the Semantic Web, 2nd European Semantic Web Conference, Her-aklion, Crete, 2005.

[NM11] Truc Vien T. Nguyen and Alessandro Moschitti. End-to-end relation ex-traction using distant supervision from external semantic repositories. In Proceedings of the 49th Annual Meeting of the Association for Computa-tional Linguistics: Human Language Technologies, pages 277–282, Port-land, Oregon, USA, June 2011. Association for Computational Linguistics.

[NS07] David Nadeau and Satoshi Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3–26, 2007.

[PR09] Siddharth Patwardhan and Ellen Riloff. A unified model of phrasal and sentential evidence for information extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 151–160, Singapore, August 2009. Association for Computational Linguis-tics.

[RF07] Benjamin Rosenfeld and Ronen Feldman. Clustering for unsupervised re-lation identification. In Proceedings of the Conference on Information and Knowledge Management (CIKM), pages 411–418, 2007.

[Ril96] Ellen Riloff. Automatically generating extraction patterns from untagged text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI-96), pages 1044–1049, 1996.

[RR09] Lev Ratinov and Dan Roth. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), pages 147–155, Boulder, Colorado, June 2009. Association for Computational Linguistics.

[SDW01] Tobias Scheffer, Christian Decomain, and Stefan Wrobel. Mining the web with active hidden markov models. In Data Mining, IEEE International Conference on, pages 645–646, Los Alamitos, CA, USA, 2001. IEEE Com-puter Society.

[SFLmotLSP87] Naomi Sager, Carol Friedman, Margaret Lyman, and members of the Lin-guistic String Project. Medical Language Processing: Computer Manage-ment of Narrative Data. Addison-Wesley Pub. Co., Reading, Mass., 1987.

[SG05] Mark Stevenson and Mark A. Greenwood. A semantic approach to ie pattern induction. In Proceedings of the 43rd Annual Meeting of the As-sociation for Computational Linguistics (ACL’05), Ann Arbor, MI, June 2005. Association for Computational Linguistics.

[SG10] Ang Sun and Ralph Grishman. Semi-supervised semantic pattern discovery with guidance from unsupervised pattern clusters. In Coling 2010: Posters, pages 1194–1202, Beijing, China, August 2010. Coling 2010 Organizing Committee.

[SN04] Satoshi Sekine and Chikashi Nobata. Definition, dictionary and tagger for extended named entities. In Proceedings of the Fourth International Con-ference on Language Resources and Evaluation, Lisbon, Portugal, 2004.

[SS06] Yusuke Shinyama and Satoshi Sekine. Preemptive information extrac-tion using unrestricted relaextrac-tion discovery. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 304–311, New York City, USA, June 2006. Association for Computational Linguistics.

[SZS⁺04] Dan Shen, Jie Zhang, Jian Su, Guodong Zhou, and Chew-Lim Tan. Multi-criteria-based active learning for named entity recognition. In Proceed-ings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Main Volume, pages 589–596, Barcelona, Spain, July 2004.

[TKS03] Erik F. Tjong Kim Sang. Introduction to the conll-2002 shared task:

Language-independent named entity recognition. In Walter Daelemans and Miles Osborne, editors, Proceedings of the Sixth Conference on Natu-ral Language Learning 2002, pages 142–147, 2003.

[TKSDM03] Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Walter Daelemans and Miles Osborne, editors, Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147, 2003.

[Yan03] Roman Yangarber. Counter-training in discovery of semantic patterns. In Proceedings of the 41st Annual Meeting of the Association for Computa-tional Linguistics (ACL’03), Sapporo, Japan, July 2003. Association for Computational Linguistics.

[YE07] Alexander Yates and Oren Etzioni. Unsupervised resolution of objects and relations on the web. In Proceedings of HLT-NAACL, 2007.

[YGTH00] Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja Huttunen.

Automatic acquisition of domain knowledge for information extraction. In Proceedings of the 18th International Conference on Computational Lin-guistics (COLING-2000), Saarbrcken, Germany, August 2000.

[YHRM11] Limin Yao, Aria Haghighi, Sebastian Riedel, and Andrew McCallum.

Structured relation discovery using generative models. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Process-ing, pages 1456–1466, Edinburgh, Scotland, UK., July 2011. Association for Computational Linguistics.

[YLG02] Roman Yangarber, Winston Lin, and Ralph Grishman. Unsupervised learning of generalized names. In Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), 2002.

[ZAR03] Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. Kernel meth-ods for relation extraction. J. Machine Learning Research, 3:1083–1106, 2003.

[ZG05] Shubin Zhao and Ralph Grishman. Extracting relations with integrated in-formation using kernel methods. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 419–426, Ann Arbor, Michigan, June 2005. Association for Computational Linguis-tics.

[ZSW⁺05] Min Zhang, Jian Su, Danmei Wang, Guodong Zhou, and Chew Lim Tan.

Discovering relations between named entities from a large raw corpus using tree similarity-based clustering. In Second International Joint Conference on Natural Language Processing, pages 378–389, Jeju Island, Republic of Korea, October 2005.

[ZSZZ05] GuoDong Zhou, Jian Su, Jie Zhang, and Min Zhang. Exploring vari-ous knowledge in relation extraction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 427–434, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics.

[ZZJZ07] GuoDong Zhou, Min Zhang, DongHong Ji, and QiaoMing Zhu. Tree

在文檔中 Information Extraction: Capabilities and Challenges (頁 33-41)