Diversity of IE Tasks

The Message Understanding Conferences led to an increased interest in the IE task and the creation of additional IE data sets. Researchers began to work on IE problems for new domains and focused on different aspects of the information extraction problem. In the following sections, we outline some of the fundamental distinctions that cut across different information extraction tasks.

21.2.1 Unstructured vs. Semi-Structured Text

Historically, most natural language processing systems have been designed to process unstructured text, which consists of natural language sentences. In contrast to structured data where the semantics of the data is defined by its organization (e.g., database entries), the meaning of unstructured text depends entirely on linguistic analysis and natural language understanding.

2http://www.itl.nist.gov/iad/mig/tests/ace/

Professor John Skvoretz, U. of South Carolina, Columbia, will present a seminar entitled “Embedded Commitment,” on Thurs-day, May 4th from 4-5:30 in PH 223D.

FIGURE 21.1: Example of an unstructured seminar announcement

Examples of unstructured text include news stories, magazine articles, and books.³ Figure 21.1 shows an example of a seminar announcement that is written as unstructured text.

Semi-structured text consists of natural language that appears in a docu-ment where the physical layout of the language plays a role in its intepretation.

For example, consider the seminar announcements depicted in Figure 21.2.

The reader understands that the speaker is Laura Petitte, who is from the Department of Psychology at McGill University, because seminar speakers and their affiliations typically appear at the top of a seminar announcement.

If McGill University had appeared below Baker Hall 355 in the announcement, then we would assume that the seminar takes place at McGill University.

Several IE data sets have been created specifically to handle domains that often include semi-structured text, such as seminar announcements, job post-ings, rental ads, and resumes. To accommodate semi-structured text, IE sys-tems typically rely less on syntactic parsing and more on positional features that capture the physical layout of the words on the page.

21.2.2 Single-Document vs. Multi-Document IE

Originally, information extraction systems were designed to locate domain-specific information in individual documents. Given a document as input, the IE system identifies and extracts facts relevant to the domain that appear in the document. We will refer to this task as single-document information extraction.

The abundance of information available on the Web has led to the creation of new types of IE systems that seek to extract facts from the Web or other very large text collections (e.g., (Brin 1998; Fleischman, Hovy, and Echihabi 2003; Etzioni, Cafarella, Popescu, Shaked, Soderland, Weld, and Yates 2005;

Pasca, Lin, Bigham, Lifchits, and Jain 2006; Pasca 2007; Banko, Cafarella, Soderland, Broadhead, and Etzioni 2007)). We will refer to this task as multi-document information extraction

Single-document IE is fundamentally different from multi-document IE, al-though both types of systems may use similar techniques. One distinguish-ing issue is redundancy. A sdistinguish-ingle-document IE system must extract

domain-3These text forms can include some structured information as well, such as publication dates and author by-lines. But most of the text in these genres is unstructured.

Laura Petitte Department of Psychology

McGill University Thursday, May 4, 1995

12:00 pm Baker Hall 355

Name: Dr. Jeffrey D. Hermes

Affiliation: Department of AutoImmune Diseases

Research & Biophysical Chemistry Merck Research Laboratories Title: “MHC Class II: A Target for Specific Immunomodulation of the Immune Response”

Host/e-mail: Robert Murphy, murph@a.crf.cmu.edu Date: Wednesday, May 3, 1995

Time: 3:30 p.m.

Place: Mellon Institute Conference Room

Sponsor: MERCK RESEARCH LABORATORIES

FIGURE 21.2: Examples of semi-structured seminar announcements

specific information from each document that it is given. If the system fails to find relevant information in a document, then that is an error. This task is challenging because many documents mention a fact only once, and the fact may be expressed in an unusual or complex linguistic context (e.g., one requiring inference). In contrast, multi-document IE systems can exploit the redundancy of information in its large text collection. Many facts will appear in a wide variety of contexts, so the system usually has multiple opportunities to find each piece of information. The more often a fact appears, the greater the chance that it will occur at least once in a linguistically simple context that will be straightforward for the IE system to recognize.⁴

Multi-document IE is sometimes referred to as “open-domain” IE because the goal is usually to acquire broad-coverage factual information, which will likely benefit many domains. In this paradigm, it doesn’t matter where the in-formation originated. Some open-domain IE systems, such as KnowItAll (Et-zioni, Cafarella, Popescu, Shaked, Soderland, Weld, and Yates 2005) and Tex-tRunner (Banko, Cafarella, Soderland, Broadhead, and Etzioni 2007), have

4This issue parallels the difference between single-document and multi-document question answering (QA) systems. Light et al. (Light, Mann, Riloff, and Breck 2001) found that the performance of QA systems in TREC-8 was directly correlated with the number of answer opportunities available for a question.

addressed issues of scale to acquire large amounts of information from the Web. One of the major challenges in multi-document IE is cross-document coreference resolution: when are two documents talking about the same enti-ties? Some researchers have tackled this problem (e.g., (Bagga and Baldwin 1998; Mann and Yarowsky 2003; Gooi and Allan 2004; Niu, Li, and Srihari 2004; Mayfield, Alexander, Dorr, Eisner, Elsayed, Finin, Fink, Freedman, Garera, McNamee, Mohammad, Oard, Piatko, Sayeed, Syed, Weischedel, Xu, and Yarowsky 2009)), and in 2008 the ACE evaluation expanded its focus to include cross-document entity disambiguation (Strassel, Przybocki, Peterson, Song, and Maeda 2008).

21.2.3 Asssumptions about Incoming Documents

The IE data sets used in the Message Understanding Conferences consist of documents related to the domain, but not all of the documents mention a relevant event. The data sets were constructed to mimic the challenges that a real-world information extraction system must face, where a fundamental part of the IE task is to determine whether a document describes a relevant event, as well as to extract information about the event. In the MUC-3 through MUC-7 IE data sets, only about half of the documents describe a domain-relevant event that warrants information extraction.

Other IE data sets make different assumptions about the incoming docu-ments. Many IE data sets consist only of documents that describe a relevant event. Consequently, the IE system can assume that each document contains information that should be extracted. This assumption of relevant-only docu-ments allows an IE system to be more aggressive about extracting information because the texts are known to be on-topic. For example, if an IE system is given stories about bombing incidents, then it can extract the name of every person who was killed or injured and in most cases they will be victims of a bombing. If, however, irrelevant stories are also given to the system, then it must further distinguish between people who are bombing victims and people who were killed or injured in other types of events, such as robberies or car crashes.

Some IE data sets further make the assumption that each incoming doc-ument contains only one event of interest. We will refer to these as single-event documents. The seminar announcements, corporate acquisitions, and job postings IE data sets only contain single-event documents. In contrast, the MUC data sets and some others (e.g., rental ads and disease outbreaks) allow that a single document may describe multiple events of interest. If the IE system can assume that each incoming document describes only one rele-vant event, then all of the extracted information can be inserted in a single output template.⁵ If multiple events are discussed in a document, then the

5Note that coreference resolution of entities is still an issue, however. For example, a

IE system must perform discourse analysis to determine how many different events are being reported and to associate each piece of extracted information with the appropriate event template.

在文檔中 Chapter 21 (頁 4-8)