Learning-based Approaches to IE

As we discussed in Section 21.3, early information extraction systems used hand-crafted patterns and rules, often encoded in cascaded finite-state trans-ducers. Hand-built IE systems were effective, but manually creating the pat-terns and rules was extremely time-consuming. For example, it was estimated that it took approximately 1500 person-hours of effort to create the patterns used by the UMass MUC-4 system (Riloff 1993; Lehnert, Cardie, Fisher, Mc-Carthy, Riloff, and Soderland 1992)

Consequently, researchers began to use statistical techniques and machine learning algorithms to automatically create IE systems for new domains. In the following sections, we overview four types of learning-based IE methods:

supervised learning of patterns and rules, supervised learning of sequential IE classifiers, weakly supervised and unsupervised learning methods for IE, and learning-based approaches for more global or discourse-oriented approaches to IE.

21.4.1 Supervised Learning of Extraction Patterns and Rules Supervised learning methods originally promised to dramatically reduce the knowledge engineering bottleneck required to create an IE system for a new domain. Instead of painstakingly writing patterns and rules by hand, knowl-edge engineering could be reduced to the manual annotation of a collection of training texts. The hope was that a training set could be annotated in a matter of weeks, and nearly anyone with knowledge of the domain could do the annotation work.⁶ As we will acknowledge in Section 21.4.3, manual an-notation is itself a substantial endeavor, and a goal of recent research efforts is to eliminate this bottleneck as well. But supervised learning methods were an important first step toward automating the creation of information extraction systems.

The earliest pattern learning systems used specialized techniques, some-times coupled with small amounts of manual effort. AutoSlog (Riloff 1993) and PALKA (Kim and Moldovan 1993) were the first IE pattern learning

6In contrast, creating IE patterns and rules by hand typically requires computational lin-guists who understand how the patterns or rules will be integrated into the NLP system.

systems. AutoSlog (Riloff 1993; Riloff 1996a) matches a small set of syntac-tic templates against the text surrounding a desired extraction and creates one (or more) lexico-syntactic patterns by instantiating the templates with the corresponding words in the sentence. A “human in the loop” must then manually review the patterns to decide which ones are appropriate for the IE task. PALKA (Kim and Moldovan 1993) uses manually defined frames and keywords that are provided by a user and creates IE patterns by map-ping clauses containing the keywords onto the frame’s slots. The patterns are generalized based on the semantic features of the words.

Several systems use rule learning algorithms to automatically generate IE patterns from annotated text corpora. LIEP (Huffman 1996) creates can-didate patterns by identifying syntactic paths that relate the role fillers in a sentence. The patterns that perform well on training examples are kept, and as learning progresses they are generalized to accommodate new training examples by creating disjunctions of terms. CRYSTAL (Soderland, Fisher, Aseltine, and Lehnert 1995) learns extraction rules using a unification-based covering algorithm. CRYSTAL’s rules are “concept node” structures that in-clude lexical, syntactic, and semantic constraints. WHISK (Soderland 1999) was an early system that was specifically designed to be flexible enough to handle structured, semi-structured, and unstructured texts. WHISK learns regular expression rules that consist of words, semantic classes, and wildcards that match any token. (LP )²(Ciravegna 2001) induces two different kinds of IE rules: tagging rules to label instances as desired extractions, and correction rules to correct mistakes made by the tagging rules. Freitag created a rule-learning system called SRV (Freitag 1998b) and later combined it with a rote learning mechanism and a Naive Bayes classifier to explore a multi-strategy approach to IE (Freitag 1998a).

Relational learning methods have also been used to learn rule-like struc-tures for IE (e.g., (Roth and Yih 2001; Califf and Mooney 2003; Bunescu and Mooney 2004; Bunescu and Mooney 2007)). RAPIER (Califf and Mooney 1999; Califf and Mooney 2003) uses relational learning methods to generate IE rules, where each rule has a pre-filler, filler, and post-filler component.

Each component is a pattern that consists of words, POS tags, and semantic classes. Roth and Yih (Roth and Yih 2001) propose a knowledge represen-tation language for propositional relations and create a 2-stage classifier that first identifies candidate extractions and then selects the best ones. Bunescu and Mooney (Bunescu and Mooney 2004) use Relational Markov Networks to represent dependencies and influences across entities and extractions.

IE pattern learning methods have also been developed for related applica-tions such as question answering (Ravichandran and Hovy 2002), where the goal is to learn patterns for specific types of questions that involve relations between entities (e.g., identifying the birth year of a person).

21.4.2 Supervised Learning of Sequential Classifier Models An alternative approach views information extraction as a classification problem that can be tackled using sequential learning models. Instead of using explicit patterns or rules to extract information, a machine learning classifier is trained to sequentially scan text from left to right and label each word as an extraction or a non-extraction. A typical labeling scheme is called IOB, where each word is classified as an ’I’ if it is inside a desired extraction,

’O’ if it is outside a desired extraction, or ’B’ if it is the beginning of a desired extraction. The sentence below has been labeled with IOB tags corresponding to phrases that should be extracted as facts about a bombing incident.

Alleged/B guerrilla/I urban/I commandos/I launched/O two/B highpower/I bombs/I against/O a/B car/I dealership/I in/O down-town/O San/B Salvador/I this/B morning/I.

In the example above, the IOB tags indicate that five phrases should be ex-tracted: “Alleged guerrilla urban commandos”, “two highpower bombs”, “a car dealership”, “San Salvador”, and “this morning”. Note that the ‘B’ tag is important to demarcate where one extraction begins and another one ends, particularly in the case when two extractions are adjacent. For example, if only ’I’ and ’O’ tags were used, then “San Salvador” and “this morning” would run together and appear to be a single extraction. Depending on the learning model, a different classifier may be trained for each type of information to be extracted (e.g., one classifier might be trained to identify perpetrator extrac-tions, and another classifier may be trained to identify location extractions).

Or a single classifier can be trained to produce different types of IOB tags for the different kinds of role fillers (e.g., Bperpetrator and Blocation) (Chieu and Ng 2002).

A variety of sequential classifier models have been developed using Hid-den Markov Models (Freitag and McCallum 2000; Yu, Guan, and Zhou 2005;

Gu and Cercone 2006), Maximum Entropy Classifiers (Chieu and Ng 2002), Conditional Random Fields (Peng and McCallum 2004; Choi, Cardie, Riloff, and Patwardhan 2005), and Support Vector Machines (Zelenko, Aone, and Richardella 2003; Finn and Kushmerick 2004; Li, Bontcheva, and Cunning-ham 2005; Zhao and Grishman 2005). Freitag and McCallum (Freitag and McCallum 2000) use Hidden Markov Models and developed a method to au-tomatically explore different structures for the HMM during the learning pro-cess. Gu and Cercone (Gu and Cercone 2006) use HMMs in a 2-step IE process: one HMM retrieves relevant text segments that likely contain a filler, and a second HMM identifies the words to be extracted in these text seg-ments. Finn and Kushmerick (Finn and Kushmerick 2004) also use a 2-step IE process but in a different way: one SVM classifier identifies start and end tags for extractions, and a second SVM looks at tags that were orphaned (i.e., a start tag was found without a corresponding end tag, or vice versa) and tries to identify the missing tag. The second classifier aims to improve IE recall by

producing extractions that otherwise would have been missed. Yu et al. (Yu, Guan, and Zhou 2005) created a cascaded model of HMMs and SVMs. In the first pass, an HMM segments resumes into blocks that represent different types of information. In the second pass, HMMs and SVMs extract infor-mation from the blocks, with different classifiers trained to extract different types of information.

The chapter on Fundamental Statistical Techniques in this book explains how to create classifiers and sequential prediction models using supervised learning techniques.

21.4.3 Weakly Supervised and Unsupervised Approaches Supervised learning techniques substantially reduced the manual effort re-quired to create an IE system for a new domain. However, annotating training texts still requires a substantial investment of time, and annotating docu-ments for information extraction can be deceptively complex (Riloff 1996b).

Furthermore, since IE systems are domain-specific, annotated corpora cannot be reused: a new corpus must be annotated for each domain.

To further reduce the knowledge engineering required to create an IE sys-tem, several methods have been developed in recent years to learn extraction patterns using weakly supervised and unsupervised techniques. AutoSlog-TS (Riloff 1996b) is a derivative of AutoSlog that requires as input only a preclassified training corpus in which texts are identified as relevant or irrel-evant with respect to the domain but are not annotated in any other way.

AutoSlog-TS’s learning algorithm is a two-step process. In the first step, Au-toSlog’s syntactic templates are applied to the training corpus exhaustively, which generates a large set of candidate extraction patterns. In the second step, the candidate patterns are ranked based on the strength of their associ-ation with the relevant texts. Ex-Disco (Yangarber, Grishman, Tapanainen, and Huttunen 2000) took this approach one step further by eliminating the need for a preclassified text corpus. Ex-Disco uses a small set of manually defined seed patterns to partition a collection of unannotated text into rele-vant and irrelerele-vant sets. The pattern learning process is then embedded in a bootstrapping loop where (1) patterns are ranked based on the strength of their association with the relevant texts, (2) the best pattern(s) are selected and added to the pattern set, and (3) the corpus is re-partitioned into new relevant and irrelevant sets. Both AutoSlog-TS and Ex-Disco produce IE pat-terns that performed well in comparison to pattern sets used by previous IE systems. However, the ranked pattern lists produced by these systems still need to be manually reviewed.⁷

Stevenson and Greenwood (Stevenson and Greenwood 2005) also begin with

7The human reviewer discards patterns that are not relevant to the IE task and assigns an event role to the patterns that are kept.

seed patterns and use semantic similarity measures to iteratively rank and select new candidate patterns based on their similarity to the seeds. Stevenson and Greenwood use predicate-argument structures as the representation for their IE patterns, as did Surdeanu et al. (Surdeanu, Harabagiu, Williams, and Aarseth 2003) and Yangarber (Yangarber 2003) in earlier work. Sudo et al. (Sudo, Sekine, and Grishman 2003) created an even richer subtree model representation for IE patterns, where an IE pattern can be an arbitrary subtree of a dependency tree. The subtree patterns are learned from relevant and irrelevant training documents. Bunescu and Mooney (Bunescu and Mooney 2007) developed a weakly supervised method for relation extraction that uses Multiple Instance Learning (MIL) techniques with SVMs and string kernels.

Meta-bootstrapping (Riloff and Jones 1999) is a bootstrapping method that learns information extraction patterns and also generates noun phrases that belong to a semantic class at the same time. Given a few seed nouns that belong to a targeted semantic class, the meta-bootstrapping algorithm itera-tively learns a new extraction pattern and then uses the learned pattern to hypothesize additional nouns that belong to the semantic class. The patterns learned by meta-bootstrapping are more akin to named entity recognition pat-terns than event role patpat-terns, however, because they identify noun phrases that belong to general semantic classes, irrespective of any events.

Recently, Phillips and Riloff (Phillips and Riloff 2007) showed that boot-strapping methods can be used to learn event role patterns by exploiting role-identifying nouns as seeds. A role-role-identifying noun is a word that, by virtue of its lexical semantics, identifies the role that the noun plays with respect to an event. For example, the definition of the word kidnapper is the agent of a kidnapping event. By using role-identifying nouns as seeds, the Basilisk bootstrapping algorithm (Thelen and Riloff 2002) can be used to learn both event extraction patterns as well as additional role-identifying nouns.

Finally, Shinyama and Sekine (Shinyama and Sekine 2006) have developed an approach for completely unsupervised learning of information extraction patterns. Given texts for a new domain, relation discovery methods are used to preemptively learn the types of relations that appear in domain-specific documents. The On-Demand Information Extraction (ODIE) system (Sekine 2006) accepts a user query for a topic, dynamically learns IE patterns for salient relations associated with the topic, and then applies the patterns to fill in a table with extracted information related to the topic.

21.4.4 Discourse-oriented Approaches to IE

Most of the IE systems that we have discussed thus far take a relatively localized approach to information extraction. The IE patterns or classifiers focus only on the local context surrounding a word or phrase when making an extraction decision. Recently, some systems have begun to take a more global view of the extraction process. Gu and Cercone (Gu and Cercone 2006) and Patwardhan & Riloff (Patwardhan and Riloff 2007) use classifiers to first

identify the event-relevant sentences in a document and then apply an IE system to extract information from those relevant sentences.

Finkel et al. (Finkel, Grenager, and Manning 2005) impose penalties in their learning model to enforce label consistency among extractions from different parts of a document. Maslennikov and Chua (Maslennikov and Chua 2007) use dependency and RST-based discourse relations to connect entities in different clauses and find long-distance dependency relations.

Finally, as we discussed in Section 21.3.5, IE systems that process multiple-event documents need to generate multiple templates. Template generation for multiple events is extremely challenging, and only a few learning systems have been developed to automate this process for new domains. WRAP-UP (Soderland and Lehnert 1994) was an early supervised learning system that uses a collection of decision trees to make a series of discourse decisions to automate the template generation process. More recently, Chieu et al.

(Chieu, Ng, and Lee 2003) developed a system called ALICE that generates complete templates for the MUC-4 terrorism domain (MUC-4 Proceedings 1992). ALICE uses a set of classifiers that identify extractions for each type of slot and a template manager to decide when to create a new template.

The template manager uses general-purpose rules (e.g., a conflicting date will spawn a new template) as well as automatically derived “seed words” that are associated with different incident types to distinguish between events.

在文檔中 Chapter 21 (頁 16-21)