IE with Cascaded Finite-State Transducers

Probably the most important idea that emerged in the course of the MUC evaluations was the decomposition of the IE process into a series of subprob-lems that can be modeled with “cascaded finite-state transducers” (Lehnert, Cardie, Fisher, Riloff, and Williams 1991; Hobbs, Appelt, Bear, Israel, and Tyson 1992; Hobbs, Appelt, Bear, Israel, Kameyama, Stickel, and Tyson 1997; Joshi 1996; Cunningham, Maynard, Bontcheva, and Tablan 2002). A finite-state automaton reads one element at a time of a sequence of elements;

each element transitions the automaton into a new state, based on the type of element it is, e.g., the part of speech of a word. Some states are designated as final, and a final state is reached when the sequence of elements matches a valid pattern. In a finite-state transducer, an output entity is constructed when final states are reached, e.g., a representation of the information in a phrase. In a cascaded finite-state transducer, there are different finite-state transducers at different stages. Earlier stages will package a string of elements into something the next stage will view as a single element.

In the typical system, the earlier stages recognize smaller linguistic objects and work in a largely domain-independent fashion. They use purely linguistic knowledge to recognize portions of the syntactic structure of a sentence that linguistic methods can determine reliably, requiring relatively little modifica-tion or augmentamodifica-tion as the system is moved from domain to domain. The later stages take these linguistic objects as input and find domain-dependent patterns within them. In a typical IE system, there are five levels of process-ing:

1. Complex Words: This includes the recognition of multiwords and proper name entities, such as people, companies, and countries.

2. Basic Phrases: Sentences are segmented into noun groups, verb groups, and particles.

3. Complex Phrases: Complex noun groups and complex verb groups are identified.

document may mention multiple victims so the IE system needs to determine whether an extracted victim refers to a previously mentioned victim or a new one.

4. Domain Events: The sequence of phrases produced at Level 3 is scanned for patterns of interest to the application, and when they are found, semantic structures are built that encode the information about entities and events contained in the pattern.

5. Merging Structures: Semantic structures from different parts of the text are merged if they provide information about the same entity or event.

This process is sometimes called template generation, and is a complex process not done by a finite-state transducer.

As we progress through the five levels, larger segments of text are analyzed and structured. In each of stages 2 through 4, the input to the finite-state transducer is the sequence of chunks constructed in the previous stage. The GATE project (Cunningham, Maynard, Bontcheva, and Tablan 2002) is a widely used toolkit that provides many of the components needed for such an IE pipeline.

This decomposition of the natural-language problem into levels is essential to the approach. Many systems have been built to do pattern matching on strings of words. The advances in information extraction have depended cru-cially on dividing that process into separate levels for recognizing phrases and recognizing patterns among the phrases. Phrases can be recognized reliably with purely syntactic information, and they provide precisely the elements that are required for stating the patterns of interest.

In the next five sections we illustrate this process on the Bridgestone Sports text.

21.3.1 Complex Words

The first level of processing identifies multiwords such as “set up”, “trad-ing house”, “new Taiwan dollars”, and “joint venture”, and company names like “Bridgestone Sports Co.” and “Bridgestone Sports Taiwan Co.”. The names of people and locations, dates, times, and other basic entities are also recognized at this level. Languages in general are very productive in the construction of short, multiword fixed phrases and proper names employing specialized microgrammars, and this is the level at which they are recognized.

Some names can be recognized by their internal structure. A common pattern for company names is “ProperName ProductName”, as in “Acme Widgets”. Others can only be recognized by means of a table. Internal structure cannot tell us that IBM is a company and DNA is not. It is also sometimes possible to recognize the types of proper names by the context in which they occur. For example, in the sentences below:

(a) XYZ’s sales

(b) Vaclav Havel, 53, president of the Czech Republic

we might not know that XYZ is a company and Vaclav Havel is a person, but

the immediate context establishes that. These can be given an underspecified representation that is resolved by later stages.

21.3.2 Basic Phrases

The problem of syntactic ambiguity in natural language is AI-complete.

That is, we will not have systems that reliably parse English sentences cor-rectly until we have encoded much of the real-world knowledge that people bring to bear in their language comprehension. For example, noun phrases cannot be reliably identified because of the prepositional phrase attachment problem. However, certain syntactic constructs can be identified with reason-able reliability. One of these is the noun group, which is the head noun of a noun phrase together with its determiners and other left modifiers (these are sometimes called “base NPs”). Another is what we are calling the “verb group”, that is, the verb together with its auxiliaries and any intervening ad-verbs. Moreover, an analysis that identifies these elements gives us exactly the units we most need for subsequent domain-dependent processing. The task of identifying these simple noun and verb groups is sometimes called “syntactic chunking”. The basic phrases in the first sentence of text (1) are as follows, where “Company Name” and “Location” are special kinds of noun group that would be identified by named entity recognition:

Company Name: Bridgestone Sports Co.

Verb Group: said

Noun Group: Friday

Noun Group: it

Verb Group: had set up

Noun Group: a joint venture

Preposition: in

Location: Taiwan

Preposition: with

Noun Group: a local concern

Conjunction: and

Noun Group: a Japanese trading house

Verb Group: to produce

Noun Group: golf clubs Verb Group: to be shipped

Preposition: to

Location: Japan

Noun groups can be recognized by a relatively simple finite-state grammar encompassing most of the complexity that can occur in English noun groups (Hobbs et al., 1992), including numbers, numerical modifiers like “approxi-mately”, other quantifiers and determiners, participles in adjectival position, comparative and superlative adjectives, conjoined adjectives, and arbitrary

or-derings and conjunctions of prenominal nouns and noun-like adjectives. Thus, among the noun groups that can be recognized are:

“approximately 5 kg”

“more than 30 people”

“the newly elected president”

“the largest leftist political force”

“a government and commercial project”

The principal ambiguities that arise in this stage are due to noun-verb ambi-guities. For example, “the company names” could be a single noun group with the head noun “names”, or it could be a noun group “the company” followed by the verb “names”. One can use a lattice representation to encode the two analyses and resolve the ambiguity in the stage for recognizing domain events.

Verb groups (and predicate adjective constructions) can be recognized by an even simpler finite-state grammar that, in addition to chunking, also tags them as Active Voice, Passive Voice, Gerund, and Infinitive. Verbs are sometimes locally ambiguous between active and passive senses, as the verb “kidnapped”

in the following two sentences:

“Several men kidnapped the mayor today.”

“Several men kidnapped yesterday were released today.”

These cases can be tagged as Active/Passive, and the domain-event stage can later resolve the ambiguity. Some work has also been done to train a classifier to distinguish between active voice and “reduced” passive voice constructions (Igo and Riloff 2008).

The breakdown of phrases into nominals, verbals, and particles is a linguistic universal. Whereas the precise parts of speech that occur in any language can vary widely, every language has elements that are fundamentally nominal in character, elements that are fundamentally verbal or predicative, and particles or inflectional affixes that encode relations among the other elements (Croft 1991).

21.3.3 Complex Phrases

Some complex noun groups and verb groups can be recognized reliably on the basis of domain-independent, syntactic information. For example:

• the attachment of appositives to their head noun group

“The joint venture, Bridgestone Sports Taiwan Co.,”

• the construction of measure phrases

“20,000 iron and “metal wood” clubs a month”

• the attachment of “of” and “for” prepositional phrases to their head noun groups

“production of 20,000 iron and “metal wood” clubs a month”

• noun group conjunction

“a local concern and a Japanese trading house”

In the course of recognizing basic and complex phrases, domain-relevant entities and events can be recognized and the structures for these can be constructed. In the sample joint-venture text, entity structures can be con-structed for the companies referred to by the phrases “Bridgestone Sports Co.”, “a local concern”, “a Japanese trading house”, and “Bridgestone Sports Taiwan Co.” Information about nationality derived from the words “local”

and “Japanese” can be recorded. Corresponding to the complex noun group

“The joint venture, Bridgestone Sports Taiwan Co.,” the following relation-ship structure can be built:

Relationship: TIE-UP

Entities: –

Joint Venture Company: “Bridgestone Sports Taiwan Co.”

Activity: –

Amount: –

Corresponding to the complex noun group “production of 20,000 iron and

‘metal wood’ clubs a month”, the following activity structure can be built up:

Activity: PRODUCTION

Company: –

Product: “iron and ‘metal wood’ clubs”

Start Date: –

Complex verb groups can also be recognized in this stage. Consider the following variations:

“GM formed a joint venture with Toyota.”

“GM announced it was forming a joint venture with Toyota.”

“GM signed an agreement forming a joint venture with Toyota.”

“GM announced it was signing an agreement to form a joint ven-ture with Toyota.”

Although these sentences may differ in significance for some applications, often they would be considered equivalent in meaning. Rather than defining each of these variations, with all their syntactic variants, at the domain event level, the user should be able to define complex verb groups that share the same significance. Thus, “formed”, “announced it was forming”, “signed an agreement forming”, and “announced it was signing an agreement to form”

may all be equivalent, and once they are defined to be so, only one domain event pattern needs to be expressed. Verb group conjunction, as in

“Terrorists kidnapped and killed three people.”

can be treated as a complex verb group as well.

21.3.4 Domain Events

The next stage is recognizing domain events, and its input is a a list of the basic and complex phrases recognized in the earlier stages, in the order in which they occur. Anything that was not identified as a basic or complex phrase in a previous stage can be ignored in this stage; this can be a significant source of robustness.

Identifying domain events requires a set of domain-specific patterns both to recognize phrases that correspond to an event of interest and to identify the syntactic constitutents that correspond to the event’s role fillers. In early information systems, these domain-specific “extraction patterns” were defined manually. In Sections 21.4.1 and 21.4.3, we describe a variety of learning methods that have subsequently been developed to automatically generate domain-specific extraction patterns from training corpora.

The patterns for events of interest can be encoded as finite-state machines, where state transitions are effected by phrases. The state transitions are driven off the head words in the phrases. That is, each pair of relevant head word and phrase type—such as “company-NounGroup” and “formed-PassiveVerbGroup”— has an associated set of state transitions. In the sample joint-venture text, the domain event patterns

and

would be instantiated in the first sentence, and the patterns

and

in the second. These four patterns would result in the following four structures being built:

Relationship: TIE-UP

Entities: “Bridgestone Sports Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company: –

Activity: –

Amount: –

Activity: PRODUCTION

Company: –

Product: “golf clubs”

Start Date: –

Relationship: TIE-UP

Entities: –

Joint Venture Company: “Bridgestone Sports Taiwan Co.”

Activity: –

Amount: NT$20000000

Activity: PRODUCTION

Company: “Bridgestone Sports Taiwan Co.”

Product: –

Start Date: DURING: January 1990

The third of these is an augmentation of the TIE-UP structure discovered in the complex phrase phase.

Certain kinds of “pseudo-syntax” can be done at this stage, including rec-ognizing relative clauses and conjoined verb phrases, as described in Hobbs et al. (1997).

Many subject-verb-object patterns are of course related to each other. The sentence:

“GM manufactures cars.”

illustrates a general pattern for recognizing a company’s activities. But the same semantic content can appear in a variety of ways, including

“Cars are manufactured by GM.”

“. . . GM, which manufactures cars. . .”

“. . . cars, which are manufactured by GM. . .”

“. . . cars manufactured by GM . . .”

“GM is to manufacture cars.”

“Cars are to be manufactured by GM.”

“GM is a car manufacturer.”

These are all systematically related to the active voice form of the sentence.

Therefore, there is no reason a developer should have to specify all the varia-tions. A simple tool would be able to generate all of the variants of the pattern from the simple active voice Subject-Verb-Object form. It would also allow adverbials to appear at appropriate points. These transformations would be executed at compile time, producing the more detailed set of patterns, so that at run time there is no loss of efficiency.

This feature is not merely a clever idea for making a system more conve-nient to author. It rests on the fundamental idea that underlies generative transformational grammar, but is realized in a way that does not impact the efficiency of processing.

In recent years, full-sentence parsing has improved, in large part through the use of statistical techniques. Consequently, some IE systems have begun to rely on full parsers rather than shallow parsing techniques.

21.3.5 Template Generation: Merging Structures

The first four stages of processing all operate within the bounds of single sentences. The final level of processing operates over the whole text. Its task is to see that all the information collected about a single entity, relationship, or event is combined into a unified whole. This is one of the primary ways that the problem of coreference is dealt with in information extraction, including both NP coreference (for entities) and event coreference. One event template is generated for each event, which coalesces all of the information associated with that event. If an input document discusses multiple events of interest, then the IE system must generate multiple event templates. Generating mul-tiple event templates requires additional discourse analysis to (a) correctly determine how many distinct events are reported in the document, and (b) correctly assign each entity and object to the appropriate event template.

Among the criteria that need to be taken into account in determining whether two structures can be merged are the internal structure of the noun groups, nearness along some metric, and the consistency, or more generally, the compatibility of the two structures.

In the analysis of the sample joint-venture text, we have produced three activity structures. They are all consistent because they are all of type PRO-DUCTION and because “iron and ‘metal wood’ clubs” is consistent with “golf clubs”. Hence, they are merged, yielding:

Activity: PRODUCTION

Company: “Bridgestone Sports Taiwan Co.”

Product: “iron and ‘metal wood’ clubs”

Start Date: DURING: January 1990

Similarly, the two relationship structures that have been generated are con-sistent with each other, so they can be merged, yielding:

Relationship: TIE-UP

Entities: “Bridgestone Sports Co.”

“a local concern”

“a Japanese trading house”

Joint Venture Company: “Bridgestone Sports Taiwan Co.”

Activity: –

Amount: NT$20000000

The entity and event coreference problems are very hard, and constitute active and important areas of research. Coreference resolution was a task in the later MUC evaluations (MUC-6 Proceedings 1995; MUC-7 Proceedings 1998), and has been a focus of the ACE evaluations. Many recent research efforts have applied machine learning techniques to the problem of coreference resolution (e.g., (Dagan and Itai 1990; McCarthy and Lehnert 1995; Aone and Bennett 1996; Kehler 1997; Cardie and Wagstaff 1999; Harabagiu, Bunescu, and Maiorana 2001; Soon, Ng, and Lim 2001; Ng and Cardie 2002; Bean and

Riloff 2004; McCallum and Wellner 2004; Yang, Su, and Tan 2005; Haghighi and Klein 2007)).

Some attempts to automate the template generation process will be dis-cussed in Section 21.4.4.

在文檔中 Chapter 21 (頁 8-16)