The relations in ACE relation extraction are very general and appear across a wide range of documents. On the other hand, many of the events types involved in event extraction and scenario template extraction are associated with specific topics. This provides an opportu-nity for semi-supervised training.
The observation was first made by Riloff [Ril96] in connection with the MUC-3 task (terrorist incidents) that if one takes a collection of relevant documents and a collection of irrelevant documents (those not about terrorism) and computes the frequency of particular predicates in the two collections, those predicates which appear with a relatively higher frequency in the relevant documents are (in general) predicates relevant to the task. Using the scoring function
score(P ) = (RelF req/F req) × log(F req)
(where F req is the freqency of the pattern and RelF req is the frequency in relevant docu-ments) applied to about 750 terrorism-relevant and 750 irrelevant articles, the top-ranked patterns were
Yangarber [YGTH00] extended this to a semi-supervised (bootstrapping) procedure: a seed set of patterns were used to identify some relevant documents; these documents were used to identify a few additional patterns, etc. The initial procedure, which was applied to the MUC-6 (management succession) task, did not have any natural stopping point. By training competing learners concurrently for different event types, it was possible to improve the learners for individual topics and reach a natural stopping point where all documents were assigned to a topic [Yan03].
Stevenson and Greenwood [SG05] demonstrated an alternate approach to generalizing event extraction patterns using WordNet, again applied to the MUC-6 task. Liao [LG10a]
combined these two approaches, using WordNet to expand patterns but filtering them using document relevance criteria, showing gains both on MUC-6 and three of the ACE event types.
6.4 Evaluation
As with any IE task, evaluating event or scenario templates begins with preparation of an annotated reference corpus. An evaluation is only as good as its reference corpus, and the quality of the reference corpus depends in turn on the precision with which the events are defined. In some technical domains, events and their arguments can be precisely defined.
This is also true for some everyday events – being born, married, divorced, dying, starting a job, being convicted of a crime. On the other hand, some ACE event classes, such as attack and meet, have fuzzy boundaries, so the inter-annotator agreement is quite low.
Devising an intuitive and effective metric for comparing a system output with the ref-erence annotation is difficult. The system output and refref-erence consist of multiple events / templates; each event will have multiple arguments, and each argument will have a role.
There may be missing or spurious events, missing or spurious arguments, and correct argu-ments with incorrect roles.
For both MUC and ACE there was a need to have a single event/template score. This was obtained, roughly speaking, by searching for the best possible alignment of templates and template slots (MUC) or events and event arguments (ACE). ACE provided a different penalty for each type of error (wrong event type, missing argument, spurious argument, incorrect role, ...). The resulting scores are not always easy to understand, especially when small changes in output alter the optimal alignment.
The alternative is to have two (or more) separate evaluation metrics. For MUC, a fair amount of the research focused on slot filling and did not address the consolidation task.
Accordingly, the evaluation metric counted the [slot name, slot value] pairs, classified them as correct, missing, and spurious, and computed recall and precision in the usual way. For ACE, separate measurements can be made of event classification and argument extraction.
Event classification is judged based on finding correct [trigger word, event type] pairs. Ar-gument extraction (similar to MUC slot filling) is based on finding correct [event type, role, event argument] triples. In judging correct event arguments, we either use reference entities or align the system entities with reference entities.
Chapter 7
Unsupervised Information Extraction
In all the methods we have discussed up to now, we have assumed that we knew in advance what semantic classes of entities, relations, or events we wanted to extract. This information was either manually incorporated into rules or expressed through supervision – either fully supervised or semi-supervised learning.
In this chapter we shall consider what we can learn from the texts themselves, without any supervision. Given some texts and some entity types, can we identify the relations, or at least the most common relations, between these types of entities? Going a step further, can we identify both a set of relations and a set of types of entities (word classes) which appear as the arguments of these relations?
If we consider the general language (of news, magazines, literature) the word classes will be correspondingly general (people, animates, places, ...) and have fuzzy boundaries (because of metaphorical usages, for example). However, if we narrow the domain and in particular focus on scientific and technical domains, the word classes will be more specialized and the constraints sharper. Harris [Har68] dubbed these science sublanguages and observed that just as the general language has syntactic word class constraints
The cat is sleeping.
* The cat are sleeping.
(where the second sentence is ungrammatical in the general language) the sublanguage has sublanguage word constraints
The potassium entered the cell.
* The cell entered the potassium.
(so the second sentence is ungrammatical in the sublanguage). Ideally, unsupervised IE should be able to discover such argument constraints and thus discover the information structures of the domain.
7.1 Discovering Relations
Most of the work has focussed on discovering relations. The basic idea is fairly simple:
we gather a large set of triples (potential relation instances), each consisting of a pair of arguments (entities) and a context – either the word sequence between the arguments or the dependency path connecting the arguments. The arguments must appear in the same sentence, at most a fixed distance apart; typically only examples with named arguments are collected. Our goal is to group these triples into sets which represent the same semantic relation.
To this end, we define a similarity measure between triples, based on whether the argu-ments match and on how similar the contexts are. Linking triples which involve the same pair of arguments captures the notion that if the same pair of names appear together in two different contexts they are likely to represent the same relation (this same notion was the basis for semi-supervised relation extraction). Using the similarity measure, we gather the triples into clusters, using some standard clustering procedure. If the procedure has been successful, the triples which fall into the same cluster will convey (approximately) the same relation.
If the procedure is provided with a standard set of entity types and an entity tagger, we will treat each pair of argument types separately. Using the standard MUC types, for example, we would gather PERSON-PERSON relations, PERSON-ORGANIZATION relations, etc. separately.
Hasegawa et al. [HSG04] report an early experiment of this type. They collected name pairs (PERSON-GPE and COMPANY-COMPANY) from one year of newspaper articles using rather stringent criteria: the names could be at most 5 words apart, and each pair had to appear at least 30 times. This yielded a small number of pairs with extensive evidence about each pair, and made possible a relatively simple clustering strategy. An initial cluster was built by taking all the contexts for a particular name pair; this assumes that each pair is only connected by a single relation. Then clusters with any context words in common were merged. This produced a small number of high-quality clusters, such as ones representing merger and parent relations between companies. Chen et al. [CJTN05]
refined several aspects of this approach, automatically selecting weights for words in the context and choosing the number of clusters to generate.
Zhang et al. [ZSW+05] observed for the same news corpus that 10 to 25% of the name pairs (depending on the frequency of the pairs) were connected by more than one relation.
For example, a story about A acquring B might be followed by a story in which A was the corporate parent of B. Accordingly they clustered instances of name pairs separately, allowing distinct instances of the same pair to appear in separate clusters. By doing so, and by using a parse-tree-based measure of context similarities (rather than a simple bag of words) they obtained some performance improvements.
Rosenfeld and Feldman [RF07], also using a corpus of news stories, demonstrated that this clustering could be performed without classifying types of names. This should make it easier to port the procedure to new domains which may have new entity types. Using a rich set of features based on matching parts of the context token sequence, they compared the effectiveness of different clustering procedures.
The task becomes somewhat more challenging when we move from single news sources (used in all the experiments just described) to the Web. The scale is larger, and there is a greater variety in entity names and relation contexts. The larger scale requires more efficient clustering algorithms. Typical simple clustering algorithms, such as hierarchical agglomerative clustering, require time O(n2) to cluster n items, since each item must be compared (at least once) to every other item. This is manageable for small collections – thousands of name pairs – but not for the Web (millions or billions of pairs. However, if the connections (shared features) between items are sparse (argument pairs share only a few relation phrases, for example) and the final clusters will be relatively small (compared to the total number of clusters), it is possible to build a more efficient procedure by iterating over the shared features.
Yates and Etzioni [YE07] described such a procedure, which takes time O(kn log n) where k is the maximum cluster size. They applied it to a set of 10,000 frequently appearing relation phrases collected from the Web. To address the problem of name variants in Web data, they also clustered argument strings based on string similarity and shared contexts, thus identifying many synonymous names. The two clustering phases are mutually beneficial:
identifying synonymous arguments means that we have more shared contexts with which to identify synonymous relation phrases, and vice versa.