• 沒有找到結果。

TextRunner: Open Information Extraction on the Web

N/A
N/A
Protected

Academic year: 2022

Share "TextRunner: Open Information Extraction on the Web"

Copied!
2
0
0

加載中.... (立即查看全文)

全文

(1)

NAACL HLT Demonstration Program, pages 25–26,

Rochester, New York, USA, April 2007. c 2007 Association for Computational Linguistics

TextRunner: Open Information Extraction on the Web

Alexander Yates Michael Cafarella

Michele Banko Oren Etzioni University of Washington Computer Science and Engineering

Box 352350 Seattle, WA 98195-2350

{ayates,banko,hastur,mjc,etzioni,soderlan}@cs.washington.edu

Matthew Broadhead Stephen Soderland

1 Introduction

Traditional information extraction systems have focused on satisfying precise, narrow, pre-specified requests from small, homoge- neous corpora. In contrast, the TextRunner system demonstrates a new kind of informa- tion extraction, called Open Information Ex- traction (OIE), in which the system makes a single, data-driven pass over the entire cor- pus and extracts a large set of relational tuples, without requiring any human input.

(Banko et al., 2007) TextRunner is a fully- implemented, highly scalable example of OIE.

TextRunner’s extractions are indexed, al- lowing a fast query mechanism.

Our first public demonstration of the Text- Runner system shows the results of perform- ing OIE on a set of 117 million web pages. It demonstrates the power of TextRunner in terms of the raw number of facts it has ex- tracted, as well as its precision using our novel assessment mechanism. And it shows the abil- ity to automatically determine synonymous re- lations and objects using large sets of extrac- tions. We have built a fast user interface for querying the results.

2 Previous Work

The bulk of previous information extraction work uses hand-labeled data or hand-crafted patterns to enable relation-specific extraction (e.g., (Culotta et al., 2006)). OIE seeks to avoid these requirements for human input.

Shinyama and Sekine (Shinyama and Sekine, 2006) describe an approach to “un- restricted relation discovery” that does away

with many of the requirements for human in- put. However, it requires clustering of the doc- uments used for extraction, and thus scales in quadratic time in the number of documents.

It does not scale to the size of the Web.

For a full discussion of previous work, please see (Banko et al., 2007), or see (Yates and Et- zioni, 2007) for work relating to synonym res- olution.

3 Open IE in TextRunner

OIE presents significant new challenges for in- formation extraction systems, including Automation of relation extraction, which in traditional information extraction uses hand- labeled inputs.

Corpus Heterogeneity on the Web, which makes tools like parsers and named-entity tag- gers less accurate because the corpus is differ- ent from the data used to train the tools.

Scalability and efficiency of the system.

Open IE systems are effectively restricted to a single, fast pass over the data so that they can scale to huge document collections.

In response to these challenges, Text- Runner includes several novel components, which we now summarize (see (Banko et al., 2007) for details).

1. Single Pass Extractor

The TextRunner extractor makes a sin- gle pass over all documents, tagging sen- tences with part-of-speech tags and noun- phrase chunks as it goes. For each pair of noun phrases that are not too far apart, and subject to several other constraints, it applies a clas- sifier described below to determine whether or not to extract a relationship. If the classifier 25

(2)

deems the relationship trustworthy, a tuple of the form t = (ei, rj, ek) is extracted, where ei, ekare entities and rj is the relation between them. For example, TextRunner might ex- tract the tuple (Edison, invented, light bulbs).

On our test corpus (a 9 million document sub- set of our full corpus), it took less than 68 CPU hours to process the 133 million sen- tences. The process is easily parallelized, and took only 4 hours to run on our cluster.

2. Self-Supervised Classifier

While full parsing is too expensive to apply to the Web, we use a parser to generate training examples for extraction. Using several heuris- tic constraints, we automatically label a set of parsed sentences as trustworthy or untrust- worthy extractions (positive and negative ex- amples, respectively). The classifier is trained on these examples, using features such as the part of speech tags on the words in the re- lation. The classifier is then able to decide whether a sequence of POS-tagged words is a correct extraction with high accuracy.

3. Synonym Resolution

Because TextRunner has no pre-defined re- lations, it may extract many different strings representing the same relation. Also, as with all information extraction systems, it can ex- tract multiple names for the same object. The Resolver system performs an unsupervised clustering of TextRunner’s extractions to create sets of synonymous entities and rela- tions. Resolver uses a novel, unsupervised probabilistic model to determine the probabil- ity that any pair of strings is co-referential, given the tuples that each string was extracted with. (Yates and Etzioni, 2007)

4. Query Interface

TextRunner builds an inverted index of the extracted tuples, and spreads it across a cluster of machines. This architecture sup- ports fast, interactive, and powerful relational queries. Users may enter words in a relation or entity, and TextRunner quickly returns the entire set of extractions matching the query.

For example, a query for “Newton” will return tuples like (Newton, invented, calculus). Users may opt to query for all tuples matching syn-

onyms of the keyword input, and may also opt to merge all tuples returned by a query into sets of tuples that are deemed synonymous.

4 Experimental Results

On our test corpus of 9 million Web doc- uments, TextRunner extracted 7.8 million well-formed tuples. On a randomly selected subset of 400 tuples, 80.4% were deemed cor- rect by human reviewers.

We performed a head-to-head compari- son with a state-of-the-art traditional in- formation extraction system, called Know- ItAll. (Etzioni et al., 2005) On a set of ten high-frequency relations, TextRunner found nearly as many correct extractions as Know- ItAll (11,631 to 11,476), while reducing the error rate of KnowItAll by 33% (18% to 12%).

Acknowledgements

This research was supported in part by NSF grants IIS-0535284 and IIS-0312988, DARPA contract NBCHD030010, ONR grant N00014- 05-1-0185 as well as gifts from Google, and carried out at the University of Washington’s Turing Center.

References

M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. 2007. Open In- formation Extraction from the Web. In IJCAI.

A. Culotta, A. McCallum, and J. Betz. 2006. Inte- grating Probabilistic Extraction Models and Re- lational Data Mining to Discover Relations and Patterns in Text. In HLT-NAACL.

O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. 2005. Unsupervised Named- Entity Extraction from the Web: An Experi- mental Study. Artificial Intelligence, 165(1):91–

134.

Y. Shinyama and S. Sekine. 2006. Preemptive Information Extraction Using Unrestricted Re- lation Discovery. In HLT-NAACL.

A. Yates and O. Etzioni. 2007. Unsupervised Res- olution of Objects and Relations on the Web. In NAACL-HLT.

26

參考文獻

相關文件

We report here a rare case of schwannoma of the hypoglossal nerve in the submandibular salivary gland region with imaging and histopathological findings.. Plain radiographs are

Carcinoma ex pleomorphic adenoma (CXPA) is a rare malignant salivary gland tumor, mostly involving the parotid and submandibular glands.. Minor salivary gland involvement is even

˙Please visit the Outpatient Clinic of Family Please visit the Outpatient Clinic of Family Medicine Division for a follow-up diagnosis and

When the relative phases of the state of a quantum system are known, the system can be represented as a coherent superposition (as in (1.2)), called a pure state; when the sys-

(a) the respective number of whole-day and half-day kindergarten students receiving subsidy under the Pre-primary Education Voucher Scheme (PEVS) or the Free Quality

A subgroup N which is open in the norm topology by Theorem 3.1.3 is a group of norms N L/K L ∗ of a finite abelian extension L/K.. Then N is open in the norm topology if and only if

ix If more than one computer room is opened, please add up the opening hours for each room per week. duties may include planning of IT infrastructure, procurement of

The min-max and the max-min k-split problem are defined similarly except that the objectives are to minimize the maximum subgraph, and to maximize the minimum subgraph respectively..