5 List Extractor - Unsupervised Named-Entity Extraction from the Web: An Experimental Study∗

We now present the third method for increasing KNOWITALL’s recall, the List Extractor (LE). Where the methods described earlier extract information from unstructured text on Web pages, LE uses regular page structure to support extraction. LE locates lists of items on Web pages, learns a wrapper on the fly for each list, automatically extracts items from these lists, then sorts the items by the number of lists in which they appear.

LE locates lists by querying search engines with sets of items extracted by the baseline KNOWITALL

(e.g., LE might query Google with “London” “Paris” “New York” “Rome”). LE leverage the fact that many informational pages are generated from databases and therefore have a distinct, but regular and easy-to-learn structure. We combine ideas from previous work done on wrapper induction in our implementation of LE to learn wrappers quickly (in under a second of CPU time per document) and autonomously (unlike much of the work on wrapper induction, LE is unsupervised).

5.1 Background and Related Work

One of the first applications of wrapper learning appeared in [16], which describes an agent that queried online stores with known product names and looked for regularities in the resulting pages in order to build e-commerce wrappers. In [24], Kushmerick generalized how to automatically learn wrappers for information extraction, and presented wrappers as regular expressions with some kind of structure or constraints. The idea is that given a fully labeled training set of sample extractions from documents, one can learn a wrapper or patterns of words that precede and follow the extracted terms. In addition to the prefixes and suffixes, there is also a notion of heads and tails, which are points that delimit the context to which the extraction pattern applies.

The base algorithm for wrapper induction is fairly straightforward. Given fully labeled texts (or oracles) in which negative examples are those parts without labels, iterate over all possible patterns to find the best heads, tails, prefixes, and suffixes, that match all the training data, and use these for extraction. The com-plexity and accuracy depends on the expressiveness of the expressions (i.e. wild cards, semantic/synonym matches, etc.), the amount of data to learn from, and the level of structure in the documents.

Cohen in [11] extended the notion of wrapper induction by generalizing how to automatically learn rules to include linear regular expressions as well as hierarchical paths (DOM parse) in an HTML document.

Cohen also explored how to use these wrappers to automatically extract arbitrary lists of related items from Web pages for other purposes [10]. We borrow both of these ideas in our implementation, but differ in how our wrapper is trained, used, and measured experimentally.

Perhaps the work that most resembles LE is Google Sets, which is an interface provided by Google that functionally appears almost identical to LE. The input to Google Sets is several words, and the output is a list of up to 100 tokens that are found in lists on the Web. Since we do not know how Google Sets is implemented and cannot get unlimited results from their interface, we are unable to compare the two systems.

5.2 Problem Definition and Characteristics

The inputs to LE include the name of a class and a set of positive seeds. The output is a set of candidate tokens for the given class that are found on Web pages containing lists of instances, where the list includes a subset of the positive seeds. We take advantage of the repetition of information on the Web by being highly selective on which documents we choose to extract from. In particular, we want documents that contain

many known positive examples and that exhibit a high amount of structure from which we can infer new examples. It is reasonable to assume that this structure exists for many classes, since many professional Web sites are automatically generated from databases.

We do not have negative examples, so any learning procedure we use will have to rely on positive examples only. This means that as we carve out a space that we believe separates the positive instances from the negative ones, we need to make some assumptions or apply some domain specific heuristics to create a precise information extractor. This is done by analyzing the HTML structure of a document. In particular, we localize our learning to specific blocks of HTML, and strongly favor complex hypotheses over less restrictive ones. It is better to generalize than to over-generalize. The intuition is that under-generalizing may result in false negatives for a given document, but that the missed opportunities on one document are likely to appear again on other documents.

5.3 Algorithm

Now we will discuss the online wrapper induction algorithm outlined in Figure 14. The input to this algorithm is a set of positive examples (seedExamples at line 1). The output is a list of tokens (extractions).

The first step is to use the seed examples to obtain a set of documents as shown in line 2. This is currently done by selecting some number of random positive seeds to combine in a query to a search engine such as Google. One can imagine more sophisticated ways of selecting seeds such as grouping popular or rare instances together (assuming like-popularity instances are found together), or grouping seeds alphabetically since lists are often alphabetical on the Web.

We apply the learning and extraction to each document individually. Within a document we further partition the space based on the HTML tags. This is done by creating a subtree (or single HTML block from the whole document) for every set of composite tags (such as<table>,<select>,<td>, etc.) that have a start and end tag and more text and tags in between. Once we have selected an HTML block or subtree of the parsed HTML, we must first identify all the positive seeds within that block that are the words used in the search. We may add a threshold to skip and continue with the next block if not enough seeds are found.

At this point we apply the learning to induce a wrapper.

A prefix is some pattern that precedes a token (the seeds in our example). In order to learn the best prefix pattern for a given block, we consider all the keywords in that block, and find some pattern that maximally matches all of them. Generally we consider 3 - 10 keywords in a block to learn from (more discussion of this later). One option is to build a prefix that matches as many exact characters as possible for each keyword starting from the token and going outwards to the left. A more flexible option is to increase expressiveness and have wildcards, Boolean characteristics, or semantic/synonym options in the matching, similar to Perl regular expressions. The former option is too specific to generalize well in almost any context, and the latter is complicated and requires many training examples (probably best for free text with many labeled examples). We chose a compromise that we believe will work well in the Web domain. First we require that all characters match up until the first HTML tag. For example,<center>hot Tucson</center>and

<td>hot Phoenix</td>would have a prefix “hot ”. If the text matches up to a tag, then we check if the tags match. In this case we do not require that the whole tag match - we just require that the tag type be the same, even though the attributes may differ. This means that for an<a...>tag, two keywords might have a different “href=...” but still match. The only exception is when we match a text block (or text between tags).

Then these must match among all keywords in order to be included in the prefix. Some sample wrappers look like (<td><a>TOKEN motels</a></td>) and (//   TOKEN   //). The best prefix is generally considered to be the longest matching prefix. To learn a suffix, we apply the same idea outwards to the right of the token.

LISTEXTRACTOR(seedExamples)

documents = searchForDocuments(seedExamples) For each document in documents

parseTree = ParseHTML(document) For each subtree in parseTree

keyWords = findAllSeedsInTree(subtree) prefix = findBestPrefix(keyWords, subtree) suffix = findBestSuffix(keyWords, subtree)

Add to wrapperTree from createWrapper(prefix, suffix)) For each goodWrapper in wrapperTree

Find extractions using goodWrapper Return list of extractions

Figure 14: High-level pseudocode for List Extractor

Once a wrapper is learned, we add it to a wrapper tree. The wrapper tree is a hierarchical structure that resembles the HTML structure. Each wrapper in the wrapper tree corresponds to blocks that subsume or contain other wrappers and their blocks. This can be useful for later analysis and comparison of wrappers for a given document in order to choose which wrappers to apply. One heuristic would be to only apply wrappers that are at the leaves (i.e. smallest HTML block with several keywords). Another heuristic would be to apply a wrapper only if it did not generalize any further than its children. After all the wrappers have been constructed and added to the tree, we select the best ones according to such a measure (initialized with defaults or learned in some way) and apply them to get extractions. Applying a wrapper simply means to find other sequences in the block that match the pattern completely, and then to extract the specified token.

5.4 Example and Parameters

We consider a relatively simple example in Figure 15 in order to see how the algorithm works, and to illustrate the effects of different parameters on precision, recall, overfitting, and generalization. On top we have the 4 seeds used to search and retrieve the HTML document, and below we have the 5 wrappers learned from at least 2 keywords and their bounding lines in the HTML.

The first wrapper, w1, is learned for the whole HTML document, and matches all 4 keywords; w2 is for the body, and is identical to w1, except for the context; w3 has the same wrapper pattern as w1 and w2, contains all keywords, but has a noticeably different and smaller context (just the single table block); w4 is interesting because here we see an example of overfitting. The suffix is too long and will not extract France.

We see a similar problem in w5 where the prefix is too long and will not extract Israel.

It is easy to see that the best wrapper is w3; w4 and w5 are too specific; while w2 and w1 are too general.

There are a few heuristics one can apply to prefer wrappers such as w3 over the others. One is to force most or all keywords to match (in our case, forcing 3 or 4 words to match rather than 2 would not have allowed w4 or w5). Another is to only consider leaf wrappers. In the case of having at least 2 words match for a wrapper, this would not help since we would select w4 and w5. However, if we combine selecting leaf wrappers with matching many key words, we would eliminate w4 and w5 and be left with w3, which is optimal. The intuition is that generally as we go up the wrapper tree, we generalize our wrappers to a larger part of the document which is more prone to errors. If we do not force many keywords to match, we get smaller leaves and may be more precise lower in the tree, but miss out on some of the structure and get less

Keywords: Italy, Japan, Spain, Brazil 1 <html>

2 <body>

3 My favorite countries:

4 <table>

5 <tr><td><a>Italy</a></td><td><a>Japan</a></td><td><a>France</a></td></tr>

6 <tr><td><a>Israel</a></td><td><a>Spain</a></td><td><a>Brazil</a></td></tr>

7 </table>

8 My favorite pets:

9 <table>

10 <tr><td><a>Dog</a></td><td><a>Cat</a></td><td><a>Alligator</a></td></tr>

11 </table>

12 </body>

13 </html>

Wrappers (at least 2 keywords match):

w1 (1 - 13): <td><a>TOKEN</a></td>

w2 (2 - 12): <td><a>TOKEN</a></td>

w3 (4 - 7): <td><a>TOKEN</a></td>

w4 (5 - 5): <td><a>TOKEN</a></td><td><a>

W5 (6 - 6): </a></td><td><a>TOKEN</a></td>

Figure 15: Example HTML with learned wrappers. LE selects wrapper w3 that covers the table from lines 4 to 7 and extracts all the country names without errors. Other wrappers either over-generalize or under-generalize.

extractions. Below is a list of some parameters to consider when using this algorithm:

1. Number of keywords to match in a block

2. Selection of wrappers from the wrapper tree (leaves, all, other) 3. Length/complexity of prefix/suffix/both

4. Number of search words to use for retrieving documents

5. Selection of keywords for searching (random, alphabetical, popular/rare together/apart) 5.5 Results

We measured LE on three classes running it for varying number of seeds and queries. We left all parameters at their default values (meaning the wrappers were fairly selective) and searched for documents using 4 randomly drawn seeds at a time. A sample of the results are shown in Experiment 9.

Class Seeds Queries Extractions Correct % Correct

City 3,000 9,000 190,000 90,000 47%

Film 300 9,000 31,000 24,500 79%

Scientist 50 5,000 65,000 15,000 23%

City 5 1 6,000 4,000 66%

Experiment 8: Results for LE. Seeds is the number of positive examples given as input. Queries is the number of times 4 tokens were randomly selected from the seeds to search for documents. Extractions is the total number of unique extractions. LE can find large numbers of extractions from relatively few queries.

Correct is the number of extractions in the class before using the Assessor to boost precision.

As Experiment 9 shows, LE is very efficient at finding many correct extractions in a class. In under two minutes, it took five seeds and found about 4000 correct extractions. Actually this is not very impressive since some lists were found on pages that contained over 18,000 correct city instances (so the correct search query can get much better documents). However, in all cases, there was also a significant amount of junk.

Here are some of the reasons for this:

1. Airports, Hotels, Countries, and more junk are often listed with cities 2. Actors, Musicians, and misspellings are often listed with movies

3. Famous people, random names, and other information are often listed with scientists

Intuitively this makes sense as lists and HTML structure in general often group related things together.

Scientists are particularly difficult since they fall into many more general categories.

5.6 Discussion and Future Extensions

Although the percentage correct in all categories may not look very promising, these results are actually quite good since cutting down the number of candidate tokens from the whole Web to the subsets above helps the Assessor. Also, there may be many items found in lists and other structures on the Web that are not found in free text by standard information extraction methods. For example, rare cities found on long HTML select lists will often not be found in free text.

There are quite a few extensions that can be done to make LE work better. Finding more relevant documents and lists, perhaps through better selection of seeds, will probably help, since there are clearly thousands of lists still to be found in all the classes considered here. Making the wrappers more expressive and learning the best wrapper parameters for each class could help too. For example, movies could use more flexible matching since the titles sometimes have slightly different orders of words, but are still the same.

在文檔中 Unsupervised Named-Entity Extraction from the Web: An Experimental Study∗ (頁 28-32)