Predicting Input Values - Beautiful Data

A large number of HTML forms have text inputs. In addition, some forms with select menus require values in their text inputs before any results can be retrieved.

We note that text inputs are typically used in two different ways. First, there are generic inputs that practically accept any reasonable value, and the words entered in the inputs are used to retrieve all documents in a backend text database that contain those words.

Common examples of this case are searching for books by title or by author. Second, there are typed inputs. Such inputs only accept values from a well-defined finite set or datatype (e.g., zip codes), or belong to some continuous but well-defined datatype (e.g., dates or prices). Invalid entries in typed text boxes generally lead to error pages, and hence it is

S U R F A C I N G T H E D E E P W E B 145 important to identify the correct data type. Badly chosen keywords in generic text boxes

can still return some results, and hence the challenge lies in identifying a finite set of words that extracts a diverse set of result pages.

The two types of text inputs can be treated separately. In what follows, we first describe an algorithm to generate keywords for a generic input before considering the case of typed inputs.

Generic text inputs

Before we describe how good candidate keywords can be identified for generic inputs, let us consider (and dismiss) a possible alternative. Conceivably, we could have designed word lists in various domains to enter into text inputs and tried to match each text input with the best-fitting word list. However, we quickly realized that there are far too many concepts and far too many domains. Furthermore, for generic inputs, even if we identified inputs in two separate forms that correspond to the same concept in the same domain, it is not necessarily the case that the same set of keywords will work on both sites. The best key-words often turn out to be very site-specific. Since our goal was to scale to millions of forms and multiple languages, we required a simple, efficient, and fully automatic technique.

We adopt an iterative probing approach. At a high level, we assign an initial seed set of candidate keywords as values for the text input and construct a query template with the text box as the single binding input. We generate the corresponding form submissions, download the contents of the corresponding web pages, and extract additional keywords from the resulting documents. The extracted keywords are then used to update the candi-date values for the text box. We repeat the process until either we are unable to extract further keywords or we have reached an alternate stopping condition, e.g., a sufficient number of candidate keywords. On termination, a subset of the candidate keywords is chosen as the set of values for the text box.

Iterative probing has been proposed in the past as a means to retrieve documents from a text database (Barbosa 2004, Callan 2001, Ipeirotis 2002, Ntoulas 2005). However, these approaches had the goal of achieving maximum coverage of specific sites. As a conse-quence, they employ site-aware techniques, and the approaches are not applicable across all domains.

At a high level, we customize iterative probing as follows:

• To determine whether the text input is in fact a generic input, we apply the informa-tiveness test on the template in the first iteration using the initial candidate set. Our results indicate that generic text inputs are likely to be deemed informative, but others inputs are not.

• To select the seed set of candidate values, we analyze the contents of the web page that has the form. We select words from a page by identifying the words most relevant to its contents. Any reasonable word scoring measure, e.g., the popular TF-IDF measure (Salton 1983), can be used to select the top few words on the form page.

146 C H A P T E R N I N E

• To select new candidate values at the end of each iteration, we consider the set of all words found on all form submission pages analyzed for the template. We exclude words that appear on too many pages, since they are likely to be part of the boilerplate HTML that appears on every page. We also exclude words that appear on one page, since they are likely to be nonsensical or idiosyncratic words that are not representative of the contents of the form site.

• To select the final set of values for the text input, we consider all the candidate values extracted from the form page or the submission pages and select from the set in the order of their ability to retrieve the most diverse content (by analyzing the content of the pages resulting from form submissions).

We note that placing a single maximum limit on the number of keywords per text input is unreasonable because the contents of form sites might vary widely from a few to tens to millions of results. We use a back-off scheme to address this problem. We start with a small maximum limit per form. Over time, we measure the amount of search engine traf-fic that is affected by the generated URLs. If the number of queries affected is high, then we increase the limit for that form and restart the probing process.

Our experimental analyses indicate that iterative probing as outlined here is effective in selecting input values for generic inputs. The corresponding form submissions are able to expose a large number of records in the underlying database. Interestingly, we found that text inputs and select menus in the same form often expose different parts of the under-lying data. We were also able to establish that a web crawler can, over time, expose more deep-web content starting with the URLs generated by our system.

Typed text inputs

Our work indicates that there are relatively few types that, if recognized, can be used to index many domains, and therefore appear in many forms. For example, a zip code is used as an input in many domains, including store locators, used cars, public records, and real estate. Likewise, a date often is used as an input in many domains, such as events and arti-cle archives.

To utilize this observation, we build on two ideas. First, a typed text input will produce reasonable result pages only with type-appropriate values. We use this to set up informa-tiveness tests using known values for popular types. We consider finite and continuous types. For finite types (e.g., zip codes and state abbreviations in the U.S.), we can test for informativeness using a sampling of the known values. For continuous types, we can test using sets of uniformly distributed values corresponding to different orders of magnitude.

Second, popular types in forms can be associated with distinctive input names. We can use such a list of input names, either manually provided or learned over time (e.g., as in [Doan 2001]), to select candidate inputs on which to apply our informativeness tests.

S U R F A C I N G T H E D E E P W E B 147

在文檔中 Beautiful Data (頁 162-165)