Automatic wrapper generator for integrated IR

SELECT Title and URL FROM Yahoo and Altavista WHERE Keyword =

4.6 Automatic wrapper generator for integrated IR

The approach presents a design for an automatic XML-based framework with which to generate wrappers rapidly. Wrappers created with this framework support a unified interface for a meta-search information retrieval system based on the Internet Search Service using the CORBA standard. Greatly advantaged by the compatibility of CORBA and XML, a user can quickly and easily develop information-gathering applications, such as a meta-search engine or any other information source retrieval method. The two main things our design provides are a method of wrapper generation that is fast, simple, and efficient, and a wrapper generator that is CORBA and XML compliant and that supports a unified interface.

The effort has mainly gone into designing wrappers to translate the returns into a specific representation for queries from the mediator. In fact, for the retrieval application developer faced with multiple information sources, it is important for the available retrieval applications to have a uniform programming interface. It is for that we propose our integrated information retrieval methodology with a unified interface, as shown in [18]. The flexible architecture here has a unified programming interface and an information retrieval application for querying a variety of sources. We use an IIR (Integrated Information Retrieval) service based on COSS (Common Object Service Specification) of CORBA (Common Object Request Broker Architecture). The metadata of the sources is defined by DTD (Document Type Definition) of XML (eXtensible Markup Language). With this system, an information retrieval developer can easily design applications or agents to collect desired information via a high-level uniform programming interface.

The proposed architecture is ideal for the information retrieval task. However, because of the multiple sources, a supportive framework is necessary. In addition, the framework must ensure that information retrieval application developers can generate wrappers that are simple and fast and that are both XML and CORBA compliant. Being XML-compliant enables data exchange between different information sources, and being CORBA-compliant enables communication between heterogeneous systems. With practice, employing this framework in an SQL-like high-level query scheme, the user or the client program (e.g., an information retrieval application) can perform the extraction from a variety of sources.

Most of the current wrapper development methods have difficulty with designing query and extraction rules because a good knowledge of web documents and of the syntax of rules is required. Wrapper implementers find designing such rules difficult and tedious. In many systems on the Internet, the returned information is designed for user, not for a program. In addition, a wrapper is an important software component between the information retrieval system and the information source. A well-defined wrapper with a uniform communication interface improves the performance of a heterogeneous information retrieval system, but writing this kind of wrapper increases the workload of the wrapper programmer. The solution in this approach to that problem is an automatic generating framework for an XML-based wrapper with a CORBA-based unified interface. With this framework, the XML data model is used to express the metadata of information sources, and the output file of the results is also in XML format. CORBA is an open system model that supports communication between the software components within distributed environments and is used to define the uniform interface for the meta-search system proposed in [18][19]. With this framework, an information retrieval application can use the CORBA standard to communicate with a variety of wrappers and acquire the results based on a standardized object model. In addition, because XML is now a popular standard for representing and exchanging data, the wrapper programmer has no need to learn a new extraction language to generate wrappers.

A wrapper is a software component that embraces an information source. Its main objective is to be the interface between the client program and information source. Because of the heterogeneous information sources, it is best to support a wrapper with a uniform interface. A typical uniform interface wrapper has three tasks. First, it receives user requests from a mediator and then translates them into a query string format (typically into URL form for web information sources) or into a query command (typical in the RDBMS system) acceptable to the information source. Second, it retrieves Internet or the Intranet documents from which it extracts desired information according to extraction rules provided by the user. Last, it stores the desired information in a specific form and provides an interface to allow a user or client to retrieve data in a high-level and structured way.

The typical architecture of a wrapper supporting for [18] is shown in Figure 4.11. It has seven components: (1) Query String Translator, (2) Parameter Encapsulator, (3) Document Parser, (4) Information Extractor, (5) Result Packer, (6) Network Transmission Interface, and (7) Server Skeleton Interface. Each component is a stand-alone Java class, and can be

developed and replaced independently. All the data passing through these components are created as an XML DOM objects. Consequently, the wrapper uses standard DOM API to develop the application. Because XML is a structured and meaningful data format, each component easily understands the content of received data and treats it appropriately in an explicit and precise way. Our automatic framework for wrapper generation uses the advantages of XML.

Figure 4.11: Basic architecture of XML-based wrapper.

We now discuss the architecture of the XML-based wrapper shown in Figure 4.11 in more detail. On the right of the figure is Network Transmission Interface, which is the gateway between the wrapper and the information source. Its task is to provide an interface such that the wrapper can send the appropriate query string to the individual information source and then retrieve the results through the interface. On the left is Server Skeleton Interface, which is the interface mapped to the stub of the client program based on CORBA standard. It follows that the interface defined by IDL is such that the wrapper receives through the Server Skeleton the SQL-like request (e.g., SELECT URL FORM ALTAVISTA WHERE KEYWORD=wrapper) from the Client Stub of the client application. The Query String Translator has the task of parsing the user request, extracting such parameters as URL, ALTAVISTA, KEYWORD=wrapper and packing them into a DOM object. It then passes the object to Parameter Encapsulator, the main task of which is to encapsulate the parameters into an appropriate query string and send that to a specified search engine or information source.

For example, the base query string for AltaVista (http://www.altavista.com) is in the form of http://www.altavista.com/cgi/query?q=DT…, and all the parameters (e.g.,

KEYWORD=wrapper) are appended to that string in the appropriate way. The query string with parameters is sent by Network Transmission Interface to the specified search engine (e.g., AltaVista). Next, Network Transmission Interface retrieves the results and passes the data to Document Parser.

Document Parser then parses the content into an XML parse tree. Information Extractor extracts the desired fields according to user-defined extraction rules. Obviously, four major fields (URL, TITLE, DESCRIPTION, and RELATED PAGE) are given in the returned AltaVista document, and Information Extractor consequently extracts from it any URL that contains the keyword, ‘wrapper’. Finally, Results Packer packs into standard format all the information needed by the user into the Client Stub that is generated by CORBA IDL compiler.

To support the uniform interface of the integrating information retrieval environment proposed in [18], the generated wrappers have to be reconciled with IIR (Integrated Information Retrieval) interfaces. There are five IIR interfaces: InformationRetriever, Wrapper, MetaData, Collector, and Iterator. The first of these is explained as follows:

Interface InformationRetriever {

MetaData Get_meta(in QuerySourceType qsType);

Wrapper prepare(in ParameterList pl, in QuerySourceName qsName,

in QueryLanguageType qlType);

}

The interface of InformationRetriever has two methods, Get_meta() and prepare(). The purpose of Get_meta() is to obtain a object reference of MetaData object. The purpose of prepare() is to prepare a query request with appropriate parameters for a specified information source and then to obtain a object reference of Wrapper object to start the query process.

The Wrapper interface is explained as follows:

Interface Wrapper {

Collector Query() raises(QueryProcessingError, QueryInvalid);

}

The interface of Wrapper allows a mediator to start the retrieval process by invoking the Query() method of the Wrapper interface. InformationRetriever then dispatches an appropriate wrapper to handle a specified information source according to the information obtained from Get_meta(). The interface of MetaData is explained as follows:

Interface MetaData

{

在文檔中 A Study on Email-based Mobile Agent Runtime Environment (頁 89-93)