• 沒有找到結果。

Boolean QL_Available(in QuerySourceType qsType);

Boolean Registry(in QuerySourceMeta metadata);

Boolean Unregistry(in QuerySourceName qsName);

Boolean Replace(in QuerySourceName qsName, in QuerySourceName metadata);

QuerySourceMeta get(in QueryLanguageType ql_type);

}

The interface of MetaData provides data retrieval robustness while retrieving any information source. By the appropriate assignment of each field of MetaData, the client can obtain the format of a query request and the schema of the result. For the detail of MetaData, please refer to [15].

The last two interfaces, Collector and Iterator, are responsible for collecting information from sources. They are explained as follows:

Interface Collector {

MetaData GetMeta(in QueryLanguageType qsType);

Readonly attribute long Result_size;

As may now be seen, a client program has two methods of retrieving desired data. The first is random retrieval, which uses Result_size() and retrieve_element_at(). The second is sequential retrieval, using create_iterator(). This last creates an Iterator object, which provides a simple interface for data retrieval. For detailed information about the unified interface integrating information environment, please refer to section 4.4.

To generate a wrapper for a specified search engine or information source correctly, the wrapper implementer must provide adequate information to the wrapper generator in the following forms: Query Template file, Pattern file, and Scheme file. The role of each in our wrapper generation framework is shown in Figure 4.12. Within the framework, wrapper generation has three phases: (1) Query Translation, (2) Documents Retrieval, and (3) Results

Translation.

Figure 4.12: An overview of the wrapper generator.

In Figure 4.12, we consider the developed information retrieval application. The application communicates to the wrapper according to the interface defined by IDL files. In the Query Translation phase, the wrapper translates the user’s high-level SQL-like queries from the application via the CORBA standard into an acceptable format for the specified search engine or information source, for both of which the wrapper generator must also know the query command patterns. The wrapper implementer in the QUERYTEMPLATE file must provide the query format.

In the Document Retrieval phase, using the QUERYTEMPLATE file, the wrapper attempts document retrieval from a specified information source by the appropriate query string or command. If the information provided by user is correct and the network connection is good enough, the desired documents is quickly retrieved into local data storage.

Most returned documents from information sources are either semi-structured or unstructured.

One of the responsibilities of the wrapper generation in this phase is to parse documents, extract desired information, and then store it into XML DOM objects. The wrapper implementer must provide in advance a PATTERN file that regulates the handling of the specified data fields for the document parsing process. In addition, for proper management of the data in the specified fields, detailed information about the data type and variable name

of the user-interested fields has to be provided in the SCHEMA part of the PATTERN file.

The QUERYTEMPLATE, PATTERN, and SCHEMA files are all written in XML syntax.

Following the Document Retrieval phase, the user-interested information is stored in the data structure of a wrapper in the form of XML DOM objects. In the Result Translation phase, obtaining the information from these objects via the CORBA standard is fast and simple for the client program. One of the responsibilities of the wrapper in this phase is to prepare the results extracted from DOM objects in appropriate format and send them back to the client program via CORBA.

The interface between the wrapper and the client program is also an important part of the framework. A concise and standard interface is needed here, so we adopt IDL (Interface Definition Language) of the CORBA standard to define it. The CORBA standard is language-independent, so that the wrapper generated within the framework can be communicated to the client program whatever the language or operating system. That is, a wrapper generated within the framework is an XML-based and CORBA-enabled component over the network. In this way, wrappers fully support the integrated heterogeneous information retrieval system we propose in [18].

As discussed earlier, most information sources constantly change content or even structure. Wrapper codes are tightly coupled with the structure of a specified information source. If the information source changes the document structure, the wrapper implementer must also change the wrapper codes. Such constant modification is both time-consuming and tedious. Consequently, an information retrieval application developer welcomes an automatic wrapper generation system that decreases the workload. Unlike what is proposed in other works, the XML-based wrapper generation framework we present is automatic and consequently answers the need. Not only is it XML-compliant and CORBA-enabled, but the generating procedure is simple and fast. The workflow for the framework is shown in Figure 4.13.

Figure 4.13: The workflow of the proposed wrapper generation.

There are many editing tools for XML and IDL files. The wrapper implementer must select the appropriate tools to define and prepare the QUERY TEMPLATE, PATTERN, SCHEMA, and IDL files shown in Figure 4.13. Consider the Wrapper Generation phase shown in this figure. First, an IDL compiler is employed to compile a user-defined IDL file and produce a client stub and server skeleton for the interface between the client application and the CORBA-based generated wrapper. The information retriever uses the client stub to develop a client program that communicates to the wrapper in an appropriate way. The server skeleton together with the QUERY TEMPLATE, PATTERN, and SCHEMA files are required in the wrapper generation procedure. Employing these user-defined files, the generator selects the appropriate components stored in the Component Repository to produce the desired wrapper. At Runtime Phase, the new wrapper can be invoked by to retrieve the requested in the client program via the CORBA interface defined in the IDL file formation.

The architecture of our wrapper generator is shown in Figure 4.14. Three XML-formed files, QUERY TEMPLATE, SCHEMA, and PATTERN, are required. An XML parser parses them and stores the desired information into DOM objects defined by W3C. That is, following XML Parsing Phase, three DOM objects are produced. Then, Query Rules

Analysis phase creates the Query Transfer Rules and the Query String according to the DOM

objects coming from the QUERY TEMPLATE file. The Data Extract Rules are generated at

Data Extract Rule Analysis phase according to the DOM objects generated from the SCHEMA

file. The Pattern Sample is produced by Pattern Analysis phase according to the DOM object generated from PATTERN file. The Code Generator shown in Figure 4.13 then chooses suitable components from the Component Repository in accordance with all these information (Query Transfer Rules, Query String, Data Extract Rules, and Pattern Sample).

Finally, Binder binds Code Generator-chosen components and Server Skeleton code fragments produced by the IDL compiler into the desired new wrapper.

SCHEMA

Figure 4.14: Architecture of automatic wrapper generator.

In summary, the usual method for extracting information fields of interest in the framework is by pattern matching. Using it, a wrapper implementer quickly and easily prepares wrapper specifications for a specified information source. This differs from previous approaches in that a wrapper generated in our framework focuses on extracting fields of interest from returned documents and not on analyzing their content. An implementer is not required to understand the whole structure of a specified information source.

Consequently, time and cost in generating new wrapper are greatly reduced.

All the data structures, including imported files and the representation of objects are XML compliant. Since XML is a widely popular standard nowadays, most developers are