Unisearch - Previous Systems Developed

CHAPTER 2 RELATED WORK

2.2 Previous Systems Developed

2.2.1 Unisearch

The Consortium on Core Electronic Resources in Taiwan (CONCERT)[28]

provides plentiful online databases, but a user may not be aware of which one he/she should use. Different search interfaces of different resources even make the search more difficult since the user must learn how to use every type of interface. To increase the simplicity to use CONCERT resources, Unisearch was created (Figure 2-2).

Figure 2-2: Unisearch provides a single interface for multiple resources Unisearch is a metasearcher specially designed to take into account the constraints of CONCERT resources:

1. Although there are some protocol standards such as Z39.50 for distributed search, many resources do not provide such facility or need additional cost to purchase a separated module for compliance of the wanted protocol. Using a well-known protocol standard of the library community is not feasible.

2. It is aimed to satisfy CONCERT needs, and the scale was not sufficient to put resource providers working together to agree with following a common communication protocol.

3. All resources were not free, so access policies should be respected or some agreement should be done with every resource provider if special access privileges are needed.

4. Usage statistics is important and should not be misguided by the usage of Unisearch.

5. Interface of resources may change.

Issues 1 and 2 constrain the design to rely only on html and the http protocol. These protocols do not constrain much about performing search such as Z39.50 does, so the interface of every resource needs to be analyzed. A wrapper is created for every resource, which transforms the user’s query to a form acceptable by the target

EI Cpx

SDOS

IEL

WOS

resource. Resources have different search capabilities (Table 2-1).

Search Field

Resource

Abstract Author Journal Name Title ISSN/ISBN

Year of

Table 2-1: Search fields supported by different resources

Most resources support advanced search which allow users create complex queries in a single command-line format. The command-line format makes the translation of queries simpler than mapping to html controls; some examples are given in Table 2-2.

If the command-line format is not supported, then search terms need to be mapped to respective html controls. A query template mechanism is used to eliminate resource-specific definitions from the program code. An example of the template is given in Figure 2-3.

欄位名稱

資料庫名稱

Ovid Ebsco Proquest Ei CPX

Abstract ab AB ABS(customer delight) AB Author au AU AUTHOR(Gertrude Enders

Huntington)

AU(Michael Kinsley)

Journal Name jn ST

Source SO SO(chicago tribune) JO(computing)

Title ti TI TITLE(Future) TI(future AND career)

Table 2-2: Advanced search in the command-line format

Figure 2-3: Query template example

Issues 3 and 4 infer that resource providers should be aware of who is using the resource even if the search is submitted by a metasearcher system. Most resources use the IP address of the user’s browser to perform identification. In a traditional server solution, the server is responsible for connecting with the target resource, which makes it unaware of the user’s IP address and misguides the access control and statistic mechanisms of the target resource. Unisearch instead creates the connection from the user’s computer, not from the server, avoiding the problem.

Issue 5 requires that Unisearch uses some strategies to simplify adaptation to changes. When a resource changes, adaptation is achieved by modifying the template (Figure 2-3) and the profile of the resource, not the program itself, thus separating the volatile part from the code for convenient update. The information is kept in a server and users retrieve them every time it performs a search.

Figure 2-4 is the Unisearch Architecture. When a user uses the Unisearch System, the User Identificator module identifies the user and provides a list of databases that the user can use. The Database Selection Interface is then displayed as Figure 2-5, in which the user can select the target resources to search. The Search Interface lets the user input the query (Figure 2-6).

U n is e a r c h S y s t e m

D a ta b a s e

S e le c tio n In te r fa c e S e a r c h In te r fa c e

P r o c e s s C o n tr o lle r

S e a r c h D is p a tc h e r

Q u e r y C o n s tr u c to r

D a ta b a s e C o n n e c to r

D a ta b a s e 1

D a ta b a s e 2

D a ta b a s e 3 D a ta b a s e n

...

Search Results

U s e r Id e n tific a to r

Figure 2-4: Unisearch Architecture

Figure 2-5: Selection of database sources for Unisearch

Figure 2-6: Search interface for Unisearch

The Process Controller module creates several instances of the Search Dispatcher object, according to the number of selected databases. The Search Dispatcher loads the target resource profile, and uses it to translate the user’s query into the command line format acceptable by the target resource. The Database Connector module loads the query template and uses it to send the translated query to the respective target resource. Results are displayed as Figure 2-7.

Figure 2-7: Return of search results using Unisearch 2.2.2 Virtual Union Catalog System

Interlibrary loan help libraries compensate the deficiencies of their collections, but

without a unified catalog it is very inconvenient: users need to search the online public access catalog of every potential library. A unified catalog can be constructed by periodically harvesting information about the holding of the subject libraries, but the effort needed is great and users will eventually suffer some misses during periods.

VUCS is a metasearcher designed to solve the latency problem and the periodical update effort of a unified catalog.

2.3 Related Technologies

2.3.1 Extensible Markup Language

XML is a markup language for documents containing structured information.

Structured information contains both content and some indication of what role that content plays, so it identifies structures in a document. XML does not define the semantics or the tags; it is a metalanguage, i.e. a language for describing other languages, which lets you design your own customized markup languages for limitless different types of documents.

2.3.2 Document Object Model

The W3C Document Object Model is a platform and language neutral interface that allows programs and scripts to dynamically access and updates the content, structure and style of documents.

The goal of DOM is to define a programmatic interface for markup languages such as XML and HTML. The DOM architecture is divided into modules. Each module addresses a particular domain. Domains covered by the current DOM API are XML, HTML, Cascading Style Sheets (CSS), and tree events. The Core DOM provides a low-level set of objects that can represent any structured document. While by itself this interface is capable of representing any HTML or XML document, the core interface is a compact and minimal design for manipulating a document’s contents.

Depending upon the DOM’s usage, the core DOM interface may not be convenient or appropriate for all users. The HTML and XML specifications provide additional, higher level interfaces that are used with the core specification to provide a more convenient view into a document. These specifications consist of objects and methods that provide easier and more direct access into the specific types of documents.

The Document Object Model provides a standard set of objects for representing HTML and XML documents, a standard model of how these objects can be combined, and a standard interface for accessing and manipulating them.

2.3.3 Regular Expressions

Regular expressions are a very powerful text parsing language used widely in many applications. Their main use is to find a particular pattern within a given string that matches whatever rules expressed using this language. Regular expressions can be considered to be a generalized form of substrings. The alphanumeric characters retain their meaning, but some other characters become special and allow one to construct more general “substrings” to match. For instance, foo is a simple regular expression, only matching the pattern foo. But the power of regular expressions come from metacharacters such as the “.” character. It matches any character, so “foo.” will match “food”, “fool” and “foot”. If you wish to use a metacharacter as a real character, quote it with “\” and can still be used.

2.3.4 Dynamic HTML

Dynamic HTML (DHTML) builds upon existing HTML standards to expand the possibilities of Web page design, presentation, and interaction. The basic notion behind DHTML is to allow any element of a page to be changeable at any time.

Without it, any modification requires a post trip with the server, i.e. requires a request

to a server to perform the changes to the page, reconstructs the entire page in the server with the modifications and then sends everything back to the client. While workable, this process is quite slow, as it places a burden on both network traffic and server processing time. With long delays between a user’s action and an on-screen response, building effective Web-based applications is quite constricting.

DHTML allows modifications occurring entirely on the client-side. This means that page modifications should appear immediately following a trigger, such as a user selection. For this to occur, it is more about scripting than HTML, the markup language. DHTML describes the abstract concept of breaking up a Web page into manipulable elements and expose those elements to a scripting language that can perform the manipulations.

DHTML itself is not a language. In practice, one programs DHTML by combining HTML, Cascading Style Sheets, and Javascript. To allow Javascript working with HTML/CSS, the Document Object Model describes each page element and the characteristics of which may be modified in an object-oriented fashion.

2.3.5 Component Object Model

The Component Object Model (COM) is a Microsoft technology to software components. COM is the underlying architecture that forms the foundation for higher-level software services, like those provided by OLE (Object Linking and Embedding). OLE services span various aspects of commonly needed system functionality, including compound documents, custom controls, interapplication scripting, data transfer, and other software interactions. These services provide distinctly different functionality to the user. However they share a fundamental requirement for a mechanism that allows binary software components, derived from any combination of pre-existing customers’ components and components from

different software vendors, to connect to and communicate with each other in a well defined manner. This mechanism is supplied by COM, which have the following characteristics:

z Defines a binary standard for component interoperability

z Is independent of any programming language

z Is extensible by developers in a consistent manner

z Uses a single programming model for components to communicate within the same process

z Provides rich error and status reporting

z Allows dynamic loading and unloading of components.

The Component Object Model defines several fundamental concepts. These include:

z A binary standard for invoking function between components.

z A provision for strongly-typed groupings of functions into interfaces.

z A base interface providing: (1) A way for components to dynamically discover the interfaces implemented by other components. (2) Reference counting to allow components to track their own lifetime and delete themselves when appropriate.

z A mechanism to identify components and their interfaces uniquely, worldwide.

z A “component loader” to set up component interactions and to help manage component interactions.

2.3.6 Script Technologies

Script languages are complete but simple programming languages that can be used

create applications within the context they are designed for. It is complete in that it provides you with a set of tools that you can use to accomplish any reasonable task in the target context, such as in the browser or in the operating system. It is simple enough so that the learning curve is not too steep. So script languages provide quick solutions.

Scripting languages are interpreted rather than compiled. A scripting environment provides a runtime engine (often called a parser) that processes instructions on the fly.

In contrast, other programming languages (e.g. C++) must be compiled into a set of machine instructions to become executable. A compiled language requires a far more complex development environment but executes faster.

In Windows OS, script engines works tightly with COM, although a script programmer do not need to understand it unless if he/she wants to extend the script language itself with more objects. Microsoft calls it scripting technologies. Each software component that complies with COM’s set of rules is a COM object. The functionality that each COM object makes available externally is organized into groups called interfaces. Automation objects are a special type of COM objects that allow their interfaces available to script engines through a basic interface called IDispatch. So, a scripting technology is a service that a component makes available to scripts through automation.

Scripting technologies provide access to a set of functions that are expressed in terms of objects, methods, and properties. The set of functions is often referred to as an object model. An object model is a hierarchy of logically related objects, each with a set of methods and properties. A script then can access those objects and invoke their methods and properties through the automation object’s interface.

2.3.7 Web Interface Definition Language

An introduction about WIDL was given in section 2.1.2.1 .Here we introduce the nuts and bolts of the WIDL language.

The WIDL definition is stored in an ASCII file, which is utilized by client programs at runtime to determine both the location of the service (URL) and the structure of documents that contain the desired data. Client programs access WIDL definitions from local files, naming services such as LDAP, HTTP servers or other URL access schemes, allowing centralized management of WIDL files. Unlike the way CORBA and DCE IDL are normally used, WIDL is interpreted at runtime. As a result, Service, Condition, and Variable definitions within WIDL files can be administered without requiring modification of client code. This usage model supports application-to-application linkages that are more robust and maintainable than if they were coded by hand.

There are three models for WIDL management:

z Client side: where WIDL are collocated with a client program

z Naming service: where WIDL definitions are returned from directory services

z Server side: where WIDL are collocated or embedded within Web documents Except for being expressed in XML, WIDL specifications closely correlate to existing IDLs. One significant difference is the notion of a WIDL record. A WIDL service may specify input or output variables within a particular interface.

The Web Interface Definition Language (WIDL) consists of six XML tags:

z <WIDL> defines an interface, which can contain multiple services and binding

z <SERVICE> defines a service, which consist of input and output bindings

z <BINDING> defines a binding, which specifies input and output variables, as

well as conditions for successful completion of a service

z <VARIABLE> defines input, output, and internal variables used by a service to submit HTTP requests, and to extract data from HTML/XML documents

z <CONDITION> defines success and failure condition for the binding of output variables; specifies error

z <REGION> defines a region within an HTML/XML document; useful for extracting regular result sets which vary in size, such as the output of a search engine, or news

One of the most important features of WIDL is the capability to reliably extract specific data elements from Web documents and map them to output parameters. Two candidate technologies for data extraction are pattern matching by regular expressions or pattern matching by tag patterns. Regular expressions are well suited to raw text files and poorly structured HTML documents. Tag patterns instead rely on the tag structure of the document and needs parse of the document. The parsed document structure exposes relationships between document objects, enabling elements of a document to be accessed with an object model, described in section 2.3.2 . Using an object model, an absolute reference to an element of an HTML document might be specified:

doc.p[0].text

This reference would retrieve the text of the first paragraph of a given document.

From both a development and an administrative point of view, pattern matching is more labor intensive to establish and maintain. Regular expressions are difficult to construct and prone to breakage as document structures change. For instance, the addition of formatting tags around data elements in HTML documents could easily

derail the search for a pattern. An object model, on the other hand, can see through many such changes.

The <VARIABLE> element is used to describe input and output binding parameters. Common attributes are:

z NAME: Required identifier for calling programs.

z TYPE: Required. Specifies both the data type and dimension (for arrays) of the variable.

z REFERENCE: Optional. Specifies an object reference to extract data from the HTML, XML, or text document returned as the result of a service invocation.

z MASK: Optional. Masks permit the use of pattern matching and token collection to easily strip away unwanted labels and other text surrounding target data items.

The <REGION> element is used in output bindings to define targeted subregions of document. This is useful in services that return variable arrays of information in structures that can be located between well known elements of a page.

Regions are critical for poorly designed documents where it is otherwise impossible to differentiate between desired data elements (for instance, story links on a news page) and elements that also match the search criteria.

z NAME: Required. Specifies the name for a region. This name can then be used as the root of an object reference. For instance, a region named foo can be used in object references such as:

foo.p[0].text

z START: Required. An object reference that determines the beginning of a region.

z END: Required. An object reference that determines the end of a region.

<SERVICE NAME=”TechWebOut” METHOD=”GET” URL=”http://www.techWeb.com”

OUTPUT=”techWebOut”>

<SERVICE NAME="TechWeb" METHOD="GET"

URL="http://www.techWeb.com/" OUTPUT="techWebOut">

<REGION NAME="tops" START="doc.font['Last?Updated*']"

END="doc.b['For?more*']" />

</BINDING>

</SERVICE>

</WIDL>

Figure 2-8: Extraction of data elements with regions

Figure 2-8 demonstrates the use of regions in a news service, where the number of news stories varies day to day. Regions permit the extraction of data elements relative to other features of a document. The tops region begins with a text object that matches the pattern ‘Last Updated’ and ends with and object that matches ‘For more*’.

Variable references into the tops region collect arrays of anchors and anchor text, regardless of the fact that the sizes of the arrays change throughout the day. The object references within tops are vastly simplified by the processing already provided by the region definition:

tops.a[].text

tops.a[].href

Object References

The default object model used by WIDL provides object references for accessing elements and properties of HTML and XML documents. This model is based on the DOM object model in JavaScript, but without the JavaScript method definitions.

在文檔中網際網路自動化平台設計及其應用 (頁 18-0)