Enhancing semantic digital library query using a content and service inference model (CSIM)

(1)

http://www.cogsci.princeton.edu/~wn/.

Enhancing semantic digital library query using a content

and service inference model (CSIM)

Su-Hsien Huang

a,*

, Hao-Ren Ke

b,1

, Wei-Pang Yang

a,1

a_{Department of Computer and Information Science, National Chiao Tung University,} 1001, Ta-Hsueh Road, Hsinchu, Taiwan, ROC

b_{Library of National Chiao Tung University, 1001, Ta-Hsueh Road, Hsinchu, Taiwan, ROC} Received 2 August 2002; accepted 13 April 2004

Available online 15 June 2004

Abstract

Although digital library (DL) information is becoming increasingly annotated using metadata, semantic query with respect to the structure of metadata has seldom been addressed. The correlation of the two important aspects of DL–– content and services––can generate additional semantic relationships. This study proposes a content and service inference model (CSIM) to derive 15 relationships between content and services, and deﬁnes functions to manipulate these relationships. Adding the manipulation functions to query predicates facilitates the description of structural semantics of DL content. Moreover, in search for DL services, inferences concerning CSIM relationships can be made to reuse DL service components. Highly promising with experimental results demonstrates that CSIM outperforms the conventional keyword-based method in both content and service queries. Applying CSIM in DL signiﬁcantly improves semantic queries and alleviates the administrative load when developing novel DL services such as DL query interface, library resource-planning and virtual union catalog system.

Keywords: Semantic query; Content and service inference model; Digital library; Information retrieval; Metadata

1. Introduction

Rapid developments in information technology have accelerated worldwide access to information. A well-designed architecture is required to coordinate eﬀectively and eﬃciently the dissemination of a large amount of information over the Internet (Grossman, Qin, & Xu, 1995; Monch & Drobnik, 1998; Nikolaou & Marazakis, 1998). Digital libraries (DL) have received considerable attention in recent years. A DL represents an Internet-based architecture that can access various kinds of information from anywhere. One

*

Corresponding author. Tel.: +886-3-5131554; fax: +886-3-5718925.

E-mail addresses:sshuang@cis.nctu.edu.tw(S.-H. Huang), claven@lib.nctu.edu.tw (H.-R. Ke), wpyang@cis.nctu.edu.tw (W.-P. Yang).

1_{Tel.: +886-3-5131554; fax: +886-3-5718925.}

Information Processing and Management 41 (2005) 891–908

(2)

major DL activity is to ﬁnd information; however, many DL use keyword-based searches, which rely on keyword matching and constitute non-semantic means of retrieving information. Keyword-based searches do not consider the multiple senses of a query term; for instance, the sense of the query term ‘‘JAVA’’ is ambiguous because the query system cannot distinguish whether the user’s interest is in coﬀee, a pro-gramming language or an island in Indonesia. Therefore, digital libraries that employ typical keyword-based searches have begun to be adapted to allow information to be retrieved more semantically.

A response to a semantic query attempts to determine the precise meaning of a query term according to context. Substantial work has been conducted on automatically determining the correct sense of a polyse-mous word, including on referencing machine-readable dictionaries such as WordNet (Allan & Raghavan, 2002; Miller et al., 2002). Another way of semantically retrieving information is to add related concepts (Lee, Kim, Kageura, & Choi, 2002) by referring to external auxiliaries, such as computing co-occurrences of terms (Chen, Chung, Marshall, & Yang, 1998); discovering implicit semantics using latent semantic analysis (Kolda & O’Leary, 1998), or analyzing corpuses (Gauch & Wang, 1999). A semantic query can also be performed by collaborative information-ﬁltering methods, including data mining or clustering usage proﬁles of users with the same interest (Dai, 2001; Mostafa, Mukhopadhyay, Lan, & Palakal, 1997; Wu, 2001).

The senses of query terms obtained from external information (like metadata) support semantic queries. In digital libraries, abundant metadata extend semantic queries from a structural perspective. For instance, VUCS determines which fields should be retrieved and integrated among heterogeneous data formats (such as for example, Dublin Core, MARC, or other canonical formats) to retrieve information from a virtual union catalog system (VUCS). A human being is normally required to judge which schemas contain the same structure and semantics. Data in different formats may include synonyms or homonyms in their attributes, so human intervention makes the process difficult and slow. Therefore, automatically exploiting comprehensive semantics by the structural relationship is essential to identify the heterogeneity of the schema and expand the semantics of metadata. Although some popular markup languages like XML have a namespace facility to elucidate structural information, a mechanism for identifying the semantics of the structure among different attributes is unavailable. MARIAN (Goncalves, France, & Fox, 2001) DL searches in an object-oriented fashion. Data are modeled as classes and relationships as weighted links to represent structural relationships. In the MARIAN approach, the structural relationships among the content are derived from the class hierarchy. Heterogeneity is evaluated by computing the weights of corresponding links.

Sharing metadata architecture in distributed digital libraries drives the use of metadata to expand semantic relationships (Blanchi & Petrone, 2001). Open digital library (Fox, Suleman, & Luo, 2002), IN-RIA (Abiteboul, Benjelloun, & Milo, 2002) and 5SL (Goncalves & Fox, 2002) apply metadata to describe, derive and federate services in distributed DL. The ﬁrst step in conducting a semantic query in such dis-tributed DL is to determine which services must execute the query. An individual service may not along suﬃce to response completely to the entire query, so required information must be collected from various services, or a single task must be executed separately by independent services. A resource-planning facility is required to generate a reasonable plan of queries in series to solve this problem. Considering the metadata of services, the induction process is related to the compatibility between the interfaces and the capabilities. Accordingly, comprehensive relationships among services must be derived to execute semantic queries.

Considerable attention has been paid to DL architecture’s enabling: DL queries across distributed DL services using metadata (Paepcke et al., 1996), and managing interaction between the two most important elements of digital libraries––content and services (Lynch & Garcia-Molina, 1995; Monch & Drobnik, 1998). Content represents all the materials stored in a digital library, including texts, images and videos. A service is an application that interacts with users via an interface and can convert content into speciﬁc formats. Conventional keyword-based semantic queries cannot clarify two critical elements of DL––con-tent schemas and service capabilities. ConDL––con-tent schemas determine whether two pieces of conDL––con-tent have the same format. Service capabilities are distinguished by comparing the functions of services. Metadata

(3)

support semantic queries on content schemas and service capabilities in many ways. First, metadata that describe content schema and service capabilities have comprehensive semantics regarding content and services. These semantics beneﬁt the provision of accurate information in response to DL queries. Second, relationships between DL content and services can be derived from semantic information embedded in metadata; manipulating these relationships yields further semantics associated to the metadata. Third, metadata can be easily stored and indexed to support retrieval, since they have a formal structure. Metadata that describe semantic information about DL content and services motivates the derivation of semantic relationships among metadata. Given content and services, several questions may be raised. Does one type of content have the same format as another? Can two services perform the same task? Is one type of content produced (or manipulated) by a service? A model that formalizes the relationships between content and services is required to answer these questions.

The work seeks to formulate the structural relationships among metadata concerning DL content and services, to assist the extension of semantic DL queries. This work proposes a content and service inference model (CSIM) to elucidate the interaction of metadata, in terms of the structural relationship. CSIM defines 15 relationships between DL content and services. Using CSIM, the result of a DL query is extended by embedding relationships into the query predicates. For example, for a user who wants to retrieve content with the format ‘‘Dublin Core’’, CSIM returns content with the format ‘‘Dublin Core’’ and derives the content with other relationships, such as that which can be translated into ‘‘Dublin Core’’ (the ‘‘Trans-latable’’ relationship), and that which contains the same semantics but with different formats (the ‘‘Homonymous’’ relationship). Section 4.1 presents an example of a DL developer who desires to develop a virtual union catalog system and requests the functionality of the intended service using CSIM. The re-sponse recommends a list that connects existing services required to perform the VUCS service; these are an extracting service that extracts data from structured documents, a translation service that translates a native data format into the canonical one and an integration service that combines distributed extracting services. The recommendation can be implemented using DL componentization technology (Suleman & Fox, 2002). Restated DL designers can develop a new service that involves existing service components and translation rules, to yield the derived service list. In this manner, CSIM reuses DL components and reduces the administrative load to maintain a DL. Applying CSIM to a digital library makes DL queries semantic and effective. Moreover, CSIM can be applied to DL query interfaces, virtual union catalog systems and library resource-planning systems. A series of experiments were conducted to demonstrate that CSIM outper-formed in both content and service queries because of the extended CSIM relationships.

The rest of this paper is organized as follows. Section 2 formally deﬁnes content and services in digital libraries, and analyzes the relationships between them. The algorithms that manipulate these relationships are also deﬁned. Section 3 elaborates semantic queries that leverage CSIM to retrieve DL content and services. Section 4 presents the experiments that elucidate the feasibility of CSIM, and demonstrates a prototype system. Section 5 draws conclusions and highlights areas for future work on CSIM.

2. Content and service inference model (CSIM)

Content and services are two integral aspects of digital libraries. Content is the materials stored in digital libraries, which can be produced and processed by services. The types of content include Web pages, library holding records, and multimedia data (like texts, images, and videos). A service is a software application that transforms one type of content item into another. A service possesses speciﬁc functions and interacts with users via input and output interfaces. In this paper, both services and content are represented via metadata to facilitate the interaction between them. This paper proposes a novel framework called the content and service inference model (CSIM). CSIM infers relationships between content and services from

(4)

their metadata. CSIM can raise DL queries to a semantic level by using content semantics, service capa-bilities, and the relationships between content and services.

Typically, the metadata of content includes an identifier, data schema, presentation format, and a set of semantic description items. The metadata of a service includes the identifier and a statement about its capabilities. Section 2.1 formally defines the metadata of content and services. A total of 15 relationships between content and services are defined, as illustrated in Fig. 1. The relationships are proposed according to the structure of metadata defined in Section 2.1 with considering reasonable semantic auxiliaries (five semantic tables defined in Section 2.1). These relationships are directional and categorized into four types:

Service to Service. Two services can relate to each other according to their capabilities and input/output interfaces. Eight relationships of this type are deﬁned––Identical, Inclusive, Homonymous, Synonymous, Replaceable, Translatable, Combinable, and Combine. For example, two services are ‘‘Synonymous’’ if they possess the same capabilities but have diﬀerent input/output interfaces; one service is ‘‘Inclusive’’ of another if it contains more capabilities than the latter.

Content to Content. Two pieces of content relate to each other based on their semantics and schema elements. Five relationships of this type are deﬁned––Identical, Homonymous, Synonymous, InheritFrom, and Translat-able. For example, two pieces of content are ‘‘Identical’’ if they have identical semantics and format.

Service to Content. A service can produce content. One relationship of this type is deﬁned––Produce. For example, a WebPAC system may produce a data set in the Dublin Core format.

Content to Service. Content can be produced by a service. One relationship of this type is deﬁned–– ManipulatedBy. For example, various data sets can be manipulated by a virtual union catalog system to produce an integrated data set.

The total 15 relationships in CSIM are proposed after thoroughly examining all possible relationships between content and services under the CSIM data model and hypothesis. Section 2.1 formally deﬁnes content, services and their ingredients, and Section 2.2 formally deﬁnes these relationships.

2.1. Basic deﬁnition

In the following, the metadata of content and services2is formally deﬁned.

Deﬁnition 1 (Content). A piece of Content, C, is a quadruple {Id, Schema, Presentations, Semantics} (McCray, Gallagher, & Flannick, 1999) where,

Service 1 Content 1 Service 2 Content 2 1. Identical 2. Inclusive 3. Homonymous 4. Synonymous 5. Replaceable 6. Translatable 7. Combinable 8. Combine 1. Identical 2. Homonymous 3. Synonymous 4. InheritFrom 5. Translatable 1. ManipulatedBy 1. Produce

Fig. 1. Relationships between content and services.

2_{Hereafter, ‘‘content’’ and ‘‘service’’ are used as shorthand for ‘‘the metadata of content’’ and ‘‘the metadata of service’’} respectively.

(5)

1. Id is the identiﬁer of C.

2. Schema is the schema of C. Schema is a quadruple {Ent, Incs, Attributes, Associations}, where 2.1. Ent˝ Names is the name of a content schema.

2.2. Incs˝ (Names, Names). Each pair ðe1; e2Þ 2 Incs indicates that e1is a subtype of e2. Incs is stored in CIT (deﬁned below) and assumed to be acyclic.

2.3. Attributes˝ Names is the set of attribute names.

2.4. Associations˝ (Association_name, Entity_name1, Entity_name2, Cardinality1, Cardinality2) is the association set, which indicates the cardinality between two entities.

3. Presentations˝ Names is the set of the presentation interfaces of content C. 4. Semantics˝ CST is the set of semantics of content C. CST is deﬁned below.

Deﬁnition 2 (Service). A Service, S, is a quadruple {Id, Capabilities, Outputs, Inputs}, where 1. Id is the identiﬁer of S.

2. Capabilities˝ SCT is the set of service capabilities. SCT is deﬁned below. 3. Outputs˝ Names are the output schemas of service S.

4. Inputs˝ Names are the input schemas accepted by S.

Additional data structures are required for the formal deﬁnition of CSIM.

1. Content Semantics Table (CST). CST contains ontological terms to identify the semantics of content. Dublin Core and MARC are two examples of ontological terms in CST. The ontology in CST is hier-archical; an ascendant term covers the semantics of a descendent term.

2. Content Inheritance Table (CIT). CIT maintains the schema hierarchy of content (Incs attribute of Sche-ma in Deﬁnition 1). For example, the fact that scheSche-ma A is a subtype of scheSche-ma B can be represented as ðA; BÞ in CIT.

3. Service Capability Table (SCT). SCT contains ontological terms to deﬁne possible service functionalities. The ontology in SCT is hierarchical; an ascendant term has more general capability than a descendent term. For example, a service that uses CORBA to implement a virtual union catalog system will have two capabilities––CORBA_Distributed_System and Virtual_Union_Catalog_System, where CORBA_Dis-tributed_System and Virtual_Union_Catalog_System are the descendent terms of DisCORBA_Dis-tributed_System and Catalog_System respectively.

4. Translation Rule Table (TRT). TRT stores the rules for translating between two pieces of content. A rule R for translating between two pieces of content is expressed as a quadruple: {Id, FromSchema, To-Schema, Rules}, where

1. Id is the identiﬁer of R.

2. FromSchema˝ Names is the name of the source schema. 3. ToSchema˝ Names is the name of the target schema.

4. Rules˝ (FromAttributeName, ToAttributeName, TranslationRule) are the rules for translating be-tween speciﬁc attributes.

5. Content and Service Repository (CSR): CSRis a repository that stores the metadata of content and services, including the access methods for the CSIM metadata framework to retrieve content and ser-vices.

(6)

2.2. Relationships between content and services

As mentioned above, four types of 15 directional relationships between content and services exist–– content to content, service to service, content to service, and service to content. This section formally explicates each relationship.

Deﬁnition 3 (Content to Content). Given two pieces of content C1¼ fId1; Schema1; Presentations1; Semantics1g and C2¼ fId2; Schema2; Presentations2; Semantics2g, the possible relationships between C1and C2are as fol-lows.

1. IdenticalðC1; C2Þ iff Schema1¼ Schema2 and Semantics1¼ Semantics2. 2. HomonymousðC1; C2Þ iff Schema1¼ Schema2 but Semantics16¼ Semantics2. 3. SynonymousðC1; C2Þ iff Schema16¼ Schema2but Semantics1¼ Semantics2. 4. InheritFromðC1; C2Þ iff ðSchema1; Schema2Þ 2 Incs2.

C1and C2have the InheritFrom relationship if and only if the schema pairðSchema1; Schema2Þ exists in CIT. Namely, the schema of C1inherits from C2. Given a sequence of content C1to Cn, such that any two consecutive pieces of content have the InheritFrom relationship, these pieces of content exhibit the transitive property.

5. TranslatableðC1; C2Þ. C1 and C2 have the Translatable relationship if C1 can be translated into C2 by means of speciﬁc translation rules. In other words, C1and C2 are translatable if and only if there exists a translation rule T 2 TRT such that T :FromSchema ¼ Sch1and T :ToSchema¼ Sch22. Moreover, given a sequence of content C1to Cn, such that any two consecutive pieces of content have the Translatable rela-tionship, these pieces of content exhibit the transitive property.

Deﬁnition 4 (Service to Service). Given two services S1¼ fId1; Capabilities1; Outputs1; Inputs1g and S2¼ fId2; Capabilities2; Outputs2; Inputs2g, the possible relationships between S1 and S2include the follow-ing.

1. IdenticalðS1; S2Þ iﬀ Capabilities1¼ Capabilities2. 2. InclusiveðS1; S2Þ iﬀ Capabilities1 Capabilities2.

3. HomonymousðS1; S2Þ iﬀ Outputs1¼ Outputs2, Inputs1¼ Inputs2, but Capabilities16 Capabilities2. 4. SynonymousðS1; S2Þ iﬀ Outputs16¼ Outputs2or Inputs16¼ Inputs2, but IdenticalðS1; S2Þ.

5. ReplaceableðS1; S2Þ iﬀ InclusiveðS1; S2Þ, Outputs1¼ Outputs2and Inputs1¼ Inputs2.

6. TranslatableðS1; S2Þ iﬀ InclusiveðS1; S2Þ, 9 T1; T2; T3; T4 in TRT such that TranslateðOutputs1;fT1gÞ ¼ TranslateðOutputs2;fT2gÞ and TranslateðInputs1;fT3gÞ ¼ TranslateðInputs2;fT4gÞ.

7. CombinableðS1; S2Þ iﬀ Inputs1¼ Outputs2.

8. CombineðSc;fSigÞ combines a set of services fSig ð1 6 i 6 nÞ, 8i CombinableðSi; Si1Þ (that is, Outputs1¼ Inputs2, Outputs2¼ Inputs3; . . . ; Outputsn1 ¼ InputsnÞ, into a new service Sc¼ fIdc; Capabilitiesc; Outputsc; Inputscg, where

1. Idc is the identiﬁer of Sc.

2. Capabilitiesc¼ Capabilities1[ Capabilities2[ [ Capabilitiesn. 3. Outputsc¼ Outputsn.

4. Inputsc¼ Inputs1.

Given a sequence of services S1to Sn, a new service Sc can be generated when any two successive services Siand Si1 have the Combinable relationship. The new service has a new Id, and the capabilities include all the capabilities of S1to Sn. The new service has the same input interface as the ﬁrst serviceðS1Þ, and the same output interface as the ﬁnal serviceðSnÞ.

(7)

Definition 5 (Service to Content). Given a service S¼ fIds; Capabilitiess; Outputss; Inputssg and content C¼ fIdc; Schemac; Presentationsc; Semanticscg, S and C have the ProduceðC; SÞ relationship iff Schemac ¼ Outputss or TranslatableðOutputss; SchemacÞ. In other words, this definition determines if S can produce C.

Deﬁnition 6 (Content to Service). Given a content C¼ fIdc; Schemac; Presentationsc; Semanticscg and a ser-vice S¼ fIds; Capabilitiess; Outputss; Inputssg, C and S have the ManipulatedByðC; SÞ relationship iﬀ Schemac ¼ Inputssor TranslatableðSchemac; InputsÞ. In other words, this relationship determines whether C can be manipulated by S.

2.3. Manipulating operations

CSIM applies manipulating operations to the above four types of 15 relationships. Manipulating operations are of two types––p operations and P operations. p operations assess the relationship between the given content and service. If a speciﬁc relationship exists, the operation returns TRUE, otherwise it returns FALSE. P operations return the corresponding content or services satisfying the speciﬁed relationship.

p operationscheck if the given content and service have the relationship specified in the p operations. A total of five p operations are defined:

• pSchemas: Given two pieces of content, A and B, A pSchemas Brefers to CIT and returns TRUE if A and B have the InheritFromðA; BÞ relationship.

• pr

Schemas: Given two pieces of content, A and B, A prSchemasBrefers to TRT and returns TRUE if A and B have the TranslatableðA; BÞ relationship.

• pSemantics: Given two pieces of content, A and B, A pSemanticsBrefers to CST and returns TRUE if A and B have the IdenticalðA; BÞ relationship.

• pCapabilities: Given two services A and B, A pCapabilitiesBrefers to SCT and returns TRUE if A and B have the InclusiveðA; BÞ relationship.

• pr

Capabilities: Given a service A and a set of services Bs, A p r

CapabilitiesBsrefers to SCT and returns TRUE if the InclusiveðA; BsÞ relationship holds, or one of the relationships holds: the CombinableðA; BsÞ, ReplaceableðA; BsÞ or TranslatableðA; BsÞ. In other words, A pr

Capabilities Bsdetermines whether service A can be replaced by a series of services Bs according to one of the following four conditions.

Bs contain all the capabilities of A;

A can be combined by a set of services into Bs; A can be replaced by a set of services into Bs;

Bs contain all the capabilities of A, but Bs also can be translated into the same input and output sche-mas as A.

The algorithms corresponding to the ﬁve p operations can be referenced in http://www.data-base.cis.nctu.edu.tw/.

P operationsreturn the content or services conforming to the speciﬁed relationship. Four categories of P operations exist: Pc_{, P}s_{, P}sc_{and P}cs_{, with respect to the four types of relationships deﬁned in Section 2.2.}

Pc _{operations return the content that conforms to the speciﬁed relationship.} • Pc

Translatable: Given content A, P c

Translatabledetermines the content that can be translated into A by direct or transitive translations.

• Pc

InheritFrom: Given content A, P c

InheritFromdetermines the content that is inherited from A by direct or tran-sitive inheritance.

(8)

• Pc

Identical: Given content A, P c

Identicaldetermines the content that satisﬁes the Identical relationship with A. • Pc

Homonymous: Given content A, P c

Homonymousdetermines the content that satisﬁes the Homonymous relation-ship with A.

• Pc

Synonymous: Given content A, P c

Synonymous determines the content that satisﬁes the Synonymous relation-ship with A.

Ps operations return the services that exhibit the speciﬁed relationship. • Ps

Identical: Given a service A, P s

Identical determines the services that exhibit the Identical relationship with A.

• Ps

Inclusive: Given a service A, P s

Inclusive determines the services that exhibit the Inclusive relationship with A.

• Ps

Homonymous: Given a service A, P s

Homonymousdetermines the services that exhibit the Homonymous relation-ship with A.

• Ps

Synonymous: Given a service A, P s

Synonymousdetermines the services that exhibit the Synonymous relation-ship with A.

• Ps

Replaceable: Given a service A, P s

Replaceabledetermines the services that exhibit the Replaceable relationship with A.

• Ps

Translatable: Given a service A, P s

Translatabledetermines the services that exhibit the Translatable relationship with A.

• Ps

Combinable: Given a service A, P s

Combinabledetermines the services that exhibit the Combinable relationship with A.

One Psc _{operation, P}sc

Produce, is deﬁned. Given a service S, P sc

Produce returns the content that satisﬁes the Produce relationship with S.

One Pcs _{operation, P}cs

ManipulatedBy, is deﬁned. Given content C, P cs

ManipulatedByreturns all the services that satisfy the ManipulatedBy relationship with C.

The corresponding algorithms of P operations can be referenced in http://www.data-base.cis.nctu.edu.tw/.

Fig. 2 depicts the architecture used to apply CSIM in DL queries. A query interface receives semantic queries and dispatches the queries to the CSIM Engine. The CSIM Engine parses the query predicates and applies the algorithms described above to solve semantic DL queries by referring to the Registry Authority. The Registry Authority contains access methods for all content and service metadata to expedite the access

Content Content

Service Service

Registry Authority CSIM Engine

CST CIT SCT TRT

Content & Service Repository

Query I

n

terface

(9)

to the metadata. Furthermore, the Registry Authority comprises a set of ontological tables. These onto-logical tables present a common vocabulary used for the ontoonto-logical hierarchies that deﬁne content semantics and service capabilities. Index structures for CST, CIT, SCT and TRT are stored in the Content and Service Repository (CSR), to accelerate metadata retrieval. The ontology proposed herein consists of vocabulary with hierarchical structure, and an ascendant ontological concept implies all descendant con-cepts.

3. Semantic digital library query

Applying CSIM to digital libraries supports powerful semantic queries in content and service retrieval. In CSIM, content and service semantics can be elaborated more finely than conventional keyword-based approaches by adding relationships defined in the previous section. Using this abundant semantic infor-mation, CSIM accurately retrieves results and derives alternative answers that conventional approaches cannot do. For example, in response to a query for content with particular semantics, CSIM can retrieve the content not only in the same schema hierarchy and with identical semantics, but also in a different format, such as synonymous content. Content with various schemas, which are translatable into a single schema, can also be retrieved. Furthermore, in a semantic service query, CSIM can infer a list of recom-mendations to suggest that a user concatenates available services to create the desired service, using the combinable and translatable relationships.

3.1. Query language

The query language for CSIM is SQL-like. It consists of three main clauses.

1. Select Clause: The select clause contains the attributes of the content or service to be retrieved. The mode of attributes in a query can be set to EXACT or AMBIGUOUS. An EXACT query returns the attributes that exactly satisfy the given predicate without being translated or combined. For example, if one user wants to exactly retrieve the content with the data format, ‘‘Dublin Core’’, the query statement can be set to ‘‘Select EXACT C.Id From Content C where C.Schema¼ ‘‘Dublin Core’’’’. An AMBIGUOUS query recommends answers that have the same semantics as those speciﬁed attributes. As in the preceding example, the query can be set to ‘‘Select AMBIGUOUS C.Id From Content C where C.Schema¼ ‘‘Dublin Core’’’’. The query will return three types of answer: (1) The content with the data format ‘‘Dublin Core’’. (2) The content which is not in data format ‘‘Dublin Core’’ but can be translated into ‘‘Dublin Core’’ format. (3) The content which is not in data format ‘‘Dublin Core’’ but in the same hierarchy of ‘‘Dublin Core’’ in CIT. The absence of the attribute mode indicates the default query mode ‘‘EXACT’’. 2. From Clause: This clause speciﬁes the content or service from which a user seeks. For example, a user

wants to retrieve information from one piece of content C and two services S1 and S2.

3. Where Clause: This clause states conditional expressions that consist of the content or services given in the From Clause. In this clause, a set of Boolean operators (NOT, AND, OR), and a set of relationships deﬁned in Section 2 are applied. For example, if a user wants to retrieve a service S1whose input is pro-duced by another service S2 and whose output data schema is Dublin Core, then the Where Clause is ‘‘S1:Input¼ ManipulatedByðS2Þ AND S1:Output¼ \Dublin Core"’’. Basically, the syntax of a conditional expression in this clause is like that of traditional SQL-like language.

(10)

3.2. Semantic query

Semantic queries are based on the relationships between content and services in CSIM. Semantic queries encompass two types––EXACT query and AMBIGUOUS query (Fig. 3). An EXACT query inquires the services or content that precisely satisﬁes the given predicates, without inferring other relationships. An AMBIGUOUS query returns the service or content that can be inferred from the translatable or com-binable relationships, as well as the service or content with the same semantics as those speciﬁed attributes. Notably, an AMBIGUOUS query yields more results but takes more time to respond. Both content and service queries are illustrated by ‘‘Basic’’ and ‘‘Advanced’’ query. ‘‘Basic’’ query are in standard SQL statement and can be used in conventional query interface. ‘‘Advanced’’ query contain CSIM manipulation functions in their query predicate, which are able to derive more sophisticated semantic relationships. 3.2.1. Content query

A content query inquires about the content meeting the requirement specified in the query. A user can specify an EXACT or AMBIGIOUS query. An EXACT query returns the content that entirely satisfies the query, which means no inference is employed to obtain the result. An AMBIGIOUS query returns all the content that can have the same semantics specified in the query. Here, the term ‘‘can’’ means that the content may be translated into, or inherited from the target content.

Basic content query

Example. Determine the content with the schema of ‘‘Dublin Core’’.

Query Statement: Select C.Id From Content C where C:Schema¼ \Dublin Core". Algorithm: ContentQuery(Attributes A, Content C){

1. Locate content c. Let c2 CSR, and c:Id ¼ C:Id, c:Schema pSchema C:Schema, c:Presentations¼ C:Presentations, and c:Semantics pSemantics C:Semantics;

2. If A2 AMBIGUOUS, for each r 2 PTranslatableðr; CÞ, c c [ r; (The symbol ‘‘ ’’ indicates ‘‘assign the value’’.)

3. For each k2 c, return k:A. Advanced content query

Content Query

Exact or Ambiguous Query?

Exact Query Result Ambiguous Query InheritFrom? Translatable? Service Query

Exact or Ambiguous Query?

Exact Query

Result

Ambiguous Query

Combinable?

Translatable?

(11)

Example. Determine the content inherited from the ‘‘Dublin Core’’ schema.

Query Statement: Select C1:Id From Content C1, C2 where C2:Schema¼ \Dublin Core" and InheritFromðC1; C2Þ.

Algorithm: AdvancedContentQuery(Attributes A, Contents C, Relationship R){

1. Locate content c. Let c2 CSR, and c:Id ¼ C:Id, c:Schema pSchema C:Schema, c:Presentations¼ C:Presentations and c:Semantics pSemanticsC:Semantics;

2. If A2 AMBIGIOUS, for each r 2 Pc

Translatable ðr; CÞ; c c [ r; 3. For each i2 PðRÞ

c c [ ContentQueryðId; iÞ; 4. For each k2 c, return k:A. 3.2.2. Service query

A service query inquires about the services meeting the requirement speciﬁed in the query. A user can specify the query to be EXACT or AMBIGIOUS. An EXACT query returns the services that entirely satisfy the query. An AMBIGIOUS query determines all the services that can have the same capabilities speciﬁed in the query. Here the term ‘‘can have’’ implies that the services can be concatenated or translated into the target service.

Basic service query

Example. Determine the services with the service capability ‘‘Catalog_System’’. Query Statement: Select S.Id From Service S where S:Capabilities¼ \Catalog System". Algorithm: ServiceQuery(Attributes A, Services S){

1. Locate service s. Let s2 CSR, s:Id ¼ S:Id, and s:Capabilities pCapabilities S:Capabilities; 2. If A2 AMBIGIOUS, for each r 2 Ps

Translatableðr; C:SchemaÞ, c c [ r; 3. For each k2 c, return k:A.

Advanced service query

Example. Determine the services that have the same capabilities with ‘‘Catalog_System’’.

Query Statement: Select S1:Id From Service S1, S2 where S2:Capabilities¼ \Catalog System" and InclusiveðS1; S2Þ.

Algorithm: AdvancedServiceQuery(Attributes A, Services S, Relationship R){

1. Locate service s. Let s2 CSR, s:Id ¼ S:Id, and s:Capabilities pCapabilities S:Capabilities; 2. If A2 AMBIGIOUS,

Locate service rs2 CSR where S:Capabilities pr

Capabilities rs:Capabilities c c [ rs;

3. For each i2 R

c c [ ServiceQueryðid; iÞ; 4. For each k2 c, return k:A.

(12)

3.3. Ranking function

A result of a CSIM semantic query can be classiﬁed into one of the following types:

1. Exact match. The result conforms to the query predicate without additional translation, inheritance, or combination.

2. Ambiguous match. The result is a recommendation that may not completely satisfy all query predi-cates, but can satisfy the predicates by translating, inheriting, or combining available services or con-tent.

Because the results of a query may not totally fulfill the user’s requirements, a ranking function is proposed to evaluate the fitness of the results of a query. The ranking function W is separated into WContent and WService, and defines as follows;

Ranking function WContentðContent A; Content ResultsÞ ¼ 1 if A pSchemas Results and NumðResultsÞ ¼ 1 ¼ Pð1 TiÞ if A pr

Schemas Results and NumðResultsÞ > 1

ð1Þ

Ranking function WServiceðService A; Service ResultsÞ ¼ 1 if A pCapabilities Results and NumðResultsÞ ¼ 1 ¼ XW_Resultsi Pð1 TiÞ . ðNumðResultsÞ NumðTiÞÞ if A pr

Capabilities Results and NumðResultsÞ > 1

ð2Þ

where

W_Resultsi: NumðResultsi:CapabilitiesÞ=NumðA:CapabilitiesÞ WS

Resulti;

WS

Resulti: Service weight of Result

i

. The larger weight represents the service is easier to compose in the result; Ti: Overhead of the translation rules to produce Results;

NumðResultsÞ: The number of services and translation rules. The less number of Results is, the larger rank of the Result is.

Ranking functions WContent and WService are proposed to rank content and services, respectively. The ranking follows the number of semantic concepts or capabilities that meet the query predicates, and the amount of content and services that are combined to yield the result.

Example. Assume a service A with ﬁve capabilities; we apply CSIM in the semantic query and obtain that A can be concatenated by three services S1, S2, and S3 with two translations T1 and T2. Each of the three services owns three service capabilities of A, and the union of their capabilities includes all the service capabilities of A. If all of these services have WS_{¼ 1 and the two translation rules have a 10% and 20%} overhead respectively, what is the rank of this concatenation?

Result. In this example, W_Results1, W_Results2 and W_Results3 are 3/5¼ 0.6. T1and T2are 0.1 and 0.2, respectively. Applying Eq. (2) yields the rank of this concatenation as ð0:6 1 þ 0:6 1 þ 0:6 1Þ ðð1 0:1Þ ð1 0:2ÞÞ=ð5 2Þ ¼ 0:432.

(13)

4. Experiments

4.1. Experimental set-up and approach

A series of experiments were performed in the digital library of National Chiao Tung University (NCTUDL, http://www.lib.nctu.edu.tw/) to demonstrate the feasibility of CSIM. These experiments in-volved service and content queries, which explored the structural relationship of the metadata of services and content.

The subjects of service queries were the 81 services in NCTUDL. The services were divided into six categories––Tutorial (TU), Query (QU), Service (SE), Database (DB), Journal (JO), and Holding (HO). The metadata of each service was given by experts and contained keywords that delineated their service capa-bilities. Experiments on service queries were conducted to compare the conventional keyword-based ap-proach, the CSIM apap-proach, and the CSIM approach with service inference. The keyword-based approach returned the services the description of whose capabilities exactly matched the speciﬁed keywords. The CSIM approach involved the AMBIGUOUS semantic query without enabling the Ps

Combinablemanipulation function; this approach conducted a query to refer to related ontological terms for service capabilities and returned answers that would be inherited or translated into the desired services. CSIM with service inference involved the AMBIGUOUS semantic query with enabling the Ps_Combinable manipulation function. This ap-proach returned the answer of the CSIM apap-proach; in addition, a recommend list that could compose the required service from existing services was returned as well. For a digital library that decomposed services into reusable components, CSIM with inference advised the system developer to construct new services by combining the existing components. In this manner, a lot of development eﬀort could be saved.

To evaluate content queries, a virtual union catalog system called VUCS@NCTUDL was developed, which harvested over 20 WebPAC systems of Taiwanese Libraries and integrated the results (Huang et al., 2000). The issue of various data schemas in the WebPAC systems complicated the mapping of schema attributes in response to a content query. To handle this issue, conventional VUCS systems involved user intervention in the design phase, or stored mapping information using metadata. This design strategy led to a much less flexible system and required user intervention in the system design. CSIM referred to the ontology stored in the Content Inheritance Table (CIT) to extend the hierarchical relationships between attributes of schemas. Furthermore, CSIM used the ontology of semantics stored in the Content Semantics Table (CST) to solve the problems of attribute semantics (such as those involving Synonymous and Homonymous relationships). The experiments retrieved four attributes (tile, subject, author, and publisher) of the content satisfying the query. The keyword-based approach returned the content contained the specified keywords within these fields (not all WebPAC systems contains all of these fields). The CSIM approach employed the ontological tables (both CST and CIT) to map various attributes of schemas into the same attribute if they were at the same level of the ontological hierarchy (in Fig. 8, ontology area). Additionally, the CSIM approach also exploited the ontological tables and TRT to convert heterogeneous attributes into the content suitable for the requested attributes. The keywords used in content queries were randomly generated by VUCS@NCTUDL. The top-k answers were calculated to average the performance in the four attributes.

4.2. Experimental metrics

Two metrics, Accuracy and Coverage, were used to evaluate both the service and content query and thereby elucidated the effectiveness of CSIM. Accuracy represents the effectiveness of the returned answers to be correct. Coverage represents the effectiveness of the returned correct answers to be included in the entire correct answers. For the service query, the Accuracy (AmðsÞ) and Coverage (CmðsÞ) are defined as

(14)

AmðsÞ ¼ ðjtotal services in Fsj \ jtotal services in Fideals jÞ=jtotal services in Fsj CmðsÞ ¼ ðjtotal services in Fsj \ jtotal services in Fideals jÞ=jtotal services in F

s idealj

where m indicates the method to be examined, which can be keyword-based, CSIM and CSIM with inference. s represents the examined service category. In these formulas, Fsrepresents the services in each service category returned by the query; Fs

idealrepresents all the services in category s, the services in concept s, and the services returned by CSIM with inference whose ranks exceed the threshold Tservice:

Fs

ideal¼ the services of category s [ the services of concept s [ the services returned by CSIM with inference whose ranks exceed Tservice

The threshold Tservice discards the results that are cascadedly translated by too more translation rules be-tween the input and output interfaces. The aim of Tserviceis to control the eﬃciency of the service query. For the content query, the Accuracy (AmðcÞ) and Coverage (CmðcÞ) are deﬁned as

AkðcÞ ¼ ðjtotal content in Fcj \ jtotal content in Fidealc jÞ=jtotal content in Fcj CkðcÞ ¼ ðjtotal content in Fcj \ jtotal content in Fidealc jÞ=jtotal content in F

c idealj

where k indicates the method to be examined, which can be keyword-based, CSIM or CSIM with inference. crepresents the top-k content returned by VUCS@NCTUDL. In these formulas, Fcrepresents the response to a content query. Fc

ideal represents all the content returned by the keyword-based approach, and the content returned by CSIM whose ranks exceed the threshold Tcontent:

F_idealc ¼ the content returned by keyword-based approach [ the content returned by CSIM whose ranks exceed Tcontent

The threshold Tcontent discards the results that are cascadedly translated by too more translation rule be-tween two content. The aim of Tcontent is to control the eﬃciency of the content query.

The percentage of the improvement of CSIM over the keyword-based approach was used to demonstrate the performance of CSIM in content query. The Improvement is deﬁned as:

ImprovementAccuracy¼ ðACSIMm ðcÞ A

Keyword-based

m ðcÞÞ=A

Keyword-based

m ðcÞ

ImprovementCoverage¼ ðCmCSIMðcÞ C

Keyword-based

m ðcÞÞ=C

Keyword-based

m ðcÞ

4.3. Experiment results

Figs. 4 and 5 present the Accuracy and Coverage of the service query. Only one translation is allowed in their input/output schemas between two services. Both ﬁgures indicate that the CSIM approach

(15)

forms the keyword-based approach. In all service categories, the keyword-based approach has poor Accuracy and Coverage because these services do not explicitly contain the searched keywords in their capabilities. The CSIM approach dramatically improves the performance with regard to both Accuracy and Coverage. This ﬁnding shows that exploiting concept approaches (like CSIM) is useful for semantic DL queries. Notably, the CSIM approach with inference outperforms pure CSIM in the ‘‘HO’’ category be-cause the former approach can recommend users to combine existing services to generate additional ones such as the union of WebPAC system and the electronic Journal databases. This result may encourage libraries to spend less eﬀort by integrating available ones on developing add-on services.

For content query, Figs. 6 and 7 plot the improvement of the average Accuracy and Coverage with respect to the queries in the four attributes (title, subject, author and publisher). The Top-K in X -axis means that the top-k books returned by VUCS are selected as the result. To avoid too many translations between two pieces of content, only one translation is allowed. In Fig. 6, CSIM increase 1.2 times per-formance in average than the keyword-based approach in Accuracy because CSIM refers to CIT and TRT to obtain more conceptually related attributes of schemas. For example, the ‘‘author’’ ﬁeld in NCTUDL

Fig. 5. Coverage of service query.

Fig. 6. Accuracy improvement of content query.

(16)

may appear in other WebPAC systems with diﬀerent name such as creator; in this case, keyword-based approach cannot retrieve the correct results. In Fig. 7, the improvement of Coverage indicates that CSIM outperforms 8–10% to the keyword-based approach. The curve of Coverage improvement increases when the returned answers increase because CSIM refers to CST to return those attributes with the same semantics but diﬀerent format (like Synonyms).

In summary, the performance of semantic DL queries with CSIM is highly promising. The CSIM model represents a signiﬁcant improvement in both service and content queries. The CSIM model not only en-hances the Accuracy and Coverage of a DL query but also suggests how librarians can integrate available components into a desired service.

4.4. Prototype system

A prototype system called CSIM@NCTUDL has been implemented. Fig. 8 illustrates the user inter-face of this system. This system supports advanced semantic DL queries for users to retrieve content and services in NCTUDL. Moreover, librarians can consult this system to determine reusable components in NCTUDL before they start establishing new services. CSIM@NCTUDL includes four main areas. 1. Selection area: This area allows the user to specify which one of the three query types is to be issued––

service query, content query, and compound query. A compound query allows the user to retrieve con-tent or services with both concon-tent and service manipulation functions.

2. Edit area: This area allows the user to specify the query predicates, including input/output schemas and semantics of content, and service capabilities. They are selected from the ontology area.

3. Ontology area: This area contains all ontological hierarchies deﬁned in Section 2. The ontology is devel-oped by domain experts.

4. Result area: This area displays the results that satisfy the predicates in the edit area. Each result is ex-pressed as a recommendation list with one or more items. A list with one item indicates that the item

(17)

exactly matches the specified predicates; on the other hand, a list with more than one item indicates that the users can combine these items together to obtain a new content/service combination that satisfies the specified predicates. The ranking of each result is also given. A numeral in front of each service of the recommendation list denotes the fitness for the specified service capabilities. The result area presents a set of recommendation lists to advise the user; nevertheless, the system leaves the task of confirming the fea-sibility of the recommendation list to the user.

The example illustrates in Fig. 8 conducts a query to retrieve VUCS services. The input and output interfaces are speciﬁed as ‘‘holding’’ format. The ‘‘holding’’ format, which is the top item in content schema ontology, means that the service to be returned should contain any kind of library holdings. The capa-bilities are speciﬁed as ‘‘Integration’’ and ‘‘Query’’, each of which is the top item in one capability ontology. Consequently, the answer area shows not only the VUCS@NCTU service (Index 1 of the result area in Fig. 8), but also a recommended list to suggest another virtual union catalog system (Index 3 of the result area in Fig. 8) by combing three services: an extractor for structured documents (Huang et al., 2000), a translation service that translates native data schemas into a canonical schema (Ke, Huang, & Yang, 2001), and an integration service to combine distributed extractors.

5. Conclusions and future research

This work presents a novel content and service inference model (CSIM), which defines 15 relationships between content and services to handle semantic DL queries. It enumerates these relationships and presents manipulating functions, p and P operations, to realize them. p operations return TRUE or FALSE for a specific relationship and P operations determine all the content or services that exhibit a specific rela-tionship. The proposed semantic DL query applies CSIM and comprises two queries. An exact query re-turns the answers that exactly match the predicate, and an ambiguous query rere-turns recommendations that can be inherited, translated or combined from available content or services, as well as those that match the exact query.

CSIM was applied to the digital library of National Chiao Tung University (NCTUDL). A virtual union catalog system (VUCS@NCTUDL) and CSIM@NCTUDL was constructed (Huang et al., 2000). Experimental ﬁndings indicate that CSIM outperforms the conventional keyword-based approach in handling DL queries. An ambiguous DL query with CSIM recommends additional results beyond those of the conventional keyword-based approach, improving the Accuracy and Coverage of both content and service retrieval. Applying CSIM to digital library queries reveals that the administrative load can be re-duced when new services and content are to be developed. Digital library designers can generate new content and services from those available, by considering the recommendations in response to an ambig-uous query. Metadata and translation rules are applied to translate and reuse content and services to support the design of an object-oriented or component-based digital library.

CSIM can be used in DL applications. With respect to a DL resource-planning system, the inference capability of CSIM indicates that librarians should reuse available components to construct new DL services easily. Furthermore, various data ﬁelds can be categorized into semantic hierarchies to simplify the transformation between them, facilitating the combination of various data ﬁelds. Such a combination frequently arises when union systems are created. The experiments presented herein have demonstrated that CSIM outperform the conventional keyword-based approach in a virtual union catalog system.

CSIM support semantic DL queries. In most digital libraries, services are distributed on the Web, making the mapping between systems sloppy and large. Users may want to ﬁnd for the services that meet their needs. Metadata can be used to describe services and CSIM applied in semantic searches to facilitate such a search. Users can specify the input type of a service (such as library holdings), output type (such as

(18)

Web pages) and capabilities (such as search systems). This semantic assignment and search bridges the gap between the aims of the user and the service capabilities. However, CSIM has an efficient problem when it generates all possible responses to an ambiguous query, especially when cascaded translations occur be-tween different content. A properly chosen threshold Tcontent and Tservice govern the performance of CSIM. Future research will focus on accommodating broader semantics and developing new schemes to yield more knowledge from metadata with abundant semantics. In this manner, digital library queries can be further improved. As well as examining the optimization of operations of CSIM, our future work will develop more efficient indexing mechanisms for accessing related data structures.

Acknowledgements

The authors would like to thank the National Science Council of the Republic of China, Taiwan, for ﬁnancially supporting this research under contract no. NSC90-2213-E-009-082.

References

Abiteboul, S., Benjelloun, O., & Milo, T. (2002). Web services and data integration. In Proceedings of the third international conference on web information and system engineering (pp. 3–6). Singapore.

Allan, J., & Raghavan, H. (2002). Using part-of-speech patterns to reduce query ambiguity. In Proceedings of the 25th annual international ACM SIGIRconference on research and development in information retrieval (pp. 303–314). Tampere, Finland. Blanchi, C., & Petrone, J. (2001). Distributed interoperable metadata registry. D-Lib Magazine, December, 7(12). Available:http://

www.dlib.org/dlib/december01/blanchi/12blanchi.html.

Chen, H. C., Chung, Y. M., Marshall, R., & Yang, C. C. (1998). An intelligent personal spider (agent) for dynamic Internet/Intranet searching. Decision Support Systems, 23(1), 41–58.

Dai, Y. M. (2001). A data mining system for mining library borrowing history records. Master Thesis of National Chiao-Tung University.

Fox, E. A., Suleman, H., & Luo, M. (2002). Building digital libraries made easy: toward open digital libraries. In Proceedings of the ﬁfth international conference on Asian digital libraries (ICADL2002) (pp. 14–24). Singapore.

Gauch, S., & Wang, J. (1999). A corpus analysis approach for automatic query expansion and its extension to multiple databases. ACM Transactions on Information System, 17(3), 250–269.

Goncalves, M. A., & Fox, E. A. (2002). A language for declarative speciﬁcation and generation of digital libraries. In Proceedings of second joint ACM/IEEE-CS joint conference on digital libraries (JCDL’2002) (pp. 263–272). Portland.

Goncalves, M. A., France, R. K., & Fox, E. A. (2001). MARIAN: ﬂexible interoperability for federated digital libraries. In Proceedings of 5th European conference of research and advanced technology for digital libraries (ECDL-01) (pp. 173–186). Darmstadt, Germany.

Grossman, R., Qin, X., & Xu, W. (1995). An architecture for a scalable, high-performance digital library, mass storage systems. In Proceedings of the fourteenth IEEE symposium (pp. 11–14). Monterey, California.

Huang, S. S., Ke, H. R., & Yang, W. P. (2000). Information extraction for documents with common structure. In The third international conference of Asian digital library (ICADL2000) (pp. 105–112). Seoul, Korea.

Ke, H. R., Huang, S. S., & Yang, W. P. (2001). The study of interoperability of digital libraries with metadata. University Library Journal, 5(1), 49–78.

Kolda, T. G., & O’Leary, D. P. (1998). A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Transactions on Information Systems, 16(4), 322–346.

Lee, K. S., Kim, D. W., Kageura, K., & Choi, K. S. (2002). A workbench for acquiring semantic information and constructing dictionary for compound noun analysis. ICADL, Lecture Notes in Computer Science, 2555, 315–327.

Lynch, C., & Garcia-Molina, H. (1995). Interoperability, scaling and the digital libraries research agenda. Information Infrastructure Technology and Applications (IITA) a Digital Libraries workshop. Available:http://www-diglib.stanford.edu/diglib/pub/reports/ iita-dlw/main.html.

McCray, A. T., Gallagher, M. E., & Flannick, M. A. (1999). Extending the role of metadata in a digital library system. In Proceedings of the IEEE research and technology advances in digital libraries, 1999 (ADL99) (pp. 190–199). Baltimore.

Miller, G. A., Fellbaum, C., Tengi, R., Wolﬀ, S., Wakeﬁeld, P., & Langone, H. (2002). Wordnet: a lexical databases for the English language. Available: >http://www.cogsci.princeton.edu/~wn/.