Literature Review - 智慧型資訊整合於異質資料倉儲和資料探勘之模型、架構、與績效評估-應用本體論、母型綱要、和學名結構

2.1 XML Query Capability

Benchmarking the XML data management systems should consider many factors. Designing a set of comprehensive queries to test the XML databases’ performance is an important point. XML query languages should capture the whole characteristics of a XML document, and the functionalities they provide would influence the query performance. The W3C XML Query Language working group (Chamberlin, Fankhauser, Marchiori, & Robie, 2003) list 20 XML query language “must have” functionalities, as Table shows.

Some of the expected functionalities may affect the efficiency of the system significantly.

XQuery has met all of the requirements except F12 and F16, and it becomes a standard query language to test the performance of XML data management systems. Generally speaking, queries to benchmark XML

databases would fall into several categories: Match, Join, Navigation, Casting, Reconstruction, and Update.

Queries for Match are mainly used to test the database ability to handle simple string lookups with a fully specified path.

Join queries can be divided into two parts: Join on References, and Join on Values. References are an important part of XML, because they allow richer relationships than just hierarchical structure. Queries Join on References would test if query optimizer can take advantage of references to be joined.

Queries Join on Values, on the other hand, would test the database’s ability to handle large intermediate results.

Differing from the former, their joins are on the basis of values. Navigation Queries investigate how well the query processor can optimize path expressions, and avoid traversing irrelevant parts of the tree. Strings are the basic data type in XML documents. Casting strings to another data type that carries more semantics is necessary. Queries for Casting challenge the ability of the database to cast different data types.

Reconstruction Queries attempt to reconstruct the original document from its fragmentations stored in the databases. Update Queries try to add, delete, and modify elements in the XML document. These queries test the databases’ ability to manage XML document. Furthermore, other XML query functionalities such as sort, ordered access, text search, and

aggregation also should be captured in the benchmark query set.

2.2 XML Benchmarks

XMark, XMach-1 and XOO7 are three benchmarks available today that can be used to evaluate certain aspects of XML database systems.

2.2.1 XMark

XMark (Schmidt, Waas, Kersten, Carey, Manolescu, & Busse, 2002) is a single-user benchmark. The data model of XMark is an Internet auction site.

Therefore, its database contains one big XML document with text and non-text data. XMark enriches the references in the data, like the item IDREF in an auction element and the item’s ID in an item element. The text data used are the 17000 most frequently occurring words of Shakespeare’s plays. The standard data size is 100MB with a scaling factor 1.0 and users can change the data size by 10 times from the standard data (the initial data) each time. However, it has no support for XML Schema. In operation model of XMark, 20 XQuery challenges are designed to cover the essentials of XML query processing, as Table shows. No update operations are specified in XMark.

2.2.2 XMach-1

XMach-1 (Böhme & Rahm, 2001)

is a scalable multi-user benchmark. The main objective of the benchmark is to stress-test XML systems under a multi-user workload. The data model of XMach-1 is designed for B2B applications and considers text documents and catalog data. It assumes that size of the data files exchanged will be small. It provides support for DTD only and does not consider XML Schema for optimization. The operation model of XMach-1 consists of eight queries and three update operations, shown in Table.

Queries specified in XMach-1 cover typical database functionality (join, aggregation, sort) as well as information retrieval and XML-specific features (document assembly, navigation, element access). Update operations cover inserting and deleting of documents as well as changing attribute values. We find that some queries contain several query functionalities. It is hard to analyze the experiment result and ascertain which feature leads to the given performance result. Specially, XMach-1 has defined three update operations that are unique across other XML benchmarks.

2.2.3 XOO7

XOO7 (Li, Bressan, Dobbie, Lacroix, Lee, Nambiar, & Wadhwa, 2001) is an XML version of the OO7 benchmark, which was designed to test

the efficiency of object-oriented DBMS.

XOO7 is a single-user based benchmark for XMLMS that focuses on the query processing aspect of XML. The data model of XOO7 comes from the OO7 benchmark by mapping the OO7 schema and data set to XML. No specific application domain is modeled by the data of XOO7. It is based on a generic description of complex objects using component-of relationships. XOO7 also proposes three different databases of varying size: small, medium, and large.

It supports DTD only. In operation model, XOO7 provides relational, document and navigational queries that are specific and critical for XML database applications. These queries test the primitive features and each query

covers only a few features. Table displays the queries adopted in XOO7.

XOO7 contains large amount of queries, each query covers only a few features.

Comparing to the other two benchmarks, XOO7 has certainly the highest ratio which stresses its data-centric focus.

However, we can find that some queries are focus on the same functionality.

Similar to XMark, no update operation is specified in XOO7.

2.3 XML Benchmarks Comparison A comparison of key features of these main XML benchmarks against this research is described in Table 1. The key features include application focus, evaluation scope, database and workload characteristics.

Table 1: Comparison of Benchmarks over Workload Characteristics

Feature XMark XMach-1 XOO7 This Research

Evaluation Scope Query

Processor DBMS Query Processor

Heterogeneous Information Integration Application Domain E-Commerce E-Commerce Generic Generic Data Model

Document Size 10MB~10GB 16KB Unknown Various

Data Heterogeneity benchmark by query functionality.

Compared to other XML benchmarks, XMark provides a concise and comprehensive set of queries. However, it does not provide update operations to manipulate XML documents. XMach-1 only defines a small number of XML queries that cover multiple functions and update operations for which system performance is determined. XOO7 maps the original queries of OO7 into XML, and adds some XML specific queries. In general, XMach-1, XMark, and XOO7

cover only a subset of the XML query requirements. In this research, we attempt to propose a generic workload model. In order to cover the whole functionalities of XML query processing, we combine queries of these three XML benchmarks and integrate them into ten types of queries. In particular, the intelligent information integration system is generally used for query data, not provide data manipulation functions.

Therefore, the query model in this research does not support update operations.

Table 2: Comparison of Benchmarks over Query Functionalities

Query Functionality XMark XMach-1 XOO7 This Research

Update Operation ˇ

2.4 Ontology

Ontologies play an important role for integration as a way of formally defined terms for communication. They aim at capturing domain knowledge in a generic way and provide a commonly agreed understanding of a domain, which may be reused, shared, and operationalized across applications and groups.

A good ontology should represent the domain specific knowledge explicitly. The question is how do we

know an ontology is good? The answer is the ontology benchmark. There are plenty of benchmark studies in other fields like database or compilers.

However, there are no specific benchmarks studies or tools for evaluating ontology-based applications.

In fact, there is still no guideline to evaluate ontologies and related technologies.

In this section, we introduce the role of ontologies in intelligent information integration first. And then we discuss a major inference task which

is the main operation of an ontology benchmark. Finally, the ontology related benchmark works are reviewed.

2.4.1 Ontology and Intelligent Information Integration

Traditional integration approaches use inexpressive models of database schemas or XML trees to integrate heterogeneous data sources. This would cause many semantic heterogeneity problems. Ontologies provide much richer modeling means with classes and properties organized into is-a hierarchy and enriched with axioms and relations processable with inference (Maier, Aguado, Bernaras, Laresgoiti, Pedinaci, Pena, & Smithers, 2003). Almost all ontology-based integration approaches ontologies are used for the explicit description of the information source semantics. With respect to the integration of data sources, they can be used for the identification and association of semantically corresponding information concepts.

Some approaches use ontologies not only for content explication, but also either as a global query model or for the verification of the (user-defined or

system-generated) integration description (Wache, Vögele, Visser,

Stuckenschmidt, Schuster, Neumann, &

Hübner, 2001). Ontologies are usually expressed in a logic-based language, so that fine, accurate, consistent, sound, and meaningful distinctions can be made

among the classes, properties, and relations. Therefore, ontologies not only have the expressiveness needed in order to model the data in the sources, but their reasoning ability can help in the selection of the sources that are relevant for a query of interest, as well as to specify the extraction process.

2.4.2 Ontology and Benchmark

To the best of our knowledge, the benchmark presented here is the first one for ontology-based intelligent information integration. The ontology benchmark model in this research differs from database benchmarks, such as Wisconsin benchmark, OO7 benchmark, and BUCKY benchmark. They are all DBMS-oriented and storage benchmarks, and there is no inference ability included.

In this research, the ontology workload model is applied to an intelligent information integration system, and we focus on the inference ability of the ontology. Ontology and XML are often found together and are often confused.

XML is a standard for marking up - adding additional information, called metadata - to documents. The purpose of XML is to tag textual information with additional structure that enables it to be

“understood” and exchanged by programs. However, XML tags still require humans to interpret their meanings. Therefore, XML benchmarks only focus on structural and syntactic evaluation of systems, and they have no

semantics. On the other hand, ontology benchmark is devoted to capture the semantic expressions in the system.

Thus, ontology and XML are complementary technologies: ontology provides the meaning for XML standards; XML provides a valuable medium for information exchange between programs that share the same ontology.

As mentioned above, there is still no guideline for evaluation of ontology-based application. Horrocks and Patel-Schneider (1998) benchmark description logic systems, or so-called knowledge bases. Description logics (DLs) are a family of knowledge representation languages that can be used to represent the knowledge of an application domain in a structured and formally well-understood way.

Description logic systems provide their users with various inference capabilities that deduce implicit knowledge from the explicitly represented knowledge.

Horrocks and Patel-Schneider try to evaluate the reasoning algorithms in description logics. Terminological part (Tbox) is a set of axioms describing the structure of domain. Assertional part (Abox) is a set of axioms describing concrete situation (Horrocks, 2002).

They are related to this research. In an intelligent information integration system, the ontology can be viewed as

the Tbox, and the heterogeneous data can be viewed as Abox. However, the logic described is only a subset of the ontology languages, such as DAML+OIL and OWL. DAML+OIL and OWL can be seen to be equivalent to a very expressive description logic.

They provide more constructors and allow more axioms than description logic. Therefore, the inference services of ontology are more complex than traditional description logic systems.

在文檔中智慧型資訊整合於異質資料倉儲和資料探勘之模型、架構、與績效評估-應用本體論、母型綱要、和學名結構 (頁 10-16)