• 沒有找到結果。

2.4 Matching Techniques

2.4.4 Matching Systems

Many matching systems have emerged during the last decade. For example, DELTA (Data Element Tool-based Analysis) [8] is a system that semi-automatically discovers attribute correspondences among database schemas by using textual similarities between data element definitions. Cupid uses linguistic and structural schema matching tech-niques, and computing similarity coefficients with the help of external specific thesauri.

The matching algorithm in Cupid consists of three phases. The first phase computes linguistic similarity coefficients between schema element names based on some matchers such as common prefix, suffix tests. The second phase computes structural similarity co-efficients by measuring the similarity between contexts. The third phase aggregates the results of the linguistic and structural matching through a weighted sum and generates a final alignment. In next paragraphs, we introduce other two matching systems GLUE and COMA.

Doan and colleagues [14] developed a system, GLUE, which uses machine learning techniques to find mappings between two ontologies. For each concept in one ontology, GLUE finds the most similar concept in the other ontology. Comparing with other machine learning approaches where only use a single similarity measure, GLUE using probabilistic definitions of several practical similarity measures. Furthermore, GLUE uses multiple learning strategies, each of which exploits a different type of information either in the data instances or in the taxonomic structure of the ontologies.

GLUE is heavily rely on instances that use to compute the joint probability distri-bution, P (A, B), P (A, B), P (A, B), and P (A, B), where the term P (A, B) is the notion of the probability that an instance belongs to concept A but does not belong to concept B. The joint probability distribution of the concepts is used in the similarity measures.

Thereby, the application can use the joint distribution to compute any suitable similarity measure. Based on the computing of joint probability distribution and not to estimate specific similarity values directly, GLUE has the advantage that can co-work with various similarity functions which have found appropriate probabilistic interpretations.

GLUE has three kind of learner, content learner, name learner and meta-learner. The content learner uses a text classification method, called Naive Bayes learning. The name learner is similar to the content learner but uses the full name of the instance instead of its content. The meta-learner that combines the predictions of the two learners. The architecture of GLUE is composed of three main modules (Figure 2.3 [14]):

• Distribution Estimator: Once two ontologies has taken into the Distribution Esti-mator with their data instances, the Distribution EstiEsti-mator begin to apply machine techniques to compute the joint probability distribution for every pair of concepts.

• Similarity Estimator: Similarity Estimator uses the result(probability values) from Distribution Estimator to compute the similarity values for probability values each pair of concepts. The output is in the form of similarity matrix, which will feed to the Relaxation Labeler.

• Relaxation Labeler: The Relaxation Labeler contains domain-specific constraints and heuristic knowledge that can support to the discovery of the most appropriate mapping. By taking into the domain-specific constraints and heuristic knowledge,

Figure 2.3: The GLUE Architecture

the Relaxation Labeler can find the mapping which not only has high similarity, but also satisfies the domain constraints and the common knowledge. This mapping is the output of GLUE.

Do and Rahm designed a schema matching system, COMA [13]. Based on the idea that achieves high match accuracy for a large variety of schemas, a single technique is unlikely to be successful, COMA uses a composite match approaches more comprehen-sively and effectively. Moreover, COMA is a generic match system supporting different applications and multiple schema types such as XML and relational schemas. In 2005, a extended system COMA++ [4] also supports mapping between database schemas and ontologies.

The match processing in COMA is composed of numbers of iterations or one iteration.

Each iteration consists of three phases(Figure 2.4 [13]):

• After two data structures are converted to the graphs, user uses Feedback matcher to capture match and mismatch information including corrected match results from

Figure 2.4: Match processing in COMA

the previous match iteration.

• In this step, multiple independent matchers chosen from the matcher library are executed. The result with k matchers, m S1 elements and n S2 elements is a k x m x n cube of similarity values

• Finally, matcher results are combined and stored in the similarity cube by the aggregation of matcher-specific results and selection of match candidates

Currently the matchers in COMA are classified into three parts: simple, hybrid and reuse-oriented matchers. Simple matchers consist of string-based matching techniques, and hybrid matchers combine of simple matchers and other hybrid matchers to obtain more accurate similarity values. Reuse-oriented matcher is a new matcher created by COMA, and it is based on the fact that many schemas to be matched are similar to pre-viously matched schemas. The operation of reuse-oriented matcher proceeds as follows:

Given two match results, match1: S1 ↔ S2 and match2: S2 ↔ S3 sharing schema S2, using the MatchCompose operation which derives a new match result, match: S1 ↔ S3.

Note that MatchCompose operation computes the similarity measure by using Average strategy that is: Given n as the similarity between a and b, and m as the similarity between b and c, the similarity between a and c will be n+m2 using MatchCompose oper-ation.

Chapter 3

Preliminaries

Since our work aims at mapping between relational databases and ontologies, in this chapter, we first introduce preliminary studies of relational databases and OWL.

3.1 Relational Database

A database is a collection of related data or relation grouped together as a single file.

The related form of these relations is depends on what the database model is applied.

These databases models are primarily composed of as follows:

• Hierarchical Model: Relations are related in a parent/child tree, with each child relation having at most one parent relation

• Network Model: Similar to the hierarchical model, but each relation can have more than one parent relation

• Relational Model: Relations are related to each other by sharing a common at-tribute. In next section, we have the details of Relational Model.

In the past, storing data in database systems always meant to adapt the data to either the hierarchical or the network based models as mentioned above. Until the emergence of Edgar F. Codd, he devoted to the study of relational model [12] resulted in revolutionizing data storage and access fundamentally. Currently, relational database have become the most popular way to store data for nearly any of applications especially for the Web.