Matchers in the Approach - 關聯式資料庫至知識本體之對映: 一套結合豐富語意與對映一致性的方法

Before describing the method for mapping foreign keys to object properties we are ﬁrst to discuss the matchers and source of computing the similarity using in this approach.

4.4.1 Matchers

Various matching techniques have been introduced in Chapter 2 , especially for the string-based and WordNet-string-based matchers. Usually mapping is a complex process and a single kind of matcher often can not satisfy the mapping problem since diﬀerent situation can be happened when matching diﬀerent information structures, and a matcher is frequently designed for specifying applications. For this reason, some matching systems use com-posite or hybrid matchers. Comcom-posite matcher is a matcher that combines the results of multiple single matchers and hybrid matcher incorporates the features of several match-ers into one compound matcher. Furthermore, most of the time is spent on computing the similarities between entities when dealing with the mapping problem. Therefore, an eﬃciency matcher is required. In this approach, we use a hybrid matcher combining Lin and I-Sub measures. These two measures are complement based on the fact that Lin can be used as a WordNet-bases matcher and I-Sub can be used as a string-based matcher and besides, the designing concept of I-Sub comes from the intuitions of Lin. Now explaining this hybrid matcher.

In the ﬁeld of the Information retrieval(IR), the following formula can be represented the information content of some word w in a document.

IC(w) = log( 1 P (w))

Intuitively, two concept are similar if they share some mutual information content and hence Resnik[34] uses the following equation as a method to evaluate the similarity between two concepts.

Deﬁnition 4.4.1 (Resnik Similarity).

Sim(c1, c2) = max

c∈Sup(c1,c2)[− log P (c)]

, where P (c) is the probability of encountering an instance of concept c and Sup(c1, c2) is a set of superconcept of c₁ and c₂

If all words in a WordNet tree are regarded as a document, the similarity of two words can be determined by their common information contents. Besides, since a word in WordNet tree can be seen as a concept, any similarity of two words can depend on their common information contents in WordNet tree.

Since the equation of Resnik only considers the commonality between two concepts, Lin [25] proposed a similarity measure normalizing Resnik’s equation based on the three intuitions. In his thesis, intuitions are described as follow.

• Intuition 1: The similarity between A and B is related to their commonality. The more commonality they share, the more similar they are.

• Intuition 2: The similarity between A and B is related to the diﬀerences between them. The more diﬀerences they have, the less similar they are.

• Intuition 3: The maximum similarity between A and B is reached when A and B are identical, no matter how much commonality they share.

Commonality or diﬀerence between two concepts can be quantiﬁed by using the in-formation contents of these two concepts and hence Lin extended Resnik’s equation to the following formula:

Sim_lin(c₁, c₂) = 2∗ IC(LCS)

IC(c₁) + IC(c₂), where LCS is the least common subsumer of c₁ and c₂. Stoilos et al. [36] proposed a string-based matcher called I-Sub. The main idea of I-Sub comes from Lin measure, that is, the similarity among two entities is related to

their commonalities as well as to their diﬀerences and hence the following equation is

Note that in our approach, we ignore the formula Dif f (s₁, s₂). Because as opposed to the names of entities in OWL ontologies, the ones of entities in relational databases are often incomplete and abbreviated. Furthermore, I-Sub is designed for mapping between OWL ontologies. Therefore, when mapping between a relational database and OWL ontologies, the fact that some entities whose similarities should be high will not be shown due to the formula Dif f (s₁, s₂).

Now using two strings “hasTransportation” and “traﬃc” as an example to describe how a hybrid matcher combining Lin and I-Sub measures computes the similarity in our approach. Before calculating the similarity of these two string, they need to be tokenized ﬁrst as described below.

• Tokenization: In information retrieval, a string is considered as a set of words also called bag of words. Besides, in the area of Computer Science researchers tend to use multiple words to represent variables or real world entities. Hence, in order to get the more accurate meaning of strings, we need to tokenize these strings by segmenting them into sequences of tokens. In this approach, we recognize four tokenizing processes:

– Replacing the punctuation character with the blank.

– Recognizing the upper case as a beginning of a word.

– Removing digits.

– Removing stopwords.

After tokenizing, “hasTransportation” becomes “transportation” and “traﬃc” does not change. Then using our hybrid matcher to compute the similarity between these two strings. Since “transportation” and “traﬃc” can be found in WordNet their information contents are 9.054 and 10.170 respectively and the information content of LCS for these two strings is 5.894. Therefore Sim_lin(transportation, traf f ic) = 0.613. Note that all information contents are computed in advance [28]. If this example is revised to “has-Trans” and “traﬃc”, “has“has-Trans” will become ““has-Trans”. I-Sub will compute the similarity in this case since trans can not be found in WordNet.

4.4.2 Source of Computing the Similarity

In order to increase the mapping performance, many researches make full use of the infor-mation related to the mapped entities. In general, these inforinfor-mation can be summarized as follows.

• Local similarity: Some information uses for describing the entities is local infor-mation. For instance, we usually give a name as a meaning of an entity and this name is the local description of this entity. Hence, many approaches use matchers to compute the similarity of entities by their local descriptions.

• Internal similarity: Similar entities usually have the similar internal structures. In other words, if two entities are with similar internal structures, they will have a chance to match each other. For instance, we can compute the internal similarity by the datatypes of two entities such as “int” and “ﬂoat”, then the result can inﬂuence the ﬁnal similarity measure between these two entities.

• External(Relational) similarity: Similar to internal similarity, two entities will be possible to match each other if their neighbors are similar. We compute the sim-ilarity of their neighbors as the external simsim-ilarity. Then, the ﬁnal result of the similarity measure between two entities will be inﬂuenced by the external similar-ity.

在文檔中關聯式資料庫至知識本體之對映: 一套結合豐富語意與對映一致性的方法 (頁 48-52)