Before describing the method for mapping foreign keys to object properties we are first to discuss the matchers and source of computing the similarity using in this approach.
4.4.1 Matchers
Various matching techniques have been introduced in Chapter 2 , especially for the string-based and WordNet-string-based matchers. Usually mapping is a complex process and a single kind of matcher often can not satisfy the mapping problem since different situation can be happened when matching different information structures, and a matcher is frequently designed for specifying applications. For this reason, some matching systems use com-posite or hybrid matchers. Comcom-posite matcher is a matcher that combines the results of multiple single matchers and hybrid matcher incorporates the features of several match-ers into one compound matcher. Furthermore, most of the time is spent on computing the similarities between entities when dealing with the mapping problem. Therefore, an efficiency matcher is required. In this approach, we use a hybrid matcher combining Lin and I-Sub measures. These two measures are complement based on the fact that Lin can be used as a WordNet-bases matcher and I-Sub can be used as a string-based matcher and besides, the designing concept of I-Sub comes from the intuitions of Lin. Now explaining this hybrid matcher.
In the field of the Information retrieval(IR), the following formula can be represented the information content of some word w in a document.
IC(w) = log( 1 P (w))
Intuitively, two concept are similar if they share some mutual information content and hence Resnik[34] uses the following equation as a method to evaluate the similarity between two concepts.
Definition 4.4.1 (Resnik Similarity).
Sim(c1, c2) = max
c∈Sup(c1,c2)[− log P (c)]
, where P (c) is the probability of encountering an instance of concept c and Sup(c1, c2) is a set of superconcept of c1 and c2
If all words in a WordNet tree are regarded as a document, the similarity of two words can be determined by their common information contents. Besides, since a word in WordNet tree can be seen as a concept, any similarity of two words can depend on their common information contents in WordNet tree.
Since the equation of Resnik only considers the commonality between two concepts, Lin [25] proposed a similarity measure normalizing Resnik’s equation based on the three intuitions. In his thesis, intuitions are described as follow.
• Intuition 1: The similarity between A and B is related to their commonality. The more commonality they share, the more similar they are.
• Intuition 2: The similarity between A and B is related to the differences between them. The more differences they have, the less similar they are.
• Intuition 3: The maximum similarity between A and B is reached when A and B are identical, no matter how much commonality they share.
Commonality or difference between two concepts can be quantified by using the in-formation contents of these two concepts and hence Lin extended Resnik’s equation to the following formula:
Simlin(c1, c2) = 2∗ IC(LCS)
IC(c1) + IC(c2), where LCS is the least common subsumer of c1 and c2. Stoilos et al. [36] proposed a string-based matcher called I-Sub. The main idea of I-Sub comes from Lin measure, that is, the similarity among two entities is related to
their commonalities as well as to their differences and hence the following equation is
Note that in our approach, we ignore the formula Dif f (s1, s2). Because as opposed to the names of entities in OWL ontologies, the ones of entities in relational databases are often incomplete and abbreviated. Furthermore, I-Sub is designed for mapping between OWL ontologies. Therefore, when mapping between a relational database and OWL ontologies, the fact that some entities whose similarities should be high will not be shown due to the formula Dif f (s1, s2).
Now using two strings “hasTransportation” and “traffic” as an example to describe how a hybrid matcher combining Lin and I-Sub measures computes the similarity in our approach. Before calculating the similarity of these two string, they need to be tokenized first as described below.
• Tokenization: In information retrieval, a string is considered as a set of words also called bag of words. Besides, in the area of Computer Science researchers tend to use multiple words to represent variables or real world entities. Hence, in order to get the more accurate meaning of strings, we need to tokenize these strings by segmenting them into sequences of tokens. In this approach, we recognize four tokenizing processes:
– Replacing the punctuation character with the blank.
– Recognizing the upper case as a beginning of a word.
– Removing digits.
– Removing stopwords.
After tokenizing, “hasTransportation” becomes “transportation” and “traffic” does not change. Then using our hybrid matcher to compute the similarity between these two strings. Since “transportation” and “traffic” can be found in WordNet their information contents are 9.054 and 10.170 respectively and the information content of LCS for these two strings is 5.894. Therefore Simlin(transportation, traf f ic) = 0.613. Note that all information contents are computed in advance [28]. If this example is revised to “has-Trans” and “traffic”, “has“has-Trans” will become ““has-Trans”. I-Sub will compute the similarity in this case since trans can not be found in WordNet.
4.4.2 Source of Computing the Similarity
In order to increase the mapping performance, many researches make full use of the infor-mation related to the mapped entities. In general, these inforinfor-mation can be summarized as follows.
• Local similarity: Some information uses for describing the entities is local infor-mation. For instance, we usually give a name as a meaning of an entity and this name is the local description of this entity. Hence, many approaches use matchers to compute the similarity of entities by their local descriptions.
• Internal similarity: Similar entities usually have the similar internal structures. In other words, if two entities are with similar internal structures, they will have a chance to match each other. For instance, we can compute the internal similarity by the datatypes of two entities such as “int” and “float”, then the result can influence the final similarity measure between these two entities.
• External(Relational) similarity: Similar to internal similarity, two entities will be possible to match each other if their neighbors are similar. We compute the sim-ilarity of their neighbors as the external simsim-ilarity. Then, the final result of the similarity measure between two entities will be influenced by the external similar-ity.