Since the development of the Web technologys, Web pages has increased rapidly. Ac-cording to Google, currently the number of Web pages has well over 15.5 billion [32] . These Web pages contained all kinds of information, however, the vast majority of these information is only in a human understandable format such as HTML. More precisely for-mat is eXtensible Markup Language(XML). Although XML provides a set of self-defined metadata tags to describe the semantic of Web data, it does not define the meaning of the tags. As a consequence, Web content can be accessed only in the syntactic level and software agents or machines can not efficiently understand and process this kind of data.
For this purpose, in 2001 [7] , Tim-Berners Lee proposed the vision of the Semantic Web as follows: “The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation.” To realize the requirement for the Semantic Web, World Wide Web Consortium(W3C) and other organizes have been effected at specifying and developing standard language. The Semantic Web will built on the standard layers as
Figure 2.1: Semantic Web Layers
shown in Figure2.1 [22]. In next section, we will present ontology which is the core component in these layers.
2.1.1 Ontology
The term “Ontology” came from philosophy and there are plenty of definitions, the most popular is “An ontology is a formal, explicit specification of a shared conceptualization”
[17] . Ontology is a key enabling technology for the Semantic Web. Unlike the XML, the level ontologies provides not only the syntactic but the semantic. By ontologies, we can build a controlled vocabulary of concepts, each with explicitly defined and machine-understandable semantic. Thereby, people and machines can communicate precisely.
For this reason, during the last decade interest to ontologies has increased. The area of applicability for ontologies is wide: information retrieval and extraction, information systems design and enterprise integration, natural language processing, database design, conceptual modeling. In the Semantic Web area, a formal knowledge representation model is also called an ontology.
However, constructing ontologies for domain knowledge is still tedious, time-consuming and error-prone. Although some tools like Protege [38] can manually edit ontologies, fully automatic ontology building remains in the distant future. In order to make the ontology-construction efficient, Maedche and Staab presented an ontology-learning framework [27]
that encompasses ontology import, extraction, pruning, refinement, and evaluation. The
overview of framework is as follows :
• Merging existing structures or defining mapping rules between these structures allows importing and reusing existing ontologies.
• Ontology extraction models major parts of the target ontology, with learning sup-port fed from Web documents.
• The target ontology’s rough outline, which results from import, reuse, and extrac-tion, is pruned to better fit the ontology to its primary purpose.
• Ontology refinement profits from the pruned ontology but completes the ontology at a fine granularity (in contrast to extraction).
• The target application serves as a measure for validating the resulting ontology.
Furthermore, in order to have the consistent format to model ontology, W3C had defined the standard for Web ontology that is OWL(Web Ontology Language) [29]. OWL is a language for ontology serves as defining and conceptualizing the Web content. We can conceptualize or model a domain knowledge through OWL features such as class and property. Currently, OWL is built on top of RDF(Resource Description Framework) and divided into three sub-languages depend on the expressiveness, namely OWL Lite, OWL DL and OWL Full. In Chapter 3, we have the details for OWL.
2.2 Deep Annotation
As mentioned above, Web pages is well over 15.5 billion [32] . Annotating these Web pages with OWL is a key step to reach the goal of the Semantic Web. However, there is still lack of the existing automatic tools to annotate them especially for the dynamic Web pages. Dynamic Web page is also called a deep Web which most of its content comes from the underlying databases. In contrast, the static Web page is called surface Web.
A July 2000 white paper [10] estimated 43,000-96,000 deep Web sites and an informal estimate of 7,500 terabytes of data- 500 times larger than the surface Web.
However, the content of deep Web is hardly reached by search engines. For this purpose, emerging the deep annotation [18] . Deep annotation is a vision of framework
Figure 2.2: An Architecture for Deep Annotation
to provide semantic annotations for the underlying databases of the deep Web. This framework is composed of three parties: Database and Web Site Provider, Annotator, and Query Party- see Figure2.2 [18]. It assumes that many web sites will in fact participate in the Semantic Web and will support the sharing of information. Hence, Web site providers will give their users information proper, information structure and information context.
The details of the architecture are as follows:
• Database and Web site providers markup the dynamic Web pages according to the information structures of the database.
• The annotator produces client-side annotations conforming to the client ontology either via the marked HTML documents or database schemas.
• The annotator publishes the client ontology and the mapping rules derived from annotations.
• The annotator assesses and refines the mapping using certain guidelines.
• Query party mappings into his own information structures such as ontologies and/or to migrate the data into his own repository.
2.3 Mapping between Relational Databases and OWL Ontologies
Due to the requirement for the Semantic Web and the problem of the deep Web, many re-searches has focused on the mapping between a relational database and OWL ontologies [2] [24] [37] [1] [19]. In this field, mapping is the process to find the semantic corre-spondences between entities or elements of the relational database and OWL ontologies.
Currently there are some solutions and tools to deal with this problem. They can be clas-sified into two parts: methods for creating ontologies from existing relational database and methods for mapping a relational database to already existing OWL ontologies. If the former case that generated ontologies finally will be mapped to domain-related ontologies , we think it will lose more semantic information comparing to the original database. For this reason, our work will focus on the latter case. In next two sections, we will discuss these two kinds of methods.
2.3.1 Extracting OWL ontologies from Relational Databases
Extracting OWL ontologies from relational databases is a Reverse engineering which is the process of analyzing an existing system; identifying system components, abstractions, and interrelationships; and creating representations of them. There are plenty of work on the reverse engineering of relational databases. However, most of them focus on extracting E-R(entity-relationship) and object models from them, the semantics obtained by these methods cannot fully meet the requirement of constructing ontologies. Until recent years, there exist a few approaches that consider ontologies as the target for reverse engineering.
In this section, we will introduce some studies in this field.
Astrova [2] presented a method to extract ontologies from a relational database. This method is composed of two processes: By analyzing information such as key, data and attribute correlations to extract a conceptual schema, which expresses semantics about the relational database, and transforming this schema into a semantically equivalent on-tology. Finally migrating data from a database to ontologies. Transforming the relational
schema into a semantically equivalent ontology proceeds as follows:
1. Classification of relations: Depend on the features, relations can be classified into three categories:
• Base relations: If a relation is independent of any other relation in a relational database schema, it is a base relation.
• Dependent relations: If a primary key of a relation depends on another rela-tions’ primary key, it is a dependent relation.
• Composite relations: A composite relation is a relation that is neither base nor dependent and its primary keys are composed of other relations’ primary keys.
2. Mapping: Key, data and attribute correlations can be described given two relations.
Each kind of correlation has four types: equality, inclusion, overlap and key disjoint-edness. With the combination of these types of correlation and the development of the mapping constraint, how to extract a ontology from a relational database schema can be determined.
Since the relational database schema often has little explicit semantics [33] , through analyzing tuples in the relational database, additional ”hidden” semantics (e.g. inheri-tance) can be discovered. However, it is very time consuming with regard to the number of tuples of the relational database.
Astrova also presented a method [3] that constructed an ontology based on analyzing the HTML-forms to extract a form model schema, transforming the form model schema into ontology and creating ontological instances from data contained in the pages. The drawback of this approach is that it does not offer any way to the identification of inher-itance relationship which is a significant aspect in the ontology construction. In order to overcome this drawback, Benslimane proposed an approach [9] to acquire ontologies from data-intensive webs. The main idea of this approach is the fact that users often query database through HTML forms and the query results often return as HTML tables. Thus, the data in the HTML forms are often structural data. By analyzing a HTML-form, im-portant information can be obtained. Besides, this method use an enriched relational
schema instead of simply use the relational schemas that is constructed from database.
The processes of this method are as follows:
1. Extract forms schema by analyzing HTML pages. It uses several identification rules and translation rules to identify the form unit and generate the XML-schema.
2. Restructure and enrich the relational schema through semantics of the forms schema.
In this step, the result of the relational schema is mostly like the structure of the underlying database. But it has additional inclusion dependencies and constraints.
3. Construct OWL ontology from the enriched relational schema using a set of trans-formation rules. These rules construct classes, properties, and Inheritance from the semantic similarities between the relational schema and ontology(OWL).
Man Li [24] extracted ontology in a relational database using E-R Model. This ap-proach defined twelve rules for extracting ontology from the relational database schema and used these rules to create ontology.
Trinh [37] proposed a tool named RDB20NT that creates ontology in a relational database. This tool is a method using an ontology to describe relational database and converting the information in a relational database into this ontology.
2.3.2 Mapping Relational Databases to Existing OWL Ontolo-gies
In this section we will discuss some approaches that directly mapping relational databases to OWL ontologies. These approaches are assumed that domain-related ontologies and legacy relational databases already exist. Because there is a different domain level or size between databases and ontologies, and the modeling criteria used for designing databases is also different from those used for designing ontology models. Thus, compare with the approaches that extract ontologies from databases, mapping approaches here are rare and more complex.
Borgida et al. [1] proposed a method that assists users in specifying and inferring mapping formulas between relational databases and OWL ontologies. Based on the sim-ple correspondences between relational database schemas and OWL ontologies, comsim-plex formulas expressing the semantic mapping can be found. However, this method has a
disadvantage that the database must be based on ER design principles and cannot extract classes that can be separated in the fields within a table. Besides, since only scheme is taken account of without consideration on instance, it is hard to match precisely.
Hu and Qu presented an approach [19] that uses virtual documents based on TF/IDF model to discover simple mappings between the relational database schema and OWL ontologies. Besides, this approach also finds the subsumption relationships, called contex-tual mappings, which can be directly translated to conditional mappings or view-based mappings [11] . The overview of this approach is as follows:
1. Classifying entity types: This is a preprocessing process. It classifies entities into the relational schema and the ontology into four different groups to limit the search-ing space of candidate mappsearch-ings. Besides, it coordinates different characteristics between the relational schema and the ontology.
2. Discovering simple mappings: This step firstly constructs virtual documents for the entities in the relational schema and the ontology to capture their implicit semantic information. Then, it discovers simple mappings between entities by calculating the confidence measures between virtual documents via the TF/IDF model
3. Validating mapping consistency: This phase uses mappings between relations and classes to validate the consistency of mappings between attributes and properties.
It considers the compatibility between data types of attributes and properties as well. In addition, some inference rules are also integrated in this process.
4. Constructing contextual mappings: This phase operates on mappings between re-lations and classes found in the previous phases, and supplies them with sample instances. It constructs a set of contextual mappings, which indicate the conditions how they could be transformed to view-based mappings with selection conditions.
Furthermore, there exist some studies that deal with the other problems from other aspects. For instance, Dou et al. [15] describe a general framework for integrating databases with ontologies via a first-order ontology language Web-PDDL. Barrasa et al.
[5] design a language R2O to express complex mappings between relational database schemas and ontologies.
2.4 Matching Techniques
After introducing the mapping between relational databases and OWL ontologies, we now discuss some matching techniques that do a lot of help in this field. Matching is the process that takes two data structures such as schemas or ontologies as input and calculates similarity relationship between their entities or elements as output. It plays a important role in many application domains, such as semantic web, schema/ontology integration, data warehouses, e-commerce, query mediation, information retrieval, etc.
Many matching approaches have been proposed so far, and in this section we will intro-duce some of them include string-based matching, WordNet-based matching, graph-based matching and some matching systems.
2.4.1 String-Based Matching
String-based matching is a element level matching that takes the information such as local descriptions of two elements as input and return their similarity measure as output.
This technique is widely used in various schema matching systems [26] [13] . In this section, some popular string-based matchers are discussed.
• Edit-distance: It determine the distance between two strings based on the min-imal number of edit operations (insert, delete, replace) needed to transform one string into the other. There are various edit-distance matchers, for instance, a fa-mous matcher called Levenstein distance [23] is the minimum number of insertions, deletions, and substitutions of characters required to transform one string into the other.
• Jaro measure: The Jaro measure has been defined for matching proper names that may contain similar spelling mistakes [20] . It is not based on an edit distance model, but on the number and proximity of the common characters between two strings. The definition of Jaro is as below: Let s[i] ∈ com(s, t) if and only if
∃j ∈ [i − (min(|s|, |t|)/2, i + (min(|s|, |t|)/2] and transp(s, t) are the elements of com(s, t) which occur in a different order in s and t.
J aroσ(s,t) = 1
3 ×|com(s, t)|
|s| × |com(t, s)|
|t| × |com(s, t)| |transp(s, t)|
|com(s, t)|
• N-gram similarity: The N-gram similarity is also often used in comparing strings. It computes the number of common n-grams, i.e., sequences of n characters, between them. For instance, three-grams for the string article are: art, rti, tic, icl, cle. Its definition is: Let ngram(s, n) be the set of substrings of s of length. The n-gram similarity is a similarity dsuch that:
n− gramσ(s,t) = |ngram(s, n)
ngram(t, n)| min(|s| , |t|) − n − 1
• Token-based distances: This technique come from information retrieval and consider a string as a (multi)set of words (also called bag of words). A very common measure is TFIDF, many systems use measures based on it. In this model, each string is represented as a vector containing weights for each term(word) of a global corpus.
The similarity of two strings is then determined with the cosine measure.
2.4.2 WordNet
String-based matchers can only reach the syntactic matching. In order to get to the semantic matching, we need some other tools such as WordNet. WordNet [31] is a large lexical database of English, words are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Each synset has a gloss that defines the concept that it represents. Synsets are connected each other through explicit semantic relations such as hypernymy, hyponymy for nouns and hypernymy and troponymy for verbs. These relations constitute kind-of and part-of hierarchies. For example, the word author has a Synonym writer, and they constitute a sysset which has a gross“writes (books or stories or articles or the like) professionally (for pay)” to describe it. The word communicator is the hypernymy of this synset, hence, author, writer and communication constitute a hierarchy relation. Thereby, a matcher based on WordNet can be designed by the following rules: Given two words s and t,
• t s, if t is a hyponym or meronym of s
• t s, if t is a hypernym or holonym of s
• t ≡ s, if they are connected by synonymy relation or they belong to one synset.
• t ⊥ s, if they are connected by antonymy relation or they are the siblings in the part of hierarchy.
2.4.3 Graph Matching
Graph matching usually is based on is-a or part-of hierarchy of the relation in the graph.
Many researchers have conducted studies on this field. Do et al. [13] , calculate the similarity between internal nodes in a graph based on the similarity of children nodes. In other words, similarity between two classes is evaluated by using the similarity of child node, not terminal node.
Melnik et al. [30] presented a generic graph matching algorithm Similarity Flood-ing(SF ) that uses fixed point computation to determine corresponding nodes between two graphs. The principle of the algorithm is that the similarity between two nodes must depend on the similarity between their neighbor nodes. The algorithm proceeds as follows:
1. First, given two data structures S1 and S2, translating them into graphs G1 and G2.
2. Then, using matching technique such as string-based matcher to do initial mapping between G1 and G2.
3. Based on the assumption that whenever any two nodes in models G1 and G2 are found to be similar, the similarity of their adjacent elements increases, and hence this step begins a number of iterations that each computes the similarity in each node between G1 and G2 until a fixpoint has been reached, i.e. the similarities of all model nodes stabilize.
4. Finally, depending on a variety of selecting mapping strategies. Filters select the best mappings which are then manually reviewed.
SF is flexible and extensible because it is a generic approach. However, the effective-ness and fiteffective-ness for mapping different data structures such as database and ontology is unknown.
In the previous approaches, they compute coefficients between labels in the range [0,1] instead of computing semantic relations such as subsumption or equivalent relation.
Hence, Giunchiglia and Yatskevich [16] proposed a method S-Match that computes the semantic relations(not the similarity) between nodes based on that mappings between the concepts (but not labels) assigned to nodes should be calculated. With the help of WordNet, this method proceeds as follows:
1. By using WordNet, all concepts denoted by all labels in two graphs are computed.
1. By using WordNet, all concepts denoted by all labels in two graphs are computed.