2. LITERATURE REVIEW
2.3 Semantic Similarity
2.3.1 Overview
Semantic similarity is a significant research domain in information retrieval and information integration. Several approaches to model semantic similarity compute the semantic distance between two words or two documents by definitions within an ontology.
The study of semantic similarity between words has been a part of natural language processing and is a generic issue in a variety of applications in the areas of computational linguistics and artificial intelligence, both in the academic community and industry (Li et al.
2003). Since the first studies on interoperating information systems, progress has been made concerning syntactic (i.e., data types and formats) and structural heterogeneities (i.e., schematic integration, query languages, and interfaces) (Sheth 1992; Andrea and Egenhofer, 2003). Because of the increasing of the number of keywords and the complex relations of the contexts, the technology needed to deal successfully with these issues must focus on the semantics underlying the data used by the interoperating information systems.
Recent investigations in information retrieval and data integration have emphasized the use of ontologies and semantic similarity functions as a mechanism for comparing objects that can be retrieved or integrated across heterogeneous repositories (Guarino et al., 1999; Voorhees, 1998; Smeaton and Quigley, 1996; Lee et al., 1993). An ontology is a kind of knowledge representation which describes concepts by definitions that are sufficiently detailed to capture the semantics of a specific domain. In other words, an ontology can be seen as a view of the real world, and supports intentional queries regarding the content of a
database. It also reflects the relevance of data by providing a declarative description of semantic information independent of the data representation (Andrea and Egenhofer, 2003).
Besides, an information system can be seen as a collection of processes that produce services; i.e., sets of related activities transforming input information into an output, to accomplish a well defined objective. Thus, more researches applied the semantic similarity to integrate the multiple, heterogeneous systems and processes relying on the data integration capability of semantic similarity analysis methods (Castano and De Antonellis 1995a, 1995b, 1997; Sciore et al. 1994).
In case of complex organizations, the number of processes to be integrated can be large, and computer-based techniques for their integration are required to perform the analysis in a systematic and semi-automatic way. Process integration can be performed following an information processing viewpoint (Galbraith 1973). Accordingly, Castano and De Antonellis (1998) propose a framework for expressing semantic relationships between multiple information systems’ processes. Via process modeling and process semantic similarity analysis manners, they successfully integrate the processes of different information systems.
Accordingly, due to the successful integration applications of semantic similarity, this study applied the semantic similarity analysis for re-engineering the cross-organizational processes between A/E and GC in a D/B project team.
2.3.2 Basic concepts for semantic similarity analysis
Evaluation of process similarity is based on the terminological relationships between names specified in process descriptors. Performing a similarity-based analysis on a large
number of process descriptors with an uncontrolled vocabulary is a difficult task and requires techniques for comparing names appearing in different process descriptors (Castano and De Antonellis, 1998). Major problems arising with a name-based approach are related to the fact that different processes can be characterized by the same or semantically similar entities and operations, which do not necessarily have the same name.
In fact, as pointed out in (Furnas, 1987), the probability of two designers picking the same name for a given entity is very low (Castano and De Antonellis, 1998). Following recent proposals on semantic heterogeneity in multidatabase systems, a semantic dictionary is introduced, where names of semantic similar entities and operations are grouped under the same concept (Bright, Hurson and Pakzad 1994; Hammer and McLeod 1993; Castano and De Antonellis, 1998). The semantic dictionary provides capabilities to manage duplicate descriptors for processes, resulting from using different names for denoting the same entity or operation. The dictionary is organized into concept hierarchies by means of the generalization and aggregation abstraction mechanisms (Castano and De Antonellis, 1998).
The architecture of the semantic dictionary is shown in Figure 2.2. A concept hierarchy is defined in the dictionary, where the concepts at the top of the hierarchy are either more general than the lower level concepts (generalization abstraction) or are composed of them (aggregation abstraction). Each concept at the bottom level of the hierarchy has associated a cluster of names corresponding to entities and operations that are semantically similar (Castano and De Antonellis, 1998). Names contained in clusters are entity and operation names used in process descriptors.
Figure 2.2 the Architecture of semantic dictionary
In the semantic dictionary, concepts are connected by means of hierarchical links. In addition, apredefined link exists between a bottom-level concept and each name in the cluster associated with this concept. To operationally evaluate affinity, we assign an affinity strengthδto hierarchical links in the semantic dictionary. Two names in process descriptors can have affinity if they refer to a common concept in the dictionary hierarchies, that is, if a path of length 1, with l ≧1, denoted by the symbol “→l
”, exists between them
in the semantic dictionary. The affinity of two names coincides with the strength of the path between them. The strength of a path “→l” is computed using the monotonic function
as Equation 2.4 shows (Castano and De Antonellis, 1998).1 ( ) entity nj.
δ
is the affinity strength, usually set as 0.8 based on the experimentation of Castano and De Antonellis (1998); l is the semantic length between ni and nj in the semantic dictionary and the Ck is the concept link k in the semantic dictionary.Based on the concept of semantic dictionary and the affinity function, more process similarity functions, such as functional similarity function and information similarity
function can derived.