Building Topic Maps Using a Text Mining Approach

(1)

Building Topic Maps Using a Text Mining

Approach

Hsin-Chang Yang1 _{and Chung-Hong Lee}2

1 _{Department of Information Management, Chang Jung University,}

Tainan, 711, Taiwan

2 _{Department of Electric Engineering, National Kaohsiung University of Applied}

Sciences, Kaohsiung, Taiwan

Abstract. Topic maps standard (ISO-13250) has been gradually

recog-nized as an emerging standard for information exploration and knowl-edge organization in the web era. One advantage of topic maps is that they enable a user to navigate and access the documents he wants in an organized manner, rather than browsing through hyperlinks that are generally unstructured and often misleading. Nowadays, the topic maps are generally manually constructed by domain experts or users since the functionality and feasibility of automatically generated topic maps still remain unclear. In this work we propose a semi-automatic scheme to construct topic maps. We ﬁrst apply a text mining process on a corpus of information resources to identify the topics and discover the relations among these topics. Necessary components of topic maps such as topics, topic types, topic occurrences, and topic associations may be automati-cally revealed by our method.

1 Introduction

Topic maps provide a general, powerful, and user-oriented way to navigating the information resources under consideration in any specific domain. A topic map provides a uniform framework that not only identifies important subjects from an entity of information resources and specifies the information resources that are semantically related to a subject, but also explores the relations among these subjects. When a user needs to find some specific information on a pool of information resources, he only needs to examine the topic maps of this pool, selects the topic he thought interesting, and the topic maps will show him the information resources that are related to this topic as well as its related topics. He will also recognize the relationships among these topics and the roles the topics play in such relationships. With the help of the topic maps, we no longer have to browse through a set of hyperlinked documents and hope they may eventually reach the information we need in a finite amount of time even when we know nothing about where we should start. We also don’t have to gather some words and hope that they may perfectly symbolize the idea we interest and are conceived well by a search engine to obtain reasonable result. Topic N. Zhong et al. (Eds.): ISMIS 2003, LNAI 2871, pp. 307–314, 2003.

c

(2)

maps provide us a way to navigate and organize information, as well as to create and maintain knowledge in an infoglut.

To construct a topic map for a set of information resources, human inter-vention is unavoidable in present time. We need human effort in tasks such as selecting topics, identifying their occurrences, and revealing their associations. Such need is acceptable only when the topic maps are used merely for naviga-tion purpose and the volume of the informanaviga-tion resource is considerably small. However, a topic map should not only be a ’topic navigation map’. Besides, the volume of information resource under consideration is generally large enough to prevent manual construction of topic maps. To expand the applicability of topic maps, some kind of automatic process should involve during the construction of topic maps. The degree of automation in such construction process may differ for different users with different needs. One may only need a friendly interface to automate the topic map authoring process, and another one may try to automat-ically identify every components of a topic map for a set of information resources from the ground up. In this work, we recognize the importance of topic maps not only as a navigation tool but also an desirable scheme for knowledge acquisition and representation. According to such recognition, we try to develop a scheme based on a proposed text mining approach to automatically construct topic maps for a set of information resources. Our approach is opposite to the navigation task performed by a topic map to obtain information. We extract knowledge from a corpus of documents to construct a topic map. Although the proposed approach cannot fully construct the topic maps automatically in present time, our approach still seems promising in developing a fully automatic scheme for topic map construction.

2 Related Work

Topic map standard is an emerging standard so there are few works about it. Most of the early works about topic maps focus on providing introductory mate-rials [5,1]. Few of them are devoted to automatic construction of topic maps. Two works that addressed this issue were reported in [6] and [4]. Rath [6] discussed a framework for automatic generation of topic maps according to a so-called ’topic map template’ and a set of generation rules. The structural information of topics is maintained in the template. They used a generator to interpret the genera-tion rules and extract necessary informagenera-tion that fulfills the template to create the topic map. However, both the rules and the template are to be constructed explicitly and probably manually. Moore [4] discussed topic map authoring and how software may support it. He argued that the automatic generation of topic maps is a useful first step in the construction of a production topic map. How-ever, the real value of a topic map comes through the involvement of people in the process. This argument is true if the knowledge that contained in the topic maps can only be obtained by human efforts. Fully automatic generation pro-cess is possible only when such knowledge may be discovered from the underlying

(3)

set of information resources through an automated process, which is generally known as knowledge discovery from texts or text mining.

3 The Text Mining Process

Before we can create topic maps, we should ﬁrst perform a text mining pro-cess on the set of information resources to reveal the relationships among the information resources. Here we only consider those information resources that can be represented in regular texts. Examples of such resources are web pages, ordinary books, technical speciﬁcations, manuals, etc. The set of information re-sources is collectively known as the corpus and individual resource is referred as a document in the following text. To reveal the relationships between documents, the popular self-organizing map (SOM) [3] algorithm is applied to the corpus to cluster documents. We adopt the vector space model [7] to transform each document in the corpus into a binary vector. These document vectors are used as input to train the map. We then apply two kinds of labeling process to the trained map and obtained two feature maps, namely the document cluster map (DCM) and the word cluster map (WCM). In the document cluster map each neuron represents a document cluster that contains several similar documents with high word co-occurrence. In the word cluster map each neuron represents a cluster of words that reveal the general concept of the corresponding document cluster associated with the same neuron in the document cluster map.

The text mining process described above provides us a way to reveal the relationships between the topics of the documents. We introduce here a method to identify topics and the relationships between topics. The method also arranges these topics in a hierarchical manner according to their relationships. As we mentioned before, a neuron in the DCM represents a cluster of documents that contain words that often co-occurred in these documents. Besides, documents that associate with neighboring neurons contain similar set of words. Thus we may construct a super-cluster by combining neighboring neurons. To form a super-cluster, we ﬁrst deﬁne the distance between two clusters:

D(i, j) = H(||Gi− Gj||), (1)

where i and j are the neuron indices of the two clusters and G_i is the two-dimensional grid location of neuroni. ||G_i−G_j|| measures the Euclidean distance between the two coordinates G_i and G_j. H(x) is a bell-shaped function that has maximum value when x = 0. We also deﬁne the dissimilarity between two clusters:

D(i, j) = ||wi− wj||, (2)

where w_i denotes the synaptic weight vector of neuroni. We may then compute the supporting cluster similarity S_i for a neuroni from its neighboring neurons by

(4)

Si=

j∈Bi

S(i, j), (3)

where doc(i) is the number of documents associated with neuron i in the doc-ument cluster map and B_i is the set of neuron indices in the neighborhood of neuroni. The function F : R+→ R+ is a monotonically increasing function. A dominating neuron is the neuron which has locally maximal supporting cluster similarity. We may select dominating neurons by the following algorithm:

Step 1. Find the neuron with the largest supporting cluster similarity. Selecting

this neuron as dominating neuron.

Step 2. Eliminate its neighbor neurons so that they will not be considered as

dominating neurons.

Step 3. If there is no neuron left or the number of dominating neurons exceeds

a predetermined value, stop. Otherwise go to Step 1.

A dominating neuron may be considered as the centroid of a super-cluster, which contains several clusters. We assign every cluster to some super-clusters by the following method. Theith cluster (neuron) is assigned to the kth super-cluster if

D(i, k) = min

l D(i, l), l is a super−cluster. (4)

A super-cluster may be thought as a category that contains several sub-categories. LetC_kdenote the set of neurons that belong to thekth super-cluster, or category. The category topics are selected from those words that associate with these neurons in the WCM. For all neuronsj ∈ C_k, we select then∗th word as the category topic if

j∈Ck

wjn∗ = max₁_≤n≤N

j∈Ck

wjn. (5)

Eq. 5 selects the word that is the most important to a super-cluster since the components of the synaptic weight vector of a neuron reﬂect the willingness that the neuron wants to learn the corresponding input data, i.e. words.

The topics that are selected by Eq. 5 form the top layer of the category hi-erarchy. To ﬁnd the descendants of these topics in the hierarchy, we may apply the above process to each super-cluster and obtain a set of sub-categories. These sub-categories form the new super-clusters that are on the second layer of the hierarchy. The category structure can then be revealed by recursively applying the same category generation process to each new-found super-cluster. We de-crease the size of neighborhood in selecting dominating neurons when we try to ﬁnd the sub-categories.

4 Automatic Topic Maps Construction

The text mining process described in Sec. 3 reveals the relationships between documents and words. Furthermore, it may identify the topics in a set of doc-uments, reveals the relationships among the topics, and arrange the topics in

(5)

a hierarchical manner. The result of such text mining process can be used to construct topic maps. We will discuss the steps in topic maps construction in the following subsections.

4.1 Identifying Topics and Topic Types

The topics in the constructed topic map can be selected as the topics identified by Eq. 5. All identified topics in every layer of the hierarchy can be used as topics. Since topics in different layers of the hierarchy represent different levels of significance, we may constrain the significance of topics in the topic map by limiting the depth of hierarchy that we select topics from. If we only used topics in higher layers, the number of topics is small but the topics represent more important topics. The significance level can be set explicitly in the beginning of the construction process or determined dynamically during the construction process. A way to determine the number of topics is by the size of the self-organizing map.

The topic types can also be determined by the constructed hierarchy. As we mentioned before, a topic on higher layers of the hierarchy represents a more important concept than those on lower layers. For a parent-child relationship between two concepts on two adjacent layers, the parent topic should represent a important concept of its child topic. Therefore, we may use the parent topic as the type of its child topics. Such usage also fulﬁlls the requirement of the topic map standard that a topic type is also a topic.

4.2 Identifying Topic Occurrences

The occurrences of a identiﬁed topic are easy to obtain after the text mining process. Since a topic is a word labeled to a neuron in the WCM, its occurrences can be assigned as the documents labeled to the same neuron in the DCM. That is, let a topic τ be labeled to neuron A in the WCM, the occurrences of τ should be those documents labeled to the same neuron A in the DCM. For example, if the topic ’text mining’ was labeled to the 20th neuron in the WCM, all documents labeled to the 20th neuron in the DCM should be the occurrences of this topic. Furthermore, we may create more occurrences of this topic by allowing documents labeled to lower levels of the hierarchy being also included. For example, if neuron 20 in the above example located on the second level of a topic hierarchy, we may also allow the clusters of documents associated with topics below this level being also occurrences of this topic. Another approach is to use the DCM directly such that we also include the documents associated with the neighboring neurons of the neuron associated with a topic as its occurrences.

4.3 Identifying Topic Associations

The associations among topics can be identiﬁed by two ways in our method. The ﬁrst is to use the developed hierarchy structure among topics. A topic is

(6)

associated with the other if there existed a path between them. We should limit the lengths of such paths to avoid establishing associations between pairs of unrelated topics. For example, if we limited the length to 1, only topics that are direct parents and children are associated with the topic under consideration. The type of such associations is essentially an instance-class association. The second way to identify topic associations simply examines the WCM and finds the associations. To establish associations to a topicτ, we first find the neuron A that τ is labeled to. We then establish associations between τ and every topic associated with some neighboring neuron ofA. The neighboring neurons are selected from a neighborhood of A that are arbitrarily set by the creator. Obviously, a large neighborhood will create many associations. We should at least create associations between τ and other topics associated with the same neuron A since they are considered as very related topics in the text mining process. The association types are not easy to reveal by the method since we do not fully reveal the semantic relations among neurons after text mining process. An alternative method to determine the association type between two topics is to use the semantic relation defined in a well-developed ontology such as WordNet [2].

5 Experimental Result

We performed the experiments on a set of news articles that were collected from a Chinese newswire site 1_{. Two corpora were constructed in our experiments.} The ﬁrst corpus (CORPUS-1) contains 100 news articles posted during Aug. 1-3, 1996. The second corpus (CORPUS-2) contains 3268 documents posted during Oct. 1-9, 1996. A word extraction process was applied to the corpora to extract Chinese words. Total 1475 and 10937 words were extracted from CORPUS-1 and CORPUS-2, respectively. To reduce the dimensionality of the feature vectors we discarded those words that occur only once in a document. We also discarded the words that appear in a manually constructed stoplist and reduced the number of words to 563 and 1976 for CORPUS-1 and CORPUS-2, respectively. A reduction rate of 62% and 82% is achieved for the two corpora, respectively.

To train CORPUS-1, we constructed a self-organizing map that contains 64 neurons in 8×8 grid format. The number of neurons is determined experimentally such that a better clustering can be achieved. Each neuron in the map contains 563 synapses. The initial training gain is set to 0.4 and the maximal training time is set to 100. These settings are also determined experimentally. We tried diﬀerent gain values ranged from 0.1 to 1.0 and various training time setting ranged from 50 to 200. We simply adopted the setting which achieves the most satisfying result. After training we labeled the map by documents and words respectively and obtained the DCM and the WCM for CORPUS-1. The above process was also applied to CORPUS-2 with a 20×20 neuron map and obtained the DCM and the WCM for CORPUS-2.

(7)

m a« Ï È ç å mü Ë dB } §7 mü K â ^z ± Æ ì Àc ùU Ïü Ï} âT | C) è½ n¨ l) ý^U ÍR äù Ï U Èz àß Ò¼ ~ aI me Öd m ½K °= è /« i ~¨ [Ó : Ò¼ü /0 <û 7 & d 3I (a)

Fig. 1. The category hierarchies of CORPUS-1

ÏR .DRKVLXQJ Æ UHJLVWUDWLRQ ÏRü .DRKVLXQJ &RXQW\ 'E,µµæL" |Ï}pé 'EÏü õ! þ<Ù 'ÏüÊW ¨^ yÿã@o¢ ÏR} .DRKVLXQJ +DUERU &LW\ +DOO &LW\ &RXQFLO FRXQFLO è½ IORZDJH 7DLSHLç 'ìmel»}Ï }J 'Ï}¬[ì%ñp {Úípø 'ß&Àoå-aÀ 2VçÀ '«½ÙçaùU ÀôÙÖ{ 'çÔrËSåtj Dn&4Rï '~21²`$µv ²¦#+xDÆ ' çüEmµ3Æ ².ÀA0è ÏR .DRKVLXQJ &LW\

Fig. 2. Part of the topic map of CORPUS-1

We applied the hierarchy generation process after the clustering process to the DCM to obtain the category hierarchies. In Figure 1 we show the overall

(8)

hierarchies developed from CORPUS-1. The topic maps of these two corpora can then be generated based on the hierarchies. We show part of the overall topic map for CORPUS-1 in Fig. 2. In the ﬁgure we use the same two-layer view of topic maps that separates the topics and their occurrences. The occurrences of a topic are collectively shown in the same box. We use the subjects of articles to refer to the articles. A subject is a string starting with a ’@’ in Fig. 2.

6 Conclusions

In this work we present a novel approach for semi-automatic topic map con-struction. The approach starts from applying a text mining process on a set of information resources. Two feature maps, namely the document cluster map and the word cluster map, are created after the text mining process. We then apply a category hierarchy development process to reveal the hierarchical struc-ture of the document clusters. Some topics are also identiﬁed by such process to indicate the general subjects of those clusters located in the hierarchy. We may then create topic maps according to the two maps and the developed hierarchy automatically. Although our method may not identify all kinds of components that should construct a topic map, our approach seems promising since the text mining process achieves satisfactory result in revealing implicit topics and their relationships.

References

1. M. Biezunski. Introduction to topic mapping. In SGML Europe 1997 GCA

Confer-ence, Barcelona, Spain, 1997.

2. C. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA, 1998.

3. T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 1997.

4. G. Moore. Topic map technology – the state of the art. In XML Europe 2000, Paris, Frace, 2000.

5. S. Pepper. Navigating haystacks, discovering needles. Markup Languages: Theory

and Practice, 1:41–68, 1999.

6. H. H. Rath. Technical issues on topic maps. In Proceedings of Metastructures 99

Conference, GCA, 1999.

7. G. Salton and M. J. McGill. Introduction to Modern Information Retrieval.