TOSOM: A Topic-Oriented Self-Organizing Map
for Text Organization
Hsin-Chang Yang, Chung-Hong Lee, and Kuo-Lung Ke
Abstract—The self-organizing map (SOM) model is a well-known neural network model with wide spread of applications. The main characteristics of SOM are two-fold, namely dimension reduction and topology preservation. Using SOM, a high-dimensional data space will be mapped to some low-dimensional space. Meanwhile, the topological relations among data will be preserved. With such char-acteristics, the SOM was usually applied on data clustering and visu-alization tasks. However, the SOM has main disadvantage of the need to know the number and structure of neurons prior to training, which are difficult to be determined. Several schemes have been proposed to tackle such deficiency. Examples are growing/expandable SOM, hierarchical SOM, and growing hierarchical SOM. These schemes could dynamically expand the map, even generate hierarchical maps, during training. Encouraging results were reported. Basically, these schemes adapt the size and structure of the map according to the distribution of training data. That is, they are driven or data-oriented SOM schemes. In this work, a topic-data-oriented SOM scheme which is suitable for document clustering and organization will be developed. The proposed SOM will automatically adapt the number as well as the structure of the map according to identified topics. Unlike other data-oriented SOMs, our approach expands the map and generates the hierarchies both according to the topics and their characteristics of the neurons. The preliminary experiments give promising result and demonstrate the plausibility of the method.
Keywords—Self-Organizing Map, Topic Identification, Learning Algorithm, Text Clustering.
I. INTRODUCTION
M
ANY artificial neural network models had been pro-posed to model the learning mechanism of brain. One of the widely applied models is the self-organizing map (SOM) model [1]. The SOM learns from high dimensional data and maps them on a low, often 2, dimensional map in a topology-preserving manner. That is, data close together in high-dimensional space will also be close in the mapped low-dimensional space. Such abilities make the SOM being widely applied in data visualization and clustering tasks. Up to now, there are more than 7,700 papers published regarding the SOM, according to the bibliography1compiled by the original SOM team [2].Although the SOM is easy and effective in data clustering, there are three main disadvantages in its learning algorithm. First, the training time of the SOM is often long since the input
H.-C. Yang is with the Department of Information Management, National University of Kaohsiung, Kaohsiung 811, Taiwan (corresponding author, phone: +886-7-5919512; e-mail: [email protected]).
C.-H. Lee is with the Department of Electrical Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung 807, Taiwan (e-mail: [email protected])
K.-L. Ke is with the Institute of Information Management, National University of Kaohsiung, Kaohsiung 811, Taiwan
1Accessible from http://www.cis.hut.fi/research/refs/
data is often high-dimensional and the SOM requires repeated training over input data. Second, the size of the map is fixed. The number of neurons in the map should be determined a priori. However, there are no gold standard for determining this number. Third, the original SOM can reveal only lateral correlations among data, but not hierarchical ones. To remedy these deficiencies, several schemes have been proposed. To overcome the fixed structure problem, it is often to make the map expandable during training. That is, the map contains few neurons initially and expands in later training stages according to the distributions of data. There are two major approaches to conquer the lack of hierarchical relationships. The first is to apply agglomerative hierarchical clustering or divisive hierarchical clustering processes on the two-dimensional map to generate hierarchies. The second is to retrain each cluster in the map using a lower-level map and obtain the hierarchical structure. These approaches on map expansion or hierarchy generation all relies on the characteristics of data to decide when and how to expand the map. Since the SOM will cluster together similar data, many similar data will be clustered into few neurons. Thus we can decide whether to expand the map or not by detecting the distribution of data on the map. Here we may call such approaches the data-driven or data-oriented approaches. Although data-oriented approaches may effectively generate adequate number of clusters or even hierarchies, they provide no insight into the meaning of data, let along guiding the training process.
The SOM is often used for text document clustering and categorization. Text categorization concerns of classifying documents into some categories according to their contents, characteristics, and properties. When documents are properly categorized, documents in a cluster should have a common theme. Oppositely, we need some well-defined category struc-tures to perform text categorization. Such strucstruc-tures are often manually constructed and are often inconsistent. Automatic schemes for generating categorization structure are needed in categorizing large, dynamic repositories.
In this work, we propose a novel self-organizing map algorithm based on topics of documents rather than documents themselves. The core difference between our method and traditional data-oriented SOMs lies on the intervention of topics in the learning process. The topics of each cluster are identified continuously during training process. These topics are then served as bases of expanding maps and hierarchies. Using such higher-level knowledge during training should provide deeper insight of the underlying documents and better guidance of the training. Besides, our method can naturally generate a categorization structure with identified themes and
World Academy of Science, Engineering and Technology Vol:4 2010-05-29
865 International Scholarly and Scientific Research & Innovation 4(5) 2010
hierarchical structure. Thus it is also well fit for tasks regarding text documents such as text categorization, text clustering, and text hierarchy generation, etc.
The following text is divided into four sections. Sec. II describes some works related to our research. In Sec. III we will introduce the proposed topic-oriented self-organizing map algorithm. Sec. IV gives the experimental result. Finally, we give conclusions and discussions in the last section.
II. RELATEDWORK
We briefly review some related works here. A. Adaptable SOM
To overcome the two major disadvantages, namely static map structure and lack of hierarchical relationships, of tra-ditional SOM [3], [1], many models have been proposed. Some models use adaptable map size, e.g. the growing grid model [4]. Another approach is to use hierarchical arranged maps, such as hierarchical feature map [5] and tree-structured self-organizing map [6]. Hybrid approaches were also de-veloped, e.g. growing hierarchical self-organizing map (GH-SOM) [7], [8], [9].
B. Text Organization Based on SOM
Here we refer text organization to the efforts involved in organizing a corpse of texts into some predefined structures. Typical text organization processes include text clustering, text categorization, and text structure generation. Text organization research has been active for several decades. Many methods have been proposed to conquer this problem. Here we men-tion some works that make use of SOM. SOM was widely used in text clustering. A famous example is the WEBSOM project [10]. Liu et al. [11] proposed ConSOM which use concepts along with traditional features to guide the learning process. Their method demonstrated better result compared to traditional SOM due to its semantic sensibility.
III. TOPIC-ORIENTEDSOM ALGORITHM
In this work we try to develop a new training method for SOM. The core of the proposed approach is the identification of topics which are used to guide the training process as well as to change the structure of the map. Fig. 1 depicts the flowchart of our method. The detailed explanations of the steps are described as follows.
A. Preprocessing
We need to transform the text documents into vectors before training since the input to SOM need to be vector-form. In this step common procedures for processing text documents, such as word segmentation, stopword elimination, and stemming, were first applied to obtain a set of keywords that can describe the contents of a document. All keywords were collected into the vocabulary of the corpse. A document is then transformed into a vector according to the keywords it contains. Let document dj = {kij|1 ≤ i ≤ nj}, 1 ≤ j ≤ N, where N
is the number of documents, nj is the number of distinct
dĞdžƚĚŽĐƵŵĞŶƚƐ WƌĞƉƌŽĐĞƐƐŝŶŐ ŽĐƵŵĞŶƚ ǀĞĐƚŽƌƐ ^KDƚƌĂŝŶŝŶŐ dŽƉŝĐŝĚĞŶƚŝĨŝĐĂƚŝŽŶ >ĂƚĞƌĂů ĞdžƉĂŶƐŝŽŶ͍ ,ŝĞƌĂƌĐŚŝĐĂů ĞdžƉĂŶƐŝŽŶ͍ >ĂƚĞƌĂů ĞdžƉĂŶƐŝŽŶ ,ŝĞƌĂƌĐŚŝĐĂů ĞdžƉĂŶƐŝŽŶ zĞƐ EŽ zĞƐ EŽ ^ƚŽƉƚƌĂŝŶŝŶŐ
Fig. 1. Flowchart of the proposed method.
keywords in dj, and kij represents the ith keyword in dj.
The vocabulary, denoted byV , is just the union of all dj, i.e.
V =jdj= {ki|1 ≤ i ≤ |V |}. A document is encoded into
a binary vector according to those keywords that occurred in it. When a keywordkioccurs in this document, theith element of
the vector will have value 1; otherwise, the element will have value 0. With this scheme, a document dj will be encoded into a binary vectordj.
B. SOM Training
The document vectors were trained by classical SOM al-gorithm [3]. The initial training was performed on a small map, says a2 × 2 map, named initial SOM. The initial SOM will be expanded when training proceeds. When the training process is accomplished, we then perform a labeling process on the trained map to establish the association between each document and one of the neurons. The labeling process is described as follows. The feature vector of documentdj,dj=
(xji), 1 ≤ i ≤ |V |, 1 ≤ j ≤ N, is compared to the synaptic
weight vectors of every neuron in the map. We then labeldjto
thelth neuron if its synaptic weight vector is closest to dj, i.e.
||dj− wl|| = argminm||dj− wm||, where wm is the synaptic
weight vector of neuronm, wm= (wmi), 1 ≤ i ≤ |V |. After
the labeling process, each document is labeled to some neuron or, from a different point of view, each neuron is labeled by a set of documents. We record the labeling result and obtain the document cluster map (DCM). In the DCM, each neuron is labeled by a list of documents which are considered similar and are in the same cluster.
We will also construct the keyword cluster map (KCM)
World Academy of Science, Engineering and Technology Vol:4 2010-05-29
866 International Scholarly and Scientific Research & Innovation 4(5) 2010
hierarchical expanded. Since our method incorporates various text mining approaches in training, it is feasible to use our method on text documents. The experimental results suggest that our method is adequate for developing structure which can be used for text categorization and organization.
ACKNOWLEDGMENT
This research was funded by National Science Council under grant NSC 98-2221-E-390-040.
REFERENCES
[1] T. Kohonen, Self-Organizing Maps. Berlin: Springer-Verlag, 1997. [2] M. P¨oll¨a, T. Honkela, and T. Kohonen, “Bibliography of self-organizing
map (SOM) papers: 2002-2005 addendum,” Information and Computer Science, Helsinki University of Technology, Espoo, Finland, Tech. Rep. TKK-ICS-R24, 2009.
[3] T. Kohonen, “Self-organizing formation of topologically correct feature maps,” Biological Cybernetics, vol. 43, no. 1, pp. 59–69, 1982. [4] B. Fritzke, “Growing grid - a self-organizing network with constant
neighborhood range and adaption strength,” Neural Processing Letter, vol. 2, no. 5, pp. 9–13, 1995.
[5] R. Miikkulainen, “Script recognition with hierarchical feature maps,”
Connection Science, vol. 2, pp. 83–101, 1990.
[6] P. Koikkalainen, “Tree structured self-organizing maps,” in Kohonen
Maps, E. Oja and S. Kaski, Eds. Amsterdam, Netherlands: Elsevier, 1999, pp. 121–130.
[7] A. Rauber, M. Dittenbach, and D. Merkl, “Towards automatic content-based organization of multilingual digital libraries: An English, French and German view of the Russian information agency Nowosti news,” in
Proceedings of the Third All-Russian Scientific Conference on Digital Libraries: Advanced Methods And Technologies, Digital Collections,
September 11-13 2001, pp. 11–13.
[8] A. Rauber, D. Merkl, and M. Dittenbach, “The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data,” IEEE
Transactions on Neural Networks, vol. 13, no. 6, pp. 1331–1341, 2002.
[9] M. Dittenbach, A. Rauber, and D. Merkl, “Recent advances with the growing hierarchical self-organizing map,” in Advances in
Self-Organizing Maps, N. Allinson, Y. Ahujun, L. Allinson, and J. Slack,
Eds. Lincoln, England: Springer, 2001, pp. 140–145.
[10] S. Kaski, T. Honkela, K. Lagus, and T. Kohonen, “WEBSOM–Self-organizing maps of document collections,” Neurocomputing, vol. 21, pp. 101–117, 1998.
[11] Y. Liu, X. Wang, and C. Wu, “ConSOM: A conceptional self-organizing map model for text clustering,” Neurocomputing, vol. 71, no. 4-6, pp. 857–862, 2008.
[12] G. A. Miller, “WordNet: A lexical database for English,”
Communica-tions of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
[13] T. Pedersen, S. Patwardhan, and J. Michelizzi, “WordNet::Similarity -measuring the relatedness of concepts,” in HLT-NAACL 2004:
Demon-stration Papers, D. M. Susan Dumais and S. Roukos, Eds. Boston, Massachusetts, USA: Association for Computational Linguistics, May 2 - May 7 2004, pp. 38–41.
[14] C. H. Lee and H. C. Yang, “A Web text mining approach based on self-organizing map,” in Proceedings of the ACM CIKM’99 2nd Workshop on
Web Information and Data Management, Kansas City, Missouri, 1999,
pp. 59–62.
[15] H. C. Yang and C. H. Lee, “A text mining approach on automatic generation of Web directories and hierarchies,” Expert Systems with
Applications, vol. 27, no. 4, pp. 645–663, 2004.
World Academy of Science, Engineering and Technology Vol:4 2010-05-29
869 International Scholarly and Scientific Research & Innovation 4(5) 2010