A Text Mining Approach for Automatic Construction of Hypertexts

全文

(1)A Text Mining Approach on Automatic Generation of Web Directories and Hierarchies Hsin-Chang Yang ∗ Department of Information Management, Chang Jung University. Chung-Hong Lee Department of Electric Engineering, National Kaohsiung University of Applied Sciences. Abstract The World Wide Web (WWW) has been recognized as the ultimate and unique source of information for information retrieval and knowledge discovery communities. Tremendous amount of knowledge are recorded using various types of media, producing enormous amount of web pages in the WWW. Retrieval of required information from the WWW is thus an arduous task. Different schemes for retrieving web pages have been used by the WWW community. One of the most widely used scheme is to traverse predefined web directories to reach a user’s goal. These web directories are compiled or classified folders of web pages and are usually organized into hierarchical structures. The classification of web pages into proper directories and the organization of directory hierarchies are generally performed by human experts. In this work, we provide a corpus-based method that applies a kind of text mining techniques on a corpus of web pages to automatically create web directories and organize them into hierarchies. The method is based on the self-organizing map learning algorithm and requires no human intervention during the construction of web directories and hierarchies. The experiments show that our method can produce comprehensible and reasonable web directories and hierarchies.. ∗ Correspondence: Department of Information Management, Chang Jung University, 711, Tainan, Taiwan. Email addresses: hcyang@mail.cju.edu.tw (Hsin-Chang Yang), leechung@mail.ee.kuas.edu.tw (Chung-Hong Lee).. Preprint submitted to Elsevier Science. 28 April 2004.

(2) 1. Introduction. The World Wide Web (WWW, or specifically the web) has been a popular resource when we try to find some kind of information. The popularity raises from the following properties of the web. First, it is a universal repository of knowledge such that almost any kind of knowledge can be found there. Second, the users pay little to access the web and are encouraged to use the resource. Third, there is no central editorial board in the web, so anyone can contribute his resource to the web. These properties stimulate the users to retrieve information from the web and add information to the web, making the web grow in an incredible speed and users rely on the web to obtain the information they need. Information finding is thus a serious problem for the web since most users find it hard to obtain the information using current information retrieval strategies. Two kinds of strategies are now adopted by the web communities, namely searching and browsing. The former is a purposeful action such that a user has to express his need before the retrieval process is performed. The performance of a searching system is the direct result of its indexing strategy and ranking mechanism. Contrary to searching, when a user browses the web his goal may adapt during the browsing process. The users should evaluate the performance by the number of steps to their goal (web) pages from the starting pages. Such evaluation is influenced by the selection of the starting page and the link structures (i.e. the hyperlinks) in the pages that participate in the browsing process. Since the link structures may be considered static during browsing, the selection of starting pages plays the most important role when a user tries to find his goal in minimum time. Therefore, many commercial or academic web sites actively collect web pages and sort them into web directories to provide users the starting points in the browsing process. Web directories are categories of semantically similar web pages and are often organized in a hierarchical manner. We often assign a label to each directory to identify its theme or subject. In a web directory hierarchy, the root directory generally conveys the most general subject or concept of all the web pages in this hierarchy. A directory in this hierarchy should cover the common subjects or concepts of all its sub-directories. When a user tries to find some web pages, he may start from one of the hierarchies which theme is the closest to his information need. In the mean time a set of sub-directories are provided to the user. The user may further traverse into one of the sub-directories whenever necessary. For example, a user may find the directory labeled ’culture’ being too coarse when he intends to retrieve a web page describing ’poetry of China’. He may then traverse into the sub-directories and reach his goal directory in the following order: ’regional’ → ’Asia’ → ’China’ → ’literature’ → ’poetry’. The user may either find his goal page in this directory or he can select one page from this directory and traverse the hyperlinks in this page to eventually 2.

(3) reach the goal page. Most users find web directories useful when one has in mind only general idea about his goal or cannot clearly express his information need into keywords to be used by a search engine. Therefore, many portal sites create web directories to provide users starting points in the browsing process. Most existing web directories were created manually by human specialists. One of the famous result is the web directory hierarchies of Yahoo!. Lots of domain experts spend a great amount of efforts creating such hierarchies. Although the result of their efforts has been well recognized as a successful one, the applicability of manual construction of web directories is still limited. Such limitation is mainly caused by the gigantic amount of web pages produced and being produced. It is not practical to expect that human experts may eventually assign every right documents to the right directories in an acceptable amount of time. Therefore, semi-automatic or fully automatic creation of web directories has attracted the attention of many researchers and practitioners. There are three tasks we should achieve when we try to create web directories automatically from a corpus of web pages. The first is to create a set of web directories. The second is to discover the structure of these directories. The last is to assign web pages to proper directories. All these tasks are difficult to achieve, especially the first two. The reason is that we need an insight of the underlying semantic structure of a language to create the directories and their structure. Unfortunately, such insight is difficult to be revealed by an automated process. Thus only few works focused on the fully automatic creation of web directories. In this work we present a method to automatically create web directories and their structure. We apply a text mining process on a corpus of web pages to identify some important topics in this corpus. Besides, a hierarchy of these topics is also developed. These identified topics are used as the names of the web directories. In the mean time the structure of these directories is modelled by the developed hierarchy. The proposed web directory creation process consists of three phases. In the first phase we apply a text mining process to discover the relationships among documents and words. We first construct a self-organizing map (SOM) neural network (Kohonen, 1997) to cluster the web pages. In the end of this phase web pages of similar topics are clustered around some nearby neurons. We then analyze the network and obtain two feature maps, namely the document cluster map (DCM) and the word cluster map (WCM). In the second phase we analyze the WCM and identify a set of topics from the map. These topics are used as the names of some web directories. In the third phase we apply a recursive process to develop hierarchies of the identified topics and create the hierarchical structure of web directories. The web pages are naturally assigned to corresponding directories after the text mining process. The paper is organized as follows. In Sec. 2 we review some related work about 3.

(4) web directory creation. We then describe the text mining process in detail in Sec. 3. The second and third phase of the web directory creation process are presented in Sec. 4. In Sec. 5 we show some experimental results and in Sec. 6 we will give some conclusions.. 2. Related Work. Most of the works related to web directories concern with categorization of web pages using some existing web directories as well as their structure and fall into the general domain of text categorization. Most text categorization research focused on developing methods for categorization. Examples are Bayesian independent classifier (Lewis, 1992), decision trees (Apte et al., 1994), linear classifiers (Lewis et al., 1996), context-sensitive learning (Cohen and Singer, 1996), learning by combining classifier (Larkey and Croft, 1996), and instancebased learning (Lam et al., 1999). Based on the defined categorization method, text categorization or classification systems usually categorize documents according to some predefined category hierarchy. A famous example is the work by CMU text learning group (Grobelnik and Mladenić, 1998). They used the Yahoo! hierarchy to categorize documents. Another usage of predefined category hierarchy is to browse and search the retrieval results of a retrieval strategy. An example is the Cat-a-Cone system developed by Hearst and Karadi (1997). Feldman et al. (1998) combined keyword distribution and keyword hierarchy to perform a range of data mining operation in documents. In this work we apply a text mining process to identify topics as web directory names. Such work is in essence similar to research on topic identification or theme generation of text documents. Salton and Singhal (1994) generated a text relationship map among text excerpts and recognized all groups of three mutually related text excerpts. A merging process is applied iteratively to these groups to finally obtain the theme (or a set of themes) of all text excerpts. Clifton and Cooley (1999) used traditional data mining techniques to identify topic in a text corpus. They used a hypergraph partitioning scheme to cluster frequent item sets. The topic is represented as a set of named entities of the corresponding cluster. Ponte and Croft (1997) applied dynamic programming techniques to segment text into relatively small segments. These segments can then be used for topic identification. Lin (1995) used a knowledge-based concept counting paradigm to identify topic through the WordNet hierarchy. Hearst and Plaunt (1993) argued that the advent of full-length documents should be accompanied by the need for subtopic identification. They developed techniques for detecting subtopics and performed experiments using sequences of locally concentrated discussions rather than full-length documents. All these work, in some extent, may identify topics of documents that can be used as category themes for text categorization. How4.

(5) ever, they either rely on predefined category hierarchy (e.g. (Lin, 1995)) or do not reveal the hierarchy at all. Our text mining process also develops hierarchical structure from the identified topics. Recently, researchers in text categorization field have proposed methods for automatically developing category hierarchy. McCallum and Nigam (1999) used a bootstrapping process to generate new terms from a set of human-provided keywords. Human intervention is still required in their work. Probabilistic methods were widely used in exploiting hierarchy. Weigend et al. (1999) proposed a two-level architecture for text categorization. The first level of the architecture predicts the probabilities of the meta-topic groups, which are groups of topics. This allows the individual models for each topic on the second level to focus on finer discrimination within the group. They used a supervised neural network to learn the hierarchy where topic classes were provided and already assigned. A different probabilistic approach by Hofmann (1999) used a unsupervised learning architecture called Cluster-Abstraction Model to organize groups of documents in a hierarchy. Self-organizing map algorithm has been applied to document clustering in several works (for example, (Kaski et al., 1998; Rauber and Merkl, 1999; Rizzo et al., 1998)). The WEBSOM project (Kaski et al., 1998) uses the selforganizing maps as a preprocessing stage for encoding documents. Such maps are, in turn, used to automatically organize (i.e., cluster) documents according to the words that they contain. When the documents are organized, following the steps in the preprocessing stage, on a map in such a way that nearby locations contain similar documents, exploration of the collection is facilitated by the intuitive neighborhood relations. Thus, users can easily navigate a word category map and zoom in on groups of documents related to a specific group of words. Rauber and Merkl (1999) used the self-organizing map to cluster text documents. They labeled each neuron with a set of keywords that were selected from the input vectors mapped to this neuron. Those keywords that contribute less quantization error were selected. These keywords together form the label of a cluster, rather than a single term for a category. Moreover, the hierarchical structure among these keywords were not revealed. Rizzo et al. (1998) used the SOM to cluster hypertext documents for interactive browsing and document searching. In the field of web directory generation, one report that is close to our work was presented by Sato and Sato (1999). In their work they generate web directories for rather specific domains such as aquarium, zoo, and museum. Giving a word, a name collector collects all proper names related to the word. A content editor will then collect web pages that are related to each instance of proper names and generate a digest page of that instance. Finally, an organizer will generate a table-of-content of the directory. Although their method needs no human intervention, the generated structure (that is, the table-of-content) 5.

(6) is rather primitive. Besides, it is hard to evaluate their result for such set of limited domains.. 3. The Text Mining Process. To obtain the web directory hierarchies, we first perform a clustering process on the corpus. We then apply a web directory generation process to the clustering result and obtain the web directory hierarchy. This section describes the clustering process. We will start from the preprocessing steps, follow by the clustering process by SOM learning algorithm. Two labeling processes are then applied to the learned result. After the labeling processes, we obtain two feature maps which characterize the relationships among documents and words, respectively. The web directory hierarchy generation process, which will be described in the next section, is then applied to these two maps to develop the web directory hierarchies.. 3.1 Document Encoding and Preprocessing. Our approach begins with a standard practice in information retrieval (Salton and McGill, 1983), i.e. the vector space model, to encode documents with vectors, in which each element of a document vector corresponds to a different indexed term. In this work the corpus contains a set of Chinese news web pages from CNA(Central News Agency) 1 . For simplicity and generality, we will denote a web page as a document hereafter. First we apply an extractor to filter out HTML tags and extract index terms from the documents. Traditionally there are two schemes for extracting terms from Chinese texts. One is character-based scheme and the other is word-based scheme (Huang and Robertson, 1997). We adopt the second scheme because individual Chinese characters generally carry no context-specific meaning. A word in Chinese is composed by two or more Chinese characters, therefore it is often to use the terms ’word’ and ’phrase’ interchangeably in Chinese. After extracting words, they are used as index terms to encode the documents. Unlike traditional vector space model, we do not adopt any term weighting and use instead a binary vector to represent a document. A value of 1 for an element in a vector indicates the presence of the corresponding word in the document; otherwise, a value of 0 indicates the absence of the word. We decide to use the binary vector scheme due to the following reasons. First, we cluster documents according to the co-occurrence of the words, which is irrelevant to the weights 1. http://www.cna.com.tw. 6.

(7) of the individual words. Second, our preliminary experiments showed no advantage in the clustering result by using term weighting schemes(classical tf and tf · idf schemes were used). As a result, we believe the binary scheme is adequate to our needs. A problem with this encoding method is that if the vocabulary is very large the dimensionality of the vector is also high. In practice, the resulting dimensionality of the space is often tremendously huge, since the number of dimensions is determined by the number of distinct index terms in the corpus. In general, feature spaces on the order of 1000 to 100000 are very common for even reasonably small collections of documents. As a result, techniques for controlling the dimensionality of the vector space are required. Such a problem could be solved, for example, by eliminating some of the most common and some of the rarest indexed terms in the preprocessing stage. Several other techniques may also be used to tackle the problem. Examples are multidimensional scaling (Cox and Cox, 1994), principal component analysis (Jolliffe, 1986), and latent semantic indexing (Deerwester et al., 1990). In information retrieval several techniques are widely used to reduce the number of index terms. Unfortunately, these techniques are not fully applicable to Chinese documents. For example, stemming is generally not necessary for Chinese texts. On the other hand, we can use stop words and thesaurus to reduce the number of index terms. In this work, we manually constructed a stop list to filter out the meaningless words in the texts. We did not use a thesaurus simply because it is not available. A document in the corpus typically contains about 100-200 characters. We discard over-lengthy (more that 250 words) and duplicated documents for better result. This step can be omitted because it only affects less than 2% of documents in the corpus. Keeping these documents may only slightly increase the number of index terms and the processing time. However, if the corpus contains many duplicated or over-lengthy documents, the clustering result may be deteriorated.. 3.2 Document Clustering Using Self-organizing Maps. In this sub-section we will describe how to organize documents and words into clusters by their co-occurrence similarities. The documents in the corpus are first encoded into a set of vectors as described in Sec. 3.1. We intend to organize these documents into a set of clusters such that similar documents will fall into the same cluster or nearby clusters. The unsupervised learning algorithm of SOM networks (Kohonen, 1997) meets our needs. The SOM algorithm organizes a set of high-dimensional vectors into a two-dimensional map 7.

(8) of neurons according to the similarities among the vectors. Similar vectors, i.e. vectors with small distances, will map to the same or nearby neurons after the training (or learning) process. That is, the similarity between vectors in the original space is preserved in the mapped space. Applying the SOM algorithm to the web page vectors, we actually perform a clustering process on the web pages. Here a neuron in the map can be treated as a cluster. Similar documents will fall into the same or neighboring neurons (clusters). Besides, the similarity of two clusters can be measured by the geometrical distance between their corresponding neurons. To decide the cluster to which a document or a word belongs, we apply two labeling processes to the documents and the words, respectively. After the document labeling process, each document will associate with a neuron in the map. We record such associations and form the document cluster map (DCM). In the same manner, we label each word to some neuron in the map and form the word cluster map (WCM). We then use these two cluster maps to generate the web directory hierarchies. We define some denotations and describe the training process here. Let xi = {xin |1 ≤ n ≤ N }, 1 ≤ i ≤ M , be the encoded vector of the ith document in the corpus, where N is the number of indexed terms and M is the number of the documents. We use these vectors as the training inputs to the SOM network. The network consists of a regular grid of neurons which each has N synapses. Let wj = {wjn |1 ≤ n ≤ N }, 1 ≤ j ≤ J, be the synaptic weight vector of the jth neuron in the network, where J is the number of neurons in the network. We train the network by the SOM algorithm: Step 1 Randomly select a training vector xi from the corpus. Step 2 Find the neuron j with synaptic weight vector wj which is the closest to xi , i.e. ||xi − wj || = min ||xi − wk ||.. (1). 1≤k≤J. Step 3 For each neuron l in the neighborhood of neuron j, update its synaptic weights by wlnew = wlold + α(t)(xi − wlold ),. (2). where α(t) is the training gain at time stamp t. Step 4 Increase time stamp t. If t reaches the preset maximum training time T , halt the training process; otherwise decrease α(t) and the neighborhood size, goto Step 1. The training process stops after time T which is sufficiently large so that every vector may be selected as training input for certain times. The training gain and neighborhood size both decrease when t increases. 8.

(9) 3.3 Mining Document and Word Associations. When the document clustering process is accomplished, we then perform a labeling process on the trained network to establish the association between each document and one of the neurons. The labeling process is described as follows. Each document’s feature vector xi , 1 ≤ i ≤ M is compared to the synaptic weight vectors of every neuron in the map. We then label the ith document to the jth neuron if they satisfy Eq. 1. After the labeling process, each document is labeled to some neuron or, from a different point of view, each neuron is labeled by a set of documents. We record the labeling result and obtain the DCM. In the DCM, each neuron is labeled by a list of documents which are considered similar and are in the same cluster. We explain why the SOM algorithm performs a clustering process here. In the labeling process those documents which contain many co-occurred words will map to the same or neighboring neurons. Since the number of the neurons is usually much smaller than the number of the documents in the corpus, multiple documents may label to the same neuron. Thus a neuron forms a document cluster. Also, neighboring neurons represent document clusters of similar meaning, i.e. high word co-occurrence frequency in our context. On the other hand, it is possible that some neurons may not be labeled by any document. We call these neurons the unlabeled neurons. Unlabeled neurons exist when two situations occur. One is when the number of documents is considerably small comparing to the number of neurons. Another situation is when the corpus contains too many conceptually similar documents such that a great part of documents will fall into a small set of neurons. However, unlabeled neurons will not diminish the result of the clustering since they do not affect the similarity measurement between any pair of clusters. We construct the WCM by labeling each neuron in the trained network with certain words. Such labeling is achieved by examining the neurons’ synaptic weight vectors, and is based on the following observation. Since we adopt binary representation for the document feature vectors, ideally the trained map should consist of synaptic weight vectors with component values near either 0 or 1. Since a value of 1 in a document vector represents the presence of a corresponding word in that document, a component with value near 1 in a synaptic weight vector also shows that such neuron has recognized the importance of the word and tries to ’learn’ the word. According to such interpretation, we design the following word labeling process. For the weight vector of the jth neuron wj , if its nth component exceeds a predetermined threshold, the corresponding word of that component is labeled to this neuron. The threshold should be a real value near 1. By virtue of the SOM algorithm, a neuron may be labeled by several words which often co-occur in a set of documents, making a neuron a word cluster. The labeling method may not completely label every 9.

(10) word in the corpus. We call these words the unlabeled words. Unlabeled words happen when several neurons compete for a word during the training process. The competition often results in imperfect convergence of weights, as a result, some words may not be learned well, i.e. their corresponding components may not have values near 1 in any neuron’s weight vector. We solve this problem by examining all the neurons in the map and labeling each unlabeled word to the neuron with the largest value of the corresponding component for that word. That is, the nth word is labeled to the jth neuron if wjn = max wkn .. (3). 1≤k≤J. Note that we ignore the unlabeled neurons in Eq. 3. The WCM autonomously clusters words according to their similarity of cooccurrence. Words tend to occur simultaneously in the same document will be mapped to neighboring neurons in the map. For example, the translated Chinese words for ”neural” and ”network” often occur simultaneously in a document. They will map to the same neuron, or neighboring neurons, in the map because their corresponding components in the encoded document vector are both set to 1. Thus a neuron will try to learn these two words simultaneously. On the contrary, words that do not co-occur in the same document will map to distant neurons in the map. Thus we can reveal the relationship between two words according to their corresponding neurons in the WCM.. 4. Automatic Generation of Web Directories. After the text mining process, each neuron in the DCM and the WCM actually represents a document cluster and a word cluster, respectively. Such clusters can be considered as categories of the underlying corpus in text categorization task. A category actually forms a web directory when the corpus consists of web pages. In this section, we will describe methods for finding two kinds of implicit knowledge from these categories and use them for automatic web directory generation. First, we will present a method to reveal the hierarchical structure among categories. Such hierarchies are used to organize web directories in a human-comprehensible way for easy browsing. Second, a category theme identification method is developed to find the labels of each category for easy interpretation by human. These labels are used as the names for corresponding web directories. After the web directory hierarchy generation process, classification of web pages into proper web directories can also be achieved by our method. 10.

(11) 4.1 Generation of Directory Hierarchies To obtain a directory hierarchy, we first cluster documents by the SOM using the method described in Sec. 3 and obtain the DCM and the WCM. As we mentioned before, a neuron in the DCM represents a cluster of documents so we will use ’cluster’ and ’neuron’ interchangeably when we refer to the trained map. A cluster here also represents a category in text categorization terminology. In the following we will use ’cluster’ and ’category’ interchangeably without causing ambiguity. Documents which are labeled to the same neuron, or neighboring neurons, usually contain words that often co-occur in these documents. By virtue of the SOM algorithm, the synaptic weight vectors of neighboring neurons have the least difference comparing to those of distant neurons. That is, similar document clusters will correspond to neighboring neurons in the DCM. Thus we may generate a cluster of similar clusters, or a super-cluster, by congregating neighboring neurons. This will essentially create a two-level hierarchy. In this hierarchy, the parent node is the constructed super-cluster and the child nodes are the clusters that compose the super-cluster. The hierarchy generation process can be further applied to every super-cluster to establish the next level of this hierarchy. The overall hierarchy can then be established iteratively using such top-down approach until a stop criterion is satisfied. To form a super-cluster, we first define the distance between two clusters: D(i, j) = ||Gi − Gj ||,. (4). where i and j are the neuron indices of the two clusters and Gi is the twodimensional grid coordinates of neuron i in the map. For a square formation 1 1 of neurons, Gi = (i mod J 2 , i div J 2 ). ||Gi − Gj || measures the Euclidean distance between the two coordinates Gi and Gj . We also define the dissimilarity between two clusters: D(i, j) = ||wi − wj ||,. (5). where wi is the synaptic weight vector of neuron i. Now we try to find some significant clusters among all clusters as super-clusters. Since neighboring clusters in the map represent similar clusters, we can determine the significance of a cluster by examining the overall similarity that is contributed by its neighboring clusters. Such similarity, namely the supporting cluster similarity, is defined by: Si =. 1 X doc(i)doc(j) |Bi | j∈Bi F (D(i, j), D(i, j)). (6). 11.

(12) Step 1 Find the cluster which has the largest supporting cluster similarity among the clusters under consideration. This cluster is a dominating cluster. Step 2 Eliminate its neighboring clusters so that they will not be considered as dominating clusters. Step 3 If (1) there is no cluster left, or (2) the number of dominating clusters ND exceeds a predetermined value, or (3) the level of hierarchy exceeds a predetermined depth ∆, stops the process. Otherwise decrease the neighborhood size in Step 2 and goto Step 1. Fig. 1. The super-cluster generation process algorithm.

(13).

(14).

(15) .

(16) .

(17) .

(18) . (a). (b). Fig. 2. (a) A two-level hierarchy comprises a super-cluster as root node and several clusters as child nodes. (b) The dominating cluster k is selected and used as a super-cluster. Its neighboring clusters compose the super-cluster. We only show a possible construction of the hierarchy here.. where doc(i) is the number of documents associated with neuron i, i.e. cluster i, in the DCM and Bi is the set of neuron indices in the neighborhood of neuron i. Notice that we ignore the unlabeled neurons here. The function F : R+ → R+ is a monotonically increasing function which takes D(i, j) and D(i, j) as arguments. The super-clusters are developed from a set of dominating clusters in the map. A dominating cluster is a cluster which has locally maximal supporting cluster similarity. We may select all dominating clusters in the map by the algorithm shown in Figure 1. The algorithm finds dominating clusters from all clusters under consideration and creates a level of the hierarchy. The overall process is depicted in Figure 2. As shown in the figure, we may develop a two-level hierarchy by using a super-cluster (or dominating cluster) as the parent node and the clusters which are similar to the super-cluster as the child nodes. In Figure 2, super-cluster k corresponds to the dominating cluster k. The neighboring clusters of cluster k are used as the child nodes in the hierarchy. 12.

(19) A dominating cluster is the centroid of a super-cluster, which contains several child clusters. We will use the neuron index of a neuron as the index of the cluster associate with it. For consistency, the index of a dominating cluster is used as the index of its corresponding super-cluster. The child clusters of a super-cluster can be found by the following rule: Cluster i belongs to supercluster k if D(i, k) = minl D(i, l), where l is a super-cluster. The above process creates a two-level hierarchy of categories. In the following we will show how to obtain the overall hierarchy. A super-cluster may be thought as a category which contains several sub-categories. In the first application of the super-cluster generation process (denoted by STAGE-1), we obtain a set of super-clusters. Each super-cluster is used as the root node of a hierarchy. Thus the number of generated hierarchies is the same as the number of super-clusters obtained in STAGE-1. Note that we can put these super-clusters under one root node and obtain one single hierarchy by setting a large neighborhood in step 2 of Figure 1. However, we see no advantage for doing so because the corpus generally contains documents of wide variety kinds of themes. Trying to put all different themes under a single general theme should be considered meaningless. To find the children of the root nodes obtained in STAGE-1, we may apply the super-cluster generation process to each super-cluster (STAGE-2). Notice that in STAGE-2 we only consider clusters which belong to the same supercluster. A set of sub-categories will be obtained for each hierarchy. These subcategories will be used as the third level of the hierarchy. The overall category hierarchy can then be revealed by recursively applying the same process to each new-found super-cluster (STAGE-n). We decrease the size of neighborhood in selecting dominating clusters when the super-cluster generation process proceeds. This will produce a reasonable number of levels for the hierarchies, as we will discuss later. In the following we show how the text categorization task should be achieved. Each neuron in the trained map will associate with a leaf node in one of the generated hierarchies after the hierarchy generation process. Since a neuron corresponds to a document cluster in the DCM, the developed hierarchies naturally perform a categorization of the documents in the training corpus. We may categorize new documents as follows. An incoming document T with document vector xT is compared to all neurons in the trained map to find the document cluster it belongs. The neuron with synaptic weight vector which is the closest to xT will be selected. The incoming document is categorized into the category where the neuron been associated with as a leaf node in one of the category hierarchies. This is depicted in Figure 3. In the figure, document T is the closest to neuron i which is in the third level of the first hierarchy. Therefore T will be categorized to the category which neuron i represents. Its parent categories as well as, if there is any, child categories, can be easily 13.

(20) . " # $ %& % # ' ( ) * # $ %& % # ' ! ! / ) %& * # $ %& % # $ # % & $ ) *,+-& . Fig. 3. The text categorization process. obtained. Through this approach, the task of text categorization has been done naturally by the category hierarchy generation process. The neighborhood Bi in calculating supporting cluster similarity of a cluster i may be arbitrarily selected. Two common selections are circular neighborhood and square neighborhood. In our experiments, the shapes of the neighborhood are not crucial. It is the sizes of the neighborhood, denoted by Nc1 , that matter. Different sizes of neighborhood may result in different selections of dominating clusters. Small neighborhood may not capture the necessary support from similar clusters. On the other hand, without proper weighting, a large Nc1 will incorporate the support from distant clusters which may not be similar to the cluster under consideration. Besides, large neighborhoods have the disadvantage of costing much computation time. Another use of neighborhood is to eliminate similar clusters in the supercluster generation process. In each stage of the process, this neighborhood size, denoted by Nc2 , has a direct influence to the number of dominating clusters. Large neighborhoods will eliminate many clusters and result in less dominating clusters. Oppositely, a small neighborhood produces a large number of dominating clusters. We must decrease the neighborhood size when the process proceeds because the number of neurons under consideration is also decreased. In Step 3 of the super-cluster generation process algorithm we set three stop criteria. The first criterion stops finding super-clusters if there is no neuron left for selection. This is a basic criterion but we need the second criterion, which limits the number of dominating clusters, to constrain the breadth of hierarchies. The lack of the second criterion may result in shallow hierarchies with too many categories in each level if the neighborhood size is considerably small. An extreme case happens when the neighborhood size is 0. In such case Step 2 of the algorithm will not eliminate any cluster. As a result, every cluster will be selected as dominating clusters and we will obtain J single level hierarchies. Determining an adequate neighborhood size as well as a proper number of dominating clusters is crucial for obtaining an acceptable result. The third criterion constrains the depth of a hierarchy. If we allow a hierarchy having large depth then we will obtain a set of ’slimy’ hierarchies. Notice that 14.

(21) setting large depths may cause no effect because the neighborhood size and the number of dominating clusters may already satisfy the stop criterion. An ad-hoc heuristic rule used in our experiments is to determine the maximum depth d if it satisfies the following rule: Find the first d such that. J ≤ 1.0 for d = 1, 2, . . . , K 2d. (7). where K is the dimension of the neighborhood and is defined as a ratio to the map’s dimension. For example, if the map contains an 8 × 8 grid of neurons, K = 4 means that the dimension of the neighborhood is 1/4 of the map’s dimension, which is 2 in this case. The depth d which satisfies Eq. 7 is then 2. Notice that there may exist some ’spare’ clusters which are not used in any hierarchy after the hierarchy generation process. These clusters are not significant enough to be a dominating cluster of a super-cluster in any stage of the process. Although we can extend the depths in the hierarchy generation process to enclose all clusters into the hierarchies, sometimes we may decide not to do so because we want a higher document-cluster ratio, that is, a cluster should contain a significant amount of documents. For example, if all clusters contain very few documents, it is not wise to use all clusters in the hierarchies because we may have a set of hierarchies which each contains many nodes without too much information. To avoid producing such over-sized hierarchies, we may adopt a different approach. When the hierarchies have been created and there still exists some spare neurons, we simply assign each spare neuron to its nearest neighbors. This in effect merges the document clusters associated with these spare clusters into the hierarchies. The merging process is necessary to achieve a reasonable document-cluster ratio.. 4.2 Generation of Directories. In Sec. 4.1 we show how to obtain the category hierarchies from the DCM. In each hierarchy, a leaf node represents an individual cluster as well as a directory. In this subsection we will show a method to assign each directory in a hierarchy to a label and create a human-interpretable web hierarchy. These labels should reflect the themes of their associated nodes, that is, directories. Moreover, the label of a parent directory in a hierarchy should represent a common theme of its child directories. Traditionally a label usually contains only a word or a simple phrase to allow easy comprehension for humans. In this work, we try to identify cluster themes, i.e. directory labels, by examining the WCM. As we mentioned before, any neuron i in the DCM represents a cluster and includes a set of similar documents. In the mean time, the same neuron in the WCM contains a set of words that often co-occur in such set of 15.

(22) Table 1 Comparisons of the generated themes by different weighting functions. The three parameters a, b, and α are all positive real values. We compared each function to constant weighting function and obtained the percentages of matched themes for various factor values. The factor b was set to 1.0 for all linear functions. Weighting Functions Factor Factor Values Used. 1.0. 0.5. 0.1. 0.01. 0.001. Linear Function G(x) = ax + b. a. 91.43. 92.86. 97.14. 100. 100. Exponential Function G(x) = eαx. α. 92.86. 92.86. 97.14. 100. 100. documents. Since neighboring neurons in the DCM contain similar documents, some significant words should often occur in these documents. The word that many neighboring neurons try to learn most should be the most important word and, as a result, the theme of a cluster. Thus we may find the most significant word of a cluster by examining the synaptic weight vectors of its corresponding neuron and neighboring neurons. In the following we will show how to find the cluster themes. Let Ck denote the set of clusters that belong to the super-cluster k. The cluster theme of this super-cluster is selected from words in the WCM that associated with all clusters in Ck . For all clusters j ∈ Ck , we select the n∗ th word as the cluster theme of cluster k if X j∈Ck. X w jn w jn∗ = max . G(j, k) 1≤n≤N j∈Ck G(j, k). (8). G : R+ + {0} → R+ is a monotonically non-decreasing weighting function. G(j, k) will increase when the distance between neuron j and neuron k increases. Eq. 8 selects the word that is the most important to a super-cluster since the weight of a synapse in a neuron reflects the willingness that the neuron wants to learn the corresponding input data, i.e. a word in our case. We apply Eq. 8 in each stage of the super-cluster generation process so when a new super-cluster is found, its theme is also determined. In STAGE-1, the selected themes label the root nodes of every directory hierarchy. Likewise, the selected themes in STAGE-n are used as labels of the n-th level nodes in each hierarchy. If a word has been selected in previous stages, it will not be a candidate of the themes in the following stages. In Eq. 8 the function G gives weights to each neuron according to its distance to the super-cluster’s centroid. The neurons near the centroid have greater contribution to the overall word significance. However, a simple constant function produced a result similar to several other weighting functions in our experiments. Table 1 lists the results of different weighting functions. We can observe that the identified themes are considerably similar for different functions. There is no difference among these functions when the factors are properly set. 16.

(23) 5. Experimental Result. We established a system named AGENDA (Automatically GENerated Directory Access) to implement the proposed method and perform the experiments. We will describe the details of the experiments in the following. We applied our method on the Chinese news articles posted daily in the web by CNA(Central News Agency). Two corpora were constructed in our experiments. The first corpus (CORPUS-1) contains 100 news articles posted in Aug. 1, 2, and 3, 1996. The second corpus (CORPUS-2) contains 3268 documents posted during Oct. 1 to Oct. 9, 1996. As we mentioned in Sec. 3.1, we discard over-lengthy and duplicated documents. A word extraction process is applied to the corpora to extract Chinese words. There are 1475 and 10937 words extracted from CORPUS-1 and CORPUS-2, respectively. To reduce the dimensionality of the feature vectors we discarded those words which occur only once in a document. We also discarded the words appeared in a manually constructed stoplist. This reduces the number of words to 563 and 1976 for CORPUS-1 and CORPUS-2, respectively. A reduction rate of 62% and 82% are achieved for the two corpora, respectively. To train CORPUS-1, we constructed a self-organizing map which contains 64 neurons in 8×8 grid format. The number of neurons is determined experimentally such that a better clustering can be achieved. Each neuron in the map contains 563 synapses. The initial training gain is set to 0.4 and the maximal training time is set to 100. These settings are also determined experimentally. We had tried different gain values ranged from 0.1 to 1.0 and various training time setting ranged from 50 to 200. We simply adopt the setting which achieves the most satisfying result. After training we label the map by documents and words respectively by the methods described in Sec. 3.3, and obtain the DCM and the WCM for CORPUS-1. The above process is also applied to CORPUS-2 to obtain the DCM and the WCM for CORPUS-2. Table 2 lists the settings of the SOM used in training CORPUS-1 and CORPUS-2. The resulting DCMs and the WCMs for these two corpora are depicted in Figure 4 and 5. Each grid in the maps represents a neuron. Starting from neuron 1 in the upper left corner of the map, the neuron index increases row by row to the lower-right corner, which has an index value of J. In each grid of Figure 4(a) we list the subjects of the news articles associated with that grid. Similarly, we list the words belonging to the same cluster in the same grid in Figure 5(a). We do not directly list the subjects in Figure 4(b), or words in Figure 5(b) due to space limitation. To reveal the effectiveness of the clustering process we select several clusters and list the documents labeled to these clusters in Figure 6. The neuron indices are shown at the top of each cluster of documents and are the same as in Figure 4(b). We can observe that documents labeled to the same cluster are 17.

(24) Table 2 Training parameters used in our experiments CORPUS-1. CORPUS-2. Number of neurons. 64. 400. Number of synapses in a neuron. 563. 1976. Initial training gain. 0.4. 0.4. Maximal training time. 200. 500. 0= 1> 2? 3@ 4A 53 6 7 8 9 : ; < 00 ÄÌ ÅR Í ÆÎ nÄ ÇÅ ÈÏ ºB ÉÐ ÊÑ ËÒ Ó Ô0 aØ ²Ù ÕÚ ÖÛ ×Ù Ü M Ý d Þ ß à áâ Bãäå 0 B ? Mò þ 0B é< g 00 B Oâ 8 R ¬ 0 Ý8 Z Á 0 g ÝOg 0Z þ e { 5 0 e Z 0 â gð f g ± ê 0 âã é¹ Ñ3 0B ß â @ 0 e â 0Ñ â3 ¹ 0¹â ß . . ª. . . «. ¬. . 0; é ± ê æg ¾ç è B ã : 0ò 3 Oó ëX ìô í î ï f ð ñ 07 ùí úî gû bõ öD g÷ ¾ O ë ø 0°ôT 0 gñ ®Ç 0D < 0÷ © 0 S â < ¬ þ 0Ý = « n 0 a¬ òØ þ3 e L. ». v. . . . ¯. °. ±. ². ³. ´. µ. ¡. ². . ¶. ¹. b. Â. ®. . . ·. ¸. . ¡. º. 9. . ÷. 7. B. . Ó. C. D. :. A. ". G. . . =. . . . l. ä. . . ¬. ¹. . i. ·. £. . . . ö. =. p. . . ß. ¸. ¹. ½. . ¹. j. . º. &. a. . ¿. À. ¹. . . d. ¹. . ¾. L. Â. µ. ». Á. É. Ê. Ó. ). Ô. Õ. L. M. . Ö. G. O. d. J. K. Q. R. P. &. . ñ. G. ß. . ¶. L. &. £. ·. c. Ã. . . À. . p. C. Õ. . . Å. È. S. . @. . Æ. Ä. Ç. 9. . . . . . . . . . . . . {. Ð. Ò. Ô. ¿. È. Ï. . Ò. Õ. ¾. ¯. . . >. Ñ. ½. . . . h. Î. ¹. c. p. i. . &. c. ¹. X. ¼. Å. I. ó. ¬. Í. L. 6. ´. Ç. Ì. N. =. . . . ¼. ?. <. ú. . +. d. E. . . ¹. 7. >. H. . . <. @. F. . Õ. ;. Ä. Æ. c. L. Ñ. @. Ã. Ë. 00 BB CC ND E3 FO GP HQ IR JS TK LU MV 3W 4X Y Z [ 00 üB Cý bþ « ¬n ÿ ð® ¯ { ® 00 þ° ¾Y Á Âþ _ TN 00 N ° ° { °7 3 N 00 Ã 00 ß Ì Ì 00 O ¾ 3 ¹ ?ð z 00 ² Ù ? 1 8 2 ò 3 |. . }. ~. . . . . . . . . 8. ×. Ø. P. Ù. Ú. . Û. Ü. à. Ë. á. â. ã. ä. . å. æ. ä. ç. è. é. ê. ë. '. ì. í. . î. a. ï. ð. ò. :. . ó. . ô. C. õ. . ù. ú. ÿ. . . T. ß. U. ÷. S. ø. d. . D. . . . . . . . =. R. . ö. p. p. V. û. >. . ü. . W. ý. . X. . '. +. ,. 0. 1. . . 5. õ. ÿ.

(25). . . . ¢. !. . -. ¡. 4. £. e. e. %. Y. É. Ê. M. Ë. 0. Î. p. Ï. Ô. Õ. . t. Ø. Ù. ]. ¢. Ú. Ð. ¡. 2. ·. ü. î. \. Ü. =. Í. Ó. . %. Ö. 5. 6. 0. .. . . . . &. . ·. c. a. Á. C. j. . A. e. ü. S. _. c. ¢. . . i. c. Y. Ô. k. Z. Ò. ®. ¯. L. a. f. g. h. m. %. n. o. ©. ª. «. ¬. l. `. ¨. Ù. r. g. °. W. H. :. ;. . =. >. ?. .. J. K. . I. @. A. *. . A. . . !. ". L. >. N. O. . . ¡. . . P. B. M. . .

(26). . . Ù. . . #. $. . ¡. %. x. u. &. '. r. g. ÷. . ". Ù. . Ù. y. w. ô. s. +. t. s. +. u. v. a. . . . ÷. . `. ). Ù. . 9. ö. . *. :. ;. . C. T. U. ¤. t. +. Ò. ÷. Q. W. e. f. g. t. u. y. z. ¥. ,. -. . Ú. 2. 3. 4. 5. d. X. Ú. V. X. V. c. r. Ù. y. . ô. M. ". s. 1. . . Ù. {. ¢. |. }. . Ù. . r. Ú. ». ¼. ±. . ². ¿. . ;. . +. t. X. q. a. m. n. y. z. §. ¨. ©. /. 0. ). 8. Q. [. `. ¦. Z. x. k. . l. ^. j. v. _. ]. i. Y. \. h. w. .. a. 6. 7. 0 Â 00 µ I ; j ¶ O 0 M Ü b ð 0 ¡ a º. (. *. . S. ]. b. p. . E. Ù. R. \. Y. o. . . Q. X. £. . q. . .. G. F. . . . i. <. 0 ¡¢ £¤¥5¦§¨ ©ª 00 ô¹ º »J ¼ ½ j ¾ g ¿ À Á Â< Ã¬ § ¨j 00 ¹ ºj » D j Å O Ð 0 7 8 89 0° B n 8 00 Ø gI B C ¡ I â. D. E. . . À. . *. . . . ~. . . :. . . Ú. ;. . =. . . J. . ). . >. . z. =. °. a. . Ú. . ³. . ¸ 0 A 3 · a 3 ¸ 0 ß a 0 fg 0 Ã ÍÎ ° º Ð 0 3 a 3 ¸ 0 M a · ¸ ç ç q r ä 0 3 a N 7 . ;. Þ. Ý. . Ù. ¦. C. ×. y. . . _. . Ù. =. . . Ù. =. ö. >. . j. 9. /. >. . Ú. . ©. ". . . ¬. ]. 4. . ¼. . !. .. . ^. §. Ò. ¡. . . M. p. ]. p. Ì. b. R. Û. . . 1. . ¦. T. Ñ. ). 4. . . . b. 00 Û j Y Å ù º g ¡ 0 00 ° M ë± J á â± c ù R. . R. £. . -. . i. ¤. ¦. . . . ¡. Q. ¥. /. . d. . . . . [. 00 Ja bO Û 3 3 B b O 0Apº º p ¡ a. . ,. 3. . Z. . &. *. . ÿ. Y. . . ). 0 ~ J o Y 0 Y f g _ f Y 0 n o O 00 °° BB CH ? ± ² ³ ´ µ ¶ 3 · ¸ 0 °A B 7 ñ 0 c 1 ö 0ä ù 0Ö Z 0 3 0 Ì R Î 2 00 0æ ; 0 a° M {. 8. . G. . %. (. 2. þ. . . . $. 7. Þ. . . #. . . ñ. V. . . . Ý. . G. . . ß. .

(27). . . . . 0c dB Ce \f ]4 ^g Y _ ` a 4 b 0q hr Hs I t u i v w x j y kz l{ m t | n y } o v w u p u z 0 0< b « 1 0 Ã « ÷ ½ þ ¢ 03 Ê 0 ·þâ 0 C 9? 3 Ë e : 0Ó 5 3 0 ~ 0ë J × 0þ× . >. ä. ß. . Ù. ß. . . ä. ß. Ú. à. . á. â. *. 7. ç. =. ;. . ;. ã. Æ. ä. 1. å. =. . Ù. X. o. *. ,. t. 8. 8. . Ý. Ê. . &. '. æ. . Ý. (a) ç. ç. è. í. ç. ì. ï. ç. ë. ç. ç. ç. í. ç. ï. ç. ç. ç. í. ç. ê. ì. ê. ç. ç. è. ç. ç. ì. ç. ç. ë. ç. ç. ï. ç. ç. ç. ç. ï. ç. ç. ê. ì. ç. ç. ì. ë. ç. ç. ì. ï. ï. ì. ç. ç. ê. è. ç. ð. ì. ç. ð. ë. ç. ð. ï. ç. ì. è. ê. ç. ì. é. î. ç. í. ì. ç. ç. î. ç. ë. ì. ë. ç. è. ç. ð. í. é. é. í. ë. ç. é. ì. ð. é. ï. ç. ë. ç. ê. í. ç. ð. é. ð. ê. ç. è. é. ë. ç. ç. ð. ì. ë. ê. ç. ç. ê. ì. é. ë. ë. ì. ç. î. è. ç. è. é. è. ì. ë. ð. ð. ê. ð. è. î. è. ç. í. ç. ç. ç. ê. ð. ç. ë. ì. ç. é. ç. î. è. é. ì. ð. ð. î. ï. é. í. ç. ç. ì. ç. í. ç. í. î. î. ì. è. ç. í. ç. ï. ì. ç. ç. ì. ç. è. ê. é. ç. ð. è. ç. í. é. ð. é. ç. ç. ð. ç. ð. ç. ê. ï. ð. ï. è. ç. í. ë. î. î. ç. í. ê. í. í. ç. ç. ì. ç. ì. ç. è. ð. è. ð. ç. ç. è. ì. ç. è. ë. ç. ç. ì. ì. é. ç. ë. ç. ï. ë. ç. ë. ç. ð. ï. ç. ç. ì. í. í. é. ç. ï. ç. è. ç. ï. ì. ì. ë. ç. ê. ç. é. é. ï. ì. è. ç. è. ç. ï. ç. ð. ì. ï. ê. ê. ð. ç. í. ç. ð. ç. ï. ç. ç. î. ð. è. î. ì. ð. ë. è. ç. è. í. í. è. ç. ì. ì. è. ï. ì. í. ð. ê. ê. è. ë. ç. ç. ç. î. è. é. ð. ì. ì. ç. è. ç. è. è. ç. ç. ì. ì. ê. ï. ï. ë. í. í. ç. ë. ç. ç. ç. ì. é. ç. è. ì. ì. î. è. ï. ç. è. î. é. ì. é. é. é. ï. ï. ë. î. ç. ð. ì. í. ç. ð. ì. ç. è. ç. ç. ç. è. ç. ç. ç. ð. ð. ë. ç. ç. ð. ï. ì. ê. é. é. ç. ë. ç. ê. ð. ê. ð. ç. ç. è. ë. è. ð. ê. ï. ç. é. ï. ì. ç. ç. ð. è. î. ð. ë. é. ç. ç. ì. í. ì. é. ç. ç. ç. í. ì. ç. ç. ì. ì. ç. ì. ì. é. ð. î. ê. í. ç. ç. ç. ç. ç. ð. ì. ì. ê. ì. ì. ì. ð. ç. î. è. ç. í. ð. ì. î. ç. ç. é. î. ç. è. ë. ç. ë. ç. ì. ç. é. ç. ë. è. é. ç. ï. è. ç. ë. ï. ê. ë. ð. í. è. è. ð. ï. ð. è. ç. è. ð. ì. ç. ì. ê. ê. ç. è. è. ç. ç. ê. ï. ç. î. é. ç. ð. î. ç. ç. ç. ë. ç. é. í. ç. ç. í. ç. ç. ë. ë. ê. è. î. é. ç. é. ç. í. ê. ð. ë. è. ð. í. í. ç. ç. ì. í. ç. ì. ç. ì. í. ï. î. è. ë. ç. ð. î. ç. ç. ï. í. ì. ð. ç. ï. é. ç. ë. ç. è. í. ç. è. ð. î. ç. ç. ð. î. ï. ð. ç. ë. (b) Fig. 4. The document cluster maps for (a) CORPUS-1 and (b) CORPUS-2. We only show the number of documents associated with the same neuron in (b). The starting neuron index of each row is shown on the left of the map.. similar in context. For example, cluster 1 contains articles related to ’Mr. Tang Shubei’ who was the executive vice-chairman of the Association for Relations Across the Taiwan Strait, and neuron 400 contains articles related to ’the Chia-Yi city’, a city locates on southern Taiwan. Moreover, neurons 1, 2, and 18.

(28) ñ. ò. ó. ô. ý. þ. ú. ÿ. . õ. ý. þ. ñ. ö. ÷. ù. ú. û. ü. . . . . . . .

(29). . û. . . . . ø. . . . ". û. *. . !. ". . . $. %. ,. -. #. +. . . . !. . û. .. . ü. . &. '. (. . . 0. 1. 2. 3. . 5. 6. 7. 8. 9. . :. ;. ). 4. ü. <. =. ?. . ò. C. F. 9. I. 3. . ö. . . B. 9. D. E. G. H. K. ü. #. 0. L. M. L. M. K. N. T. 1. U. ü. P. 0. N. J. 1. K. O. K. O. Q. O. K. V. W. R. S. õ. `. i. j. k. r.

(30). s. 1. t. |. z. {. . . ÿ. _. O. 8. ÿ. õ. ö. 0. J. &. ?. . . A. #. #. . >. @. /. . @. J. . . O. õ. . O. . . . . X. . . . . þ. . . . . . . . ±. ². . . E. . ¡. . ¢. £. ¤. ¥. ¦. §. 4. 2. ¨. ³. D. ´. µ. ¶. ·. 2. ©. ª. . u. . u. «. t. ¬. . !. º. p. ». . !. ¶. ·. @. A. D. E. . ®. ¯. °. . . J. O. ¼. #. &. ¿. ¸. ½. À. Á. 1. S. #. Å. Æ. Í. Î. ". Ï. ¹. ¾. . C. 0. À. Ë. #. . 3. 8. û. b. À. =. P. &. !. Ï. 1. b. &. 7. ;. C. 4. ;. 6. ÷ 9. P. <. î. %. J. -. À. 5. ,. K. . . . #. } =. . õ. ô. B. C. F. G. H. I. ². J. . v. ö. P. O. ä. . Q. . . p. . . L. M. Ã. Ä. È. Ð. o. Å. Y. Ç. . v. . . . . . . . R. S. T. U. Z. ô. [. \. ]. Æ. É. . Ç. Ê. . Ë. Ñ. Ì. Ì. Ò. 0 Ú. ó. Ü. å. @. ÷. c. |. d. û. . i. . j. k. . . ¡. ¢. £. c. õ. . Y. Y. . . Ú. ! ?. ! ?. . ®. e. `. (. l. Õ. õ. Ï. þ. &. . J. . $. Ð. ®. K. Õ. ®. V. ù. ú. é. Ò. I. ò V. ¦. §. û. ü. ö. ý. V. -. =. ×. Ù. #. . f. m. n. ¤. ¥. l. b. . 4. . u. u. Ø. Ø. g. |. û. ¤. ¥. Û. 3. . ö. Þ. 4. &. ß. C. . Ú. ¦. l. . 3. . ±. Ø. Ù. à. ñ. â. ã. ä. è. ç. . û. B. s. y. z. é. r. w. x. . ë. ². ì. y. â. í. â. ñ. . º. . ñ. i. ò. ó. ê. ï. ß. p. U. Ú. Ô. ð. È. u. v. O. . t. OL. NL RL SL LT L LO L LN L LR L LS L OTL OOL ONL ORL OSL PTL POL PNL PRL PSL. M N N LL R Q N N O LO LP M R N U LS Q N R M R T P M M R N R LL T R T S T LS O N Q LP R. O. P U P Q U Q T R U O N U. M. Q OT O U R L P. P O. L. Q R Q P U LT R LL LO T N M LL LT M R U T LO S U U P Q U M U O LO Q O P LT Q Q U. õ. v. ú. ~. . . õ. 8. ÿ. 8. . \. ]. ^. ÿ. _. õ. . d. O. 8. e. f. g. m. c. n. o. p. q. ÿ. w. x. y. . z. {. . . {. . . . ú. . d. . ÿ. _. . . Q. . m. _. ô. . . ÿ. 1. . t. . ö. h. B. . {. . _. . . . . . . . ö. ô. õ. ý. þ. ÿ. ö. ÷. 8. _. ø. ù. ö. 8. ú. û. ¨. . . . . . . . . . ±. . . . . . J. ü. . ý. þ. Û. . t. ù. Z. [. õ. . (. Q. \. . û. !. ". #. . $. *. õ. #. (. {. S. Y. £. . . 8. . . t. ). ò. t. %. %. ?.

(31). . &. }. ~. õ. ö. ,. *. {. S. [. O. {. S. ]. -. .. O. /. E. . D. +. é. |. . . Ë. . Ë. . . . s. u. . . . . . O . u. J. . . . . . . . d. 8. o. . ©. ±. o. &. Ü. à. õ. ª. ª. 8. õ. !. `. ª. . Ó. . ÿ. . É. . ¬. . . ³. ´. µ. ¶. O. _. . ¹. º. ¤. æ. ¤. ç. è. å. î. ¯. . . å. ä. è. ÷. Á. «. ². ®. °. Z. Q. ·. ,. è. é. . . é. N. å. ê. î. ë. ì. d. W. ». . ». . ¿. À. d. ú. '. \. . í. §. §. Á. . ¯. . ¼. ½. ¾. d Ë. . Ë. Z. [. Y. &. f. Q. \. . Ë. 8. . Á. É. 3. Ê. . È. Z. [. Y. . . ï. 4. *. . V. W. ï. 4. ú. Â. . Ã. O. 3. Ä. Å. Ê. # ! " $ % & ' - . / 0 (1 ( @ :A ) GH J KJ ( Z. [. ï. 4. ð. . ?. ?. @. 0. !. V. ®. !. õ. . ^. V. ®. . ñ. X. Y. k. Z. Q. ï. Y. !. . . ð. K. Í. ó. ¼. %. Z. [. . k. õ. `. <. ¦. 3. ¦. O. ÷. ¦. ´. ?. 4. Æ. . Ü. Ë. Ç. W. ·. !. Ì. Í. Ý. ¦. A. . J. P S N LQ R R T Q T N T LL P R S T L LL N N R LT LL N LL LO T Q LQ U LP P R LP N LL M Q U M. (a). Q Q U O R S LP P N L N O R R S LT LL T S O M LS N P LT LT T P LP LO O N Q LO L M U R LL O. LP T M L L U M P T N T N LU U T N P U LQ Q. LP O T T U LT O LT O LO R L U N Q LL T M T R. R R O P P M P S U T. T. T LT LL O Q R Q R N. P R Q O L R LU T R P U P R LT U R M U R Q M Q U S U LP LT L LL M R LP N OR S Q T P U N. * +, 5; 6< 27 38= > 9? 4 B C DC E F I -.. ô. Ë. . ¦. . . . v. . ÷. ø. S LT Q P L S P L U O S N Q Q LR O L U Q R Q LN L O R U M S U L LO LO U LT R R T N R Q. 4. õ. ö. ®. Í. &. '. Ø. \. 8. ÷. ¨. +. ò. Ã. ö. ¾. 4. 7. . ï. Á. ?. . . u. à. 8. . ÿ. ð. ð. ò. [. t. . Q R Q U LR P R N U S T T Q U N LM R N U U S O S LO R M T S U O LT R P N M LT R P U N. ä. Ë. Î. O. ä. . O. ï. -. ó. ^. O. á. õ. #. Á. ý. ª. ©. ¦. ª. ¸. ¹. ,. L. . u. õ. . l. c. l. t. }. J. î. [. b. i. g. '. . ã. 0. à. õ. ×. Þ. á. . q. v. h. #.

(32) û. ¦. #. Ý. â. ¤. À. . Ï. u. ¨. û. þ. Ï. Ö. . î ÷. î ÿ. é. Ø. ¬ ø. û. . Õ. . ÷. #. Ó . Ñ. e J. ö. Ö. Ý. Û. #. p. ³. a. . Ô. Õ. Ú. æ. ´. Ï. Ü. Z. a. Ò. ñ. =. X. _. . Á. Ô. ó. . W. ^. Q. . Ë. . V. N. . õ. Ó. Û. . . ". ä. ^. . . ?. . ) é. ) é. =. . J. Ã. . . ?. #. 0. 3. #. >. Û. ° . 2. ". :. C. Ï. Ë. 7. J. Û. 2. Â. S. J. . . . 1. Y. Q. . 8. Q. ÷. õ. 4. ö. O. O. M R R S LL S Q R LL P R M L L N U S Q O Q R S Q L Q T U T LP R LL LO U N M LP Q LP S. (b) Fig. 5. The word cluster maps for (a) CORPUS-1 and (b) CORPUS-2. In (b) we only show the number of words associated with the same cluster due to the size limitation.. 3, which locate closely in the map, all contain articles related to ’Mainland China’. Similarly, the word clusters of CORPUS-2 are shown in Figure 7. We only show the word clusters of those neurons shown in Figure 6. It is clear that the word clusters effectively catch the general ideas of the underlying document clusters. For example, the word cluster of neuron 1 contains 9 words which are closely related to the documents associated with the same neuron in Figure 6. Through the above examples it is clear that the DCM and WCM successfully cluster the documents and words, respectively. In Table 3 we list some statistics of the DCMs and the WCMs. We can see that the number of unlabeled words are usually high. The cause of such circumstance may be explained as follows. The corpus contains many similar documents which have lots of common words. By virtue of the SOM algorithm, neighboring neurons will compete for these words. Thus the synaptic weight of a neuron will ’vibrate’ if this neuron competes for the same word with neighboring 19.

(33) V d]e]fZgihZjZkil XZ Z Y ] [ ] \ Z ^ ] _ ] ` Z a ] b Z c XZYZ[]\imon]a]p]q]risut]vZwix]yZzZ{ XZYZ[]\]|Z}]~]]yZZ]]Z o]o] XZ XZYZ YZ[] []\] \]Z Z] a]] p] ioofZroi iyZ zZZ Z{]iyo XZYZ[]\]fZ]ZuZ]i i u¡ XZYZ[]\]¢Z£¥¤§¦iöaZp©oª XZ XZYZ YZ[] []\] \]^Z fZf] wia]x]pZyZ«i zZ{Z¬o] ³]®] ´Z µ¯i ¡ °Z±Z² XZYZ[]\]fZ¶·u¸Z¹]ºZ»i¼Z½Z]¾¿uÀ]Á]Â X$%&'()*+i]! oö,-.iç/ u¡ X] 01 2Zþ 7HI 8:J 9<; XZ¦]=> ?@ AB354] Cû D6] Eâ 7F]Z¼uG XKLMNZ¢]é]w XOPQRST]>?U]VW]Ë]í X i+X XZ uro [Y \ZZ ]]i^_]àbZ]Öc Xe g]þZíZâ iOhkPjZp qlrm X]]d f] n +]oZº© X5QstQuvwxyz{Zé|}~ X[N]^d 3u:hk X] ZÕZ X[]N]] ^Õ µ XZ ðU X¦ Z f vg] í] j] Zl ½ monþ X] Z¥Zõ uZÝ ]¦«]§ ¬¨ B- ©] õ] ¡óZ¢] £ »ª X]] ]Z ¤ X«¬®iç/¯]r °5tZ ±iú XýþZ]ÍkÿZ{]W fZW V ±]f]w]yZ_§ XZ yÎ Ê u õ Xw]x]Íkp. ö ÿZS {] fZ ÿ] {] fZ

(34) NÕ XZ]ÊÜæZ\Í p µö]f] X Zå {] fZ þ X Í µÿ] Z i6ZÊâ Ü. 6]â] ! ÉZ¥ ]³¤à XurZiyo+¥Í"ÿZ{]f]iç¯7 XZ c] y] zZ{Ï Xÿ ý þå Z]ÿZ ]{] zZf] { µ ¡ ±# $ ] ZaÓµwi Â%xZ&] X'¸]]ÅZÏ]w]x]y)( µ¡]g*ZM + X ÍkÍkÿ] {Z f] u ,ÿZ{]Æi ) -Í ]þ oÛ] Ö] ´ X . p ö Z f. 3 . Z / ] ì í X 0 1]ú),)Æi+¥Í"]f]ÿ]{ X ÿZ "ÿZû{]·uì] X2Íkp. ]Å uö 3 4 ]{] 5 èZfZ Öi ]q6ZZ~] ! f]í XýþZ]ÅÍkÿZ{]fZ ì]í Í78 cÎ X X2 :;3<]=>ç9+Vp. Â©uª]ÿ]Ö]{bZ?Z»]e¥b]Í@±i¼Z. W ÊZËZÌ¥Í§ÎZÏÐoÑ XZ Z a ] p ] Ã Z Ä i Å Z Æ Z Ç ] È Z É i X]]ZÒ]Óo]a]pZÔ]ÕZÖ]e×uØ]a]ÙZÚ]Û XZÜZÝ]Þ]ßZà]á]âZ²]ãZä]a]pZå]æ]ÚZÛiç]è X] XZaZ]p]oé] Ô]wZ ôZ] õ]ê] b]ëZ öZì] ÷]íZ øZî] ù]ïbðuúuñ] û]ò] ]óZü ýuì]í X]]oþ]ÿ]øZb XZYZ[]\]fZa]b Zc]dZe]fü. X]

(35) Z´Z ìZ÷ ]oÔ oÔ Z û Zb XZ]_] å] X²³óÅµ 3¶·¸" ¹# # ºuÓ»¼½¾ XZ XZÿ »ª Å]¿G¯iÀú]bÆiÁ]¢µ²3³ GÀú¦] Ç]a ³]ÂíZÃ Ä X²³óÈi¿oÉÊ.Ë+Ì]é XZÿ]µ3 Í]´Î Ïµ 3µÊÐÑ]³ X ¶Ö±Z×] ±~ ©5Ø] ±] bÒ ¸µ 3 X² ²³ ³óó Í]Ó´°5ÔÎÕ X X²Þo³ ßóZàa +Ù] söZá Î]¶âZ¹ ÿº ].ÁÚ â ²+³Û]´ ãÜZä ùÝ X]É]å]ÿ¿æçè]±éëêì<íi©ªÎ XZ¿oÿ¿]Ö]e¢]Øé:êì§ÿª µ¡ X Rô]úrðuÿ Zbú NÕéëêì<ñZ Xî òïóu« ñõ¿ ö÷ øù XZ]b]³Z´0ñÏc]û·uÏ XA[BZþ]6CDV É W ü X 0F IT2](ÿZJK L XqEo P ] Q ûò GÔH R S ]aZ)M U ]N OV XZ³W E] )Xuÿ]Zõ ]ÃiçYZ X Z [\ð)b]icð)dê. $R ,º _g`hiÞB°i XZéEZT ð0)a f ZÔ] XEo6jZê2³kXl·mi¿v]n u¡ X o(svup]w¦ixÊèyqzv{¼]q]q1o]0|û}rZ¦]Ï XEoEZ. )t XEZ ~iÒÇ]ĆB ZÊU]Zõ X ] û ¸ xZoÛé]£ 8] XEZEo X v n QiÊÎÐ )]b©] X o E ].â EZ )ts XEo ZûZ5)t Â]s » B]) £1suö XEo¸ ]ÕZ)suÛ¡ ¢£]ÙZ¡ X Z ¤¦ZÚÍ§xW" 3 § yi« A XEZ o E § Zvr] ¨ õ ]¥]¦Óx ,. © ) ª X EZ] B ¬8] Æ )° 3o¶t ÓB·©. ª)®EZ¯°6 X f þ ± ² ] ³ ´ µ ` } XEo:95 Í§ÀZ¸W¨ 3v¹º]í»¼ XEZ Z)sGcB ½¦¾IJ¿,Za] . Fig. 6. Selected clusters and the documents labeled to them. We only show the subjects of documents. Every document contains a subject line with ’@’ at the beginning.. neurons. As a result, the synaptic weights will converge imperfectly. Such imperfect convergence will produce many unlabeled words if we use a high threshold value. However, we used a two-stage labeling process to overcome this problem, as described in Sec. 3.3. Although lower threshold values tend to decrease the number of unlabeled words, we suggest not to use an extremely small threshold value(e.g. less than 0.5) because many insignificant words will be included in the word clusters. Figure 8 depicts the relationship between the number of unlabeled words and the threshold value. In all the experiments, we set the threshold value to 0.9. After the clustering process, we then applied the hierarchy generation process to the DCM to obtain the directory hierarchies. In our experiments we limited 20.