Multilingual Hierarchy Generation and Alignment Using Self-Organizing Maps

(1)

Multilingual Hierarchy Generation and Alignment

Using Self-Organizing Maps

Hsin-Chang Yang

Department of Information Management National University of Kaohsiung

Kaohsiung, Taiwan 811 Email: [email protected]

Chung-Hong Lee

Department of Electrical Engineering National Kaohsiung University

of Applied Sciences Kaohsiung, Taiwan

Email: [email protected]

Heng-Tzu Tsai

Department of Information Management National University of Kaohsiung

Kaohsiung, Taiwan 811

Abstract—Multilingual information retrieval has attracted lots of attention in recent years due to the explosive increase of multilingual Web pages. It will not be easy to retrieve documents written in languages other than the query if the relationships among entities of different languages were not found. In this work, we will develop a method based on self-organizing maps to organize documents into hierarchy. When hierarchies of individual languages are developed, an alignment process is then applied on these hierarchies to discover the associations between entities of different languages. The method had been performed on a set of parallel corpora containing Chinese and English documents and obtained promising result.

I. INTRODUCTION

Nowadays many of existing documents are stored as Web pages. The amount of Web pages has been exceeded a number of billions and is still growing. Although these pages are expressed in standard description languages such as HTML or XML, their contents are written in different languages that are generally native for the authors of the pages. In the mean time, most of the users tend to use their native language to query Web pages, which may be written in various languages. Only those pages that are written in or contain the same language as the query will be retrieved, unless some kinds of translation or mapping are done between query and Web pages. Such translation, however, is a task that is difficult to be tackled. Thus users were forced to relinquish their familiar languages and use the one that suits the target pages. This is rather inconvenient for most users that have limited ability in expressing their need using non-native languages. To remedy such inconvenience, cross-lingual or multilingual information retrieval (MLIR) techniques were developed for decades. Traditionally, machine translation schemes were often adopted to translate queries and Web pages into the same language to perform MLIR. Unfortunately, there is still no well recognized scheme to provide precise translation between two languages. A different approach is to match queries and documents directly without a priori translation. This approach is also difficult since we need some kind of measurements to measure the semantic relatedness between queries and documents. Such semantic measurements are generally not able to be explicitly defined, even with human intervention. Thus we need a kind of automated process to discover the

relationships between different languages. Such process is often called multilingual text mining (MLTM).

In this work, we will develop a MLTM technique that creates multilingual hierarchies and align these hierarchies automatically. Our method applies the self-organizing map (SOM) model [1] to cluster a set of parallel texts. We will then develop a method to structure the trained maps into hierarchies. An alignment algorithm will also be developed to align hierarchies of different languages according to their semantic relatedness. Both relationships between multilingual text documents as well as multilingual linguistic terms could be discovered after the alignment. We performed experiments on a set of Chinese-English parallel texts. The evaluation results suggest that our approach should be plausible in constructing and aligning multilingual hierarchies, as well as tackling MLIR tasks.

II. RELATEDWORK

We will briefly review some articles in two related fields. In Sec. II-A we review works in MLTM. Later in Sec. II-B we will discuss works of hierarchy construction and alignment.

A. Related work in multilingual information text mining

Many multilingual text mining techniques, especially for machine learning approaches, are based on comparable or parallel corpora, which were used in this work. Chau and Yeh [2] generated fuzzy membership scores between Chinese and English terms and clustered them to perform MLIR. Lee and Yang [3] also used SOM to train a set of Chinese-English parallel corpora and generate two feature maps. The relationships between bilingual documents and terms are then discovered. Rauber et al [4] applied GHSOM to cluster multi-lingual corpora which contain Russian, English, German, and French documents. However, they translated all documents into one language first before training. Thus they actually performed monolingual text mining.

B. Related work on hierarchy creation and alignment

In knowledge discovery research, we often like to organize data, especially textual data, into hierarchies since they are well perceived by humans. In one of the early work by Han and

(2)

Fu [5], they developed a technique for refining existing concept hierarchies and generating new concept hierarchies. How-ever, their method can only applied to numerical attributes. McCallum and Nigam [6] used a bootstrapping process to generate new terms from a set of human-provided keywords. This approach was then applied in portal site construction [7]. Human intervention is still required in their work. Probabilistic methods were widely used in exploiting hierarchy. Weigend et al [8] proposed a two-level architecture for text categorization. The first level of the architecture predicts the probabilities of the meta-topic groups, which are groups of topics. This allows the individual models for each topic on the second level to focus on finer discrimination within the group. They used a supervised neural network to learn the hierarchy where topic classes were provided and already assigned. A different probabilistic approach by Hofmann [9] used an unsupervised learning architecture called Cluster-Abstraction Model to or-ganize groups of documents in a hierarchy. Liu and Yang [10] used the link information within a Web page to build a topic hierarchy of a Website. Their method relies on the precise analysis and specification of link features which are implausible for normal texts. A close approach to our work was proposed by Chuang and Chien [11], which uses the search results of search engines for feature extraction and applies a modified hierarchical agglomerative clustering algo-rithm to organize short text segments into shallow hierarchies. Their work resembles ours in two aspects: First, hierarchical clustering approach is used. Second, it also creates shallow, multi-way hierarchies instead of binary ones. However, our method does not rely on the support of search engines for feature extraction. Besides, our approach can simultaneously identify themes and organize documents merely using the contents of the textual documents.

Alignment of concept hierarchies resembles the task of ontology mapping or alignment since a concept hierarchy is similar to an ontology in both structure and function. Many works on ontology alignment or mapping have been proposed [12], [13], [14]. According to Noy and Stuck-enschmidt’s classification, the ontology mapping discovery schemes can be based on common reference ontology, lexical similarity, structural similarity, user input, external resources such as Wordnet’s annotation, or prior matching. Our method, however, cannot be easily classified in such classification. We may consider our approach a combined scheme of using lexical and structural similarities. A close work of our method is the HICAL system [15]. The HICAL system generated alignment rules for concepts using the statistic method in a top-down manner. They first examined the similarity of the root concepts of distinct hierarchies and generated a rule if these concepts are similar. This process is recursively applied to lower-levels concepts to obtain more specific rules. Their method is similar to ours in respect of the use of data instances within concepts in the hierarchies. However, the alignment algorithm is significantly different.

III. MULTILINGUALHIERARCHYGENERATION

To obtain the hierarchies, we first perform a clustering process on a set of parallel documents. We then apply a hierarchy generation process to the clustering result and obtain the hierarchies. We will start from the preprocessing steps.

A. Document preprocessing

In this work, we first used a common segmentation pro-gram to segment possible keywords for English documents. Processing steps such as stopword elimination and stemming were then applied to reduce the number of keywords. All keywords of all documents are collected to build a vocabulary

for English keywords. This vocabulary is denoted as VE. A

document is encoded into a binary vector according to those keywords that occurred in it. When a keyword occurs in this document, the corresponding element of the vector will have value 1; otherwise, the element will have value 0. With this

scheme, a document Ej will be encoded into a binary vector

Ej. The size (number of elements) of Ej is just the size of

the vocabulary VE, i.e. |VE|.

The segmentation of Chinese words is more difficult than English words because we have to separate these consecutive letters into a set of words. There are several segmentation schemes for Chinese words. We adopt the segmentation pro-gram developed by the CKIP team of Academia Sinica to segment words [16]. Different from English text processing, Chinese words require no stemming in general. Stopword elimination could be applied to Chinese words as in English. However, there is no standard stopword list available. In this work, we omit this process since we select only nouns as key-words. As in English case, the selected keywords are collected

to build Chinese vocabulary VC. Each Chinese document Cj

is encoded into a vector Cj in the same manner as English.

The size of Cj is just the size of the vocabulary VC, i.e. |VC|.

B. Self-organizing map clustering and labeling

When documents are encodned into vectors as described in Sec. III-A, we apply SOM training algorithm to train maps us-ing these vectors. When the trainus-ing process is accomplished, we then perform a labeling process on the trained network to establish the association between each document and one of the neurons. The labeling process is described as follows. The

feature vector of document Di, xi, 1 ≤ i ≤ M , where M is

the number of documents, is compared to the synaptic weight

vectors of every neuron in the map. We then label Di to the

jth neuron if its synaptic weight vector is closest to xi. After

the labeling process, each document is labeled to some neuron or, from a different point of view, each neuron is labeled by a set of documents. We record the labeling result and obtain the document cluster map (DCM). In the DCM, each neuron is labeled by a list of documents which are considered similar and are in the same cluster.

We will also construct the keyword cluster map (KCM) by labeling each neuron in the trained network with certain keywords. Such labeling is achieved by examining the neu-rons’ synaptic weight vectors. For the weight vector of the

(3)

jth neuron wj, if its nth component exceeds a predetermined

threshold, the corresponding word of that component, i.e. the

nth word in the vocabulary, is labeled to this neuron. By virtue

of the SOM algorithm, a neuron may be labeled by several keywords which often co-occur in a set of documents, making a neuron a keyword cluster.

C. Hierarchy generation

To generate a hierarchy, we should first cluster documents by the SOM using the method described in Sec. III-B to obtain the DCM and the KCM. We may generate a cluster of similar clusters, or a super cluster, by aggregating neighboring neurons. This should essentially create a two-level hierarchy such that the parent node is the constructed super cluster and the child nodes are the clusters that compose the super cluster. The hierarchy generation process can be further applied to every super cluster to establish the next higher level of this hierarchy. The overall hierarchy can then be established iteratively until a stop criterion is satisfied.

Here we use a top-down approach to generate hierarchies based on the construction of super clusters. To form a super-cluster, we first define the distance between two clusters in the map:

D(i, j) = ||Gi− Gj||, (1)

where i and j are the neuron indices of the two clusters

and Gi is the two-dimensional grid coordinates of neuron

i in the map. For a square formation of neurons, Gi =

(i mod √J, i div √J). ||Gi− Gj|| measures the Euclidean

distance between the two coordinates Gi and Gj. We also

define the dissimilarity between two clusters:

D(i, j) = ||wi− wj||, (2)

where wi is the synaptic weight vector of neuron i. Now

we try to find some significant clusters among all clusters as super clusters. Since neighboring clusters in the map represent similar clusters, we can determine the significance of a cluster by examining the overall similarity that is contributed by its neighboring clusters. Such similarity, namely the aggregated

cluster similarity, is defined by: Si= 1 Bi X j∈Bi doc(i)doc(j) F (D(i, j), D(i, j)), (3)

where doc(i) is the number of documents associated with

neuron i, i.e. cluster i, in the DCM and Bi is the set of

neuron indices in the neighborhood of neuron i. The function

F : R+ _{→ R}+ _{is a monotonically increasing function which}

takes D(i, j) and D(i, j) as arguments.

The super clusters are developed from a set of dominated clusters in the map. A dominated cluster is a cluster which has locally maximal aggregated cluster similarity. We may select all dominated clusters in the map by the following algorithm: Step 1 Find the cluster which has the largest aggregated cluster similarity among all clusters under consid-eration. This cluster is a dominated cluster.

Step 2 Eliminate its neighboring clusters so that they will not be considered as dominated clusters in succeeded steps.

Step 3 If

1) there is no cluster left, or

2) the number of dominated clusters ND exceeds

a predetermined value, or

3) the level of generated hierarchy so far exceeds a predetermined depth σ,

stops the process. Otherwise, decrease the neighbor-hood size in Step 2 and goto Step 1.

The child clusters of a super cluster can be found by the following rule: Cluster i belongs to super-cluster k if for all l super-clusters: D(i, k) < D(i, l).

The above process creates a two-level hierarchy. In the following we will show how to obtain the overall hierarchy. In the first application of the super cluster generation process (denoted by STAGE-1), we obtain a set of super clusters. We aggregate these super clusters under one pseudo root node and form the first and second levels of the hierarchy. To find the children of the super clusters obtained in STAGE-1, we may apply the super cluster generation process on each super cluster (STAGE-2). Notice that in STAGE-2 we only consider clusters which belong to the same super cluster. A set of child nodes will be obtained and be used as the third level of the hierarchy. The overall hierarchy can then be revealed by recursively applying the same process to each new-found super cluster (STAGE-n). We decrease the size of neighborhood in selecting dominated clusters when the super cluster generation process proceeds.

In the generated hierarchy, a node represents an individual neuron which is associated with a cluster of documents. We also assign labels to each node to exhibit the themes of the clusters. The labels of a node are those keywords associated with the same neuron in the KCM. In this regard, a node in the hierarchy is labeled by a document cluster as well as a keyword cluster. Nodes in higher levels should have more general coverage than those in lower levels. We will use these labels to align hierarchies, which will be discussed in the next section.

IV. MULTILINGUALHIERARCHYALIGNMENT

In this section we will demonstrate a method to align two hierarchies generated using the method described in Sec. III-C.

Let H1 = {V1, E1} and H2 = {V2, E2} be the hierarchies

generated using documents of two different languages, where

V1 and E1 denote the set of vertices (nodes) and edges,

respectively, of H1. Likewise, V2 and E2 denote the set of

vertices and edges, respectively, of H2. We will first introduce

an alignment method to map nodes between H1and H2. That

is, a mapping M : V1→ V2 will be derived.

A. Hierarchy alignment

We want to map a node Ck ∈ V1to some node El∈ V2. They

are considered related if they have similar themes. Meanwhile, the theme of a node could be determined by the documents

(4)

labeled to it. Thus we could associate two nodes according to their corresponding document clusters. Since we use parallel corpora to train the SOM, the correspondence between a document and its counterpart in another language is known a priori. We should then use such correspondences to associate document clusters of different languages.

We use a voting scheme to calculate the likelihood of

associating Ckin H1with Elin H2. For each pair of documents

Ci and Cj in Ck, we should find the nodes to which their

counterparts Ei and Ej are labeled in H2. Let these nodes

be Ep and Eq, respectively. The shortest path between them in

H2 is also found. We then add a score of 1 to both Ep and

Eq. Meanwhile, a score of _dist(C_i1_,C_j₎₋₁ is added to all other

clusters on this path, where function dist(Ci, Cj) returns the

length of the shortest path between Ei and Ej in H2. The

same scheme is applied to all pairs of documents in Ck and

the cumulative scores for all nodes in H2 are calculated. We

associate Ck with Elwhen it has the highest score. When there

is a tie with scores, we should accumulate the largest score of their adjacent nodes to break the tie. If a tie still happens, arbitrary selection could be made. A possible choice is to

select the node in the same or nearest layer of Ck since it

may has similar coverage of themes. As a result, the above

process defines the mapping M between H1 and H2.

B. Association discovery

After establishing the mapping M from H1 and H2, the

associations between documents in different languages can

then be defined by such mapping. A document Ci in L1 is

associated with a document Ej in L2 if their corresponding

nodes in H1 and H2, respectively, are associated. Likewise,

the keyword associations are created according to the found

mapping. A keyword labeled to node k in H1 will be

associ-ated with a keyword labeled to node l in H2 if Ck and El are

associated.

Associations between documents and keywords, in either language, could be easily defined by the mapping M. When

Ck is associated with El, all documents and keywords labeled

to these two nodes are associated. This includes the association

between a document in L1 and a keyword in L1, the

associ-ation between a document in L2 and a keyword in L2, the

association between a document in L1 and a keyword in L2,

and the association between a document in L2 and a keyword

in L1.

V. EXPERIMENTALRESULT

We constructed the bilingual parallel corpora by collecting parallel documents from Sinorama corpus. The corpus contains segments of bilingual articles of Sinorama magazine. Each document is a segment of an article. The corpus contains 976 parallel documents. This is a relatively small corpus that we majorly used for explanation purpose. Each Chinese document had been segmented into a set of keywords though the segmentation program developed by the CKIP team of Academia Sinica. The program is also a part-of-speech tagger. We selected only nouns and discarded stopwords. As a result,

we have a vocabulariy of size 3436. For English documents, common segmentation program and part-of-speech tagger are used to convert them into keywords. Stopwords were also removed. Furthermore, Porter’s stemming algorithm was used to obtain stems of English keywords. Finally, we obtained an English vocabularies of size 3711. These vocabularies were then used to convert each document into a vector, as described in Sec. III-A. To train the corpus, we constructed a self-organizing map which contains 100 neurons in 10 × 10 grid format. Each neuron in the map contains 3436 and 3711 synapses for training Chinese documents and English documents, respectively. After training we labeled the map by documents and keywords, respectively, by the methods described in Sec. III-B and obtained the DCMs and the KCMs for both languages.

After the clustering process, we applied the hierarchy gen-eration process to the DCMs to obtain the hierarchies. In our experiments we limited the number of dominated clusters to 10. We limited the depth of hierarchies to 3. In Figure 1 we show part of the Chinese hierarchy developed from the corpus. Each leaf node in a hierarchy represents a cluster in the corresponding DCM. The parent node of some child nodes in level n of a hierarchy represents a super cluster found in STAGE-(n-1). In this example, the root node is one of the six dominated clusters found in STAGE-1. This node has 4 children which are the 4 dominated clusters obtained in STAGE-2. These child nodes comprise the third level of the hierarchy. Likewise, the child nodes of the third-level nodes are the found dominated clusters after STAGE-3. The keywords in the KCM are used to label every node in the hierarchies. We merge all keywords that belong to those clusters that are not dominated clusters into their nearest dominated clusters, as described in Sec. III-C. It is clear in this example that the generated hierarchy comprises similar clusters which have related keywords. We omitted the rest of the hierarchies due to space limitation.

A. Hierarchy alignment

After generating bilingual hierarchies, we applied the hierar-chy alignment algorithm described in Sec. IV-A. The quality of the alignment is measured by the amount of parallel documents in each pair of associated clusters. For example, let Chinese

cluster Ck be associated with English cluster El in which Ck

contains documents C1, C3, C5 and El contains documents

E1, E2, E3, and E4. Here Ci and Ei are a pair of parallel

documents. In this case, the number of parallel documents in

this pair of clusters is then 2, namely C1/E1 and C3/E3.

The total number of such parallel document pairs over all associated clusters is summed and divided by the total number of parallel document pairs in the training set. We have ratios 0.63 and 0.58 for Corpus-1 and Corpus-2, respectively. This means that about 60% of parallel documents fall in associated clusters.

(5)

(fiber) (textile) (textile) (domestic) (Dongguan) (Taiwanese businessman) (mainland China) (enterprise) (South Africa) (Dongguan) (Taiwanese businessman) (Palau) (mainland China) (Taiwan) (Taiwanese businessman) (Dongguan) (fiber) (nylon) (fiber) (nylon) (manufacturer) (fiber) (fiber) (metal) (fiber) (metal) (product) (patent) A B C (South Africa) (Taiwanese Businessman) (South Africa) (Taiwanese Businessman) (Derlon) (enterprise) (environment) ! (security) (Taiwanese Businessman) (enterprise) (environment) (mainland China) " (US dollar) (Taiwanese Businessman) (enterprise) (mainland China) # $ (regulation) %& (trade) A (industry) (product) (industry) (product) (manufacturer) (domestic) ' (foreign) (textile) (manufacturer) (textile) (textile) (manufacturer) B (textile) (Taiwanese Businessman) (Taiwanese Businessman) () * (local) (Taiwanese Businessman) +, (afterthought) (Palau) () (local) - . / (travel agency) (Palau) - ./ (travel agency) C

Fig. 1. One of the generated Chinese hierarchies. English translations are enclosed in the parentheses for reference.

VI. CONCLUSIONS

In this work, we proposed an automatic method to generate and align multilingual hierarchies. Our method applies the self-organizing map model to cluster bilingual documents and creates two feature maps for each language. A hierarchy gener-ation process is then applied on these maps to create bilingual hierarchies. These hierarchies are then aligned according to the relatedness of their nodes. The associations between nodes are determined by the documents associated with them. We conducted experiments on two sets of corpora and obtained promising result.

ACKNOWLEDGMENT

This work is supported by National Science Council under grant NSC 97-2221-E-390-017.

REFERENCES

[1] T. Kohonen, Self-Organizing Maps. Berlin: Springer-Verlag, 2001. [2] R. Chau and C. H. Yeh, “A multilingual text mining approach to web

cross-lingual text retrieval,” Knowledge-Based Systems, vol. 17, no. 5-6, pp. 219–227, 2004.

[3] C. H. Lee and H. C. Yang, “A multilingual text mining approach based on self-organizing maps,” Applied Intelligence, vol. 18, no. 3, pp. 295– 310, 2003.

[4] A. Rauber, D. Merkl, and M. Dittenbach, “The growing hierarchical self-organizing map: exploratory analysis of high-dimensional data,” IEEE

Transactions on Neural Networks, vol. 13, no. 6, pp. 1331–1341, 2002.

[5] J. Han and Y. Fu, “Dynamic generation and refinement of concept hierarchies for knowledge discovery in database,” in Proceedings of

AAAI’94 Workshop on Knowledge Discovery in Database. AAAI Press,

1994, pp. 157–168.

[6] A. McCallum and K. Nigam, “Text classification by bootstrapping with keywords, em and shrinkage,” in Proceedings of ACL’99 Workshop

for Unsupervised Learning in Natural Language Processing. Morgan

Kaufmann, 1999, pp. 52–58.

[7] A. McCallum, K. Nigam, J. Rennie, and K. Seymore, “Automating the construction of internet portals with machine learning,” Information

Retrieval, vol. 3, no. 2, pp. 127–163, 2000.

[8] A. S. Weigend, E. D. Wiener, and J. O. Pedersen, “Exploiting hierarchy in text categorization,” Information Retrieval, vol. 1, no. 3, pp. 193–216, 1999.

[9] T. Hofmann, “The cluster-abstraction model: unsupervised learning of topic hierarchies from text data,” in Proceedings of the Sixteen

Interna-tional Joint Conference on Artificial Intelligence. Morgan Kaufmann,

1999, pp. 682–687.

[10] N. Liu and C. C. Yang, “A link classification based approach to website topic hierarchy generation,” in Proceedings of WWW 2007. Association for Computing Machinery, 2007, pp. 1127–1128.

[11] S. L. Chuang and L. F. Chien, “Taxonomy generation for text segments: A practical web-based approach,” ACM Transactions on Information

Systems, vol. 23, no. 4, pp. 363–396, 2005.

[12] N. F. Noy and H. Stuckenschmidt, “Ontology alignment: an an-notated bibliography,” in Semantic Interoperability and Integration, Y. Kalfoglou, W. M. Schorlemmer, A. P. Sheth, S. Staab, and M. Uschold, Eds. IBFI, Schloss Dagstuhl, Germany, 2005.

[13] Y. Kalfoglou and W. M. Schorlemmer, “Ontology mapping: The state of the art,” in Semantic Interoperability and Integration, Y. Kalfoglou, W. M. Schorlemmer, A. P. Sheth, S. Staab, and M. Uschold, Eds. Schloss Dagstuhl, Germany: IBFI, 2005.

[14] N. Choi, I. Y. Song, and H. Han, “A survey on ontology mapping,”

SIGMOD Record, vol. 35, no. 3, pp. 34–41, 2006.

[15] R. Ichise, H. Takeda, and S. Honiden, “Rule induction for concept hier-archy alignment,” in Proceedings of the Workshop on Ontology Learning

at the 17th International Joint Conference on Artificial Intelligence (IJCAI). CEUR-WS.org, 2001.

[16] K. J. Chen and M. H. Bai, “Unknown word detection for chinese by a corpus-based learning method,” International Journal of Computational

linguistics and Chinese Language Processing, vol. 3, no. 1, pp. 27–44,