A Multilingual Hierarchy
Mapping Method Based on
GHSOM
Hsin-Chang Yang Associate Professor
Department of Information Management
2
Outline
Introduction
Document Processing and Clustering by GHSOM
Association Discovery Experimental Result Conclusions
Introduction
Most of the search engines provide
only monolingual search interface.
It would be convenient for the users to express their queries in familiar language and search documents in
other languages.
Cross-lingual or multilingual information retrieval
4
Translate the queries or the
documents into another language
Easy and convenient
Imprecise for modern machine translation systems
Match queries and documents directly
Direct match of semantics
Difficult to match semantics; need for schemes of semantic relatedness
discovery between languages
Multilingual text mining
Discovering semantic relationships
between linguistic entities of different languages
In this work, we will develop a
MLTM scheme based on GHSOM.
System Architecture
6 Chinese documents Chinese documents English documents English documents Parallel corpora preprocessing preprocessing Chinese document vectors Chinese document vectors English document vectors English document vectors Train by GHSOM Train by GHSOM Hierarchy of monolingual documents Hierarchy of bilingual documents Association discovery Association discovery Document associations Keyword associations Document/Keyword associations query query Retrieval resultRetrieval result MLTM process
Document Processing and
Clustering by GHSOM
GHSOM was proposed by Rauber et al. to provide the SOM with
capabilities of dynamic map
expansion and hierarchy construction.
has been applied to expertise
management, failure detection, and multilingual information retrieval
We used GHSOM to organize multilingual documents into hierarchies.
Document Processing and
Clustering by GHSOM
A typical structure of GHSOM
8
Layer 0
Layer 1
Layer 2
Document Processing and
Clustering by GHSOM
Document preprocessing word segmentation stemming stopword elimination keyword selection Document encodingA document Dj is encoded into a vector
Dj = {tf-idfij}, 1 i |V|, where V
10
Document Processing and
Clustering by GHSOM
Document clustering
Document vectors were trained by
GHSOM.
Two hierarchies were constructed for English and Chinese documents
respectively. C1 C3 C5 C2 C4 E1 E2 E3 E4 E5 Ck Document labelling
Chinese hierarchy English hierarchy
Ep Eq
Association Discovery
The constructed hierarchies reveal document and keyword associations for individual languages.
However, associations between
documents or keywords of different languages are much difficult to find
because there is no direct mapping
12
Association Discovery
Finding Associations
to associate a Chinese keyword cluster with an English keyword cluster
a kind of general problem of ontology alignment
A Chinese keyword cluster is
considered to be related to an English one if they represent the same theme.
the theme of a keyword cluster could be determined by the documents labelled to the same neuron as it
Association Discovery
Thus we could associate two clusters according to their corresponding
document clusters.
parallel corpora were used
the correspondence between documents of
different languages is known a priori
To associate a Chinese cluster Ck with
some English cluster El, we use a voting
scheme to calculate the likelihood of
14
Association Discovery
Voting for best-matched cluster
1. For each pair of Chinese documents Ci and
Cj in Ck, we should find the neuron clusters
which their English counterparts Ei and Ej
are labelled to in the English hierarchy. Let
these clusters be Ep and Eq.
2. Find the shortest path between Ep and Eq in
the English hierarchy.
3. Add 1 to Ep and Eq. Add 1/(dist(Ci, Cj)-1) to all
other clusters in the path.
Association Discovery
We associate Ck with El when it has
the highest score. An example 0.83 2 1.33 2 2 0 0.83 English hierarchy
16
Association Discovery
Document associations
Chinese document Ci is associated with
English document Ej if their
corresponding clusters are associated.
Keyword associations
A Chinese keyword labelled to neuron k in the Chinese hierarchy will be
associated with an English keyword labelled to neuron l in the English hierarchy if Ck and El are associated.
Association Discovery
Document-keyword associations
When Ck is associated with El, all
documents and keywords labelled to these two neurons are associated.
18
Experimental Result
Sinorama parallel corpora were used
Chinese article was faithfully translated into English
Our corpus contains 976 parallel documents.
We have a Chinese vocabulary of size 3436 and English vocabulary of size 3711.
Each document is transformed into a vector. We used the GHSOM program developed by Rauber’s team to train the bilingual vectors.
Experimental Result
20
22
Experimental Result
Performance Evaluation
mean inter-document path length between each pair of documents in Ck or Ek:
The quality of the bilingual hierarchies can then be measured by the average of all Pk,
denoted by , over entire hierarchy.
, 2 , dist 1
k j i j i k k C C P C C k PExperimental Result
We computed the average value of over 100 trainings. We obtained a
value of 2.39.
k
24
Conclusions
We proposed a text mining method to extract associations between
multilingual texts and keywords.
GHSOM performs well in clustering and organizing documents.
The discovered associations seems plausible for MLIR and other