Multilingual Information
Retrieval using GHSOM
Hsin-Chang Yang Associate Professor
Department of Information Management
2
Outline
Introduction
Document Processing and Clustering by GHSOM
Association Discovery and MLIR Experimental Result
Introduction
Most of the search engines provide
only monolingual search interface.
It would be convenient for the users to express their queries in familiar language and search documents in
other languages.
Cross-lingual or multilingual information retrieval
4 Translate the queries or the documents into another language
Easy and convenient
Some kind of machine translation engine must be used
Imprecise for modern machine translation systems
Match queries and documents directly
Direct match of semantics
Difficult to match semantics; need for schemes of semantic relatedness discovery between
languages
Multilingual text mining
Discovering semantic relationships
between linguistic entities of different languages
In this work, we will develop a
MLTM scheme based on GHSOM and apply it on MLIR task.
System Architecture
6 Chinese documents Chinese documents English documents English documents Parallel corpora preprocessing preprocessing Chinese document vectors Chinese document vectors English document vectors English document vectors Train by GHSOM Train by GHSOM Hierarchy of Chinese documents Hierarchy of English documents Association discovery Association discovery Document associations Keyword associations Document/Keyword associations query query Retrieval result Retrieval result MLTM process MLIR processDocument Processing and
Clustering by GHSOM
GHSOM was proposed by Rauber et al. to provide the SOM with
capabilities of dynamic map
expansion and hierarchy construction.
We used GHSOM to organize multilingual documents into hierarchies.
Document Processing and
Clustering by GHSOM
A typical structure of GHSOM
8 Layer 0
Layer 1
Layer 2
Document Processing and
Clustering by GHSOM
Document preprocessing word segmentation stemming stopword elimination keyword selection Document encodingA document Dj is encoded into a vector
Dj = {tf-idfij}, 1 i |V|, where V
10
Document Processing and
Clustering by GHSOM
Document clustering
Document vectors were trained by GHSOM.
Two hierarchies were constructed for English and Chinese documents
respectively. C1 C3 C5 C2 C4 E1 E2 E3 E4 E5 Ck Document labelling Chinese hierarchy English hierarchy
Ep Eq k1 k2 k5 keyword cluster document cluster
Association Discovery
The constructed hierarchies reveal document and keyword associations for individual languages.
However, associations between
documents or keywords of different languages are much difficult to find
because there is no direct mapping
12
Association Discovery
Finding Associations
to associate a Chinese keyword cluster with an English keyword cluster
a kind of general problem of ontology alignment
A Chinese keyword cluster is
considered to be related to an English one if they represent the same theme.
the theme of a keyword cluster could be determined by the documents labelled to the same neuron as it
Association Discovery
Thus we could associate two clusters according to their corresponding
document clusters.
parallel corpora were used
the correspondence between documents of different languages is known a priori
To associate a Chinese cluster Ck with some English cluster El, we use a voting
scheme to calculate the likelihood of
14
Association Discovery
Vote for best-matched cluster
1. For each pair of Chinese documents Ci and
Cj in Ck, we should find the neuron clusters
which their English counterparts Ei and Ej
are labelled to in the English hierarchy. Let these clusters be Ep and Eq.
2. Find the shortest path between Ep and Eq in
the English hierarchy.
3. Add 1 to both Ep and Eq. Add 1/(dist(Ci, Cj)-1)
to all other clusters in the path.
Association Discovery
We associate Ck with El when it has
the highest score. An example 0.83 2 1.33 2 2 0 0.83 English hierarchy
16
Association Discovery
Document associations
Chinese document Ci is associated with
English document Ej if their
corresponding clusters are associated.
Chinese document English document
Keyword associations
A Chinese keyword labelled to neuron k in the Chinese hierarchy will be
associated with an English keyword labelled to neuron l in the English hierarchy if Ck and El are associated. Chinese keyword English keyword
Association Discovery
Document-keyword associations
When Ck is associated with El, all
documents and keywords labelled to these two neurons are associated.
Chinese document English keyword English keyword Chinese document
MLIR application
The documents associated with a query keyword q Q are retrieved according to the document-keyword associations.
Ranking:
SR(q,Dj) = SC(q,Dj)SK(q,Dj)
take account of the importance of q in a cluster as well as a document
The Ranking
SC(q,Dj): cluster score, measures the
importance of the cluster that Dj belongs to
Eq is the cluster that Cq, which is the Chinese
cluster that q associates with, is associated with. EDj is the document cluster associated with Dj in the English hierarchy
(Eq, EDj) measures the shortest path length
between Eq and EDj
1 , 1 , j D q j C q D S E EThe Ranking
SK(q,Dj): document score, measures
the importance of q in Dj
the value of the element corresponding to
q in the document vector of Dj, i.e. Dj
The ranking score of a Chinese
document in responding to an English query keyword is also calculated in
the same way by exchanging the
languages of the query and document.
Experimental Result
Sinorama parallel corpora were used
Chinese article was faithfully translated into English
Our corpus contains 10672 parallel documents.
We have a Chinese vocabulary of size 12941 and English vocabulary of size 13723.
Each document is transformed into a vector. We used the GHSOM program developed by Rauber’s team to train the bilingual vectors.
Experimental Result
An example Sinorama document
Experimental Result
Experimental Result
We developed a simple search engine to evaluate the performance of our
method in MLIR.
Performance evaluation is based on classic recall and precision measures. 31 queries words: 19 Chinese and 12 English
Relevant documents to query word q
Experimental Result
Conclusions
We proposed a text mining method to extract associations between
multilingual texts and keywords.
GHSOM performs well in clustering and organizing documents.
The discovered associations seems plausible for MLIR and other
Thanks for your attention.