Multilingual Information Retrieval using GHSOM

(1)

Multilingual Information

Retrieval using GHSOM

Hsin-Chang Yang Associate Professor

Department of Information Management

(2)

2

Outline

Introduction

Document Processing and Clustering by GHSOM

Association Discovery and MLIR Experimental Result

(3)

Introduction

Most of the search engines provide

only monolingual search interface.

It would be convenient for the users to express their queries in familiar language and search documents in

other languages.

Cross-lingual or multilingual information retrieval

(4)

4 Translate the queries or the documents into another language

Easy and convenient

Some kind of machine translation engine must be used

Imprecise for modern machine translation systems

Match queries and documents directly

Direct match of semantics

Difficult to match semantics; need for schemes of semantic relatedness discovery between

languages

(5)

Multilingual text mining

Discovering semantic relationships

between linguistic entities of different languages

In this work, we will develop a

MLTM scheme based on GHSOM and apply it on MLIR task.

(6)

System Architecture

6 Chinese documents Chinese documents English documents English documents Parallel corpora preprocessing preprocessing Chinese document vectors Chinese document vectors English document vectors English document vectors Train by GHSOM Train by GHSOM Hierarchy of Chinese documents Hierarchy of English documents Association discovery Association discovery Document associations Keyword associations Document/Keyword associations query query Retrieval result Retrieval result MLTM process MLIR process

(7)

Document Processing and

Clustering by GHSOM

GHSOM was proposed by Rauber et al. to provide the SOM with

capabilities of dynamic map

expansion and hierarchy construction.

We used GHSOM to organize multilingual documents into hierarchies.

(8)

Document Processing and

Clustering by GHSOM

A typical structure of GHSOM

8 Layer 0

Layer 1

Layer 2

(9)

Document Processing and

Clustering by GHSOM

Document preprocessing word segmentation stemming stopword elimination keyword selection Document encoding

A document Dj is encoded into a vector

Dj = {tf-idfij}, 1  i  |V|, where V

(10)

10

Document Processing and

Clustering by GHSOM

Document clustering

Document vectors were trained by GHSOM.

Two hierarchies were constructed for English and Chinese documents

respectively. C₁ C₃ C₅ C₂ C₄ E₁ E₂ E₃ E₄ E₅ C_k Document labelling Chinese hierarchy English hierarchy

E_p E_q k₁ k₂ k₅ keyword cluster document cluster

(11)

Association Discovery

The constructed hierarchies reveal document and keyword associations for individual languages.

However, associations between

documents or keywords of different languages are much difficult to find

because there is no direct mapping

(12)

12

Association Discovery

Finding Associations

to associate a Chinese keyword cluster with an English keyword cluster

a kind of general problem of ontology alignment

A Chinese keyword cluster is

considered to be related to an English one if they represent the same theme.

the theme of a keyword cluster could be determined by the documents labelled to the same neuron as it

(13)

Association Discovery

Thus we could associate two clusters according to their corresponding

document clusters.

parallel corpora were used

the correspondence between documents of different languages is known a priori

To associate a Chinese cluster Ck with some English cluster E_l, we use a voting

scheme to calculate the likelihood of

(14)

14

Association Discovery

Vote for best-matched cluster

1. For each pair of Chinese documents Ci and

Cj in Ck, we should find the neuron clusters

which their English counterparts Ei and Ej

are labelled to in the English hierarchy. Let these clusters be Ep and Eq.

2. Find the shortest path between Ep and Eq in

the English hierarchy.

3. Add 1 to both Ep and Eq. Add 1/(dist(Ci, Cj)-1)

to all other clusters in the path.

(15)

Association Discovery

We associate Ck with El when it has

the highest score. An example 0.83 2 1.33 2 2 0 0.83 English hierarchy

(16)

16

Association Discovery

Document associations

Chinese document Ci is associated with

English document Ej if their

corresponding clusters are associated.

Chinese document  English document

Keyword associations

A Chinese keyword labelled to neuron k in the Chinese hierarchy will be

associated with an English keyword labelled to neuron l in the English hierarchy if Ck and El are associated. Chinese keyword  English keyword

(17)

Association Discovery

Document-keyword associations

When Ck is associated with El, all

documents and keywords labelled to these two neurons are associated.

Chinese document  English keyword English keyword  Chinese document

(18)

MLIR application

The documents associated with a query keyword q  Q are retrieved according to the document-keyword associations.

Ranking:

SR(q,Dj) = SC(q,Dj)SK(q,Dj)

take account of the importance of q in a cluster as well as a document

(19)

The Ranking

S_C(q,D_j): cluster score, measures the

importance of the cluster that D_j belongs to

E_q is the cluster that C_q, which is the Chinese

cluster that q associates with, is associated with. E_Dj is the document cluster associated with D_j in the English hierarchy

(Eq, EDj) measures the shortest path length

between E_q and E_Dj





_

_

1 , 1 ,    j D q j C q D S E E

(20)

The Ranking

SK(q,Dj): document score, measures

the importance of q in Dj

the value of the element corresponding to

q in the document vector of Dj, i.e. Dj

The ranking score of a Chinese

document in responding to an English query keyword is also calculated in

the same way by exchanging the

languages of the query and document.

(21)

Experimental Result

Sinorama parallel corpora were used

Chinese article was faithfully translated into English

Our corpus contains 10672 parallel documents.

We have a Chinese vocabulary of size 12941 and English vocabulary of size 13723.

Each document is transformed into a vector. We used the GHSOM program developed by Rauber’s team to train the bilingual vectors.

(22)

Experimental Result

An example Sinorama document

(23)

(24)

Experimental Result

(25)

Experimental Result

We developed a simple search engine to evaluate the performance of our

method in MLIR.

Performance evaluation is based on classic recall and precision measures. 31 queries words: 19 Chinese and 12 English

Relevant documents to query word q

(26)

Experimental Result

(27)

Conclusions

We proposed a text mining method to extract associations between

multilingual texts and keywords.

GHSOM performs well in clustering and organizing documents.

The discovered associations seems plausible for MLIR and other

(28)

Thanks for your attention.