• 沒有找到結果。

Multilingual Information Retrieval using GHSOM

N/A
N/A
Protected

Academic year: 2021

Share "Multilingual Information Retrieval using GHSOM"

Copied!
28
0
0

加載中.... (立即查看全文)

全文

(1)

Multilingual Information

Retrieval using GHSOM

Hsin-Chang Yang Associate Professor

Department of Information Management

(2)

2

Outline

Introduction

Document Processing and Clustering by GHSOM

Association Discovery and MLIR Experimental Result

(3)

Introduction

Most of the search engines provide

only monolingual search interface.

It would be convenient for the users to express their queries in familiar language and search documents in

other languages.

Cross-lingual or multilingual information retrieval

(4)

4 Translate the queries or the documents into another language

Easy and convenient

Some kind of machine translation engine must be used

Imprecise for modern machine translation systems

Match queries and documents directly

Direct match of semantics

Difficult to match semantics; need for schemes of semantic relatedness discovery between

languages

(5)

Multilingual text mining

Discovering semantic relationships

between linguistic entities of different languages

In this work, we will develop a

MLTM scheme based on GHSOM and apply it on MLIR task.

(6)

System Architecture

6 Chinese documents Chinese documents English documents English documents Parallel corpora preprocessing preprocessing Chinese document vectors Chinese document vectors English document vectors English document vectors Train by GHSOM Train by GHSOM Hierarchy of Chinese documents Hierarchy of English documents Association discovery Association discovery Document associations Keyword associations Document/Keyword associations query query Retrieval result Retrieval result MLTM process MLIR process

(7)

Document Processing and

Clustering by GHSOM

GHSOM was proposed by Rauber et al. to provide the SOM with

capabilities of dynamic map

expansion and hierarchy construction.

We used GHSOM to organize multilingual documents into hierarchies.

(8)

Document Processing and

Clustering by GHSOM

A typical structure of GHSOM

8 Layer 0

Layer 1

Layer 2

(9)

Document Processing and

Clustering by GHSOM

Document preprocessing word segmentation stemming stopword elimination keyword selection Document encoding

A document Dj is encoded into a vector

Dj = {tf-idfij}, 1  i  |V|, where V

(10)

10

Document Processing and

Clustering by GHSOM

Document clustering

Document vectors were trained by GHSOM.

Two hierarchies were constructed for English and Chinese documents

respectively. C1 C3 C5 C2 C4 E1 E2 E3 E4 E5 Ck Document labelling Chinese hierarchy English hierarchy

Ep Eq k1 k2 k5 keyword cluster document cluster

(11)

Association Discovery

The constructed hierarchies reveal document and keyword associations for individual languages.

However, associations between

documents or keywords of different languages are much difficult to find

because there is no direct mapping

(12)

12

Association Discovery

Finding Associations

to associate a Chinese keyword cluster with an English keyword cluster

a kind of general problem of ontology alignment

A Chinese keyword cluster is

considered to be related to an English one if they represent the same theme.

the theme of a keyword cluster could be determined by the documents labelled to the same neuron as it

(13)

Association Discovery

Thus we could associate two clusters according to their corresponding

document clusters.

parallel corpora were used

the correspondence between documents of different languages is known a priori

To associate a Chinese cluster Ck with some English cluster El, we use a voting

scheme to calculate the likelihood of

(14)

14

Association Discovery

Vote for best-matched cluster

1. For each pair of Chinese documents Ci and

Cj in Ck, we should find the neuron clusters

which their English counterparts Ei and Ej

are labelled to in the English hierarchy. Let these clusters be Ep and Eq.

2. Find the shortest path between Ep and Eq in

the English hierarchy.

3. Add 1 to both Ep and Eq. Add 1/(dist(Ci, Cj)-1)

to all other clusters in the path.

(15)

Association Discovery

We associate Ck with El when it has

the highest score. An example 0.83 2 1.33 2 2 0 0.83 English hierarchy

(16)

16

Association Discovery

Document associations

Chinese document Ci is associated with

English document Ej if their

corresponding clusters are associated.

Chinese document  English document

Keyword associations

A Chinese keyword labelled to neuron k in the Chinese hierarchy will be

associated with an English keyword labelled to neuron l in the English hierarchy if Ck and El are associated. Chinese keyword  English keyword

(17)

Association Discovery

Document-keyword associations

When Ck is associated with El, all

documents and keywords labelled to these two neurons are associated.

Chinese document  English keyword English keyword  Chinese document

(18)

MLIR application

The documents associated with a query keyword q  Q are retrieved according to the document-keyword associations.

Ranking:

SR(q,Dj) = SC(q,Dj)SK(q,Dj)

take account of the importance of q in a cluster as well as a document

(19)

The Ranking

SC(q,Dj): cluster score, measures the

importance of the cluster that Dj belongs to

Eq is the cluster that Cq, which is the Chinese

cluster that q associates with, is associated with. EDj is the document cluster associated with Dj in the English hierarchy

(Eq, EDj) measures the shortest path length

between Eq and EDj

1 , 1 ,    j D q j C q D S E E

(20)

The Ranking

SK(q,Dj): document score, measures

the importance of q in Dj

the value of the element corresponding to

q in the document vector of Dj, i.e. Dj

The ranking score of a Chinese

document in responding to an English query keyword is also calculated in

the same way by exchanging the

languages of the query and document.

(21)

Experimental Result

Sinorama parallel corpora were used

Chinese article was faithfully translated into English

Our corpus contains 10672 parallel documents.

We have a Chinese vocabulary of size 12941 and English vocabulary of size 13723.

Each document is transformed into a vector. We used the GHSOM program developed by Rauber’s team to train the bilingual vectors.

(22)

Experimental Result

An example Sinorama document

(23)
(24)

Experimental Result

(25)

Experimental Result

We developed a simple search engine to evaluate the performance of our

method in MLIR.

Performance evaluation is based on classic recall and precision measures. 31 queries words: 19 Chinese and 12 English

Relevant documents to query word q

(26)

Experimental Result

(27)

Conclusions

We proposed a text mining method to extract associations between

multilingual texts and keywords.

GHSOM performs well in clustering and organizing documents.

The discovered associations seems plausible for MLIR and other

(28)

Thanks for your attention.

參考文獻

相關文件

For example, the teacher librarians teach students reading strategies while English and Chinese language subject teachers provide reading materials for students to

printing, engraved roller 刻花輥筒印花 printing, flatbed screen 平板絲網印花 printing, heat transfer 熱轉移印花. printing, ink-jet

Xianggang zaji (miscellaneous notes on Hong Kong) was written by an English and translated into Chinese by a local Chinese literati.. Doubts can therefore be cast as to whether

Teachers may consider the school’s aims and conditions or even the language environment to select the most appropriate approach according to students’ need and ability; or develop

(1) Western musical terms and names of composers commonly used in the teaching of Music are included in this glossary.. (2) The Western musical terms and names of composers

 Diversified parent education programmes for parents of NCS students starting from the 2020/21 school year to help them support their children’s learning and encourage their

In order to facilitate school personnel of DSS schools in operating their schools smoothly and effectively and to provide new DSS schools a quick reference on the

Yuen Shi-chun ( 阮 仕 春 ) , Research and Development Officer (Musical Instrument) of the Hong Kong Chinese Orchestra, is the foremost innovator in the construction