• 沒有找到結果。

A Multilingual Hierarchy Mapping Method Based on GHSOM

N/A
N/A
Protected

Academic year: 2021

Share "A Multilingual Hierarchy Mapping Method Based on GHSOM"

Copied!
25
0
0

加載中.... (立即查看全文)

全文

(1)

A Multilingual Hierarchy

Mapping Method Based on

GHSOM

Hsin-Chang Yang Associate Professor

Department of Information Management

(2)

2

Outline

Introduction

Document Processing and Clustering by GHSOM

Association Discovery Experimental Result Conclusions

(3)

Introduction

Most of the search engines provide

only monolingual search interface.

It would be convenient for the users to express their queries in familiar language and search documents in

other languages.

Cross-lingual or multilingual information retrieval

(4)

4

Translate the queries or the

documents into another language

Easy and convenient

Imprecise for modern machine translation systems

Match queries and documents directly

Direct match of semantics

Difficult to match semantics; need for schemes of semantic relatedness

discovery between languages

(5)

Multilingual text mining

Discovering semantic relationships

between linguistic entities of different languages

In this work, we will develop a

MLTM scheme based on GHSOM.

(6)

System Architecture

6 Chinese documents Chinese documents English documents English documents Parallel corpora preprocessing preprocessing Chinese document vectors Chinese document vectors English document vectors English document vectors Train by GHSOM Train by GHSOM Hierarchy of monolingual documents Hierarchy of bilingual documents Association discovery Association discovery Document associations Keyword associations Document/Keyword associations query query Retrieval result

Retrieval result MLTM process

(7)

Document Processing and

Clustering by GHSOM

GHSOM was proposed by Rauber et al. to provide the SOM with

capabilities of dynamic map

expansion and hierarchy construction.

has been applied to expertise

management, failure detection, and multilingual information retrieval

We used GHSOM to organize multilingual documents into hierarchies.

(8)

Document Processing and

Clustering by GHSOM

A typical structure of GHSOM

8

Layer 0

Layer 1

Layer 2

(9)

Document Processing and

Clustering by GHSOM

Document preprocessing word segmentation stemming stopword elimination keyword selection Document encoding

A document Dj is encoded into a vector

Dj = {tf-idfij}, 1  i  |V|, where V

(10)

10

Document Processing and

Clustering by GHSOM

Document clustering

Document vectors were trained by

GHSOM.

Two hierarchies were constructed for English and Chinese documents

respectively. C1 C3 C5 C2 C4 E1 E2 E3 E4 E5 Ck Document labelling

Chinese hierarchy English hierarchy

Ep Eq

(11)

Association Discovery

The constructed hierarchies reveal document and keyword associations for individual languages.

However, associations between

documents or keywords of different languages are much difficult to find

because there is no direct mapping

(12)

12

Association Discovery

Finding Associations

to associate a Chinese keyword cluster with an English keyword cluster

a kind of general problem of ontology alignment

A Chinese keyword cluster is

considered to be related to an English one if they represent the same theme.

the theme of a keyword cluster could be determined by the documents labelled to the same neuron as it

(13)

Association Discovery

Thus we could associate two clusters according to their corresponding

document clusters.

parallel corpora were used

the correspondence between documents of

different languages is known a priori

To associate a Chinese cluster Ck with

some English cluster El, we use a voting

scheme to calculate the likelihood of

(14)

14

Association Discovery

Voting for best-matched cluster

1. For each pair of Chinese documents Ci and

Cj in Ck, we should find the neuron clusters

which their English counterparts Ei and Ej

are labelled to in the English hierarchy. Let

these clusters be Ep and Eq.

2. Find the shortest path between Ep and Eq in

the English hierarchy.

3. Add 1 to Ep and Eq. Add 1/(dist(Ci, Cj)-1) to all

other clusters in the path.

(15)

Association Discovery

We associate Ck with El when it has

the highest score. An example 0.83 2 1.33 2 2 0 0.83 English hierarchy

(16)

16

Association Discovery

Document associations

Chinese document Ci is associated with

English document Ej if their

corresponding clusters are associated.

Keyword associations

A Chinese keyword labelled to neuron k in the Chinese hierarchy will be

associated with an English keyword labelled to neuron l in the English hierarchy if Ck and El are associated.

(17)

Association Discovery

Document-keyword associations

When Ck is associated with El, all

documents and keywords labelled to these two neurons are associated.

(18)

18

Experimental Result

Sinorama parallel corpora were used

Chinese article was faithfully translated into English

Our corpus contains 976 parallel documents.

We have a Chinese vocabulary of size 3436 and English vocabulary of size 3711.

Each document is transformed into a vector. We used the GHSOM program developed by Rauber’s team to train the bilingual vectors.

(19)

Experimental Result

(20)

20

(21)
(22)

22

Experimental Result

Performance Evaluation

mean inter-document path length between each pair of documents in Ck or Ek:

The quality of the bilingual hierarchies can then be measured by the average of all Pk,

denoted by , over entire hierarchy.

, 2 , dist 1        

  k j i j i k k C C P C C k P

(23)

Experimental Result

We computed the average value of over 100 trainings. We obtained a

value of 2.39.

k

(24)

24

Conclusions

We proposed a text mining method to extract associations between

multilingual texts and keywords.

GHSOM performs well in clustering and organizing documents.

The discovered associations seems plausible for MLIR and other

(25)

參考文獻

相關文件

one on ‘The Way Forward in Curriculum Development’, eight on the respective Key Learning Areas (Chinese Language Education, English Language Education, Mathematics

Xianggang zaji (miscellaneous notes on Hong Kong) was written by an English and translated into Chinese by a local Chinese literati.. Doubts can therefore be cast as to whether

Teachers may consider the school’s aims and conditions or even the language environment to select the most appropriate approach according to students’ need and ability; or develop

(1) Western musical terms and names of composers commonly used in the teaching of Music are included in this glossary.. (2) The Western musical terms and names of composers

 Diversified parent education programmes for parents of NCS students starting from the 2020/21 school year to help them support their children’s learning and encourage their

In order to facilitate school personnel of DSS schools in operating their schools smoothly and effectively and to provide new DSS schools a quick reference on the

Yuen Shi-chun ( 阮 仕 春 ) , Research and Development Officer (Musical Instrument) of the Hong Kong Chinese Orchestra, is the foremost innovator in the construction

 Authorized by the State Education Ministry, International College of Traditional Chinese Medicine (ICTCM) was established in 1992 within TUTCM..  It is in TUTCM where