Previous Work for Cross-Language Document Retrieval

Chapter 1 Introduction

3.1 Previous Work for Cross-Language Document Retrieval

The approach of cross-language information retrieval allows a user to formulate a query in one language and to retrieve documents in others. The controlled vocabulary is the first and the traditional technique widely used in libraries and documentation centers. Documents are indexed manually using fixed terms which are also used for queries. These terms can be also indexed in multiple languages and maintained in a so-called thesaurus.

Using dictionary-based technique queries will be translated into a language in which a document may be found. The corpus-based technique analyzes large collections of existing texts and automatically extracts the information needed on which the translation will be based. However, this technique tends to require the integration of linguistic constraints, because the use of only statistical techniques by extracting information can introduce errors and thus achieve bad performance [24].

Latent Semantic Indexing (LSI) is a new approach and a new experiment in multilingual information retrieval which allows a user to retrieve documents by concept and meaning and not only by pattern matching. In following, we will review those three approaches.

3.1.1 Dictionary-Based Approach

The size of public domain and commercial dictionaries in multiple languages on the Internet is increasing steadily. As an example we cite a few of them: Collins COBUILD English Language Dictionary and its series in major European languages, Leo Online Dictionary, Oxford Advanced Learner's Dictionary of Current English, and Webster’s New Collegiate Dictionary.

Electronic monolingual and bilingual dictionaries build a solid platform for developing multilingual applications. Using dictionary-based technique queries will be translated into a language in which a document may be found. However, this technique sometimes achieves unsatisfactory results because of ambiguities. Many words do not have only one translation and the alternate translations have very different meanings. Moreover, the scope of a dictionary is limited. It lacks in particular a technical and topical terminology which is very crucial for a correct translation. Nevertheless, this technique can be used for implementing simple dictionary-based application or can be combined with other approaches to overcome the above mentioned drawbacks.

Using electronic dictionary-based approach for query translation has achieved an effectiveness of 40-60% in comparison with monolingual retrieval [4] [29]. In [27] a multilingual search engine, called TITAN, has been developed. Based on a bilingual dictionary it allows to translate queries from Japanese to English and English to Japanese. TITAN helps Japanese users to search in the Web using their own language. However, this system suffers again from polysemy.

3.1.2 Corpus-based Technique

The corpus-based technique seems to be promising. It analyzes large collections of existing texts (corpora) and automatically extracts the information needed on which the translation will be based. Corpora are collections of information in electronic form to support e.g. spelling and grammar checkers, and hyphenation routines (lexicographer, word extractor or parser, glossary tools).

These corpora are used by researchers to evaluate the performance of their solutions, such TREC collections for cross-language retrieval. A few examples of mono, bi- and multilingual corpora are Brown Corpus, Hansard and United Nation documents respectively. The Hansard Corpus⁵ contains parallel texts in English and Canadian French collected during six years by the Canadian Parliament. The Brown Corpus consists of more than one million words of American English. It was published 1961 and it is now available at the ICAME⁶ (International Computer Archive of Modern English). Interested readers are referred to the survey about electronic corpora and related resources in [17].

In the United States, the WordNet Project at Princeton has created a large network of word senses in English related by semantic relations such as synonymy, part-whole, and is-a relation [19] and [39]. Similar work has been launched in Europe, called EuroWordNet [21]. These semantic taxonomies in EuroWordNet, have been developed for Dutch, Italian and Spanish and are planned to be extended to other European languages. Related activities have been launched in Europe, such as ACQUILEX⁷ (Acquisition of Lexical Knowledge for Natural Language Processing Systems), ESPRIT MULTILEX ⁸ (Multi-Functional Standardized Lexicon for

5 http://morph.ldc.upenn.edu/ldc/news/release/hansard.html

6 http://www.hd.uib.no/icame.html

7 http://www.cl.cam.ac.uk/Research/NL/acquilex

8 http://www.twente.research.ec.org/esp-syn/text/5304.html

European Community Languages).

The main problems associated with dictionary-based CLIR are (1) untranslatable search keys due to the limitations of general dictionaries, (2) the processing of inflected words, (3) phrase identification and translation, and (4) lexical ambiguity in source and target languages. The category of untranslatable keys involves new compound words, special terms, and cross-lingual spelling variants, i.e., equivalent words in different languages which differ slightly in spelling, particularly proper names and loanwords. In this dissertation translation ambiguity refers to the increase of irrelevant word senses in translation due to lexical ambiguity in the source and target languages.

The collection may contain parallel and/or comparable corpora. A parallel corpus is a collection which may contain documents and their translations. A comparable corpus is a document collection in which documents are aligned based on the similarity between the topics which they address. Document alignment deals with documents that cover similar stories, events, etc. For instance, the newspapers are often describing political, social, economical events and other stories in different languages. Some news agencies spend a long time in translating such international articles, for example, from English to their local languages (e.g. Spanish, Arabic).

These high-quality parallel corpora can be used as efficient input for evaluating cross-language techniques.

Sheridan and Ballerini developed an automatic thesaurus construction based on a collection of comparable multilingual documents [62]. Using the information retrieval system Spider this approach has been tested on comparable news articles in German and Italian (SDA News collection) addressing same topics at the same time.

Sheridan and Ballerini reported that queries in German against Italian documents

achieve about 32% of the best Spider performance on Italian retrieval, using relevance feedback. Other experiments on English, French and German have been presented in [72]. The document alignment in this work was based on indicators, such as proper nouns, numbers, dates, etc. There is also alignment based on term similarity as in latent semantic analysis. This allows mapping text between those documents in different languages.

3.1.3 Indexing by Latent Semantic Analysis

In previous approaches the ambiguity of terms and their dependency leads to poor results. Latent Semantic Indexing (LSI) is a new approach and a new experiment in multilingual information retrieval which allows a user to retrieve documents by concept and meaning and not only by pattern matching. If the query words have not been matched, this does not mean that no document is relevant. In contrast, there are many relevant documents which, however, do not contain the query term word by word. This is the problem of synonymy. The linguist will express for example his request differently as computer scientist. The documents do not contain all possible terms that all users will submit. Using thesauri to overcome this issue remains ineffective, since expanding query to unsuitable terms decreases the precision drastically.

The latent semantic indexing analysis is based on singular-value decomposition [16]. Considering the term-document matrix terms and documents that are very close will be ordered according to their degree of “semantic” neighborhood. The result of LSI analysis is a reduced model that describes the similarity between term-term, document-document, and term-document relationship. The similarity between objects can be computed as cosine between their representing vectors.

The results of the LSI approach have been compared with those of a term matching method (SMART). Two standard document collections MED and CISI (1033 medical documents and 30 queries, 1460 information science abstracts and 35 queries) have been used. It has been showed that LSI yields better results than term matching.

Davis and Dunning [14] have applied LSI to cross-language text retrieval.

Their experiments on the TREC collection achieved approximately 75 % of the average precision in comparison to the monolingual system on the same material [15]. The collection contains about 173,000 Spanish newswire articles. 25 queries have been translated from Spanish to English manually. Their results reported in TREC-5 showed that the use of only dictionary-based query expansion yields approx. 50 % of the average precision in comparison to results of the multilingual system. This degradation can be explained by the ambiguity of term translation using dictionaries.

This technique has been used by Oard [45] as the basis for multilingual filtering experiments, and encouraging results have been achieved. The representation of documents by LSI is “economical” through eliminating redundancy. It reduces the dimensionality of document representation (or modeling), and the polysemy as well.

However, updating (adding new terms and documents) in representation matrices is time-consuming.

3.2 Combine Text and Visual Feature for Medical Image

在文檔中醫學影像資料庫之研究 (頁 43-49)