An Approach to Using the Web as a Live Corpus for Spoken Transliteration Name Access

(1)

Computational Linguistics and Chinese Language Processing

Vol. 11, No. 3, September 2006, pp. 183-200 183 © The Association for Computational Linguistics and Chinese Language Processing

[Received February 16, 2006; Revised June 23, 2006; Accepted July 10, 2006]

An Approach to Using the Web as a Live Corpus for

Spoken Transliteration Name Access

Ming-Shun Lin

∗

, Chia-Ping Chen

+

and Hsin-Hsi Chen

∗

Abstract

Recognizing transliteration names is challenging due to their flexible formulation and lexical coverage. In our approach, we employ the Web as a giant corpus. The patterns extracted from the Web are used as a live dictionary to correct speech recognition errors. The plausible character strings recognized by an Automated Speech Recognition (ASR) system are regarded as query terms and submitted to Google. The top N snippets are entered into PAT trees. The terms of the highest scores are selected. Our experiments show that the ASR model with a recovery mechanism can achieve 21.54% performance improvement compared with the ASR only model on the character level. The recall rate is improved from 0.20 to 0.42, and the MRR from 0.07 to 0.31. For collecting transliteration names, we propose a named entity (NE) ontology generation engine, called the XNE-Tree engine, which

produces relational named entities by a given seed. The engine incrementally extracts high co-occurring named entities with the seed. A total of 7,642 named entities in the ontology were initiated by 100 seeds. When the bi-character language model is combined with the NE ontology, the ASR recall rate and MRR are improved to 0.48 and 0.38, respectively.

1. Introduction

Named entities [MUC 1998], which denote persons, locations, organizations, etc., are common foci of searchers. Thompson and Dozier [1997] showed that named entity recognition (NER) could improve the performance of information retrieval systems. Capturing named entities is challenging due to their flexible formulation and novelty. The issues behind speech recognition make named entity recognition more challenging on the

∗_{Department of Computer Science and Information Engineering, National Taiwan University,} Taipei, 106 Taiwan

E-mail: {d91022 , hhchen}@csie.ntu.edu.tw

+_{Department of Computer Science and Engineering, National Sun Yat-Sen University,} 70, Lien-Hai Road, Kaohsiung, 804 Taiwan

(2)

184 Ming-Shun Lin et al.

spoken level than on the written level. This paper focuses on a special type of named entities, called transliteration names. They describe foreign people, places, etc. Spoken transliteration name recognition is useful for many applications. For example, cross language image retrieval via spoken queries aims to employ the latter in one language to retrieve images with captions in another language [Lin et al. 2004].

In the past, Appelt and Martin [1999] adapted the TextPro system to process transcripts generated by a speech recognizer. Miller et al. [2000] analyzed the effects of out-of-vocabulary errors and the loss of punctuation on name finding in automatic speech recognition. Huang and Waibel [2002] proposed an adaptive method of named entity extraction for the meeting understanding. Chen [2003] dealt with spoken cross-language access to image collections. The coverage of a lexicon is one of the major issues in spoken transliteration name access. Recently, researchers are interested in using the Web, which provides a huge collection of up-to-date data, as a corpus. Keller and Lapata [2003] employed the Web to obtain frequencies for bigrams that are unseen in a given corpus.

Named entities are important objects in web documents. Building named entity relationship chains from the web is an important task. Matsuo et al. [2004] found social networks of trust from related web pages. Google sets1_{extracts named entity from web pages}

by inputting a few named entities. For some emerging applications like personal name disambiguation [Fleischman and Hovy 2004] [Mann and Yarowsky 2003], social chain finding [Bekkerman and McCallum 2005] [Culotta et al. 2004] [Raghavan et al. 2004], etc., glossary-based representations of named entities are not enough. For collecting transliteration names and building a bi-character language model, we propose a named entity (NE) ontology generation engine, called the XNE-Tree engine. This engine produces relational named entities

by given a seed. The engine uses Google to incrementally extract high co-occurrence named entities from related web pages and those named entities have similar relational properties with the seed. In each iterative step, the seed will be replaced by its siblings or descendants, which form new seeds. In this way, the XNE-Tree engine will build a tree structure as follows

with the original seed as a root.

1_{http://labs.google.com/sets}

Seed

(3)

An Approach to Using the Web as a Live Corpus for 185 Spoken Transliteration Name Access

In this paper, we discuss using the Web as a live dictionary for recognizing spoken transliteration names and employ the fuzzy search capability of Google to retrieve relevant web page summaries. In section 2, we sketch the steps in our method. In section 3, we discuss using PAT trees to learn patterns from the Web dynamically and to correct recognition errors. Section 4 shows the experiments, which are the ASR model with/without the recovery mechanism. Section 5 presents the XNE-Tree named entity ontology engine and our

experimental results. In section 6, we make concluding remarks.

2. Spoken Transliteration Name Recognition System

The spoken transliteration name recognition system shown in Figure 1 accepts a speech signal denoting a foreign named entity and converts it into a character string. It is composed of the following four major stages. Stages (1) and (2) consist of the fundamental tasks in speech recognition. In the Stages (3) and (4), speech-to-text errors are corrected by using the Web.

Figure 1. Stage in transliteration name recognition

(1) First, we employ the HTK2_{and SRILM}3_{toolkits to build speech recognition models.}

For each speech signal, we use the model to get a syllable lattice.

(2) Then, the syllable lattice is mapped into a character lattice by using a mapping table. The mapping table is a syllable-to-character mapping. Top-N character strings are selected from the character lattice by using Viterbe algorithm and a bi-character model which is trained from a transliteration name corpus. Such character strings will be called ASR strings in the following.

(3) Next, each ASR string is regarded as a query and is submitted to a web search engine like Google. From the top-N search result, we select higher frequency patterns from a PAT tree structure. The PAT tree [Chien 1997] [Gonnet et al. 1992], which was derived from the Patricia

2_{http://htk.eng.cam.ac.uk/}

(4)

tree, can be employed to extract word boundary and key phrases automatically. Because we employ the PAT tree to extract patterns, the patterns will be called PAT candidates in the following. A PAT tree example, “湯姆漢克斯湯姆克魯斯喬治克魯尼” in MS950 encoding, is shown in Figure 2. The circles represent semi-infinite string numbers. The number above each circle denotes the length, which indicates the first different bit of the character strings recorded in the sub-trees. In this example, the longest patterns are for “克魯” and “湯姆” on nodes (7, 12) and (0, 5), with lengths of 33 and 34 bits, respectively. The second longest patterns are for “克”, ”魯”, ”姆” and “斯” on nodes (3, 7, 12), (8, 13), (1, 6) and (4, 9), with lengths of 16, 17, 18 and 18 bits, respectively.

Figure 2. An example of extracting longest length pattern and its frequency (4) Finally, the PAT candidates of all the ASR strings are merged together and ranked based on their number of occurrences and similarity scores. Candidates with the highest ranks are regarded as the recognition results for a spoken transliteration name.

Consider the example shown in Figure 3. The Chinese speech signal is a transliteration name, “湯姆克魯斯”, in Chinese, which denotes the name of the movie star “Tom Cruise.” The lattice shows different combinations of syllables. Each syllable corresponds to several Chinese characters. For example, “ke” is converted into “克”, “柯”, “科”, “可”, “喀”, “刻”, etc. The ASR strings “塔莫克魯斯”, “塔門克魯斯”, “塔莫柯魯斯”, etc. are selected from the

Document : 湯姆漢克斯湯姆克魯斯喬治克魯尼 Semi-infinite string: 0:湯姆漢克斯湯姆克魯斯喬治克魯尼 1:姆漢克斯湯姆克魯斯喬治克魯尼 2:漢克斯湯姆克魯斯喬治克魯尼 3:克斯湯姆克魯斯喬治克魯尼 4:斯湯姆克魯斯喬治克魯尼 5:湯姆克魯斯喬治克魯尼 6:姆克魯斯喬治克魯尼 7:克魯斯喬治克魯尼 8:魯斯喬治克魯尼 9:斯喬治克魯尼 10:喬治克魯尼 11:治克魯尼 12:克魯尼 13:魯尼 14:尼 MS950 encode : 湯:011011100110111 姆:010110011100011 漢:011011110010001 克:010100010100101 斯:011001011010111 湯:011011100110111 姆:010110011100011 克:010100010100101 魯:100110110110111 斯:011001011010111 喬:010101011010110 治:011011001011101 克:010100010100101 魯:100110110110111 尼:010111000011110 0 13 4 1 3 6 8 14 0 1 18 5 3 5 19 2 9 11 19 10 7 12 2 5 6 17 34 35 8 7

(5)

character lattice. Through Google fuzzy search using the query “塔莫克魯斯 ”, some summaries of Chinese web pages are obtained and shown in Figure 4. Although the common transliteration of “Tom Cruise” in Chinese is “湯姆克魯斯”, which is different from the query “塔莫克魯斯”, fuzzy matching using Google can still identify relevant snippets containing the correct transliteration. We will call this operation “recognition error recovery using the Web” in the following.

Figure 3. An example of recognizing a transliteration name: “

湯姆克魯斯

” (“Tom Cruise”) (1) ... 至於贏家部份，則還是湯姆漢克斯、湯姆克魯斯、喬治克魯尼這些老面孔，… (2) ... 第 76 分鐘，克魯斯換下梅開二度的維埃裏。 (3) ... 國際米蘭 (4-4-2)：豐塔納/科爾多瓦，布爾迪索，馬特拉齊，法瓦利/斯坦科維奇，貝隆，扎內蒂，埃姆雷/克魯斯，馬丁斯。 (4) ... 提起妮可即發火湯姆克魯斯“想殺死記者”. (5) ... 電影節最具看點的明星當然非妮可基德曼與湯姆克魯斯有望在水城的戲劇性重逢莫屬。

Figure 4. Summaries obtained through fuzzy search for the query “

塔莫克魯斯

” In the above examples, each partial matching part is enclosed in a rectangle symbol and the correct transliteration name is underlined. Summaries (1), (4) and (5) mention the movie star “湯姆克魯斯” (Tom Cruise) and summaries (2) and (3) mention a football star, “克魯斯” (Cruz). Figure 3 shows that PAT patterns like “聖塔克魯斯”, “湯姆克魯斯”, “姆克魯斯演”, etc. are proposed. After merging and ranking are performed, the possible recognition results are “湯姆克魯斯”, “洛普克魯茲”, “聖塔克魯茲”, etc.

Speech Input

湯姆克魯斯 Tom Cruise

Syllable Lattice Character_Lattice _StringsASR _EngineSearch PAT Patterns Recognition_Results

WebPages WebPages WebPages 聖塔克魯斯湯姆克魯斯姆克魯斯演克魯斯妮可傑米福克斯洛普克魯茲湯姆克魯斯洛普克魯茲聖塔克魯茲塔納克羅斯門將魯斯圖特洛克羅斯傑米福克斯湯姆漢克斯布魯斯威利塔莫克魯斯塔門克魯斯塔莫柯魯斯塔莫克魯絲塔莫可魯斯他莫克魯斯塔莫克魯茲他門克魯斯 … … … … … … … … … 他莫克魯斯塔摩柯路思納默科盧司那墨可陸絲拿模喀祿賜勒刻露茲樂恪如子爾儒玆耳茹治勞羅日門洛蒙孟 … … … … …

(6)

3. Recognition Error Recovery Using the Web

The error recovery module tries to select higher frequency patterns from the Web search results and substitute the speech recognition results of Stages 1 and 2 (shown in Section 2) with the pattern. In this approach, Web search results obtained with an ASR string are placed in a PAT tree, and PAT candidates are selected from the tree. Two points are worth noting. A PAT candidate should occur many times in the PAT tree and should be similar to the ASR string.

The frequency, Freq, of a PAT candidate can be computed easily based on the PAT tree structure. The similarity between a PAT candidate and an ASR string is modeled by edit distance, which is the minimum number of insertions, deletions and substitutions needed to transform one character string (ASR) into another string (PAT). The smaller number is, the more similar they are. The similarity score, between an ASR string and a PAT string, is the frequency of the PAT string minus their number of edit operations. Finally, the score of a PAT string relative to an ASR string is defined as follows:

( , ) ( ) tan ( , )

Score ASR PAT = ×α Freq PAT − ×β Dis ce ASR PAT . (1) It is computed through weighted merging of the frequency of the PAT string and by using the similarity between the ASR string and PAT string. This value determines if the ASR string will be replaced by the PAT string. In the above example, Freq(湯姆克魯斯)=43 and Distance (塔莫克魯斯, 湯姆克魯斯)=2.

4. Experimental Results

The speech input to the transliteration name recognition system is a Chinese utterance. We employed 51,111 transliteration names [Chen et al. 2003] to train the bi-character language model discussed in Section 2. In the experiments, the test data include 50 American state names, 29 names of movie stars from the 31st_{Annual People’s Choice Awards}

(http://www.pcavote.com), and 21 names of NBA stars from the 2005 NBA All Star Team (http://www.nba.com/allstar2005/). The 50 American state names are not very active on the Web. In contrast, the 50 names of stars are very active. The test set is different from the training data set, so it is an open test. Because there may be more than one transliteration for a foreign named entity, the answer keys are manually prepared. For example, “Arizona” has four possible transliterations in Chinese: “亞利桑納”, “亞歷桑納”, “亞利桑那”, and “亞歷桑 那”. On average, there are 1.9 Chinese transliterations for a foreign name in our test set. Appendix A lists the name test set and its answer keys. As explained in Section 2, the transliteration name recognition system is composed of four major stages. Stages 1 and 2 include the fundamental speech recognition tasks, and Stages 3 and 4 comprise the error

(7)

recovery task. To examine the effects of these two parts, we will evaluate them separately in the following two subsections.

4.1 Performance in the Error Recovery Task

We assume that correct syllables have been identified in the speech recognition task. We simulate this assumption by transforming all the characters in the answer keys into syllables. Then, in Stage 2, we map the syllable lattice to obtain a character lattice. A total of 50 ASR strings are extracted from the character lattice in Stage 2 and submitted to Google. Finally, the best 10 PAT candidates are selected. We use the MRR (Mean Reciprocal Rank) [Voorhees 1999] and recall rate to evaluate the performance. The MRR represents the average rank of the correctly identified transliteration names among in the proposed candidates and it is defined as follow: 1 1 M i i M R R r M = = ∑ , (2)

where M is the total number of test cases ; ri equals 1/ranki if ranki > 0 and ri is 0 if no answer

is found. The ranki is the rank of the first correct answer for the ith test case. That is, if the first

correct answer is ranked 1, then the score is 1/1; if it is ranked 2, the score is 1/2, and so on. The MRR value is between 0 and 1. The inverse of the MRR denotes the average position of the correct answer in the proposed candidate list. The higher the MRR value is, the better the performance is. The recall rate is the number of correct references divided by M. It indicates how many transliteration names are correctly recognized.

Table 1. Performance of models wo/with error recovery

Models Recall MRR

ASR only 0.79 0.50

ASR + Web 0.90 0.88

ASR/Pre-removed + Web 0.59 0.48

Table 1 summarizes the experimental results obtained with models without/with the error recovery procedure. With the “ASR only” model, the top 10 ASR strings produced in Stage 2 are regarded as answers. This model does not employ the error recovery procedure. The recall rate is 0.79 and the MRR is 0.50. That is, 79 of 100 transliteration names are recognized correctly, and they appear in the first 2 (=1/0.50) position. In contrast, the “ASR + Web” model utilizes the error recovery procedure. PAT candidates extracted from the Web are selected in Stage 4. The recall rate is 0.90 and the MRR is 0.88. A total of 90 transliteration names are recognized correctly, and they appear in the first 1.13 (=1/0.88) position on average. In other words, when they are recognized correctly, they are always the top 1. Compared with

(8)

the first model, the recall rate is increased 13.92%. As for the third model, i.e., the “ASR/Pre-removed + Web” model, we try to evaluate the error recovery ability. The correct transliteration names appearing in the set of ASR strings are removed. That is, all of the ASR strings contain at least one incorrect character. In such cases, the recall rate is 0.59 and the MRR is 0.48. This means that 59 transliteration names are recovered, and they appeared in the first 2.08 (=1/0.48) position on average. We further examine the number of errors produced by the “ASR/Pre-removed + Web” model to study the error tolerance when using the Web. Table 2 shows the lengths of the transliteration names (in the rows), and the number of matching characters (in the columns). For a transliteration name of length l, the number of matching characters is 0 to l. Each cell denotes how many strings belong to the specific category. For example, before error recovery, there are 6, 25, 90, 184, and 0 strings of length 4, which have 0, 1, 2, 3, and 4 characters matching the corresponding answer keys, respectively. After error recovery, there are 19, 52, 66, 62, and 106 strings of length 4, which have 0, 1, 2, 3, and 4 characters matching the answer keys, respectively. In other words, the recovery procedure corrects some wrong characters. The number of 1-character (2-character) errors decreased from 184 (90) to 62 (66), and total number of correct strings are increased from 0 to 106. Table 2. Distribution before/after error recovery

Before Error Recovery After Error Recovery Number of Matching Characters Number of Matching Characters Length of NEs _{0 1 2 3 4 5 6} _{0 1 2 3 4 5 6} 2 11 23 0 - - - - 13 21 0 - - - - 3 6 29 76 0 - - - 6 39 64 2 - - - 4 6 25 90 184 0 - - 19 52 66 62 106 - - 5 9 10 12 77 193 0 - 11 23 36 41 53 137 - 6 0 0 1 8 20 39 0 0 3 19 12 7 5 22

Table 3. Effects of error positions and string lengths

Error Positions Length=2 Length=3 Length=4 Length=5 Length=6 Total

Position 1 0 0 37 42 7 86 Position 2 0 2 35 42 4 83 Position 3 - 0 20 19 9 48 Position 4 - - 17 24 3 44 Position 5 - - - 14 3 17 Position 6 - - - - 1 1 Total 0 2 109 141 27 279

(9)

Table 3 shows the effects of the error position (in the rows) and the string length (in the columns). A total of 0, 2, 106, 137, and 22 utterances recover 1 character with length 2, 3, 4, 5, and 6, respectively. A total of 0, 0, 3, 4, and 5 utterances recover 2 characters with length 2, 3, 4, 5, and 6, respectively. No utterances can recover over 3 characters. The cell denotes how many strings can be recovered under the specific position and length. For example, a total of 37, 35, 20, and 17 errors for strings of length 4 appearing at positions 1, 2, 3, and 4, respectively, can be recovered by using the Web. In the experiments, 0% (=0/34), 1.80% (=2/111), 35.74% (=109/305), 46.84% (=141/301), and 39.71% (=27/68) of the strings of length 2, 3, 4, 5, and 6 can be recovered, respectively. The 34 is the number of the PAT candidates with length 2. Similarly, the 111, 305, 301, and 68 are the number of the PAT candidates with length 3, 4, 5, and 6. As for length, the longer strings facilitate better recovery than the shorter strings. Another results show that 30.82% (=86/279), 29.75% (=83/279), 17.20% (=48/279), 15.77% (=44/279), 6.09% (=17/279), and 0.36% (=1/279) of the strings with incorrect character appearing at positions 1, 2, 3, 4, 5 and 6 can be recovered, respectively. The 279 is the number of characters on which the 100 test data. Because the bi-character language model proceeds from the left side to the right side, the errors occurring at the beginning are easier to recover than those at the end.

4.2 Performance in the Speech Recognition Task

The set of 100 transliteration names discussed in Section 4.1 are spoken by 2 males and 1 female, so 300 transliteration names are recorded. We employ HTK and SRILM to get the best 100 syllable lattices (N-Best, N=100). The TCC-300 dataset for Mandarin is used to train the acoustic models. There are 417 HMM models, and each one has 39 feature vectors. The syllable accuracy is computed as follows: (M-I-D-S)/M * 100%, where M is the number of correct syllables; I, D, and S denote the number of insertion, deletion, and substitution errors, respectively. The syllable accuracy is 76.57%. To estimate the character recovery ability, we consider the correct character number. The accuracy of the ASR only and ASR+Web models on the character level are computed as follows, respectively:

1 1 ( ) ( , ) max ( ) ( ) M _Length _i _ij _i j toK i Length i

Word TestName Distance AnsKey ASR Word TestName = = − ∑ (3) and 1 1 ( ) ( , ) max ( ) ( ) M _Length _i _ij _i j toK i Length i

Word TestName Distance AnsKey PAT Word TestName

= =

−

∑ _, (4) where M is the total test number and K is the answer key number for test name i. A total of 50

(10)

ASR strings are extracted from the character lattice, and the best 50 PAT candidates are selected. Table 4 shows the character level results. The “ASR+Web” model achieves 21.54% better performance than the “ASR Only” model on average. Table 5 shows the word level results. The “ASR+Web” model using error recovery procedure improves the recall rate and the MRR of the “ASR Only” model from 0.20 and 0.07 to 0.42 and 0.31, respectively. In other words, the average ranks of the correct transliteration names move from the 14th_position

(=1/0.07) to the 3rd_{position (=1/0.31) after error recovery.}

Table 4. Performance on the character level

ASR Only (Character Level Accuracy) ASR + Web (Character Level Accuracy) Top 1 Top 2 Top 3 Top 4 Top 5 Top 1 Top 2 Top 3 Top 4 Top 5 38.01% 43.34% 47.30% 49.07% 50.93% 48.18% 54.01% 55.93% 58.03% 59.48%

Table 5. Performance on the word level

ASR Only (Word Level) ASR + Web (Word Level)

Recall MRR Recall MRR 0.20 0.07 0.42 0.31 Web fuzzy search produces useful patterns for error recovery. Our fault tolerance experiments show that longer transliteration names have stronger tolerance than shorter transliteration names and that the incorrect characters appearing at the beginning of a transliteration name are relatively easier to correct than those appearing at the end. Thus, the improvement in the character level accuracy is helpful for the recovery mechanism, and vice versa.

5. Re-training the Bi-Character Language Model

For collecting transliteration names to build a bi-character language model, we propose using a named entity (NE) ontology generation mechanism, called the XNE-Tree engine. Given a seed,

the engine incrementally extracts relational named entities with the seed from related web pages and the output is a tree structure. Each node in the structure is a named entity (NE).

5.1 A Named Entity Ontology Generation Engine

Recognizing a named entity and calculating the relational property score with a seed are two crucial tasks. Firstly, we submit the given seed to a search engine and select the top N returned snippets. Then, we use the suffix tree to extract possible patterns automatically. The patterns, which are extracted based on the global statistic, may be impacted by the frequency variance of patterns with the same substrings [Yang and Li 2002]. Because our aim is to generate named entities, most of the max-duplicated strings can be filtered out by using a named entity

(11)

recognition (NER) system [Chen et al. 1998]. The NER system will re-segment a candidate pattern to obtain some substrings and give each substring a part of speech (POS) and a possible name tag. If any substring is tagged as a location, an organization, or a person by using an NER-POS server [Chen et al. 1998], the candidate pattern is considered to be a named entity. Because prepositions frequently occur before/after a named entity, the suffix tree approach may introduce an incorrect boundary. Thus, we filter out substrings that have a preposition tag.

Secondly, we calculate a relational property score, called the Co-Occurrence Double-Check score (CODC, for each extracted name entity (denoted Yi) with a seed (denoted

X). We postulate that X and Yi have a strong relationship if we can find Yi from X (a forward

process) and find X from Yi (a backward process). The forward and backward processes form a

double check operation. CODC(X, Y) is defined as follows:

( @ ) ( @ ) log ( ) ( ) ( , ) f Y X f X Y f X f Y CODC X Y e α ⎛ _× ⎞ ⎜ ⎟ ⎝ ⎠ = , (5) where f(X@Yi) is the total number of occurrences of X in the top N snippets when query Yi is

submitted to the search engine. Similarly, f(Yi@X) is the total number of occurrences of Yi in

the top N snippets for query X; f(X) is the total number of occurrences of X in the top N snippets for query X, and f(Y) is the total number of occurrences of Y in the top N snippets of query Y. In each iterative step, Yi is added to a queue when the CODC(X,Yi) value is larger

than a threshold θ. Then, we get a new seed X from the queue. The CODC measure is best when α=0.15. The overall process is shown in Figure 5.

Q u e u e T e rm E x tra c tio n A n N E D o u b le C h e c k W ra p p e r N E R P O S T a g g in g N E O n to lo g y N e w S e e d D ic tio n a ry

(12)

5.2 Constructing a Named Entity Ontology

When building a bi-character language model, we choose 100 seeds, which are the same 100 utterances described in Section 4. Here, we set a condition to control generation of the ontology. Each initial seed can derive at most four layers, and no more than 15 children are allowed in the first layer. The maximum number of children of a named entity in layer i is bounded by the number in layer (i-1) multiplying a decreasing rate. In the experiments, we set the decreasing rate to be 0.7, so that at most 15, 15×0.7, 15×0.72_{, and 15×0.7}3_{children can be}

expanded by a named entity in layers 0-3, respectively. We set the threshold θ at 0.1. Those named entities with CODC scores larger than the predefined threshold are sorted, and a sufficient number of named entities are selected in a sequence for expansion. In this way, a total of 7,642 nodes are generated by the 100 seeds. We employ Touch-Graph (http://www.touchgraph.com) to represent the named entity ontology. Figure 6 shows an example by using “湯姆克魯斯” as a seed, which is a Mandarin transliteration name of an actor “Tom Cruise”, to build an ontology. To evaluate the performance, we consider the following four types.

(1) Named Entity (NE) type: In this case, the proposed candidate should be a named entity and should not have incorrect boundary. A personal name with a title or a first name with more than 4 characters is regarded as being correct. In contrast, patterns with a last name only are considered incorrect.

(2) Relational property of NE (RNE) type: For those acceptable strings in (1), which have the same relational property with the initial seed or its parents are considered to be correct. The remaining nodes are incorrect.

(3) Partial Named Entity (PNE) type: We relax the restriction on boundary errors specified in (1). Patterns consisting of partial named entities are regarded as being correct. The remaining nodes are incorrect.

(4) Relational property of PNE (RPNE) type: For those acceptable strings in (3), which have the same relational property with the initial seed or its parents are considered to be correct. The remaining nodes are incorrect.

Table 6 shows the performance in ontology generation. Of those 7,642 nodes, the error rates for the NE type, the RNE type, the PNE type, and the PRNE type are 19.60%, 34.20%, 12.62%, and 29.82%, respectively.

Table 6. Performance in ontology generation

Size of Seed Size of Ontology NE RNE PNE RPNE

(13)

Figure 6. A snapshot of the named entity ontology of “

湯姆克魯斯

” (“Tom Cruise”)

5.3 Combining the Bi-Character Language Model with the NE Ontology

In the previous experiments, we employed 51,111 transliteration names (BaselineTN) to build the bi-character language model. However, these transliteration names might not be active on the Web. We submitted these transliteration names to a search engine (i.e., Google). For a transliteration name, if the search engine does not return any web pages, we filter it out. Finally, we filter out 14,933 named entities and get 36,178 transliteration names (FilterTN) with this method. Refer to Table 6. Of the 7,642 named entities (Total-Ontology) reported by Xne-engine, 6,146 named entities (NE-Ontology) are of the correct NE type, and 5,023 named

entities (RNE-Ontology) are of the correct RNE type.

In the experiments, we consider word level accuracy only. Two basic transliteration name corpora, i.e., BaselineTN and FilterTN, are employed to build bi-character language models. In ideal case, correct syllables have been identified in the ASR (ASR_Perfect). Table 7 shows that FilterTN is a little better than BaselineTN. We further combine FilterTN with the NE ontology derived by the XNE-Tree engine to perform evaluation. In this way, we employ

the FilterTN+RNE-Ontology, FilterTN+NE-Ontology, and FilterTN+Total-Ontology to build bi-character language models. Table 7 summarizes the experimental results obtained with the language model with the NE ontology. The three models with the NE ontology outperform the baseline model. In particular, the NE ontology improve the recall rate and the MRR from 0.79 and 0.50 (BaselineTN) to 0.84 and 0.55 (FilterTN+RNE-Ontology), respectively. Table 8 lists the results obtained using both the NE ontology and error recovery procedure. The NE ontology is still helpful, in particular for the recall rate. In the best case, it improves the recall rate from 0.90 (BaselineTN) to 0.94 (FilterTN+RNE-Ontology). In summary, the model using NE ontology resources, the recall rate is improved 13.92%. On comparing the “FilterTN+RNE-Ontology” model with the ASR model without the error recovery procedure and NE ontology resources, the recall rate is improved 18.98%.

(14)

Table 7. Bi-character language models with the NE ontology but without error recovery.

ASR_Perfect Only (Word Level) Language Model Size of TN

Recall MRR BaselineTN 51,111 0.79 0.50 FilterTN 36,178 0.80 0.50 FilterTN + RNE-Ontology 41,201 0.84 0.55 FilterTN + NE-Ontology 42,324 0.83 0.57 FilterTN + Total-Ontology 43,820 0.82 0.57

Table 8. Bi-character language models with both the NE ontology and error recovery procedure

ASR_Perfect + Web (Word Level) Language Model Recall MRR BaselineTN 0.90 0.88 FilterTN 0.90 0.87 FilterTN + RNE-Ontology 0.94 0.88 FilterTN + NE-Ontology 0.93 0.88 FilterTN + Total-Ontology 0.93 0.90

Table 9. Combining the bi-character language model with the NE ontology without/with the error recovery procedure in ASR systems

ASR Only (Word Level) ASR+Web (Word Level) Language Model Recall MRR Recall MRR BaselineTN 0.20 0.07 0.42 0.31 FilterTN 0.20 0.06 0.41 0.32 FilterTN + RNE-Ontology 0.23 0.11 0.48 0.38 FilterTN + NE-Ontology 0.24 0.11 0.48 0.37 FilterTN + Total-Ontology 0.24 0.12 0.47 0.39

Table 9 summarizes the experimental results obtained with language models that use the NE ontology without/with error recovery procedure in the complete transliteration name ASR system. The system without the error recovery procedure (ASR Only), the NE ontology still improves the performance. Comparing the “FilterTN+RNE-Ontology” with BaselineTN, the recall rate is increased 15%. When the ASR system incorporates the error recovery procedure (ASR+Web), the recall rate is increased 14.28% (FilterTN+RNE-Ontology vs. BaselineTN).

(15)

6. Conclusions

In this study, we employ the Web as a giant corpus to correct transliteration name recognition errors. Web fuzzy search produces useful patterns for error recovery. In the ideal case, we input the correct syllable sequences, convert them into text strings, and test the recovery capability by using the Web corpus. On comparing with the model without the web recovery procedure, the recall rate is improved 13.92%. For collecting transliteration names beforehand, we propose using a named entity (NE) ontology generation engine, called the XNE-Tree engine.

The engine automatically creates named entity ontology for a given seed. In the experiments, a total of 7,642 named entities in the ontology were initiated by 100 seeds. After the language model for speech recognition combined the named entity ontology, the recall rate is improved 18.98%. With a complete transliteration name ASR system, the error recovery experiments show that the recall rate is increased from 0.20 to 0.42 and the MRR from 0.07 to 0.31. When the RNE-Ontology is incorporated, the recall rate and the MRR is increased 0.48 and 0.38, respectively. Thus, we conclude that the error recovery procedure and NE ontology can helpful to the ASR model.

Acknowledgements

This research was partially supported by the National Science Council, Taiwan, under contracts NSC94-2752-E-001-001-PAE and NSC95-2752-E-001-001-PAE.

References

Appelt, D. E., and D. Martin, “Named Entity Extraction from Speech: Approach and Results Using the TextPro System,” In Proceedings of DARPA Broadcast News Workshop, 1999, pp. 51-54.

Bekkerman, R., and A. McCallum, “Disambiguating Web Appearances of People in a Social Network,” In Proceedings of WWW, 2005, pp. 463-470.

Chen, H. H., “Spoken Cross-Language Access to Image Collection via Captions,” In Proceedings of 8th_{Eurospeech, 2003, pp. 2749-2752.}

Chen, H. H., C. H. Yang, and Y. Lin, “Learning Formulation and Transformation Rules for Multilingual Named Entities,” In Proceedings of the Association for Computational Linguistics on Multilingual and Mixed-language Named Entity Recognition, 2003, pp. 1-8.

Chen, H. H., Y. W. Ding, and S. C. Tsai, “Named Entity Extraction for Information Retrieval,” Computer Processing of Oriental Languages, Special Issue on Information Retrieval on Oriental Languages, 1998, pp. 75-85

Chien, L. F., “PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval,” In Proceedings of 20th ACM SIGIR Conference, 1997, pp. 50-58.

(16)

Culotta, A., R. Bekkerman, and A. McCallum, “Extracting Social Networks and Contact Information from email and the Web,” In Proceedings of the First Conference on Email and Anti-Spam, 2004.

Fleischman, M. B., and E. Hovy, “Multi-document Person Name Resolution,” In Proceedings of the Association for Computational Linguistics (ACL), Reference Resolution Workshop, 2004. (presentation order: second)

Gonnet, G. H., R. A. Baeza-Yates, and T. Snider, “New Indices for Text: PAT Trees and PAT Arrays,” In Information Retrieval Data Structures Algorithms, 1992, pp. 66-82.

Huang, F., and A. Waibel, “An Adaptive Approach of Name Entity Extraction for Meeting Application,” In Proceedings Of`Human Language Technology Conference, 2002. Keller, F., and M. Lapata, “Using the Web to Obtain Frequencies for Unseen Bigrams,”

Computational Linguistics, 2003, pp. 459-484.

Lin, W. C., M. S. Lin, and H. H. Chen, “Cross-Language Image Retrieval via Spoken Query,” In Proceedings of RIAO, 2004, pp. 524-536.

Mann, G. S., and D. Yarowsky, “Unsupervised Personal Name Disambiguation,” In Proceedings of Conference on Computational Natural Language Learning, 2003.

Matsuo, Y., H. Tomobe, K. Hasida, and M. Ishizuka, “Finding Social Network for Trust Calculation,” In Proceedings of 16th_{European Conference on Artificial Intelligence,}

2004, pp. 510-514.

Miller, D., S. Boisen, R. Schwartz, R. Stone, and R. Weischedel, “Named Entity Extraction from Noisy Input: Speech and OCR,” In Proceedings of 6th Applied Natural Language Processing Conference, 2000, pp. 316-324.

MUC Message Understanding Competition,

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/index.html, 1998.

Raghavan, H., J. Allan, and A. McCallum, “An Exploration of Entity Models, Collective Classification and Relation Description,” In Proceedings of LinkKDD, 2004.

Thompson, P., and C. Dozier, “Name Searching and Information Retrieval,” In Proceedings of Second Conference on Empirical Methods in Natural Language Processing, 1997, pp. 134--140.

Yang, W., and X. Li, “Chinese Keyword Extraction Based on Max-Duplicated Strings of the Documents,” In Proceedings of 25th ACM SIGIR Conference, 2002, pp. 439-440. Voorhees, E., “The TREC-8 Question Answering Track Evaluation,” In Proceedings of the 8th

(17)

Appendix A

Transliteration

name Answer keys list Transliteration name Answer keys list

科羅拉多克羅拉多柯羅拉多科羅拉多喬治克隆尼喬治克隆尼喬治克龍尼喬治柯隆尼加利福尼亞加里福尼亞加利福尼亞加利弗_尼亞丹佐華盛頓丹佐華聖頓丹佐華盛頓喬治亞喬治亞喬治雅湯姆克魯斯湯姆克魯斯密西根密希根密西根強尼戴普強尼戴普阿拉斯加阿拉斯加湯姆漢克斯湯姆漢克斯北卡羅萊納北卡羅萊納北卡羅萊那 _{北卡羅來納北卡羅來那} *芮妮齊薇格芮妮齊薇格芮尼齊維格康乃狄克康乃迪克康乃狄克康迺迪克莎莉賽隆莎莉賽隆莎莉塞隆沙莉賽隆德拉瓦德拉瓦妮可基嫚妮可基嫚尼可基曼妮可基曼佛羅里達佛羅里達佛羅理達茱莉安摩爾朱利安摩爾茱莉安摩爾朱莉安摩爾南卡羅萊納南卡羅萊納南卡羅萊那 _{南卡羅來納南卡羅來那} 茱莉亞羅勃茲茱莉亞羅勃茲朱利亞羅伯茲朱利亞羅勃茲朱莉亞羅伯茲茱莉亞羅伯茲 *夏威夷夏威夷夏威宜威爾史密斯威爾史密斯愛荷華艾荷華愛何華愛荷華維果莫天森維果莫天森維果摩天森維果墨天森愛達荷艾達荷愛達荷麥特戴蒙麥特戴蒙伊利諾伊利諾伊立諾依利諾休傑克曼休傑克曼休杰克曼 *印地安那印地安那印地安納印弟安納托貝馬奎爾托貝馬奎爾堪薩斯坎薩斯堪薩斯鄔瑪舒嫚烏瑪舒曼鄔瑪舒曼肯塔基肯塔基琪拉奈特莉琪拉奈特莉奇拉奈特莉路易斯安那路易斯安納凱特貝琴薩凱特貝琴薩麻薩諸塞麻薩諸塞馬薩諸塞麻塞諸塞荷莉貝瑞荷莉貝瑞荷利貝瑞緬因緬因安潔莉娜裘莉安潔莉娜裘莉馬里蘭馬里蘭馬利蘭柴克巴爾夫柴克巴爾夫亞利桑那亞利桑納亞歷桑納亞利桑那 _亞歷桑那布萊德彼特布萊德比特布萊德彼特 _{布萊得比特布萊得彼特} 明尼蘇達明尼蘇達明尼蘇答金凱瑞金凱瑞密蘇里米蘇里密蘇里柯林法洛柯林法洛科林法洛密西西比密西西比裘德洛裘德羅裘德洛蒙大拿蒙大納蒙大那蒙大拿娜塔莉波曼娜塔利波曼娜塔莉波曼內布拉斯加內布拉斯加凱特溫絲蕾凱特溫斯雷凱特溫斯蕾 _{凱特溫絲雷凱特溫絲蕾} 阿拉巴馬阿拉巴馬 *珍妮佛嘉納珍妮佛嘉納北達科他北達科他北達科塔 *茱兒芭莉摩茱兒芭莉摩新罕布夏新漢布夏新罕布夏姚明姚明紐澤西紐澤西俠客歐尼爾俠克歐尼爾俠客歐尼爾 *新墨西哥新墨西哥凱文賈奈特凱文加奈特凱文賈奈特內華達內華達 *崔西麥格瑞_迪崔西麥格瑞迪崔西麥葛瑞迪紐約紐約柯比布萊恩柯比布萊恩科比布萊恩俄亥俄俄亥俄文斯卡特文斯卡特文思卡特奧克拉荷馬奧克拉荷馬奧克拉荷瑪奧克拉_河馬提姆鄧肯提姆鄧肯奧勒奧勒岡葛蘭特希爾格蘭特希爾葛蘭特希爾賓汐法尼亞賓希法尼亞賓西法尼亞賓夕凡勒布朗詹姆斯勒布朗詹姆斯勒布郎詹姆斯

(18)

200 Ming-Shun Lin et al. 尼亞 *羅德島羅德島艾倫艾弗森艾倫艾弗森艾倫埃弗森艾倫艾佛森阿肯色阿肯色阿肯瑟阿肯塞 *小歐尼爾小歐尼爾南達科他南達科塔南達科他南達柯塔拉希德華萊士拉希德華萊士拉西德華萊士田納西田納西田那西普林斯普林斯德克薩斯德克薩斯得克薩斯賈米森賈米森 *猶他猶他比盧普斯比魯普斯比盧普斯佛蒙特佛蒙特斯托賈科維奇斯托賈克維奇斯托賈科維奇斯托賈可_維奇維吉尼亞維基尼亞維吉尼亞維吉尼雅德克諾維茨基德克諾維茨基德克諾威茨基德克諾維_斯基華盛頓華聖頓華盛頓華勝頓班華萊士班華萊士班華勒斯西維吉尼亞西維基尼亞西維吉尼亞西維吉_尼雅卡梅隆安東尼卡梅隆安東尼卡麥隆安東尼威斯康辛威斯康辛威斯康新斯塔德邁爾斯塔德麥爾斯塔德邁爾斯塔達邁爾懷俄明懷俄明基里連科基里連科

* “印、島、兒、猶、小、芮” characters are not in training set and “尼、妮”, ” 辛、新” and “奇、琪” differentia of frequency is too high.