National Sun Yat-sen University Institutional Repository:Item 987654321/35125

(1)

行政院國家科學委員會專題研究計畫成果報告

常見噪音環境下中文短訊語音輸入系統研探

計畫類別：個別型計畫計畫編號： NSC94-2213-E-110-061- 執行期間： 94 年 08 月 01 日至 95 年 07 月 31 日執行單位：國立中山大學資訊工程學系(所) 計畫主持人：陳嘉平計畫參與人員：王聖富,曾俊翰,陳泰宏報告類型：精簡報告處理方式：本計畫可公開查詢

中華民國 95 年 10 月 23 日

(2)

行政院國家科學委員會補助專題研究計畫

□ 成果報告

□期中進度報告

常見噪音環境下中文短訊語音輸入系統研探

計畫類別：□ 個別型計畫

□ 整合型計畫

計畫編號：NSC94－2213－E－110－061－

執行期間：

94 年

8 月

1 日至

95 年

7 月

31 日

計畫主持人：陳嘉平

共同主持人：

計畫參與人員：王聖富,曾俊翰,陳泰宏

成果報告類型(依經費核定清單規定繳交)：□精簡報告

□完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

□出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年□二年後可公開查詢

(3)

摘要

本計畫有三個主要的研究成果。第一個成果是中文輸入法的研究，特別是基於混淆度對於簡化中文輸入法作可行性評估。第二個成果是中文文法的自動學習。因為我們採用最小描述長度的方法，因此也達到文本資料庫的壓縮。值得一提的是這些研究都用到中央研究院平衡語料庫。此外，在自動語音辨識部份，我們建立了一個以隱藏式馬可夫模型為基礎的系統，用 TCC300 語料庫作訓練，作為音譯英文名字辨識的前端系統。此系統之後端利用全球資訊網及搜尋引擎來提升辨識度。 關鍵詞:中文輸入法，自動文法學習，語音辨識，搜尋引擎

Abstract

There are three main research results in this project. The first is on the study of Chinese input methods. We evaluate the feasibility of simplifying the input methods based on the perplexity. The second is automatic learning of grammar rules of Chinese. As we use the principle of minimum description length, an effect of text corpus compression is achieved. Note that these researches use the Academia Sinica Balanced Corpus (ASBC). In addition, in automatic speech recognition, we build a system based on the hidden Markov models, which are trained by the TCC300 corpus. It is used as the front end for a spoken transliterated name recognition system. The back end of this system uses the worldwide web (WWW) and search engines to enhance the recognition results.

Keywords: Chinese input methods, context-free grammar, automatic speech recognition,

search engine

(4)

前言

We study the problem of simplifying Chinese input method and making it suitable for use with mobile devices [1]. To see the feasibility of aggressively reducing the number of keystrokes per Chinese character, we compare three input modes: character-based,

syllable-based and first-symbol-based. Specifically, we use these linguistic units as token types and compare the perplexities. With the language model trained by data based on the ASBC corpus, the perplexity of the data set we collect from on-line chat and instant messages is 102.6 for character-based model, 67.7 for syllable-based model and 16.3 for first-symbol-based model. Arguing from the relation between the perplexity and the number of ``typical'' sentences [2] of a language model [3], our conclusion is that on average there are 6 to 7 characters per first-symbol in natural Chinese language.

We also study the problem of learning context-free grammar from a corpus [4]. We investigate a technique that is based on the notion of minimum description length of the corpus. A cost as a function of grammar is defined as the sum of the number of bits required for the representation of a grammar and the number of bits required for the derivation of the corpus using that grammar. On the Academia Sinica Balanced Corpus [5] (ASBC) with part-of-speech tags, the overall cost, or description length, reduces by as much as 14% compared to the initial cost. In addition to the experimental results, we also give a novel analysis on the costs of two special context-free grammars, where one derives only the set of strings in the corpus and the other derives the set of arbitrary strings from the alphabet.

In a joint work with the National Taiwan University, we build a system for spoken transliteration name recognition [6]. The challenge in such a system is the uncertainty in the transliteration of name and the error-prone recognition of spoken query. To reduce the recognition errors, we incorporate multiple hypotheses for a given query. To deal with the uncertainty in name transliteration, we use the World-Wide Web as a live corpus.

Specifically, an HTK-based ASR system [7] outputs a syllable lattice for each spoken transliteration name. This lattice is post-processed by the minimum word-error-rate criteria to become a syllable grid [8][9]. Syllable grids are mapped to character grids, which are used in forging queries to a search engine. The web page summaries returned by the search engine are passed to a pattern extraction module based on PAT-tree [10] to output

transliteration candidates. The hypothesis with the highest combined score is the

recognition result. The experimental results show that long transliteration names are more robust than short ones, and wrong characters in the beginning are easier to be corrected than

(5)

研究目的

The goal of this research is to investigate using simplified methods to use with mobile devices. This has to deal with the limited size and computing power of such device. With more powerful handsets and faster data communication speeds, mobile electronic devices appear to be the converging points for new information technologies, looming to replace the immobile counter-parts. However, for that to happen, the user interfaces on these devices do need significant overhauls. Take the instant message (IM) service for example. Being used to run on desktops and laptops, it is now running on the mobile phones since the advent of 3G wireless network. In order to input a text message, the users can only use the keypads limited in size and the number of distinct keys. Since the set of potential text is large, this constraint in size posts a severe challenge for a convenient and healthy interface.

From the perspective of source coding, we can view the Chinese input problem as representing each Chinese sentence (source) by a codeword of input symbols. Ideally, a source code has a high probability of being decodable and a low expected code length. Here, in addition, we require that the number of code symbols (the size of the alphabet set) should be as low as possible.

The scenario of our input scheme is as follows. When a user wants to input a sentence, he inputs the sequence of first Mandarin phonetic symbols of the characters in the sentence. Given the input sequence, the system outputs the most-likely candidate sentences for the user to choose from.

Whether this is a feasible approach or not depends on the entropy of the text (source) and the entropy of symbol sequence. It is certainly feasible if these entropies are similar in magnitude. Otherwise, there will be many sentences (exponential in the input size) for given input symbol sequence. If this is the case, the system must be able to search efficiently for potential sentences and list the top candidates in the order of probability for the user to choose.

We also study the problem of learning context-free grammar (CFG) from a corpus of part-of-speech tags. The framework of CFG, although not complex enough to enclose all human languages, is an approximation good enough for many purposes. For a natural language, a decent CFG can derive most sentences in the language. Put differently, with high probability, a sentence can be parsed by a parser based on the CFG. A set of grammatical rules may be used in place of the n-gram language models. They may be small in size and can be stored in a mobile device.

(6)

文獻探討

For Chinese input, common methods are: Pinyin, Pinzi, Complex, Hand-written, and Number. Pinyin is based on using the Mandarin phonetic symbols to represent a character, such as the Syllable, the Microsoft New Syllable, and the Natural input methods. The Pinzi method is based on using parts of a character, such as the Chang-Jie and Da-Yi methods. The Complex is based on using the form, phoneme and morpheme of a character, such as the Liu input method. The Hand-written is based on character recognition. In the Number method, the strokes are represented by numbers, and a user inputs the sequence of strokes of a character as a sequence of numbers. For the Pinyin method, there are several research works to improve the accuracy and efficiency. In [11], a statistical approach combining a trigram language model and a segmentation model is proposed to improve the conversion accuracy. In [12], an approach based on compression by partial match is implemented in the language model that outperforms modified Kneser-Ney smoothing methods. In [13], a scalar-quantized compact bigram is used on mobile phones to reduce computational resource.

Many studies on the application of context-free grammars have been performed. In [14], an automatic speech recognition system uses a dynamic programming algorithm for recognizing and parsing spoken word strings of a context-free grammar in the Chomsky normal form. CFG can also be used in software engineering. In [15], the components in a source code that need to be renovated are recognized and new code segments are generated from context-free grammars. In addition, since parsing outputs larger and less-ambiguous meaning-bearing structures in the sentence, for high-level natural language processing tasks such as question answering [16] and interactive voice response [17] systems, the design and implementation of CFG can be crucial to their success. A detailed account of context-free grammars and other formal languages is given in [18].

(7)

研究方法

We use bigram language models for the study of Chinese input methods. To estimate the parameters in the bigram language model, we use a maximum-likelihood-based estimator modified by smoothing and backing-off. The maximum-likelihood estimate (MLE) is simply the relative frequency in the training set. To cope with bigrams unseen in the training set, we use the add-one smoothing scheme. On top of smoothing, we also incorporate backoff scheme into our bigram language model.

Once we have a language model P, we compute the perplexity of a test set and relate the perplexity to the number of typical sequences in the domain of the test data. The numbers of typical sequences are used to quantify the feasibility of three input modes: the first-symbol based, syllable based and character based. The data we use are the standard ASBC and a self-collected CHAT text sets.

In the automatic learning of context-free grammar, we use the minimum description length criteria. A cost of specifying the grammatical rules and deriving the sentences in a corpus (ASBC) is defined for a CFG.

We first analyze two special cases: the recursive CFG and the exhaustive CFG.. Both cover the corpus in an apparent way, but the costs are quite different. With the cost function, we then investigate the relationship between cost and the number of rules. We also use the learned grammars to parse the original set of sentences and we have found some parse trees similar to known grammatical structures.

(8)

結果與討論

The result on the syllable-based mode actually supports the fact that syllable-based approach is highly feasible. The search space of character sequences for a given syllable sequence is manageable and fast search can be implemented without significant computational resource. For the feasibility of first-symbol-based input mode, further research work is required as the search space is enormous. It is necessary to structure the search space so that good candidates can be approached efficiently.

In the future, we plan to investigate the processing of syllable/first-symbol strings to word strings for Chinese input. This will complete a prototype of input interface to be deployed on mobile devices.

In the learning of a CFG automatically from corpus, the proposed rules learned from heuristic bigram counting show that on ASBC corpus, the reduction of cost is 14.0% of the initial cost, and the learned rules do lead to (via parsing) meaningful structures in some sentences.

There are other kinds of CFG rules that are not considered in this study. The candidate set of rules should be enlarged for more descriptive power. In the future, we plan to extend the family of grammatical rules to a larger set for further cost reduction. In addition, we will use the Shannon code for the cost reduction.

計畫成果自評

研究內容與原計畫大致相符且達成預期目標，如手機輸入介面與學生之訓練等。研究成果之學術或應用價值適合在學術期刊發表 (已發表於國際研討會)。

(9)

References

[1] Chun-Han Tseng and Chia-Ping Chen, "Chinese Input Method Based On Reduced

Mandarin Phonetic Alphabet", Proceedings of Interspeech 2006, pp. 733 –736

[2] T.M.Coverand J.A.Thomas,“Elements of Information Theory”,Wiley,1991

[3] S.F.Chen and J.Goodman,“An Empirical Study of Smoothing Techniques for Language Modeling”,TechnicalreportTr-10-98, 1998, Center for Research in Computing Technology, Harvard University, Cambridge, Massachusetts

[4] Tai-Hung Chen, Chun-Han Tseng and Chia-Ping Chen, "Automatic Learning of

Context-Free Grammar", Proceedings of Rocling 2006

[5] Chu-Ren Huang and Keh-Jiann Chen, Academia Sinica Balanced Corpus. See also http://www.sinica.edu.tw/SinicaCorpus/98-04.pdf

[6] Ming-Shun Lin, Chia-Ping Chen and Hsin-Hsi Chen, "An Approach of Using the Web as

a Live Corpus for Spoken Transliteration Name Access", Proceedings of Rocling 2005

[7] S.Young et.al.,“The HTK Book”

[8] L.Mangu,E.Brilland A.Stolcke,“Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion Networks”,Journalof

Computer, Speech and Language 14(4), 2000, pp.373-400.

[9] A.Stolcke,“SRILM- An Extensible Language Modeling Toolkit”,ProceedingsofICSLP

2002, Denver Colorado

[10] Lee-Feng Chien,“PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval”,ACM SIGIR 1997

[11] Zheng Chen and Kai-Fu Lee,“A New Statistical Approach to Chinese Pinyin Input”,

ACL-2000. The 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, 3-6 October 2000.

[12] Jin Hu Huang and David Powers,“Adaptive Compression-based Approach for Chinese Pinyin Input”,ACL SIGHAN Workshop,pp.24-27.

[13] Feng Zhang,Zheng Chen,Mingjing Li,Guozhong Dai,“Chinese Pinyin Input Method for Mobile Phone”,ISCSLP2000.

[14] H.Ney,“DynamicProgramming Speech Recognition Using aContext-FreeGrammar”, Proceedings of ICASSP'87, pp. 69-72.

[15] Mark van den Brand,Alex Sellink,and ChrisVerhoef,“Generation ofcomponentsfor software renovation factories from context-freegrammars”,In Conferenceon Reverse Engineering, IEEE Computer Society, WCRE97, pp. 144-153.

[16] C.Yuan and C.Wang,“Parsing modelforanswerextraction in Chinesequestion answering system”,ProceedingsofIEEE NLP-KE '05, pp. 238 - 243.

[17] M.Balakrishna,D.Moldovan,E.K.Cave,“Automaticcreation and tuning ofcontext

freegrammarsforinteractivevoiceresponsesystems”,ProceedingsofIEEE NLP-KE '05, pp. 158 - 163.

[18] J.E.Hopcroft,R.Motwaniand J.D.Ullman,“Introduction to Automata Theory, Languages and Computation”,Addison-Wesley (2001).

National Sun Yat-sen University Institutional Repository:Item 987654321/35125

行政院國家科學委員會專題研究計畫 成果報告

常見噪音環境下中文短訊語音輸入系統研探

中 華 民 國 95 年 10 月 23 日

行政院國家科學委員會補助專題研究計畫

□ 成 果 報 告

□期中進度報告

常見噪音環境下中文短訊語音輸入系統研探

計畫類別：□ 個別型計畫

□ 整合型計畫

計畫編號：NSC94－2213－E－110－061－

執行期間：

94

年

8

月

1

日至

95

年

7

月

31

日

計畫主持人：陳嘉平

共同主持人：

計畫參與人員： 王聖富,曾俊翰,陳泰宏

成果報告類型(依經費核定清單規定繳交)：□精簡報告

□完整報告

本成果報告包括以下應繳交之附件：

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份

□出席國際學術會議心得報告及發表之論文各一份

□國際合作研究計畫國外研究報告書一份

處理方式：除產學合作研究計畫、提升產業技術及人才培育研究計畫、

列管計畫及下列情形者外，得立即公開查詢

□涉及專利或其他智慧財產權，□一年□二年後可公開查詢

摘要

Abstract

前言

研究目的

文獻探討

研究方法

結果與討論

計畫成果自評

References

行政院國家科學委員會專題研究計畫成果報告

中華民國 95 年 10 月 23 日

□ 成果報告

計畫參與人員：王聖富,曾俊翰,陳泰宏