出席國際學術會議心得報告 - 智慧型運輸系統之道路信息整合應用與交通控制---子計畫一：用於路況報導之國語、閩南語、客語的整合式文句翻語音及語言辨識(III)

計畫編號 NSC 95-2218-E-011-006

計畫名稱智慧型運輸系統之道路信息整合應用與交通控制--子計畫一：用於路況報導之國語、閩南語、客語的整合式文句翻語音及語言辨識(3/3)

出國人員姓名服務機關及職稱

古鴻炎

台灣科技大學資訊工程系副教授會議時間地點 2006/12/13 ~ 2006/12/16, 新加坡

會議名稱 The 5^thInternational Symposium on Chinese Spoken Language Processing 發表論文題目 An Initial System for Integrated Synthesis of Mandarin, Min-nan, and Hakka

Speech 一、參加會議經過

由於大會第一天安排的是 tutorial sessions，並且顧慮到自己所教授的課程，因此決定第二天一大早搭機前往新加坡。當抵達新加坡時，時間是中午 12:25 左右，離 13:30 (也就是我的論文被安排發表的時段)，已是十分靠近，因此趕緊搭車前往新加坡大學，

到達大學後，又花費了一些時間搭校車及尋找會議舉辦的大樓，所以到達會議地點時，

已是 14:00，雖是遲到但總算是趕上。就是因為以前曾經來過，所以才會大意了一點，

不過主辦單位不把開會地點安排於平地上的大樓，而安排於山坡上難找的建築，也是不恰當的。

這次的會議，第五屆中文口語語言處理國際會議，是第二次在新加坡舉辦，聚集了許多研究華語語音處理的專家、學者，所以遇到了多位國內作語音處理的前輩和朋友，

並且也遇到了多位國際上知名的人士。此次會議共有 183 篇來自 18 個國家和地區的投稿，被接受的有149 篇。我的論文排於 P1 session (Speech Analysis, Enhancement, Coding and Synthesis)，除此之外，還參加了數個 sessions，例如: P2 session (Topics in Spoken Language Processing)，L5 session (Speech Synthesis), SPE3 session (Multi-Lingual Corpus Development)。

二、與會心得

目前我的研究重心是在語音合成方面，因此在此次會議裡，也大多是參加語音合成相關的 session，其中令我印象最深刻的是，K. Tokuda 所提出的以 HMM 模型來作彈性化的語音合成的方向，而且已經有中國微軟的研究人員依據此方向在作華語的語音合成了。語音合成的研究裡，已有不少韻律模型的研究成果被提出，可用以提升合成語音的自然度，不過流暢度方面，相關的研究卻很少，而基於 HMM 模型的語音合成方式，我覺得可用以同時考慮自然度和流暢度的問題。

華人或華裔人口所使用的語言，除了華語(Mandarin)之外，其實閩南語(Holo)、客語 (Hakka)和其它的漢語，在台灣和中國也有很多的使用人口，可是相對地作閩南語、客語的語音處理的論文卻很少，這表示華語之外的其它漢語，並未得到其人口比率相當的重視。這次我發表的論文，就是思考、研究作多種語言(華語、閩南語、客語)的整合式語

An Initial System for Integrated Synthesis of Mandarin, Min-nan, and Hakka Speech

Hung-Yan Gu¹, Yan-Zuo Zhou¹, and Huang-Liang Liau¹

1Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan

{guhy, M9315058, M9215001}@mail.ntust.edu.tw

Abstract. In this study, an integrated speech synthesis system is initially built to synthesize Mandarin, Min-nan, and Hakka speeches. By integration, only a model trained with Min-nan sentences is used to generate pitch-contours for the three languages, same rules are used to generate syllable duration and amplitude values, and a same program module implementing the method, TIPW, is used to synthesize the three languages’speech waveforms. Also, each syllable of a language has just one recorded signal waveform, i.e. no chance of unit selection.

Under such a restricted situation, the synthetic speech signals still have a noticeable level of naturalness and signal clarity.

Keywords: cross-lingual speech synthesis, pitch contour, TIPW.

1 Introduction

In Taiwan, there are many languages spoken, including Mandarin, Min-nan, Hakka, and others of smaller population. Mandarin has been most extensively studied than the other languages because it is the official language. However, successful construction of a synthesis model or system for Mandarin does not imply the same modeling method can be directly applied to another language. Developing speech synthesis systems for other languages is strongly aspired because Mandarin is not the mother tongue of most people in Taiwan. Also, some weak languages face the crisis of disappearance.

If systems developed or speech data collected previously can only be applied to Mandarin, then further resource (efforts and money) are still needed to study other languages. Such a situation will become more severe if a corpus-based approach [1], [2] is adopted. In addition, there will be inconsistency in prosody and timbre among independently developed speech synthesis systems for different languages. Therefore, a better approach is to construct a more generalized system that can synthesize not only Mandarin but also Min-nan and Hakka speeches. Such an approach, if successfully realized, can not only save resource but also obtain much higher consistency among the synthesized speeches for different languages. Also, another advantage is that an improvement when made to a system component can immediately benefit all the languages supported.

Mandarin, Min-nan, and Hakka are all syllable prominent languages, and are all tone languages. The number of different syllables (not distinguishing lexical tones) is 405 for Mandarin, 833 for Min-nan, and 783 for Hakka [3]. The numbers of different lexical tones are 5, 7, and 7, respectively, for Mandarin, Min-nan, and Hakka. Since the numbers, 405, 833, and 783, are not large, syllable is commonly chosen as the speech unit for synthesis processing. Actually, in our system, each syllable (tone not distinguished) of a language has only one recorded utterance. That is, no extra units are available to do unit selection, and each syllable’s waveform must be manipulated to synthesize speech signals owning different required prosodic characteristics.

In general, a speech synthesis system can be divided into three subsystems, i.e., (a) text analysis, (b) prosodic parameter generation, and (c) speech waveform synthesis [4], [5]. Our integrated system is also divided into such subsystems. The main processing flow of our system is shown in Fig. 1. The first two processing blocks are for text analysis, the middle two blocks are for prosodic parameter generation, and the last two blocks are for signal waveform synthesis. Our system is an integrated system but not a bundle of three independent systems for the three languages. This is because the program modules in the three subsystems are all shared in synthesizing the three languages’speeches. For example, the model for pitch-contour parameter generation is shared (or adapted) to Mandarin and Hakka, although it is originally trained with Min-nan training sentences. The explanations for why the program modules can be shared are given in the following sections.

In the subsystem of text analysis, the first block in Fig. 1, ”Text Analysis A”, parse the inputted text to recognize tags and slice the text string into a sequence of Chinese-character or alphanumeric-syllable tokens. Then, each Chinese-character token is tried, in the second block (”Text Analysis B”), to check if it can be looked up in a pronunciation dictionary in order to determine the comprising character’s pronunciation syllable. For the subsystem of prosodic parameter generation, the pitch-contour parameters of a syllable are determined by a mixed model of ANN [6], [7]

and SPC-HMM [8] in the third block of Fig. 1. As to the parameters, amplitude and duration, for a syllable, they are determined with a rule-based method [9], [10] in the forth block of Fig. 1. For the subsystem of signal waveform synthesis, a piece-wise linear time-warping function is first constructed in the fifth block of Fig. 1. Then, the method of TIPW (time-proportioned interpolation of pitch waveform, an improved variant of PSOLA) [11] is used in the sixth block of Fig. 1 to synthesize speech signal waveforms. To show that the integrated system has noticeable performance in naturalness and signal clarity, we have set up a web page to provide example synthetic speeches for the three languages (http://guhy.csie.ntust.edu.tw/hmtts/).

2 Text Analysis

In Min-nan and Hakka, there are still many spoken words whose corresponding ideographic words are not known. Therefore, we make a decision that in the inputted text, Chinese characters may be interleaved with syllables spelled in alphanumeric symbols. For example, “cit4 tou5 ”is a Min-nan word whose first two syllables are spelled in alphanumeric symbols. This decision implies that the inputted text must

first be parsed into a sequence of Chinese-character and alphanumeric-syllable tokens.

For example, “ mai3 ki3”is parsed into “ ”,“mai3”,and “ki3”.The digit at the end of a syllable indicates the lexical tone of the syllable. This parsing processing is executed in the first block of Fig. 1.

In addition, we have defined several kinds of tags to help carry some necessary information. For example, the tag, “@>2”, may be placed between two sentences to indicate that the sentences behind the tag will be synthesize to Hakka speech till another language-selection tag is encountered. Such language-selection tag is needed because we intend that the sentences of an article may be alternatively synthesized to different languages’speeches. Another kind of tag is “@>dxxx”.This tag may also be placed between two sentences to change the speaking rate of the sentences behind it.

The part, “xxx”,in the tag represents three decimal digits to specify how many mini-seconds in average a syllable will be synthesized to. In addition to the two tags explained, several other kinds of tags are also defined. The details are listed in Table 1.

The parsing of such tags is also executed in the first block of Fig. 1.

After an inputted sentence is parsed into a sequence of tokens, the pronunciation syllables for each Chinese-character token are determined in the second block of Fig.

2. According to the language-selection tag, the corresponding pronunciation dictionaries are looked up to check if the prefix of a token can be found in the dictionaries. The dictionary consisted of longer words is tried before the dictionary

text input text-analysis A text-analysis B

pronunciation dictionary

syllable pitch-contour generation (ann, hmm)

amp. & dur. value generation (rule-based)

ANN & HMM model paramtr

time-warping mapping (phone boundary mark)

signal waveform synthesis (TIPW) speech signal

speech-units:

waveform, pitch peaks, phone boundaries.

Fig. 1. Main processingflow.

Table 1. Defined tags and their meanings.

Tag symbol Explanation

@>x language selection, x may be 0, 1, or 2.

0: Min-nan, 1: Mandarin, 2: Hakka

@>dxxx speaking rate, syllable average duration in xxx mini-seconds.

@>txxx average tone height, in xxx Hz.

@>vxxx vocal track extended (or shrunken) to xxx percents of original length.

<, > word-constructing tag, e.g., <cit4 tou5>

* breath-break tag

consisted of shorter words. Currently, we have collected 55,000, 12,000, and 19,000 multi-syllabic words, respectively, for Mandarin, Min-nan, and Hakka. Note that inputted text is usually composed in Mandarin written words. Therefore, dictionary looking-up plays a role of word translation. For example, “ ”(today) in Mandarin is translate to “ ”,in Min-nan, which is pronounced as, “gin1a2 rit8”.Another example, “ ”(chopstick) in Mandarin is translated to “ ”which is pronounced as,

“di7”.These examples also show that the words obtained after translation may have longer or shorter lengths.

After a word is found in a dictionary or a segment of syllables bounded with the tags, “<”and “>”,is parsed out, we know the boundaries of a word and its comprising syllables. Then, tone-sandhi rules for the currently selected language can be applied to the comprising syllables of the word. This is executed in the second block of Fig. 1.

Note that different languages have very different tone-sandhi rules. For example, in Mandarin, if two adjacent syllables are both of the third tone, then the former one must have its tone changed to the second tone. As another example, consider the tone-sandhi rule in Min-nan that every syllable of a word except the word-final one must have its tone changed to its inflected tone.

3 Prosodic Parameter Generation

The prosodic parameters of a syllable include pitch-contour, duration, amplitude, and leading pause. The generating of prosodic parameter values plays a very important role because it determines the naturalness level of a synthesized speech. Therefore, many efforts had been devoted to investigate models (or methods) for generating of prosodic parameter values [7], [8], [12], [13], [14].

Among these prosodic parameters, pitch-contour is the most important one for higher naturalness level. Therefore, we have spent many efforts in investigating different kinds of models, HMM [8], ANN, and a mixed model of both [15]. In the third block of Fig. 1, a mixed model of HMM and ANN is used to generate pitch-contours. Here, model mixing means taking a weighted sum of two pitch-contours generated by HMM and ANN, respectively. Note that in this study, pitch-contour models, HHM and ANN, are both trained with Min-nan spoken sentences. Then, through tone mapping, we adapt the Min-nan trained and mixed model to generate

pitch-contours for Hakka and Mandarin. By such a sharing of pitch-contour model, the efforts for training other languages’pitch-contour models can be saved.

In contrast to pitch-contour, duration and amplitude are thought to be minor factors for naturalness. Hence, only a rule-based method is used in the forth block of Fig. 1 to generate their values.

3.1 Syllable Pitch Contour HMM

A syllable at the beginning of a sentence is usually uttered with higher pitch than the one at the end, i.e., the phenomenon of declining. With respect to this phenomenon, we imagine that there are three prosodic states corresponding to sentence-initial, sentence-middle, and sentence-final. However, we do not know how to assign a sentence’s syllables to these states explicitly. Therefore, we imagine these prosodic states are hidden and will simulate them by the hidden states of a left-to-right hidden Markov model. Besides the influence of prosodic states, the lexical tones of a syllable and its adjacent syllables also have strong influences. Therefore, we take into account the lexical-tones of a syllable and its adjacent syllables, and call such an HMM as syllable pitch-contour HMM (SPC-HMM).

The height and shape of a syllable’s pitch-contour are mainly influenced by the tone classes of the syllable and its left and right adjacent syllables. Therefore, we decide to combine the t’th syllable’s lexical tone and pitch-contour VQ (vector quantization) code with its left and right adjacent syllables’lexical tones to define the t’th observation symbol, Ot, as

where X_tis the lexical-tone number of the t’th syllable, and V_tis the pitch-contour VQ code of the t’th syllable in a training sentence.

Before VQ encoding, the pitch-contour of each syllable from the training sentences is first time normalized and then pitch-height normalized [8]. Time normalization means placing 16 measuring points equally spaced in time. Then, a pitch-contour is represented as a vector of 16 frequency values (in log Hz scale), called a frequency vector. After time normalization, these frequency vectors must be normalized in pitch-height to eliminate the influence of the speaker’s emotion in recording time. Totally, we have recorded 643 Min-nan training sentences that are comprised with 3,696 syllables.

Next, consider the generating of pitch-contours by using SPC-HMM. When a sentence is inputted, it will be analyzed first by the text-analysis components. Hence, its pronunciation-syllable sequence is available. Then, we can encode every three adjacent syllables’lexical tones partially (without information of VQ code, Vt) according to Equation (1). Since each tone class has 8 codewords in its pitch-contour VQ codebook, each syllable of the sentence has 8 possible encoded observation symbols corresponding to it. Therefore, in the synthesis phase (or testing phase), besides the time and state axes, a third axis, i.e. index to the 8 possible observation symbol candidates, must be added. Then, the conventional two-dimensional (time and state) DP (dynamic programming) algorithm for speech recognition is extended to

three-dimensional DP algorithm and used to search the most probable path [8].

According to the path found, pitch-contour VQ codes of the syllables can then be decoded.

3.2 Syllable Pitch Contour ANN

The architecture of the artificial neural network used here is as shown in Fig. 2. It is designed to be a recurrent type ANN in order to have prosodic state kept internally.

The input layer of the ANN has 8 ports (28 bits) to receive contextual parameters. For the hidden and recurrent hidden layers, the numbers of nodes are both set to be 30 according to experiment results. After a syllable’s contextual parameters are inputted and processed, a pitch contour represented as a 16 dimensional frequency vector is outputted at the output layer.

8 co n tex tu a l p a ra m eter s

1 6 d im . p itch -co n to u r

Fig. 2. The architecture of the ANN studied here.

Here, the contextual parameters, i.e. the inputs to the ANN, are appropriately selected to provide essential contextual information and to lower the quantity of required training sentences. In details, the contextual parameters are as listed Table 2.

Because there are totally seven lexical tones in Min-Nan, 3 bits are enough to represent them. The numbers of different syllable initials and finals are respectively

在文檔中智慧型運輸系統之道路信息整合應用與交通控制---子計畫一：用於路況報導之國語、閩南語、客語的整合式文句翻語音及語言辨識(III) (頁 47-58)