• 沒有找到結果。

使用音段式發音法法辨認器辨認結果

第五章 使用音段式語音發音方法辨認器

5.2 使用音段式發音法法辨認器辨認結果

      

  

(5.2)

圖 5.1:音段式發音方法辨認器之參數抽取之示意圖。

在此我們使用第二章所求得之正規化語音信號波封與子頻段信號波封之對數值,

 

log

E

i ,

i

0, , 6

為參數。對每個音段,正規化語音信號波封與子頻段信號波封之對數值之曲線各求 4 個 discrete legengre 多項式參數

  a

i ,

i

0, , 6,總共為 28 個參數。

在音段式語音發音方法辨認器中,使用待辨認音框、前後各 1 個音段的音段參數及 各音段的長度,合計共 87 維參數。

5.2 使用音段式發音法法辨認器辨認結果

在實驗中,我們使用 TIMIT 語料庫,讓第一級取樣式聲學參數之音素端點偵測器操 作在低偵測漏失率的工作點(在下列實驗中設定為 3%),取樣式聲學參數之音素端點偵測 器會的到總音段數為 80123,也就是平均一個音素會被分成 1.25 段。

我們使用 RNN 作為音段式語音發音辨認器,輸入層有 87 個神經元,所使用的隱藏 層神經元數目為 90 個,輸出為 6 種發音方式。所獲得語音發音方法辨認率如表 5.1 所示。

若與李錦輝教授論文[17]中使用音框式參數之語音發音辨認器比較,在[17]中實際上是使

51

用待辨認音框及前後個 4 個音框(共 9 個音框)之 MFCC 參數(共 108 維參數)作為發音方 法辨認器之輸入參數,使用 MLP 類神經網路做辨認器。其以音框為單位(frame-based) 之發音方式辨認結果也並列於表 5.1 中。

表 5.1 : 使用音段式語音發音辨認方法之辨認結果

Pronunciation manner Segment-based

Recog. Rate (%)

Frame-base Recog. Rate(%)[17]

Fricative 79.4 85.2 Stop 78.8 72.5 Glide 68.0 56.5 Vowel 90.5 89.0 Nasal 75.4 77.5 Silence 89.2 92.2 Total 83.3 82.1

在表 5.1 中,可以發現對長度較短的音素如:stop 及 glide,使用音段式語音發音方 法辨認方法,其辨認率可以大幅改善。

若將音段換算為 frame-based 的辨認率,使用音段式語音發音方法辨認器其總音框辨 識率為 83.15%,也較[17]中的結果為佳。

另我們驚訝的是我們使用較低的頻率解析度所求的知參數還能獲得較佳的結果;所 以在音段式語音發音方法及位置辨認或偵測器上,也就是 detection-based ASR 上之應用,

將還有進一步探討的空間。

52

53

公司須會購買之語料庫(中華民國計算語言學會),TCC-300 語料庫之類音素端點 標示資料可以授權發行,將對 TCC-300 語料庫使用者有極大的助益。

已發表之論文

(1) You-Yu Lin, Yih-Ru Wang, “Sample-based Phone-like Unit Automatic Labeling in Mandarin Speech, “, Proc. of ROCLING 2009, Taichung, ROC. pp. 137-149, Sept.

2009.

(2) You-Yu Lin, Yih-Ru Wang and Yuan-Fu Liao, “Phone Boundary Detection using Sample-based Acoustic Parameters, “, Proc. of INTERSPEECH-2010, Makuhari, Japan, pp. 1397-1400, Spet., 2010.

(3) Yih-Ru Wang, “A Two-stage Sample-based Phone Boundary Detector using Segmental Similarity Features, “, Proc. of INTERSPEECH-2011, Florence, Italy, pp.

413-416, Aug., 2011. (本篇論文之內容未詳列於本報告,故列於附件,本論文基 本上在偵測出音素端點後,如同第五章中一樣在使用音段參數來幫助,可以做 到更好的音素端點偵測。)

已投稿之論文

(4) Yih-Ru Wang, You-Yu Lin, “High-Resolution Phone Boundary Detection using Sample-based acoustic Parameters, ”, submitted to IEEE Trans. on Audio, Speech and Language and Processing.

54

參考文獻

[1] F. Malfrère, O. Deroo, and T. Dutoit, “Phonetic alignment: Speech synthesis based vs.

hybrid HMM/ANN,” in Proceedings of the International Conference on Spoken

Language Processing, vol. IV, Sydney, NSW, Australia, 1998, pp. 1571–1574.

[2] Toledano, D.T.; Gomez, L.A.H.; Grande, L.V., “Automatic phonetic segmentation,”

Speech and Audio Processing, IEEE Transactions on , vol.11, no.6, pp. 617-625, Nov.

2003.

[3] Jen-Wei Kuo and Hsin-min Wang, “Minimum Boundary Error Training for Automatic Phonetic Segmentation,” The Ninth International Conference on Spoken Language

Processing (Interspeech 2006 - ICSLP), September 2006.

[4] J.-W. Kuo, H.-Y. Lo, and H.-M. Wang, “Improved HMM/SVM methods for automatic phoneme segmentation,” in Proc. Interspeech, Antwerp, Belgium, 2007, pp.

2057-2060.

[5] K.-S. Lee, “MLP-based phone boundary refining for a TTS database,” IEEE

Transactions on Audio, Speech and Language Processing, vol. 14, no. 3, pp. 981–989,

2006.

[6] Sorin Dusan and Lawrence Rabiner, “On the Relation between Maximum Spectral Transition Positions and Phone Boundaries,” in Proc. Interspeech 2006, pp. 17–21.

[7] J. Garofolo et al., “Documentation for the DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM,” Feb. 1993.

[8] Almpanidis, G., Kotti, M., Kotropoulos, and C., “Robust Detection of Phone Boundaries Using Model Selection Criteria With Few Observations,” IEEE

Transactions on Audio, Speech, and Language Processing, vol.17, no.2, pp.287-298,

Feb. 2009.

[9] Sharlene A. Liu, “ Landmark detection for distinctive feature-based speech recognition,” J. Acoust. Soc. Am. 100 (5), November 1996, pp. 3417-3430.

[10] H. Misra, S. Ikbal, H. Bourlard, and H. Hermansky, “Spectral entropy based feature for robust ASR,” in Proc. ICASSP 2004, pp. 193–196.

[11] Jia-lin Shen, Jeih-weih Hung, Lin-shan Lee, “Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments,” in Proc. ICSLP 1998.

[12] Nico Tool Kit : Available: http://nico.nikkostrom.com

[13] Li Lao, X Wu, L Cheng, X Zhu, “Maximum weighted entropy clustering algorithm,”

Proceedings of the 2006 IEEE International conference on Networking, Sensing and Control, pp. 1022-1025.

[14] B.-H. Juang, and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Trans. Speech and Audio Processing, vol. 40, no. 12, pp.

55

3043-3054, Dec., 1992.

[15] B.-H. Juang, W. Hou and C.-H. Lee, “Minimum classification error rate Methods for Speech Recognition,” IEEE Trans. Speech and Audio Processing, vol. 5, no. 3, pp.

257-265, May 1997.

[16] Sin-Horng Chen and Yih-Ru Wang, ' Vector Quantization of Pitch Information in Mandarin Speech, ', IEEE Trans. on Communications, Vol. 38, No. 9, pp. 1317-1320, Sept., 1990.

[17] Sabato Marco Siniscalchi, Jinyu Li, Chin-Hui Lee, "A study on lattice rescoring with knowledge scores for automatic speech recognition", INTERSPEECH-2006, pp.

1319-1322.

[18] 林宥余,”使用取樣點式語音聲學參數之音素端點偵測”, 交通大學碩士碩文,

民國 99 年。

56

57

58

3

sh

ㄕ 10

h

ㄏ 17

l

4

r

ㄖ 11

j

ㄐ 18

b

5

z

ㄗ 12

q

ㄑ 19

p

6

c

ㄘ 13

x

ㄒ 20

m

7

s

ㄙ 14

d

ㄉ 21

f

其中,關於空聲母(INULL)為當音節只有韻母發音時給予的預設聲母。

表二、國語 18 類韻母表

編號 拼音 注音 編號 拼音 注音 編號 拼音 注音

1

FNULL1

Φ1 7

a_ng

ㄤ 13

e_ng

2

FNULL2

Φ2 8

o

ㄛ 14

e_n

3

a

ㄚ 9

ou

ㄡ 15

er

ㄦ 4

ai

ㄞ 10

e

ㄜ 16

yi

ㄧ 5

ao

ㄠ 11

eh

ㄝ 17

wu

ㄨ 6

a_n

ㄢ 12

ei

ㄟ 18

yu

其中,關於注音中ㄢ、ㄤ、ㄥ、ㄣ,在本計畫中之類音素層級將韻母細分,使鼻音韻尾 自成一類。

表三、國語 2 類鼻音韻尾 編號 拼音

1

n_n

2

ng

其中,/n_n/和/ng/分為ㄢ、ㄣ及ㄥ、ㄤ的鼻音韻尾。

59

附錄三

Yih-Ru Wang, “A Two-stage Sample-based Phone Boundary Detector using Segmental Similarity Features, “, Proc. of INTERSPEECH-2011, Florence, Italy, pp. 413-416, Aug., 2011.

(本篇論文之內容未詳列於本報告,故列於附件,本論文基本上在偵測出音素端點後,如 同第五章中一樣在使用音段參數來幫助,可以做到更好的音素端點偵測。)

A Two-Stage Sample-based Phone Boundary Detector using Segmental Similarity Features

Yih-Ru Wang

Institute of Communication, National Chiao Tung University, Hsinchu, Taiwan, ROC

[email protected]

Abstract

In this paper, a two-stage sample-based phone boundary detection algorithm is proposed. In the first stage, some local sample-based acoustic parameters are used to pre-select some phone boundary candidates. Then, in the second stage, some high-order statistics of the log-likelihood differences of two adjacent speech segments around each boundary candidate are calculated to serve as similarity measure for candidate verification. Experimental results on the TIMIT speech corpus showed that EERs of 8.6% and 7.6% were achieved for one-stage and two-one-stage sample-based phone boundary detections, respectively. Moreover, for the two-stage system, 42.1% and 81.9% of boundaries detected were within 5- and 15-sample error tolerance from manual labeling results.

Index Terms: phone boundary detection, similarity measure

1. Introduction

Automatic phonetic segmentation is a historic and basic problem in speech signal processing. Although a lot of researches had been done in the past [1], an automatic phonetic segmentation algorithm with high accuracy and precision is still a state-of-the-art work. Without knowing the text of the speech signal, it becomes a phone boundary detection problem which is more difficult than the phone boundary alignment problem. An accurate phone boundary detector is important and essential for speech processing engineering and linguistics.

In automatic boundary detection without knowing the text of the speech signal, the rate of acoustic signal change is the most important cue for decision making. In [2], the spectral transition measure, which is in fact the norm of delta MFCC, was used to find the phone boundaries. 15.4% miss detection (MD) and 22.0% false alarm (FA) rates were achieved on the TIMIT training data set. In [3], the model selection technique, DISTBIC, was used to perform the phone boundary detection.

The DISTBIC first used the Kullback-Leibler (KL) distance to find the boundary candidates, and then employed the Bayesian information criterion (BIC) to further verify those candidates.

25.7% MD and 23.3% FA rates were achieved on the NTIMIT database. In our previous work [4], some sample-based

with features extracted from the neighboring speech segments.

For obtaining the similarity measure, a more precise signal modeling method, the common component Gaussian mixture model (CCGMM) [5], is employed to model the speech signal.

Some high-order statistics of the log-likelihood difference functions of the two neighboring segments, like mean, variance and skewness, can then be represented in terms of CCGMM coefficients [6]. These high-order statistics are used to calculate the similarity measure for improving boundary candidate verification.

The paper is organized as follows. In Section 2, the proposed sample-based phone boundary detection algorithm is discussed in detail. The performance of the two-stage system is examined by simulations discussed in Section 3. Some conclusions are given in the last section.

2. Two-stage Sample-based Phone Boundary Detector

In the proposed two-stage sample-based phone boundary detection algorithm, speech signal is first processed sample-by-sample to extract some sample-based acoustic parameters.

Then, those local acoustic parameters are used in the first stage to detect some candidates of phone boundary. The speech signal is accordingly segmented into lots of acoustic segments.

In the second stage, the high-order statistics of the log-likelihood difference of two neighboring segments are calculated to serve as the similarity measure of the two segments for verifying the boundary candidate. In the following subsections, we discuss these two stages in detail.

2.1. First-stage Boundary Candidate Detection It is known that the spectrum of a speech signal is an effective cue for phone boundary detection. In this study, six sub-band signal envelopes are used. The input speech signal firstly passes through six band-pass filters with cutoff frequencies shown below

0.0– 0.4 KHz, 0.8 – 1.5 KHz, 1.2 – 2.0 KHz, 2.0 – 3.5 KHz, 3.5 – 5.0 KHz, 5.0 – 8.0 KHz.

The energies of the above six sub-band signals were shown to be effective in speech landmark detection [7]. In the sample-based approach, the envelopes of those sub-band signals, x ni[ ], are extracted instead of their energies.

Detection of each sub-band signal envelope is realized by passing the complex analytic signals, [ ]x ni  j y ni[ ], through a low-pass filter. The Hilbert transformed signals, y n in i[ ], analytic signals can be produced by

[ ] [ ] [ ] for 1, ,6

i i

y nx nh n i (1) where

1 / , is odd and 0< 2

e n , is also extracted. The cutoff frequency of the low-pass filter is set to 30 Hz.

The low-passed KL distance was used in probability theory to measure the similarity of two distributions. In this study, we use it to measure the similarity of two adjacent speech samples represented by six sub-band envelopes, { [ ];e m ii 1, ,6} for mn and n1. The KL distance is implemented by first normalizing the six sub-band signal envelopes [8] by

Then, the sample-based KL distance is calculated by

 

entropy defined by

   

parts of speech signal.

The similarity of the signals around the boundary candidate can also be a useful measure of signal change. For each boundary candidate, cj , the feature vectors

E n ii[ ]; 0, ,6

in its two neighborhood windows Bj and Bj are assumed to be normal distributed. Here, Bj and Bj are defined as

Bj[cjr cj, j1], Bj[ ,c cj jrj], KL distance at boundary candidate, cj, can be defined as the KL distance of the pdfs of the feature vectors in Bj and Bj,

The normalized sub-band signal envelope, sample-based KL distance and spectral entropy and their delta terms are effective parameters for modeling the short-term spectral changing rate. They are used as the input features of the first-stage phone boundary pre-selection.

2.1.1. Sample-based boundary detection by neural networks

A boundary candidate pre-selection procedure is first used to reduce the number of data needed to be processed in the following boundary detection. The selected boundary candidates are those samples having larger speech signal changing rate. Thus, the sample-based KL distance is employed for boundary candidate pre-selection. A simple peak picking method with threshold is used to select all samples which satisfy the following constrains as candidates

[ ] [ 1], [ ] [ 1], and [ ]

KL KL KL KL KL d

d nd nd nd nd nTh ,(6)

where Thd is a threshold. The sequence of boundary candidates is denoted as { ;cj j1 ,Nc}.

The average normalized sub-band envelope of the segment, [ck1,ck], is defined by feature vector includes the following acoustic parameters,

(1) Features from current boundary candidates :

 

Lastly, two neural network-based classifiers, a multi-layer perceptron (MLP) and a recurrent neural network (RNN), are used to screening these phone boundary candidates.

2.2. Similarity measure of acoustic signal segments After pre-selecting some boundary candidates in the first stage, we then verify each of them in the second stage. A new similarity measure based on CCGMM representation is introduced to calculate the distance of the two acoustic segments around a candidate for its verification.

For a speech segment k, the pdf of its acoustic feature distributions with common covariance matrix which is used as

the basis of signal pdf; and { ; ckl l1, , }L are the coefficients of CCGMM [5].

Then, the un-symmetric KL (KL1) distances are calculated and used as the similarity measures of the two adjacent functions of the two segments. They can be approximated by [5] log-likelihood differences, some segment-based acoustic features calculated in the first stage are also used. In summary, the 30-dim features used for determining whether the candidate at time c is a phone boundary are listed below: k

(1) Output of first-stage RNN boundary detector, (2) Features from two adjacent segments:

1 1

(3) Statistics of the log-likelihood differences of two adjacent segments, ck[ ,c ck k1], ck[ck1,ck] : second-stage phone boundary detector.

3. Experiment Results

The TIMIT speech corpus was used to evaluate the effectiveness of the proposed sample-based phone boundary detection algorithm. The numbers of phone boundaries in the

training and testing parts of TIMIT were 172460 and 62465, respectively. The total numbers of samples were 2.27*108 and 8.29*107 for training and testing data sets. In average, there were 12.2 phone boundaries in 1 sec, or one boundary per 1310 samples for training data.

First, the envelopes of six sub-band signals were found from the speech signal. Then, the sample-based KL distance and spectral entropy were extracted. The threshold value of KL distance was properly chosen empirically to preselect the boundary candidates. The numbers of the resulting boundary candidates were 534189 and 194201 for the training and test data sets, respectively. In other words, only 0.85% speech samples, or one out of 116 samples, were preselected as boundary candidates for the training data.

These boundary candidates were then screened by an MLP classifier and an RNN classifier in which all neurons in the hidden layer were fully feedback to themselves. The numbers of hidden neurons were empirically set to 75 and 80 for MLP and RNN classifiers, respectively. These two neural networks were trained using the iterative target selection algorithm [4]. alarms)/(number of true boundaries + number of false alarms).

EERs of 11.6% and 8.6% were achieved for MLP and RNN classifiers, respectively. Compared with the performance of [4], which is 15.4% MD and 22% FA rates for the TIMIT training data set, about 50% EER reduction was achieved. In order to check the accuracy of those detected boundaries, the normalized cumulative histogram of the absolute deviation between the automatically detected boundary and the manually labeled one was shown in Figure 2. As shown in the figure, about 70% phone boundaries were correctly detected within 10-ms error tolerance.

In order to compare the performance of the proposed systems with the state-of-the-art model-based approach, an HMM phone recognition system was implemented. 61 3-state phone models were built. Besides, the minimum mean absolute boundary error (MMAE) criterion was used to realign the recognition result of the HMM system. The EER of the HMM system was 6.3%. The EER of the sample-based RNN phone boundary detector is about 20% higher than that of the HMM system. However, as shown in Figure 2, the inclusion rate of the RNN detected boundaries is much higher than those of the HMM recognizer when the error tolerance is less than 30-msec. So, the RNN boundary detector is of higher precision.

In order to perform the 2nd-stage verification, a low threshold value was adopted for the first-stage RNN classifier.

The resulting MD rate is only 3.9%, while the FA rate is 17.7%. To calculate the similarity measure of speech segments found from stage 1, 64 Gaussian mixtures were used in the CCGMM. An RNN with 80 hidden neurons was used as the second-stage verification. The curve of MD rate vs. FA rate for the two-stage phone boundary detector was also shown in Figure 1. An EER of 7.6% was achieved. The performance was much better than that of the first-stage RNN detector. It can also be found from Figure 2 that 42.1% and 81.9% of boundaries detected at the EER operating point were within 5- and 15-ms error tolerance from manual labeling results. They were slightly higher than those of the first-stage RNN detector.

4. Conclusions

In this paper, a two-stage sample-based phone boundary

In this paper, a two-stage sample-based phone boundary

相關文件