Recognition Procedure - Speech Modeling and Recognition

Chapter 3 Speech Modeling and Recognition

3.4 Recognition Procedure

Given the HMMs and the observation sequence O ={o₁, o₂,…, o_T }, the recognition stage is to compute the probability P(O|Λ) by using an efficient method, forward-backward procedure. This method has been introduced in the training stage.

Recall the forward variable αt(i) is defined as

given the initial conditions

( ) ( )

₁

1 i π_ib_i o

α = , 1≤i≤N (3-54)

( )

i =1

βT _,1≤i≤N (3-55)

where N is the number of states. The probability of being in state i at time t is expressed as

(

,q S |

)

( ) ( )

i β i

P O _t = _i Λ = _t _t (3-56)

such as the total probability P(O|Λ) is then obtained by

( ) ∑ ( ) ∑ ( ) ( )

which is employed in the speech recognition stage.

Chapter 4 Experimental Results

Several speaker-independent recognition experiments are shown in this chapter.

The effect and performance of different front-end techniques are discussed in the experimental results. The corpus will be described in section 4.1. The experiments are divided into two parts, including the monophone-based HMM and the syllable-based HMM. The experimental results will be shown in section 4.2, and 4.3, respectively.

4.1 Corpus

The corpora employed in this thesis are TCC-300 provided by the Associations of Computational Linguistics and Chinese Language Processing (ACLCLP) and the connected-digits database provided by the Speech Processing Lab of the Department Communication Engineering, NCTU. These corpora are introduced as below.

4.1.1

TCC-300

In the speaker-independent speech recognition experiments, the TCC-300 database from the Associations of Computational Linguistics and Chinese Language Processing (ACLCLP) was used for monophone-based HMM training. TCC-300 is a collection of microphone speech databases produced by National Taiwan University (NTU), National Chiao Tung University (NCTU) and National Cheng Kung University (NCKU). In this thesis, the training corpus uses the speech databases produced by National Chiao Tung University.

The speech signal is recording under the following conditions, listed in Table 4-1. The speech is saved in the MAT file format, which is a format for recording the

speech waveform in PCM format and, in addition, recording the condition of the environment and the speaker in detail by adding extra 4096 bytes file header into the PCM.

Table 4-1 The recording environment of the TCC-300 corpus produced by NCTU

File Format MAT

Microphone Computer headsets VR-2560 made by Taiwan Knowles Sound card Sound Blaster 16

Sampling rate 16 kHz Sampling format 16 bits Speaking style read

The database provided by NCTU is comprised of paragraphs spoken by 100 speakers (50 males and 50 females). Each speaker read 10-12 paragraphs. The articles are selected from the balanced corpus of the Academia Sinica and each article contains several hundreds of words. These articles are then divided into several paragraphs and each paragraph includes no more than 231 words. Table 4-2 shows the statistics of the databases

Table 4-2 The statistics of the database TCC-300 (NCTU)

Males Females Total

Amounts of speakers 50 50 100

Amounts of syllables 75059 73555 148614

Amounts of Files 622 616 1238

Time (hours) 5.98 5.78 11.76

Maximum words in a paragraph 229 131 -

Minimum words in a paragraph 41 11 -

4.1.2

Connected-digits corpus

This connected-digits corpus is provided by the Speech Processing Lab of the Department Communication Engineering, NCTU. All signals are stored in a format of PCM without file header. The recording format of the waveform files is listed in Table 4-2. The database consists of 3-11 connected digits, such as “011415726”, “79110”,

“347”, etc, spoken by 100 speakers (50 males and 50 females). The statistics of the database is shown in Table 4-4.

Table 4-3 Recording environment of the connected-digits

Connected-digits format

File Format PCM

Sampling rate 16 kHz

Sampling format 16 bits

Table 4-4 Statistics of the connected-digits database

Males Females Total

Amounts of speakers 50 50 100

Amounts of Files 500 499 999

Maximum digits in a file 3 3 -

Minimum words in a file 11 11 -

4.2 Monophone-based Experiment

The objective of this experiment is to evaluate the performance of different features based on monophone HMMs for speaker-independent speech recognition.

The phonetic transcription SAMPA-T is employed in this thesis and then the monophone-based HMMs are then trained, which will states in the section 4.2.1 and 4.2.2, respectively. The experiment results will be shown in the last section.

4.2.1

SAMPA-T

SAMPA-T (Speech Assessment Method Phonetic Alphabet - Taiwan) developed by Dr. Chiu-yu Tseng, Research Fellow of Academia Sinica, are employed for transcribing the database with a machine readable phonetic transcription [23]. Table 4-5 and Table 4-6 are the comparison table of 21 consonants and 39 vowels of Chinese syllables between SAMPA-T, Chinese phonetic alphabet, and the type of pronunciations.

Table 4-5 The comparison table of 21 consonants of Chinese syllables between SAMPA-T and Chinese phonetic alphabets

Type SAMPA phonetic

alphabet Type SAMPA phonetic

alphabet

b ㄅ dj ㄐ

p ㄆ tj ㄑ

d ㄉ dz` ㄗ

t ㄊ ts` ㄔ

g ㄍ dz ㄗ

plosive

k ㄎ

affricates

ts ㄘ

f ㄈ m ㄇ

h ㄏ nasals

n ㄋ

s ㄙ liquid l ㄌ

s` ㄕ

sj ㄒ

fricatives

Z` ㄖ

Table 4-6 Comparison table of 39 vowels of Chinese syllables between SAMPA-T, and Chinese phonetic alphabets

SAMPA phonetic

alphabet SAMPA phonetic

alphabet SAMPA phonetic alphabet

@n ㄣ aN ㄤ u@n ㄨㄣ

i ㄧ @N ㄥ uai ㄨㄞ

u ㄨ iE ㄧㄝ ua ㄨㄚ

a ㄚ iai ㄧㄞ uaN ㄨㄤ

o ㄛ iEn ㄧㄢ uei ㄨㄟ

e ㄝ ia ㄧㄚ uo ㄨㄛ

@ ㄜ iaN ㄧㄤ y ㄩ

@` ㄦ iau ㄧㄠ yE ㄩㄝ

ai ㄞ in ㄧㄣ yEn ㄩㄢ

ei ㄟ iN ㄧㄥ yn ㄩㄣ

au ㄠ iou ㄧㄡ yoN ㄩㄥ

ou ㄡ uan ㄨㄢ U

an ㄢ oN ㄨㄥ U`

p.s. U` is the null vowel for retroflexed vowels and U represents the null vowel for un- retroflexed vowels.

All the wave files should be corresponding to a transcription file. For example, a part of paragraph marked with Chinese phonetic alphabets and tones (1, 2,…, 5) are given in the database, shown in Table 4-7. Table 4-8 shows the transcriptions of the words in Table 4-7 marked with SAMPA-T. For monophone-based HMM training, the word-level transcriptions, such as shown in Table 4-8, should be further transferred to the phone-level transcriptions, shown in Table 4-9, where the tones are neglected. It is noted that the punctuation marks, such as comma and period, are replaced with the notation “sil” which means it is silent at this moment in time.

Table 4-7 A paragraph marked with Chinese phonetic alphabets 茶味有苦、澀、嗆、薰，

ㄔㄚˊ ㄨㄟˋ ㄧㄡˊ ㄎㄨˇ 、ㄙㄜˋ 、ㄑㄧㄤˋ 、ㄒㄩㄣ，

由其中才能品味出茶味的香、甘、生津，

ㄧㄡˊ ㄑㄧˊ ㄓㄨㄥㄘㄞˊ ㄋㄥˊ ㄆㄧㄣˇ ㄨㄟˋ ㄔㄨㄔㄚˊ ㄨㄟˋ ㄉㄜ․ ㄒㄧㄤ、ㄍㄢ、ㄕㄥㄐㄧㄣ，

同樣的，人生也是有不同的情緒，

ㄊㄨㄥˊ ㄧㄤˋ ㄉㄜ․ ，ㄖㄣˊ ㄕㄥㄧㄝˇ ㄕˋ ㄧㄡˇ ㄅㄨˋ ㄊㄨㄥˊ ㄉㄜ․ ㄑㄧㄥˊ ㄒㄩˋ

起起落落，

ㄑㄧˊ ㄑㄧˇ ㄌㄨㄛˋ ㄌㄨㄛˋ ，

不也是由痛苦中才能真正體會快樂是什麼嗎？

ㄅㄨˋ ㄧㄝˇ ㄕˋ ㄧㄡˊ ㄊㄨㄥˋ ㄎㄨˇ ㄓㄨㄥㄘㄞˊ ㄋㄥˊ ㄓㄣㄓㄥˋ ㄊㄧˇ ㄏㄨㄟˋ ㄎㄨㄞˋ ㄌㄜˋ ㄕˋ ㄕㄜˊ ㄇㄛ․ ㄇㄚ․ ？

Table 4-8 Word-level transcriptions using SAMPA-T ts`a2 uei4 iou2 ku3, s@4, tjiaN4, sjyn1,

iou2 tji2 dz`oN1 tsai2 n@N2 pin3 uei4 ts`u1 ts`a2 uei4 d@5 sjiaN1, gan1, s`@N1 djin1,

toN2 iaN4 d@5, Z`@n2 s`@N1 iE3 s`U`4 iou3 bu4 toN2 d@5 tjiN2 sjy4, tji2 tji3 luo4 luo4,

bu4 iE3 s`U`4 iou2 toN4 ku3 dz`oN1 tsai2 n@N2 dz`@n1 dz`@N4 ti3 huei4 kuai4 l@4 s`U`4 s`@2 mo5 ma5?

Table 4-9 Phone-level transcriptions using SAMPA-T ts` a uei iou ku sil s @ sp tj iaN sp sj yn sil

iou tj i dz` oN ts ai n @N p in uei ts` u ts` a u ei d @ sj iaN sil g an s` @N dj in sil

t oN iaN d @ sil Z` @n s` @N iE s` U` iou b u t oN d @ tj iN sj y sil

tj i tj i l uo l uo sil

b u iE s` U` iou t oN k u dz` oN ts ai n @N dz` @n dz` @N t i h uei k uai l @ s` U` s` @ mo ma sil

4.2.2

Monophone-based HMM used on TCC-300

From the phonetic transcription defined in SAMPA-T, there are 21 consonants and 39 vowels of Chinese dialects spoken in Taiwan. Hence, the total number of monophone-based HMM is equal to 62, including 21 consonants, 39 vowels, the silence model “sil”, and the short pause model “sp” where the “sp” denotes the short pause between two words. The number of states of the HMM is defined in Table 4-10

and the structure is shown in Fig.4-1. It is noted that the number of states here includes 2 null states, called entry and exit node, which cannot produce any observations, and the probabilities of staying in the null states is equal to zero. The entry and exist node make the HMMs much easier to connect together without changing parameters of the HMMs, for example, the word “樂” is a combination of the HMM “l” and the HMM “@”, shown in Fig.4-2.

Besides, the shrot pause model “sp” used here is so called “tee-model” which has direct transition from entry to exist node. The silence model has extra transitions from states 2 to 4 and from states 4 to 2 in order to make the model more robust by allowing individual states to absorb the various impulsive noises in the training data.

The backward skip allows this to happen without committing the model to transit to the following word.

Table 4-10 Definitions of HMM used in monophone-based experiment

Number of monophone-based HMMs 62 (60 monophones, “sp” and “sil”)

Number of states of “sp” 3 (first and last state are null state)

Number of states of consonants

(includes “sil”) 5 (first and last state are null state)

Number of states of vowels 7 (first and last state are null state)

Number of Gaussian mixtures in a state 5

The training database is selected from the TCC-300, where eight folders (F_NEWG1−F_NEWG4 and M_NEWG1−M_NEWG4) produced by NCTU are employed to train the monophone-based HMMs. The training database comprises of 517 files spoken by 40 females and 515 files spoken by 40 males. All the MAT files should be converted to the wave format prior to training. The Hidden Markov Model Tool Kit (HTK) developed by Cambridge University Engineering Department (CUED) is employed in this thesis since it provides sophisticated facilities for speech research.

Fig.4-1 HMM structure of (a) sp, (b) sil, (c) consonants and (d) vowels

Fig.4-2 (a) HMM structure of the word “樂(l@4),” (b) “l” and (c) “@”

(d)

(a)

l @

(a)

(b) (c) l@

4.2.3

Experiments

The parameters of front-end processing are set as Table 4-11. The features adopted in the experiment are listed in Table 4-12. The flow chart of training the monophone-based HMMs is shown in Fig. 4-3. At the beginning, only the corpus and its corresponding Chinese phonetic alphabets are available. Hence, it is essential to transfer the Chinese phonetic alphabets to SAMPA-T before training. It is noted that there is no boundary information of the corpus. Here, six features selected in this thesis are based on LPC, MFCC, and PLP, which have been introduced in Chapter 2.

In the process of training, there is no rule that how much times of doing the Baum-Welch re-estimation will get best model and consequently it is necessary to test and verify the recognition rate to find the best model.

Table 4-11 The parameters of front-end processing

Sampling frequency 16 kHz Pre-emphasis filter 1−0.97z^-1

Hamming window

( )

⎟

Table 4-12 Six different features adopted in this thesis Order Number of

filter banks Energy ∆ ∆² Linear Predictive Coefficients

(LPC_39) 39 - √ √ √

Linear Predictive Coefficients

(LPC_38) 38 - √ √

Linear Predictive Reflection

coefficients (RC) 39 - √ √ √ LPC Cepstrum Coefficients

(LPCC) 39 - √ √ √

Mel-Frequency Cepstral

Coefficients (MFCC) 39 26 √ √ √ Perceptual Linear Prediction

Coefficients (PLP) 39 26 √ √ √

There is a 3-D view of six features performing on the word “不久” (bu4 djiou3) and the variations of the 39-dimensioned (or 38-dimensioned for LPC_38) vectors from frame 1 to frame 100 are shown in Fig. 4-4 where time denotes the frame order.

The highest curve is at the 13-th element of the feature vectors (19-th element for LPC_38) since this element is the energy term.

Fig.4-3 Flow chart of training the monophone-based HMMs

Pre-emphasis

Frame blocking

Hamming window Corpus

Feature extraction

Model Training Transcriptions

……

Front-end Processing

62 HMM models

b p d U sp sil

(a) (b)

The monophone-based HMMs are usually employed in Large Vocabulary Speech Recognition (LVSR). However, one of the factors which influence the recognition rate of the LVSR is the language model. Language model is a statistical model which attempts to capture the regularities of natural languages and improve the performance by estimating the probability distribution of various linguistic units, such as words, sentences, etc. If the recognition task is long paragraphs or articles, language model should be trained. Nevertheless, language model is to the key point in this thesis. Hence, the connected-digits corpus just mentioned in 4.1.2 is utilized for testing the monophone-based HMMs and the HMMs are trained by six different kinds

Fig.4-4 3-D view of the variations of the feature vectors (a) LPC-38 (b) LPC_39 (c) RC (d) LPCC (e) MFCC (f) PLP

(f) (e)

of feature extraction methods where the connected-digits needs only an uncomplicated grammar that the sentence are arbitrary permutation of digits.

The experimental results are shown in Table 4-12 and Fig. 4-6. The total number of digits, denoted by T , used in this experiment is 8432. There are three variables should be concerned in order to compute the recognition rate, that is, the number of insertions (I), the number of deletions (D) and the number of substitutions (S). For example, the output sentence of the recognition may be

i2 @`4 ba1 djiou3 sU4 while the actual sentence is

i2 ba1 liou3 sU4 san1

Fig.4-5 Flow chart of testing the performance of different features

Pre-emphasis

Frame blocking

Hamming window Connected-digits corpus

Feature extraction

Speech Recognition

Transcriptions of test corpus

…

Front-end Processing

62 HMM models b U sil

Grammar

Results analysis

Recognition Rate

where “@`4” is an insertion error, “djiou3” is a substitution error and “san1” is an deletion error.

Based on the definition of mentioned the above, the performance of different features can be examined through two functions, the Correct (%) and the Accuracy (%). The Correct (%) is computed by

( )

100

Correct = ×

T S D

% T -

-(4-1)

and the Accuracy (%) is defined as

( )

100

Accuracy = ×

T I S D

% T - -

(4-2)

which means that the Accuracy (%) concerns not only the deletion error and the substitution error but also the insertion error. Hence, the Accuracy (%) will lower than the percent of correct (%).

Table 4-13 Comparison of the Corr (%) and Acc (%) of different features

LPC_38 LPC_39 RC LPCC MFCC PLP

Number of

iterations Corr Acc Corr Acc Corr Acc Corr Acc Corr Acc Corr Acc 1 73.7 40.1 69.4 38.3 82.0 46.0 87.3 67.1 89.6 67.2 90.0 67.9 2 75.2 47.5 72.8 47.2 83.7 54.3 88.9 70.8 91.2 72.2 91.6 73.7 3 76.4 51.3 74.8 49.4 84.0 56.1 89.2 71.6 92.0 74.2 92.2 75.6 4 77.8 54.1 75.9 50.6 83.7 54.5 89.6 71.5 92.4 74.4 92.9 76.5 5 78.7 55.9 76.4 52.3 83.6 54.0 89.6 71.8 92.8 75.7 93.2 77.3 6 79.7 56.6 76.8 53.9 83.6 54.2 89.6 71.7 92.9 76.3 93.3 78.0 7 80.1 57.3 76.9 54.2 83.6 54.3 89.5 71.9 93.0 77.1 93.4 78.6 8 80.6 58.2 77.1 54.5 83.7 54.4 89.5 71.9 93.0 77.4 93.4 78.7 9 81.0 59.0 77.0 54.4 83.9 54.7 89.7 72.1 93.1 77.7 93.4 78.7 10 81.1 59.1 77.1 54.3 83.9 55.0 89.6 72.1 93.1 78.1 93.4 78.8 11 81.2 59.2 77.1 54.3 84.1 55.4 89.6 72.1 93.2 78.2 93.3 78.9 12 81.2 59.2 76.8 54.0 84.2 56.0 89.6 72.2 93.2 78.3 93.2 78.9

60 65 70 75 80 85 90 95 100

1 2 3 4 5 6 7 8 9 10 11 12

Number of iterations

Correct (%)

LPC_38 LPC_39 RC LPCC MFCC PLP

From the Fig. 4-6(a), the percent of correct is the recognition rate without considering the insertion error and the performance of the connected-digits speaker-independent recognition based on monophone HMM is

PLP > MFCC > LPCC > RC > LPC38 > LPC39 (4-3) where the PLP performs better than all the other features from iteration 1 to iteration 12 of the training. The performance of all the models almost saturates when coming up to iteration 12. The maximum percent of correct of PLP appears in iteration 8 of training and then decreases when more training iterations are performed. It shows that the PLP costs less time than others to reach a better model in the training stage. From

Fig.4-6 Comparison of the different features (a) Correct (%) (b) Accuracy (%)

30 40 50 60 70 80 90

1 2 3 4 5 6 7 8 9 10 11 12

Number of iterations

Accuracy (%)

LPC_38 LPC_39 RC LPCC MFCC PLP

(a)

(b)

Fig. 4-6(b), it shows the recognition results when the insertion error is considered.

The order of the performance is not the same as (4-3) especially for the RC. It infers that the RC tends to insert words between two words than other insertion methods.

Comparisons of the different features through average and the best Correct (%) and Accuracy (%) are shown in Fig. 4-7. The order of the performance from the good to the bad in this experiment is

PLP > MFCC > LPCC > RC > LPC38 > LPC39 (4-4) except for the max Accuracy (%) where RC is worse than LPC38, hence, the order of the best performance of the six features is

PLP > MFCC > LPCC > LPC38 > RC > LPC39 (4-5) where PLP still has the best performance than other features in monophone-based speaker-independent speech recognition.

Fig.4-7 Monophone-based HMM experiment (a) Average Correct (%) (b) Average Accuracy (%) (c) Max Correct (%) (d) Max Accuracy (%)

LPC_38 LPC_39 RC LPCC MFCC PLP

Average Correct (%)

54.8 52.6 54.9

71.9

LPC_38 LPC_39 RC LPCC MFCC PLP

Average Accuracy (%)

59.2

LPC_38 LPC_39 RC LPCC MFCC PLP

Max Accuracy (%)

81.2

LPC_38 LPC_39 RC LPCC MFCC PLP

Max Correct (%)

(a) (b)

4.3 Syllable-based Experiments

The purpose of this experiment is to examine the performance of different features while applying to the word-level HMM speaker-independent speech recognition. The word-level HMM is feasible when the recognition task is small;

hence, the connected-digits corpus is employed to train the word-level HMM and then utilized to recognize the connected-digits sentences in this thesis.

4.3.1

Syllable-based HMM used on connected-digits corpus

The connected-digits sentence is composed of arbitrary combination of the digits - “0, 1, 2, 3, 4, 5, 6, 7, 8, or/and 9”. Hence, the total number of word-level HMM needed is 12, including 10 digits, the silence model “sil”, and short pause model “sp”. The number of states of the HMM is defined in Table 4-14

The training database is selected from the connected-digits database mentioned in 4.1.2, where 800 files, where 400 files are spoken by 40 males and 400 files are spoken by 40 females, are selected for training the syllable-based HMMs. The other files of the corpus (199 files, 99 files spoken by 10 females and 100 spoken by 10 males) are adopted for testing the performance of the different features in syllable-based HMM speaker-independent speech recognition.

Table 4-14 Definition of HMM used in syllable-based experiment

Number of syllable-based HMMs 12 (10 digits, “sp” and “sil”)

Number of states of HMM 8 (first and last state are null state)

Number of Gaussian mixtures in a state 4

4.3.2

Experiments

The parameters of front-end processing are set the same as Table 4-11. The features adopted in the experiment listed in Table 4-15 which are the same as the parameters used in monophone-base experiments. The flow chart of training the syllable-based HMMs is shown in Fig. 4-8 where the digits “0, 1, 2, 3, 4, 5, 6, 7, 8, and 9” are denotes by “yi, er, san, si, wu, liu, qi, ba, jiu, and ling,” respectively. It is noted that the boundary information of the corpus is available. Therefore, the training procedure is different with the procedure of experiment in 4.2 which has no boundary information and the details of the difference between them have been introduced in section 3.3. In practice, the boundary information is beneficial for training HMM, that is, the HMM will be trained more precise with the boundary inforamtion. In addition, the number of HMMs is less than the HMMs used in previous experiment. The recognition results are supposed to be much higher than the results in 4.2.3.

Fig.4-9 shows the testing procedure of the sullable-based recognition procedure.

Table 4-16 shows the experiment results of the performance of different features performing on the connected-digits speaker-independent recognition. The performance of the features is generally in the order from the good to the bad as

PLP, MFCC > LPCC > RC > LPC_38 > LPC_39 (4-6) where the PLP and MFCC are resemble in maximum ACC(%), which is the major guide for judging the performance of the models.

Table 4-15 Six different features adopted in this thesis

Order Number of

filter banks Energy ∆ ∆² Linear Predictive Coefficients

(LPC_39) 39 - √ √ √

Linear Predictive Coefficients

(LPC_38) 38 - √ √

Linear Predictive Reflection

coefficients (RC) 39 - √ √ √ LPC Cepstrum Coefficients

(LPCC) 39 - √ √ √

Mel-Frequency Cepstral

Coefficients (MFCC) 39 26 √ √ √ Perceptual Linear Prediction

Coefficients (PLP) 39 26 √ √ √

Fig.4-8 Flow chart of training the syllable-based HMMs

Pre-emphasis

Frame blocking

Hamming window Connected-digits Corpus

Feature extraction

Model Training Transcriptions

……

Front-end Processing

12 HMM models

yi er ling sp sil

Boundary information

san

Table 4-16 Comparison of the Corr (%) and Acc (%) of different features

LPC_38 LPC_39 RC LPCC MFCC PLP

Number of iterations

Corr Acc Corr Acc Corr Acc Corr Acc Corr Acc Corr Acc 1 79.7 77.1 80.5 77.6 90.9 89.7 94.3 93.8 96.5 95.9 96.6 96.1 2 84.0 80.8 83.5 80.3 94.3 92.9 96.6 96.3 98.2 98.0 98.4 98.1 3 85.0 81.8 84.3 81.2 94.7 93.3 97.1 96.9 98.4 98.2 98.5 98.4 4 84.8 81.9 84.6 81.6 94.9 93.5 97.2 97.1 98.6 98.4 98.5 98.3 5 85.2 82.3 84.5 81.4 94.8 93.4 97.3 97.1 98.7 98.5 98.5 98.4 6 85.2 82.4 84.6 81.6 94.9 93.6 97.2 97.0 98.7 98.5 98.6 98.5 7 85.3 82.6 84.7 81.7 95.0 93.8 97.2 97.0 98.7 98.5 98.6 98.5 8 85.3 82.7 84.6 81.6 94.8 93.7 97.3 97.1 98.6 98.5 98.6 98.5 9 85.4 82.9 84.3 81.2 95.0 93.9 97.3 97.2 98.6 98.5 98.6 98.5 10 85.6 83.2 84.5 81.3 95.1 93.9 97.2 97.1 98.6 98.5 98.5 98.4

Fig.4-9 Flow chart of testing the syllable-based HMMs

Pre-emphasis

Frame blocking

Hamming window Connected-digits corpus

Feature extraction

Speech Recognition

Transcriptions of test corpus

…

Front-end Processing

12 HMM models yi ^ling sp

Grammar

Results analysis Recognition Rate

sil

Comparisons of the different features through average and the best Correct (%) and Accuracy (%) are shown in Fig. 4-11. In this case, the PLP and the MFCC are both a good choice of the connected-digits speaker-independent recognition due to their high recognition rate (Correct (%) and Accuracy (%)). The LPC_38 performs better than the LPC_39 since the sampling frequency of the speech is 16 kHz. From the guideline of selecting the order of filter p introduced in Chapter 2, the value of p should be chosen as 18-20 to represent the characteristic of the filter. Hence, the recognition rate of LPC_38 (p=18) is higher than the LPC_39. As for the reflection coefficients RC, the performance is much better than the performance of LPC_38. It can be inferred that the RC is more suitable for represent the speech signal of small-vocabulary speech recognition from the results of the two experiments.

Fig.4-10 Comparison of the different features (a) Correct (%) (b) Accuracy (%)

75 80 85 90 95 100

1 2 3 4 5 6 7 8 9 10

Number of iterations

Correct (%)

LPC_38 LPC_39 RC LPCC MFCC PLP

70 75 80 85 90 95 100

1 2 3 4 5 6 7 8 9 10

Number of iterations

在文檔中針對非特定語者語音辨識使用不同前處理技術之比較 (頁 58-0)