Modelling Acoustic Characteristics - Acoustic Characteristic Checking

Chapter 2 Framework of the Acoustic-Phonetics and SONFIN Based

2.4 Acoustic Characteristic Checking

2.4.3 Modelling Acoustic Characteristics

Here, we will introduce the way to construct the formant trajectory model of each model. We used the histogram analysis to analyze these acoustic characterises. The speech database which provides the corpus for analysis is the training set of TIMIT database. The detailed interpretation of TIMIT will be seen in the next chapter. First we used the SHR-based PDA to reveal the relationship between fundamental frequency and the gender of speaker. The threshold of SHR is chosen as 0.6. Fig. 13 showed the distribution of fundamental frequency respectively for men and women.

Fig. 13 the distribution of F0 for men and women

Form the analysis, a threshold-value of 140 (Hz) is chosen. The patterns whose fundamental frequency is above the threshold are taken as men-like category, and those else taken as women-like category. Then we constructed the trajectory model for each vowel class respectively for both men-like and women-like category. The models are derived the histogram analysis form individual vowel category and consisted of three time-dependent sub-model sampled at 20%, 50% and 80% of the vowel duration. In each sub-model, we chose the pass-bands for F1 and F2 respectively which indicate the formant candidates may be allowed to take place in the frequency range. On the other hand, the stop-band is chosen which prohibited the occurrence of formant candidates in this vowel category. Let’s take the vowel /iy/ of men-like category as an example. In sub-model 1, the pass-band for F1 is chosen as 400~500 Hz, the pass-band for F2 is chosen as 1600~2500 Hz, and the stop-band as 700~1200 Hz. An illustration of the model of a specific vowel category is shown below.

Fig. 14 model of /iy/

Many acoustic evidences could be found in these models. For example, as shown in Fig. 15, the difference between /iy/ and /ih/ is presented by the location of pass-band and stop-band. In Fig. 16, we could see the temporal trajectories of the diphthongs /ay/ and /oy/. Those phenomena are the cues for the acoustic characteristic checking.

Fig. 15 an illustration of /iy/ and /ih/

Fig. 16 an illustration of /ay/ and /oy/

Chapter 3 Experiments

3.1 Introduction

In the previous chapter, we described the structures of the proposed vowel recognition system in detail. In order to investigate and show the contribution and correlation of these different techniques we applied, several sets of experiments are done. In the first set of experiments, the method of narrow band pattern-matching model proposed by Hillenbrand [5] was evaluated. The second set of experiments is designed and implemented with DCSCs as feature set and SONFIN as the classifier. In these experiments, we tried to show the superior ability of the neural fuzzy inference network for this classification task. In the third set of experiments, we adopted the AE-DCSCs as the feature set and SONFIN again as the classifier. The improvement caused by the feature enhancement techniques are shown here. Finally, in the last set of experiments, acoustic-checking technique is applied to this system. The experiment results showed the contribution of this process.

For these experiments (except for the pattern-matching model), the following 101 features are first computed for each token. The first 45 features encoded 15 DCTC trajectories over 100 ms centered at each vowel midpoint (20-ms frame length, 5-ms frame spacing, 15 DCTCs using three DCSCs, with a warping factor of two). This part of features

is greater than 100ms). The next 56 features encoded eight DCTC trajectories over 300 ms centered at each vowel midpoint (10-ms frame length, 2.5 ms frame spacing, eight DCTCs using seven DCSCs, with a time warping of 4). This part of features tried to represent the co-articulation phenomenon of the vowel token (i.e. the duration of vowel is less than 300ms and the preceding and following consonants will be included). Thus, these features include varying degrees of time-frequency resolution, based on the conjecture that different vowel pairs might be best discriminated with features varying with respect to these resolutions.

3.2 Experiment Database

The TIMIT acoustic-phonetic speech corpus is used for all training, development, and performance evaluation experiments. This corpus is widely used throughout the world and provides a standard that permits direct comparison of experimental results obtained by different methodologies. The entire corpus consists of 10 sentences recorded from each of 630 speakers of American English. Two of the sentences (sa) are identical for all the speakers. Five of the sentences (sx) for each speaker are drawn from a set of 450 phonetically compact sentences hand-designed at MIT. The emphasis behind these sentences is on covering a wide range of phonetic pairs. The 450 (sx) sentences are each spoken by seven different speakers. The final three sentences (si) for each speaker are chosen at random from the Brown Corpus and are unique for all the speakers. The speakers in the corpus are comprised of males and females (at a ratio of roughly two to one) from eight predefined dialect

regions of the United States.

For all experiments, the data is divided into distinct units known as the training set and the test set. The training set is used to estimate the parameters for each of the phonetic models to be used in the experiments.

The test set consists of the actual test data for the classification or recognition performance evaluation. Speakers from the training and test sets never overlap. This is important to ensure fair experimental conditions. Both sets are generally chosen to reflect a well balanced representation of the speakers in the corpus. Most of the training and test sets utilized in this work are selected specifically because they are identical to training and test sets used in other work. Therefore, the results can be directly compared to those obtained and reported in the literature.

The sets used in this thesis are listed along with some of their statistics in Table 1. The experiments are performed with vowels extracted from the TIMIT database. A total of 16 vowels are used, encompassing 13 monophthongal vowels /iy/, /ih/, /eh/, /ey/, /ae/, /aa/, /ah/, /ao/, /ow/, /uh/, /ux/, /er/, /uw/, and diphthongs /ay/, /oy/, /aw/.

Table 1 data set used for the experiments

Data Set Training Speakers Test Speakers Training Tokens Test Tokens

SX 450 50 17040 1786

3.3 Experiment Result

Experiment Set I

In this set of experiments, the narrow band pattern-matching model proposed by Hillenbrand [5] was evaluated in the TIMIT database.

Hillenbrand summarized the cues from the researches about human vowel perception and proposed the pattern-matching model algorithm. It was assumed that the human perception mechanism is a narrow band pattern-matching procedure. Thus several experiments done by Hillenbrand were used to verify the assumption. The original corpus in those experiments is the database recorded by Hillenbrand et al. it consisted of 1668 /hVd/ utterances spoken by 139 well-trained speakers.

The original accuracy rate by Hillenbrand was 91.4%. However in the experiment set I, we used the TIMIT database and the accuracy rate was 32.06%, which was much lower as expected.

Experiment Set II

In this set of experiments, we used DCSCs as the feature set and SONFIN as the classifier. Several parameters of SONFIN are adjusted to allow SONFIN to find its optimal structure by itself. There are 21 rules in the best structure of SONFIN. The recognition rate of this structure is 72.72% which is higher than other methodology in the literature (including the 71.50% reached by the system proposed by Zahorian et al.).

Experiment set III

The acoustic enhanced features AE-DCSCs are adopted in experiment set III in order to evaluate the improvement of the feature modification. Again, the neural fuzzy classifier SONFIN is used as the classifier and the parameters of SONFIN are adjusted. There are 14 rules in the best structure of SONFIN for AE-DCSCs. The recognition rate of this structure is 74.41% which is higher than those in experiment set II.

Experiment IV

Finally, the acoustic-checking procedure is applied to the result of AE-DCSCs and SONFIN structure. The threshold of confidence factor is chosen as 0.4. and the experiment result showed a recognition rate of 74.75%

Table 2 experiment results and comparison

Feature set Method Accuracy

Narrow-Band Spectra Pattern-Matching Model

32.06%

DCSCs Partitioned Neural

Networks

71.50%

DCSCs SONFIN 72.72%

Acoustic-Enhanced-DCSCs SONFIN 74.47%

Acoustic-Enhanced-DCSCs SONFIN + Acoustic Checking

74.75%

3.3 Discussion

It is shown by these experiments that our proposed system and the applied techniques performed better than those by others in the literature.

In the first set of experiments, we showed the poorness of the feasibility in the perception model by Hillenbrand. In the second set of experiments, we just tried to evaluate the classification ability of the neural fuzzy classifier SONFIN if we compare it to the system proposed by Zahorian (1999) [13] which also used the DCSCs as the feature set but partitioned neural networks as the classifier. The experiment showed the accuracy rate by SONFIN is 72.72% higher than the accuracy rate of 71.50 % from the system proposed by Zahorain. Almost 1.2% improved by SONFIN.

The experiment results showed the powerful classification ability of SONFIN. It used a simpler structure and easier to be trained than the PNN, but performed better. The reasons may be the partition ability in the input/output space and the effective inference of rules. Each rule constructed in SONFIN may classify the output space as many parts as the output class, and multiple rules are integrated via fuzzy inference which partitioned the output space carefully and precisely.

The modified feature set is proved more representative in experiment set III, the higher recognition rate in experiment set II is 74.41% which is 1.69% higher than that in experiment set II. the experiment result showed the idea of the enhancement of the acoustic characteristics in the spectrum domain is feasible. The procedure of acoustic-enhancement emphasized the spectral harmonics and balanced them which suggested in the acoustic-phonetic researches [5]. The acoustic-checking procedure is

applied in experiment IV. The experiment result showed the recognition rate is raised to 74.75%. Thus, the potential and effectiveness of the proposed system was verified.

Chapter 4 Conclusion

In this thesis, we attempt to develop a more robust speaker independent automatic recognizer for English vowels. We integrate the spectral shape based features and acoustic characteristics in our system.

Moreover several techniques are applied in this work.

Fist of all, we modify the gross shape of spectrum, which will be encoded to the feature set of the token. In this phase, we try to enhance the spectral peaks by eliminating the variation between harmonics and balance the amplitude difference by a spectrum-level-normalization process. The DCTC is adopted to encode the spectrum with a nonlinearity frequency warping according to the characteristic of human perception. In order to represent the temporal cues, we use the DCSC to encode the trajectory of spectrum. The suitable time warping can be adjusted to preserve the information better in a finite feature dimension.

A neural fuzzy inference classifier called as SONFIN is adopted in the proposed system as the main recognizer. The SONFIN has the ability to construct its optimal structure by itself and can self-adjust its parameters such as membership function and the consequent parameters.

Experiments showed the SONFIN has a simpler structure and better performance.

Finally a formant checking procedure is done in the system. The procedure is used to distinguish the ambiguous cases according to their

acoustic evidence. If the confidence factor form SONFIN classification is not high enough, the recognition-result is taken as ambiguous and their acoustic cues such as fundament frequency and formant trajectory will be evaluated and checked with the model of vowels. This procedure provides another view to look at the token and provide a more accuracy recognition result. Many experiments based on the popular acoustic-phonetic database are done and the results showed that our proposed system performed much better.

Bibliography

[1] L. Rabiner and B. H. Juang, “Fundamentals of Speech Recogntion”, Prentice-Hall International, NJ, 1993

[2] A. M. A. Ali, J. van der Spiegel, and P. Mueller, “Acoustic-Phonetic Fetaures for the Automatic Classification of Stop-Consonants”, IEEE Trans. Speech and Audio Processing, vol. 9, pp. 833-841, 2001.

[3] G. Peterson and H. L. Barney, ‘‘Control methods used in a study of the vowels,’’ J. Acoust. Soc. Am. 24, 175–184, 1952.

[4] J. Hillenbrand, L. A. Getty, M. J. Clark, and K. Wheeler, ‘‘Acoustic characteristics of American English vowels,’’ J. Acoust. Soc. Am. 97, 3099–3111, 1995.

[5] J. Hillenbrand, and A. H. Robert, “A narrow band pattern-matching model of vowel perception”, J. Acoust. Soc. Am 113(2), 1044-1055, 2003.

[6] L. C. W. Pols, L. J. van der Kamp, and, R. Plomp, ‘‘Perceptual and physical space of vowel sounds,’’ J. Acoust. Soc. Am. 46, 458–467, 1969.

[7] Z. B. Nossair and S. A. Zahorian, “Dynamic spectral shape features as acoustic correlates for initial stop consonants,” J. Acoust. Soc.

Amer., vol. 89, pp. 2978–2991, 1991.

[8] Z. B. Nossair, P. L. Silsbee, and S. A. Zahorian, “Signal modeling enhancements for automatic speech recognition,” in Proc.

ICASSP’95, pp. 824–827.

[9] L. Rudasi and S. A. Zahorian, “Text-independent talker identification with neural networks,” in Proc. ICASSP’91, pp. 389–392.

[10] L. Rudasi and S. A. Zahorian, “Text-independent speaker identification using binary-pair partitioned neural networks,” in Proc.

IJCNN’92, pp. IV: 679–684.

[11] S. A. Zahorian and A. J. Jagharghi, “Spectral-shape features versus formants as acoustic correlates for vowels,” J. Acoust. Soc. Amer., vol. 94, pp. 1966–1982, 1993.

[12] S. A. Zahorian, D. Qian, and A. J. Jagharghi, “Acoustic-phonetic transformations for improved speaker-independent isolated word recognition,” in Proc. ICASSP’91, pp. 561–564.

[13] S. A. Zahorian and Z. B. Nossair, “A partitioned neural network approach for vowel classification using smoothed time/frequency features”, IEEE Trans. Speech and Audio Processing, vol. 7, pp.

414-425, 1999.

[14] C. F. Juang and C. T. Lin, “An on-line self-constructing neural fuzzy inference network and its application,” IEEE Trans. Fuzzy Syst., vol.

6, pp. 12–32, 1998.

[15] C. T. Lin and C. S. G. Lee, Neural Fuzzy Systems: A Neural-Fuzzy Synergism to Intelligent Systems. Englewood Cliffs, NJ:

Prentice-Hall, 1996.

[16] C. T. Lin, Neural Fuzzy Control Systems with Structure and Parameter Learning. Singapore: World Scientific, 1994.

[17] W. D. Goldenthal, “Statistical trajectory models for phonetic recognition,” Ph.D. dissertation, Mass. Inst. Technol., Cambridge,

[18] W. D. Goldenthal and J. R. Glass, “Modeling spectral dynamics for vowel classification,” in Proc. EUROSPEECH’93, pp. 289–292.

[19] H. Leung and V. Zue, “Some phonetic recognition experiments sing artificial neural nets,” in Proc. ICASSP’88, pp. I: 422–425.

[20] H. Leung and V. Zue, “Phonetic classification using multi-layer perceptrons,” in Proc. ICASSP’90, pp. I: 525–528.

[21] H. M. Meng and V. Zue, “Signal representations for phonetic classifi-cation,” in Proc. ICASSP’91, pp. 285–288.

[22] H. Gish and K. Ng, “A segmental speech model with applications to word spotting,” in Proc. ICASSP’93, pp. II-447–II-450.

[23] M. S. Phillips, “Speaker independent classification of vowels and diphthongs in continuous speech,” in Proc. 11th Int. Cong. Phonetic Sciences, 1987, vol. 5, pp. 240–243.

[24] R. A. Cole and Y. K. Muthusamy, “Perceptual studies on vowels excised from continuous speech,” in Proc. ICSLP’92, pp. 1091–1094.

[25] B. S. Rosner and J. B. Pickering, “Vowel Perception and Production”, Oxford U.P., Oxford, 1994

[26] D. H. Klatt, ‘‘Prediction of perceived phonetic distance from critical-band spectra: A first step,’’ IEEE ICASSP, 1278–1281, 1982.

[27]S. K. Pal and S. Mitra, “Multilayer perceptron, fuzzy sets, and classifi-cation,” IEEE Trans. Neural Networks, vol. 3, pp. 683–697, 1992.

[28] S. Mitra and S. K. Pal, “Fuzzy multilayer perceptron, inferencing and rule generation,” IEEE Trans. Neural Networks, vol. 6, pp. 51–63, 1995.

Classification and Reule Generation," IEEE Trans. Neural Networks, vol. 8, pp. 1338–1350, 1997.

[30] X. Sun, "Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio," IEEE ICASSP2002, , 2002, Vol 1, pp 333-336.

在文檔中基於聽覺語言學與模糊類神經網路之英文母音辨識技術 (頁 46-0)