• 沒有找到結果。

Conclusions and Future Work

5.2 Future Work

Through the consideration of personalization and humanization, mobile communication devices can evolve to meet people’s needs. DSR facilitates the creation of an exciting new set of applications and services combining voice and data. The proposed solution to robust DSR systems is only the beginning in the development of Human-Machine Interface service over wireless networks. Each of our proposed algorithms may be further examined to discover some possible contributions. This section briefly outlines some directions of future work.

The goal of voice conversion is to control speech individuality or add individual cues to speech processing algorithms. In this thesis, nice conversion technologies have be applied to convert voice quality from hearing-impaired speaker to normal speech.

The key strategy is the detection and exploitation of characteristic features in spectral and prosodic levels. When voice personality can be more accurately characterized and exploited, more technologies can be integrated into voice-controlled services. For example, as the personal communication system becomes pervasive in mobile financial transactions and information retrieval services, the utility of speaker identification and authentication based on voice individuality increases. A speaker identification system

based on Gaussian mixture models for characterizing spectral shapes can attain high identification accuracy [49]. Further, the GMM framework allows a direct integration with robust well-developed speech recognition systems. In addition, voice conversion can be applied to computer-assisted language learning. A learning system needs to provide the utility for detecting and correcting errors by mining the speech signal for information about learner’s deviations from reference speakers’ pronunciation. Recent research has found that a better solution for pronunciation learning should address not only the phone articulation but also the speech prosody.

In this thesis, a JSCD scheme which exploits the combined source and channel statistics as an a priori information is proposed for the channel error mitigation. The basic strategy is to exploit the large amount of residual redundancy existing in the DSR features. Similar analysis also indicated that substantial residual redundancy existed in the source-encoded speech parameters. Therefore, the proposed JSCD scheme can be applied to speech transmission systems in order to attain robust performance over wireless networks. The channel information considered in the decoding algorithm is the error statistics averaged over a training sequence. However, in the real-world communication the statistics of channel information also vary with time. Adaptively exploiting time-varying channel information is an important issue for the design of JSCD algorithms.

Bibliography

[1] D. Goddeau, W. Goldenthal, and C. Weikart, ”Deploying Speech Applications over the Web,” in EUROSPEECH-1997, pp. 685-688, 1997.

[2] J. C. Junqua and J. P. Haton, ”Robustness in Automatic Speech Recognition:

Funamentals and Applications,” Kluwer Academic Pub, Boston, 1996.

[3] H. C. Choi and R.W. King, ”On the Use Spectral Transformation for Speaker Adaptation in HMM based Isolated-Word Speech Recognition,” Speech Commu-nication, vol. 17 , pp. 13-143, 1995.

[4] C. H. Lee, ”On Stochastic Feature and Model Compensation Approaches to Robust Speech Recognition,” Speech Communication, vol. 25, pp. 29-47, 1998.

[5] C. H. Lee, C. H. Lin and B.H. Juang, ”A Study on Speaker Adaptation of the Parameters of Continuous Density Hidden Markov Models,” IEEE Trans. Signal Processing, vol. 39, pp. 806-814, May 1996.

[6] C. J. Leggetter and P. C. Woodland, ”Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models,” Computer Speech and Language, vol. 9, pp. 171-185, Apr. 1995.

[7] A. Sankar and C. H. Lee, ”A Maximum-Likelihood Approach to Stochastic Matching for Robustspeech Recognition,” IEEE Trans. Speech and Audio Pro-cessing, vol. 4, pp.190-202, May 1996.

[8] A. Kain and M. W. Macon, ”Spectral voice conversion for text-to-speech synthe-sis,” in Proceedings ICASSP’98, pp. 285-288, 1998.

[9] N. Bi and Y. Qi, ”Application of speech conversion to alaryngeal speech enhance-ment,” IEEE Trans. Speech Audio Processing, vol. 5, pp. 97-105, 1997.

[10] M. Abe, S. Nakamura, K. Shikano and H. Kuwabara, ”Voice conversion through vector quantization,” in Proceedings ICASSP’88, pp. 655-658, 1988.

[11] Y. Stylianou, O. Cappe and E. Moulines, ”Continuous probabilistic transform for voice conversion,” IEEE Trans. Speech Audio Processing, vol. 6, pp. 131-142, 1998.

[12] R. N. Ohde and D. J. Sharf, ”Phonetic Analysis of Normal and Abnormal Speech,” Merrill, New York, 1992.

[13] A. Bernard and A. Alwan, ”Low-bitrate distributed speech recognition for packet-based and wireless communication,” IEEE Trans. Speech and Audio Processing, vol. 10, no. 8, pp. 570-579, 2002,

[14] C. Boulis, M. Ostendorf, E. Riskin and S. Otterson, ”Graceful degradation of speech recognition performance over packet-erasure networks,” IEEE Trans.

Speech and Audio Processing, vol. 10, no. 8, pp. 580-590, 2002.

[15] A. M. Peinado, V. Sanchez, J. L. Perez-Cordoba and A. Torre, ”HMM-based channel error mitigation and its application to distributed speech recognition,”

Speech Communication, vol. 41, pp. 549-561, 2003.

[16] H. U. Reinhold and I. Valentin, ”Soft features for improved distributed speech recognition over wireless networks,” in Proc. Int. Conf. Spoken Language Pro-cessing, pp. 2125-2128, Jeju Island, Korea, 2004.

[17] T. Fingscheidt and P. Vary, ”Softbit speech decoding: a new approach to error concealment,” IEEE Trans. Speech and Audio Processing, vol. 9, no. 3, pp. 240-251, 2001.

[18] L. N. Kanal and A. R. K. Sastry, ”Models for channels with memory and their applications to error control,” in Proc. IEEE, vol. 66, pp. 724-744, 1978.

[19] E. N. Gilbert, ”Capacity of a burst-noise channel,” The Bell System Technical Journal, vol. 39, pp. 1253-1265, 1960.

[20] S Davis and P. Mermelstein, ”Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentence,” IEEE Trans.

Acoustics, Speech, and Signal Processing, vol. 28, pp. 357- 366, Aug. 1980.

[21] V. V. Digalakis, L. G. Neumeyer and M. Perakakis, ”Quantization of cepstral parameters for speech recognition over the World Wide Web,” IEEE Journal on Selected Areas in Communications, vol. 17, pp. 82-90, Jan. 1999.

[22] ETSI ES 202 212 v1.1.1. Digital speech recognition; extended advanced front-end feature extraction algorithm; compression algorithms; back-front-end speech re-construction algorithm, Nov. 2003.

[23] L. S. Lee, ”Voice dictation of Mandarin Chinese,” IEEE Signal Processing Mag-azine, pp. 63-101, 1997.

[24] L. R. Bahl, J. Cocke, F. Jelinek and J. Raviv, ”Optimal Decoding of Linear Codes for Minimizing Symbol Error Rate,” IEEE Trans. Inform. Theory, vol.

IT-20, pp. 284-287, 1974.

[25] S. Lin and D. J. Costello, ”Error Control Coding,” Prentice Hall, New Jersey, 2004.

[26] L. N. Kanal and A. R. K. Sastry, ”Models for Channels with Memory and their Applications to Error Control,” in Proc. IEEE, vol. 66, pp. 724-744, 1978.

[27] E. N. Gilbert, ”Capacity of a Burst-noise Channel,” The Bell System Technical Journal, vol. 39, pp. 1253-1265, 1960.

[28] W. Turin, ”MAP Symbol Decoding in Channels with Error Bursts,” IEEE Trans.

Inform. Theory, vol. 47, no. 5, pp. 1832-1838, 2001.

[29] CoCentric System Studio-Referenec Design Kits, Mountain View, CA: Synopsys, Inc., 2003.

[30] J.Y. Chouinard, M. Lecours and G. Y. Delisle, ”Estimation of Gilbert’s and Fritchman’s Models Parameters Using the Gradient Method for Digital Mobile Radio Channels,” IEEE Trans. Veh. Technol., vol. 37, no. 3, pp.158-166, Aug.

1988.

[31] R. J. McAulay and T. F. Quatieri, ”Speech analysis-synthesis based on a sinu-soidal representation,” IEEE Trans. Acoust. Speech Sig. Process., vol. 34, pp.

744-754, 1986.

[32] I. Hochberg, H. Levitt and M. J. Osberger, ”Speech of The Hearing Impaired: Re-search, Training, and Personnel Preparation,” University Park Press, Maryland 1983.

[33] R. Monsen, ”Toward Measuring How Well Hearing-impaired Children Speak,”

Journal of Speech and Hearing Research. vol. 21, pp. 197-219, 1978.

[34] N. S. McGarr and K. S. Harris, Articulatory control in deaf speaker, in I.

Hochberg, H. Levitt and M. J. Osberger (Eds.), Speech of the Hearing Impaired, University Park Press, Baltimore 1983.

[35] M. J. Osberger and H. Levitt, ”The Effect of Timing Errors on the Intelligibility of Deaf Children’s Speech,” J. Acoust. Soc. Amer., vol. 66 (5), pp. 1316-1324, 1979.

[36] B. L. Chang, ”The Perceptual Analysis of Speech Intelligibility of Students with Hearing Impairments,” Bulletin of Special Education, vol. 18, pp. 53-78, 2000.

[37] B. G. Lin and Y. C. Huang, ”An Analysis on The Hhearing Impaired Students’

Chinese Language Abilities and its Error Patterns,” Bulletin of Special Education, vol. 15, pp. 109-129, 1997.

[38] X. S. Shen and M. Lin, ”A perceptual study of Mandarin tones 2 and 3,” Language and Speech, vol. 34(2), pp. 145-156, 1991.

[39] R. J. McAulay and T. F. Quatieri, ”Sinusoidal Coding: Speech Coding and Synthesis,” Elsevier, Amsterdam, 1995.

[40] T. F. Quatieri and R. J. McAulay, ”Shape Invariant Time-scale and Pitch Mod-ification of Speech,” IEEE Trans. Signal Processing, vol. 40 (3), pp. 497-510, 1992.

[41] A. V. Oppenhein and R. W. Schafer, ”Discrete-time Signal Processing,” Prentice Hall, New Jersey, 1989.

[42] Y. Stylianou, O. Cappe and E. Moulines, ”Continuous probabilistic transform for voice conversion,” IEEE Trans. Speech Audio Processing, vol. 6, pp. 131-142, 1998.

[43] A. Dempster, N. Laird and D. Rubin, ”Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Stat. Soc., vol. 39, pp. 1-38, 1977.

[44] S. H. Chen and Y. R. Wang, ”Vector quantization of pitch information in Man-darin speech,” IEEE Trans. Communications, vol. 38, pp. 1317-1320, 1990.

[45] Y. Linde, A. Buzo and R. M. Gray, ”An Algorithm for Vector Quantizer Design,”

IEEE Trans. Communications, vol. 28, pp. 84-95, 1980.

[46] L. Rabiner and B. H. Juang, ”Fundamentals of Speech Recognition,” Prentice Hall, New Jersey, 1993.

[47] P. C. Lee, ”A Study on Acoustic Characteristic of Mandarin Affricates of Hearing-impaired Speech,” Bulletin of special Eduation and Rehabilitation, vol. 7, pp.

79-112, 1999.

[48] R. A. Johnson and G. K. Bhattacharyya, ”Statistics: Principles and Methods,”

John Wiley and Sons, New York, 1996.

[49] D. A. Reynolds and R. C. Rose, ”Robust Text-Independent Speaker Identifica-tion Using Gaussian Mixture Speaker Models,” IEEE Trans. Speech and Audio Processing, vol. 3, pp. 72-83, 1995.

相關文件