Future Works - 強健及分散式語音辨識系統中的動態量化技術

Although many issues of environmental noise and transmission errors have been investigated in the dynamic quantization, there are still several important topics opened for further research. Each of our proposed approaches in the above ﬁve major chapters in this thesis may be further studied to determine some possible contributions. Following list is just to depict some issues of the dynamic quantization framework:

1. Extend the deﬁnition of quantization distortion measure to discriminate repre-sentative codewords for speech recognition,

2. Better integration of uncertainty source in Distributed speech recognition frame-work,

3. Jointly optimization of dynamic quantization (source coding) and channel cod-ing,

4. Combination of various front-end feature processing approaches for improving the accuracy of the speech recognition system.

Based on the results and techniques that we have investigated and built-up, there are several topics that we could extend our current work for further research in dynamic quantization.

In Chapter 3, we successfully jointly consider the issues of compression and robust-ness, and the integration could be applied for both robust and distributed speech recogni-tion. Another interesting idea is to jointly consider compression and discrimination issues.

In Chapter 3, the hidden codebook on the vertical scale is derived based on uniform, Lapla-cian and Gaussian distribution via Lloyd-Max algorithm, which aims to minimize the overall quantization distortion. Every data point is treated with the same importance in the quan-tization process. However, there may be some regions in the feature space more critical than other regions. The critical region has smaller margin among HMM models and small distortion for samples in these critical regions could cause recognition errors. Therefore, the samples in the critical region should be carefully considered to enlarge the margin among HMM models. On the other hand, quantization distortion in some features may be more important than distortions in others. The quantization distortion sensitivity for diﬀerent feature parameters should be integrated in the quantization distortion measure to optimize the recognition performance.

In Chapter 4, we jointly consider the uncertainty caused by both environmental noise and quantization errors. In Chapter 5, the reliability of received feature vectors is considered in Viterbi decoding in the third stage of error concealment. For distributed speech recognition, it would be better to jointly consider these three source of uncertainties:

quantization distortion, environmental noise and transmissions. The above uncertainty estimation is derived from feature perspective. On the other hand, the reliability could be estimated based an entropy-based measure to determine the discriminating ability of a feature parameter in identifying the correct acoustic models [70, 72, 71]. The uncertainty or reliability estimated from feature or model perspective could be further integrated in Viterbi decoding to improve the recognition performance.

In the three-stage error concealment(EC) framework in Chapter 5, the error de-tection is based on the characteristics of HQ features. There is no channel coding scheme

applied on the encoded HQ symbols. If the source coding and channel coding are considered jointly, the recall and precision rates of error detection could be further improved. Also, with channel coding, the soft decision decoding at receiver could oﬀer channel reliability information for weighted Viterbi decoding.

In Chapter 6, the context-dependent quantization exploiting speech correlation in the quantization process improves the robustness against environmental noise and transmis-sion errors. This is probably because the speech context change could provide additional in-formation for human perception and speech recognition. The concept of context-dependency could be also applied to other feature transformation methods. For example, the transfor-mation of Histogram equalization (HEQ) could depend on not only the order-statistics of the current feature parameter, but also the left and right context parameter. The cor-relation of order-statistics in consecutive frames could improve the robustness of feature parameters.

To the best of our knowledge, the above concept has not been reported in the literature yet. These future works are very important and meaningful in the research area of robust and distributed speech recognition.

Bibliography

[1] Special section on “Speech Technology in Human-Machine Communication,” IEEE Signal Processing Magazine, vol. 22, no. 5, Sep. 2005.

[2] D. Pearce, “Enabling new speech driven services for mobile devices: An overview of the ETSI standards activities for distributed speech recognition front-ends,” Proc.

Applied Voice Input/Output Soc. Conf., May 2000.

[3] V. V. Digalakis and L. G. Neumeyer and M. Perakakis, “Quantization of cepstral pa-rameters for speech recognition over the world wide web,” IEEE Select. Areas Com-mun., vol. 17, no. 1, pp 82-90, Jan. 1999.

[4] J. -Y. Li, Bo Liu, R. -H. Wang and Li. -R. Dai, “A complexity reduction of ETSI ad-vanced front-end for DSR,” Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing, Apr. 2004.

[5] A. Agarwal, and Y. M. Cheng, “Two-Stage Mel-Warped Wiener Filter for Robust Speech Recognition,” Proc. ASRU99, 1999.

[6] J. -W. Hung and L. -S. Lee, “Comparative Analysis for Data-Driven Temporal Filters Obtained Via Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) In Speech Recognition,” Proc. Eurospeech, pp 1959-1962, 2001.

[7] S. Vuren and H. Hermansky, “Data-Driven Design of RASTA-Like Filters,” Proc.

ICSLP, 1996.

[8] Ni-chun Wang, Jeih-weih Hung and Lin-shan Lee, “Data-driven temporal ﬁlters based

on multi-eigenvectors for robust features in speech recognition,” Proc. IEEE Int. Conf.

Acoust. Speech, Signal Processing, 2003.

[9] S. Furui, “Cepstral Analysis Technique for Automatic Speaker Veriﬁcation,” IEEE Trans. Acoust. Speech Signal Processing, 1981.

[10] O. Viikki and K. Laurila, “Noise Robust HMM-based Speech Recognition Using Seg-mental Cepstral Feature Vector Normalization,” in ESCA NATO WorkshopRobust Speech Recognition Unknown Communication Channels, pp 107-110, 1997.

[11] J. Droppo and A. Acero and L. Deng, “Uncertainty Decoding with SPLICE for Noise Robust Speech Recognition,” Proc. IEEE Int. Conf. Acoust. Speech, Signal Process-ing, pp 57-60, 2002.

[12] J. Droppo, L. Deng, and A. Acero, “Evaluation of the SPLICE algorithm on the Aurora2 database (web update),” Proc. Eurospeech 2001, pp 217-220, Sep. 2001.

[13] J. A. Arrowood and M. A. Clements, “Using Observation Uncertainty In HMM De-coding,” Proc. ICSLP, 2002.

[14] H. Liao and M. J. F. Gales, “Joint Uncertainty Decoding for Noise Robust Speech Recognition,” Proc. Eurospeech, pp 3129-3132, 2005.

[15] N. B. Yoma and C. Molina and J. Silva and C. Busso, “Modeling, Estimating, and Compensating Low-Bit Rate Coding Distortion in Speech Recognition,” IEEE Trans.

Speech, Audio Processing, vol. 14, no. 1, pp 246-255, Jan. 2006.

[16] J. A. Arrowood and M. Clements, “Extended Cluster Information Vector Quantization (ECI-VQ) for Robust Classiﬁcation,” Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing, pp 889-892, May 2004.

[17] ETSI, “Speech Processing, Transmission and Quality Aspects (STQ), Distributed speech recognition; Extended advanced front-end feature extraction algorithm; Com-pression algorithms; Back-end speech reconstruction algorithm,” ES 202 212 V1.1.1 Recommendation, Nov. 2003.

[18] I. Kiss and P. Kapanen, “Robust feature vector compression algorithm for distributed speech recognition,” Proc. Eurospeech, pp 2183-2186, 1999.

[19] B. Milner and X. Shao, “Low Bit-rate Feature Vector Compression Using Transform Coding and Non-uniform Bit Allocation,” Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing, pp 129-132, Apr. 2003.

[20] Q. Zhu and A. Alwan, “An eﬃcient and scalable 2D-DCT based feature coding scheme for remote speech recognition,” Proc. IEEE Int. Conf. Acoust. Speech, Signal Process-ing, pp 113-116, 2001.

[21] W. -H. Hsu and L. -S. Lee, “Eﬃcient and Robust Distributed Speech Recognition (DSR) over Wireless Fading Channels: 2D-DCT Compression, Iterative Bit Allo-cation, Short BCH Code and Interleaving,” Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing, pp 69-72, 2004.

[22] C. Perkins, O. Hodson, and V. Hardman, “A survey of packet loss recovery techniques for streaming audio,” IEEE Network Mag., pp 40V48, 1998.

[23] C. Boulis and M. Ostendorf and E. A. Riskin and S. Otterson, “Graceful Degrada-tion of Speech RecogniDegrada-tion Performance over Packet-Erasure Networks,” IEEE Trans.

Speech, Audio Processing, vol. 10, no. 8, pp 580-590, Nov. 2002.

[24] Z. -H. Tan and P. Dalsgaard, “Channel error protection scheme for distributed speech recognition,” Proc. ICSLP 02, 2002.

[25] B. P. Milner and S. Semnani, “Robust speech recognition over IP networks,” Proc.

ICASSP, 2000.

[26] L. Docio-Ferandez and C. Garcia-Mateo, “Distributed speech recognition over IP networks on the Aurora 3 database,” Proc. ICSLP, 2002.

[27] B. P. Milner and A. B. James, “Analysis and compensation of packet loss in distributed speech recognition using interleaving,” Proc. Eurospeech, 2003.

[28] B. Milner and A. James, “Robust Speech Recognition over Mobile and IP Networks in Burst-Like Packet Loss,” IEEE Trans. Speech Audio Processing, vol. 14, no. 1, pp 223-231, Jan. 2006.

[29] A. Gomez, A. M. Peinado,V. Sanchez, and A. J. Rubio, “A source model mitigation technique for DSR over lossy packet channels,” Proc. Eurospeech, 2003.

[30] A. Bernard and A. Alwan, “Low-bitrate distributed speech recognition for packet-based and wireless communication,” IEEE Trans. Speech, Audio Processing, vol. 10, no. 8, pp 570-579, Nov. 2002.

[31] A. Cardenal-Lopez and L. Docio-Fernandez and C. Garcia-Mateo, “Soft decoding strategies for distributed speech recognition over IP networks,” Proc. IEEE Int. Conf.

Acoust. Speech, Signal Processing, pp 49-52, May 2004.

[32] V. Ion and R. Haeb-Umbach, “A Uniﬁed Probabilistic Approach to Error Concealment for Distributed Speech Recognition,” Proc. Interspeech, pp 2853-2856, Sep. 2005.

[33] A. M. Gomez, A. Peinado, V. Sanchez, A. Rubio, “An integrated scheme for robust distributed speech recognition over lossy packet networks,” Proc. IEEE Int. Conf.

Acoust. Speech, Signal Processing, pp 857-860, Apr. 2007.

[34] V. Ion and R. Haeb-Umbach, “Multi-resolution soft features for channel-robust dis-tributed speech recognition,” Proc. Interspeech, pp 594-597, Sep. 2007.

[35] Z. -H. Tan and P. Dalsgaard and B. Lindberg, “A subvector based error concealment algorithm for speech recognition over mobile networks,” Proc. IEEE Int. Conf. Acoust.

Speech, Signal Processing, May 2004.

[36] A. M. Peinado and V. Sanchez and J. L. Perez-Cordoba and A. J. Rubio, “Eﬃ-cient MMSE-Based Channel Error Mitigation Techniques Application to Distributed Speech Recognition Over Wireless Channels,” IEEE Trans. Wireless Communication, vol. 4, no. 1, pp 14-19, Jan. 2005.

[37] B. Delaney, “Increased robustness against bit errors for distributed speech recognition in wireless environments,” Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing, 2005.

[38] A. Bernard and A. Alwan, “Channel noise robustness for low-bitrate remote speech recognition,” Proc. ICSLP, 2002.

[39] T. Endo, S. Kuroiwa, and S. Nakamura, “Missing feature theory applied to robust speech recognition on IP networks,” Proc. Eurospeech, 2003.

[40] H. K. Kim and R. V. Cox, “A bitstream-based front-end for wireless speech recognition on IS-136 communication systems,” IEEE Trans. Speech Audio Processing, vol 9, no 5, pp 558V568, 2001.

[41] H. G. Hirsch and D. Pearce, “The AURORA Experimental Framework for the Per-formance Evaluations of Speech Recognition Systems under Noisy Conditions,” ISCA ITRW ASR2000, Year Sep. 2000.

[42] K. K. Paliwal and S. So, “Scalable Distributed Speech Recognition Using Multi-Frame GMM-based Block Quantization,” Proc. ICSLP, 2004.

[43] S. Molau and M. Pitz and H. Ney, “Histogram based normalization in the acoustic feature space,” Proc. ASRU, 2001.

[44] A. de la Torre and A. M. Peinado and J. C. Segura and J. L. Perez-Cordoba and M. C.

Benitez and A. J. Rubio, “Histogram equalization of speech representation for robust speech recognition,” IEEE Trans. Speech Audio Processing, vol. 13, no. 3, pp 355-366, May 2005.

[45] S. Chen and R. Gopinath, “Gaussianization,” Proc. Neural Information Processing Systems, pp 423-429, 2000.

[46] C. -Y. Wan and L. -S. Lee, “Histogram-based Quantization (HQ) for Robust and Scalable Distributed Speech Recognition,” Proc. Interspeech, pp 957-960, Sep. 2005.

[47] S. P. Lloyd, “Least Squares Quantization in PCM,” IEEE Trans. Information Theory, vol. 28, pp 129-137, Mar. 1982.

[48] J. Max, “Quantizing for Minimum Distortion,” IEEE Trans. Speech Audio Processing, vol. 6, no. 1, pp 7-12, Mar. 1960.

[49] F. Hilger and H. Ney, “Quantile-based histogram equalization for noise robust speech recognition,” Proc. Eurospeech, pp 1135-1138, 2001.

[50] C. -Y. Wan and L. -S. Lee, “Joint Uncertainty Decoding (JUD) with Histogram-based Quantization (HQ) for Robust and/or Distributed Speech Recognition,” Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing, pp 125-128, May 2006.

[51] C. -Y. Wan and Yi Chen and L. -S. Lee, “Three-Stage Error Concealment for Dis-tributed Speech Recognition (DSR) with Histogram-Based Quantization (HQ) un-der Noisy Environment,” Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing, pp 877-880, Apr. 2007.

[52] C. -Y. Wan and L. -S. Lee, “Histogram-based Quantization (HQ) for Robust and Dis-tributed Speech Recognition,” IEEE Trans. Audio Speech and Language Processing, vol. 16, no. 4, pp 859-873, May 2008.

[53] L. Bahl and J. Cocke and F. Jelinek and J. Raviv, “Optimal decoding of linear codes for minimizing symbol error rate,” IEEE Trans. Inf. Theory, vol. 20, no. 2, pp 284-287, Mar. 1974.

[54] V. Sanchez and A. M. Peinado and J. L. Perez-Cordoba, “Low Complexity Channel Error Mitigation for Distributed Speech Recognition over Wireless Channels,” Proc.

IEEE Int. Conf. Communications, pp 3619-3623, May 2003.

[55] J. -H. Chen, “Receiver design and simulation analysis of GPRS physical layer,” Master Thesis, National Taiwan University Jun. 2001.

[56] C. -P. Chen and J. A. Bilmes, “MVA Processing of Speech Features,” IEEE Trans.

Speech Audio Processing, vol. 15, no. 1, pp 257-270, Jan. 2007.

[57] J. -W. Hung and L. -S. Lee, “Optimization of Temporal Filters for Constructing Ro-bust Features in Speech Recognition,” IEEE Trans. Speech Audio Processing, vol. 14, no. 3, pp 808-832, May 2006.

[58] ITU-T (Telecommunication Standardization Sector, International Telecommunication Union), “Subjective Performance Assessment of Telephone-band and Wideband Dig-ital Codecs, Annex D: Modiﬁed IRS Send and Receive Characteristics,” ITU-T Rec-ommendation P.830 Feb. 1996.

[59] H. Hermansky, “TRAP-TANDEM: data-driven extraction of temporal features from speech,” Proc. ASRU, pp 255- 260, 2003.

[60] Y. Linde and A. Buzo and R. Gray, “An Algorithm for Vector Quantizer Design,”

IEEE Trans. Speech Audio Processing, vol. 28, no. 1, pp 84-95, Jan. 1980.

[61] C. -Y. Wan and Yi Chen and L. -S. Lee, “Context-dependent Quantization for Ro-bust and/or Distributed Speech Recognition,” Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing, pp 4413-4416, Mar. 2008.

[62] M. Flickner and H. Sawhney and W. Niblack and J. Ashley and Q. Huang, and B.

Dom, “Query by image and video content: the QBIC system,” IEEE Computer, Sep.

1995.

[63] John R. Smith and Shin-Fu Chang, “VisualSEEk: a fully automated content-based image query system,” ACM Multimedia, 1996.

[64] J. Chen and T. Tan and P. Mulhem and M. Kankanhalli, “An improved method for image retrieval using speech annotation,” Proceedings of the 9th International Conference on Multi-Media Modeling, 2003.

[65] Timothy J. Hazen and Brennan Sherry and Mark Adler, “Speech-based annotation and retrieval of digital photographs,” Proc. Interspeech, 2007.

[66] G. W. Furnas and S. Deerwester and S. T. Dumais and T. K. Landauer and R. Harsh-man and L.A. Streeter and K.E. Lochbaum, “Information retrieval using a singular

value decomposition model of latent semantic structure,” Proc. ACM SIGIR Conf.

R&D in Information Retrieval, 1988.

[67] T. Hofmann, “Probabilistic latent semantic indexing,” Proc. ACM SIGIR Conf. R&D in Informational Retrieval, 1999.

[68] M. J. Swain and D. H. Ballard, “Color indexing,” Int. Journal of Computer Vision, 1991.

[69] B. S. Manjunath and W. Y. Ma, “Texture features for browsing and retrieval of image data,” IEEE T-PAMI, Aug. 1996.

[70] Y. Chen and C. -Y. Wan and L. -S. Lee, “Entropy-Based Feature Parameter Weight-ing for Robust Speech Recognition,” Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing, 2006.

[71] Y. Chen and C. -Y. Wan and L. -S. Lee, “Confusion-Based Entropy-Weighted Decod-ing for Robust Speech Recognition” Proc. Interspeech, 2008.

[72] Y. Chen and C. -Y. Wan and L. -S. Lee, “Robust Speech Recognition By Properly Uti-lizing Reliable Frames And Segments In Corrupted Signals” pp 99-104, Proc. ASRU, 2007.

在文檔中強健及分散式語音辨識系統中的動態量化技術 (頁 114-0)