第五章 客觀的非侵入式語音評量方法
5.4 綜合三種感知特徵評量語音品質
5.4.1 研究方法
由前三節得到的語音信號三種感知特徵:理解性、自然性、基頻失真來做客 觀非侵入式語音品質評量;由前三節的方法,我們可分別得到三種感知特徵經由 計算後得到的量化數值,再用最小平方法將三者對語音品質好壞的影響做結合,
得到最後我們預估的語音品質分數(pre_MOS)。
我們使用 Supp.23 語料庫中實驗一和實驗三中的語料,對其中四位男性與四 位女性經過各種損傷的語音做分析,分別得到各別預估的語音品質分數,最後再 和 ITU-T 客觀評量方法的國際標準 P.563 做比較,預估的效能用相關係數表示。
比較結果可參考表 5-2 到表 5-5。
5.4.2 研究結果
c31 2.96 2.46 3.43 2.92 2.79 3.67
c19 3.29 2.77 3.09 3.79 3.14 3.22
c7 2.67 2.46 2.64 2.58 2.20 2.12
c46 4.38 3.65 4.03 4.54 4.43 3.48
c25 1.96 2.38 2.33 1.58 2.82 1.81
第六章 結果討論與未來展望 6. 1 結果討論
6.1.1 客觀的侵入式語音品質評量方法
我們客觀的侵入式語音評量方法,是使用感知聽覺模型以及前人的研究結果 發現【12】,在 Rate 和 Scale 兩個維度上找到一個能量分佈區域,這個特殊區域是 Rate 在 32Hz 到 256Hz 以及 Scale 在 2cyc/oct 到 8cyc/oct 之間;藉由計算並累加不同 時間點的乾淨語音和損傷語音在此特殊區域的能量變化來評量語音品質好壞,然 後和 MOS 對照(mapping)並將結果和 ITU-T P.862(PESQ)做比較。
從表 4-1 可看到,對 ITU-T Supp.23 實驗一的四位語者,我們預測的 MOS 和 真實 MOS 最高的相關係數可達 0.928,最差是 0.795,平均是 0.868;而 PESQ 最高 是 0.923,最差是 0.866,平均是 0.902。若將我們的預測結果和 PESQ 的預測結果 相除做比較,平均可高達 96%的相關程度,由此可看出我們客觀的侵入式預測 MOS 方法的效能和 PESQ 相當。
6.1.2 客觀的非侵入式語音品質評量方法
我們客觀的非侵入式語音品質評量方法是是使用感知聽覺模型在兩個聽覺感 知階段做觀察和分析,希望從人耳低階感知反應上擷取出可能影響聽者在高階認 知判斷語音品質的特徵參數來對語音品質做客觀評量,這三個特徵分別是-理解 性、自然性、基頻失真。
我們評量理解性高低的方法是使用結合頻域和時域的調變指數(STMI),在語
音信號最重要的可感知調變區(critical perceptible modulations),這塊區域的範圍是
相關係數
Average 0.802 0.826
表 6-1:P.563 預估 MOS 和我們預估 MOS 的效能評比,以相關係數表示
參考文獻
[1]. A. W. Rix, J.G. Beerends, D.-S. Kim, P. Kroon, and O. Ghitza, “Objective Assessment of Speech and Audio Quality-Technology and Applications,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 14, no. 6, Nov.2006.
[2]. “Perceptual evaluation of Speech quality, an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,”
ITU-T Rec. P.862, 2001.
[3]. T. P. Barnwell, “Improved objective quality measures for low bit speech compression,” National Science Foundation, Final Technical Report, 1985.
[4]. S. R. Quackenbush, T. P. Barnwell, III, and M. A. Clements, “Objective Measures of Speech Quality,” Englewood Cliffs, NJ:Prentice-Hall, 1998.
[5]. C. Jin and R. Kubichek, “Vector quantization techniques for output-based
objective speech quality,” IEEE Int. Conf. Acoust., Speech, Signal Process. Atlanta, GA, pp.491-494, 1996
[6]. P. Gray, M. P. Hollier, and R. E. Massara, “Non-intrusive speech quality assessment using vocal tract models,” Inst. Elect. Eng. Proc. Vis. Image Sig.
Process.,vol. 147, no. 6, pp. 493-501, 2000.
[7]. T. H. Falk and W.-Y. Chan, “Nonintrusive speech quality estimation using
Gaussian mixture models,” IEEE Sig. Process. Letters, vol. 13, no. 2, pp.108-111, 2006.
[8]. “Single Ended Method for Objective Speech Quality Assessment in Narrow-Band Telephony Applications,” ITU-T Rec. P.563, 2004.
[9]. A. Raja, R. M. A. Azad, C. Flanagan, and C. Ryan, “Real-Time, Non-intrusive Evaluation of VoIP,” EuroGP LNCS 4445, pp. 217-228, 2007.
[10]. L. Ding, Z. Lin, A. Radwan, M. S. El-Hennawey, and R. A. Goubran,
“Non-intrusive single-ended speech quality assessment in VoIP,” Speech Communication 49, pp. 477-489, 2007.
[11]. “The E-model, a computational model for use in transmission planning,” ITU-T Rec. G.107, 2002.
[12]. D.-S. Kim, “A cue for objective speech quality estimation in temporal envelope representations,” IEEE Signal Processing Lett., vol. 11, no. 10, pp.849-852, Oct.
2004.
[13]. A. Raake, “Does the content of speech influence its perceived sound quality?” in Proc. 3rd Int. Conf. on Language Resources and Evaluation, vol. 4, pp. 1170-1176, 2002.
[14]. D.-S. Kim, “ANIQUE:An auditory model for single-ended speech quality
estimation,” IEEE Trans. Speech Audio Process., vol. 13, no. 5, pp.821-831, Sep.
2005.
[15]. T. Chi, P. Ru, and S. A. Shamma, “Multiresolution spectrotemporal analysis of complex sounds,” J. Acoust. Soc. Am., vol. 118, no. 2, pp. 887-906, 2005.
[16]. T. Chi, Y. Gao, C. G. Guyton, P. Ru, and S. Shamma, “Spectro-temporal
modulation transfer functions and speech intelligibility,” J. Acoust. Soc.Am.,vol.
106, no. 5, pp. 2719-2732, 1999.
[17]. M. Elhilali, T. Chi, and S. A. Shamma, “A spectro-temporal modulation
index(stmi)for assessment of speech intelligibility,” Speech Communication, vol.
41,
no. 2-3, pp.331-348, 2003.
[18]. A. Ratcliff, S. Coughlin, and M. Lehman, “Factors influencing ratings of speech naturalness in augmentative and alternative communication,” ISAAC, vol. 18, Mar. 2002.
[19]. W. Sanders, C. Gramlich, and A. Levine, “Naturalness of synthesized speech,”
University-level computer-assisted instructions at Stanford:1968-80(pp. 487-501) [20]. J. G. Beerends, “Modelling cognitive effects that play a role in the perception of
speech quality,” in Workshop “Speech Quality Assessment”, Bochum, Germany, pp. 1-9, 1994.
[21]. S. D. Voran, “Perception of temporal discontinuity impairments in coded speeh- A proposal for objective estimators and some subjective test results,” Int. for Telecommunication Sciences, 2003.
[22]. L. Ding, A. Radwan, M. S. El-Hennawey, and R. A. Goubran, “Measurement of the effects of temporal clipping on speech quality,” IEEE Trans. on
instrumentation and measurement, vol. 55, no. 4, Aug. 2006.
[23]. D.J. Klein, D.A. Depireux, J.Z. Simon, and S.A. Shamma, “Robust
spectrotemporal reverse correlation for the auditory system˺Optimizing stimulus design,” Journal of Computational Neuroscience 9, 85-111, 2000.
[24]. N. Mesgarani, M. Slaney, and S. A. Shamma, “Discrimination of speech from nonspeech based on multiscale spectro-temporal modulations,” IEEE Trans. on audio, speech, and language processing, vol. 14, no. 3, May 2006.
[25]. “ITU-T coded-speech database,” 1998, Supp.23 to P series Rec., ITU-T.
[26]. “Methods for subjective determination of transmission quality,” 1996, ITU-T.
[27]. “Mapping function for transforming P.862 raw result scores to MOS-LQO,”
ITU-T Rec. P.862.1, 2003.
[28]. N. F. Viemeister, “Temporal modulation transfer functions based on modulation thresholds,” J. Acoust. Soc. Amer., vol. 66, pp. 1364-1380, 1997.
[29]. K. Hustad, “Intelligibility differences for three listener groups,” Journal of Speech and Hearing Research, 41, 744-752, 1998.