多層次極限學習機於語音訊號處理上的應用 - 政大學術集成

全文

(1)國立政治大學理學院資訊科學系博士論文 Department of Computer Science. National Chengchi University Doctoral Dissertation. 多層次極限學習機於語音訊號處理上的應用政治 Hierarchical Extreme Learning Machine 大 for Speech Signal 立 Processing. ‧ 國. 學. n. al. er. io. sit. y. ‧. Nat. 胡泰克 Tassadaq Hussain. iv. n Ch 指導教授︰曹昱 U 博士 engchi 廖文宏博士 Advisors: Yu Tsao, Ph.D. WenHung Liao, Ph.D. This dissertation is submitted for the degree of Doctor of Philosophy in Social Networks and Human Centered Computing. 中華民國 109 年 04 月 April 2020. DOI:10.6814/NCCU202000466.

(2) Acknowledgements First of all, I would like to express my most sincerest gratitude to my advisor Dr. Yu Tsao for his time, patience, motivation, devotion, and immense encouragement throughout the. 政治大. duration of my Ph.D. study. I am indebted to the freedom that he gave me in deciding. 立. my direction and providing me with the resources required to carry out my research work.. ‧ 國. 學. The completion of my Ph.D. study would not have been possible without his continuous. ‧. professional guidance, assistance, endless support, and empathy.. sit. y. Nat. I am also grateful to my coadvisor Prof. WenHung Liao for his patience, continuous. n. al. er. io. feedback, motivation, and invaluable advice. He has always helped me in conducting. i Un. v. research and his unconditional support and encouragement meant my progress was on schedule.. Ch. engchi. Special thanks are most certainly due to Prof. Sabato Marco Siniscalchi from the Uni versity of Enna, Italy, for his insightful observations, valuable suggestions, professional advice, and critical remarks during my submissions that in effect helped me shape my final dissertation. I would also like to extend my gratitude to the Taiwan International Graduate Pro gram on Social Network and HumanCentered Computing (TIGPSNHCC) for the Ph.D. fellowship which has supported this research. I am also indebted to the faculty of Com puter Science, National Chengchi University; Institute of Information Science, Academia. i. DOI:10.6814/NCCU202000466.

(3) Sinica; and the Research Center for Information Technology Innovation, Academia Sinica for their support throughout this training. I would like to convey my gratitude to Prof. Mark Liao for his support, advice and valuable suggestions throughout our Ph.D. study. He always motivated us and took care of us in every possible way amidst his busy schedule. It would have been difficult to complete this work without the support and friendship provided by my BioASP lab members over this Ph.D. I also express my gratitude to Dr. SyuSiang Wang for his feedback and valuable indepth discussions. I also deeply thank all of my TIGPSNHCC friends and BioASP lab. 政治大. members who have contributed to my research that is not listed here.. 立. I am deeply indebted to my parents and family and more specifically to my wife for. ‧ 國. 學. supporting and believing in me and for being my strongest backup during these years. ‧. whose encouragement, continuous support, and most importantly patience have helped. sit. y. Nat. me to complete my research studies. Their continuous affectionate love, support, prayers,. io. er. and sacrifices lead me to focus on what has been a hugely rewarding and enriching process. al. iv n C the least, I would like to dedicate myhthesis i U daughter, who has made me e ntogmy c hnewborn n. that has to lead to the successful completion of my Ph.D. study. Last, but certainly not. stronger, better, and more fulfilled than I could ever have imagined.. ii. DOI:10.6814/NCCU202000466.

(4) 中文摘要語音是人與人互動中最有效、最自然的手段，在過去的幾十年中，語音信號處理的各個議題已經被深入地研究，然而在真實聲學環境下有效提高人類聽覺、. 政治大. 機器識別率仍然是一項艱鉅的任務。近年來以語音控制的個人助理系統（例如. 立. Alexa、Google Home 等）已經被大幅使用，進而重塑了人機交互模式。在經常需. ‧ 國. 學. 要遠距離交談的實際應用中（例如，音頻數據挖掘和語音輔助應用），背景噪聲. ‧. 會嚴重降低語音信號的質量和清晰度，因此，能夠抑制噪聲是在實用環境下的重. sit. y. Nat. 要議題。針對這個議題，本文首先提出了一種語音去噪框架，其目的是：（i）有. n. al. er. io. 效、快速地從單通道語音信號中去除背景噪聲；（ii）在不匹配測試條件下（靜態. i Un. v. 和非靜態噪聲以及不同 SNR 級別），能夠有效地從嘈雜的聲音中提取出清晰的語. Ch. engchi. 音特徵。（iii）在訓練數據量有限的情況下也可以獲得優異的除噪性能。實驗結果證實與基於深層類神經網絡的方法相比，在訓練數據量有限的情況下，所提出的 HELM 框架可以產生效果相當甚至更好的語音品質和清晰度。除了噪音，混響是另一個語音的問題。混響通常是指反射聲音的總集合，會嚴重影響與語音應用的效能。近年來，深層類神經模型強大的回歸能力已經證實可以有效地對語音去除混響效果。但是深層類神經模型有個重大缺點，就是需要大量的混響無混響訓練語音對來訓練，而大量的訓練資料對通常並不容易取得。因此，開發一種使用少量的訓練數據的演算法變成重要的研究議題。本論文研究以 HELM 來解決了混響問題和數據需求問題，同時提出了利用整體學習框架的優. iii. DOI:10.6814/NCCU202000466.

(5) 點。實驗結果表明，在匹配以及不匹配的測試條件下，該框架優於傳統方法和最近提出的整體深度學習演算法。一個語音增強方法的局限是在沒見過的聲學條件下無法獲得令人滿意的性能。在本論文中，我們嘗試基於 HELM 解決通道不匹配的影響，可以在真實的聲學條件下將低質量的骨傳導麥克風話音轉換為高質量的空氣傳導麥克風話音。除了純音頻處理框架外，我們還將所提出的方法應用於多模態學習來改善純音頻語音增強模型的整體性能。在本論文中，我們也提出了一個結合聲音影像的語音增強系統。結果證實在不同的測試條件下，與僅有音頻的語音增強系統相比，結合聲音. 政治大. 影像的語音增強系統可以提供更佳的效能。深度學習的另一個新興研究主題是促. 立. 進模型壓縮以進一步增加應用性。我們提出了新穎的模型壓縮技術，可以有效地. ‧ 國. 學. 降低計算需求。未來我們預期壓縮後的模型能夠實現於硬體，並且與各種語音應. ‧. io. sit. y. Nat. n. al. er. 用結合。. Ch. engchi. iv. i Un. v. DOI:10.6814/NCCU202000466.

(6) Abstract Speech is the most effective and natural medium of communication in human–human interaction. In the past few decades, a great amount of research has been conducted on. 政治大. various aspects and properties of speech signal processing. However, improving the intel. 立. ligibility for both human listening and machine recognition in real acoustic conditions still. ‧ 國. 學. remains a challenging task. In recent years, voicecontrolled personal assistants systems. ‧. (such as Alexa, Google Home, and Home Pod, etc.) have been widely used, and have re. sit. y. Nat. shaped the humanmachine interaction mode. In practical applications that often require. n. al. er. io. distant talking communications (e.g., audio data mining and voiceassisted applications),. i Un. v. the effect of background noise can severely deteriorate the quality and intelligibility of. Ch. engchi. speech signals for both human and machine listeners. Therefore, it is desirable that noise suppression can be made robust against changing noise conditions to operate in realtime environments. To address this issue, this dissertation initially presents a speech denoising framework which aims, (i) at the effective and fast removal of background noise from a singlechannel speech signal, (ii) to extract clean speech features from the noisy counter part and effective even under mismatch testing conditions (stationary and nonstationary noise and SNR levels), and (iii) to attain optimal performance when the amount of train ing data is limited. The proposed framework offers a universal approximation capability through comparative measures. The experimental results demonstrate that the proposed. v. DOI:10.6814/NCCU202000466.

(7) framework can yield comparable or even better speech quality and intelligibility compared with conventional signal processing and deep neuralbased approaches when the amount of training data is limited. Besides noise, reverberation is yet another issue that can affect the learning effective ness and robustness of distanttalking communication devices. Reverberation generally refers to the collection of reflected sounds that can affect the performance of speechrelated applications significantly. In recent years, the approximation capabilities of deeper neural models have been exploited to study the reverberation effect. The outcome of these studies. 政治大. indicate that neuralbased learning have strong regression capabilities, and can substan. 立. tially achieve outstanding speech dereverberation results. However, deep neural models. ‧ 國. 學. require a large amount of reverberantanechoic training waveform pairs to achieve reason. ‧. able performance improvement. Therefore, it is required to develop a datadriven solution. sit. y. Nat. that can achieve robust generalization performance for realistic reverberated conditions. io. er. and can be optimized with a small amount of training data, or more precisely adaptation. al. iv n C h eand sertation next addresses the reverberation i U issue while preserving the n gdata c hrequirement n. data. Motivated by the promising performance achieved for speech denoising, this dis. advantages of deep neural structures leveraging upon ensemble learning framework. Ex perimental results reveal that the proposed framework outperforms both traditional meth ods and a recently proposed integrated deep and ensemble learning algorithm in terms of standardized evaluation metrics under matched and mismatched testing conditions. A common drawback of most modern speech enhancement (SE) approaches is that they are typically evaluated using simulated datasets, where training and testing condi tions are generated in controlled environments. Consequently, these approaches suffer from channel mismatch problems in unseen acoustic conditions and are unable to achieve. vi. DOI:10.6814/NCCU202000466.

(8) satisfactory performance. In online learning, where data arrives from different channels and environments, an effective solution is required to address the channel mismatch prob lem. In this dissertation, we will next address the impact of channel mismatch and propose an alternative SE system which converts lowquality boneconducted microphone utter ances into highquality airconducted microphone utterances in real acoustic conditions. Although the effects of noise and reverberation using audioonly frameworks are well examined under diverse sets of synthetically generated conditions, such frameworks need to initially acquire a large number of training data, covering as many environmental con. 政治大. ditions as possible, to improve the robustness against unknown test conditions. Recent. 立. literature has exploited the great potential of auxiliary information in humanmachine in. ‧ 國. 學. teractions. The data obtained from heterogeneous sensors and devices using the internet. ‧. of things (IoT) can be useful for more robust inference, thereby providing further insights. sit. y. Nat. into multimodal learning. In addition to audioonly SE frameworks, multimodal learning. io. er. has recently been adopted to improve the overall performances of audioonly SE mod. al. iv n C an audiovisual SE system. The finalh results demonstrate eng c h i Uthat the incorporation of auxil n. els. The thesis later expands the audioonly paradigm of the SE framework and proposes. iary information alongside audio can provide adequate performance enhancement over an audioonly SE system under different test conditions. Another emerging focus of deep learning is to facilitate deep neuralbased models to work in realworld applications. The problem with the existing deep neural models is that they are computationally expensive and memory intensive, thereby limiting the de ployment in edge devices with low memory resources. Based on the successful results of audioonly and audiovisual SE frameworks, in this thesis, we propose a joint audiovisual SE framework to finally address model and data compression strategies in order to meet. vii. DOI:10.6814/NCCU202000466.

(9) the computational demands and facilitate realtime predictions. The proposed framework demonstrates that incorporation of visual information helps the framework to retain most of the information lost by the audioonly framework, while the model compression lets the framework to further reduce the computation requirement. The model compression enables the model to land in the hardware implementation arena for multimodal environ ments to obtain efficient regression ability.. 立. 政治大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. viii. i Un. v. DOI:10.6814/NCCU202000466.

(10) Contents. Acknowledgements. i. 中文摘要. 立. ‧ y. sit. al. n. 1. xii. xvi. er. io. List of Tables. v. ix. Nat. List of Figures. ‧ 國. Contents. iii. 學. Abstract. 政治大. Ch. engchi. i Un. v. SPEECH SIGNAL PROCESSING: AN OVERVIEW. 1. 1.1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2. 1.1.1. Speech Denoising . . . . . . . . . . . . . . . . . . . . . . . . . .. 2. 1.1.2. Speech Dereverberation . . . . . . . . . . . . . . . . . . . . . .. 4. 1.1.3. Channel Compensation . . . . . . . . . . . . . . . . . . . . . . .. 6. 1.1.4. Multimodal Speech Enhancement . . . . . . . . . . . . . . . . .. 6. 1.2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. 1.3. Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 8. 1.4. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. ix. DOI:10.6814/NCCU202000466.

(11) 1.5. 13. 2.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 13. 2.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 14. 2.3. Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. 2.3.1. Conventional Spectral Restoration Methods . . . . . . . . . . . .. 16. 2.3.2. Data Driven Methods . . . . . . . . . . . . . . . . . . . . . . . .. 17. 政治大 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 立. 19. 2.3.3. 2.4. The ELM Model . . . . . . . . . . . . . . . . . . . . . . . . . .. 24. 2.4.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . .. 24. 2.4.2. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . .. 27. 學. ‧. 2.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 40. y. Nat. 42. io. sit. ELMBASED SPEECH DEREVERBERATION. er. 3. 12. ELMBASED SPEECH DENOISING. ‧ 國. 2. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 43. 3.3. Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 46. 3.3.1. Ensemble Learning for Speech Signal Processing . . . . . . . . .. 46. 3.3.2. HELMbased Speech Dereverberation System . . . . . . . . . .. 48. 3.3.3. Highway HELM . . . . . . . . . . . . . . . . . . . . . . . . . .. 51. 3.3.4. Residual HELM . . . . . . . . . . . . . . . . . . . . . . . . . .. 52. 3.3.5. Ensemble HELM for Speech Dereverberation . . . . . . . . . . .. 53. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 54. 3.4.1. 54. n. 3.4. al. Ch. n U engchi. iv. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . .. x. 42. DOI:10.6814/NCCU202000466.

(12) 3.4.2 3.5. 56. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 86. ELMBASED CHANNEL COMPENSATION. 88. 4.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 88. 4.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 89. 4.3. Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 91. 4.4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 92. 政治大 HELMbased SE System . . . . . . . . . . . . . . . . . . . . . . 立. 92. 4.4.1 4.4.2. 94. 4.4.4. Automatic Speech Recognition . . . . . . . . . . . . . . . . . . .. 95. 4.4.5. Sensitivity/Stability Towards the Training Data . . . . . . . . . .. 學. Spectrogram Analysis . . . . . . . . . . . . . . . . . . . . . . .. ‧. 96 99. er. sit. y. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. io. ELMBASED MULTIMODAL SPEECH ENHANCEMENT. al. n. 5. 93. 4.4.3. Nat. 4.5. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . .. ‧ 國. 4. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . .. Ch. n U engchi. iv. 100. 5.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100. 5.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101. 5.3. Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103. 5.4. 5.5. 5.3.1. Audioonly SE System . . . . . . . . . . . . . . . . . . . . . . . 103. 5.3.2. AudioVisual SE System . . . . . . . . . . . . . . . . . . . . . . 104. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.4.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 105. 5.4.2. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 107. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112. xi. DOI:10.6814/NCCU202000466.

(13) COMPRESSED MULTIMODAL SE 6.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113. 6.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114. 6.3. Proposed method for SE . . . . . . . . . . . . . . . . . . . . . . . . . . 117. 6.4. 6.5. 6.3.1. HELMbased multimodal System for SE . . . . . . . . . . . . . 117. 6.3.2. Binarization and Quantization . . . . . . . . . . . . . . . . . . . 118. Experimental Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.4.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 120. 6.4.2. Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 121. 立. 政治大. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128. 學. CONCLUSION AND FUTURE WORK. 129. ‧. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129. 7.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131. y. sit. al. n. VITA. io. Bibliography. Nat. 7.1. er. 7. 113. ‧ 國. 6. Ch. n U engchi. xii. iv. 133. 157. DOI:10.6814/NCCU202000466.

(14) List of Figures 2.1. HELM architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21. 2.2. HELMbased speech enhancement architecture. . . . . . . . . . . . . . .. 22. 2.3. PESQ scores for ELM with different activation functions and numbers of. 立. 政治大. hidden neurons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ‧ 國. 學. 2.4. PESQ, SDI, STOI, and SSNRI average scores for ELM and HELM con. ‧. figurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 30. Spectrograms of an utterance (a) clean (PESQ = 4.6439), (b) noisy (PESQ. sit. y. Nat. 2.5. 28. n. al. er. io. = 2.2976), (c) ELM (PESQ = 2.3018), and (d) HELM (PESQ = 2.5489). i Un. v. contaminated with babble noise. . . . . . . . . . . . . . . . . . . . . . . 2.6. Ch. engchi. 32. Spectrograms of an utterance (a) clean (PESQ = 4.6439), (b) noisy (PESQ = 2.4433), (c) ELM (PESQ = 2.5258), and (d) HELM (PESQ = 2.7345) contaminated with car noise. . . . . . . . . . . . . . . . . . . . . . . . .. 2.7. 32. PESQ score for (a) DDAE1, (b) DDAE2, (c) DDAE3, (d) DDAE4, (e) HELM1, (f) HELM2, (g) HELM3 and (h) HELM4, using different amounts of training batch samples (TS). . . . . . . . . . . . . . . . . . . . . . . .. 2.8. 38. STOI score for (a) DDAE1, (b) DDAE2, (c) DDAE3, (d) DDAE4, (e) HELM1, (f) HELM2, (g) HELM3 and (h) HELM4, using different amounts of training batch samples (TS). . . . . . . . . . . . . . . . . . . . . . . .. xiii. 39. DOI:10.6814/NCCU202000466.

(15) 3.1. Residual block for DDAE. . . . . . . . . . . . . . . . . . . . . . . . . .. 3.2. Overall speech dereverberation architecture using (a) conventional HELM, (b) HELM(Hwy), and (c) HELM(Res). . . . . . . . . . . . . . . . . . . .. 3.3. 50. Offline and online stages of the ensemble HELM (eHELM) dereverbera tion framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.4. 47. 54. Average STOI, SRMR, FwSSNR, Cep, and LLR scores of Reverb, Wu Wang, CDR, IDEAD (Res), and eHELMD (Res) in the matched testing con ditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.5. 66. 政治大. Average STOI, SRMR, FwSSNR, Cep, and LLR scores of Reverb, Wu. 立. Wang, CDR, IDEAD (Res), and eHELMD (Res) in the mismatched testing. ‧ 國. 學. conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amplitude envelopes of the fifth channel of: (a) Clean, (b) Reverb, (c). ‧. 3.6. 68. sit. y. Nat. IDEAD (Res), and (d) eHELMD (Res). The reverberated utterance was at. io. al. iv n C U he IDEAD (Res), and (d) eHELM utterance was at D (Res). n g cTheh ireverberated. Amplitude envelopes of the fifth channel of: (a) Clean, (b) Reverb, (c). n. 3.7. RT60 = 1.2 s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8. 71. er. RT60 = 1.2 s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 73. Average STOI, SRMR, FwSSNR, Cep, and LLR scores of Reverb, Wu Wang, CDR, IDEAD (Res), and eHELMD (Res) under the matched testing conditions for the MHINT. . . . . . . . . . . . . . . . . . . . . . . . . .. 3.9. 75. Average STOI, SRMR, FwSSNR, Cep, and LLR scores of Reverb, Wu Wang, CDR, IDEAD (Res), and eHELMD (Res) under the mismatched test ing conditions for the MHINT. . . . . . . . . . . . . . . . . . . . . . . .. xiv. 76. DOI:10.6814/NCCU202000466.

(16) 3.10 Performance of DDAE(Res), HRNN, HLSTM, HBLSTM, and HELM(Res) frameworks of PESQ, STOI, SRMR, FwSSNR, Cep, and LLR evaluation metrics at RT60s (a) 0.3 s, (b) 0.6 s, (c) 0.9 s, and (d) 1.2 s using different amounts of reverberatedanechoic training utterance pairs (i.e., 120, 300, 600, 1200, 2400, and 3600). . . . . . . . . . . . . . . . . . . . . . . . .. 82. 3.11 Average subjective listening scores of WuWang, CDR, IDEAD (Res), and eHELMD (Res) for RT60 = 0.6 s, 0.7 s, and 1.0 s, of MHINT. . . . . . . .. 84. 3.12 Average subjective listening scores of WuWang, CDR, IDEAD (Res), and. 政治大. eHELMD (Res) for large room (RT60 = 0.7 s) of SimData with distance. 立. ∈ {Near, Far}, and for the four rooms of RealData (= Lecture, Meeting,. ‧ 國. 學. Office, and Stairways) of REVERB challenge corpus. . . . . . . . . . . .. ‧. HELMbased SE Architecture . . . . . . . . . . . . . . . . . . . . . . .. 4.2. Spectrograms of the enhanced test utterances using the (c) DDAE and (d). 92. sit. y. Nat. 4.1. 84. n. al. er. io. HELM of the (a) ACM and (b) BCM utterances. For each figure, the x. i Un. v. axis denotes the time in seconds, and the yaxis represents the frequency. Ch. engchi. in Hertz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3. Average PESQ scores for DDAE and HELM SE frameworks using differ ent amounts of training data. . . . . . . . . . . . . . . . . . . . . . . . .. 4.4. 97. Average STOI scores for DDAE and HELM SE frameworks using differ ent amounts of training data. . . . . . . . . . . . . . . . . . . . . . . . .. 4.5. 95. 97. CER results of DDAE and HELM frameworks using different amounts of training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 98. 5.1. Audioonly HELMbased SE framework. . . . . . . . . . . . . . . . . . 104. 5.2. Proposed AVHELM SE framework. . . . . . . . . . . . . . . . . . . . . 105 xv. DOI:10.6814/NCCU202000466.

(17) 5.3. Average PESQ scores over six noise types at different SNR levels. . . . . 109. 5.4. Average HASPI and SSNRI scores over six noise types at different SNR levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110. 5.5. Spectrograms of (a) Clean, (b) Noisy, (c) AHELM, and (d) AVHELM. The test utterance was contaminated with noise applause at SNR = 2 dB. 111. 5.6. Waveforms of (a) Clean, (b) Noisy, (c) AHELM, and (d) AVHELM. The test utterance was contaminated with noise applause at SNR = 2 dB. . . 112. 6.1 6.2. The HELMbased multimodal SE framework. . . . . . . . . . . . . . . . 118. 治政大using HASPI and SS Performance comparison of different frameworks 立 ‧ 國. Performance comparison of HELMa and HELMav using HASPI and SS. ‧. NRI evaluation metrics for different SNRs averaged across six noise types. 127. io. sit. y. Nat. n. al. er. 6.3. 學. NRI evaluation metrics for six noise types averaged across different SNRs. 123. Ch. engchi. xvi. i Un. v. DOI:10.6814/NCCU202000466.

(18) List of Tables 2.1. Aurora–4 Training set description . . . . . . . . . . . . . . . . . . . . .. 25. 2.2. Aurora–4 Test set description . . . . . . . . . . . . . . . . . . . . . . . .. 26. 2.3. Single result abstracted from average objective evaluation scores of ELM. 立. 政治大. [500] and HELM [200 200 500] configuration . . . . . . . . . . . . . . .. ‧ 國. 學. 2.4. Performance comparison of HELM frameworks using different window. ‧. sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. y. sit. n. al. Ch. i Un. v. Average PESQ scores of HELM, HELM(Hwy), HELM(Res) and Reverb. engchi. speech under specific reverberated conditions. . . . . . . . . . . . . . . . 3.2. 61. Average PESQ scores of ensemble HELM and IDEA frameworks in the matched testing conditions. . . . . . . . . . . . . . . . . . . . . . . . . .. 3.5. 59. Average PESQ scores of four HELM frameworks with the RS and KB schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.4. 57. Average PESQ scores of HELM, HELM(Hwy), and HELM(Res) with dif ferent context information. . . . . . . . . . . . . . . . . . . . . . . . . .. 3.3. 36. er. io. speech enhancement methods . . . . . . . . . . . . . . . . . . . . . . . . 3.1. 34. Objective evaluation scores of DDAE and HELM alongside traditional. Nat. 2.5. 29. 63. Average PESQ scores of ensemble HELM and IDEA frameworks with complex structures in the matched testing conditions. . . . . . . . . . . . xvii. 66. DOI:10.6814/NCCU202000466.

(19) 3.6. Average PESQ scores of ensemble HELM and IDEA frameworks with complex structures in the mismatched testing conditions. . . . . . . . . .. 3.7. Average PESQ scores of the RTA system and the EHELMD (Res) in the matched and mismatched testing conditions. . . . . . . . . . . . . . . . .. 3.8. 67. 69. Average PESQ scores of ensemble HELM and IDEA frameworks with complex structures in the matched testing conditions for the MHINT corpus. 72. 3.9. Average PESQ scores of ensemble HELM and IDEA frameworks with complex structures in the mismatched testing conditions for the MHINT. 政治大. corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 立. 74. 3.10 Average performance comparison between different metric scores of the. ‧ 國. 學. WuWang, CDR, IDEA, and the ensemble HELM systems for the SimData. 78. ‧. 3.11 Reverberation times (RT60s) and distance between the loudspeaker and 79. sit. y. Nat. the microphone for each room. . . . . . . . . . . . . . . . . . . . . . . .. io. al. er. 3.12 Average performance comparison between different metric scores of the. v ni. n. WuWang, CDR, IDEA, and the ensemble HELM systems for RealData. . 4.1. Ch. U i e h n c g Average PESQ and STOI scores of the unprocessed BCM speech and the HELM enhanced speech trained with the BCM/ACM utterance pairs. . .. 4.2. 80. 94. Average PESQ and STOI scores of the unprocessed BCM speech and DDAE and HELM enhanced speech trained with the BCM/ACM(IE) utterance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4.3. CERs of the original ACM and BCM test utterances and DDAE and HELM enhanced speech. . . . . . . . . . . . . . . . . . . . . . . . . . .. 5.1. 94. 96. Average PESQ scores of logMMSE, AHELM, and AVHELM under matched and mismatched noise conditions. . . . . . . . . . . . . . . . . . . . . . 108 xviii. DOI:10.6814/NCCU202000466.

(20) 6.1. Average PESQ scores of KLT, logMMSE, RPCA, HELMa , and HELMav processed speech signals under matched and mismatched noise conditions. 122. 6.2. Average PESQ scores of HELMa and HELMav with binary and ternary weights under matched and mismatched noise conditions. . . . . . . . . . 125 Average PESQ scores of HELMa and HELMav using 16bit quantized in put with realvalued and binary weights under matched and mismatched noise conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126. 立. 政治大. 學 ‧. ‧ 國 io. sit. y. Nat. n. al. er. 6.3. Ch. engchi. xix. i Un. v. DOI:10.6814/NCCU202000466.

(21) Chapter 1 SPEECH SIGNAL PROCESSING: AN 政治大 OVERVIEW 立 ‧ 國. 學 ‧. The goal of a speech signal processing algorithms is to ameliorate the intelligibility. sit. y. Nat. (the percentage of words correctly recognized by listeners) and the quality (the level of. io. er. residual noise and reverberation in that signal) of a corrupted signal in adverse conditions. al. iv n C h e n g cofhvoicebased search interest, owing to the wide dissemination solutions for realworld i U n. [1]. In the past several decades, speech signal processing has attracted considerable re. applications, such as automatic speech recognition [2, 3, 4], speaker recognition [5] [6], speech coding [7], hearing aids [8] [9], and cochlea implants [10] [11]. As new applica tions are deployed, the definition of speech signal processing has broadened to include not only classical noise reduction problem, but also signal separation and reverberation prob lems. In general, speech enhancement (SE) techniques can be categorized into three main groups, namely signal denoising, speech dereverberation, and channel compensation. In this dissertation, we mainly focus on the SE techniques where the goal is to obtain high quality speech from the lowquality version. First, this chapter provides a summary of the. 1. DOI:10.6814/NCCU202000466.

(22) research work relating to three different types of SE techniques, namely speech denoising, speech dereverberation, and channel compensation. Next, the chapter discusses the work related to SE models proposed in more recent years using multimodal learning strategies. In addition, modelbased compression and quantization techniques to reduce the compu tational costs are discussed. Finally, the key research challenges involved in designing a robust SE system and the contribution of this dissertation are briefly discussed.. 1.1. 立. 政治大. Speech Denoising. 學. ‧ 國. 1.1.1. Background. In realworld applications, the level of background noise may significantly diminish. ‧. the quality and intelligibility of a speech signal acquired by a microphone to the point. sit. y. Nat. that it becomes useless for subsequent processing [1]. Several singlechannel SE methods. n. al. er. io. have been proposed in the past to address noise reduction tasks. However, the perfor. i Un. v. mance of SE in real acoustic environments is not always satisfactory, because improving. Ch. engchi. intelligibility and quality concurrently is a challenging problem. A class of SE methods, termed spectral restoration, aims to design a filter or transformation that attenuates the noise components to generate clean speech. Notable techniques include the Wiener filter and its extensions [12, 13, 14], the minimum mean square error spectral estimator (MMSE) [15, 16, 17], the maximum a posteriori spectral amplitude estimator (MAPA) [18] [19], the maximum likelihood spectral amplitude estimator (MLSA) [20] [21], and generalized MAPA [22]. Another popular class of SE methods adopts speech models for SE. Notable examples include the harmonic model [23], the linear prediction (LP) model [24] [25], and the hidden Markov model (HMM) [26]. A common limitation of most of these conven. 2. DOI:10.6814/NCCU202000466.

(23) tional methods is that they rely on either the additive nature of the background noise or the statistical properties of speech and noise signals. As a consequence, these methods fail to properly contrast the nonstationary noise of realworld scenarios in unexpected acoustic conditions. Rather than assuming an explicit model, methods based on nonlinear mapping have also been adopted to address noise reduction tasks. In such approaches, stereo training data is generally needed to learn a nonlinear mapping function between noisy and clean speech. In the nonlinear mapping category, artificial neural networks (ANN) have been. 政治大. shown to be a viable solution to effectively address background noise issues [27] [28].. 立. For example, in [29], a singlehiddenlayer with 160 neurons was employed to estimate. ‧ 國. 學. the instantaneous signaltonoise ratio (SNR) level of amplitude modulation spectrogram. ‧. (AMS), and then the noise was suppressed according to the estimated SNRs of different. sit. y. Nat. channels. Alternatively, in [30, 31, 32], shallow ANNs were used to determine a map. io. er. ping between the noisy and clean speech signals. Unfortunately, a lack of depth hindered. al. iv n C U h e n glearning leveraging a greedy layerwise unsupervised [33], often referred to as c h i algorithm n. comprehensive exploitation of the relationships between noisy and clean speeches. By. pretraining [34], the training of deep neural networks (DNNs) can now be successfully designed, and the strong regression capabilities of deep models can be better explored. For example, deep/stacked denoising autoencoders (DDAEs) were used to model the re lationship between clean and noisy features in [35] [36]. Deep recurrent neural networks and longshort term memory (LSTM) networks have also been adopted in feature en hancement [37] [38]. In [39], a deep belief network (DBN) with a restricted Boltzmann machine (RBM) was used to design a facial expression recognition (FER) system. Akhtar et al. [40] further exploited the performance of neural networks by generating a Ksupport. 3. DOI:10.6814/NCCU202000466.

(24) normbased noise model, to train neural networks. Meanwhile, convolutional neural net works, which have a better capability of modeling local temporalspectral structures of speech signals, have been adopted as a fundamental model for the SE task in [41], and a deeper structure of the convolutional neural network (DCNN) was used for hand gesture recognition in [42]. A common issue with ANNbased speech enhancers is the degraded performance in the presence of unexpected noise. A simple, yet effective solution to this problem is to cover many different types of noise in the training set, as proposed in [43]. In addition to ANN, a generalized single hidden layer feedforward network (GSLFN). 政治大. [44] has been proposed for regression problems in which the traditional singlelayer feed. 立. forward network (SLFN) is extended by exploiting the polynomial functions of inputs as. ‧ 國. 學. output weights. In [45], the universal enhancing capabilities of deep models were more. ‧. thoroughly investigated. In particular, the authors proposed a regression DNNbased SE. sit. y. Nat. framework via training a deep and wide neural network architecture using a large collec. io. er. tion of heterogeneous training data with four noise types.. n. al. 1.1.2. ni Ch Speech Dereverberation U engchi. v. Reverberation refers to the collection of reflected sounds from surfaces (e.g., walls and objects) in an acoustic enclosure. It has been shown to severely deteriorate the quality and intelligibility of speech signals for both human and machine listeners. Such a deteriora tion can substantially affect the performance of speechrelated applications, for instance, ASR [46, 47, 48], speaker identification systems [49, 50, 51]. It can also severely hamper speech reception performance for both normal and hearingimpaired listeners [52] [53]. In the last few decades, numerous approaches have been proposed to solve the reverberation problem. The conventional speech dereverberation techniques can be categorized into. 4. DOI:10.6814/NCCU202000466.

(25) three main groups [54]. The first group referred to as sourcemodelbased approaches, aims to separate the speech and reverberation based on the prior information of clean structures and room reverberation effects. Notable algorithms belonging to this category include the linear prediction (LP) methods [55, 56, 57], harmonic filtering techniques [58], and probabilistic models [59] [60]. Another group of algorithms is based on homomor phic transformation, in which the reverberated speech signals are analyzed in the cepstral domain to simply subtract the reverberation from the signal. Notable techniques include cepstralbased processing [61] and spectral subtraction [62]. The third group of algo. 政治大. rithms includes channel inversion and employs inverse filtering to deconvolve the speech. 立. convoluted with room impulse response (RIR) during reverberation. Notable techniques. ‧ 國. 學. include the minimum mean square error (MMSE) [63], least square, beamforming [64],. ‧. and matched filtering [65]. Recently, nonlinear spectral mapping approaches have been. sit. y. Nat. developed to address the reverberation problem. For these approaches, ANNs are gener. io. er. ally used to ｀learn＇the mapping function of the reverberated and anechoic speech [66].. al. iv n C h eof nthose extensively studied [67]. The outcome h i Upoints out that deeper structures g cstudies n. More recently, the universal approximation capabilities of deeper structures have been. of neural networks enable strong learning capabilities, and the reverberation problem can be handled with success. For example, DDAEs were adopted to reconstruct the anechoic speech signal from the reverberated signal in [46] [68]. In [69] [70], LSTM and deep recurrent neural network (DRNN)based dereverberation systems were proposed to effec tively reduce the reverberation effects by leveraging the current as well as past frames. In [71, 72, 73, 74, 75], DNNbased solutions have been proposed to improve performance of the system by training a deeper framework to obtain a mapping from the reverberated speech signal to an anechoic one.. 5. DOI:10.6814/NCCU202000466.

(26) 1.1.3. Channel Compensation. Different acoustic features of recording sensors in mobile and Internet of Things (IoT) devices can cause a major channel mismatch which is another common problem in the speechrelated applications. In this dissertation, we next focus on the channel mismatch problem by considering the utterances recorded using two different microphones, i.e., air conducted microphone (ACM) and boneconducted microphone (BCM), as a represen tative channel mismatch conditions. A number of filteringbased and probabilistic solu. 政治大. tions have been proposed in the past to convert lowquality BCM utterances to highquality. 立. ACM utterances. In [76], the BCM utterances were passed through a designed reconstruc. ‧ 國. 學. tion filter to improve quality. In [77] and [78], BCM and ACM utterances were combined for SE and ASR in nonstationary noisy environments. In [79], a probabilistic optimum. ‧. filter (POF)based algorithm was used to estimate the clean features from the combina. y. Nat. er. io. sit. tion of standard and throat microphone signals. Thang et al. [80] restored boneconducted speech in noisy environments based on a modulation transfer function (MTF) and a linear. n. al. Ch. i Un. v. prediction (LP) model. Later, Tajiri et al. [81] proposed a noise suppression technique. engchi. based on nonnegative tensor factorization using a bodyconducted microphone known as a nonaudible murmur (NAM) microphone.. 1.1.4. Multimodal Speech Enhancement. Recent studies have shown that visual modality carries important information, such as lip motions and mouth articulations that can help discriminate similar speech sound in noisy conditions [82, 83, 84]. Recently, several SE methods that integrate audio and visual information have been proposed. For example, in [85] [86], fullyconnected, and convo lutional neural network models were used to build an audiovisual SE system and have 6. DOI:10.6814/NCCU202000466.

(27) improved the noise reduction performance successfully compared to audioonly frame works. In [87], the authors proposed a deep learningbased framework to investigate the impact of the Lombard effect on the performance of the audiovisual SE system. In [88], a speech separation system was proposed that incorporated audiovisual information using a deep networkbased model. More recently, model compression that aims to facilitate the use of deep models in real world applications has attracted considerable attention. Several model compression tech niques have been proposed to reduce computational costs without significantly degrading. 政治大. the achievable performance. In addition to the stateoftheart performance achieved by. 立. the deeplearningbased techniques in different classification and regression tasks, a con. ‧ 國. 學. siderable amount of research has been done on quantizationbased model compression. ‧. strategies to improve the computational capability of deeplearningbased systems for effi. sit. al. n. Motivation. er. io. 1.2. y. Nat. cient online learning without degrading much of system’s overall performance [89, 90, 91].. Ch. engchi. i Un. v. Traditional SE algorithms are generally derived based on some assumptions of the noise and reverberation signals. Unfortunately, such assumptions do not always hold in realworld conditions and thus may induce unwanted distortions in the reconstructed sig nals. The recent advancement, on the other hand, has uncontroversially demonstrated the great potential of deep neuralbased models for speech signal processing. Despite the un matched performance achieved by deep neural models, identifying a way to train deep learning models efficiently with limited resources remains a key issue. However, less at tention has been given in the speech signal processing research field with neural models to the robustness issue. Indeed, deep models suffer from a domain mismatch problem 7. DOI:10.6814/NCCU202000466.

(28) when the production environment differs significantly from the training conditions. If parallel noisy and clean speech is available in the target domain, model retraining largely mitigate the degradation in performance. Moreover, a huge parallel corpus is needed to have a meaningful retraining for gradientbased deep models. Thus, an effective front end speech signal processing system that can handle attenuation and time delay effects is highly desired. The approaches proposed in the present dissertation falls onto such a paradigm for speech signal processing; thereby, the rationale behind this dissertation lines up with gradientbased neural architectures.. 立. Research Challenges. 學. ‧ 國. 1.3. 政治大. Improving the intelligibility for both human listening and machine recognition is not. ‧. always satisfactory in real acoustic conditions. In recent years, nonlinear spectral mapping. y. Nat. io. sit. based solutions have been developed to address the speech signal processing problems.. n. al. er. Whilst deep learningbased approaches can substantially achieve outstanding results in. Ch. i Un. v. speech signal processing, deep neural models have notable limitations: (1) generalization. engchi. performance under mismatched training/test conditions which can severely deteriorate the system performance, (3) a multilayer architecture is considered as a whole that is trained and finetuned by several passes of backpropagation (BP) based finetuning in order to achieve reasonable learning capabilities such a training scheme is cumbersome and time consuming, and (3) require a large amount of cleannoisy/reverberantanechoic training waveform pairs covering as many as possible environmental conditions to improve ro bustness against unseen testing conditions, which may limit the deployment of deep neu ral modelbased solutions in many realworld applications, especially when operated in wearable or mobile client sides. 8. DOI:10.6814/NCCU202000466.

(29) Though the deep neural models have solved the slowgradient based training and data augmentation problem [92] [93], however training a deep neural model efficiently with limited resources remains a key issue. To validate our concern, we are providing the following reasons: (1) an emerging research topic of deep learning to investigate new solutions for “fewshot learning＂or “learning under low resource conditions＂. That is, to facilitate the deep models to work in realworld applications where researchers have recently been made aware that it is not always ideal to prepare a deep and universal model in the offline stage to handle diverse testing conditions in the online stage. As a result, deep. 政治大. models suffer from a domain mismatch problem when the production environment differs. 立. significantly from the training conditions. On the contrary, a model that can be trained. ‧ 國. 學. efficiently with a small amount of training data is more favorable. (2) The computational. ‧. costs are another consideration for applications. (3) In realtime situations, where the. sit. y. Nat. data arrives in a sequential stream and exhibits dynamically changing and nonstationary. io. n. al. er. environments, an alternate option is required for online learning.. 1.4. Contributions. Ch. engchi. i Un. v. To address the shortcomings of both conventional speech signal processing (dynami cally changing and nonstationary environments) and deep learningbased (datarequirement) approaches, this dissertation focuses on an alternate hierarchical extreme learning ma chine (HELM)based solutions to address the shortcomings of both conventional and deep learningbased speech signal processing approaches. Unlike traditional BPbased algo rithms, the parameters of the ELM feature extraction layers are randomly specified and need not be finetuned, thereby providing an extremely fast training phase with good gen eralization performance and a universal approximation capability. The proposed solutions 9. DOI:10.6814/NCCU202000466.

(30) have the key advantage of avoiding gradientbased solutions, so the parameters of ELM can be optimized with a small amount of training data. To take advantage of the multi layer model, we employ a HELM for speech signal processing. Experimental evidence reported in the present dissertation indeed demonstrates that HELMbased solutions pro vide an extremely fast training phase with good generalization performance and a universal approximation capability when only a small amount of training data is available. The key goal is to devise datadriven models for speech signal processing that can be deployed effi ciently by leveraging a small amount of training data and limited computational resources.. 政治大. The main contributions of this dissertation are as follows:. 立. ‧ 國. 學. • Initially, we exploited the unique and effective characteristics of the HELM model to construct a speech denoising framework. HELM extracts information in a multi. ‧. layer manner, keeping all the advantages of deep models in the approximation of. Nat. sit. y. complicated functions and maintaining strong regression capabilities. The proposed. n. al. er. io. solution has a key advantage of avoiding cumbersome and timeconsuming training. Ch. i Un. v. process of BPbased finetuning. In an overview, the proposed framework demon. engchi. strated that (i) HELMs are indeed a viable solution for extracting clean speech fea tures from the noisy counterpart, and HELMbased SE is effective even when testing data involves mismatch noisy type and SNR levels, and; (ii) when the amount of training data is limited, the proposed HELMbased SE algorithm outperforms the algorithms based on conventional BPbased neural networks under different testing conditions. • Next, an ensemble learning approach is devised to handle attenuation and time delay effects for speech dereverberation. The main focus of the proposed approach is to examine the effectiveness of combining the HELM models leveraging three 10. DOI:10.6814/NCCU202000466.

(31) mechanisms never employed in HELMs: ensemble learning, residual, and highway structures. In addition, the objective of the proposed framework is to address the data requirement issue while preserving the advantages of deep neural structures. The goal is to construct a datadriven model that can be deployed efficiently lever aging a small amount of training material and limited computational resources. • In addition to noise and reverberation, we then study the effect of channel mismatch on the enhancement performance. Channel mismatch is yet another common prob lem that can significantly degrade the overall performance of the speech signals for. 治政大 we present a HELMbased both human and machine listeners. To address this issue, 立 ‧ 國. 學. framework to convert lowquality boneconducted utterances to highquality air conducted utterances. Compared with traditional microphone i.e., ACM, the speech. ‧. signals recorded with a BCM are robust against noise while some highfrequency. Nat. sit. y. components may be missing. The experimental results verify that the proposed. n. al. er. io. framework notably improves the original boneconducted speech and outperforms. i Un. v. the previous deep learningbased SE framework in terms of standardized objective. Ch. engchi. measures, as well as automatic speech recognition (ASR) performance. • Research has shown that visual modality, such as lip movements and mouth ar ticulations, carries important information that can help discriminate similar speech patterns in noisy conditions. Inspired by the success achieved for speech denois ing by conventional HELM, we build a joint audiovisual speech denoising frame work by incorporating the visual information alongside audio to deal with unseen noises under low SNR conditions. The proposed multimodal framework outper forms the conventional audioonly framework by exhibiting a satisfactory perfor mance in terms of standardized objective measures under matched and mismatched 11. DOI:10.6814/NCCU202000466.

(32) testing conditions. The results further confirm the applicability of HELMbased solutions using multimodal frameworks under challenging conditions and low re source environments. • To facilitate deep learningbased models in realworld applications, the disserta tion investigates the performance of the multimodal speech denoising framework by utilizing model compression strategies namely, binarization and quantization. The proposed audiovisual framework is trained by using binary weights and quantized speech signals to cutdown the computational requirement. The results demonstrate. 治政 that the proposed framework with binarized weights大 and quantized data still worked 立 ‧ 國. ‧. 1.5. 學. as usual with the overall performance of the system slightly reduced.. Dissertation Outline. sit. y. Nat. io. er. This dissertation is organized as follows: Chapter 2, discusses the proposed ELM and. al. HELM based speech denoising/enhancement frameworks. Chapter 3 formulates the prob. n. iv n C lem of speech dereverberation and derives h e nangensemble c h i Ulearning approach to effectively recover anechoic speech from reverberated one using HELMbased spectral mapping. Chapter 4 extends the HELM framework and provides a HELMbased channel compensa tion strategy. Chapter 5 extends the single modality framework and adopts a multimodal learning approach to train the audiovisual framework to obtain enhanced speech. Chap ter 6 describes the model compression technique based on multimodal learning. Finally, Chapter 7 summarizes this dissertation, highlights its research contribution, and provides an insight for future work.. 12. DOI:10.6814/NCCU202000466.

(33) Chapter 2 ELMBASED SPEECH DENOISING 政治大立 Overview. ‧ 國. 學. 2.1. ‧. In wireless telephony and audio data mining applications, it is desirable that noise sup pression can be made robust against changing noise conditions and operate in real time. y. Nat. er. io. sit. (or faster). The learning efficiency and online computation of artificial neural networks. al. are therefore critical factors in applications for speech enhancement tasks. To address. n. iv n C these issues, we present an ELM framework, at the effective and fast removal of h e n gaiming chi U background noise from a singlechannel speech signal, based on a set of randomly cho. sen hidden units and analytically determined output weights. Because feature learning with shallow ELM may not be effective for natural signals, such as speech, even with a large number of hidden nodes, HELM architectures are deployed by leveraging sparse autoencoders. In this manner, we not only keep all the advantages of deep models in approximating complicated functions and maintaining strong regression capabilities, but we also overcome the cumbersome and timeconsuming features of BPbased fine tuning schemes, which are typically adopted for training deep neural architectures. The pro. 13. DOI:10.6814/NCCU202000466.

(34) posed ELM framework was evaluated on the Aurora–4 speech databases. The Aurora–4 task provides relatively limited training data, and test speech data corrupted with both ad ditive noise and convolutive distortions for matched and mismatched channels and SNR conditions. In addition, the task includes a subset of testing data involving noise types and SNR levels that are not seen in the training data. The experimental results indicate that when the amount of training data is limited, both ELM and HELM based speech enhancement techniques consistently outperform the conventional BPbased shallow and deep learning algorithms, in terms of standardized objective evaluations, under various. 政治大. testing conditions. The content of this chapter have been published in [94].. 立. ‧ 國. 學. 2.2. Introduction. ‧. In this chapter, we propose an alternative speech enhancement framework based on. y. Nat. io. sit. the unique and effective characteristics of the ELM algorithm [95], namely extremely. n. al. er. fast training, good generalization, and a universal approximation/classification capabil. Ch. i Un. v. ity. ELMs can play a key role in many machine learning applications, such as traffic sign. engchi. recognition [96], gesture recognition [97], video tracking [97], object classification [98], data representation in big data [99], water distribution and wastewater collection [100], opal grading [101], nonlinear timeseries modeling [102] and adaptive dynamic program ming [103]. In [104], the authors have also demonstrated that ELMs are suitable for a wide range of feature mapping applications, rather than just the classical ones. Moreover, to take advantage of multilayer models, we deploy a speech enhancement algorithm with HELMs. To the best of our knowledge, this is the first work that helps ELM and HELM to the speech enhancement task. To evaluate the noise reduction capability of ELM and HELM, we conducted a series of experiments on the standardized Aurora–4 noisy speech 14. DOI:10.6814/NCCU202000466.

(35) corpus [105]. Notably, the amount of training data in the Aurora–4 speech corpus is rel atively limited in comparison to that used in [45]. Aurora–4 also provides a subset of the test data that allows an assessment in mismatch (SNR and channel) conditions. The contributions of our results are as follows: (i) We have demonstrated that ELMs are in deed a viable solution for extracting clean speech features from the noisy counterpart, and ELMbased speech enhancement is effective even when testing data involving noisy type and SNR levels that are not seen in the training data, and; (ii) when the amount of train ing data is limited, the proposed ELM speech enhancement algorithm outperforms the. 政治大. algorithms based on more conventional BPbased neural networks under different testing. 立. conditions, in terms of the perceptual evaluation of speech quality (PESQ, a standard. ‧ 國. 學. ized speech quality evaluation metric), and segmental signal to noise ratio improvement. ‧. (SSNRI, a standardized objective speech quality evaluation metric).. sit. y. Nat. The remainder of this chapter is organized as follows. Section 2.3 presents the ELM/. io. al. er. HELM based speech enhancement algorithms. Section 2.4 presents our experimental. v. n. setup and results. The summary of this chapter is discussed in Section 2.5.. 2.3. Ch. engchi. i Un. Proposed Method. In general, speech enhancement techniques can be categorized into two main groups, namely signal processing solutions and datadriven approaches. In the following sections, we discuss the underpinnings of both approaches by describing some prominent tech niques in both groups. First, the speech enhancement problem will be introduced more formally through the spectral restoration method. Next, we briefly discuss key datadriven methods.. 15. DOI:10.6814/NCCU202000466.

(36) 2.3.1. Conventional Spectral Restoration Methods. Speech enhancement algorithms involve a transformation of a noisy speech signal into the spectral domain to recover the desired clean signal. A noisy speech signal y[n] is composed of a clean speech signal x[n], and additive noise signal v[n],. y[n] = x[n] + v[n],. (2.1). where n is the time index. A noisy signal is converted into short time Fourier transform. 政治大 signal is divided into short frames using a window function w(n). 立. (STFT) domain to determine its frequency and phase components. In STFT, the speech. ‧ 國. 學. STFT speech signal can be expressed as. The corresponding. (2.2). ‧. Y [m, l] = X[m, l] + V [m, l],. Nat. sit. y. where Y [m, l], X[m, l], and V [m, l] are the mth frequency bins of the noisy speech, clean. n. al. er. io. speech, and noise spectra of the lth frame, respectively, corresponding to frequency ωm ,. i Un. v. where ωm = 2πm/M , m = 0, 1, . . . , M − 1. The aim of speech enhancement ap. Ch. engchi. proaches is to restore x[n] ( or X[m, l]) from y[n] (or Y [m, l]). For spectral restora tion, a gain function G[m, l] is estimated based on the computed a priori SNR statis b tics and a posteriori SNR statistic. The enhanced speech, X[m, l], is obtained by fil tering Y [m, l] through G[m, l]. The phase of the noisy speech is copied and used to pre pare the phase of the enhanced speech. An inverse STFT (ISTFT) is applied to convert b X[m, l], m = 0, 1, . . . , M − 1; l = 1, 2, . . . , L and the phase, to obtain the enhanced speech b x. Some of the notable techniques mentioned in the Chapter 1, namely MMSE [15, 16, 17], MAPA [18] [19], and MLSA [20] [21] are based on this approach.. 16. DOI:10.6814/NCCU202000466.

(37) 2.3.2. Data Driven Methods. Nonnegative Matrix Factorization In nonnegative matrix factorization (NMF) based speech enhancement, a speech data matrix Y ∈ RM ×L with M frequency bins and L speech frames is projected to a space that is a linear combination of a set of vectors, i.e., Y ≈ WH, where W = [W X W V ] ∈ RM ×(px +pv ) (W X and W V denote the basis matrices of speech and noise, respectively) and H = [H TXb H TVb ]T ∈ R(px +pv )×L . Here, px , py ≤ min(M, L) are the corresponding basis. 政治大. vectors for speech and noise ( H Xb and H Vb denote the estimated coefficient matrices of. 立. speech and noise, respectively). NMF approximation is achieved by using two alternative. ‧ 國. 學. minimizing criteria: (1) the least square criteria to minimize ∥V − WH∥2 w.r.t W and. sit. y. Nat. [106, 107].. ‧. H; and (2) the generalized KullbackLeibler (KL) divergence to minimize D(V ∥WH). io. n. al. er. During the training stage, NMF is applied separately on clean and noisy data, in which. v. magnitude spectrums of the speech (|X[m, l]|) and noise (|V [m, l]|) are computed. Subse. Ch. engchi. i Un. quently, the Euclidean distance between the magnitude spectrum and the factored matrices is minimized by the following update rule [106]: WTY W T WH YHT W ←W⊗ WHH T H ←H⊗. (2.3). In the enhancement stage, a spectral gain is estimated and the enhanced speech is obtained as b X[m, l] = G[m, l]Y [m, l]. (2.4). where the gain function G[m, l] is formulated using a specific statistical model and opti 17. DOI:10.6814/NCCU202000466.

(38) mality criterion.. Deep Denoising Autoencoder (DDAE) Recently, deep denoising autoencoders (DDAEs) have demonstrated a tremendous performance in the field of speech enhancement. DDAE is trained as a noisyclean pair to learn the statistical information between the clean and noisy speech signals [108]. The aim of DDAE is to transform the noisy speech signal to a clean speech by minimizing the b and the reference clean signal X, such reconstruction error between the predicted signal X that. 政治大. 立 θ = arg min(E(θ) + ρC(θ)) ∗. (2.5). θ. ‧ 國. 學. with. ‧. E(θ) = ∥θ(Y ) − X∥2F. (2.6). sit. y. Nat. where ρ is a constant that controls the tradeoff between the reconstruction accuracy and. n. al. er. io. regularization term C(θ) [36], θ(Y ) denotes the transformation function of DDAE. During. i Un. v. the training phase, a DDAE is trained in a greedy layerwise manner and is then used to. Ch. engchi. estimate clean speech out of noisy speech signals as. h1 (Y [l]) = σ(W 1 Y [l] + b1 ), .. . (2.7) hD−1 (Y [l]) = σ(W D−1 hD−2 (Y [l]) + bD−1 ), b = W D hD−1 (Y [l]) + bD X[l] where Y [l] = [log(|Y [1, l]|) . . . log(|Y [m, l]|) . . . log(|Y [M, l]|)]T , and b = [log(|X[1, b l]|) . . . log(|X[m, b b X[l] l]|) . . . log(|X[M, l]|)]T are the lth logarithm amplitude vectors of the input noisy speech and estimated clean speech, respectively, {W 1 . . . W D } b are the weight matrices, {b1 . . . bD } are the corresponding bias vectors, and X[m, l] is the 18. DOI:10.6814/NCCU202000466.

(39) logarithmic amplitude vector of the enhanced speech. Furthermore, σ is the vectorwise nonlinear activation function. The relationship in Eq. (2.5) can be optimized by using any unconstrained optimization algorithm. In particular, the Hessianfree algorithm was adopted in [109] to compute this. During the enhancement phase, the ISTFT is applied to the magnitude spectrum together with the phase spectrum from the original signal to reconstruct the waveform [45] [108]. The difference between DDAE [108] and DNN [45] lies in the initialization and architecture design, where DDAE formulates the noise reduc tion (NR) task as an encodingdecoding process, and DNN considers it as a regression task.. 政治大. If the decoder part in DDAE is also multilayer MLP then it becomes a fullyconnected. 立. regression model, same as the one presented in [45].. ‧ 國. 學. The ELM Model. ‧. 2.3.3. sit. y. Nat. The ELM model was proposed by Huang et al. [95] for single layer feedforward. io. n. al. er. networks (SLFNs), to overcome the drawbacks of the BP algorithm. ELM provides an. i Un. v. efficient and quick learning process, which does not require the massive finetuning of parameters [104].. Ch. engchi. Shallow ELM The input weights and biases of the hidden layer in SLFNs can be chosen randomly to learn N distinct observations [110]. Given N distinct observations (yi , xi ), where yi = [yi1 , yi2 . . . yiJ ]T ∈ RJ and xi = [xi1 , xi2 . . . xiI ]T ∈ RI , the outputs of the SLFNs can be modeled as f (yi ) =. Q X. β q σ(wq · yi + bq ). (2.8). q=1. 19. DOI:10.6814/NCCU202000466.

(40) where σ(·) is the activation function, wq = [w1q , w2q , . . . , wJq ]T ∈ RJ is the weight vector from the input node to the qth hidden node, bq is the bias of the qth hidden node, β q = [βq1 , βq2 , . . . , βqI ]T ∈ RI is the weight vector from the qth hidden node to the output nodes, and Q is the number of hidden neurons. A standard SLFN for the ith hidden node with zero error is given as N X. ∥f (yi ) − xi ∥ = 0. (2.9). i=1. The above relation can be shortened as HB = X. where. 政治大. 立. . (2.10) . n. Ch Q×I. e n g cN ×I hi. sit. io. al. ,.  xT1     .   X =  ..      xTN. er. Nat T. β 1     .   B=  ..      β TQ. y. ‧. ‧ 國. 學.  σ(w1 · y1 + b1 ) · · · σ(wQ · y1 + bQ )      . .   .. .. H = ,      σ(w1 · yN + b1 ) · · · σ(wQ · yN + bQ ) N ×Q    . i Un. (2.10a). v. The output weight matrix B is computed as. B = H +X. (2.11). where H + is the MoorePenrose (MP) pseudoinverse of H, which can be calculated using orthogonal projection methods such as H + = (H T H)−1 H T , where H T H should be non singular, or H + = H T (HH T )−1 , where HH T should be nonsingular. In order to solve the linear inverse problem arising at the ELM output, in this chapter we adopted a fastiterative shrinkagethreshold algorithm (FISTA) [111], which is an ex tension of the gradient algorithm, and offers better convergence properties for problems involving large amounts of data. 20. DOI:10.6814/NCCU202000466.

(41) Hierarchical ELM Inspired by DNNs, where features are extracted using a multilayer framework with an unsupervised initialization, Tang et al. [97] extended ELM, and proposed HELM for multilayer perceptrons (MLPs). The overall structure of the HELM model is illustrated in Fig. 2.1. The HELM framework comprises two stages, i.e., unsupervised feature ex traction and supervised feature regression. In unsupervised feature extraction, high level features are extracted using an ELMbased autoencoder by considering each layer as an. 政治大 traction, in order to make use of information from training data. The output of the un 立. autonomous layer. The input data is projected to ELM feature space before feature ex. ‧ 國. 學. supervised feature extraction stage can then be used as the input to the supervised ELM regression stage [97] for the final result, based on the learning from the two stages.. ‧. n. al. Output weight. er. io Hidden Layers. sit. y. Nat. Regression. i v stage Supervised n CELMh Layer engchi U. Sparse Autoencoder Hidden weight Hidden weight. Unsupervised feature representation. Sparse Autoencoder Input weight Input Data. Figure 2.1: HELM architecture.. 21. DOI:10.6814/NCCU202000466.

(42) HELM. Feature Extraction. Spectrum Recover. Hidden Layer 1. Hidden Layer 2. ELM Layer. Phase. STFT. ISTFT. Enhanced Speech. Noisy Speech. 立. 政治大. Figure 2.2: HELMbased speech enhancement architecture.. ‧ 國. 學. ELM and HELM for SE. ‧. In this section, we describe the use of ELM and HELM for a regression model to. sit. y. Nat. io. er. perform speech enhancement. Fig. 2.2 illustrates the system architecture of the proposed. al. ELM/HELMbased speech enhancement approach. The main concept is to use an ELM/. n. iv n C HELM model to transform noisy speech clean speech. h e ton g c h i U The overall system includes offline and online stages.. During the offline stage, a set of noisyclean speech pairs is prepared. The noisy and clean speech signals are first converted into the frequency domain using the STFT to de termine the frequency and phase components of the signal. The logarithm power spectra (LPS) of the noisy and clean speech spectra are then placed at the input and output sides of the ELM model, respectively. More specifically, the goal of the ELM/HELM system is to reconstruct the clean speech signal from the noisy speech by minimizing the recon. 22. DOI:10.6814/NCCU202000466.

(43) struction error, such that b 2F E = ∥X − X∥. (2.12). b is the estimated speech signal and X is the reference clean speech signal. Ac where X cording to ELM theory [104], any continuous target function can be approximated as PN l=1. b b are the lth logarithm amplitude vectors of ∥f (Y [l]) − X[l]∥ = 0, where Y [l] and X[l]. the input noisy speech and estimated clean speech described in Section 2.3.2, respectively. The relationship in Eq. (2.8) can be written as. 治 X 政 f (Y [l]) = β σ(w · Y [l] + 大b ) 立 Q. q. q. (2.13). q. q=1. ‧ 國. 學. where wq is the weight vector, bq is the bias and β q is the output weight vector of the qth hidden node. The relation in Eq. (2.10) can be written compactly in matrix form as. ‧. b HB = X. Nat. sit. y. (2.14). n. al. er. io. b is the estimated speech where H is the hidden layer output, B is the output weight and X signal, given as . Ch. engchi. i Un. v. .  σ(w1 · Y [1] + b1 ) · · · σ(wQ · Y [1] + bQ )      . .   .. .. H = ,      σ(w1 · Y [N ] + b1 ) · · · σ(wQ · Y [N ] + bQ ) N×Q     T b T [1]  X β 1       .   .   b =  ..  , X B=   ..        T    b [N ] β TQ X Q×M. (2.14a). N×M. The corresponding output weight matrix for the estimated speech signal can be computed as b = H +X b B 23. (2.15) DOI:10.6814/NCCU202000466.

(44) where H + is the MoorePenrose (MP) pseudoinverse of H and is described in Section b is the output weight matrix, and X b is the estimated speech signal. 2.3.3, B In the online stage, the noisy speech signals are first converted into LPS and phase parts. The noisy LPS features are transformed to obtain the enhanced ones by following b estimated in the the steps in Eqs. (2.13) and (2.14) for the ELM/HELM models (H and B) offline stage. The phase of the noisy speech is used to prepare the phase of the enhanced speech. An ISTFT is applied to obtain the enhanced speech signals.. 2.4. Experiments. 政治大. 立. ‧ 國. ‧. 2.4.1. 學. In this section, we present our experimental setup and results.. Experimental Setup. y. Nat. n. er. io. al. sit. Description of the Aurora–4 Database. v. The Aurora–4 [105] dataset was used to evaluate the performance of the proposed. Ch. engchi. i Un. ELMbased speech enhancement algorithm. The Aurora–4 dataset includes speech data recorded at two sampling rates, 8 kHz and 16 kHz. The 16 kHz speech data was used in this chapter. Aurora–4 contains two training sets: clean and multicondition. Each set contains 7138 utterances, as shown in Table 2.1. In this chapter, we employed these two training sets to train the speech enhancement models (input data from the multicondition training set, output data from the clean training set). The multicondition training set was divided into two blocks, each consisting of 3569 utterances, where 893 were clean and the remaining 2676 were randomly contaminated with six different background noises at SNR levels varying from 10 to 20 dB. The first block of data was recorded using a Sennheiser. 24. DOI:10.6814/NCCU202000466.

(45) microphone, and the second block was recorded using various microphones (so that the speech in the dataset contained interferences with two different channel conditions). The testing set includes 4620 utterances, which were divided into 14 test sets, each containing 330 utterances. The entire set was used to test the performance [105] under dif ferent noise and channel conditions. The testing data includes six different noises, namely babble, car, restaurant, street, airport, and train, with both matched and mismatched chan nel conditions. The testing dataset was further classified into four larger groups as shown in Table 2.2. Because Test Set 1 (Set A) contained clean speech only, the corresponding. 政治大. evaluation scores (PESQ, SSNRI, SDI, and STOI) are not included for comparison in the. 立. following discussion. From Table 2.2, it can be noted that Set B covered speech with addi. ‧ 國. 學. tive noise, Set C covered speech with convolutive noise, and Set D contained speech with. ‧. both additive and convolutive noises. Test Sets C and D contained clean and noisy test. y. Nat Training Set 1. al. Description. n. er. Category. io. Training Set. sit. Table 2.1: Aurora–4 Training set description. Clean data. Ch. Training. Multicondition. Set 2. data. iv n U utterances) (3569. Clean Speech with Sennhesier microphone. engchi. No noise. Speech recorded. (893 utterances). with Sennhesier. Speech contaminated with. microphone. 6 different noises at. (3569 utterances). 1020 dB SNRs (2676 utterances) No noise. Speech recorded. (893 utterances). with 18 different. Speech contaminated with. microphones. 6 different noises at. (3569 utterances). 1020 dB SNRs (2676 utterances). 25. DOI:10.6814/NCCU202000466.