• 沒有找到結果。

多層次極限學習機於語音訊號處理上的應用 - 政大學術集成

N/A
N/A
Protected

Academic year: 2021

Share "多層次極限學習機於語音訊號處理上的應用 - 政大學術集成"

Copied!
178
0
0

加載中.... (立即查看全文)

全文

(1)國立政治大學理學院資訊科學系 博士論文 Department of Computer Science. National Chengchi University Doctoral Dissertation. 多層次極限學習機於語音訊號處理上的應用 政 治 Hierarchical Extreme Learning Machine 大 for Speech Signal 立 Processing. ‧ 國. 學. n. al. er. io. sit. y. ‧. Nat. 胡泰克 Tassadaq Hussain. iv. n Ch 指導教授︰ 曹昱 U 博士 engchi 廖文宏 博士 Advisors: Yu Tsao, Ph.D. Wen­Hung Liao, Ph.D. This dissertation is submitted for the degree of Doctor of Philosophy in Social Networks and Human Centered Computing. 中華民國 109 年 04 月 April 2020. DOI:10.6814/NCCU202000466.

(2) Acknowledgements First of all, I would like to express my most sincerest gratitude to my advisor Dr. Yu Tsao for his time, patience, motivation, devotion, and immense encouragement throughout the. 政 治 大. duration of my Ph.D. study. I am indebted to the freedom that he gave me in deciding. 立. my direction and providing me with the resources required to carry out my research work.. ‧ 國. 學. The completion of my Ph.D. study would not have been possible without his continuous. ‧. professional guidance, assistance, endless support, and empathy.. sit. y. Nat. I am also grateful to my co­advisor Prof. Wen­Hung Liao for his patience, continuous. n. al. er. io. feedback, motivation, and invaluable advice. He has always helped me in conducting. i Un. v. research and his unconditional support and encouragement meant my progress was on schedule.. Ch. engchi. Special thanks are most certainly due to Prof. Sabato Marco Siniscalchi from the Uni­ versity of Enna, Italy, for his insightful observations, valuable suggestions, professional advice, and critical remarks during my submissions that in effect helped me shape my final dissertation. I would also like to extend my gratitude to the Taiwan International Graduate Pro­ gram on Social Network and Human­Centered Computing (TIGP­SNHCC) for the Ph.D. fellowship which has supported this research. I am also indebted to the faculty of Com­ puter Science, National Chengchi University; Institute of Information Science, Academia. i. DOI:10.6814/NCCU202000466.

(3) Sinica; and the Research Center for Information Technology Innovation, Academia Sinica for their support throughout this training. I would like to convey my gratitude to Prof. Mark Liao for his support, advice and valuable suggestions throughout our Ph.D. study. He always motivated us and took care of us in every possible way amidst his busy schedule. It would have been difficult to complete this work without the support and friendship provided by my BioASP lab members over this Ph.D. I also express my gratitude to Dr. Syu­Siang Wang for his feedback and valuable in­depth discussions. I also deeply thank all of my TIGP­SNHCC friends and BioASP lab. 政 治 大. members who have contributed to my research that is not listed here.. 立. I am deeply indebted to my parents and family and more specifically to my wife for. ‧ 國. 學. supporting and believing in me and for being my strongest backup during these years. ‧. whose encouragement, continuous support, and most importantly patience have helped. sit. y. Nat. me to complete my research studies. Their continuous affectionate love, support, prayers,. io. er. and sacrifices lead me to focus on what has been a hugely rewarding and enriching process. al. iv n C the least, I would like to dedicate myhthesis i U daughter, who has made me e ntogmy c hnewborn n. that has to lead to the successful completion of my Ph.D. study. Last, but certainly not. stronger, better, and more fulfilled than I could ever have imagined.. ii. DOI:10.6814/NCCU202000466.

(4) 中文摘要 語音是人與人互動中最有效、最自然的手段,在過去的幾十年中,語音信號 處理的各個議題已經被深入地研究,然而在真實聲學環境下有效提高人類聽覺、. 政 治 大. 機器識別率仍然是一項艱鉅的任務。近年來以語音控制的個人助理系統(例如. 立. Alexa、Google Home 等)已經被大幅使用,進而重塑了人機交互模式。在經常需. ‧ 國. 學. 要遠距離交談的實際應用中(例如,音頻數據挖掘和語音輔助應用),背景噪聲. ‧. 會嚴重降低語音信號的質量和清晰度,因此,能夠抑制噪聲是在實用環境下的重. sit. y. Nat. 要議題。針對這個議題,本文首先提出了一種語音去噪框架,其目的是:(i)有. n. al. er. io. 效、快速地從單通道語音信號中去除背景噪聲;(ii)在不匹配測試條件下(靜態. i Un. v. 和非靜態噪聲以及不同 SNR 級別),能夠有效地從嘈雜的聲音中提取出清晰的語. Ch. engchi. 音特徵。(iii)在訓練數據量有限的情況下也可以獲得優異的除噪性能。實驗結果 證實與基於深層類神經網絡的方法相比,在訓練數據量有限的情況下,所提出的 HELM 框架可以產生效果相當甚至更好的語音品質和清晰度。 除了噪音,混響是另一個語音的問題。混響通常是指反射聲音的總集合,會 嚴重影響與語音應用的效能。近年來,深層類神經模型強大的回歸能力已經證實 可以有效地對語音去除混響效果。但是深層類神經模型有個重大缺點,就是需要 大量的混響­無混響訓練語音對來訓練,而大量的訓練資料對通常並不容易取得。 因此,開發一種使用少量的訓練數據的演算法變成重要的研究議題。本論文研究 以 HELM 來解決了混響問題和數據需求問題,同時提出了利用整體學習框架的優. iii. DOI:10.6814/NCCU202000466.

(5) 點。實驗結果表明,在匹配以及不匹配的測試條件下,該框架優於傳統方法和最 近提出的整體深度學習演算法。 一個語音增強方法的局限是在沒見過的聲學條件下無法獲得令人滿意的性能。 在本論文中,我們嘗試基於 HELM 解決通道不匹配的影響,可以在真實的聲學條 件下將低質量的骨傳導麥克風話音轉換為高質量的空氣傳導麥克風話音。除了純 音頻處理框架外,我們還將所提出的方法應用於多模態學習來改善純音頻語音增 強模型的整體性能。在本論文中,我們也提出了一個結合聲音影像的語音增強系 統。結果證實在不同的測試條件下,與僅有音頻的語音增強系統相比,結合聲音. 政 治 大. 影像的語音增強系統可以提供更佳的效能。深度學習的另一個新興研究主題是促. 立. 進模型壓縮以進一步增加應用性。我們提出了新穎的模型壓縮技術,可以有效地. ‧ 國. 學. 降低計算需求。未來我們預期壓縮後的模型能夠實現於硬體,並且與各種語音應. ‧. io. sit. y. Nat. n. al. er. 用結合。. Ch. engchi. iv. i Un. v. DOI:10.6814/NCCU202000466.

(6) Abstract Speech is the most effective and natural medium of communication in human–human interaction. In the past few decades, a great amount of research has been conducted on. 政 治 大. various aspects and properties of speech signal processing. However, improving the intel­. 立. ligibility for both human listening and machine recognition in real acoustic conditions still. ‧ 國. 學. remains a challenging task. In recent years, voice­controlled personal assistants systems. ‧. (such as Alexa, Google Home, and Home Pod, etc.) have been widely used, and have re­. sit. y. Nat. shaped the human­machine interaction mode. In practical applications that often require. n. al. er. io. distant talking communications (e.g., audio data mining and voice­assisted applications),. i Un. v. the effect of background noise can severely deteriorate the quality and intelligibility of. Ch. engchi. speech signals for both human and machine listeners. Therefore, it is desirable that noise suppression can be made robust against changing noise conditions to operate in real­time environments. To address this issue, this dissertation initially presents a speech denoising framework which aims, (i) at the effective and fast removal of background noise from a single­channel speech signal, (ii) to extract clean speech features from the noisy counter­ part and effective even under mismatch testing conditions (stationary and non­stationary noise and SNR levels), and (iii) to attain optimal performance when the amount of train­ ing data is limited. The proposed framework offers a universal approximation capability through comparative measures. The experimental results demonstrate that the proposed. v. DOI:10.6814/NCCU202000466.

(7) framework can yield comparable or even better speech quality and intelligibility compared with conventional signal processing­ and deep neural­based approaches when the amount of training data is limited. Besides noise, reverberation is yet another issue that can affect the learning effective­ ness and robustness of distant­talking communication devices. Reverberation generally refers to the collection of reflected sounds that can affect the performance of speech­related applications significantly. In recent years, the approximation capabilities of deeper neural models have been exploited to study the reverberation effect. The outcome of these studies. 政 治 大. indicate that neural­based learning have strong regression capabilities, and can substan­. 立. tially achieve outstanding speech dereverberation results. However, deep neural models. ‧ 國. 學. require a large amount of reverberant­anechoic training waveform pairs to achieve reason­. ‧. able performance improvement. Therefore, it is required to develop a data­driven solution. sit. y. Nat. that can achieve robust generalization performance for realistic reverberated conditions. io. er. and can be optimized with a small amount of training data, or more precisely adaptation. al. iv n C h eand sertation next addresses the reverberation i U issue while preserving the n gdata c hrequirement n. data. Motivated by the promising performance achieved for speech denoising, this dis­. advantages of deep neural structures leveraging upon ensemble learning framework. Ex­ perimental results reveal that the proposed framework outperforms both traditional meth­ ods and a recently proposed integrated deep and ensemble learning algorithm in terms of standardized evaluation metrics under matched and mismatched testing conditions. A common drawback of most modern speech enhancement (SE) approaches is that they are typically evaluated using simulated datasets, where training and testing condi­ tions are generated in controlled environments. Consequently, these approaches suffer from channel mismatch problems in unseen acoustic conditions and are unable to achieve. vi. DOI:10.6814/NCCU202000466.

(8) satisfactory performance. In online learning, where data arrives from different channels and environments, an effective solution is required to address the channel mismatch prob­ lem. In this dissertation, we will next address the impact of channel mismatch and propose an alternative SE system which converts low­quality bone­conducted microphone utter­ ances into high­quality air­conducted microphone utterances in real acoustic conditions. Although the effects of noise and reverberation using audio­only frameworks are well examined under diverse sets of synthetically generated conditions, such frameworks need to initially acquire a large number of training data, covering as many environmental con­. 政 治 大. ditions as possible, to improve the robustness against unknown test conditions. Recent. 立. literature has exploited the great potential of auxiliary information in human­machine in­. ‧ 國. 學. teractions. The data obtained from heterogeneous sensors and devices using the internet. ‧. of things (IoT) can be useful for more robust inference, thereby providing further insights. sit. y. Nat. into multimodal learning. In addition to audio­only SE frameworks, multimodal learning. io. er. has recently been adopted to improve the overall performances of audio­only SE mod­. al. iv n C an audio­visual SE system. The finalh results demonstrate eng c h i Uthat the incorporation of auxil­ n. els. The thesis later expands the audio­only paradigm of the SE framework and proposes. iary information alongside audio can provide adequate performance enhancement over an audio­only SE system under different test conditions. Another emerging focus of deep learning is to facilitate deep neural­based models to work in real­world applications. The problem with the existing deep neural models is that they are computationally expensive and memory intensive, thereby limiting the de­ ployment in edge devices with low memory resources. Based on the successful results of audio­only and audio­visual SE frameworks, in this thesis, we propose a joint audio­visual SE framework to finally address model and data compression strategies in order to meet. vii. DOI:10.6814/NCCU202000466.

(9) the computational demands and facilitate real­time predictions. The proposed framework demonstrates that incorporation of visual information helps the framework to retain most of the information lost by the audio­only framework, while the model compression lets the framework to further reduce the computation requirement. The model compression enables the model to land in the hardware implementation arena for multimodal environ­ ments to obtain efficient regression ability.. 立. 政 治 大. ‧. ‧ 國. 學. n. er. io. sit. y. Nat. al. Ch. engchi. viii. i Un. v. DOI:10.6814/NCCU202000466.

(10) Contents. Acknowledgements. i. 中文摘要. 立. ‧ y. sit. al. n. 1. xii. xvi. er. io. List of Tables. v. ix. Nat. List of Figures. ‧ 國. Contents. iii. 學. Abstract. 政 治 大. Ch. engchi. i Un. v. SPEECH SIGNAL PROCESSING: AN OVERVIEW. 1. 1.1. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 2. 1.1.1. Speech Denoising . . . . . . . . . . . . . . . . . . . . . . . . . .. 2. 1.1.2. Speech Dereverberation . . . . . . . . . . . . . . . . . . . . . .. 4. 1.1.3. Channel Compensation . . . . . . . . . . . . . . . . . . . . . . .. 6. 1.1.4. Multimodal Speech Enhancement . . . . . . . . . . . . . . . . .. 6. 1.2. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7. 1.3. Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 8. 1.4. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 9. ix. DOI:10.6814/NCCU202000466.

(11) 1.5. 13. 2.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 13. 2.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 14. 2.3. Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 15. 2.3.1. Conventional Spectral Restoration Methods . . . . . . . . . . . .. 16. 2.3.2. Data Driven Methods . . . . . . . . . . . . . . . . . . . . . . . .. 17. 政 治 大 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 立. 19. 2.3.3. 2.4. The ELM Model . . . . . . . . . . . . . . . . . . . . . . . . . .. 24. 2.4.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . .. 24. 2.4.2. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . .. 27. 學. ‧. 2.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 40. y. Nat. 42. io. sit. ELM­BASED SPEECH DEREVERBERATION. er. 3. 12. ELM­BASED SPEECH DENOISING. ‧ 國. 2. Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 43. 3.3. Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 46. 3.3.1. Ensemble Learning for Speech Signal Processing . . . . . . . . .. 46. 3.3.2. HELM­based Speech Dereverberation System . . . . . . . . . .. 48. 3.3.3. Highway HELM . . . . . . . . . . . . . . . . . . . . . . . . . .. 51. 3.3.4. Residual HELM . . . . . . . . . . . . . . . . . . . . . . . . . .. 52. 3.3.5. Ensemble HELM for Speech Dereverberation . . . . . . . . . . .. 53. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 54. 3.4.1. 54. n. 3.4. al. Ch. n U engchi. iv. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . .. x. 42. DOI:10.6814/NCCU202000466.

(12) 3.4.2 3.5. 56. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 86. ELM­BASED CHANNEL COMPENSATION. 88. 4.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 88. 4.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 89. 4.3. Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 91. 4.4. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 92. 政 治 大 HELM­based SE System . . . . . . . . . . . . . . . . . . . . . . 立. 92. 4.4.1 4.4.2. 94. 4.4.4. Automatic Speech Recognition . . . . . . . . . . . . . . . . . . .. 95. 4.4.5. Sensitivity/Stability Towards the Training Data . . . . . . . . . .. 學. Spectrogram Analysis . . . . . . . . . . . . . . . . . . . . . . .. ‧. 96 99. er. sit. y. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. io. ELM­BASED MULTIMODAL SPEECH ENHANCEMENT. al. n. 5. 93. 4.4.3. Nat. 4.5. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . .. ‧ 國. 4. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . .. Ch. n U engchi. iv. 100. 5.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100. 5.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101. 5.3. Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103. 5.4. 5.5. 5.3.1. Audio­only SE System . . . . . . . . . . . . . . . . . . . . . . . 103. 5.3.2. Audio­Visual SE System . . . . . . . . . . . . . . . . . . . . . . 104. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.4.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 105. 5.4.2. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 107. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112. xi. DOI:10.6814/NCCU202000466.

(13) COMPRESSED MULTIMODAL SE 6.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113. 6.2. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114. 6.3. Proposed method for SE . . . . . . . . . . . . . . . . . . . . . . . . . . 117. 6.4. 6.5. 6.3.1. HELM­based multimodal System for SE . . . . . . . . . . . . . 117. 6.3.2. Binarization and Quantization . . . . . . . . . . . . . . . . . . . 118. Experimental Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.4.1. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 120. 6.4.2. Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 121. 立. 政 治 大. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128. 學. CONCLUSION AND FUTURE WORK. 129. ‧. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129. 7.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131. y. sit. al. n. VITA. io. Bibliography. Nat. 7.1. er. 7. 113. ‧ 國. 6. Ch. n U engchi. xii. iv. 133. 157. DOI:10.6814/NCCU202000466.

(14) List of Figures 2.1. HELM architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 21. 2.2. HELM­based speech enhancement architecture. . . . . . . . . . . . . . .. 22. 2.3. PESQ scores for ELM with different activation functions and numbers of. 立. 政 治 大. hidden neurons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. ‧ 國. 學. 2.4. PESQ, SDI, STOI, and SSNRI average scores for ELM and HELM con­. ‧. figurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 30. Spectrograms of an utterance (a) clean (PESQ = 4.6439), (b) noisy (PESQ. sit. y. Nat. 2.5. 28. n. al. er. io. = 2.2976), (c) ELM (PESQ = 2.3018), and (d) HELM (PESQ = 2.5489). i Un. v. contaminated with babble noise. . . . . . . . . . . . . . . . . . . . . . . 2.6. Ch. engchi. 32. Spectrograms of an utterance (a) clean (PESQ = 4.6439), (b) noisy (PESQ = 2.4433), (c) ELM (PESQ = 2.5258), and (d) HELM (PESQ = 2.7345) contaminated with car noise. . . . . . . . . . . . . . . . . . . . . . . . .. 2.7. 32. PESQ score for (a) DDAE1, (b) DDAE2, (c) DDAE3, (d) DDAE4, (e) HELM1, (f) HELM2, (g) HELM3 and (h) HELM4, using different amounts of training batch samples (TS). . . . . . . . . . . . . . . . . . . . . . . .. 2.8. 38. STOI score for (a) DDAE1, (b) DDAE2, (c) DDAE3, (d) DDAE4, (e) HELM1, (f) HELM2, (g) HELM3 and (h) HELM4, using different amounts of training batch samples (TS). . . . . . . . . . . . . . . . . . . . . . . .. xiii. 39. DOI:10.6814/NCCU202000466.

(15) 3.1. Residual block for DDAE. . . . . . . . . . . . . . . . . . . . . . . . . .. 3.2. Overall speech dereverberation architecture using (a) conventional HELM, (b) HELM(Hwy), and (c) HELM(Res). . . . . . . . . . . . . . . . . . . .. 3.3. 50. Offline and online stages of the ensemble HELM (eHELM) dereverbera­ tion framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.4. 47. 54. Average STOI, SRMR, FwSSNR, Cep, and LLR scores of Reverb, Wu­ Wang, CDR, IDEAD (Res), and eHELMD (Res) in the matched testing con­ ditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.5. 66. 政 治 大. Average STOI, SRMR, FwSSNR, Cep, and LLR scores of Reverb, Wu­. 立. Wang, CDR, IDEAD (Res), and eHELMD (Res) in the mismatched testing. ‧ 國. 學. conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amplitude envelopes of the fifth channel of: (a) Clean, (b) Reverb, (c). ‧. 3.6. 68. sit. y. Nat. IDEAD (Res), and (d) eHELMD (Res). The reverberated utterance was at. io. al. iv n C U he IDEAD (Res), and (d) eHELM utterance was at D (Res). n g cTheh ireverberated. Amplitude envelopes of the fifth channel of: (a) Clean, (b) Reverb, (c). n. 3.7. RT60 = 1.2 s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8. 71. er. RT60 = 1.2 s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 73. Average STOI, SRMR, FwSSNR, Cep, and LLR scores of Reverb, Wu­ Wang, CDR, IDEAD (Res), and eHELMD (Res) under the matched testing conditions for the MHINT. . . . . . . . . . . . . . . . . . . . . . . . . .. 3.9. 75. Average STOI, SRMR, FwSSNR, Cep, and LLR scores of Reverb, Wu­ Wang, CDR, IDEAD (Res), and eHELMD (Res) under the mismatched test­ ing conditions for the MHINT. . . . . . . . . . . . . . . . . . . . . . . .. xiv. 76. DOI:10.6814/NCCU202000466.

(16) 3.10 Performance of DDAE(Res), HRNN, HLSTM, HBLSTM, and HELM(Res) frameworks of PESQ, STOI, SRMR, FwSSNR, Cep, and LLR evaluation metrics at RT60s (a) 0.3 s, (b) 0.6 s, (c) 0.9 s, and (d) 1.2 s using different amounts of reverberated­anechoic training utterance pairs (i.e., 120, 300, 600, 1200, 2400, and 3600). . . . . . . . . . . . . . . . . . . . . . . . .. 82. 3.11 Average subjective listening scores of Wu­Wang, CDR, IDEAD (Res), and eHELMD (Res) for RT60 = 0.6 s, 0.7 s, and 1.0 s, of MHINT. . . . . . . .. 84. 3.12 Average subjective listening scores of Wu­Wang, CDR, IDEAD (Res), and. 政 治 大. eHELMD (Res) for large room (RT60 = 0.7 s) of SimData with distance. 立. ∈ {Near, Far}, and for the four rooms of RealData (= Lecture, Meeting,. ‧ 國. 學. Office, and Stairways) of REVERB challenge corpus. . . . . . . . . . . .. ‧. HELM­based SE Architecture . . . . . . . . . . . . . . . . . . . . . . .. 4.2. Spectrograms of the enhanced test utterances using the (c) DDAE and (d). 92. sit. y. Nat. 4.1. 84. n. al. er. io. HELM of the (a) ACM and (b) BCM utterances. For each figure, the x­. i Un. v. axis denotes the time in seconds, and the y­axis represents the frequency. Ch. engchi. in Hertz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3. Average PESQ scores for DDAE and HELM SE frameworks using differ­ ent amounts of training data. . . . . . . . . . . . . . . . . . . . . . . . .. 4.4. 97. Average STOI scores for DDAE and HELM SE frameworks using differ­ ent amounts of training data. . . . . . . . . . . . . . . . . . . . . . . . .. 4.5. 95. 97. CER results of DDAE and HELM frameworks using different amounts of training data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 98. 5.1. Audio­only HELM­based SE framework. . . . . . . . . . . . . . . . . . 104. 5.2. Proposed AVHELM SE framework. . . . . . . . . . . . . . . . . . . . . 105 xv. DOI:10.6814/NCCU202000466.

(17) 5.3. Average PESQ scores over six noise types at different SNR levels. . . . . 109. 5.4. Average HASPI and SSNRI scores over six noise types at different SNR levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110. 5.5. Spectrograms of (a) Clean, (b) Noisy, (c) AHELM, and (d) AVHELM. The test utterance was contaminated with noise applause at SNR = ­2 dB. 111. 5.6. Waveforms of (a) Clean, (b) Noisy, (c) AHELM, and (d) AVHELM. The test utterance was contaminated with noise applause at SNR = ­2 dB. . . 112. 6.1 6.2. The HELM­based multimodal SE framework. . . . . . . . . . . . . . . . 118. 治 政 大using HASPI and SS­ Performance comparison of different frameworks 立 ‧ 國. Performance comparison of HELMa and HELMav using HASPI and SS­. ‧. NRI evaluation metrics for different SNRs averaged across six noise types. 127. io. sit. y. Nat. n. al. er. 6.3. 學. NRI evaluation metrics for six noise types averaged across different SNRs. 123. Ch. engchi. xvi. i Un. v. DOI:10.6814/NCCU202000466.

(18) List of Tables 2.1. Aurora–4 Training set description . . . . . . . . . . . . . . . . . . . . .. 25. 2.2. Aurora–4 Test set description . . . . . . . . . . . . . . . . . . . . . . . .. 26. 2.3. Single result abstracted from average objective evaluation scores of ELM. 立. 政 治 大. [500] and HELM [200 200 500] configuration . . . . . . . . . . . . . . .. ‧ 國. 學. 2.4. Performance comparison of HELM frameworks using different window. ‧. sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. y. sit. n. al. Ch. i Un. v. Average PESQ scores of HELM, HELM(Hwy), HELM(Res) and Reverb. engchi. speech under specific reverberated conditions. . . . . . . . . . . . . . . . 3.2. 61. Average PESQ scores of ensemble HELM and IDEA frameworks in the matched testing conditions. . . . . . . . . . . . . . . . . . . . . . . . . .. 3.5. 59. Average PESQ scores of four HELM frameworks with the RS and KB schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3.4. 57. Average PESQ scores of HELM, HELM(Hwy), and HELM(Res) with dif­ ferent context information. . . . . . . . . . . . . . . . . . . . . . . . . .. 3.3. 36. er. io. speech enhancement methods . . . . . . . . . . . . . . . . . . . . . . . . 3.1. 34. Objective evaluation scores of DDAE and HELM alongside traditional. Nat. 2.5. 29. 63. Average PESQ scores of ensemble HELM and IDEA frameworks with complex structures in the matched testing conditions. . . . . . . . . . . . xvii. 66. DOI:10.6814/NCCU202000466.

(19) 3.6. Average PESQ scores of ensemble HELM and IDEA frameworks with complex structures in the mismatched testing conditions. . . . . . . . . .. 3.7. Average PESQ scores of the RTA system and the EHELMD (Res) in the matched and mismatched testing conditions. . . . . . . . . . . . . . . . .. 3.8. 67. 69. Average PESQ scores of ensemble HELM and IDEA frameworks with complex structures in the matched testing conditions for the MHINT corpus. 72. 3.9. Average PESQ scores of ensemble HELM and IDEA frameworks with complex structures in the mismatched testing conditions for the MHINT. 政 治 大. corpus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 立. 74. 3.10 Average performance comparison between different metric scores of the. ‧ 國. 學. Wu­Wang, CDR, IDEA, and the ensemble HELM systems for the SimData. 78. ‧. 3.11 Reverberation times (RT60s) and distance between the loudspeaker and 79. sit. y. Nat. the microphone for each room. . . . . . . . . . . . . . . . . . . . . . . .. io. al. er. 3.12 Average performance comparison between different metric scores of the. v ni. n. Wu­Wang, CDR, IDEA, and the ensemble HELM systems for RealData. . 4.1. Ch. U i e h n c g Average PESQ and STOI scores of the unprocessed BCM speech and the HELM enhanced speech trained with the BCM/ACM utterance pairs. . .. 4.2. 80. 94. Average PESQ and STOI scores of the unprocessed BCM speech and DDAE­ and HELM­ enhanced speech trained with the BCM/ACM(IE) utterance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 4.3. CERs of the original ACM and BCM test utterances and DDAE­ and HELM­ enhanced speech. . . . . . . . . . . . . . . . . . . . . . . . . . .. 5.1. 94. 96. Average PESQ scores of logMMSE, AHELM, and AVHELM under matched and mismatched noise conditions. . . . . . . . . . . . . . . . . . . . . . 108 xviii. DOI:10.6814/NCCU202000466.

(20) 6.1. Average PESQ scores of KLT, logMMSE, RPCA, HELMa , and HELMav processed speech signals under matched and mismatched noise conditions. 122. 6.2. Average PESQ scores of HELMa and HELMav with binary and ternary weights under matched and mismatched noise conditions. . . . . . . . . . 125 Average PESQ scores of HELMa and HELMav using 16­bit quantized in­ put with real­valued and binary weights under matched and mismatched noise conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126. 立. 政 治 大. 學 ‧. ‧ 國 io. sit. y. Nat. n. al. er. 6.3. Ch. engchi. xix. i Un. v. DOI:10.6814/NCCU202000466.

(21) Chapter 1 SPEECH SIGNAL PROCESSING: AN 政 治 大 OVERVIEW 立 ‧ 國. 學 ‧. The goal of a speech signal processing algorithms is to ameliorate the intelligibility. sit. y. Nat. (the percentage of words correctly recognized by listeners) and the quality (the level of. io. er. residual noise and reverberation in that signal) of a corrupted signal in adverse conditions. al. iv n C h e n g cofhvoice­based search interest, owing to the wide dissemination solutions for real­world i U n. [1]. In the past several decades, speech signal processing has attracted considerable re­. applications, such as automatic speech recognition [2, 3, 4], speaker recognition [5] [6], speech coding [7], hearing aids [8] [9], and cochlea implants [10] [11]. As new applica­ tions are deployed, the definition of speech signal processing has broadened to include not only classical noise reduction problem, but also signal separation and reverberation prob­ lems. In general, speech enhancement (SE) techniques can be categorized into three main groups, namely signal denoising, speech dereverberation, and channel compensation. In this dissertation, we mainly focus on the SE techniques where the goal is to obtain high­ quality speech from the low­quality version. First, this chapter provides a summary of the. 1. DOI:10.6814/NCCU202000466.

(22) research work relating to three different types of SE techniques, namely speech denoising, speech dereverberation, and channel compensation. Next, the chapter discusses the work related to SE models proposed in more recent years using multimodal learning strategies. In addition, model­based compression and quantization techniques to reduce the compu­ tational costs are discussed. Finally, the key research challenges involved in designing a robust SE system and the contribution of this dissertation are briefly discussed.. 1.1. 立. 政 治 大. Speech Denoising. 學. ‧ 國. 1.1.1. Background. In real­world applications, the level of background noise may significantly diminish. ‧. the quality and intelligibility of a speech signal acquired by a microphone to the point. sit. y. Nat. that it becomes useless for subsequent processing [1]. Several single­channel SE methods. n. al. er. io. have been proposed in the past to address noise reduction tasks. However, the perfor­. i Un. v. mance of SE in real acoustic environments is not always satisfactory, because improving. Ch. engchi. intelligibility and quality concurrently is a challenging problem. A class of SE methods, termed spectral restoration, aims to design a filter or transformation that attenuates the noise components to generate clean speech. Notable techniques include the Wiener filter and its extensions [12, 13, 14], the minimum mean square error spectral estimator (MMSE) [15, 16, 17], the maximum a posteriori spectral amplitude estimator (MAPA) [18] [19], the maximum likelihood spectral amplitude estimator (MLSA) [20] [21], and generalized MAPA [22]. Another popular class of SE methods adopts speech models for SE. Notable examples include the harmonic model [23], the linear prediction (LP) model [24] [25], and the hidden Markov model (HMM) [26]. A common limitation of most of these conven­. 2. DOI:10.6814/NCCU202000466.

(23) tional methods is that they rely on either the additive nature of the background noise or the statistical properties of speech and noise signals. As a consequence, these methods fail to properly contrast the non­stationary noise of real­world scenarios in unexpected acoustic conditions. Rather than assuming an explicit model, methods based on non­linear mapping have also been adopted to address noise reduction tasks. In such approaches, stereo training data is generally needed to learn a non­linear mapping function between noisy and clean speech. In the non­linear mapping category, artificial neural networks (ANN) have been. 政 治 大. shown to be a viable solution to effectively address background noise issues [27] [28].. 立. For example, in [29], a single­hidden­layer with 160 neurons was employed to estimate. ‧ 國. 學. the instantaneous signal­to­noise ratio (SNR) level of amplitude modulation spectrogram. ‧. (AMS), and then the noise was suppressed according to the estimated SNRs of different. sit. y. Nat. channels. Alternatively, in [30, 31, 32], shallow ANNs were used to determine a map­. io. er. ping between the noisy and clean speech signals. Unfortunately, a lack of depth hindered. al. iv n C U h e n glearning leveraging a greedy layer­wise unsupervised [33], often referred to as c h i algorithm n. comprehensive exploitation of the relationships between noisy and clean speeches. By. pre­training [34], the training of deep neural networks (DNNs) can now be successfully designed, and the strong regression capabilities of deep models can be better explored. For example, deep/stacked denoising autoencoders (DDAEs) were used to model the re­ lationship between clean and noisy features in [35] [36]. Deep recurrent neural networks and long­short term memory (LSTM) networks have also been adopted in feature en­ hancement [37] [38]. In [39], a deep belief network (DBN) with a restricted Boltzmann machine (RBM) was used to design a facial expression recognition (FER) system. Akhtar et al. [40] further exploited the performance of neural networks by generating a K­support. 3. DOI:10.6814/NCCU202000466.

(24) norm­based noise model, to train neural networks. Meanwhile, convolutional neural net­ works, which have a better capability of modeling local temporal­spectral structures of speech signals, have been adopted as a fundamental model for the SE task in [41], and a deeper structure of the convolutional neural network (DCNN) was used for hand gesture recognition in [42]. A common issue with ANN­based speech enhancers is the degraded performance in the presence of unexpected noise. A simple, yet effective solution to this problem is to cover many different types of noise in the training set, as proposed in [43]. In addition to ANN, a generalized single hidden layer feed­forward network (GSLFN). 政 治 大. [44] has been proposed for regression problems in which the traditional single­layer feed­. 立. forward network (SLFN) is extended by exploiting the polynomial functions of inputs as. ‧ 國. 學. output weights. In [45], the universal enhancing capabilities of deep models were more. ‧. thoroughly investigated. In particular, the authors proposed a regression DNN­based SE. sit. y. Nat. framework via training a deep and wide neural network architecture using a large collec­. io. er. tion of heterogeneous training data with four noise types.. n. al. 1.1.2. ni Ch Speech Dereverberation U engchi. v. Reverberation refers to the collection of reflected sounds from surfaces (e.g., walls and objects) in an acoustic enclosure. It has been shown to severely deteriorate the quality and intelligibility of speech signals for both human and machine listeners. Such a deteriora­ tion can substantially affect the performance of speech­related applications, for instance, ASR [46, 47, 48], speaker identification systems [49, 50, 51]. It can also severely hamper speech reception performance for both normal and hearing­impaired listeners [52] [53]. In the last few decades, numerous approaches have been proposed to solve the reverberation problem. The conventional speech dereverberation techniques can be categorized into. 4. DOI:10.6814/NCCU202000466.

(25) three main groups [54]. The first group referred to as source­model­based approaches, aims to separate the speech and reverberation based on the prior information of clean structures and room reverberation effects. Notable algorithms belonging to this category include the linear prediction (LP) methods [55, 56, 57], harmonic filtering techniques [58], and probabilistic models [59] [60]. Another group of algorithms is based on homomor­ phic transformation, in which the reverberated speech signals are analyzed in the cepstral domain to simply subtract the reverberation from the signal. Notable techniques include cepstral­based processing [61] and spectral subtraction [62]. The third group of algo­. 政 治 大. rithms includes channel inversion and employs inverse filtering to deconvolve the speech. 立. convoluted with room impulse response (RIR) during reverberation. Notable techniques. ‧ 國. 學. include the minimum mean square error (MMSE) [63], least square, beamforming [64],. ‧. and matched filtering [65]. Recently, nonlinear spectral mapping approaches have been. sit. y. Nat. developed to address the reverberation problem. For these approaches, ANNs are gener­. io. er. ally used to `learn'the mapping function of the reverberated and anechoic speech [66].. al. iv n C h eof nthose extensively studied [67]. The outcome h i Upoints out that deeper structures g cstudies n. More recently, the universal approximation capabilities of deeper structures have been. of neural networks enable strong learning capabilities, and the reverberation problem can be handled with success. For example, DDAEs were adopted to reconstruct the anechoic speech signal from the reverberated signal in [46] [68]. In [69] [70], LSTM­ and deep recurrent neural network (DRNN)­based dereverberation systems were proposed to effec­ tively reduce the reverberation effects by leveraging the current as well as past frames. In [71, 72, 73, 74, 75], DNN­based solutions have been proposed to improve performance of the system by training a deeper framework to obtain a mapping from the reverberated speech signal to an anechoic one.. 5. DOI:10.6814/NCCU202000466.

(26) 1.1.3. Channel Compensation. Different acoustic features of recording sensors in mobile and Internet of Things (IoT) devices can cause a major channel mismatch which is another common problem in the speech­related applications. In this dissertation, we next focus on the channel mismatch problem by considering the utterances recorded using two different microphones, i.e., air­ conducted microphone (ACM) and bone­conducted microphone (BCM), as a represen­ tative channel mismatch conditions. A number of filtering­based and probabilistic solu­. 政 治 大. tions have been proposed in the past to convert low­quality BCM utterances to high­quality. 立. ACM utterances. In [76], the BCM utterances were passed through a designed reconstruc­. ‧ 國. 學. tion filter to improve quality. In [77] and [78], BCM and ACM utterances were combined for SE and ASR in non­stationary noisy environments. In [79], a probabilistic optimum. ‧. filter (POF)­based algorithm was used to estimate the clean features from the combina­. y. Nat. er. io. sit. tion of standard and throat microphone signals. Thang et al. [80] restored bone­conducted speech in noisy environments based on a modulation transfer function (MTF) and a linear. n. al. Ch. i Un. v. prediction (LP) model. Later, Tajiri et al. [81] proposed a noise suppression technique. engchi. based on non­negative tensor factorization using a body­conducted microphone known as a non­audible murmur (NAM) microphone.. 1.1.4. Multimodal Speech Enhancement. Recent studies have shown that visual modality carries important information, such as lip motions and mouth articulations that can help discriminate similar speech sound in noisy conditions [82, 83, 84]. Recently, several SE methods that integrate audio and visual information have been proposed. For example, in [85] [86], fully­connected, and convo­ lutional neural network models were used to build an audio­visual SE system and have 6. DOI:10.6814/NCCU202000466.

(27) improved the noise reduction performance successfully compared to audio­only frame­ works. In [87], the authors proposed a deep learning­based framework to investigate the impact of the Lombard effect on the performance of the audio­visual SE system. In [88], a speech separation system was proposed that incorporated audio­visual information using a deep network­based model. More recently, model compression that aims to facilitate the use of deep models in real­ world applications has attracted considerable attention. Several model compression tech­ niques have been proposed to reduce computational costs without significantly degrading. 政 治 大. the achievable performance. In addition to the state­of­the­art performance achieved by. 立. the deep­learning­based techniques in different classification and regression tasks, a con­. ‧ 國. 學. siderable amount of research has been done on quantization­based model compression. ‧. strategies to improve the computational capability of deep­learning­based systems for effi­. sit. al. n. Motivation. er. io. 1.2. y. Nat. cient online learning without degrading much of system’s overall performance [89, 90, 91].. Ch. engchi. i Un. v. Traditional SE algorithms are generally derived based on some assumptions of the noise and reverberation signals. Unfortunately, such assumptions do not always hold in real­world conditions and thus may induce unwanted distortions in the reconstructed sig­ nals. The recent advancement, on the other hand, has uncontroversially demonstrated the great potential of deep neural­based models for speech signal processing. Despite the un­ matched performance achieved by deep neural models, identifying a way to train deep learning models efficiently with limited resources remains a key issue. However, less at­ tention has been given in the speech signal processing research field with neural models to the robustness issue. Indeed, deep models suffer from a domain mismatch problem 7. DOI:10.6814/NCCU202000466.

(28) when the production environment differs significantly from the training conditions. If parallel noisy and clean speech is available in the target domain, model re­training largely mitigate the degradation in performance. Moreover, a huge parallel corpus is needed to have a meaningful re­training for gradient­based deep models. Thus, an effective front­ end speech signal processing system that can handle attenuation and time delay effects is highly desired. The approaches proposed in the present dissertation falls onto such a paradigm for speech signal processing; thereby, the rationale behind this dissertation lines up with gradient­based neural architectures.. 立. Research Challenges. 學. ‧ 國. 1.3. 政 治 大. Improving the intelligibility for both human listening and machine recognition is not. ‧. always satisfactory in real acoustic conditions. In recent years, nonlinear spectral mapping­. y. Nat. io. sit. based solutions have been developed to address the speech signal processing problems.. n. al. er. Whilst deep learning­based approaches can substantially achieve outstanding results in. Ch. i Un. v. speech signal processing, deep neural models have notable limitations: (1) generalization. engchi. performance under mismatched training/test conditions which can severely deteriorate the system performance, (3) a multilayer architecture is considered as a whole that is trained and fine­tuned by several passes of back­propagation (BP) based fine­tuning in order to achieve reasonable learning capabilities such a training scheme is cumbersome and time­ consuming, and (3) require a large amount of clean­noisy/reverberant­anechoic training waveform pairs covering as many as possible environmental conditions to improve ro­ bustness against unseen testing conditions, which may limit the deployment of deep neu­ ral model­based solutions in many real­world applications, especially when operated in wearable or mobile client sides. 8. DOI:10.6814/NCCU202000466.

(29) Though the deep neural models have solved the slow­gradient based training and data­ augmentation problem [92] [93], however training a deep neural model efficiently with limited resources remains a key issue. To validate our concern, we are providing the following reasons: (1) an emerging research topic of deep learning to investigate new solutions for “few­shot learning"or “learning under low resource conditions". That is, to facilitate the deep models to work in real­world applications where researchers have recently been made aware that it is not always ideal to prepare a deep and universal model in the offline stage to handle diverse testing conditions in the online stage. As a result, deep. 政 治 大. models suffer from a domain mismatch problem when the production environment differs. 立. significantly from the training conditions. On the contrary, a model that can be trained. ‧ 國. 學. efficiently with a small amount of training data is more favorable. (2) The computational. ‧. costs are another consideration for applications. (3) In real­time situations, where the. sit. y. Nat. data arrives in a sequential stream and exhibits dynamically changing and non­stationary. io. n. al. er. environments, an alternate option is required for online learning.. 1.4. Contributions. Ch. engchi. i Un. v. To address the shortcomings of both conventional speech signal processing (dynami­ cally changing and non­stationary environments) and deep learning­based (data­requirement) approaches, this dissertation focuses on an alternate hierarchical extreme learning ma­ chine (HELM)­based solutions to address the shortcomings of both conventional and deep learning­based speech signal processing approaches. Unlike traditional BP­based algo­ rithms, the parameters of the ELM feature extraction layers are randomly specified and need not be fine­tuned, thereby providing an extremely fast training phase with good gen­ eralization performance and a universal approximation capability. The proposed solutions 9. DOI:10.6814/NCCU202000466.

(30) have the key advantage of avoiding gradient­based solutions, so the parameters of ELM can be optimized with a small amount of training data. To take advantage of the multi­ layer model, we employ a HELM for speech signal processing. Experimental evidence reported in the present dissertation indeed demonstrates that HELM­based solutions pro­ vide an extremely fast training phase with good generalization performance and a universal approximation capability when only a small amount of training data is available. The key goal is to devise data­driven models for speech signal processing that can be deployed effi­ ciently by leveraging a small amount of training data and limited computational resources.. 政 治 大. The main contributions of this dissertation are as follows:. 立. ‧ 國. 學. • Initially, we exploited the unique and effective characteristics of the HELM model to construct a speech denoising framework. HELM extracts information in a multi­. ‧. layer manner, keeping all the advantages of deep models in the approximation of. Nat. sit. y. complicated functions and maintaining strong regression capabilities. The proposed. n. al. er. io. solution has a key advantage of avoiding cumbersome and time­consuming training. Ch. i Un. v. process of BP­based fine­tuning. In an overview, the proposed framework demon­. engchi. strated that (i) HELMs are indeed a viable solution for extracting clean speech fea­ tures from the noisy counterpart, and HELM­based SE is effective even when testing data involves mismatch noisy type and SNR levels, and; (ii) when the amount of training data is limited, the proposed HELM­based SE algorithm outperforms the algorithms based on conventional BP­based neural networks under different testing conditions. • Next, an ensemble learning approach is devised to handle attenuation and time­ delay effects for speech dereverberation. The main focus of the proposed approach is to examine the effectiveness of combining the HELM models leveraging three 10. DOI:10.6814/NCCU202000466.

(31) mechanisms never employed in HELMs: ensemble learning, residual, and highway structures. In addition, the objective of the proposed framework is to address the data requirement issue while preserving the advantages of deep neural structures. The goal is to construct a data­driven model that can be deployed efficiently lever­ aging a small amount of training material and limited computational resources. • In addition to noise and reverberation, we then study the effect of channel mismatch on the enhancement performance. Channel mismatch is yet another common prob­ lem that can significantly degrade the overall performance of the speech signals for. 治 政 大 we present a HELM­based both human and machine listeners. To address this issue, 立 ‧ 國. 學. framework to convert low­quality bone­conducted utterances to high­quality air­ conducted utterances. Compared with traditional microphone i.e., ACM, the speech. ‧. signals recorded with a BCM are robust against noise while some high­frequency. Nat. sit. y. components may be missing. The experimental results verify that the proposed. n. al. er. io. framework notably improves the original bone­conducted speech and outperforms. i Un. v. the previous deep learning­based SE framework in terms of standardized objective. Ch. engchi. measures, as well as automatic speech recognition (ASR) performance. • Research has shown that visual modality, such as lip movements and mouth ar­ ticulations, carries important information that can help discriminate similar speech patterns in noisy conditions. Inspired by the success achieved for speech denois­ ing by conventional HELM, we build a joint audio­visual speech denoising frame­ work by incorporating the visual information alongside audio to deal with unseen noises under low SNR conditions. The proposed multimodal framework outper­ forms the conventional audio­only framework by exhibiting a satisfactory perfor­ mance in terms of standardized objective measures under matched and mismatched 11. DOI:10.6814/NCCU202000466.

(32) testing conditions. The results further confirm the applicability of HELM­based solutions using multimodal frameworks under challenging conditions and low re­ source environments. • To facilitate deep learning­based models in real­world applications, the disserta­ tion investigates the performance of the multimodal speech denoising framework by utilizing model compression strategies namely, binarization and quantization. The proposed audio­visual framework is trained by using binary weights and quantized speech signals to cut­down the computational requirement. The results demonstrate. 治 政 that the proposed framework with binarized weights大 and quantized data still worked 立 ‧ 國. ‧. 1.5. 學. as usual with the overall performance of the system slightly reduced.. Dissertation Outline. sit. y. Nat. io. er. This dissertation is organized as follows: Chapter 2, discusses the proposed ELM and. al. HELM based speech denoising/enhancement frameworks. Chapter 3 formulates the prob­. n. iv n C lem of speech dereverberation and derives h e nangensemble c h i Ulearning approach to effectively recover anechoic speech from reverberated one using HELM­based spectral mapping. Chapter 4 extends the HELM framework and provides a HELM­based channel compensa­ tion strategy. Chapter 5 extends the single modality framework and adopts a multimodal learning approach to train the audio­visual framework to obtain enhanced speech. Chap­ ter 6 describes the model compression technique based on multimodal learning. Finally, Chapter 7 summarizes this dissertation, highlights its research contribution, and provides an insight for future work.. 12. DOI:10.6814/NCCU202000466.

(33) Chapter 2 ELM­BASED SPEECH DENOISING 政 治 大 立 Overview. ‧ 國. 學. 2.1. ‧. In wireless telephony and audio data mining applications, it is desirable that noise sup­ pression can be made robust against changing noise conditions and operate in real time. y. Nat. er. io. sit. (or faster). The learning efficiency and online computation of artificial neural networks. al. are therefore critical factors in applications for speech enhancement tasks. To address. n. iv n C these issues, we present an ELM framework, at the effective and fast removal of h e n gaiming chi U background noise from a single­channel speech signal, based on a set of randomly cho­. sen hidden units and analytically determined output weights. Because feature learning with shallow ELM may not be effective for natural signals, such as speech, even with a large number of hidden nodes, HELM architectures are deployed by leveraging sparse auto­encoders. In this manner, we not only keep all the advantages of deep models in approximating complicated functions and maintaining strong regression capabilities, but we also overcome the cumbersome and time­consuming features of BP­based fine tuning schemes, which are typically adopted for training deep neural architectures. The pro­. 13. DOI:10.6814/NCCU202000466.

(34) posed ELM framework was evaluated on the Aurora–4 speech databases. The Aurora–4 task provides relatively limited training data, and test speech data corrupted with both ad­ ditive noise and convolutive distortions for matched and mismatched channels and SNR conditions. In addition, the task includes a subset of testing data involving noise types and SNR levels that are not seen in the training data. The experimental results indicate that when the amount of training data is limited, both ELM and HELM based speech enhancement techniques consistently outperform the conventional BP­based shallow and deep learning algorithms, in terms of standardized objective evaluations, under various. 政 治 大. testing conditions. The content of this chapter have been published in [94].. 立. ‧ 國. 學. 2.2. Introduction. ‧. In this chapter, we propose an alternative speech enhancement framework based on. y. Nat. io. sit. the unique and effective characteristics of the ELM algorithm [95], namely extremely. n. al. er. fast training, good generalization, and a universal approximation/classification capabil­. Ch. i Un. v. ity. ELMs can play a key role in many machine learning applications, such as traffic sign. engchi. recognition [96], gesture recognition [97], video tracking [97], object classification [98], data representation in big data [99], water distribution and wastewater collection [100], opal grading [101], nonlinear time­series modeling [102] and adaptive dynamic program­ ming [103]. In [104], the authors have also demonstrated that ELMs are suitable for a wide range of feature mapping applications, rather than just the classical ones. Moreover, to take advantage of multi­layer models, we deploy a speech enhancement algorithm with HELMs. To the best of our knowledge, this is the first work that helps ELM and HELM to the speech enhancement task. To evaluate the noise reduction capability of ELM and HELM, we conducted a series of experiments on the standardized Aurora–4 noisy speech 14. DOI:10.6814/NCCU202000466.

(35) corpus [105]. Notably, the amount of training data in the Aurora–4 speech corpus is rel­ atively limited in comparison to that used in [45]. Aurora–4 also provides a subset of the test data that allows an assessment in mismatch (SNR and channel) conditions. The contributions of our results are as follows: (i) We have demonstrated that ELMs are in­ deed a viable solution for extracting clean speech features from the noisy counterpart, and ELM­based speech enhancement is effective even when testing data involving noisy type and SNR levels that are not seen in the training data, and; (ii) when the amount of train­ ing data is limited, the proposed ELM speech enhancement algorithm outperforms the. 政 治 大. algorithms based on more conventional BP­based neural networks under different testing. 立. conditions, in terms of the perceptual evaluation of speech quality (PESQ, a standard­. ‧ 國. 學. ized speech quality evaluation metric), and segmental signal to noise ratio improvement. ‧. (SSNRI, a standardized objective speech quality evaluation metric).. sit. y. Nat. The remainder of this chapter is organized as follows. Section 2.3 presents the ELM/. io. al. er. HELM based speech enhancement algorithms. Section 2.4 presents our experimental. v. n. setup and results. The summary of this chapter is discussed in Section 2.5.. 2.3. Ch. engchi. i Un. Proposed Method. In general, speech enhancement techniques can be categorized into two main groups, namely signal processing solutions and data­driven approaches. In the following sections, we discuss the underpinnings of both approaches by describing some prominent tech­ niques in both groups. First, the speech enhancement problem will be introduced more formally through the spectral restoration method. Next, we briefly discuss key data­driven methods.. 15. DOI:10.6814/NCCU202000466.

(36) 2.3.1. Conventional Spectral Restoration Methods. Speech enhancement algorithms involve a transformation of a noisy speech signal into the spectral domain to recover the desired clean signal. A noisy speech signal y[n] is composed of a clean speech signal x[n], and additive noise signal v[n],. y[n] = x[n] + v[n],. (2.1). where n is the time index. A noisy signal is converted into short time Fourier transform. 政 治 大 signal is divided into short frames using a window function w(n). 立. (STFT) domain to determine its frequency and phase components. In STFT, the speech. ‧ 國. 學. STFT speech signal can be expressed as. The corresponding. (2.2). ‧. Y [m, l] = X[m, l] + V [m, l],. Nat. sit. y. where Y [m, l], X[m, l], and V [m, l] are the mth frequency bins of the noisy speech, clean. n. al. er. io. speech, and noise spectra of the lth frame, respectively, corresponding to frequency ωm ,. i Un. v. where ωm = 2πm/M , m = 0, 1, . . . , M − 1. The aim of speech enhancement ap­. Ch. engchi. proaches is to restore x[n] ( or X[m, l]) from y[n] (or Y [m, l]). For spectral restora­ tion, a gain function G[m, l] is estimated based on the computed a priori SNR statis­ b tics and a posteriori SNR statistic. The enhanced speech, X[m, l], is obtained by fil­ tering Y [m, l] through G[m, l]. The phase of the noisy speech is copied and used to pre­ pare the phase of the enhanced speech. An inverse STFT (ISTFT) is applied to convert b X[m, l], m = 0, 1, . . . , M − 1; l = 1, 2, . . . , L and the phase, to obtain the enhanced speech b x. Some of the notable techniques mentioned in the Chapter 1, namely MMSE [15, 16, 17], MAPA [18] [19], and MLSA [20] [21] are based on this approach.. 16. DOI:10.6814/NCCU202000466.

(37) 2.3.2. Data Driven Methods. Nonnegative Matrix Factorization In nonnegative matrix factorization (NMF) based speech enhancement, a speech data matrix Y ∈ RM ×L with M frequency bins and L speech frames is projected to a space that is a linear combination of a set of vectors, i.e., Y ≈ WH, where W = [W X W V ] ∈ RM ×(px +pv ) (W X and W V denote the basis matrices of speech and noise, respectively) and H = [H TXb H TVb ]T ∈ R(px +pv )×L . Here, px , py ≤ min(M, L) are the corresponding basis. 政 治 大. vectors for speech and noise ( H Xb and H Vb denote the estimated coefficient matrices of. 立. speech and noise, respectively). NMF approximation is achieved by using two alternative. ‧ 國. 學. minimizing criteria: (1) the least square criteria to minimize ∥V − WH∥2 w.r.t W and. sit. y. Nat. [106, 107].. ‧. H; and (2) the generalized Kullback­Leibler (KL) divergence to minimize D(V ∥WH). io. n. al. er. During the training stage, NMF is applied separately on clean and noisy data, in which. v. magnitude spectrums of the speech (|X[m, l]|) and noise (|V [m, l]|) are computed. Subse­. Ch. engchi. i Un. quently, the Euclidean distance between the magnitude spectrum and the factored matrices is minimized by the following update rule [106]: WTY W T WH YHT W ←W⊗ WHH T H ←H⊗. (2.3). In the enhancement stage, a spectral gain is estimated and the enhanced speech is obtained as b X[m, l] = G[m, l]Y [m, l]. (2.4). where the gain function G[m, l] is formulated using a specific statistical model and opti­ 17. DOI:10.6814/NCCU202000466.

(38) mality criterion.. Deep Denoising Autoencoder (DDAE) Recently, deep denoising autoencoders (DDAEs) have demonstrated a tremendous performance in the field of speech enhancement. DDAE is trained as a noisy­clean pair to learn the statistical information between the clean and noisy speech signals [108]. The aim of DDAE is to transform the noisy speech signal to a clean speech by minimizing the b and the reference clean signal X, such reconstruction error between the predicted signal X that. 政 治 大. 立 θ = arg min(E(θ) + ρC(θ)) ∗. (2.5). θ. ‧ 國. 學. with. ‧. E(θ) = ∥θ(Y ) − X∥2F. (2.6). sit. y. Nat. where ρ is a constant that controls the tradeoff between the reconstruction accuracy and. n. al. er. io. regularization term C(θ) [36], θ(Y ) denotes the transformation function of DDAE. During. i Un. v. the training phase, a DDAE is trained in a greedy layer­wise manner and is then used to. Ch. engchi. estimate clean speech out of noisy speech signals as. h1 (Y [l]) = σ(W 1 Y [l] + b1 ), .. . (2.7) hD−1 (Y [l]) = σ(W D−1 hD−2 (Y [l]) + bD−1 ), b = W D hD−1 (Y [l]) + bD X[l] where Y [l] = [log(|Y [1, l]|) . . . log(|Y [m, l]|) . . . log(|Y [M, l]|)]T , and b = [log(|X[1, b l]|) . . . log(|X[m, b b X[l] l]|) . . . log(|X[M, l]|)]T are the lth logarithm amplitude vectors of the input noisy speech and estimated clean speech, respectively, {W 1 . . . W D } b are the weight matrices, {b1 . . . bD } are the corresponding bias vectors, and X[m, l] is the 18. DOI:10.6814/NCCU202000466.

(39) logarithmic amplitude vector of the enhanced speech. Furthermore, σ is the vector­wise non­linear activation function. The relationship in Eq. (2.5) can be optimized by using any unconstrained optimization algorithm. In particular, the Hessian­free algorithm was adopted in [109] to compute this. During the enhancement phase, the ISTFT is applied to the magnitude spectrum together with the phase spectrum from the original signal to reconstruct the waveform [45] [108]. The difference between DDAE [108] and DNN [45] lies in the initialization and architecture design, where DDAE formulates the noise reduc­ tion (NR) task as an encoding­decoding process, and DNN considers it as a regression task.. 政 治 大. If the decoder part in DDAE is also multi­layer MLP then it becomes a fully­connected. 立. regression model, same as the one presented in [45].. ‧ 國. 學. The ELM Model. ‧. 2.3.3. sit. y. Nat. The ELM model was proposed by Huang et al. [95] for single layer feed­forward. io. n. al. er. networks (SLFNs), to overcome the drawbacks of the BP algorithm. ELM provides an. i Un. v. efficient and quick learning process, which does not require the massive fine­tuning of parameters [104].. Ch. engchi. Shallow ELM The input weights and biases of the hidden layer in SLFNs can be chosen randomly to learn N distinct observations [110]. Given N distinct observations (yi , xi ), where yi = [yi1 , yi2 . . . yiJ ]T ∈ RJ and xi = [xi1 , xi2 . . . xiI ]T ∈ RI , the outputs of the SLFNs can be modeled as f (yi ) =. Q X. β q σ(wq · yi + bq ). (2.8). q=1. 19. DOI:10.6814/NCCU202000466.

(40) where σ(·) is the activation function, wq = [w1q , w2q , . . . , wJq ]T ∈ RJ is the weight vector from the input node to the qth hidden node, bq is the bias of the qth hidden node, β q = [βq1 , βq2 , . . . , βqI ]T ∈ RI is the weight vector from the qth hidden node to the output nodes, and Q is the number of hidden neurons. A standard SLFN for the ith hidden node with zero error is given as N X. ∥f (yi ) − xi ∥ = 0. (2.9). i=1. The above relation can be shortened as HB = X. where. 政 治 大. 立. . (2.10) . n. Ch Q×I. e n g cN ×I hi. sit. io. al. ,.  xT1     .   X =  ..      xTN. er. Nat T. β 1     .   B=  ..      β TQ. y. ‧. ‧ 國. 學.  σ(w1 · y1 + b1 ) · · · σ(wQ · y1 + bQ )      . .   .. .. H = ,      σ(w1 · yN + b1 ) · · · σ(wQ · yN + bQ ) N ×Q    . i Un. (2.10a). v. The output weight matrix B is computed as. B = H +X. (2.11). where H + is the Moore­Penrose (MP) pseudoinverse of H, which can be calculated using orthogonal projection methods such as H + = (H T H)−1 H T , where H T H should be non­ singular, or H + = H T (HH T )−1 , where HH T should be non­singular. In order to solve the linear inverse problem arising at the ELM output, in this chapter we adopted a fast­iterative shrinkage­threshold algorithm (FISTA) [111], which is an ex­ tension of the gradient algorithm, and offers better convergence properties for problems involving large amounts of data. 20. DOI:10.6814/NCCU202000466.

(41) Hierarchical ELM Inspired by DNNs, where features are extracted using a multi­layer framework with an unsupervised initialization, Tang et al. [97] extended ELM, and proposed HELM for multi­layer perceptrons (MLPs). The overall structure of the HELM model is illustrated in Fig. 2.1. The HELM framework comprises two stages, i.e., unsupervised feature ex­ traction and supervised feature regression. In unsupervised feature extraction, high level features are extracted using an ELM­based autoencoder by considering each layer as an. 政 治 大 traction, in order to make use of information from training data. The output of the un­ 立. autonomous layer. The input data is projected to ELM feature space before feature ex­. ‧ 國. 學. supervised feature extraction stage can then be used as the input to the supervised ELM regression stage [97] for the final result, based on the learning from the two stages.. ‧. n. al. Output weight. er. io Hidden Layers. sit. y. Nat. Regression. i v stage Supervised n CELMh Layer engchi U. Sparse Autoencoder Hidden weight Hidden weight. Unsupervised feature representation. Sparse Autoencoder Input weight Input Data. Figure 2.1: HELM architecture.. 21. DOI:10.6814/NCCU202000466.

(42) HELM. Feature Extraction. Spectrum Recover. Hidden Layer 1. Hidden Layer 2. ELM Layer. Phase. STFT. ISTFT. Enhanced Speech. Noisy Speech. 立. 政 治 大. Figure 2.2: HELM­based speech enhancement architecture.. ‧ 國. 學. ELM and HELM for SE. ‧. In this section, we describe the use of ELM and HELM for a regression model to. sit. y. Nat. io. er. perform speech enhancement. Fig. 2.2 illustrates the system architecture of the proposed. al. ELM/HELM­based speech enhancement approach. The main concept is to use an ELM/. n. iv n C HELM model to transform noisy speech clean speech. h e ton g c h i U The overall system includes offline and online stages.. During the offline stage, a set of noisy­clean speech pairs is prepared. The noisy and clean speech signals are first converted into the frequency domain using the STFT to de­ termine the frequency and phase components of the signal. The logarithm power spectra (LPS) of the noisy and clean speech spectra are then placed at the input and output sides of the ELM model, respectively. More specifically, the goal of the ELM/HELM system is to reconstruct the clean speech signal from the noisy speech by minimizing the recon­. 22. DOI:10.6814/NCCU202000466.

(43) struction error, such that b 2F E = ∥X − X∥. (2.12). b is the estimated speech signal and X is the reference clean speech signal. Ac­ where X cording to ELM theory [104], any continuous target function can be approximated as PN l=1. b b are the lth logarithm amplitude vectors of ∥f (Y [l]) − X[l]∥ = 0, where Y [l] and X[l]. the input noisy speech and estimated clean speech described in Section 2.3.2, respectively. The relationship in Eq. (2.8) can be written as. 治 X 政 f (Y [l]) = β σ(w · Y [l] + 大b ) 立 Q. q. q. (2.13). q. q=1. ‧ 國. 學. where wq is the weight vector, bq is the bias and β q is the output weight vector of the q­th hidden node. The relation in Eq. (2.10) can be written compactly in matrix form as. ‧. b HB = X. Nat. sit. y. (2.14). n. al. er. io. b is the estimated speech where H is the hidden layer output, B is the output weight and X signal, given as . Ch. engchi. i Un. v. .  σ(w1 · Y [1] + b1 ) · · · σ(wQ · Y [1] + bQ )      . .   .. .. H = ,      σ(w1 · Y [N ] + b1 ) · · · σ(wQ · Y [N ] + bQ ) N×Q     T b T [1]  X β 1       .   .   b =  ..  , X B=   ..        T    b [N ] β TQ X Q×M. (2.14a). N×M. The corresponding output weight matrix for the estimated speech signal can be computed as b = H +X b B 23. (2.15) DOI:10.6814/NCCU202000466.

(44) where H + is the Moore­Penrose (MP) pseudoinverse of H and is described in Section b is the output weight matrix, and X b is the estimated speech signal. 2.3.3, B In the online stage, the noisy speech signals are first converted into LPS and phase parts. The noisy LPS features are transformed to obtain the enhanced ones by following b estimated in the the steps in Eqs. (2.13) and (2.14) for the ELM/HELM models (H and B) offline stage. The phase of the noisy speech is used to prepare the phase of the enhanced speech. An ISTFT is applied to obtain the enhanced speech signals.. 2.4. Experiments. 政 治 大. 立. ‧ 國. ‧. 2.4.1. 學. In this section, we present our experimental setup and results.. Experimental Setup. y. Nat. n. er. io. al. sit. Description of the Aurora–4 Database. v. The Aurora–4 [105] dataset was used to evaluate the performance of the proposed. Ch. engchi. i Un. ELM­based speech enhancement algorithm. The Aurora–4 dataset includes speech data recorded at two sampling rates, 8 kHz and 16 kHz. The 16 kHz speech data was used in this chapter. Aurora–4 contains two training sets: clean and multi­condition. Each set contains 7138 utterances, as shown in Table 2.1. In this chapter, we employed these two training sets to train the speech enhancement models (input data from the multi­condition training set, output data from the clean training set). The multi­condition training set was divided into two blocks, each consisting of 3569 utterances, where 893 were clean and the remaining 2676 were randomly contaminated with six different background noises at SNR levels varying from 10 to 20 dB. The first block of data was recorded using a Sennheiser. 24. DOI:10.6814/NCCU202000466.

(45) microphone, and the second block was recorded using various microphones (so that the speech in the dataset contained interferences with two different channel conditions). The testing set includes 4620 utterances, which were divided into 14 test sets, each containing 330 utterances. The entire set was used to test the performance [105] under dif­ ferent noise and channel conditions. The testing data includes six different noises, namely babble, car, restaurant, street, airport, and train, with both matched and mismatched chan­ nel conditions. The testing dataset was further classified into four larger groups as shown in Table 2.2. Because Test Set 1 (Set A) contained clean speech only, the corresponding. 政 治 大. evaluation scores (PESQ, SSNRI, SDI, and STOI) are not included for comparison in the. 立. following discussion. From Table 2.2, it can be noted that Set B covered speech with addi­. ‧ 國. 學. tive noise, Set C covered speech with convolutive noise, and Set D contained speech with. ‧. both additive and convolutive noises. Test Sets C and D contained clean and noisy test. y. Nat Training Set 1. al. Description. n. er. Category. io. Training Set. sit. Table 2.1: Aurora–4 Training set description. Clean data. Ch. Training. Multi­condition. Set 2. data. iv n U utterances) (3569. Clean Speech with Sennhesier microphone. engchi. No noise. Speech recorded. (893 utterances). with Sennhesier. Speech contaminated with. microphone. 6 different noises at. (3569 utterances). 10­20 dB SNRs (2676 utterances) No noise. Speech recorded. (893 utterances). with 18 different. Speech contaminated with. microphones. 6 different noises at. (3569 utterances). 10­20 dB SNRs (2676 utterances). 25. DOI:10.6814/NCCU202000466.

參考文獻

相關文件

• The memory storage unit is where instructions and data are held while a computer program is running.. • A bus is a group of parallel wires that transfer data from one part of

• The memory storage unit holds instructions and data for a running program.. • A bus is a group of wires that transfer data from one part to another (data,

國立政治大學應用數學系 林景隆 教授 國立成功大學數學系 許元春召集人.

Bootstrapping is a general approach to statistical in- ference based on building a sampling distribution for a statistic by resampling from the data at hand.. • The

二、 學 與教: 第二語言學習理論、學習難點及學與教策略 三、 教材:.  運用第二語言學習架構的教學單元系列

大學教育資助委員會資助大學及絕大部分專上院 校接納應用學習中文(非華語學生適用)的「達 標」

1.大專以上學歷(不限特定科系) 2.行政文書處理與文字表達能力 3.外語能力(國際書信往來與客戶接待) 4.資訊應用能力(excel、ppt 等軟體操作)

• The memory storage unit holds instructions and data for a running program.. • A bus is a group of wires that transfer data from one part to another (data,