國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Though the deep neural models have solved the slowgradient based training and data
augmentation problem [92] [93], however training a deep neural model efficiently with limited resources remains a key issue. To validate our concern, we are providing the following reasons: (1) an emerging research topic of deep learning to investigate new solutions for “fewshot learning"or “learning under low resource conditions". That is, to facilitate the deep models to work in realworld applications where researchers have recently been made aware that it is not always ideal to prepare a deep and universal model in the offline stage to handle diverse testing conditions in the online stage. As a result, deep models suffer from a domain mismatch problem when the production environment differs significantly from the training conditions. On the contrary, a model that can be trained efficiently with a small amount of training data is more favorable. (2) The computational costs are another consideration for applications. (3) In realtime situations, where the data arrives in a sequential stream and exhibits dynamically changing and nonstationary environments, an alternate option is required for online learning.
1.4 Contributions
To address the shortcomings of both conventional speech signal processing (dynami
cally changing and nonstationary environments) and deep learningbased (datarequirement) approaches, this dissertation focuses on an alternate hierarchical extreme learning ma
chine (HELM)based solutions to address the shortcomings of both conventional and deep learningbased speech signal processing approaches. Unlike traditional BPbased algo
rithms, the parameters of the ELM feature extraction layers are randomly specified and need not be finetuned, thereby providing an extremely fast training phase with good gen
eralization performance and a universal approximation capability. The proposed solutions
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
have the key advantage of avoiding gradientbased solutions, so the parameters of ELM can be optimized with a small amount of training data. To take advantage of the multi
layer model, we employ a HELM for speech signal processing. Experimental evidence reported in the present dissertation indeed demonstrates that HELMbased solutions pro
vide an extremely fast training phase with good generalization performance and a universal approximation capability when only a small amount of training data is available. The key goal is to devise datadriven models for speech signal processing that can be deployed effi
ciently by leveraging a small amount of training data and limited computational resources.
The main contributions of this dissertation are as follows:
• Initially, we exploited the unique and effective characteristics of the HELM model to construct a speech denoising framework. HELM extracts information in a multi
layer manner, keeping all the advantages of deep models in the approximation of complicated functions and maintaining strong regression capabilities. The proposed solution has a key advantage of avoiding cumbersome and timeconsuming training process of BPbased finetuning. In an overview, the proposed framework demon
strated that (i) HELMs are indeed a viable solution for extracting clean speech fea
tures from the noisy counterpart, and HELMbased SE is effective even when testing data involves mismatch noisy type and SNR levels, and; (ii) when the amount of training data is limited, the proposed HELMbased SE algorithm outperforms the algorithms based on conventional BPbased neural networks under different testing conditions.
• Next, an ensemble learning approach is devised to handle attenuation and time
delay effects for speech dereverberation. The main focus of the proposed approach is to examine the effectiveness of combining the HELM models leveraging three
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
mechanisms never employed in HELMs: ensemble learning, residual, and highway structures. In addition, the objective of the proposed framework is to address the data requirement issue while preserving the advantages of deep neural structures.
The goal is to construct a datadriven model that can be deployed efficiently lever
aging a small amount of training material and limited computational resources.
• In addition to noise and reverberation, we then study the effect of channel mismatch on the enhancement performance. Channel mismatch is yet another common prob
lem that can significantly degrade the overall performance of the speech signals for both human and machine listeners. To address this issue, we present a HELMbased framework to convert lowquality boneconducted utterances to highquality air
conducted utterances. Compared with traditional microphone i.e., ACM, the speech signals recorded with a BCM are robust against noise while some highfrequency components may be missing. The experimental results verify that the proposed framework notably improves the original boneconducted speech and outperforms the previous deep learningbased SE framework in terms of standardized objective measures, as well as automatic speech recognition (ASR) performance.
• Research has shown that visual modality, such as lip movements and mouth ar
ticulations, carries important information that can help discriminate similar speech patterns in noisy conditions. Inspired by the success achieved for speech denois
ing by conventional HELM, we build a joint audiovisual speech denoising frame
work by incorporating the visual information alongside audio to deal with unseen noises under low SNR conditions. The proposed multimodal framework outper
forms the conventional audioonly framework by exhibiting a satisfactory perfor
mance in terms of standardized objective measures under matched and mismatched
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
testing conditions. The results further confirm the applicability of HELMbased solutions using multimodal frameworks under challenging conditions and low re
source environments.
• To facilitate deep learningbased models in realworld applications, the disserta
tion investigates the performance of the multimodal speech denoising framework by utilizing model compression strategies namely, binarization and quantization. The proposed audiovisual framework is trained by using binary weights and quantized speech signals to cutdown the computational requirement. The results demonstrate that the proposed framework with binarized weights and quantized data still worked as usual with the overall performance of the system slightly reduced.