• 沒有找到結果。

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Though the deep neural models have solved the slow­gradient based training and data­

augmentation problem [92] [93], however training a deep neural model efficiently with limited resources remains a key issue. To validate our concern, we are providing the following reasons: (1) an emerging research topic of deep learning to investigate new solutions for “few­shot learning"or “learning under low resource conditions". That is, to facilitate the deep models to work in real­world applications where researchers have recently been made aware that it is not always ideal to prepare a deep and universal model in the offline stage to handle diverse testing conditions in the online stage. As a result, deep models suffer from a domain mismatch problem when the production environment differs significantly from the training conditions. On the contrary, a model that can be trained efficiently with a small amount of training data is more favorable. (2) The computational costs are another consideration for applications. (3) In real­time situations, where the data arrives in a sequential stream and exhibits dynamically changing and non­stationary environments, an alternate option is required for online learning.

1.4 Contributions

To address the shortcomings of both conventional speech signal processing (dynami­

cally changing and non­stationary environments) and deep learning­based (data­requirement) approaches, this dissertation focuses on an alternate hierarchical extreme learning ma­

chine (HELM)­based solutions to address the shortcomings of both conventional and deep learning­based speech signal processing approaches. Unlike traditional BP­based algo­

rithms, the parameters of the ELM feature extraction layers are randomly specified and need not be fine­tuned, thereby providing an extremely fast training phase with good gen­

eralization performance and a universal approximation capability. The proposed solutions

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

have the key advantage of avoiding gradient­based solutions, so the parameters of ELM can be optimized with a small amount of training data. To take advantage of the multi­

layer model, we employ a HELM for speech signal processing. Experimental evidence reported in the present dissertation indeed demonstrates that HELM­based solutions pro­

vide an extremely fast training phase with good generalization performance and a universal approximation capability when only a small amount of training data is available. The key goal is to devise data­driven models for speech signal processing that can be deployed effi­

ciently by leveraging a small amount of training data and limited computational resources.

The main contributions of this dissertation are as follows:

• Initially, we exploited the unique and effective characteristics of the HELM model to construct a speech denoising framework. HELM extracts information in a multi­

layer manner, keeping all the advantages of deep models in the approximation of complicated functions and maintaining strong regression capabilities. The proposed solution has a key advantage of avoiding cumbersome and time­consuming training process of BP­based fine­tuning. In an overview, the proposed framework demon­

strated that (i) HELMs are indeed a viable solution for extracting clean speech fea­

tures from the noisy counterpart, and HELM­based SE is effective even when testing data involves mismatch noisy type and SNR levels, and; (ii) when the amount of training data is limited, the proposed HELM­based SE algorithm outperforms the algorithms based on conventional BP­based neural networks under different testing conditions.

• Next, an ensemble learning approach is devised to handle attenuation and time­

delay effects for speech dereverberation. The main focus of the proposed approach is to examine the effectiveness of combining the HELM models leveraging three

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

mechanisms never employed in HELMs: ensemble learning, residual, and highway structures. In addition, the objective of the proposed framework is to address the data requirement issue while preserving the advantages of deep neural structures.

The goal is to construct a data­driven model that can be deployed efficiently lever­

aging a small amount of training material and limited computational resources.

• In addition to noise and reverberation, we then study the effect of channel mismatch on the enhancement performance. Channel mismatch is yet another common prob­

lem that can significantly degrade the overall performance of the speech signals for both human and machine listeners. To address this issue, we present a HELM­based framework to convert low­quality bone­conducted utterances to high­quality air­

conducted utterances. Compared with traditional microphone i.e., ACM, the speech signals recorded with a BCM are robust against noise while some high­frequency components may be missing. The experimental results verify that the proposed framework notably improves the original bone­conducted speech and outperforms the previous deep learning­based SE framework in terms of standardized objective measures, as well as automatic speech recognition (ASR) performance.

• Research has shown that visual modality, such as lip movements and mouth ar­

ticulations, carries important information that can help discriminate similar speech patterns in noisy conditions. Inspired by the success achieved for speech denois­

ing by conventional HELM, we build a joint audio­visual speech denoising frame­

work by incorporating the visual information alongside audio to deal with unseen noises under low SNR conditions. The proposed multimodal framework outper­

forms the conventional audio­only framework by exhibiting a satisfactory perfor­

mance in terms of standardized objective measures under matched and mismatched

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

testing conditions. The results further confirm the applicability of HELM­based solutions using multimodal frameworks under challenging conditions and low re­

source environments.

• To facilitate deep learning­based models in real­world applications, the disserta­

tion investigates the performance of the multimodal speech denoising framework by utilizing model compression strategies namely, binarization and quantization. The proposed audio­visual framework is trained by using binary weights and quantized speech signals to cut­down the computational requirement. The results demonstrate that the proposed framework with binarized weights and quantized data still worked as usual with the overall performance of the system slightly reduced.