國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Chapter 6
COMPRESSED MULTIMODAL SE
6.1 Overview
Recently, model compression that aims to facilitate the use of deep models in real
world applications has attracted considerable attention. Several model compression tech
niques have been proposed to reduce computational costs without significantly degrading the achievable performance. In this chapter, we propose a multimodal framework for SE by utilizing HELM to enhance the performance of conventional HELMbased SE frame
works that consider audio information only. Furthermore, we investigate the performance of the HELMbased multimodal SE framework trained using binary weights and quantized input data to reduce the computational requirement. The experimental results show that the proposed multimodal SE framework outperforms the conventional HELMbased SE framework in terms of three standard objective evaluation metrics. The results also show that the performance of the proposed multimodal SE framework is only slightly degraded, when the model is compressed through model binarization and quantized input data. The content of this chapter has been published in [160] and to be published in APSIPA 2019.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
6.2 Introduction
In realworld conditions, background noise can severely degrade the quality and intel
ligibility of speech signals, thereby limiting the development of speech related applications [161, 162, 163, 164, 2, 155, 165]. Numerous signal processingbased SE methods have been proposed in the past to alleviate the background noise problem [166, 115, 12, 167].
While these methods have been applied to improve the intelligibility for both human lis
tening and machine recognition, the results have not always been satisfactory especially in regards to real acoustic conditions. Recently, approaches based on nonlinear spectral map
ping have been proposed and confirmed to be effective in many SE tasks. The mapping function for these approaches aims to transform noisy speech to clean speech and is gen
erally realized by a machine learningbased model. Several studies have been conducted to investigate the potential of deeplearningbased models with finetuned parameters for SE. For these approaches, a set of noisy and clean utterances is required to train the deep models [152]. For example, the authors of [45] [108] proposed frameworks based on deep neural networks and deep denoising autoencoder (DDAE) to perform SE in nonstationary noise conditions. In [168] and [169], convolutional neural networks were used to trans
form noisy logarithmic power spectra (LPS) features and complex spectral features to their clean counterparts, respectively. Similarly, in [155, 170, 156], SE systems based on long shortterm memory and recurrent neural networks were proposed to reduce the noise effects effectively. Although these deeplearningbased approaches have achieved state
oftheart performance, they have the following limitations: (a) mismatched training/test conditions can severely deteriorate the system performance, and (b) a large amount of training data is required to achieve satisfactory generalization performance, which may
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
limit the applicability of these frameworks in realworld scenarios.
To overcome the limitations of both conventional signal processing and deeplearning
based SE approaches, in our previous work [94] [144], we have proposed an alternative SE framework by adopting an HELM framework. The parameters of the feature extraction layers of the HELM do not need to be finetuned using BPbased algorithms, thereby pro
viding an extremely fast training phase with good generalization performance and general approximation capability.
Recent studies have shown that visual modalities, such as lip motions and mouth ar
ticulations, carry important information that can help distinguish similar speech sounds under noisy conditions [82, 83, 84]. Several audiovisual methods have been proposed recently to learn multimodal features for SE tasks using multimodal learning strategies.
In [85, 86], feedforward and convolutional neural network models were used to build an audiovisual SE system, which successfully improved the noise reduction performance compared with that of audioonly frameworks. In [88], a speech separation system was proposed, which used a deeplearningbased model to combine audiovisual information.
Meanwhile, Li et al. proposed a crossmodal studentteacher learning framework to fully utilize the audiovisual information to attain improved speech recognition performance under challenging conditions [171].
In this chapter, we extend our previously proposed HELMbased SE framework [94], which adopts audio information only (thus termed HELMa), by incorporating a visual modality to further improve the SE performance. The proposed HELMbased audio
visual SE framework, termed HELMav, first processes the audio and visual modalities separately and then learns multimodal features and an output weight matrix. In addition to the stateoftheart performance achieved by the deeplearningbased techniques in dif
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
ferent classification and regression tasks, a considerable amount of research has been done on quantizationbased model compression strategies to improve the computational capa
bility of deeplearningbased systems for efficient online learning without degrading much of system’s overall performance [89, 90, 91]. Motivated by the satisfactory performance achieved by the model compression strategies for backpropagationbased methods, we employ binarization and quantization schemes to train the feedforward only framework (HELMaand HELMav) for efficient learning using binary weights and quantized data. The experimental results demonstrate that the introduction of visual modality can improve the performance compared with that of HELMain terms of three standardized objective mea
sures: the perceptual evaluation of speech quality (PESQ) [127], hearing aid speech per
ception index (HASPI) [157], and segmental signaltonoise ratio improvement (SSNRI) [14]. The results also show that by binarizing the weights (limiting the weights to +1 and
1) and quantizing the input data (representing the mantissa bits in a single floatingpoint number with fewer bits), the proposed framework still operates well and the overall SE performance of the system is only marginally affected.
The remainder of this chapter is organized as follows: Section 6.3 introduces the pro
posed HELMav SE system as well as the model binarization and input data quantization schemes. Section 6.4 presents the experiential setup and results. Section 6.5 provides concluding remarks.
‧
國立 政 治 大 學