• 沒有找到結果。

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Chapter 6

COMPRESSED MULTIMODAL SE

6.1 Overview

Recently, model compression that aims to facilitate the use of deep models in real­

world applications has attracted considerable attention. Several model compression tech­

niques have been proposed to reduce computational costs without significantly degrading the achievable performance. In this chapter, we propose a multimodal framework for SE by utilizing HELM to enhance the performance of conventional HELM­based SE frame­

works that consider audio information only. Furthermore, we investigate the performance of the HELM­based multimodal SE framework trained using binary weights and quantized input data to reduce the computational requirement. The experimental results show that the proposed multimodal SE framework outperforms the conventional HELM­based SE framework in terms of three standard objective evaluation metrics. The results also show that the performance of the proposed multimodal SE framework is only slightly degraded, when the model is compressed through model binarization and quantized input data. The content of this chapter has been published in [160] and to be published in APSIPA 2019.

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

6.2 Introduction

In real­world conditions, background noise can severely degrade the quality and intel­

ligibility of speech signals, thereby limiting the development of speech related applications [161, 162, 163, 164, 2, 155, 165]. Numerous signal processing­based SE methods have been proposed in the past to alleviate the background noise problem [166, 115, 12, 167].

While these methods have been applied to improve the intelligibility for both human lis­

tening and machine recognition, the results have not always been satisfactory especially in regards to real acoustic conditions. Recently, approaches based on nonlinear spectral map­

ping have been proposed and confirmed to be effective in many SE tasks. The mapping function for these approaches aims to transform noisy speech to clean speech and is gen­

erally realized by a machine learning­based model. Several studies have been conducted to investigate the potential of deep­learning­based models with fine­tuned parameters for SE. For these approaches, a set of noisy and clean utterances is required to train the deep models [152]. For example, the authors of [45] [108] proposed frameworks based on deep neural networks and deep denoising autoencoder (DDAE) to perform SE in non­stationary noise conditions. In [168] and [169], convolutional neural networks were used to trans­

form noisy logarithmic power spectra (LPS) features and complex spectral features to their clean counterparts, respectively. Similarly, in [155, 170, 156], SE systems based on long short­term memory and recurrent neural networks were proposed to reduce the noise effects effectively. Although these deep­learning­based approaches have achieved state­

of­the­art performance, they have the following limitations: (a) mismatched training/test conditions can severely deteriorate the system performance, and (b) a large amount of training data is required to achieve satisfactory generalization performance, which may

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

limit the applicability of these frameworks in real­world scenarios.

To overcome the limitations of both conventional signal processing and deep­learning­

based SE approaches, in our previous work [94] [144], we have proposed an alternative SE framework by adopting an HELM framework. The parameters of the feature extraction layers of the HELM do not need to be fine­tuned using BP­based algorithms, thereby pro­

viding an extremely fast training phase with good generalization performance and general approximation capability.

Recent studies have shown that visual modalities, such as lip motions and mouth ar­

ticulations, carry important information that can help distinguish similar speech sounds under noisy conditions [82, 83, 84]. Several audio­visual methods have been proposed recently to learn multimodal features for SE tasks using multimodal learning strategies.

In [85, 86], feedforward and convolutional neural network models were used to build an audio­visual SE system, which successfully improved the noise reduction performance compared with that of audio­only frameworks. In [88], a speech separation system was proposed, which used a deep­learning­based model to combine audio­visual information.

Meanwhile, Li et al. proposed a cross­modal student­teacher learning framework to fully utilize the audio­visual information to attain improved speech recognition performance under challenging conditions [171].

In this chapter, we extend our previously proposed HELM­based SE framework [94], which adopts audio information only (thus termed HELMa), by incorporating a visual modality to further improve the SE performance. The proposed HELM­based audio­

visual SE framework, termed HELMav, first processes the audio and visual modalities separately and then learns multimodal features and an output weight matrix. In addition to the state­of­the­art performance achieved by the deep­learning­based techniques in dif­

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

ferent classification and regression tasks, a considerable amount of research has been done on quantization­based model compression strategies to improve the computational capa­

bility of deep­learning­based systems for efficient online learning without degrading much of system’s overall performance [89, 90, 91]. Motivated by the satisfactory performance achieved by the model compression strategies for back­propagation­based methods, we employ binarization and quantization schemes to train the feed­forward only framework (HELMaand HELMav) for efficient learning using binary weights and quantized data. The experimental results demonstrate that the introduction of visual modality can improve the performance compared with that of HELMain terms of three standardized objective mea­

sures: the perceptual evaluation of speech quality (PESQ) [127], hearing aid speech per­

ception index (HASPI) [157], and segmental signal­to­noise ratio improvement (SSNRI) [14]. The results also show that by binarizing the weights (limiting the weights to +1 and

­1) and quantizing the input data (representing the mantissa bits in a single floating­point number with fewer bits), the proposed framework still operates well and the overall SE performance of the system is only marginally affected.

The remainder of this chapter is organized as follows: Section 6.3 introduces the pro­

posed HELMav SE system as well as the model binarization and input data quantization schemes. Section 6.4 presents the experiential setup and results. Section 6.5 provides concluding remarks.

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y