Overview - 多層次極限學習機於語音訊號處理上的應用

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

Chapter 6 COMPRESSED MULTIMODAL SE

6.1 Overview

Recently, model compression that aims to facilitate the use of deep models in real

world applications has attracted considerable attention. Several model compression tech

niques have been proposed to reduce computational costs without significantly degrading the achievable performance. In this chapter, we propose a multimodal framework for SE by utilizing HELM to enhance the performance of conventional HELMbased SE frame

works that consider audio information only. Furthermore, we investigate the performance of the HELMbased multimodal SE framework trained using binary weights and quantized input data to reduce the computational requirement. The experimental results show that the proposed multimodal SE framework outperforms the conventional HELMbased SE framework in terms of three standard objective evaluation metrics. The results also show that the performance of the proposed multimodal SE framework is only slightly degraded, when the model is compressed through model binarization and quantized input data. The content of this chapter has been published in [160] and to be published in APSIPA 2019.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

6.2 Introduction

In realworld conditions, background noise can severely degrade the quality and intel

ligibility of speech signals, thereby limiting the development of speech related applications [161, 162, 163, 164, 2, 155, 165]. Numerous signal processingbased SE methods have been proposed in the past to alleviate the background noise problem [166, 115, 12, 167].

While these methods have been applied to improve the intelligibility for both human lis

tening and machine recognition, the results have not always been satisfactory especially in regards to real acoustic conditions. Recently, approaches based on nonlinear spectral map

ping have been proposed and confirmed to be effective in many SE tasks. The mapping function for these approaches aims to transform noisy speech to clean speech and is gen

erally realized by a machine learningbased model. Several studies have been conducted to investigate the potential of deeplearningbased models with finetuned parameters for SE. For these approaches, a set of noisy and clean utterances is required to train the deep models [152]. For example, the authors of [45] [108] proposed frameworks based on deep neural networks and deep denoising autoencoder (DDAE) to perform SE in nonstationary noise conditions. In [168] and [169], convolutional neural networks were used to trans

form noisy logarithmic power spectra (LPS) features and complex spectral features to their clean counterparts, respectively. Similarly, in [155, 170, 156], SE systems based on long shortterm memory and recurrent neural networks were proposed to reduce the noise effects effectively. Although these deeplearningbased approaches have achieved state

oftheart performance, they have the following limitations: (a) mismatched training/test conditions can severely deteriorate the system performance, and (b) a large amount of training data is required to achieve satisfactory generalization performance, which may

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

limit the applicability of these frameworks in realworld scenarios.

To overcome the limitations of both conventional signal processing and deeplearning

based SE approaches, in our previous work [94] [144], we have proposed an alternative SE framework by adopting an HELM framework. The parameters of the feature extraction layers of the HELM do not need to be finetuned using BPbased algorithms, thereby pro

viding an extremely fast training phase with good generalization performance and general approximation capability.

Recent studies have shown that visual modalities, such as lip motions and mouth ar

ticulations, carry important information that can help distinguish similar speech sounds under noisy conditions [82, 83, 84]. Several audiovisual methods have been proposed recently to learn multimodal features for SE tasks using multimodal learning strategies.

In [85, 86], feedforward and convolutional neural network models were used to build an audiovisual SE system, which successfully improved the noise reduction performance compared with that of audioonly frameworks. In [88], a speech separation system was proposed, which used a deeplearningbased model to combine audiovisual information.

Meanwhile, Li et al. proposed a crossmodal studentteacher learning framework to fully utilize the audiovisual information to attain improved speech recognition performance under challenging conditions [171].

In this chapter, we extend our previously proposed HELMbased SE framework [94], which adopts audio information only (thus termed HELM_a), by incorporating a visual modality to further improve the SE performance. The proposed HELMbased audio

visual SE framework, termed HELM_av, first processes the audio and visual modalities separately and then learns multimodal features and an output weight matrix. In addition to the stateoftheart performance achieved by the deeplearningbased techniques in dif

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

ferent classification and regression tasks, a considerable amount of research has been done on quantizationbased model compression strategies to improve the computational capa

bility of deeplearningbased systems for efficient online learning without degrading much of system’s overall performance [89, 90, 91]. Motivated by the satisfactory performance achieved by the model compression strategies for backpropagationbased methods, we employ binarization and quantization schemes to train the feedforward only framework (HELM_aand HELM_av) for efficient learning using binary weights and quantized data. The experimental results demonstrate that the introduction of visual modality can improve the performance compared with that of HELM_ain terms of three standardized objective mea

sures: the perceptual evaluation of speech quality (PESQ) [127], hearing aid speech per

ception index (HASPI) [157], and segmental signaltonoise ratio improvement (SSNRI) [14]. The results also show that by binarizing the weights (limiting the weights to +1 and

1) and quantizing the input data (representing the mantissa bits in a single floatingpoint number with fewer bits), the proposed framework still operates well and the overall SE performance of the system is only marginally affected.

The remainder of this chapter is organized as follows: Section 6.3 introduces the pro

posed HELM_av SE system as well as the model binarization and input data quantization schemes. Section 6.4 presents the experiential setup and results. Section 6.5 provides concluding remarks.

‧

國

立政治大學

‧

N a tio na

l C h engchi U ni ve rs it y

在文檔中多層次極限學習機於語音訊號處理上的應用 - 政大學術集成 (頁 133-137)