• 沒有找到結果。

2.5 Summary

5.3.2 Audio­Visual SE System

Figure 5.1: Audio­only HELM­based SE framework.

5.3.2 Audio­Visual SE System

In this section, we extend the AHELM framework by considering the visual infor­

mation and propose an HELM­based audio­visual SE (AVHELM) framework. In the AVHELM framework the visual information is incorporated with the speech information to further improve the enhancement capability of the system. Fig. 5.2 illustrates the pro­

posed AVHELM framework. In this system, the AV features are processed independently through the HELM sparse autoencoders (i.e., unsupervised stage) to learn the sparse rep­

resentation of the noisy audio features and visual information individually. The outputs of the two modalities from the unsupervised stage are subsequently combined to form an integrated input to the supervised regression stage, as shown in Fig. 5.2.

For AHELM, the relationship between input (noisy) and output (enhanced) can be written as:

X = H(Y ) Ba (5.1)

where H(Y ) is the hidden layer output matrix for input noisy speech signal Y , Bais the output weight matrix for audio­only HELM, and X is the estimated clean speech signal as shown in Fig. 1. Similarly, the estimated clean speech signal for the AVHELM framework

Figure 5.2: Proposed AVHELM SE framework.

can be computed by integrating the audio and visual information such as:

X = [H(Y ), H(V )] Bav (5.2)

where H(Y ) and H(V ) are the corresponding hidden layer output matrices for audio and visual modality, Bav is the output weight matrix for integrated audio­visual information, and X is the estimated speech signal as shown in Fig. 5.2.

5.4 Experiments

5.4.1 Experimental Setup

Description of the Dataset

The audio and visual information were recorded and prepared by Hou et al. [85] based on the transcript of the Taiwan Mandarin hearing in noise test (TMHINT) sentences [158].

The dataset contained audio­visual recordings of 320 Mandarin utterances spoken by a na­

tive speaker and were recorded in a quiet room with sufficient lighting, where the speaker

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

was filmed facing towards the camera. The length of each utterance was approximately 3­4 seconds. The visual information was recorded at 30 frames per second (fps), at a resolu­

tion of 1920 pixels× 1080 pixels. The audio data were recorded at 48 kHz and resampled to 16 kHz for further processing. Among the 320 utterances, we randomly selected 100 utterances for the training set and 40 random utterances for the testing set, noting that no overlap occurred between the training and testing utterances. We used nine different sta­

tionery and non­stationery background noises, namely machine, pink noise, babble, baby cry, party crowd, applause, grocery store, restaurant, and vacuum cleaner as background noises to prepare the training and testing data. For the training set, the clean utterances were artificially contaminated with one stationery and four non­stationery noises, namely machine, babble, party crowd, restaurant, vacuum cleaner at 5 different signal­to­noise ratios (SNRs)∈ {­6, ­3, 3, 6, 10 dB} to generate 100×5(noise types)×5(SNRs) = 2500 noisy training utterances. To confirm the effectiveness of our proposed AVHELM sys­

tem, two evaluation scenarios were adopted to design the testing sets: Matched noise type, and mismatched noise type. In the matched case, the clean testing utterances were contaminated with two matched noises namely, party crowd and babble at the matched and mismatched SNRs ∈ {­6, ­2, 0, 2, and 6 dB}, to that used in the training set. In the mismatched case, the clean testing utterances were contaminated with four mismatch noise types namely, applause, baby cry, pink noise, and grocery store at the matched and mismatched SNRs∈ {­6, ­2, 0, 2, and 6 dB}, respectively.

The proposed system was evaluated using three standard objective evaluation metrics:

PESQ, HASPI, and SSNRI.

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Audio­visual Feature Extraction

The audio speech signals were processed using STFT with a frame length of 512 sam­

ples, and a frame shift of 256 samples. We varied the size of the input speech vector by considering more contexts at the input layer. In this chapter, we used± 2 neighbouring speech vectors in the left and right alongside the central speech vector, similar to [85], generating LPS features of dimensions 257×5 (= 257×(ws ×2 + 1), where ws is the con­

textual window size, and ws = 2 was used in our experiments).

For the visual information, we used the same visual features as that in [85], by con­

verting each video of the corresponding utterance into a sequence of images at a frame rate of 50 fps. The mouth part of each image was subsequently detected using the Viola–

­Jones method [159] and was cropped into a 16×24 pixels region, thus resulting in visual features of dimensions 16×24×3×5, where 3 is its RGB channel and 5 is the neighboring visual vectors including the left and right alongside the central visual vector.

5.4.2 Experimental Results

AVHELM vs AHELM

In this section, we first evaluate the overall performance of the proposed AVHELM against AHELM. Table 5.1 compares the average PESQ results between the proposed AVHELM and AHELM structures under matched and mismatched testing conditions. For fair comparison, the two frameworks were trained using 1000, 1000, and 8000 neurons ([1000 1000 8000]). For the two HELM configurations, the sigmoidal activation function was employed with the regularization parameter equal to that used in [94]. In our pre­

liminary experiments, we found that with such a small amount of training data, both the

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Table 5.1: Average PESQ scores of logMMSE, AHELM, and AVHELM under matched and mismatched noise conditions.

Condition Noise Type logMMSE AHELM AVHELM

Matched

Babble 2.2138 2.2932 2.4066

Crowd party 2.1473 2.3163 2.4054

Applause 1.8963 2.3873 2.5104

Mismatched Baby cry 1.9725 2.6812 2.7633 Grocery store 2.0986 2.3083 2.4948 Pink noise 2.5774 2.3639 2.5125

DNN­ and CNN­based AVSE systems [85] cannot perform well. To focus our attention to HELM­based SE systems, the CNN­based results are not included in this chapter. For the AVHELM, each modality was processed independently by the unsupervised stage to convert low­level features to representative features. During the supervised stage, the rep­

resentations learned by both modalities were integrated linearly to learn the multimodal transformation. By looking at Table 5.1, we can argue that the proposed AVHELM frame­

work improved the speech quality (PESQ) for both matched and mismatched (stationery and non­stationery) noise types. The performance of AVHELM and AHELM is further compared against a traditional logarithmic minimum mean square error (logMMSE) [115]

method. It is clear from Table 5.1, that the AVHELM and AHELM frameworks with simi­

lar configurations attained a significant performance improvement compared to logMMSE under testing conditions, except for most pink noise, where logMMSE as a powerful tra­

ditional method remains its advantage under mismatched stationery noise condition and performs better compared to AHELM and AVHELM frameworks.

Table 5.1 shows the average PESQ performance comparison between the two HELM systems under matched and mismatched testing conditions. However, the table

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

0 0.5 1 1.5 2 2.5 3

-6 -2 0 2 6

PESQ

SNR in dB

Noisy LogMMSE AHELM AVHELM

Figure 5.3: Average PESQ scores over six noise types at different SNR levels.

failed to describe which modality weighted more while reconstructing the denoised speech signal at different SNR levels. Therefore, we plotted the average PESQ performance at specific SNR levels with the aim to further investigate the behaviors of the two HELM frameworks. Fig. 5.3 presents the average PESQ performance for the six noise types at different SNR levels for AHELM and AVHELM. The results of unprocessed speech (denoted as Noisy) and logMMSE are also listed for comparison. We observe that the proposed AVHELM framework obtained a significant performance improvement while handling low SNRs (i.e., ­6 and ­2 dB). The figure illustrates the behavior of the AVHELM framework by demonstrating that the framework relied more on the visual information by obtaining some guidance while handling low SNRs and making a decision. However, the visual information did not provide much help or guidance to the AVHELM system while handling high SNRs. The difference between the average PESQ score for AHELM and AVHELM gets smaller for high SNRs (2 dB and 6 dB), indicating that the audio modality played crucial role in high SNRs for decision making.

In addition to the PESQ scores, we also reported the average HASPI and SSNRI re­

sults for the AVHELM and AHELM frameworks beside logMMSE. Fig. 5.4 displays the average HASPI and SSNRI results for the six noise types at different SNR levels. Overall, the proposed AVHELM demonstrated better speech enhancement capabilities compared to the AHELM and logMMSE by maintaining high scores for HASPI and SSNRI. How­

ever, an obvious performance improvement can be seen at low SNR levels to that of at high SNR levels, again confirming that the visual modality played a crucial role while reconstructing a signal at low SNR levels.

0

Figure 5.4: Average HASPI and SSNRI scores over six noise types at different SNR levels.

To better appreciate the SE performance attained by the proposed model, we plotted the spectrogram of the enhanced speech signals yielded by the AHELM and AVHELM.

Fig. 5.5 shows the spectrogram of a test utterance contaminated with a non­stationery noise applause at SNR = ­2 dB. Fig. 5.5(c) and (d) display the spectrogram of the test utterance

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

enhanced by AHELM and AVHELM frameworks. The spectrograms of clean and noisy speech signals are also illustrated in Fig. 5.5(a) and (b) for comparison. From Fig. 5.5, we note that although both AHELM and AVHELM perfectly restore clean speech under very challenging condition (non­stationery, ­2 dB SNR), they can effectively suppress noise components from the noisy signal (Fig. 5.5(b)). Moreover, the AVHELM more ef­

fectively suppresses noise components and yields better speech quality (PESQ = 2.7784) compared to the AHELM (PESQ = 2.6496). In addition to spectrogram plots, we also plot­

ted the waveforms to visually investigate the speech processed by AHELM and AVHELM.

Fig. 5.6(a), (b), (c), and (d) show the waveforms of Clean, Noisy, AHELM, and AVHELM speech, respectively, where the test utterance is contaminated with applause noise at SNR

= ­2 dB. From the figure, we observe that the waveform of the denoised speech yielded by AVHELM displays a similar pattern as clean speech with less distortion even at low SNR (SNR = ­2 dB), illustrating that AVHELM can more effectively restore clean speech from the noisy counterpart.

(a) (b)

(c) (d)

Figure 5.5: Spectrograms of (a) Clean, (b) Noisy, (c) AHELM, and (d) AVHELM. The test utterance was contaminated with noise applause at SNR = ­2 dB.

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

(a) (b)

(c) (d)

Figure 5.6: Waveforms of (a) Clean, (b) Noisy, (c) AHELM, and (d) AVHELM. The test utterance was contaminated with noise applause at SNR = ­2 dB.

5.5 Summary

In this chapter, a novel AVHELM framework is proposed to improve the performance of the conventional AHELM framework. The results demonstrate that incorporating the visual modality/information increases the system performance under both matched and mismatched (also both stationery and non­stationery) noise conditions at severe SNR lev­

els when limited training data is available. To the best of our knowledge, this is the first work that successfully applies HELM for audio­visual speech enhancement. In our future work, we aim to further enhance the system performance by considering noise­ and SNR­

aware training.

立 政 治 大 學

N a tio na

l C h engchi U ni ve rs it y

Chapter 6

COMPRESSED MULTIMODAL SE