2.5 Summary
5.3.2 AudioVisual SE System
Figure 5.1: Audioonly HELMbased SE framework.
5.3.2 AudioVisual SE System
In this section, we extend the AHELM framework by considering the visual infor
mation and propose an HELMbased audiovisual SE (AVHELM) framework. In the AVHELM framework the visual information is incorporated with the speech information to further improve the enhancement capability of the system. Fig. 5.2 illustrates the pro
posed AVHELM framework. In this system, the AV features are processed independently through the HELM sparse autoencoders (i.e., unsupervised stage) to learn the sparse rep
resentation of the noisy audio features and visual information individually. The outputs of the two modalities from the unsupervised stage are subsequently combined to form an integrated input to the supervised regression stage, as shown in Fig. 5.2.
For AHELM, the relationship between input (noisy) and output (enhanced) can be written as:
X = H(Y ) Ba (5.1)
where H(Y ) is the hidden layer output matrix for input noisy speech signal Y , Bais the output weight matrix for audioonly HELM, and X is the estimated clean speech signal as shown in Fig. 1. Similarly, the estimated clean speech signal for the AVHELM framework
‧
Figure 5.2: Proposed AVHELM SE framework.
can be computed by integrating the audio and visual information such as:
X = [H(Y ), H(V )] Bav (5.2)
where H(Y ) and H(V ) are the corresponding hidden layer output matrices for audio and visual modality, Bav is the output weight matrix for integrated audiovisual information, and X is the estimated speech signal as shown in Fig. 5.2.
5.4 Experiments
5.4.1 Experimental Setup
Description of the Dataset
The audio and visual information were recorded and prepared by Hou et al. [85] based on the transcript of the Taiwan Mandarin hearing in noise test (TMHINT) sentences [158].
The dataset contained audiovisual recordings of 320 Mandarin utterances spoken by a na
tive speaker and were recorded in a quiet room with sufficient lighting, where the speaker
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
was filmed facing towards the camera. The length of each utterance was approximately 34 seconds. The visual information was recorded at 30 frames per second (fps), at a resolu
tion of 1920 pixels× 1080 pixels. The audio data were recorded at 48 kHz and resampled to 16 kHz for further processing. Among the 320 utterances, we randomly selected 100 utterances for the training set and 40 random utterances for the testing set, noting that no overlap occurred between the training and testing utterances. We used nine different sta
tionery and nonstationery background noises, namely machine, pink noise, babble, baby cry, party crowd, applause, grocery store, restaurant, and vacuum cleaner as background noises to prepare the training and testing data. For the training set, the clean utterances were artificially contaminated with one stationery and four nonstationery noises, namely machine, babble, party crowd, restaurant, vacuum cleaner at 5 different signaltonoise ratios (SNRs)∈ {6, 3, 3, 6, 10 dB} to generate 100×5(noise types)×5(SNRs) = 2500 noisy training utterances. To confirm the effectiveness of our proposed AVHELM sys
tem, two evaluation scenarios were adopted to design the testing sets: Matched noise type, and mismatched noise type. In the matched case, the clean testing utterances were contaminated with two matched noises namely, party crowd and babble at the matched and mismatched SNRs ∈ {6, 2, 0, 2, and 6 dB}, to that used in the training set. In the mismatched case, the clean testing utterances were contaminated with four mismatch noise types namely, applause, baby cry, pink noise, and grocery store at the matched and mismatched SNRs∈ {6, 2, 0, 2, and 6 dB}, respectively.
The proposed system was evaluated using three standard objective evaluation metrics:
PESQ, HASPI, and SSNRI.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Audiovisual Feature Extraction
The audio speech signals were processed using STFT with a frame length of 512 sam
ples, and a frame shift of 256 samples. We varied the size of the input speech vector by considering more contexts at the input layer. In this chapter, we used± 2 neighbouring speech vectors in the left and right alongside the central speech vector, similar to [85], generating LPS features of dimensions 257×5 (= 257×(ws ×2 + 1), where ws is the con
textual window size, and ws = 2 was used in our experiments).
For the visual information, we used the same visual features as that in [85], by con
verting each video of the corresponding utterance into a sequence of images at a frame rate of 50 fps. The mouth part of each image was subsequently detected using the Viola–
Jones method [159] and was cropped into a 16×24 pixels region, thus resulting in visual features of dimensions 16×24×3×5, where 3 is its RGB channel and 5 is the neighboring visual vectors including the left and right alongside the central visual vector.
5.4.2 Experimental Results
AVHELM vs AHELM
In this section, we first evaluate the overall performance of the proposed AVHELM against AHELM. Table 5.1 compares the average PESQ results between the proposed AVHELM and AHELM structures under matched and mismatched testing conditions. For fair comparison, the two frameworks were trained using 1000, 1000, and 8000 neurons ([1000 1000 8000]). For the two HELM configurations, the sigmoidal activation function was employed with the regularization parameter equal to that used in [94]. In our pre
liminary experiments, we found that with such a small amount of training data, both the
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Table 5.1: Average PESQ scores of logMMSE, AHELM, and AVHELM under matched and mismatched noise conditions.
Condition Noise Type logMMSE AHELM AVHELM
Matched
Babble 2.2138 2.2932 2.4066
Crowd party 2.1473 2.3163 2.4054
Applause 1.8963 2.3873 2.5104
Mismatched Baby cry 1.9725 2.6812 2.7633 Grocery store 2.0986 2.3083 2.4948 Pink noise 2.5774 2.3639 2.5125
DNN and CNNbased AVSE systems [85] cannot perform well. To focus our attention to HELMbased SE systems, the CNNbased results are not included in this chapter. For the AVHELM, each modality was processed independently by the unsupervised stage to convert lowlevel features to representative features. During the supervised stage, the rep
resentations learned by both modalities were integrated linearly to learn the multimodal transformation. By looking at Table 5.1, we can argue that the proposed AVHELM frame
work improved the speech quality (PESQ) for both matched and mismatched (stationery and nonstationery) noise types. The performance of AVHELM and AHELM is further compared against a traditional logarithmic minimum mean square error (logMMSE) [115]
method. It is clear from Table 5.1, that the AVHELM and AHELM frameworks with simi
lar configurations attained a significant performance improvement compared to logMMSE under testing conditions, except for most pink noise, where logMMSE as a powerful tra
ditional method remains its advantage under mismatched stationery noise condition and performs better compared to AHELM and AVHELM frameworks.
Table 5.1 shows the average PESQ performance comparison between the two HELM systems under matched and mismatched testing conditions. However, the table
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
0 0.5 1 1.5 2 2.5 3
-6 -2 0 2 6
PESQ
SNR in dB
Noisy LogMMSE AHELM AVHELM
Figure 5.3: Average PESQ scores over six noise types at different SNR levels.
failed to describe which modality weighted more while reconstructing the denoised speech signal at different SNR levels. Therefore, we plotted the average PESQ performance at specific SNR levels with the aim to further investigate the behaviors of the two HELM frameworks. Fig. 5.3 presents the average PESQ performance for the six noise types at different SNR levels for AHELM and AVHELM. The results of unprocessed speech (denoted as Noisy) and logMMSE are also listed for comparison. We observe that the proposed AVHELM framework obtained a significant performance improvement while handling low SNRs (i.e., 6 and 2 dB). The figure illustrates the behavior of the AVHELM framework by demonstrating that the framework relied more on the visual information by obtaining some guidance while handling low SNRs and making a decision. However, the visual information did not provide much help or guidance to the AVHELM system while handling high SNRs. The difference between the average PESQ score for AHELM and AVHELM gets smaller for high SNRs (2 dB and 6 dB), indicating that the audio modality played crucial role in high SNRs for decision making.
In addition to the PESQ scores, we also reported the average HASPI and SSNRI re
‧
sults for the AVHELM and AHELM frameworks beside logMMSE. Fig. 5.4 displays the average HASPI and SSNRI results for the six noise types at different SNR levels. Overall, the proposed AVHELM demonstrated better speech enhancement capabilities compared to the AHELM and logMMSE by maintaining high scores for HASPI and SSNRI. How
ever, an obvious performance improvement can be seen at low SNR levels to that of at high SNR levels, again confirming that the visual modality played a crucial role while reconstructing a signal at low SNR levels.
0
Figure 5.4: Average HASPI and SSNRI scores over six noise types at different SNR levels.
To better appreciate the SE performance attained by the proposed model, we plotted the spectrogram of the enhanced speech signals yielded by the AHELM and AVHELM.
Fig. 5.5 shows the spectrogram of a test utterance contaminated with a nonstationery noise applause at SNR = 2 dB. Fig. 5.5(c) and (d) display the spectrogram of the test utterance
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
enhanced by AHELM and AVHELM frameworks. The spectrograms of clean and noisy speech signals are also illustrated in Fig. 5.5(a) and (b) for comparison. From Fig. 5.5, we note that although both AHELM and AVHELM perfectly restore clean speech under very challenging condition (nonstationery, 2 dB SNR), they can effectively suppress noise components from the noisy signal (Fig. 5.5(b)). Moreover, the AVHELM more ef
fectively suppresses noise components and yields better speech quality (PESQ = 2.7784) compared to the AHELM (PESQ = 2.6496). In addition to spectrogram plots, we also plot
ted the waveforms to visually investigate the speech processed by AHELM and AVHELM.
Fig. 5.6(a), (b), (c), and (d) show the waveforms of Clean, Noisy, AHELM, and AVHELM speech, respectively, where the test utterance is contaminated with applause noise at SNR
= 2 dB. From the figure, we observe that the waveform of the denoised speech yielded by AVHELM displays a similar pattern as clean speech with less distortion even at low SNR (SNR = 2 dB), illustrating that AVHELM can more effectively restore clean speech from the noisy counterpart.
(a) (b)
(c) (d)
Figure 5.5: Spectrograms of (a) Clean, (b) Noisy, (c) AHELM, and (d) AVHELM. The test utterance was contaminated with noise applause at SNR = 2 dB.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
(a) (b)
(c) (d)
Figure 5.6: Waveforms of (a) Clean, (b) Noisy, (c) AHELM, and (d) AVHELM. The test utterance was contaminated with noise applause at SNR = 2 dB.
5.5 Summary
In this chapter, a novel AVHELM framework is proposed to improve the performance of the conventional AHELM framework. The results demonstrate that incorporating the visual modality/information increases the system performance under both matched and mismatched (also both stationery and nonstationery) noise conditions at severe SNR lev
els when limited training data is available. To the best of our knowledge, this is the first work that successfully applies HELM for audiovisual speech enhancement. In our future work, we aim to further enhance the system performance by considering noise and SNR
aware training.
‧
國立 政 治 大 學