2.4 Experiments
2.4.2 Experimental Results
國
立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
speech intelligibility in human listening tests [114]. A higher STOI value indicates better speech intelligibility, and the score ranges from 0 to 1. In the following discussion, the scores across the testing sets (the clean test set, Test Set 1 in Table 2.1, was excluded) of the Aurora–4 task are reported.
The speech signal was processed using a moving window, with a size of 10 ms and a step of 5 ms. Then, the Melfrequency power spectrum (MFP) feature was calculated for each speech frame. In this chapter, we used an 80dimensional MFP feature.
2.4.2 Experimental Results
ELM
In this section, the performance of ELM is investigated by varying the number of neurons (Q) in the hidden layer in Eq. (2.14) and the type of activation function. Fig. 2.3 shows the PESQ score for ELM using different activation functions, namely the sigmoid (Sig), hyperbolic tangent (Tanh), and radial basis function (RBF), with different numbers of neurons (Q = 500, 1000, and 1500). To assess which of these activation functions performs the best, we used the same set of training and test data. From Fig. 2.3, we note that the PESQ values for the abovementioned activation functions monotonically increased with Q. These results demonstrate that the RBF, Tanh, and sigmoid functions all consistently returned performance improvements when the number of neurons was increased. Meanwhile, the sigmoid activation function achieves the best performance for different values of Q i.e., PESQ = 2.3570, 2.3847, and 2.4151, when compared with RBF (PESQ = 1.4293, 1.5261, and 1.5881) and Tanh (PESQ = 2.2563, 2.2998 and 2.3106).
Thus, in the following experiments, the sigmoid function is used as the activation function for ELM.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
0 0.5 1 1.5 2 2.5 3
500 1000 1500
PESQ
Number of neurons
RBF Tanh Sig
Figure 2.3: PESQ scores for ELM with different activation functions and numbers of hidden neurons.
HELM versus ELM
When compared with ELM, HELM leverages hierarchical training to generate a sparse representation of the input data. Then, the standard ELM, which is on the top of the hierarchical structure of HELM, performs the regression. To closely study the perfor
mances of ELM and HELM, Table 2.3 presents the PESQ, SDI, and SSNRI scores of ELM and HELM for each test set. The learning accuracy of ELM/HELM is dependent on a user specified regularization parameter which needs to be selected carefully during experiments. In our experiments, we tried different values of regularization parameter to determine its impact on the performance. Here, we only reported the best regularization parameter ( = 200) in our speech enhancement task. As displayed in Table 2.2, Set B, Set C, and Set D contained speech utterances with only additive noise, only convolutive noise, and with both additive and convolutive noises, respectively. From Table 2.3, it can be no
ticed that ELM yielded higher PESQ and STOI values and lower SDI and SSNRI scores
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
for Set B than those for Set D, because Set D includes additional convolutive distortions with a mismatch channel. As Set C did not contain additive noise, the PESQ score of Set C is higher than those of Set B and Set D. In general, the same trend could be observed in the HELM results, while the overall performance of HELM is consistently better than that of ELM (higher PESQ, STOI, and SSNRI scores) across Sets B and D. However, HELM attained a lower SDI score for additive noises (Set B) and higher SDI score for Set C and Set D when compared to ELM, because of the channel mismatch, which increases the distortion index for convolutive noises in Set C and Set D.
Table 2.3: Single result abstracted from average objective evaluation scores of ELM [500] and HELM [200 200 500] configuration
Test Set ELM HELM
PESQ SDI STOI SSNRI PESQ SDI STOI SSNRI Set B 2.4070 0.4240 0.8110 8.7280 2.5410 0.3820 0.8300 9.2360 Set C 2.6900 1.5690 0.7990 7.4570 2.8270 1.7220 0.8170 7.3940 Set D 2.2510 1.2700 0.7500 11.2810 2.3680 1.4600 0.7670 11.3750
Fig. 2.4 shows the average results for the 13 test sets ( Sets B, C, and D) across the four evaluation metrics, using different numbers of hidden neurons for the ELM ([500], [1000], [1500]) and HELM ([200 200 500], [200 200 1000], [200 200 1500]) configurations. For an impartial comparison with ELM, we used the same number of neurons in the regression stage (third layer) for HELM. Both ELM and HELM are tested against the same Aurora–
4 testing dataset, using the sigmoid activation function. It can be seen that the HELM framework demonstrated significant improvements in terms of PESQ, STOI, and SSNRI, and maintained a stable performance against a higher number of neurons. However, for HELM the SDI score improved from 0.9824 to 0.9781 when the number of hidden neurons increased from [200 200 500] to [200 200 1500], whereas for ELM it jumped from 0.9027
‧
to 0.9781 for an increase from [500] to [1500] hidden neurons. To determine the optimal size for the HELM hidden layers, we evaluated different configurations by changing the number of neurons in each layer. Experiments show that good results can be achieved by fixing the first two layers with the same number of hidden neurons and varying the number of hidden neurons in the third (ELM) layer. The configuration [200 200 X] was selected for HELM, where ’X’ denotes the number of hidden neurons for the third layer, because compared to other configurations this achieved the best results with a low number of neurons during speech enhancement experiments.
It can be concluded by examining Table 2.3 and Fig. 2.4 that ELM provided less dis
tortion for a low number of neurons, but the distortion index deteriorated sharply as the number of neurons increased. However, the distortion index increased in HELM for a certain number of neurons, and then began to decrease.
2.357 2.4834 2.3847 2.5471 2.4151 2.581
0.9027 0.9824 0.9381 0.9931 0.9617 0.9781
0.7817 0.8001 0.7812 0.8089 0.7831 0.8157
9.8086 10.0815 9.7638 10.0779 9.7914 10.1631
0
Figure 2.4: PESQ, SDI, STOI, and SSNRI average scores for ELM and HELM configu
rations.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
Spectrogram Analysis
A spectrogram graphically represents the salient patterns of the speech signal and is used to analyze the signal over time at various frequencies. To visually compare the speech enhancement performances for both ELM and HELM, we plotted the spectrograms of the clean and noisy speech files for each enhanced speech signal. Fig. 2.5 and Fig. 2.6 present the spectrograms of the same utterance contaminated with two different noise types (bab
ble and car, respectively) for Test Set D with mismatch channel conditions. Fig. 2.5(a) and (b) show the spectrograms of the clean and noisy speech signals, respectively, with babble noise. Fig. 2.5(c) and (d) present the enhanced speech signals for ELM and HELM using the [1500] and [200 200 1500] configurations, respectively. We can observe that both ELM and HELM successfully reduced the noise components, and HELM provides a better reconstructed speech signal than ELM. We also included the PESQ scores of the ut
terances in Fig. 2.5 and the scores show that HELM can more effectively improve speech quality than ELM. Moreover, Fig. 2.6(a) and (b) show the spectrogram plots for the cor
responding clean and noisy speech, respectively, corrupted with car noise. Here, we note similar trends as that noted from Fig. 2.5. The HELM framework provides a higher PESQ ( = 2.7345) than ELM (PESQ = 2.5258), where the noisy speech signal had (PESQ = 2.4433) and contained both additive and convolutive noises.
Deeper and Wider HELM
In the previous sections, we have observed that HELM has superior capabilities as a regression model. Therefore, an HELMbased regression model is better suited for appli
cation to speech enhancement. To further scrutinize the HELM performance, we varied the size of the input speech vector by including more context at the input layer. In this man
‧
Figure 2.5: Spectrograms of an utterance (a) clean (PESQ = 4.6439), (b) noisy (PESQ = 2.2976), (c) ELM (PESQ = 2.3018), and (d) HELM (PESQ = 2.5489) contaminated with babble noise.
Figure 2.6: Spectrograms of an utterance (a) clean (PESQ = 4.6439), (b) noisy (PESQ = 2.4433), (c) ELM (PESQ = 2.5258), and (d) HELM (PESQ = 2.7345) contaminated with car noise.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
ner, deeper HELM structures are introduced, and their performances measured. In partic
ular, we considered the following four configurations: HELM1 with 6000 hidden neurons (hierarchical structure equal to [1000 1000 4000]), HELM2 with 10000 hidden neurons (hierarchical structure equal to [1000 1000 8000]), HELM3 with 14000 hidden neurons (hierarchical structure equal to [1000 1000 12000]), and HELM4 with 18000 hidden neu
rons (hierarchical structure equal to [1000 1000 16000]). Table 2.4 lists the resulting en
hancements for the following five HELM configurations: HELM (hierarchical structure equal to [200 200 1500]), HELM1, HELM2, HELM3, and HELM4, where the dimension of the input window size (i.e. ws) is changed from 80 to (ws∗ 80), in order to consider neighboring input speech vectors including left and right alongside center speech vector.
From Table 2.4, we can observe that HELM4 outperforms HELM, HELM1, HELM2, and HELM3 in terms of PESQ for a window size equal to 1 (ws = 1). However, HELM4 introduced more distortion (SDI = 1.1862) with less intelligibility (STOI = 0.8126) com
pared with the basic HELM configuration (HELM with configuration equal to [200 200 1500]) for a window size equal to 1. The table clearly illustrates a small improvement (overall improvement of 0.023) in the performance, with a PESQ increases from 2.5810 to 2.6040, for HELM configurations when the number of neurons increased from 1900 (HELM with configuration equal to [200 200 1500]) to 18000 (HELM4 with configu
ration equal to [1000 1000 16000]) in total, for a window size equal to 1. Moreover, the PESQ score for similar configurations escalated almost twofold, i.e. from 2.6547 to 2.7698 (overall improvement of 0.1151) when the ws increased from 1 to 7. Similarly, the perfor
mance further improved from 2.5880 to 2.7687 with an overall improvement of 0.1807, when the ws increased to 11. It is apparent from Table 2.4 that HELM demonstrated bet
ter speech enhancement capabilities when the size of the context window was increased.
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
However, there was a sudden drop in the performances of the HELM frameworks when the input window size increased beyond 7, except for HELM1, where the PESQ enhances from 2.7040 to 2.7060. Although increasing the window size improved the intelligibility (STOI) and the SSNRI scores for the three HELM frameworks, it also introduced more distortion. The table tells us that the best results were achieved by HELM4 (configuration equal to [1000 1000 16000]) with an input window size of 7. It is worth mentioning that the deeper structures of HELM with a wider context window (ws = 7) proved to be more effective in terms of the speech quality (PESQ) and intelligibility (STOI) when compared with an even larger context (ws = 11), which degraded the performance by considering irrelevant information.
Table 2.4: Performance comparison of HELM frameworks using different window sizes
ws Framework PESQ SDI STOI SSNRI
[200 200 1500] 2.5810 0.9780 0.8160 10.1600
[1000 1000 4000] 2.5669 1.1301 0.8105 10.0851
1 [1000 1000 8000] 2.5938 1.1364 0.8116 10.0928
[1000 1000 12000] 2.5979 1.1684 0.8124 10.0794
[1000 1000 16000] 2.6040 1.1862 0.8126 15.8031
[200 200 1500] 2.6547 1.0228 0.8191 10.9900
[1000 1000 4000] 2.7040 1.1405 0.8243 11.0340
7 [1000 1000 8000] 2.7440 1.1499 0.8297 11.0527
[1000 1000 12000] 2.7592 1.1450 0.8329 11.0753
[1000 1000 16000] 2.7698 1.1576 0.8345 15.7461
[200 200 1500] 2.5880 0.9930 0.8130 11.1300
[1000 1000 4000] 2.7060 1.1202 0.8250 11.2100
11 [1000 1000 8000] 2.7310 1.1616 0.8292 11.2324
[1000 1000 12000] 2.7585 1.1805 0.8320 11.2371
[1000 1000 16000] 2.7687 1.1525 0.8332 11.2656
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
HELM versus DDAE
In this section, we compare HELM against a conventional DDAE, where we have adopted a similar configuration to that reported in [108]. For deeper structures, the au
toencoder is trained using clean and multicondition data contaminated with six different background noises, as described in Section 2.4.1. We built four DDAE based speech en
hancement systems, namely DDAE1, DDAE2, DDAE3 and DDAE4 with 3, 5, 7, and 9 nonlinear layers, respectively, each having 2048 hidden neurons. The deeper structures of DDAE were compared with our deeper HELM configurations. Namely, HELM1 was compared with DDAE1, that has a total of 6144 ( = 2048*3) hidden neurons; HELM2 with 10000 hidden neurons was compared with DDAE2, which has 10240 ( = 2048*5) hidden neurons; HELM3 with 14000 hidden neurons was compared with DDAE3, which has 14336 ( = 2048*7); and HELM4 with 18000 hidden neurons was compared with DDAE4, which has 18432 ( = 2048*9) hidden neurons. The learning rate during the training of the DDAE frameworks was set to 0.0002, with a batch size of 5000. The numbers of epochs for the four DDAE structures were set to 70. Table 2.5 lists the speech enhancement re
sults for these deeper HELM and DDAE configurations with an input context window size equal to 7. This was selected because it gave the highest PESQ score (Section 2.4.2). By examining Table 2.5, we can confirm that HELM outperforms DDAE in terms of PESQ and SSNRI. However, HELM generated a higher distortion (SDI) with a low intelligibility (STOI) score compared with the DDAE frameworks. The table apparently demonstrates that the performance of HELM is consistent (increasing gradually) in terms of PESQ, STOI and SSNRI for higher number of neurons, while the DDAE performance showed incon
sistency in terms of PESQ, SSNRI and SDI as more layers and neurons were introduced.
That is, PESQ, SSNRI and SDI are degraded as the DDAE structure becomes larger. The
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
table explicitly demonstrates the behavior of the DDAE structures by showing that adding more layers into the DDAE structures (DDAE3 and DDAE4) and injecting more neurons does not guarantee a good performance when the training data is limited, and a sufficient amount of data is necessary for DDAE structures to have a good generalization capabil
ity. On the other hand, the HELM structures proved to show a monotonically increasing performance for higher numbers of neurons.
Table 2.5: Objective evaluation scores of DDAE and HELM alongside traditional speech enhancement methods
Method PESQ SDI STOI SSNRI
KLT 2.4907 1.2438 0.8594 9.5737
MMSE 2.5600 1.5212 0.8549 3.2246
RPCA 2.5615 1.8178 0.8426 1.6268
DDAE1 2.6767 1.0456 0.8293 10.6733
DDAE2 2.6783 1.0581 0.8330 10.6100
DDAE3 2.6731 0.9686 0.8385 10.5776
DDAE4 2.6664 1.0858 0.8401 10.3242
HELM1 2.7040 1.1405 0.8243 11.0340
HELM2 2.7440 1.1499 0.8297 11.0527
HELM3 2.7592 1.1450 0.8329 11.0753
HELM4 2.7698 1.1576 0.8345 15.7461
In addition, both learning algorithms are compared against three different classes of speech enhancement algorithms, i.e. a conventional spectral restoration approach in which we used an MMSEbased noise reduction technique [115], a subspacebased KLT [116]
algorithm and noise reduction based on robust PCA (RPCA) [117], to verify the perfor
mances in speech enhancement tasks. It is evident that both learning algorithms have attained a significant improvement over the traditional methods, with improved PESQ, SDI, and SSNRI scores. However, the intelligibility of the KLT is greater than for both of the abovementioned learning algorithms. The results in Table 2.5 demonstrate that
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
HELM with deeper configurations (HELM1, HELM2, HELM3, and HELM4) outscored the KLT, MMSE, RPCA, and DDAE methods with a notable margin. These results fur
ther confirm the advantages of HELM for achieving a satisfactory NR performance with relatively few training samples.
Sensitivity towards the Training Data
To analyze the sensitivity of the two learning algorithms(DDAE and HELM), we pro
gressively decreased the sizes of the training batch samples (TS) in steps of 10%. Initially, we used 150000 MFP spectral patches of the training samples, which were reduced in 10%
decrements to finally reach 150 MFP patches. The number of epochs was also reduced as the sizes of the training data were decreased. Initially, we used 70 epochs to train 150000 MFP DDAE frameworks, and which we then reduced the number of epochs to 40 epochs as the size of the training data was curtailed by 10% (i.e., 15000 MFP). We further re
duced the epochs to 30 when the size of the training data was decreased to 1500 MFP and 150 MFP patches, respectively. The purpose of such an investigation is to evaluate the stability of each algorithm against the size of the training data. Fig. 2.7 and Fig. 2.8 present compact synopses of the two learning algorithms by means of PESQ and STOI, respectively, for ws = 7. Overall, there is a drop in the performance for both of the learning algorithms. However, the HELM frameworks provided a considerably substan
tial performance, even when the training samples reduced to 150 MFP patches in the end.
On close examination, the graph in Fig. 2.7 shows an improvement in the performances of the DDAE frameworks when the size of the training samples (TS) was increased by 10%, from TS150 to TS1500. The PESQ score improved from 1.7588 to 2.1920 (from level 1⃝ to 2⃝) for DDAE1, from 1.8507 to 2.1053 for DDAE2, from 1.8096 to 1.956 for
‧
DDAE3 and from 1.6227 to 1.7798 for DDAE4 when the TS is increased from 150 MFP patches to 1500 MFP patches. The same trend can be observed when the size of the train
ing samples is increased from 15000 MFP patches to 150000 MFP patches (level 3⃝ and level 4⃝) for the DDAE frameworks. However, it can also be noted that the performances of DDAE3 and DDAE4 dropped rapidly as soon as the training data was reduced by 10%
(from TS150000 to TS15000), i.e., from PESQ = 2.6731 to 2.2777 for DDAE3 and from PESQ = 2.6664 to 2.2413 for DDAE4, which acutely describes the sensitiveness of deeper DDAE frameworks toward the training data. The performances for DDAE3 and DDAE4 degraded severely as the size of the training data was reduced by 20% (from TS150000 to TS1500).
Figure 2.7: PESQ score for (a) DDAE1, (b) DDAE2, (c) DDAE3, (d) DDAE4, (e) HELM1, (f) HELM2, (g) HELM3 and (h) HELM4, using different amounts of training batch samples (TS).
On the other hand, HELM proved to be highly resilient against the reduction in the size of the training samples. The PESQ score for the HELM1 configuration escalated from 1.9377 to 2.4706 when the size of the training samples was only increased by just 10% (150 MFP to 1500 MFP patches), as shown in Fig. 2.7. Similarly, the PESQ score
‧
Figure 2.8: STOI score for (a) DDAE1, (b) DDAE2, (c) DDAE3, (d) DDAE4, (e) HELM1, (f) HELM2, (g) HELM3 and (h) HELM4, using different amounts of training batch sam
ples (TS).
further improved from 2.6469 (MFP patches = 150000) to 2.7040 for the next increment in the size of the training samples (level 3⃝ to level 4⃝). Furthermore, the PESQ scores for HELM2 increased from 2.0122 to 2.7440 when the size of the training samples was increased from 150 to 150000 MFP patches. The deeper structures of HELM (HELM3 and HELM4) provided a steady performance in terms of PESQ compared with the DDAE frameworks when the size of the training data was reduced. We also measured the ef
fect of the reduction of training samples on the speech intelligibility, which measures the comprehensibility of the speech signal for the given conditions. Fig. 2.8 shows the intel
ligibility (STOI) of the test speech signals for each of the two learning algorithms with the limited training samples. The STOI score for the DDAE frameworks became very poor when the patches were reduced to 150 MFP. In contrast, HELM again proved to be very stable, even for a training sample size reduced to 150 MFP patches. The STOI for DDAE1 dropped from 0.8293 to 0.6575 (from level 4⃝ to level 1⃝), for DDAE2 the value decreased from 0.8330 to 0.6943, for DDAE3 it dropped from 0.8385 to 0.7009, and for
‧
國立 政 治 大 學
‧
N a tio na
l C h engchi U ni ve rs it y
DDAE4 it dropped from 0.8401 to 0.6547. However, for HELM the decrease was not so drastic. For HELM1, it declined from 0.8243 to 0.7764, for HELM2 it declined from 0.8297 to 0.7710, for HELM3 it declined from 0.8329 to 0.7662 and for HELM it declined from 0.8345 to 0.7572.
Although both learning algorithms somehow maintained quality and intelligibility for the reduced training samples, DDAE, for which PESQ and STOI decreased most sig
nificantly compared with the HELM frameworks, revealed the sensitiveness of DDAE frameworks to the amount of the training samples.