Chapter 3 Methods and Materials
3.4 Rebalancing Imbalanced Dataset
3.4.1 Synthetic Minority Oversampling Technique with Validity
Corpus. A simple over- or under-sampling method is not good enough. A simple over-sampling method increases both training and testing time while we are using a (sparse) kernel-based classifier and it can be simply replaced by a simple cost-sensitive method. A simple under-sampling method, on the other hand, loses too much information. The trouble caused by removing samples is two-fold (as describe in Section 3.1.2) because the most of the majority samples are reliable and most of the
0 500 1000 1500 2000 2500 3000
0 50 100 150 200 250 300 350 400
0 200 400 600 800 1000 1200 1400 1600 1800 2000
0 50 100 150 200 250 300
minority samples are unreliable. The analysis above leads us to a conclusion: we can adopt a cost-sensitive method or a finer synthetic sampling method.
The Synthetic Minority Over-sampling Technique (SMOTE) partially meets our needs because it still synthesizes too many samples, which is costly to sparse kernel machines. And we noticed that SMOTE synthesizes new samples uniformly with an aim in mind that it does not attempt to change the probability distribution.
Nonetheless, in emotion recognition, we have a crucial clue—validity which in other applications is not necessarily given or acquirable. Validity is a measure of how much credibility can be put on a sample. If a sample is unreliable, it does not deserve more derivatives (synthetic samples). Taking validity into consideration, we modified the original SMOTE and name it SMOTEV (SMOTE with validity).
Formulation
The original SMOTE selects a target sample x and its k-nearest neighbors x s (kNN) and synthesize k samples on the midway of any pairs x , x by
x x α x x , where α~U 0,1 .
It has two shortcomings. First, it needs to compute the distance matrix in order to find kNN’s. Second, it may synthesize many unreliable samples. Increasing unreliable samples prolonged training and testing time, and it might make learning more biased (increasing amount of falsification).
To tackle the distance matrix, SMOTEV made another attempt. It first selects samples with 80% or higher validity to form a reference set. Next, all, except the samples with 20% or lower validity, become candidates of target sample. New samples are synthesized on a random position along the line of a reference sample and a target. Taking validity into consideration, we can formulate SMOTEV as
x x β x x 1 β x β x
β ~Beta 1 v , 1 v
r: the index of a randomly selected sample in the reference subset.
x : reference sample x : target sample v : validity of x v : validity of x
The random variable β has mode of 0, 1 . The skewed mode value reflects the fact that more credibility is given on the reference sample. Note that SMOTEV, just as SMOTE, cannot apply to nominal or ordinary scales (no technical problems, but the meaning might become nonsense); it only can apply to interval or ratio scales.
44
The final solution to imbalanced datasets in our experiment was to combine SMOTEV with different cost for each class. Setting different costs saves time and it has similar performance to that of sampling methods. Comparison of five different rebalancing schemes is shown in Table 3.4.1. The SVMs inherently bias toward majority classes since they aim to minimize total error; therefore, training without rebalancing is definitely infeasible.
Aided by validity, SMOTEV conceptually synthesizes new samples that may be more reliable than the original SMOTE. The two synthetic methods have similar performance. However, all synthetic methods unavoidably increase training time. In order to reduce the trouble, setting different cost seems to be the best way. Therefore, we only synthesize part of the data, and setting different costs afterwards.
Table 3.4.1
No rebalancing
A E N P R Recall Precision UR UP
A 194 74 333 8 2 31.75% 32.61% 31.10% 35.58%
E 86 337 1069 13 3 22.35% 51.93% WR Training N 252 215 4876 34 0 90.68% 70.60% 65.76% 613.86 P 7 4 181 23 0 10.70% 22.77% GR
R 56 19 448 23 0 0.00% 0.00% 0.00%
Different Costs
A E N P R Recall Precision UR UP
A 332 115 54 36 74 54.34% 22.07% 41.32% 31.57%
E 256 693 273 115 171 45.95% 34.32% WR Training N 790 1124 2060 591 812 38.31% 80.94% 40.09% 901.92 P 13 16 39 95 52 44.19% 10.00% GR
R 113 71 119 113 130 23.81% 10.49% 39.86%
SMOTE
A E N P R Recall Precision UR UP
A 306 120 53 43 89 50.08% 22.82% 39.22% 31.10%
E 224 699 252 138 195 46.35% 34.78% WR Training N 685 1104 2005 671 912 37.29% 81.08% 38.94% 4720.8 P 13 13 46 88 55 40.93% 8.26% GR
R 113 74 117 125 117 21.43% 8.55% 37.68%
SMOTEV
A E N P R Recall Precision UR UP
A 255 93 122 17 124 41.73% 25.81% 35.31% 32.34%
E 149 446 634 24 255 29.58% 38.02% WR Training
N 487 575 3108 155 1052 57.80% 74.64% 48.61% 2542.8 P 14 12 92 35 62 16.28% 13.01% GR
R 83 47 208 38 170 31.14% 10.22% 32.48%
SMOTEV with different costs
A E N P R Recall Precision UR UP
A 294 95 98 25 99 48.12% 22.92% 39.23% 32.41%
E 213 510 511 62 212 33.82% 38.93% WR Training N 666 640 2780 342 949 51.70% 76.82% 46.18% 1252.5 P 11 13 59 73 59 33.95% 12.81% GR
R 99 52 171 68 156 28.57% 10.58% 38.23%
46
Chapter 4
Experiment Results and Discussions
4.1 Experiment setup
For AEC database, the training set was 9959 clean speech chunks uttered by the children in Ohm school and the test set was 8527 clean speech chunks (or synthetic noisy samples) uttered by the children in Mont school. This is the same settings as that used in the INTERSPEECH 2009 Emotion Challenge. Speaker independence was assumed. For BES database, 10-fold cross-validation (CV) was used due of limited number of samples. Ten subsets were formed according to speakers to assure speaker independence. Note that using independent speakers in training and testing causes about 7% decrease of UR compared to a stratified 10-fold CV scheme (in which the training set contains samples from every speaker that represent the whole dataset better).
Additive white Gaussian noise (AWGN) and additive babble noise (ABN) from NOISEX-92 database (Varga and Steeneken, 1993) were employed in the following experiments. The SNR condition ranges from 15dB to 0dB, 5dB decrease each step.
In order to link the results of the matched condition paradigm, the experiments were conducted in two conditions: in the first part, a slack mismatched condition which allows researchers to use the statistics of test data is adopted. In the second, a strict mismatched condition which allows no knowledge of the test data is adopted.
Note that only one classifier is train in both parts. The difference between slack and strict mismatched condition lies in the testing stage. For strict condition, both training and testing samples were normalized using the same scaling factor (computed from the training set) whereas for slack condition, training and testing samples were normalized using different scaling factors according to the noise type and SNR.
4.2 Peripheral Materials
This section tackles several “minor issues” that facilitate or jeopardize the major objective. Some of them affect features or classification results in an extreme sense, even represent a shift of paradigm; others are less important but still more or less boost performance when properly handled.
4.2.1 Parameter Selection by Grid Search
Kernel and SVM parameters
Experiments showed that kernels other than linear do not further improve classification results in our case. This is probably because that the data samples are highly overlapping and that the dimension is high. Actually, in non-separable cases, curvy decision boundary helps only when the distribution of samples from different classes are very different. In our cases, distributions of each RS features are very similar between classes so applying RBF kernels does not improve performance. (A reply to Yeh’s future work in Section 1.3)
SVM parameters have an impact on the shape of decision boundaries and thus influence robustness. Robustness is a question of generalization, so if the boundaries are very twisted, the classifier will have reduced performance of generalization.
However, robustness of features should not live on classifier’s competence of generalization. In our experiments, SVM parameters are selected to maximize the performance in training set since our assumption of a strict mismatched condition allows us to use no more information other than training samples.
In our cases of RBF kernels, there are two parameters: parameter C, the regularization cost (or ν in the case of ν-SVM; in our case, ν-SVM has numerical problem determiningν), and parameter γ of RBF kernel. Grid search is applied to find the best parameter combination (C, γ). There are other strategies for parameter selection (cf. Friedrichs and Igel, 2004), but a simple grid search serves our purpose.
The procedure of grid search has two phases, coarse search and fine search. Coarse search finds a general trend of how the parameter combination influences performance and fine search zooms in a specific area in order to find a better result (Chang and Lin, 2011). The results of grid search on different databases, feature sets are shown in following Tables. All tests are based on strict mismatched condition and speaker independent setting.
48
Both SVM and ν-SVM have some problems in parameter settings when encountering highly overlapping problems. If the regularization term C of linear kernel SVM is too large, the algorithm will converge in a very slow fashion; the counterpart of C in ν-SVM, parameter ν, cannot be set too large. In our experiments, ν cannot be larger than 0.2152 if the case is Aibo Corpus, and it cannot be larger than 0.5 if the case is Berlin Database.
Table 4.2.1: Parameter selection in AEC database using full RS feature set.
Database: Aibo Emotion Corpus Feature set: r180
Normalization to [0, 1]
ν-SVM with linear kernel
ν UR WR
0.1 11.0990 10.8998
0.2 23.7597 53.6030
0.21 25.9669 57.9145
0.215 26.5859 56.9820
0.2151 24.1896 59.0893 0.2152 n/a n/a
Table 4.2.2: Parameter selection in BES database using full RS feature set.
Database: Berlin Emotional Speech Database Feature set: r180
Normalization to [0, 1]
ν-SVM with linear kernel
ν UR WR GR
0.05 62.9833 65.0467 59.5497
0.1 64.2851 65.9813 61.1472
0.15
64.7768 66.1682 61.5982
0.2 63.9185 65.9813 60.6937
0.25 63.8075 65.9813 60.4542
0.3 63.5807 66.1682 59.8891
0.35 62.5823 65.4206 59.3010
0.4 61.0858 64.6729 57.1871
0.45 59.3344 63.1776 55.2406
0.5 n/a n/a n/a
In simpler (separable) cases, ν-SVM is known for its convenience of parameter tuning; in our cases, SVM is a superior option. All assessment metrics exhibit higher performance for the original form of SVM; therefore we carried on experiments with the original form of SVM. When C grows larger, it takes much longer to compute the decision boundaries. This phenomenon only happens in linear kernel because linear kernel only allows linear separation. In this case, there will be a lot of misclassified samples, so the kernel becomes very dense and optimization becomes more difficult (large C indicates smaller allowable distance from the decision boundary.).
The impact parameter C has on performance was simply a trade-off between UR and WR when using i384 features. The best C was around 2 for i384 and it was 2 for r180 features. The discordance became a problem later when we attempted to fuse the two feature sets. Larger value of the regularization term C did not change classification in Berlin Database, which is in accordance with previous report (Keerthi and Lin, 2003).
Table 4.2.3:
Database: Aibo Emotion Corpus Feature set: r180
Normalization to [0, 1]
C-SVM with linear kernel
C UR (%) WR (%) Training Time
(sec)
1 40.8539 38.0162 117.5781
4 40.4612
38.3311
139.241216 41.1159 38.2342 212.9472
64 40.9177 38.0162 530.5290
256 40.5074 38.2463 1896.6185
1024 39.6739 38.0162 >7200
Table 4.1.4:
C UR (%) WR (%) Training Time (sec)
0.0039 35.1718 36.2965 194.2642 0.0078 36.6687 36.9868 185.4706 0.0156 37.9080 37.6892 184.7538 0.0312 38.6105 37.8467 179.2178 0.0625 39.2341 38.1252 169.9313
0.125 39.6744 37.7377
148.4252
0.25 40.2018 37.5802 155.9607
50
0.5 41.0704 38.0041 179.4198
1 40.8539 38.0162 211.7685
2
41.2416 38.5370 216.3879
4 40.4612 38.3311 228.1803
6 40.522 38.3553 249.0802
8 40.8762 38.3069 252.9136
12 41.0623 38.2827 236.2706
16 41.1159 38.2342 266.8908
20 40.7435 38.1252 278.7337
24
41.2673 38.3674 285.9171
28 41.0141 38.1737 292.3508
32 41.0003 38.0041 299.1315
Table 4.2.5
Database: Aibo Emotion Corpus Feature set: i384
Normalization to [0, 1]
C-SVM with linear kernel
C UR (%) WR (%) Training Time
(sec)
0.0039 39.8802 37.5076 243.5482
0.0078 40.4281 38.9851 246.4208
0.0156 40.9513 39.8571
296.54310.0312 41.0799 39.7239
311.38140.0625 40.8276 39.2879 300.0468
0.125 39.6519 38.6218 -
0.25 39.0249 38.6702 -
1 38.3846 39.0335 377.3136
2 37.9609 38.9851 -
3 37.6471 38.9367 -
4 37.7094 39.1183 456.2891
6 37.1659 39.3848 -
8 37.1238 39.4938 -
16 37.2128 39.5785 741.8974
64 36.9992 39.4695 1704.751
256 36.5811 39.6512 -
1024 - - ???? (>3600*5)
Table 4.2.6:
Database: Berlin Emotional Speech Database Feature set: r180
Normalization to [0, 1]
C-SVM with linear kernel
C UR (%) WR (%)
0.0625 62.9155 65.0467
0.125 64.4259 66.1682
0.25 64.3638 65.7944
0.5 67.0633 68.0374
1 66.1917 67.2897
2 67.7281 68.7850
4 64.775 66.3551
8 64.5947 66.3551
16 63.6399 65.7944
Table 4.2.7
Database: Berlin Emotional Speech Database Feature set: i384
Normalization to [0, 1]
C-SVM with linear kernel
C UR (%) WR (%) GR(%)
0.0039 28.793 30.2804 0
0.0078 46.5434 47.8505 0
0.0156 53.9342 54.7664 38.5244
0.0312 64.393 64.8598 63.0003
0.0625 67.833 68.972 67.1822
0.125 69.2811 70.4673 68.7423
0.25 67.7995 69.1589 67.2293
0.5 66.4909 68.2243 65.6157
1 65.3944 67.1028 64.5396
2 65.5069 67.2897 64.627
4 65.5069 67.2897 64.627
8 65.5069 67.2897 64.627
256 65.5069 67.2897 64.627
52
Figure 4.2.1 Grid search for parameter selection in Berlin Database.
(a) (b)
(c) (d)
Figure 4.2.2: Grid search for parameter selection in Berlin Database (ν-SVM).
(a) show the result of coarse-scale search. (c) is a 3D version of (b). (d) is the result of fine-scale search.
4.2.2 Cross-validation Issues in Berlin Database
In Berlin Database, there are only 535 wav files that belong to 7 classes. The number of features is 180 (RS) or 384 (INTERSPEECH). As rule of thumb, the number of training samples is better to be 5 or 10 times to the number of features. If it is not the case, cross-validation (CV) is suggested to be employed for a better estimation of performance.
There are several schemes of cross-validation. Stratified 10-fold cross-validation schemes are widely adopted to ensure a better (unbiased and minimal variance) estimator. Nevertheless, in our experiments, we intended to apply the same criterions to both Aibo Corpus and Berlin Database. Since a speaker independent
54
training/testing configuration was set in Aibo Corpus, we decided to apply 10-fold cross validation to Berlin Database according to the 10 speakers.
One merit from speaker independent settings is that the performance estimator is fixed. In random 10-fold cross-validation, the estimator is a random variable, which makes comparison between feature sets harder. The downside, of course, is that the performance gets lower than that in a stratified cross-validation. Experimental results in Table 4.2.8 shows about 7% decrease when adopting speaker independent settings.
Table 4.2.8: Comparison between speaker dependent and independent settings
Speaker dependent setting was carried out by random 10-fold CV; the results are shown in the upper chart. Speaker independent setting was carried out by 10-fold CV according to 10 speakers; the results are shown in the lower chart. Both tests were conducted under clarity condition. Note that Anger gets more confusion with Happy, and Neutral becomes less distinctive under speaker independent settings.
H A D F N B S
Recall
URH 33 23 2 10 3 0 0
46.5%
73.2%A 9 115 0 2 1 0 0
90.6%
WRD 1 2 25 8 5 3 2
54.3%
75.9%F 18 4 2 41 4 0 0
59.4%
GRN 3 0 4 1 65 5 1
82.3%
70.8%B 1 0 3 0 7 67 3
82.7%
S 0 0 0 0 1 1 60
96.8%
50.8% 79.9% 69.4% 66.1% 75.6% 88.2% 90.9%
H A D F N B S
Recall
URH 30 27 3 11 0 0 0
42.3%
66.3%A
20
103 1 3 0 0 081.1%
WRD 3 3 20 11 4 3 2
43.5%
68.8%F 21 5 3 37 3 0 0
53.6%
GRN 3 0 4 6
55
9 269.6%
63.5%B 1 0 3 0 8 64 5
79.0%
S 0 0 1 0 2 0 59
95.2%
38.5% 74.6% 57.1% 54.4% 76.4% 84.2% 86.8%
4.3 Experiments on Robustness
4.3.1 Slack Mismatched Condition
Some abbreviations are listed in this passage. The proposed rate-scale features which contain spectro-temporal modulation information are called the RS features.
The RS features comprise two subsets: the RSmu set consists of temporal mean of RS and the RSsd set consists of temporal standard deviation (SD) of RS. The 384 INTERSPEECH features are denoted as i384, and the totality of RSmu and RSsd is denoted as r180.
The following experimental results are split into two parts based on slack and strict matched conditions. All other setting are held the same, except for one thing:
under slack condition, the hybrid feature set is i384 and r180 but under strict condition it is i384 and RSsd. The reason is that under slack condition, even RSmu is robust, so it can be added into the hybrid set.
The results from slack mismatched condition are shown in Fig. 6 and Fig. 7 and Table 3 and Table 4. In both BES and AEC databases, the unweighted recall rate of RSsd features holds almost the same except for 0 dB condition. On the other hand, the performance of i384 has an apparent trend of decreasing. Although i384 fares about 8% better than r180 does in clean condition, increasing advantage of RS features arises through decreasing SNR. Similar trends also appear under matched condition.
There is one thing that needs notice: a single raise in UR does not mean that the performance is elevated under that SNR condition. It is the trend instead of a single point on the curve that matters.
56
Figure 4.3.1: Curves for UR v. SNR in BES database under slack mismatched condition.
*i384: 384 baseline features used in INTERSPEECH 2009 Emotion Challenge.
*r180: proposed spectro-temporal modulation (rate-scale) features.
*r90: half of the proposed features with only temporal standard deviation (RSsd).
*IR: i384 combining r180 features (564 features in total). w: white noise. b:babble noise.
*Training/Testing set: 10-fold cross-validation with speaker independence
∞dB 20dB 15dB 10dB 5dB 0dB
i384w 68.8119 60.3312 59.5984 61.0768 54.2032 47.9846
i384b 68.8119 57.6923 57.413 53.4125 50.7897 43.966
r180w 66.3769 66.1757 64.996 64.9484 63.1001 60.5604
r180b 66.3769 65.2422 65.6143 65.9263 61.8951 50.8454
IR_w 73.1573 66.0913 62.8653 64.9214 61.2421 55.389
IR_b 73.1573 66.3826 67.2531 62.6277 59.3636 55.4546
r90w 52.5284 52.9064 51.9664 52.2298 51.6153 44.9501
r90b 52.5284 52.4778 52.9681 54.1444 49.8457 38.9375
35 40 45 50 55 60 65 70 75
Unweighted recall rate
Unweighted recall rate
under slack mismatched condition
Table 4.3.1: The confusion matrices of r180 under ∞, 0dB white noise and 0dB babble noise condition.
∞ dB
H A D F N B S
H 29 24 5 13 0 0 0
A 25 95 3 3 1 0 0
D 2 3 27 9 1 2 2
F 20 3 7 37 2 0 0
N 2 0 8 6 52 9 2
B 1 0 4 0 8 60 8
S 0 0 1 0 1 0 60
UR 66.38%
WR 67.29%
GR 64.29%
0 dB (white noise) 0 dB (babble noise)
H A D F N B S H A D F N B S
H 26 29 5 10 1 0 0 24 20 5 16 5 1 0
A 20 103 2 1 0 0 1 34 80 5 8 0 0 0
D 2 3 26 9 2 3 1 6 2 23 6 6 1 2
F 20 4 10 31 2 0 2 18 4 12 24 6 0 5
N 2 0 17 6 45 6 3 2 0 15 6 41 6 9
B 1 0 7 1 9 57 6 0 0 4 0 20 43 14
S 0 0 7 0 7 0 48 0 0 4 2 6 7 43
UR 60.56% 50.85%
WR 62.80% 51.96%
GR 58.48% 49.30%
*The columns are classification results and the rows contains true label.
58
Figure 4.3.2: The performance of four feature sets in two type of noise under slack mismatched condition.
*w: white noise. b: babble noise. r90: RSsd;
Table 4.3.2: Confusion matrices of classification result using r90 feature set.
∞dB 0dB (white noise) 0dB (babble noise)
A E N P R A E N P R A E N P R
A 284 141 83 29 74 164 151 126 72 98 172 125 148 64 102
E 183 712 409 66 138 236 564 409 115 184 194 481 536 96 201 N 546 1118 2721 338 654 611 993 2475 590 708 709 957 2528 515 668
P 10 20 87 54 44 21 33 82 51 28 23 35 87 40 30
R 91 84 184 72 115 72 84 194 89 107 77 102 191 82 94
UR 38.10% 30.72% 28.58%
WR 47.06% 40.70% 40.15%
GR 35.79% 29.27% 26.68%
∞dB 15dB 10dB 5dB 0dB
i384w 34.0805 31.7828 31.3663 29.6525 27.3104
i384b 34.0805 30.9148 29.7634 28.0848 27.1918
r180w 38.0958 36.8806 35.5228 33.025 30.7178
r180b 38.0958 36.3379 35.4951 32.7817 28.5766
IR_w 39.5223 36.034 35.2724 33.89 31.2807
IR_b 39.5223 35.8016 34.6279 33.1507 30.4382
r90w 38.9796 38.5473 37.9525 35.7288 32.4861
r90b 38.9796 38.5652 38.0537 36.744 34.4989
20 25 30 35 40 45
unweighted recall
Slack mismatched condition in AEC
database
4.3.2 Strict Mismatched Condition
Under strict mismatched condition, the performance of both r180 and i384 features decreases in a rather fast fashion. On the contrary, RSsd holds fair performance when noise is not too severe.
Figure 4.3.3: Curves for UR v. SNR in BES database under strict mismatched condition.
*Training/Testing set: 10-fold cross-validation with speaker independence
*IR: hybrid feature set of i384 and r90.
∞dB 20dB 15dB 10dB 5dB 0dB
i384w 65.4998 44.3692 40.7651 35.9994 30.0137 27.0519
i384b 65.4998 49.7023 45.2093 38.4405 30.3997 22.2993
r180w 66.5532 62.2958 52.967 39.377 29.652 22.5866
r180b 66.5532 64.8847 60.6835 50.6261 35.3618 25.8061
IR_w 68.6049 38.3585 34.3785 30.7386 26.8331 24.1201
IR_b 68.6049 46.1104 38.7074 28.3047 21.1159 17.0742
r90w 52.5284 52.9064 51.9664 52.2298 51.6153 44.9501
r90b 52.5284 52.4778 52.9681 54.1444 49.8457 38.9375
15 25 35 45 55 65
Unweighted recall
Unweighted recall rate
under strict mismatched condition
60
Table 4.3.3: Confusion matrix using r180 feature set under ∞, 0dB white noise and 0dB babble noise condition in Berlin Database.
∞ dB
H A D F N B S
H 19 25 6 19 2 0 0
A 33 86 3 4 1 0 0
D 7 3 20 7 7 1 1
F 22 4 9 22 10 0 2
N 2 0 7 9 43 8 10
B 0 0 2 1 13 60 5
S 0 0 1 2 9 7 43
UR 52.53%
WR 54.77%
GR 49.25%
0 dB (white noise) 0 dB (babble noise)
H A D F N B S H A D F N B S
H 11 11 8 31 10 0 0 12 19 1 21 14 1 3
A 33 46 8 34 5 1 0 41 68 1 11 4 2 0
D 4 0 13 12 10 1 6 5 2 1 7 20 1 10
F 3 1 8 39 12 0 6 7 2 0 16 20 2 22
N 0 0 4 5 42 0 28 0 0 0 2 31 3 43
B 0 0 2 1 23 32 23 0 0 0 0 9 33 39
S 0 0 0 1 7 1 53 0 0 0 0 2 0 60
UR 44.95% 38.94%
WR 44.11% 41.31%
GR 39.90% 25.52%
Figure 4.3.4: The performance of four feature sets in two type of noise under strict mismatched condition. The database is the Aibo Emotion Corpus.
Table 4.3.4: Confusion matrices of classification result using r90 feature set.
∞dB 0dB, white noise 0dB, babble noise
A E N P R A E N P R A E N P R
A 364 91 55 30 71 76 37 55 398 45 277 51 92 132 59
E 400 660 198 95 155 84 221 235 855 113 358 321 408 306 115 N 1356 985 1656 595 785 260 295 998 3399 425 1107 350 2028 1418 474
P 28 17 35 77 58 4 3 17 172 19 23 5 47 110 30
R 168 66 107 97 108 29 16 70 377 54 117 27 141 193 68
UR 37.95% 27.11% 33.59%
WR 34.70% 18.42% 33.96%
GR 35.56% 19.30% 29.72%
* The columns are classification results and the rows contains true label.
∞dB 15dB 10dB 5dB 0dB
i384w 40.8836 28.484 27.7508 29.9478 23.403
i384b 40.8836 31.1979 28.5196 25.6792 23.2721
r180w 38.4873 30.9367 22.3641 22.2385 19.9756
r180b 38.4873 36.2747 29.9413 24.1416 21.3265
IR_w 39.2886 27.7961 27.7749 26.6509 21.1784
IR_b 39.2886 30.6028 27.06 25.5483 24.0099
r90w 37.9466 37.4443 37.0016 34.1584 27.1089
r90b 37.9466 37.6617 37.1286 36.2775 33.591
20 25 30 35 40 45
UR
Unweighted recall rate
under strict mismatched Condition
62
4.3.3 Discussion on Emotion
Emotions at opposing extremes of arousal are easier to discriminate according to the dimensional emotion theory. This phenomenon is observable in Table 5 (H versus Others for valence and HADF v. NBS for arousal). The presence of noise inflicts more damage on recognition of closer emotions (emotions in the same emotion family) than that on more unrelated emotions. Again in Table 5, for example, it is more likely that the classifier confuses Neutral with Boredom but it is less likely that the classifier confuses Sadness with Anger (cf. (Schuller et al, 2006) for similar tests). The similarity emotion psychology described above also corresponds to the results of data visualization in Figure 4.3.7.
It is interesting when we take a look on how noise influences RSsd and further influences classification results. As Table 6 and Table 5 show, additive babble noise caused classification to skew toward neutral emotion (N) in both databases. This result is not unimaginable because babble noise contains numerous intelligible pieces of utterance. The superposition of those pieces of utterance makes an emotion neutral speech-like sound which shares similar traits with emotion neutral speech. Therefore, speech samples with high density of babble noise tend to be classified as Neutral.
White noise, on the other hand, skews classification toward the emotion category that has more drastic change of pitch-related attributes. In the case of BES, it would be Fear and in AEC, it would be Positive (including Joyful and Motherese). One might wonder why in BES, the result was not skewed toward Happy. This is because in BES, the acoustics of Fear samples is much more significant than that of Happy samples.
4.3.4 Discussion on Robustness
In this paragraph, we discuss how noise affects RS features. Under slack mismatched condition, the effect from noise is partly removed by the normalization and therefore the degradation is greatly reduced; nonetheless, under strict mismatched condition, with the increasing presence of noise pattern, the structure of the temporal mean of RS features (later denoted as RSmu) is gradually destroyed, causing rapid degradation of the UR. This indicates that when there is no available knowledge that can adjust the testing sample, i.e. when it is unable to apply slack mismatched condition, even the RSmu are not robust.
On the other hand, the temporal standard deviation of the RS features (later denoted as RSsd) which fares limited ability of recognition is fairly robust even under strict mismatched condition. The reason the two sets of RS features differ in robustness performance is explained here. The RS features are derived from spectrum.
When additive noise comes in, the energy is elevated thus resulting in elevated RSmu. The elevation is not removed under strict mismatched condition, so degradation in performance is inevitable. However, addition in spectrum inflicts minor effects to variance and that is why RSsd is robust.
Noise with the same type usually has the similar RS pattern. Figure 4.3.5 show typical patterns of AWGN and babble noise respectively. Babble nose has stronger response in low rate region in both positive and negative rate half-planes while AWGN affects more on higher rate regions. In this point of view, how noise affects speech is merely a translation in the feature space. (Of course, additive noise does not result in pure translation.) This is why the classification (trained by RSmu and RSsd) appears to give all testing samples the same label in very low SNR under strict mismatched condition.
(a) (b) Figure 4.3.5: Babble noise and white noise.
Only RSmu is shown here. The x-axis represents “rate” and the y-axis represents “scale”.
64
Hybrid Features
In Fig. 4.3.1 and Fig. 4.3.2, under slack mismatched condition, combining i384 with r180 features is beneficial to robustness in both AEC and BES databases.
Unfortunately, the hybrid feature set did not work well under strict condition. The discrepancy is natural because under slack condition the distribution of testing samples can be regulated whereas under strict condition the normalization results in a biased distribution of testing data. For example, if testing samples are just a translation of original training samples, under slack condition, the translation can be compensated; however, under strict condition, the translation is not mended. In short, under matched or slack mismatched condition, a hybrid set of i384 and RS (either RSmu or RSsd) features helps the totality to gain robustness; under strict condition, applying robust feature sets in classification is more applicable.
An alternative argument that the normalization (under strict mismatched condition) worsened the robustness of i384 features. This argument is half right and it
An alternative argument that the normalization (under strict mismatched condition) worsened the robustness of i384 features. This argument is half right and it