Discussion on Robustness - Experiments on Robustness

Chapter 4 Experiment Results and Discussions

4.3 Experiments on Robustness

4.3.4 Discussion on Robustness

In this paragraph, we discuss how noise affects RS features. Under slack mismatched condition, the effect from noise is partly removed by the normalization and therefore the degradation is greatly reduced; nonetheless, under strict mismatched condition, with the increasing presence of noise pattern, the structure of the temporal mean of RS features (later denoted as RSmu) is gradually destroyed, causing rapid degradation of the UR. This indicates that when there is no available knowledge that can adjust the testing sample, i.e. when it is unable to apply slack mismatched condition, even the RSmu are not robust.

On the other hand, the temporal standard deviation of the RS features (later denoted as RSsd) which fares limited ability of recognition is fairly robust even under strict mismatched condition. The reason the two sets of RS features differ in robustness performance is explained here. The RS features are derived from spectrum.

When additive noise comes in, the energy is elevated thus resulting in elevated RSmu. The elevation is not removed under strict mismatched condition, so degradation in performance is inevitable. However, addition in spectrum inflicts minor effects to variance and that is why RSsd is robust.

Noise with the same type usually has the similar RS pattern. Figure 4.3.5 show typical patterns of AWGN and babble noise respectively. Babble nose has stronger response in low rate region in both positive and negative rate half-planes while AWGN affects more on higher rate regions. In this point of view, how noise affects speech is merely a translation in the feature space. (Of course, additive noise does not result in pure translation.) This is why the classification (trained by RSmu and RSsd) appears to give all testing samples the same label in very low SNR under strict mismatched condition.

(a) (b) Figure 4.3.5: Babble noise and white noise.

Only RSmu is shown here. The x-axis represents “rate” and the y-axis represents “scale”.

Hybrid Features

In Fig. 4.3.1 and Fig. 4.3.2, under slack mismatched condition, combining i384 with r180 features is beneficial to robustness in both AEC and BES databases.

Unfortunately, the hybrid feature set did not work well under strict condition. The discrepancy is natural because under slack condition the distribution of testing samples can be regulated whereas under strict condition the normalization results in a biased distribution of testing data. For example, if testing samples are just a translation of original training samples, under slack condition, the translation can be compensated; however, under strict condition, the translation is not mended. In short, under matched or slack mismatched condition, a hybrid set of i384 and RS (either RSmu or RSsd) features helps the totality to gain robustness; under strict condition, applying robust feature sets in classification is more applicable.

An alternative argument that the normalization (under strict mismatched condition) worsened the robustness of i384 features. This argument is half right and it also implies that i384 features are not robust.

Fusion Schemes

Instead of feature fusion, another approach is to fusion the decision of every learning machine, just like what the Emotion Challenge did. In our preliminary experiments, three classifiers (SVM’s) were built up using r180 features, i384 features, and r180+i384 features, respectively. The output then went through a majority mechanism to give a final decision. The following chart shows a slight performance boost (which is a better result than that of any single participants in 2009 Emotion Challenge). Two fusion schemes, committee and expert decision, were adopted and both resulted in better UR. Nonetheless, a robust fusion scheme against noise is still an open and ongoing quest because it requires several robust (and discriminative) sets of features.

Feature Dimension

A brief inspection of feature dimension or feature reduction is presented in the following figures. Results were visualized in two-dimensional plots. Sample distributions after applying linear discriminant analysis (LDA, which is supervised) and principal component analysis (PCA, which is unsupervised) showed similar trends.

The results of LDA (training set) look like a belt of grouped clusters, inlaid with Anger, Joy, Fear, Disgust, Neutral, Boredom, and Sadness, in sequence. The order as well as the constellation also indicates the internal similarity between each class pair.

High activation emotions (Anger, Joy, Fear, and Disgust) locate at adjacent places and

low activation emotions (Neutral, Boredom, and Sadness) also locate at adjacent places. Despite of LDA’s nearly perfect constellations in the training phase, its generalization ability appears to be slightly mediocre, at least not better than PCA.

Comparing the right panels of Fig. 4.3.6 and Fig. 4.3.7 (a), we can observe resembling trends such as the fact that Anger and Joy are highly overlapping.

In unsupervised feature reduction scheme (PCA), there is no apparent constellation for samples from the same emotion class. However, high activation emotions and low activation emotions still distribute at opposing locations.

(a) Leaving speaker 10 out.

(b) Leaving speaker 5 out.

Figure 4.3.6: Reducing feature dimension (r180) to two using LDA.

-8 -6 -4 -2 0 2 4 6 8

(a) Reduction by principal component analysis

(b) Reduction by t-distributed Stochastic Neighbor Embedding Figure 4.3.7: Visualization after reducing feature dimension (r180) to two.

*Green circles = Happy; Red squares = Anger; Magenta pentagons = Disgust; Yellow hexagons = Fear;

Black crosses = Neutral; Cyan plus signs = Boredom; Blue asteroids = Sadness.

-1.5 -1 -0.5 0 0.5 1 1.5 2

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5

-60 -40 -20 0 20 40 60

Table 4.3.5: Comparison between combining and the original models r180