Synthetic Minority Oversampling Technique with Validity

Chapter 3 Methods and Materials

3.4 Rebalancing Imbalanced Dataset

3.4.1 Synthetic Minority Oversampling Technique with Validity

Corpus. A simple over- or under-sampling method is not good enough. A simple over-sampling method increases both training and testing time while we are using a (sparse) kernel-based classifier and it can be simply replaced by a simple cost-sensitive method. A simple under-sampling method, on the other hand, loses too much information. The trouble caused by removing samples is two-fold (as describe in Section 3.1.2) because the most of the majority samples are reliable and most of the

0 500 1000 1500 2000 2500 3000

0 50 100 150 200 250 300 350 400

0 200 400 600 800 1000 1200 1400 1600 1800 2000

0 50 100 150 200 250 300

minority samples are unreliable. The analysis above leads us to a conclusion: we can adopt a cost-sensitive method or a finer synthetic sampling method.

The Synthetic Minority Over-sampling Technique (SMOTE) partially meets our needs because it still synthesizes too many samples, which is costly to sparse kernel machines. And we noticed that SMOTE synthesizes new samples uniformly with an aim in mind that it does not attempt to change the probability distribution.

Nonetheless, in emotion recognition, we have a crucial clue—validity which in other applications is not necessarily given or acquirable. Validity is a measure of how much credibility can be put on a sample. If a sample is unreliable, it does not deserve more derivatives (synthetic samples). Taking validity into consideration, we modified the original SMOTE and name it SMOTEV (SMOTE with validity).

Formulation

The original SMOTE selects a target sample x and its k-nearest neighbors x s (kNN) and synthesize k samples on the midway of any pairs x , x by

x x α x x , where α~U 0,1 .

It has two shortcomings. First, it needs to compute the distance matrix in order to find kNN’s. Second, it may synthesize many unreliable samples. Increasing unreliable samples prolonged training and testing time, and it might make learning more biased (increasing amount of falsification).

To tackle the distance matrix, SMOTEV made another attempt. It first selects samples with 80% or higher validity to form a reference set. Next, all, except the samples with 20% or lower validity, become candidates of target sample. New samples are synthesized on a random position along the line of a reference sample and a target. Taking validity into consideration, we can formulate SMOTEV as

x x β x x 1 β x β x

β ~Beta 1 v , 1 v

r: the index of a randomly selected sample in the reference subset.

x : reference sample x : target sample v : validity of x v : validity of x

The random variable β has mode of 0, 1 . The skewed mode value reflects the fact that more credibility is given on the reference sample. Note that SMOTEV, just as SMOTE, cannot apply to nominal or ordinary scales (no technical problems, but the meaning might become nonsense); it only can apply to interval or ratio scales.

The final solution to imbalanced datasets in our experiment was to combine SMOTEV with different cost for each class. Setting different costs saves time and it has similar performance to that of sampling methods. Comparison of five different rebalancing schemes is shown in Table 3.4.1. The SVMs inherently bias toward majority classes since they aim to minimize total error; therefore, training without rebalancing is definitely infeasible.

Aided by validity, SMOTEV conceptually synthesizes new samples that may be more reliable than the original SMOTE. The two synthetic methods have similar performance. However, all synthetic methods unavoidably increase training time. In order to reduce the trouble, setting different cost seems to be the best way. Therefore, we only synthesize part of the data, and setting different costs afterwards.

Table 3.4.1

No rebalancing