Machine Learning Fall 2009 Final Project Report

(1)

Machine Learning Fall 2009

Final Project Report - LUNONONON LUNONONON LUNONONON LUNONONON 1. Introduction

We use three learning methods taught in class and KNN to classify this data set. The data is multi-labeled with high dimension. From the paper in course website, each vector represents the normalized TF-IDF value in a document, and the vector for each sample is very sparse.

2. Formulation of Methods

Because multi-labeled data is not easily classified in general learning algorithm we learned in class, we separate the multi-labeled data to be single-labeled data, but the computation become larger in terms of total number of labels.

Since at least half data have no label, we think that we can use additive filter to help classification in PLR and SVM. In these procedures, we adopt different approaches to compare the results of evaluation. We decide whether the vector should be labeled at fist, if not, we classify other labels for the vector. In 2.2 and 2.3, we use several different methods to evaluate the performance. We predict that maybe there will be less false positive in the second round.

2.1 K Nearest Neighbors

Because of the character of data, in this high dimensional space, the vectors would be close to each other within a specific class (from heuristic). We choose KNN to be our first method to see how the performance and efficiency would be.

In this method, we are going to try two different kinds of distance measure. One is Euclidean distance (generally used by KNN) and the other is cosine distance (consider the value is TF-IDF weight for each dimension). These two distance formulations are as follows.

Euclidean distance = a− b+ a− b+ ⋯ + a− b 1

cosine distance = 1 − A ∙ B

|A||B| 2

And we are going to try three different approaches to choose which labels returned from the K nearest neighbors we are going to use.

(1) Set the threshold

For the nearest K neighbors, we set a threshold T ≤ K, if a label in these K neighbors appears bigger or equal to T times, than we will include this label. The algorithm is as follows.

for each label ℓ_%, if ' C₎*ℓ_%+

, )-

≥ T, include ℓ_% where C₎*l_%+ = 01 j²³ label appears in i²³ neighbor

0 otherwise 7

R98922004 Yun-Nung Chen R98922009 Che-An Lu

(2)

(2) Only choose one with most votes

For the nearest K neighbors, we only include the label which appears most. The algorithm is as follows.

label = arg max_ℓ :' C₎ℓ_%

, )-

;

(3) Set weights and threshold

The third approach is similar to the first one, except that we add weight to each neighbor. The weight of the point would be higher if it is closer to the vector that we are processing now. The reason why we add the nature log term is that we can exponentially decrease the weight in terms of farer point. After adding the weights, we think that the values become more reliable. The algorithm is as follows.

for each label ℓ_%, if ' C₎*ℓ_%+ × W₎

, )-

≥ T, include ℓ_%

where W₎ = 1 + lnK − i

2.2 Perceptron Learning Rule

As we mention before, we try to train a binary model for each label, we will include a label if the corresponding model predicts positive. And it will be null label if all the models’ predictions are both negative. We call it binary approach.

We also try to train a model which can predict whether a sample has labels or not, and let’s call it “null model”. We try to use PLR to train the null model. After using the null model to filer some data, we can decrease the computation and expect the performance would be better.

We choose this method to see whether the data in this high dimensional space is linear separable or not.

Considering the time complexity, we choose PLR to be our method and see how the results would be. We found that most of the labels are linearly separable (E_)> = 0) and converge within iteration 5000. Because of some non-separable model, we add pocket algorithm to make sure lowest Ein. There are two different procedures as below.

(1) Binary approach

(2) Binary approach with null model

2.3 Support Vector Machine

In the first step, we also use binary approach as mentioned above. We use liblinear and libsvm (tools from prof. Chih-Jen Lin) to test the performance of these methods.

We use default value first with liblinear (linear kernel) and libsvm (RBF kernel) to see their performance.

The result seems that linear kernel is not only better than non-linear kernel but also much faster than non-linear one. One interesting thing is that by running with RBF kernel in binary approach with default value, we get a result as Null Hypothesis. Considering the efficiency and performance, we decide to do more (parameters choosing) with linear kernel for following methods.

We choose two control factors. The one is adaptation of parameters, and another one is the use of null model. We use different parameters (s and c) for each model, and use the one which has the highest accuracy by using 5-fold cross validation. We control these two factors to generate four approaches as follows.

(3)

As we can see that some labels in training data have a few positive samples, and some of them are even zero, so the training result of these kind of SVM is unreliable. Hence we use a threshold T to filter out those models’ prediction which with few training data and choose the best T to get reliable data.

(1) Binary approach and default parameters (2) Binary approach and adapted parameters

(3) Binary approach with null model and default parameters (4) Binary approach with null model and adapted parameters

2.4 Ensemble Learning

We use two easy approaches to implement ensemble learning to get higher performance. We select four best results for training data generated by above algorithms, including KNN for (1) and (3), PLR, and SVM.

Because the result of SVM is best, in some methods, we fix the labels that SVM generates and add other labels. There are five methods as follow.

(1) One label in more than two results

(2) SVM result and one label in more than two rest results (3) SVM result and one label in all remaining results (4) SVM result and one label in two KNN results (5) SVM result and one label in one of KNN results

3. Experiments

3.1 K Nearest Neighbors

As mentioned above, we tried two distance formulations combined with three approaches, and use leave-one-out cross validation to measure their performance. And we compute three different values which are percentage of False Positive, percentage of False Negative and Average Hamming Distance.

(1) Set the threshold

From Fig.1, we can find that there is a good value of threshold T for each K. The optimal T values are usually smaller than half K, and all have similar average hamming distances.

Comparing two different distance formulations, we find that cosine distance provides a little better than another. Because of character of data, in which the vectors consist of normalized TF-IDF in a document, cosine distance is better. If cosine distance is smaller, which means cosine similarity is larger, larger cosine similarity could be generated when two vectors have large feature value in the same dimension, and two documents both have the same word and they have large TF-IDF value, so it is more possible that they have some the same labels.

According to this experiment, we choose the values of K and threshold. Our criteria is that choosing a more stable point in the Fig.1, which means the hamming distance value fluctuates little when threshold change a little.

We choose three points for each distance and upload to Competition Website, and the hamming distances are showed in the Table.1. These six points both get good performances for testing data as training data. Just like Fig.1, the best points usually include K that is about half value of threshold.

(4)

Fig.1 The with K and THRES in Method 1

(K, THRES) (3, 2) (5, 3) (8, 4)

Euclidean Training (CV) 0.598750 0.578150 0.580150

Testing 0.599600 0.582578 0.581950

Cosine Training (CV) 0.595200 0.571500 0.573500

Testing 0.597450 0.576100 0.574050

Table.1 Average Hamming Distance in Method 1

(2) Only choose one with most votes

In Fig.2, we compare the results of two distance formulations and find that although they can be lower to similar average hamming distance when K is large enough, cosine distance provides better performance than another when K is smaller than 5.

They both have similar false negative, but when K is smaller than 10, cosine distance clearly gives less false positive so that about 5 nearest neighbors that cosine distance finds have more reliable labels.

In Table.2, we can find that the trend of performance for training data is similar to the one for testing data.

We know that in this method, the larger K would be, the better performance becomes. We get the lower hamming distance when K is 15 for cosine distance.

Fig.2 Three curves for Euclidean Distance and Cosine Distance in Method 2

K 5 10 15

Testing 0.744750 0.706550 0.695800

Cosine Training (CV) 0.702450 0.686450 0.684700

Testing 0.716100 0.694025 0.689150

Table.2 Average Hamming Distance in Method 2

(5)

(3) Set weights and threshold

The Fig.3 is similar to the Fig.1, we also can see the better evaluation cosine distance provide. By using this approach, the hamming distance is clearly lower than one in Fig.1, so it represents weights can help us to get more reliable value in order to determine whether the label is important.

Fig.3 The weighted KNN with K and THRES in Method 3

(K, THRES) (4, 4) (6, 6) (8, 8)

Testing 0.593325 0.576125 0.574325

Cosine Training (CV) 0.558050 0.555500 0.556650

Testing 0.591025 0.569125 0.567975

Table.3 Average Hamming Distance

3.2 Perceptron Learning Rule

In this procedure, we find the number of data that can’t converge within 5000 iterations is just only 10.

We can see the process of lowing Ein from Fig.4, and we use pocket algorithm to hold lowest Ein(Ein_opt) and show them to see PLR can work. In Table.4, although there are some non-separable data, we can lower the Ein to only about 0.065, and it means the noise is not too much so that PLR still works for the data. In Table.5, except some non-separable data, PLR can converge within about 200 iterations, so the algorithm is fast and suitable to the data. From Table.6, we see the two results are similar. We cannot see the progress when we add the null model in PLR, but actually we reduce the computation. We think that the performance is good, thus believing the data is almost linear separable so that we can get good result.

Fig.4 The Process of decreasing Ein and Ein_opt

(6)

Model Null Labeled Non-separale All

Ein 0.06425 0.00000155 0.00648 0.000183

Table.4 Average Ein Value

Model Null Labeled Separable All

Iterations (times) 5000 326.7232 204.8116 339.8873

Table.5 Average Training Iterations

Baseline Null Filter

Hamming Distance for Testing 0.563800 0.569500

Table.6 Average Hamming Distance for two approaches

3.3 Support Vector Machine

In Table.7, we find that null models can help the performance to increase, but I can’t find that adaptation of parameters can decrease the hamming distance for training data, but for testing data, we get better performance. After seeing and analyzing the results, we think the reason could be that the difference between these performances is little, and the parameter from cross validation is not very reliable, so we are likely to get the worse performance for training data. Setting threshold T helps us get better performance.

Approach None Parameter Null filter Null Filter + Para

False Possitive 98/11833 208/7095 243/12418 527/9878

False Negative 2177/13912 5025/13912 1737/13912 4561/13912

HamDist Training 0.113750 0.261650 0.09900 0.25440

HamDist Testing 0.538600 0.529500 0.53075 0.52645

Table.7 Comparison for Four Approaches for T = 2

3.4 Ensemble Learning

Using ensemble four best results, we can easily get better performance, and the results are shown as Table.8. This method would be good is because that we use the best results and strictly limit extra labels to make sure the labels we add are reliable so that we can effectively lower the hamming distance. These four results we use are KNN(cos_8_4), weighted KNN(cos_8_4), PLR(baseline), and SVM(null+para).

Approach 2/4 SVM + 2/3 SVM + 3/3 SVM +

(KN&KN)

SVM + (KNorKN) Hamming Dist 0.528125 0.521175 0.519425 0.523900 0.538425

Table.8 Comparison for Five Ensemble Techniques

4. Conclusion

Analyzing the character of data is important, and using proper methods to classify the data. In above three main methods, using SVM is easy but the adaptation of parameters is important to get better result. We also think that PLR is suitable to the data, because we can efficiently find hypothesis and classify the data from these experiments. The best suggestion to separate the multi-labeled data is that using two easy methods, PLR and SVM, and ensemble these two results. If you want to get better performance, you should add KNN results. Using these three methods is not too difficult, but it takes a lot of time to get good performance.

5. Working Distribution

Yun-Nung Chen – PLR coding, SVM, experiments recording, and report writing

Che-An Lu – data preprocessing, KNN coding, SVM, experiments recording, and report writing