M Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

(1)

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Yun-Nung Chen, Che-An Lu, Chao-Yu Huang

Abstract—Naïve Bayes spam email filters are a well-known and powerful type of filters. We construct different filters using three types of classification, including Naïve Bayes, SVM, and KNN. We compare the pros and cons between these three types and use some approaches to improve them to get a better spam filter.

Index Terms—Spam filter, Naïve Bayes, SVM, KNN

—————————— ——————————

1 I

NTRODUCTION

ass unsolicited electrontic mail, often known as spam, has recently increased enormously and has become a serious threat to not only the Internet but also to society.

Over the past few years, many approaches have been provided to filter spam. Some of them often use Naïve Bayes method to implement it.

2 P

ROPOSED

A

PPROACHES 2.1 Problem Definition

Construct a filter which can prevent spam from getting into mailbox.

Input: A mail message Output: “Spam” or “Ham”

2.2 Proposed Solution

Figure.1 System Flowchart

We use method of machine learning to train a model, and decide the input message belonges ham or spam. We use three methods to implement spam filter, including

Naïve Bayes, SVM, and KNN.

We will compare these three methods first. After that we will modify each method independently to improve the accuracy and compare each method itself. Finally make a conclusion above all the improvement on these three methods.

Following are our methods details.

Naïve Bayes

From Bayesian theorem of total probability, given the vector x = x, x, … , x of a mail d, the probability that d belongs to category C is:

P c = C| X = x =

|

∑ |

(1) where k ∈ spam, ham%.

We can decide that mail belongs to a model with a higher probability. This uses unigram language model to compute the probability of class.

In order to get better feature (word with more information), we preprocess testing message to remove some noise word and reserve words which are more important.

We remove words whose lenth is longer than 50, and we also delete words which appear for less 20 times.

A word which appears in more documents represents that it carries less information for classification.

We use SRI Language Model Toolkit: to generate unigram language model, and compute the Bayesian Proba- bility according to the model. We can decide the classification of testing message.

SVM

The support vector machines (SVM) is easy to use and powerful for data classification. When generating filter model, we create a vector for each data in tha training corpus and then SVM will map these vectors into hyper- plan. Then SVM will learn which kinds of vector are close to which class. Here SVM is a good approaching method in precisely finding best classification hyper-plans to maximize the margin so that we could classify a new mail into spam or ham.

————————————————

• Yun-Nung Chen Author is in the National Taiwan Universty.

E-mail: swallow29271223@gmail.com.

• Che-An Lu Author is in the National Taiwan Universty.

E-mail: JAST.Lu@gmail.com.

• Chau-Yu Huang Author is in the National Taiwan Universty.

E-mail: hamigwa@gmail.com

M

(2)

Figure.2 Version1

First we will select top 1000 document frequency terms as we do in KNN. Secondly we will create “TF-IDF” for each training data. Thirdly we will use libsvm (a tool from Chih-Jen Lin) to train a model. With each unclassified mail we will create a “TF-IDF” vector too, finally predict it with the svm model we trained before. The fo- mula of “TF-IDF”is described as below:

&0.5 + 0.5

_-.⁺^,_/₊_/

0 × &log

₆₊⁵_,

+ 10

⁽²⁾

Where N is number of terms in whole corpus, f9 is frequency of term t, and df9 is data frequency of term t.

If a term appears in many documents, it represents that this term doesn’t have significance. So we use IDF to diminish the weight of such term inorder to select better features. Why we use data frequency to select the feature is because the following figure (a reference from course

“Web Mining”).

Figure.3

We can see that the performance between all the feature selection methods except mutual information and term strength makes no much difference. For our coding convinence, we choose to implement data frequency.

Version2

It is similar to version 1, we only turn the uppercase in- to lowercase (case-insensitive), so that for example “free”

will be the same term with “FREE” or “Free”

Version3

It is similar to version 2, the only difference is that when training svm model, we will set the parameters

“cost” and “gamma” to the best condition.

KNN Version1

First we will create a “binary vector” which maintain the information of each feature (term) exist or not (1 or 0) for each training data. Secondly, for each unclassified mail we will create a binary vector too, then using cosine distance to find out the top K closest training data from the corpus then find out which class is the unclassified mail belongs to.

cosine distance:

_|@^@_/^/_|C@^∙@^B_B_C

⁽³⁾

Version2

It is similar to version 1, one big difference is that in version 2, we will not use “binary vector” as before, we will use “TF-IDF” respectively.

Version3

It is similar to previous version, we only turn the uppercase into lowercase (case-insensitive), so that for example “free” will be see as the same term with “FREE” or

“Free”.

3 C

ONTRIBUTIONS

3.1 Compare each method independently

We compare three methods independently, and we also can observe the difference between sizes in 1200, 3000, 9000, and 21000.

Naïve Bayes

When we trains language model, computing the prob- abilities of words doesn’t care the case (case-insensitive), and we also remove the word with too small probability for message preprocessing.

When the size of traing data is smaller (1200), the result still has good performance. We can see that accuracy is 93.6206%.

But when training set becomes a little larger (3000), the result is not good as smaller one. We can see that the result is improved up slightly. Compared to size, the improvement of accuracy is relatively small, and we can just improve accuracy to 3% (93.6206% -> 96.2669).

We believe feature selection is important to Naïve Bayes, and we can use a better feature to improve result.

But we must spend much time to testing data to see whether the result will be improved up, time and accuracy is tradeoff.

SVM

When the training set is small (1200), the result of SVM model is much poor than others.

When we use case-insensitive to create tf-idf vector, the accuracy can improve up to 20% (60.854% ->

72.6285%), which means that it is important to combine the information of uppercase and lowercase together to increase the concept for a specific term (ex: free, Free, FREE). The other reason is that if we see “free” and “Free”

as the same term, then the data frequency of free will increase, so that we won’t throw away such important feature (since we only select top 1000 ranked by data frequency).

(3)

After we find the best gamma and cost value for each corpus, the performance can also improve up to 20%

(72.6285% -> 88.2511%). But the process of finding such parameters is very time consuming, so it will be a tradeoff.

When using large training set, the performance won’t improve accordingly. We think that it is because when the training set grows up, the noise will also increase, which means that there will be more ham similar to spam.

KNN

The performance of KNN is really out of our expecta- tion. At begin, we think KNN won’t be better than SVM.

But after our experiment, KNN seems work well on spam classification.

The other interesting thing is that when using “binary vector” and “TF-IDF” vector, the performance makes little difference (notice that TF-IDF still a little better than binary vector). The improvement of using “TF-IDF” vector is not that significant as we expect. We think that it is because in spam classification, some important features in spam is quite different from ham, so the “weight” (TF- IDF) of such feature will not be so important as whether there “exist” such feature.

And we also found that the case-sensitive and case- insensitive makes little difference in KNN, not like in SVM. We think the main reason is that we didn’t do feature selection in KNN, instead, we keep all the features.

So we won’t throw out some important features (ex:

Money) such as SVM feature selection does. And this may also be the reason why KNN will beat SVM.

3.2 Compare all these three methods

We think that it is why KNN is better than Naïve Bayes that the feature we select when we implement Naïve Bayes is not good enough to train an excellent model.

We think KNN can beat SVM is because that we throw some information when doing feature selection in SVM.

And the other reason might be that spam classification problem is a binary decide problem (spam / ham), so KNN can easily close to one side, and we think if there exists more class, the performance of SVM will be better than KNN.

When training corpus is small, the performance of KNN is still well, not like SVM, with only 60% accuracy.

We think it is because SVM is a machine learning method, we can’t expect it learns well with a little training data.

KNN can still find the top K similar data in a small corpus.

We also think that Naïve Bayes is good method to filt spam when traing set is small, and because the word ap- pearing in ham isn’t too many, we can compute probabili- ties of words to decide a category by using a training set with small size.

4 E

XPERIMENTAL

R

ESULTS 4.1 Corpus

We use the corpus which is provided by trec06. There are 37822 messages (12910 ham, 24912 spam). We separate the whole set into training data and testing data.

The testing data are 2910 ham and 4912 spam which are randomly select from the corpus. Remaining data (10000 ham, 20000 spam) are used to be training data.

These two set (testing data and training data) is inde- pendent.

In our experiments, we will create four different sizes of training data which are randomly select from the training corpus. The ratio of spam and ham in these four training datas are all 2:1, and the corresponding size of them are 1200, 3000, 9000, and 21000. (Following we will use 800:400, 2000:1000, 6000:3000, 14000:7000 to represent four different training sets)

4.2 Result of Evaluation

Following are three different methods’ accuracy table with different sizes of training data.

Naïve Bayes

Accuracy 800:400 93.6206 % 2000:1000 90.8719 % 6000:3000 95.2570 % 14000:7000 96.2669 %

Table. 1 SVM

Version1 Version2 Version3 800:400 60.8540 % 72.6285 % 88.2511 % 2000:1000 87.4585 % 90.5316 % 93.8794 % 6000:3000 88.8520 % 93.5983 % 96.4477 % 14000:7000 88.2255 % 91.0299 % 91.0299 %

Table. 2 KNN

Version1 Version2 Version3 800:400 94.0710 % 95.6172 % 95.6044 % 2000:1000 95.9877 % 97.2144 % 97.0611 % 6000:3000 97.5466 % 98.0833 % 97.7383 % 14000:7000 97.1122 % 97.8789 % 97.5978 %

Table. 3

Fig.4 is the accuracy plot of Naïve Bayes, first version of SVM, and first version of KNN with different size of training data.

Figure.4 Accuracy of Naïve Bayes

We use Naïve Bayes to be the baselin. We can see that in the first version of SVM, all the accuracy are less than Naïve Bayes. And in the first version of KNN, the performance is already better than Naïve bayes.

Fig.5 is the accuracy plot after we improve the SVM

60.00%

65.00%

70.00%

75.00%

80.00%

85.00%

90.00%

95.00%

100.00%

800:400 2000:1000 6000:3000 14000:7000

Naïve Bayes ver1 SVM ver1 KNN

(4)

accuracy.

Figure.5 Accuracy of SVM

We can see that there has a big improvement between version 1 and version 2, the difference between these two versions is that we use case-insensitive in version 2. In version 2 we will much emphasize some important features such as “Free”, and we won’t throw too many information away due to feature selection as we mentioned before. The improvement between version 2 and version 3 is also significant. By well selecting the SVM parameter gamma and cost, the performance can really improve a lot as we can see.

Fig.6 is the accuracy plot after we improve the KNN accuracy.

Figure.6 Accuracy of KNN

We can see that there is a little improvement from version 1 to version 2, the difference of them is that we use

“TF-IDF” vector instead of “binary” vector in version 2.

And when we modified version 2 from case-sensitive to case-insensitive, the difference between them is not as significant as SVM (no more than 0.4%). We also mentioned this before, it is because we didn’t do feature selection in KNN.

Fig.7 is the accuracy plot of Naïve Bayes, and the best version of SVM and KNN with different size of training data.

Figure.7 Comparison of 3 methods

As we can see, after we improve SVM, two of the mid- dle size of training set is better than Naïve Bayes. And the other two is much more close to Naïve bayes compare with the first version. After we improve KNN, the result is again better than Naïve Bayes.

5 C

ONCLUSION

After experimenting three different methods, we found that KNN has higher accuracy than other two approaches.

Because we think KNN is more suitable for classification of less catergories than SVM, accuracy of KNN is higher.

We think Naïve Bayes is a good method for spam filter, and the time costs little on training (about 1-2 seconds).

Testing an input message requires much time using Naïve Bayes, but the result is good enough.

The training time of KNN is very fast, but it takes lots of time on testing. We think it is because we didn’ implement any indexing algorithm such as KD-tree, R-Tree or Quad-Tree when finding the nearest top K neighbors. In our future work, we can implement these indexing methods to improve the efficiency of KNN.

The training time of SVM compared with Naïve Bayes and KNN is much longer, espically when we want to find out the best gamma and cost parameters for the training process. But the testing time of SVM is much faster than the other two methods.

In future work, we can focus on different feature selection methods to improve the performance of Naïve Bayes and SVM, and the results of them might become better than KNN.

6 J

^OB

R

ESPONSIBILITY Yun-nung Chen (B94902032)

Naïve Bayes (Training and Testing), Report Writing.

Che-an Lu (B94902097)

SVM (Training and Testing), KNN(Training and Test- ing), Report Writing.

Chau-yu Huang (B94902052)

Message Preprocessing, Report Writing.

ACKNOWLEDGMENT

The report uses lots of toolkits, including MIME-tools,

60.00%

65.00%

70.00%

75.00%

80.00%

85.00%

90.00%

95.00%

100.00%

800:400 2000:1000 6000:3000 14000:7000

ver1 SVM ver2 SVM ver3 SVM

93.00%

94.00%

95.00%

96.00%

97.00%

98.00%

99.00%

100.00%

800:400 2000:1000 6000:3000 14000:7000

ver1 KNN ver2 KNN ver3 KNN

86.00%

88.00%

90.00%

92.00%

94.00%

96.00%

98.00%

100.00%

800:400 2000:1000 6000:3000 14000:7000

Naïve Bayes ver3 SVM ver2 KNN

(5)

SRILM, and SVM. So we want to thanks about it.

REFERENCES

[1] SRI Language Model Toolkit

http://www.speech.sri.com/projects/srilm/

[2] CJ Lin’s Home Page

http://www.csie.ntu.edu.tw/~cjlin/

M Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

1 I

2 P

A

P c = C| X = x =

M

&0.5 + 0.5

0 × &log

+ 10

cosine distance:

3 C

4 E

R

5 C

6 J

R

P c = C| X = x =