Bag-of-words - Spam Detection for First Posts

5.4 Spam Detection for First Posts

5.4.2 Bag-of-words

After performing Chinese word segmentation on the HTML-stripped cleansed content from each post with Jieba¹⁸, we count the occurrence of each word in training set, and construct a ‘vocabulary’ with these words.

Next, rare words with less than 5 occurrences are removed from the vocabu-lary, since these would be the sparse bag-of-words features and might cause overﬁtting. On the other hand, words appeared in over 30% of the posts are also removed, as these are likely to be stop words or the like. After the vocabulary is set up, we represent each post as a vector of occurrence of each word in the vocabulary, where the occurrences are normalized by the length of the post.

In bag-of-words, each word in the vocabulary corresponds to a feature. Since the high number of features (words) could slow down the training process signiﬁcantly and may cause overﬁtting, we apply randomized PCA (Halko et al., 2011) on the #posts× #words bag-of-words matrix to reduce the word dimension. The desired number of dimension to reduced to with PCA is tuned by looking at the average F-measure from 5-fold cross validation on the training set, as plotted in the following ﬁgure.

Figure 10: values of F-measure as #component in PCA changes

18https://github.com/fxsjy/jieba

The absolute performance shown in the plot might look unusually high. How-ever, it’s partially due to the fact that we downsampled the non-spam posts in training set, so the validation sets in 5-fold CV all have much higher ra-tios of spam posts than test set. What we really care about is the relative performance. As shown in the plot, reducing to 50 components may cause too much information loss and thus deteriorating the average F-measure. On the other hand, too many components may cause some degree of overﬁtting which also worsens the performance. The average F-measure is the highest when the bag-of-words is reduced to 150 components, so we adopt it to train our model on the whole training set and see how it performs on test set.

precision recall F-measure test set 62.89% 48.08% 54.50%

test set* 50.00% 51.43% 50.70%

Table 14: content bag-of-words features only (150 components)

The performance is actually decent, whereas our observation on subtlety of the spam posts in section 4.1¹⁹ that ﬁrst post gives us a hunch that the contents of the posts might not give big clues about whether a post is spam, since the contents of spam posts are well-disguised.

Such result makes us curious about what’s happening under the hood. To dive deeper into it, we’d like to get the importance of each feature in order to observe what types of words are the decisive factors in the model’s predic-tions. However, for a non-linear model like SVM with RBF kernel, there’s no simple way of computing importance of each feature. Nevertheless, by

‘falling back’ to linear kernel, the model suﬀers around 10% performance loss in F-measure on test set, but we’re able to see the relative importance of each word by looking at the coeﬃcients after inverse-transformed with PCA.

The following ﬁgure is a word cloud containing the words with the highest coeﬃcients (weights), that is, words that are the strongest spam indicators, where the font size of each word is positively correlated with its weight.

19Notice most of the examples listed in section 4.1 are replies, though.

Figure 11: words with the highest weights

The next word cloud is for words with the lowest weights, that is, words that are the strongest non-spam posts indicators.

Figure 12: words with the lowest weights

We can already observe the distinctive diﬀerence between the two word clouds at the ﬁrst glance. The ﬁrst one is mainly about Samsung’s top products (galaxy, nexus, note, sii) and the user experiences (^體驗,^看到,^覺得), while focusing on the multimedia aspect (^照片, ^拍照, ^影片). On the other hand, the second word cloud is more about seeking help (問題, 解決, 無法), and involves more polite words (^謝謝, ^大大, ^小弟) and technicalities (rom, ^開機, ^設定).

The previous bags-of-word features were based on only contents of the posts, but there is also much information lying in the titles of the threads, so we create another 50 dimension-reduced bags-of-word features based on the titles, and combine these with the contents ones to yield 200 features. We prefer not to have them mixed together because title and content may have distinct groups of ‘spam keywords’.

precision recall F-measure test set 59.12% 51.44% 55.01%

test set* 56.16% 58.57% 57.34%

Table 15: content and title bag-of-words features

With the addition of title bag-of-words, a further improvement in F-measure can be seen.

The dimension-reduced bags-of-word features turned out to be surprisingly helpful. The model is able to accomplish over 55% in F-measure while the ratio of spam is only around 3% on the test sets for ﬁrst posts. Compared to the random baseline, it boosts the F-measure by as much as 45%, which implies that the contents of posts actually give some strong clues about whether a ﬁrst post is spam. Although on the surface, each spam post looks rather unsuspicious on its own, collectively, spam posts put more emphasis on certain topics, in comparison with non-spam posts, and our model trained with bag-of-words features was able to exploit this distinction.

在文檔中國立臺灣大學電機資訊學院資訊工程學系碩士論文 (頁 34-37)