Spammer Detection - 國立臺灣大學電機資訊學院資訊工程學系碩士論文

Figure 19: measuring usefulness of the features from proﬁles

Spammers are the more productive/reputable posters according to the num-ber of threads they made, the scores, which was described in table 6, they have, and etc. It echoes the observation in section 4.3 that some spammers are reputable writers hired to make a handful of spam posts. Furthermore, on most web forums in general, there would be lots of ‘lurkers’²⁴ who regu-larly login and read posts, but barely participate in discussions. It might be another factor that contributes to such diﬀerence.

Figure 20: number of threads of spammers and non-spammers

24‘^潛水者’ in Chinese

Figure 21: ‘score’ of spammers and non-spammers

Using these attributes in user proﬁles as features, the F-measure increased, compared with the random baseline. Still, proﬁle information alone seems to be far from suﬃcient for our model to distinguish spammers from non-spammers.

precision recall F-measure

2.75% 22.62% 4.91%

Table 30: proﬁle attributes as features

5.6.3 Maximum Spamicity of the First Posts of the User

Similar to section 5.5.5, we leverage the model from the spam detection for ﬁrst posts to build a feature from its quality spamicity estimates for ﬁrst posts.

A new feature max_spamicity_fps²⁵ is computed by taking the max-imum of the spamicity estimates of all ﬁrst posts submitted by the user.

Because the adopted deﬁnition of spammer is ‘whoever makes one or more spam posts’, taking the maximum is a more sensible choice, compared to taking the mean or the median.

The fact that our best model for ﬁrst posts spam detection happens to have a higher precision than recall is a plus here. If the model misses a spam ﬁrst post of one spammer (high precision, low recall), there’s still a chance some other spam ﬁrst post by the spammer could be detected. On the contrary, if the model misidentiﬁes a non-spam ﬁrst post by a normal user as spam (low precision, high recall), and gives it a high spamicity estimate, then the

25‘maximum spamicity of the first posts made by the user’

value of max_spamicity_fps will be high for that user, who is thus likely to be misclassiﬁed as a spammer.

One thing we should be careful about is the model we leverage to compute the spamicity features shouldn’t be trained with any spam ﬁrst post by any user in test set, since this would almost guarantee a high spamicity estimate for that post, which results in a high max_spamicity_fps for that user in test set, without really knowing anything. Fortunately, with the data splitting method described in section 5.2, the training set of posts won’t contain any post by any user in test set of users, let alone a spam ﬁrst post.

By adding the max_spamicity_fps feature, the performance is signiﬁcantly improved by almost 50% in F-measure. Leveraging the model for ﬁrst posts really pays oﬀ here.

precision recall F-measure

61.02% 42.86% 50.35%

Table 31: proﬁles and max_spamicity_fps

For what it’s worth, 54 out of the 84 (64.3%) of the spammers in test set has submitted a spam post that is a ﬁrst post in thread. In principle, it should be the maximum number of spammers that could be identiﬁed with this feature.

5.6.4 Burstiness of Registration of Throwaway Accounts

In section 4.3, we observed most of the throwaway spammer accounts were registered in bursts. To make use of this observation, we devise the fea-ture burstiness_throwaway_reg²⁶ that counts the number of throwaway spammers accounts which were registered 20 days within the registration of the respective account. When the value of this feature is high, it’s an indication of being in a burst depicted in ﬁgure 3, so the user account in discussion is likely to be a (throwaway) spammer account.

Adding burstiness_throwaway_reg on top of the existing proﬁle and max_ﬀ_preds features, the F-measure increased by a little.

26‘burstiness of throwaway accounts’ registrations

precision recall F-measure

67.31% 41.67% 51.47%

Table 32: proﬁles, max_spamicity_fps, and burstiness_throwaway_reg

5.6.5 Frequently Appeared Groups of Posters

We discussed the collusion between spammers, where they would make posts in the same thread as discussed in section 4.7. To detect and make use of the collusion between spammers, we apply frequent itemset mining, which is a widely applied technique in the ﬁeld of data mining. In the popular ‘shopping in supermarket’ example, it ﬁnds the set of items which are frequently put into the same basket and bought together. Moreover, we could set a support threshold to specify how frequent is frequent enough for the itemsets to be selected.

Applying the shopping analogy to our scenario, user id of each post is an

‘item’, and every 30 posts in a thread forms a ‘basket’, so each frequent itemset would be a group users that frequently ‘appeared together’ in threads.

In our experiments, frequent itemset mining is conducted on all threads in both training set and test set, with the help of the Orange (Demšar et al., 2013) library, and doesn’t involve the use of ground truth.

Rather than being somehow incorporated as a feature in our model, the mined frequent itemsets are used to fuel a smoothing process on the prediction outputs of the base model²⁷. In this smoothing process, for each 3-element frequent itemset²⁸, we add up the spamicity²⁹of the users in it. If the sum is bigger than a threshold, then we predict all users in this 3-element itemset to be spammers. The pseudocode of the whole procedure is in algorithm 5.7.

With the described smoothing process, the F-measure reaches 64.60%, which doesn’t look bad at all considering the ratio of spammers in test set is only 0.98%.

27The current model for spammer detection we’re improving upon. Its performance was shown in table 32

28If there are bigger frequent itemsets, we simply take all 3-element subsets of them.

29probabilistic prediction output by our base model if the user is in test set, or just 0 or 1 according to the ground truth if the user is in training set

Algorithm 2 The whole process combined with the smoothing step

1 function FinalModel(users_train, labels_train, users_test, posts)

2 f eatures_train← ExtractFeatures(users_train)

3 f eatures_test ← ExtractFeatures(users_test)

4 model ← MLprocedure(features_train, labels_train)

5 preds, probs ← MakePredictions(model, features_test)

6 f req_groups← FindFrequentGroups(posts) ▷ all posts

7 preds← SmoothPredsFreqGrps(preds, probs, freq_groups)

8 return preds

9 function FindFrequentGroups(posts)

10 threads← index the posts with their thread ids

11 baskets ← empty list ▷ initialize the list of ‘baskets’ or itemsets

12 for t← threads do

13 for β ← each 30 consecutive posters’ ids in thread t do

14 append β to baskets

15 f req_groups← FrequentItemsetMining(baskets) ▷ ‘Orange’

16 remove the groups with support < 3 from f req_groups

17 return f req_groups

18 function SmoothPredsFreqGrps(preds, probs, f req_groups)

19 for f req_group← freq_groups do

20 for f req_triple← each 3-item subset of freq_group do

21 sum← 0 ▷ initialize the ‘spamicity’ of the triple

22 for uid← freq_triple do

23 if uid ∈ users_train then

24 sum ← sum + labels_train[uid] ▷ 1 if spammer

25 else

26 sum ← sum + probs[uid] ▷ probablity of spammer

27 if sum > 1.2 then ▷ ‘spamicity’ higher than a threshold

28 for uid ← freq_triple do

29 preds[uid] ← 1 ▷ predict all 3 as spammers

30 return preds

precision recall F-measure

67.53% 61.90% 64.60%

Table 33: proﬁles, max_spamicity_fps, burstiness_throwaway_reg, with the smoothing process

Again, the fact that our best model for spam detection for ﬁrst posts happens to have higher precision than recall helps. It somehow causes the base model trained with max_spamicity_fps feature to have a higher precision as well.

Since this smoothing process intuitively gears toward improving the recall, the lower recall of the base model leaves a big room for this process to boost the performance.

在文檔中國立臺灣大學電機資訊學院資訊工程學系碩士論文 (頁 51-57)