• 沒有找到結果。

Figure 19: measuring usefulness of the features from profiles

Spammers are the more productive/reputable posters according to the num-ber of threads they made, the scores, which was described in table 6, they have, and etc. It echoes the observation in section 4.3 that some spammers are reputable writers hired to make a handful of spam posts. Furthermore, on most web forums in general, there would be lots of ‘lurkers’24 who regu-larly login and read posts, but barely participate in discussions. It might be another factor that contributes to such difference.

Figure 20: number of threads of spammers and non-spammers

24潛水者’ in Chinese

Figure 21: ‘score’ of spammers and non-spammers

Using these attributes in user profiles as features, the F-measure increased, compared with the random baseline. Still, profile information alone seems to be far from sufficient for our model to distinguish spammers from non-spammers.

precision recall F-measure

2.75% 22.62% 4.91%

Table 30: profile attributes as features

5.6.3 Maximum Spamicity of the First Posts of the User

Similar to section 5.5.5, we leverage the model from the spam detection for first posts to build a feature from its quality spamicity estimates for first posts.

A new feature max_spamicity_fps25 is computed by taking the max-imum of the spamicity estimates of all first posts submitted by the user.

Because the adopted definition of spammer is ‘whoever makes one or more spam posts’, taking the maximum is a more sensible choice, compared to taking the mean or the median.

The fact that our best model for first posts spam detection happens to have a higher precision than recall is a plus here. If the model misses a spam first post of one spammer (high precision, low recall), there’s still a chance some other spam first post by the spammer could be detected. On the contrary, if the model misidentifies a non-spam first post by a normal user as spam (low precision, high recall), and gives it a high spamicity estimate, then the

25‘maximum spamicity of the first posts made by the user’

value of max_spamicity_fps will be high for that user, who is thus likely to be misclassified as a spammer.

One thing we should be careful about is the model we leverage to compute the spamicity features shouldn’t be trained with any spam first post by any user in test set, since this would almost guarantee a high spamicity estimate for that post, which results in a high max_spamicity_fps for that user in test set, without really knowing anything. Fortunately, with the data splitting method described in section 5.2, the training set of posts won’t contain any post by any user in test set of users, let alone a spam first post.

By adding the max_spamicity_fps feature, the performance is significantly improved by almost 50% in F-measure. Leveraging the model for first posts really pays off here.

precision recall F-measure

61.02% 42.86% 50.35%

Table 31: profiles and max_spamicity_fps

For what it’s worth, 54 out of the 84 (64.3%) of the spammers in test set has submitted a spam post that is a first post in thread. In principle, it should be the maximum number of spammers that could be identified with this feature.

5.6.4 Burstiness of Registration of Throwaway Accounts

In section 4.3, we observed most of the throwaway spammer accounts were registered in bursts. To make use of this observation, we devise the fea-ture burstiness_throwaway_reg26 that counts the number of throwaway spammers accounts which were registered 20 days within the registration of the respective account. When the value of this feature is high, it’s an indication of being in a burst depicted in figure 3, so the user account in discussion is likely to be a (throwaway) spammer account.

Adding burstiness_throwaway_reg on top of the existing profile and max_ff_preds features, the F-measure increased by a little.

26‘burstiness of throwaway accounts’ registrations

precision recall F-measure

67.31% 41.67% 51.47%

Table 32: profiles, max_spamicity_fps, and burstiness_throwaway_reg

5.6.5 Frequently Appeared Groups of Posters

We discussed the collusion between spammers, where they would make posts in the same thread as discussed in section 4.7. To detect and make use of the collusion between spammers, we apply frequent itemset mining, which is a widely applied technique in the field of data mining. In the popular ‘shopping in supermarket’ example, it finds the set of items which are frequently put into the same basket and bought together. Moreover, we could set a support threshold to specify how frequent is frequent enough for the itemsets to be selected.

Applying the shopping analogy to our scenario, user id of each post is an

‘item’, and every 30 posts in a thread forms a ‘basket’, so each frequent itemset would be a group users that frequently ‘appeared together’ in threads.

In our experiments, frequent itemset mining is conducted on all threads in both training set and test set, with the help of the Orange (Demšar et al., 2013) library, and doesn’t involve the use of ground truth.

Rather than being somehow incorporated as a feature in our model, the mined frequent itemsets are used to fuel a smoothing process on the prediction outputs of the base model27. In this smoothing process, for each 3-element frequent itemset28, we add up the spamicity29of the users in it. If the sum is bigger than a threshold, then we predict all users in this 3-element itemset to be spammers. The pseudocode of the whole procedure is in algorithm 5.7.

With the described smoothing process, the F-measure reaches 64.60%, which doesn’t look bad at all considering the ratio of spammers in test set is only 0.98%.

27The current model for spammer detection we’re improving upon. Its performance was shown in table 32

28If there are bigger frequent itemsets, we simply take all 3-element subsets of them.

29probabilistic prediction output by our base model if the user is in test set, or just 0 or 1 according to the ground truth if the user is in training set

Algorithm 2 The whole process combined with the smoothing step

1 function FinalModel(users_train, labels_train, users_test, posts)

2 f eatures_train← ExtractFeatures(users_train)

3 f eatures_test ← ExtractFeatures(users_test)

4 model ← MLprocedure(features_train, labels_train)

5 preds, probs ← MakePredictions(model, features_test)

6 f req_groups← FindFrequentGroups(posts) ▷ all posts

7 preds← SmoothPredsFreqGrps(preds, probs, freq_groups)

8 return preds

9 function FindFrequentGroups(posts)

10 threads← index the posts with their thread ids

11 baskets ← empty list ▷ initialize the list of ‘baskets’ or itemsets

12 for t← threads do

13 for β ← each 30 consecutive posters’ ids in thread t do

14 append β to baskets

15 f req_groups← FrequentItemsetMining(baskets) ▷ ‘Orange’

16 remove the groups with support < 3 from f req_groups

17 return f req_groups

18 function SmoothPredsFreqGrps(preds, probs, f req_groups)

19 for f req_group← freq_groups do

20 for f req_triple← each 3-item subset of freq_group do

21 sum← 0 ▷ initialize the ‘spamicity’ of the triple

22 for uid← freq_triple do

23 if uid ∈ users_train then

24 sum ← sum + labels_train[uid] ▷ 1 if spammer

25 else

26 sum ← sum + probs[uid] ▷ probablity of spammer

27 if sum > 1.2 then ▷ ‘spamicity’ higher than a threshold

28 for uid ← freq_triple do

29 preds[uid] ← 1 ▷ predict all 3 as spammers

30 return preds

precision recall F-measure

67.53% 61.90% 64.60%

Table 33: profiles, max_spamicity_fps, burstiness_throwaway_reg, with the smoothing process

Again, the fact that our best model for spam detection for first posts happens to have higher precision than recall helps. It somehow causes the base model trained with max_spamicity_fps feature to have a higher precision as well.

Since this smoothing process intuitively gears toward improving the recall, the lower recall of the base model leaves a big room for this process to boost the performance.

相關文件