Sentiment Scores Toward the Brands - Spam Detection for First Posts

5.4 Spam Detection for First Posts

5.4.5 Sentiment Scores Toward the Brands

The main objective of the covert marketing campaign is to promote a certain brand and sometimes denounce its competitor’s brands in order to give it an unfair edge. Hence, we expect spam posts to show a positive attitude when it comes to Samsung, and possibly a negative attitude toward the competitors.

We devise a simple method to capture the sentiment toward brands in posts.

Basically, we just add up the polarity of sentiment words in NTU sentiment dictionary (NTUSD) (Ku et al., 2006) and emoticons near mention of a brand or a product. For preciseness, the pseudocode producing the sentiment scores is presented in Algorithm 1

The following table shows number of spam posts in training set by the po-larity of our estimated sentiment scores toward the brands. The result is not what we desired, since there are many posts with negative polarity toward Samsung, and many with positive polarity toward HTC. Even worse, the

#positive/#negative ratio of Samsung is actually lower than HTC.

brand positive negative neutral no mention

Samsung 504 312 379 688

HTC 110 62 111 1600

Table 19: number of spam post with diﬀerent polarities

With the sentiment scores toward Samsung and HTC, instead of showing any improvement, the F-measure dropped a little on both test set and test set*.

Algorithm 1 Compute Sentiment Score Toward the Brands

1 function AllBrandsSentimentScores(content)

2 for β ← [Samsung, HT C, . . .] do

3 scores[β]← BrandSentimentScore(content, β)

4 return scores

5 function BrandSentimentScore(content, β)

6 B ← list of aliases of β ▷ manually collected

7 P ← list of aliases of β’s products ▷ described in section 3.3

8 score← 0

9 for α← B⊕

P do ▷ longest aliases ﬁrst

10 if α is in content then

11 S ← the sentence containing α plus the next one

12 score← score + SegmentSentimentScore(S)

13 return score

14 function SegmentSentimentScore(S)

15 pw ← #(NTUSD positive words in S) ▷ longest matches ﬁrst

16 nw ← #(NTUSD negative words in S) ▷ longest matches ﬁrst

17 pe ← #(positive emoticons in S)

18 ne← #(negative emoticons in S)

19 score← pw − nw + pe − ne

20 return score

precision recall F-measure test set 70.97% 52.88% 60.61%

test set* 65.57% 57.14% 61.07%

Table 20: bag-of-words, content characteristics, submission time, thread activeness, and sentiment scores toward the brands

We postulate that spammers might put more eﬀort into the promoting the latest products, because those are also the ones that are being promoted through proper ways of advertising. Hence, we make a variation of the al-gorithm to only account for the mention of products whose release date is within one month from the submission time of the post. More precisely speaking, the line 6 from Algorithm 1 should be skipped, and the right hand side of line 7 should be modiﬁed to be ‘list of aliases of βś products which are released within one month from the submission time of content’.

Still, it shows no sign of improvement on top of the existing features.

precision recall F-measure test set 72.03% 49.52% 58.69%

test set* 64.41% 54.29% 58.91%

Table 21: bag-of-words + content characteristics submission time, thread activeness,

and sentiment scores toward the hot products

There are some viable explanations of why the polarity of our estimated sentiment score fails to reﬂect the true opinion polarity. First, as discussed in section 2.1.4 and 4.1, the spam posts are carefully written to subtly deliver the messages, so they might to some degree avoid using sentiment words. Second, sarcasm is heavily used on Mobile01, which even some human readers often can’t fully grasp. Third, NTUSD is not speciﬁcally designed for Mobile01, and has been there for some years, while the community on Mobile01 many have given some words new meanings, and even invented new words in their subculture.

To further investigate it, we list concrete examples²⁰ of which our algorithm failed to grasp the true sentiment, where lime green background is used to

20Since we’re not going to repeat this discussion on spam detection for replies, the examples includes both ﬁrst posts and replies.

to indicate HTC brand/product mentions, and blue background for Sam-sung brand/product mentions; positive words or emoticons near a mention are signiﬁed by red background, while light blue background signiﬁes the negative ones. Segments surrounded by the ‘|’ symbols represent emoticons on Mobile.

Samsung→ −2 HT C → 0

|orz|只能說S2真的是怪阿!!

...

This posts used negative emoticons and words to compliment a Samsung product in a dramatic manner.

Samsung→ +3 HT C → 0

我比較 Note耶

因為htc^好像把XL^當精品賣..^哈

現在單核心的手機還敢賣那麼貴的..^{而且還很多人讚賞} 真的只有HTC

XL^{真的不錯啦}

不過考量到一支手機可能要使用個一兩年的時間我還是比較 NOTE

除了筆的功能很方便

到時要是出現5.0 ^的系統.. ^{單核心不夠跑怎麼辦}

The algorithm successfully detect the positivity toward Samsung based on the positive words near the two mentions of a Samsung product. However, to recognize sarcastic mockery of HTC is out of reach for this simple algorithm.

Samsung→ −3 HT C → 0 我也覺得吵這些要適可而止了

現在全世界有幾個國家像台灣這樣送一堆東西的

到時所有人把三星到不敢送東西吃虧的還不是我們自己???

真的把事情鬧大了..只有爽到現在.. 卻苦到以後買手機的人阿 Samsung→ −2 HT C → 0

好久沒有三星了今天又來一篇

該領錢下班囉!

Samsung→ −1 HT C → 0

印象中是之前三星手機就有的功能！不是新功能！

不過還蠻方便的，隨時掌握朋友的生日，也能增加彼此的話題嘛！

In these examples, some negative words are around ‘^三星 (Samsung)’, but no actual negativity toward Samsung was there.

Samsung→ +1 HT C → +2

手中的感機用了也快一年, 之前是因為很 HTC^的 sence 感覺很質感

不過這次的Galaxy nexus 的介面整個很到我

而且我真的覺得 4.65 ^{吋才是最適合的大小吧}

我玩過XL也還 OK, 只是不愛那白色帶點紅的設計快點上市吧!! ^等不及啦

In this example, the spammer actually praised a HTC product at the start of the post, but then claimed that a Samsung product is even better. Sentiment polarity toward both brands are accurately identiﬁed (both positive), but recognizing the comparison is the critical here.

Sentiment/attitude toward the relevant brands is deﬁnitely an aspect that can be exploited to help the detection of spam posts. However, as demon-strated by these examples, a more advanced algorithm is needed.

在文檔中國立臺灣大學電機資訊學院資訊工程學系碩士論文 (頁 41-46)