• 沒有找到結果。

國立臺灣大學電機資訊學院資訊工程學系 碩士論文

N/A
N/A
Protected

Academic year: 2022

Share "國立臺灣大學電機資訊學院資訊工程學系 碩士論文"

Copied!
62
0
0

加載中.... (立即查看全文)

全文

(1)

國立臺灣大學電機資訊學院資訊工程學系 碩士論文

Department of Computer Science and Information Engineering College of Electrical Engineering and Computer Science

National Taiwan University Master Thesis

垃圾評論的分析與偵測 - 用流出資訊作為標準答案 Opinion Spam Analysis and Detection

Leaked Confidential Information as Ground Truth

陳譽仁 Yu-Ren Chen

指導教授:陳信希 博士 Advisor: Hsin-Hsi Chen, Ph.D.

中華民國 103 年 7 月

July 2014

(2)

Opinion Spam Analysis and Detection

Leaked Confidential Information as Ground Truth Yu-Ren Chen, Hsin-Hsi Chen

July 2014

Contents

Abstract 3

1 Introduction 3

2 Related Work 4

2.1 ‘Spam’ in General . . . 4

2.1.1 Email Spam . . . 5

2.1.2 Web Spam . . . 5

2.1.3 Social Network Spam . . . 5

2.1.4 Opinion Spam . . . 6

2.2 Target of Detection . . . 6

2.2.1 Spam Post Detection . . . 6

2.2.2 Spammer Detection . . . 7

2.3 Proposed Features . . . 7

2.3.1 Content-centric Features . . . 7

2.3.2 Non-content-centric Features . . . 8

2.4 Ground Truth Acquisition . . . 8

3 Dataset 9 3.1 Leaked Spreadsheets . . . 9

3.2 Mobile01 Corpus . . . 10

3.3 Product Information . . . 13

4 Data Exploration 14 4.1 Subtlety . . . 14

4.2 Low Spam Post Ratio of (Some) Spammers . . . 18

4.3 Different Types of Spammer Accounts . . . 19

4.4 First Post vs Replies in Threads . . . 21

(3)

4.5 Pattern in Submission Time of Posts . . . 22

4.6 Activeness of Threads . . . 23

4.7 Collusion between Spammers . . . 25

5 Detection 28 5.1 Evaluation Metric . . . 29

5.2 Data Splitting . . . 29

5.2.1 Posts (for Spam Detection) . . . 30

5.2.2 Users Accounts (for Spammer Detection) . . . 30

5.3 Machine Learning . . . 31

5.4 Spam Detection for First Posts . . . 32

5.4.1 Random Baseline . . . 32

5.4.2 Bag-of-words . . . 33

5.4.3 Content Characteristics . . . 36

5.4.4 Submission Time and Thread Activeness . . . 39

5.4.5 Sentiment Scores Toward the Brands . . . 40

5.5 Spam Detection for Replies . . . 45

5.5.1 Random Baseline . . . 45

5.5.2 Bag-of-words . . . 45

5.5.3 Content Characteristics . . . 47

5.5.4 Submission Time, Thread Activeness and Position in Thread . . . 48

5.5.5 Spamicity of the First Post in the Thread . . . 49

5.6 Spammer Detection . . . 50

5.6.1 Random Baseline . . . 50

5.6.2 Profile Information . . . 50

5.6.3 Maximum Spamicity of the First Posts of the User . . 52

5.6.4 Burstiness of Registration of Throwaway Accounts . . 53

5.6.5 Frequently Appeared Groups of Posters . . . 54

5.7 Caveat . . . 56

6 Future Work 57 6.1 Sentiment/Attitude Shown in Posts . . . 57

6.2 Interaction between Forum Posters . . . 57

6.3 Integration of Spam and Spammer Detection . . . 58

7 Conclusions 58

Reference 58

(4)

Abstract

‘Opinion spamming’ usually refers to the illegal marketing practice which involves delivering commercially advantageous opinions as regular users on review websites. In this research, based on a set of internal records of opinion spams leaked from a shady marketing campaign, we are able to explore the characteristics of opinion spams and spammers to obtain some insights, and then make an attempt to devise features that could be potentially helpful in automatic detection. In the final experiments, we find that our detection model can achieve a decent performance with a set of rather basic features.

1 Introduction

In April 2013, on the Taiwan-based web forum Mobile01, a poster submitted a thread1 in which several confidential documents of a covert marketing2 campaign that had been conducted under the table were disclosed. The campaign instructed hired writers and designated employees to post disin- genuous comments on some web forums including Mobile01. This revelation created a big stir at that time, since it was the first time such strong evidence supporting what most folks had considered as a ‘conspiracy theory’ came to light.

Mobile01, also known as 01 in Taiwan, is a web forum which mainly features discussion about mobile phones, hand-held devices, and other consumer elec- tronics. The vast majority of the users on the site are originated in Taiwan, so the posts are mostly written in Traditional Chinese, which is the official script of Taiwan. The site was founded in year 2000 and has become one of the most well known Taiwanese local websites. As reported by Alexa, Mobile01 ranked #10 in terms of website traffic in Taiwan, as of this writing.

The confidential documents, along with relevant articles describing the cam- paign, are on Taiwansamsungleaks3, a website made by the hacker ‘0xb’.

According to the site, the covert marketing campaign was carried out by a consulting firm that was a subsidiary company of one of the biggest IT com- panies in the world. In this campaign, hired posters were asked to promote a certain brand and denounce its rivals on web forums such as Mobile01, while

1http://www.mobile01.com/topicdetail.php?f=568&t=3284729

2Covert marketing is defined as a firm’s marketing actions whereby consumers believe that the activities are not those of the firm. (Kaikati and Kaikati, 2004)

3http://taiwansamsungleaks.org

(5)

disguised as normal consumers. Among the disclosed documents, there are two spreadsheets4 that appear to be internally-kept records of the spam posts incurred by the campaign from 2011 to 2012. Each row in these spreadsheets is a record of an incentivized forum post and consists of the poster’s user- name, the time of posting, the url to the post, the product that was discussed in the post, and some other details.

Generally speaking, web forums provide platforms for people with similar interests to interact and share experiences with each other. Since people nor- mally believe posts on legit forums to be based on genuine personal opinion and experience, it’s considered unethical to use them to promote things for personal gain without disclaimer, and take advantage of the inherent mutual trust between forum users. As a matter of fact, such marketing malpractice violated the fair trade law, and the company in charge of the campaign was fined by the Fair Trade Commision (FTC) in Taiwan, after the investigation was completed.

In this research, we make an attempt to leverage these spreadsheets to gen- erate ground truth of deceptive forum spams. After some exploration into the data, we try to come up with automatic methods for spam detection and spammer detection, and then conduct experiments to see its performance under various conditions.

2 Related Work

We orgainze the related work section into subsections focusing on different aspects, which include the studies of spam in general (section 2.1), what type of target to detect (opinion spam or opinion spammer) (section 2.2), features proposed in the past (section 2.3), and the difficulty in acquiring the ground truth data of opinion spam (section 2.4).

2.1 ‘Spam’ in General

Spam, whose various definitions usually center around the concept of un- solicited message (Hayati et al., 2010), has been bothering Internet users since the rise of the Internet. Because the amount of spam is usually quite formidable, it would be too laborious to identify and remove them one-by- one manually. Therefore, finding an automatic spam detection method has

4with file name extension ẋlsx

(6)

long been a popular research topic due to the strong demand, in addition to the fact that it’s an intriguing topic in itself.

2.1.1 Email Spam

Email spam is one of the most prevalent types of spam that could be dated back to long ago. In our experience, when people mention ‘spam’ without giv- ing a more specific context, it can be assumed that they’re referring to email spam on most occasions. The topic of email spam detection has been exten- sively studied, and there is rich literature that covers in-depth exploration of this topic available. For an thorough overview, Blanzieri and Bryl (2008) surveyed the state-of-the-art-at-the-time machine learning applications for email spam filtering.

2.1.2 Web Spam

Another form of spam is web spam, whose objective is to game the ranking algorithm of the search engine in order to get an undeserved high ranking. It is usually applied as a Black Hat SEO5technique to pursue the lucrative profit that could be brought by the search engine traffic. However, as the major search engine Google keeps refining the ranking algorithm, nowadays simple link spams are no longer able to cheat the search engine, and could instead incur some penalty in ranking. Gyongyi and Garcia-Molina (2005) presented a comprehensive taxonomy of the web spamming techniques. Gyöngyi et al.

(2004) proposed a semi-automatic method to separate good and reputable pages from web spams.

2.1.3 Social Network Spam

Just as the rise of search engine leads to web spam, as social media gained its popularity in recent years, social network spam came along. Sometimes concisely called as ‘social spam’, social network spam has many variants.

One of the variants involves throwaway accounts created in batch to some- how bait regular users to clicks certain link for personal gain. McCord and Chuah (2011) and Benevenuto et al. (2010) both discussed the techniques of spammer detection on Twitter.

5use of aggressive SEO tactics without following the terms of service of search engines

(7)

2.1.4 Opinion Spam

The kind of spam we want to detect in this research is usually referred to as opinion spam or review spam. Opinion spam is related but still dif- ferent from other kinds of spam from various perspectives. One of the most prominent differences is that opinion spam is arguably the most ‘subtle’

kind of spam, since it is not only completely ineffective, but also very harm- ful to the reputation of a brand (or a store, a restaurant, etc.) when got caught. Therefore, opinion spammers would generally try their best to dis- guise their opinion spam as genuine opinion. Carefully-written opinion spams have caused great challenges in manually identifying the spams and anno- tating the ground truth, which is in concert with the finding that human are poor judge of deception (Vrij et al., 2008). Ott et al. (2011) reported a very low annotator agreement score when annotating the opinion spams from a review corpus. In contrast, most of the email spams, web spams, or social network spams are fairly easy to spot by an experienced user of the respective platform.

One of the earliest researches on opinion spam is Jindal and Liu (2008), in which they attempt to detect fake product reviews on Amazon. Since then, this topic has been drawing increasing attention. Mukherjee et al. (2011), Lim et al. (2010), Jindal et al. (2010), Xie et al. (2012), Wang et al. (2011) are some of the following researches.

In the later parts of this paper, when we mention ‘spam’ or ‘spammer’ without specifying its type, we’re referring to ‘opinion spam’ or ‘opinion spammer’, respectively.

2.2 Target of Detection

The task of opinion spam detection can be seen as a binary classification problem where we want the detection model to detect whether a given in- stance is a spammy or not. Naturally, each instance would be a post, and spam (post) and non-spam would be the two classes. Alternatively, each instance could be an user account, where spammer and non-spammers would be the classes, when we only care about which of the users are the black sheep. In our research, we attempt to construct the models and conduct some experiments for both types of targets.

(8)

2.2.1 Spam Post Detection

In spam (post) detection, the detection model’s job is to identify whether a forum post (or a product review, a store review, etc.) is a spam post.

Many of the previous researches on opinion spam aimed at detecting spam reviews, which can be seen as a type of post (Jindal and Liu, 2008; Harris, 2012; Jindal et al., 2010; Ott et al., 2011). Nonetheless, even if the target of detection is spam, we can still utilize features derived from information about the corresponding spammers, and vice versa.

2.2.2 Spammer Detection

Lim et al. (2010) and Wang et al. (2011) are two of the previous researches that focused on identifying spammers, while Mukherjee et al. (2011) created a variation by making groups of spammer who worked together to write fake reviews as their detection target. In our research, we define ‘spammers’ to be users who have ever submitted a spam post. Under this definition, in some sense, spammer detection is not harder than spam detection, as a spammer will be identified if any of his/her spam posts is identified.

2.3 Proposed Features

A good amount of features has been proposed for the use with commonly- applied supervised learning models such as SVM (Support Vector Machine) (Cortes and Vapnik, 1995), or alternatively, in some ad-hoc models designed for the specific purpose. Most of these features fall into the two categories:

the ones derived from the textual contents of opinion spams, and the ones not directly related to them.

2.3.1 Content-centric Features

In terms of features derived from contents, Jindal and Liu (2008) counted the percentage of opinion-bearing words, brand name mentions, numerals, capitals etc. Mukherjee et al. (2011) computed content similarity between reviews to examine if there are duplicate or near-duplicate reviews, which are suspicious of being spam reviews. Ott et al. (2011) used bag-of-n-grams and slightly improved the performance with psychologically meaningful di- mensions in LIWC (Pennebaker et al., 2007). Harris (2012) took cues such as word diversity, proportion of first person pronouns and mention of brand

(9)

names. Feng et al. (2012) went a step further by adopting deep syntactic features, which are derived from the production rules involved when parsing the contents based on the PCFG, in addition to the basic bag-of-words.

2.3.2 Non-content-centric Features

Speaking of features not directly related to contents, Lim et al. (2010) and Feng et al. (2012) both made extensive use of various characteristics of user rating patterns on Amazon. Mukherjee et al. (2011) derived features from bursts in the amount of reviews, how early the reviews was post, and rating deviation, with respect to either groups or individuals. Wang et al. (2011) iteratively computed the trustiness of reviewers, honesty of reviews and re- liability of stores based on a graph model which utilized non-content-centric features such as average rating and number of reviews. Since our dataset is obtained from a web forum, some information from product review site, such as user ratings, is unavailable, which might limit some possibilities here.

2.4 Ground Truth Acquisition

One of the major obstacles in studies of opinion spam is the difficulty in acquiring ground truth, since it’s in spammer’s best interest to keep it secret, and manual annotation is ineffective because of the subtlety nature of opinion spam mentioned in section 2.1.4.

A lot of effort had been put into obtaining ground truth in studies of opinion spam. Jindal and Liu (2008) assumed near-duplicate reviews were likely to be spam and followed this heuristic to build an annotated dataset. More recently, collecting annotations using crowdsourcing platform like Amazon Mechanical Turk had become a popular approach. Gokhman et al. (2012) discussed various techniques of obtaining ground truth in studies of decep- tion, and argued that realistic deceptive contents could be generated from crowdsourcing, if the context of deception in practice is replicated on the crowdsourcing platform. On the other hand, it’s ineffective to annotate ex- isting deceptive contents. In fact, one of the quality indicators of fabricated opinion spams is that they shouldn’t be recognizable by crowdsourced an- notators. Ott et al. (2011) scraped truthful opinions from TripAdvisor and synthesized deceptive opinion with the help of Amazon Mechanical Turk.

Thus far, most of the previous researches on opinion spams appeared to adopt some sort of approximations of the actual ground truth, due to the difficulties stated in this section. On the contrary, in our research, we extract ground

(10)

truth from the confidential records leaked directly from a covert advertising campaign, which assures its ‘trueness’.

3 Dataset

There are three major sources of our dataset:

1. The leaked spreadsheets disclosed by the anonymous hacker ‘0xb’ pro- vide ground truth of which posts are spam. (section 3.1)

2. The actual contents and various meta information on Mobile01 compose the ‘body’ of our corpus. (section 3.2)

3. Product information is scraped from a phone review website name SOGI6 to aid analysis requiring knowledge about the products. (sec- tion 3.3)

3.1 Leaked Spreadsheets

The leaked spreadsheets HHP-2011.xlsx and HHP-2012.xlsx keep the histo- ries of the opinion spam posts made in 2011 and 2012, respectively. Several discussion platforms were spammed, but for simplicity, we consider only the opinion spams and the corresponding spammers on Mobile01, which make up the majority of the records contained in the spreadsheets.

Among the columns in the spreadsheets, urls to the spam posts and usernames of the spammers7 are extracted. Some typos and inconsistent ways of presenting the usernames (e.g., lowercase vs uppercase, confusion between similar looking Unicode characters) are manually checked and fixed.

Furthermore, recorded urls linked to pages on Mobile01 might appear in different forms. To be able to reliably match the posts we scraped later, a 3-tuple (f id, thid, pnum) are extracted from each of the urls, where f id, thid and pnum refers to forum id, thread id and page number, respectively.

These 3-tuples serve as unique identifiers of a page in a thread on Mobile01.

For the example Mobile01 page url below, the extracted 3-tuple identifier would be (566, 4009283, 2).

http://www.mobile01.com/topicdetail.php?f=566&t=4009283&p=2

6http://www.sogi.com.tw

7Whenever the word spammer or user is used hereafter without further details, we’re talking about spammer account or user account, respectively, since no way can we find out who the actual human poster behind an user account is.

(11)

Since we regard any user who has ever post a spam post as spammer, any account which is contained in the spreadsheets is considered to be spam- mer. Thereafter, we have a set of 2-tuples that each consists of a spammer’s username and a nested 3-tuples identifier leading to the page containing one or more spamming posts of the spammer. An example snippet of data ex- tracted from leaked spreadsheets is listed in table 1. As a matter of fact, the spreadsheets didn’t specify exactly which post in the page the url points to is spam, so if a linked page contains multiple posts by the poster with the recorded username, we simply consider all of them as spam posts.

thread page id username

fid thid pnum

amberwangtw 568 2378318 1 nickliu623 568 2682497 1

jackR 14 1977960 1

popstyle 568 2661837 1 kk8928166 568 2636349 2

賈蘇林 568 2400890 1

CBR600RR2007 217 2399752 1

QQ_578 61 2605621 1

Table 1: data extracted from leaked spreadsheet

3.2 Mobile01 Corpus

A large portion of previous related studies used dataset scraped from product or store review websites such as Amazon or TripAdvisor, whereas our corpus is scraped from a web forum. Another difference is that the contents on Mobile01 are mostly written in Traditional Chinese, with little bit of En- glish phrases scattered around, rather than predominantly written in English as in previously used corpora.

Mobile01 works just like a typical web forum, such as the ones based on phpBB or VBulletin, and here we assume the readers to have a basic under- standing in how web forums work.

Since more than 70% of the recorded spams were submitted to the Samsung (Android) board on Mobile01, we decide to focus our analysis on this board.

By SSH tunneling through Linux workstations maintained by the depart- ment, within a reasonably short period of time we were able to fetch all the threads along with the contained posts accessible by a regular member on the Samsung (Android) board on May, 2014. In addition, profiles of users

(12)

who have ever post in this board are also retrieved. To get the relevant in- formation we need out of the retrieved web pages, we parse HTML with the help of BeautifulSoup8. After the laborious tasks of mangling with the raw data, the cleaned data are all stored into a SQLite database, for the ease of later accesses and possible modifications. For instance, each post is stored as a record in the POSTS table, which has a column for each attribute of a post.

Basic counts, scraped attributes, and randomly-selected snippets of the SQLite tables POSTS, PROFILES, and THREADS are shown in Tables 2 to 7.

It should be noted that the data we scraped from Mobile01 is the ‘May 2014 version’, while the spam activity we want to investigate happened during 2011 and 2012. Ideally, a snapshot at the end of the 2012 may suit our need best. By the time we collected the dataset, some posts could have been edited or removed, and profiles could have evolved with time had the users stayed active. In table 7, we can see 4 out of 10 randomly picked profiles have the last login time ‘1399075200’, which represents 5 May 2014, the date when the profile data were scraped from Mobile01.

table row counts posts 632234 threads 41759 profiles 58531

(a) Scraped data

target type row counts

spammers 300

spam posts 3116

(b) Labeled data

attribute description

thid id of the thread to which the post belong time submission time of the post

uid id of the poster who made the post uname username of the poster

nfloor position relative to other posts in the thread pnum page number on which the post is content structured content in HTML

Table 2: attributes of the table POSTS

8http://www.crummy.com/software/BeautifulSoup

(13)

thid time uid uname nfloor pnum content

2753035 1337723880 2135371 ZSKOR 20 2 …

2007342 1297610580 151149 湯尼小 1 1 …

3060762 1353922620 2369024 pinckstraw 12 2 …

2550662 1331902200 1817096 iamfishfis 1074 108 …

1830978 1288783020 185736 wunit 26 3 …

3841427 1396355820 2253702 bluestaral 1 1 …

2741044 1337137440 1448858 wei700818 9 1 …

2467506 1322415300 160444 nella76327 4 1 …

1899252 1291632600 1426406 cloud2211 9 1 …

3227368 1362157920 2212133 jinshun000 187 19 … Table 3: a snippet of the table POSTS

attribute description

thid id of the thread

fid id of the forum (board) in which the thread is

title title of the thread

pages number of pages in the thread clicks number of clicks (views) on this thread

time submission time of the thread (=first post’s) Table 4: attributes of table THREADS

thid fid title pages clicks time

3851232 568 請問沒有參加預購的人 4/11哪裡比較能買到s5? 1 880 1396908300 1710011 568 有住台中的神人大大能幫忙root I9000? 5 6941 1282155000 2301390 568 S2 5.1 聲道怎麼比不開還不太好聽!! 1 1474 1313325000 2810711 568 S3嚴重收訊問題~有同樣問題的還說說吧 2 4972 1340631900 3029682 568 我的 note ii 32gb沒貼神腦或聯強的貼紙 1 319 1352040120 3015440 568 越南 NOTE2 開箱之尋寶圖??? 1 1518 1351294380 2582007 568 Samsung Galaxy mini s5570使用 Kies 程式的問題 1 154 1328747100 2848181 568 你們的 note會這樣嗎? 1 2288 1342674000 2258807 568 再跟新9100的新版本的時候出現不可預期的錯 1 231 1310961000 3287201 568 遊戲的背景音樂破破的 1 143 1364908860

Table 5: a snippet of the table THREADS

(14)

attribute description

uid id of the user

reg_time time of registration on the site login_time last time the user logged in

n_threads number of threads initialized by the user

n_eff_posts number of effective posts

n_posts number of all posts

n_replies number of replies, which is equal to (n_posts - n_threads) score ‘karma’ given by other users to the threads the user make p_phone %proportion of posts made on the smart phone section

Table 6: attributes of table PROFILES

uid reg_time login_time n_p osts

n_eff_p osts

n_threadsn_repliesscore p_phone 1873586 1295136000 1395878400 18 18 0 18 0 88

820575 1192406400 1292976000 5 5 0 5 0 60 2495668 1367107200 1387756800 16 16 0 16 0 100 2500546 1367798400 1397692800 15 15 2 13 0 6

941418 1204934400 1398816000 9 9 1 8 0 22 1850046 1292457600 1398643200 50 44 3 47 5 42 2678858 1397174400 1399075200 4 4 1 3 0 50 636450 1172620800 1399075200 71 66 1 70 0 18 814919 1191801600 1399075200 328 176 14 314 0 0 165425 1125964800 1399075200 561 540 13 548 4 74

Table 7: a snippet of the table PROFILES

3.3 Product Information

We scrape product information of all cell phone and tablet of the top brands listed in the front page of Sogi. The scraped attributes of each product are shown in the table below.

(15)

attribute description

brand brand name of the product specs product specifications price estimated price release release date of the product description product description

Table 8: attributes of products

People use a wide variety of aliases to refer to cell phone or tablets products on Mobile01, and very rarely would call the products by their full exact names. To be able to match with as many product mentions as possible, we take every 1-word or 2-word fragments from the full name as the aliases.

And if the name contain Roman numerals, we convert them to the respective Arabic numbers to create more aliases. There ought to be a lot of false aliases when they’re constructed in such a loose manner, such as general terms like

‘3D’, ‘pro’, so we just remove these by sifting through the matching results, as the false aliases without any match most likely won’t do any harm.

Many of the products of the same brand have some of the aliases shared. For example, people often use Note to refer to a Samsung product, but there’s a dozen of the Samsung products having such alias. Since what we ultimately care about is the brand the product belonged to, we just deem it as a mention of an arbitrary Samsung product of the Note series.

4 Data Exploration

In this section, we will inspect the dataset to get a grasp of what is going on in this organized campaign of covert marketing. It should be noted that some of characteristics might not be manifested by other similar marketing campaigns, since each of them might have its own ‘game plan’ that involves a different objective in a different context.

4.1 Subtlety

One of the basic properties we observed is that most of the spam posts don’t really look suspicious, which echoes the discussion in section 2.1.4.

Spammers usually deliver their opinion about brands in a subtle way which blends them into the discussion, not to mention that a portion of the spam

(16)

posts (mostly replies) don’t even carry opinion about any brand9, but only serve the purpose of keeping the discussion alive and bumping10 in order to attract more attention to specific topics of the thread that meet the goal of the campaign. Moreover, even before the whole story was revealed, it had been rumored for years that some of the posters on Mobile01 are part-time paid writers, which may cause the spammers to be extra careful to avoid backfiring from the community.

The following is some examples of the ‘subtle’ spam posts selected from the dataset. Without the ground truth available, we may not be able to identify these posts as spams beforehand. The dark gray parts contain the titles of the thread the posts are within, and the nfloor of the posts (position of the posts in the threads); the light gray parts are the contents of the posts.

Xperia mini pro 得入手嗎 #45

17000 你可以買更好的阿例如再加個1000 去買 HTC Sensation MINI 我記得訂價很便宜

搞不好你還有機會買兩隻咧

This post subtly mocked the higher price of a HTC product via a price comparison with MINI, where MINI was not even referring to a Samsung product but a Sony one.

Galaxy note 退訂 +1 #42

印象中,之前三星的產品預購好像都沒有公佈售價啊?

其實一個願打一個願挨啦!真的沒辦法接受的話 就等上市再買囉

< 刪文 > #8

明明可以刁你卻無絛件讓你換機…

這樣的服務還會讓你特地在這裡發文表示心痛,

9In this paper, we still deem these spam posts as ‘opinion spam’, but probably should have come up with a more appropriate name for this specific type of spam posts.

10Replying to a thread would ‘bump’ it to the top place of the board, since threads are ordered by the time of the last reply on most web forums including Mobile01.

(17)

那也難怪會有人想把自己送修的事情鬧上新聞了……

S2 訊號旁的 3G 號不見了 = = #9 打去電信業者客服問問看

會不會是你家那邊收訊有問題?

The above three spam posts defended Samsung’s service and products with a seemingly unbiased and rational tone. Judging solely by the content of these posts, it’s by no mean easy to tell if they’re spam.

有用 I9003 的大大, 分享一下使用心得吧 #2

| lag |i9003 我沒用過 看有沒有其他大大可以跟你分享 不然爬文找找應該也有

| 請爬文 |

GALAXY S2 #43

可能不到記者會

也不能確定台灣上市的版本呀

These posts have little actual contents, and seem to be there only to heat up the discussion about some Samsung products.

請問 位神人 NOTE 的白色版本預計 時會有? #10 聖誕節前…好像蠻應景的?

香港都上了,台灣應該也快了吧?

兩個禮拜本來很 怎麼遊戲應用 體越多

開 越來越 那個 號出現越來越久 = = #9 還是常常整理程式比較好啦

用不到的程式就刪掉吧

These two posts are similar to the previous ones, but they managed to provide

(18)

somewhat helpful answers to fit into the conversation.

NDSL 的觸控筆和 GALAXY NOTE 可否共用?#1 我發現 NDSL 的觸控筆和 GALAXY NOTE的觸控筆很像,

他們可否共用?原理也類似嗎?

This is the first post in a thread. It intended to initiate discussion about a Samsung product with a question about it. It doesn’t seem to contain opinion about any brand, like many of the previous posts.

說 生就一定要 的粉可愛!我就 要買 GALAXY R#5 剛看了一下

Galaxy R好像沒有消除雜訊的這個功能選項?

幫 生最好的方式, 就是送他一隻 SII 再 他下 ! #42 你們的對話也太有趣

連爸爸都要請出來了

| 大笑 |

準備衝 Galaxy Nexus 的來簽到吧 #24 拖到月底也太久了吧

店家不清楚可能真的要打電話問一下了 我才不想白跑一趟買不到手機咧 等等就來問問

一起 過 i9100 的鏡 ,分享生活中的小確 吧!#24 一張大溪的全景照

一位 100% 的向日葵人

白色 SII+ 粉紅色 SGP protector glass 好看嗎? #7 樓主貼的保護貼好美啊!

不過價格……好可怕啊!價位真的是有點高,可以直接買保護殼了!

(19)

熱 的 Galaxy Nexus 到手啦 #41 newlu

我傾向不包膜耶

弧形螢幕另一個好處就是這個 而且包膜的話鏡面質感好像會下降 好好保護他就好拉

The above posts are all replies to threads initiated by spam posts. Again, these posts only serve the purpose of heating up the discussion or just keep it alive. Since the intended messages may already be delivered by the first post or some other replies in the thread, the spammers avoid stating any strong opinion about the brands directly to make these posts even less suspicious.

Such carefully written spam posts may make the automatic detection very challenging, because the content-centric features could be ineffective. Never- theless, in the later experiments, we find the contents still encompass some clues that help spam detection greatly.

4.2 Low Spam Post Ratio of (Some) Spammers

Spammers are the posters who had submitted any spam post recorded in the leaked spreadsheets, as defined in section 3.1. Even though we inspect the posts from the training set, which only contains posts from the board where the ‘spam density’ is the highest, still, only about 33% of the posts by spammers are recorded as spam. The distribution of the #spams /

#posts ratio of each spammer in the training set is plotted in the following figure.

(20)

Figure 1: distribution of the spam post ratios among spammers

The figure demonstrated that a large amount of the ‘spammers’ actually rarely spammed. The majority non-spam posts of these spammers could neutralize the spam signal extracted from posts if we try to average them on a per-user basis.

4.3 Different Types of Spammer Accounts

In fact, there seems to be mainly two types of spammer accounts in this dataset: accounts of reputable posters who are paid one or few times to write quality long post to promote the brand, and throwaway accounts shared internally among the spammers to synthesize public opinions. The figure below is the scatter plot of the spam post ratio vs total number of threads made by the spammers.

(21)

Figure 2: spam posts ratio vs number of threads

We can see that most of the spammers with a high #spams/#posts have initialized very few threads, which could be a clue that these are the throw- away accounts created for the sole purpose of spamming. On the other hand, accounts with a lower ratio show a bigger variance in #threads. Some of them are likely to be reputable posters, who are usually the ones that makes lots of threads.

Usually, throwaway accounts are often created in mass within a short period of time, as it takes much more effort and patience to spread out the daunting task of registering throwaway accounts, especially if a large amount of them are needed. To test if this applies in our dataset, we adopt a simple heuristic that categorize account initiating less than 35 threads as throw- away account, and reputable account otherwise. In the figure below, the number of spammer accounts registered within each two weeks after 2009 in the training set is plotted.

(22)

Figure 3: number of spammer accounts created within each 2 weeks

Indeed, there are three short periods when particularly high numbers of throwaway spammer accounts were registered: January 2010, April 2011 and October 2011. On the other hand, speaking of reputable accounts, the times of registration are spread quite evenly. This observation could later help us identify throwaway spammer accounts.

4.4 First Post vs Replies in Threads

As in most online web forums, the first post, also known as original post in a thread is written by the user submitting the thread. First posts are relatively richer in content as they serve the critical role in initializing a discussion on a specific topic. On the other hand, replies are often quite concise, and sometimes don’t really carry any opinion, as manifested in some of the spam reply examples listed in section 4.1.

First posts and the replies in threads display different characteristics in many aspects. In the following figure, we can see that the first posts tend to contain more characters. Moreover, at least one image is embedded in 19.2% of the first posts, but only in 4.1% of the replies.

(23)

Figure 4: #characters in first posts and replies

The table below shows the spam counts and proportions for first posts and replies in training set. It’s a bit surprising to see the ratio of spams in first posts is as high as 5%. In other words, for every 20 threads in the training set, one of them is created for covert marketing! In contrast, %spam is much lower for replies.

type #posts #spams %spams first posts 10951 546 4.99%

replies 148481 1337 0.90%

all posts 159432 1883 1.18%

Table 9: #spams and %spams in first posts vs replies in training set

Considering all these differences between first posts and replies, we decide to separately train a detection model for each later. The performance is expected to look much nicer for first posts, since they carry more information (richer content) and have a significantly higher spam density, in contrast with replies. The high performance will be very helpful when we leverage prediction results on first posts to assist the spammer detection.

4.5 Pattern in Submission Time of Posts

Because making spam posts on Mobile01 is a job rather than a leisure activity for the spammers, we postulate that a higher percentage of spam posts would be submitted during work time, compared to non-spam posts.

To check our postulation, we plot the distribution of submission time of spam and non-spam. In figure 5, the submission time of non-spam posts distributes

(24)

pretty evenly over each day of week, whereas the amount of spam posts drops drastically on Saturday, and has a moderate decrease on Sunday. In figure 6, we can observe that in each hour of day, more spam posts are submitted during work hour, especially between 10 a.m. and 11 a.m., while non-spam posts are more often made during the spare hours. Hence, we see there is more or less a trend that spam posts are more often made during work time than leisure time, in comparison with non-spam posts.

Figure 5: proportion of spam submitted throughout a week

Figure 6: proportion of posts submitted throughout a day

4.6 Activeness of Threads

The threads started by spam first posts are expected to be more active, since those are written to draw attention and exposure, while non-spam threads may or may not be created with such intent in mind.

(25)

One intuitive way to measure the activeness of a thread would be counting the total number of posts in the thread, which is equal to 1 (first post) +

#replies. In the figure below, the numbers of posts in spam and non-spam threads are plotted. Clearly, spam threads tend to attract more replies, which could be either spam replies or non-spam replies.

Figure 7: #posts in spam threads vs normal threads

Another way to measure the activeness of a thread is with the number of clicks. This is one of the primitive attributes of thread we scraped, as de- scribed in table 4.11 As shown below, spam threads seem to get more clicks in comparison with normal threads.

Figure 8: #clicks in spam threads vs normal threads

11Although not visible on the current web interface, the number of clicks is still available somewhere in the HTML of the threads.

(26)

4.7 Collusion between Spammers

Looking into the leaked speadsheets, we notice that a few threads contain multiple spam posts submitted by different accounts, which is an indication of collusion going on between multiple spammers. Usually, these spammer would express similar opinion in the same thread to reinforce the credibility, or it could be just a result of multiple spammers bumping the same thread12 in an attempt to attract more attention to it.

Sometimes, it could be just the same person submitting posts with different spammer accounts in a thread, but still, it can be seemed as collusion between multiple spammer accounts on the surface.

Figure 9: number of threads containing specific number of spams

In the figure above, we could observe that there are a few threads containing 2 or more spam posts. In fact, as much as 67% of the spam posts are in a thread with at least 2 spam posts.

For a concrete example, we excerpt the first 10 posts of an example thread, in which 7 of them are actually spam posts, which were post by 5 unique spammer accounts.

準備衝 Galaxy Nexus 的來簽到吧

#1 woosawowo spam 這禮拜 Galaxy Nexus就要上市了

12which is often started by a spam first post

(27)

等好久了早就準備好要趕快入手可惜他沒在資訊展開賣阿 不然應該可以有很多優惠大放送

看到好多體驗文超級生火的

有沒有已經準備好銀彈要衝Galaxy Nexus 的一起來簽到吧 話說除了台哥大有預購之外

有沒有其他地方可以直接買空機阿 比如說三星旗艦店之類的

好像沒什麼消息

#2 甘草仔 spam

去過 Galaxy Nexus高峰會之後,

我認真的考慮入手 ICS,我真的覺得好棒喔 但這隻手機好像是台哥大獨賣 哭哭

#3 square.chen ham

可以單機購買啊 在配合信用卡12 期免利息...

#4 小籠包 spam

甘草仔別急,我覺得你應該會贏得手機 所以就不用買啦

對了我要衝啦

趕快讓 nexus one 退休了 今天問台哥大的人

居然一問三不知 只叫我留資料等候聯絡 說她要再問問看 真是不合格

我已經打算NP 去台哥大了

#5 TTW2010 ham 我本來有在台哥大預購 6 但是打電話去問出貨時間

台哥大表示15 出貨 16號之後才會到

(28)

所以我就退掉了

然後跑到台哥大門市去問能不能預購 門市說不能而且不一定每個門市都會到貨

所以...

我星期四打算先跑離家最近的台哥大 沒有就跑第二近的三星生活館 再沒有就跑去內湖家樂福的三星 應該就能買到吧

因為老實說好像不是那麼熱門

不討厭 HTC也買了不少 HTC, 但是對 H 粉超級厭惡, 那容不下” 見的心態真的很可怕

#6 woosawowo spam 也許通訊行會有貨吧我猜 不過我跟小籠包意見一樣 我覺得甘草大很有可能得獎阿 我應該也是會去三星旗艦店問問看吧 如果真的沒有的話就去台哥大看看 希望 15號就能到手

#7 jiantz spam 我想我還是先觀望一下 等 woo 大開箱了 15 Galaxy Nexus 16 iPhone4S 到底最後會選哪一支 現在也還是個謎阿

#8 甘草仔 spam

哈摟 小籠包 原來妳坐我旁邊壓 妳好妳好 希望承您貴言搂 WOO大也感謝您的讚美

不用觀望了啦 兩支一起買,先買NEXUS XD

(29)

我覺得預購已經來不及了,我想直接去沒市把玩實機,

有現貨可以考慮直接殺了XD 我可以廣告我的文章嗎 XD

#9 danadanad spam 我真的覺得這支蠻屌的耶

以前看 NEXUS S 都沒什麼感覺

但是這次GALAXY NEXUS可以很明顯的感受到 GOOGLE的誠

#10 noisycat ham

Galaxy Nexus確定只有兩個地方有賣,

一個是三星旗艦店 (空機),一個是台哥大!!

These 7 posts basically all took the same stand and more or less conveyed some positive opinion about Samsung. Leveraging such collusive activi- ties between spammers improves the performance of our spammer detection model by a great margin, as demonstrated in section 5.6.5.

5 Detection

In this section, we will discuss some aspects of devising the detection models, which includes selecting a evaluation metric (section 5.1), how we split the dataset into training set and test set for the posts and the user accounts (section 5.2.1), and the machine learning procedure (section 5.3). Three detection models will be constructed:

1. Spam detection for first posts (section 5.4) 2. Spam detection for replies (section 5.5) 3. Spammer detection (section 5.6)

where we would add each type of features iteratively to see how the perfor- mance of the models will be progressively improved.

(30)

5.1 Evaluation Metric

In our dataset, both spam posts and spammers’ ratio are quite low. There- fore, accuracy shouldn’t be the main metric to look at since it’s dominated by the majority non-spam/non-spammer class, about which we don’t really care.

Let’s just walk through it with spammer detection, while the same arguments can also be applied to spam detection:

High precision on the spammer class is desired because we don’t want to falsely incriminate a innocent forum user as a spammer; high recall on the spammer class is also desired because we’d like to find as many opinion spammer out there as possible. Depending on the application, precision could be more important than recall, or vice versa. For instance, in an application where the detection system is used in an initial filtering stage narrowing down the set of suspicious users for a later stage of manual classification, high recall might be preferred over high precision since misclassifying a spammer as normal user completely rules out the possibility of identifying the instance right, while identifying a normal user as spammer could still be corrected in the later stage of the pipeline.

Because no particular application is aimed at, we don’t have a prior prefer- ence on either precision or recall13. Therefore, our evaluation metric of choice would the harmonic mean of precision and recall, also known as F-measure, on the spam/spammer class.

5.2 Data Splitting

We split our data instances into the training set and the test set. Data exploration discussed in section 4 was conducted only on the training set;

moreover, model selection and parameter tuning are performed based on the result of 5-fold cross validation on the training set. Avoiding ever touching the test set until the final evaluations makes the evaluation result on test set better reflects the real world expected performance.

13To leverage the model for first posts in spammer detection, we actually prefer it to have higher precision for ‘internal use’, but we choose to not take it into consideration when selecting the evaluation metric.

(31)

5.2.1 Posts (for Spam Detection)

For spam detection, each instance in our dataset14 is a post. These posts are assigned to either training set or test set according to their temporal orders. Posts submitted between Jan 2011 and Dec 2011 are selected into training set, and the ones submitted between Jan 2012 and May 2012 are put into the test set. The reason we didn’t utilize all the posts from 2012 is the ratio of spam posts drops drastically after May 2012 (only 30 spams in total), so we simply excluded those to keep the ratio of spam posts in the test set close to the training set.

However, there is a problem that, for many of the posts in test set, there also exists posts by the same user in training set. Under this circumstance, even if the final trained model performs well on test set, it doesn’t necessarily imply that the model is a good opinion spam detector. The model might just cap- ture the writing habit of the spammers that may or may not have intrinsic connection to the spam activity. For example, a spammer might use some words all the time in the spam posts purely out of personal preference, which would cause the model to learn to recognize such words as ‘spam keywords’

and thus possibly gain performance on test set, without really capturing the essence of opinion spam. Although this issue can’t be completely eliminated considering that some spammer accounts are shared by the spammers15, still we try to mitigate it by removing all the posts by user accounts who have posts included in training set from the test set, and call the resulting set

‘test set*’. As a result, there won’t be any posts submitted by the same user account between training set and test set*

#spam posts #all posts spam ratio

training set 1883 159432 1.12%

test set 1233 92552 1.33%

test set* 414 32932 1.26%

Table 10: training and test set of posts

5.2.2 Users Accounts (for Spammer Detection)

Similar to the previous section, we want to assign user accounts to the train- ing set and the test set according to temporal order. Time of the account’s

14Here, and in this whole detection section, ‘dataset’ refers to the set of instances for our particular detection task, rather than the whole dataset we collected in section 3

15Here, ‘spammers’ refer to the actual human posters, rather than the spammer accounts as in most uses of ‘spammer(s)’ in this paper

(32)

registration comes to mind. However, the rationale of splitting dataset by temporal order should be that we could evaluate the performance of the model detecting spammers appearing in future given the information of spam- mers caught in the past16, which is a likely scenario for real world applica- tions. Registration time does not always signify the active periods of the users. In this regard, submission times of the posts by the users is a more sensible choice.

Following the thinking in the previous section, apparently, users that have submitted a post during the first period (Jan 2011 to Dec 2011) but not the second (Jan 2012 to May 2012) should be put into the training set, and users having submitted a post in the second but not the first period should be assigned to the test set. Which set should the users who have submitted posts in both periods be assigned to though? Since we are going to use the detection model for first posts to assist spammer detection in section 5.6.3, this set of users shouldn’t be in the test set as some of the spam first posts by these users might have been ‘peeked’ by the model for spam post detection.

Hence, these users are assigned to the training set.

#spammers #all users spammer ratio

training set 215 17216 1.25%

test set 84 8603 0.98%

Table 11: training and test set of users

5.3 Machine Learning

The machine learning procedure is conducted mainly with the help of the Scikit-Learn library (Pedregosa et al., 2011). We’ve tried various learning algorithms such as Logistic Regression, SVM with linear kernel, SVM with RBF kernel from Sickit-Learn, SVMperf, etc. Most of the time, SVM with RBF kernel seems to win out by a non-negligible margin. SVMperf claims to somehow directly optimize the F-measure (Joachims, 2005), but the result F-measure from our experiments is not better than SVM with RBF kernel from Scikit-Learn, while SVMperf taking a much longer time to train a model.

Therefore, we decide to stick with SVM with RBF kernel from Scikit-Learn, which is actually a Python wrapper for the widely-used LibSVM (Chang and Lin, 2011), to conduct the of our experiments. As suggested in Hsu

16By replacing ‘spammers’ with ‘spam posts’, it becomes the rationale for splitting posts by temporal order in the section 5.2.1

(33)

et al. (2003), we scale each feature to zero mean and unit variance before feeding it to SVM.

There are two primary hyperparameters C and γ to be tuned in SVM with RBF kernel. For this purpose, whenever a model is to be learned, we first run 5-fold cross-validation multiple times on the training set to facilitate a grid search on C and γ with F-measure as the metric to optimize. The grid to search is represented below.

(C, γ) ∈ {10x| − 3 ≤ x ≤ 3, x ∈ Z} × {10y| − 5 ≤ y ≤ 2, y ∈ Z}

5.4 Spam Detection for First Posts

As discussed in section 4.4, we’d like to train a detection model specifically for first posts in thread, so we only use the first posts from the training set and test set of posts introduced in section 5.2. The counts and ratios of spam for the first posts are listed below for future reference.

#spam posts #all posts spam ratio

training set 546 10951 4.99%

test set 208 5870 3.54%

test set* 70 3035 2.30%

Table 12: only considering first posts in thread

Due to the low ratio of spam posts in training set, in the following experi- ments on spam detection for first posts, we randomly (but deterministically) select 60% of the non-spam posts to remove from the training set17before- hand, so as to speed up the learning procedure.

5.4.1 Random Baseline

As an absolute baseline, the model predicts whether a first post is spam based on the result of flipping a fair coin. As expected, the precision is about equal to ratio of the spams, which is 3.54% and 2.30% on test set and test set*, respectively. The recall is around 50%, which reflects that fact that there is a half chance we correctly identify a spammer as such by flipping a fair coin.

17Notice we’re not downsampling the non-spam instance from the test set.

(34)

precision recall F-measure test set 3.43% 49.04% 6.42%

test set* 2.52% 55.71% 4.82%

Table 13: random baseline for spam detection for first posts

5.4.2 Bag-of-words

After performing Chinese word segmentation on the HTML-stripped cleansed content from each post with Jieba18, we count the occurrence of each word in training set, and construct a ‘vocabulary’ with these words.

Next, rare words with less than 5 occurrences are removed from the vocabu- lary, since these would be the sparse bag-of-words features and might cause overfitting. On the other hand, words appeared in over 30% of the posts are also removed, as these are likely to be stop words or the like. After the vocabulary is set up, we represent each post as a vector of occurrence of each word in the vocabulary, where the occurrences are normalized by the length of the post.

In bag-of-words, each word in the vocabulary corresponds to a feature. Since the high number of features (words) could slow down the training process significantly and may cause overfitting, we apply randomized PCA (Halko et al., 2011) on the #posts× #words bag-of-words matrix to reduce the word dimension. The desired number of dimension to reduced to with PCA is tuned by looking at the average F-measure from 5-fold cross validation on the training set, as plotted in the following figure.

Figure 10: values of F-measure as #component in PCA changes

18https://github.com/fxsjy/jieba

(35)

The absolute performance shown in the plot might look unusually high. How- ever, it’s partially due to the fact that we downsampled the non-spam posts in training set, so the validation sets in 5-fold CV all have much higher ra- tios of spam posts than test set. What we really care about is the relative performance. As shown in the plot, reducing to 50 components may cause too much information loss and thus deteriorating the average F-measure. On the other hand, too many components may cause some degree of overfitting which also worsens the performance. The average F-measure is the highest when the bag-of-words is reduced to 150 components, so we adopt it to train our model on the whole training set and see how it performs on test set.

precision recall F-measure test set 62.89% 48.08% 54.50%

test set* 50.00% 51.43% 50.70%

Table 14: content bag-of-words features only (150 components)

The performance is actually decent, whereas our observation on subtlety of the spam posts in section 4.119 that first post gives us a hunch that the contents of the posts might not give big clues about whether a post is spam, since the contents of spam posts are well-disguised.

Such result makes us curious about what’s happening under the hood. To dive deeper into it, we’d like to get the importance of each feature in order to observe what types of words are the decisive factors in the model’s predic- tions. However, for a non-linear model like SVM with RBF kernel, there’s no simple way of computing importance of each feature. Nevertheless, by

‘falling back’ to linear kernel, the model suffers around 10% performance loss in F-measure on test set, but we’re able to see the relative importance of each word by looking at the coefficients after inverse-transformed with PCA.

The following figure is a word cloud containing the words with the highest coefficients (weights), that is, words that are the strongest spam indicators, where the font size of each word is positively correlated with its weight.

19Notice most of the examples listed in section 4.1 are replies, though.

(36)

Figure 11: words with the highest weights

The next word cloud is for words with the lowest weights, that is, words that are the strongest non-spam posts indicators.

Figure 12: words with the lowest weights

(37)

We can already observe the distinctive difference between the two word clouds at the first glance. The first one is mainly about Samsung’s top products (galaxy, nexus, note, sii) and the user experiences (體驗,看到,覺得), while focusing on the multimedia aspect (照片, 拍照, 影片). On the other hand, the second word cloud is more about seeking help (問題, 解決, 無法), and involves more polite words (謝謝, 大大, 小弟) and technicalities (rom, , 設定).

The previous bags-of-word features were based on only contents of the posts, but there is also much information lying in the titles of the threads, so we create another 50 dimension-reduced bags-of-word features based on the titles, and combine these with the contents ones to yield 200 features. We prefer not to have them mixed together because title and content may have distinct groups of ‘spam keywords’.

precision recall F-measure test set 59.12% 51.44% 55.01%

test set* 56.16% 58.57% 57.34%

Table 15: content and title bag-of-words features

With the addition of title bag-of-words, a further improvement in F- measure can be seen.

The dimension-reduced bags-of-word features turned out to be surprisingly helpful. The model is able to accomplish over 55% in F-measure while the ratio of spam is only around 3% on the test sets for first posts. Compared to the random baseline, it boosts the F-measure by as much as 45%, which implies that the contents of posts actually give some strong clues about whether a first post is spam. Although on the surface, each spam post looks rather unsuspicious on its own, collectively, spam posts put more emphasis on certain topics, in comparison with non-spam posts, and our model trained with bag-of-words features was able to exploit this distinction.

5.4.3 Content Characteristics

A set of features derived from basic characteristics of the contents of the post is introduced.

(38)

feature description

n_all number of characters used in the post n_words number of words in the post (segmented by Jieba)

n_lines number of lines in the post n_hyperlinks number of hyperlinks in the post

n_img number of images added to the post n_emoticon number of emoticons used in the post

n_quote number of quotations from previous posts

p_digit proportion of digits

p_english proportion of English characters p_punct proportion of punctuation characters p_special proportion of non-alphanumeric characters p_wspace proportion of white space characters p_immediacy proportion of first person pronouns (e.g., , ) p_ntusd_pos proportion of positive words in NTUSD p_ntusd_neg proportion of negative words in NTUSD p_emoticon_pos proportion of positive emoticons p_emoticon_neg proportion of negative emoticons Table 16: description of not-so-obvious feature names

In regard to the naming of these features, the n_ prefix means number of, while the p_ prefix means proportion of (divided by the number of characters in the post). Most features should be self-explanatory then.

We compute symmetric KL divergence to find out which features exhibit the most different distributions between spams and hams. The formula of symmetric KL divergence is:

DKL(Pspam(f )∥Qham(f )) + DKL(Qham(f )∥Pspam(f ))

where Pspam and Qham are the distributions of the feature f under all spam posts and all non-spam first posts, respectively. The higher the symmetric KL divergence is, the more different the two distributions are, which makes the feature more useful in discriminating between spam and non-spam first posts.

(39)

Figure 13: content characteristics features

The top four features that distinguish spam and non-spam first posts best are n_all, n_imgs, n_words and n_lines, which are all related to the quantity of content. This is not surprising because many of the spam first posts are essentially advertisements in disguise (e.g., unboxing posts and ‘positive experience with a Samsung product’ posts) and would generally use lots of words and pictures to showcase Samsung products in an attempt to impress people.

Figure 14: number of images in spam and ham

(40)

Figure 15: number of words in spam and ham

On top of the bag-of-words features, we add these 17 numerical features that characterize the contents of the first posts. The resulting performance is shown below. F-measure increases by about 3% on both test set and test set* respectively, so these features do seem to provide extra information that help detect spam first posts.

precision recall F-measure test set 73.05% 49.52% 58.79%

test set* 64.91% 52.86% 60.32%

Table 17: bag-of-words and content characteristics

5.4.4 Submission Time and Thread Activeness

We are done adding the content-centric features, so it’s time to incorporate some non-content-centric features.

As discussed in section 4.5, spam posts have a tendency of being submitted more often during work time. To make use of this observation, we add a binary indicator feature for each hour in a day and each day in a week, in total 24 + 7 = 31 new features. When the post was submitted during the hour or the day a feature corresponds to, then its value is 1; otherwise it’s 0.

Moreover, we use number of posts in the thread started by the first post as another feature, which can serve as a measure of the activeness of the thread, as discussed in section 4.6.

(41)

precision recall F-measure test set 72.37% 52.88% 61.11%

test set* 66.67% 57.14% 61.54%

Table 18: bag-of-words, content characteristics, submission time and thread activeness

By incorporating these non-content-centric features, we see further improve- ment in F-measure on both test set and test set*.

5.4.5 Sentiment Scores Toward the Brands

The main objective of the covert marketing campaign is to promote a certain brand and sometimes denounce its competitor’s brands in order to give it an unfair edge. Hence, we expect spam posts to show a positive attitude when it comes to Samsung, and possibly a negative attitude toward the competitors.

We devise a simple method to capture the sentiment toward brands in posts.

Basically, we just add up the polarity of sentiment words in NTU sentiment dictionary (NTUSD) (Ku et al., 2006) and emoticons near mention of a brand or a product. For preciseness, the pseudocode producing the sentiment scores is presented in Algorithm 1

The following table shows number of spam posts in training set by the po- larity of our estimated sentiment scores toward the brands. The result is not what we desired, since there are many posts with negative polarity toward Samsung, and many with positive polarity toward HTC. Even worse, the

#positive/#negative ratio of Samsung is actually lower than HTC.

brand positive negative neutral no mention

Samsung 504 312 379 688

HTC 110 62 111 1600

Table 19: number of spam post with different polarities

With the sentiment scores toward Samsung and HTC, instead of showing any improvement, the F-measure dropped a little on both test set and test set*.

(42)

Algorithm 1 Compute Sentiment Score Toward the Brands

1 function AllBrandsSentimentScores(content)

2 for β ← [Samsung, HT C, . . .] do

3 scores[β]← BrandSentimentScore(content, β)

4 return scores

5 function BrandSentimentScore(content, β)

6 B ← list of aliases of β ▷ manually collected

7 P ← list of aliases of β’s products ▷ described in section 3.3

8 score← 0

9 for α← B

P do ▷ longest aliases first

10 if α is in content then

11 S ← the sentence containing α plus the next one

12 score← score + SegmentSentimentScore(S)

13 return score

14 function SegmentSentimentScore(S)

15 pw ← #(NTUSD positive words in S) ▷ longest matches first

16 nw ← #(NTUSD negative words in S) ▷ longest matches first

17 pe ← #(positive emoticons in S)

18 ne← #(negative emoticons in S)

19 score← pw − nw + pe − ne

20 return score

參考文獻

相關文件

從去年 11 月開始,主辦單位便積極向各企業詢問意願,於今年 2 月 確定邀請 Yahoo!公司及 D-Link

Researches of game algorithms from earlier two-player games and perfect information games extend to multi-player games and imperfect information games3. There are many kinds of

Stone and Anne Zissu, Using Life Extension-Duration and Life Extension-Convexity to Value Senior Life Settlement Contracts, The Journal of Alternative Investments , Vol.11,

從去年 11 月開始,主辦單位便積極向各企業詢問意願,於今年 2 月 確定邀請 Yahoo!公司及 D-Link

從去年 11 月開始,主辦單位便積極向各企業詢問意願,於今年 2 月 確定邀請 Yahoo!公司及 D-Link

Using Reinforcement Learning to Establish Taiwan Stock Index Future Intra-day Trading Strategies.. 賴怡玲

what is the most sophisticated machine learning model for (my precious big) data. • myth: my big data work best with most

We first define regular expressions with memory (REM), which extend standard regular expressions with limited memory and show that they capture the class of data words defined by