國立交通大學
資訊科學與工程研究所
碩 士 論 文
社 群 網 路 興 趣 探 勘
Mining Interest Topics from Plurk
研 究 生:李宜謙
指導教授:蔡錫鈞 教授
社群網路興趣探勘
Mining Interest Topics from Plurk
研 究 生:李宜謙
Student: Yi-Chien Lee
指導教授 :蔡錫鈞
Advisor: Shi-Chun Tsai
國 立 交 通 大 學
資 訊 科 學 與 工 程 研 究 所
碩 士 論 文
A Thesis
Submitted to Institute of Computer Science and Engineering College of Computer Science
National Chiao Tung University in partial Fulfillment of the Requirements
for the Degree of Master
in
Computer Science
July 2012
Hsinchu, Taiwan, Republic of China
摘
要
近年來隨著社群網路服務的蓬勃發展,越來越多使用者透過微網誌服務認識新朋 友。然而,利用微網誌服務來認識新朋友會遇到一個問題:例如當你看到一位微網誌使 用者的大頭照覺得她是你喜歡的異性類型,進而想要認識這位網友,那麼你可能得先大 致看完她的留言內容,先大致了解這位網友對哪些話題有興趣後開始嘗試談話。因為大 部分的微網誌服務沒有提供類似 Facebook 的個人資訊頁面,陌生人無法透過閱讀使用 者主動提供的資訊去投其所好,加上微網誌的文章發表量很大,想要看完一個人的留言 去推測他的興趣是很困難的。此外,如果想要認識的異性網友沒有公開她的留言,那麼 想要一親芳澤的難度就更高了,因為你無從得知她對那些話題有興趣。 為了解決上述的問題,我們針對噗浪 (Plurk) 微網誌服務設計了一套興趣探勘系統。 這套探勘系統能夠快速整理受測者發表過的關鍵字並視覺化該使用者的交友網路。若 受測者將他的時間軸設定為私密狀態,意即留言內容不公開,我們透過整合該受測者朋 友的留言資訊去推測他會感到興趣的話題與關鍵字。我們也可將受測者感興趣的關鍵 字使用於個人化、廣告業務以及朋友推薦等應用。 為了快速蒐集噗浪上的資訊,我們開發了一套基於 ZeroMQ 的分散式資料蒐集框架 並佈署到多台機器上增加資料蒐集的速度。此外,由於噗浪的 Python API 函式庫效能 不甚理想,所以我們透過更換 JSON 函式庫、強化 HTTP 連線管理以及撰寫 OpenSSL 擴充套件加速 HMAC-SHA1 運算速度等手段改善效能瓶頸並大幅增加蒐集的效率。ABSTRACT
People started to make friends with micro-blogging service in recent years; however, it is difficult to read all messages posted by those whom you are interested in but not familiar with to find out what he/she is interested in to start a conversation. Furthermore, unlike blog or Facebook, most of micro-blogging services do not provide profile functionality (self-description page) for users to describe him/her-self for people to know what he/she is interested in.
To address this demand, we build an online Social Networking Service Discovery (SNSD) system for Plurk users (plurkers) to find out a plurker’s interest topics/keywords and relation-ships/connections. The results are presented in graphics on a web browser. With the derived interests and relationships/connections, applications of the system include friend recommenda-tions and personalized advertisements.
To enhance crawling performance, we develop a distributed crawling system based on Ze-roMQ messaging protocol and deploy it on multiple machines to crawl data from Plurk. In addi-tion, we patch the Plurk API library for Python to enhance throughput by replacing the standard library with high-performance JSON library, optimizing HTTP connections and customizing Python C-extensions to accelerate HMAC-SHA1 computation.
Acknowledgments
I would like to thank my parents for offering me an opportunity to accomplish this thesis. I am greatly indebted to Dr. Shi-Chun Tsai, my advisor, for his patience, guidance and encour-agement. I also wish to thank Dr. Wen-Guey Tzeng, Dr. Shih-Kun Huang, Dr. Tyng-Ruey Chuang, Dr. Ying-ping Chen, Dr. Tzong-Han Tsai, Dr. Cheng-Zen Yang, Dr. Yi-Yu Liu, Mr. Jim Huang, Dr. Min-Zheng Shieh, Mr. Min-Chuan Yang, Mr. Chuan-Yu Tsai, Mr. Min-Cheng Chan, Mr. Huai-Sheng Huang, Mr. Chun-Yuan Cheng and all the other members in the CCIS research group and CSCC for sharing their wisdom with me.
誌謝
首先誠摯的感謝指導教授蔡錫鈞博士,老師悉心的教導使我得以一窺圖論與演算法 領域的深奧殿堂並指點我正確的方向,使我獲益匪淺。老師對學問的嚴謹更是我輩學習 的典範。 本論文的完成另外亦得感謝噗浪的創辦人雲惟彬先生與 0xlab 的 Jserv 大大提供意 見,以及元智大學蔡宗翰教授和鄭鈞元同學的支持。因為有你們的幫忙,使得本論文能 夠更完整而嚴謹。 感謝楊名全、謝旻錚、詹珉誠以及黃懷陞學長協助我進行研究,且總能在我迷惘時 為我解惑,也感謝蔡權昱同學的幫忙,恭喜我們順利走過這兩年。實驗室的方智誼、邱 韜瑋以及系計中的許伯羽、游傑、林宏昱學弟們當然也不能忘記,你們的協助我銘記在 心。 感謝我的摯友馬安妤、李金樺、黃暄仁以及許健聖在我陷入低潮時能鼓勵我並使我 重拾研究的熱情。 因為需要感謝的人太多了,就感謝少女時代吧!最後,謹以此文獻給我摯愛的雙親。Contents
1 Introduction 1 1.1 Motivation . . . 1 1.2 Related Work . . . 8 1.3 Challenges . . . 11 1.4 Approach . . . 11 1.4.1 Community Detection . . . 11 1.4.2 Data Collection . . . 12 1.5 Results . . . 13 1.6 Thesis Structure . . . 15 2 System Architecture 16 2.1 Overview . . . 162.1.1 Social Networking Service Discovery (SNSD) System . . . 16
2.1.2 Distributed Crawling System . . . 16
2.2 SNSD System Design and Architecture . . . 18
2.3 Crawling System Design Considerations . . . 19
2.3.1 Concurrent Programming . . . 19
2.3.2 Messaging Protocol . . . 22
2.3.3 Data Serialization Format . . . 24
2.3.4 Datastore . . . 27
2.3.5 Task Queuing for Crawling . . . 28
2.3.6 Security . . . 29
2.4 Distributed Crawling System Architecture . . . 30
2.4.1 System Architecture . . . 30
2.4.3 Task Queuing . . . 32
3 Implementation Details 35 3.1 Data Collection . . . 35
3.1.1 Overview . . . 35
3.1.2 Plurk API and Library . . . 36
3.1.3 Library Optimization . . . 37
3.2 Preprocessing . . . 39
3.2.1 A Plurk and Its Data . . . 39
3.2.2 Elements of a Plurk . . . 40
3.2.3 URL Filtering Mechanism . . . 41
3.2.4 Tokenization . . . 44
3.2.5 Plurks Preprocessing . . . 48
3.3 Community Detection . . . 51
3.3.1 Snowball Sampling . . . 51
3.3.2 Modularity and Louvain Algorithm . . . 51
3.3.3 Filtering . . . 54
3.4 Interest Hierarchy Model . . . 54
3.5 Datastore Architecture . . . 57
3.6 Celery Task Queue . . . 59
3.7 Celery Cluster Layout and Worker Configurations . . . 62
3.8 Delta Cluster Deployment . . . 65
4 Experiments 69 4.1 Environment . . . 69
4.2 Performance Benchmarks . . . 70
4.2.1 Python JSON Libraries . . . 70
4.2.2 Python Serialization . . . 72
4.2.3 HMAC-SHA1 . . . 72
4.2.4 Python Plurk API Library . . . 75
4.2.5 Redis Connection . . . 76
4.3 Interest Derivation . . . 78
5 Conclusions and Future Works 84
Bibliography 86
Appendix A Diskless Linux Cluster Installation 94
A.1 Base System . . . 94 A.2 Network Block Device (NBD) Server . . . 97 A.3 DHCP and PXE Server . . . 99
Appendix B MongoDB Cluster Installation 100
B.1 MongoDB Installation . . . 100 B.2 Replica Sets . . . 104 B.3 Sharding . . . 106
List of Tables
3.1 Comparison of hierarchical data model design from Ref. [45] . . . 55 4.1 Machine specifications and roles . . . 69
List of Figures
1.1 Plurk, Weibo, and Twitter daily visitors count graph (Taiwan only) . . . 3
1.2 Alexa traffic rank for Plurk . . . 3
1.3 Visitors statistics by Alexa . . . 3
1.4 Plurk timeline . . . 4
1.5 A sample plurk . . . 4
1.6 Plurk profile: extra information . . . 5
1.7 A Plurk user with his plurks public . . . 6
1.8 A Plurk user with his plurks private . . . 6
1.9 Interests derivation from public or private plurkers . . . 7
1.10 Google Analytics for Go!Plurk . . . 9
1.11 Go!Plurk system flow . . . 9
1.12 Sample news articles for training . . . 10
1.13 Sample plurks for training . . . 10
1.14 Interest pie chart generated by Go!Plurk . . . 11
1.15 Architecture for two-tier parallel crawler . . . 13
1.16 SNSD website overview . . . 14
2.1 Components in the SNSD system and crawling system . . . 17
2.2 Work flow for generation of interest keywords hierarchy . . . 17
2.3 GIL Behavior . . . 19
2.4 Computation bound . . . 20
2.5 Thread . . . 20
2.6 Event Loop . . . 21
2.7 Coroutine . . . 22
2.10 Transports of ZeroMQ . . . 24
2.11 RPC over AMQP . . . 25
2.12 RPC over ZeroMQ . . . 25
2.13 Various message formats within the crawling system . . . 28
2.14 RPC over AMQP on Crawling . . . 29
2.15 RPC over ZeroMQ on Crawling . . . 29
2.16 Queuing work flow . . . 30
2.17 Distributed crawling architecture . . . 31
2.18 Messaging patterns between components . . . 32
2.19 Work flow of crawling system . . . 33
2.20 States and Redis data type of the task queue . . . 34
3.1 Plurk mobile view . . . 36
3.2 HTTP persistent connection . . . 38
3.3 Elements of a plurk . . . 42
3.4 URL filtering for a file . . . 44
3.5 URL filtering for an image . . . 45
3.6 URL filtering for a Youtube link . . . 45
3.7 URL filtering for a web page . . . 46
3.8 Demonstration of the normalize function . . . 47
3.9 Demonstration of the tokenize function . . . 49
3.10 Demonstration of the preprocessing . . . 50
3.11 Visualization of the steps of Louvain algorithm . . . 54
3.12 Plurk profile: general information . . . 55
3.13 A sample Closure Table . . . 57
3.14 Table Relationships . . . 58
3.15 MongoDB cluster architecture . . . 60
3.16 MongoDB cluster configuration . . . 60
3.17 Celery cluster architecture . . . 64
3.18 OpenStack security group configurations . . . 64
3.19 CS workstation cluster architecture . . . 65
3.20 Servers and racks donated by Delta, Inc. . . 66
3.22 Delta server with VGA card . . . 67
3.23 Servers installed in rack . . . 68
3.24 Delta cluster architecture . . . 68
4.1 Encoding performance . . . 71
4.2 Decoding performance . . . 71
4.3 Big data performance . . . 71
4.4 Memory usage of JSON libraries . . . 72
4.5 Serialization performance . . . 73
4.6 Memory usage of serialization libraries . . . 73
4.7 Encoded data size . . . 74
4.8 HMAC-SHA1 performance . . . 75
4.9 Original API library . . . 76
4.10 Enhanced API library . . . 76
4.11 Improvements . . . 77
4.12 Redis binding modes . . . 77
4.13 Redis remote connection types . . . 78
4.14 Result of interest derivation . . . 79
4.15 Interest keywords hierarchy . . . 80
4.16 Interest tag cloud . . . 81
4.17 View communities in pack layout . . . 81
4.18 View communities in treemap layout . . . 82
4.19 Focus community on pack layout . . . 82
4.20 Focus community on treemap layout . . . 83
List of Algorithms
1 URL filtering mechanism . . . 43 2 Tokenization process . . . 48 3 Louvian algorithm . . . 53
Chapter 1
Introduction
1.1
Motivation
Social networking service on the Internet can be traced back to mid 1990s when providers such as Geocities launched the service in the form of generalized online communities, which offered two ways of inter-personal interaction: chat rooms and personal web pages.
Rapid development of the Internet led to the next generation of social networking services in late 1990s through early 2000s. This new generation of services include the following two features, among others: user profiles and blog.
User Profiles allow users to define lists of “friends” and search for other users with similar attributes [71]. Active providers in this period include SixDegrees.com (1997-2001), Friendster (2002), MySpace (2003), LinkedIn (2003) and Facebook (2004).
Blogs emerged in the late 1990s. A blog (a portmanteau of the term web log) [12] is a discussion or information site published on the Internet. Most blogs operate in an interactive manner, meaning that visitors are allowed to leave comments or even messages to each other. Bloggers not only produce content to post on their blogs but also build social relations with their readers and other bloggers [28]. In that sense, blogging can be regarded as a form of social networking.
Microblogging, “the SMS of the Internet,” [24] is a broadcast medium in the form of blog-ging. A microblog differs from a traditional blog in that its content is much smaller in size. For example, Twitter and Plurk enable their users to write and read text-based messages up to 140 characters in length. Most social networking websites offer their own microblogging feature via “status updates.” [92] Leading micro-blogging providers include Twitter (2006), Facebook,
Sina Weibo (2009) and Plurk (2008).
Use of social networks for making friends and keeping in touch with friends has become popular recently.[16, 38, 27] Micro-blogging services such as Plurk, Twitter, and Weibo are popular in Taiwan. According to Alexa [4], Plurk ranked 45th in Taiwan and 2,014th worldwide on June 1, 2012, as shown in Figures 1.2 and 1.3. This statistics indicates that Plurk is an active social network service especially in Taiwan.
Furthermore, according to Google Trends on June 1, 2012 (Figure 1.1), Plurk is among the most popular micro-blogging services in Taiwan, probably due to its user-friendly interface named “Timeline” as shown in Figure 1.4. As such, this study is based on the Plurk community. Freshman students exchange their Plurk accounts via bulletin board system (BBS) to get familiar with each others. Open source developers share their Twitter and Plurk accounts in their presentation slides for audiences to contact them if they have any comments or are interested in the project.
Most micro-blogging services allow users to make two types of relationships: friend and follower. The friend relationship requires that both individuals confirm they are friends while the follower relationship can be established without confirmation. As such, followers may not be connected to the target individual in real life.
In Plurk, individual’s profile information (Figure 1.6) is not publicly available. Users can set their conversations (plurks) as shown in Figure 1.5, public (Figure 1.7) or private (Figure 1.8), i.e. only specific users or friends of those who post the contents can view the contents while anonymous users and followers are not allowed to.
Given the constraints, it’s difficult to know someone via plurks even though all his/her plurks are public, as shown in Figure 1.9(a) and 1.9(b). However, affiliation and interests information can be derived from his/her friends in order to conjecture who he/she is or what he/she is inter-ested in even though we know nothing about him/her, as shown in Figure 1.9(c).
According to the hypothesis, we build an online analysis system for users to find out what he/she might be interested in by providing his/her Plurk account name.
After generating interest topics/keywords information about someone whom you are inter-ested in, by our system, you can use the information to refer him/her to your friends who share the same interests. Search engine service provider and commercial company can use our system to build user profiles for customized service and advertising.
Figure 1.1: Plurk, Weibo, and Twitter daily visitors count graph (Taiwan only)
Figure 1.2: Alexa traffic rank for Plurk
Figure 1.4: Plurk timeline
(a) Additional information
(b) Interests
(c) Schools & Work
Figure 1.7: A Plurk user with his plurks public
(a) (b)
(c)
1.2
Related Work
The Go!Plurk project [47], developed by Ken Lee, Bryan Cheng, and Sean Lee, is the first service to find users’ interest topics based on the content they posted on Plurk. In this thesis, we extend and enhance the preliminary work of the Go!Plurk project.
Go!Plurk was announced via Plurk on June 15, 2009. There were at least 13,527 undupli-cated users visited our website and we analyzed more than 30,000 Plurk accounts in the follow-ing week (Figure 1.10).
This project was reported by United Daily News (UDN) [95] and a famous blogger Briian [94] in June 2009, and PChome magazine also introduced the project in August of the same year. We used 300 news articles from Yahoo! Taiwan and plurks from top-100 active plurkers as training sources, which were classified into ten pre-defined categories: chitchat, delicacies, education, lifestyle, movies, music, drama, sports, technology, and travel. Besides, we use CKIP [58] from Academia Sinica as Chinese tokenization engine, and defined a reserved lexical category list for filtering returned tokens from CKIP as shown in Figure 1.12 and Figure 1.13. Figure 1.11 depicts an overview of the Go!Plurk work flow.
In a simple Go!Plurk test, we sampled 20 latest plurks from tester, use CKIP to extract Chinese tokens, apply filtered tokens into Naïve Bayes classifier to calculate scores for each category, and finally render a pie chart to visualize the interest distribution, as shown in Figure 1.14.
Although this service is popular with Plurk communities, there are several known issues and limitations which need to be improved. First, if the tester set his/her plurks private, we cannot get what he/she said in order to analyze his/her interest. Second, the quality and quantity of training articles are poor due to short training period and limited labor hours. Third, the ten pre-defined categories are not general enough to represent interests and users cannot get details due to the flat structure. Lastly, we only sample 20 latest plurks from tester via RSS feed provided by Plurk in order to simplify implementation; mechanism for handling plurk content with URL link was not implemented. As such, we cannot get enough tokens to represent the tester.
Given these issues and limitations, in this thesis we try to increase the accuracy of prediction results even if tester set his/her plurks private by collecting as much public plurks as we can to expand training scale, applying automatic training process, and deriving interests information from one’s friends.
Figure 1.10: Google Analytics for Go!Plurk
Figure 1.12: Sample news articles for training
Figure 1.14: Interest pie chart generated by Go!Plurk
1.3
Challenges
In order to enhance and improve the Go!Plurk system, we have to collect plurk data back to local datastore efficiently, and the datastore must provide good durability and excellent reading performance for online retrieval. Traditional database management system (DBMS) is not suit-able for managing big data; there are more than one billion of plurks to crawl. As such, we need to find out a database solution for big data, which is crucial to efficient web crawling.
Moreover, in order to derive a plurker’s interest topics with his/her conversation private, we have to compute community partition information for the plurker and extract public plurks posted by the partition members to derive associated interest topics and keywords. Since the community detection problem is known for high computational complexity, we have to employ proper algorithm and optimize performance for online service.
1.4
Approach
1.4.1
Community Detection
Girvan and Newman [30] presented the Grivan-Newman algorithm for community detection by measuring the graph-theoretic measurement of betweenness. This algorithm returns reason-able quality of result but runs slowly in worst-case time O(m2n) on a network of n vertices and
m edges or O(n3) on a sparse network. The poor computational complexity makes it impractical for detecting communities in large networks.
Newman [60] proposed an enhanced community detection algorithm by employing modu-larity [59, 61] as objective function to maximize it. Modumodu-larity is a metric to measure the quality of a particular division of a network into communities. For a weighted network G, the modu-larity is defined as Q = 2m1 ∑i,j[Aij − kikj 2m ] δ (C (i) , C (j)),
where Aij is the weight of edges between vertex i and j, m is the number of edges of G, C (i)
is the community of vertex i, and the δ-function δ (C (i) , C (j)) is equal to 1 if C (i) = C (j), i.e. i and j are in the same community, and 0 otherwise.
The modularity maximization method employs exhaustive search for all possible divisions of a network for the highest modularity value to detect community and this method is considered intractable [14]. Newman [62] then proposed an approximate optimization algorithm which is similar to his previous research and the worst-case running time is O((m + n)n) or O(n2) on a
sparse network.
According to Fortunato [26], the computational complexity of Louvain algorithm [85] is
O(m). This algorithm is extremely fast and graphs with up to 109 edges can be analyzed in a reasonable time on current computational resources. Therefore, we use Louvain algorithm to detect community partitions in this thesis and the details about the Louvain algorithm is listed in the Section 3.3.2.
1.4.2
Data Collection
Chau [15] presented a framework which guarantees that no redundant crawling would occur while executing parallel crawlers for online social networks. He also demonstrates how to em-ploy parallel crawlers and improve crawling performance for online social networks including Linkedin and Friendster via centralized queue by using MySQL database as shown in Figure 1.15. The crawler architecture is based on two-tier parallelism, i.e. the coordinator or sched-uler schedule tasks for multiple agents in parallel. Besides, each agent itself employs multiple threads for crawling. This architecture allows simultaneous failures of member crawlers. How-ever, details of the protocol between crawler agent and scheduler, implementation of the crawler and datastore design for storing large number of records are not revealed.
Figure 1.15: Architecture for two-tier parallel crawler
the basic usage with single threading. As the Twitter API has query rate limitation, Kwak em-ploy 20 machines with different IPs and self-regulating collection rate at 10000 requests per hour. However, there are billions of tweets, millions of user profile and tens of billions of user relationship connections on the Twitter social network. We have to employ more efficient ways to crawl data from social network service provider.
1.5
Results
We build an online social networking service discovery (SNSD) system for Plurk users (plurkers) to find out interest topics/keywords and relationship . The results can be viewed on a website as shown in Figure 1.16. Besides, we develop a new distributed crawling system framework based on ZeroMQ messaging protocol and deploy it on several machines to crawl data from Plurk. Finally, we patch the Plurk API library for Python to enhance throughput by replacing the standard library with high-performance JSON library, optimize HTTP connections and customize Python C-extensions for accelerating HMAC-SHA1 [48] computation.
1.6
Thesis Structure
The remainder of the thesis is organized as follows. Chapter 2 introduces social networking services discovery (SNSD) system and the distributed crawling system for this thesis along with its high-level design. Chapter 3 describes system implementation details. Chapter 4 summarizes results of the implemented enhancements and illustrates the website built for visualizing SNSD system. Chapter 5 discusses future work and concludes the thesis.
Chapter 2
System Architecture
2.1
Overview
In this chapter, we will introduce two systems: (1) a social networking services discovery (SNSD) system for discovering user’s relationship and interest from Plurk and (2) a distributed crawling system for crawling data from the Internet efficiently. The architecture diagram for these two systems is depicted in Figure 2.1.
2.1.1
Social Networking Service Discovery (SNSD) System
Recent studies [89, 49] indicate that micro-blogging services such as Twitter and Plurk are used as news aggregation services and ties in Facebook are driven by personal contacts. That is, networks may be clustered by communities of interests and geography is less significant for micro-blogging services. Offline relationships drive friendship in Facebook. As such, we can discover interest topics of users via community detection because micro-blogging services users connecting to each other are probably driven by interests instead of having offline relationships. Given the hypothesis, we propose a framework to discover interest topics for a micro-blogging service user based on his/her conversations, even if conversations are private, by ag-gregating interest information from communities of the user. Figure 2.2 depicts an overall work flow for generation of interest keywords hierarchy.
2.1.2
Distributed Crawling System
Distributed crawling is a distributed computing technique employing many computers to fetch data from the Internet. For example, Internet search engines such as Google and Yahoo!
Figure 2.1: Components in the SNSD system and crawling system
built server farms distributed geographically to fetch web pages and build indices for indexing the Internet.
In this thesis, we deploy several computers as crawlers which call Plurk API to request Plurk users’ profile, relationship and plurks. Besides, we utilize these crawlers as load balancers to fetch Uniform Resource Locator (URL) from plurks to extend content while avoiding blocking by the service provider.
Even though previous researches [15, 49] had proposed architectures for parallel crawlers, they did not provide implementation details such as protocol and datastore. Our design and implementation will be depicted later in this thesis.
2.2
SNSD System Design and Architecture
In this section, we will describe the framework which analyzes interest information for micro-blogging users even when their conversations are private.
Firstly, we try to collect users’ conversations and relationship network as much as possible. In general, micro-blogging service providers (MBSPs) provide application programming inter-faces (API) for users to access data; however, most of MBSPs limit the request rate by an API key or IP addresses. We will cover mechanisms for distributed crawlers to access data from MBSPs beyond the rate limitation in Chapter 3.
Secondly, we apply community detection algorithm when the requested user’s conversations are private. According to previous hypothesis, micro-blogging networks are clustered by com-munities of interests. We use this idea to derive interest information from comcom-munities when conversation data for interest analysis is not available.
Thirdly, tokenize the incoming conversation and response data then apply syntactic filter for removing stop words and uncommon tokens. In western languages, words are separated by spaces in a sentence, so we only need to split the data by spaces and punctuation marks such as periods, commas, etc. for tokenization. But in Chinese, there is no simple ways to tokenize be-cause Chinese text does not have word boundaries and each character is a fundamental linguistic unit. Therefore, we have to apply Chinese tokenization algorithms to tokenize data.
Lastly, merge interest tags and return them in a hierarchical structure by the pre-defined interest hierarchy generated from user’s conversation or derived from communities. We get various interest tags from the previous step, but it is not suitable for visualization directly because they are still meaningless. Thus, we need to summarize distinct interest tags into formatted
Figure 2.3: GIL Behavior
hierarchical structure so that users can view the results easily. Besides, if interest tags are derived from communities, it should render an additional community graph to indicate where these tags are derived from.
2.3
Crawling System Design Considerations
2.3.1
Concurrent Programming
CPython, written in C, is the default Python bytecode interpreter. However, this interpreter is not fully thread-safe. In order to support concurrency, global interpreter lock (GIL), a mutex lock, was introduced. That is, only one thread is allowed to execute at a given moment, as shown in Figure 2.3. This restricts multi-threaded CPython programs from fully utilizing all processors in a multi-processor system. It becomes a computational bottleneck while processors are not fully utilized, as shown in Figure 2.4(a).
Therefore, for multiprocessing module, a process-based threading interface is available since CPython version 2.6 [40], and it side-steps the GIL effectively by using subprocesses instead of threads. Instead of threads, processes use interprocess communication (IPC) to communicate with each other, which is a much heavier solution.
The GIL is released on blocking I/O, when the thread is forced to wait, other threads in “ready” state will be chosen to execute and get into “running” state, as shown in Figures 2.4. Therefore, I/O bound Python programs are recommended to use threading module, and CPU bound programs fit better the multiprocessing module in general. Nevertheless, threading so-lution is not good enough in the C10K problem [46]. The C10K problem refers to handling of concurrent ten thousand connections. Several I/O models are introduced to achieve the goal as described below; We choose Gevent for this thesis.
(a) CPU Bound Tasks
(b) I/O Bound Tasks
Figure 2.4: Computation bound
Figure 2.5: Thread
Blocking sockets with single thread
This model is the simplest implementation with one loop in one process, but it can only accept one connection at a time.
Blocking sockets with multi-thread
In order to accept multiple connections at the same time, this model will create a new thread to accept each connection request. Although it can deal with multiple connections, it is an inefficient approach because it will spend most of CPU time on context-switching when handling massive concurrent connections.
Figure 2.6: Event Loop
Non-blocking sockets by event-driven
In order to reduce the context-switching overhead, this approach creates a loop to wait for occurrence of I/O events and executes the registered handler associated with the event, as shown in Figure 2.6. This approach is also called event-driven programming. For example, Twisted [102] is a Python networking framework by using this approach to accomplish non-blocking asynchronous I/O. The main benefit of this approach is less context-switching, but it makes program complicated because multiple events might be raised simultaneously.
Non-blocking sockets by coroutine
Coroutine [11, 10, 50] is an alternative concurrency approach using Python generator func-tion available since CPython version 2.5[32]. Unlike normal funcfunc-tion, generator funcfunc-tion pro-duces sequence of results instead of a return value, and it yield a value then “throw” it back when called. In contrast to thread, coroutine does not use context-switching because all corou-tines run in a single process, as shown in Figure 2.7. Besides, as coroucorou-tines are not run in multiple processes, they will not be restricted by GIL and we can fully control the scheduling of coroutines. Furthermore, it’s much cheaper to create a coroutine than a thread, we can spawn massive coroutines without significant overheads.
To improve the crawling performance, we finally choose Gevent [23], a coroutine-based networking library, as our crawling backend. Gevent uses a Greenlet [70], a micro-thread or lightweight coroutine library as the synchronous API on top of the libevent [63] event loop.
In Gevent programming model, every coroutine has a parent, i.e., the caller, and the top coroutine is the main thread or the current thread. Sub-coroutines yield execution to their parents
Figure 2.7: Coroutine
when starting to wait for completion of I/O operations, as shown in Figure 2.8. The parent coroutine will monitor which I/O is done from the event loop, and yield the execution back to the calling sub-coroutine to achieve the asynchronous non-blocking I/O operation, as shown in Figure 2.9.
Furthermore, Gevent provides a cooperative socket module which ensures coroutines by Greenlet can access sockets simultaneously. This feature, along with urllib3 [5], is exploited to speed up the connection performance.
2.3.2
Messaging Protocol
Advanced Message Queuing Protocol (AMQP) [65] is an application layer protocol for message-oriented middleware (MOM) and is an evolution of semantics taken from the Java Messaging Service (JMS). AMQP covers two main enterprise messaging patterns: (1) topic-based publish-subscribe distribution and (2) reliable request-reply with persistent queues by pre-defined resources: exchange, queue, and binding.
ZeroMQ (ZMQ) [37] is an intelligent transport layer library of messaging functionalities inspired by the Internet Protocol (IP) [55]. It’s a redesign of messaging to pursue the objective of uniformity and scalability, i.e. it aims to solve the problem of how to connect thousands of clients and do millions of messages in a second in a large messaging system. Furthermore, ZeroMQ covers four main patterns: transient pub-sub, unreliable request-reply, pipeline, and peer-to-peer. In addition, it provides broker devices and message routing when necessary.
Figure 2.8: Coroutine with I/O
Figure 2.10: Transports of ZeroMQ
In general, AMQP is essentially centralized with a broker and provides reliable persistent queuing. ZeroMQ is essentially distributed with no pre-defined broker and aims at dealing with massive messages currently. We choose ZeroMQ as the crawling messaging framework and AMQP as the backend for Celery [8], the task queuing system for the web worker mentioned in the previous section.
Remote Procedure Call (RPC) over AMQP requires two queues for storing tasks data and result, as shown in Figure 2.11. Even though this scenario guarantees reliability and security, it has a high overhead in the queuing when employing it into a crawling system, as shown in Figure 2.12.
In order to reduce queue usage, we employ ZeroMQ as messaging protocol in the crawling system.
2.3.3
Data Serialization Format
In this section, we will introduce several data serialization formats employed in our sys-tems: Pickle [33], JavaScript Object Notation (JSON) [43], Binary JSON (BSON) [96], and MessagePack (MsgPack) [76].
Figure 2.11: RPC over AMQP
Pickle
Pickle is a standard Python module for serializing Python object structures. It converts a Python object into a byte stream when serializing; a byte stream is converted back into a Python object on de-serializing. However, the Pickle module is not intended to be a secure format against erroneous or maliciously constructed data. We need to authenticate the pickled object before de-serializing it.
JavaScript Object Notation (JSON)
JSON is a lightweight human-readable open standard [21] for data serialization. It is derived from JavaScript for representing data types and data structures. JSON is widely deployed by Web APIs such as Twitter, Plurk, and Facebook Graph API, etc.
Binary JSON (BSON)
BSON is based on JSON and is adopted by MongoDB for data storage. It is designed to be efficient both in storage space and scan-performance. Unlike JSON, BSON uses binary form for representing data types and data structures. In addition, it extends JSON with the date, byte array, and regular expression types.
MessagePack (MsgPack)
MsgPack is based on JSON and aims to be as compact and simple as possible. It is very similar to BSON except it does not support the date and regular expression data type but more space-efficient. The Protocol Buffers (PB) [31] format by Google Inc. also aims to be compact and is compared with MsgPack. However, it is necessary to define a schema which describes the structure for PB before serializing or de-serializing an object can be performed. But MsgPack and JSON are compatible to serialize arbitrary data structures.
Listing 2.1 demonstrates how to encode a dictionary object by the above four serialization formats in Python. The encoded data size for Pickle, JSON, BSON, and MsgPack are: 218, 164, 116, and 151 respectively.
Listing 2.1: Serialization
>>> import pickle, marshal, json, bson, msgpack >>> data = {
... "fans_count": 98, ... "friends_count": 120,
... "privacy": "only_friends", ... "user_info": { ... "display_name": "Ken", ... "karma": 131.32, ... "gender": 1, ... "id": 3461880, ... "avatar": 10 ... } ... } >>> pickle.dumps(data) "(dp0\nS'fans_count'\np1\nI98\nsS'user_info'\np2\n(dp3\nS'gender'\np4\nI1\nsS'display_name'\np5\nS' Ken'\np6\nsS'karma'\np7\nF131.32\nsS'avatar'\np8\nI10\nsS'id'\np9\nI3461880\nssS'friends_count '\np10\nI120\nsS'privacy'\np11\nS'only_friends'\np12\ns." >>> json.dumps(data)
'{"fans_count": 98, "user_info": {"gender": 1, "display_name": "Ken", "karma": 131.32, "avatar": 10, "id": 3461880}, "friends_count": 120, "privacy": "only_friends"}'
>>> bson.BSON.encode(data) '\x97\x00\x00\x00\x10fans_count\x00b\x00\x00\x00\x03user_info\x00J\x00\x00\x00\x10gender\x00\x01\ x00\x00\x00\x02display_name\x00\x04\x00\x00\x00Ken\x00\x01karma\x00\n\xd7\xa3p=j`@\x10avatar\ x00\n\x00\x00\x00\x10id\x00\xf8\xd24\x00\x00\x10friends_count\x00x\x00\x00\x00\x02privacy\x00\r \x00\x00\x00only_friends\x00\x00' >>> msgpack.dumps(data) '\x84\xaafans_countb\xa9user_info\x85\xa6gender\x01\xacdisplay_name\xa3Ken\xa5karma\xcb@`j=p\xa3\ xd7\n\xa6avatar\n\xa2id\xce\x004\xd2\xf8\xadfriends_countx\xa7privacy\xaconly_friends'
In the crawling system, we decode JSON data from Plurk API then store the results into MongoDB in BSON format by MongoDB driver (PyMongo) [2]. Besides, scheduler transmits control signal and crawlers return crawled data to handler in MsgPack format via ZMQ, as shown in Figure 2.13. Furthermore, we return user profile and relationship data in JSON for AJAX HTTP requests and use Pickle as format for the Celery web task queue via AMQP in the SNSD system.
2.3.4
Datastore
In order to store as many conversations from Plurk as we can for the SNSD system, we have come up with the criteria for choosing proper data store: scalability, high availability (HA), performance and index support.
Scalability means we can easily scale out the data store by adding resources to a single node (scale vertically) or adding more nodes to the system (scale horizontally). High availability (HA) ensures the data store works properly even if a node in the system is down or out of service. Index support is required for improving performance and guarantee data uniqueness.
Figure 2.13: Various message formats within the crawling system
According to these criteria, we choose MongoDB, a document-oriented database system, as data store for storing conversations from MBSPs; MySQL, a relational database manage-ment system, for interest hierarchy; and Redis, a in-memory, key-value data store with optional durability, for storing user relationships.
2.3.5
Task Queuing for Crawling
Traditional task queuing systems based on Remote Procedure Call (RPC) require two extra queues to store task requests, as shown in Figure 2.14, and the result produced by workers for each request. However, it is not suitable for handling large number of requests by storing extra data for the traditional RPC.
In order to improve performance and storage efficiency, we replace AMQP with ZeroMQ library as messaging protocol, as shown in Figure 2.15, and introduce a new mechanism: let the worker pull tasks from ventilator (dispatcher) instead of having ventilator push tasks to available worker. Besides, ventilator maintains a priority queue to store states and creation timestamp for to-do tasks.
When a worker connects to ventilator and asks for a new task, the ventilator pop the oldest to-do task and check if the task is in done state and has exceeded the time to live (TTL) or not. If it is not done and exceeded TTL, return this task to worker and set it to the work-in-progress (WIP) state; otherwise, return the task in to-do state which is generated by ventilator. Figure
Figure 2.14: RPC over AMQP on Crawling
Figure 2.15: RPC over ZeroMQ on Crawling
2.16 illustrates the process mentioned above.
With this mechanism, ventilator can control the number of to-do tasks and guarantee all tasks will be processed eventually.
2.3.6
Security
We have to ensure communications between nodes in the crawling system are encrypted to prevent information leak or nodes being compromised. However, ZeroMQ does not provide encryption [97], therefore we need to implement key exchange protocol or use SSH tunnel.
Our crawling system is designed for handling massive requests with good performance. We cannot deploy complicated cryptographical mechanism such as RSA algorithm for per-message
Figure 2.16: Queuing work flow
encryption/description and setup of SSH tunnel between ventilator and workers.
Therefore, we employ Advanced Encryption Standard (AES) algorithm with Intel Advanced Encryption Standard Instructions (AES-NI) hardware support as the default cryptography.
2.4
Distributed Crawling System Architecture
2.4.1
System Architecture
Similar to Chau’s crawling framework [15], our system is also based on two-tier architecture to allow for simultaneous failures of agents. Figure 2.17 depicts a high-level architecture of the crawling system for this thesis. The crawling system consists of seven components as explained below.
• Agent: Installed in every worker node as a daemon process to receive commands from scheduler to start, stop, or restart the worker process, update scripts and configuration files, and increase/decrease the number of worker processes.
• Ventilator: Serves as the task dispatcher to dispatch tasks to available workers.
• Proxy: Started with the worker process in worker nodes. It is aimed to reduce TCP con-nections between backend ventilators.
Figure 2.17: Distributed crawling architecture
• Broker: Similar to proxy, but it’s started on server side to receive TCP connections from worker nodes and forward messages to ventilators by service identities as router.
• Worker: Do tasks assigned by ventilators.
• Registry: Keep track of available workers, allocate service identity for ventilators, and balance the requests from ventilators.
• Commander: Administrator send commands to control worker nodes via this role, and it could communicate with registry to adjust the total number of workers automatically. There are several ventilators for different purposes in this system; each ventilator has a unique service identity which is allocated by the registry. For example, we want to crawl plurk-ers’ relationship for community detection and public plurks for deriving interest topics. Then there will be two ventilators to dispatch tasks to workers and store result from crawler to specific datastore such as MongoDB and Redis in our scenario.
ZeroMQ covers several messaging patterns. We employ request-reply pattern between ventilator-worker, ventilator-registry, commander-registry, and commander-agent; publish-subscribe pat-tern between commander-agent. Figure 2.18 depicts the messaging patpat-tern employed in the
Figure 2.18: Messaging patterns between components
2.4.2
Work Flow
Figure 2.19(a) illustrates the work flow within crawling system. The flow is explained below. First, when worker becomes available, it sends an INIT message to ventilator via proxy and broker. If there are no available ventilators, worker will start to resend the INIT message until a ventilator responds.
After a worker association is established, ventilator updates worker status to registry and generates task for the worker with universally unique identifier (UUID) as task ID then sends task assignment to worker. When a worker finishs the task, it sends results along with task ID to ventilator. Ventilator then processes the results and stores it to the datastore.
Figure 2.19(b) illustrates the work flow between commander and agents. There are two messaging patterns, i.e. publish-subscribe and request-reply, between commander and agents in different scenarios. Agents subscribe to the topic with their own unique hostname and wait for specific instructions assigned by commander and broadcast generic topics.
If we want to broadcast instructions such as reboot all agent machines or fetch latest config-uration files or assign specific agent to execute commands such as restart the crawling process, use the publish-subscribe channel with corresponding topic. That is, generic topic for broad-casting and specific topic for assigned agent. Moreover, if we want to execute commands and get response from agent, then use request-reply channel for receiving responses.
2.4.3
Task Queuing
As mentioned above, we introduce a new queuing mechanism and define three states to represent queuing status of a task. This mechanism is based on Redis datastore, which handles
(a) Work flow between ventilator and workers (b) Work flow between commander and agents
Figure 2.19: Work flow of crawling system
three data types: list, set and sorted set as shown in Figure 2.20.
Listing 2.2 illustrates how our queuing mechanism works. The execution function, upon receipt of a target user_id, will use ZADD command to add the target to the WIP queue with 300 seconds of TTL then crawl data for the target and store results into datastore. After crawling and storing, we add the target to the DONE queue by SADD command and remove it from WIP queue by ZREM command. If there is any exception during crawling or storing data, we remove the target from WIP queue by ZREM command.
The fetch_targets function demonstrates how to fetch new targets. First, we check if any target is in WIP state and has executed over TTL by ZRANGEBYSCORE command, i.e. there is something wrong while crawling data for the target. Second, we generate nine target candidates by removing and getting the first element in the TODO queue via LPOP command then return these candidates.
The add_todo function depicts how to add a given target user_id to the TODO queue. We check if this target is in the WIP queue by ZSCORE command and whether it is already done or not by SISMEMBER command first. If the target is not in the WIP state and not done, then push the target to TODO queue by RPUSH command.
Figure 2.20: States and Redis data type of the task queue
import redis, time r = redis.Redis()
def execute(user_id): try:
r.zadd('WIP', user_id, int(time.time() + 300)) CRAWL_FOR_THE_TARGET_AND_STORE(user_id) r.sadd('DONE', user_id) r.zrem('WIP', user_id) except: r.zrem('WIP', user_id) def fetch_targets(): targets = []
for _ in r.zrangebyscore('WIP', 0, int(time.time())): targets.append(_) for _ in xrange(9): target = r.lpop('TODO') targets.append(target) return targets def add_todo(user_id):
if r.zscore('WIP', user_id) is None and not r.sismember('DONE', user_id): r.rpush('TODO', user_id)
Chapter 3
Implementation Details
3.1
Data Collection
3.1.1
Overview
In this section we will show how we crawl data from the Internet and how we store these data for interests derivation. There are three mechanisms for crawling data from the Internet and Plurk: (1) parsing HyperText Markup Language (HTML) source; (2) applying stateful program-matic web browsing module or (3) using application programming interface (API) provided by service provider.
Parsing HTML source is the basic mechanism for web crawling. It works by analyzing static pages’ HTML source code with regular expressions (Regex) or creating Document Object Model (DOM) for parsing. However, this mechanism is unable to process a page whose content is loaded with Asynchronous JavaScript and XML (AJAX). For example, we can apply this mechanism to crawl Plurk in mobile view (Figure 3.1), but it doesn’t work in the standard view. In order to deal with AJAX, we utilize the stateful programmatic web browsing module. Generally speaking, this mechanism is based on web browser engines such as WebKit and Gecko to interpret web pages as a real web browser. Even though this mechanism can deal with most of web pages, it is much slower than parsing HTML directly. It not suitable for crawling a large number of web pages due to poor performance.
Most of web service and social network service providers such as Google, Twitter, and Plurk, etc. provide APIs for developers to access data by registering applications to the official registry. This mechanism is the most efficient way for crawling data from specific service. However it
Figure 3.1: Plurk mobile view
usually has rate limitation, i.e. only a limited number of requests in a given period of time is allowed. Besides, it can’t work behind web proxy servers as anonymous page view.
We apply Plurk API for crawling Plurk data and use Spynner, a stateful programmatic web browsing module for Python, as the engine to parse keywords from Google real time trends service in this thesis.
3.1.2
Plurk API and Library
Plurk API [67] is currently available in version 2.0. Compared to version 1.0, version 2.0 is stateless (no login is required) and requests should be signed using OAuth Core 1.0a standard [64]. Version 1.0 is session-based and user account and password, instead of authorized keys, are used for authentication. Both Version 2.0 and 1.0 API return data encoded in JSON format. Plurk officially recommends clsung’s plurk-oauth [17] API library to Python developer, which depends on oauth2 [79] and httplib2 [83] library. Listing 3.1 depicts how to use plurk-oauth library to get Plurk profile. Even though the plurk-plurk-oauth is fully functional and well tested, it has poor performance and connection latency resulting from: HTTP connection overhead, per-formance bottlenecks in JSON library decoding and HMAC-SHA1 signing.
The HMAC procedure for OAuth consists of two phases: (1) calculate HMAC signature by the specified hash function and the given key and message, then (2) compute the Base64 encoding for the given binary signature. Listing ?? demonstrates the Python implementation of HMAC-SHA1. Besides, HMAC-SHA1 signature can also be obtained by the shell commands as follows:
$ echo -n "message" | openssl dgst -sha1 -binary -hmac "key" | openssl enc -base64 IIjfdNXyFGtIFGyvSWU3fp0L46Q=
>>> from PlurkAPI import PlurkAPI
>>> plurk = PlurkAPI(CONSUMER_KEY, CONSUMER_SECRET) >>> plurk.authorize()
>>> print plurk.callAPI('/APP/Profile/getOwnProfile') ...truncated data...
Listing 3.2: Compute HMAC-SHA1 by Python standard libraries
>>> import hashlib >>> import binascii
>>> trans_5C = ''.join(chr(x ^ 0x5c) for x in xrange(256)) >>> trans_36 = ''.join(chr(x ^ 0x36) for x in xrange(256)) >>> digestmod = hashlib.sha1
>>> blocksize = digestmod().block_size >>> def hmac(key, msg):
... if len(key) > blocksize:
... key = digestmod(key).digest() ... key += chr(0) * (blocksize - len(key)) ... o_key_pad = key.translate(trans_5C) ... i_key_pad = key.translate(trans_36)
... return digestmod(o_key_pad + digestmod(i_key_pad + msg).digest()) ... >>> h = hmac('key', 'message') >>> print h.hexdigest() 2088df74d5f2146b48146caf4965377e9d0be3a4 >>> print binascii.b2a_base64(h.digest())[:-1] IIjfdNXyFGtIFGyvSWU3fp0L46Q= >>>
>>> from hashlib import sha1 >>> import hmac
>>> h = hmac.new('key', 'message', sha1) >>> print binascii.b2a_base64(h.digest())[:-1] IIjfdNXyFGtIFGyvSWU3fp0L46Q=
3.1.3
Library Optimization
In order to improve these performance bottlenecks, we develop our enhanced patch for plurk-oauth library.
First, we replace httplib2 [83] with urllib3 [5] for connection pooling; instead of making connection for each request, connection pool works as a cache to make connections reused when required, as shown in Figure 3.2. This reduces connection latency and improves throughput.
Second, as Plurk API returns data in JSON format and every request must be decoded into dictionary type for Python or hash type for Ruby, this is one of the performance bottlenecks.
Figure 3.2: HTTP persistent connection
We benchmark and profile several Python JSON libraries (mentioned in Chapter 4) then replace Python JSON decoder included in standard library with ujson [42], which is a high performance C extension module for Python for the enhanced library.
Third, Python hmac module included in standard library is based on hashlib module, which calls native and optimized OpenSSL binary directly. However, this approach has poor perfor-mance because hmac module calls hashlib module just for getting hashed value and process several steps such as character translating and Base64 encoding for calculating HMACs. To ad-dress this issue, we customize an OpenSSL wrapper for HMAC-SHA1, which returns complete HMAC-SHA1 value directly.
OpenSSL is a nearly optimized C library by assembling codes with hardware acceleration instructions, and it provides several ciphers, hashing and encoding functions. We use OpenSSL as the HMAC, SHA1, and Base64 engine and integrate these OpenSSL functions to a Python extension module with Python C API. This customized extension is built with native codes. It performs 72 times faster than the version included in standard library. Detailed experimental results are given in Chapter 4.
Listing 3.3: Compute HMAC-SHA1 by OpenSSL
#include <string.h> #include <openssl/evp.h> #include <openssl/hmac.h> #include <openssl/sha.h> #ifndef SHA_DIGEST_LENGTH #define SHA_DIGEST_LENGTH 20
#endif
#define B64_LEN (((SHA_DIGEST_LENGTH + 2) / 3) * 4) + 1
unsigned char* hmac_sha1(unsigned char* key, unsigned char* data) {
unsigned char* digest; unsigned char ret[B64_LEN]; // compute HMAC digest
digest = HMAC(EVP_sha1(), key, strlen(key), data, strlen(data), NULL, NULL); // encode binary digest to Base64 format
EVP_EncodeBlock(ret, digest, SHA_DIGEST_LENGTH); return ret;
}
3.2
Preprocessing
In this section, we will show the elements of Plurk content (plurk) and explain how we apply the URL filtering and tokenization to the preprocessing of plurks.
3.2.1
A Plurk and Its Data
We can invoke Timeline series API to fetch plurk data. For example, we invoke /APP/ Timeline/getPlurk to get data for a plurk by passing the plurk unique id or invoke /APP/Timeline/ getPublicPlurks to get the public plurks for a plurker by passing the plurker’s user_id or nick-name. A plurk (Figure 1.5 on Page 4) is encoded as a JSON object. It will be returned as follows: { "responses_seen": 0, "qualifier": "says", "replurkers": [], "plurk_id": 1003643246, "response_count": 0, "replurkers_count": 0, "replurkable": false, "limited_to": "|3461880|", "no_comments": 0, "favorite_count": 0, "is_unread": 0, "lang": "tr_ch", "favorers": [], "content_raw": "http://j.mp/JoIb2K\nhttp://youtu.be/C8HjWFPY78I\n少女 時代 太 妍、蒂芬 妮、徐玄 所 組 成 的 子 團 體「太 蒂 徐」以 首 張 專 輯《Twinkle》登台已 滿四 周,每 周都穿上 不 同風格的 表 演服的 她們,26日以性 感 可 愛 的 粉 色 系 空",
"user_id": 3461880, "plurk_type": 1,
"qualifier_translated": "說", "replurked": false,
"favorite": false,
"content": ...truncated data..., "replurker_id": null,
"posted": "Sun, 10 Jun 2012 15:50:08 GMT", "owner_id": 3461880
}
Plurk API defines twenty two attributes for a plurk. However, in order to reduce storage size, we only include the following eight essential attributes for further processing in this thesis. The definitions of each attributes are listed below:
{ "_id": Number, "owner": Number, "qualifier": String, "content": String, "content_raw": String, "tags": Array, "posted_at": ISODate, "updated_at": ISODate }
• _id: The unique plurk id, used for identification of the plurk. • owner: The owner/poster of this plurk.
• qualifier: Qualifier is used to define the type of the plurk, which can be “says”, “asks”, “likes”, “shares”, etc.
• content: The formatted and filtered content, e.g. URL will be turned into text and emotions will be filtered.
• content_raw: The raw content as entered by user.
• tags: The tagging result from the filtered content, which is listed in the interest hierarchy. • posted_at: The date this plurk was posted in ISODate format.
• updated_at: The date this plurk was formatted and filtered in ISODate format.
3.2.2
Elements of a Plurk
• Text: Text is the basic type of a plurk, which may contain Chinese, Japanese, English, or other language characters stored in Unicode.
• URL: URL may be in several types: @plurk_ID, web page, image, or file.
– @plurk_ID: @plurk_ID identifies a Plurk user (plurker). A @plurk_ID in a plurk will be stored in the content_raw column and transformed into http://www.plurk. com/plurk_ID in the content column. Moreover, it will show the plurker’s nickname instead of account name as the link name in the Plurk page.
– Web Page: The web page type is a hyperlink which refers to a web page. Plurk user can define the link name; if not defined, it will show the original link in the Plurk page.
– Image: The image type is a hyperlink which refers to an image in such format as PNG, JPG, GIF etc.
– File: The file type is similar to the image type. If the hyperlink does not refer to an image then a normal file is assumed.
• oPreview: oPreview is a special case of web page type. If the page has open graph protocol properties, the hyperlink will be transformed into a short “summary” instead of normal links. This type is convenient for plurkers to share a web page. Instead of typing URL and defining the link name, a plurker simply types URL and the page title will be displayed automatically.
Based on the characteristics of plurk elements, we design a URL filtering mechanism and a preprocessing procedure which will be described in the following sections.
3.2.3
URL Filtering Mechanism
We give a procedure for URL filtering in order to transform URL from raw link into text content or tags which represent the subject of the URL. The pseudocode is shown in Algorithm 1.
(a) Text
(b) @plurk_ID source (c) @plurk_ID rendered
(d) Web page URL source (e) Web page URL rendered
(f) Image source (g) Image rendered
(h) File URL source
(i) oPreview source
(j) oPreview rendered
Input: U RL Output: content 1 begin 2 content = null 3 if U RL is shortened then 4 U RL = expand_shortened_URL(U RL) 5 end
6 if U RL is a web page then 7 if Tag is available then
8 content = keywords from predefined tags column 9 end
10 else if Metadata is available then
11 content = keywords or description from metadata 12 end
13 else
14 content = title of the page 15 end
16 end
17 else if U RL is an image then
18 content = keywords from Google image search 19 end
20 else if U RL is linked to Youtube then 21 content = keywords from metadata 22 end
23 return content 24 end
Algorithm 1: URL filtering mechanism
Firstly, extract the original URL behind the short URL if necessary by detecting if any URL redirect request occurs while reading a URL. For example, the URL http://www.ettoday.net/ news/20120527/50150.htm is shorten into http://j.mp/JoIb2K and posted to a plurk. In this case, we will detect the URL redirect when open the shortened URL, we then continue to open the redirected URL for reading content.
Secondly, read content from the URL. If the URL is referring to a file, then ignore it (Figure 3.4). If the URL is referring to an image, then apply this URL as a query to the image search engine such as Google Images (Figure 3.5). If the URL refers to Youtube, we get the description and keywords value from metadata in the <meta> tag (Figure 3.6). If the URL refers to a web page, we check if the metadata exists first, then we get keywords, description and title values. Else if metadata is not available, but keywords for the page are defined then get these keywords. Otherwise, we get title value from the page, as shown in Figure 3.7.
Lastly, we update filtered content from URL to the content column in datastore. For example, the URL http://j.mp/JoIb2K after URL filtering process will be transformd into several tags as
Figure 3.4: URL filtering for a file
follows:
{ ...
"content": "少女時代, 子團體, 徐玄, 太妍, 蒂芬 妮, TaeTiSeo, Twinkle", "content_raw": "http://j.mp/JoIb2K",
... }
3.2.4
Tokenization
There are no straightforward methods to tokenize a Chinese sentence because Chinese text does not have word boundaries and word is a fundamental linguistic unit. Therefore, we develop a two-step tokenization mechanism based on dictionary in this section.
Tsai [84] implemented a Chinese segmentation algorithm named MMSEG based on max-imum matching algorithm and Ma [54] showed the procedures of the CKIP Chinese segmen-tation system, including the disambiguation algorithm for resolving segmensegmen-tation ambiguities and identifying unknown words.
These two Chinese segmentation algorithms and implementations (MMSEG and CKIP) are popular among Traditional Mandarin Chinese users. However, we only care about the matching of keywords instead of the segmentations of a sentence. Therefore, we do not employ Chinese segmentation system but a maximum matching algorithm based on corpus dictionary and which is stored in a trie data structure.
Matching Algorithm with Recursively Implemented StorAge (MARISA) is a space-efficient trie data structure, while libmarisa [91] is a C++ library implementation of MARISA. We use marisa-trie [57] Python package, a Python version binding of libmarisa as the trie implementa-tion to store interest keywords dicimplementa-tionary and to find the longest prefixes keyword. Listing 3.4 demonstrates how to use marisa-trie library to build a trie and find all prefixes by a given key.
Listing 3.4: Find all prefixes of a given key by marisa trie
>>> import marisa_trie
Figure 3.5: URL filtering for an image
Figure 3.7: URL filtering for a web page
>>> trie.prefixes(u'key12') [u'key1', u'key12']
We normalize sentences with the following five steps:
1. insert space behind CJK characters and before ASCII characters 2. insert space behind ASCII characters and before CJK characters
3. replace punctuation characters with whitespaces, replace continuous spaces with a single whitespace
4. strip the whitespaces found in the beginning or end of a sentence, and finally 5. convert case-based characters into lowercase.
Firstly, according to Unicode code charts, CJK characters (unified ideographs) are located in the range between 4E00 and 9FFF, Katakana in the range between 30A0 and 30FF, and Hiragana in the range between 3040 and 309F. We employ the Python regex extension, an alternative regular expression module to replace the re module from Python standard library and apply these ranges to define several regular expressions for normalization. This step is inspired by the
Figure 3.8: Demonstration of the normalize function
project: “為什麼你們就是不能加個空格呢?” [86]. p_mixed_1 and p_mixed_2 define the pattern set of Unicode characters and ASCII characters. p_ws defines the pattern of whitespace character by special character \s, which is equivalent to the set [ \t\n\r\f\v]. p_punctuation defines the pattern of punctuation characters by special character \p{P}, which is supported by regex module. Listing 3.5 shows the normalization process and Figure 3.8 illustrates how the normalization process normalizes a sentence step by step.
Listing 3.5: Normalize sentences by regular expresions
import regex as re p_mixed_1 = re.compile(ur'([\u4e00-\u9fff\u3040-\u30FF])([a-zA-Z0-9@&;=_\[\$\%\^\*\-\+\(\/])') p_mixed_2 = re.compile(ur'([a-zA-Z0-9@&;=_\[\$\%\^\*\-\+\(\/])([\u4e00-\u9fa5\u3040-\u30FF])') p_ws = re.compile(r'[\s]+') p_punctuation = re.compile(ur'\p{P}+') def normalize(ctx): _ = p_mixed_1.sub(r'\1 \2', ctx) _ = p_mixed_2.sub(r'\1 \2', _) _ = p_punctuation.sub(' ', _) _ = p_ws.sub(' ', _) return _.strip().lower()
Secondly, we generate a list of indexes of whitespace characters in the sentence, then use the index to retrieve terms. The purpose of this algorithm is to determine whether it is a CJK
term or a western term. The pseudocode is shown in Algorithm 2 and Figure 3.9 depicts the tokenization process step by step.
Input: context, a string
trie, a marisa-trie
Output: terms, a set of matched keywords
1 begin
2 index = 0
3 terms = an empty set
4 while index < length(context) do
5 if context[index] is a white space then
6 index += 1
7 end
8 match = trie.longest_prefix(context[index:]) 9 if match is not null then
10 terms.add(match) 11 index += length(match) 12 end
13 else if context[index] in [a-zA-Z0-9] then
14 index = next index of white spaces occurs in the context 15 end 16 else 17 index += 1 18 end 19 end 20 return terms 21 end
Algorithm 2: Tokenization process
3.2.5
Plurks Preprocessing
The preprocessing procedure is carried out in three phases (Figure 3.10) as follows: • Apply the URL filtering process to transform URL links into text for extending content. • Apply the tokenization mechanism to extract keywords from the plurk which are included
in interest keywords hierarchy.
• Transform raw content into tags, and update the records to the datastore.
MongoDB is an open source document-oriented database management systems and a part of the NoSQL family. It stores records in BSON format, which support several data types such as String, Integer, Boolean, Double, Null, Array, and Object, etc. As mentioned in subsection 3.2.1, we store the extended and filtered content from phase one into content field in String type