Leveraging Microblogging Big Data with a Modified Density-based Clustering Approach for Event Awareness and Topic Ranking

(1)

http://jis.sagepub.com/

Journal of Information Science

http://jis.sagepub.com/content/early/2013/02/28/0165551513478738

The online version of this article can be found at: DOI: 10.1177/0165551513478738

published online 1 March 2013 Journal of Information Science

Chung-Hong Lee and Tzan-Feng Chien

awareness and topic ranking

Leveraging microblogging big data with a modified density-based clustering approach for event

- Jul 18, 2013 version of this article was published on

more recent

A

Published by:

http://www.sagepublications.com

On behalf of:

Chartered Institute of Library and Information Professionals

can be found at:

Journal of Information Science

Additional services and information for

http://jis.sagepub.com/cgi/alerts Email Alerts: http://jis.sagepub.com/subscriptions Subscriptions: http://www.sagepub.com/journalsReprints.nav Reprints: http://www.sagepub.com/journalsPermissions.nav Permissions: What is This? - Mar 1, 2013

OnlineFirst Version of Record

>>

- Jul 18, 2013

(2)

Journal of Information Science 1–21

Ó The Author(s) 2013 Reprints and permission: sagepub. co.uk/journalsPermissions.nav DOI: 10.1177/0165551513478738 jis.sagepub.com

Leveraging microblogging big data with a

modified density-based clustering

approach for event awareness and topic

ranking

Chung-Hong Lee

Department of Electrical Engineering, National Kaohsiung University of Applied Sciences, Taiwan

Tzan-Feng Chien

Department of Electrical Engineering, National Kaohsiung University of Applied Sciences, Taiwan

Abstract

Although diverse groups argue about the potential and true value benefits from social-media big data, there is no doubt that the era of big data exploitation has begun, driving the development of novel data-centric applications. Big data is notable not only because of its size, but also because of the complexity caused by its relationality to other data. In the past, owing to the limited possibilities of acces-sing big data, few data sources were available to allow researchers to develop advanced data-driven applications, such as monitoring of emerging real-world events. In fact, social media is greatly impacting the growth of big data; and big data is providing enterprises with the data to help them understand how to better detect marketing demands. Microblogging is a social network service capable of aggre-gating messages to explore facts and unknown knowledge. Nowadays, people often attempt to search for trending news and hot topics in real time from microblogging messages to satisfy their information needs. Under such a circumstance, a real demand is to find a way to allow users to organize a large number of microblogging messages into understandable events. In this work, we attempt to tackle such challenges by developing an online text-stream clustering approach using a modified density-based clustering model with collected microblogging big data. The system kernel combines three technical components, including a dynamic term weighting scheme, a neigh-bourhood generation algorithm and an online density-based clustering technique. After acquiring detected event topics by the system, our system provides functions for recommending top-priority event information to assist people to effectively organize emerging event data through the developed topic ranking algorithm.

Keywords

big data; microblog; online computation; social networks; text mining

1. Introduction

Big data denotes the data sets that are too ‘big’ to be handled using the existing database management tools and meth-ods. Big data has existed for a long time, but few application systems have been able to access it until recent years. Social media is greatly impacting the growth of big data; and big data is providing enterprises with the information to help them to better understand customers’ preferences, and use the information to analyse new business opportunities. Ford Motors, PepsiCo and Southwest Airlines, for example, analyse consumer postings about them on social-media sites such as Twitter and Facebook to evaluate the immediate impact of their marketing strategies and to understand how cus-tomer sentiment about their brands is changing [1]. Another example, an experimental application investigated in this work, is the utilization of the contents of social media messages (e.g. Twitter data sets) to collect real-world event data to support understanding and status updates of emerging events. Because new events continually evolve, it is difficult to

Corresponding author:

Chung-Hong Lee, Department of Electrical Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan. Email: [email protected]

(3)

have an overview of the structure and dynamic development of emerging events. For example, some emerging events caused by severe natural disasters such as earthquakes and tsunami require new scientific methodologies for risk man-agement and control. With the help of social-media big data like Twitter data sets, people can carry out the task of event awareness by mining the large scale event-related social content contributed by users around the world. Novel online event detection techniques, which incorporate streaming models with online clustering algorithms, provide feasible solu-tions to deal with the social streams for event mining in real time. However, although some stream mining approaches have been successfully developed to analyse social big data for event detection, little research effort has been spent on integrating microblog-based event monitoring subsystems, including online event detection, dynamic term weighting and online ranking of detected events to satisfy the information needs of event awareness. For this reason, in this work we established a framework of event detection systems on Twitter data, and further developed algorithms for ranking detected events. The term ‘event’, in this work, is defined as ‘something happened at a specific location during a specific period of time’, and ‘topic’ represents ‘a set of messages concentrating on some specific issues, which are continuously discussed by users who are concerned with the newest event development in the real world’. For real-time event discov-ery, however, it is not sufficient to provide users with the newest microblogging messages that happen to contain the requested keywords. It is normally expected that, if the users want fresh information about important on-going events and hot topics, they will be able to hear it from responsible sources and reliable discussions, not just from anyone who happens to include the keywords (e.g. ‘Japan earthquake’) in microblogging messages. Still, we believe that more is needed to accomplish a challenging task: using live microblogging data to automatically identify emerging events and rank the topics in a near-real-time manner. Once dealing with microblogging text streams, one important indication of evolution is the presence of bursts. For microblogs, a burst indicates that the occurrence of a certain message feature is unexpectedly frequent in a short period of time. The burst detection methods can be applied to automatic detection of sudden events, providing useful insights into the unusual events and in turn supporting situation awareness in real time. In this work, therefore, we use Twitter messages (i.e. tweets) to develop an online event clustering and topic ranking methods for real-time event detection.

The rest of the paper is organized as follows. Section 2 surveys several related work in the literature. The developed system architecture is described in Section 3. Section 4 reports on online text stream clustering methods for mining microblogging messages. We introduce our developed online clustering method, including three components: a dynamic term weighting scheme, a neighbourhood generation algorithm and an online density-based clustering approach. In Section 5, we describe our experiments on collected Twitter data sets, and discuss their evaluation results. Finally, Section 6 concludes the work by summarizing our research contributions.

2. Related work

Recently research on social networking applications has drawn increasing attention. For example, Otte and Rousseau [2] studied the influence of social network analysis in several application fields, such as publication citation and co-citation networks, collaboration structures and other forms of social interaction networks. Recently, Ross et al. explored the use of a Twitter-enabled backchannel to enhance the conference experience, collaboration and the co-construction of knowl-edge [3]. Microblogging, such as Twitter, is one of popular social networking applications, representing a simple yet powerful way to allow enormous numbers of users every day to post messages related to everything ranging from mun-dane daily life routines to breaking news events. Over the past few years, data mining on microblog-based big data [4, 5] has rapidly emerged as a popular research topic in the areas of text mining, information retrieval and social media. In addition, much research work has focused on applying Twitter for sentiment analysis and opinion mining applications, such as Thelwall et al. [6] and Jansen et al. [7]. One of the key advantages regarding microblogs is that it enables people to acquire up-to-the-minute information ubiquitously. Many developed recommendation systems are based on such a characteristic to discovery popular movies [8] and RSS news topics [9], etc., in near real time.

In a study of the decision-making uses on microblogs, Cheong and Lee [10] presented an excellent view of trending topics on microblogs. The topics on Twitter were categorized into short-term, medium-term and long-term topics, and finally utilized an off-line SOM approach for analysis of trend patterns. Becker et al. [11–14] proposed a series of work regarding topic discovery on Twitter, which is perhaps the most similar research related to our work. Naaman et al. [11] developed a trending topic detection algorithm in local areas including burst detection algorithm and local trending terms detection. Also, in their later work [12, 13], they developed an online event identification framework that uses online clustering and filtering technologies to identify topics on Twitter. In particular, the goal of their research [14] is similar to our topic ranking algorithms, which tried to develop a variety of event-based re-ranking techniques, and evaluated the ranking results of tweets with aspects of quality, relevance and usefulness. Furthermore, they also developed a topic tracking system [15]. In this work, a planned event, a so-called scheduled event, means that the events are known to the

(4)

machine, which is suitable of utilizing a topic tracking method. The differences between Becker’s work and our work are listed as follows: (1) we do not only focus on the events at the specific location, but cluster messages which come from every corner of the world; and (2) the detected topics in our system are not limited to known events. The other microblogging topic detection system [15] introduced an idea of finding real-time local topics using micorblogging data. TwitterMonitor [16] utilized trending word detection to recognize hot topics. Sankaranarayanan et al. [17] developed a system so-called TwitterStand, which captures tweets that correspond to breaking news by means of utilizing a clustering approach for mining tweets.

In most data stream applications, burst detection is an important task to discover increasingly shapely patterns, which may be trending ones or unusual activities in the non-stationary stream. Related idea of the state-based method was pro-posed by Kleinberg [18]. Their work considers the arrival rate of messages as the main factor in feature extraction, and the result yields a nested representation of the set of bursts that impose a hierarchical structure on the overall stream. Regarding the different lines of though in the threshold-based method, Swan [19–21] believed that each feature in a ran-dom time period always maintains a certain average rate; the mean of the average rate is the frequency of features in the time period divided by the total time interval. Therefore the burst detection algorithm in their work performs the results of the chi-square test as their threshold; once the average rate of a feature overtakes the threshold, it is recognized as a significant feature. Alternatively, a trend-based method [22] treats a word as a falling or rising word by measuring abso-lute or relative change. In this work, we do not compare two event times (current and the last record), but consider the long-term expectation value. In addition, as the weights of features in dynamic stream have been computed, the next step is to group relevant text documents into thematic collections. This is called the task of data stream clustering [23–25].

STREAM [24] was the first data stream clustering technique, proposed by Guha. The k-median clustering algorithm was adopted with a simple algorithm based on divide-and-conquer to solve the space limitation problem. In addition, CluStream [26] is a stream clustering process that generates an online component that periodically stores detailed sum-mary statistics and an offline component which uses only these sumsum-mary statistics. A modification of CluStream tech-nique, HPstream [27], is a projected clustering method to reduce the dimensions of the data stream. Another similar work of CluStream is Birch [28]. Birch (Balanced Iterative Reducing and Clustering using Hierarchies) is a well-known incre-mental clustering that summarizes online information into offline records and uses developed measurements to cope with the natural closeness of points to prevent scanning all data points. Zhong [29, 30] combined an online spherical k-means algorithm with an existing scalable clustering strategy to achieve fast and adaptive clustering of text streams. Generally speaking, in our survey most text stream clustering works use online k-means-based techniques as their major text stream clustering algorithm. The main drawback of the k-means based online clustering method is that it should determine the fixed parameter of k (i.e. the topic), and it is thus unsuitable for some real world applications in the problem domain, especially in dealing with the topic detection task with dynamic topics. Such issues were discussed in these research work [30, 31], and some solutions for avoiding empty cluster problems and choices of k were addressed. To avoid the problems of k-means-based clustering methods, we used online density-based approaches for mining topics from microblogging data collection.

Since the results of data stream clustering evolve with time, some authors have also addressed them as evolutionary clustering [32]. This allows new incoming data to be compared with current ones, but one-pass clustering systems attempt to forget data when they have passed away. Chakrabarti et al. [32] extended two widely used clustering algo-rithms, namely k-means and agglomerative hierarchical clustering, into an evolutionary method. The authors claimed that the evolutionary clustering is good at noise removal and masters catching the topic shift over time. Similar to evolu-tionary clustering, an incremental approach extended to a DBSCAN clustering method called IncrementalDBSCAN [33, 34] also preserves the data in memory. The algorithm of IncrementalDBSCAN operates on the insertion and deletion of updates, and all situations that consider the linkage of objects’ density-connectivity in algorithms (e.g. absorption, mer-ging) are event-driven. Owing to the advantages of density-based clustering, we modified and enhanced the original IncrementalDBSCAN clustering method for implementation of event detection.

Streaming text means that its concept may change over time like a water flow; such a characteristic is known as con-cept drift [35, 36]. Concon-cept drift is an intractable problem that violates the rule of training data. Intuitively, a priori knowledge is a useful justification that machine learning algorithms and data mining algorithms attempt to acquire by the training process. However, in streaming data, the patterns of a priori knowledge may possibly cause errors because the concept of streaming data is rapidly changing. Hence a good text stream clustering should not segment an evolving topic into separated topics [37]. To overcome this problem, Aggarwal et al. [26] proposed a micro-clustering approach to storing summaries in macro-clusters as snapshots. This two-phased approach enables the user to explore the nature of the evolution of the clusters over time. On the other hand, some literature has revealed that the sliding window process is a natural choice to handle this problem. Zhou et al. proposed SWClustering [38], which stores temporal information in EHCF (Histogram of Cluster Features) over a sliding window for handling the clusters’ evolution.

(5)

3. System architecture

The aim of this work was to provide a solution to allow users to quickly and accurately discover emerging events to sat-isfy their information needs. To achieve this goal, we focused on developing functions for detecting real-time topics in a timely updated manner. The system architecture developed for mining microblogging message streams is presented in Figure 1. In Figure 1, the online text-stream clustering we develop is based on a sliding window model and it combines three technical components: a dynamic term weighting scheme, a neighbourhood generation algorithm and an online density-based clustering. After acquiring the detected event topics by the system module, a topic ranking algorithm is performed for recommending urgent event information with a ranking to assist people to effectively organize emerging event information.

As the system starts to process the message streams, the language filter will filter out some incoming messages that contain non-ASCII characters (i.e. Chinese, Japanese, etc.), and transform the message contents into bag-of-words fea-tures. In this work, we do not filter out messages that contain abbreviations and text message shorthand in the content. This is because we believe that every feature in message flows might contain inexplicitly temporal information, even where the message contains only one word (e.g. ‘earthquake!!’). A detailed description of the subsystems is addressed in the following section.

4. The proposed methods for stream mining and clustering on microblogs

In this section, the kernel approach of this work, the developed online text-stream clustering approach for mining the microblogging message stream, is described. The developed online clustering method, which is based on a sliding win-dow model, includes three components: a dynamic term weighting scheme, a neighbourhood generation algorithm and an online density-based clustering approach.

4.1. The dynamic term weighting technique and a sliding window approach

To process texts with a chronological order, a fundamental problem is how to find the significant features in text streams. Specifically, the trends of a concept are often not stable but change with time, which is known as concept drift. Under

Figure 1. Illustration of the system framework.

(6)

such circumstances, the design of a weighting scheme for a microblogging message should be constantly updated. However, it is worth mentioning that almost all terms occur in each message only once owing to the length limitation of messages. Therefore, the computation overhead of term frequency, tf, is strongly affected by the length of messages. In addition, the document frequency, df, conflicts with the operation of topic mining since a higher df value of the words implies that the terms occur in many documents, which might lead to the problem of missing topic words in messages to some extent. Here we apply the developed term weighting scheme BursT, which was described in our previous work [39]. Our strategy in determining BursT value is that a heavier weight is achieved by a higher burstiness, in which some word occurs frequently in the window. The formula of BursT is presented in equation 1 [40]:

BursTw,t= BSw,t* TOPw,t ð1Þ

where the weight of the word w at time t will be constituted by two factors – BS (Burst Score) and TOP (Term Occurrence Probability). The basic idea of the BS factor is to calculate the arrival rate of words, and each word will be compared with its long-time average of arrival rates, so a word will be recognized as a rising feature when the current arrival rate surpasses its average value. On the other hand, the TOP (term occurrence probability) factor represents the probability of the word occurrence in the sliding window. This factor enables the weight of the word to grow with its occurrence frequency in messages, in order to identify trending topics. For the operation of mining hot news topics from messages, if a word occurs in more messages, it is more likely to be a trending topic. For example, the word ‘mac’ may be contributed from ‘Mac OS’ as a new Macintosh computer will be available shortly; meanwhile, some people may be talking about ‘Mac Taylor’ (i.e. a character in the well-known TV drama CSI: NY). In this case, the word ‘mac’ can be correctly detected as a topic word in the appropriate trending topic associated with a specific emerging event through applying the BursT weighting formula. A more detailed description of the weighting factors can be found in our previous work [40].

4.1.1. Incorporating a sliding window model. Prior to microblogging stream mining, the first challenging task is to process the never-ending data streams. Previous related work in exploring the concept of temporal locality [41] has proven that we do not need to keep vast volume messages in memory, because message distribution of a topic in a temporal domain is largely centralized in a period of time. This implies that the method of storing a subset of entire datasets for a process is feasible. If an event is very frequently discussed, relevant messages will consecutively appear to prolong the life of a topic. Also, it is almost impossible to store all messages at one time owing to the restrictions of memory limitation and constant time lapsing. As a result, in this work we adopted the sliding window model [42] to tackle the issue by means of moving time frames to process incoming messages. Briefly, the steps of sliding window technique include: (1) the inser-tion operainser-tion in which a new index entry is built when a message comes in; (2) reservainser-tion of the message until its life-time exceeds the fixed length of life-time window tw; and (3) the deletion operation in which the message is removed from memory.

4.2. Neighbourhood generation algorithm

In this work, we applied density-based clustering in an incremental manner. Before detecting real-time topics by cluster-ing algorithm, the first step is to build neighbourhood relationships among messages. The most frequently used method to analyse relations between texts is a nearest-neighbour based approach. When a message comes into the system, neigh-bours should be picked up to establish relations with it. The simplest method to select neighneigh-bours is scanning the entire corpus to discover the neighbours of the message, but it is time consuming and inefficient. In order to reduce the compu-tational cost, in this work we utilize an inverted table to record that the word appears in messages in the sliding window. In particular, the deployment of the inverted table enhances computational efficiency. Thus, when a new message comes into the system, it is only calculated with the messages which have at least one feature the same, namely candidate neigh-bours. These candidate neighbours will be taken to calculate the distance that represents the extent of temporal text simi-larity between these messages. The simisimi-larity measure also considers the timing effects for the reduction of simisimi-larity for two messages with a different time distance [37]. Consequently, the exact neighbour set of the message will be estab-lished to further support text clustering.

Thus, if Sim(ma,mb) is exceed over the given threshold radius r, mbwill be ma’s neighbour and mais also the

neigh-bour message of mb. The correlation function of maand mbcan be denoted as:

(7)

In equation (2), the standard cosine function of similarity computation is used for content-based similarity measure-ment. For the consideration of temporal information, the temporal penalty tp(ma,mb) is an exponential distribution that

can calculate the timing effects for the reduction of similarity for two documents with different time distances [37]. Consequently, the exact neighbour set of the message m will be established to further support text clustering.

4.3. The online density-based clustering approach

As the neighbour relations are updated, the next step is to incrementally gather them into appropriate thematic topics. In this part of our work, an online text-stream clustering approach is adopted. The goal of clustering text stream is to detect a set of clusters in near real time from a never-ending text stream; these clusters are called topic clusters. To reach the goal, a modification version of IncrementalDBSCAN is developed in this work. The pros and cons for practical consid-erations are discussed in the following sections.

4.3.1. The IncrementalDBSCAN approach. With operations of the sliding window, text stream clustering should also be designed in functions of insertion and deletion. The operations of text stream clustering algorithm must include the response to situations that occur when a new object comes into the system and a strategy for deleting objects. In this work, we adopted a density-based clustering called IncrementalDBSCAN to group relevant messages into topic clusters thematically (see workflow in Figure 2). In our preliminary system, DBHTE (Density-based Hot Topic Extraction) was tested with IncrementalDBSCAN, and the results show that it is a feasible solution for incrementally clustering micro-blogging for topic detection. The reasons for adopting density-based clustering are as follows: (1) there is no assumption about the number of clusters with fixed or flexile parameter of k (i.e. topic) in density-based clustering. Several methods need to initial the number of topics that are thus unsuitable for some real-world applications in the problem domain, especially in dealing with the topic detection task with dynamic topics around the world. (2) It has the ability to detect arbitrarily shaped clusters (see sketch in Figure 3). (3) Messages collected from microblogs normally contain much noise. Once mining microblogging messages, the clustering algorithm should perform at its best to filter out noise in processing the contents. Density-based clustering groups objects based on their density connectivity and treats noises as outliers that would not be involved in any cluster.

The most interesting feature of IncrementalDBSCAN is that it has a ‘transitive’ characteristic. In Figure 4, suppose there exists a situation in which A is density-reachable to B, and B is density-reachable to C. Although there is no direct relation between A and C, there must exist a transitive relation between A and C through B. For example, when an earth-quake event happens, the topic cluster may contain many groups about ‘earthearth-quake’. After a while, some messages about ‘tsunami’ appear; in other words, the earthquake event starts drifting in its concept imperceptibly to a ‘tsunami’ event, and then moves on the topic of ‘nuclear’. This is a typical pattern of topic evolution or concept evolution [43]. Transitive characteristics may help us to establish a hidden relation that indicates that ‘earthquake’ and ‘nuclear’ are both on the

Figure 2. Two phases of text stream clustering in the IncrementalDBSCAN approach with the sliding window model.

(8)

same topic. This kind of transitive absorption might benefit from adapting concept drift and preventing segment topics becoming independent ones.

Another transitive phenomenon also exists in the algorithm of IncrementalDBSCAN, namely transitive merging. Suppose there are two cluster topics α and β; at first, they do not share any bursty word and so are treated as two inde-pendent topics. After a moment, amounts of γ messages frequently appear and the message γ has both the bursty fea-tures of α and β. This indicates that there is a causality relationship between topics α (cause) and β (effect), and the transitive merging will be triggered to merge them into a single topic.

Figure 4 shows two topics, one discussing the Grammy Awards and another talking about the American singer Lady Gaga. At first there seems to be no relation between them, but with time, many messages mention both Grammy and Lady Gaga, which points out that Lady Gaga was performing a song on the Grammy Award show. Thus all messages about these topics should be properly merged. However, for the updating requirement, some algorithms maintain clusters independently and assign an incoming message to the nearest cluster. Thus two separated topics will not be merged and the additional checking is needed in these methods to periodically compare the distributions of the pair-wise topics to merge relevant topics. Contrariwise, IncrementalDBSCAN provides a merge operation to dynamically change the assignments of messages, and the transitive merging has the ability to merge separated topics that have causality, so as to make our system more effective. 4.3.2. The modified IncrementalDBSCAN approach. Unfortunately, IncrementalDBSCAN is not a fully compatible algo-rithm for mining topics in user-generated content. Here we point out two problems in IncrementalDBSCAN and modify the algorithm to make our system more robust. First, in our observation, there exists minority noise in which a single message mentions a mixture of trending topics. That would cause the ‘transitive merging’ condition to be carried out. Imagining a scenario, the topic detection system maintains three topic clusters that were originally independent. They are discussing the topics of the Royal wedding, Sam Adams and Osama being killed, respectively. And now, here comes a message the content of which is ‘royal wedding on friday, sam adams on saturday, and osama killed on sunday . how can this weekend get any damn better!?’ At this time, it may trigger the merge condition owing to the message establish-ing neighbourhood relationships with others from the three topics. As a result, separated topics will be merged into sestablish-ingle topic, as shown in Figure 5.

Transitive merging has two sides: positive and negative. The modification we are going to perform is to adjust the merge judgment by adding some constraints. They are listed as follows:

1. In the first step, ensuring object γ has enough influence to trigger the merge condition. For instance, the incoming message should be a core object.

2. Subsequently, calculating the merge score for the pair-wise candidate clusters to make sure that they are extremely similar. For example, the merge score can be measured by bursty words and it should exceed the given threshold (i.e. double Minpts). The design of merge score calculation algorithm is listed as shown in the mergeScore().

3. Otherwise, the incoming message should be absorbed by the most relevant topic cluster (Scheme 1).

Figure 4. (a) Transitive absorption benefits of adapting concept drift among topic evolution, and (b) transitive merging is able to merge separated topics when they have causality.

(9)

4.5. Topic ranking algorithm

The challenges of ranking microblogging posts mainly come from the briefness of user-generated content and its real-time characteristics. Therefore, the developers should design more efficient algorithms to rank posts, especially avoiding the use of a factor that will calculate the entire collection each time. In order to satisfy such an information need, we developed an approach called topic energy [37] to evaluate the significance of each topic. To understand the topic evolu-tion over time, the topic energy uses records to evaluate the extent of significance of a given topic at each time point over a time period. After aggregating messages into topics thematically, users may want to know how many hot topics there are on microblogs and how popular they are.

4.5.1. Representation of a resulting cluster for a topic event and ranking of related messages. Subsequently, the intensity of a cluster can be recognized by utilizing the energy function. It is necessary to identify the central concept of a topic clus-ter. Some researchers have employed the technique of topic summarization by selecting representative sentences in each topic cluster. In an attempt to keep the content original, we proposed a ranking function for online sorting messages in each topic cluster.

Instead of re-calculating entire message collection for ranking, we performed a process of ranking messages after clus-tering operation, based on the following considerations: (1) sorting messages for each topic is faster than directly sorting all messages; and (2) users can view ranked messages with a topical structure.

To score a message at time t, we define the message score MSm(7) and scoring function (8) as:

MSm= X wj∈ m weightwj jwj∈ mj *jwj∈ TCtj jw∈ TCtj ð7Þ scoremt= MSm* e (cttm)=t _ð8Þ

The message score MSmis calculated when the message enters the system by accumulating the BursT weighting

val-ues of the word j, producing the ratio of word j appearing in the topic cluster at time point TCt. Furthermore the scoring

function is measured using the message score and an aging factor. The aging factor panelizes the older message by com-paring the current time ctwith the timestamp of this message tm. Here,τ is an exponential time constant, in this work we

usually set it to 600. Thus the scores for a message may change over time. According to the operations of the sliding win-dow model, the results of ranking order may also change over time. Even if a very high scoring message appeared in a topic, the next time it would be replaced by a message with a higher score, or removed as its life time exceeds the time-length of the sliding window.

5. Experiments and discussion

5.1. Preliminary experiment

First, we performed a small preliminary experiment to identify the feasibility and potential of the developed system. The experiment aimed to compare our detection result with the event situation in the real world, therefore the operational result of Google Trends,1which reflects what keywords people are searching for on a daily basis, was selected as a base-line to evaluate our system performance. We applied the developed topic detection method (previously known as the DBHTE algorithm) with the parameter settings Eps = 0.4, MinPts = 10, and the length of time window W was set to 1 hour. We submitted a keyword ‘Microsoft’ to Google Trends, and there was an event news ‘Microsoft launches Kin phones’ reported by Times Live at 09:14 AM, 13 April 2010, shown in Figure 7. In comparison, we examined the sam-ple messages which contained the ‘Microsoft’ string in the collection, shown in Figure 8. Figure 8 shows that the maxi-mum peak in our tweets collection was at 03:00 AM, 13 April 2010.

The results in Figures 7 and 8 indicate that, in this case, the result in the microblogging system was more sensitive and clearer than the one in Google Trends for event detection before the news was reported. There were 20 messages in the cluster related to ‘Microsoft launches Kin phones’ generated at 02:56:35, 13 April 2010.

5.2. Running time

To get an initial picture about how long it takes for our system to carry out a task in all stages, we also conducted a mea-sure to evaluate its real-time performance. The result indicates the timings of processing steps regarding the information flow within our system. The running time for a task includes: (1) collecting the input data through Twitter API, 0.04347

(10)

seconds; (2) decomposing each message into keywords and filtering out stop words, 0.0023 seconds; (3) the term-weighting process for each message, 0.00023 seconds; (4) the neighbourhood generation operation for each message, 0.00015 seconds; (5) the density-based clustering process, 0.000228 seconds; and (6) the ranking operation, 0.000002565 seconds. On average, the memory space consumed in performing a task is about 1.2G Bytes, and the length of the time window (i.e. the window size of the sliding window) is normally set to 1–2 hours. In our example, the longer parameter setting of window-size (i.e. 2 hours) consumed more memory space than the shorter one (i.e. 1 hour). However, the lon-ger window-size will reduce the probability of missing messages within an event life cycle. A detailed discussion of more experiments is provided in the following sections.

5.3. Collecting data sets from microblogging big data

To detect topics in microblogs, a total number of 170,427,512 microblogging posts were collected from Twitter over a period of 140 days (dating from 6 January to 26 May 2011). The test samples were collected through Twitter Stream API2 by Twitter4J3 library and the timestamp is based on Taipei time (GMT + 8:00). After filtering out non-ASCII tweets, 88,682,820 available tweets were utilized as our data source. It is worth mentioning that all online tweets were stored in a static corpus, and then our algorithms processed them in an online manner (i.e. time sequence order). Subsequently, we partitioned messages into unigrams and removed the substring ‘RT @username:’. In this work, the stopword list contains stopwords in several different languages4since the collected tweets contain multilingual texts. Finally, all capital letters in each tweet were converted into lowercase for our experiments.

Figure 7. Query result of a keyword ‘Microsoft’ in Google Trends in the preliminary experiment.

(11)

5.4. Evaluating trends of features with timelines

In order to examine the system performance in reflecting the trends of words, we selected the ‘President Obama delivers statement on death of Osama Bin Laden’ event as our case study. Our experiment drew the trends of the string ‘Obama’ with the timeline. Figure 9 indicates the intensity of the inter-arrival gap of the feature word ‘Obama’, and is an enlarged drawing. In these figures, the expected values for ‘Obama’ had decreased when the inter-arrival gaps suddenly dropped owing to the event occurring. The burst period can be identified once the gap-value is under the line of expected value. It is worth mentioning that the historical case investigated in this context is only for our empirical study, without any politi-cal motivation.

In the task of topic detection, the TFPDF term-weighting scheme is a well-known method that gives a higher weight to a term that has occurred frequently in many documents from the newswire sources to find emerging topics. PDF weighting represents an exponential distribution of the number of documents containing the term compared with the total number of documents in the channel. The TFPDF scheme benefits topic detection tasks particularly for finding sig-nificant words, but it over-emphasizes the extraction of frequent features, which may enable oral words to have heavier weights.

Subsequently, we compared the weighting values of BursT and TFIDF methods, as shown in Figure 10, and found that the incremental TFIDF cannot reflect the actual trends in the sliding window algorithm, although TFPDF and our approach performed well in topic words. However, in the outcome of oral word analysis, we demonstrated that the word ‘lol’, which has both a high density of collection and a high arrival rate, as an example, and obviously it might not be suitable to define ‘lol’ as a valid feature. It is worth mentioning that some popular oral words might be easily over-weighted in TFPDF because it places too much emphasis on document frequency. As shown in Figure 10D and E, the weighted number of ‘lol’ in TFPDF is still higher than in incremental TFIDF, even when the event is still happening.

5.5. Distributions of bursty features and oral features

Using the collected Twitter data sets, we evaluated our system using the weight distribution of BS and TOP in snapshots. In this section, we experimented with our data sets by taking the event ‘Japan earthquake on 11 March 2011’ as an exam-ple. The ‘Japan earthquake’ topic cluster was created by our system on 11 March 2011 14:03:28; the first time, there were no weights heavier than 0.5 (Figure 11). We found that some bursty features like ‘earthquake’ and ‘scares’ are the major key terms in forming a new topic. In our weight design, a heavy weight is assigned by a word that has a higher burst rate than expected within a certain range of document frequency. Obviously, in the oral features listed in Table 1, all samples have a higher TOP value than bursty features, but they are not in the bursty state to be qualified as important features.

After a period of time, at 21:34:59 the concept of the topic started to change; there were many people talking about the tsunami that might possibly affect Hawaii. In Figure 11, there appear some features whose weight is > 0.5 as being a powerful influential feature. Additionally, from the effects of news media, the list of bursty features gradually includes the explicit words (Table 1) and there is an interesting finding that the word ‘jepang’ also belongs to a bursty feature, which is ‘japan’ in Bahasa Indonesia. Also, although ‘hawaii’ has a lower TOP value than ‘japan’ at that time, the BS raised its priority to indicate ‘hawaii’ to be more important that ‘japan’. That confirms that our burst detection in term weighting method is effective.

Figure 9. Inter-arrival gaps using the feature word ‘Obama’ in the period of the event of ‘President Obama delivers statement on death of Osama Bin Laden’.

(12)

work, we applied an evaluation metrics to investigate the reliability of our algorithm with the 2 hour length of sliding window and show how the radius affects the system performance. The approximate topic recall and topic precision val-ues of 10 selected topic clusters are shown in Figure 12. Ideally, the lower similarity threshold may have a higher approximate topic recall. This means that the system is more capable of retrieving relevant messages. On the other hand it may pose a higher probability of gathering noise and become a low-quality cluster. However, in the set of the fourth, sixth, eighth and tenth selected clusters with Eps 0.3, the system has a lower approximate topic recall since the ETRSetk

for each topic cluster k is independent. Obviously, the ninth topic cluster ‘Osama Bin Laden killed by U.S.’ has a low approximate topic recall in Eps = 0.4 and Eps = 0.5 owing to the low radius threshold. According to the experimental result, on average our system achieved a high topic precision performance except for the case of the second topic cluster ‘Super Bowl 2011’ with Eps = 0.3. That was due to numerous errors occuring in clustering the messages about ‘puppy bowl’ and ‘Chelsea vs. Liverpool football game’. To sum up, the aim of this work is to help effectively resolve the prob-lems of information overload. As a result, in this work the performance trade-off we are particularly concerned with is the topic precision (i.e. precision), rather than approximate topic recall (i.e. recall). The experimental results show that our system is effective in online text-stream clustering, specifically for the topic precision of topics, as shown in Figure 13.

Figure 11. Term weight distribution with BS and TOP factors on 11 March 2011, 14:03:28 (the time at which the earthquake was detected by our system), 11 March 2011, 21:34:59 (the time at which the tsunami caused by the Japan earthquake headed towards Hawaii) and 12 March 2011, 18:14:06 (the time at which the Japanese government confirmed the radiation leak).

(13)

6. Conclusions

In this paper we have developed an online event clustering and topic ranking method for real-time event monitoring using big data collected from Twitter data sets, through a computation of burst detection and clustering algorithms. To the end, the conclusions of this work are listed as follows.

We implemented a set of technical solutions for event detection from microblogging message streams, which integrate an online text-stream clustering method combining a dynamic term weighting scheme, neighbourhood generation algo-rithm and text stream clustering methods, as well as a topic ranking approach.

The aim of our approach is to establish a comprehensive way to organize real-time event topics, allowing users to quickly find information related to emerging events across the world. Compared with most real-time information search services, our method does not require any form of user query for acquisition of information regarding event development.

Our approach is also designed to cope with the effects of concept drift in the tasks of online topic detection and track-ing, especially the design of bursty features for topic shifting in microblogging streams.

Table 2. Top-five posts and bursty words Topic energy Top-five posts 11 March 2011, 14:03:28

0.063 36146830 | Fri Mar 11 14:02:42 CST 2011 | Singapore | RT @stcom: BREAKING: Magnitude 7.9 earthquake hits Japan, rattling buildings in Tokyo. Tsunami alert was issued.

Northern Japan. Tsunami alert has been issued. #Japan #Quake

0.056 36146241 | Fri Mar 11 14:01:08 CST 2011 | Tokyo | RT @BreakingNews: Japan update: Agency says earthquake magnitude 7.9

0.056 36146382 | Fri Mar 11 14:01:30 CST 2011 | Abu Dhabi | RT @ProducerMatthew: More information on the 7.9-magnitude earthquake that just hit Japan - http://t.co/tGLml6d |

Bursty words magnitude: 0.00346452637, tsunami: 0.02789596796, earthquake: 0.242462589, japan: 0.1775319869, tokyo: 0.078453986

11 March 2011, 21:34:59

0.453 36302792 | Fri Mar 11 21:26:01 CST 2011 | Eastern Time (US & Canada) | RT @peoplemag: Sending love and prayers to everyone affected by the earthquake and the tsunami disaster in Japan. Hawaii. Wst coas.

0.453 36306331 | Fri Mar 11 21:33:41 CST 2011 | Greenland | Thoughts n prayers with all affected my #earthquake n #tsunami. #japan #hawaii:(

0.421 36304157 | Fri Mar 11 21:29:04 CST 2011 | Central Time (US & Canada) | Earthquake hit Japan and Hawaii; Tsunami headed toward Hawaii. Pray for Japan and my home and al the people

#Tsunami hitting #Hawaii rigth now - #Tsunami golpea #Hawaii en este momento

Bursty word tsunami: 0.071463817, japan: 0.02848438, japon: 0.04315333, hawaii: 4.05886315, earthquake: 1.756118123, disaster: 0.62424273

12 March 2011, 18:14:06

yesterday and a nuclear plant explosion today in Japan. Please..enough.

0.102 36886979 | Sat Mar 12 18:05:43 CST 2011 | International Date Line West | RT @paul_steele: Video of the explosion at nuclear plant in #Japan: http://t.co/WinOKcw liv event page: http://t.co/g9Aoi5r rt @bbcnews

0.100 36887454 | Sat Mar 12 18:07:24 CST 2011 | Quito | RT @BreakingNews: Japan nuclear plant update: Hourly radiation leaking from Fukushima is equal to amount ermitted in one year, official.

0.093 36887747 | Sat Mar 12 18:08:26 CST 2011 | Eastern Time (US & Canada) | RT @Reuters: FLASH: #Japan chief cabinet secretary Edano: confirms radiation leak at Fukushima plant

Bursty words radiation: 2.303558 × 10− 4, nuclear: 0.04980415, japan: 3.13926528, japanese: 0.20153408, fukushima: 0.00554549, plant: 0.233838422

(14)

Figure 12. (a) ATR and (b) TP evaluations with parameter settings of Eps = 0.5, 0.4 and 0.3.

Figure 13. Overall performance of our system.

(15)

The length of each message leads to a problem with the lack of semantic integrality in tweets. This makes it more dif-ficult to design a workable ranking algorithm. In this work, we overcame such a challenge. The preliminary results show that our algorithmic model has the potential for event detection and topic ranking.

With the limitation of memory and other computational resources, the computing of topic ranking avoids calculating entire corpus, and pays more attention to dealing with dynamic text streams.

For business decision making, the techniques developed in this work can be applied to support companies in design-ing experiments with microbloggdesign-ing big data to obtain the real-time information to explore new business opportunities, and develop appropriate processes to extract business value from social-streaming big data.

Notes

1. http://www.google.com/trends

2. https://dev.twitter.com/docs/streaming-api 3. http://twitter4j.org/en/index.html

4. Including Bahasa Indonesia, Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Polish, Portuguese, Spanish and Turkish.

References

[1] Bughin J, Chui M and Clouds Manyika J. Big data, and smart assets: Ten tech-enabled business trends to watch. McKinsey Quarterly 2010; 4: 26–43.

[2] Otte E and Rousseau R. Social network analysis: A powerful strategy, also for the information sciences. Information Science 2002; 28: 441–453.

[3] Ross C, Terras M, Warwick C et al. Enabled backchannel: Conference Twitter use by digital humanists. Journal of Documentation 2011; 67: 214–237.

[4] Banerjee N, Chakraborty D, Dasgupta K et al. User interests in social media sites: An exploration with micro-blogs. In: Proceedings of the 18th ACM conference on information and knowledge management, Hong Kong, 2009.

[5] Java A, Song X, Finin T et al. Why we Twitter: Understanding microblogging usage and communities. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on web mining and social network analysis, San Jose, CA, 2007.

[6] Thelwall M, Buckley K and Paltoglou G. Sentiment in Twitter events. Journal of the American Society for Information Science and Technology 2011; 62: 406–418.

Figure 15. Topic precision experiment with parameter settings of Eps = 0.5, 0.4 and 0.3.

(16)

[7] Jansen BJ, Zhang M, Sobel K and Chowdury A. Twitter power: Tweets as electronic word of mouth. Journal of the American Society for Information Science and Technology 2009; 60: 2169–2188.

[8] Esparza SG, O’Mahony MP and Smyth B. On the real-time web as a source of recommendation knowledge. In: Proceedings of the fourth ACM conference on recommender systems, Barcelona, 2010.

[9] Phelan O, McCarthy K and Smyth B. Using Twitter to recommend real-time topical news. In: Proceedings of the third ACM Conference on Recommender Systems, New York, 2009.

[10] Cheong M and Lee V. Integrating web-based intelligence retrieval and decision-making from the Twitter trends knowledge base. In: Proceedings of the 2nd ACM workshop on social web search and mining, Hong Kong, 2009.

[11] Naaman M, Becker H and Gravano L. Hip and trendy: Characterizing emerging trends on Twitter. Journal of the American Society for Information Science and Technology 2011; 62: 902–918.

[12] Becker H, Naaman M and Gravano L. Beyond trending topics: Real-world event identification on Twitter. In: Proceedings of the fifth international AAAI conference on weblogs and social media, Barcelona, 2011.

[13] Becker H, Naaman M and Gravano L. Selecting quality Twitter content for events. In: Proceedings of the fifth international AAAI conference on weblogs and social media, Barcelona, 2011.

[14] Becker H, Chen F, Iter D et al. Automatic identification and presentation of Twitter content for planned events. In: Proceedings of the fifth international AAAI conference on weblogs and social media, Barcelona, 2011.

[15] Han P, Xie X and Woo W. Context-based local hot topic detection for mobile user. In: Proceedings of the adjunct pervasive computing conference, Helsinki, 2010.

[16] Mathioudakis M and Koudas N. TwitterMonitor: Trend detection over the Twitter stream. In: Proceedings of the 2010 interna-tional conference on management of data, Indiana, 2010.

[17] Sankaranarayanan J, Samet H, Teitler BE et al. TwitterStand: News in Tweets. In: Proceedings of the 17th ACM SIGSPATIAL international conference on advances in geographic information systems, Seattle, WA, 2009.

[18] Kleinberg J. Bursty and hierarchical structure in streams. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, Alberta, 2002.

[19] Swan R and Allan J. Extracting significant time varying features from text. In: Proceedings of the eighth international confer-ence on information and knowledge management, Missouri, 1999.

[20] Swan R and Allan J. Automatic generation of overview timelines. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, Athens, 2000.

[21] Swan R and Jensen D. TimeMines: Constructing timelines with statistical models of word usage. In: Proceedings of the ACM KDD-2000 workshop on text mining, Massachusetts, 2000.

[22] Kleinberg J. Temporal dynamics of on-line information streams. In: Proceedings of data stream management: Processing high-speed data streams, 2005.

[23] Gaber MM, Zaslavsky A and Krishnaswamy S. Mining data streams: A Review. SIGMOD Record 2005; 34: 18–26.

[24] Guha S, Meyerson A, Mishra N et al. Clustering data streams: Theory and practice. IEEE Transaction on Knowledge and Data Engineering 2003; 15: 515–528.

[25] Khalilian M and Mustapha N. Data stream clustering: Challenges and issues. In: Proceedings of the international multiconfer-ence of engineers and computer scientists, Hong Kong, 2010.

[26] Aggarwal CC, Han J, Wang J et al. A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, Berlin, 2003.

[27] Aggarwal CC, Han J, Wang J et al. A framework for projected clustering of high dimensional data streams. In: Proceedings of the thirtieth international conference on very large data bases, Toronto, 2004.

[28] Zhang T, Ramakrishnan R and Livny M. BIRCH: An efficient data clustering method for very large databases. SIGMOD Record 1996; 25: 103–114.

[29] Zhong S. Efficient online spherical k-means clustering. In: Proceedings of the IEEE international joint conference on neural networks, Montreal, 2005.

[30] Zhong S. Efficient streaming text clustering. Neural Networks, 18(5–6) (2005) 790–798.

[31] Roxy P and Toshniwal D. Clustering unstructured text documents using fading function. Journal of World Academy of Science, Engineering and Technology 2009; 52: 149–156.

[32] Chakrabarti D, Kumar R and Tomkins A. Evolutionary clustering. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, 2006.

[33] Ester M, Kriegel HP, Sander J et al. Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24th international conference on very large databases, New York, 1998.

[34] Lee CH, Chien TF and Yang HC. DBHTE: A novel algorithm for extracting real-time microblogging topics. In: Proceedings of the 23rd international conference on computer applications in industry and engineering, Las Vegas, NV, 2010.

[35] Tsymbal A. The problem of concept drift: Definitions and related work. Technical report, Department of Computer Science, Trinity College, Ireland, 2004.

[36] Widmer G and Kubat M. Learning in the presence of concept drift and hidden contexts. Machine Learning 1996; 23: 69–101. [37] Lee CH, Chien TF and Yang HC. An automatic topic ranking approach for event detection on microblogging messages. In:

(17)

[38] Zhou A, Cao F, Qian W et al. Tracking clusters in evolving data streams over sliding windows. Knowledge Information System 2008; 15: 181–214.

[39] Lee CH, Wu CH and Chien TF. BursT: A dynamic term weighting scheme for mining microblogging messages. In: Proceedings of the 8th international symposium on neural networks, Guilin, China, 2011.

[40] Lee CH. Mining spatio-temporal information on microblogging streams using a density-based online clustering method. Expert Systems with Applications 2012; 39: 9623–9641.

[41] Aggarwal CC and Yu PS. A framework for clustering massive text and categorical data streams. In: Proceedings of the ACM SIAM conference on data mining, Bethesda, MD, 2006.

[42] Bifet A. Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams. Amsterdam: IOS Press, 2010. [43] Masud MM, Chen Q, Khan L et al. Addressing concept-evolution in concept-drifting data streams. In: Proceedings of the 2010

IEEE international conference on data mining, Sydney, 2010.