Unsupervised and supervised learning to evaluate event relatedness based on content mining from social-media streams

(1)

Unsupervised and supervised learning to evaluate event relatedness based

on content mining from social-media streams

Chung-Hong Lee

⇑

Dept. of Electrical Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan

a r t i c l e

i n f o

Keywords: Stream mining Data mining Event evaluation Social networks

a b s t r a c t

Due to the explosive growth of social-media applications, enhancing event-awareness by social mining has become extremely important. The contents of microblogs preserve valuable information associated with past disastrous events and stories. To learn the experiences from past events for tackling emerging real-world events, in this work we utilize the social-media messages to characterize real-world events through mining their contents and extracting essential features for relatedness analysis. On one hand, we established an online clustering approach on Twitter microblogs for detecting emerging events, and meanwhile we performed event relatedness evaluation using an unsupervised clustering approach. On the other hand, we developed a supervised learning model to create extensible measure metrics for ofﬂine evaluation of event relatedness. By means of supervised learning, our developed measure metrics are able to compute relatedness of various historical events, allowing the event impacts on speciﬁed domains to be quantitatively measured for event comparison. By combining the strengths of both meth-ods, the experimental results showed that the combined framework in our system is sensible for discov-ering more unknown knowledge about event impacts and enhancing event awareness.

1. Introduction

Event evaluation using social streams is a challenging area of re-search that attempts to evaluate evolving real-world events by uti-lizing continuously arriving message streams. The challenges normally come from the process of incremental clustering of unpredictable volume of incoming event elements in the dynamic environment. In most cases, the internal structures of news events in real world are quite complicated. How and why things in the events are regarded similar and related thus has deep-rooted con-sequences to how the model works. Given these conditions we still require effective methods by which to compare current and past events, and learn past experiences to cope with possible event evo-lution. Recently, due to the continuous growing presence of social-media applications, there has been a numerous research effort on developing solutions to employ social-media power for awareness of real-world events. Among these applications, one of the most signiﬁcant phenomena is that people who are located close to the location of occurrence of some emerging event have a higher probability for reporting up-to-date situation about the most re-cent event development. Meanwhile, people lived in other coun-tries concerning the same event may also contribute their

insightful ideas regarding side-effects of event development through social networks. This provides useful knowledge for resolving problems once people suffer from similar events. While this pattern holds across a wide range of real-world cases and time periods, little attention has been paid to establish effective methods for evaluating event relatedness through the use of social media. In fact, the contents of microblogs preserve valuable infor-mation associated with past disastrous events and stories. To learn the experiences from past microblogging messages for coping with emerging real-world events, allowing make sensible decisions, the techniques for event evaluation are essentially required. Due to emerging real-world events continually evolve, it is hard to ﬁgure out the structure and dynamic development of emerging events, and directly utilize the data of the on-going event to compare with the ones of past events. Novel online event detection techniques, which corporate streaming models with online clustering algo-rithms, provide feasible solutions to deal with the text streams (e.g. Tweets) for event mining in real time. To estimate the impacts for event understanding, in this work we developed a framework of event evaluation system on Twitter dataset, and used the social-media messages to characterize the collected events for related-ness analysis.

This work is an attempt to describe the concept of event relat-edness using social network datasets. We consider two aspects of relatedness computation we believe event relatedness model 0957-4174/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved.

http://dx.doi.org/10.1016/j.eswa.2012.05.068 ⇑Tel.: +886 7 3814526.

E-mail address:[email protected]

Contents lists available atSciVerse ScienceDirect

Expert Systems with Applications

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a

(2)

should carry out. First, it should take the relatedness among the considered dimensions into account. Second, the relatedness measures should cover online and ofﬂine evaluation of detected events. By analyzing the contents of Twitter dataset, our work started with the formulation of event features. In our previous pro-ject, we have developed an online event detection system for min-ing Twitter streams usmin-ing a density based clustermin-ing approach. In this work, we go on evaluating event relatedness using event clus-ters produced by the developed system platform. Some functions of the developed system framework have been reported in our pre-vious work (Lee, 2012; Ester, Kriegel, Sander, Wimmer, & Xu, 1998; Lee, Wu, & Chien, 2011; Lee, Yang, Chien, & Wen, 2011).

The offline event-evaluation model emphasizes extensible capacity in the relatedness measure metrics. That is, the related-ness metrics should be able to be self-contained and also combine with the extensible amounts of specified event-topic elements for measures. These two models provide a unified framework for the relatedness analysis of current and past events, event association, and event evolution, etc.

We propose here a combined framework to establishing relat-edness evaluation techniques that incorporates a model for online evaluation of emerging event and measure metrics for ofﬂine event evaluation. In this work, the results of relatedness measures were mainly based on a quantitative assessment of relatedness among events, which can be used to support analyzing the implicit rela-tionships among events, providing insightful viewpoints for event awareness. This is a novel approach in this ﬁeld by validating considered impact factors involved in the event development, for contributing to relatedness evaluation and analysis of real-world events.

1.1. Problem statements

What makes a past event-story related to the current event? Presumably, two event-stories should be contextually or conceptu-ally related to each other. In this perspective, similarity would be an important representation of relatedness. However, similarity is not a sufﬁcient attribute of the problem at hand. In previous work relatedness between two events is often represented by similarity between these events. In this problem domain, ‘relatedness’, how-ever, is a more general concept than ‘similarity’. Similar events are obviously related by virtue of their similarity, but dissimilar events may also be implicitly related by some other hidden rela-tionships, although these two terms are used sometimes inter-changeably. For the applications of event analysis, evaluation of relatedness is more helpful than similarity, since there are quit a lot of implicit and useful clues with dissimilar features among var-ious events. Thus, in this work we established a novel combination of several techniques for evaluating events’ relatedness, rather than only work on computing their similarity.

Given the fact that some past events were still better under-stood by past news documents, our goal of this work is, roughly speaking, to extract features from a tweet corpus for locating a list of related events that the user would like to study afterwards. Hence, through the developed system, the users will be able to estimate the likelihood that the equivalence relation holds for a gi-ven collection of egi-vent datasets based on their selected features. 2. System framework

2.1. Problem characteristics

In this section, the system framework and architecture for eval-uating relatedness of detected events based upon Twitter data is described. First of all, we present some problem characteristics

and difﬁculties in system development for exploring microblog-ging contents associated with the event relatedness as below.

Alone with the online event detection functional module, the system framework also combines online- and ofﬂine-event evaluation subsystems. While these systems were able to be integrated together, the functions of online and ofﬂine evalua-tion subsystems are allowed to be stand-alone. That is because that we expect to create an opportunity for people to make judgments regarding the analysis of event impacts by different implementation methods and empirical results through unsu-pervised and suunsu-pervised models for the application domain. For empirical purposes, an event representation should at least

share some similar content with the ones of other related events. The notion of relevance in information retrieval, which measures to what extent the topic of a candidate document matches the topic of the query, can be regarded as a natural form of relatedness. A variety of retrieval models have been well studied in the ﬁeld of information retrieval to model relevance, such as vector space model. Motivated by this, in this work it is expected to build effective models to measure and analyze intrinsic relatedness among events for meeting users’ informa-tion needs.

Exchanging microblogging messages depends on the users to communicate, which almost always needs that they are compe-tent in the same language or rely on a bilingual mediator. It is clear that language stands in a more complicated relationship to the formulation of discussed events than national bound-aries. In reality, many people currently around the world use English in addition to their local and national languages. For instance, most of the Twitter users located in the English-speak-ing countries follow users who are located in English-speakEnglish-speak-ing countries. However, even for local and overseas communication of Twitter users located in non-English speaking countries, the use of the same dominant language (i.e. English) in discussions regarding specific event is still significant. This suggests that the effect of country-specific-language might be weakened for event analysis by the wide use of English as a lingua franca. The view is taken, therefore, in this work we focus only on the use of English as the major language for study of microblog based event evaluation.

Once dealing with Twitter streams, one important factor is the presence of message locations. In particular, when analyzing the distribution of Twitter messages for event awareness, it is important to consider the uneven distribution of the users’ loca-tions around the world. Of course, Twitter users are certainly not distributed evenly around the world. However, the event evaluation method applied in our work is mainly concerned with automatic identification of bursts from Twitter posted messages, providing useful insights into the local events and in turn facilitating timely event monitoring. This may benefit the study of the first hand information from the original loca-tions of the event occurrence. However, for long-term analysis of event development, we still need to take such an issue into account.

In reality, the concept transition related to evolving events is generally hard to be estimated. It would be quite difﬁcult to analyze transitions of unknown events. In particular, extraction of sensible information from a clustering result is not an easy task. Under such a circumstance, background knowledge regarding impacts gained from past events may be helpful. Thus, in this work we also look at several other dimensions that can either strengthen or impede the extent of relatedness between events. In addition to spatial and temporal analysis, we established an extensible measure metrics covering several factors such as business, politics, sport, climate, commodity, C.-H. Lee / Expert Systems with Applications 39 (2012) 13338–13356 13339

(3)

ﬁnance, and entertainment to estimate the impacts of studied events. Part of the functions in the developed system provides a quantitative investigation of the effect of important factors for historical events to pursue a deep understanding of known events and their possible inter-relationship.

To solve the aforementioned issues, we have proposed a frame-work for event evaluation by mining contents from Twitter mes-sages. We established an online evaluation approach on Twitter microblogs for detecting large-scale events and performing relat-edness evaluation based upon an unsupervised technique. Further-more, we have studied the development of measure metrics for ofﬂine evaluation of event relatedness. By supervised learning, our developed measure metrics are able to compute relatedness of historical events, allowing the event impacts on speciﬁc do-mains to be quantitatively analyzed, and evaluated to perform event comparison.

2.2. System architecture

In this section, the system framework and algorithms for min-ing events and evaluatmin-ing relatedness based upon Twitter data (i.e. Tweets) is described. First, non-ASCII messages will be bypass for collecting ASCII coding content of messages. Subsequently, in order to perform unsupervised event clustering our system started with construction of a dynamic feature space which maintains messages with a sliding window model to deal with the message streams. New incoming messages will be reserved in memory till they are out of the window. Then we utilized a dynamic term

weighting scheme (Lee, Wu, et al., 2011) to assign dynamic weights to each word. The neighborhood generation algorithm is performed to quickly establish relations with messages, and carry out the operation of text stream clustering. In this work, we uti-lized density based clustering approach as our online clustering algorithm. Therefore, the system constantly groups messages into topics, and the shape of clusters would change over time. Finally, hot topic events on microblogs can be determined by analyzing the collected cluster records. In order to measure the relatedness among events, we extract feature patterns of each event by per-forming content mining for content analysis, spatial analysis, and temporal analysis, as shown in Fig. 1. The datasets of detected events were stored in the event repository (i.e. Event DB). This al-lows our online approach to compare the new event vector with other event vectors for dynamic evaluation of event relatedness. On the other hand, the datasets of detected events stored in the repository are being used to perform ofﬂine relatedness measures and analysis by the developed supervised event evaluation meth-od. More detailed description of our proposed approaches will be addressed later.

3. Characterization of detected real-world events

To further describing the event formulation, an example of de-tected event (i.e., ‘‘Japan earthquake on March 11, 2011’’) using Google Maps for illustration of geographical locations of the event is shown inFig. 2.Fig. 3illustrates the event evolution representa-tion based upon different factors, including time, geospatial Fig. 1. The system framework.

(4)

keyword, and the logarithm of the number of messages. The event timeline is often utilized to report the tweet activity by volume.

Fig. 4illustrates the sample Twitter-messages for Japan earthquake (March 11, 2011).

3.1. Content mining for extracting event’s content feature

There are millions of short messages containing several words in Twitter service every day. The importance of these key-words will change over time. In classic text retrieval systems, the most common method for feature extraction is to deal with each document as a bag-of-words representation. Such an approach is not completely suitable for our dynamic system. The main techni-cal issue of detecting events in text streams is to derive a set of fea-tures (words) to describe each message and a similarity measure between messages (Becker, Naaman, & Gravano, 2010). Thus, in this work for mining Twitter message streams, we utilized a dy-namic term weighting scheme called BursT (Lee, Wu, et al., 2011)

to timely update the weighting of keywords in each messages. Sub-sequently, each message will be clustered by IncrementalDBSCAN clustering algorithm (Ester et al., 1998), and then our system will record the maximum burst weighting of each keywords of each cluster.

3.2. Content mining for extracting event’ temporal feature

In our event evaluation system, we assumed that each event to-pic has characteristics of temporal locality. It means that a toto-pic would be discussed by tweets during a period of time. The reason we use the ways for mining topics rather than using keywords tracking methods is due to that such techniques can group relevant posts based on similarity of messages, avoiding missing valuable messages. In this work, ‘‘event’’ is regarded as a set of messages that are highly concentrated on some issues in a period of time. Such a phenomenon is also described as the characteristics of tem-poral locality among messages. The concept of temtem-poral locality is used to present that an event that is discussed at one point in time will be discussed again sometime in the near future. To process incoming texts with a chronological order, a fundamental issue we concerned is how to find the significant features in text streams. Besides, it has been observed that, in microblogging text streams, some words are ‘‘born’’ when they appear the first time, and then their intensity ‘‘grow’’ in a period of time till reach a peak. These words are called burst words. As time passes by, once the topics are no longer discussed by people, they ‘‘fade away’’ with power law and eventually the feature words become ‘‘death’’ (dis-appear), or change to a normal state. Such a phenomenon is re-garded as a lifecycle of the selected features associated with a particular event under investigation.

3.3. Content mining for extracting event’s spatial feature

While an event occurs in real world, the Twitter users post mes-sages which may contain spatial information regarding the event situation. Thus, by extracting the geographical terms from the con-tent of these messages, the spatial information about where the event originally occurred and diffused can be obtained. We utilized GeoName which is a geographical dictionary to extract geographi-cal terms form each clusters, and utilized term frequency weighting factor to weight each geographical terms for representing each Fig. 2. The illustration of geographical locations for Japan earthquake event (March

11, 2011).

Fig. 3. The timeline of sample Twitter-messages for Japan earthquake (March 11, 2011).

(5)

clusters. Also, the spatial features of the event for location estima-tion can be obtained by extracting time zone data or based on a precise form of geographic coordinates of location (i.e. latitude and longitude) in tweets. However, according to our previous work a preliminary statistics on 270,852 sample tweets, we found that approximately 66,565 (24%) tweets have no time-zone information from their user proﬁle, and 268,831 (99%) tweets have no latitude and longitude information through Twitter Stream API. As a result, the way of latitude and longitude information in tweets is not well suited for detecting geographical events in our work.

4. The unsupervised method for online text stream clustering and event evaluation

In this chapter, the developed online text-stream clustering ap-proach for event evaluation by mining microblogging message streams is described. The developed online clustering method which is based on a real-time event-cluster generation model, including three parts: a dynamic term weighting scheme, a sliding window model, and an online density-based clustering approach.

Our work starts with developing the system for detecting topics and tracking events about hot news topics, and preferences of peo-ple from text information sources of Twitter microblogging ser-vices. In this work, an algorithm using a density-based method is developed for mining microblogging message streams. The pur-pose of our approach is to effectively detecting and grouping emerging topics from the user-generated content in a real-time or speciﬁed time slot. On the other hand, for tackling a key chal-lenging issue in mining the microblogging messages, we attempt to analyze the real-time distributed messages and extract signiﬁ-cant features of them in a dynamic environment. We propose a no-vel term weighting method, called BursT, using a sliding window technique for weighting message streams. This method was proven to be capable of dealing with concept drift problem, being able to detect context changes without being explicitly informed about them. More details regarding system implementation can be found in our previous publication (Lee, Wu, et al., 2011).

4.1. Online text-stream clustering by a density-based approach As the temporally-ordered messages streaming into the system, the next step is to incrementally gather messages into thematically

topics. For such an information gathering process, one of the main difficulties is figuring out the meaning and value of those fleeting bits of information for mining the text streams. The challenge goes beyond filtering out spam, though that’s an important part of it. Microblogging messages may lose their value within minutes of being written. Therefore, the system should be able to quickly group them into clusters which are evolving over time. Meanwhile, the continuous evolution of clusters makes it essential to be able to swiftly identify new clusters in the data. That is, the algorithm has to deal with lots of external dynamic changes, i.e. various updates occur and topic shift (i.e. concept drift) issues, etc. In order to achieve this goal, we have to provide an effective solution in which online clustering operation can be well performed in mining the microblogging text streams.

4.1.1. Reasons for adopting a density-based approach on online event clustering

Adopting density-based clustering methods in this work is based upon the following reasons:sons:

Density-based clustering techniques are capable of detecting arbitrary-shaped clusters.

In microblogging messages, the contents normally include lots of noises. Once mining these messages, the clustering algorithm should be able to ﬁlter out noises in processing the contents. Density-based clustering groups data based on their density connectivity and treats noises as outliers which would not be involved in any cluster.

There is no assumption about the number of clusters with ﬁxed topics, and it is thus unsuitable for some real world applications in the problem domain, especially in dealing with the topic detection task with dynamic topics around the world.

Due to the dynamic natures mentioned above, it is highly desir-able to perform data updates incrementally. Thus, in this work a density-based clustering based on the algorithm of IncrementalDB-SCAN (Ester et al., 1998) was used for our system development. IncrementalDBSCAN is an efﬁcient algorithm which is based on DBSCAN for mining data with density-based connectively (Lee, 2012; Lee, Wu, et al., 2011; Lee, Yang, et al., 2011; (Ester et al., 1998). According to the theory of IncrementalDBSCAN clustering method, the shape of clusters will change over time when a Fig. 4. Sample Twitter-messages for Japan earthquake (March 11, 2011).

(6)

message being inserted or a victim message being deleted from sliding window with its message density properties. Certainly the less density area would not be a topic, because of the distances be-tween messages are long according to the calculations of temporal text similarity. Meanwhile, text stream cluster algorithm will gen-erate several clusters at each time, due to its natural dynamics. 4.2. A dynamic term weighting approach

It is a critical issue to find the significant features in text streams with a chronological order. In general information systems, weighting schemes use information that is based upon processing keyword distributions across the entire corpus. How-ever, in this problem domain the text-stream corpora tend to be dynamic, with new messages always being added and weighting calculation values being updated. Thus, the significance of key-words in the text streams is always not stable but change with time. That is, the weighting values for microblogging message should be constantly changed. In particular, a special consideration is that almost all terms occur in each message only once due to the length limitation of microblogging messages. As a result, the com-putation overhead of term frequency (tf) would be strongly affected by the limited length. Besides, the document frequency (df) conflicts the operation of event-topic mining since a higher df value of the words implies the terms occur in many documents, which might lead to the problem of missing topic words in messages to some extent. As a result, we developed a special term weighting scheme called BursT which was proposed in our previous work (Lee, Wu, et al., 2011). The experimental results show that a better perfor-mance by utilizing our approach in weighting words of microblog-ging messages than many weighting methods (Lee, Wu, et al., 2011).

In this work, we used our developed weighting method (Lee, Wu, et al., 2011) which considers the characteristics of microblogs and incorporates burst detection for adapting dynamic environ-ment. The solution of BursT weighting method is that a heavier weight is determined by a higher burstiness, in which some word occurs frequently in the window. Thus, the formula of BursT weighting scheme is shown in Eq. (1):

BursTw;t¼ BSw;t TOPw;t ð1Þ Where the weight of the word w at time t will be constituted by two factors: BS (Burst Score) and TOP (Term Occurrence Probability). For calculating BursT weights of single words, each word w is recorded as a quartet hw, atw, t 1, nw, t, E(arw, t)i, atw, t 1 represents the last time word w arrived, nw, t counts the total number of word w appeared in our system, and E(arw, t) is a long time cumulative expectation of arrival rate to the word w.

The second factor in BursT weighting scheme is TOP (term occurrence probability) factor, which is formulized by the propor-tion of the term in the sliding window. For the operapropor-tion of mining hot news topics from messages, if a word occurs in more messages, it is more likely to be a valid topic. Thus, the term occurrence prob-ability corresponding to the word w at tth arrival is formulated as below:

TOPw;t¼ PðwtjctÞ ¼fjm : wt2 ctgj

jctj

ð2Þ

where TOP represents the probability of the word occurrence in the sliding window, and ct denotes the message collection in the corpus collected from the time t tw to current time. This factor would en-able the weight of the word to grow with its occurrence frequency in messages, for identiﬁcation of event topics (Lee, 2012).

4.3. Online generation of event clusters and dynamic relatedness evaluation

In our work, each extracted keyword in the tweet was assigned with burst weighting value for real-time event detection. Since the burst weighting value of each keyword for representing on-going event is dynamically changed over time, the system will keep the maximum burst weighting value of each keyword of the event for establishing an event-vector representation.

Once some emerging events were detected by our system, the event clusters and event vectors can be generated by formulating clustered messages. Also, a relatedness measure metrics developed for computing event relatedness is activated for event evaluation. Several essential features of each detected event dataset have been extracted for event formulation by performing content mining operations. This allows our approach compare the new event vec-tor with other event vecvec-tors for evaluation of event relatedness.

Subsequently, we start to perform online relatedness measures among the on-going event and historical event vectors. For dy-namic relatedness evaluation, we composed a new event vector by assigning updated burst weighting value, and then employed cosine similarity measure to calculate the vector relatedness among on-going event and historical events per ten minutes. The process of online event evaluation is illustrated inFig. 5.

5. The supervised method for ofﬂine evaluation of event relatedness

Relatedness represents how well a candidate event is related to other events. In order to model relatedness, we propose several algorithmic evaluation methods that characterize relatedness from multiple aspects. As mentioned previously, we have established an Fig. 5. The process of online event evaluation.

(7)

unsupervised online clustering approach to detect burstiness on Twitter microblogs for detecting realtime large-scale events, and performed online dynamic evaluation of event relatedness. Going further, our work moved to develop a measuring method for ofﬂine evaluation of event relatedness. Through our developed measure metrics for computing relatedness of historical events, the essen-tial aspects of impacts of related events can be quantitatively eval-uated and analyzed, allowing for working as a stand-alone system for event evaluation, or cooperating with the developed online event evaluation system for understanding possible event develop-ment and evolution.

5.1. Techniques for ofﬂine measures of event relatedness

Our base model structure is developed for comparing related-ness among event datasets from various perspectives. For impact analysis of events, the learned model developed in this work at-tempts to measure the extent of relatedness among several event by comparing their datasets in terms of several essential dimen-sions using a supervised classifier based metrics. As mentioned previously, through formulating the collection of related social messages an event-story can be modeled by a number of hidden topics and selected features, with each topic gathering a series of observed messages according to topic-specific terms and sentences used in the content. A representation of the event comparison can be done by the quantitative values represented in each defined to-pic-dimension in a classifier (topic)-based space in a supervised manner. This measures the model’s descriptive power, while requiring no submitted on-going event data for comparison.

Thus, for offline event evaluation in this work we implemented a measure metrics for acquisition of event relatedness from Twitter messages by means of construction of a classifier-based system (i.e. Support Vector Machines classifiers). The Support Vector Ma-chines (SVM) (Vapnik, 1999) is one of the major statistical learning models. It basically provides a way for data categorization by pro-ducing a decision surface to separate the training data samples into two classes. As such, the resulting classifiers are capable of dis-criminating similar/dissimilar (or related/unrelated) event data, and further computing the degree of relatedness among the event datasets by means of our developed algorithm. In this work, we utilized the LIBSVM (Chang & Lin, 2001) and a RBF kernel to evalu-ate our offline event relevalu-atedness.

5.2. Vectorizing event-data in a SVM-classifier based vector space A decision combination function must make use of useful repre-sentations of classifier decisions. In Fig. 6, a SVM-based metrics system for measuring event relatedness is illustrated. For measur-ing event relatedness, in this work a new vector space was formu-lated by SVM-based classifiers as a measure metrics for measuring event represented by collected Twitter messages, in which the resulting decision values of the input event would be examined by each trained SVM classifier in the metrics. That is, event vectors in such a combination of classifiers were regarded as mappings of topic categories in points on a multiple dimensional grid to form a category vector space. This also reflects a real world situation that a single event may involve one or several categories of topics. The vector approach allows for a mathematic and a physical represen-tation of events for measures of their relatedness in aspects of se-lected topics. Additional classifiers can be added to the model by adding another dimension to the geometric representation. The pattern of adding dimensions to represent additional classifiers can be continued as many times as needed. If we needed to model n distinct classifiers, then we would use such an extensible metrics (i.e., weight of classifier 1, weight of classifier 2, weight of classifier 3, . . ., weight of classifier n) for vector evaluation. The classifier

vector space uses the notion of a space of topic category, where each event is represented as a vector in a high-dimensional space. Because the position of an event in the vector space is determined by the degree of relevance based on the judgments of topic-classi-ﬁers, events with many topics in common end up close together, while events with few shared topics end up far apart. As a result, the relatedness between event vectors can be computed by means of several developed algorithms.

5.3. Metrics for measuring event relatedness using a multiple SVM-based classiﬁer system

In this work we developed an approach applying a classifier-based technique with Support Vector Machines (SVM) method to support measuring of event relatedness. In the first stage, we em-ployed topic specific messages to train Support Vector Machines (SVM) classifiers for constructing a measure metrics for identifying the topic categories of the events. Subsequently, we combined the trained classifiers to form a metrics, and input some unknown event data into the model to evaluate the decision values by each classifier in the metrics. Finally, new vectors that are formulated by resulting decision values from several different SVM classifiers (see

Fig. 6) can be generated, and these vectors allowed us compute the extent of relatedness among events in a quantitative manner based on several measuring algorithm.

6. Experimental results and discussion 6.1. Experimenting with online event evaluation

In this work we experimented with a vast amount of Twitter data to identify the validity of the framework, through demonstrat-ing selected cases by takdemonstrat-ing the events detected by the developed platform.

(8)

6.1.1. Case study (I): baseline event: ‘‘Virginia earthquake on August 24, 2011’’

In the experiment, a total number of 192,541,656 Twitter posts were collected, dating from: January 1, 2011 to September 30, 2011. The test samples were collected through Twitter Stream API. After ﬁltering out non-ASCII tweets, 102,709,809 tweets had been utilized as our data source. We utilized the dataset collected from January 1, 2011 to May 31, 2011 corpus as our dataset for training, and used the corpus dating from June 1, 2011 to Septem-ber 30, 2011 as our test data. Subsequently, we partitioned

messages into unigrams and all capital letters in each tweet were converted into lowercase for our experiments.

The datasets (Case I: baseline event ‘‘Virginia earthquake on August 24, 2011’’).

In this experiment, we utilized the event ‘‘Virginia earthquake’’ as a baseline for identifying our framework. The event happened at 01:51, and the ﬁrst post appeared at 01:52:04. The event was de-tected by our system at 01:52:17. The result of relatedness ranking

Fig. 7. Ranking of event relatedness upon a comparison with baseline ‘‘Virginia earthquake’’ event at various time points (Event ID: #1173, August 24, 2011). C.-H. Lee / Expert Systems with Applications 39 (2012) 13338–13356 13345

(9)

Chang, C. C., & Lin, C.J . (2001). LIBSVM: A library for support vector machines. In ACM transactions on intelligent systems and technology (ACM TIST).

Chen, F., Farahat, A., & Brants, T. (2004). Multiple similarity measures and source-pair information in story link detection. In Proceedings of the international conference on human language technology conference of the North American Chapter of the Association for Computational Linguistics, Boston, Massachusetts, USA.

Choudhury, M. D., Sundaram, H., John, A., Seligmann, D. D., & Kelliher, A. (2010). Birds of a feather: Does user homophily impact information diffusion in social media? In Proceedings of the computing research repository.

Cunha, E., Magno, G., Comarela, G., Almeida, V., Gonçalves, M. A., & Benevenuto, F. (2011). Analyzing the dynamic evolution of Hashtags on Twitter: A language-based approach. In Proceedings of the workshop on languages in social media, Portland, Oregon.

Ester, M., Kriegel, H. P., Sander, J., Wimmer, M., & Xu, X. (1998). Incremental Clustering for Mining in a Data Warehousing Environment. In Proceedings of the 24rd international conference on very large databases. New York, USA. Ferret, O. (2002). Using collocations for topic segmentation and link detection. In

Proceedings of the 19th international conference on computational linguistics, Taipei, Taiwan (Vol. 1).

Gaber, M. M., Zaslavsky, A., & Krishnaswamy, S. (2005). Mining data streams: A review. In Proceedings of the SIGMOD Records (Vol. 34, pp. 18–26).

Guha, S. et al. (2003). Clustering data streams: Theory and practice. In Proceedings of the IEEE transactions on knowledge and data engineering (Vol. 15, pp. 515–528). Ishikawa, Y., Chen, Y., & Kitagawa, H., (2001) An on-line document clustering method based on forgetting factors. In Proceedings of the 5th European conference on research and advanced technology for digital libraries (pp. 325–339). Lavrenko, V. et al. (2002). Relevance models for topic detection and tracking. In Proceedings of the 2nd international conference on human language technology research, San Diego, California, USA.

Lee, C. H. (2012). Mining spatio-temporal information on microblogging streams using a density-based online clustering method. Expert Systems with Applications. Lee, C. H., Wu, C. H., & Chien, T. F. (2011). BursT: A dynamic term weighting scheme

for mining microblogging messages. In Proceedings of the 8th international symposium in neural networks. Lecture notes in computer science (Vol. 6677, pp. 548–557). Berlin/Heidelberg: Springer.

Lee, C. H., Yang, H. C., Chien, T. F., & Wen, W.S. (2011). A novel approach for event detection by mining spatio-temporal information on microblogs. In Proceedings of the IEEE international conference on advances in social network analysis and mining, Kaohsiung, Taiwan, July 25–27.

Leskovec, J. (2011). Social media analytics: Tracking, modeling and predicting the ﬂow of information through networks. In Proceedings of the 20th ACM WWW international conference on World Wide Web, Hyderabad, India.

Lin, Y. R., Chi, Y., Zhu, S., Sundaram, H., & Tseng, B. L. (2008). Facetnet: A framework for analyzing communities and their evolutions in dynamic networks. In Proceedings of the 17th ACM WWW international conference on World Wide Web, Beijing, China.

Lin, C. X., Mei, Q., Jiang, Y., Han, J., & Qi, S. (2011). Inferring the diffusion and evolution of topics in social communities. In Proceedings of the 5th ACM SNAKDD international workshop on social network mining and analysis, San Diego, CA, USA. Mei, Q., & Zhai, C. (2005). Discovering evolutionary theme patterns from text: An exploration of temporal text mining. In Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery in data mining, Chicago, Illinois, USA.

Naaman, M., Becker, H., & Gravano, L. (2011). Hip and trendy: characterizing emerging trends on Twitter. Journal of the American Society for Information Science and Technology.

Nallapati, R. (2003). Semantic language models for topic detection and tracking. In Proceedings of the international conference on the North American Chapter of the Association for Computational Linguistics on Human Language Technology: HLT-NAACL 2003 student research workshop, Edmonton, Canada (Vol. 3).

Nallapati, R., & Allan, J. (2002). Capturing term dependencies using a language model based on sentence trees. In Proceedings of the 8th international conference on information and knowledge management, McLean, Virginia, USA.

Nomoto, T. (2010). Two-Tier similarity model for story link detection. In Proceedings of the 19th ACM international conference on information and knowledge management, Toronto, ON, Canada.

Roxy, P., & Toshniwal, D. (2009). Clustering unstructured text documents using fading function. In Proceedings of the World Academy of Science, Engineering and Technology (pp. 149–156).

Salton, G. (1989). Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc. Shah, C., Croft, W. B., & Jensen, D. (2006). Representing documents with named

entities for story link detection (SLD). In Proceedings of the 15th ACM international conference on information and knowledge management, Arlington, Virginia, USA.

Spiliopoulou, M., Ntoutsi, I., Theodoridis, Y., & Schult, R. (2006). MONIC: Modeling and monitoring cluster transitions. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, USA.

Štajner, T., & Grobelnik, M. (2009). Story link detection with entity resolution. In Proceedings of the 8th ACM WWW international conference on World Wide Web semantic search workshop, Madrid, Spain.

Tang, X., & Yang, C. C. (2011). Following the social media: Aspect evolution of online discussion. In Proceedings of the 4th international conference on social computing, behavioral-cultural modeling and prediction, College Park, USA.

Uejima, H., Miura, T., & Shioya, I. (2004). Giving temporal order to news corpus. In Proceedings of the 16th IEEE international conference on tools with artiﬁcial intelligence (pp. 208–215).

Vapnik, V. (1999). An overview of statistical learning theory. IEEE Transaction on Neural Networks, 10(5), 988–999.

Wang, L., & Li, F. (2011). Story link detection based on event words. In Proceedings of the 12th international conference on computational linguistics and intelligent text processing, Tokyo, Japan (Vol. Part II).

Yang, C. C., Shi, X., & Wei, C. P. (2009). Discovering event evolution graphs from news corpora. Proceedings of the IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans, 39, 850–863.

Zhai, C., Velivelli, A., & Yu, B. (2004). A cross-collection mixture model for comparative text mining. In Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA, USA. Zhang, X., Wang, T., & Chen, H. (2007). Story link detection based on dynamic

information extending. In Proceedings of the 45th annual meeting of the association for computational linguistics, Prague, Czech Republic.

Zhang, X., Wang, T., & Chen, H. (2008). Story link detection based on event model with uneven SVM. In Proceedings of the 4th Asia information retrieval conference on information retrieval technology, Harbin, China.

Zhao, Q., Mitra, P., & Chen, B. (2007). Temporal and information ﬂow based event detection from social text streams. In Proceedings of the 22nd international conference on artiﬁcial intelligence, Vancouver, Canada (Vol. 2).

Zhong, S. (2005a). Efﬁcient streaming text clustering. Proceedings of the Neural Networks, 18, 790–798.

Zhong, S. (2005b). Efﬁcient online spherical k-means clustering. In Proceedings of the IEEE international joint conference on neural networks, Montreal, Canada (pp. 3180–3185).