Social trend tracking by time series based social tagging clustering

(1)

Social trend tracking by time series based social tagging clustering

Shihn-Yuarn Chen

a

, Tzu-Ting Tseng

b

, Hao-Ren Ke

c,⇑

, Chuen-Tsai Sun

a

Department of Computer Science, National Chiao Tung University, No. 1001 Ta Hsueh Road., Hsinchu 300, Taiwan b

Institute of Information Management, National Chiao Tung University, No. 1001 Ta Hsueh Road., Hsinchu 300, Taiwan c

Graduate Institute of Library & Information Studies, National Taiwan Normal University, No. 162, He-ping East Road, Section 1, Taipei 10610, Taiwan

a r t i c l e i n f o

Keywords:

Web 2.0 Social tagging Time series clustering Event tracking

a b s t r a c t

Social tagging is widely practiced in the Web 2.0 era. Users can annotate useful or interesting Web resources with keywords for future reference. Social tagging also facilitates sharing of Web resources. This study reviews the chronological variation of social tagging data and tracks social trends by clustering tag time series. The data corpus in this study is collected from Hemidemi.com. A tag is represented in a time series form according to its annotating Web pages. Then time series clustering is applied to group tag time series with similar patterns and trends in the same time period. Finally, the similarities between clusters in different time periods are calculated to determine which clusters have similar themes, and the trend variation of a specific tag in different time periods is also analyzed. The evaluation shows the rec-ommendation accuracy of the proposed approach is about 75%. Besides, the case discussion also proves the proposed approach can track the social trends.

1. Introduction

Social tagging has recently become a widely used application on the Internet. This process involves bookmarking part or all of a website for future reference. Social tagging can be used at a variety of websites, such as online shopping systems like Amazon.com, photo sharing communities like Flickr.com, and bookmarking ser-vices like Delicious.com. When someone finds something interest-ing online, he/she can tag it with some keywords. Tagginterest-ing is very similar to bookmarking the entire page, and is similarly accessible. Tagging also allows users to collaborate with other people on-line, including sharing collections and tag navigating. By sharing collections, a user can understand what other users bookmark and how others describe the same resource by various tags. Differ-ent resources tagged with the same word may refer to differDiffer-ent subject matter, and this phenomenon can be found by navigating resources through one tag. For example, the tag ‘‘world-series” may highlight news reports regarding the 2009 World Series be-tween the New York Yankees and the Philadelphia Phillies, but may also tag news reports about 2008 World Series between the Philadelphia Phillies and the Tampa Bay Rays. Tags can also be used to track news events. For example, news about Barack Obama’s career as a senator to his presidential campaign and inauguration can be tagged simply ‘‘Obama.”

This study analyzes social tagging information on time line, and each tag is represented by its tagging resources. Time series clus-tering is then applied to group tags with similar theme and find out the trends of events. In our example, there are five tags: 奧運 (Olympic Games), 中國 (China), 北京 (Beijing), 政治 (Politics) and台灣 (Taiwan).Table 1lists the usages of these tags in five sequential time points: p1, p2, p3, p4 and p5. Ignoring the chrono-logical factor, traditional clustering algorithms group奧運 (Olym-pic Games) and_{中國 (China) in the same cluster, because of their} similar usage count. However, according toFig. 1, which depicts the usages of the tags at timeline, it is observably that中國 (China), 政治 (Politics) and 台灣 (Taiwan) have similar polyline trends. Sim-ilar trends indicate these three tags have more simSim-ilar theme than 奧運 (Olympic Games) and 北京 (Beijing), and these three tags should be grouped in the same cluster.

This study applies time series clustering to find out tags with similar trends. Based on clustering results, users can find related tags and documents in a particular time period. In addition, re-lated documents from different time periods can be retrieved by calculating the similarities between clusters in different time periods.

The rest of this paper is organized as follows. Section2reviews previous studies on social tagging, time series analysis and cluster-ing algorithms. Section3describes the proposed approach, cover-ing data pre-processcover-ing, time series representation, time series clustering, and recommendation. Section 4 evaluates and com-pares the proposed approach and the counterpart approach that does not take into account the chronological factor. Section5 con-cludes with future proposals.

⇑ Corresponding author. Tel.: +886 2 77345203; fax: +886 3 5718925. E-mail addresses:[email protected] (S.-Y. Chen),[email protected] (T.-T. Tseng),[email protected],[email protected](H.-R. Ke),ctsun@cs. nctu.edu.tw(C.-T. Sun).

Contents lists available atScienceDirect

Expert Systems with Applications

(2)

2. Related works

2.1. Social tagging and folksonomy

‘‘Folksonomy” is derived from the words ‘‘folks” and ‘‘taxon-omy.” It means a classification created by ordinary people. Vander Wal defined the term folksonomy as, ‘‘. . . the result of personal free tagging of information and objects for one’s own retrieval. Tagging is performed in a social environment (shared and open). Act of tagging is done by the person consuming the information.” (Vander Wal, 2005) Folksonomy also includes collaborative classification, collab-orative tagging, free tagging, tagsonomy, etc. Folksonomy empha-sizes the spirits of social classification, collaboratively creation, and typically flat name-spaces.

Folksonomy consists of three aspects: user, resource, and classi-fication (Fig. 2) (Pu, 2007). The user aspect involves social and col-laborative concepts; the Resource aspect involves media information; the classification aspect defines the classification rules.

Social tagging is one type of folksonomy. Users can use tags, which are indicative keywords to annotate, describe or classify useful information. Flickr and Delicious.com are examples of sites which promote social tagging. Flickr is a photo sharing web-site where pictures can be tagged, and Delicious.com is a bookmark service provider which allows user to tag bookmarked URLs. In these instances, users are both consumers and contribu-tors of tags, and these tags can be used for classification, indexing, searching and browsing content.

2.2. Clustering algorithm

There are various clustering algorithms which can be divided into five categories (Han & Kamber, 2001): partitioning methods (e.g.: k-means and fuzzy c-means), hierarchical methods (e.g.: agglomerative and divisive hierarchical clustering), density-based methods (e.g.: DBSCAN), grid-based methods (e.g.: STING) and model-based methods (e.g.: SOM). Clustering algorithms usually only process static data. Among the various clustering algorithms, the partitioning methods are most commonly used. A partitioning clustering method usually has to determine the number of clusters in advance, and then reduces the value of a goal function by itera-tive clustering computations. The halting condition of a

partition-ing clusterpartition-ing method is usually a threshold value of the goal function or a specific iteration count. For example, the k-means algorithm clusters data into k groups, and its goal function is the sum of square error between the centroid of a cluster and data items in the cluster.

2.2.1. Hierarchical clustering

This study uses hierarchical clustering to group time series data; this subsection introduces hierarchical clustering in greater detail. There are two types of hierarchical clustering: agglomera-tive (Voorhees, 1986) and divisive (Hastie, Tibshirani, & Friedman, 2009). Fig. 3 illustrates an example of hierarchical clustering. Agglomerative hierarchical clustering initially represents each data item as a cluster, and iteratively merges the two closest clusters till the halting constraint is satisfied. Divisive hierarchical clustering is different from agglomerative. Divisive method groups all data items in one group at beginning, and splits a cluster into two most distant clusters iteratively till the halting constraint is reached.

The criteria to decide cluster merging or splitting is the distance between clusters. The four ways to measure the distance between two clusters are single linkage, complete linkage, average linkage and Ward’s distance (Ward, 1963).

I. Single linkage: Fig. 4(a) illustrates single linkage distance measurement, which only considers the shortest distance between two clusters. The distance is D(Ci, Cj) = min d(a, b),

where a belongs to cluster Ci, and b belongs to cluster Cj.

II. Complete linkage: Fig. 4(b) shows complete linkage dis-tance, which considers the longest distance between two clusters. The distance is D(Ci, Cj) = max d(a, b), where a

belongs to cluster Ci, and b belongs to cluster Cj.

III. Average linkage: Fig. 4(c) displays average linkage, which considers the average distance between all data item pairs across two clusters. The distance is D(Ci, Cj) = (Rd(a, b))/

(jCijjCjj), where a belongs to cluster Ci, and b belongs to

clus-ter Cj.

IV. Ward’s distance:Fig. 4(d) depicts Ward’s distance; it finds out the centroid of two clusters first, and then calculates the square sum of distances between all data items and the centroid. The distance is D(Ci, Cj) = (Rja mj2), where a

belongs to Ci[ Cj, and m is the centroid of CiandCj.

In addition to distance measurement of clusters, hierarchical clustering also has to consider the halting constraint before execut-ing. The halting constraint is usually the cluster count or the aver-age distance between clusters.

2.3. Time series analysis

A time series is a sequence of successive data measured at uni-form time intervals (Box & Jenkins, 1976). Time series data is a set

Fig. 1. Represent tags on time line. Table 1

Tag usage example.

p1 p2 p3 p4 p5 Total 奧運 40 20 0 2 0 62 中國 8 15 22 12 10 67 北京 10 11 0 6 8 35 政治 5 10 20 10 8 53 台灣 6 9 19 8 10 52

(3)

of values of an item’s attribute in a particular time period. For example, the everyday market price of a company’s stock in the first quarter 2009, and the weekly rainfall records of Taipei city in 2009. Time series analysis extracts statistics and other details from time series data. These statistics and details are helpful in forecasting the trends of future events.

This study is designed to cluster the time series data of social tags. However, time series data are chronological, and clustering algorithms are not proper to process non-static data. Before exe-cuting clustering, time series data should be transformed into a static form. The distance measurement between time series data is also essential, and some measurement methods are introduced as follows.

2.3.1. Euclidean distance

Euclidean distance is the simplest measurement between two time series data items. This method states a time series data of length N (i.e. N measured values on time line) as a data point in an N-dimension space. The similarity of two time series data items is the distance of each in the N-dimension space. However, Euclid-ean distance dose not afford for offset translation (Fig. 5(a)) or amplitude scaling (Fig. 5(b)) (Bollobás, Das, Gunopulos, & Mannila, 1997). Offset translation indicates that two time series are almost the same, except their amplitude offset. Amplitude scaling shows that two time series have similar trends, but one is the scaling of the other at certain time periods. For reducing the influence of off-set translation and amplitude scaling, normalization is a solution. For example, Agrawal et al.’s approach (Agrawal, Lin, Sawhney, & Shim, 1995) normalizes every time series to a range (1,+1). After normalization, the Euclidean distance is calculated sequentially.

2.3.2. Dynamic time warping

Another issue presented is time series shifting (Fig. 5(c)), which indicates that two time series are similar but a delay time period exists between them. Euclidean distance and Agrawal et al. ap-proach do not afford for measuring the similarity between time

Fig. 4. Charts of four distance measure methods for hierarchical clustering. Fig. 2. Three respects of folksonomy (Pu, 2007).

(4)

series with shifting. Dynamic time warping (DTW) is proposed to remedy this issue (Oates, Firoiu, & Cohen, 1999; Salvador & Chan, 2007). DTW allows referencing a time series data point conse-quently for various times while calculating the distance between two time series data, andFig. 6is an example.

For example, there are two time series, Q = q1, q2, q3,. . .,qnand

R = r1, r2, r3,. . .,rm. In order to minimize the distance between Q

and R, DTW aligns Q and R by replicating certain data points. DTW generates a n m matrix, MDTW, to record the distances (e.

g. Euclidean distance) between the data items qiand rj. Each

warp-ing path, W, is

W¼ w1; w2; w3; . . . ; wk;

where minðm; nÞ 6 K 6 ðm þ n 1Þ;

wk¼ MDTWði; jÞ; w1¼ MDTWð1; 1Þ; wK¼ MDTWðn; mÞ:

ð1Þ

The minimum length of W is the minimum distance between Q and R, dDTW, which can be calculated by dynamic programming

(Liao, 2005). dDTW¼ min PK k1wk K ¼ Dðn; mÞ; ð2Þ Dði; jÞ ¼ dðqi; rjÞ þ min Dði 1; j 1Þ Dði 1; jÞ Dði; j 1Þ 8 > < > : 9 > = > ;: ð3Þ

2.3.3. Longest common subsequence

Longest common subsequence (LCS) method finds the longest common subsequence in all sequences, and the similarity of two time series is the portion of the longest common subsequence and the original time series. However, LCS does not accommodate amplitude scaling and offset translation.Agrawal et al. (1995) pro-posed an approach to address these issues.Fig. 7is an example to show their approach.

Agrawal et al.’s study details LCS time series analysis in three steps: atomic matching, windows stitching and subsequence ordering. The brief ideas of their approach are as follows. The first step is to define the gaps between the time series Q and R, and re-move them. Second, align the time series to eliminate any shifting issues. The third step adjusts the time series to eliminate

ampli-tude scaling and offset translation. Indicating the longest common subsequence of both time series is the final step.

3. Time series based social tagging clustering

3.1. Dataset and preprocess

This study focuses on social tagging in a traditional Chinese environment. The data is collected from Hemidemi.com, one of the largest traditional Chinese social bookmarking service provid-ers. Hemidemi.com (Fig. 8) records the URL and title of a web page, when it was added (create date), and which tags were assigned by an individual user. The collected data includes 3842 distinct URLs, which were saved on Hemidemi.com from 2008/1/1 to 2008/12/ 31. Our information additionally contains the titles of these URLs, creation dates, and 2707 distinct tags which annotate these URLs. Besides, web page contents of these URLs are also crawled, and most of them are in traditional Chinese. The corpus covers various domains, such as sports, movie, cuisine, traveling, and politics.

This study uses CKIP (Chinese Knowledge Information Process-ing)1_{to preprocess the contents of the crawled web pages. CKIP}

tok-enizes traditional Chinese documents into phrases and labels proper part-of-speech;Fig. 9shows an example of CKIP results. After CKIP processing, this study only keeps nouns and verbs as feature candi-dates, as shown inFig. 10.

However, some nouns and verbs do not efficiently represent information, and may hurt the clustering accuracy. These ‘‘stop words” can be removed by various ways, one of the simplest way is referring stop-word lists. This study uses Oracle Text Reference

Fig. 6. Examples of dynamic time warping (DTW) (Oates et al., 1999; Salvador & Chan, 2007).

Fig. 7. Examples of longest common subsequence (LCS) (Agrawal et al., 1995). Fig. 5. Examples of offset translation, amplitude scaling and shifting of time series.

1

(5)

Chinese stoplist2_{and Word List with Accumulated Word Frequency}

in Sinica Corpus 3.03to remove these high frequent and low repre-sentative words and phrases.

3.2. Feature selection

This study uses the vector space model (VSM) to represent web pages and tags. Each term produced by data preprocessing is a dimension in the vector space. However, the enormous dimension

size increases computation time and may deteriorate the cluster-ing accuracy. In order to reduce the computation time and increase the clustering accuracy, this study applies three rules to remove insignificant and unrepresentative features.

1. Remove terms which do not appear in more than three web pages.

2. Remove terms which appear in more than 5% web pages. 3. In each web page, a term which appears only once is removed.

Once the three rules have been completed, Log Likelihood Ratio (LLR) (Lehmann, 1986; Neyman & Pearson, 1967) is applied to determine the features of a document. LLR is a statistical and

Fig. 8. Screenshot of Hemidemi.com.

Fig. 9. Example of CKIP result.

Fig. 10. Keep nouns and verbs of CKIP result.

2

http://download.oracle.com/docs/cd/B19306_01/text.102/b14218/astopsup. htm#sthref2545.

3

(6)

probabilistic method, which tests the probabilities of two hypoth-esizes (null and alternative hypothesis), and determines which one is more possible to happen. In this study, the null hypothesis (H1)

states that the distribution of a term (termi) occurring in a web

page (dx) is the same as other terms in dx. The alternative

hypoth-esis (H2) presumes that the distribution of termiin dxis different to

other terms in dx. The formulas for H1and H2are as follows, and the

occurrence distribution of termiand dxis shown inTable 2.

H1: PðtermijdxÞ ¼ p ¼ PðtermijdxÞ; ð4Þ

H2: PðtermijdxÞ ¼ p1–p2¼ PðtermijdxÞ; ð5Þ

p¼ PðtermijdxÞ ¼ PðtermijdxÞ ¼ PðtermiÞ;

p1¼ Pðtermi\ dxÞ PðdxÞ ; p2¼ Pðtermi\ dxÞ PðdxÞ : ð6Þ

O11is the frequency of termiappearing in dx, O12is the frequency of

termiappearing in web pages other than dx, O21is the frequency of

terms other than termiappearing in dx; O22is the frequency of terms

other than termiappearing in web pages except dx. This study

as-sumes the probability distribution is binomial distribution, as Eq.

(7)

bðk; n; xÞ ¼ ðnÞxk_{ð1 xÞ}ðnkÞ_: _ð7Þ

Then, H1 and H2 can be represented as Eq.(8).

LðH1Þ ¼ bðO11; O11þ O12; pÞbðO21;O21þ O22; pÞ;

LðH2Þ ¼ bðO11; O11þ O12; p1ÞbðO21;O21þ O22; p2Þ:

ð8Þ

The Log Likelihood Ratio value,2logk, can be calculated by using Eq.(9).

2 log k ¼ 2 logLðH1Þ

LðH2Þ

¼ 2 log bðO11; O11þ O12; pÞbðO21;O21þ O22; pÞ

bðO11; O11þ O12; p1ÞbðO21;O21þ O22; p2Þ

¼ 2ððO11þ O21Þ log p þ ðO12þ O22Þ logð1 pÞ

ðO11log p1þ O12logð1 p1Þ

þ O21log p2þ O22logð1 p2ÞÞ: ð9Þ

Koller et al. believed that, in hierarchical clustering, the appro-priate amount of features in a document ranges from 10 to 20 ( Kol-ler & Sahami, 1997). Furthermore, too many features may decrease the coherence between features and documents, and increase noises during clustering (Chang & Hsu, 2005). This study chooses at most 50 terms with the highest LLR in each document as fea-tures. After feature selection, the feature amount in the corpus is reduced from 1,760,840 (123,830 distinct) to 402,319 (20,371 distinct).

3.3. Tag representation

In the vector space model, a document dxis represented as dx=

{wx1, wx2, wx3,. . .,wxn} where wxiis the weight of termiin dx. This

study chooses TFIDF to calculate the weight of each term in a ument. In social tagging, a tag is used to annotate one or more

doc-uments, so this study uses annotated documents to represent a tag, tagj. Suppose tagjannotates document dxon date p, then tagjcan be

represented as tagj,p= tagj,p,x= dx= {wx1, wx2, wx3,. . .,wxn} on date p.

If tagjannotates two documents (dxand dy) on date p, then tagjcan

be represented as tagj,p= tagj,p,x+ tagj,p,y= dx+ dy= {wx1 +wy1,

wx2+ wy2, wx3+ wy3,. . .,wxn+ wyn}. The formal representation of tagj

on date p is shown in Eq.(10), where Wpk¼Pqx¼1wxk; q is the

num-ber of documents annotated by tagjon date p.

tag_j;p¼ tag_j;p;1þ tag_j;p;2þ þ tag_j;p;q¼ fWp1; Wp2; . . . ; Wpkg:

ð10Þ

3.4. Tag time series representation

This study normalizes each tag first in order to avoid offset translation and amplitude scaling. The normalization formula is shown as Eq.(11). tagj;p¼ Wp1 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn k¼1W 2 pk q ; . . . ; Wpm ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn k¼1W 2 pk q 8 > < > : 9 > = > ; ð11Þ

v

j;p¼ tagj;pþ1 tagj;p¼ fWvjp;1; Wvjp;2; . . . ; Wvjp;kg: ð12Þ

The time series of a tag, tagj, is the union of consecutive time

seg-ments of the tag. Each time segment

v

j,pis the difference between

tagj,p+1and tagj,p(Eq. (12)). According to (Van Wijk & Van Selow, 1999), time series data is the sequence of N data pairs,

v

i= (yi,ti),

where i = 1, 2, 3,. . .,N, and yiis the value of time ti. The time line

can be split into M time periods. Vj,mrepresents the time series of

tagjin time period m, where m = 1, 2, 3,. . .,M. Each Vj,mcontains N

consecutive data pairs,

v

p= (

v

j,p, tp), where p = 1, 2, 3,. . .,N. This

study splits the whole time line (2008/1/1 2008/12/31) every two weeks, so that there are 26 time periods, and 14 consecutive data pairs in each time period.

3.5. Time series similarity

This study uses cosine similarity to compute the similarity be-tween two tag time series in the same time period. Suppose tagi

and tagjon the time line. The time series of tagiand tagjin period

m are Vi,m= {

v

i,1,

v

i,2,

v

i,3,. . .,

v

i,N} and Vj,m= {

v

j,1,

v

j,2,

v

j,3,. . .,

v

j,N},

respectively. The similarity of tagiand tagjin time period m, sim

(tagi, tagj), is calculated in Eq.(13)

simðtagi; tagjÞ ¼ ðsimilarityð

v

i;1;

v

j;1Þ þ

þ similarityð

v

i;N;

v

j;NÞÞ=N; ð13Þ similarityð

v

i;p;

v

j;pÞ ¼ Pn k¼1ðWvip;k Wvjp;kÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn k¼1W2vip;k q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPn k¼1W2vjp;k q : ð14Þ

Sometimes, the two time series shift. InFig. 11, the solid and dashed time series are obviously similar, but they are shifted. Con-sidering this issue, this study calculates the similarity of two tag time series by moving one backward and forward 1–4 days artifi-cially. The highest similarity value is the shifting similarity, sf. The final similarity of two tag time series is the linear combination with a weighted parameter w (0.5 in this study) of sf and the sim-ilarity them without shifting (Eq.(15)).

sim00ðtagi; tagjÞ ¼ w sim 0_ðtag

i; tagjÞ þ ð1 wÞ sf ðtagi; tagjÞ:

ð15Þ

The similarity between the two time series, sim0(tagi, tagj), is

be-tween1 and 1. The negative value means the two tag time series (tagiand tagj) have different trend in a period of time. For example,

Table 2

Occurrence distribution of term (termi) and document (dx).

dx dx

termi O11 O12

(7)

tagiis seldomly used on date p, but tagjis used more often. This is

due to two reasons. First, the documents annotated by tagjon date

p are not relevant to tagi. The other, although the documents

anno-tated by tagjon date p are relevant to tagi, users seldom use tagito

annotate these documents. Unfortunately, it takes time and efforts to judge the actual reason, so this study only considers the positive similarity and the negative values are set to 0. In the collected cor-pus, the amount of similarity of time series is 581,423, and 420,925 of them are negative. Out of the 160,498 positive similarity pairs of time series, the average is 0.00233, and the distribution is listed in

Table 3.

3.6. Time series clustering

This study applies agglomerative hierarchical clustering algo-rithm to cluster time series and uses average linkage (Fig. 4(c)) for calculating the distance between clusters. The detailed steps are as follows:

1. For each time period m (m = 1, 2, 3,. . .,M), every tag time series is treated as a cluster.

2. Calculate the average distance between cluster pairs (Eq.(16)), wherejCij is the size of cluster Ci, d(a, b) is calculated by Eq.(15).

DavgðCi; CjÞ ¼ 1 jCijjCjj X a2Ci;b2Cj dða; bÞ: ð16Þ

3. Find the largest Davg(Ci, Cj), and merge Ciand Cj.

4. Iteratively execute steps 2 and 3, till reaching the halting con-straint. The halting constraint is that the average distance of inter-clusters is less than the average distance between all tag time series in the period m.

5. Go back to step 1, and choose next m.

3.7. Recommendation

After time clustering, the time series in the same cluster have similar concept and similar trends. This study uses a mechanism to recommend relevant documents in the same time period and recommend relevant clusters across different time periods.

I. Recommend relevant documents in the same time period. (a) Recommending documents relevant to a cluster.

The clustering result can be used to suggest similar doc-uments to users for further reading. However, there are many documents annotated by tags in the same cluster, so the reasonable approach is to recommend the most relevant documents to users. Cosine similarity calculates the similarity between the cluster centroid and each document. Then suggest top n documents with the high-est similarity to users. The cluster centroid Ci is

calcu-lated in Eq.(17), wherejCij is the cluster size.

Ci¼ XjCij j¼1 Vj;m jCij ð17Þ

(b) Recommending documents relevant to multiple tags in a cluster.

Sometimes, tagiand tagjare clustered together in time

period m, but there is no overlap between documents annotated by tagiand documents annotated by tagj. This

is due to users’ tagging behavior patterns, not indicates that tagiand tagjare not relevant. In order to recommend

Table 3

Distribution of positive similarity pairs of time series in corpus.

Similarity interval # of pairs

0.5 0.851 1,079 1.71E02 0.5 17,888 3.41E03 1.71E02 19,713 2.33E03 3.41E03 9,221 6.81E04 2.33E03 43,557 1.36E04 6.81E04 49,478 2.72E05 1.36E04 15,461 0.0 2.72E05 4,101 Time Weight

Fig. 11. Time series shifting example.

Table 4

Distribution of positive similarity pairs of time series in corpus. The minimum tag

count in a cluster

Hierarchical clustering (not consider the chronological factor)

Time series clustering

Cmp Sep Qcq Cmp Sep Ocq

1 0.2866 0.0044 0.1455 0.2730 0.0029 0.1380

3 0.3704 0.0074 0.1889 0.3645 0.0058 0.1852

4 0.4023 0.0077 0.2050 0.4055 0.0061 0.2058

Table 5

Distribution of positive similarity pairs of time series in corpus. Expert A No Yes Total Expert B No 53(21.2%) 22(8.8%) 75(30%) Yes 21(8.4%) 154(61.6%) 175(70%) Total 74(29.6%) 176(70.4%) 250(100%) Table 6

Distribution of positive similarity pairs of time series in corpus.

Kappa Strength of agreement

0.00 Poor 0.01–0.20 Slight 0.21–0.40 Fair 0.41–0.60 Moderate 0.61–0.80 Substantial 0.81–1.00 Almost perfect Table 7

Distribution of positive similarity pairs of time series in corpus. Experts’ labeling

Y N

Clustering Y 130 37

Result N 13 27

(8)

documents which are most relevant to tagiand tagj, the

time series of tagiand tagjin time period m are merged

into Vij,m(Vij,m= Vi,m+ Vj,m). The similarity between Vij,m

and each document annotated either by tagior tagjis

cal-culated by cosine similarity. The most similar documents are then suggested.

II. Recommend relevant clusters across different time periods. This study groups tag time series in the same time period together. There may be relevant clusters in different time periods. This study retrieves relevant clusters from different time periods according to cosine similarity between two clusters. The relevant degree is simðCi; CjÞ, where Ci and Cj

are the centroids of cluster i and j; and clusters i and j belong to different time periods. If the similarity is larger than a threshold (0.07 in this study), the two clusters are relevant.

4. Evaluation and case discussion

This section compares the proposed time series clustering ap-proach, which produces 1225 clusters, and the hierarchical cluster-ing without considercluster-ing the chronological factor, which produces 1161 clusters. Besides, some clustering result cases are also dis-cussed to show the proposed approach can find out the trends of events.

4.1. Quantification analysis

4.1.1. Clustering quality

Clustering is an unsupervised method to group similar data items together. The clustering quality depends on the in-cluster similarity and separation degree between clusters. The common ways to evaluate the quality of clustering are as follows: cluster compactness, cluster separation, and overall cluster quality (He et al., 2003).

I. Cluster compactness.

The cluster compactness, Cmp, is shown in Eq.(22), where

v

(X) is the variance of all documents, and

v

(ci) is the variance

of documents in a cluster. When the value of Cmp is smaller, the clusters are more compact.

v

ðXÞ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 N XN i¼1 d2ðxi; xÞ v u u t _ð18Þ x ¼1 N X i xi; ð19Þ dðxi; xjÞ ¼ 1 cosðxi; xjÞ; ð20Þ

v

ðciÞ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 jcij Xjcij j¼1 d2ðcij; ciÞ v u u t _; _ð21Þ Cmp¼1 C XC i

v

ðciÞ

v

ðXÞ : ð22Þ

II. Cluster separation.

The formula of cluster separation, Sep, is displayed in Eq.

(23), where

r

is the Gaussian Constant, C is the number of clusters, and d(xci, xcj) is the distance between cluster ci

and cj. Sep is valued between 0 and 1. When Sep has a

smal-ler value, the clusters separate better.

Sep¼_{CðC 1Þ}1 X C i¼1 XC j¼1;j–i exp d 2_ðx ci; xcjÞ 2

r

2 ! : ð23Þ

III. Overall cluster quality.

Overall cluster quality, Ocq, is the linear combination of clus-ter compactness and clusclus-ter separation with a parameclus-terb (0.5 in this study). The value ofb is between 0 and 1. If the value of Ocq is smaller, the overall cluster quality is better.

OcqðbÞ ¼ b Cmp þ ð1 bÞ Sep: ð24Þ

The number of tags in each cluster may affect the comparison of Cmp, Sep and Ocq. Different settings of minimum tag count in a clus-ter are applied in this evaluation.Table 4shows the values of Cmp, Sep and Ocq in different settings. The Cmp and Ocq values indicate that both approaches have similar cluster compactness and overall cluster quality. However, the Sep values of the proposed time series clustering are significantly better (>10%) than traditional hierarchi-cal clustering.

4.1.2. Quality of relevant cluster recommendation

Before evaluating the quality of recommendation, this study re-moves 505 clusters, which contain less than three tags, and 720 clusters are left. 250 cluster pairs are then randomly chosen, and two computer science experts are asked to evaluate whether each cluster pair is similar or not.Table 5lists the results of evaluation. ‘‘Yes” indicates that the expert determines the cluster pair is sim-ilar, and ‘‘No” indicates dissimilar. The Kappa4_{value of the}

evalua-tion is 0.589. According toTable 6, the strength of agreement is moderate.

Kappa = (observed agreement chance agreement)/(1 chance agreement),

observed agreement = (53 + 154)/250 = 0.828,

chance agreement = 0.296 0.3 + 0.704 0.7 = 0.5816, Kappa = (0.828 0.5816)/(1 0.5816) = 0.589.

According toTable 5, there are 207 (154 + 53) agreement cluster pairs. These 207 pairs are used to evaluate the clustering accuracy.

Table 7lists the result, and the sensitivity, specificity and accuracy of clustering (Han & Kamber, 2001) are as follows.

sensiti

v

ity¼ 130=143 ¼ 0:909; specificity¼ 27=64 ¼ 0:422; accuracy¼ 0:909 143 207þ 0:422 64 207ﬃ 0:758: 4.2. Case discussion

4.2.1. Case of event trend on time line

This subsection uses the tag,電影 (movie), to show that the pro-posed time series clustering approach can diagram the trend of events on the timeline. From 2008/5/6 to 2008/5/20,鋼鐵人 (iron man) is the most relevant tag to電影 (movie), which coincides with the release of the movie in Taiwan (Fig. 12). From 2008/7/15 to 2008/7/29, the movie,海角七號 (Cape 7), was released, and the tag海角七號 (Cape 7) and 電影 (movie) are clustered in the same group (Fig. 13). However, the movie海角七號 (Cape 7) does not lead an upsurge at the first few days after releasing, and the tag 海角七號 (Cape 7) does not increase in usage, too. The tags most relevant to電影 (movie) are 瓦力 (Wall-E) and 動畫 (animation) between 2008/7/29 to 2008/8/12 (Fig. 14). After a few weeks, the tag海角七號 (Cape 7) increased in use and is clustered together with電影 (movie) and 魏德盛 (the director of the movie), shown inFig. 15. This movie also impulses the traveling fever in Taiwan, which causes電影 (movie) and 海角七號 (Cape 7) are grouped

to-4

(9)

Fig. 12. Related tags of ‘‘電影” (movies) during 2008/5/6 2008/5/20.

Fig. 14. Related tags of ‘‘電影” (movies) during 2008/7/29 2008/8/12. Fig. 13. Related tags of ‘‘電影” (movies) during 2008/7/15 2008/7/29.

(10)

gether with旅遊 (traveling) and 墾丁 (Kenting, the main filming area in the movie), as shown inFig. 16.

This proves that the proposed time series clustering approach can detect the societal trends by dividing the timeline and per-forming clustering in each period.

4.2.2. Clustering with and without the chronological factor

The previous Subsection4.2.1describes the advantage of clus-tering data by time period. This subsection shows the advantage of clustering with considering the chronological factor.

In the corpus, tags, including 中國 (China), 奧運 (Olympic Games), 北京奧運 (Beijing Olympic Games), BBC, 台灣 (Taiwan), 政治”(Politics), 新聞自由 (News Freedom), etc., are used between 2008/7/29 to 2008/8/12.

Table 8lists the similarity values between 中國 (China) and other tags in this time period. If the chronological factor is not ta-ken into account,中國 (China), 奧運 (Olympic Games), BBC and 新聞自由 (News Freedom) are in the same cluster, 台灣 (Taiwan)

and 政治 (Politics) are in another cluster, and 北京奧運 (Beijing Olympic Games) is in yet another (as illustrated inFig. 17(a)). The proposed time series clustering approach clusters中國 (China), 台灣 (Taiwan) and 政治 (Politics) in the same group, BBC, 新聞自由 (News Freedom) and奧運 (Olympic Games) in another, and 北京奧運 (Beijing Olympic Games) in yet another (as show inFig. 17(b)).

Table 8

Similarity values between中國 (China) and other tags during 2008/7/29 to 2008/8/12. Similarity (without the chronological

factor)

Similarity (time series clustering) 奧運 0.729 台灣 0.316 bbc 0.671 政治 0.270 新聞自由 0.563 奧運 0.265 台灣 0.547 Bbc 0.205 政治 0.394 北京奧運 0.118 北京奧運 0.356 新聞自由 0.052

(11)

If users only search under the cluster,台灣 (Taiwan) and 政治 (Politics), inFig. 17(a), they may misconceive that documents in this cluster are only related to political events in Taiwan. However, in this time period, there are also many documents related to polit-ical issues in China and across the strait. Users can easily miss these articles if they just browse the clustering result as shown inFig. 17(a).

When the chronological factor is considered, like inFig. 17(b), the proposed approach groups中國 (China), 台灣 (Taiwan) and 政治 (Politics) together. This cluster would contain documents re-lated to political events in Taiwan and in China.

5. Conclusion and future work

This study collects data from Hemidemi.com, and considers the chronological factor to represent each tag as time series by the vec-tor space model. The corpus covers data created in 2008, including 3842 distinct web pages and 2707 distinct tags. This study divides the timeline into 26 periods, where each period is two weeks. The proposed approach produces 720 clusters by uses agglomerative hierarchical clustering with average linkage distance measure-ment. Cluster compactness, cluster separation and overall cluster quality are used to evaluate the proposed approach and traditional hierarchical clustering without considering the chronological fac-tor. The evaluation results indicate that the proposed approach has similar qualities in cluster compactness and overall cluster quality measurements, and improves cluster separation signifi-cantly (>10%). The data is clustered periodically which allows for tracking societal trends. When considering the chronological fac-tor, time series clustering is more precise than traditional hierar-chical clustering in identifying the events in a time period. The proposed approach can also recommend relevant documents and clusters to users. The accuracy of these recommendations is around 0.758.

There are still some issues that need improvement. First, there can be irrelevant information in a web page. For example, a web page which introduces the movie ‘‘Iron Man” may contain informa-tion that is irrelevant to the movie, such as other movies released during same week. The second issue is the consistency of tags. So-cial bookmarking, a collection of folks’ creation, is user designated, and is not under any straight set of rules or authority control. If an

ontology can be created to identify different tags with similar concepts like ‘‘Web 2.0” and ‘‘Web2”, clustering accuracy and quality would improve. The classification of tags is the last issue. Classifying tags can enable users to track the trends of a classifica-tion with a broader view, instead of simply tracking a tag.

References

Agrawal, R., Lin, K. I., Sawhney, H. S., & Shim, K. (1995). Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In Proceedings of the 21st international conference on very large data bases, Zurich, Switzerland (pp. 490–501).

Bollobás, B., Das, G., Gunopulos, D., & Mannila, H. (1997). Time-series similarity problems and well-separated geometric sets. In Proceedings of the 13th annual symposium on computational geometry (pp. 454–456).

Box, G., & Jenkins, G. (1976). Time series analysis: Forecasting and control. Oakland, California: Holden-Day.

Chang, H.-C., & Hsu, C.-C. (2005). Using topic keyword clusters for automatic document clustering. In Third international conference on information technology and applications (Vol. 1, pp. 419–424).

Han, J., & Kamber, M. (2001). Data mining: Concepts and techniques. San Francisco: Morgan Kaufman. pp. 346–389.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). 14.3.12 Hierarchical clustering. The elements of statistical learning (2nd ed.). New York: Springer. pp. 520–528. He, J., Tan, A.-H., Tan, C.-L., & Sung, S.-Y. (2003). On quantitative evaluation of

clustering systems. Clustering and information retrieval. Kluwer Academic Publishers, 105–133.

Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. Stanford InfoLab.

Lehmann, L. E. (1986). Testing statistical hypotheses. Wiley.

Liao, T. W. (2005). Clustering of time series data – A survey. Pattern Recognition, 38 (11), 1857–1874.

Neyman, J., & Pearson, E.S. (1967). Joint statistical papers. Hodder Arnold. Oates, T., Firoiu, L., & Cohen, P. (1999). Clustering time series with hidden Markov

models and dynamic time warping. In Proceedings of the IJCAI-99 workshop on neural, symbolic and reinforcement learning methods for sequence learning (pp. 17–21).

Pu, H.-T. (2007). The development and applications of folksonomy.<http://www.lib. ncku.edu.tw/journal/16/1.htm>Retrieved 21.06.08.

Salvador, S., & Chan, P. (2007). Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis, 11(5), 561–580.

Vander Wal, T. (2005). Folksonomy coinage and definition. Online information conference 2005.

Voorhees, E. M. (1986). Implementing agglomerative hierarchical clustering algorithms for use in document retrieval. Information Processing & Management, 22(6), 465–476.

Van Wijk, J. J., & Van Selow, E. R. (1999). Cluster and calendar based visualization of time series data. In Proceedings of 1999 IEEE symposium on information visualization (pp. 4–9).

Ward, J. H. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301), 236–244.