Personalized recommendation of popular blog articles for mobile applications

(1)

Personalized recommendation of popular blog articles

for mobile applications

Duen-Ren Liu

⇑

, Pei-Yun Tsai, Po-Huan Chiu

Institute of Information Management, National Chiao Tung University, Hsinchu, Taiwan

a r t i c l e

i n f o

Article history: Received 5 October 2009

Received in revised form 22 October 2010 Accepted 1 January 2011

Available online 9 January 2011 Keywords: Mobile service Blog recommenders Time-sensitive topic Collaborative ﬁltering

a b s t r a c t

Weblogs have emerged as a new communication and publication medium on the Internet for diffusing the latest useful information. Providing value-added mobile services, such as blog articles, is increasingly important to attract mobile users to mobile commerce, in order to beneﬁt from the proliferation and convenience of using mobile devices to receive information any time and anywhere. However, there are a tremendous number of blog arti-cles, and mobile users generally have difﬁculty in browsing weblogs owing to the limita-tions of mobile devices. Accordingly, providing mobile users with blog articles that suit their particular interests is an important issue. Very little research, however, has focused on this issue.

In this work, we propose a novel Customized Content Service on a mobile device (m-CCS) to ﬁlter and push blog articles to mobile users. The m-CCS includes a novel forecasting approach to predict the latest popular blog topics based on the trend of time-sensitive pop-ularity of weblogs. Mobile users may, however, have different interests regarding the latest popular blog topics. Thus, the m-CCS further analyzes the mobile users’ browsing logs to determine their interests, which are then combined with the latest popular blog topics to derive their preferred blog topics and articles. A novel hybrid approach is proposed to recommend blog articles by integrating personalized popularity of topic clusters, item-based collaborative ﬁltering (CF) and attention degree (click times) of blog articles. The experiment result demonstrates that the m-CCS system can effectively recommend mobile users’ desired blog articles with respect to both popularity and personal interests.

1. Introduction

Weblogs have emerged as a new communication and publication medium on the Internet for diffusing the latest useful information. Blog articles represent the opinions of the populace and constitute a reaction to current events (e.g., news) on the Internet[13]. Accordingly, looking for the latest popular issues discussed by blogs and attracting readers’ attention is an interesting subject. Moreover, providing value-added mobile services, such as blog articles, is increasingly important to attract mobile users to mobile commerce, in order to beneﬁt from the proliferation and convenience of using mobile devices to receive information anytime and anywhere. There are, however, a tremendous number of blog articles, and mobile users generally have difﬁculty in browsing weblogs owing to the inherent limitations of mobile devices, such as small screens, short usage time and poor input mechanisms. Accordingly, providing mobile users with blog articles that suit their interests is an important issue. Very little research, however, has focused on this issue.

⇑ Corresponding author. Tel.: +886 3 5131245; fax: +886 3 5723792. E-mail addresses:[email protected],[email protected](D.-R. Liu).

Contents lists available atScienceDirect

Information Sciences

(2)

There are three main types of research regarding blogs. The ﬁrst type of research focuses on analyzing the link structure

between blogs to form a community[19,20]. Through the hyperlinks between blogs, people can communicate across blogs

by publishing content related to other blogs. Nakajima et al.[31]proposed a method to identify the important bloggers in the

conversations, based on their roles in preceding blog threads, and identify ‘‘hot’’ conversation. The second type of research

focuses on content analysis to derive the propagation of topics and trends in the blogsphere. Gruhl et al.[11,12]modeled the

information propagation of topics among blogs based on blog text. With the analysis of tracking topic and user drift, Hayes

et al.[13]examined the relationship between blogs over time. Mei et al.[28]proposed a method to discover the distributions

and evolution patterns across time and space. Although existing studies have investigated the evolution of blog topics, they have not considered how to predict the degree of popularity of blog topics. The last type of research focuses on how to model

the bloggers and derive their interests in order to generate personal recommendations[38,40]. A variety of methods has

been proposed to model the blogger’s interests and provide recommended content which is similar to their earlier experi-ences[15,24].

The majority of previous studies on blogs have ignored the hot topics and popular articles discussed by mass groups of readers, who engage in browsing actions related to the blog articles. Moreover, existing studies do not consider recommend-ing blog articles to mobile readers in mobile environments. With more and more blog articles continually berecommend-ing published on the Internet, the scale and complexity of blog contents are growing rapidly, resulting in information overload for blog read-ers. Mobile readers could only browse a very limited number of blog articles because of the restrictions of mobile devices.

Accordingly, traditional recommendation methods, such as the collaborative ﬁltering approach[1,2,5,17,25,35], may suffer

the sparsity problem of finding similar users or items due to insufficient historical records of browsing blog articles by mo-bile readers. To address the sparsity issue and blog information overload, it is essential to design an appropriate mechanism for recommending blog articles in mobile environments. Blog readers are often interested in browsing emerging and popular blog topics, from which the popularity of blogs can be inferred according to the accumulated click times on blogs. Popularity based solely on click times, however, cannot truly reflect popularity trends. For example, a new event may trigger emerging discussions such that the number of related blog articles and browsing actions is small at the beginning and rapidly increases as time goes on. Thus, it is important to analyze the trend of time-sensitive popularity of blogs to predict emerging hot blog topics. In addition, blog readers may have different interests regarding the emerging popular blog topics. Nevertheless, exist-ing researches have not addressed such issues of how to predict the popularity trend of blog topics and personalized popular topics.

More speciﬁcally, several studies have been proposed to model the blogger’s interest and provide personal

recommenda-tions[15,24,38,40]. Traditional approaches of recommender systems can also be adopted to recommend blog articles to

mo-bile users. However, existing researches have not addressed the issue of recommending personalized popular blog articles, which is especially important for mobile environments where mobile users can not freely browse a tremendous amount of blog articles on the Internet due to the restriction of mobile devices, and therefore must rely on service providers’ recom-mendations to browse a small and feasible subset of blog articles. Many blog articles are new articles to the system, since they have not been viewed by any mobile user in the system due to the limitation of mobile devices. Traditional recommen-dation methods may suffer from the new item problem, in which there is no rating record on new items by which to derive

the prediction[1]. It means that most new articles, which are popular on the Internet and to which the masses of Internet

users pay attention, may be ignored by conventional recommendation methods. Accordingly, the recommended feasible set of blog articles should contain those articles which are new articles to the system but are popular with Internet users and also suit mobile users’ personal interests. Existing recommendation approaches have neither addressed such issues nor con-sidered the popularity degree of blog articles.

In this work, we propose a novel Customized Content Service on a mobile device (m-CCS) to recommend personalized and popular blog articles to mobile users. Conventional recommender systems mainly employ the users’ behavior logs recorded in the systems to make recommendations. Differing from existing recommender systems, we use an additional data source collected from the Internet, i.e., the Internet users’ click times on blog articles, to identify the popularity degree of blog arti-cles which are integrated with recommendation approach to improve the recommender quality in mobile recommender services.

First, we propose a novel approach to predict the trend of time-sensitive popularity of blog topics. We analyze blog con-tents retrieved by co-RSS to derive topic clusters, i.e., blog topics. We define a topic as a set of significant terms that are clus-tered together based on aspects of similarity. By examining the clusters, we can extract the salient features of topics. Moreover, we analyze the click times of Internet readers accessing articles. For each topic cluster, we modified a double

exponential smoothing method[6,7]to predict the popularity degree of the topic according to the variation in trends of click

times by Internet readers. Second, mobile users may have different interests regarding the latest popular blog topics. Thus, we further propose a novel approach to infer mobile users’ preferred (personalized) popular blog topics based on the pre-dicted popularity degree of blog topics and mobile users’ personal interests, derived by analyzing their browsing logs. Third, a novel hybrid recommendation approach is proposed to recommend blog articles by integrating personalized popularity of topic clusters, item-based collaborative filtering (CF) and attention degree (click times) of blog articles. The major novel ideas are as follows. The hybrid prediction is derived according to the clarity of personal preference derived from collaborative filtering, based on the historical behavior of the mobile user. With clear preference, i.e. more browsing records of mobile users, the hybrid prediction will be influenced more by user preference prediction based on collaborative filtering. The hy-brid prediction is, however, dominated by Internet attention degree of articles for the mobile users who have very few

(3)

browsing records with which to infer their preferences. Moreover, hybrid prediction considers the predictive personalized popularity degree of the topic cluster to which each article belongs; the more popular the topic of an article is, the more numerous the users who are interested in the article.

The ﬁltered articles are sent to the individual’s mobile device via a WAP Push service. This allows the user to receive per-sonalized and relevant articles, satisfying the demand for instant information. Finally, we conduct on-line experiments to compare different strategies: uniﬁed push of articles selected by experts and personalized push of articles selected by the m-CCS system’s novel recommendation service. The experiment result shows that our proposed approach considering cus-tomized predictive popularity degree can increase the click rates of blog articles to enhance the quality of recommendation. The proposed m-CCS system can effectively recommend desirable blog articles to mobile users based on popularity and per-sonal interests.

The remainder of this paper is organized as follows. Section2introduces works related to blogs, forecasting and

recom-mendations; a brief introduction to our system is given in Section3; detailed descriptions of the processing module of our

system are presented in Sections4 and 5; Section6illustrates how to integrate different modules of our system to develop

recommendation methods; the system architecture is illustrated in Section7; Section8presents the evaluation of the

use-fulness of m-CCS empirically and practically; and the conclusions and suggestions for future work are presented in Section9.

2. Literature review

2.1. Discovering the trend of blog topics

Blog content represents the opinions of the populace and reactions to current events (e.g., news) on the Internet[13].

With Web 2.0, blogs have become such a powerful force that mainstream media cannot help but take notice[9]. Several

re-searches focus on analyzing blog content to derive the propagation of topics and trends in the blogsphere. Gruhl et al.[11,12]

modeled the information propagation of blog topics, based on blog texts. The patterns they proposed for topic propagation were useful for predicting sales forecasts. In addition, more and more researches have recently been paying attention to studies on blog content. Blog text analysis focuses on eliciting useful information from blog entry collections, and determin-ing certain trends in the blogosphere. A Natural Language Processdetermin-ing (NLP) algorithm has been used to determine the most

important keywords within a deﬁnite time period; it can automatically discover trends across blogs[9]. Nevertheless, the

above mentioned researches emphasize assigning blog articles to only one topic, while blogs, in fact, contain many topics.

Mei et al.[28]focused on a mixture of subtopics and recognize the spatiotemporal topic patterns within blog documents.

They proposed a probabilistic method to model the most salient topics from a text collection, and discover the distributions

and evolution patterns across time and space. To track topic and user drift, Hayes et al.[13]examined the relationship

be-tween blogs over time. Some studies have investigated the evolution of blog topics. However, most researches have not con-sidered how to predict the popularity degree of blog topics. In addition, researches mainly analyze the content of blog articles to discover the evolution and trend of blog topics without considering the Internet readers’ perspective, i.e., the click times of Internet readers on blog articles. Differing from other studies, we identify blog topics by clustering similar blog arti-cles into clusters (topics), and then use the accumulated Internet readers’ click times of blog artiarti-cles for generating topic clusters by which to predict the popularity degree of blog topics.

2.2. Recommending blog articles

Several studies investigated user modeling and personal recommendation in blog space. A variety of methods[38,40]has

been proposed to model bloggers’ interest, such as classifying articles into predeﬁned categories to identify the author’s

pref-erence[24], and thereby automatically recommend the blog articles which suit their interest, by analyzing the contents to

which bloggers have reacted. Huang et al.[15]proposed an approach to extract terms relevant to users from blog articles,

and then recommend blog articles explored by Google’s search engine. While bloggers can receive recommended content which is similar to that their earlier experiences, the method ignores the hot topics and popular articles discussed by the bulk of readers which can attract mobile users’ interest. These studies mainly examined the interests of bloggers and iden-tiﬁed which topics were widely discussed by the bloggers without considering the perspectives of Internet readers. They did not address the issue of how to predict the popularity trend of blog topics. Moreover, existing approaches on recommending blog articles did not investigate the recommendation of popular blog articles by considering the popularity degree of blog topics. Differing from existing studies, we recommend personalized and popular blog articles by considering Internet read-ers’ click times on blog articles and the predictive popularity degree of blog topics.

2.3. Forecasting

Forecasting methods mainly use historical data to infer future development trends. Time series prediction uses a set of observation values by time order to construct a suitable model to forecast future trends. Within the variety of methods,

the exponential smoothing method[6]is easy to understand and highly reliable; this method can also use less data to make

(4)

A standard exponential smoothing method[30]assigns exponentially decreasing weights to previous observations. In other words, recent observations are given relatively more weight in forecasting than are the older observations. The exponential smoothing method has been widely used in short term or medium term economic development trend forecasting. In the sim-ple exponential smoothing method, the current prediction value is derived from the prediction value and actual value of the preceding time period. Simple exponential smoothing is suitable for stationary time series which do not exhibit trend effect. The double exponential smoothing approach is usually used to process the time series data with trend effect, and is pre-dicted using Eq.(1) [7]. For preceding time series, x(t) is the actual value at time t, and ^xðtÞ is the prediction value at time t; and b(t) represents the trend effect at time t. To forecast the current value for time t þ 1; ^xðt þ 1Þ is the average value

be-tween two parameters, x(t) and ½^xðtÞ þ bðtÞ, weighted by

a

which is a smoothing constant. Therefore, the difference of

smoothing constant would determine which parameter has greater inﬂuence in affecting the prediction value. Learning from the formula, each prediction value is weighted from the series value within the past period. The more recent the historical data, the greater the weight of the prediction:

^

xðt þ 1Þ ¼

a

xðtÞ þ ð1

a

Þ½^xðtÞ þ bðtÞ; ð1Þ

bðtÞ ¼ b½^xðtÞ ^xðt 1Þ þ ð1 bÞbðt 1Þ: ð2Þ

The trend effect at time t, b(t) is calculated as Eq.(2). The value b is used to weight the difference between two prediction

values: ^xðtÞ and ^xðt 1Þ, belonging to adjacent days and the preceding trend effect, b(t 1). For the double exponential

smoothing method, the value of ^xðtÞ and b(1) have to be assigned in the initial stage. The simplest way is to make an

assump-tion for ^xð2Þ ¼ xð1Þ and b(1) = 0. Some research has also suggested that the selection of the initial value is not important to-ward the stationary[7], since it does not have a signiﬁcant effect on the prediction result. In this work, we modiﬁed a double exponential smoothing method to predict the popularity degree of the topic according to the variation in trends of click times by Internet readers.

2.4. Recommendation approaches

The recommender system is widely used to provide suitable personalized information to users according to their needs

and preferences[1–3,17,18,22,29,35]. The recommender system has been applied in many different areas[36], such as

prod-ucts[8,23], movies[32], books[10]and music[37], and not only offers personalized recommendation service for each cus-tomer, but also benefits business marketing strategies. Generally, the recommender system mainly includes content-based filtering and collaborative filtering.

The content-based ﬁltering (CBF) approach analyzes customers’ preferences regarding the item’s attribute features to

build up a personal feature proﬁle, and then predict which items the customer will like[14,41]. In other words, this approach

recommends items with similar attribute features to the customer proﬁles according to their past preferences; it is more likely to be used for document webpage and news article recommendations. However, this method still has some restrictions which need to be improved; it is not easy to analyze the features of items, and users can only receive recommended items

which are similar to past ones[21].

The collaborative ﬁltering (CF) approach is one of the most popular recommending approaches, and it has been

success-fully applied in many areas[4,32]. This method can solve some problems of content-based method mentioned before. There

is no need to analyze the contents of an item; the recommended items are identiﬁed for target users solely based on the similarities to the historical proﬁles of other users. Furthermore, it can deal with items with content dissimilar to those in the past.

Based on the relationship between items or users, the CF method can be classiﬁed into two types[35]: user-based CF and

item-based CF. User-based CF calculates the similarity between users, and predicts the target user’s preference regarding

dif-ferent items; GroupLens is an example of such a system[32]. The CF approach involves two steps: neighborhood formation

and prediction. The neighborhood of a target user is selected according to his/her similarity to other users, and is computed by Pearson correlation coefﬁcient or the cosine measure. Either the k-NN (nearest neighbor) approach or a threshold-based approach is used to choose k users who are most similar to the target user.

With the numbers of users and items exploding, determining how to quickly produce high quality recommendations and search a large amount of potential neighbors in real time are important issues, especially for commercial systems. The item-based CF method has been proposed to identify the relationships between different items that users had already rated and then ranking recommended items each user has not viewed before; this method has already been applied on the Amazon

platform[10], achieving good performance.

The item-based collaborative ﬁltering (ICF) algorithm[34] ﬁrst analyzes the relationships between items (e.g.,

docu-ments), rather than the relationships between users. Then, the item relationships are used to compute recommendations for users indirectly, by ﬁnding items that are similar to other items which the user has previously accessed. Thus, the pre-diction for item j for user u is calculated by the weighted sum of the ratings given by the user for items similar to j and

weighted by item similarity, as shown in Eq.(3):

pu;j¼

Pn

i¼1wðj; iÞ ru;i

Pn

i¼1jwðj; iÞj

(5)

where pu,jrepresents the predicted rating of item j for user u; w(j, i) is the similarity between two items j and i; and ru,i

de-notes the rating of user u for item i. A number of methods can be used to determine the similarity between items e.g., cosine-based similarity, correlation-cosine-based similarity, and adjusted cosine similarity methods. Since the adjusted cosine similarity

method performs better than the others[34], we used it as the similarity measure for the ICF method. The adjusted cosine

similarity between two items i and j is given by Eq.(4): AdjSimði; jÞ ¼

P

u2Uðru;i ruÞðru;j ruÞ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P u2Uðru;i ruÞ 2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi_P u2Uðru;j ruÞ 2 q ; ð4Þ

where ru,i/ ru,jis the rating of item i/j given by user u; and ruis the average item rating of user u.

The CBF method is limited in being unable to provide serendipitous recommendations since the recommendation is based solely on the content features of items that the user has preferred. The success of collaborative filtering relies on the avail-ability of a sufficiently large set of quality preference ratings provided by users. Accordingly, finding users with similar pref-erences is difficult if the user rating matrix is very sparse (few preference ratings), causing the sparsity problem for the CF method. In addition, the CF method may suffer from the new item problem, in which there is no rating record on new items by which to derive the prediction[1].

3. System process overview

We propose a novel value-added mobile service, namely Customized Content Service on mobiles (m-CCS), to provide cus-tomized blog articles for mobile users based on the time-sensitive popularity of topics and personal preference patterns, as shown inFig. 1.

The first step of our system is to collect blog articles from the Internet. The RSS mechanism is a useful way to capture the latest articles automatically without visiting each site. RSS is an abbreviation for Really Simple Syndication, which is an XML document to aggregate information from multiple web sources. Any mobile user can subscribe to RSS feeds. However, there may be a shortage of information caused by insufficient RSS feeds subscribed to individuals. Thus, we propose a co-RSS meth-od to solve this problem. The co-RSS methmeth-od gathers all RSS feeds from users such that RSS flocks, called crows-RSS, are formed to enrich information sources. After this preliminary procedure, the system can automatically collect desirable

con-tents from diverse resources. Moreover, we use information retrieval technology (e.g. tf-idf approach)[33]to pre-process

articles which are trawled every day from blog websites according to crows-RSS feeds. After extracting the features (term vectors) of blog articles, the time-sensitive popularity tracking (TPT) module groups articles into topic clusters and

automat-ically predicts their trend of popularity. The details of the TPT module are presented in Section4.

(6)

Since the viewable content on mobile device screens is limited, designing a personalized service for ﬁltering articles is particularly desirable. The m-CCS can monitor the click rates of articles daily and log user viewing records to infer implicit preference of mobile users. Without the effort of user rating, the implicit interest of a user regarding an article is inferred by comparing the time spent on reading the article with the average time spent on articles of the same size. The browsing re-cords of users are analyzed to discover their behavior patterns and then their personal preferences are deduced through a personal favorite analysis (PFA) module. Moreover, the m-CCS predicts a user’s preferred topics by deriving his/her custom-ized popularity degree of topic clusters according to the predicted popularity of topic clusters and his/her preferences.

Sec-tion5presents the details of the PFA module.

Finally, the system recommends blog articles based on the customized popularity degree of topic clusters and the pref-erence of mobile users. The recommended articles are then sent to the user’s mobile device via a WAP Push service. This allows users to instantly receive personalized and relevant blog articles. The proposed recommendation process of the m-CCS mainly integrates content analysis and collaborative filtering to improve the shortcomings of pure collaborative filtering (CF), including sparsity and cold start issues, as well as aspects such as: (1) the prediction of popular topic cluster of concern to bloggers and readers on the Internet, (2) the prediction of users’ preference score by item-based collaborative filtering, and (3) attention degree (click times) of blog articles obtained from Internet users. The detailed descriptions of the

recommen-dation process are presented in Section6.

In general, the effectiveness of the CF recommendation approach mostly depends on the set of historical data. There are still potential limitations, such as sparsity and cold start issues[2,39]. Low-quality recommendation results may be delivered due to the sparsity issue, namely when the system only has very few rating records of users to measure the similarity be-tween users or items. For the cold start issue of new items or new users, the system will present weak performance in rec-ommendation because of the lack of active records viewed by users.

In our research, we focus on mobile users and blog articles. We apply clustering techniques to ﬁrst group the articles into topic clusters and then form neighborhoods of items from the topic clusters, which can reduce the sparsity problem and im-prove the scalability of recommender systems. Additionally, many blog articles have not been viewed by any mobile user in our system due to the limitations of mobile devices. It means that most articles, which are popular on the Internet and are attractive to the masses of Internet users, may be ignored in the process of recommendation. Thus, our proposed recommen-dation approach not only considers mobile users’ preferences concerning the articles which have been pushed to them on the mobile devices, but also considers the perspectives of Internet readers to identify the popularity of articles, in order to im-prove the quality of recommendation.

4. Time-sensitive popularity tracking

In this section, we present a novel approach to predict the trend of time-sensitive popularity of blog topics. We identify the blog topic clusters and their popularity according to the perspectives of writers and readers on the Internet, and then trace the trend of popularity temporally. In the following subsections, we illustrate the details of the tracking process shown inFig. 2.

4.1. Forming topic clusters of blog articles

Articles in blogs are free and usually contain different opinions so that it is difficult to categorize articles into their appro-priate categories as defined by bloggers. That is to say, the existing category in a blog website is insufficient to fully represent

(7)

the blog. In our research, we use article features, i.e., term-weight vector, derived from the pre-processing to deal with blog articles which are published within a given time window on the Internet. We collect blog articles from bog websites as the training corpus to construct the dictionary by applying one of the statistical methods, the log likelihood ratio, to extract meaningful phrases and terms. In addition, blog articles are trawled every day from blog websites according to the cro-wed-RSS feeds. Note that the blog training data is periodically updated and trained to update the dictionary. Signiﬁcant terms/phrases are extracted from the content of an article according to the dictionary derived from the blog training data. In addition, each article is represented as a term vector by using the tf-idf approach[33]to calculate the weight of term i in an article j, as deﬁned in Eq.(5):

wi;j¼ fi;j log

N ni ; fi;j¼ freqi;j maxlðfreql;jÞ ; ð5Þ

where N is the number of articles; niis the number of articles that contain term i; fi,jis the normalized frequency of term i in

article j; freqi,jis the frequency of term i in article j; and maxl(flj) is the frequency of term l which has the maximum frequency

in article j.

The size of the time window is set as seven days. That is, all the articles posted in the past seven days will be categorized and recommended to individual users.

A hierarchical agglomerative algorithm with group-average clustering approach[16]is applied to implement the

cluster-ing step. It treats each article as a cluster ﬁrst and then successively merges the pairs of clusters with highest cluster sim-ilarity. The similarities between two articles can be calculated by means of the cosine similarity measure, as shown in Eq.(6):

simðdi;djÞ ¼ cosðd * i;d * jÞ ¼ d * i d * j kd * ik kd * jk : ð6Þ

The cluster similarity between two clusters is deﬁned as the average pairwise similarities of all pairs of articles from dif-ferent clusters. The cluster similarity between two clusters riand rjis calculated by Eq.(7), where di/djis a blog article

belonging to the set of blog articles Sri/Srjin Cluster ri/rj; jSrij/j Srjj is the number of blog articles of Sri/Srjand sim(di, dj) denotes

the cosine similarity between the articles diand dj:

similarityðri;rjÞ ¼ P di2Sri P dj2Srjsimðdi;djÞ jSrijjSrjj : ð7Þ

We stop merging the pairs of clusters when the highest cluster similarity is below a threshold during the merge process. The number of clusters each day is not constant; it depends on the density of the discussed topic. If the density of the topic which people discuss is high, the diversity of the article is low and the numbers of clusters decrease.

4.2. Constructing the trend path between clusters belonging to adjacent days

To reveal the path of the trend which predicts the popularity degree of current clusters, we measure the cluster similarity between the target Cluster r and all the Clusters pr belonging to the preceding period, and then select the one with maximum values to construct the link with one of the preceding clusters.

As blog articles are usually composed of unstructured words, to obtain similarity between two clusters appertaining to two days, we average the value of cosine similarity between articles crossing clusters. The similarity between two clusters (r, pr) in adjacent days is calculated by Eq.(7). After establishing the linkages, the trend of each current cluster can be derived from the preceding related cluster. As shown inFig. 3, all of the clusters receive a trend path from the preceding cluster. The topic of Cluster1 in day 3 is evolved from Cluster1 in day 2, and so on, and we can use the relationship and similarity between them to calculate the popularity degree.

Cluster1 Cluster2 Cluster3 Cluster1 Cluster3 Cluster2 Cluster4 Cluster1 Cluster3 Cluster2 … … … …

day 1 day 2 day 3 day 4

(8)

4.3. Acquisition of actual popularity degree for each preceding cluster

After clustering blog articles to form topic clusters (e.g. theme groups) and constructing the trend path, we mainly use reader attention, namely the click times of topic clusters, to derive the popularity degree of each cluster. To help predict the popularity degree of a current cluster, we consider the click times in proportion to the reader attention causing a topic to rise and ﬂourish. After clustering blog articles to form a topic group and constructing the trend path, the actual popularity degree for each preceding cluster can be acquired from the times the articles have been clicked during a previous period. Let Sprdenote the set of blog articles in Cluster pr. For each preceding Cluster pr, we obtain CTt(Spr), the total click times of the

articles in Spron the Internet within the preceding time period t, as deﬁned in Eq.(8):

CTtðSprÞ ¼

X

di2Spr

ClickTimestðdiÞ; ð8Þ

where the actual click times for blog article diin time t can be represented by ClickTimest(di). Then, the click times can be

converted to the actual popularity degree, APDpr(t), which is a normalized value based on the maximum ClickTimes over

all Skin the preceding period t, as deﬁned in Eq.(9):

APDprðtÞ ¼

CTtðSprÞ

MaxfClickTimestðSkÞg

100%: ð9Þ

4.4. Predicting popularity degree of current cluster

We analyze the trend evolution of attention from Internet readers to predict the popularity degree of current cluster. The

time series of popularity trend is a set of serial observation values by time order, as shown inFig. 4. We modiﬁed the double

exponential smoothing method described in Section2.3to forecast the degree of popular trend for each cluster of blog topic.

We only give brief explanations of some equations of the double exponential smoothing method. Readers can refer to the references[6,7]for further details.

For each Cluster r, we use the weighted average method that combines the actual popularity degree (APD) and predicted popularity degree (PPD) of the preceding period to predict the popularity degree of current clusters on the assumption that

the effect of popularity degree decays as days pass, as deﬁned in Eq.(10):

PPD0

rðt þ 1Þ ¼

a

APDprðtÞ þ ð1

a

Þ ½PPDprðtÞ þ bprðtÞ; ð10Þ

where we use Cluster pr at preceding time t to predict the initial popularity degree of Cluster r at time t + 1 which is denoted by PPD0

rðt þ 1Þ. For the preceding Cluster pr at time t, APDpr(t) is the actual popularity degree as mentioned above; PPDpr(t)

denotes the predictive popularity degree of Cluster pr at time t. The bpr(t) represents the trend effect for the previous period.

Note that the value of initial predictive popularity degree for current cluster, PPD0

rðt þ 1Þ, is between zero and one. The

parameter

a

is a smoothing constant between zero and one, which is used to determine the relative importance of actual

popularity degree and the predictive popularity degree with trend effect in the preceding period.

We combine the difference of the predictive popularity degrees at time t and at time t 1, and the trend effect at time t 1 to calculate the trend effect at time t, bpr(t), using the weighted average, as deﬁned in Eq.(11):

bprðtÞ ¼ d ½PPDprðtÞ PPDpprðt 1Þ þ ð1 dÞ bpprðt 1Þ: ð11Þ

Note that the Cluster pr is the preceding cluster of r, while the Cluster ppr is the preceding cluster of pr. The PPDppr(t 1)

and bppr(t 1) are the predictive popularity degree and trend effect of Cluster ppr at time t 1, respectively. The parameter d

is a smoothing constant between zero and one, which is used to adjust the relative importance of the difference between the predictive popularity degrees at time t and at time t 1, and the trend effect at time t 1.

The values of

a

and d in Eqs.(10) and (11), respectively, can be decided by experts or experimental analysis. The double

exponential smoothing approach[7]is usually applied to analyze time series data; however, it does not consider the relation

between topic clusters belonging to adjacent time periods. In our research, we concentrate on topic clusters in different time periods and construct the topic linkage from the preceding time to the current time as a topic trend path with a popularity degree. Therefore, to link topic clusters, the maximal similarity between adjacent clusters, i.e., current Cluster r and

(9)

preceding Cluster pr, as described in Section4.2, is selected to adjust the predictive popularity degree of Cluster r, as shown

in Eq.(12). Notably, the smaller similarity leads to the lower reliability of the prediction path between two clusters:

PPDrðt þ 1Þ ¼ PPD0rðt þ 1Þ similarityðr; prÞ: ð12Þ

InFig. 5, we take one path of trend which belongs to three-day time periods as an example and set both parameters,

a

and d, as 0.3. We use the popularity of Cluster11, which belongs to Time t, to predict the popularity degree of Cluster22 in Time t + 1. In the same way, Cluster01 is useful to infer Cluster22. In the initial stage, the actual popularity degree for Cluster01 is

assumed to be 40%. It is reasonable to assume PPD0

pr(t) = APDppr(t 1), PPDppr(t 1) = 0, and bppr(t 1) = 0, at the starting

time 0. Likewise, we also assume that predictive popularity degree PPDCluster01(t 1) and the trend effect bCluster01(t 1)

for Cluster01 is zero, respectively. Thus, the initial predictive popularity degree of Cluster11 could be derived, and the value is 40%. Then the similarity across adjacent clusters should be considered to calculate the predictive popularity degree.

Suppose that the value of similarity between Cluster01 and Cluster11 is 0.23; we can obtain the predictive popularity de-gree of Cluster11 after adjustment as: 40% 0.23 = 9.2%. Next, we use the values which were derived previously to predict the initial popularity degree of Cluster22 according to Eq.(10):

PPD0

Cluster22ðt þ 1Þ ¼ 0:3 APDCluster11ðtÞ þ 0:7 ½PPDCluster11ðtÞ þ bCluster11ðtÞ:

The value of trend effect, bCluster11(t), is derived using Eq.(11):

bCluster11ðtÞ ¼ 0:3 ½PPDCluster11ðtÞ PPDCluster01ðt 1Þ þ 0:7 bCluster01ðt 1Þ ¼ 2:76%:

Thus, PPD0

Cluster22ðt þ 1Þ ¼ 0:3 10% þ 0:7 ½9:2% þ 2:76% ¼ 11:37%. The value of similarity between Cluster11 and

Cluster22 is 0.82. We obtain the ﬁnal predictive popularity degree as follows: PPDCluster22ðt þ 1Þ ¼ PPD0Cluster22ðt þ 1Þ similarityðCluster11; Cluster22Þ ¼ 9:32%:

5. Personal favorite analysis

In this section, we present a novel scheme that models the interests of users who browse blog articles on mobile devices. Our proposed methods are implemented to enhance an existing system running in a real mobile business environment. Be-cause of the limited features of mobile devices, it is inconvenient to give explicit relevance ratings of blog articles for mobile users. Thus, the existing system does not provide the function of explicit rating of articles. Providing explicit feedback such as rating items may bring users extra burden; because it would disturb the normal browsing process, it would usually be

ig-nored by users[26]. Accordingly, we analyze the browsing patterns of mobile users as implicit feedback information to

de-rive their preferences for blog articles. 5.1. Analysis of user browsing behavior

We model browsing patterns within session time by analyzing the log data of mobile users. A user’s browsing pattern is derived by calculating his/her average reading time per word for browsing blog articles within session time. The system re-cords the browsing time of blog articles requested by mobile users to derive the session interval and browsing time for each article. A timeout mechanism is used to terminate a session automatically when a user does not make any request in a time period. Calculating the time interval between user requests on articles within each session could estimate a user’s browsing (stick) time on an article.

In order to acquire the browsing pattern for the user u, we analyze the browsing speed, Hu,s, to get the average browsing

time per word in this session s, as shown in Eq.(13):

Time t+1 Time t-1 Time t Cluster 01 Cluster 02 Cluster 11 Cluster 12 Cluster 13 Cluster 21 Cluster 22 Similarity=0.23 Similarity=0.82 APD=40% APD =10%

(10)

Hu;s¼ 1 jDu;sj X di2Du;s TimeuðdiÞ DocSizeðdiÞ ; ð13Þ

where diis an article i that the user u had browsed within session s; Du,sis a set of articles browsed by user u in session s; jDu,sj

denotes the number of articles in Du,s; DocSize (di) identiﬁes the number of words of the article; and Timeu(di) denotes the

user u’s browsing time on blog article di.

After obtaining a user’s current browsing behavior, Hu,s, which is viewed as the user’s recent pattern within one session,

we use a weighted approach to predict a user’s future browsing pattern by an incremental approach, which incrementally modiﬁes the former browsing pattern employing the user’s current browsing behavior. The parameters b can be adjusted in order to set one as more important than the other. We believe that recent browsing behavior has a greater effect upon the future behavior of the mobile user, so we set the parameter b to give recent patterns more weight.

The predicted browsing pattern is calculated by using Eq.(14), where H0

u,sdenotes former browsing pattern which has

been accumulated till session s for mobile user u. Then we can use the new browsing pattern at session s, i.e., Hu,s, to predict

the future behavior at new session s + 1: H0

u;sþ1¼ b Hu;sþ ð1 bÞ H0u;s: ð14Þ

5.2. Inferring user preference for articles

In this step, we infer user preferences for articles based on their browsing behavior that is considered as implicit feedback

information. Previous studies[27]have also found that reading time is indicative of interest. By analyzing a user’s browsing

time on an article, we can infer how interested the user is in the article and its corresponding preference score. If the brows-ing time is longer than usual, we can estimate that the user has a high preference level for the article.

According to the user’s browsing behavior in usual time, we employ the user’s browsing pattern mentioned in Section5.1

to estimate the browsing time for the article and calculate the Predict Browsing Time, PBTu(di), to compare with Actual

Brows-ing Time, ABTu(di), of the user. The predict browsing time PBTu(di) is equal to DocSizeðdiÞ H0u;sþ1, where DocSize (di) is the size

(number of words) of blog article diand H0u;sþ1denote the average browsing time per word for user u as described in Section

5.1. Then, we calculate the preference score (PS) for target user u on blog article dias follows:

PSuðdiÞ ¼

1 1 þPBTuðdiÞ

ABTuðdiÞ

: ð15Þ

We can observe that the value of this function is in the range (0, 1); the higher value of preference score means that the user has more interest in the article.

6. Hybrid recommendation

In this section, we propose a novel hybrid method that combines user preference prediction by collaborative ﬁltering, Internet attention degrees of articles, and customized popularity degree of topic cluster, in order to recommend personalized blog articles to mobile users.

The basic idea of this process is to integrate the different viewpoints of mobile users and Internet users. We use an item-based collaborative ﬁltering approach to recommend the latent articles of interest according to the actual browsing behavior of mobile users. However, the CF approach suffers from the sparsity and cold start issues. Because of the limitations of the mobile device, the mobile user cannot easily surf blog articles and a lot of articles are never browsed by mobile users. It means that most popular articles on the Internet, attractive to the masses of Internet users, may be ignored in the process of recommendation. Thus, our proposed recommendation approach not only considers the mobile users’ preferences con-cerning the articles which have been pushed to them on the mobile devices, but also considers the viewpoints of Internet readers to identify the attention degree of articles, in order to improve the quality of recommendation. We also consider the predictive popularity degree of the topic cluster to which each article belongs. The more popular the topic of an article is the more users there will be who are interested in the article.

6.1. Topic-based collaborative ﬁltering

Research has demonstrated that the item-based CF approach can efficiently produce high-quality recommendations. The item-based CF method usually computes item similarity based on the whole set of items. However, user preferences on items of different clusters may vary, since the items of different clusters have different characteristics. Mobile users with similar preferences on a topic cluster (e.g. movies) may have different interests in other topics. As mentioned previously, we apply clustering techniques to group the articles into topic clusters first and then form neighborhoods of items from the topic clus-ters; this can improve the scalability of recommender systems. For each topic cluster, we adopt the item-based CF method to predict mobile users’ preferred articles, due to the efficiency concern for commercial systems.

We use the adjusted cosine[34]to measure the similarity between two articles, diand dj, which belong to Cluster r, as

deﬁned in Eq.(16). The set of users who co-rate both diand djis denoted by Uij. The PSu(di) is the preference score of the

(11)

simadjr ðdi;djÞ ¼ P u2UijðPSuðdiÞ PSuÞðPSuðdjÞ PSuÞ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P u2UijðPSuðdiÞ PSuÞ 2 q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi_P u2UijðPSuðdjÞ PSuÞ 2 q : ð16Þ

To predict the preference score of target user u on article i within Cluster r, the next step is to select a set of articles most similar to the target article and generate a predicted preference for the diusing a weighted sum, as shown in Eq.(17), which

is adopted from Eq.(3). PPScf

uðdiÞ denotes the predicted preference score of target user u on article i based on the item-CF

method; djis the nearest neighbors of the target article di; and bI is the set of top-N articles most similar to the target article

and had been browsed by user u: PPScfuðdiÞ ¼ P j2bIPSuðdjÞ sim adj r ðdi;djÞ P j2bIjsim adj r ðdi;djÞj : ð17Þ

6.2. The degree of attention for blog article

Mobile users are usually interested in those articles to which the majority of Internet readers pay attention. Within a topic Cluster r, we obtain the attention degree of an article, attentionr(di), which is the accumulated click times indicating how

much attention from Internet readers, as deﬁned in Eq.(18):

attentionrðdiÞ ¼

e

ACCTðdiÞ Maxdj 2DrfACCTðdjÞg₁

e 1 : ð18Þ

The attention degree is derived from the click-through rate, which is calculated as ACCTðdiÞ=Maxdj2DrfACCTðdjÞg. ACCTðdiÞ

denotes the accumulated click times for article di; Dris the set of articles in Cluster r; and Maxdj2DrfACCTðdjÞg means the max-imum accumulated click times of articles in Cluster r. We assume that the attention degree has the property of network externality. The larger the click-through rate of an article, the more attractive the article is to the mobile users.Fig. 6shows that the value of attention degree, attentionr(di), rises as the click-through rate increases, and it is between zero and one.

6.3. Customized predictive popularity degree

In the process of time-sensitive popularity tracking (TPT), we apply a modiﬁed exponential smoothing method to predict the general popularity degree of topic cluster. In this section, we further consider each user’s preference obtained from the per-sonal favorite analysis (PFA) step to derive the customized (perper-sonalized) popularity degree of topic cluster. An article can be included in different topic clusters belonging to successive time periods. Once a mobile user has read an article, his/her pref-erence score is inferred from the browsing behavior. The customized popularity degree of topic cluster can then be derived using a user’s preference scores for a certain article belonging to this cluster.

Two methods are designed to derive the customized predictive popularity degree (CPPD) of topic Cluster r for a speciﬁc user

u. The ﬁrst one is called weighted customized predictive popularity degree (WCPPD) method and is presented in Eqs.(19) and

(20):

WCPPDu;r¼

x

u;r PPDrþ ð1

x

u;rÞ

P

dj2Du;rPSuðdjÞ

jDu;rj

; ð19Þ

x

u;r¼jD1u;rj; if jDu;rj > 1

x

u;r¼ 1; otherwise:

(

ð20Þ

(12)

We use the average preference score (PS) of user u for those articles that have been read by u and contained in the target Cluster r to adjust the predictive popularity degree, PPDr, for user u. Du,rdenotes the set of articles that user u has browsed in

Cluster r. The parameter

x

is used to adjust the relative importance of PPDrand the average preference score. The value of

x

is

smaller if user u has browsed more number of articles in Cluster r; thus more weight (1

x

) is assigned to the average

pref-erence score (PS) of user u.

The derivation of personalized CPPD is according to the reliability of the user’s personal preference derived from the his-torical records of the user. If user u has browsed more articles, the system is more capable of predicting the user’s preference score (PS), and thus the user’s personal preference is more reliable. On the contrary, if user u has browsed fewer articles, then the user’s personal preference is less reliable since the system may not be able to predict the user’s preference score (PS) based on insufﬁcient browsing records. With more reliable personal preference, i.e. more browsing records, the CPPD is inﬂu-enced more by the average preference score (PS) of user u. The CPPD is, however, dominated by the general predictive

pop-ularity degree, PPDrfor the users who have very few browsing records to infer their preferences.

For those mobile users who have sufficient browsing records in our system, the popularity degree of topic cluster pro-vided by m-CCS would be more customized. The system would assign more weight to personal characteristics of users who have sufficient historical behavior records, and give less weight to the general popularity degree of topic clusters. Con-versely, if a user has very few behavior records to be analyzed, the degree of modification of topic clusters is smaller. That is, the less the browsing history of users, the less the personal ranking of clusters. The system will recommend the more general and popular topic clusters.

The second approach is called the harmonic customized predictive popularity degree (HCPPD) method. The basic idea of this

method is to apply the harmonic mean approach to combine the predictive popularity degree (PPDr) and the adjusted average

preference score PSadjustedu;r

for each topic cluster as in Eq.(21). We derive the adjusted average preference score according to

Eq.(22). The weight value, 1

x

u,rdeﬁned in Eq.(20)is used to adjust the average preference scores. The adjusted average

preference score would be high if a user browses more articles within topic cluster and shows higher preference for those articles. Moreover, according to the characteristic of harmonic mean, the customized predictive popularity degree of Cluster r for user u will be high if both the predictive popularity degree of Cluster r and the adjusted average preference score of user u are high: HCPPDu;r¼ 2 PPDr PSadjustedu;r PPDrþ PSadjustedu;r ; ð21Þ PSadjusted u;r ¼ ð1

x

u;rÞ P dj2Du;rPSuðdjÞ jDu;rj : ð22Þ

The adjusted average preference score is derived from the user’s average preference score by using the weight value,

1

x

u,rto conduct the adjustment. The weight value, 1

x

u,r, varies according to the number of articles that had been

browsed by user u. The weight value 1

x

u,ris larger if the user had browsed more articles.

Fig. 7identiﬁes a tendency of the weight value, 1

x

u,r, with regard to the different number of browsed articles ranging

from one to ﬁfteen. From the plots inFig. 7, we observe that the value of weight, 1

x

u,r, increases as the number of browsed

articles increases. With more articles browsed by a user, his/her personal preference of historical records would become more important to affect the value of the customized predictive popularity degree of the topic cluster; thus, the average pref-erence score is adjusted by multiplying it with a higher weigh value to derive the adjusted average prefpref-erence score.

More-over, the weight value, 1

x

u,r, increases rapidly for fewer browsed articles, while the curve trends to be ﬂat for more

numerous browsed articles. Generally, the adjusted average preference score would be high if a user browses more articles within topic cluster and shows higher preference for those articles.

(13)

6.4. Article selection and recommendation

In this section, we propose a hybrid model that integrates the previous processes to recommend articles to mobile users. We derive the predictive preference score of a mobile user u on article di;PPShybridu;r ðdiÞ, as a hybrid of PPScfu;rðdiÞ, the predictive

preference score by collaborative ﬁltering, and attentionr(di), the Internet readers’ attention degree on the article within the

Cluster r. PPShybrid

u;r ðdiÞ can be expressed as Eq.(23). The parameter

s

u, which is used to adjust the relative importance of

PPScf

u;rðdiÞ and attentionr(di), is deﬁned in Eq.(24). Dpushu;t

is the number of articles pushed to user u within time period t

and Dbrowsedu;t

denotes the number of articles that the user has browsed within time period t. The more articles a user has

browsed, the more personal interest is emphasized when the historical records of the mobile user are sufﬁcient to predict his/her preference (e.g. PPScfu;rðdiÞÞ. In contrast, the attention degree, which represents the opinion of the masses on the

Inter-net, is more important to compute the prediction (for recommendation) when very few records of browsing articles exist and the system cannot effectively infer the mobile user’s preference:

PPShybridu;r ðdiÞ ¼

s

u PPScfu;rðdiÞ þ ð1

s

uÞ attentionrðdiÞ; ð23Þ

s

u¼ log2 Dbrowsed u;t Dpush u;t þ 1 0 B @ 1 C A: ð24Þ

The computation of PPShybrid

u;r ðdiÞ is according to the reliability of personal preference derived from the historical behavior

of the user. With more reliable personal preference, i.e. more browsing records, the system is more capable of inferring the user’s preference based on sufﬁcient browsing records; thus, PPShybridu;r ðdiÞ is inﬂuenced more by PPScfu;rðdiÞ. PPShybridu;r ðdiÞ,

how-ever, is dominated by attentionr(di) for the users who have very few browsing records to infer their preferences since the

personal preference derived from historical behavior of the user may be unreliable due to insufﬁcient browsing records for analysis.

Fig. 8. Weight values in different percentages of user-browsed articles.

(14)

The value of

s

u is between zero and one, and the plots at different percentages of browsed articles, calculated by Dbrowsed_u;t = Dpushu;t

distribution, are shown inFig. 8. When the browsing records are insufﬁcient,

s

utends to zero; PPScfu;rðdiÞ is

ignored and the ﬁnal preference is mainly decided by using attentionr(di). In contrast, with

s

uapproaching the maximum

va-lue one, PPShybrid_u;r ðdiÞ is mainly derived by PPScfu;rðdiÞ. The upward curve is slightly convex. That is to say, the value of the weight

increases rapidly for smaller percentage of browsed articles, while the curve tends to the ﬂat for larger percentage of browsed articles. We consider that user preference appears signiﬁcant in the beginning of browsing behavior.

So far, we have generated the predictive preference on articles within clusters. To select the recommended articles from different clusters, we have to consider the priority (ranking) of topic clusters according to the customized popularity degree.

In Section6.3, we derived the customized predictive popularity degree, CPPDu,r(WCPPD and HCPPD), to denote user u’s

per-sonalized ranking of topic clusters. We derived the ﬁnal predicted preference score of user u on article di;PPSrecu ðdiÞ by further

applying CPPDu,rto adjust user u’s latent interest for articles cross topic cluster, as shown in Eq.(25). The articles with top-N

predicted preference score are selected for recommendations:

PPSrecu ðdiÞ ¼ PPShybridu;r ðdiÞ CPPDu;r: ð25Þ

After the processing above, the selected articles are transformed into XHTML format for mobile devices and then pushed to

the handsets via WAP. The system will push no more than ten titles of articles (seeFig. 9), due to the limitation of mobile

devices and short user browsing time compared with that of PC users. Then, users can click the title in which they are inter-ested to view the full contents.

7. System architecture

This research was conducted in collaboration with the CAMEO InfoTech Inc., provider of the WAP Push service for mobile phone users of Chunghwa Telecom (CHT), the biggest telecom company in Taiwan. We are implementing an m-CCS system of the proposed mobile service based on the CHT mobile customer database stored in the MySQL database; it was developed

using the Python programming language. The operating web GUI uses a Django Web Framework (seeFig. 10).

The system adopts two IBM 1U servers to process the load balance computing and provide the browsing service for mo-bile phones. The system adopts the WAP (Wireless Application Protocol) push service, which is a SMS (Short Message Ser-vice) message containing a link to a WAP page. Users can access the WAP content, receiving a WAP Push message on compatible mobile handsets. On the mobile carrier site, the WAP Gateway is built in the machine room of the system oper-ator. With the WAP Gateway, the system can reduce the trafﬁc of wireless transmissions by encoding the mobile WAP page which contains the message and URL. The system then transforms the WAP Push message to the SMS format of GSM and dispatches the message to the mobile phone through SMSC, which is a device belonging to the system operator. Thus, the titles and URL links of articles can be shown on the mobile phone. The implementation of the m-CCS system is targeted for thousands of real users in practice. Therefore, the system must overcome the issues of efﬁciency and scalability.

ATOM 1.0 RSS 1.0 RSS 2.0

Firewall

Mobile phone 1 Mobile phone 2 Mobile phone 3 WAP

gateway

SMS center Mobile carrier site

m-CCS Django web framework PC servers MySQL_database _runtimePython

Architecture

(15)

We not only adopted the load-balancing architecture, but also carefully chose the algorithm and caching technology, in order to apply the system in a real business environment.

8. Applications and experimental evaluation

In this section, we evaluate the effectiveness of our proposed time-sensitive popularity tracking module and personalized

recommendation service in Sections8.1 and 8.2, respectively.

8.1. The evaluation of time-sensitive popularity tracking

In this section, we evaluate the performance of time-sensitive popularity tracking by comparing the difference between predicted popularity and the actual popularity of topic clusters.

8.1.1. Data sets and experimental design

In our research, we processed the latest data from Internet every day. System robots automatically Crawl the net for the newest blog articles according to the co-RSS feeds in real time. Since RSS is a well-structured format, it is easy to detect new posts. When there is a new post, the system will trigger the process of capturing articles. However, the RSS usually contains partial information on the articles. In order to get the whole content, m-CCS needs to capture the primitive HTML through the URL of the blog. Furthermore, we need to parse the HTML to get the article title, content, and publish time from a variety of websites. Finally, the well-structured data are stored into database.

The total number of new published articles collected from co-RSS feeds is around two thousand daily. To conduct the pop-ularity prediction, it is necessary to capture the daily click times of captured articles within time window from Internet. We chose the blog sites providing information about click-times to conduct our evaluation. Accordingly, four popular blog sites

in Taiwan, including Wretch (http://www.wretch.cc), Pixnet (http://www.pixnet.net), Mobile01 (http://www.mobile01.com),

and Mypaper on PChome (http://mypaper.pchome.com.tw), were selected to conduct our experiments. There are around 150

new articles published daily from the blogs of these four blog sites and subscribed by co-RSS.

Time window is set as seven days. Articles published within the time window were processed to predict the popularity degree of topic clusters. About one thousand articles were chosen for analysis. The data set with click times of articles was collected form blog websites during the two-week period starting from the 10th of May 2009. In the topic clustering phase, we set a threshold value of 0.002 as the condition to stop the grouping of articles.

To evaluate the prediction model, the mean absolute error (MAE)[5]is used as the evaluation metric. As shown in Eq.

(26), the MAE is calculated by the average absolute deviation between the predicted result and the actual result at particular

time t, where St_{is the topic cluster set which was derived at time t, and jS}t

j denotes the number of topic clusters. The larger the MAE, the greater the error in the prediction model; thus, a model which presents a lower MAE can be regarded as a better model: MAEt¼ P r2StjPPDr APDrj jStj : ð26Þ 8.1.2. Evaluation result

The experiment was conducted for two weeks. During the period from the 10th of May to the 24th of May 2009, the PPDr

and APDrof each topic cluster were derived every two days, and then we calculated the value of MAE to examine the quality

of prediction model. We expect that the error rate of prediction decreases, i.e., the predictive popularity degree of topic clus-ter is improved, as time evolves.

This section presents the experimental result of prediction models based on different weight settings of parameters. As

mentioned in Section4, the parameter

a

used in Eq.(10)is used to determine the relative importance of the actual popularity

(16)

degree and the predictive popularity degree with the trend effect in the preceding period. Parameter d used in Eq.(11)is used to adjust the relative importance of the difference of the predictive popularity degrees at time t and t 1, and the trend effect at time t 1.

To determine the sensitivity of weight between the actual popularity degree and the variation trend in the preceding

per-iod, we performed an experiment by varying the value of

a

from 0.0 to 1.0 with an increment of 0.1, and setting the default

value of d as 0.5.Fig. 11presents the average of MAE over seven prediction periods under various values of parameter

a

. The

prediction model has the lowest MAE under

a

= 0.7; this means that predicting the popularity degree of topic cluster can be

more accurate when the system puts more weight on the preceding actual popularity degree.

To examine whether the value of d would affect the result of MAE, we varied the value of d under two ﬁxed values of

a

: 0.5

and 0.7. The averages of MAE are plotted inFig. 12. The result shows that there is no signiﬁcant effect on the prediction errors (MAE) under different d. In general, the best prediction accuracy is achieved under d = 0.8, and it implies that the differences between successive predictive popularity degrees, which are derived at time t and time t 1, respectively, has an important

impact on deriving the trend effect at time t + 1. Moreover, the prediction accuracy under

a

= 0.7 is better than the accuracy

under

a

= 0.5. Based on the above ﬁndings, the best parameter settings to predict the popularity degree of topic cluster are

a

= 0.7 and d = 0.8, and such parameter settings are used in the rest of our experiments.

8.2. Evaluation of recommending blog articles

In this section, we conduct on-line experiments to evaluate our proposed approach in an online business environment. The experiments are conducted in collaboration with the CAMEO InfoTech Inc., the provider of the WAP Push service for mo-bile phone users of Chunghwa Telecom in Taiwan.

8.2.1. Data sets

Mobile users use the blog-service provided by Chunghwa Telecom for free 30 day trials; then, if they feel satisﬁed, they can become formal paid subscribers to enjoy of the use of this service. Prior to the implementation of our proposed m-CCS system, all the blog articles have been selected by the human experts, and then sent to all customers without considering users’ personalized preferences. Currently, there are 18,136 users in the trial period of the system; the number of formal paid subscribers is 4967 persons. We only select formal paid users for evaluation, since free-trial users could stop using the ser-vice when their trial period ended.

Within the last one month, there were 4104 articles published on those four Internet blog websites mentioned above. Each mobile user, on average, only browsed 27.93 articles, i.e., 0.68% articles published on Internet. According to this

Fig. 12. The average MAE under different values of d(a= 0.5;a= 0.7).

(17)

observation, the number of articles browsed by mobile users is lower than that of Internet users because of the limitations of the mobile environment. Therefore, it is important to increase the click rates of mobile users by recommending the latest and interesting articles to mobile users.

We randomly selected 300 former paid customers with historical records of click times over ten times within the latest month as testing users to conduct the experiment. Among them, the highest record of click time was 257 times in one month; the lowest, 12 times.Fig. 13illustrates the distribution of click times collected from the historical records of testing users. The amount of testing users who browse the blog articles from 10 to 20 times within one month, i.e., 3 to 5 times per week in average on mobile phone, is around 50%. About 25% testing users browse articles from 20 to 30 times within one month. 8.2.2. Design of the experiments

The item-based CF method was adopted to predict the preference scores of articles based on the article-similarity ana-lyzed from the browsing log of mobile users. However, as mentioned previously, most blog articles published on the Internet are infrequently read by mobile users due to the limitations of the mobile environment. The deﬁciency of historical browsing data would result in poor performance for traditional recommendation methods, especially for the collaborative ﬁltering methods. Moreover, in mobile environments, it is important to recommend new articles which have not been read by any mobile user but are attractive to Internet users. The CF methods also suffer from the cold-start problem of recommend-ing new items (articles). In order to solve this problem, we have proposed a time-sensitive popularity-trackrecommend-ing module to predict the emerging trend of topic popularity in which most mobile users will be interested. Moreover, a customized ap-proach is further developed to predict the customized predictive popularity of topic, and integrated with item-based CF for personalized recommendations of blog articles. With this approach, mobile users can timely receive the latest hot topic articles by mobile device any time and any place.

There are several factors which affect the quality of the recommendations. They include the personalized degree of the system, the predictive popularity degree of topic cluster, and the recommendation approach. Through the experiments, we will discuss the issues listed below.

Does the system method’s customized recommendation based on personal preferences of mobile users perform better than the expert method with human selection of articles?

Does the method with customized predictive popularity degree of topic cluster perform better than the non-customized one?

What is the effect of different approaches on deriving the customized predictive popularity degree of topic cluster? To experimentally verify the effectiveness of our proposed methods, we compared different recommendation approaches. The expert method selected articles by human experts and then pushed the identical articles to all customers without con-sidering mobile users’ personal preferences. The system methods analyzed user preferences and then pushed the customized articles to mobile users automatically. The system methods include non-CPPD method, weighted-CPPD method and harmonic-CPPD method.

The non-CPPD method uses a formula similar to Eq.(23)to predict the preference score of an article by combining the

predictive preference score derived from collaborative ﬁltering and the attention degree of the article without considering CPPD. The non-CPPD method can be regarded as an enhancement of the conventional CF method by considering the attention

degrees of articles. As mentioned in Section6.3, there are two approaches to derive CPPD: the weighted method and the

har-monic mean method. Both the weighted-CPPD and harhar-monic-CPPD methods use Eq.(25)to derive the ﬁnal predictive

pref-erence score by considering the prefpref-erence score derived from collaborative ﬁltering, attention degree of the article and the customized predictive popularity degree of topic clusters.

The impacts of those methods on recommendation effectiveness are investigated in this experiment. We compared

dif-ferent recommendation models and evaluated their recommendation quality in Section8.2.3. Note that the parameters,

a

and d had been experimentally determined in a previous experiment and were set as

a

= 0.7 and d = 0.8.

Moreover, the m-CCS system not only recommends articles matching personal interests of mobile users according to the analysis of behavior log but also recommends the latest and new articles which have not been browsed by any mobile user. The new articles were selected based on the time-sensitive popularity prediction and the attention degree (click times) of

Internet users. Section8.2.4presents the experiment result on comparing the effect of system methods on recommending

new articles. We would demonstrate that recommending new articles based on customized predictive popularity degree (CPPD) of topic cluster is more effective than the method without considering CPPD.

8.2.3. Comparing different recommendation methods

We conducted on-line experiments by recommending blog articles to 300 testing users selected, as described in Section 8.2.1. The recommendations are pushed to testing users in an on-line real business environment with the cooperation of CAMEO InfoTech Inc. To avoid disturbing customers, we could not send the recommended articles to users every day; instead, the frequency of on-line recommendation was three times a week. In other words, the system pushes blog articles once per two days on average, and only ten articles are pushed to a user each time because of the limitations of the small screen of mobile device. Moreover, the cooperation company cautiously agreed with a limited scope of on-line experiments to avoid disturbing and losing customers. Due to such limitations on conducting on-line experiments in a real mobile