Chapter 3 Methodology
3.3 Feature Generation
3.3.2 Heterogeneous-‐Topic Estimating User Behavior
Heterogeneous-Topic Estimating User Behavior (HTB) comprises two parts:
using heterogeneous social network information to enhance user-behavior for missing data problem, and estimating user-behavior in novel topic with content information.
First, we assume nodes in same heterogeneous node type group share similar user-behavior information, and the group’s aggregated information can supplement individual feature if the group has consistent property. For a user u, the heterogeneous-enhanced user-behavior (HB) of u is described as follows:
HB(u, k,TY ) = (1−αTY (u),k)* Bu,k+αTY (u),k* BmeanTY (u),k
where TY(u) is user’s value of node type TY, k is the topic with diffusion records in training data, Bu,k is the user-behavior of u in topic k, BmeanTY(u),k is average user-behavior in topic k of TY(u) type nodes, and αTY(u),k is the confidence coefficient of node type value TY(u) information in topic k. The αTY(u),k is defined as follows:
αTY (u),k= 1− BstdTY (u),k
maxty∈TY(Bstdty,k) where BstdTY(u),k is the user-behavior standard deviation of TY(u) type nodes in topic k. The intuition is that if the distribution of user-behavior in node type value TY(u) is more concentrated, the average user-behavior of TY(u) is more believable.
Next, to estimate the user-behavior in novel topic knew, it needs to connect the relationship between existing topics and the novel topic by topic similarity information.
The main idea is to trust more on similar existing topics’ records but less on irrelevant topics’ records. For instance, to estimate one user’s user-behavior in novel topic
“iPhone 6”, user records of “iPhone 5” and “iPhone 4” are more believable than records of “Tsunami”. To do so, first, obtain the topic-hidden TH distribution by Latent Dirchlet Allocation (LDA) [9]. Second, calculate similarity on TH between existing topics and the novel topic, and we choose cosine distance as similarity score. Finally, to estimate the novel topic user-behavior, aggregate heterogeneous-enhanced user-behavior in
existing topics and weight it by the topic similarity. The whole process can be summarized as following formula:
HTB(u, knew,Ty) = tsim(knew, k)
k
∑ * HB(u, k,Ty)
where tsim(knew,k) is the topic similarity between knew and k in the topic-hidden distribution.
Chapter 4 Experiment
This chapter describes experiment details to evaluate the proposed model
effectiveness, and compare it with baseline models. Our goal is not to defeat the state of the art models totally; rather, our experiment design attempts to prove that our proposed features can enhance the state of the art models, and the performance in the realistic scenario will be best only when combining both together.
4.1 Plurk Data
4.1.1 Data Preparation
First, 100 most popular topics (e.g., tsunami) are identified from Plurk
microblog site between 01/2011 and 05/2011 and related users posts and response are crawled. Then 100 topics are manually separated into 7 groups based on semantic meaning: disaster, URL sharing, entertainment, domestic politics, daily life, global politics, and sports.
4.1.2 Diffusion Records
The positive diffusion records are generated based on the post-response behavior.
That is, if a person x posts a message containing one of the selected topic t, and later there is a person y responding to this message, we consider a diffusion of t has occurred from x to y (i.e., (x, y, t) is a positive instance).
The dataset contains a total of 699,985 positive instances out of 100 distinct topics; the largest and smallest topic contains 117,201 and 1,102 diffusions respectively.
4.1.3 Heterogeneous Social Network
The underlying social network is created using the post-response behavior as well. We assume there is an acquaintance link between x and y if and only if x has responded to y (or vice versa) in at least one topic in training data. There are about 161971 nodes and 376578 edges in the network.
We collect Plurk public available user information as node type for each user:
gender, location, relationship, and default language. Note that status are self-reported by users; therefore, users can choose not to provide their status, and a few users’
information is missing because their accounts are expired. For the missing data, we add an “unknown” type for each node type. The following table summarizes four kinds of node types.
Node Type Description Size
(Include Unknown)
Gender (G) User’s gender 3
Location (L) User’s current location 210
Relationship (R) User’s current relation with others
status 11
Default Language (DL) Default language the user using 39 Table 4-1 Heterogeneous Node Type
4.2 Corpus Processing
Furthermore, the sets of keywords for each topic are required to create the TW and UW matrices for latent topic analysis; we simply extract the content of posts and responses for each topic to create both matrices. We set the hidden category number h = m = 7, which is equal to the number of topic groups.
We remove stop-words, use SCWS [10] for tokenization, and MALLET [11]
and GibbsLDA++ [12] for LDA.
4.3 Cross Validation & Negative Sampling
We use topic-wise 4-fold cross validation to evaluate our method, because there are only 100 available topics. For each group, we select 3/4 of the topics as training and 1/4 as validation.
We sample negative instances equal to the number of positive instances. To sample the representative negative instances, we divide the positive instances into three types and propose sampling method respectively.
The positive instances (u,v) can be categorized into three types: “existing edges”,
“unseen edges”, and “edges with silent users”. Existing edges mean the edges with past diffusion records. If one user u and another user v did not diffuse information, but they both posted or responded others before, and then we call (u,v) as an unseen edge. If one of users or both users are silent in past, not posting or responding ever, we call (u,v) as an edge with silent users. The following figure shows three types of edges.
Figure 2 Edge Type
The same amount of negative instances for each topic (totally 699,985) is
sampled for binary classification. For each type of edges, to prepare the similar status to corresponding positive instance, we introduce three different negative sampling
methods.
For existing edges, negative instances of a topic t are sampled randomly based from existing topics’ diffusion records. For unseen edges, we sample two not silent users who are not connected in network. And for edges with silent users, if both users are silent users, we sample two other silent users from network randomly as a negative instance; if one user is silent user but the other is not, we sample a silent user and a user with diffusion records as a negative instance.
4.4 Baseline
In the following experiments, we compare User Behavior and Heterogeneous-Topic Estimating User Behavior with the proposed data-driven models in [2]: Heterogeneous-Topic Information, User Information, Topic-User Interaction, Global Features.
Training Data Testing Instances
Existing Edge Unseen Edge Edge with Silent User
4.5 Evaluation Metric
We choose area under ROC curve (AUC) [13] as evaluation metric; Testing instances are ranked based on their likelihood of positive, and compare it with the ground truth to compute AUC.
4.6 Experiment Result
4.6.1 Single Feature Comparison
Type Features AUC
State of the Art
TG (Topic Signature) 50.00%
DF (Existing Diffusion) 55.32%
NDT (Number of Distinct Topics) 55.97%
TS (Topic Similarity) 57.54%
UG (User Signature) 57.94%
UPLC (User Preference to Latent
Categories) 59.13%
OD (Out-Degree) 64.44%
ID (In-Degree) 67.00%
User Behavior UM (User Message Behavior) 67.17%
UR (User Response Behavior) 68.31%
Heterogeneous-Topic
Table 4-2 Single Feature Comparison Result
HTB is described as follows: for HTB_X_Y, X is the heterogeneous node type, and Y is the user-behavior type. For example, HTB_G_R uses gender as node type and responding action as user-behavior type.
To begin with, we attempt to compare different single features performance, so we evaluate each feature performance for novel topic information diffusion prediction.
Both user-behavior and heterogeneous-topic estimating user-behavior features outperform than the state of the art features. The best single feature is HTB_G_R:
heterogeneous-topic estimating user-behavior with gender as node type and responding action as user-behavior. Comparing with best single feature (ID) in baseline, HTB_G_R improve 3.63%, from 67.00% to 70.63%.
It [2] was showed that features exploiting users pair diffusion records, such as Existing Diffusion (DF), Topic Similarity (TS), and Number of Distinct Topics (NDT), have better performance than the others. Especially, TS was the best single feature before; however, those features do not perform well in our experiment scenario. In contrast, user level features, like HTB and ID, perform better. The reason is that users pair diffusion records may be missing in real scenario, which causes those pair level features’ performance decreasing.
4.6.2 Feature Combination Comparison
Type Features Combination AUC
State of the art
Features Single ID 67.00%
State of the art
Features TH+TS+ID+OD 69.43%
State of the art Features+
User Behavior+
HTB
TS+HTB_L_M+HTB_R_R 72.94%
Table 4-3 Combination Comparison Result
Besides, we compare features combination performance shown above.
Combining the state of the art features, User Behavior, and HTB, the best combination results in 72.94%, which outperforms than the state of the art single feature with 5.94%
and proposed features combination with 3.51%. It shows that HTB and User Behavior features can enhance performance of proposed features by [2].
Features AUC AUC - ALL
Table 4-4 Leave-One-Out Experiment Result
Moreover, we conduct a leave-one-out experiment on the best combination (TS+HTB_L_M+HTB_R_R), which comprising state of the art features and new features. The result shows that the combination is without redundancy; total performance will decrease if any one of the features is removed. Meanwhile, the
experiment result shows that HTB features importance. If HTB_L_M or HTB_R_R is removed, the total performance will decrease 4.89% and 3.32%; however, if TS is removed, the performance will just decrease 0.02%.
Chapter 5 Conclusion
Predicting information diffusion in novel topic is challenging: First, there are not diffusion records of novel topics for inference. Second, [2] has shown that the useful features for novel topic diffusion prediction usually based on past diffusion records; but, in real world, predicted node pairs may have no data about that. Furthermore, some users may be silent in collected data, and their data are limited so their behaviors become hard to predict.
We propose User Behavior and Heterogeneous-Topic Estimating User Behavior (HTB) to address above challenges. HTB is a set of user level features, which combines topic model for novel topic lacking records problem, and utilizes heterogeneous node types information to estimate the tendency of silent users in existing topics. Beside, we shift the experiment into a more realistic scenario by removing a assumption, and the experiment results show that User Behavior and HTB can help the state of the art model to improve 3.51% in the new scenario.
Chapter 6 Future Work
6.1 Time-Sensitive Model
To deal with novel topic lacking diffusion records problem, we focus on utilizing content information to connect between existing topics information and novel topics, and the experiment setting is based on topic-wise; however, it could be more complete if time information is considered.
We suggest two aspects that may be possible for enhancement: Topic Recency and Holliday Effect. Topic Recency means that the latest topic’s diffusion records are more believable than the older one’s. For instance, to estimate the diffusion path for the novel topic “HTC Butterfly”, a phone released in 2013, “HTC J” diffusion records are more believable than “HTC Aria” records, the former one released in 2012 and the latter one released in 2010. Besides, Holliday Effect means that some topics may resonate in specific time period. For example, technology companies, like Apple and Amazon, usually announce their latest products before Christmas, and products news may stimulate each other at that time, such as iPad mini and Kindle Fire.
Chapter 7 Reference
[1] “Twitter Blog: Twitter turns six,” blog.twitter.com. [Online]. Available:
http://blog.twitter.com/2012/03/twitter-turns-six.html. [Accessed: 30-Nov-2012].
[2] T.-T. Kuo, S.-C. Hung, W.-S. Lin, N. Peng, S.-D. Lin, and W.-F. Lin,
“Exploiting latent information to predict diffusions of novel topics on social networks,” presented at the ACL '12: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, 2012, vol. 2.
[3] D. Kempe, J. Kleinberg, and É. Tardos, “Maximizing the spread of influence through a social network,” Proceedings of the ninth ACM SIGKDD
international conference on Knowledge discovery and data mining, pp. 137–
146, 2003.
[4] S. Petrovic, M. Osborne, and V. Lavrenko, “RT to Win ! Predicting Message Propagation in Twitter,” ICWSM, vol. 13, no. 513435, pp. 586–589, 2011.
[5] J. Zhu, F. Xiong, D. Piao, Y. Liu, and Y. Zhang, “Statistically Modeling the Effectiveness of Disaster Information in Social Media,” presented at the 2011 IEEE Global Humanitarian Technology Conference (GHTC), 2011, pp. 431–
436.
[6] W. Galuba, K. Aberer, D. Chakraborty, Z. Despotovic, and W. Kellerer,
“Outtweeting the twitterers-predicting information cascades in microblogs,”
presented at the Proceedings of the 3rd Workshop on Online Social Networks (WOSN 2010), 2010.
[7] H. Ma, H. Yang, M. R. Lyu, and I. King, “Mining social networks using heat diffusion processes for marketing candidates selection,” presented at the CIKM '08: Proceeding of the 17th ACM conference on Information and knowledge management, 2008.
[8] C. X. Lin, Q. Mei, Y. Jiang, J. Han, and S. Qi, “Inferring the Diffusion and Evolution of Topics in Social Communities,” mind, 2011.
[9] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.
[10] Hightman, “SCWS - 簡易中文分詞系統 - hightman.cn,” hightman.cn.
[Online]. Available: http://www.hightman.cn/index.php?scws. [Accessed: 30-Nov-2012].
[11] McCallum and A. Kachites, “MALLET: A Machine Learning for Language Toolkit.,” mallet.cs.umass.edu, 30-Nov-2002. [Online]. Available:
http://mallet.cs.umass.edu. [Accessed: 30-Nov-2012].
[12] X.-H. Phan and C.-T. Nguyen, “GibbsLDA++: A C/C++ implementation of latent Dirichlet allocation (LDA),” gibbslda.sourceforge.net, 30-Nov-2007.
[Online]. Available: http://gibbslda.sourceforge.net. [Accessed: 30-Nov-2012].
[13] J. Davis and M. Goadrich, “The relationship between Precision-Recall and ROC curves,” presented at the Proceedings of the 23rd international conference
…, 2006.