Chapter 6 Community Evolution Prediction
6.2 Experiment on DBLP Dataset
In this section, we show the result of prediction on DBLP dataset. We get 83301
communities and 32038 evolution chains in DBLP dataset when we apply our detection
algorithm on the dataset in Chapter 5. Similarly, we remove the communities which
survive no more than two timeslots. Before we build the predicting models, we sample
the training and testing set evenly so as to get an unbiased model. By selecting 80% of
data as training set, we train SVM models and test them with the remaining 20% of
data. We first show the result of predicting 6 evolution types in Table 6-7.
Table 6-7 Predicting 6 evolution types in DBLP
Ground Truth
Continue Death Growth Merge Shrink Split
Prediction
The accuracy of prediction is 57.36%, which is worse than the result of predicting
6 evolution types in Facebook dataset. The DBLP dataset contains more nodes and more
edges than Facebook dataset, so the DBLP dataset may be more complicate than
Facebook and make the prediction more difficult. However, the result 57.36% is still
much better than randomly predicting. We move on to shows the results of predicting
live or not, grow or not, and merge or not in following table.
Table 6-8 Predicting live or not in DBLP
Ground Truth
Recall 95.04 % 68.30 %
Table 6-9 Predicting grow or not in DBLP
Ground Truth
Table 6-10 Predicting merge or not in DBLP
Ground Truth
The accuracy of predicting live or not in DBLP is 81.36%, which is slightly better
than the result in Facebook. The accuracy of predicting grow or not and merge or not
are 73.22% and 83.67%. The results of predicting grow or not and merge or not are
both worse than that in Facebook. It is the same phenomenon with the result of
predicting 6 evolution types. The reason is that the DBLP dataset is more complicate
and more difficult to predict because DBLP dataset has more nodes and more edges.
It is noted that the result of predicting merge or not is always the best no matter in
Facebook or DBLP. “Merge” and “Split” are the types that are very obvious and very
easy to realize and to detect. Given a community evolution chain, we can easily tell
whether a community type is “Merge” or not. We think that is the answer to the question
why the result of predicting a community merge or not is so well.
Again, we list the training accuracy and testing accuracy of prediction in DBLP to
see if there is overfitting problem in our model. Table 6-11 shows that the training
accuracy is very similar to the testing accuracy so there is no overfitting problem in our
models.
Table 6-11 Testing accuracy and training accuracy of prediction in DBLP
Testing Accuracy Training Accuracy
Predict 6 evolution types 57.36% 60.26 %
Predict live or not 81.36% 74.91 %
Predict growth or not 73.22% 74.95 %
Predict merge or not 83.67% 85.97 %
Next step, we want to see the importance of features. Because from Facebook
dataset we know that the features of evolution chains in very important in models, we
want to know that if this is the same with DBLP dataset. We use random forest to build
a new model, which is the same process in experiment on Facebook dataset, and plot
the importance of features. Figure 6-5 is the features importance in predicting 6
evolution types in DBLP.
Figure 6-5 Features importance in predicting 6 evolution types in DBLP
The top 7 important features are feature 300, feature 306, feature 305, feature 303,
feature 301, feature 304, and feature 302. All of them are the features of evolution
chains. It proves that the features of evolution chain do help the prediction. The features
importance of other prediction models is plot in the following Figure 6-6, Figure 6-7,
and Figure 6-8.
Figure 6-6 Features importance in predicting live or not in DBLP
Figure 6-7 Features importance in predicting grow or not in DBLP
Figure 6-8 Features importance in predicting merge or not in DBLP
In the plots in Figure 6-6, Figure 6-7, and Figure 6-8, we can see that most of the
top 7 important features are still features of evolution chains. There is an interesting
thing. If we take a look at the top two most important features in predicting 6 evolution
types in DBLP and that in Facebook, we can find that they are the same. The top two
most important features in predicting 6 types in DBLP and Facebook are feature 300
and feature 306, and the order of these two feature are also the same. The same situation
happens in predicting live or not in DBLP and Facebook.
Feature 306 is the most important feature in predicting live or not, which is match
to the definition of feature 306. We have mentioned that feature 306 is about how long
an evolution chain have lived. The results in Figure 6-2 and Figure 6-6 tell that how
long an evolution chain have lived is highly correlated with whether a community in
this chain will be still alive in next timeslot. This observation really make sense. If we
see an evolution chain which lives many timeslot, we can guess that it will keep alive.
In predicting grow or not, the feature 305 is the most important feature in
Facebook and the second important feature in DBLP. However, in predicting grow or
not in DBLP dataset, the feature 305 and feature 303 are almost the same importance
so we can even say that the 305 is also the most important feature in predicting grow
or not in DBLP dataset. As a result, feature 305 is the most important feature in both
DBLP and Facebook dataset when predicting a community growth or not. From these
phenomena, it shows that the prediction models are the same is some aspect no matter
which dataset we are going to predict. What makes a community live, growth, or other
possible evolution may be independent from datasets in some aspect.
To further realize how the features of evolution chains help us predict the
community evolution, we compare the original result in DBLP dataset with the result
of training model without these features and the result of training model with only these
features. The comparison is in Table 6-12.
Table 6-12 Compare the accuracy of prediction with or without chains’ feature
With chains’
Predict merge or
not
83.67% 81.74% 85.39 %
It is obvious that the accuracy of prediction model without chains’ features are all
worse than model with chains’ features. The results have no significant differences
because in features we mention in Section 3.6.2, it contains some information about the
evolution chain though not very much. The prediction result of the model with only
evolution chains’ features is only slightly worse than original result. In some case, the
model with only evolution chains’ features is even better. The reason may be we do not
pay lots of tension on tuning the argument of SVM. But, from the result, we still can
confirm that features of evolution chains do help us predict the community evolution.
From all the result above, we can prove that the evolution of community is
predictable since all the predictions is much better than random, no matter in Facebook
dataset or in DBLP dataset. Additionally, the evolution chain, which is detected by
Long-term Evolution Method, help us to do the prediction, and we get good
performance in prediction. We keep doing more prediction to show that our evolution
detection algorithm and our definition of evolution types can do better prediction than
some other algorithm.
6.3 Compare Prediction Result with SGCI
We compare our evolution detection result with the SGCI algorithm in Chapter 5,
so we are also interesting about the evolution prediction of the SGCI, and if Long-term
Evolution Method and our defined types are better than SGCI. Considering the DBLP
dataset, we replace the evolution types we detected with the SGCI types, and make it
as the new prediction target. We use the same features selected in Section 3.6. but
eliminate the evolution chains’ features. Because the features of evolution chain may
not fit SGCI algorithm, we should select some general features like size or density of a
community. To do a fairly comparison, we will compare the SGCI result with our
prediction result which is trained without evolution chains’ features. Other steps are the
same as that in Section 5.1. The result is shown in Table 6-13.
Table 6-13 Predicting SGCI types in DBLP
Ground Truth
Constancy Decay Deletion Merge Split
Addition 71 23 7 19 0 32 18
Change 5 21 12 7 0 17 10
size
The accuracy of predicting 7 SGCI types in DBLP is 49.33%, which is worse than
our result of predicting 6 types in DBLP. It is not very clear which detection is better
because predicting 7 types should be more difficult than predicting 6 types. So, we go
further to compare other prediction: predicting whether a community will live and
whether a community will merge. With the SGCI types, a community will live if its
evolution type is not “Decay”, and a community will merge if its evolution type is
“Merge”. The result is in Table 6-14 and Table 6-15.
Table 6-14 Predicting live or not according to SGCI in DBLP
Ground Truth
Precision
False True
Prediction False 1388 719 65.88 %
Result True 273 980 78.21 %
Recall 83.56 % 57.68 %
Table 6-15 Predicting merge or not according to SGCI in DBLP
Ground Truth
The accuracy of predicting live or not and predicting merge of not according to
SGCI types are 70.48 % and 61.72 %. We use a table to show the comparison between
our prediction result and the predicting result of SGCI.
Table 6-16 Compare the accuracy of prediction according to different types
According to our types According to SGCI types
Predicting multiple types 53.03 % 49.33 %
Predicting live or not 80.78 % 70.48 %
Predicting grow or not 71.44 % -
Predicting merge or not 81.74 % 61.72 %
Note that we do not predict a community grow or not because if a community will
grow, the evolution type of this community according to SGCI may be “Change Size”
or “Addition”. However, a community with evolution type “Change Size” according to
SGCI may be “Growth” or “Shrink” in our types’ definition, so we skip this prediction.
It is obvious that our accuracy is higher, especially in predicting merge or not. We may
attribute the worse prediction in SGCI types to the difficulty of prediction more types
than us. However, the prediction of live or not and merge or not is very fair. We get
much better accuracy in the two prediction. We can say that our evolution types, which
come from the evolution chain we detected and transfer by our method, is more
preventative since we get better performance in evolution prediction.
Chapter 7 Conclusion
In this work, we use the algorithm proposed by Weixu Lin to do community
evolution detection and community evolution prediction. We first build a synthetic data
generator based on Stochastic Block Model to test Weiux’s Long-term Evolution
Method. Our synthetic data generate can simulate the core structure in a community
and noise in community evolution. To give a good measure to judge the algorithm, we
suggest to use normalize mutual information as a measure of community evolution
chain detection. Based on the synthetic data and the measure, we officially verify
Long-term Evolution Method is correct, and we analysis the two arguments of the algorithm
so that we get the knowledge of how to set up p-core and threshold to get better
core nodes and lower threshold can improve the detection. After we get some
knowledge of Long-term Evolution Method, we perform the algorithm on DBLP and
Facebook dataset. To make the detection result more comprehensive and compare with
other detection algorithm, we define our evolution types and propose a method to
transfer evolution chains detected by Long-term Evolution Method to the evolution
types we defined. This method can be adjusted by two factors which make the method
flexible. We provide a case study on Facebook dataset according to our evolution types,
and shows that we can match some events happened to Facebook with the changes in
evolution types we defined. By analyzing the evolution types we detected, we can get
more information in datasets. After we have done the detection on Facebook and DBLP
dataset, we move on to do community prediction on these datasets. We use Libsvm with
R to build a classification model. We select 4 kinds of features including features about
communities, features about the difference between previous communities, features
about the evolution chains, and features about the previous evolution types, and we get
totally 307 features as our input features. The prediction result is pretty well. From the
result of the prediction, we show that the community evolution is predictable and the
evolution chain detected by Long-term Evolution Method help a lot in prediction. We
also compare the prediction result with the SGCI algorithm, by replacing our evolution
types with SGCI types. The prediction according to our types is much better than that
according to SGCI types, which means that our evolution types is more representative
and so that more predictable than SGCI. Our work’s contributions are we propose a
synthetic data generator which have evolution ground truth and core structure, provide
a measure for accuracy of evolution chain, define our own evolution types, give a
method to transfer evolution chain to our types, provide features we selected for
community evolution prediction, and finally prove the evolution is predictable with the
help of evolution chains’ features.
Community evolution prediction is still a growing topic, so there are still lots of
works can be done in the future. The main future work of community evolution
prediction is finding solutions to get better prediction result. For example, Recurrent
Neural Network (RNN) can be a useful algorithm that help us get better prediction
result. RNN can learn the evolution of features so it may help us predict evolution.
Deep learning is a useful skill. It is start to be used in many places. For example, Google
train a Go AI with deep learning and win 4 of 5 games in the contest with Lee Sedol.
Deep learning may also help the community evolution prediction. It can be a great
future work to predict evolution with deep learning skill. We select lots of feature about
a community to predict the communities’ types. However, there might be some features
which are not as general as features like size or density of a community. We can call
these features as “latent vector”. The concept of latent vector has been used in
recommendation system. Users may contain latent vector which impact their rating so
we can use latent vectors of users to predict their rating. Maybe communities also
contain their own latent vector which impact their evolution. Finding communities’
latent vectors and using them to predict evolution are interesting future works.
Reference
[1] Girvan, Michelle, and Mark EJ Newman. "Community structure in social and
biological networks." Proceedings of the national academy of sciences 99.12 (2002):
7821-7826.
[2] Fortunato, Santo. "Community detection in graphs." Physics reports 486.3 (2010):
75-174.
[3] Weiux Lin. "On the evolution of communities in large social networks," MS thesis,
National Taiwan University, 2015.
[4] Holland, Paul W., Kathryn Blackmond Laskey, and Samuel Leinhardt. "Stochastic
blockmodels: First steps." Social networks 5.2 (1983): 109-137.
[5] Karrer, Brian, and Mark EJ Newman. "Stochastic blockmodels and community
[6] Viswanath, B., Mislove, A., Cha, M., & Gummadi, K. P. (2009, August). On the
evolution of user interaction in facebook. In Proceedings of the 2nd ACM workshop on
Online social networks (pp. 37-42). ACM.
[7] DBLP dataset, http://dblp.uni-trier.de/
[8] Gliwa, B., Saganowski, S., Zygmunt, A., Bródka, P., Kazienko, P., & Kozak, J.
(2012, August). Identification of group changes in blogosphere. In Proceedings of the
2012 International Conference on Advances in Social Networks Analysis and Mining
(ASONAM 2012) (pp. 1201-1206). IEEE Computer Society.
[9] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector
machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3
(2011): 27.
[10] Takaffoli, Mansoureh, Reihaneh Rabbany, and Osmar R. Zaiane. "Community
evolution prediction in dynamic social networks." Advances in Social Networks
Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on.
IEEE, 2014.
[11] Saganowski, S., Gliwa, B., Bródka, P., Zygmunt, A., Kazienko, P., & Koźlak, J.
(2015). Predicting community evolution in social networks. Entropy, 17(5), 3053-3096.
[12] Strehl, Alexander, and Joydeep Ghosh. "Cluster ensembles---a knowledge reuse
framework for combining multiple partitions." Journal of machine learning research
3.Dec (2002): 583-617.