Experiment on DBLP Dataset - Community Evolution Prediction

Chapter 6 Community Evolution Prediction

6.2 Experiment on DBLP Dataset

In this section, we show the result of prediction on DBLP dataset. We get 83301

communities and 32038 evolution chains in DBLP dataset when we apply our detection

algorithm on the dataset in Chapter 5. Similarly, we remove the communities which

survive no more than two timeslots. Before we build the predicting models, we sample

the training and testing set evenly so as to get an unbiased model. By selecting 80% of

data as training set, we train SVM models and test them with the remaining 20% of

data. We first show the result of predicting 6 evolution types in Table 6-7.

Table 6-7 Predicting 6 evolution types in DBLP

Ground Truth

Continue Death Growth Merge Shrink Split

Prediction

The accuracy of prediction is 57.36%, which is worse than the result of predicting

6 evolution types in Facebook dataset. The DBLP dataset contains more nodes and more

edges than Facebook dataset, so the DBLP dataset may be more complicate than

Facebook and make the prediction more difficult. However, the result 57.36% is still

much better than randomly predicting. We move on to shows the results of predicting

live or not, grow or not, and merge or not in following table.

Table 6-8 Predicting live or not in DBLP

Ground Truth

Recall 95.04 % 68.30 %

Table 6-9 Predicting grow or not in DBLP

Ground Truth

Table 6-10 Predicting merge or not in DBLP

Ground Truth

The accuracy of predicting live or not in DBLP is 81.36%, which is slightly better

than the result in Facebook. The accuracy of predicting grow or not and merge or not

are 73.22% and 83.67%. The results of predicting grow or not and merge or not are

both worse than that in Facebook. It is the same phenomenon with the result of

predicting 6 evolution types. The reason is that the DBLP dataset is more complicate

and more difficult to predict because DBLP dataset has more nodes and more edges.

It is noted that the result of predicting merge or not is always the best no matter in

Facebook or DBLP. “Merge” and “Split” are the types that are very obvious and very

easy to realize and to detect. Given a community evolution chain, we can easily tell

whether a community type is “Merge” or not. We think that is the answer to the question

why the result of predicting a community merge or not is so well.

Again, we list the training accuracy and testing accuracy of prediction in DBLP to

see if there is overfitting problem in our model. Table 6-11 shows that the training

accuracy is very similar to the testing accuracy so there is no overfitting problem in our

models.

Table 6-11 Testing accuracy and training accuracy of prediction in DBLP

Testing Accuracy Training Accuracy

Predict 6 evolution types 57.36% 60.26 %

Predict live or not 81.36% 74.91 %

Predict growth or not 73.22% 74.95 %

Predict merge or not 83.67% 85.97 %

Next step, we want to see the importance of features. Because from Facebook

dataset we know that the features of evolution chains in very important in models, we

want to know that if this is the same with DBLP dataset. We use random forest to build

a new model, which is the same process in experiment on Facebook dataset, and plot

the importance of features. Figure 6-5 is the features importance in predicting 6

evolution types in DBLP.

Figure 6-5 Features importance in predicting 6 evolution types in DBLP

The top 7 important features are feature 300, feature 306, feature 305, feature 303,

feature 301, feature 304, and feature 302. All of them are the features of evolution

chains. It proves that the features of evolution chain do help the prediction. The features

importance of other prediction models is plot in the following Figure 6-6, Figure 6-7,

and Figure 6-8.

Figure 6-6 Features importance in predicting live or not in DBLP

Figure 6-7 Features importance in predicting grow or not in DBLP

Figure 6-8 Features importance in predicting merge or not in DBLP

In the plots in Figure 6-6, Figure 6-7, and Figure 6-8, we can see that most of the

top 7 important features are still features of evolution chains. There is an interesting

thing. If we take a look at the top two most important features in predicting 6 evolution

types in DBLP and that in Facebook, we can find that they are the same. The top two

most important features in predicting 6 types in DBLP and Facebook are feature 300

and feature 306, and the order of these two feature are also the same. The same situation

happens in predicting live or not in DBLP and Facebook.

Feature 306 is the most important feature in predicting live or not, which is match

to the definition of feature 306. We have mentioned that feature 306 is about how long

an evolution chain have lived. The results in Figure 6-2 and Figure 6-6 tell that how

long an evolution chain have lived is highly correlated with whether a community in

this chain will be still alive in next timeslot. This observation really make sense. If we

see an evolution chain which lives many timeslot, we can guess that it will keep alive.

In predicting grow or not, the feature 305 is the most important feature in

Facebook and the second important feature in DBLP. However, in predicting grow or

not in DBLP dataset, the feature 305 and feature 303 are almost the same importance

so we can even say that the 305 is also the most important feature in predicting grow

or not in DBLP dataset. As a result, feature 305 is the most important feature in both

DBLP and Facebook dataset when predicting a community growth or not. From these

phenomena, it shows that the prediction models are the same is some aspect no matter

which dataset we are going to predict. What makes a community live, growth, or other

possible evolution may be independent from datasets in some aspect.

To further realize how the features of evolution chains help us predict the

community evolution, we compare the original result in DBLP dataset with the result

of training model without these features and the result of training model with only these

features. The comparison is in Table 6-12.

Table 6-12 Compare the accuracy of prediction with or without chains’ feature

With chains’

Predict merge or

not

83.67% 81.74% 85.39 %

It is obvious that the accuracy of prediction model without chains’ features are all

worse than model with chains’ features. The results have no significant differences

because in features we mention in Section 3.6.2, it contains some information about the

evolution chain though not very much. The prediction result of the model with only

evolution chains’ features is only slightly worse than original result. In some case, the

model with only evolution chains’ features is even better. The reason may be we do not

pay lots of tension on tuning the argument of SVM. But, from the result, we still can

confirm that features of evolution chains do help us predict the community evolution.

From all the result above, we can prove that the evolution of community is

predictable since all the predictions is much better than random, no matter in Facebook

dataset or in DBLP dataset. Additionally, the evolution chain, which is detected by

Long-term Evolution Method, help us to do the prediction, and we get good

performance in prediction. We keep doing more prediction to show that our evolution

detection algorithm and our definition of evolution types can do better prediction than

some other algorithm.

6.3 Compare Prediction Result with SGCI

We compare our evolution detection result with the SGCI algorithm in Chapter 5,

so we are also interesting about the evolution prediction of the SGCI, and if Long-term

Evolution Method and our defined types are better than SGCI. Considering the DBLP

dataset, we replace the evolution types we detected with the SGCI types, and make it

as the new prediction target. We use the same features selected in Section 3.6. but

eliminate the evolution chains’ features. Because the features of evolution chain may

not fit SGCI algorithm, we should select some general features like size or density of a

community. To do a fairly comparison, we will compare the SGCI result with our

prediction result which is trained without evolution chains’ features. Other steps are the

same as that in Section 5.1. The result is shown in Table 6-13.

Table 6-13 Predicting SGCI types in DBLP

Ground Truth

Constancy Decay Deletion Merge Split

Addition 71 23 7 19 0 32 18

Change 5 21 12 7 0 17 10

size

The accuracy of predicting 7 SGCI types in DBLP is 49.33%, which is worse than

our result of predicting 6 types in DBLP. It is not very clear which detection is better

because predicting 7 types should be more difficult than predicting 6 types. So, we go

further to compare other prediction: predicting whether a community will live and

whether a community will merge. With the SGCI types, a community will live if its

evolution type is not “Decay”, and a community will merge if its evolution type is

“Merge”. The result is in Table 6-14 and Table 6-15.

Table 6-14 Predicting live or not according to SGCI in DBLP

Ground Truth

Precision

False True

Prediction False 1388 719 65.88 %

Result True 273 980 78.21 %

Recall 83.56 % 57.68 %

Table 6-15 Predicting merge or not according to SGCI in DBLP

Ground Truth

The accuracy of predicting live or not and predicting merge of not according to

SGCI types are 70.48 % and 61.72 %. We use a table to show the comparison between

our prediction result and the predicting result of SGCI.

Table 6-16 Compare the accuracy of prediction according to different types

According to our types According to SGCI types

Predicting multiple types 53.03 % 49.33 %

Predicting live or not 80.78 % 70.48 %

Predicting grow or not 71.44 % -

Predicting merge or not 81.74 % 61.72 %

Note that we do not predict a community grow or not because if a community will

grow, the evolution type of this community according to SGCI may be “Change Size”

or “Addition”. However, a community with evolution type “Change Size” according to

SGCI may be “Growth” or “Shrink” in our types’ definition, so we skip this prediction.

It is obvious that our accuracy is higher, especially in predicting merge or not. We may

attribute the worse prediction in SGCI types to the difficulty of prediction more types

than us. However, the prediction of live or not and merge or not is very fair. We get

much better accuracy in the two prediction. We can say that our evolution types, which

come from the evolution chain we detected and transfer by our method, is more

preventative since we get better performance in evolution prediction.

Chapter 7 Conclusion

In this work, we use the algorithm proposed by Weixu Lin to do community

evolution detection and community evolution prediction. We first build a synthetic data

generator based on Stochastic Block Model to test Weiux’s Long-term Evolution

Method. Our synthetic data generate can simulate the core structure in a community

and noise in community evolution. To give a good measure to judge the algorithm, we

suggest to use normalize mutual information as a measure of community evolution

chain detection. Based on the synthetic data and the measure, we officially verify

Long-term Evolution Method is correct, and we analysis the two arguments of the algorithm

so that we get the knowledge of how to set up p-core and threshold to get better

core nodes and lower threshold can improve the detection. After we get some

knowledge of Long-term Evolution Method, we perform the algorithm on DBLP and

Facebook dataset. To make the detection result more comprehensive and compare with

other detection algorithm, we define our evolution types and propose a method to

transfer evolution chains detected by Long-term Evolution Method to the evolution

types we defined. This method can be adjusted by two factors which make the method

flexible. We provide a case study on Facebook dataset according to our evolution types,

and shows that we can match some events happened to Facebook with the changes in

evolution types we defined. By analyzing the evolution types we detected, we can get

more information in datasets. After we have done the detection on Facebook and DBLP

dataset, we move on to do community prediction on these datasets. We use Libsvm with

R to build a classification model. We select 4 kinds of features including features about

communities, features about the difference between previous communities, features

about the evolution chains, and features about the previous evolution types, and we get

totally 307 features as our input features. The prediction result is pretty well. From the

result of the prediction, we show that the community evolution is predictable and the

evolution chain detected by Long-term Evolution Method help a lot in prediction. We

also compare the prediction result with the SGCI algorithm, by replacing our evolution

types with SGCI types. The prediction according to our types is much better than that

according to SGCI types, which means that our evolution types is more representative

and so that more predictable than SGCI. Our work’s contributions are we propose a

synthetic data generator which have evolution ground truth and core structure, provide

a measure for accuracy of evolution chain, define our own evolution types, give a

method to transfer evolution chain to our types, provide features we selected for

community evolution prediction, and finally prove the evolution is predictable with the

help of evolution chains’ features.

Community evolution prediction is still a growing topic, so there are still lots of

works can be done in the future. The main future work of community evolution

prediction is finding solutions to get better prediction result. For example, Recurrent

Neural Network (RNN) can be a useful algorithm that help us get better prediction

result. RNN can learn the evolution of features so it may help us predict evolution.

Deep learning is a useful skill. It is start to be used in many places. For example, Google

train a Go AI with deep learning and win 4 of 5 games in the contest with Lee Sedol.

Deep learning may also help the community evolution prediction. It can be a great

future work to predict evolution with deep learning skill. We select lots of feature about

a community to predict the communities’ types. However, there might be some features

which are not as general as features like size or density of a community. We can call

these features as “latent vector”. The concept of latent vector has been used in

recommendation system. Users may contain latent vector which impact their rating so

we can use latent vectors of users to predict their rating. Maybe communities also

contain their own latent vector which impact their evolution. Finding communities’

latent vectors and using them to predict evolution are interesting future works.

Reference

[1] Girvan, Michelle, and Mark EJ Newman. "Community structure in social and

biological networks." Proceedings of the national academy of sciences 99.12 (2002):

7821-7826.

[2] Fortunato, Santo. "Community detection in graphs." Physics reports 486.3 (2010):

75-174.

[3] Weiux Lin. "On the evolution of communities in large social networks," MS thesis,

National Taiwan University, 2015.

[4] Holland, Paul W., Kathryn Blackmond Laskey, and Samuel Leinhardt. "Stochastic

blockmodels: First steps." Social networks 5.2 (1983): 109-137.

[5] Karrer, Brian, and Mark EJ Newman. "Stochastic blockmodels and community

[6] Viswanath, B., Mislove, A., Cha, M., & Gummadi, K. P. (2009, August). On the

evolution of user interaction in facebook. In Proceedings of the 2nd ACM workshop on

Online social networks (pp. 37-42). ACM.

[7] DBLP dataset, http://dblp.uni-trier.de/

[8] Gliwa, B., Saganowski, S., Zygmunt, A., Bródka, P., Kazienko, P., & Kozak, J.

(2012, August). Identification of group changes in blogosphere. In Proceedings of the

2012 International Conference on Advances in Social Networks Analysis and Mining

(ASONAM 2012) (pp. 1201-1206). IEEE Computer Society.

[9] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: a library for support vector

machines." ACM Transactions on Intelligent Systems and Technology (TIST) 2.3

(2011): 27.

[10] Takaffoli, Mansoureh, Reihaneh Rabbany, and Osmar R. Zaiane. "Community

evolution prediction in dynamic social networks." Advances in Social Networks

Analysis and Mining (ASONAM), 2014 IEEE/ACM International Conference on.

IEEE, 2014.

[11] Saganowski, S., Gliwa, B., Bródka, P., Zygmunt, A., Kazienko, P., & Koźlak, J.

(2015). Predicting community evolution in social networks. Entropy, 17(5), 3053-3096.

[12] Strehl, Alexander, and Joydeep Ghosh. "Cluster ensembles---a knowledge reuse

framework for combining multiple partitions." Journal of machine learning research

3.Dec (2002): 583-617.

在文檔中社群網路中群體演變的偵測及預測 (頁 87-107)