Chapter 4 Synthetic Dataset Experiment
4.2 Experiment Result with Cores and Noise
In this experiment, we want to investigate the impact of p-core and threshold. We
add core nodes for analyze the impact of p-core. Also, we add noise in the synthetic
dataset in this experiment. We start from analysis the p-core’s impact. We set 𝑝𝑖𝑛 to
0.5, 𝑝𝑜𝑢𝑡 to 0.1, and the correctly migration probability from 1.0 to 0.1. There are 8
communities in the new synthetic data, and each of them contains 40 nodes. For the 40
nodes in each community, 4 of them are set to be core nodes. We test Long-term
Evolution Method with p-core from 1.0 to 0.1 and threshold equal to 0.05. We plot the
normalized mutual information in Figure 4-1.
Figure 4-1 NMI to p-core and correctly migration probability
In Figure 4-1, we can see that when the correctly migration probability is large,
p-core do little impact to the normalized mutual information. However, when the
correctly migration probability is getting smaller, selecting smaller p-core can improve
the result. The reason is that the nodes will have larger probability to randomly migrate
to wrong communities. The better way for Long-term Evolution Method is to calculate
the similarity of communities by considering only the core nodes because core nodes
will not migrate to wrong communities. When we choose smaller p-core, we have
greater chance to select the core nodes in communities so as to do a correct evolution
detection. The figure shows that choosing p-core to 0.3 or 0.2 give us better result in
0
most of the cases. So, in testing the real world dataset in Chapter 5, we select p-core to
be 0.2. Selecting smaller value of p-core also means that the algorithm will compute
less nodes when calculating the similarity of communities, and reduce the running time.
The following Figure 4-2 is the computing time when choosing different value of
p-core. We can see that the computing time decreases as we choose lower p-p-core.
Figure 4-2 Execution time to the p-core
Now we switch to analysis the impact of threshold. We set p-core to 1.0 and
correctly migration probability to 0.5, threshold from 1.0 to 0.1, and other arguments
are the same with what we set in analyzing p-core. The Figure 4-3 below shows the plot
of normalized mutual information to the threshold.
0
Figure 4-3 Normalized mutual information to the threshold
As Figure 4-3 shows, when the threshold is less than 0.5, the values of normalized
mutual information are stocked at 60%. But, as soon as we set the threshold larger than
0.5, the values of normalized mutual information jump to 100%. The threshold acts like
a barrier. The result shows that we should not set threshold to large so in the following
real world dataset experiment we set threshold to 0.05. The threshold value 0.05 has
also been used in [3] when testing real world dataset.
0
Chapter 5
Real World Dataset Experiment
This chapter is about the real world datasets on Long-term Evolution Method. In
Chapter 4, we verify this algorithm is correct, and knowing how the p-core and
threshold impact the detection result. In this chapter, we test the algorithm with the real
world datasets including Facebook dataset [6] and DBLP dataset [7]. After testing the
real world datasets, we should know how well Long-term Evolution Method is when it
comes to the real case. Because of no ground truth in real world datasets, we need to do
some case study and compare our result with other algorithm to evaluate this algorithm.
In this chapter, we use our method to change evolution chains into evolution types:
Continue, Merge, Split, Growth, Shrink, Death, and Birth. These types will be used to
5.1 Experiment on DBLP
We test Long-term Evolution Method with DBLP dataset first. The DBLP dataset
is a co-author network. We clean up the DBLP dataset by remove the nodes who have
published number of papers less than five. After the cleaning up, we separate the DBLP
dataset by years and change it to graphs. Each graph contains data for two years. There
is a 50% overlap, which is a year overlap, between each graphs. For example, the first
graph contains the data in the first and second year, and the second graph contains the
data in second and third year. We get 11 graphs which cover the data from D.C. 1995
to 2005. There are 148334 nodes and 1210663 edges in graphs. The reason making the
graphs with some overlap is that it will make us much more easily to catch the
communities’ evolution. This is more important when we test the algorithm with
Facebook dataset because the Facebook dataset is much smaller than DBLP dataset.
The nodes and communities in each graph in Facebook dataset is smaller than that in
DBLP dataset so that it makes the Facebook dataset more difficult to detect the
evolution.
Based on the knowledge we learn from Chapter 4, we can set a proper value of
p-core and threshold for Long-term Evolution Method. From the experiment in Chapter
4, we set p-core to 0.2 and threshold to 0.05. We run this algorithm on DBLP dataset
and it takes 174999 second (without the time of reading graphs) to finish the
computation. There are 83301 communities, 32038 evolution chains in the result. To
realize the result, we transfer the evolution chains to evolution type we defined. By
setting the 𝛼 = 1 and 𝛽 = 1, we list the evolution types’ distribution in Table 5-1,
and plot the distribution in Figure 5-1. We also remove the evolution types of
communities in last timeslot because the evolution types of communities in last timeslot
is always “Death”. We have only 11 graphs so the evolution types of communities in
11th graphs are all “Death”
Table 5-1 Number of each types in Long-term Evolution Method in DBLP
# of Event
D
th
Figure 5-1 Pie chart of types in Long-term Evolution Method in DBLP
As we can see, the “Death” and “Continue” are the most of the evolution types.
Other types are distributed very evenly. Because of there is no ground truth in
community detection and community evolution in DBLP dataset, the only way we can
do is to do some case study. In [3], it shows many case about DBLP dataset, Enron
dataset, and U.S. Senate dataset, and proves his algorithm perform well on real world
dataset, so we will not go further case studying on DBLP dataset. Instead, we compare
our result with SGCI, which is one of the famous evolution detection algorithm. The
evolution types in SGCI are “Addition”, “Change Size”, “Constancy”, “Decay”,
# of Event In Our Types
CONTINUE DEATH MERGE SPLIT SHRINK GROWTH
same as our definition of “Merge” and “Split”. Our types “Growth” and “Shrink” are
similar to types “Addition”, “Deletion”, and “Change Size” in SGCI. The type “Decay”
is the same as the type “Death” in our definition. The distribution of SGCI is shown in
Table 5-2 and Figure 5-2.
Table 5-2 Number of each types in SGCI in DBLP
# of Event in SGCI
an
M
er
ge
Figure 5-2 Pie chart of types in SGCI in DBLP
As the Figure 5-2 shows, the distribution of evolution types of SGCI is very
uneven. The possible reason is that it has 8 different types but we only have 6, and the
definition of types is different from us. In Table 5-3 we list the comparison of types in
SGCI and types in our definition.
Table 5-3 Comparison of our types and SGCI types in DBLP
Our Type
# of Event in SGCI
Addition Change Size Constancy Decay
Deletion Merge Split Split_Merge
Continue Death Growth Shrink Merge Split
The result shows that, except for types “Decay” and “Constancy” are almost match
our types “Death” and “Continue”, other types does not match very well. It is because
the other types’ definition is different from us. We cannot tell which algorithm is better
for each algorithm has their own definition of evolution types. We will have a further
discussion of each factor in Chapter 6 when it comes to predicting evolution types.
5.2 Experiment on Facebook
After testing Long-term Evolution Method with DBLP dataset, we do it again with
Facebook dataset. Facebook dataset is a posting dataset. When a user posts something
on the other user’s wall, there is a link between them. Same as the experiment on DBLP
dataset, we separate Facebook Dataset into 17 graphs by every four months starting
from Feb. 2006. There is a two-month overlap between each graphs. There are total
45613 nodes and 1559167 edges in these graphs.
We perform Long-term Evolution Method on Facebook dataset and get 9985
communities and 5672 evolution chains in total. Again, we transfer the evolution chain
to evolution type with 𝛼 = 1 and 𝛽 = 1 , and remove the evolution types of
communities in last timeslot. The following Table 5-4 and Figure 5-3 is the distribution
of the evolution types of Facebook dataset.
Table 5-4 Number of each types in Long-term Evolution Method in Facebook
# of Event
Continue 3036
Death 4610
Merge 533
Split 655
Shrink 313
Growth 838
Figure 5-3 Pie chart of types in Long-term Evolution Method in Facebook
The two major evolution types in the result are still “Death” and “Continue”, and
“Death” is almost the half of the total types. We also perform SGCI on Facebook dataset.
Table 5-5 and Figure 5-4 is the result of SGCI.
Table 5-5 Number of each types in SGCI in Facebook
# of Event in SGCI
A
dd
iti
562
# of Event In Our Types
CONTINUE DEATH MERGE SPLIT SHRINK GROWTH
on
M
er
ge
93
Sp
lit
36
Sp
lit
_
M
er
ge
0
Figure 5-4 Pie chart of types in SGCI in Facebook
In Table 5-5 and Figure 5-4, we can find that the distribution of the types is more
uneven than us, which is the same phenomenon we see in the experiment of DBLP
dataset. As what we did in testing DBLP dataset, we list the comparison of types in
SGCI and types we defined in Table 5-6.
Table 5-6 Comparison of our types and SGCI types in Facebook
Our Type
Continue Death Growth Shrink Merge Split
SGCI
Type
Constancy 3028 2 4 2 53 35
Decay 3 4101 4 2 310 488
# of Event in SGCI
Addition Change Size Constancy
Decay Deletion Merge
Split
Change
The communities with types “Constancy” is almost the same as communities with
types “Continue” in our definition. The type “Decay” is detected as “Death” in our
types. The types “Growth” and “Shrink” in our result are mostly detected as “Change
Size” in SGCI. It is matched our knowledge about “Change Size” in SGCI as we
mention that our types “Growth” and “Shrink” are similar to types “Addition”,
“Deletion”, and “Change Size” in SGCI. However, other types are not matched.
5.3 Case Studying in Facebook Dataset
We do the case studying on Facebook dataset by analysis the distribution of
evolution types in different timeslots, and matching the events of Facebook with the
detection result. Although [3] contains many case studying to support Long-term
Evolution Method, it didn’t apply the algorithm on Facebook dataset. As a result, in this
section, we try to analysis the evolution in Facebook dataset and support this algorithm
and our method of transferring evolution chain to evolution type. Facebook is a great
company and is still growing bigger. We can find the history of the Facebook on the
Internet. We may able to link the evolution of Facebook dataset we detected to what
actually happened to Facebook in its history. To further understand the evolution of
Facebook dataset, we do case study by plotting the distribution of evolution types in
each timeslot. Figure 5-5 is the distribution of types in each timeslot. It shows how
many percentages of each types in each timeslot.
Figure 5-5 Percentages of evolution types in each timeslot in Facebook
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
There are two interesting points in this plot. First, in the second timeslot which
time index is 1, we can see the percentages of type “Split” is much larger than its
previous timeslot. Second, in the 14th timeslot which time index is 13, the percentages
of type “Split” is again much larger than its previous timeslot. In the second timeslot,
it contains the data from 2006/4 to 2006/7. In 2006/9, Facebook was opened to everyone
at least 13 years old with a valid email address. The possible reason why the type “Split”
become larger is the nodes in 2006/4 ~ 2006/7 will break their own communities
because more friends are join the Facebook and they want to form new communities
with new join friends. In the 14th timeslot, it contains data from 2008/4 to 2008/7. In
2008/7, Facebook was fully opened to China users due to 2008 Summer Olympic
Games at Beijing. So, again, the structure of Facebook communities changed a lot at
this period because of the same reason: new friends, in this case Chinese users, were
join to Facebook. As we know that China finally block the Facebook in the late 2008,
we may be able to see another big changes happened in community evolution types in
Facebook if we have more data.
Chapter 6
Community Evolution Prediction
We use the detection result in Chapter 5, and build a prediction model for
community evolution in this chapter. In Chapter 5, we test Long-term Evolution Method
with real world datasets, and give a method to transfer community evolution chains to
community evolution types. Long-term Evolution Method can do perfect detection. In
this chapter, we want to do more – predict the evolution types. We want to show that
evolution is predictable and our method help us to reach the goal.
6.1 Experiment on Facebook Dataset
We first predict the communities’ evolution in Facebook dataset. In Chapter 5, we
apply Long-term Evolution Method on Facebook dataset and get 9985 communities
and 5672 evolution chains. To build a more precise prediction, we remove the
communities which survive for no more than 2 timeslots. The communities which
survive for very short timeslots may be unpredictable. After we remove the short-life
communities, we have 4235 communities. Our predicting target is the evolution types.
Before we start our model training, we should notice the distribution of the evolution
types in training and testing set. Table 6-1 is the distribution of types.
Table 6-1 Number of evolution types in training/testing set in Facebook
# of Event
Continue 1388
Death 760
Merge 652
Split 828
Shrink 257
Growth 350
As we can see, the distribution of 4235 communities is uneven. As a result, we
should first evenly and randomly sample each evolution types to get an unbiased
prediction model. We sample 257 data for every types, pick 80% of data as the training
set, and use the feature we select in Section 3.6 to train the prediction model. Then, we
use the remaining 20% data as the testing set to test our model. The prediction result is
below in Table 6-2.
Table 6-2 Predicting 6 evolution types in Facebook
Ground Truth
The accuracy of the prediction is 64.40%. It seems like the performance is not very
well. However, if we consider the accuracy of randomly prediction which is only
16.67%, it is still far better than random.
The accuracy of predicting 6 evolution types is only 64.40%. The result may
because of the difficulty of prediction. Predicting 6 evolution types may be too difficult.
So, we try to build the prediction models to the different but much easy targets: whether
a community will live, whether a community will grow, and whether a community will
merge. Same as the prediction model target to 6 evolution types, we first evenly sample
the training/testing set, pick 80% of data as training set, and use the feature we select
in Section 3.6 to train the prediction model. The following three tables, which are Table
6-3, Table 6-4, and Table 6-5, are the testing result of three prediction models.
Table 6-3 Predicting live or not in Facebook
Ground Truth
Table 6-4 Predicting grow or not in Facebook
Ground Truth
Recall 96.65 % 73.68 %
Table 6-5 Predicting merge or not in Facebook
Ground Truth
79.15%, which is pretty well. When we are predicting whether a community grow or
not, we only consider the communities which will survive in next timeslot. As a result,
the training set and testing set is different from those in predicting 6 evolution types
and in predicting live or not. The accuracy of predicting a community growth or not is
86.10%, which is better than the accuracy of predicting live or not, and much better
than predicting 6 evolution types. When we predicting a community merge or not, we
only consider the communities that will survive as well. The accuracy of predicting
merge or not is 90.04%. Only 10 of 100 will predict incorrectly. It infers that predicting
these answers is easier than predicting 6 types. Additionally, the communities which
will survive is more predictable since the results of predicting growth or not and merge
or not are better than the result of predicting live or not.
To see if there is overfitting problem in our model, we calculate the training
accuracy of each prediction. We show the training accuracy in the below Table 6-6. We
can see that the training accuracy is almost equal to the testing accuracy in each
prediction. As a result, it shows that there is no overfitting problem in our models.
Table 6-6 Testing accuracy and training accuracy of prediction in Facebook
Testing Accuracy Training Accuracy
Predict 6 evolution types 64.40 % 63.67 %
Predict live or not 79.15 % 79.55 %
Predict grow or not 86.10 % 86.58 %
Predict merge or not 90.04 % 90.22 %
It is interesting to know which features are important in our prediction model, for
we select more than 300 features. The support vector machine algorithm will apply the
kernel method, mapping the input features into high-dimensional feature spaces by
kernel function, so it is difficult for us to know which algorithm is the most important.
Consequently, we use random forest to train a new model and show the importance of
each features. The following Figure 6-1 is the features importance of the random forest
model of 6 evolution types in Facebook.
Figure 6-1 Features importance in predicting 6 evolution types in Facebook
As Figure 6-1 shows, the top 7 most important features are feature 300, feature
306, feature 303, feature 301, feature 305, feature 302, and feature 216. Recall that in
Section 3.6.3, features from feature 300 to feature 306 are the features about evolution
chains. It means that the evolution chains which are detected by Long-term Evolution
Method do help a lot in the prediction model of predicting 6 evolution types. We keep
move on to plot the features importance of other prediction models in Figure 6-2, Figure
6-3, and Figure 6-4.
Figure 6-2 Features importance in predicting live or not in Facebook
Figure 6-3 Features importance in predicting grow or not in Facebook
Figure 6-4 Features importance in predicting merge or not in Facebook
We only plot the top 7 most important features in each model. Although in each
prediction model, the order of the most important features is not the same with each
other, we still can see most of the features in the top 7 important features are features
about evolution chain. It shows that our detection algorithm is useful. We can use the
features of evolution chain detected by Long-term Evolution Method to build a good
prediction model.
6.2 Experiment on DBLP Dataset
In this section, we show the result of prediction on DBLP dataset. We get 83301
communities and 32038 evolution chains in DBLP dataset when we apply our detection
algorithm on the dataset in Chapter 5. Similarly, we remove the communities which
algorithm on the dataset in Chapter 5. Similarly, we remove the communities which