Chapter 3 Our Method
3.6 Features Selection for Community Evolution Prediction
3.6.4 Features about the previous evolution type
The final group of features is the features about the previous evolution type. Every
community will have two evolution types. One is what the community is from, the other
is what will happen to the community. If a community is the result of two communities
merging with each other in previous timeslot, this community’s previous evolution type
is “Merge”. The previous evolution type will be one of “Birth”, “Merge”, “Split”,
“Growth”, “Shrink”, and “Continue”. It is notable that the type “Death” is not a possible
type in previous evolution type. This is mentioned in Section 3.3. We also add three
different features in this group. The first one represents whether a community is
survived from previous timeslot. The second one represents whether a community’s
members increased compared to the previous timeslot. The third one represents if a
community merged or not from previous timeslot. These three features are recorded
with the value type of “true” or “false”. In Chapter 6, we not only predict the evolution
types, but also predict whether a community will live, whether a community will
growth, and whether a community will merge. As a result, we add these three features
as the input features. This will be discussed in Section 6.1. Because it is impossible for
a community evolves form a “Death” community and, as a result, the feature of whether
a community will live is always true. In the prediction model, we will remove this
feature for it is helpless. Before we remove this feature, there are 4 features in this group,
which are feature 307 to feature 310.
Chapter 4
Synthetic Dataset Experiment
In this chapter, we use the synthetic data we generate to test the evolution detection
algorithm. In Chapter 2, we talk about the Long-term Evolution Method designed by
Weixu Lin, and we decide to use this method as our main algorithm to do more
evolution analysis and evolution prediction. However, his algorithm is not fully verified.
In this chapter, we try to verify our long term evolution algorithm (LM1) by testing it
with some datasets which contain evolution ground truth. We are also interesting in the
two main parameters in the algorithm, p-core and threshold, so we conduct the
experiment to show the impact of these two parameters and the detection result.
4.1 Experiment Result
and the measurement for judging Long-term Evolution Method. In our synthetic data,
we set 4 communities, and each community contains 20 nodes. The nodes connect with
each other in the same communities with probability 0.9, which is 𝑝𝑖𝑛. The nodes in
different communities connect with each other with probability 0.1, which is 𝑝𝑜𝑢𝑡 in
Stochastic Block Model. We generate 6 graphs which have 5 times evolution. We set
the probability of correct migration to 1.0, which means there is no noise. We do not
add core node in this experiment. We would like to see how Long-term Evolution will
perform if communities contain no core node. For the arguments of Long-term
Evolution Method, we set p-core as 0.2 and threshold as 0.05. In this algorithm, the first
step is applying community detection on each graphs. Here we use community ground
truth instead of apply community detection in the first step of this algorithm because
we want to evaluate our evolution detection algorithm. The error of community
detection will impact the result of evolution detection. As we just want to judge how
well our evolution detection algorithm is, we control the community detection to 100%
correct by using community ground truth. The result is shown below.
Table 4-1 Evolution chains detected by Long-term Evolution Method
Chain Index Community Evolution Chain
0 0:0, 1:0, 4:1, 10:2
1 3:0, 7:1, 12:2
2 8:1
3 2:0, 5:1, 6:1, 11:2, 14:3
4 9:2, 13:3, 18:4, 19:4, 22:5
5 16:3, 17:3, 21:4, 25:5
6 15:3, 20:4, 23:5, 24:5
7 26:5
Table 4-1 is the result of Long-term Evolution Method. There are 8 chains detected
by the algorithm. In each chains, there are several communities in each time index. The
syntax of each element in a evolution chain is <Community Index> : <Time Index>.
For example, in the first evolution chain, 1:0 means there is a community which index
is 1 and time index is 0 in this chain. The evolution ground truth is shown in Table 4-2.
Table 4-2 Ground truth of evolution chain in synthetic data
Chain Index Community Evolution Chain
0 0:0, 1:0, 4:1, 9:2, 10:2, 13:3, 18:4, 19:4, 22:5
1 2:0, 5:1, 6:1, 11:2, 14:3, 15:3, 20:4, 23:5, 24:5
2 3:0, 7:1, 8:1, 12:2, 16:3, 17:3, 21:4, 25:5, 26:5
As we can see, the community evolution chains between Long-term Evolution
Method and the ground truth are different. However, we still can show some correctness
of this algorithm.
Because the result is not so well, we try to modify some argument in this algorithm.
We modify p-core to 1.0, which means we consider no core nodes in communities (or
every node is core node) and the algorithm will calculate similarity of communities
considering all nodes in communities. The result is in Table 4-3.
Table 4-3 Evolution chains detected by LM1 with new arguments
Chain Index Community Evolution Chain
0 0:0, 1:0, 4:1, 9:2, 10:2, 13:3, 18:4, 19:4, 22:5
1 2:0, 5:1, 6:1, 11:2, 14:3, 15:3, 20:4, 23:5, 24:5
2 3:0, 7:1, 8:1, 12:2, 16:3, 17:3, 21:4, 25:5, 26:5
The result is just the same as the ground truth. The normalized mutual information
of the result is 100%, which means the result is 100% correct. The normalized mutual
information shows that the evolution chains we detected are totally correct, and verifies
that Long-term Evolution Method is correct. The reason why we get perfect result when
we choose p-core as 1.0 is our synthetic data have no core structure in communities.
Choosing p-core to 0.2 is not a good choice in this synthetic data. In most of the real
world dataset, communities will contain some core nodes. As a result, we need a new
synthetic dataset which contains core structure in communities. Besides, this synthetic
data has almost no noise. The only noise in this synthetic is from the Stochastic Block
Model. We set 𝑝𝑖𝑛 to 0.9 so the nodes in the same communities may have no
connection with probability 0.1. In the real world data, the community evolution is more
complicated and has many noise. As a result, we do a new experiment to test a more
complicated evolution and analyze the p-core’s impact.
4.2 Experiment Result with Cores and Noise
In this experiment, we want to investigate the impact of p-core and threshold. We
add core nodes for analyze the impact of p-core. Also, we add noise in the synthetic
dataset in this experiment. We start from analysis the p-core’s impact. We set 𝑝𝑖𝑛 to
0.5, 𝑝𝑜𝑢𝑡 to 0.1, and the correctly migration probability from 1.0 to 0.1. There are 8
communities in the new synthetic data, and each of them contains 40 nodes. For the 40
nodes in each community, 4 of them are set to be core nodes. We test Long-term
Evolution Method with p-core from 1.0 to 0.1 and threshold equal to 0.05. We plot the
normalized mutual information in Figure 4-1.
Figure 4-1 NMI to p-core and correctly migration probability
In Figure 4-1, we can see that when the correctly migration probability is large,
p-core do little impact to the normalized mutual information. However, when the
correctly migration probability is getting smaller, selecting smaller p-core can improve
the result. The reason is that the nodes will have larger probability to randomly migrate
to wrong communities. The better way for Long-term Evolution Method is to calculate
the similarity of communities by considering only the core nodes because core nodes
will not migrate to wrong communities. When we choose smaller p-core, we have
greater chance to select the core nodes in communities so as to do a correct evolution
detection. The figure shows that choosing p-core to 0.3 or 0.2 give us better result in
0
most of the cases. So, in testing the real world dataset in Chapter 5, we select p-core to
be 0.2. Selecting smaller value of p-core also means that the algorithm will compute
less nodes when calculating the similarity of communities, and reduce the running time.
The following Figure 4-2 is the computing time when choosing different value of
p-core. We can see that the computing time decreases as we choose lower p-p-core.
Figure 4-2 Execution time to the p-core
Now we switch to analysis the impact of threshold. We set p-core to 1.0 and
correctly migration probability to 0.5, threshold from 1.0 to 0.1, and other arguments
are the same with what we set in analyzing p-core. The Figure 4-3 below shows the plot
of normalized mutual information to the threshold.
0
Figure 4-3 Normalized mutual information to the threshold
As Figure 4-3 shows, when the threshold is less than 0.5, the values of normalized
mutual information are stocked at 60%. But, as soon as we set the threshold larger than
0.5, the values of normalized mutual information jump to 100%. The threshold acts like
a barrier. The result shows that we should not set threshold to large so in the following
real world dataset experiment we set threshold to 0.05. The threshold value 0.05 has
also been used in [3] when testing real world dataset.
0
Chapter 5
Real World Dataset Experiment
This chapter is about the real world datasets on Long-term Evolution Method. In
Chapter 4, we verify this algorithm is correct, and knowing how the p-core and
threshold impact the detection result. In this chapter, we test the algorithm with the real
world datasets including Facebook dataset [6] and DBLP dataset [7]. After testing the
real world datasets, we should know how well Long-term Evolution Method is when it
comes to the real case. Because of no ground truth in real world datasets, we need to do
some case study and compare our result with other algorithm to evaluate this algorithm.
In this chapter, we use our method to change evolution chains into evolution types:
Continue, Merge, Split, Growth, Shrink, Death, and Birth. These types will be used to
5.1 Experiment on DBLP
We test Long-term Evolution Method with DBLP dataset first. The DBLP dataset
is a co-author network. We clean up the DBLP dataset by remove the nodes who have
published number of papers less than five. After the cleaning up, we separate the DBLP
dataset by years and change it to graphs. Each graph contains data for two years. There
is a 50% overlap, which is a year overlap, between each graphs. For example, the first
graph contains the data in the first and second year, and the second graph contains the
data in second and third year. We get 11 graphs which cover the data from D.C. 1995
to 2005. There are 148334 nodes and 1210663 edges in graphs. The reason making the
graphs with some overlap is that it will make us much more easily to catch the
communities’ evolution. This is more important when we test the algorithm with
Facebook dataset because the Facebook dataset is much smaller than DBLP dataset.
The nodes and communities in each graph in Facebook dataset is smaller than that in
DBLP dataset so that it makes the Facebook dataset more difficult to detect the
evolution.
Based on the knowledge we learn from Chapter 4, we can set a proper value of
p-core and threshold for Long-term Evolution Method. From the experiment in Chapter
4, we set p-core to 0.2 and threshold to 0.05. We run this algorithm on DBLP dataset
and it takes 174999 second (without the time of reading graphs) to finish the
computation. There are 83301 communities, 32038 evolution chains in the result. To
realize the result, we transfer the evolution chains to evolution type we defined. By
setting the 𝛼 = 1 and 𝛽 = 1, we list the evolution types’ distribution in Table 5-1,
and plot the distribution in Figure 5-1. We also remove the evolution types of
communities in last timeslot because the evolution types of communities in last timeslot
is always “Death”. We have only 11 graphs so the evolution types of communities in
11th graphs are all “Death”
Table 5-1 Number of each types in Long-term Evolution Method in DBLP
# of Event
D
th
Figure 5-1 Pie chart of types in Long-term Evolution Method in DBLP
As we can see, the “Death” and “Continue” are the most of the evolution types.
Other types are distributed very evenly. Because of there is no ground truth in
community detection and community evolution in DBLP dataset, the only way we can
do is to do some case study. In [3], it shows many case about DBLP dataset, Enron
dataset, and U.S. Senate dataset, and proves his algorithm perform well on real world
dataset, so we will not go further case studying on DBLP dataset. Instead, we compare
our result with SGCI, which is one of the famous evolution detection algorithm. The
evolution types in SGCI are “Addition”, “Change Size”, “Constancy”, “Decay”,
# of Event In Our Types
CONTINUE DEATH MERGE SPLIT SHRINK GROWTH
same as our definition of “Merge” and “Split”. Our types “Growth” and “Shrink” are
similar to types “Addition”, “Deletion”, and “Change Size” in SGCI. The type “Decay”
is the same as the type “Death” in our definition. The distribution of SGCI is shown in
Table 5-2 and Figure 5-2.
Table 5-2 Number of each types in SGCI in DBLP
# of Event in SGCI
an
M
er
ge
Figure 5-2 Pie chart of types in SGCI in DBLP
As the Figure 5-2 shows, the distribution of evolution types of SGCI is very
uneven. The possible reason is that it has 8 different types but we only have 6, and the
definition of types is different from us. In Table 5-3 we list the comparison of types in
SGCI and types in our definition.
Table 5-3 Comparison of our types and SGCI types in DBLP
Our Type
# of Event in SGCI
Addition Change Size Constancy Decay
Deletion Merge Split Split_Merge
Continue Death Growth Shrink Merge Split
The result shows that, except for types “Decay” and “Constancy” are almost match
our types “Death” and “Continue”, other types does not match very well. It is because
the other types’ definition is different from us. We cannot tell which algorithm is better
for each algorithm has their own definition of evolution types. We will have a further
discussion of each factor in Chapter 6 when it comes to predicting evolution types.
5.2 Experiment on Facebook
After testing Long-term Evolution Method with DBLP dataset, we do it again with
Facebook dataset. Facebook dataset is a posting dataset. When a user posts something
on the other user’s wall, there is a link between them. Same as the experiment on DBLP
dataset, we separate Facebook Dataset into 17 graphs by every four months starting
from Feb. 2006. There is a two-month overlap between each graphs. There are total
45613 nodes and 1559167 edges in these graphs.
We perform Long-term Evolution Method on Facebook dataset and get 9985
communities and 5672 evolution chains in total. Again, we transfer the evolution chain
to evolution type with 𝛼 = 1 and 𝛽 = 1 , and remove the evolution types of
communities in last timeslot. The following Table 5-4 and Figure 5-3 is the distribution
of the evolution types of Facebook dataset.
Table 5-4 Number of each types in Long-term Evolution Method in Facebook
# of Event
Continue 3036
Death 4610
Merge 533
Split 655
Shrink 313
Growth 838
Figure 5-3 Pie chart of types in Long-term Evolution Method in Facebook
The two major evolution types in the result are still “Death” and “Continue”, and
“Death” is almost the half of the total types. We also perform SGCI on Facebook dataset.
Table 5-5 and Figure 5-4 is the result of SGCI.
Table 5-5 Number of each types in SGCI in Facebook
# of Event in SGCI
A
dd
iti
562
# of Event In Our Types
CONTINUE DEATH MERGE SPLIT SHRINK GROWTH
on
M
er
ge
93
Sp
lit
36
Sp
lit
_
M
er
ge
0
Figure 5-4 Pie chart of types in SGCI in Facebook
In Table 5-5 and Figure 5-4, we can find that the distribution of the types is more
uneven than us, which is the same phenomenon we see in the experiment of DBLP
dataset. As what we did in testing DBLP dataset, we list the comparison of types in
SGCI and types we defined in Table 5-6.
Table 5-6 Comparison of our types and SGCI types in Facebook
Our Type
Continue Death Growth Shrink Merge Split
SGCI
Type
Constancy 3028 2 4 2 53 35
Decay 3 4101 4 2 310 488
# of Event in SGCI
Addition Change Size Constancy
Decay Deletion Merge
Split
Change
The communities with types “Constancy” is almost the same as communities with
types “Continue” in our definition. The type “Decay” is detected as “Death” in our
types. The types “Growth” and “Shrink” in our result are mostly detected as “Change
Size” in SGCI. It is matched our knowledge about “Change Size” in SGCI as we
mention that our types “Growth” and “Shrink” are similar to types “Addition”,
“Deletion”, and “Change Size” in SGCI. However, other types are not matched.
5.3 Case Studying in Facebook Dataset
We do the case studying on Facebook dataset by analysis the distribution of
evolution types in different timeslots, and matching the events of Facebook with the
detection result. Although [3] contains many case studying to support Long-term
Evolution Method, it didn’t apply the algorithm on Facebook dataset. As a result, in this
section, we try to analysis the evolution in Facebook dataset and support this algorithm
and our method of transferring evolution chain to evolution type. Facebook is a great
company and is still growing bigger. We can find the history of the Facebook on the
Internet. We may able to link the evolution of Facebook dataset we detected to what
actually happened to Facebook in its history. To further understand the evolution of
Facebook dataset, we do case study by plotting the distribution of evolution types in
each timeslot. Figure 5-5 is the distribution of types in each timeslot. It shows how
many percentages of each types in each timeslot.
Figure 5-5 Percentages of evolution types in each timeslot in Facebook
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
There are two interesting points in this plot. First, in the second timeslot which
time index is 1, we can see the percentages of type “Split” is much larger than its
previous timeslot. Second, in the 14th timeslot which time index is 13, the percentages
of type “Split” is again much larger than its previous timeslot. In the second timeslot,
it contains the data from 2006/4 to 2006/7. In 2006/9, Facebook was opened to everyone
at least 13 years old with a valid email address. The possible reason why the type “Split”
become larger is the nodes in 2006/4 ~ 2006/7 will break their own communities
because more friends are join the Facebook and they want to form new communities
with new join friends. In the 14th timeslot, it contains data from 2008/4 to 2008/7. In
2008/7, Facebook was fully opened to China users due to 2008 Summer Olympic
Games at Beijing. So, again, the structure of Facebook communities changed a lot at
this period because of the same reason: new friends, in this case Chinese users, were
join to Facebook. As we know that China finally block the Facebook in the late 2008,
we may be able to see another big changes happened in community evolution types in
Facebook if we have more data.
Chapter 6
Community Evolution Prediction
We use the detection result in Chapter 5, and build a prediction model for
community evolution in this chapter. In Chapter 5, we test Long-term Evolution Method
with real world datasets, and give a method to transfer community evolution chains to
community evolution types. Long-term Evolution Method can do perfect detection. In
this chapter, we want to do more – predict the evolution types. We want to show that
this chapter, we want to do more – predict the evolution types. We want to show that