• 沒有找到結果。

Features about the previous evolution type

Chapter 3 Our Method

3.6 Features Selection for Community Evolution Prediction

3.6.4 Features about the previous evolution type

The final group of features is the features about the previous evolution type. Every

community will have two evolution types. One is what the community is from, the other

is what will happen to the community. If a community is the result of two communities

merging with each other in previous timeslot, this community’s previous evolution type

is “Merge”. The previous evolution type will be one of “Birth”, “Merge”, “Split”,

“Growth”, “Shrink”, and “Continue”. It is notable that the type “Death” is not a possible

type in previous evolution type. This is mentioned in Section 3.3. We also add three

different features in this group. The first one represents whether a community is

survived from previous timeslot. The second one represents whether a community’s

members increased compared to the previous timeslot. The third one represents if a

community merged or not from previous timeslot. These three features are recorded

with the value type of “true” or “false”. In Chapter 6, we not only predict the evolution

types, but also predict whether a community will live, whether a community will

growth, and whether a community will merge. As a result, we add these three features

as the input features. This will be discussed in Section 6.1. Because it is impossible for

a community evolves form a “Death” community and, as a result, the feature of whether

a community will live is always true. In the prediction model, we will remove this

feature for it is helpless. Before we remove this feature, there are 4 features in this group,

which are feature 307 to feature 310.

Chapter 4

Synthetic Dataset Experiment

In this chapter, we use the synthetic data we generate to test the evolution detection

algorithm. In Chapter 2, we talk about the Long-term Evolution Method designed by

Weixu Lin, and we decide to use this method as our main algorithm to do more

evolution analysis and evolution prediction. However, his algorithm is not fully verified.

In this chapter, we try to verify our long term evolution algorithm (LM1) by testing it

with some datasets which contain evolution ground truth. We are also interesting in the

two main parameters in the algorithm, p-core and threshold, so we conduct the

experiment to show the impact of these two parameters and the detection result.

4.1 Experiment Result

and the measurement for judging Long-term Evolution Method. In our synthetic data,

we set 4 communities, and each community contains 20 nodes. The nodes connect with

each other in the same communities with probability 0.9, which is 𝑝𝑖𝑛. The nodes in

different communities connect with each other with probability 0.1, which is 𝑝𝑜𝑢𝑡 in

Stochastic Block Model. We generate 6 graphs which have 5 times evolution. We set

the probability of correct migration to 1.0, which means there is no noise. We do not

add core node in this experiment. We would like to see how Long-term Evolution will

perform if communities contain no core node. For the arguments of Long-term

Evolution Method, we set p-core as 0.2 and threshold as 0.05. In this algorithm, the first

step is applying community detection on each graphs. Here we use community ground

truth instead of apply community detection in the first step of this algorithm because

we want to evaluate our evolution detection algorithm. The error of community

detection will impact the result of evolution detection. As we just want to judge how

well our evolution detection algorithm is, we control the community detection to 100%

correct by using community ground truth. The result is shown below.

Table 4-1 Evolution chains detected by Long-term Evolution Method

Chain Index Community Evolution Chain

0 0:0, 1:0, 4:1, 10:2

1 3:0, 7:1, 12:2

2 8:1

3 2:0, 5:1, 6:1, 11:2, 14:3

4 9:2, 13:3, 18:4, 19:4, 22:5

5 16:3, 17:3, 21:4, 25:5

6 15:3, 20:4, 23:5, 24:5

7 26:5

Table 4-1 is the result of Long-term Evolution Method. There are 8 chains detected

by the algorithm. In each chains, there are several communities in each time index. The

syntax of each element in a evolution chain is <Community Index> : <Time Index>.

For example, in the first evolution chain, 1:0 means there is a community which index

is 1 and time index is 0 in this chain. The evolution ground truth is shown in Table 4-2.

Table 4-2 Ground truth of evolution chain in synthetic data

Chain Index Community Evolution Chain

0 0:0, 1:0, 4:1, 9:2, 10:2, 13:3, 18:4, 19:4, 22:5

1 2:0, 5:1, 6:1, 11:2, 14:3, 15:3, 20:4, 23:5, 24:5

2 3:0, 7:1, 8:1, 12:2, 16:3, 17:3, 21:4, 25:5, 26:5

As we can see, the community evolution chains between Long-term Evolution

Method and the ground truth are different. However, we still can show some correctness

of this algorithm.

Because the result is not so well, we try to modify some argument in this algorithm.

We modify p-core to 1.0, which means we consider no core nodes in communities (or

every node is core node) and the algorithm will calculate similarity of communities

considering all nodes in communities. The result is in Table 4-3.

Table 4-3 Evolution chains detected by LM1 with new arguments

Chain Index Community Evolution Chain

0 0:0, 1:0, 4:1, 9:2, 10:2, 13:3, 18:4, 19:4, 22:5

1 2:0, 5:1, 6:1, 11:2, 14:3, 15:3, 20:4, 23:5, 24:5

2 3:0, 7:1, 8:1, 12:2, 16:3, 17:3, 21:4, 25:5, 26:5

The result is just the same as the ground truth. The normalized mutual information

of the result is 100%, which means the result is 100% correct. The normalized mutual

information shows that the evolution chains we detected are totally correct, and verifies

that Long-term Evolution Method is correct. The reason why we get perfect result when

we choose p-core as 1.0 is our synthetic data have no core structure in communities.

Choosing p-core to 0.2 is not a good choice in this synthetic data. In most of the real

world dataset, communities will contain some core nodes. As a result, we need a new

synthetic dataset which contains core structure in communities. Besides, this synthetic

data has almost no noise. The only noise in this synthetic is from the Stochastic Block

Model. We set 𝑝𝑖𝑛 to 0.9 so the nodes in the same communities may have no

connection with probability 0.1. In the real world data, the community evolution is more

complicated and has many noise. As a result, we do a new experiment to test a more

complicated evolution and analyze the p-core’s impact.

4.2 Experiment Result with Cores and Noise

In this experiment, we want to investigate the impact of p-core and threshold. We

add core nodes for analyze the impact of p-core. Also, we add noise in the synthetic

dataset in this experiment. We start from analysis the p-core’s impact. We set 𝑝𝑖𝑛 to

0.5, 𝑝𝑜𝑢𝑡 to 0.1, and the correctly migration probability from 1.0 to 0.1. There are 8

communities in the new synthetic data, and each of them contains 40 nodes. For the 40

nodes in each community, 4 of them are set to be core nodes. We test Long-term

Evolution Method with p-core from 1.0 to 0.1 and threshold equal to 0.05. We plot the

normalized mutual information in Figure 4-1.

Figure 4-1 NMI to p-core and correctly migration probability

In Figure 4-1, we can see that when the correctly migration probability is large,

p-core do little impact to the normalized mutual information. However, when the

correctly migration probability is getting smaller, selecting smaller p-core can improve

the result. The reason is that the nodes will have larger probability to randomly migrate

to wrong communities. The better way for Long-term Evolution Method is to calculate

the similarity of communities by considering only the core nodes because core nodes

will not migrate to wrong communities. When we choose smaller p-core, we have

greater chance to select the core nodes in communities so as to do a correct evolution

detection. The figure shows that choosing p-core to 0.3 or 0.2 give us better result in

0

most of the cases. So, in testing the real world dataset in Chapter 5, we select p-core to

be 0.2. Selecting smaller value of p-core also means that the algorithm will compute

less nodes when calculating the similarity of communities, and reduce the running time.

The following Figure 4-2 is the computing time when choosing different value of

p-core. We can see that the computing time decreases as we choose lower p-p-core.

Figure 4-2 Execution time to the p-core

Now we switch to analysis the impact of threshold. We set p-core to 1.0 and

correctly migration probability to 0.5, threshold from 1.0 to 0.1, and other arguments

are the same with what we set in analyzing p-core. The Figure 4-3 below shows the plot

of normalized mutual information to the threshold.

0

Figure 4-3 Normalized mutual information to the threshold

As Figure 4-3 shows, when the threshold is less than 0.5, the values of normalized

mutual information are stocked at 60%. But, as soon as we set the threshold larger than

0.5, the values of normalized mutual information jump to 100%. The threshold acts like

a barrier. The result shows that we should not set threshold to large so in the following

real world dataset experiment we set threshold to 0.05. The threshold value 0.05 has

also been used in [3] when testing real world dataset.

0

Chapter 5

Real World Dataset Experiment

This chapter is about the real world datasets on Long-term Evolution Method. In

Chapter 4, we verify this algorithm is correct, and knowing how the p-core and

threshold impact the detection result. In this chapter, we test the algorithm with the real

world datasets including Facebook dataset [6] and DBLP dataset [7]. After testing the

real world datasets, we should know how well Long-term Evolution Method is when it

comes to the real case. Because of no ground truth in real world datasets, we need to do

some case study and compare our result with other algorithm to evaluate this algorithm.

In this chapter, we use our method to change evolution chains into evolution types:

Continue, Merge, Split, Growth, Shrink, Death, and Birth. These types will be used to

5.1 Experiment on DBLP

We test Long-term Evolution Method with DBLP dataset first. The DBLP dataset

is a co-author network. We clean up the DBLP dataset by remove the nodes who have

published number of papers less than five. After the cleaning up, we separate the DBLP

dataset by years and change it to graphs. Each graph contains data for two years. There

is a 50% overlap, which is a year overlap, between each graphs. For example, the first

graph contains the data in the first and second year, and the second graph contains the

data in second and third year. We get 11 graphs which cover the data from D.C. 1995

to 2005. There are 148334 nodes and 1210663 edges in graphs. The reason making the

graphs with some overlap is that it will make us much more easily to catch the

communities’ evolution. This is more important when we test the algorithm with

Facebook dataset because the Facebook dataset is much smaller than DBLP dataset.

The nodes and communities in each graph in Facebook dataset is smaller than that in

DBLP dataset so that it makes the Facebook dataset more difficult to detect the

evolution.

Based on the knowledge we learn from Chapter 4, we can set a proper value of

p-core and threshold for Long-term Evolution Method. From the experiment in Chapter

4, we set p-core to 0.2 and threshold to 0.05. We run this algorithm on DBLP dataset

and it takes 174999 second (without the time of reading graphs) to finish the

computation. There are 83301 communities, 32038 evolution chains in the result. To

realize the result, we transfer the evolution chains to evolution type we defined. By

setting the 𝛼 = 1 and 𝛽 = 1, we list the evolution types’ distribution in Table 5-1,

and plot the distribution in Figure 5-1. We also remove the evolution types of

communities in last timeslot because the evolution types of communities in last timeslot

is always “Death”. We have only 11 graphs so the evolution types of communities in

11th graphs are all “Death”

Table 5-1 Number of each types in Long-term Evolution Method in DBLP

# of Event

D

th

Figure 5-1 Pie chart of types in Long-term Evolution Method in DBLP

As we can see, the “Death” and “Continue” are the most of the evolution types.

Other types are distributed very evenly. Because of there is no ground truth in

community detection and community evolution in DBLP dataset, the only way we can

do is to do some case study. In [3], it shows many case about DBLP dataset, Enron

dataset, and U.S. Senate dataset, and proves his algorithm perform well on real world

dataset, so we will not go further case studying on DBLP dataset. Instead, we compare

our result with SGCI, which is one of the famous evolution detection algorithm. The

evolution types in SGCI are “Addition”, “Change Size”, “Constancy”, “Decay”,

# of Event In Our Types

CONTINUE DEATH MERGE SPLIT SHRINK GROWTH

same as our definition of “Merge” and “Split”. Our types “Growth” and “Shrink” are

similar to types “Addition”, “Deletion”, and “Change Size” in SGCI. The type “Decay”

is the same as the type “Death” in our definition. The distribution of SGCI is shown in

Table 5-2 and Figure 5-2.

Table 5-2 Number of each types in SGCI in DBLP

# of Event in SGCI

an

M

er

ge

Figure 5-2 Pie chart of types in SGCI in DBLP

As the Figure 5-2 shows, the distribution of evolution types of SGCI is very

uneven. The possible reason is that it has 8 different types but we only have 6, and the

definition of types is different from us. In Table 5-3 we list the comparison of types in

SGCI and types in our definition.

Table 5-3 Comparison of our types and SGCI types in DBLP

Our Type

# of Event in SGCI

Addition Change Size Constancy Decay

Deletion Merge Split Split_Merge

Continue Death Growth Shrink Merge Split

The result shows that, except for types “Decay” and “Constancy” are almost match

our types “Death” and “Continue”, other types does not match very well. It is because

the other types’ definition is different from us. We cannot tell which algorithm is better

for each algorithm has their own definition of evolution types. We will have a further

discussion of each factor in Chapter 6 when it comes to predicting evolution types.

5.2 Experiment on Facebook

After testing Long-term Evolution Method with DBLP dataset, we do it again with

Facebook dataset. Facebook dataset is a posting dataset. When a user posts something

on the other user’s wall, there is a link between them. Same as the experiment on DBLP

dataset, we separate Facebook Dataset into 17 graphs by every four months starting

from Feb. 2006. There is a two-month overlap between each graphs. There are total

45613 nodes and 1559167 edges in these graphs.

We perform Long-term Evolution Method on Facebook dataset and get 9985

communities and 5672 evolution chains in total. Again, we transfer the evolution chain

to evolution type with 𝛼 = 1 and 𝛽 = 1 , and remove the evolution types of

communities in last timeslot. The following Table 5-4 and Figure 5-3 is the distribution

of the evolution types of Facebook dataset.

Table 5-4 Number of each types in Long-term Evolution Method in Facebook

# of Event

Continue 3036

Death 4610

Merge 533

Split 655

Shrink 313

Growth 838

Figure 5-3 Pie chart of types in Long-term Evolution Method in Facebook

The two major evolution types in the result are still “Death” and “Continue”, and

“Death” is almost the half of the total types. We also perform SGCI on Facebook dataset.

Table 5-5 and Figure 5-4 is the result of SGCI.

Table 5-5 Number of each types in SGCI in Facebook

# of Event in SGCI

A

dd

iti

562

# of Event In Our Types

CONTINUE DEATH MERGE SPLIT SHRINK GROWTH

on

M

er

ge

93

Sp

lit

36

Sp

lit

_

M

er

ge

0

Figure 5-4 Pie chart of types in SGCI in Facebook

In Table 5-5 and Figure 5-4, we can find that the distribution of the types is more

uneven than us, which is the same phenomenon we see in the experiment of DBLP

dataset. As what we did in testing DBLP dataset, we list the comparison of types in

SGCI and types we defined in Table 5-6.

Table 5-6 Comparison of our types and SGCI types in Facebook

Our Type

Continue Death Growth Shrink Merge Split

SGCI

Type

Constancy 3028 2 4 2 53 35

Decay 3 4101 4 2 310 488

# of Event in SGCI

Addition Change Size Constancy

Decay Deletion Merge

Split

Change

The communities with types “Constancy” is almost the same as communities with

types “Continue” in our definition. The type “Decay” is detected as “Death” in our

types. The types “Growth” and “Shrink” in our result are mostly detected as “Change

Size” in SGCI. It is matched our knowledge about “Change Size” in SGCI as we

mention that our types “Growth” and “Shrink” are similar to types “Addition”,

“Deletion”, and “Change Size” in SGCI. However, other types are not matched.

5.3 Case Studying in Facebook Dataset

We do the case studying on Facebook dataset by analysis the distribution of

evolution types in different timeslots, and matching the events of Facebook with the

detection result. Although [3] contains many case studying to support Long-term

Evolution Method, it didn’t apply the algorithm on Facebook dataset. As a result, in this

section, we try to analysis the evolution in Facebook dataset and support this algorithm

and our method of transferring evolution chain to evolution type. Facebook is a great

company and is still growing bigger. We can find the history of the Facebook on the

Internet. We may able to link the evolution of Facebook dataset we detected to what

actually happened to Facebook in its history. To further understand the evolution of

Facebook dataset, we do case study by plotting the distribution of evolution types in

each timeslot. Figure 5-5 is the distribution of types in each timeslot. It shows how

many percentages of each types in each timeslot.

Figure 5-5 Percentages of evolution types in each timeslot in Facebook

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

There are two interesting points in this plot. First, in the second timeslot which

time index is 1, we can see the percentages of type “Split” is much larger than its

previous timeslot. Second, in the 14th timeslot which time index is 13, the percentages

of type “Split” is again much larger than its previous timeslot. In the second timeslot,

it contains the data from 2006/4 to 2006/7. In 2006/9, Facebook was opened to everyone

at least 13 years old with a valid email address. The possible reason why the type “Split”

become larger is the nodes in 2006/4 ~ 2006/7 will break their own communities

because more friends are join the Facebook and they want to form new communities

with new join friends. In the 14th timeslot, it contains data from 2008/4 to 2008/7. In

2008/7, Facebook was fully opened to China users due to 2008 Summer Olympic

Games at Beijing. So, again, the structure of Facebook communities changed a lot at

this period because of the same reason: new friends, in this case Chinese users, were

join to Facebook. As we know that China finally block the Facebook in the late 2008,

we may be able to see another big changes happened in community evolution types in

Facebook if we have more data.

Chapter 6

Community Evolution Prediction

We use the detection result in Chapter 5, and build a prediction model for

community evolution in this chapter. In Chapter 5, we test Long-term Evolution Method

with real world datasets, and give a method to transfer community evolution chains to

community evolution types. Long-term Evolution Method can do perfect detection. In

this chapter, we want to do more – predict the evolution types. We want to show that

this chapter, we want to do more – predict the evolution types. We want to show that