Experiment Result with Cores and Noise - Synthetic Dataset Experiment

Chapter 4 Synthetic Dataset Experiment

4.2 Experiment Result with Cores and Noise

In this experiment, we want to investigate the impact of p-core and threshold. We

add core nodes for analyze the impact of p-core. Also, we add noise in the synthetic

dataset in this experiment. We start from analysis the p-core’s impact. We set 𝑝_𝑖𝑛 to

0.5, 𝑝_𝑜𝑢𝑡 to 0.1, and the correctly migration probability from 1.0 to 0.1. There are 8

communities in the new synthetic data, and each of them contains 40 nodes. For the 40

nodes in each community, 4 of them are set to be core nodes. We test Long-term

Evolution Method with p-core from 1.0 to 0.1 and threshold equal to 0.05. We plot the

normalized mutual information in Figure 4-1.

Figure 4-1 NMI to p-core and correctly migration probability

In Figure 4-1, we can see that when the correctly migration probability is large,

p-core do little impact to the normalized mutual information. However, when the

correctly migration probability is getting smaller, selecting smaller p-core can improve

the result. The reason is that the nodes will have larger probability to randomly migrate

to wrong communities. The better way for Long-term Evolution Method is to calculate

the similarity of communities by considering only the core nodes because core nodes

will not migrate to wrong communities. When we choose smaller p-core, we have

greater chance to select the core nodes in communities so as to do a correct evolution

detection. The figure shows that choosing p-core to 0.3 or 0.2 give us better result in

most of the cases. So, in testing the real world dataset in Chapter 5, we select p-core to

be 0.2. Selecting smaller value of p-core also means that the algorithm will compute

less nodes when calculating the similarity of communities, and reduce the running time.

The following Figure 4-2 is the computing time when choosing different value of

p-core. We can see that the computing time decreases as we choose lower p-p-core.

Figure 4-2 Execution time to the p-core

Now we switch to analysis the impact of threshold. We set p-core to 1.0 and

correctly migration probability to 0.5, threshold from 1.0 to 0.1, and other arguments

are the same with what we set in analyzing p-core. The Figure 4-3 below shows the plot

of normalized mutual information to the threshold.

Figure 4-3 Normalized mutual information to the threshold

As Figure 4-3 shows, when the threshold is less than 0.5, the values of normalized

mutual information are stocked at 60%. But, as soon as we set the threshold larger than

0.5, the values of normalized mutual information jump to 100%. The threshold acts like

a barrier. The result shows that we should not set threshold to large so in the following

real world dataset experiment we set threshold to 0.05. The threshold value 0.05 has

also been used in [3] when testing real world dataset.

Chapter 5 Real World Dataset Experiment

This chapter is about the real world datasets on Long-term Evolution Method. In

Chapter 4, we verify this algorithm is correct, and knowing how the p-core and

threshold impact the detection result. In this chapter, we test the algorithm with the real

world datasets including Facebook dataset [6] and DBLP dataset [7]. After testing the

real world datasets, we should know how well Long-term Evolution Method is when it

comes to the real case. Because of no ground truth in real world datasets, we need to do

some case study and compare our result with other algorithm to evaluate this algorithm.

In this chapter, we use our method to change evolution chains into evolution types:

Continue, Merge, Split, Growth, Shrink, Death, and Birth. These types will be used to

5.1 Experiment on DBLP

We test Long-term Evolution Method with DBLP dataset first. The DBLP dataset

is a co-author network. We clean up the DBLP dataset by remove the nodes who have

published number of papers less than five. After the cleaning up, we separate the DBLP

dataset by years and change it to graphs. Each graph contains data for two years. There

is a 50% overlap, which is a year overlap, between each graphs. For example, the first

graph contains the data in the first and second year, and the second graph contains the

data in second and third year. We get 11 graphs which cover the data from D.C. 1995

to 2005. There are 148334 nodes and 1210663 edges in graphs. The reason making the

graphs with some overlap is that it will make us much more easily to catch the

communities’ evolution. This is more important when we test the algorithm with

Facebook dataset because the Facebook dataset is much smaller than DBLP dataset.

The nodes and communities in each graph in Facebook dataset is smaller than that in

DBLP dataset so that it makes the Facebook dataset more difficult to detect the

evolution.

Based on the knowledge we learn from Chapter 4, we can set a proper value of

p-core and threshold for Long-term Evolution Method. From the experiment in Chapter

4, we set p-core to 0.2 and threshold to 0.05. We run this algorithm on DBLP dataset

and it takes 174999 second (without the time of reading graphs) to finish the

computation. There are 83301 communities, 32038 evolution chains in the result. To

realize the result, we transfer the evolution chains to evolution type we defined. By

setting the 𝛼 = 1 and 𝛽 = 1, we list the evolution types’ distribution in Table 5-1,

and plot the distribution in Figure 5-1. We also remove the evolution types of

communities in last timeslot because the evolution types of communities in last timeslot

is always “Death”. We have only 11 graphs so the evolution types of communities in

11^th graphs are all “Death”

Table 5-1 Number of each types in Long-term Evolution Method in DBLP

# of Event

Figure 5-1 Pie chart of types in Long-term Evolution Method in DBLP

As we can see, the “Death” and “Continue” are the most of the evolution types.

Other types are distributed very evenly. Because of there is no ground truth in

community detection and community evolution in DBLP dataset, the only way we can

do is to do some case study. In [3], it shows many case about DBLP dataset, Enron

dataset, and U.S. Senate dataset, and proves his algorithm perform well on real world

dataset, so we will not go further case studying on DBLP dataset. Instead, we compare

our result with SGCI, which is one of the famous evolution detection algorithm. The

evolution types in SGCI are “Addition”, “Change Size”, “Constancy”, “Decay”,

# of Event In Our Types

CONTINUE DEATH MERGE SPLIT SHRINK GROWTH

same as our definition of “Merge” and “Split”. Our types “Growth” and “Shrink” are

similar to types “Addition”, “Deletion”, and “Change Size” in SGCI. The type “Decay”

is the same as the type “Death” in our definition. The distribution of SGCI is shown in

Table 5-2 and Figure 5-2.

Table 5-2 Number of each types in SGCI in DBLP

# of Event in SGCI

Figure 5-2 Pie chart of types in SGCI in DBLP

As the Figure 5-2 shows, the distribution of evolution types of SGCI is very

uneven. The possible reason is that it has 8 different types but we only have 6, and the

definition of types is different from us. In Table 5-3 we list the comparison of types in

SGCI and types in our definition.

Table 5-3 Comparison of our types and SGCI types in DBLP

Our Type

# of Event in SGCI

Addition Change Size Constancy Decay

Deletion Merge Split Split_Merge

Continue Death Growth Shrink Merge Split

The result shows that, except for types “Decay” and “Constancy” are almost match

our types “Death” and “Continue”, other types does not match very well. It is because

the other types’ definition is different from us. We cannot tell which algorithm is better

for each algorithm has their own definition of evolution types. We will have a further

discussion of each factor in Chapter 6 when it comes to predicting evolution types.

5.2 Experiment on Facebook

After testing Long-term Evolution Method with DBLP dataset, we do it again with

Facebook dataset. Facebook dataset is a posting dataset. When a user posts something

on the other user’s wall, there is a link between them. Same as the experiment on DBLP

dataset, we separate Facebook Dataset into 17 graphs by every four months starting

from Feb. 2006. There is a two-month overlap between each graphs. There are total

45613 nodes and 1559167 edges in these graphs.

We perform Long-term Evolution Method on Facebook dataset and get 9985

communities and 5672 evolution chains in total. Again, we transfer the evolution chain

to evolution type with 𝛼 = 1 and 𝛽 = 1 , and remove the evolution types of

communities in last timeslot. The following Table 5-4 and Figure 5-3 is the distribution

of the evolution types of Facebook dataset.

Table 5-4 Number of each types in Long-term Evolution Method in Facebook

# of Event

Continue 3036

Death 4610

Merge 533

Split 655

Shrink 313

Growth 838

Figure 5-3 Pie chart of types in Long-term Evolution Method in Facebook

The two major evolution types in the result are still “Death” and “Continue”, and

“Death” is almost the half of the total types. We also perform SGCI on Facebook dataset.

Table 5-5 and Figure 5-4 is the result of SGCI.

Table 5-5 Number of each types in SGCI in Facebook

# of Event in SGCI

iti

562

# of Event In Our Types

CONTINUE DEATH MERGE SPLIT SHRINK GROWTH

lit

Figure 5-4 Pie chart of types in SGCI in Facebook

In Table 5-5 and Figure 5-4, we can find that the distribution of the types is more

uneven than us, which is the same phenomenon we see in the experiment of DBLP

dataset. As what we did in testing DBLP dataset, we list the comparison of types in

SGCI and types we defined in Table 5-6.

Table 5-6 Comparison of our types and SGCI types in Facebook

Our Type

Continue Death Growth Shrink Merge Split

SGCI

Type

Constancy 3028 2 4 2 53 35

Decay 3 4101 4 2 310 488

# of Event in SGCI

Addition Change Size Constancy

Decay Deletion Merge

Split

Change

The communities with types “Constancy” is almost the same as communities with

types “Continue” in our definition. The type “Decay” is detected as “Death” in our

types. The types “Growth” and “Shrink” in our result are mostly detected as “Change

Size” in SGCI. It is matched our knowledge about “Change Size” in SGCI as we

mention that our types “Growth” and “Shrink” are similar to types “Addition”,

“Deletion”, and “Change Size” in SGCI. However, other types are not matched.

5.3 Case Studying in Facebook Dataset

We do the case studying on Facebook dataset by analysis the distribution of

evolution types in different timeslots, and matching the events of Facebook with the

detection result. Although [3] contains many case studying to support Long-term

Evolution Method, it didn’t apply the algorithm on Facebook dataset. As a result, in this

section, we try to analysis the evolution in Facebook dataset and support this algorithm

and our method of transferring evolution chain to evolution type. Facebook is a great

company and is still growing bigger. We can find the history of the Facebook on the

Internet. We may able to link the evolution of Facebook dataset we detected to what

actually happened to Facebook in its history. To further understand the evolution of

Facebook dataset, we do case study by plotting the distribution of evolution types in

each timeslot. Figure 5-5 is the distribution of types in each timeslot. It shows how

many percentages of each types in each timeslot.

Figure 5-5 Percentages of evolution types in each timeslot in Facebook

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

There are two interesting points in this plot. First, in the second timeslot which

time index is 1, we can see the percentages of type “Split” is much larger than its

previous timeslot. Second, in the 14^th timeslot which time index is 13, the percentages

of type “Split” is again much larger than its previous timeslot. In the second timeslot,

it contains the data from 2006/4 to 2006/7. In 2006/9, Facebook was opened to everyone

at least 13 years old with a valid email address. The possible reason why the type “Split”

become larger is the nodes in 2006/4 ~ 2006/7 will break their own communities

because more friends are join the Facebook and they want to form new communities

with new join friends. In the 14^th timeslot, it contains data from 2008/4 to 2008/7. In

2008/7, Facebook was fully opened to China users due to 2008 Summer Olympic

Games at Beijing. So, again, the structure of Facebook communities changed a lot at

this period because of the same reason: new friends, in this case Chinese users, were

join to Facebook. As we know that China finally block the Facebook in the late 2008,

we may be able to see another big changes happened in community evolution types in

Facebook if we have more data.

Chapter 6 Community Evolution Prediction

We use the detection result in Chapter 5, and build a prediction model for

community evolution in this chapter. In Chapter 5, we test Long-term Evolution Method

with real world datasets, and give a method to transfer community evolution chains to

community evolution types. Long-term Evolution Method can do perfect detection. In

this chapter, we want to do more – predict the evolution types. We want to show that

evolution is predictable and our method help us to reach the goal.

6.1 Experiment on Facebook Dataset

We first predict the communities’ evolution in Facebook dataset. In Chapter 5, we

apply Long-term Evolution Method on Facebook dataset and get 9985 communities

and 5672 evolution chains. To build a more precise prediction, we remove the

communities which survive for no more than 2 timeslots. The communities which

survive for very short timeslots may be unpredictable. After we remove the short-life

communities, we have 4235 communities. Our predicting target is the evolution types.

Before we start our model training, we should notice the distribution of the evolution

types in training and testing set. Table 6-1 is the distribution of types.

Table 6-1 Number of evolution types in training/testing set in Facebook

# of Event

Continue 1388

Death 760

Merge 652

Split 828

Shrink 257

Growth 350

As we can see, the distribution of 4235 communities is uneven. As a result, we

should first evenly and randomly sample each evolution types to get an unbiased

prediction model. We sample 257 data for every types, pick 80% of data as the training

set, and use the feature we select in Section 3.6 to train the prediction model. Then, we

use the remaining 20% data as the testing set to test our model. The prediction result is

below in Table 6-2.

Table 6-2 Predicting 6 evolution types in Facebook

Ground Truth

The accuracy of the prediction is 64.40%. It seems like the performance is not very

well. However, if we consider the accuracy of randomly prediction which is only

16.67%, it is still far better than random.

The accuracy of predicting 6 evolution types is only 64.40%. The result may

because of the difficulty of prediction. Predicting 6 evolution types may be too difficult.

So, we try to build the prediction models to the different but much easy targets: whether

a community will live, whether a community will grow, and whether a community will

merge. Same as the prediction model target to 6 evolution types, we first evenly sample

the training/testing set, pick 80% of data as training set, and use the feature we select

in Section 3.6 to train the prediction model. The following three tables, which are Table

6-3, Table 6-4, and Table 6-5, are the testing result of three prediction models.

Table 6-3 Predicting live or not in Facebook

Ground Truth

Table 6-4 Predicting grow or not in Facebook

Ground Truth

Recall 96.65 % 73.68 %

Table 6-5 Predicting merge or not in Facebook

Ground Truth

79.15%, which is pretty well. When we are predicting whether a community grow or

not, we only consider the communities which will survive in next timeslot. As a result,

the training set and testing set is different from those in predicting 6 evolution types

and in predicting live or not. The accuracy of predicting a community growth or not is

86.10%, which is better than the accuracy of predicting live or not, and much better

than predicting 6 evolution types. When we predicting a community merge or not, we

only consider the communities that will survive as well. The accuracy of predicting

merge or not is 90.04%. Only 10 of 100 will predict incorrectly. It infers that predicting

these answers is easier than predicting 6 types. Additionally, the communities which

will survive is more predictable since the results of predicting growth or not and merge

or not are better than the result of predicting live or not.

To see if there is overfitting problem in our model, we calculate the training

accuracy of each prediction. We show the training accuracy in the below Table 6-6. We

can see that the training accuracy is almost equal to the testing accuracy in each

prediction. As a result, it shows that there is no overfitting problem in our models.

Table 6-6 Testing accuracy and training accuracy of prediction in Facebook

Testing Accuracy Training Accuracy

Predict 6 evolution types 64.40 % 63.67 %

Predict live or not 79.15 % 79.55 %

Predict grow or not 86.10 % 86.58 %

Predict merge or not 90.04 % 90.22 %

It is interesting to know which features are important in our prediction model, for

we select more than 300 features. The support vector machine algorithm will apply the

kernel method, mapping the input features into high-dimensional feature spaces by

kernel function, so it is difficult for us to know which algorithm is the most important.

Consequently, we use random forest to train a new model and show the importance of

each features. The following Figure 6-1 is the features importance of the random forest

model of 6 evolution types in Facebook.

Figure 6-1 Features importance in predicting 6 evolution types in Facebook

As Figure 6-1 shows, the top 7 most important features are feature 300, feature

306, feature 303, feature 301, feature 305, feature 302, and feature 216. Recall that in

Section 3.6.3, features from feature 300 to feature 306 are the features about evolution

chains. It means that the evolution chains which are detected by Long-term Evolution

Method do help a lot in the prediction model of predicting 6 evolution types. We keep

move on to plot the features importance of other prediction models in Figure 6-2, Figure

6-3, and Figure 6-4.

Figure 6-2 Features importance in predicting live or not in Facebook

Figure 6-3 Features importance in predicting grow or not in Facebook

Figure 6-4 Features importance in predicting merge or not in Facebook

We only plot the top 7 most important features in each model. Although in each

prediction model, the order of the most important features is not the same with each

other, we still can see most of the features in the top 7 important features are features

about evolution chain. It shows that our detection algorithm is useful. We can use the

features of evolution chain detected by Long-term Evolution Method to build a good

prediction model.

6.2 Experiment on DBLP Dataset

In this section, we show the result of prediction on DBLP dataset. We get 83301

communities and 32038 evolution chains in DBLP dataset when we apply our detection

algorithm on the dataset in Chapter 5. Similarly, we remove the communities which

在文檔中社群網路中群體演變的偵測及預測 (頁 57-0)