• 沒有找到結果。

Chapter 2 Related Work

2.3 SGCI Algorithm

SGCI algorithm is a community evolution detection proposed by B. Gliwa, S.

Saganowski, A. Zygmunt, P. Bro?dka, P. Kazienko, and J. Kozlak. SGCI stands for

Stable Group Change Identification. The algorithm can be split in the following steps:

Step 1. Identification of fugitive groups in the separate time periods

Step 2. Identification of group continuation – assigning transitions between groups

in adjacent time steps.

Step 3. Separation of the stable groups (lasting for at least three subsequent time steps).

Step 4. Identification of types of group changes. Assigning events from the list

describing the change of the state of the group to the transitions

In Step2, it use a measure called modify Jaccard to judge is there a transitions

between groups. The definition of modify Jaccard is 𝑀𝐽(𝐴, 𝐵) = max (𝐴∩𝐵𝐴 ,𝐴∩𝐵𝐵 ) ,

where A and B are two groups. Because of this simple but useful measurement, the

computation time of SGCI algorithm is very short compare to other algorithm. In Step4,

there are 8 events in SGCI’s definition. They are “Split”, “Deletion”, “Merge”,

“Addition”, “Split_Merge”, “Decay”, “Constancy”, and “Chang Size”.

Chapter 3 Our Method

In this chapter, we talk about how we do the community evolution detection and

prediction, including our definition of evolution types. We also introduce how we

transfer evolution chains to evolution types, and how we predict community evolution

types. Before we do the community detection, we examine our evolution detection

algorithm with our own synthetic dataset. In order to address these issues, in this chapter

we as well introduce how we generate our synthetic data, and how we measure the

correctness of Long-term Evolution Method.

3.1 Synthetic Data Generator

We use our own synthetic data to test Weiux’s algorithm, so in this section we talk

about how we generate the synthetic data. As we want to verify Weiux’s algorithm is

robust and can correctly detect the evolution, we are supposed to have datasets with

evolution ground truth to judge performance. To our best knowledge, there is no real

world dataset containing the ground truth of community evolution, so it is required to

generate a synthetic dataset on our own and assign the community evolution manually.

If Long-term Evolution Method can precisely detect the community evolution in the

synthetic dataset, then it indicates that this algorithm is useful and correct.

Our synthetic data generator is based on the Stochastic Block Model [4], which is

a widely used model for generating synthetic dataset with community ground truth. For

a simple Stochastic Block Model, each of nodes will be assigned to one of 𝐾

communities. Assume node 𝑖 is in community 𝐶𝑖 and node 𝑗 is in community 𝐶𝑗.

An edge is placed between node 𝑖 and node 𝑗 with probability 𝜓𝐶𝑖𝐶𝑗. We can define

a 𝐾 × 𝐾 matrix 𝜓 to construct all the edges [5]. Here we define 𝜓𝐶𝑖𝐶𝑗 = 𝑝𝑖𝑛 if

𝐶𝑖 = 𝐶𝑗, otherwise 𝜓𝐶𝑖𝐶𝑗 = 𝑝𝑜𝑢𝑡.

To build a simple but representative synthetic data, we only consider two kinds of

community evolution – “merge” and “split” in our synthetic data. We first generate a

graph with community by the Stochastic Block Model. After generating one graph, we

assign which communities should merge and which communities should split. Finally,

we generate a new graph followed by our evolution assignment. The following content

is the detail of the synthetic data generator.

Step1 Generate one graph by Stochastic Block Model. In this step, we generate

graph by Stochastic Block Model with two probabilities, 𝑝𝑖𝑛 and 𝑝𝑜𝑢𝑡. If two nodes

are in the same community, there will be an edge connecting two nodes with probability

equal to 𝑝𝑖𝑛. On the other hand, if two nodes are in different communities, there will

be an edge connecting two nodes with probability equal to 𝑝𝑜𝑢𝑡. To generate graphs

with community structure, the constraint is that 𝑝𝑖𝑛 should be larger than 𝑝𝑜𝑢𝑡 ,

because the definition of community is the connection between nodes in same

community is larger than connection between nodes in different community.

Throughout this step, we will set number of communities, and number of nodes in each

community. With 𝑝𝑖𝑛 and 𝑝𝑜𝑢𝑡, we can generate a graph with community ground truth

based on Stochastic Block Model.

Step2 Assign community evolution. In this step, we decide which community

should merge or split in next timeslot. We first separate communities into two groups.

The community in the first group are going to merge with each other in next time slot,

and the community in the second group are going to split into two communities. For

example, we can assign community 1.1 and community 1.2 to merge into community

2.1, and assign community 1.3 to split into community 2.2 and community 2.3, where

community a.b means the b-th community in timeslot a. In this example, community

1.1, community 1.2, and community 2.1 are in the same evolution chain. For easy

computation and analysis, there will be only two communities becoming one or one

community becoming two. There are no three or more communities merging into one

or one community split into three or more communities. In this step, we also decide the

evolution in the timeslot after next timeslot. If two communities are merging into one

community in next timeslot, the merged community will split into two communities in

the timeslot after next timeslot. For example, if community 1.1 and community 1.2

merge into community 2.1, then community 2.1 will, as our design, split into

community 3.1 and community 3.2. The reason why we choose to specify the merged

group should split in next timeslot is to make this synthetic data simple and reduce the

time for generating community evolution.

Step3 Move nodes to new communities and reconstruct the graph. After we assign the community evolution, we move nodes from original communities to new

communities one by one and reconstruct a new graph. We move nodes from the

communities in present timeslot to the communities we assigned in next timeslot. Every

node object will have a variable about which community the node is in. By change

value of this variable, we move the node to new community. When a node is moved to

a new community, some of its edges should be reconstructed with the same probability

used in constructing Stochastic Block Model in Step1. The edges which will be

reconstructed will be mentioned later. There are two possible evolution of communities:

merge and split. For two communities which will merge into a new community, nodes

will break their edges with the community and the community they are going to merged

with. The edge connecting in the new community will be generated with probability

equal to 𝑝𝑖𝑛. For example, if a node is originally in community 1.1 and it is assigned

to merge with community 1.2 into community 2.1, the node will break all the edges

connecting to nodes in community 1.1 and community 1.2 but remain the edges

connecting to other communities. The edges in community 2.1 will reconstruct with the

probability equal to 𝑝𝑖𝑛. For a community that will split into two communities, nodes

in the community will break their edges with each other, and the edges connecting

between the two new communities will be assigned with the probability 𝑝𝑜𝑢𝑡 . For

example, if a node is originally in community 1.3 and community 1.3 is going to split

into community 2.2 and community 2.3, the node will break all the edges connecting

to nodes in community 1.3. The edges connecting between nodes in community 2.2 and

nodes in community 2.3 will be generated with probability 𝑝𝑜𝑢𝑡. It is notable that edges

should be unchanged if they connect two nodes from different communities which

won’t merge together. For example, if community 1.1 merges with community 1.2, the

edges connecting nodes in community 1.1 and nodes in communities other than

community 1.2 should keep unchanged. After all nodes are moved to new communities

respectively and all edges are reconstructed with the probability set in Step 1, we get a

new graph with new communities, new edges, and community evolution ground truth.

By applying these procedures, the probability of the connection between two nodes in

same communities and the probability of connection between two nodes in different

communities remain the same with the probability we used in Step1. Besides, the

connections between communities that do not evolve together will remain the same,

which is more reasonable and more realistic for community evolution.

Step4 Adding core nodes and noise. In this step, we modify the synthetic data by setting some nodes as core nodes in communities and adding some noise when moving

nodes to new communities. To make the synthetic data more similar to the real world

case, we add noise. The noise we add in synthetic is that nodes may randomly move to

communities other than we assigned. We use a probability called “correctly migration

probability” to control the noise. The lager the correctly migration probability is, the

less the noise is, and the more stable the evolution is. Core nodes is to simulate the core

structure in communities. There are two key point about the core nodes in new synthetic

data: (1) core nodes will connect to all the nodes in the same community, and (2) core

nodes will always go to the correct communities we assigned in evolution. Because the

core nodes will move to new communities correctly, the noise we added in new

synthetic data will not affect to the core nodes.

Step5 Repeat Step2 to Step4 to get required number of graphs. After Step4,

we will get a new graph for 1st evolution, and we also assign the 2nd evolution in step2.

Repeat Step2 to Step4 will give us one more graphs and the pattern of 3rd evolution.

3.2 Measure Correctness of Evolution Detection

To judge how well Weiux’s algorithm is, we use normalized mutual information

to evaluate the community evolution detection result. Because the algorithm we used

(LM1) generate the community evolution chain instead of community evolution type,

we cannot just compare the evolution type of the detection result and the ground truth.

As a result, we need a method to calculate the precision of the evolution chain we

detected. In the algorithm LM1, LM2, and LM3, the last step of these algorithms are

the same. In the final step of these algorithms, they apply a community detection on the

bipartite graph or multipartite graph to get community evolution chain, so the evolution

chain we detected is actually the communities in these graphs. Finding communities in

graphs is a famous problem in online social network analysis. Many works manage to

find community detection algorithm and they need a method to evaluate their algorithm.

Normalized mutual information [12] is one of the common used algorithm to compare

the community detection result and the community ground truth. In Chapter 3, we know

the evolution chains are the communities detected from the bipartite or multipartite

graphs. We can therefore use normalized mutual information as the measurement of our

result.

3.3 Definition of Evolution Types

We propose our own defined evolution types in this work. Most community

evolution detection algorithms try to detect the evolution type, which is very different

from the way in Weiux’s algorithm. To compare the result with other algorithm, one

way is to generate the evolution types rather than the evolution chains. Evolution type

is much easier to read and analysis than evolution chain. The output of Long-term

Evolution Method is evolution chain so we propose a method to transform evolution

chain into evolution type. First, we define 7 types of evolution types. They are “Birth”,

“Merge”, “Split”, “Growth”, “Shrink”, “Continue”, and “Death”. The following is the

concept of each evolution type.

Birth “Birth” is the evolution type which means the community is just appear in

the present timeslot. No community is in its history. When we concentrate in what

happened to a community before, this evolution type is one of the possible case.

However, if we focus on what happens to a community next timeslot, this evolution

type will never be the answer because that will break our concept: no community is in

its history.

Merge This evolution type indicates that a community is merged form other

communities or it will merge with other communities. It depends on which timeslot we

are talking about. If we are talking about what happened to a community before, “Merge”

means this community is merge from communities in previous timeslot. If we are

talking about what will happen to a community, “Merge” means this community is

going to merge with other communities in the present timeslot and become new

communities in next timeslot.

Split This evolution means a community is split form communities in previous

timeslot or a community will split into multiple communities in next timeslot. Again, it

depends on which timeslot we are focus on.

Growth The type “Growth” is quite different from “Merge” and “Split”. “Merge”

and “Split” involve the number of communities changed from previous timeslot to

present timeslot or will change from present timeslot to next timeslot. “Growth” means

that number of nodes in communities increased when communities evolved to present

timeslot, or number of nodes in communities will increase when communities evolve

to next timeslot.

Shrink This evolution type is a contrary case of “Growth”. The number of

communities have never changed. However, number of nodes in communities

decreased when communities evolved to present timeslot, or number of nodes in

communities will decrease when communities evolve to next timeslot.

Continue This evolution type means that the number of communities have never changed, like “Growth” and “Shrink”, but number of nodes in communities remains the

same or no significant change when communities evolved to present timeslot or when

communities evolve to next timeslot. We talk about how to distinguish significant

change or not in Section 3.4.

Death This evolution type is a contrary case of “Birth”. A community is given the

evolution type “Death” as it will disappear in next timeslot and the evolution of this

community will end in the present timeslot. No other community in next timeslot is

evolved from this community. Different from the type “Birth”, this evolution type can

only happen when we talk about a community’s future. If we focus on what happens to

a community before, this evolution type will never be the possible type of this

community. No community is evolved from a “Death” community.

3.4 Method to Transfer Evolution Chain to Evolution Types

In this section, we give a method to transfer the detected evolution chain into

evolution type we defined in Section 3.3. Because most works focus on what happens

next to a community, we focus on the future of communities as well. As a result, “Birth”

will not appear in this method. In this method, we first decide whether communities are

“Death” or not. If the communities are not “Death”, then next step we see if

communities are “Merge” or “Split”. If communities are neither “Merge” nor “Split”,

we will calculate the total number of nodes in communities to judge the communities

belonging to “Growth”, “Shrink”, or “Continue”. The following is the detail of the

procedure.

Step1 Decide evolution type is “Death” or not. Given a community evolution

chain, there are many communities in the chain. If a community is in the last timeslot

of the community evolution chain, this community will be given a type “Death” since

no community is in next timeslot. For example, if there is an evolution chain “12:3,

22:4, 35:4”, the evolution type of community indexed 22 and community indexed 35

will be “Death” because there is no other community in this chain which time index is

5. And, for the community indexed 12, the evolution type of it is not “Death” because

in this chain there are two communities (community 22 and community 35) which are

in the following time slot (time index 4). If a community is not “Death”, then we go to

the next step.

Step2 Decide evolution type is “Merge” or “Split”. In this step, we will calculate

the number of communities in each timeslot in an evolution chain, and compare the

numbers between each timeslot. If the number of communities in a given timeslot is

less than the number of communities in the next timeslot, the evolution types of

communities in given timeslot will be “Split”. Otherwise, the evolution type will be

“Merge”. For example, if there is an evolution chain “12:3, 22:4, 35:4”, we can

calculate the number of communities in time index 3 and in time index 4. There is one

community whose time index is 3 in this chain, and there are two communities whose

time index is 4 in this chain. Then, we can compare the number of communities in time

index 3 and the number of communities in time index 4. The number of communities

in time index 3 is less than that in time index 4, so the community in time index 3 will

be given a type “Split”. “Merge” and “Split” are sometimes having different definition.

One may consider that only multiple communities merge into one community is called

“Merge”. However, others may think that large number of communities merging into

small number of communities can be called “Merge”. To make our transferring method

more flexible, we used a factor 𝛼 to control the rule we used to decide “Merge” and

“Split”. We used an example to show how the factor controls the decision. Given 𝑥 is

the number of communities in the present timeslot, and 𝑦 is the number of

communities in next timeslot, the possible evolution type of communities in the present

timeslot is in the Table 3-1.

Table 3-1 Rule of deciding “Merge” or “Split”

Compare 𝑥 and 𝑦 Evolution Type

𝑥 > 𝛼𝑦 Merge

𝑥 < 1

𝛼𝑦 Split

𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Growth, Shrink, or Continue

If 𝛼1𝑦 < 𝑥 < 𝛼𝑦, then we go to next step to decide the evolution type is “Growth”,

“Shrink”, or “Continue”.

Step3 Decide evolution type is “Growth”, “Shrink”, or “Continue”. In this

step, we compare the number of nodes in communities. We will calculate the total

number of nodes in communities in each timeslot, and compare the numbers between

each timeslot. If the total number of nodes in a timeslot is larger than the total number

of nodes in its next timeslot, the communities in this timeslot will be given the type

“Shrink.” In contrast, if the total number of nodes in a timeslot is smaller than the total

number of nodes in its next timeslot, the communities in this timeslot will be given the

type “Growth”. Other cases besides these two will be the type “Continue”. For example,

there is an evolution chain “35:3, 36:3, 50:4, 51:4”. Given the number of communities

in time index 3 is two and the number of communities in time index 4 is also two, the

evolution type of communities in time index 3 is neither “Merge” nor “Split”. We have

to decide the evolution type of communities in time index 3. Assume the number of

nodes in community 35, 36, 50, 51 is 20, 20, 10, 10. The total number of nodes in

communities in time index 3 is 20 + 20 = 40, and the total number of nodes in

communities in time index 4 is 10 + 10 = 20. Because the total nodes in time index 3 is

larger than that in time index 4 (40 > 20), the evolution type of community 35 and

community 36 should be “Shrink.” There is a problem that one may think the

community 36 should be “Shrink.” There is a problem that one may think the