The State of Communities - The Community Pedigree model

Chapter 6 The Community Pedigree model

6.2 The State of Communities

Dynamic Community could change over time and we further define five community states: Birth, Death, Alive, Child and Division.

Definition 13 (Birth)

A new community is born at time t iff there is no similarity between and any communities of . i.e.:

Definition 14 (Death)

An old community is dead at time point t iff there are none similarity between

and any communities of community partition .

i.e.:

Definition 15 (Alive) A current community is alive iff there is similarity between one community of community partition and one community of . i.e.:

Definition 16 (Child)

A current community is a child of { iff there exist more than one

39 Figure 6-3 Example of evolution of communities

(1) Birth: A circle colored purple indicates that a community is born at current time point t and the example is shown as the community J in Fig 6-3(b). (2) Death: A triangle indicates that a community would be dead at next time point and the example is shown as the community E in Fig 6-3(b). (3) Alive: A circle of colored red indicates that a community is Alive from only one community at time t-1 to only one community at time t. The example is shown as the community I at time t which is Alive from the community D at time t-1 in Fig 6-3(b). (4) Child: A circle of colored green indicates that a community is a child of some communities at time t-1. The example is shown as the communities F which is the child of A

t F

After we determine the similarities between communities and the states of communities, the states of a community express the evolution of all communities as in 6-4(a). Besides, the evolution of single community we called pedigree. We use the same states of a community to express the evolution of single community. Fig 6-4(a) shows the similarities between communities and the state of communities from t=1 to t=4 and is called evolution net [3]. Fig 6-4(b) shows the pedigrees of community A, B and D. In the pedigree of specify community, a circle shape indicates that this community has blood relationship with the specific community. A square shape indicates that this community is non-blood relationship spouses of blood relationship communities.

In the pedigree of A, communities B is a non-blood relationship spouse of A. Their child F has a spouse G at time t=2 and their child is L. The community L is dead at time 4 so the

circle; Community M is fission of G. We could see the pedigree of community B ends at time t=4. In the pedigree of community D. Community I is Alive of D; K is Alive of I and P is Alive of K. The pedigree of D develops and does not finish.

Though the illustration, we could observe the life time of community. Community could be alive, split and merge over time. The proposed community pedigree Mapping expresses the evolution of community and solves the “one to one mapping problem”.

6.4 Proposed Algorithm: “relationship Extraction and community Pedigree dynamic Community miner” (EPC)

Assume a dynamic network G={G1,G2,…,Gt,…} where Gt is the interaction graph at t, the observation eyeshot wr, the selected normalized weight function (N(t,tc)) and Minimum community similarity threshold , we start the algorithm EPC. For each time point tc, we determine the observation window =[tc-wr, tc+wr] and EPC could be divided into three steps:

(1)Construct the relationship graph RG_tc:

For each edge within interaction graphs Gt where time point t is belong to observation window , we calculate the relationship strength between individuals (u,v) using Eq (3):

and then we contruct the Relationship graph RG_tc. (2)Use the clustering method, SHRINK [12], to discover the community partition CP_tc based on relationship graph RGtc:

(3)Determine the evolution net (EN_tc) for every two consequent timestamps tc-1 and tc:

Based on the predefined Minimum community similarity threshold ( ) and the community partition results at time tc-1 and tc, we calculate the similarity between communities of tc-1 and the communities of tc using Eq (13). For each community, we determine the states of communities. The detail algorithm is shown in Fig 6-5.

Algorithm 2 : EPC

Input: Dynamic networks where is the interaction graph of time t, observation eyeshot (wr), Selected normalized weight function (N(t,tc)), Minimum community similarity threshold ( ) Output: Community partition of each time point , the evolution net of all communities

Chapter 7 Experimental results and Performance study

In this chapter, the accuracy and efficiency of EPC would be examined. The environment is on a AMD Athlon(tm) П X2 240 CPU of 2.8 GHz with 2GBytes of main memory, running on Windows XP. The proposed EPC is implemented using C++. We compare EPC with the FacetNet [3] and PD-Greedy [5] by using 2 synthetic datasets SYN-FIX and SYN-VAR.

SYN-FIX generates the dynamic network of a fixed number of communities and fixed number of nodes over time. SYN-VAR generates the dynamic network of a variable number of communities and variable number of nodes over time. For accuracy comparison, we use the mutual information to evaluate the performance.

Section 7.1 describes the details of synthetic dataset generator and quality measurement using mutual information. Section 7.2 presents the discussion of all parameters of all comparison algorithms. Section 7.3 presents the accuracy of synthetic data experiment.

Section 7.4 presents the smoothness quality of synthetic data experiment. Section 7.5 presents the scalability of synthetic data experiment and section 7.6 presents the result of real data experiment.

7.1 Synthetic Data generation 7.1.1 SYN-FIX

Parameter Description Default

n Initial number of vertices. 128

s_c Initial size of community. 32

n_c Initial number of communities. 4

Avg_v_deg Average vertex degree. 16

Avg_v_out_deg Average vertex degree out of original community. 3 ~ 5 Ran_sel Random select some vertices out of original community. 3

Table 7-1 Parameters of SYN-FIX

The data generator SYN-FIX has been released in [26] and the original idea is proposed

in [18]. The same idea is also discussed in [5]. The SYN-FIX produces an environment of fixed number of nodes and communities over time. It generates a network which has 128 nodes, four communities of 32 nodes each and average vertex degree (Avg_v_deg) 16. Table 7-1 describes the parameters of SYN-FIX.

In SYN-FIX, the parameter Avg_v_out_deg is the average out-degree of all nodes in network. The Avg_v_out_deg controls the number of inter-edges placed between different communities so the number of intra-edge placed in a community is decided at the same time.

The number of intra-edges is increasing while Avg_v_out_deg is decreasing. We generate two datasets of SYN-FIX (SYN-FIX-VOD_3 and SYN-FIX-VOD_5) and produce such networks for twenty consecutive timestamps. The SYN-FIX-VOD_3 uses the Avg_v_out_deg = 3 and the SYN-FIX-VOD_5 uses the Avg_v_out_deg = 5. At each timestamp 3 randomly selected nodes would leave original community and randomly join the other three communities.

7.1.2 SYN-VAR

Parameter Description Default

n Initial number of vertices. 256

s_c Initial size of community. [32, 64]

n_c Initial number of communities. [4, 8]

Avg_c_e-ratio Average community edge ratio. [0.2, 0.8]

Avg_c_out_e-ratio Average community edge ratio out of original community. [0.3, 0.5]

Ran_sel Random select some nodes out of original community. [8, 20]

Add_v_at Add some new vertices at each time point. 16 Add_new_c_at Add new community at some time points.

Remove_min_c_at Remove min-community at some time points.

Table 7-2 Parameters of SYN-VAR

SYN-VAR is first discussed in [5]. The SYN-VAR produces an environment of variable number of nodes and communities over time. The Average community edge ratio (Avg_c_e-ratio) is the ratio of the maximum intra-edges to the node number of the selected community. The average community edge ratio out of original community (Avg_c_out_e-ratio) means the ratio of inter-edges to total edges of the community. That is we generate the

network containing 256 nodes, 4 communities of 64 nodes each. Set the average community edge-ratio (Avg_c_e-ratio) to 0.5 and average community edge ratio out of original community (Avg_c_out_e-ratio) to 0.3. The total number of edges of a single community is 0.5*(64*63/2) = 1008. The number of inter-edges is 1008*0.3 = 432 and the number of intra-edges is 1008-432 = 676.

All datasets produced by SYN-FIX and SYN-VAR is shown as in Table 7-3.

Type Synthetic Dataset Description

SYN-FIX

SYN-FIX-VOD_3 Fixed node number, fixed community number and average vertex out degree=3.

SYN-FIX-VOD_5 Fixed node number, fixed community number and average vertex out degree=5.

SYN-VAR

SYN-VAR-COE_0_3_REG Dynamic changed node number, dynamic changed community number, average community out edge ratio = 0.3, creating new communities at time points {3, 5, 7, 9, 11, 13, 15, 17, 19} and deleting the smallest community at time{4, 6, 8, 10, 12, 14, 16, 18, 20}

SYN-VAR-COE_0_5_REG Dynamic changed node number, dynamic changed community number, average community out edge ratio = 0.5, creating new communities at time points {3, 5, 7, 9, 11, 13, 15, 17, 19} and deleting the smallest community at time{4, 6, 8, 10, 12, 14, 16, 18, 20}

SYN-VAR-COE_0_3_RAN Dynamic changed node number, dynamic changed community number, average community out edge ratio = 0.3, randomly creating 7 new communities at time points {2, 3, 4, 6, 9, 10, 14} and deleting the smallest community at time{7, 9, 14, 15, 18, 19, 20} 20}

SYN-VAR-COE_0_5_RAN Dynamic changed node number, dynamic changed community number, average community out edge ratio = 0.5, creating new communities at time points {1, 2, 3, 6, 8, 11, 12}

and deleting the smallest community at time{7, 8, 10, 11, 14, 16, 17}

Table 7-3 Synthetic datasets generated by SYN-FIX and SYN-VAR

We generate four datasets of SYN-VAR and produced such networks for twenty

consecutive timestamps. At each timestamp SYN-VAR randomly selects some nodes to be removed and randomly selects some nodes to be added to the network. We simulate the dynamic property of social networks using three parameters: “the average community out edge ratio” (Avg_c_out_e-ratio), “add new community at some time points” (Add_new_c_at) and “remove min-community at some time points” (Remove_min_c_at). New communities would be created at randomly selected time points. The smallest community would be removed at randomly selected time points. New communities would be constructed by parts of the larger communities at previous time. For example: the dataset SYN-VAR-COE_0_3_REG indicates the average community out edge ratio = 0.3, regularly creates communities at time points {3, 5, 7, 9, 11, 13, 15, 17, 19} and regularly deletes the smallest community at time {4, 6, 8, 10, 12, 14, 16, 18, 20}. The description of other datasets would be clear in Table 7-3.

The dataset generators, SYN-FIX and SYN-VAR, provide the ground truths of the communities while they produce the interaction graphs of dynamic networks. The accuracy of EPC can be compared with these ground truths. We use the normalized mutual information (NMI) as the performance of accuracy since NMI is well-known to evaluate the quality of clusters produced by clustering algorithm [27] as is defined as:

(14)

where MI(X, Y) is the Mutual information of two random variables X and Y. MI(X,Y) measures the mutual dependency of X and Y and is defined as:

(15)

where p(x, y) is the joint probability distribution function of X and Y, and p₁(x) and p₂(y) are the marginal probability distribution functions of X and Y respectively.

H(X) is the entropy of X and H(Y) is the entropy of Y. Entropy is a measure of the uncertainty associated with a random variable. The score of NMI is normalized to 0.0~1.0 and

the accuracy is higher while the score is higher.

7.2 Synthetic Data Experiment

We compare EPC algorithm with two algorithms, FacetNet [4] and PD-Greedy [5]. First we would experiment and discuss the effect of all different parameters of EPC-SHRINK, EPC-GSCAN, FacetNet and PD-Greedy.

7.2.1 Parameter of FacetNet: α

FacetNet is based on the concept of temporal smoothness and uses the parameterα to control the community partitioning. However, the parameter α could affect the community partitioning. So, we use different α to observe the change of the community partitioning.

As shown in Fig 7-1, FacetNet are tested using SYN-FIX and SYN-VAR datasets. The vertical axis is the average NMI and the horizontal axis is the parameter α from 0.1 to 0.9.

Each line within Fig 7-1 is the performance of different datasets. The accuracy of FacetNet is stable whatever the value of α is. That is the FacetNet is not influenced by the value of α .

Figure 7-1 Parameter of FacetNet: α

7.2.2 Parameters of PD-Greedy[5]: α and μ

PD-Greedy has two parameters: α the parameter of temporal smoothness and μ the constraint of community size.

Figure 7-2 Parameters of PD-Greedy: α , μ on dataset SYN-FIX-VOD_3

Fig 7-2 shows the experimental result of PD-Greedy on α and μ . The vertical axis is the average NMI and the horizontal axis is the parameter α from 0.1 to 0.9. Each line within Fig 7-2 is the performance for different parameters μ on dataset SYN-FIX-VOD_3. The relevance between Avg-NMI and parameter μ is low and the presetting of parameter μ has dependency on datasets. While the parameter α is below 0.5, the Avg-NMI grows with α . While the parameter α is above 0.5, the trend of Avg-NMI changed fuzzily.

Figure 7-3 Parameters of PD-Greedy: α , μ on dataset SYN-FIX-VOD_5

Fig 7-3 shows the result of PD-Greedy on dataset SYN-FIX-VOD_5. Each line has low Avg-NMI due to the noise of dataset SYN-FIX-VOD_5 is higher than that of SYN-FIX-VOD_3. Avg-NMI has ambiguous correlation with parameter α .

Figure 7-4 Parameters of PD-Greedy: α , μ on dataset SYN-VAR-COD_0_3_REG

Figure 7-5 Parameters of PD-Greedy: α , μ on dataset SYN-VAR-COD_0_5_REG Fig 7-4 and 7-5 are the experimental results of PD-Greedy on different datasets. Fig 7-4 and Fig 7-5 show the same phenomenon the same as Fig 7-2. i.e.: the Avg-NMI grows with α while the parameter α is below 0.5; the trend of Avg-NMI changed fuzzily while the parameter α is above 0.5. While the noise level is higher, Fig 7-5 shows that the Avg-NMI is higher using lower parameter μ . We do not show the experimental result of SYN-VAR-COE_0_3_RAN and SYN-VAR-COE_0_5_RAN due to the same property as shown in Fig 7-4 and 7-5.

In summary, the parameter α should be setting within the range between 0.6~0.9 and the parameter μ should be setting within the range between 1%~3% for producing better Avg-NMI.

7.2.3 Parameters of EPC-SHRINK: observation eyeshot, decay weight functions

EPC-SHRINK uses the clustering algorithm SHRINK [12] and has two parameters, observation window and four weight functions. EPC-SHRINK use equal weight function (EQL) and other three normalized decay weight functions, linear decay (LIN), exponential decay (EXP) and wave decay (WAVE). While we set the observation eyeshot wr increasing, the interaction data of overlap time points would increase, the change of relationship strength of relationship graph would be small then the change of community partition is guaranteed to be small. However, increasing the wr is not always better for all situations and we use Fig 7-6 to illustrate the circumstance.

Figure 7-6 The EXP weight function under different observation eyeshot wr Fig 7-6 shows the curves of EXP weight function based on different wr. That the curve is getting smooth with the increasing wr shows the decay property of EXP being averaged out with the increasing wr. So the range of wr should be fixed and the setting of wr depends on the property of network.

Figure 7-7 Parameters of EPC-SHRINK: observation eyeshot, decay weight functions on dataset SYN-FIX-VOD_3

Fig 7-7 shows the experimental result of EPC-SHRINK on dataset SYN-FIX-VOD_3.

The vertical axis is the average NMI and the horizontal axis is the different values of observation eyeshot (wr). The results revealed that equal weight function (EQL) is significantly inferior to other three normalized decay weight functions, LIN; EXP and WAVE.

The LIN; EXP and WAVE, have the same behavior and stable accuracy. Besides, the observation eyeshot is moderately negative related to Avg-NMI.

Figure 7-8 Parameters of EPC-SHRINK: observation eyeshot, decay weight functions on dataset SYN-FIX-VOD_5

Fig 7-8 shows the experimental result of EPC-SHRINK on dataset SYN-FIX-VOD_5.

The vertical axis and horizontal axis are the same as Fig 7-7. On the opposite, the observation

eyeshot and Avg-NMI have been shown to be positively correlated with each other. The experiments summarized indicate no strong relationship between Avg-NMI and observation eyeshot. The results of other datasets show the same property as Fig 7-7 and 7-8 so we do not display the experimental result of other datasets here.

In summary, the presetting of observation eyeshot has dependency on specific dataset.

On selection of decay weight functions, LIN, EXP and WAVE have roughly the same Avg-NMI on all datasets generated by us.

7.2.4 Parameters of EPC-GSCAN: observation eyeshot, decay function and μ

EPC-GSCAN uses the clustering algorithm GSCAN [5]. We try to clarify the correlation between Avg-NMI and observation eyeshot (wr). Considering that the experimental results of most datasets are approximately the same. We use datasets, SYN-VAR-COE_0_3_RAN;

SYN-VAR-COE_0_5_RAN, and linear decay weight function in the experiment.

Figure 7-9 Parameters of EPC-GSCAN: observation eyeshot, μ on dataset SYN-VAR-COE_0_3_RAN

Fig 7-9 shows the experimental result of EPC-GSCAN-LIN on dataset SYN-FIX-VOD_3. The vertical axis is the average NMI and the horizontal axis is the different values of observation eyeshot (wr). The 5 curves in Fig 7-9 use different parameters μ . The result appear that using small μ and small observation eyeshot (wr) would get

higher value of Avg-NMI.

Figure 7-10 Parameters of EPC-GSCAN: observation eyeshot, μ on dataset SYN-VAR-COE_0_5_RAN

Fig 7-10 shows the experiment of EPC-GSCAN-LIN on dataset SYN-VAR-COE_0_5_RAN. The vertical axis and the horizontal axis indicates the same information as Fig 7-9. The result appear to reject the assumption that using small μ and small observation eyeshot (wr) would get better Avg-NMI. In terms of the relationships between observation eyeshot (wr) and Avg-NMI, the results depict no correlation and the parameter presetting is data dependent.

Figure 7-11 Parameters of EPC-GSCAN: decay weight functions on dataset SYN-VAR-COE_0_3_RAN

We compare with different decay functions on dataset SYN-VAR-COE_0_3_RAN in Fig

7-11. The vertical axis is the NMI score and the horizontal axis is timestamp. Each decay function uses the optimal parameters, μ and observation eyeshot wr. The presetting parameter is shown on the right of Fig 7-11. The result shows that the equal weight function is worse than other decay weighted functions.

Figure 7-12 Parameters of EPC-GSCAN: decay weight functions on dataset SYN-VAR-COE_0_5_RAN

In Fig 7-12, the result shows the same property as Fig 7-11. While the noise level is increasing, using EXP or WAVE function has better Avg-NMI than using EQL or LIN.

In summary, the presetting of observation eyeshot and the parameter μ depend on specified dataset. On selection of decay weight functions, the EXP and WAVE decay weight function have better Avg-NMI than other functions.

7.3 Accuracy Comparison

EPC has two algorithm versions, EPC-SHRINK and EPC-GSCAN. We compare EPC with two algorithms, FacetNet [4] and PD-Greedy [5]. These algorithms need to preset their respective parameters as discussed above. In the figures of this section, “a” is the parameter of temporal smoothness, μ is the constraint of community size and wr is the observation eyeshot. We use the parameters of highest Avg-NMI for each algorithm.

0 0.2 0.4 0.6 0.8 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

NMI

Timestamp

SYN-VAR-COE_0_5_RAN EPC-GSCAN-EQL

u=4%,wr=3 EPC-GSCAN-LIN u=4%wr=4 EPC-GSCAN-EXP u=1%,wr=3 EPC-GSCAN-WAVE u=2%,wr=4

Figure 7-13 Accuracy comparison on dataset SYN-FIX-VOD_3

Fig 7-13 shows the accuracy comparison for SYN-FIX datasets. The vertical axis indicates the NMI score and the horizontal axis indicates timestamp. The solid curves of EPC-SHINK and EPC-GSCAN show higher accuracy than the dotted curves of FacetNet and PD-Greedy.

Figure 7-14 Accuracy comparison on dataset SYN-FIX-VOD_5

In Fig 7-14, the vertical axis and horizontal axis are the same as Fig 7-13. The solid curves of EPC-SHINK and EPC-GSCAN show the higher accuracy than both the FacetNet and PD-Greedy even though the noise level is higher than Fig 7-13.

Figure 7-15(a) Accuracy comparison on dataset SYN-VAR-COE_0_3_REG

Figure 7-15(b) Accuracy comparison on dataset SYN-VAR-COE_0_5_REG

Figure 7-15(c) Accuracy comparison on dataset SYN-VAR-COE_0_3_RAN

Figure 7-15(d) Accuracy comparison on dataset SYN-VAR-COE_0_5_RAN

Fig 7-15 shows the accuracy for SYN-VAR datasets. Fig 7-15 shows that the accuracies of EPC-SHRINK and EPC-GSCAN are more stable and better than FacetNet and PD-Greedy whatever the noise level is high or low in the datasets SYN-VAR.

在文檔中運用關係萃取策略於動態社群探勘之研究 (頁 51-0)