Synthetic data generation… - Experimental results and Performance study

Chapter 7 Experimental results and Performance study

7.1 Synthetic data generation…

Parameter Description Default

n Initial number of vertices. 128

s_c Initial size of community. 32

n_c Initial number of communities. 4

Avg_v_deg Average vertex degree. 16

Avg_v_out_deg Average vertex degree out of original community. 3 ~ 5 Ran_sel Random select some vertices out of original community. 3

Table 7-1 Parameters of SYN-FIX

The data generator SYN-FIX has been released in [26] and the original idea is proposed

in [18]. The same idea is also discussed in [5]. The SYN-FIX produces an environment of fixed number of nodes and communities over time. It generates a network which has 128 nodes, four communities of 32 nodes each and average vertex degree (Avg_v_deg) 16. Table 7-1 describes the parameters of SYN-FIX.

In SYN-FIX, the parameter Avg_v_out_deg is the average out-degree of all nodes in network. The Avg_v_out_deg controls the number of inter-edges placed between different communities so the number of intra-edge placed in a community is decided at the same time.

The number of intra-edges is increasing while Avg_v_out_deg is decreasing. We generate two datasets of SYN-FIX (SYN-FIX-VOD_3 and SYN-FIX-VOD_5) and produce such networks for twenty consecutive timestamps. The SYN-FIX-VOD_3 uses the Avg_v_out_deg = 3 and the SYN-FIX-VOD_5 uses the Avg_v_out_deg = 5. At each timestamp 3 randomly selected nodes would leave original community and randomly join the other three communities.

7.1.2 SYN-VAR

Parameter Description Default

n Initial number of vertices. 256

s_c Initial size of community. [32, 64]

n_c Initial number of communities. [4, 8]

Avg_c_e-ratio Average community edge ratio. [0.2, 0.8]

Avg_c_out_e-ratio Average community edge ratio out of original community. [0.3, 0.5]

Ran_sel Random select some nodes out of original community. [8, 20]

Add_v_at Add some new vertices at each time point. 16 Add_new_c_at Add new community at some time points.

Remove_min_c_at Remove min-community at some time points.

Table 7-2 Parameters of SYN-VAR

SYN-VAR is first discussed in [5]. The SYN-VAR produces an environment of variable number of nodes and communities over time. The Average community edge ratio (Avg_c_e-ratio) is the ratio of the maximum intra-edges to the node number of the selected community. The average community edge ratio out of original community (Avg_c_out_e-ratio) means the ratio of inter-edges to total edges of the community. That is we generate the

network containing 256 nodes, 4 communities of 64 nodes each. Set the average community edge-ratio (Avg_c_e-ratio) to 0.5 and average community edge ratio out of original community (Avg_c_out_e-ratio) to 0.3. The total number of edges of a single community is 0.5*(64*63/2) = 1008. The number of inter-edges is 1008*0.3 = 432 and the number of intra-edges is 1008-432 = 676.

All datasets produced by SYN-FIX and SYN-VAR is shown as in Table 7-3.

Type Synthetic Dataset Description

SYN-FIX

SYN-FIX-VOD_3 Fixed node number, fixed community number and average vertex out degree=3.

SYN-FIX-VOD_5 Fixed node number, fixed community number and average vertex out degree=5.

SYN-VAR

SYN-VAR-COE_0_3_REG Dynamic changed node number, dynamic changed community number, average community out edge ratio = 0.3, creating new communities at time points {3, 5, 7, 9, 11, 13, 15, 17, 19} and deleting the smallest community at time{4, 6, 8, 10, 12, 14, 16, 18, 20}

SYN-VAR-COE_0_5_REG Dynamic changed node number, dynamic changed community number, average community out edge ratio = 0.5, creating new communities at time points {3, 5, 7, 9, 11, 13, 15, 17, 19} and deleting the smallest community at time{4, 6, 8, 10, 12, 14, 16, 18, 20}

SYN-VAR-COE_0_3_RAN Dynamic changed node number, dynamic changed community number, average community out edge ratio = 0.3, randomly creating 7 new communities at time points {2, 3, 4, 6, 9, 10, 14} and deleting the smallest community at time{7, 9, 14, 15, 18, 19, 20} 20}

SYN-VAR-COE_0_5_RAN Dynamic changed node number, dynamic changed community number, average community out edge ratio = 0.5, creating new communities at time points {1, 2, 3, 6, 8, 11, 12}

and deleting the smallest community at time{7, 8, 10, 11, 14, 16, 17}

Table 7-3 Synthetic datasets generated by SYN-FIX and SYN-VAR

We generate four datasets of SYN-VAR and produced such networks for twenty

consecutive timestamps. At each timestamp SYN-VAR randomly selects some nodes to be removed and randomly selects some nodes to be added to the network. We simulate the dynamic property of social networks using three parameters: “the average community out edge ratio” (Avg_c_out_e-ratio), “add new community at some time points” (Add_new_c_at) and “remove min-community at some time points” (Remove_min_c_at). New communities would be created at randomly selected time points. The smallest community would be removed at randomly selected time points. New communities would be constructed by parts of the larger communities at previous time. For example: the dataset SYN-VAR-COE_0_3_REG indicates the average community out edge ratio = 0.3, regularly creates communities at time points {3, 5, 7, 9, 11, 13, 15, 17, 19} and regularly deletes the smallest community at time {4, 6, 8, 10, 12, 14, 16, 18, 20}. The description of other datasets would be clear in Table 7-3.

The dataset generators, SYN-FIX and SYN-VAR, provide the ground truths of the communities while they produce the interaction graphs of dynamic networks. The accuracy of EPC can be compared with these ground truths. We use the normalized mutual information (NMI) as the performance of accuracy since NMI is well-known to evaluate the quality of clusters produced by clustering algorithm [27] as is defined as:

(14)

where MI(X, Y) is the Mutual information of two random variables X and Y. MI(X,Y) measures the mutual dependency of X and Y and is defined as:

(15)

where p(x, y) is the joint probability distribution function of X and Y, and p₁(x) and p₂(y) are the marginal probability distribution functions of X and Y respectively.

H(X) is the entropy of X and H(Y) is the entropy of Y. Entropy is a measure of the uncertainty associated with a random variable. The score of NMI is normalized to 0.0~1.0 and

the accuracy is higher while the score is higher.

在文檔中運用關係萃取策略於動態社群探勘之研究 (頁 56-60)