Discussion of Relationship Extraction Strategy

Chapter 4 Relationship Extraction Strategy

4.4 Discussion of Relationship Extraction Strategy

In real world, the relationship between individuals could decay over time and the interaction at each time point within should be considered an energy which increases the

relationship strength between individuals. So we presume each interaction has the same lifetime equal to the size of observation window. Then the lifetime of each interaction would

be extended from 1 to the size of observation window. This property could overcome the weakness that there is no relationship between individuals while no interactions occur.

Fig 4-6. Relationship strength curve between u and v by extraction from the interaction data of Fig 4-2 based on Normalized Equal Weight Function (wr=2)

Fig 4-6 shows the evolution of relationship strength of individual u and v in the interaction data of Fig 4-3 using normalized equal weight function. The dotted line implies the interaction occurring at time points 2, 6 and 7. The interactions at all time points have the same lifetime equal to the size of observation window . The solid line sums up the curves of all interactions and represent the relationship strength of individual u and v over time.

0 0.1 0.2 0.3 0.4 0.5

1 2 3 4 5 6 7

Relationship strength

time point

Interaction cruve at t = 2

Interaction curve at t = 6

Interaction curve at t = 7

Relationship strength curve

However, the solid line in Fig 4-6 which shows the relationship curve is higher at time t=4, 5, 6 and 7. The relationship strength curve does not match any interaction data occurred and this curve does not have the property of dynamics.

Fig 4-7. The Relationship strength curve of the interaction data of Fig4-2 based on Normalized Linear Decay Weight function (wr=2)

We change the equal weight function to the linear decay weight function and the Relationship strength curve is shown in Fig 4-7. The solid curve in Fig 4-7 indicates the relationship strength of individuals and the curve is high at t=2, 5, 6, 7 and the solid curve matches with the timestamps interaction occurred. The difference between Fig 4-6 and Fig 4-7 is that the relationship strength curve in Fig 4-7 demonstrates more dynamic property than that in Fig 4-6 so the normalized decay weight function would be more realistic than equal weight function.

0 0.2 0.4 0.6

1 2 3 4 5 6 7

Relationship strength

time point

Interaction curve at t = 2

Interaction curve at t = 6

Interaction curve at t = 7

Relationship strength curve

Chapter 5 Current Static Community Detection methods

After generating relationship graphs at each time point, we choose static community methods for discovering the community partition at each timestamp. Although many of studies have focused on community detection on static networks, not every method is suitable for relationship graph. Two issues need to be considered about: (1) Discovering communities using the weighted graph (relationship graph). (2) Detecting the noisy vertices whose relationship strength is too low to belong to any community. The SHRINK algorithm [12]

overcomes the problem of parameters pre-definition, such as minimum similarity threshold ( ) and minimum core size ( ), in density-based clustering algorithms and the predefined number of clusters in partitioning-based clustering algorithm. Through their experiment, SHRINK is an efficient, parameter free and high accuracy algorithm while comparing with other algorithm [18, 19]. So we choose SHRINK as our clustering algorithm. For comparison, we also use the greedy density-based clustering method (GSCAN) used in PD-Greedy [5, 21].

In section 5.1 we represent the detailed definition of GSCAN and section 5.2 represents the definition of SHRINK. Section 5.3 illustrates the quality measurement of community partition. Section 5.4 describes the detail algorithm of SHRINK algorithm and the algorithm of GSCAN is represented in section 5-5.

5.1 GSCAN

In this chapter we present the definition of GSCAN and related notation. Let (V, E, ) be a weighted undirected network where is the weight set of edge set E, GSCAN uses the structure similarity as similarity measure and the related definition is as follows:

Definition 7 [5]. (Neighborhood)

Given G= (V, E, ), for a node u and the adjacent nodes of are neighbors of ( ). i.e.: .

Definition 8 [5]. (Structural Similarity)

Let G= (V, E, ) be a weighted undirected network. The structural similarity between two adjacent nodes and is defined as below:

(8) where indicate the weight of

GSCAN applies a minimum similarity threshold ε to the computed structural similarity when assigning cluster membership as formalized in the following ε -Neighborhood definition:

Definition 9[5]. (ε -Neighborhood)

For a node V, the ε -Neighborhood of a node is defined by

When a vertex shares structural similarity with enough neighbors, it becomes a seed for a cluster. Such a vertex is called a core node, Core nodes are a special class of vertices that have a minimum limit of neighbors with a structural similarity that exceeds the threshold ε [21].

Definition 10[5]. (Core node)

A node V is called a core node w.r.t.

Definition 11[5]. (Directly reachable)

A node x V is directly reachable from a node V w.r.t. if (1) node is core node. (2) x .

Definition 12[5]. (Reachable)

A node V is reachable from a node V w.r.t. if there is a chain of nodes such that is directly reachable from (i < j) w.r.t. . Definition 13[5]. (Connected)

A node V is connected to a node u V w.r.t. if there is a node x V such that both and u are reachable from x w.r.t. .

28 uniquely determined by any cores of this cluster.

Figure 5-1 (a) Sample network G (b) Connected cluster of Sample network G We use the Fig 5-1 to illustrate the related definition. In Fig 5-1, a node indicates an individual and an edge indicates the structural similarity between individuals. Let , we could evaluate the Core nodes as node 13, node 15 and node 6. The node 10 is Directly reachable from node 13 due to that node 13 is a core node and . The node 4 is Reachable from node 6 due to the Directly reachable chain (node 6, node 15, node4). Based on the definition of Connected cluster, there are two clusters, {3, 10, 11, 13}

and {1, 4, 5, 6, 9, 14, 15}.

5.2 SHRINK

In this section we introduce SHRINK and related notation. For structural similarity measure, SHRINK uses the same cosine similarity as GSCAN and the related definition is as follows:

29 node a in network G. MC(a) is a local Micro-community if and only if

(1)

(2) (3)

where represents the largest similarity between nodes and their adjacent neighbor nodes.

We use the Fig 5-1(a) to illustrate the Dense Pair and Micro-Community. All the Dense Pair within the Fig 5-1(a) are shown in Fig 5-2. The dot line nodes and dot line edges indicate that these nodes are Dense Pair and could be grouped into a Micro-Community.

Figure 5-2 All Dense Pairs within the sample network G in Fig 5-1(a) Definition 17[12]. (Super-network)

Micro-community in G. Define and ; then is called a Super-network of G.

Especially, the algorithm not only discovers all the communities but also the hubs and outliers in the network. A hub is called an overlap community and a hub plays a special role in many real networks such as search engines of web page network and the communication center of protein. An outlier does not belong to any communities because the similarities between it and other nodes are too small. This algorithm does not partition all the nodes into communities and this property is just perfect for our requirement.

5.3 Measurement of Partitioning Quality

Although several well-known quality measures such as normalized cut [24] and modularity [18] have been proposed, the modularity is the most popular measure by far.

GSCAN and SHRINK both use the same similarity based modularity function Qs [25].

(9)

Assume the community partition has k communities { , , ..., },

is the total similarity of the nodes within cluster ,

is the total similarity between the nodes in cluster and any nodes in the network, and is the total similarity between any two nodes in the network.

SHRINK is based on this quality function (Qs) and incrementally calculates the increment of the modularity quality. Given two adjacent local communities and , the modularity gain can be computed by

(10)

Where is the summation of similarity of total edges between two communities and .

Based on Eq (10), assume that the micro-community is constructed by h clusters

i.e.: , the modularity gain for merging a micro-community into a super-node can be easily computed as

(11)

SHRINK uses the modularity gain to control the shrinkage of the micro-communities.

Only while the modularity gain is positive , these communities within micro-community (MC) could be merged into a super-node.

5.4 Algorithm of SHRINK

We use Fig 5-1 ~ Fig 5-5 to illustrate the key point of SHRINK. Given a simple network G as shown in Fig 5-1, nodes indicate the individuals and the weight of an edge indicates the Structural Similarity between individuals.

Each round of process of SHRINK has two phases. (1) For each node u we considered u as a micro-community MC(u), determine if each neighbor of the node x within MC(u) is the Dense pair. If x within MC(u) and a node v of the neighbors of x is Dense pair, then push v into the micro-community MC(u). The example is shown the Fig 5-3(a) and 5-4(a). The dotted lines indicate all Dense pairs found in the network G. (2) SHRINK determines the for all micro-communities {MC₁, MC₂, …MC_k} and only while the , all the nodes of the micro-community MCi would be merged into a super-node which contains more than one node at next round. Fig 5-3(b) and 5-4(b) show the second process of SHRINK. Fig 5-5(a) shows the result of third round of the SHRINK process and Fig 5-5(b) shows the fourth round of the SHRINK process. Fig 5-5(b) displays that the process of SHRINK terminates of . The network shrieked from G is called Super-network as shown in Fig 5-3(a) and Fig 5-4(a). Fig 5-5(b) shows the final result of SHRINK and there is a hub (node 2) and an outlier (node16).

In this section, we describe the algorithm GSCAN. GSCAN is extended from SCAN [21]

and combines with greedy heuristic setting of ε . SCAN performs one pass scan on each node of a network and finds all structure connected clusters for a given parameter setting. The pseudo code of the algorithm SCAN is presented in Fig 5-7. Given a weighted undirected graph G (V, E), at the beginning all nodes are labeled as unclassified. For each node that is not yet classified, SCAN checks whether this node is a core node. If the node is a core, a new cluster is expanded from this node. Otherwise, the vertex is labeled as a non-member.

GSCAN use greedy heuristic setting of ε to optimize the modularity score Q of clustering result. GSCAN adjusts the ε with change of modularity score Q and decreases or increasesε until Q reaching the local maximum modularity [5].

25: // 1.2 determine the modularity score of CP;

26: Q = (CP);

27: return CP , Q

Figure 5-7 Algorithm of SCAN

GSCAN chooses a median of similarity values of the sample nodes picked from V and the sampling rate is only about 5~10%. GSCAN increases or decreases the by a unit = 0.01 or 0.02 and maintains two kinds of heaps: (1) max heap H_max for edges having similarity below seedε ; and (2) min heap Hmin for edges having similarity above seedε . Hmax and Hmin are built during the initial clustering. After finding the initial clusters CP and calculating its modularity Qmid, GSCAN calculates two additional modularity values, Qhigh and Qlow. Here, Q_high is calculated from CP with the edges having similarity of range [seedε , seedε + ] in Hmin, and Qlow calculated from CP except the edges having similarity of range [seedε - , seedε ] in H_max. If Q_high is the highest among Q_high, Q_mid, and Q_low, GSCAN increases the density by . If Qlow is the highest value, GSCAN decreases the density by . Otherwise,

seedε would be the best density parameter. The initial clustering CP is continuously modified by adding edges from H_max to CP or by deleting edges of H_min from CP. The detailed algorithm of GSCAN is in Fig 5-8[5]:

Algorithm2 : GSCAN [5]

Input: weighted networks G = (V, E), ε , μ

Output: Set of clusters CP = {C₁, C₂, …, C_k}, modularity score Q.

1: CP_mid ← ; CPhigh ← ; CPlow ← ; 2: Q_mid ← 0 ; Q_high ← 0 ; Q_low ← 0 ; 3: (CP_mid, Q_mid ) ← SCAN(G, ε , μ ) ; 4: (CP_high, Q_high ) ←SCAN(G, ε + , μ ) ; 5: (CP_low, Q_low ) ← SCAN(G, ε , μ ) ; 6: while ( Q_mid max (Qmid ,Q_high ,Q_low)) do 7: if Q_high max (Qmid ,Q_high ,Q_low) 8: ε ←ε + ;

9: (CP_mid, Q_mid ) ← (CP_high, Q_high);

10: (CP_low, Q_low ) ← (CP_mid, Q_mid ) ; 11: CP_high, Q_high ← SCAN(G, ε + , μ ) ; 12: else if Q_min max (Qmid ,Q_high ,Q_low) 13: ε ←ε + ;

14: (CP_high, Q_high) ← (CP_mid, Q_mid );

15: (CP_mid, Q_mid ) ← (CP_low, Q_low ) ;

16: (CP_low, Q_low ) ← SCAN(G, ε - , μ ) ; 17: end

18: end

19: return CP_mid, Q_mid;

Figure 5-8 Algorithm of GSCAN

Chapter 6 The Community Pedigree Mapping

Current methods [6, 13, 3, 5] made effort in the problem of mapping one community of previous timestamp to one community of current timestamp (1-1 mapping problem). We argue these methods are not suitable for real community evolution as discussed in section 2.3.

So we propose a community pedigree mapping to express the evolution of communities. In this chapter we illustrate the evolution of communities between two consecutive time points.

Section 6.1 provides the description of community similarity and section 6.2 presents the states of a community during its lifetime of community. Section 6.3 introduces the details of Community Pedigree Mapping and section 6.4 presents “relationship Extraction and community Pedigree dynamic Community miner” (EPC).

6.1 The Similarity between Communities

Social networks are dynamic and different amount of individuals are alive at each time point. If some individuals disappear at the time point either t-1 or t, we assume those are negative ones. On the other hand, the positive individuals who we care about are alive at both two consecutive time points t-1 and t. We further define those individuals as Influence individuals.

Definition 11 (Influence individuals)

Given the relationship graphs RG_t-1( , ) and RG_t( , ). The is the node set of the relationship graph and is the node set of . We define the Influence individuals at time points t-1 and t are:

(12)

Our Community Pedigree Mapping is based on Influence individuals to determine the similarity of community.

Definition 12 (The similarity between the Communities at different timestamps)

Given the i-th community =( ) of community partition at time t-1 and the j-th community of at time t. The similarity between community and community is the number of individuals who are alive at both time points t-1 and t and is defined below.

(13)

Figure 6-1 Example of community similarity calculation

Take Fig 6-1 as an example, we using it illustrate the function 6-2. There are two communities A and B at time t-1 and two communities C and D at time t. There are 8 members overlapped between community A and community C. The similarity between A and C is since the community size of A is larger then C (12> 10).

Figure 6-2 (a) Merge example (b) Split example

t-1

t

8/12 4/12 2/10 6/10

A B

C D

C

t t-1

D

E

F

0.1 0.9

0.89 0.11

There is a serious problem, splitting or merging of small community would cause the real alive state disappear, while we determine the similarity between the communities. We use the Fig 6-2 to illustrate the situation. Considering the two communities A and B at time t-1 and C at time t in Fig 6-2(a), C is still the combination of A and B even though the members of community C are almost the members of A while there is no constrain for similarity generation; In Fig 6-2(b), Communities E and F both are the splitting of D even though F has few members of community D. The above phenomenon is not reasonable so we predefine the Minimum community similarity threshold ( ) to avoid this situation. On the other hand, we set

the = 0 while the

< Minimum community similarity threshold ( ).

6.2 The State of Communities

Dynamic Community could change over time and we further define five community states: Birth, Death, Alive, Child and Division.

Definition 13 (Birth)

A new community is born at time t iff there is no similarity between and any communities of . i.e.:

Definition 14 (Death)

An old community is dead at time point t iff there are none similarity between

and any communities of community partition .

i.e.:

Definition 15 (Alive) A current community is alive iff there is similarity between one community of community partition and one community of . i.e.:

Definition 16 (Child)

A current community is a child of { iff there exist more than one

39 Figure 6-3 Example of evolution of communities

(1) Birth: A circle colored purple indicates that a community is born at current time point t and the example is shown as the community J in Fig 6-3(b). (2) Death: A triangle indicates that a community would be dead at next time point and the example is shown as the community E in Fig 6-3(b). (3) Alive: A circle of colored red indicates that a community is Alive from only one community at time t-1 to only one community at time t. The example is shown as the community I at time t which is Alive from the community D at time t-1 in Fig 6-3(b). (4) Child: A circle of colored green indicates that a community is a child of some communities at time t-1. The example is shown as the communities F which is the child of A

t F

After we determine the similarities between communities and the states of communities, the states of a community express the evolution of all communities as in 6-4(a). Besides, the evolution of single community we called pedigree. We use the same states of a community to express the evolution of single community. Fig 6-4(a) shows the similarities between communities and the state of communities from t=1 to t=4 and is called evolution net [3]. Fig 6-4(b) shows the pedigrees of community A, B and D. In the pedigree of specify community, a circle shape indicates that this community has blood relationship with the specific community. A square shape indicates that this community is non-blood relationship spouses of blood relationship communities.

In the pedigree of A, communities B is a non-blood relationship spouse of A. Their child F has a spouse G at time t=2 and their child is L. The community L is dead at time 4 so the

circle; Community M is fission of G. We could see the pedigree of community B ends at time t=4. In the pedigree of community D. Community I is Alive of D; K is Alive of I and P is Alive of K. The pedigree of D develops and does not finish.

Though the illustration, we could observe the life time of community. Community could be alive, split and merge over time. The proposed community pedigree Mapping expresses the evolution of community and solves the “one to one mapping problem”.

6.4 Proposed Algorithm: “relationship Extraction and community Pedigree dynamic Community miner” (EPC)

Assume a dynamic network G={G1,G2,…,Gt,…} where Gt is the interaction graph at t, the observation eyeshot wr, the selected normalized weight function (N(t,tc)) and Minimum community similarity threshold , we start the algorithm EPC. For each time point tc, we determine the observation window =[tc-wr, tc+wr] and EPC could be divided into three steps:

(1)Construct the relationship graph RG_tc:

For each edge within interaction graphs Gt where time point t is belong to observation window , we calculate the relationship strength between individuals (u,v) using Eq (3):

and then we contruct the Relationship graph RG_tc. (2)Use the clustering method, SHRINK [12], to discover the community partition CP_tc based on relationship graph RGtc:

(3)Determine the evolution net (EN_tc) for every two consequent timestamps tc-1 and tc:

Based on the predefined Minimum community similarity threshold ( ) and the community partition results at time tc-1 and tc, we calculate the similarity between communities of tc-1 and the communities of tc using Eq (13). For each community, we determine the states of communities. The detail algorithm is shown in Fig 6-5.

Algorithm 2 : EPC

Input: Dynamic networks where is the interaction graph of time t, observation eyeshot (wr), Selected normalized weight function (N(t,tc)), Minimum community similarity threshold ( ) Output: Community partition of each time point , the evolution net of all communities

Chapter 7 Experimental results and Performance study

In this chapter, the accuracy and efficiency of EPC would be examined. The environment is on a AMD Athlon(tm) П X2 240 CPU of 2.8 GHz with 2GBytes of main memory, running on Windows XP. The proposed EPC is implemented using C++. We compare EPC with the FacetNet [3] and PD-Greedy [5] by using 2 synthetic datasets SYN-FIX and SYN-VAR.

SYN-FIX generates the dynamic network of a fixed number of communities and fixed number of nodes over time. SYN-VAR generates the dynamic network of a variable number of communities and variable number of nodes over time. For accuracy comparison, we use the mutual information to evaluate the performance.

Section 7.1 describes the details of synthetic dataset generator and quality measurement using mutual information. Section 7.2 presents the discussion of all parameters of all comparison algorithms. Section 7.3 presents the accuracy of synthetic data experiment.

Section 7.4 presents the smoothness quality of synthetic data experiment. Section 7.5 presents the scalability of synthetic data experiment and section 7.6 presents the result of real data experiment.

7.1 Synthetic Data generation 7.1.1 SYN-FIX

Parameter Description Default

n Initial number of vertices. 128

s_c Initial size of community. 32

n_c Initial number of communities. 4

Avg_v_deg Average vertex degree. 16

Avg_v_out_deg Average vertex degree out of original community. 3 ~ 5 Ran_sel Random select some vertices out of original community. 3

Table 7-1 Parameters of SYN-FIX

The data generator SYN-FIX has been released in [26] and the original idea is proposed

in [18]. The same idea is also discussed in [5]. The SYN-FIX produces an environment of fixed number of nodes and communities over time. It generates a network which has 128 nodes, four communities of 32 nodes each and average vertex degree (Avg_v_deg) 16. Table 7-1 describes the parameters of SYN-FIX.

In SYN-FIX, the parameter Avg_v_out_deg is the average out-degree of all nodes in

在文檔中運用關係萃取策略於動態社群探勘之研究 (頁 37-0)