Problem Statement - Notation and Problem Definition

Chapter 3 Notation and Problem Definition

3.2 Problem Statement

Definition 6 (Dynamic Community Identification)

Given a dynamic social network G = {G₁, G₂, … , G_t, …}, How to produce the community partition CPt of each timestamp? While the community partitions of each timestamp have been discovered, How to determine the evolution between the communities of every two consecutive timestamps?

Chapter 4 Relationship Extraction Strategy

In this chapter, we first illustrate the framework of EPC, “relationship Extraction and community Pedigree dynamic Community miner”, and then we introduce how the Relationship graph is constructed. The Relationship Extraction strategy extracts the relationship strength using interaction data and combines the normalized decay weight function to simulate the change of relationship strength within a fix time observation window.

4-1 Proposed Framework

Fig 4-1 Flowchart of EPC, “relationship Extraction and community Pedigree dynamic Community miner”

The flowchart of EPC, “relationship Extraction and community Pedigree dynamic Community miner”, shown in Fig 4-1 and consists of three phases. (1) Construct the relationship graph RGt which is using a set of interaction graphs within observation window.

(2) Use static community detection methods to discover the community partition CP_t based on the Relationship Graph produced in first step. (3) Determine the evolution of community using the community partitions.

Construct the Relationship Graph RG_t

Discovering community partition CP_tusing RG_t

Determine the evolution of community EN_t using

CP_t Interaction

Graphs

CP_t

EN_t

Fig 4-2 Framework of EPC, “relationship Extraction and community Pedigree dynamic Community miner”

Our framework of EPC is shown in Fig 4-2. Assuming the observation eyeshot wr = 1, tc=t, so the relationship graph RG_t is constructed using interaction graphs G_t, G_t-1 and G_t+1. Then we use static community detection method to generate the community partition CPt

based on the relationship graph RG_t. While the community partitions of each timestamp have been generated, we determine the relationship between communities at each consecutive time points using the Community Pedigree Mapping.

The graph on the top of Fig 4-2 is the pedigree of community A where a node indicates a community and an edge indicates the similarity strength between communities. The “pedigree of community A” shows all the communities which have relationship with community A. A square shape indicates the spouse community of A and the triangle shape indicates this community would be dead at next timestamp. There are 5 community spread on the timestamps {t-1, t, t+1}. The community F is similar to previous community A and B so F is the child of A and B. The community L is the child of G and F. The community B is the spouse of A and G is the spouse of F. While we want to monitor some communities to figure

out if these communities are involved with each other, the pedigree of community would be a good way to illustrate.

We present the Relationship Extraction strategy to construct the relationship graph RGt

in chapter 4. We present current static clustering method, SHRINK [12], in chapter 5 and we propose the Community pedigree Mapping to solve the problem of evolution of communities in chapter 6.

4.2. Generating Relationship Graph?

The relationship graph RG_t is constructed from interaction graphs which are most to the current time point t. We use the observation eyeshot wr to control the observation window . For example, Assuming wr=2 and tc =3, then and the relationship graph RG3 is constructed using interaction graph G3 and those interaction graphs 2 time units before (G₁, G₂) and after (G₃, G₄) current time tc. Then we determine the relationship strength Wtc(u,v) between individuals u and v using the predefined normalized weight function.

Here we propose a naïve normalized weight function, normalized Equal weight function (EQL) as follows.

(2) EQL considers the interaction graph of each time point within having the same weight and makes sure the weight summation would be equal to 1. Using normalized weight function to determine the relationship strength of each pair of individuals is just like the function as follows:

(3)

and is the

predefined normalized weight function.

Figure 4-3 Example of determining the relationship strength

We use Fig 4-3 to illustrate how the relationship strength determined. Assuming the observation eyeshot wr equals to 2 and the interaction between individual u and v occurs at time 2, 6 and 7. The relationship strength between individual u and v at tc = 3, W₃(u, v) =

+ + + +

= (0*0.2)+ (1*0.2)+ (0*0.2)+ (0*0.2)+ (0*0.2) =0.2 . Using the same process we calculate W4(u,v) = 0.4, W5(u,v) = 0.4 .

Using Eq 3 to determine the relationship graph RG₂ of Fig 2-1 and the RG₂ is shown on the bottom of Fig 4-4. A node indicates an individual and the edge weight W2(u, v) indicates the relationship strength between individuals u and v.

0.200 0.200 0.200 0.200 0.200

0 0.05 0.1 0.15 0.2 0.25

tc-2 tc-1 tc tc+1 tc+2

weight

time point

Equali wieght function (EQL)

1 2 3 4 5 6 7 W

₃

(u,v)=0.2

tc-2 tc=3 tc+2 W

₄

(u,v)=0.4 tc-2 tc=4 tc+2

e(u,v)

time point

Figure 4-4 Using the EQL weight function to construct the Relationship graph RG2 of sample interaction data of Fig 2-1

4.3 Normalized Decay Weight Function

Assuming the observation eyeshot wr and the Observation window is predefined. We propose three Normalized Decay weight functions:

Linear Decay weight function (LIN):

Fig 4-5 Normalized Decay Weight Function (wr=2)

The is based on linear decay and the weight distribution is shown in the curve (LIN) in Fig 4-5. Using to calculate the relationship strength is the same as the example in section 4-2. We multiply the weight N_L(t,tc) with the interaction occurring in and sum all the values.

The is based on the sine function of trigonometric functions to produce the relationship graph. The weight distribution is shown in the curve (WAVE) in Fig 4-5.

The is based on the approach of exponential decay function. If the weight decreases at a rate proportional to its value, it is called exponential decay [11]. The processes can be modeled by the following differential equation.

(7)

The decay constant controls the decay rate of the exponential decay and we use

Observation Window

wr=2

in our work. The weight distribution is shown in curve (EXP) in Fig 4-5.

For each normalized decay weight distribution, if there are some interactions whose time point is out of the Observation Window, the weight is assigned zero. Note that exponential decay weight distribution the weight of time point out of the observation window is non-zero but we simply assume the weight is zero.

4.4 Discussion of Relationship Extraction Strategy

In real world, the relationship between individuals could decay over time and the interaction at each time point within should be considered an energy which increases the

relationship strength between individuals. So we presume each interaction has the same lifetime equal to the size of observation window. Then the lifetime of each interaction would

be extended from 1 to the size of observation window. This property could overcome the weakness that there is no relationship between individuals while no interactions occur.

Fig 4-6. Relationship strength curve between u and v by extraction from the interaction data of Fig 4-2 based on Normalized Equal Weight Function (wr=2)

Fig 4-6 shows the evolution of relationship strength of individual u and v in the interaction data of Fig 4-3 using normalized equal weight function. The dotted line implies the interaction occurring at time points 2, 6 and 7. The interactions at all time points have the same lifetime equal to the size of observation window . The solid line sums up the curves of all interactions and represent the relationship strength of individual u and v over time.

0 0.1 0.2 0.3 0.4 0.5

1 2 3 4 5 6 7

Relationship strength

time point

Interaction cruve at t = 2

Interaction curve at t = 6

Interaction curve at t = 7

Relationship strength curve

However, the solid line in Fig 4-6 which shows the relationship curve is higher at time t=4, 5, 6 and 7. The relationship strength curve does not match any interaction data occurred and this curve does not have the property of dynamics.

Fig 4-7. The Relationship strength curve of the interaction data of Fig4-2 based on Normalized Linear Decay Weight function (wr=2)

We change the equal weight function to the linear decay weight function and the Relationship strength curve is shown in Fig 4-7. The solid curve in Fig 4-7 indicates the relationship strength of individuals and the curve is high at t=2, 5, 6, 7 and the solid curve matches with the timestamps interaction occurred. The difference between Fig 4-6 and Fig 4-7 is that the relationship strength curve in Fig 4-7 demonstrates more dynamic property than that in Fig 4-6 so the normalized decay weight function would be more realistic than equal weight function.

0 0.2 0.4 0.6

1 2 3 4 5 6 7

Relationship strength

time point

Interaction curve at t = 2

Interaction curve at t = 6

Interaction curve at t = 7

Relationship strength curve

Chapter 5 Current Static Community Detection methods

After generating relationship graphs at each time point, we choose static community methods for discovering the community partition at each timestamp. Although many of studies have focused on community detection on static networks, not every method is suitable for relationship graph. Two issues need to be considered about: (1) Discovering communities using the weighted graph (relationship graph). (2) Detecting the noisy vertices whose relationship strength is too low to belong to any community. The SHRINK algorithm [12]

overcomes the problem of parameters pre-definition, such as minimum similarity threshold ( ) and minimum core size ( ), in density-based clustering algorithms and the predefined number of clusters in partitioning-based clustering algorithm. Through their experiment, SHRINK is an efficient, parameter free and high accuracy algorithm while comparing with other algorithm [18, 19]. So we choose SHRINK as our clustering algorithm. For comparison, we also use the greedy density-based clustering method (GSCAN) used in PD-Greedy [5, 21].

In section 5.1 we represent the detailed definition of GSCAN and section 5.2 represents the definition of SHRINK. Section 5.3 illustrates the quality measurement of community partition. Section 5.4 describes the detail algorithm of SHRINK algorithm and the algorithm of GSCAN is represented in section 5-5.

5.1 GSCAN

In this chapter we present the definition of GSCAN and related notation. Let (V, E, ) be a weighted undirected network where is the weight set of edge set E, GSCAN uses the structure similarity as similarity measure and the related definition is as follows:

Definition 7 [5]. (Neighborhood)

Given G= (V, E, ), for a node u and the adjacent nodes of are neighbors of ( ). i.e.: .

Definition 8 [5]. (Structural Similarity)

Let G= (V, E, ) be a weighted undirected network. The structural similarity between two adjacent nodes and is defined as below:

(8) where indicate the weight of

GSCAN applies a minimum similarity threshold ε to the computed structural similarity when assigning cluster membership as formalized in the following ε -Neighborhood definition:

Definition 9[5]. (ε -Neighborhood)

For a node V, the ε -Neighborhood of a node is defined by

When a vertex shares structural similarity with enough neighbors, it becomes a seed for a cluster. Such a vertex is called a core node, Core nodes are a special class of vertices that have a minimum limit of neighbors with a structural similarity that exceeds the threshold ε [21].

Definition 10[5]. (Core node)

A node V is called a core node w.r.t.

Definition 11[5]. (Directly reachable)

A node x V is directly reachable from a node V w.r.t. if (1) node is core node. (2) x .

Definition 12[5]. (Reachable)

A node V is reachable from a node V w.r.t. if there is a chain of nodes such that is directly reachable from (i < j) w.r.t. . Definition 13[5]. (Connected)

A node V is connected to a node u V w.r.t. if there is a node x V such that both and u are reachable from x w.r.t. .

28 uniquely determined by any cores of this cluster.

Figure 5-1 (a) Sample network G (b) Connected cluster of Sample network G We use the Fig 5-1 to illustrate the related definition. In Fig 5-1, a node indicates an individual and an edge indicates the structural similarity between individuals. Let , we could evaluate the Core nodes as node 13, node 15 and node 6. The node 10 is Directly reachable from node 13 due to that node 13 is a core node and . The node 4 is Reachable from node 6 due to the Directly reachable chain (node 6, node 15, node4). Based on the definition of Connected cluster, there are two clusters, {3, 10, 11, 13}

and {1, 4, 5, 6, 9, 14, 15}.

5.2 SHRINK

In this section we introduce SHRINK and related notation. For structural similarity measure, SHRINK uses the same cosine similarity as GSCAN and the related definition is as follows:

29 node a in network G. MC(a) is a local Micro-community if and only if

(1)

(2) (3)

where represents the largest similarity between nodes and their adjacent neighbor nodes.

We use the Fig 5-1(a) to illustrate the Dense Pair and Micro-Community. All the Dense Pair within the Fig 5-1(a) are shown in Fig 5-2. The dot line nodes and dot line edges indicate that these nodes are Dense Pair and could be grouped into a Micro-Community.

Figure 5-2 All Dense Pairs within the sample network G in Fig 5-1(a) Definition 17[12]. (Super-network)

Micro-community in G. Define and ; then is called a Super-network of G.

Especially, the algorithm not only discovers all the communities but also the hubs and outliers in the network. A hub is called an overlap community and a hub plays a special role in many real networks such as search engines of web page network and the communication center of protein. An outlier does not belong to any communities because the similarities between it and other nodes are too small. This algorithm does not partition all the nodes into communities and this property is just perfect for our requirement.

5.3 Measurement of Partitioning Quality

Although several well-known quality measures such as normalized cut [24] and modularity [18] have been proposed, the modularity is the most popular measure by far.

GSCAN and SHRINK both use the same similarity based modularity function Qs [25].

(9)

Assume the community partition has k communities { , , ..., },

is the total similarity of the nodes within cluster ,

is the total similarity between the nodes in cluster and any nodes in the network, and is the total similarity between any two nodes in the network.

SHRINK is based on this quality function (Qs) and incrementally calculates the increment of the modularity quality. Given two adjacent local communities and , the modularity gain can be computed by

(10)

Where is the summation of similarity of total edges between two communities and .

Based on Eq (10), assume that the micro-community is constructed by h clusters

i.e.: , the modularity gain for merging a micro-community into a super-node can be easily computed as

(11)

SHRINK uses the modularity gain to control the shrinkage of the micro-communities.

Only while the modularity gain is positive , these communities within micro-community (MC) could be merged into a super-node.

5.4 Algorithm of SHRINK

We use Fig 5-1 ~ Fig 5-5 to illustrate the key point of SHRINK. Given a simple network G as shown in Fig 5-1, nodes indicate the individuals and the weight of an edge indicates the Structural Similarity between individuals.

Each round of process of SHRINK has two phases. (1) For each node u we considered u as a micro-community MC(u), determine if each neighbor of the node x within MC(u) is the Dense pair. If x within MC(u) and a node v of the neighbors of x is Dense pair, then push v into the micro-community MC(u). The example is shown the Fig 5-3(a) and 5-4(a). The dotted lines indicate all Dense pairs found in the network G. (2) SHRINK determines the for all micro-communities {MC₁, MC₂, …MC_k} and only while the , all the nodes of the micro-community MCi would be merged into a super-node which contains more than one node at next round. Fig 5-3(b) and 5-4(b) show the second process of SHRINK. Fig 5-5(a) shows the result of third round of the SHRINK process and Fig 5-5(b) shows the fourth round of the SHRINK process. Fig 5-5(b) displays that the process of SHRINK terminates of . The network shrieked from G is called Super-network as shown in Fig 5-3(a) and Fig 5-4(a). Fig 5-5(b) shows the final result of SHRINK and there is a hub (node 2) and an outlier (node16).

In this section, we describe the algorithm GSCAN. GSCAN is extended from SCAN [21]

and combines with greedy heuristic setting of ε . SCAN performs one pass scan on each node of a network and finds all structure connected clusters for a given parameter setting. The pseudo code of the algorithm SCAN is presented in Fig 5-7. Given a weighted undirected graph G (V, E), at the beginning all nodes are labeled as unclassified. For each node that is not yet classified, SCAN checks whether this node is a core node. If the node is a core, a new cluster is expanded from this node. Otherwise, the vertex is labeled as a non-member.

GSCAN use greedy heuristic setting of ε to optimize the modularity score Q of clustering result. GSCAN adjusts the ε with change of modularity score Q and decreases or increasesε until Q reaching the local maximum modularity [5].

25: // 1.2 determine the modularity score of CP;

26: Q = (CP);

27: return CP , Q

Figure 5-7 Algorithm of SCAN

GSCAN chooses a median of similarity values of the sample nodes picked from V and the sampling rate is only about 5~10%. GSCAN increases or decreases the by a unit = 0.01 or 0.02 and maintains two kinds of heaps: (1) max heap H_max for edges having similarity below seedε ; and (2) min heap Hmin for edges having similarity above seedε . Hmax and Hmin are built during the initial clustering. After finding the initial clusters CP and calculating its modularity Qmid, GSCAN calculates two additional modularity values, Qhigh and Qlow. Here, Q_high is calculated from CP with the edges having similarity of range [seedε , seedε + ] in Hmin, and Qlow calculated from CP except the edges having similarity of range [seedε - , seedε ] in H_max. If Q_high is the highest among Q_high, Q_mid, and Q_low, GSCAN increases the density by . If Qlow is the highest value, GSCAN decreases the density by . Otherwise,

seedε would be the best density parameter. The initial clustering CP is continuously modified by adding edges from H_max to CP or by deleting edges of H_min from CP. The detailed algorithm of GSCAN is in Fig 5-8[5]:

Algorithm2 : GSCAN [5]

Input: weighted networks G = (V, E), ε , μ

Output: Set of clusters CP = {C₁, C₂, …, C_k}, modularity score Q.

1: CP_mid ← ; CPhigh ← ; CPlow ← ; 2: Q_mid ← 0 ; Q_high ← 0 ; Q_low ← 0 ; 3: (CP_mid, Q_mid ) ← SCAN(G, ε , μ ) ; 4: (CP_high, Q_high ) ←SCAN(G, ε + , μ ) ; 5: (CP_low, Q_low ) ← SCAN(G, ε , μ ) ; 6: while ( Q_mid max (Qmid ,Q_high ,Q_low)) do 7: if Q_high max (Qmid ,Q_high ,Q_low) 8: ε ←ε + ;

9: (CP_mid, Q_mid ) ← (CP_high, Q_high);

10: (CP_low, Q_low ) ← (CP_mid, Q_mid ) ; 11: CP_high, Q_high ← SCAN(G, ε + , μ ) ; 12: else if Q_min max (Qmid ,Q_high ,Q_low) 13: ε ←ε + ;

14: (CP_high, Q_high) ← (CP_mid, Q_mid );

15: (CP_mid, Q_mid ) ← (CP_low, Q_low ) ;

16: (CP_low, Q_low ) ← SCAN(G, ε - , μ ) ; 17: end

18: end

19: return CP_mid, Q_mid;

Figure 5-8 Algorithm of GSCAN

Chapter 6 The Community Pedigree Mapping

Current methods [6, 13, 3, 5] made effort in the problem of mapping one community of previous timestamp to one community of current timestamp (1-1 mapping problem). We argue these methods are not suitable for real community evolution as discussed in section 2.3.

So we propose a community pedigree mapping to express the evolution of communities. In this chapter we illustrate the evolution of communities between two consecutive time points.

Section 6.1 provides the description of community similarity and section 6.2 presents the states of a community during its lifetime of community. Section 6.3 introduces the details of Community Pedigree Mapping and section 6.4 presents “relationship Extraction and community Pedigree dynamic Community miner” (EPC).

6.1 The Similarity between Communities

Social networks are dynamic and different amount of individuals are alive at each time point. If some individuals disappear at the time point either t-1 or t, we assume those are negative ones. On the other hand, the positive individuals who we care about are alive at both two consecutive time points t-1 and t. We further define those individuals as Influence individuals.

Definition 11 (Influence individuals)

Given the relationship graphs RG_t-1( , ) and RG_t( , ). The is the node set of the relationship graph and is the node set of . We define the Influence individuals at time points t-1 and t are:

(12)

Our Community Pedigree Mapping is based on Influence individuals to determine the similarity of community.

Definition 12 (The similarity between the Communities at different timestamps)

Given the i-th community =( ) of community partition at time t-1 and the j-th community of at time t. The similarity between community and community is the number of individuals who are alive at both time points t-1 and t and is defined below.

(13)

Figure 6-1 Example of community similarity calculation

Take Fig 6-1 as an example, we using it illustrate the function 6-2. There are two communities A and B at time t-1 and two communities C and D at time t. There are 8 members overlapped between community A and community C. The similarity between A and C is since the community size of A is larger then C (12> 10).

Figure 6-2 (a) Merge example (b) Split example

t-1

t

8/12 4/12 2/10 6/10

A B

C D

C

t t-1

D

E

F

0.1 0.9

0.89 0.11

There is a serious problem, splitting or merging of small community would cause the real alive state disappear, while we determine the similarity between the communities. We use the Fig 6-2 to illustrate the situation. Considering the two communities A and B at time t-1 and C at time t in Fig 6-2(a), C is still the combination of A and B even though the members of community C are almost the members of A while there is no constrain for similarity generation; In Fig 6-2(b), Communities E and F both are the splitting of D even though F has

在文檔中運用關係萃取策略於動態社群探勘之研究 (頁 30-0)