Discovering Multiple Roles in a Social Network

Chapter 3 The Proposed Framework

3.3 Discovering Multiple Roles in a Social Network

We use two distance measures to compute the distance between two feature vectors, namely, cosine and Euclidean. The cosine distance is defined by Eq. (2) while the Euclidean distance is defined by Eq. (3), where V1

= (v

11, v12, …, v1k) and V2

=

(v21,v22, …, v2k) are feature vectors. The cosine distance is a distance measure between

Phase 1

For each period of data, use the fuzzy c-means algorithm to group users into c clusters.

Analyze the roles played by the users in each cluster.

Phase 2 Transform the roles played by each user into a roleset sequence.

More periods?

Yes

Use the PrefixSpan algorithm to find frequent role change patterns.

Analyze the patterns found.

Partition the data stream into several periods.

two vectors based on the angle between them. The major limitation of cosine similarity is that it is not good for sparse data and cannot effectively handle outliers co-aligned with other normal vectors. Thus, we use the Euclidean distance as well to measure the distance between two vectors.

( , ) = 1 − ∙

‖ ‖‖ ‖ (2)

( , ) = ( − ) (3)

3.3.2 User properties

We use five types of attributes to describe the properties of users.

(1) Personality (PE)

The personality feature is the basic information in a social network. We choose the features including number of friends, number of posts in his/her own wall and privacy setting (0: public, 1: private, 2: secret) that may reflect the characteristic of users. Thus, the personality feature vector is denoted as PE = (number of

friends, number of posts, privacy). We compute the distance between two

personality feature vectors by the cosine distance. For example, the distance between two personality feature vectors (200, 300, 1) and (1000, 500, 2) is

1 −_√ ^∙ _∙√ ^∙ ^∙ = 0.132.

(2) Behavior (BE)

We extract data which records the user’s actions including post, comment, and like. The behavior feature vector for a user is denoted as BE = (number of posts,

number of comments, number of likes). In order to distinguish both the behavior

distribution and behavior frequency, we take the cosine and Euclidean distances into consideration. For example, the cosine distance between two behavior vectors (500,10,1000) and (50,1,100) is 0 (behavior distribution); however, their behavior frequencies are obviously distinct. That is, we define the distance between two behavior feature vectors, BE1 and

BE

2, as

CD(BE

1,BE2)⋅ED(BE1,BE2). Then, the distance is normalized into [0,1]. For example, the distance between two behavior feature vectors (30, 3, 0) and (10, 5, 30) is 1 −_√ ^∙ ^∙ ^∙

∙√ ∙ ⁽ ⁾ ⁽ ⁾ ⁽ ⁾ = 0.487 , where

50 is the maximum Euclidean distance among all behavior feature vectors and used to normalize the distance into [0,1].

(3) Action sequence (AS)

The actions taken by a user may reflect the different post modes of the user in a group. For example, a user who frequently posts statuses to express her feelings is different from another user who usually shares photos to others. We record user’s actions as an action sequence, denoted as AS = {a1, a2, …, ak}, where ai is a type of posts, i=1, 2,…, k. The types of posts are s(status), l(link), p(photo), and

v(video). For example, {s, l, l, v} is an action sequence. We define the distance

between two action sequences by Lev(AS1, AS2) and then the distance is normalized into [0,1], where Lev(AS1, AS2) is the Levenshtein distance [21], also known as edit distance, between AS1 and AS1. For example, the Levenshtein distance between a sequence AS1

= {s, l, l, v} and AS

= {s, l, l, p, s} is 2 because

two operations are required to modify AS1 into AS2 by replacing v to p and appending s to AS1.

(4) Affectivity (AF)

Analyzing the emotion in the articles generated by a user can discriminate user’s implicit attitude. For example, a politics group generally has supporters and opponents. By analyzing the emotion in articles, we can detect users having similar behavior but standing on different positions. The affective norms for English words (ANEW) [8] have been developed to provide a set of normative emotional rating for a large number of English words. Nielsen [27] evaluated 2477 English words used in microblogs and rated them with a score between -5 and 5. We use these words to calculate user’s affectivity. The calculation steps are listed as follows.

a. Sum up all positive and negative scores in a post.

b. Divide total positive and negative scores by the length of the post.

c. Normalize the scores by multiplying each score by the average length of posts.

Thus, an affectivity vector is denoted as AF = (positive score, negative score), which represents the average positive and negative affective scores in a post for each user. Next, we use the cosine distance to compute the distance between two affectivity vectors.

(5) Recognition (RE)

We also take recognition into account. Recognition in sociology is public acknowledgement of person’s status or merits. By analyzing the users’

recognition accumulated in a social group, we can find the influential users whose posts are more respected by or attracted to other users. We take three features into consideration. The first is the number of comments obtained from other users, which shows the topicality or attraction of user’s posts. The second is the number of posts shared by other users. The third is the number of likes obtained from other users, which implies the acceptance or usefulness of user’s posts. It may reveal the value and influence of the user. Thus, the recognition feature vector is denoted as RE= (number of comments from other users / number

of posts, number of posts shared by other users / number of posts, number of likes obtained from other users / number of posts). We define the distance between two

recognition feature vectors, RE1 and RE2, by CD(RE1,RE2)⋅ED(RE1,RE2). Then, the distance is normalized into [0, 1].

Therefore, a content-based behavioral feature vector of each user is formed by concatenating personality, behavior, action sequence, affectivity, and recognition feature vectors together. The distance between two content-based behavioral feature vectors (or two users) is computed by

α ^D

^PE⁺

β ^D

^BE⁺

γ ^D

^AS⁺

δ ^D

^AF⁺

η ^D

^RE^{, where D}^PE^,

D

BE, DAS, DAF and DRE respectively denote the distances between the personality

feature vectors, the behavior feature vectors, the action sequences, the affectivity feature vectors and the recognition feature vectors of the two users, and

α

⁺

β

⁺

γ

⁺

δ

⁺

η

= 1. Similarly, the distance is normalized into [0,1].

3.3.3 Exponential decay

The behavior, affectivity and recognition features obtained in the previous periods can be accumulated into those in the current period; however, they may decay with time. We use an exponential decay function to adjust these features as shown in Eq. (4), where Ft is the adjusted feature vector in period t, ft is the feature vector in period t, Ft and ft can be one of behavior, affectivity and recognition feature vectors in period t, and

ω

is a decay parameter. But, personality feature is static and does not decay with time. That is, the number of friends, number of posts and privacy setting don’t decay with time. Similarly, the action sequence feature doesn’t accumulate with time.

F

t =

ω‧ ^F

^t-1^{+ f}^t⁽⁴⁾

3.3.4 Fuzzy c-means clustering

Users in a social group may not play just one role. For example, in Android fans, users can post related information (news link, videos, photos, etc.), make comments to discuss with other users, and click “Like” button to follow leader’s posts. Thus, we employ the fuzzy c-means clustering algorithm (FCM) [5] to cluster together the users with similar features, where each user is represented by a content-based behavioral feature vector.

FCM classifies feature vectors X={x1, x2,…, xn} into c clusters

= , , … , by minimizing the objective function shown in Eq. (5), where is the membership of feature vector xj to cluster , n is the number of feature vectors,

m

∈[1,∞) is a weight controlling degree of fuzziness, ci is the centroid of cluster ,

∑ = 1, i=1,2,…, c, j=1,2,…,n.

= ( , ) (5)

The steps of FCM are shown as follows, where c is the number of clusters and m is the degree of fuzziness.

After grouping users into clusters, we compute the membership to each centroid for each user. Thus, each user is represented by a membership vector, denoted as

MV=[e

₁, e₂,…, e_c], where ∑ = 1. Note that the centroid is represented by the personality, behavior, affectivity and recognition features. Since the action sequence is a sequence of symbolic actions, it cannot be used to compute the centroid. Therefore, we estimate the distance between each action sequence ASi of user ui and each centroid c_k

by

^∑^, ⁽_∑ ^{) ∙} ^,

( )

, . Similarly, the distance is normalized into [0,1].

Table 1. An example database.

User Feature ([PE, BE, AS, AF, RE]) u₁ [(200,300,1),(30,3,0),{s,l,l,v},(5.4,-1.3),(5.2,0.2,1.1)]

u2 [(1000,500,2),(10,5,30),{s,p,s},(7.7,-3.4),(0.8,3.4,1.0)]

u3 [(100,30,0),(1,3,20),{s},(2.4,-3.0),(0.2,0.9,1.1)]

u4 [(100,500,1),(40,5,1),{s,s,s,l,s},(4.7,-0.4),(4.8,1.4,1.5)]

We use the example database in Table 1 to demonstrate how FCM works. Assume

c=2, m=2 and ε =

0.001. We first select u1 and u₂ as two centroids (c₁ and c₂). The distance between each pair of action sequences in the first round is shown in Table 2.

Table 2. The distance between each pair of action sequences. 345.1,1.0), (31.3,3.6,2.1), (5,-1.1), (4.8,0.6,1.2)] and c2 =[(673.2,356.9,1.6), (9,4.4,25.3), (5.9,-3.1), (0.9,2.5,1.1)].

The old and new Jm’s calculated by Eq. (5) are Jold

= 0.0306, and J

new

= 0.0294.

That is, Jold

– J

new

= 0.0012 > ε =

0.001. Thus, we repeat steps 2 and 3 until the condition is satisfied. We use ^∑^, ⁽_∑ ^{) ∙}₍ ₎ ^,

, to estimate the distance between each action sequence ASi of user ui and each centroid ck

in the following

rounds. For example, the distance between AS3 and c1 is [(1)²⋅0.75+(0)²⋅0.67+(0.728)²⋅0.8]/[(1)²+(0)²+(0.728)²]=0.767, where the distance is 0.75 between AS3 and AS1, 0.67 between AS3 and AS2, and 0.8 between AS3 and AS4.

FCM stops after three sounds of clustering. The centroids and membership vectors for each round are shown in Table 3.

Table 3. The centroids and membership vectors for each round.

R centroid uj MV [

μ

^1j^,

μ

^2j^]^J^old−Jnew

1 c1 = [(163.3,345.1,1),(31.3,3.6,2.1),(5,-1.1),(4.8,0.6,1.2)]

c2 = [(673.2,356.9,1.6),(9,4.4,25.3),(5.9,-3.1),(0.9,2.5,1.1)]

2 c1 = [(162.6,355,1),(32.1,3.7,1.8),(5,-1.1),(4.9,0.6,1.2)]

c2 = [(654.9,341.2,1.6),(8.3,4.3,25.3),(5.8,-3.1),(0.8,2.5,1.1)]

3 c1 = [(154.6,389.9,1),(34.3,3.9,1.3),(5,-0.9),(5,0.8,1.3)]

c2 = [(580.8,285.9,1.5),(6.2,4.1,25.1),(5.3,-3.2),(0.6,2.2,1.1)]

u₁

在文檔中探勘社群網路中內涵式行為角色 (頁 18-25)