Extract the Implicit Features - Implicit Features

4.3 Implicit Features

4.3.2 Extract the Implicit Features

After grouping the communities, all of the inactive nodes in the second layer must link with part of community members in the ﬁrst layer. In particular, consider Figure 4.4, the inactive node pia has some cross-edges with community C1, C2 and C3’s members. We extract the implicit features based on these community members’ characteristics. Note that we do not consider the community has no any cross-edges with p_ia (e.g., C₄). Table 4.5 lists the implicit features we extract for community. In the next subsection, we will discuss the eﬀectiveness of each feature and tell the divergence between potential nodes and non-potential nodes.

Feature Name SI SHRINK DBSCAN

total com. members 3.433 2.779 0.865

dst. com. 6.544 5.69 0.268

max. com. size 7.57 6.27 7.226

min. com. size 7.735 6.212 6.96

avg. com. size 3.513 2.612 2.9

Table 4.6: Information gain of implicit features (10⁻²)

4.3.3 Analysis

Table 4.6 shows the information gain for each feature using diﬀerent ways of clustering algo-rithm. As we can see, max.com.size and min.com.size always have higher information gain in diﬀerent based algorithm except that dst.com. has the highest in SHRINK algorithm.

Figure 4.3 shows that the maximum and minimum community size always have highest CDF gap in both SI and DBSCAN clustering algorithm. Although the dis. community has the largest gap in SHRINK when the feature value is smaller than 0.3, the maximum and minimum community size features tend to have discriminative afterward. According to above observation, we can brieﬂy conclude that max.com.size and min.com.size features is powerful predictors in diﬀerent based of clustering algorithm. For digging the predictor deeper, we focus on the maximum and minimum community size and plot the CDF for each method.

Figure 4.5: CDF of SI implicit features

Figure 4.5 to 4.7 shows the CDF of maximum and minimum community size in diﬀerent clustering methods. As we can see, most of the potential nodes have lower feature value in

Figure 4.6: CDF of SHRINK implicit features

Figure 4.7: CDF of DBSCAN implicit features

both features than non-potential nodes. The situation implied that the nodes in potential nodes’ induced subnetworks are in the smaller community than non-potential nodes’. We can consider the result as ”the smaller community size of a person p’s friends in, the higher tendency p will join the service”. The result make sense because if the person connected with p are all in the large community, they would probably be a public community such as advertisement community. In the other words, people tend to join a service if their connected person in the service are in the private community such as family community or colleague community rather than an advertisement community.

Our conclusion from above observation is that people tend to be attracted by small com-munity size whatever which clustering method be applied. Namely that if a person p’s friends incline to be in small community, p will have higher probability to join the service. Based

on the observation, we choice the maximum and minimum community size as our powerful predictors for implicit features.

So far, we extract the explicit features and implicit features for distinguish the potential nodes from inactive nodes and choice total sending f req. and dst. outgoing edges as explicit powerful predictor, max.com.size and min.com.size as implicit powerful predictor. In the section 5, we will use SVM classiﬁer to show the eﬀectiveness of the features we extract.

4.4 Sparseness

(a) dst. outgoing edges (b) total sending freq.

Figure 4.8: IG for eﬀective features

In this section, we study on the sparseness issue. The graph density of ﬁrst layer in our dataset is about 1.08∗ 10⁻⁶ and each active node provide 0.0874 interaction edges in average.

Obviously, the data we use is a very sparse data. In order to show that the features we extract are also eﬀective on sparse dataset, we generate ﬁve simulation data sets by randomly sampling the edges of original data and keeping 20%, 40%, 60% or 80% edges.

Figure 4.8 shows the information gain of the eﬀective features we explore in the previous section. Figure 4.8(a) and Figure 4.8(b) are the information gain of explicit features while Figure 4.8(c) and Figure 4.8(d) are the implicit features among diﬀerent grouping algorithm.

As we can see, the higher the graph density is, the higher the IG will be. The ﬁgures implied that a sparse data would cause the predictors become less powerful. The reason is simple, because if the graph getting sparse, we will lose more information to distinguish the diﬀerent type of nodes.

Although the data we use is very sparse, the accuracy still can hit almost 70%. The situation implied that if the density of the graph could be higher, the result could get better.

To proof the conclusion of observation, we will conduct our method on the real dataset and compare the accuracy among diﬀerent sparseness (density) data in the Section 5.

Chapter 5 Experimental Results

Name Value

Total active nodes 80,764

Total inactive nodes 436,257

Total potential nodes 1,330

Total interaction edges in 1st layer 7,064

Total cross edges 507,338

Avg. cross edges per inactive node 1.16 Avg. interaction edges per active node 0.0874 Avg. cross edges per active node 6.281 Density(sparseness) of ﬁrst layer 1.08∗ 10⁻⁶

Table 5.1: Statistics of data

feature type time (s)

Explicit features 61.57

SI implicit features 11342.17

SHRINK implicit features 11998.65 DBSCAN implicit features 11347.59

Table 5.2: Time complexity for extracting features

5.1 Dataset

We conduct the experiment on the real dataset. The data we use is the CDRs provided by Chunghwa Telecom which including caller, callee and calling time for each instance.

Another information about the data is the join time for each active node. We can get the approximate ground truth via these information as mentioned in Section 3.3. Before starting the experiment, we practice some data preprocessing which will discuss in the next subsection.

在文檔中社群網路上的潛在用戶探勘 (頁 29-35)