4.3 Implicit Features
4.3.2 Extract the Implicit Features
After grouping the communities, all of the inactive nodes in the second layer must link with part of community members in the first layer. In particular, consider Figure 4.4, the inactive node pia has some cross-edges with community C1, C2 and C3’s members. We extract the implicit features based on these community members’ characteristics. Note that we do not consider the community has no any cross-edges with pia (e.g., C4). Table 4.5 lists the implicit features we extract for community. In the next subsection, we will discuss the effectiveness of each feature and tell the divergence between potential nodes and non-potential nodes.
Feature Name SI SHRINK DBSCAN
total com. members 3.433 2.779 0.865
dst. com. 6.544 5.69 0.268
max. com. size 7.57 6.27 7.226
min. com. size 7.735 6.212 6.96
avg. com. size 3.513 2.612 2.9
Table 4.6: Information gain of implicit features (10−2)
4.3.3 Analysis
Table 4.6 shows the information gain for each feature using different ways of clustering algo-rithm. As we can see, max.com.size and min.com.size always have higher information gain in different based algorithm except that dst.com. has the highest in SHRINK algorithm.
Figure 4.3 shows that the maximum and minimum community size always have highest CDF gap in both SI and DBSCAN clustering algorithm. Although the dis. community has the largest gap in SHRINK when the feature value is smaller than 0.3, the maximum and minimum community size features tend to have discriminative afterward. According to above observation, we can briefly conclude that max.com.size and min.com.size features is powerful predictors in different based of clustering algorithm. For digging the predictor deeper, we focus on the maximum and minimum community size and plot the CDF for each method.
0
Figure 4.5: CDF of SI implicit features
Figure 4.5 to 4.7 shows the CDF of maximum and minimum community size in different clustering methods. As we can see, most of the potential nodes have lower feature value in
0
Figure 4.6: CDF of SHRINK implicit features
0
Figure 4.7: CDF of DBSCAN implicit features
both features than non-potential nodes. The situation implied that the nodes in potential nodes’ induced subnetworks are in the smaller community than non-potential nodes’. We can consider the result as ”the smaller community size of a person p’s friends in, the higher tendency p will join the service”. The result make sense because if the person connected with p are all in the large community, they would probably be a public community such as advertisement community. In the other words, people tend to join a service if their connected person in the service are in the private community such as family community or colleague community rather than an advertisement community.
Our conclusion from above observation is that people tend to be attracted by small com-munity size whatever which clustering method be applied. Namely that if a person p’s friends incline to be in small community, p will have higher probability to join the service. Based
on the observation, we choice the maximum and minimum community size as our powerful predictors for implicit features.
So far, we extract the explicit features and implicit features for distinguish the potential nodes from inactive nodes and choice total sending f req. and dst. outgoing edges as explicit powerful predictor, max.com.size and min.com.size as implicit powerful predictor. In the section 5, we will use SVM classifier to show the effectiveness of the features we extract.
4.4 Sparseness
(a) dst. outgoing edges (b) total sending freq.
(c) max. com. size (d) min. com. size
Figure 4.8: IG for effective features
In this section, we study on the sparseness issue. The graph density of first layer in our dataset is about 1.08∗ 10−6 and each active node provide 0.0874 interaction edges in average.
Obviously, the data we use is a very sparse data. In order to show that the features we extract are also effective on sparse dataset, we generate five simulation data sets by randomly sampling the edges of original data and keeping 20%, 40%, 60% or 80% edges.
Figure 4.8 shows the information gain of the effective features we explore in the previous section. Figure 4.8(a) and Figure 4.8(b) are the information gain of explicit features while Figure 4.8(c) and Figure 4.8(d) are the implicit features among different grouping algorithm.
As we can see, the higher the graph density is, the higher the IG will be. The figures implied that a sparse data would cause the predictors become less powerful. The reason is simple, because if the graph getting sparse, we will lose more information to distinguish the different type of nodes.
Although the data we use is very sparse, the accuracy still can hit almost 70%. The situation implied that if the density of the graph could be higher, the result could get better.
To proof the conclusion of observation, we will conduct our method on the real dataset and compare the accuracy among different sparseness (density) data in the Section 5.
Chapter 5
Experimental Results
Name Value
Total active nodes 80,764
Total inactive nodes 436,257
Total potential nodes 1,330
Total interaction edges in 1st layer 7,064
Total cross edges 507,338
Avg. cross edges per inactive node 1.16 Avg. interaction edges per active node 0.0874 Avg. cross edges per active node 6.281 Density(sparseness) of first layer 1.08∗ 10−6
Table 5.1: Statistics of data
feature type time (s)
Explicit features 61.57
SI implicit features 11342.17
SHRINK implicit features 11998.65 DBSCAN implicit features 11347.59
Table 5.2: Time complexity for extracting features
5.1 Dataset
We conduct the experiment on the real dataset. The data we use is the CDRs provided by Chunghwa Telecom which including caller, callee and calling time for each instance.
Another information about the data is the join time for each active node. We can get the approximate ground truth via these information as mentioned in Section 3.3. Before starting the experiment, we practice some data preprocessing which will discuss in the next subsection.