Framework Overview - 於社群網路中之高效能鏈結預測與群組查詢

Before introducing the details of the DEDS framework, we provide in Section 2.3.1 preliminaries on the proximity measures that are commonly used in link prediction tasks.

We also introduce in Section 2.3.2 the method for leveraging these proximity measures under a supervised link prediction framework. In Section 2.3.3, we describe the datasets and the evaluation metrics used throughout this chapter.

2.3.1 Proximity Measures in Link Prediction

Most existing link prediction methods involve calculating proximity measures for a non-neighboring vertex pair, v_iand v_j, where the higher the measures are, the more likely v_i and v_j are to be connected via an edge in the near future. According to [29], most basic link prediction methods generate proximity measures based on the neighborhood information of vi and vj, or on the path information between vi and vj. The methods that rely on neighborhood information include common neighbors, Jaccard’s coefficient [46], Adamic/Adar [1], and preferential attachment [5]. The common neighbors method calculates the number of neighbors that v_i and v_j have in common, while the Jaccard’s coefficient method modifies the common neighbors method by normalizing it with the total number of neighbors that v_i and v_j have. The Adamic/Adar method modifies the common neighbors method by giving more weight to the neighbor that is rarer (i.e., the neighbor that is connected to fewer other vertices). In the preferential attachment method, the proximity measure is determined by multiplying the number of neighbors of v_iand v_j.

(time)

1st snapshot at t1 2nd snapshot at t2 3rd snapshot at t3

Figure 2.2: An illustrative example for the usage of the three snapshots in the DEDS framework, where CN , J C, and P A stand for the proximity measures common neighbors, Jaccard’s coefficient, and preferential attachment, respectively. The thick lines indicate the newly generated edges which do not exist in the preceding snapshot.

The methods that rely on path information include Katz [22], hitting time [29], variants of PageRank [7] (e.g., rooted PageRank [29]), SimRank [19], and PropFlow [31]. A description and comparison of the methods listed above (with the exception of the recently proposed PropFlow) can be found in [29].

2.3.2 Supervised Framework for Link Prediction

In the proposed DEDS framework, we adopt the common proximity measures men-tioned in the previous subsection within a supervised framework. As shown in [31], using scores generated from the same proximity measure, a supervised method can outperform an unsupervised method. A supervised method is also more capable of handling the dy-namics and capturing the interdependency of topological properties in the network. There-fore, recent link prediction works have begun to use a supervised framework.

A supervised framework like DEDS works as illustrated in Figure 2.2. The first snap-shot of the original network is taken to generate feature values (i.e., the scores generated from proximity measures). Before computing the proximity measures of vertex pairs in the first network snapshot, we sparsify the snapshot using the proposed sparsification meth-ods, which will be described later in Section 2.4.1. The computational cost of calculating proximity measures can be reduced considerably, since the network size is significantly lowered. Note that before sparsifying the network snapshot, the non-neighboring vertex

pairs are marked in main memory or the disk.² This prevents unnecessary prediction of the vertex pairs that are originally connected but have their connecting edges removed during sparsification. For example, even though the edge between the vertex pair (v₁, v₅) is removed during sparsification, we can still avoid making a redundant prediction on this vertex pair.

After a given period, a second snapshot of the network is taken.³ Since the network is dynamic, this second snapshot can be used to verify whether an originally non-neighboring pair v_i and v_j in the first snapshot is now connected or not, and class labels are thereby generated. Taking the example illustrated in Figure 2.2, the vertex pair (v₁, v₂) is newly connected in the second snapshot, and hence this pair is a positive instance. While the first snapshot is sparsified in order to generate proximity measures efficiently, the second snapshot is fully preserved. This ensures that the class labels generated from the second snapshot are all correct.

Besides the first and the second network snapshots, there is a third snapshot used to capture the most recent network. We use this last network snapshot to calculate the latest proximity measures and make predictions for users. Since the goal of the drastic sparsi-fication is to generate the prediction much faster, the third snapshot is sparsified before calculating the proximity measures. Some may think that the training process using the first and second network snapshots can be conducted offline without rushing, so we do not need to sparsify the first network snapshot. However, the sparsification of the first snapshot is actually necessary and even vital to prediction accuracy. If the first snapshot is not sparsified, most proximity measures, such as the common neighbors and the Katz, tend to have much higher values. As a consequence, the classifier trained with these over-estimated values cannot make accurate predictions in the third network snapshot, which is sparsified and tends to have lower values.

2With the help of data structures such as B-tree, we can decide quickly whether a vertex pair has a link in the original network or not, even if this information is stored in the disk.

3The timing for taking these snapshots of the network (i.e., t1and t2) is discussed in [31].

2.3.3 Data and Evaluation Metrics

To evaluate the improvement of prediction accuracy introduced by the DEDS frame-work, we use two real datasets provided in the LPmade package [30]: condmat (110.5K edges) and disease (15.6K edges). Besides the real datasets, we also carry out the evalua-tion of efficiency on a larger synthetic dataset rmat (10M edges), which is produced using the R-MAT [8] graph generating algorithm. The R-MAT algorithm is able to quickly gen-erate large networks that match patterns in real-world networks. The parameters a, b, c, and d used in network generation are set to be 0.6, 0.15, 0.15, and 0.1, which follow the settings in the previous study [32].

In terms of evaluation metrics, the receiver operating characteristic (ROC) curve is an effective method for illustrating the performance of a binary classifier system, since it shows the performance of the classifier under the entire operating range. The ROC curve displays the true positive rate (TPR) versus the false positive rate (FPR), and each point on this curve is produced by varying the decision threshold of the proximity measures.

Since it would be too space-consuming to generate ROC curves for all of the classifiers under various sparsification ratios, the following evaluation is based on the area under the ROC curve (AUC). AUC is a related scalar measure of ROC, and it can be seen as a summary of performance. Besides using AUC to evaluate the performance over all decision thresholds, we also adopt the common mean square error (MSE) to measure the performance at a specific decision threshold that is selected during the training phase.

在文檔中於社群網路中之高效能鏈結預測與群組查詢 (頁 28-31)