Diverse Sparsification Methods - Diverse Ensemble of Drastic Sparsification

2.4 Diverse Ensemble of Drastic Sparsification

2.4.1 Diverse Sparsification Methods

In the DEDS framework, the proximity measures described in Section 2.3 are adopted as features for building the classifier for link prediction. Since these proximity measures have different properties, a single type of sparsified network is unable to sufficiently pre-serve them all. Therefore, we generate various types of sparsified networks, and in each network, certain proximity measures can benefit from the edges preserved. As a result, these proximity measures can be more discriminative. Furthermore, as previously men-tioned, diversity is an important factor in obtaining an effective ensemble. By generating different sparsified networks with various sparsification methods, we can increase the va-riety of classifiers trained from these sparsified networks. In the following paragraphs, we

introduce the four sparsification methods used in the DEDS framework, namely, degree-based sparsification, random-walk-degree-based sparsification, short-path-degree-based sparsification, and random sparsification.

• Degree-based sparsification. In degree-based sparsification, each edge is first given a score in proportion to the summation of the degrees at its two ends. We then repeatedly select the edge with the highest score, until the percentage of se-lected edges meets the sparsification ratio. As mentioned, the proximity measures adopted in the DEDS framework can be roughly divided into two categories: one based on neighborhood information, and the other based on path information. Most proximity measures based on neighborhood information involve counting the num-ber of common neighbors. To further illustrate this, let us assume there is a con-nected vertex pair of v_x and v_y. During sparsification, if the degree of either v_xor vyis large, removing the edge connecting the pair may affect many common neigh-bor counts. Let the degree of vxto be dx, which means vxhas dx−1 neighbors other than v_y. When we remove the edge connecting v_x and v_y, the number of common neighbors between v_y and each of the d_x− 1 neighbors of vx will decrease by one, since v_x is no longer a neighbor of v_y after sparsification. For the d_y − 1 neighbors of v_y, the common neighbor v_y also disappears between v_xand each of these d_y− 1 vertices. In total, removing the edge connecting v_x and v_y may cause as many as d_x+ d_y− 2 non-neighboring vertex pairs to lose their common neighbors.⁴ There-fore, we choose to remove the edge with the smaller summation of the degrees at its two ends, in order to reduce the loss of common neighbors.

• Random-walk-based sparsification. Certain proximity measures based on path information (e.g., hitting time, rooted PageRank, and PropFlow) involve random walks between non-neighboring vertices. Therefore, in this sparsification method, we aim to preserve the edges that are most frequently used by the random walk.

We first conduct random walk rehearsals on randomly selected vertex pairs in the

4There are dx+ d_y− 2 vertex pairs that lose common neighbors, but only the non-neighboring vertex pairs will affect prediction accuracy. Therefore, dx+ d_y− 2 is actually a worst-case analysis.

original network, and then we calculate the score of each edge in proportion to the total visited count during the random walk rehearsals. Thereafter, we repeatedly select the edge with the highest score, until the percentage of selected edges meets the sparsification ratio.

• Short-path-based sparsification. In addition to random walk, certain proximity measures based on path information (e.g., Katz) involve short paths between non-neighboring vertices. Therefore, in this sparsification method, we aim to preserve the edges that appear frequently in certain short paths, as these edges are likely to be shortcuts or important bridges. We first compute the shortest paths that do not exceed a length threshold L, and the score of each edge is calculated in proportion to the frequency that this edge appears in the paths. We then repeatedly select the edge with the highest score, until the percentage of selected edges meets the sparsifica-tion ratio. In most previous studies of link predicsparsifica-tion, the path-oriented proximity measures do not involve long paths since they are computationally expensive, and using the maximum length of the paths used in proximity measures (e.g., 5 in [31]) as L will suffice. When the original network is too large to be stored in main mem-ory, the sketch-based index structure proposed in [13] can help to compute the paths much faster, while keeping the estimation error below 1% on average.

• Random sparsification. The random sparsification method is the most straight-forward among the four methods, whereby edges are randomly selected from the original network until the percentage of selected edges meets the sparsification ra-tio. This sparsification prevents the DEDS framework from being over-fitted for any specific type of proximity measure, and also allows the DEDS framework to be more generalized to accommodate potential new proximity measures in the future.

Since the original network is comparatively large, the sparsification process that deals with the original network may take longer. However, the sparsification can be done of-fline, and then predictions can be generated online in a timely fashion using the smaller sparsified network.

(a) (b) (c) (d) (e)

(a)^(a)^(a)^(a)^(a) (b)^(b)^(b)^(b)^(b) ^(c)(c)^(c)^(c)^(c) ^(d)(d)^(d)^(d)^(d) ^(e)^(e)^(e)(e)^(e)

Figure 2.3: (a) An example network with its four different sparsified networks, (b), (c), (d), (e), obtained from degree-based sparsification, random-walk-based sparsification, short-path-based sparsification, and random sparsification, respectively.

Table 2.1: Statistics of network characteristics for different sparsified networks.

Mean SD RSD

Characteristics random_spar diverse_spar random_spar diverse_spar random_spar diverse_spar

Assortativity Coef. -0.244 -0.337 0.018 0.292 7% 86%

Average Clustering Coef. 0.033 0.097 0.006 0.067 17% 69%

Median Degree 2 0.5 0 1 0% 200%

Max Degree 23.25 49.75 0.96 23.82 4% 48%

Number of SCCs 116 258 7.26 90.41 6% 35%

Largest SCC 283.25 129 7.68 104.09 3% 81%

Largest SCC Diameter 7.5 5.5 1.29 2.38 17% 43%

在文檔中於社群網路中之高效能鏈結預測與群組查詢 (頁 32-35)