• 沒有找到結果。

Analysis on Accuracy and Efficiency

As previously mentioned, the major challenge in drastic sparsification for link predic-tion is how to considerably reduce the network size while keeping the accuracy high. In this section, we present a more detailed analysis of (1) how a diverse ensemble helps to preserve the prediction accuracy, and (2) how sparsification helps to reduce the processing

time of prediction.

2.6.1 Prediction Accuracy

Here we first analyze the AUC at different sparsification ratios. The results of larger sparsification ratios (e.g., 60% or 30%) are provided in Figure 2.7(a), and then we dig into smaller sparsification ratios (e.g., 5% or 1%) in Figure 2.7(b). From Figure 2.7(a), first it can be seen that no matter what the sparsification ratio is, the AUC of the ensemble classifier is always much higher than the individual classifiers. The AUC value is 0.73 on average for individual classifiers when the sparsification ratio is 30%; in contrast, with various types of models combined together, the AUC value of the ensemble classifier can reach up to 0.90, which is a 23% improvement. When the sparsification ratio is not too small (specifically, no smaller than 40%), the ensemble classifier can even outperform the classifier trained from the original network. As intuition suggests, the prediction accuracy drops as the sparsification ratio goes down. In Figure 2.7(b), the AUC of an individual classifier trained from the random-walk-based sparsified network falls to 0.52 when the sparsification ratio is 0.1%, which is almost no different from making decisions by flipping a coin. In contrast, the DEDS framework can effectively relieve the loss of prediction accuracy, and the AUC value of the ensemble classifier is able to remain at 0.70, which is a 35% improvement.

In addition to low prediction accuracy, another problem with using only one type of sparsified network is that the performance of individual classifiers is unstable, and the ranking of these individual classifiers varies among different sparsification ratios. For example, random sparsification is the most effective method when the sparsification ratio is larger than 70%, but the short-path-based sparsification comes out on top when the sparsification ratio is 60%. The ranking then varies again at the 9% sparsification ratio, where the degree-based sparsification outperforms the other three methods. Therefore, it is not possible here to choose an individual classifier that has top performance for all sparsification ratios. This difficulty, once again, shows the importance of incorporating the

(a) disease, ensemble size = 5

(b) disease, ensemble size = 30

Figure 2.7: Performance of four different individual classifiers, their average (denoted as

”Average”), and the ensemble classifier at different sparsification ratios.

ensemble methodology, which enables the DEDS framework to have a steadily increasing AUC value and outperform every individual classifier at all sparsification ratios.

The ensemble size in Figure 2.7(b) is 30, which appears to be large. However, when the sparsification ratio is 0.1%, 30 such sparsified networks actually contain only 3% of the edges in the original network. Moreover, as shown in Figure 2.7(b), the ensemble clas-sifier combining 30 individual clasclas-sifiers trained from various 0.1% sparsified networks can outperform any type of individual classifier trained from a 3% sparsified network.

Remember that all of the individual classifiers are unentangled in the DEDS framework.

That is, with enough CPUs or cores, these 30 classifiers can be trained and used to gener-ate predictions at the same time. With only the running time of a 0.1% sparsified network, the DEDS framework can provide prediction accuracy better than that of using a 3% spar-sified network. These results suggest that the DEDS framework could be considered as a method for parallelizing link prediction tasks to fully utilize the computational resources.

2.6.2 Computational Efficiency

Figure 2.8(a) shows that link prediction tasks would take more than 55 hours when using the large rmat network. In contrast, the DEDS framework only requires 4% of prediction cost if the network is sparsified down to 1% of the original one, which means the prediction time can be substantially shortened, from two days to two hours. Note that when the sparsification ratio is small, the sparsified network may be able to fit into main memory. Assume that the network used in Figure 2.8(a) can fit into main memory when the sparsification ratio is 1%. According to the experimental results in Figure 2.8(b), the prediction cost is now drastically lowered, from two days to only 17 seconds, which means that DEDS saves more than 99.94% on running time. The reason why DEDS can provide such great savings is that it not only reduces the number of edges needed to be processed, but also relieves the burden of disk access.

If the original network is already able to fit into main memory, the proposed DEDS framework can still lower the prediction cost considerably, by reducing the number of

1.0E+01&

Figure 2.8: Running time under different sparsification ratios: (a) the network is stored in the disk, (b) the original network is stored in the disk but can fit into main memory after being sparsified, and (c) the network is cached in main memory.

edges that need to be processed. Here we use the smaller condmat network and put it into main memory from the start. The experimental results in Figure 2.8(c) show that when the network is sparsified to 10% of the original one, the link prediction process requires less than 64% of the running time. If the sparsification ratio further decreases to 1% or even 0.1%, the link prediction process will only require 30% or 19% of the running time, respectively. That is, every time when the number of edges is sparsified to 10% of what it previously was, the prediction cost can be reduced by 35-50%.