Issues of Existing Methods - 大規模深度圖卷積網路之快速訓練演算法

In this section, we discuss issues of some existing methods and detailedly analyze their time and memory complexity.

In the original GCN paper [10], full gradient descent is used to train GCN, but it suffers from high computational and memory cost. In terms of memory, computing the full gra-dient of (2.2) by back-propagation requires storing all the embedding matrices{Z^(l)}^Ll=1

which needs O(N F L) space. In terms of the convergence speed, since the model is only updated once per epoch, the training requires more epochs to converge.

It has been shown that mini-batch SG can improve the training speed and memory requirement of GCN in some recent works [6, 1, 2]. Instead of computing the full gradient, SG only needs to calculate the gradient based on a mini-batch for each update. in this thesis, we useB ⊆ [N] with size b = |B| to denote a batch of node indices, and each SG step will compute the gradient estimation

|B|

∑

i∈B

∇ξ(yi, z^(L)_i ) (2.3)

to perform an update. Despite faster convergence in terms of epochs, SG will introduce another computational overhead on GCN training (as explained in the previous section), which makes it having much slower per-epoch time compared with full gradient descent.

Why does vanilla mini-batch SG have slow per-epoch time? We consider the com-putation of the gradient associated with one node i : ∇ξ(yi, z^(L)_i ). Clearly, this requires the embedding of node i, which depends on its neighbors’ embeddings in the previous layer. To fetch each node i’s neighbor nodes’ embeddings, we need to further aggregate each neighbor node’s neighbor nodes’ embeddings as well. Suppose a GCN has L + 1 layers and each node has an average degree of d, to get the gradient for node i, we need to aggregate features from O(d^L) nodes in the graph for one node. That is, we need to fetch information for a node’s hop-k (k = 1,· · · , L) neighbors in the graph to perform one update. Computing each embedding requires O(F²) time due to the multiplication with W^(l), so in average computing the gradient associated with one node requires O(d^LF²) time.

Embedding utilization can reflect computational efficiency. If a batch has more than one node, the time complexity is less straightforward since different nodes can have overlapped hop-k neighbors, and the number of embedding computation can be less than the worst case O(bd^L). To reflect the computational efficiency of mini-batch SG, we de-fine the concept of “embedding utilization” to characterize the computational efficiency.

During the algorithm, if the node i’s embedding at l-th layer z^(l)_i is computed and is reused u times for the embedding computations at layer l + 1, then we say the embedding

utiliza-tion of z^(l)_i is u. For mini-batch SG with random sampling, u is very small since the graph is usually large and sparse. Assume u is a small constant (almost no overlaps between hop-k neighbors), then mini-batch SG needs to compute O(bd^L) embeddings per batch, which leads to O(bd^LF²) time per update and O(N d^LF²) time per epoch.

We illustrate the neighborhood expansion problem in the left panel of Fig. 2.4. On the contrary, full-batch gradient descent has the maximal embedding utilization—each embedding will be reused d (average degree) times in the upper layer. As a consequence, the original full gradient descent [10] only needs to compute O(N L) embeddings per epoch, which means on average only O(L) embedding computation is needed to acquire the gradient of one node.

To make mini-batch SG work, previous approaches try to restrict the neighborhood expansion size, which however do not improve embedding utilization. GraphSAGE [6]

uniformly samples a fixed-size set of neighbors, instead of using a full-neighborhood set.

We denote the sample size as r. This leads to O(r^L) embedding computations for each loss term but also makes gradient estimation less accurate. FastGCN [1] proposed an important sampling strategy to improve the gradient estimation. VR-GCN [2] proposed a strategy to store the previous computed embeddings for all the N nodes and L layers and reuse them for unsampled neighbors. Despite the high memory usage for storing all the N L embeddings, we find their strategy very useful and in practice, even for a small r (e.g., 2) can lead to good convergence.

We summarize the time and space complexity in Table 2.1. Clearly, all the SG-based algorithms suffer from exponential complexity with respect to the number of layers, and for VR-GCN, even though r can be small, they incur huge space complexity that could go beyond a GPU’s memory capacity. In Chapter 3, we introduce our Cluster-GCN algorithm, which achieves the best of two worlds—the same time complexity per epoch with full gradient descent and the same memory complexity with vanilla SG.

Table 2.1: Time and space complexity of GCN training algorithms. L is number of layers, N is number of nodes, ∥A∥0 is number of nonzeros in the adjacency matrix, and F is number of features. For simplicity we assume number of features is fixed for all layers.

For SG-based approaches, b is the batch size and r is the number of sampled neighbors per node. Note that due to the variance reduction technique, VR-GCN can work with a smaller r than GraphSAGE and FastGCN. For memory complexity, LF²is for storing{W^(l)}^Ll=1

and the other term is for storing embeddings. For simplicity we omit the memory for storing the graph (GCN) or sub-graphs (other approaches) since they are fixed and usually not the main bottleneck.

Time Complexity Memory Complexity GCN [10] O(L∥A∥0F + LN F²) O(LN F + LF²)

Vanilla SG O(d^LN F²) O(bd^LF + LF²)

GraphSAGE [6] O(r^LN F²) O(br^LF + LF²)

FastGCN [1] O(rLN F²) O(brLF + LF²)

VR-GCN [2] O(L∥A∥0F + LN F² + r^LN F²) O(LN F + LF²) Cluster-GCN O(L∥A∥0F + LN F²) O(bLF + LF²)

Figure 2.4: The neighborhood expansion difference between traditional graph convolution and our proposed cluster approach in Chapter 3. The red node is the starting node for neighborhood nodes expansion. Traditional graph convolution suffers from exponential neighborhood expansion, while our method can avoid expensive neighborhood expansion.

Chapter 3 Proposed Method

3.1 Vanilla Cluster-GCN

Our Cluster-GCN technique is motivated by the following question: In mini-batch SG updates, can we design a batch and the corresponding computation subgraph to maximize the embedding utilization? We answer this affirmative by connecting the concept of em-bedding utilization to a clustering objective.

Consider the case that in each batch we compute the embeddings for a set of nodesB from layer 1 to L. Since the same subgraph A_B,B (links withinB) is used for each layer of computation, we can then see that embedding utilization is the number of edges within this batch∥AB,B∥0. Therefore, to maximize embedding utilization, we should design a batchB to maximize the within-batch edges, by which we connect the efficiency of SG updates with graph clustering algorithms.

Now we formally introduce Cluster-GCN. For a graph G, we partition its nodes into c groups: V = [V1,· · · Vc] whereVt consists of the nodes in the t-th partition. Thus we have c subgraphs as

G = [G¯ ₁,· · · , Gc] = [{V1,E1}, · · · , {Vc,Ec}],

where eachEt only consists of the links between nodes inVt. After reorganizing nodes,

the adjacency matrix is partitioned into c² submatrices as

where each diagonal block A_ttis a|Vt|×|Vt| adjacency matrix containing the links within G_t. ¯A is the adjacency matrix for graph ¯G; A_st contains the links between two partitions Vsand Vt; ∆ is the matrix consisting of all off-diagonal blocks of A. Similarly, we can partition the feature matrix X and training labels Y according to the partition [V1,· · · , Vc] as [X1,· · · , Xc] and [Y1,· · · , Yc] where Xtand Ytconsist of the features and labels for the nodes in V_trespectively.

The benefit of this block-diagonal approximation ¯G is that the objective function of GCN becomes decomposible into different batches (clusters). Let ¯A^′denotes the normal-ized version of ¯A, the final embedding matrix becomes

Z^(L)= ¯A^′σ( ¯A^′σ(· · · σ( ¯A^′XW⁽⁰⁾)W⁽¹⁾)· · · )W^(L⁻¹⁾ (3.3)

due to the block-diagonal form of ¯A (note that ¯A^′_ttis the corresponding diagonal block of A¯^′). The loss function can also be decomposed into

LA^¯^′ =∑

The Cluster-GCN is then based on the decomposition form in (3.3) and (3.4). At each

step, we sample a clusterVtand then conduct SG to update based on the gradient ofLA^¯^′_tt, and this only requires the sub-graph A_tt, the X_t, Y_t on the current batch and the models {W^(l)}^Ll=1. The implementation only requires forward and backward propagation of ma-trix products (one block of (3.3)) that is much easier to implement than the neighborhood search procedure used in previous SG-based training methods.

We use graph clustering algorithms to partition the graph. Graph clustering methods such as Metis [9] and Graclus [5] aim to construct the partitions over the vertices in the graph such that within-clusters links are much more than between-cluster links to bet-ter capture the clusbet-tering and community structure of the graph. These are exactly what we need because: 1) As mentioned before, the embedding utilization is equivalent to the within-cluster links for each batch. Intuitively, each node and its neighbors are usually located in the same cluster, therefore after a few hops, neighborhood nodes with a high chance are still in the same cluster. 2) Since we replace A by its block diagonal approxima-tion ¯A and the error is proportional to between-cluster links ∆, we need to find a partition to minimize number of between-cluster links.

In Figure 2.4, we illustrate the neighborhood expansion with full graph G and the graph with clustering partition ¯G. We can see that cluster-GCN can avoid heavy neighborhood search and focus on the neighbors within each cluster. In Table 3.1, we show two different node partition strategies: random partition versus clustering partition. We partition the graph into 10 parts by using random partition and METIS. Then use one partition as a batch to perform a SG update. We can see that with the same number of epochs, using clustering partition can achieve higher accuracy. This shows using graph clustering is important and partitions should not be formed randomly.

Time and space complexity. Since each node inVtonly links to nodes insideVt, each node does not need to perform neighborhoods searching outside A_tt. The computation for each batch will purely be matrix products ¯A^′_ttX_t^(l)W^(l)and some element-wise operations, so the overall time complexity per batch is O(∥Att∥0F + bF²). Thus the overall time complexity per epoch becomes O(∥A∥0F + N F²). In average, each batch only requires computing O(bL) embeddings, which is linear instead of exponential to L. In terms of

Table 3.1: Random partition versus clustering partition of the graph (trained on mini-batch SG). Clustering partition leads to better performance (in terms of test F1 score) since it removes less between-partition links. These three datasetes are all public GCN datasets.

We will explain PPI data in the experiment part. Cora has 2,708 nodes and 13,264 edges, and Pubmed has 19,717 nodes and 108,365 edges.

Dataset random partition clustering partition

Cora 78.4 82.5

Pubmed 78.9 79.9

PPI 68.1 92.9

space complexity, in each batch, we only need to load b samples and store their embed-dings on each layer, resulting in O(bLF ) memory for storing embedembed-dings. Therefore our algorithm is also more memory efficient than all the previous algorithms. Moreover, our algorithm only requires loading a subgraph into GPU memory instead of the full graph (though graph is usually not the memory bottleneck). The detailed time and memory complexity are summarized in Table 2.1.

在文檔中大規模深度圖卷積網路之快速訓練演算法 (頁 20-28)