Distributed Training System - 利用參數伺服器在深度學習中應用多樣化的通訊最佳化

Each component is decoupled from the main program and other components, so we can mod-ify/implement an optimization without digging into codebase or affecting other optimizations.

For example, we can replace the round-robin placement with heuristic approach mentioned in chapter 2 by a new decision function. We can also apply quantization methods in encoding and decoding function or use the adaptive synchronization method by checking the criteria at some call sites of consistency controller.

4.3 Distributed Training System

In order to compare with an existing distributed training framework, we implement a system in this work by using Tensorflow [17] as the deep learning framework mentioned in Figure 4.1. Ten-sorflow takes computation as a directed Graph. Each node in a graph represents the instantiation of an Operation, which describes an abstract computation, such as addition or matrix multiplication, and has inputs/outputs called Tensors. User programs interact with Tensorflow by using a Session.

The session can augment the currently managed graph and supports an interface named Run, which executes required computations to get outputs of specific nodes. In distributed Tensorflow, nodes are mapped onto a set of devices, and cross-device edges are replaced by a pair of send/receive nodes for cross-device communication. With this mechanism, original Tensorflow can implement different distributed training configurations including parameter server architecture.

To use modularized parameter server, the system launches an instance of modified Tensorflow per node, and each of them links to the library. There are two kinds of operations in Tensorflow that are most related to our work. Variable Op is a special stateful operation that maintains

persis-tent states across iterations. Layers in deep learning are usually represented by trainable variables, and each variable manages a tensor to store values. Another kind of operations is State Ops, which takes references of variables as inputs and can modify the underlying tensors. A subset of state ops called training ops is responsible for updating variables according to some optimization algo-rithms. We modify variable op to check whether the variable is initialized. If not, a corresponding table with related information, including size, data type, and storage types, is created first in the pa-rameter server. Then we check the synchronization condition before passing the variable to other operations. We also create a new type of operation for updating data in the parameter server and insert an updating operation after each training op node. Figure 4.2 shows the schematic flowchart of a session run.

Session

Run Variable Ops Initialized? Sync

CreateTable

Other Nodes State Ops Update End

Figure 4.2: Parameter Server in TensorFlow

In addition to modifying Tensorflow, we also expose necessary APIs to Tensorflow’s Python bindings. At the beginning of user programs, the initialization of parameter server is called with a user-provided configuration, and the paramter server is started after the creation and initialization of variables are done. An iteration of deep learning usually includes a session run. The user needs to calls the clock function at the end of each iteration. Code 4.5 shows an example of user program.

Code 4.5: An example of user program import tensorflow as tf

tf.ps_initialize_from_file("config.pbtext")

# ... graph definition...

sess = tf.Session() sess.run(init) tf.ps_start()

for i in range(MAX_ITERATION):

sess.run(train) tf.ps_clock()

# ... evaluation ...

Chapter 5

Evaluation

In this chapter, we analyze the performance gain from different optimization techniques and evaluate the scalability of our distributed training system by training three deep learning models for image classification – GoogLeNet [4], ResNet [5], and Inception-v4 [6]. GoogLeNet, developed by Google, uses a deep learning architecture called inception to increase the depth and width of the network and wins 2014 ImageNet Large Scale Visual Recognition Competition (ILSVRC).

ResNet, developed by Microsoft, proposes residual learning to cope with the difficulty of training a very deep neural network and wins 2015 ILSVRC. Inception-v4 presents several new architectures and combines the idea of residual learning to train the model efficiently.

Two datasets are used in this evaluation. CIFAR10 [45], used by ResNet, has 60,000 32x32 images in 10 classes, with 6,000 images per class. 50,000 images are used as training images, and the other 10,000 images are used for validation. ILSVRC-2012, used by GoogLeNet and Inception-v4, consists of 1.2 million images that are categorized into 1000 classes, and a subset with 50,000 images is selected as the validation set.

Our experiment environment consists of a commodity GPU cluster which has four machines.

Each machine is equipped with a Nvidia GeForce GTX 1080 Graphics Card, a 3.70 GHz 4 cores Intel CPU, and 64 GB of RAM. The machines are connected via a 1 Gbps Ethernet interface.

Tensorflow 1.4.0 is used to develop our system and runs in parameter server configuration as the baseline. We train the model by canonical SGD with learning rate 0.1 and 20,000 iterations on each node. The performance metric is the speedup of the optimizations, measured as iteration per second ratio of distributed training against single machine training.

In the first set of experiment, we test three different placement strategies. Uniform Split is a parameter-based placement method that evenly splits each layer onto all of the servers. Round-robin, the default strategy in Tensorflow, places layers to each server in turn. Heuristic Method described in Algorithm 2 is implemeted, too. Figure 5.1 shows the speedup of these three place-ment strategies in different models. We observe that each strategy has different performance in different models. The main reason is that each model has different layer size distribution as shown

in Figure 5.2, and different variation of the size on each server caused by the placement strategy as shown in Figure 5.3. Despite no variation, uniform split may incur substantial increase in net-work traffic if the model has many small layers. Taking ResNet as an example, the number of partitions produced by uniform split is four times that of the number of layers, and many parti-tions are even smaller in size than the message headers and metadata. The network utilization for useful content is very low, so the performance drops significantly. However, if a model, like Inception-v4, has many large layers, standard deviations of other strategies become large. In this case, uniform split achieves the best performance among the three strategies by reducing variation.

Layer-based methods avoid the problem mentioned above, but load balance of round-robin suffers from high coefficient of variation due to the sensitivity to the order of variable creation, especially in GoogLeNet. Heuristic method effectively reduces the coefficient of variation, achieving the best and the most stable performance on average. In the following experiments, we use heuristic method as the placement strategy.

0.4

Figure 5.1: Speedup of Placement Strategies

Figure 5.4 illustrates the speedup for various combinations of consistency control and com-pression scheme. We can get many properties of optimization techniques by analyzing these ex-perimental results. First, comparing traditional backpropagation (dark blue bar) with synchronous Distributed Wait-free Backpropagation(yellow bar), we found that transmission control can have big impact. ResNet and Inception-v4 achieve 1.54x and 1.29x acceleration, respectively. We also show that our architecture does not introduce too much overhead by comparing the original distributed Tensorflow (red bar) with the yellow bar. Compression by top-1% sparsification

0

3072 819212800245763686455296768001024001351682129922949123732484608006881281537536

Count

Size

ResNet GoogLenet Inception-v4

Figure 5.2: Layer Size Distribution of Models

0 Coefficient of variation (%) Standard Deviation (MB)

Uniform Split x

Figure 5.3: Variation of Placement Strategies

ange bar) and relaxing synchronization by stale synchronous parallelism with threshold 5 (blue bar) improve the performance significantly, too. For ResNet, the consistency control achieves higher speedup than compression. On the other hand, for Inception-v4, which is much bigger than ResNet, we can get more performance gain from compression. This again shows that different models with unique characteristics (such as size or layer distribution) may benefit from different optimization methods and hyper-parameters settings. Then, we combine all above optimizations (green bar) and achieve a near-theoretical speedup, 3.91x for ResNet and 3.78x for Inception-v4.

In Figure 5.5, we also show that our system is much closer to the theoretical performance than the original Tensorflow, achieving near-linear scalability.

0

Figure 5.4: Speedup for Different Optimizations

Next, we take ResNet as an example to study the changes of system resource usage caused by different optimization techniques. By analyzing GPU and network utilization shown in Figure 5.6, we can get some clues to improve the performance. The most important issue for speedup is increasing GPU utilization and decreasing the waiting time for network transfer. Distributed wait-free backpropagation overlaps the computation and transmission, raising network utilization and GPU utilization. Both compression and synchronization scheme reduce demands for network bandwidth, thereby decreasing the network utilization and increasing GPU utilization. The reason why synchronization scheme can achieve better GPU utilization and higher network utilization at the same time is that it reduces performance fluctuations by mitigating the impact of stragglers and also reducing the amount of transmission. Combining these methods further reduce network usage and the impact of performance fluctuations, resulting in increased GPU utilization.

在文檔中利用參數伺服器在深度學習中應用多樣化的通訊最佳化 (頁 25-30)