Elasped time(second) - 利用參數伺服器在深度學習中應用多樣化的通訊最佳化

Tensorflow Single Our System

Figure 5.8: Training Time to Accuracy of ResNet

Convergence time to a specific accuracy is affected by both the processing speed (iteration per second) and the convergence rate of each iteration. Figure 5.7 shows that staleness in asyn-chronous parallelism and information loss in compression do incur accuracy loss. That is, for the same number of iterations Tensorflow does have slightly higher accuracy than our method. How-ever, the processing speed of our method is much faster than Tensorflow, so that despite the slower converge due to staleness and compression, we can still achieve the same accuracy within 43%

of the training time of the original Tensorflow. For example, Figure 5.8 suggests that our method trains ResNet much faster than Tensorflow to a specific accurate. This indicates that the benefit of network optimizations, which reduces the communication time, outweighs the extra computing time required to remedy the quality loss introduced by these optimizations.

Chapter 6

Conclusion

In this paper, we summarize various communication optimizations for deep learning and pro-pose a modularized parameter server to adopt these techniques. Instead of using ad-hoc ap-proaches, our architecture uses overridable decoupling components to implement the functions required by optimizations. By doing so, we can apply most of the optimization techniques easily, examine different combinations, and analyze the performance.

We evaluate the proposed architecture by implementing a distributed training system based on Tensorflow and training three popular deep learning models – ResNet, GoogLeNet, and Inception-v4. The experimental results demonstrate that our system can get near-linear speedup for comput-ing and achieve convergent accuracy while reduccomput-ing the communication overhead. This reduces half of training time for ResNet than the original Tensorflow. The experiments also show that dif-ferent models may benefit from difdif-ferent optimization techniques and hyper-parameters settings.

This is because characteristics of models, such as size, computation or layer distribution, affect properties of transmission and effects of optimizations.

Looking into the performance gain, we can get some clues for enhancements. Both com-pression and asynchronous communication reduce the amount of transmitted data to ease network congestion, and the later also mitigates performance fluctuations. The quality loss of iterations can be reduced by better compression algorithms that decrease the information loss without sacrificing the compreassion ratio, and by better transmission control that uses network bandwidth effectively to decrease the staleness without violating the synchronization requirements. With these investi-gations, we plan to research and develop new optimization techniques to further improve our system.

Bibliography

[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neu-ral Information Processing Systems - Volume 1, NIPS’12, pages 1097–1105, USA, 2012.

Curran Associates Inc.

[2] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi-heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.

[3] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.

[4] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.

[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.

[6] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016.

[7] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional net-works. CoRR, abs/1608.06993, 2016.

[8] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catan-zaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, Erich Elsen, Jesse Engel, Linxi Fan, Christopher Fougner, Tony Han, Awni Y. Hannun, Billy Jun, Patrick LeGresley, Libby Lin, Sharan Narang, Andrew Y. Ng, Sherjil Ozair, Ryan Prenger, Jonathan Raiman, Sanjeev Satheesh, David Seetapun, Shubho Sengupta, Yi Wang, Zhiqian Wang, Chong Wang, Bo Xiao, Dani Yogatama, Jun Zhan, and Zhenyao Zhu. Deep speech 2: End-to-end speech recognition in english and mandarin. CoRR, abs/1512.02595, 2015.

[9] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. CoRR, abs/1508.04025, 2015.

[10] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. CoRR, abs/1504.00702, 2015.

[11] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. A neural algorithm of artistic style. CoRR, abs/1508.06576, 2015.

[12] Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josi-fovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Conference on Op-erating Systems Design and Implementation, OSDI’14, pages 583–598, Berkeley, CA, USA, 2014. USENIX Association.

[13] Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, pages 571–582, Berkeley, CA, USA, 2014. USENIX Association.

[14] Eric P. Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. Petuum: A new platform for distributed machine learning on big data. In Proceedings of the 21th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining, KDD ’15, pages 1335–1344, New York, NY, USA, 2015. ACM.

[15] Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. Geeps:

Scalable deep learning on distributed gpus with a gpu-specialized parameter server. In Pro-ceedings of the Eleventh European Conference on Computer Systems, EuroSys ’16, pages 4:1–4:16, New York, NY, USA, 2016. ACM.

[16] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentral-ized algorithms outperform centraldecentral-ized algorithms? a case study for decentraldecentral-ized parallel stochastic gradient descent. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fer-gus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5330–5340. Curran Associates, Inc., 2017.

[17] Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vi-jay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A

system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.

[18] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274, 2015.

[19] Frank Seide and Amit Agarwal. Cntk: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 2135–2135, New York, NY, USA, 2016. ACM.

[20] Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. Heterogeneity-aware distributed parameter servers. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, pages 463–478, New York, NY, USA, 2017. ACM.

[21] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William Dally. Deep gradient compression:

Reducing the communication bandwidth for distributed training. 2018.

[22] Peter H. Jin, Qiaochu Yuan, Forrest N. Iandola, and Kurt Keutzer. How to scale distributed deep learning? 2016.

[23] Richard E. Korf. Multi-way number partitioning. In Proceedings of the 21st International Jont Conference on Artifical Intelligence, IJCAI’09, pages 538–543, San Francisco, CA, USA, 2009. Morgan Kaufmann Publishers Inc.

[24] Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of the 24th Inter-national Conference on Neural Information Processing Systems, NIPS’11, pages 693–701, USA, 2011. Curran Associates Inc.

[25] Qirong Ho, James Cipar, Henggang Cui, Jin Kyu Kim, Seunghak Lee, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and Eric P. Xing. More effective distributed ml via a stale synchronous parallel parameter server. In Proceedings of the 26th International Con-ference on Neural Information Processing Systems - Volume 1, NIPS’13, pages 1223–1231, USA, 2013. Curran Associates Inc.

[26] Jan-Jan Wu Li-Yung Ho and Pangfeng Liu. Adaptive communication for distributed deep learning on commodity gpu cluster. In 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2018), CCGRID’18, 2018.

[27] Yusuke Tsuzuku, Hiroto Imachi, and Takuya Akiba. Variance-based gradient compression for efficient distributed deep learning. CoRR, abs/1802.06058, 2018.

[28] Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jin-liang Wei, Pengtao Xie, and Eric P. Xing. Poseidon: An efficient communication architecture for distributed deep learning on gpu clusters. In Proceedings of the 2017 USENIX Confer-ence on Usenix Annual Technical ConferConfer-ence, USENIX ATC ’17, pages 181–193, Berkeley, CA, USA, 2017. USENIX Association.

[29] Pangfeng Liu Ching-Yuan Tsai, Ching-Chi Lin and Jan-Jan Wu. Computation and commu-nication scheduling optimization for distributed deep learning systems.

[30] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learn-ing with limited numerical precision. In Proceedlearn-ings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 1737–1746.

JMLR.org, 2015.

[31] Tim Dettmers. 8-bit approximations for parallelism in deep learning. CoRR, abs/1511.04561, 2015.

[32] Urs K¨oster, Tristan Webb, Xin Wang, Marcel Nassar, Arjun K Bansal, William Constable, Oguz Elibol, Scott Gray, Stewart Hall, Luke Hornof, Amir Khosrowshahi, Carey Kloss, Ruby J Pai, and Naveen Rao. Flexpoint: An adaptive numerical format for efficient training of deep neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Sys-tems 30, pages 1742–1752. Curran Associates, Inc., 2017.

[33] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech dnns. In Interspeech 2014, September 2014.

[34] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd:

Communication-efficient sgd via gradient quantization and encoding. In I. Guyon, U. V.

Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Ad-vances in Neural Information Processing Systems 30, pages 1709–1720. Curran Associates, Inc., 2017.

[35] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Tern-grad: Ternary gradients to reduce communication in distributed deep learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1509–1519. Curran Asso-ciates, Inc., 2017.

[36] Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In

INTERSPEECH 2015, 16th Annual Conference of the International Speech Communication Association, Dresden, Germany, September 6-10, 2015, pages 1488–1492, 2015.

[37] Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient de-scent. In Proceedings of the Conference on Empirical Methods in Natural Language Pro-cessing, Copenhagen, Denmark, September 2017.

[38] dmlc. Ps-lite. https://github.com/dmlc/ps-lite, 2015.

[39] Microsoft. Multiverso. https://github.com/Microsoft/Multiverso, 2016.

[40] Henggang Cui, Alexey Tumanov, Jinliang Wei, Lianghong Xu, Wei Dai, Jesse Haber-Kucharsky, Qirong Ho, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P.

Xing. Exploiting iterative-ness for parallel ml computations. In Proceedings of the ACM Symposium on Cloud Computing, SOCC ’14, pages 5:1–5:14, New York, NY, USA, 2014.

ACM.

[41] Jinliang Wei, Wei Dai, Aurick Qiao, Qirong Ho, Henggang Cui, Gregory R. Ganger, Phillip B. Gibbons, Garth A. Gibson, and Eric P. Xing. Managed communication and consis-tency for fast data-parallel iterative analytics. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC ’15, pages 381–394, New York, NY, USA, 2015. ACM.

[42] Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, and Onur Mutlu. Gaia: Geo-distributed machine learning approaching LAN speeds. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17), pages 629–647, Boston, MA, 2017. USENIX Association.

[43] Sebastian Ruder. An overview of gradient descent optimization algorithms. CoRR, abs/1609.04747, 2016.

[44] Google. grpc. https://grpc.io.

[45] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10, 2009.

在文檔中利用參數伺服器在深度學習中應用多樣化的通訊最佳化 (頁 32-39)