Conclusion and Future Work - 基於進階HQEMU之動態二進制碼向量化

SIMD processing capability is increasingly important in microprocessor architec-tures. New microprocessors often come up with enhanced and more powerful SIMD capabilities. However, legacy application binaries may not benefit from such newly delivered performance. Dynamic binary translation or optimization could apply some transformations to turn scalar loops in legacy application binaries into vector or SIMD forms. Prior work has been successful in expanding short SIMD to longer SIMD. However, vectorizing scalar loops in legacy binaries has not been very success-ful. One of the main reasons is that some critical information for vectorization has been lost due to register spilling. One way to recover such important information is to promote spilled variables back to register and enable effective vectorization.

This is rather challenging since the binary translator must deal with possible mem-ory aliasing to the stack such as local scalar and array accesses. Although runtime checking could be applied to ensure the correctness of virtual register promotion, previously proposed checking could often require excessive overhead and diminish the benefit of vectorization.

This work proposes using virtual register promotion to recover critical informa-tion for scalar loop vectorizainforma-tion. It also comes up with a cost-effective aliasing check for stack variables and spilled variables so that the runtime checking would not block the way to efficient SIMD execution. The evaluation of our approach is based on HQEMU, and our DBT translates ARMv7 application binaries to run on the ARMv8 Cortex-A53 processor. The set of benchmark kernels have shown 1.42x speedup over native runs of ARMv7 binaries on ARMv8. We have

success-fully demonstrated that scalar legacy code could effectively be vectorized to exploit new SIMD capabilities available on the host machine.

Currently, we only consider the cases of innermost loops with a single exit point.

In the future, we would like to extend the vectorization scope to include outer loops or unrolled loops where the register spilling caused troubles would be even more common in preventing scalar loops in legacy application binaries to be vectorized via DBT.

Reference

[1] R. B. Lee, “Subword parallelism with max-2,” IEEE Micro, 1996.

[2] J. E. Smith and R. Nair, Virtual Machines: Versatile Platforms for Systems and Processes. Jan. 2005.

[3] C. Zheng and C. Thompson, “Pa-risc to ia-64: Transparent execution, no recompi-lation,” Computer, 2000.

[4] Apple’s rosetta, https://www.apple.com/rosetta/index.html, 2006.

[5] A. D’Antras, C. Gorgovan, J. Garside, and M. Luján, “Low overhead dynamic binary translation on arm,” ser. PLDI’17, 2017.

[6] D. Hong et al., “Exploiting longer simd lanes in dynamic binary translation,” in ICPADS, 2016.

[7] Y. Liu et al., “Exploiting asymmetric simd register configurations in arm-to-x86 dynamic binary translation,” ser. PACT, 2017.

[8] E. Duesterwald and V. Bala, “Software profiling for hot path prediction: Less is more,” SIGPLAN Not., 2000.

[9] D. Hong et al., “Hqemu: A multi-threaded and retargetable dynamic binary trans-lator on multicores,” ser. CGO’12, 2012.

[10] F. Bellard, “Qemu, a fast and portable dynamic translator,” ser. USENIX ATC’05, 2005.

[11] C. Lattner and V. Adve, “Llvm: A compilation framework for lifelong program analysis & transformation,” ser. CGO ’04, Mar. 2004.

[12] J. L. Henning, “Spec cpu2000: Measuring cpu performance in the new millennium,”

Computer, vol. 33, no. 7, pp. 28–35, Jul. 2000, issn: 0018-9162.

[13] J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCH Comput. Archit.

News, vol. 34, no. 4, pp. 1–17, Sep. 2006, issn: 0163-5964.

[14] Spec cpu 2017 , https://www.spec.org/cpu2017/, 2017.

[15] J. Dehnert et al., “The transmeta code morphing software: Using speculation, recov-ery, and adaptive retranslation to address real-life challenges,” ser. CGO’03, 2003.

[16] C. Luk et al., “Pin: Building customized program analysis tools with dynamic in-strumentation,” ser. PLDI ’05, 2005.

[17] N. Nethercote and J. Seward, “Valgrind: A framework for heavyweight dynamic binary instrumentation,” SIGPLAN Not., 2007.

[18] V. Bala, E. Duesterwald, and S. Banerjia, “Dynamo: A transparent dynamic opti-mization system,” SIGPLAN Not., vol. 35, no. 5, pp. 1–12, May 2000, issn: 0362-1340.

[19] D. Bruening, E. Duesterwald, and S. Amarasinghe, “Design and implementation of a dynamic optimization framework for windows,” Jan. 2002.

[20] J. Lu et al., “Design and implementation of a lightweight dynamic optimization system,” JILP, 2004.

[21] A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian, “Automatic intra-register vector-ization for the intel architecture,” Int. J. Parallel Program., vol. 30, no. 2, pp. 65–98, Apr. 2002, issn: 0885-7458. [Online]. Available: http://dl.acm.org/citation.

cfm?id=586554.586555.

[22] D. Naishlos, “Autovectorization in gcc,” Proceedings of the 2004 GCC Developers Summit, pp. 105–118, 2004. [Online]. Available: ftp://gcc.gnu.org/pub/gcc/

summit/2004/Autovectorization.pdf.

[23] N. Sreraman and R. Govindarajan, “A vectorizing compiler for multimedia exten-sions,” Int. J. Parallel Program., vol. 28, no. 4, pp. 363–400, Aug. 2000, issn: 0885-7458. doi: 10.1023/A:1007559022013. [Online]. Available: http://dx.doi.org/

10.1023/A:1007559022013.

[24] S. Larsen and S. Amarasinghe, “Exploiting superword level parallelism with mul-timedia instruction sets,” in Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, ser. PLDI ’00, Vancou-ver, British Columbia, Canada: ACM, 2000, pp. 145–156, isbn: 1-58113-199-2. doi:

10.1145/349299.349320. [Online]. Available: http://doi.acm.org/10.1145/

349299.349320.

[25] R. L. Leupers and S. Bashford, “Graph-based code selection techniques for embed-ded processors,” ACM Trans. Des. Autom. Electron. Syst., vol. 5, no. 4, pp. 794–

814, Oct. 2000, issn: 1084-4309. doi: 10.1145/362652.362661. [Online]. Available:

http://doi.acm.org/10.1145/362652.362661.

[26] S. Kral, F. Franchetti, J. Lorenz, and C. Ueberhuber, “Simd vectorization of straight line fft code, euro-par 2003. parallel processing, 9th international euro-par confer-ence, klagenfurt, austria, august 26-29, 2003. proceedings,” vol. 2790, Aug. 2003, pp. 251–260. doi: 10.1007/978-3-540-45209-6_39.

[27] V. Porpodas, A. Magni, and T. M. Jones, “Pslp: Padded slp automatic vector-ization,” in 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Feb. 2015, pp. 190–201. doi: 10.1109/CGO.2015.7054199.

[28] H. Zhou and J. Xue, “A compiler approach for exploiting partial simd parallelism,”

ACM Trans. Archit. Code Optim., 2016.

[29] N. Hallou et al., “Runtime vectorization transformations of binary code,” IJPP, 2017.

[30] R. Zhou, G. Wort, M. Erdős, and T. M. Jones, “The janus triad: Exploiting paral-lelism through dynamic binary modification,” ser. VEE 2019, 2019.

[31] Mcsema, https://github.com/trailofbits/mcsema, 2014.

[32] C. Lengauer, “Polly—performing polyhedral optimizations on a low-level interme-diate representation,” Parallel Processing Letters, 2012.

[33] J. Li et al., “Dynamic register promotion of stack variables,” ser. CGO’11, 2011.

[34] K. Anand et al., “A compiler-level intermediate representation based binary analysis and rewriting system,” ser. EuroSys ’13, 2013.

[35] A. Kotha et al., “Automatic parallelization in a binary rewriter,” ser. MICRO’43, 2010.

[36] A. Kotha, K. Anand, T. Creech, K. Elwazeer, M. Smithson, and R. Barua, “Affine parallelization of loops with run-time dependent bounds from binaries,” ser. EOSP’14, Apr. 2014.

在文檔中基於進階HQEMU之動態二進制碼向量化 (頁 39-43)